Week 12: Tries, Huffman coding, Pattern Matching
Tries |
Tries | 2/37 |
Tries are trees organised using parts of keys (rather than whole keys)
... Tries | 3/37 |
Tries are useful for prefix-search of strings (radix search)
Cost of searching O(d) (independent of N)
... Tries | 4/37 |
Tries can be implemented using BST-like nodes:
... Tries | 5/37 |
Trie representation (using BST-like nodes):
typedef struct TrieNode *Link; typedef struct TrieNode { char keybit;// one char from key int finish;// last char in key? Item data;// no Item if !finish Link child; Link sibling; } TrieNode; typedef struct { Link root; } TrieRep; typedef TrieRep *Trie; typedef char *Key;
Trie Operations | 6/37 |
Searching traverses path, using char-by-char from Key:
TrieNode *find(Trie t, Key k) { char *c = k; TrieNode *curr = t->root; while (*c != '\0' && curr != NULL) {// scan siblings while (curr != NULL && curr->keybit != *c) curr = curr->sibling; if (curr == NULL) return NULL; if (*(c+1) == '\0') return curr; curr = curr->child;// move down one level c++;// get next character } return NULL; }
... Trie Operations | 7/37 |
Searching and deletion in Tries:
Item *search(Trie t, Key k) { TrieNode *n = find(t,k); if (n == NULL) return NULL; return (n->finish) ? &(n->data) : NULL; } void delete(Trie t, Key k) { TrieNode *n = find(t,k); if (n == NULL) return; n->finish = 0; }
... Trie Operations | 8/37 |
Insertion into Trie:
TrieNode *newTrieNode(Key k, int i, Item it) { TrieNode *new = malloc(sizeof(TrieNode)); new->keybit = k[i]; if (k[i+1] != '\0') new->finish = 0; else { new->finish = 1; new->data = it; } new->child = NULL; new->sibling = NULL; }
... Trie Operations | 9/37 |
Insertion into Trie (cont):
void insert(Trie t, Item it) { Key k = key(it); TrieNode *n = find(t,k); if (n != NULL) { n->finish = 1; n->data = it;// replaces any existing Item return; } if (t->root == NULL) { t->root = newTriNode(k,0,it); } ...
... Trie Operations | 10/37 |
Insertion into Trie (cont):
... TrieNode *curr = t->root; for (i = 0; k[i] != '\0'; i++) {// scan siblings prev = NULL; while (curr != NULL && curr->key != k[i]) { prev = curr; curr = curr->sibling; } if (curr == NULL)// add new sibling curr = prev->sibling = newTrieNode(k,i,it); if (k[i+1] == '\0') break; if (curr->child == NULL) curr->child = newTrieNode(k,i+1,it); curr = curr->child;// move down one level } }
Tries (Example) | 11/37 |
Word matching and prefix matching with a standard trie.
The above example and the following slides are from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley.
Compressed Tries | 12/37 |
A compressed trie merges nodes with one subtree such that each node has at least two subtrees.
Another example: Compact representation of compressed trie
The above example is from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley.
Suffix Tries | 13/37 |
A suffix trie is a compressed trie containing all the suffixes of the given text.
The above example is from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley. and text from wikipedia.
Huffman coding | 14/37 |
A Huffman code is a prefix code that is commonly used for lossless data compression.
Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes")
Text Compression using Huffman coding | 15/37 |
Code … mapping of each character to a binary code word
Prefix code … binary code such that no code word is prefix of another code word
Encoding tree …
Text compression problem
Given a text T, find a prefix code that yields the shortest encoding of T
Huffman coding: Building tree | 16/37 |
Building tree:
Another example:
The above example and text are from from wikipedia at https://en.wikipedia.org/wiki/Huffman_coding.
Huffman coding: Example | 17/37 |
Text: a fast runner need never be afraid of the dark
Huffman coding: Analysis | 18/37 |
Analysis of Huffman's algorithm:
Pattern Matching | 19/37 |
Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P
Applications:
Pattern Matching | 20/37 |
Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P
Applications:
Pattern Matching: Brute-force | 21/37 |
Brute-force pattern matching algorithm
BruteForceMatch(T,P): | Input text T of length n, pattern P of length m | Output starting index of a substring of T equal to P | -1 if no such substring exists | | for all i=0..n-m do | | j=0 // check from left to right | | while j<m ∧ T[i+j]=P[j] do // test ith shift of pattern | | j=j+1 | | if j=m then | | return i // entire pattern checked | | end if | | end while | end for | return -1 // no match found |
Pattern Matching: Brute-force, Analysis | 22/37 |
Brute-force pattern matching runs in O(n·m)
Examples of worst case (forward checking):
aaa…ah
aaah
Boyer-Moore Algorithm | 23/37 |
The Boyer-Moore pattern matching algorithm is based on two heuristics:
c
c
c
... Boyer-Moore Algorithm | 24/37 |
Example:
... Boyer-Moore Algorithm | 25/37 |
Boyer-Moore
Case 1: j ≤ 1+L[c]
Case 2: 1+L[c] < j
For the alphabet Σ = {
Analysis of Boyer-Moore
The Knuth-Morris-Pratt
Reminder:
When a mismatch occurs …
KMP preprocesses the pattern to find matches of its prefixes with itself
Construction of the failure function is similar to the KMP algorithm itself:
Analysis of failure function computation:
Analysis of Knuth-Morris-Pratt algorithm:
Boyer-Moore algorithm
Knuth-Morris-Pratt algorithm
Produced: 18 Oct 2017
Example: Σ = {
a,b,c,d
acab
c a
b
c
d
L(c) 2 3 1 -1
... Boyer-Moore Algorithm 26/37
BoyerMooreMatch(T,P,Σ):
| Input text T of length n, pattern P of length m, alphabet Σ
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| L=lastOccurenceFunction(P,Σ)
| i=m-1, j=m-1 // start at end of pattern
| repeat
| | if T[i]=P[j] then
| | if j=0 then
| | return i // match found at i
| | else
| | i=i-1, j=j-1
| | end if
| | else // character-jump
| | i=i+m-min(j,1+L[T[i]])
| | j=m-1
| | end if
| until i≥n
| return -1 // no match
... Boyer-Moore Algorithm 27/37
Exercise 1: Boyer-Moore algorithm 28/37 a,b,c,d
abacab
abacaabadcabacabaabb
c a
b
c
d
L(c) 4 5 3 -1
13 comparisons in total
... Boyer-Moore Algorithm 29/37
aaa … a
baaa
⇒ Boyer-Moore significantly faster than brute-force on English text
Knuth-Morris-Pratt Algorithm 30/37
... Knuth-Morris-Pratt Algorithm 31/37
... Knuth-Morris-Pratt Algorithm 32/37
Example: P =
abaaba
j 0 1 2 3 4 5 Pj a b a a b a F(j) 0 0 1 1 2 3
... Knuth-Morris-Pratt Algorithm 33/37
KMPMatch(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| F=failureFunction(P)
| i=0, j=0 // start from left
| while i<n do
| | if T[i]=P[j] then
| | if j=m-1 then
| | return i-j // match found at i-j
| | else
| | i=i+1, j=j+1
| | end if
| | else // mismatch at P[j]
| | if j>0 then
| | j=F[j-1] // resume comparing P at F[j-1]
| | else
| | i=i+1
| | end if
| | end if
| end while
| return -1 // no match
abacab
abacaabadcabacabaabb
j 0 1 2 3 4 5 Pj a b a c a b F(j) 0 0 1 0 1 2
... Knuth-Morris-Pratt Algorithm 35/37
failureFunction(P):
| Input pattern P of length m
| Output failure function for P
|
| F[0]=0
| i=1, j=0
| while i<m do
| | if P[i]=P[j] then // we have matched j+1 characters
| | F[i]=j+1
| | i=i+1, j=j+1
| | else if j>0 then // use failure function to shift P
| | j=F[j-1]
| | else
| | F[i]=0 // no match
| | i=i+1
| | end if
| end while
| return F
⇒ failure function can be computed in O(m) time
... Knuth-Morris-Pratt Algorithm 36/37
⇒ KMP's algorithm runs in optimal time O(m+n)
Boyer-Moore vs KMP 37/37
A,C,G,T