Week 12: Tries, Huffman coding, Pattern Matching


Tries


Tries2/37

Tries are trees organised using parts of keys (rather than whole keys)

[Diagram:Pics/tries/trie-example-small.png]


... Tries3/37

Tries are useful for prefix-search of strings (radix search)

Each node in a trie (implementing a dictionary) ... Depth of trie d = length of longest key value

Cost of searching O(d)   (independent of N)


... Tries4/37

Tries can be implemented using BST-like nodes:

[Diagram:Pics/tries/trie-as-bst-small.png]


... Tries5/37

Trie representation (using BST-like nodes):

typedef struct TrieNode *Link;

typedef struct TrieNode {
   char keybit; // one char from key
   int  finish; // last char in key?
   Item data;   // no Item if !finish
   Link child;
   Link sibling;
} TrieNode;

typedef struct { Link root; } TrieRep;

typedef TrieRep *Trie;
typedef char *Key;


Trie Operations6/37

Searching traverses path, using char-by-char from Key:

TrieNode *find(Trie t, Key k)
{
   char *c = k;
   TrieNode *curr = t->root;
   while (*c != '\0' && curr != NULL) {
      // scan siblings
      while (curr != NULL && curr->keybit != *c)
         curr = curr->sibling;
      if (curr == NULL) return NULL;
      if (*(c+1) == '\0') return curr;
      curr = curr->child; // move down one level
      c++;                // get next character
   }
   return NULL;
}


... Trie Operations7/37

Searching and deletion in Tries:

Item *search(Trie t, Key k)
{
   TrieNode *n = find(t,k);
   if (n == NULL) return NULL;
   return (n->finish) ? &(n->data) : NULL;
}

void delete(Trie t, Key k)
{
   TrieNode *n = find(t,k);
   if (n == NULL) return;
   n->finish = 0;
}


... Trie Operations8/37

Insertion into Trie:

TrieNode *newTrieNode(Key k, int i, Item it)
{
   TrieNode *new = malloc(sizeof(TrieNode));
   new->keybit = k[i];
   if (k[i+1] != '\0')
      new->finish = 0;
   else {
      new->finish = 1;
      new->data = it;
   }
   new->child = NULL;
   new->sibling = NULL;
}


... Trie Operations9/37

Insertion into Trie (cont):

void insert(Trie t, Item it)
{
   Key k = key(it);
   TrieNode *n = find(t,k);
   if (n != NULL) {
      n->finish = 1;
      n->data = it; // replaces any existing Item
      return;
   }
   if (t->root == NULL) {
      t->root = newTriNode(k,0,it);
   }
   ...


... Trie Operations10/37

Insertion into Trie (cont):

   ...
   TrieNode *curr = t->root;
   for (i = 0; k[i] != '\0'; i++) {
      // scan siblings
      prev = NULL;
      while (curr != NULL && curr->key != k[i]) {
         prev = curr;
         curr = curr->sibling;
      }
      if (curr == NULL) // add new sibling
         curr = prev->sibling = newTrieNode(k,i,it);
      if (k[i+1] == '\0') break;
      if (curr->child == NULL)
         curr->child = newTrieNode(k,i+1,it);
      curr = curr->child; // move down one level
   }
}


Tries (Example)11/37

Word matching and prefix matching with a standard trie.

[Diagram:Pics/tries/trie2-small.png]


The above example and the following slides are from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley.


Compressed Tries 12/37

A compressed trie merges nodes with one subtree such that each node has at least two subtrees.

[Diagram:Pics/tries/trie3-small.png]

Another example: Compact representation of compressed trie

[Diagram:Pics/tries/trie4-small.png]


The above example is from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley.


Suffix Tries 13/37

A suffix trie is a compressed trie containing all the suffixes of the given text.

[Diagram:Pics/tries/trie5-small.png]


The above example is from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley. and text from wikipedia.


Huffman coding14/37

A Huffman code is a prefix code that is commonly used for lossless data compression.

Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes")


Text Compression using Huffman coding15/37

Code … mapping of each character to a binary code word

Prefix code … binary code such that no code word is prefix of another code word

Encoding tree

Text compression problem

Given a text T, find a prefix code that yields the shortest encoding of T


Huffman coding: Building tree16/37

Building tree:

[Diagram:Pics/tries/huffman-tree1-small.png]

Another example:

[Diagram:Pics/tries/huffman-1-small.png]

[Diagram:Pics/tries/huffman-tbl1-small.png]


The above example and text are from from wikipedia at https://en.wikipedia.org/wiki/Huffman_coding.


Huffman coding: Example17/37

Text: a fast runner need never be afraid of the dark

[Diagram:Pics/tries/huffman-example-small.png]


Huffman coding: Analysis18/37

Analysis of Huffman's algorithm:


Pattern Matching19/37

Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P

Applications:


Pattern Matching20/37

Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P

Applications:


Pattern Matching: Brute-force21/37

Brute-force pattern matching algorithm

BruteForceMatch(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  for all i=0..n-m do
|  |  j=0                           // check from left to right
|  |  while j<m ∧ T[i+j]=P[j] do    // test ith shift of pattern
|  |     j=j+1
|  |     if j=m then
|  |        return i                // entire pattern checked
|  |     end if
|  |  end while
|  end for
|  return -1                        // no match found


Pattern Matching: Brute-force, Analysis22/37

Brute-force pattern matching runs in O(n·m)

Examples of worst case (forward checking):


Boyer-Moore Algorithm23/37

The Boyer-Moore pattern matching algorithm is based on two heuristics:


... Boyer-Moore Algorithm24/37

Example:

[Diagram:Pics/tries/boyer-moore-small.png]


... Boyer-Moore Algorithm25/37

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build

Example: Σ = {a,b,c,d}, P = acab

cabcd
L(c)231-1


... Boyer-Moore Algorithm26/37

BoyerMooreMatch(T,P,Σ):
|  Input  text T of length n, pattern P of length m, alphabet Σ
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  L=lastOccurenceFunction(P,Σ)
|  i=m-1, j=m-1                 // start at end of pattern
|  repeat
|  |  if T[i]=P[j] then
|  |     if j=0 then
|  |        return i            // match found at i
|  |     else
|  |        i=i-1, j=j-1
|  |     end if
|  |  else                      // character-jump
|  |     i=i+m-min(j,1+L[T[i]])
|  |     j=m-1
|  |  end if
|  until i≥n
|  return -1                    // no match


... Boyer-Moore Algorithm27/37

Case 1: j ≤ 1+L[c]

[Diagram:Pics/tries/boyer-moore-case1-small.png]

Case 2: 1+L[c] < j

[Diagram:Pics/tries/boyer-moore-case2-small.png]


Exercise 1: Boyer-Moore algorithm28/37

For the alphabet Σ = {a,b,c,d}

  1. compute last-occurrence function L for pattern P = abacab
  2. trace Boyer-More on P and text T = abacaabadcabacabaabb
    • how many comparisons are needed?


cabcd
L(c)453-1

[Diagram:Pics/tries/boyer-moore-example-small.png]


13 comparisons in total


... Boyer-Moore Algorithm29/37

Analysis of Boyer-Moore algorithm:


Knuth-Morris-Pratt Algorithm30/37

The Knuth-Morris-Pratt algorithm …

Reminder:


... Knuth-Morris-Pratt Algorithm31/37

When a mismatch occurs …


[Diagram:Pics/tries/kmp-shift-small.png]


... Knuth-Morris-Pratt Algorithm32/37

KMP preprocesses the pattern to find matches of its prefixes with itself

Example: P = abaaba
j012345
Pjabaaba
F(j)001123

[Diagram:Pics/tries/kmp-failure-function-small.png]


... Knuth-Morris-Pratt Algorithm33/37

KMPMatch(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  F=failureFunction(P)
|  i=0, j=0                    // start from left
|  while i<n do
|  |  if T[i]=P[j] then
|  |     if j=m-1 then
|  |        return i-j         // match found at i-j
|  |     else
|  |        i=i+1, j=j+1
|  |     end if
|  |  else                     // mismatch at P[j]
|  |     if j>0 then
|  |        j=F[j-1]           // resume comparing P at F[j-1]
|  |     else
|  |        i=i+1
|  |     end if
|  |  end if
|  end while
|  return -1                   // no match


KMP-Algorithm

  1. compute failur function F for pattern P = abacab
  2. trace Knuth-Morris-Pratt on P and text T = abacaabadcabacabaabb


j012345
Pjabacab
F(j)001012

[Diagram:Pics/tries/kmp-example-small.png]


... Knuth-Morris-Pratt Algorithm35/37

Construction of the failure function is similar to the KMP algorithm itself:

failureFunction(P):
|  Input  pattern P of length m
|  Output failure function for P
|
|  F[0]=0
|  i=1, j=0
|  while i<m do
|  |  if P[i]=P[j] then   // we have matched j+1 characters
|  |     F[i]=j+1
|  |     i=i+1, j=j+1
|  |  else if j>0 then    // use failure function to shift P
|  |     j=F[j-1]
|  |  else
|  |     F[i]=0           // no match
|  |     i=i+1
|  |  end if
|  end while
|  return F

Analysis of failure function computation:

⇒  failure function can be computed in O(m) time


... Knuth-Morris-Pratt Algorithm36/37

Analysis of Knuth-Morris-Pratt algorithm:

⇒  KMP's algorithm runs in optimal time O(m+n)


Boyer-Moore vs KMP37/37

Boyer-Moore algorithm

Knuth-Morris-Pratt algorithm


Produced: 18 Oct 2017