Week 12: Tries, Huffman coding, Pattern Matching

Tries

Tries 2/37

Tries are trees organised using parts of keys (rather than whole keys)

[Diagram:Pics/tries/trie-example-small.png]

... Tries 3/37

Tries are useful for prefix-search of strings (radix search)

e.g. as might be useful in implementing a dictionary

Each node in a trie (implementing a dictionary) ...

contains one part of a key (typically one char)
may have up to 26 children
may be tagged as a "finishing" node
but even "finishing" nodes may have children

Depth of trie d = length of longest key value

Cost of searching O(d) (independent of N)

... Tries 4/37

Tries can be implemented using BST-like nodes:

[Diagram:Pics/tries/trie-as-bst-small.png]

... Tries 5/37

Trie representation (using BST-like nodes):

typedef struct TrieNode *Link;

typedef struct TrieNode {
   char keybit; // one char from key
   int  finish; // last char in key?
   Item data;   // no Item if !finish
   Link child;
   Link sibling;
} TrieNode;

typedef struct { Link root; } TrieRep;

typedef TrieRep *Trie;
typedef char *Key;

Trie Operations 6/37

Searching traverses path, using char-by-char from Key:

TrieNode *find(Trie t, Key k)
{
   char *c = k;
   TrieNode *curr = t->root;
   while (*c != '\0' && curr != NULL) {
      // scan siblings
      while (curr != NULL && curr->keybit != *c)
         curr = curr->sibling;
      if (curr == NULL) return NULL;
      if (*(c+1) == '\0') return curr;
      curr = curr->child; // move down one level
      c++;                // get next character
   }
   return NULL;
}

... Trie Operations 7/37

Searching and deletion in Tries:

Item *search(Trie t, Key k)
{
   TrieNode *n = find(t,k);
   if (n == NULL) return NULL;
   return (n->finish) ? &(n->data) : NULL;
}

void delete(Trie t, Key k)
{
   TrieNode *n = find(t,k);
   if (n == NULL) return;
   n->finish = 0;
}

... Trie Operations 8/37

Insertion into Trie:

TrieNode *newTrieNode(Key k, int i, Item it)
{
   TrieNode *new = malloc(sizeof(TrieNode));
   new->keybit = k[i];
   if (k[i+1] != '\0')
      new->finish = 0;
   else {
      new->finish = 1;
      new->data = it;
   }
   new->child = NULL;
   new->sibling = NULL;
}

... Trie Operations 9/37

Insertion into Trie (cont):

void insert(Trie t, Item it)
{
   Key k = key(it);
   TrieNode *n = find(t,k);
   if (n != NULL) {
      n->finish = 1;
      n->data = it; // replaces any existing Item
      return;
   }
   if (t->root == NULL) {
      t->root = newTriNode(k,0,it);
   }
   ...

... Trie Operations 10/37

Insertion into Trie (cont):

   ...
   TrieNode *curr = t->root;
   for (i = 0; k[i] != '\0'; i++) {
      // scan siblings
      prev = NULL;
      while (curr != NULL && curr->key != k[i]) {
         prev = curr;
         curr = curr->sibling;
      }
      if (curr == NULL) // add new sibling
         curr = prev->sibling = newTrieNode(k,i,it);
      if (k[i+1] == '\0') break;
      if (curr->child == NULL)
         curr->child = newTrieNode(k,i+1,it);
      curr = curr->child; // move down one level
   }
}

Tries (Example) 11/37

Word matching and prefix matching with a standard trie.

The above example and the following slides are from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley.

Compressed Tries 12/37

A compressed trie merges nodes with one subtree such that each node has at least two subtrees.

Another example: Compact representation of compressed trie

The above example is from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley.

Suffix Tries 13/37

A suffix trie is a compressed trie containing all the suffixes of the given text.

Suffix trees allow particularly fast implementations of many important string operations.
Suffix trees can be constructed in O(n) time and space, where n is length of text, the algorithm for this is beyond the scope of this lecture!
Once constructed, several operations can be performed quickly, for instance locating a substring, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc.

The above example is from "Data Structures and Algorithms in Java"; Sixth Edition; Michael T. Goodrich, Roberto Tamassia and Michael H. Goldwasser; 2014; Wiley. and text from wikipedia.

Huffman coding 14/37

A Huffman code is a prefix code that is commonly used for lossless data compression.

Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes")

the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol
The algorithm derives a variable-length code table for encoding source symbols, using estimated probability or frequency of occurrence for each possible value of the source symbol.
More common symbols are generally represented using fewer bits than less common symbols.
computes frequency f(c) for each character c
encodes high-frequency characters with short code
no code word is a prefix of another code word
uses optimal encoding tree to determine the code words

Text Compression using Huffman coding 15/37

Code … mapping of each character to a binary code word

Prefix code … binary code such that no code word is prefix of another code word

Encoding tree …

represents a prefix code
each leaf stores a character
code word given by the path from the root to the leaf (0 for left child, 1 for right child)

Text compression problem

Given a text T, find a prefix code that yields the shortest encoding of T

short codewords for frequent characters
long code words for rare characters

Huffman coding: Building tree 16/37

Building tree:

[Diagram:Pics/tries/huffman-tree1-small.png]

Another example:

[Diagram:Pics/tries/huffman-1-small.png]

[Diagram:Pics/tries/huffman-tbl1-small.png]

The above example and text are from from wikipedia at https://en.wikipedia.org/wiki/Huffman_coding.

Huffman coding: Example 17/37

Text: a fast runner need never be afraid of the dark

[Diagram:Pics/tries/huffman-example-small.png]

Huffman coding: Analysis 18/37

Analysis of Huffman's algorithm:

O(n+d·log d) time
- n … length of the input text T
- s … number of distinct characters in T

Pattern Matching 19/37

Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P

Applications:

Text editors

Search engines

Biological research

Pattern Matching 20/37

Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P

Applications:

Text editors

Search engines

Biological research

Pattern Matching: Brute-force 21/37

Brute-force pattern matching algorithm

checks for each possible shift of P relative to T
- untile a match is found, or
- all placements of the pattern have been tried

Pattern Matching: Brute-force, Analysis 22/37

Brute-force pattern matching runs in O(n·m)

Examples of worst case (forward checking):

T = aaa…ah
P = aaah
may occur in DNA sequences
unlikely in English text

Boyer-Moore Algorithm 23/37

The Boyer-Moore pattern matching algorithm is based on two heuristics:

Looking-glass heuristic: Compare P with subsequence of T moving backwards
Character-jump heuristic: When a mismatch occurs at T[i]=c
- if P contains c ⇒ shift P so as to align the last occurrence of c in P with T[i]
- otherwise ⇒ shift P so as to align P[0] with T[i+1] (a.k.a. "big jump")

... Boyer-Moore Algorithm 24/37

Example:

[Diagram:Pics/tries/boyer-moore-small.png]

... Boyer-Moore Algorithm 25/37

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build

last-occurrence function L
- L maps alphabet Σ to integers such that L(c) is defined as
  - the largest index i such that P[i]=c, or
  - -1 if no such index exists

Example: Σ = {a,b,c,d}, P = acab

c	`a`	`b`	`c`	`d`
L(c)	2	3	1	-1

L can be represented by an array indexed by the numeric codes of the characters
L can be computed in O(m+s) time (m … length of pattern, s … size of alphabet Σ)

... Boyer-Moore Algorithm 26/37

Biggest jump (m characters ahead) occurs when L[T[i]] = -1

... Boyer-Moore Algorithm 27/37

Case 1: j ≤ 1+L[c]

[Diagram:Pics/tries/boyer-moore-case1-small.png]

Case 2: 1+L[c] < j

[Diagram:Pics/tries/boyer-moore-case2-small.png]

Exercise 1: Boyer-Moore algorithm 28/37

For the alphabet Σ = {a,b,c,d}

compute last-occurrence function L for pattern P = abacab
trace Boyer-More on P and text T = abacaabadcabacabaabb
- how many comparisons are needed?

c a b c d

L(c) 4 5 3 -1

[Diagram:Pics/tries/boyer-moore-example-small.png]

13 comparisons in total

... Boyer-Moore Algorithm 29/37

Analysis of Boyer-Moore algorithm:

Runs in O(nm+s) time
- m … length of pattern n … length of text s … size of alphabet
Example of worst case:
- T = aaa … a
- P = baaa
Worst case may occur in images and DNA sequences but unlikely in English texts
⇒ Boyer-Moore significantly faster than brute-force on English text

Knuth-Morris-Pratt Algorithm 30/37

The Knuth-Morris-Pratt algorithm …

compares the pattern to the text left-to-right
but shifts the pattern more intelligently than the brute-force algorithm

Reminder:

Q is a prefix of P … P = Qω, for some ω∈Σ^*
Q is a suffix of P … P = ωQ, for some ω∈Σ^*

... Knuth-Morris-Pratt Algorithm 31/37

When a mismatch occurs …

what is the most we can shift the pattern to avoid redundant comparisons?
Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]

[Diagram:Pics/tries/kmp-shift-small.png]

... Knuth-Morris-Pratt Algorithm 32/37

KMP preprocesses the pattern to find matches of its prefixes with itself

Failure function F(j) defined as
- the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
if mismatch occurs at P_j ⇒ advance j to F(j-1)

Example: P = abaaba

j	0	1	2	3	4	5
P_j	a	b	a	a	b	a
F(j)	0	0	1	1	2	3

[Diagram:Pics/tries/kmp-failure-function-small.png]

... Knuth-Morris-Pratt Algorithm 33/37

KMP-Algorithm

compute failur function F for pattern P = abacab
trace Knuth-Morris-Pratt on P and text T = abacaabadcabacabaabb

j 0 1 2 3 4 5

P_j a b a c a b

F(j) 0 0 1 0 1 2

[Diagram:Pics/tries/kmp-example-small.png]

... Knuth-Morris-Pratt Algorithm 35/37

Construction of the failure function is similar to the KMP algorithm itself:

Analysis of failure function computation:

At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
Hence, there are no more than 2·m iterations of the while-loop

⇒ failure function can be computed in O(m) time

... Knuth-Morris-Pratt Algorithm 36/37

Analysis of Knuth-Morris-Pratt algorithm:

Failure function can be computed in O(m) time
At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
Hence, there are no more than 2·n iterations of the while-loop

⇒ KMP's algorithm runs in optimal time O(m+n)

Boyer-Moore vs KMP 37/37

Boyer-Moore algorithm

decides how far to jump ahead based on the mismatched character in the text
works best on large alphabets and natural language texts (e.g. English)

Knuth-Morris-Pratt algorithm

uses information embodied in the pattern to determine where the next match could begin
works best on small alphabets (e.g. A,C,G,T)

Produced: 18 Oct 2017