Boyer-Moore String Matching

COMP2521 20T2 ♢ Boyer-Moore String Matching [0/12]
❖ String Matching

String matching problem

Example:
T = i l i k e p a t t e r n s
P = p a t

a match occurs when T  and P  are aligned as follows

T = i l i k e p a t t e r n s
P =           p a t
COMP2521 20T2 ♢ Boyer-Moore String Matching [1/12]
❖ ... String Matching

A naive approach to solving this problem works as follows

[Diagram:Pics/naive-string-match.png]

COMP2521 20T2 ♢ Boyer-Moore String Matching [2/12]
❖ Boyer-Moore Algorithm

The Boyer-Moore string matching algorithm

It is based on two heuristics:
COMP2521 20T2 ♢ Boyer-Moore String Matching [3/12]
❖ ... Boyer-Moore Algorithm

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ

L maps Σ to integers such that L(c) is defined as Example:  Σ = {…,a,b,c,d,e,f,…},   P = acab  

c...abcdef...
L(c)...231-1-1-1...

L can be represented by an array indexed by the ascii codes of the chars

COMP2521 20T2 ♢ Boyer-Moore String Matching [4/12]
❖ ... Boyer-Moore Algorithm

The lastOccurences function to build L

intArray lastOccurences(P, Σ):
|  Input  pattern string P, alphabet Σ
|  Output array containing last
|            position of each character in pattern
|         characters not in pattern have "position" -1
|
|  L = make array of size |Σ|
|  m = length(P)
|  // set all values in L to -1
|  for each ch ∈ Σ do
|     L[ch] = -1
|  end for
|  for each i = 0 .. m-1 do
|     L[P[i]] = i
|  end for
|  return L

COMP2521 20T2 ♢ Boyer-Moore String Matching [5/12]
❖ ... Boyer-Moore Algorithm

When a mismatch occurs at T[i] = ch ...

  • if P  contains ch  ⇒ shift P  to align the last occurrence of ch  in P  with T[i]
  • otherwise ⇒ shift P  to align P[0]  with T[i+1]    (a.k.a. "big jump") Examples:

    [Diagram:Pics/mismatches.png]

  • COMP2521 20T2 ♢ Boyer-Moore String Matching [6/12]
    ❖ ... Boyer-Moore Algorithm


    A complete example of matching with multiple "big jumps":


    [Diagram:Pics/boyer-moore.png]

    COMP2521 20T2 ♢ Boyer-Moore String Matching [7/12]
    ❖ ... Boyer-Moore Algorithm

    int BoyerMooreMatch(T,P,Σ):
    |  Input  text T of length n, pattern P of length m, alphabet Σ
    |  Output starting index of a substring of T equal to P
    |         -1 if no such substring exists
    |
    |  L=lastOccurences(P,Σ)
    |  i=m-1, j=m-1                 // start at end of pattern
    |  repeat
    |  |  if T[i]=P[j] then
    |  |     if j=0 then
    |  |        return i            // match found at i
    |  |     else
    |  |        i=i-1, j=j-1
    |  |     end if
    |  |  else                      // character-jump
    |  |     i=i+m-min(j,1+L[T[i]])
    |  |     j=m-1
    |  |  end if
    |  until i≥n
    |  return -1                    // no match
    

    COMP2521 20T2 ♢ Boyer-Moore String Matching [8/12]
    ❖ ... Boyer-Moore Algorithm

    Case 1: j ≤ 1+L[c]

    [Diagram:Pics/boyer-moore-case1.png]

    COMP2521 20T2 ♢ Boyer-Moore String Matching [9/12]
    ❖ ... Boyer-Moore Algorithm

    Case 2: 1+L[c] < j

    [Diagram:Pics/boyer-moore-case2.png]

    COMP2521 20T2 ♢ Boyer-Moore String Matching [10/12]
    ❖ Example Execution

    For the alphabet Σ = {a,b,c,d} and P  = abacab ...

    1. compute the last-occurrence table L

      cabcd
      L(c)453-1

    2. count comparisons searching for P in T = abacaabadcabacabaabb

      [Diagram:Pics/boyer-moore-example.png]

    COMP2521 20T2 ♢ Boyer-Moore String Matching [11/12]
    ❖ Analysis of Algorithm

    Reminder:

    Analysis of Boyer-Moore algorithm:
    Example of worst case:     T  = aaa … a   P  = baaa

    Worst case may occur in images or DNA sequences but unlikely in text

    Boyer-Moore significantly faster than brute-force on English text

    COMP2521 20T2 ♢ Boyer-Moore String Matching [12/12]


    Produced: 8 Aug 2020