Strings and Matching

COMP2521 20T1 ♢ Strings and Matching [0/12]
❖ Strings

A string is a sequence of symbols.

An alphabet Σ is the set of possible symbols in strings.

Examples of strings:

Examples of alphabets:


Often, the "symbols" are letters (so we call them characters from now on)

COMP2521 20T1 ♢ Strings and Matching [1/12]
❖ ... Strings

Notation:

COMP2521 20T1 ♢ Strings and Matching [2/12]
❖ ... Strings


Example ...

The string  a/a  of length 3 over the ASCII alphabet has


Note:   "" means the same as λ   (= empty string)

COMP2521 20T1 ♢ Strings and Matching [3/12]
❖ ... Strings

ASCII (American Standard Code for Information Interchange)

[Diagram:Pics/ascii.png]

Often, the character with ascii code 0 is called "NUL" rather than "NULL"

COMP2521 20T1 ♢ Strings and Matching [4/12]
❖ ... Strings

Reminder ...

In C, a string is an array of chars (bytes) containing ASCII codes

Because strings are so common, C provides convenient syntax:

char str[] = "hello";
// same as 
char str[] = {'h', 'e', 'l', 'l', 'o', '\0'};

Note: str[] will have 6 elements either way

COMP2521 20T1 ♢ Strings and Matching [5/12]
❖ ... Strings

C provides string manipulation functions via #include <string.h>

Examples:

// return length of string
int strlen(char *str)
// copy one string into a string buffer
char *strcpy(char *dest, char *src)
// make a copy of a string (uses malloc())
char *strdup(char *)
// concatenate two strings, into buffer holding dest
char *strcat(char *dest, char *src)
// find substring pat inside string str
char *strstr(char *str, char *pat)
// copy at most n chars from src into a dest buffer
char *strncpy(char *dest, char *src, int n)

COMP2521 20T1 ♢ Strings and Matching [6/12]
❖ ... Strings

Details of one #include <string.h> function ...

char *strncat(char *dest, char *src, int n)

COMP2521 20T1 ♢ Strings and Matching [7/12]
❖ Pattern Matching

strstr() provides a simple example of string matching

The general string matching (aka pattern matching) problem

strstr(str,pat) returns pointer to first occurence of pat in str

T  is sometimes called "haystack", P  is sometimes called "needle"

Note that regular expression pattern matching is quite a different beast.

COMP2521 20T1 ♢ Strings and Matching [8/12]
❖ ... Pattern Matching

Example of simple (brute force) pattern matching:

[Diagram:Pics/pattern-match.png]

COMP2521 20T1 ♢ Strings and Matching [9/12]
❖ ... Pattern Matching

Brute-force pattern matching algorithm:

BruteForceMatch(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  for all i = 0..n-m do
|  |  j = 0                       // check from left to right
|  |  while j<m ∧ T[i+j]=P[j] do  // test ith shift of pattern
|  |     j = j+1
|  |     if j=m then
|  |        return i              // entire pattern checked
|  |     end if
|  |  end while
|  end for
|  return -1                      // no match found

COMP2521 20T1 ♢ Strings and Matching [10/12]
❖ ... Pattern Matching

C version of above algorithm:

int bruteForce(char *t, char *p)
{
   int n = strlen(t);  //length of string
   int m = strlen(p);  //length of pattern
   // for each starting position of pattern
   for (int i = 0; i <= n-m; i++) {
      // scan pattern, comparing against string
      for (int j = 0; j < m; j++) {
         // mismatch detected, terminate scan
         if (T[i+j] != P[j]) break;
      }
      // if reached end of pattern, have a match
      if (j == m) return i;
   }
   return -1;  // no match
}

COMP2521 20T1 ♢ Strings and Matching [11/12]
❖ Complexity Analysis

Brute-force pattern matching runs in O(n·m)

Examples of worst case (forward checking):

COMP2521 20T1 ♢ Strings and Matching [12/12]


Produced: 25 Jul 2020