Week 11


2-3-4 Trees1/75

2-3-4 trees have variable-size nodes

[Diagram:Pics/trees/2-3-4-tree-small.png]


... 2-3-4 Trees2/75

2-3-4 tree implementation:

typedef struct node Node;
typedef struct node *Tree;
struct node {
    int  order;    // 2, 3 or 4
    Item data[3];  // items in node
    Tree child[4]; // links to subtrees
};

Example:

[Diagram:Pics/trees/2-3-4-nodes-small.png]


... 2-3-4 Trees3/75

Search algorithm:

search(Tree, Key)
{
   if (empty(Tree)) return NOT_FOUND
   // scan root node, looking for key
   if (∃ i, key(data[i]) == Key)
      return Node containing data[i]
   if (Key < key(data[0]))
      return search(child[0],Key)
   if (∃ i, key(data[i]) < Key < key(data[i+1]))
      return search(child[i],Key)
   if (Key > key(data[order-1]))
      return search(child[N],Key)
   return NOT_FOUND
}


... 2-3-4 Trees4/75

Insertion algorithm:

insert(Tree, Item)
{
   Node = search(Tree, key(Item)
   Parent = parent of Node
   if (order(Node) < 4)
      insert Item in Node, order++
   else {
      promote = Node.data[1]  // middle value
      NodeL   = new Node containing data[0]
      NodeR   = new Node containing data[2]
      if (key(Item) < key(data[1]))
         insert Item in NodeL
      else
         insert Item in NodeR
      insert promote into Parent
      while (order(Parent) == 4)
         continue promote/split upwards
      if (isRoot(Parent) && order(Parent) == 4)
         split root, making new root
   }
}


... 2-3-4 Trees5/75

Insertion into a 2-node or 3-node:

[Diagram:Pics/trees/2-3-4-add-small.png]

Insertion into a 4-node (requires a split):

[Diagram:Pics/trees/2-3-4-split-small.png]


... 2-3-4 Trees6/75

Splitting the root:

[Diagram:Pics/trees/2-3-4-split-root-small.png]


... 2-3-4 Trees7/75

2-3-4 tree performance ...

Insertion (into tree of depth d) = O(d) comparisons

Search (in tree of depth d) = O(d) comparisons Depth of 2-3-4 tree with N nodes = log4N < d < log2N

Note that all paths in a 2-3-4 tree have same length d


... 2-3-4 Trees8/75

Variations on 2-3-4 trees ...

Variation #1: why stop at 4? why not 2-3-4-5 trees? or M-way trees?

Variation #2: don't have "variable-sized" nodes


Red-Black Trees


Red-Black Trees10/75

Red-black trees are a representation of 2-3-4 trees using BST nodes.

Definition of a red-black tree

Insertion algorithm: avoids worst case O(n) behaviour

Search algorithm: standard BST search


... Red-Black Trees11/75

Representing 4-nodes in red-black trees:

[Diagram:Pics/trees/234-rb-nodes-small.png]


Note: some texts colour the links rather than the nodes.


... Red-Black Trees12/75

Representing 3-nodes in red-black trees (two styles):

[Diagram:Pics/trees/234-rb-nodes2-small.png]


... Red-Black Trees13/75

Equivalent trees (one 2-3-4, one red-black):

[Diagram:Pics/trees/234-rb-tree2-small.png]


... Red-Black Trees14/75

Red-black tree implementation:

typedef enum {RED,BLACK} Colr;
typedef struct Node *Link;
typedef struct Node *Tree;
typedef struct Node {
   Item data;   // actual data
   Colr colour; // relationship to parent
   Link left;   // left subtree
   Link right;  // right subtree
} Node;

RED = node is part of the same 2-3-4 node as its parent (sibling)

BLACK = node is a child of the 2-3-4 node containing the parent


... Red-Black Trees15/75

Making new nodes requires a colour:

Node *newNode(Item it, Colr c)
{
   Node *new = malloc(sizeof(Node));
   assert(new != NULL);
   new->data = it; new->colour = c;
   new->left = new->right = NULL;
   return new;
}


... Red-Black Trees16/75

Searching method is standard BST search:

Item *search(Tree t, Key k)
{
   if (t == NULL) return NULL;
   int diff = cmp(k, key(t->data));
   if (diff < 0)
      return search(t->left, k);
   else if (diff > 0)
      return search(t->right, k);
   else // matches
      return &(t->data);
}


Exercise 1: 2-3-4 vs Red-Black Insertion17/75

Show the 2-3-4 tree resulting from the insertion of:

10  5  9  6  2  4  20  15  18  19  17  12  13  14


Symbol Tables


Symbol Tables19/75

A symbol table (dictionary) is a collection of items

SymTab insert(SymTab t, Item it) { ... }
Item *search(SymTab t, Key k) { ... }

Applications of symbol tables:


Key-Indexed Symbol Table20/75

Consider the following special case:

Leads to very efficient representation


... Key-Indexed Symbol Table21/75

Data representation:

[Diagram:Pics/searching/key-indexed-small.png]


Note: UNUSED is distinguished from all other Item values.

Essentially a simple form of hashing (see later).


Exercise 2: Key-to-index Mapping22/75

Define a function which

E.g. lo == 12, hi == 27, a[16]

indexOf(12) == 0, indexOf(17) = 5,

indexOf(21) == 9, indexOf(27) == 15


Symbol Table Representations23/75

Symbol tables can be represented in many ways:

  • key-indexed array (max # items, restricted key space)
  • key-sorted array (max # items, using binary search)
  • linked list (unlimited items, sorted list?)
  • binary search tree (unlimited items, traversal orders)

    Costs (assuming N items):

    TypeSearch costs
     minmaxavg
    key-indexed11*1
    sorted array1log2Nlog2N
    linked list1NN/2
    BSTree1N*log2N


    Hashing


    Hashing25/75

    Key-indexed arrays had "perfect" search performance O(1)

    Hashing allows us to approximate this performance, but


    ... Hashing26/75

    The ideal for key-indexed collections:

    courses["COMP3311"] = "Database Systems";
    printf("%s\n", courses["COMP3311"]);
    

    Almost as good:

    courses[h("COMP3311")] = "Database Systems";
    printf("%s\n", courses[h("COMP3311")]);
    

    In practice:

    item = {"COMP3311","Database Systems"};
    courses = insert(courses, item);
    printf("%s\n", search(courses, "COMP3311"));
    


    ... Hashing27/75

    To use arbitrary values as keys, we need three things:


    ... Hashing28/75

    Generalised ADT for a collection of Items

    Interface:

    typedef struct CollectionRep *Collection;
    
    Collection newCollection();    // make new empty collection
    Item *search(Collection, Key); // find item with key
    void insert(Collection, Item); // add item into collection
    void delete(Collection, Key);  // drop item with key
    

    Implementation:

    typedef struct CollectionRep {
       ... some data structure to hold multiple Items ...
    } CollectionRep;
    


    ... Hashing29/75

    For hash tables, we make one change to interface:

    typedef struct HashTabRep *HashTable;
    // make new empty table of size N
    HashTable newHashTable(int);
    Item *search(HashTable, Key); // find item with key
    void insert(HashTable, Item); // add item into collection
    void delete(HashTable, Key);  // drop item with key
    

    Implementation:

    typedef struct HashTabRep {...Items[N]...} HashTabRep;
    ... plus ...
    int hash(Key k, int N);  // hash function giving 0..N-1
    


    Exercise 3: Hash Lab30/75

    Implement a HashLab which:


    Hashing 31/75

    Hashing is a method for maintaining a collection

    Requires a function to map keys to indexes:   hash:  Key → 0..N-1

    [Diagram:Pics/hashing/hashing-review-small.png]


    ... Hashing 32/75

    Hash table interface:

    typedef struct HashTabRep *HashTable;
    // make new empty table of size N
    HashTable newHashTable(int);
    // find item with key
    Item *search(HashTable, Key);
    // add item into collection
    void insert(HashTable, Item);
    // drop item with key
    void delete(HashTable, Key);
    


    Exercise 4: Hash Lab33/75

    Implement a HashLab which:


    ... Hashing 34/75

    Example hash table implementation:

    typedef struct HashTabRep {
       int  N;       // size of array
       Item **items; // array of (Item *)
    } HashTabRep;
    
    HashTable newHashTable(int N)
    {
       HashTable new = malloc(sizeof(HashTabRep));
       new->items = malloc(N*sizeof(Item *));
       new->N = N;
       for (int i = 0; i < N; i++)
          { new->items[i] = NULL; }
       return new;
    }
    


    ... Hashing 35/75

    Idealised versions of HashTable operations:

    Item *search(HashTable ht, Key k)
    {
        int i = hash(k, ht->N);
        return ht->items[i];
    }
    void insert(HashTable ht, Item it)
    {
        int i = hash(key(it), ht->N);
        ht->items[i] = newItem(it);
    }
    void delete(HashTable ht, Key k)
    {
        int i = hash(k, ht->N);
        free(ht->items[i]);
        ht->items[i] = NULL;
    }
    


    Hash Functions36/75

    Points to note:


    ... Hash Functions37/75

    Basic idea behind hash function

    int hash(Key key, int N)
    {
       int val = convert key to int;
       return val % N;
    }
    

    If keys are ints, conversion is easy (identity function)

    How to convert keys which are strings?   (e.g. "COMP1927" or "9300035")


    Exercise 5: Hash Functions (i)38/75

    Consider this potential hash function:

    int hash(char *key, int N)
    {
        int h = 0; char *c;
        for (c = key; *c != '\0'; c++)
            h = h + *c;
        return h % N;
    }
    

    How does this function convert strings to ints?

    What are the deficiencies with this function and how can it be improved?


    ... Hash Functions39/75

    A slightly more sophisticated hash function

    int hash(char *key, int N)
    {
       int h = 0;  char *c;
       int a = 127; // a prime number
       for (c = key; *c != '\0'; c++)
          h = (a * h + *c) % N;
       return h;
    }
    

    Converts strings into integers in table range.

    But poor choice of a (e.g. 128) can result in poor hashing.


    ... Hash Functions40/75

    To use all of value in hash, with suitable "randomization":

    int hash(char *key, int N)
    {
       int h = 0, a = 31415, b = 21783;
       char *c;
       for (c = key; *c != '\0'; c++) {
          a = a*b % (N-1);
          h = (a * h + *c) % N;
       }
       return h;
    }
    

    This approach is known as universal hashing.


    ... Hash Functions41/75

    A real hash function (from PostgreSQL DBMS):

    
    hash_any(unsigned char *k, register int keylen, int N)
    {
        register uint32 a, b, c, len;
        // set up internal state
        len = keylen;
        a = b = 0x9e3779b9;
        c = 3923095;
        // handle most of the key, in 12-char chunks
        while (len >= 12) {
            a += (k[0] + (k[1] << 8) + (k[2] << 16) + (k[3] << 24));
            b += (k[4] + (k[5] << 8) + (k[6] << 16) + (k[7] << 24));
            c += (k[8] + (k[9] << 8) + (k[10] << 16) + (k[11] << 24));
            mix(a, b, c);
            k += 12; len -= 12;
        }
        // collect any data from remaining bytes into a,b,c
        mix(a, b, c);
        return c % N;
    }
    


    ... Hash Functions42/75

    Where mix is defined as:

    
    #define mix(a,b,c) \
    { \
      a -= b; a -= c; a ^= (c>>13); \
      b -= c; b -= a; b ^= (a<<8);  \
      c -= a; c -= b; c ^= (b>>13); \
      a -= b; a -= c; a ^= (c>>12); \
      b -= c; b -= a; b ^= (a<<16); \
      c -= a; c -= b; c ^= (b>>5);  \
      a -= b; a -= c; a ^= (c>>3);  \
      b -= c; b -= a; b ^= (a<<10); \
      c -= a; c -= b; c ^= (b>>15); \
    }
    

    i.e. scrambles all of the bits from the bytes of the key value


    Hash Table ADT43/75

    Enhanced concrete data representation:

    #include "Item.h"  // Item has key and data
    
    #define NoItem distinguished Item value
    
    typedef struct HashTabRep {
       Item *items; // array of Items
       int  nslots; // # elements in array  (was called N)
       int  nitems; // # items stored in array
    } HashTabRep;
    
    typedef HashTabRep *HashTable;
    


    Exercise 6: NoItem values44/75

    Suggest suitable NoItem values if


    ... Hash Table ADT45/75

    Hash table initialisation:

    HashTable newHashTable(int N)
    {
       HashTabRep *new = malloc(sizeof(HashTabRep));
       assert(new != NULL);
       new->items = malloc(N*sizeof(Item));
       assert(new->items != NULL);
       for (int i = 0; i < N; i++)
          new->items[i] = NoItem;
       new->nitems = 0; new->nslots = N;
       return new;
    }
    


    ... Hash Table ADT46/75

    Search function

    Item *search(HashTable ht, Key k) {
       int i = hash(k, ht->nslots);
       if (ht->items[i] == NoItem)
          return NULL;
       else if (key(ht->items[i]) != k)
          return NULL;
       else
          return &(ht->items[i]);
    }
    


    ... Hash Table ADT47/75

    Functions to maintain hash table:

    void insert(HashTable ht, Item it) {
       int i = hash(key(it), ht->nslots);
       if (ht->items[i] == NoItem)
          { ht->items[i] = it; ht->nitems++; }
       else if (key(ht->items[i] == key(it))
          ht->items[i] = it;  // update
       else { // (key(ht->items[i] != key(it))
          // ... what to do? 
       }
    }
    void delete(HashTable ht, Key k) {
       int i = hash(k, ht->nslots);
       if (ht->items[i] == NoItem)
          return;
       else if (key(ht->items[i] == k)
          { ht->items[i] = NoItem; ht->nitems--; }
       else { // (key(ht->items[i] != key(it))
          return;  // no item with key k in table
       }
    }
    


    Problems with Hashing48/75

    In ideal scenarios, search cost in hash table is O(1).

    Problems with hashing:


    Exercise 7: Expanding Hash Table49/75

    Write a function

    HashTable expand(HashTable ht) { ... }
    

    which doubles the number of slots in a hash table


    Collision Resolution50/75

    Three approaches to dealing with hash collisions:


    Separate Chaining51/75

    Solve collisions by having multiple items per array entry.

    Make each element the start of linked-list of Items.

    [Diagram:Pics/hashing/hash-linked-small.png]


    ... Separate Chaining52/75

    Concrete data structure for hashing via chaining

    typedef struct HashTabRep {
       List *lists; // array of Lists of Items
       int  nslots; // # elements in array
       int  nitems; // # items stored in HashTable
    } HashTabRep;
    
    HashTable newHashTable(int N)
    {
       HashTabRep *new = malloc(sizeof(HashTabRep));
       assert(new != NULL);
       new->lists = malloc(N*sizeof(List));
       assert(new->lists != NULL);
       for (int i = 0; i < N; i++)
          new->lists[i] = newList();
       new->nslots = N; new->nitems = 0;
       return new;
    }
    


    ... Separate Chaining53/75

    Using the List ADT, search becomes:

    #include "List.h" 
    Item *search(HashTable ht, Key k)
    {
       int i = hash(k, ht->nslots);
       return ListSearch(ht->lists[i], k);
    }
    

    Even without List abstraction, easy to implement.

    Using sorted lists gives only small performance gain.


    ... Separate Chaining54/75

    Other list operations are also simple:

    #include "List.h"
    
    void insert(HashTable ht, Item it) {
       Key k = key(it);
       int i = hash(k, ht->nslots);
       ListInsert(ht->lists[i], it);
    }
    void delete(HashTable ht, Key k) {
       int i = hash(k, ht->nslots);
       ListDelete(ht->lists[i], k);
    }
    

    Essentially: select a list; operate on that list.


    ... Separate Chaining55/75

    Cost analysis:

    Ratio of items/slots is called load α = M/N


    Linear Probing56/75

    Collision resolution by finding a new location for Item

    Examples:

    [Diagram:Pics/hashing/hash-linear-small.png]


    Hashing 57/75

    Hashing is a method for maintaining a collection

    Requires a function to map keys to indexes:   hash:  Key → 0..N-1

    [Diagram:Pics/hashing/hashing-review-small.png]


    ... Hashing 58/75

    Hash table interface:

    typedef struct HashTabRep *HashTable;
    // make new empty table of size N
    HashTable newHashTable(int);
    // find item with key
    Item *search(HashTable, Key);
    // add item into collection
    void insert(HashTable, Item);
    // drop item with key
    void delete(HashTable, Key);
    


    ... Hashing 59/75

    Possible concrete data representation:

    #define NoItem distinguished Item value
    
    typedef struct HashTabRep {
       Item *items; // array of Items
       int  nslots; // # elements in array
       int  nitems; // # items stored in array
    } HashTabRep;
    
    typedef HashTabRep *HashTable;
    

    Assume:  key(NoItem) matches no real Key value


    ... Hashing 60/75

    Consider a hash table with N slots ...

    When using chaining

    When using a fixed-size array


    Linear Probing61/75

    [Diagram:Pics/hashing/hash-linear-small.png]


    ... Linear Probing62/75

    Insert function for linear probing:

    void insert(HashTable ht, Item it)
    {
       assert(ht->nitems < ht->nslots);
       int N = ht->nslots;
       Item *a = ht->items;
       Key k = key(it);
       int i, j, h = hash(k,N);
       for (j = 0; j < N; j++) {
          i = (h+j)%N;
          if (a[i] == NoItem) break;
          if (eq(k,key(a[i]))) break;
       }
       if (a[i] == NoItem) ht->nitems++;
       a[i] = it;
    }
    


    ... Linear Probing63/75

    Search function for linear probing:

    Item *search(HashTable ht, Key k)
    {
       int N = ht->nslots;
       Item *a = ht->items;
       int i, j, h = hash(k,N);
       for (j = 0; j < N; j++) {
          i = (h+j)%N;
          if (a[i] == NoItem) return NULL;
          if (eq(k,key(a[i]))) return &(a[i]);
       }
       return NULL;
    }
    


    ... Linear Probing64/75

    Search cost analysis:

    Example costs:
    load (α)1/2  2/3  3/4  9/10
    search hit1.52.03.05.5
    search miss2.55.08.555.5
    Assumes reasonably uniform data and good hash function.


    ... Linear Probing65/75

    Deletion slightly tricky for linear probing.

    Need to ensure no NoItem in middle of "probe path"
    (i.e. previously relocated items moved to appropriate location)

    [Diagram:Pics/hashing/hash-probe-delete-small.png]


    ... Linear Probing66/75

    Delete function for linear probing:

    void delete(HashTable ht, Key k)
    {
       int N = ht->nslots;
       Item *a = ht->items;
       int i, j, h = hash(k,N);
       for (j = 0; j < N; j++) {
          i = (h+j)%N;
          if (a[i] == NoItem) return; // k not in table
          if (eq(k,key(a[i]))) break;
       }
       a[i] = NoItem;
       ht->nitems--;
       // clean up probe path
       j = i+1;
       while (a[j] != NoItem) {
          Item it = a[j];
          a[j] = NoItem;
          ht->nitems--;
          insert(ht, it);
          j = (j+1)%N;
       }
    }
    


    Exercise 8: Linear Probing Example67/75

    Consider a linear-probed hash table

    Show the result of inserting items with these keys
    1. 1, 2, 3, 4, 5, 6, 7, 8, 9
    2. 15, 6, 20, 3, 17, 14, 33, 5
    into an initially empty table


    Exercise 9: Alternative Deletion Handling68/75

    To simplify NoItem deletion problem ...

    Give a data structure for this and re-implement functions.


    ... Linear Probing69/75

    A problem with linear probing: clusters

    E.g. insert 5, 6, 15, 16, 7, 17, with hash = k%10

    [Diagram:Pics/hashing/clustering-small.png]


    Double Hashing70/75

    Double hashing improves on linear probing:

    To generate relatively prime


    ... Double Hashing71/75

    Concrete data structures for hashing via double hashing:

    typedef struct HashTabRep {
       Item *items; // array of Items
       int  nslots; // # elements in array
       int  nitems; // # items stored in HashTable
       int  nhash2; // second hash mod
    } HashTabRep;
    
    #define hash2(k,N2) (((k)%N2)+1)
    
    HashTable newHashTable(int N)
    {
       HashTabRep *new = malloc(sizeof(HashTabRep));
       assert(new != NULL);
       new->items = malloc(N*sizeof(Item));
       assert(new->items != NULL);
       for (int i = 0; i < N; i++)
          new->items[i] = NoItem;
       new->nslots = N; new->nitems = 0;
       new->nhash2 = findSuitablePrime(N);
       return new;
    }
    


    ... Double Hashing72/75

    Search function for double hashing:

    Item *search(HashTable ht, Key k)
    {
       int N = ht->nslots;
       Item *data = ht->items;
       int i, j, h = hash(k,N);
       int incr = hash2(k,ht->nhash2);
       for (j = 0, i = h; j < N; j++) {
          if (eq(k,key(data[i]) == 0)
             return &(data[i]);
          i = (i+incr)%N;
       }
       return NULL;
    }
    


    ... Double Hashing73/75

    Insert function for double hashing:

    void insert(HashTable ht, Item it)
    {
       int N = ht->nslots;
       Item *data = ht->items;
       Key k = key(it);
       int i, j, h = hash(k,N);
       int incr = hash2(k,ht->nhash2);
       for (j = 0; j < N; j += incr) {
          ix = (i+j)%N;
          if (cmp(k,key(data[ix]) == 0)
             break;
          else if (data[ix] == NoItem)
             break;
       }
       assert(j != N); // table full
       if (data[ix] == NoItem) ht->nitems++;
       data[ix] = it;
    }
    


    ... Double Hashing74/75

    Costs for double hashing:

    load (α)1/2  2/3  3/4  9/10
    search hit1.41.61.82.6
    search miss1.52.03.05.5

    Can be significantly better than linear probing


    Hashing Summary75/75

    Collision resolution approaches:

    Only chaining allows α > 1, but performance degrades once α > 1

    Once M exceeds initial choice of N,


    Produced: 9 Oct 2017