Hashed Files

COMP9315 21T1 ♢ Hashed Files ♢ [0/15]
❖ Hashing

Basic idea: use key value to compute page address of tuple.

[Diagram:Pics/file-struct/hash.png]

e.g. tuple with key = v  is stored in page i

Requires: hash function h(v) that maps KeyDomain → [0..b-1].

COMP9315 21T1 ♢ Hashed Files ♢ [1/15]
❖ Hashing (cont)

PostgreSQL hash function (simplified):


Datum hash_any(unsigned char *k, int keylen)
{
   uint32 a, b, c, len, *ka = (uint32 *)k;
   /* Set up the internal state */
   len = keylen;
   a = b = c = 0x9e3779b9+len+3923095;
   /* handle most of the key */
   while (len >= 12) {
      a += ka[0]; b += ka[1]; c += ka[2];
      mix(a, b, c);
      ka += 3;  len -= 12;
   }
   ... collect data from remaining bytes into a,b,c ...
   mix(a, b, c);
   return UInt32GetDatum(c);
}

See backend/access/hash/hashfunc.c for details  (incl mix())

COMP9315 21T1 ♢ Hashed Files ♢ [2/15]
❖ Hashing (cont)

hash_any() gives hash value as 32-bit quantity (uint32).

Two ways to map raw hash value into a page address:

COMP9315 21T1 ♢ Hashed Files ♢ [3/15]
❖ Hashing Performance

Aims:

Note: if data distribution not uniform, address distribution can't be uniform.

Best case: every bucket contains same number of tuples.

Worst case: every tuple hashes to same bucket.

Average case: some buckets have more tuples than others.

Use overflow pages to handle "overfull" buckets  (cf. sorted files)

All tuples in each bucket must have same hash value.

COMP9315 21T1 ♢ Hashed Files ♢ [4/15]
❖ Hashing Performance (cont)

Two important measures for hash files:

Three cases for distribution of tuples in a hashed file:

Case L Ov
Best ≅ 1 0
Worst >> 1 **
Average < 1 0<Ov<1

(** performance is same as Heap File)

To achieve average case, aim for   0.75  ≤  L  ≤  0.9.

COMP9315 21T1 ♢ Hashed Files ♢ [5/15]
❖ Selection with Hashing

Select via hashing on unique key k (one)

// select * from R where k = val
getPageViaHash(R, val, P)
for each tuple t in page P {
    if (t.k == val) return t
}
for each overflow page Q of P {
    for each tuple t in page Q {
        if (t.k == val) return t
}   }

Costone :   Best = 1,   Avg = 1+Ov/2   Worst = 1+max(OvLen)

COMP9315 21T1 ♢ Hashed Files ♢ [6/15]
❖ Selection with Hashing (cont)


Working out which page, given a key ...

getPageViaHash(Reln R, Value key, Page p)
{
   uint32 h = hash_any(key, len(key));
   PageID pid = h % nPages(R);
   get_page(R, pid, buf);
}

COMP9315 21T1 ♢ Hashed Files ♢ [7/15]
❖ Selection with Hashing (cont)

Select via hashing on non-unique hash key nk (pmr)

// select * from R where nk = val
getPageViaHash(R, val, P) 
for each tuple t in page P {
    if (t.nk == val) add t to results
}
for each overflow page Q of P {
    for each tuple t in page Q {
        if (t.nk == val) add t to results
}   }
return results

Costpmr  =  1 + Ov

If Ov is small (e.g. 0 or 1), very good retrieval cost

COMP9315 21T1 ♢ Hashed Files ♢ [8/15]
❖ Selection with Hashing (cont)

Hashing does not help with range queries** ...

Costrange = b + bov


Selection on attribute j which is not hash key ...

Costone,   Costrange,   Costpmr  =  b + bov


** unless the hash function is order-preserving (and most aren't)

COMP9315 21T1 ♢ Hashed Files ♢ [9/15]
❖ Insertion with Hashing

Insertion uses similar process to one queries.

// insert tuple t with key=val into rel R
getPageViaHash(R, val, P) 
if room in page P {
    insert t into P; return
}
for each overflow page Q of P {
    if room in page Q {
        insert t into Q; return
}   }
add new overflow page Q
link Q to previous page
insert t into Q

Costinsert :    Best: 1r + 1w    Worst: 1+max(OvLen))r + 2w

COMP9315 21T1 ♢ Hashed Files ♢ [10/15]
❖ Deletion with Hashing

Similar performance to select on non-unique key:

// delete from R where k = val
// f = data file ... ovf = ovflow file
getPageViaHash(R, val, P)
ndel = delTuples(P,k,val)
if (ndel > 0) put_page(dataFile(R),P.pid,P)
for each overflow page Q of P {
    ndel = delTuples(Q,k,val)
    if (ndel > 0) put_page(ovFile(R),Q.pid,Q)
}

Extra cost over select is cost of writing back modified pages.

Method works for both unique and non-unique hash keys.

COMP9315 21T1 ♢ Hashed Files ♢ [11/15]
❖ Problem with Hashing...

So far, discussion of hashing has assumed a fixed file size (b).

What size file to use?

Problem: change file size ⇒ change hash function ⇒ rebuild file

Methods for hashing with files whose size changes:

COMP9315 21T1 ♢ Hashed Files ♢ [12/15]
❖ Flexible Hashing

All flexible hashing methods ...

Start with hash function to convert value to bit-string:

uint32 hash(unsigned char *val)

Require a function to extract d  bits from bit-string:

unit32 bits(int d, uint32 val)

Use result of bits() as page address.

COMP9315 21T1 ♢ Hashed Files ♢ [13/15]
❖ Flexible Hashing (cont)

Important concept for flexible hashing: splitting


Result: expandable data file, never requiring a complete file rebuild
COMP9315 21T1 ♢ Hashed Files ♢ [14/15]
❖ Flexible Hashing (cont)

Example of splitting:

[Diagram:Pics/file-struct/split.png]

Tuples only show key value; assume h(val) = val

COMP9315 21T1 ♢ Hashed Files ♢ [15/15]


Produced: 7 Mar 2021