❖ Hashing |
Basic idea: use key value to compute page address of tuple.
e.g. tuple with key = v is stored in page i
Requires: hash function h(v) that maps KeyDomain → [0..b-1].
❖ Hashing (cont) |
PostgreSQL hash function (simplified):
Datum hash_any(unsigned char *k, int keylen) { uint32 a, b, c, len, *ka = (uint32 *)k; /* Set up the internal state */ len = keylen; a = b = c = 0x9e3779b9+len+3923095; /* handle most of the key */ while (len >= 12) { a += ka[0]; b += ka[1]; c += ka[2]; mix(a, b, c); ka += 3; len -= 12; } ... collect data from remaining bytes into a,b,c ... mix(a, b, c); return UInt32GetDatum(c); }
See backend/access/hash/hashfunc.c
mix()
❖ Hashing (cont) |
hash_any()
uint32
Two ways to map raw hash value into a page address:
uint32 hashToPageNum(uint32 hval) { uint32 mask = 0xFFFFFFFF; return (hval & (mask >> (32-k))); }
uint32 hashToPageNum(uint32 hval) { return (hval % b); }
❖ Hashing Performance |
Aims:
Best case: every bucket contains same number of tuples.
Worst case: every tuple hashes to same bucket.
Average case: some buckets have more tuples than others.
Use overflow pages to handle "overfull" buckets (cf. sorted files)
All tuples in each bucket must have same hash value.
❖ Hashing Performance (cont) |
Two important measures for hash files:
Case | L | Ov |
Best | ≅ 1 | 0 |
Worst | >> 1 | ** |
Average | < 1 | 0<Ov<1 |
To achieve average case, aim for 0.75 ≤ L ≤ 0.9.
❖ Selection with Hashing |
Select via hashing on unique key k (one)
// select * from R where k = val
getPageViaHash(R, val, P)
for each tuple t in page P {
if (t.k == val) return t
}
for each overflow page Q of P {
for each tuple t in page Q {
if (t.k == val) return t
} }
Costone : Best = 1, Avg = 1+Ov/2 Worst = 1+max(OvLen)
❖ Selection with Hashing (cont) |
Working out which page, given a key ...
getPageViaHash(Reln R, Value key, Page p) { uint32 h = hash_any(key, len(key)); PageID pid = h % nPages(R); get_page(R, pid, buf); }
❖ Selection with Hashing (cont) |
Select via hashing on non-unique hash key nk (pmr)
// select * from R where nk = val
getPageViaHash(R, val, P)
for each tuple t in page P {
if (t.nk == val) add t to results
}
for each overflow page Q of P {
for each tuple t in page Q {
if (t.nk == val) add t to results
} }
return results
Costpmr = 1 + Ov
If Ov is small (e.g. 0 or 1), very good retrieval cost
❖ Selection with Hashing (cont) |
Hashing does not help with range queries** ...
Costrange = b + bov
Selection on attribute j which is not hash key ...
Costone, Costrange, Costpmr = b + bov
** unless the hash function is order-preserving (and most aren't)
❖ Insertion with Hashing |
Insertion uses similar process to one queries.
// insert tuple t with key=val into rel R
getPageViaHash(R, val, P)
if room in page P {
insert t into P; return
}
for each overflow page Q of P {
if room in page Q {
insert t into Q; return
} }
add new overflow page Q
link Q to previous page
insert t into Q
Costinsert : Best: 1r + 1w Worst: 1+max(OvLen))r + 2w
❖ Deletion with Hashing |
Similar performance to select on non-unique key:
// delete from R where k = val // f = data file ... ovf = ovflow file getPageViaHash(R, val, P) ndel = delTuples(P,k,val) if (ndel > 0) put_page(dataFile(R),P.pid,P) for each overflow page Q of P { ndel = delTuples(Q,k,val) if (ndel > 0) put_page(ovFile(R),Q.pid,Q) }
Extra cost over select is cost of writing back modified pages.
Method works for both unique and non-unique hash keys.
❖ Problem with Hashing... |
So far, discussion of hashing has assumed a fixed file size (b).
What size file to use?
Methods for hashing with files whose size changes:
❖ Flexible Hashing |
All flexible hashing methods ...
uint32 hash(unsigned char *val)
Require a function to extract d bits from bit-string:
unit32 bits(int d, uint32 val)
Use result of bits()
❖ Flexible Hashing (cont) |
Important concept for flexible hashing: splitting
101
0101
1101
0101
101
1101
101
❖ Flexible Hashing (cont) |
Example of splitting:
Tuples only show key value; assume h(val) = val
Produced: 7 Mar 2021