❖ Signature-based indexing |
Reminder: file organisation for signature indexing (two files)
One signature slot per tuple slot; unused signature slots are zeroed.
We use the terms "signature" and "descriptor" interchangeably
❖ Concatenated Codewords (CATC) |
In a concatenated codewords (catc) indexing scheme
❖ CATC Example |
Consider the following tuple (from bank deposit database)
Branch | AcctNo | Name | Amount |
Perryridge | 102 | Hayes | 400 |
It has the following codewords/descriptor (for m = 16, ui = 4 )
Ai | cw(Ai) |
Perryridge | 0101 |
102 | 1001 |
Hayes | 1010 |
400 | 1100 |
desc(t) | 1100101010010101 |
❖ CATC Queries |
To answer query q in CATC
E.g. consider the query (Perryridge, ?, Hayes, ?)
Ai | cw(Ai) |
Perryridge | 0101 |
? |
0000 |
Hayes |
1010 |
? |
0000 |
desc(q) | 0000101000000101 |
❖ CATC Queries (cont) |
Once we have a query descriptor, we search the signature file:
pagesToCheck = {} // scan r signatures for each descriptor D[i] in signature file { if (matches(D[i],desc(q))) { pid = pageOf(tupleID(i)) pagesToCheck = pagesToCheck ∪ pid } } // then scan bsq = bq + δ pages to check for matches
Matching can be implemented efficiently ...
#define matches(sig,qdesc) ((sig & qdesc) == qdesc)
❖ Example SIMC Query |
Consider the query and the example database:
Signature | Deposit Record |
0000101000000101 |
(Perryridge,?,Hayes,?) |
1010100101101001 |
(Brighton,217,Green,750) |
1100101010010101 |
(Perryridge,102,Hayes,400) |
1010011010010110 |
(Downtown,101,Johnshon,512) |
0110101001010011 |
(Mianus,215,Smith,700) |
1010101011000101 |
(Clearview,117,Throggs,295) |
1001010100111001 |
(Redwood,222,Lindsay,695) |
Gives two matches: one true match, one false match.
❖ CATC Parameters |
False match probablity pF = likelihood of a false match
How to reduce likelihood of false matches?
Since ui 's are relatively small, hash collisions may be a serious issue
But making ui's means larger signatures ⇒ optimisation problem
❖ CATC Parameters (cont) |
How to determine "optimal" m and u ?
Choice of ui values
❖ Query Cost for CATC |
Cost to answer pmr query: Costpmr = bD + bsq
E.g. m=64, B=8192, r=104 ⇒ cD = 1024, bD=10
bsq includes pages with rq matching tuples and rF false matches
Expected false matches = rF = (r - rq).pF ≅ r.pF if rq ≪ r
E.g. Worst bsq = rq+rF, Best bsq = 1, Avg bsq = ceil(b(rq+rF)/r)
❖ Variations on CATC |
CATC has one descriptor per tuple ... potentially inefficient.
Alternative approach: one descriptor for each data page.
Every attribute of every tuple in page contributes to descriptor.
Size of page descriptor mp = ( 1/loge 2 )2 . c.n . loge ( 1/pF )
Size of codewords is proportionally larger (unless attribute domain small)
E.g. n = 4, c = 64, pF = 10-3 ⇒ mp ≅ 3680bits ≅ 460bytes
Typically, pages are 1..8KB ⇒ 8..64 PD/page (cPD).
E.g. mp ≅ 460, B = 8192, cPD ≅ 17
❖ Variations on CATC (cont) |
Improvement: store b × mp-bit page descriptors as mp × b-bit "bit-slices"
If b = 2x then uses same storage as page descriptors
Query cost: scan ui / 2 bit-slices for each known attribute
If k is set of known attribute values, #slices = ∑i∈k ui / 2
E.g. b = 128, m = 256, n = 4, ui = 16
(a,?,c,?)
compared to scan of 128 page descriptors, where each PD is 64-bits (8-bytes)
❖ Comparison with SIMC |
Assume same m , pF , n for each method ...
CATC has ui-bit codewords, each has ≅ui / 2 bits set to 1
SIMC has m-bit codewords, each has k bits set to 1
Signatures for both have m bits, with ≅m / 2 bits set to 1
CATC has flexibility in ui, but small(er) codewords so more hash collisions
SIMC has less hash collisions, but has errors from "unfortunate" overlays
Produced: 28 Mar 2021