❖ Signature-based indexing |
Reminder: file organisation for signature indexing (two files)
One signature slot per tuple slot; unused signature slots are zeroed.
❖ Superimposed Codewords (SIMC) |
In a superimposed codewords (simc) indexing scheme
OR
OR
OR
bits desc = 0 for (i = 1; i <= n; i++) { bits cw = codeword(A[i],m,k) desc = desc | cw }
❖ SIMC Example |
Consider the following tuple (from bank deposit database)
Branch | AcctNo | Name | Amount |
Perryridge | 102 | Hayes | 400 |
It has the following codewords/descriptor (for m = 12, k = 2 )
Ai | cw(Ai) |
Perryridge | 010000000001 |
102 | 000000000011 |
Hayes | 000001000100 |
400 | 000010000100 |
desc(t) | 010011000111 |
❖ SIMC Queries |
To answer query q in SIMC
E.g. consider the query (Perryridge, ?, ?, ?)
Ai | cw(Ai) |
Perryridge | 010000000001 |
? |
000000000000 |
? |
000000000000 |
? |
000000000000 |
desc(q) | 010000000001 |
❖ SIMC Queries (cont) |
Once we have a query descriptor, we search the signature file:
pagesToCheck = {} // scan r signatures for each descriptor D[i] in signature file { if (matches(D[i],desc(q))) { pid = pageOf(tupleID(i)) pagesToCheck = pagesToCheck ∪ pid } } // then scan bsq = bq + δ pages to check for matches
Matching can be implemented efficiently ...
#define matches(sig,qdesc) ((sig & qdesc) == qdesc)
❖ Example SIMC Query |
Consider the query and the example database:
Signature | Deposit Record |
010000000001 |
(Perryridge,?,?,?) |
100101001001 |
(Brighton,217,Green,750) |
010011000111 |
(Perryridge,102,Hayes,400) |
101001001001 |
(Downtown,101,Johnshon,512) |
101100000011 |
(Mianus,215,Smith,700) |
010101010101 |
(Clearview,117,Throggs,295) |
100101010011 |
(Redwood,222,Lindsay,695) |
Gives two matches: one true match, one false match.
❖ SIMC Parameters |
False match probablity pF = likelihood of a false match
How to reduce likelihood of false matches?
Having k too high ⇒ increased overlapping.
Having k too low ⇒ increased hash collisions.
❖ SIMC Parameters (cont) |
How to determine "optimal" m and k?
m = ( 1/loge 2 )2 . n . loge ( 1/pF )
Formula from Bloom (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors", Communications of the ACM, 13 (7): 422–426
❖ Query Cost for SIMC |
Cost to answer pmr query: Costpmr = bD + bsq
E.g. m=64, B=8192, r=104 ⇒ cD = 1024, bD=10
bsq includes pages with rq matching tuples and rF false matches
Expected false matches = rF = (r - rq).pF ≅ r.pF if rq ≪ r
E.g. Worst bsq = rq+rF, Best bsq = 1, Avg bsq = ceil(b(rq+rF)/r)
❖ Page-level SIMC |
SIMC has one descriptor per tuple ... potentially inefficient.
Alternative approach: one descriptor for each data page.
Every attribute of every tuple in page contributes to descriptor.
Size of page descriptor (PD) (clearly larger than tuple descriptor):
Typically, pages are 1..8KB ⇒ 8..64 PD/page (cPD).
E.g. mp ≅ 460, B = 8192, cPD ≅ 17
❖ Page-level SIMC (cont) |
Algorithm for evaluating pmr query using page descriptors
pagesToCheck = {} // scan b mp-bit page descriptors for each descriptor D[i] in signature file { if (matches(D[i],desc(q))) { pid = i pagesToCheck = pagesToCheck ∪ pid } } // read and scan bsq data pages for each pid in pagesToCheck { Buf = getPage(dataFile,pid) check tuples in Buf for answers }
❖ Bit-sliced SIMC (cont) |
Algorithm for evaluating pmr query using bit-sliced descriptors
matches = ~0 //all ones // scan m r-bit slices for each bit i set to 1 in desc(q) { slice = fetch bit-slice i matches = matches & slice } for each bit i set to 1 in matches { fetch page i scan page for matching records }
Effective because desc(q) typically has less than half bits set to 1
❖ Comparison of Approaches |
Tuple-based
❖ Signature-based Indexing in PostgreSQL |
PostgreSQL supports signature based indexing via the bloom
(Signature-based indexes like this are often called Bloom filters)
Creating a Bloom index
create index Idx on R using bloom (a1,a2,a3) with (length=64, col1=3, col2=4, col3=2);
Example: 10000 tuples, query = select * from R where a1=55 and a2=42, no matching tuples, random numeric values for attributes
No indexes ... execution time 15ms B-tree index on all attributes ... execution time 12ms Bloom index ... execution time 0.4ms
For more details, see PostgreSQL doc, Appendix F.5
Produced: 6 Apr 2021