SIMC Indexing

COMP9315 21T1 ♢ SIMC Indexing ♢ [0/16]
❖ Signature-based indexing

Reminder: file organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile1.png]

One signature slot per tuple slot; unused signature slots are zeroed.

COMP9315 21T1 ♢ SIMC Indexing ♢ [1/16]
❖ Superimposed Codewords (SIMC)

In a superimposed codewords (simc) indexing scheme

A tuple descriptor desc(t) is Method (assuming all n attributes are used in descriptor):

bits desc = 0 
for (i = 1; i <= n; i++) {
   bits cw = codeword(A[i],m,k)
   desc = desc | cw
}

COMP9315 21T1 ♢ SIMC Indexing ♢ [2/16]
❖ SIMC Example

Consider the following tuple (from bank deposit database)

Branch AcctNo Name Amount
Perryridge 102 Hayes 400

It has the following codewords/descriptor (for m = 12,   k = 2 )

Ai cw(Ai)
Perryridge 010000000001
102 000000000011
Hayes 000001000100
400 000010000100
desc(t) 010011000111
COMP9315 21T1 ♢ SIMC Indexing ♢ [3/16]
❖ SIMC Queries

To answer query q in SIMC

E.g. consider the query (Perryridge, ?, ?, ?).

Ai cw(Ai)
Perryridge 010000000001
? 000000000000
? 000000000000
? 000000000000
desc(q) 010000000001
COMP9315 21T1 ♢ SIMC Indexing ♢ [4/16]
❖ SIMC Queries (cont)

Once we have a query descriptor, we search the signature file:

pagesToCheck = {}
// scan r signatures
for each descriptor D[i] in signature file {
    if (matches(D[i],desc(q))) {
        pid = pageOf(tupleID(i))
        pagesToCheck = pagesToCheck ∪ pid
    }
}
// then scan bsq = bq + δ pages to check for matches

Matching can be implemented efficiently ...

#define matches(sig,qdesc) ((sig & qdesc) == qdesc)

COMP9315 21T1 ♢ SIMC Indexing ♢ [5/16]
❖ Example SIMC Query

Consider the query and the example database:

Signature Deposit Record
010000000001 (Perryridge,?,?,?)
100101001001 (Brighton,217,Green,750)
010011000111 (Perryridge,102,Hayes,400)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010101010101 (Clearview,117,Throggs,295)
100101010011 (Redwood,222,Lindsay,695)

Gives two matches:  one true match,  one false match.

COMP9315 21T1 ♢ SIMC Indexing ♢ [6/16]
❖ SIMC Parameters

False match probablity pF  =  likelihood of a false match

How to reduce likelihood of false matches?


Larger m means larger signature file ⇒ read more signature data.

Having k too high    increased overlapping.
Having k too low    increased hash collisions.

COMP9315 21T1 ♢ SIMC Indexing ♢ [7/16]
❖ SIMC Parameters (cont)

How to determine "optimal" m and k?

  1. start by choosing acceptable pF
      (e.g. pF ≤ 10-4 i.e. one false match in 10,000)
  2. then choose m and k to achieve no more than this pF.
Formulae to derive m  and k  given pF  and n:

k  =  1/loge2 . loge ( 1/pF )

m  =  ( 1/loge 2 )2 . n . loge ( 1/pF )


Formula from Bloom (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors", Communications of the ACM, 13 (7): 422–426

COMP9315 21T1 ♢ SIMC Indexing ♢ [8/16]
❖ Query Cost for SIMC

Cost to answer pmr query: Costpmr = bD + bsq

bD = ceil( r/cD )  and  cD = floor(B/ceil(m/8))

E.g. m=64,   B=8192,   r=104    ⇒    cD = 1024,   bD=10

bsq includes pages with rq  matching tuples and rF  false matches

Expected false matches = rF  =  (r - rq).pF  ≅  r.pF   if rq ≪ r

E.g. Worst bsq = rq+rF,    Best bsq = 1,    Avg bsq = ceil(b(rq+rF)/r)

COMP9315 21T1 ♢ SIMC Indexing ♢ [9/16]
❖ Page-level SIMC

SIMC has one descriptor per tuple ... potentially inefficient.

Alternative approach: one descriptor for each data page.

Every attribute of every tuple in page contributes to descriptor.

Size of page descriptor (PD) (clearly larger than tuple descriptor):

E.g. n = 4, c = 64, pF = 10-3   ⇒   mp ≅ 3680bits ≅ 460bytes

Typically, pages are 1..8KB    8..64 PD/page (cPD).

E.g. mp ≅ 460,   B = 8192,   cPD ≅ 17

COMP9315 21T1 ♢ SIMC Indexing ♢ [10/16]
❖ Page-level SIMC (cont)

File organisation for page-level superimposed codeword index

[Diagram:Pics/select/simc-2level.png]

COMP9315 21T1 ♢ SIMC Indexing ♢ [11/16]
❖ Page-level SIMC (cont)

Algorithm for evaluating pmr query using page descriptors

pagesToCheck = {}
// scan b mp-bit page descriptors
for each descriptor D[i] in signature file {
    if (matches(D[i],desc(q))) {
        pid = i
        pagesToCheck = pagesToCheck ∪ pid
    }
}
// read and scan bsq data pages
for each pid in pagesToCheck {
    Buf = getPage(dataFile,pid)
    check tuples in Buf for answers
}

COMP9315 21T1 ♢ SIMC Indexing ♢ [12/16]
❖ Bit-sliced SIMC

Improvement: store b m-bit page descriptors as m b-bit "bit-slices"

[Diagram:Pics/select/bit-sliced.png]

COMP9315 21T1 ♢ SIMC Indexing ♢ [13/16]
❖ Bit-sliced SIMC (cont)

Algorithm for evaluating pmr query using bit-sliced descriptors

matches = ~0  //all ones
// scan m r-bit slices
for each bit i set to 1 in desc(q) {
   slice = fetch bit-slice i
   matches = matches & slice
}
for each bit i set to 1 in matches {
   fetch page i
   scan page for matching records
}


Effective because desc(q) typically has less than half bits set to 1

COMP9315 21T1 ♢ SIMC Indexing ♢ [14/16]
❖ Comparison of Approaches

Tuple-based

Page-based Bit-sliced All signature files are roughly the same size, for a given pF
COMP9315 21T1 ♢ SIMC Indexing ♢ [15/16]
❖ Signature-based Indexing in PostgreSQL

PostgreSQL supports signature based indexing via the bloom module

(Signature-based indexes like this are often called Bloom filters)

Creating a Bloom index

create index Idx on R using bloom (a1,a2,a3)
with   (length=64, col1=3, col2=4, col3=2);

Example: 10000 tuples, query = select * from R where a1=55 and a2=42, no matching tuples, random numeric values for attributes

No indexes ... execution time 15ms
B-tree index on all attributes ... execution time 12ms
Bloom index ... execution time 0.4ms

For more details, see PostgreSQL doc, Appendix F.5

COMP9315 21T1 ♢ SIMC Indexing ♢ [16/16]


Produced: 6 Apr 2021