CATC Indexing

COMP9315 21T1 ♢ CATC Indexing ♢ [0/12]
❖ Signature-based indexing

Reminder: file organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile1.png]

One signature slot per tuple slot; unused signature slots are zeroed.


We use the terms "signature" and "descriptor" interchangeably

COMP9315 21T1 ♢ CATC Indexing ♢ [1/12]
❖ Concatenated Codewords (CATC)

In a concatenated codewords (catc) indexing scheme

A tuple descriptor (signature) desc(t) is
The order that the concenated codewords appears doesn't matter, as long as it's done consistently
COMP9315 21T1 ♢ CATC Indexing ♢ [2/12]
❖ CATC Example

Consider the following tuple (from bank deposit database)

Branch AcctNo Name Amount
Perryridge 102 Hayes 400

It has the following codewords/descriptor (for m = 16,   ui = 4 )

Ai cw(Ai)
Perryridge 0101
102 1001
Hayes 1010
400 1100
desc(t) 1100101010010101
COMP9315 21T1 ♢ CATC Indexing ♢ [3/12]
❖ CATC Queries

To answer query q in CATC

E.g. consider the query (Perryridge, ?, Hayes, ?).

Ai cw(Ai)
Perryridge 0101
? 0000
Hayes 1010
? 0000
desc(q) 0000101000000101
COMP9315 21T1 ♢ CATC Indexing ♢ [4/12]
❖ CATC Queries (cont)

Once we have a query descriptor, we search the signature file:

pagesToCheck = {}
// scan r signatures
for each descriptor D[i] in signature file {
    if (matches(D[i],desc(q))) {
        pid = pageOf(tupleID(i))
        pagesToCheck = pagesToCheck ∪ pid
    }
}
// then scan bsq = bq + δ pages to check for matches

Matching can be implemented efficiently ...

#define matches(sig,qdesc) ((sig & qdesc) == qdesc)

COMP9315 21T1 ♢ CATC Indexing ♢ [5/12]
❖ Example SIMC Query

Consider the query and the example database:

Signature Deposit Record
0000101000000101 (Perryridge,?,Hayes,?)
1010100101101001 (Brighton,217,Green,750)
1100101010010101 (Perryridge,102,Hayes,400)
1010011010010110 (Downtown,101,Johnshon,512)
0110101001010011 (Mianus,215,Smith,700)
1010101011000101 (Clearview,117,Throggs,295)
1001010100111001 (Redwood,222,Lindsay,695)

Gives two matches:  one true match,  one false match.

COMP9315 21T1 ♢ CATC Indexing ♢ [6/12]
❖ CATC Parameters

False match probablity pF  =  likelihood of a false match

How to reduce likelihood of false matches?

Larger m means larger signature file ⇒ read more signature data.

Since ui 's are relatively small, hash collisions may be a serious issue

But making ui's means larger signatures ⇒ optimisation problem

COMP9315 21T1 ♢ CATC Indexing ♢ [7/12]
❖ CATC Parameters (cont)

How to determine "optimal" m  and u ?

  1. start by choosing acceptable pF
      (e.g. pF ≤ 10-4 i.e. one false match in 10,000)
  2. then choose m  to achieve no more than this pF.
Formulae to derive "good" m:   m  =  ( 1/loge 2 )2 . n . loge ( 1/pF )

Choice of ui values

COMP9315 21T1 ♢ CATC Indexing ♢ [8/12]
❖ Query Cost for CATC

Cost to answer pmr query: Costpmr = bD + bsq

bD = ceil( r/cD )  and  cD = floor(B/ceil(m/8))

E.g. m=64,   B=8192,   r=104    ⇒    cD = 1024,   bD=10

bsq includes pages with rq  matching tuples and rF  false matches

Expected false matches = rF  =  (r - rq).pF  ≅  r.pF   if rq ≪ r

E.g. Worst bsq = rq+rF,    Best bsq = 1,    Avg bsq = ceil(b(rq+rF)/r)

COMP9315 21T1 ♢ CATC Indexing ♢ [9/12]
❖ Variations on CATC

CATC has one descriptor per tuple ... potentially inefficient.

Alternative approach: one descriptor for each data page.

Every attribute of every tuple in page contributes to descriptor.

Size of page descriptor mp  =  ( 1/loge 2 )2 . c.n . loge ( 1/pF )

Size of codewords is proportionally larger   (unless attribute domain small)

E.g. n = 4, c = 64, pF = 10-3   ⇒   mp ≅ 3680bits ≅ 460bytes

Typically, pages are 1..8KB    8..64 PD/page (cPD).

E.g. mp ≅ 460,   B = 8192,   cPD ≅ 17

COMP9315 21T1 ♢ CATC Indexing ♢ [10/12]
❖ Variations on CATC (cont)

Improvement: store b × mp-bit page descriptors as mp × b-bit "bit-slices"

If b = 2x  then uses same storage as page descriptors

Query cost: scan ui / 2 bit-slices for each known attribute

If k is set of known attribute values, #slices = i∈k ui / 2

E.g. b = 128, m = 256, n = 4, ui = 16

(a,?,c,?) requires scan of 2 × 8 128-bit (16-byte) slices

compared to scan of 128 page descriptors, where each PD is 64-bits (8-bytes)

COMP9315 21T1 ♢ CATC Indexing ♢ [11/12]
❖ Comparison with SIMC


Assume same m , pF , n  for each method ...

CATC has ui-bit codewords, each has ≅ui / 2  bits set to 1

SIMC has m-bit codewords, each has k bits set to 1

Signatures for both have m  bits, with ≅m / 2  bits set to 1

CATC has flexibility in ui, but small(er) codewords so more hash collisions

SIMC has less hash collisions, but has errors from "unfortunate" overlays

COMP9315 21T1 ♢ CATC Indexing ♢ [12/12]


Produced: 28 Mar 2021