Selection on Multiple Attributes


Multi-dimensional (Nd) Selection


Operations for Nd Select2/150

N-dimensional select queries = condition on several attributes.


Tuple Space3/150

One view of N-dimensional selection on a relation R...

E.g. if N=D, we are checking existence of a tuple at a point


Heaps4/150

Heap files can handle pmr or space using standard method:

// select * from R where Cond
f = openFile(fileName("R"),READ);
for (p = 0; p < nPages(f); p++) {
    buf = getPage(f, p);
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (matches(tup,Cond))
            add tup to result set
    }
}

Costpmr  =  Costspace  =  b     (i.e. O(n), worst case scenario)


Multiple Indexes5/150

DBMSs already support building multiple indexes on a table.

Which indexes to build, depends on which queries are asked.

If we don't know queries, index all attribute subsets:

create table R (a int, b int, c int);
create index Rax on R (a);
create index Rbx on R (b);
create index Rcx on R (c);
create index Rabx on R (a,b);
create index Racx on R (a,c);
create index Rbcx on R (b,c);
create index Rallx on R (a,b,c);


... Multiple Indexes6/150

Characteristics of indexes:

Since single-attribute indexes are common in DBMSs, consider first how to use these.


... Multiple Indexes7/150

Generalised view of pmr and space queries:

select * from R
where  a1 op1 C1 and a2 op1 C2
       and ... and an opn Cn 

where, for pmr, all opi are equality tests

Possible approaches to handling such queries ...

  1. use index on one ai to reduce tuple tests
  2. use indexes on all ai, and intersect answer sets


Selection using One Index8/150

Using index on one attribute:

If several attributes have indexes:


... Selection using One Index9/150

Abstract algorithm:

// select * from R where Cond
// Cond = a1op1C1 and a2op2C2 and a3op3C3 ...
Tids = IndexLookup(R, a[i], op[i], C[i])
// gives { tid1, tid2, ...} for tuples satisfying aiopiCi
Pages = { }
foreach tid in Tids {
    if (pageOf(tid)  Pages)
        Pages = Pages  {pageOf(tid)}
}
f = openFile(fileName("R"),READ)
foreach page in Pages {
    Buf = getPage(f,page);
    foreach tup in Buf {
        if (matches(tup, (a1 op1 C1 ...))
            add tup to Results
}   }


... Selection using One Index10/150

Cost of this approach to selection:

Note: some pages might contain


Selection using Many Indexes11/150

Using indexes on several attributes:

Mi = { tid(t) | t ∈ R ∧ t[ai] opi Ci }


... Selection using Many Indexes12/150

Abstract algorithm:

// select * from R where a1 op1 C1 and a2 op2 C2 ...
for (i = 1; i <= #QueryTerms; i++) {
    if (no index on a[i]) continue
    Tids = IndexLookup(R, a[i], op[i], C[i])
    PageSets[i] = { }
    foreach tid in Tids {
        if (pageOf(tid)  PageSets[i])
            PageSets[i] = PageSets[i]  pageOf(tid)
}   }
// PageSets[i] contains pages with tuples matching (ai opi Ci)
...


... Selection using Many Indexes13/150

...
// compute intersection of Pages[i]
Pages = PageSets[1]
for (i = 2; i <= #QueryTerms; i++) {
    Pages = Pages  PageSets[i];
}
// finally ... fetch data pages
f = openFile(fileName("R"),READ);
foreach page in Pages {
    Buf = getPage(f,page);
    // this page has at least 1 match
    foreach tup in Buf {
        if (matches(tup, (a1 op1 C1 ...))
            add tup to Results
}   }


Bitmap Indexes14/150

A bitmap index assists computation of result sets in pmr.

[Diagram:Pics/select/bitmap-index-small.png]


... Bitmap Indexes15/150

Structure of bitmap index:


... Bitmap Indexes16/150

Consider using bitmap index to solve query:

select * from Employees where dept='Sales' and gender='M'

Method:

tSales = bitmap[dept,'Sales']
tMales = bitmap[gender,'M']
matches = tSales & tMales
for (i in 0..r-1) {
   if (matches[i] == 1) {
      tid = slotToTupleId(i)
      fetch tuple t via TupleId tid
}  }


... Bitmap Indexes17/150

Costs associated with bitmap indexes ...

Storage costs:

Query-time costs:


N-dimensional Hashing


Hashing and pmr19/150

Consider these pmr queries on a table hashed using salary

Q1: select * from Employees
    where  salary = 60000 and gender = 'M'

Q2: select * from Employees
    where  dept = 'Sales' and salary = 60K

Q3: select * from Employees
    where  salary > 60000 and dept = 'Sales'

CostQ1 = 1+Ov,     CostQ2  =  b+bOv,     CostQ3  =  b+bOv


... Hashing and pmr20/150

Problem: hashing only effective if hash key used in query.

No problem if all queries involve conditions like key=val

But frequently tables are accessed in multiple ways.

Can we devise hashing that "involves" attributes?

Yes, using multi-attribute hashing (mah) ...


... Hashing and pmr21/150

Multi-attribute hashing ...

Multi-attribute hashing "parameterises" the indexing scheme: Not as good as hashing on one attribute; better than linear scan.


... Hashing and pmr22/150

Multi-attribute hashing parameters:


MA.Hashing Example23/150

Consider branch,acctNo,name,amount) table (+ hash values)

branch h(B) acctNo h(Ac) name h(N) amount
Brighton 1011 217 1001 Green 0101 750
Downtown 0000 101 0101 Johnson 1101 512
Mianus 1101 215 1011 Smith 0001 700
Perryridge 0100 102 0110 Hayes 0010 400
Redwood 0110 222 1110 Lindsay 1000 695
Round Hill 1111 305 0001 Turner 0110 350
Clearview 1110 117 0101 Throggs 0110 295

Note that we ignore the amount attribute; we are assuming, effectively, that nobody will want to ask queries like   select * from where amount=533


... MA.Hashing Example24/150

Hash parameters:    d=3     d1=1    d2=1    d3=1

Choice vector:

[Diagram:Pics/select/choice-vector-small.png]

This choice vector tells us:


... MA.Hashing Example25/150

Consider the tuple:

branch acctNo name amount
Brighton 217 Green 750

Hash value (page address) is computed by:

[Diagram:Pics/select/mahins1-small.png]


... MA.Hashing Example26/150

Consider the tuple:

branch acctNo name amount
Downtown 101 Johnston 512

Hash value (page address) is computed by:

[Diagram:Pics/select/mahins2-small.png]


... MA.Hashing Example27/150

Consider the tuple:

branch acctNo name amount
Round Hill 305 Turner 350

Hash value (page address) is computed by:

[Diagram:Pics/select/mahins3-small.png]


... MA.Hashing Example28/150

Hash values for all tuples:

tuple Hash
(Brighton,217,Green,750) 111
(Downtown,101,Johnshon,512) 110
(Mianus,215,Smith,700) 111
(Perryridge,102,Hayes,400) 000
(Redwood,222,Lindsay,695) 000
(Round Hill,305,Turner,350) 011
(Clearview,117,Throggs,295) 010


... MA.Hashing Example29/150

If inserted, in order of appearance, into a database containing 8 pages:

[Diagram:Pics/select/mahex1-small.png]


MA.Hashing Hash Functions30/150

Auxiliary definitions:

#define HashSize 32
typedef unsigned int HashVal;

// extracts i'th bit from hash value
#define bit(i,h) ((h) & (1 << (i)))

// choice vector elems
typedef struct { int attr, int bit } CVelem;
typedef CVelem ChoiceVec[HashSize];

// compute hash for i'th attribute of tuple t
HashVal hash(Tuple t, int i) { ... }


... MA.Hashing Hash Functions31/150

Produce combined d-bit hash value for tuple tup:

HashVal hash(Tuple tup, ChoiceVec cv, int d)
{
    HashVal h[nAttr(tup)];  // hash for each attr
    HashVal res = 0, oneBit;
    int     i, attr, bpos;
    for (i = 0; i < nAttr(tup); i++)
        h[i] = hash(tup,i);
    for (i = 0; i < d; i++) {
        attr = cv[i].attr;  bpos = cv[i].bit;
        oneBit = bit(bpos, h[attr]);
        res = res | (oneBit << (i-bpos));
    }
    return res;
}


Queries with MA.Hashing32/150

In a partial match query:

E.g.

select amount
from   Deposit
where  branch = 'Brighton' and name = 'Green'

for which we use the shorthand   (Brighton, ?, Green, ?)


... Queries with MA.Hashing33/150

If we try to form a composite hash for a query, we don't know values for some bits:

[Diagram:Pics/select/mahquery1-small.png]


... Queries with MA.Hashing34/150

How to answer this query?

query Hash
(Brighton, ?, Green, ?) 1*1

Any matching tuples must be in pages with addresses 101 or 111.

We need to read just these pages.


... Queries with MA.Hashing35/150

Consider the query:

select amount
from   Deposit
where  name = 'Green'

[Diagram:Pics/select/mahquery2-small.png]

Need to check pages: 100, 101, 110, 111.

(With original hashing scheme, would have read all pages)


... Queries with MA.Hashing36/150

Consider the query:

select amount
from   Deposit
where  branch='Brighton' and acctNo=217 and name='Green'

[Diagram:Pics/select/mahquery3-small.png]

Need to check only page 111.


MA.hashing Query Algorithm37/150

// Build the partial hash value (e.g. 10*0*1)
// Treats query like tuple with some attr values missing
nstars = 0;
for each attribute i in query Q {
    if (hasValue(Q,i)) {
        set d[i] bits in composite hash
            using choice vector and hash(Q,i)
    } else {
        set d[i] *'s in composite hash
            using choice vector
        nstars++;
    }
}
...


... MA.hashing Query Algorithm38/150

...
// Use the partial hash to find candidate pages
f = openFile(fileName("R"),READ);
for (i = 0; i < 2**nstars; i++) {
    P = composite hash
    replace *'s in P
        using i and choice vector
    Buf = getPage(f, P);
    for each tuple T in Buf {
        if (T satisfies pmr query)
            add T to results
    }
}


Query Cost for MA.Hashing39/150

Multi-attribute hashing handles a range of query types, e.g.

select * from R where a=1
select * from R where d=2
select * from R where b=3 and c=4
select * from R where a=5 and b=6 and c=7

A relation with n attributes has 2n different query types.

Different query types have different costs   (different no. of *'s)


... Query Cost for MA.Hashing40/150

A query type Q may be defined via a set of attribute numbers,
indicating which attributes have values specified.

Example Query Types:

Query Type
(v1, ?, ?, ?) {1}
(v1, ?, v3, ?) {1,3}
(?, ?, v3, v4) {3,4}
(v1, v2, v3, v4) {1,2,3,4}
(?, ?, ?, ?) {}   (open query)


... Query Cost for MA.Hashing41/150

In practice, different query types are not equally likely.

E.g. maybe most of the queries asked are of the form

select * from R where a=1

with occasionally a query of the form

select * from R where b=3 and c=4

For each application, we can define a query distribution
which gives the probability pQ of asking each query type Q.


... Query Cost for MA.Hashing42/150

Consider a query of type Q with m attributes unspecified.

Each unspecified Ai contributes di *'s.

Total number of *'s is   s  =  ∑i ∉ Q di.

Number of pages to read is   2s  =  ∏i ∉ Q 2di.

If we assume no overflow pages, Cost(Q) = 2s (where s is determined by Q)


... Query Cost for MA.Hashing43/150

Min query cost occurs when all attributes are used in query

Min Costpmr  =  1

Max query cost occurs when no attributes are specified

Max Costpmr  =  2d  =  b

Average cost is given by weighted sum over all query types:

Avg Costpmr  =  ∑Q pQi ∉ Q 2di

Since all query types are possible, aim to minimise average cost.


Optimising MA.Hashing Cost44/150

For a given application, the aim is to minimise Costpmr.

Can be achieved by choosing appropriate values for di   (cv)

Heuristics:

Trade-off: making query type Qj more efficient makes Qk less efficient.

This is a combinatorial optimisation problem, and can be handled by standard optimisation techniques e.g. simulated annealing.


MA.Hashing Cost Example45/150

Our example database has 16 possible query types:

Query type Cost pQ
(?, ?, ?, ?) 8 0
(br, ?, ?, ?) 4 0.25
(?, ac, ?, ?) 4 0
(br, ac, ?, ?) 2 0
(?, ?, nm, ?) 4 0
(br, ?, nm, ?) 2 0
(?, ac, nm, ?) 2 0.25
(br, ac, nm, ?) 1 0
(?, ?, ?, amt) 8 0
(br, ?, ?, amt) 4 0
(?, ac, ?, amt) 4 0
(br, ac, ?, amt) 2 0
(?, ?, nm, amt) 4 0
(br, ?, nm, amt) 2 0.5
(?, ac, nm, amt) 2 0
(br, ac, nm, amt) 1 0

Cost values are based on earlier choice vector   (dbr = dac = dnm = 1)
pQ values can be determined by observation of DB use.


... MA.Hashing Cost Example46/150

Consider r=106, Nr=100, b=104, d=14.

Attribute br occurs in 0.5+0.25 used query types
allocate many bits to it e.g. d1=6.

Attribute nm occurs in 0.5+0.25 of queries
allocate many bits to it e.g. d3=4.

Attribute amt occurs in 0.5 of queries
allocate less bits to it e.g. d4=2.

Attribute ac occurs in 0.25 of queries
allocate least bits to it e.g. d2=2.


... MA.Hashing Cost Example47/150

With bits distributed as: d1=6, d2=2, d3=4, d4=2

Query type Cost pQ
(br, ?, ?, ?) 28 = 256 0.25
(?, ac, nm, ?) 28 = 256 0.25
(br, ?, nm, amt) 22 = 4 0.5

Cost  =  0.5 × 22 + 0.25 × 28 + 0.25 × 28  =  130


MA.Hashing vs One-Attribute Hashing48/150

Multi-attribute hashing is basically just a different way of forming hash value for each tuple.

Hence, can be used with all flexible hashing schemes.

As the file grows, the di's are increased one by one, using the choice vector.

Difference between mah and standard hashing:
queries "suggest" a set of pages rather than a single page.

Cost for insertion, deletion, ordered scan is same as for underlying hashing scheme.


Grid Files49/150

The mah approach attempts to combine all attributes in the hash value.

Alternative approach is to keep hash values separate and use auxiliary structure to provide combination.

A generalisation of extendible hashing, called grid files provides this.

Ext.hashing uses a directory on one attribute.

Grid files use a k-dimensional directory to handle k attributes.


... Grid Files50/150


[Diagram:Pics/select/gridfile-small.png]

A simple 2-dimensional grid file


Selection with Grid Files51/150

Consider the following grid file directory:

[Diagram:Pics/select/gridquerys-small.png]

for a table R with two attributes a and b


... Selection with Grid Files52/150

If  d1=3  and  d2=2 ...

select...where a=C1 and b=C2 one cell
select...where a=C1 one row (four cells)
select...where b=C2 one column (eight cells)

Number of directory cells visited is similar to number of pages visted in mah.


... Selection with Grid Files53/150

Execution of select...where b=C2:

[Diagram:Pics/select/gridquery1-small.png]

where  hash(C1) = ...001  and  hash(C2) = ...110


... Selection with Grid Files54/150

Selection for pmr in a 2d grid file ( nAttrs = d )

for (i = 0; i < nAttr(tup); i++) {
    if (!hasValue(tup,i))
        { lo[i] = 0; hi[i] = 2**d-1 }
    else {
        val = bits(d[i], getAttr(tup,i))
        lo[i] = val; hi[i] = val
}   }
Pages = []
for (i = lo[1]; i <= hi[1]; i++) {
    for (j = lo[2]; j <= lo[2]; j++) {
        if (!(grid[i][j] in Pages))
            add grid[i][j] to Pages
}   }
for each P in Pages {
    Buf = getPage(f,P)
    check Buf for solutions
}


Query Cost for Grid Files55/150

Cost depends on:

For pmr with all attributes specified C(x,y,z)  =  2


... Query Cost for Grid Files56/150

For pmr with j < k attributes specified

C(?,y,?)  =  (gq + m)

Typically, 1 ≤ gq ≤ 50   (i.e. grid directories are relatively small)


Other Operations on Grid Files57/150

Insertion has several cases:

Deletion can be achieved by marking (so Costpmr + bq w)

As with all hashing schemes, grid file does not help ordered scan.


N-dimensional Tree Indexes


Multi-dimensional Tree Indexes59/150

We have seen how hashing can be extended to handle multiple attributes.

Can tree-based index structures be generalised as well?

Yes ... and a very large number of techniques have been suggested in research literature.

E.g.   BSP-trees, Cell-trees, LSD-trees, SS-trees, TV-trees, X-trees, ...

We consider three popular examples:   kd-trees, Quad-trees, R-trees.


Example 2d Relation60/150

As an example for multi-dimensional trees, consider the following relation:

create table Rel (
    A1 char(1),  # 'a'..'z'
    A2 integer   # 0..9
);

with example tuples:

Rel('a',1)  Rel('a',5)  Rel('b',2)
Rel('d',1)  Rel('d',2)  Rel('d',4)
Rel('d',8)  Rel('g',3)  Rel('j',7)
Rel('m',1)  Rel('r',5)  Rel('z',9)


... Example 2d Relation61/150

And the tuple-space for the above tuples:

[Diagram:Pics/select/2d-space-small.png]


kd-Trees62/150

kd-trees are multi-way search trees where

For a tree with L levels and k attributes: Partitions "tuple space" into smaller and smaller regions at each level.


Example kd-Tree63/150

[Diagram:Pics/select/kd-tree-small.png]

Each leaf node contains a reference to a bucket containing tuples from a small region of k-dimensional space.


... Example kd-Tree64/150

How this tree partititons the tuple space:

[Diagram:Pics/select/kd-tree-space-small.png]


Searching in kd-Trees65/150

Consider first how to answer pmr queries ...

If all attributes have values specified:

If e.g. attribute A2 is unspecified


... Searching in kd-Trees66/150

As a tree traversal, the algorithm is best specified recursively.

Parameters for search function:

Other available information: The search commences via   Search(Q, 0, kdTreeRoot)


... Searching in kd-Trees67/150

Search(Query Q, Level L, Node N)
{
    if (isLeaf(N)) {
        Buf = getPage(f,N.page)
        check Buf for matching tuples
    } else {
        ai = attrLev[L]
        if (hasValue(Q,ai)) {
            Val = getAttr(Q,ai)
            newN = search N for Val
            Search(Q, L+1, newN)
        } else {
            for each child newN of N
                Search(Q, L+1, newN)
}   }   }


... Searching in kd-Trees68/150

For space queries ...

We require a change to the above Search algorithm.

if (hasValue(Q,ai)) {
    Val = getAttr(Q,ai)
    newN = search N for Val
    Search(Q, L+1, newN)
}

becomes

if (isRange(Q,ai)) {
    for each node N on current level {
        for each value V in range(Q,ai)) {
            newN = childOf(N,V)
            Search(Q, L+1, newN)
}   }   }


Query cost for kd-Trees69/150

Different kinds of queries (i.e. different specified attributes)

The query cost will clearly also be affected by the depth of the tree.

If we include all n attributes as indexes, then tree has depth ≥ n.

Depth of tree is also affected by branching factor at each node.

Note: kd-trees are not necessarily balanced.


... Query cost for kd-Trees70/150

Query examples:

Open query:   (?,?,?,?,?,...)   (zero attributes specified)

Point query:   (a,b,c,d,e,...)   (all attributes specified) Generic pmr query:   (a,?,c,?,e,...)


Insertion into kd-Trees71/150

Input: tuple with all values specified.

Method:

traverse tree to leaf node
    (reaching data page P1)
if (not full P1)
    insert tuple
else {
    create new data page P2
    compute new partition point Part
    distribute tuples between P1 and P2
        (using Part)
    create new internal node N with Part
    link N into tree, link N to P1 and P2
}
write any modified pages (data or index)


... Insertion into kd-Trees72/150

Example insertion into non-full data page:

[Diagram:Pics/select/kd-tree1-small.png]


... Insertion into kd-Trees73/150

Example insertion into full data page:

[Diagram:Pis/file-structc/kd-tree2-small.png]


kd-B-Trees74/150

kd-B-trees: a disk-based index structure using kd-tree ideas.

Aims to have a tree which is:

Basic idea: distribute kd-tree structure over a number of pages: Requires modification of insertion method to do re-balancing
(changes to insertion make it more complex and potentially much more expensive)


... kd-B-Trees75/150

Simple example kd-B-tree   (max 2 kd-tree levels in each node):

[Diagram:Pics/select/k-d-b-tree-small.png]


... kd-B-Trees76/150

What this produces:

Potential problems:


Selection in kd-B-Trees77/150

To answer a pmr query:

Pages = kdBsearch(query,root)
for each page P in Pages {
    Buf = getPage(f,P)
    scan Buf for matching tuples
}

kdbSearch(Query, Page)
{
    if (Page is a data page)
        Pages = Page
    else {
        Buf = getPage(kdbf,Page)
        search kd-tree in Buf
            using attributes from Query
        Pages = []
        for each indicated external subtree S {
            P = kdbSearch(Query, S)
            add P to Pages
    }   }
    return Pages
}


... Selection in kd-B-Trees78/150

Size of kd-B-tree:

Obviously, depth increases for not perfectly-balanced trees.


... Selection in kd-B-Trees79/150

If all attributes are specified (i.e. point search):

Costpmr  =  (d+1)

If some attributes unspecified ...

Can potentially search a large region of tree:

space queries can be handled similarly to pmr queries.


Insertion/Deletion in kd-B-Trees80/150

Conceptually, insertion is same as for kd-tree:

Added complication: kd-tree is spread over many index pages


... Insertion/Deletion in kd-B-Trees81/150

Minimum insert cost is 1w   (no split, no change to kd-tree)

Insert cost with split and kd-tree update: 3w
(need to write re-partitioned data pages and modified index page)

Worst case insert cost is high (tree/data re-organisation).

Deletion is straightforward (mark as deleted).

If data page occupancy falls too low, might consider


Quad Trees82/150

Quad trees use a regular, recursive partitioning of kd tuple space.

At each level, partition space into non-overlapping kd hypercubes.

For 2d case (others are much harder to visualise):

Aim: each "leaf" partition contains approx same number of tuples.


... Quad Trees83/150

Quad-tree on example 2d tuple space:

[Diagram:Pics/select/quad-tree-space-small.png]


... Quad Trees84/150

Basis for the partitioning:


... Quad Trees85/150

The previous partitioning gives this tree structure, e.g.

[Diagram:Pics/select/quad-tree-small.png]


... Quad Trees86/150

Each node of a k dimensional quad-tree has 2k children.

Quad-trees originally devised as memory based structures, for spatial data.

In this domain, k = 2 or k = 3 and branching is fairly small.

For index pages in a disk-based data structure:


Insertion into Quad Tree87/150

Inserting a new tuple into a 2d quad-tree indexed file involves:

traverse tree to leaf node
    (reaching data page P1)
if (not full P1)
    insert tuple
else {
    create new data pages P2, P3, P4
    distribute tuples between P1 .. P4
    create new internal node N
    link N into tree, link N to P1 .. P4
}
write any modified pages (data or index)


... Insertion into Quad Tree88/150

Example:

[Diagram:Pics/select/quad-insert-small.png]

 
Cost   =   traversal cost + update cost
  =   read ≥ 1 node pages + write ≥ 1 data pages


... Insertion into Quad Tree89/150

Potential problems with quad-trees:

The problems can be overcome by:


Query with Quad Tree90/150

For pmr query

For kd quad-tree, and query with n specified attributes:


... Query with Quad Tree91/150

For space query

Example: 2d quad-tree with query a ≤ A1 ≤ b ∧ c ≤ A2 ≤ d


... Query with Quad Tree92/150

Space query example:

[Diagram:Pics/select/quad-query-small.png]

Need to traverse: red(NW), green(NW,NE,SW,SE), blue(NE,SE).


R-Trees93/150

R-trees use a flexible, hierachical partitioning of kd tuple space.

Could be viewed as a more flexible version of quad-tree.


... R-Trees94/150

Example 2d R-tree node (with two levels of children):

[Diagram:Pics/select/r-tree-node-small.png]


... R-Trees95/150

[Diagram:Pics/select/r-tree-small.png]


... R-Trees96/150

The R-tree was initally designed for indexing polygons
  (via their minimum bounding rectangles (MBR))

It adapts naturally to tuple-space points (i.e. tuples)
  (each MBR represents a region of the tuple-space)

Goals in constructing R-trees (cf. B-trees)


Query with R-trees97/150

Designed to handle space queries and "where-am-I" queries.

"Where-am-I" query: find all regions containing a given point P:

Space (region) queries are handled in a similar way


... Query with R-trees98/150

Algorithm for "where am I" query in 2d R-tree:

Regions = RtreeSearch(a,b,root)

RtreeSearch(x,y,Node) {
    Regions = []
    if (leaf(Node)) {
        for each (x1,y1,x2,y2,P) in Node {
            if (x1,y1,x2,y2) contains (a,b) {
                add (x1,y1,x2,y2) to Regions
    }   }   }
    else {
        for each (x1,y1,x2,y2,Child) in Node {
            if (x1,y1,x2,y2) contains (a,b) {
                Rs = RtreeSearch(x,y,Child)
                add Rs to Regions
    }   }   }
    return Regions;
}


Insertion into R-tree99/150

Insertion of an object R occurs as follows:

Note that R may be a point or a polygon.


... Insertion into R-tree100/150

Heuristics in R-tree insertion ...

Choose child to expand to contain R

Choose a child (from several) to contain R Partition objects between old/new data pages


... Insertion into R-tree101/150

Quadratic split method to partition objects in S:

find objects a and b in S
    that produce the largest MBR
place a in S1
place b in S2
for every other object c in the S {
    s1 = current size of S1
    i1 = size of S1 with c added
    s2 = current size of S2
    i2 = size of S2 with c added
    if ((s1-i1) < (s2-i2))
        add c to S1
    else
        add c to S2
}


R-tree variants102/150

R+ tree does not permit overlapping regions.

R* tree uses delayed splitting on insertion. SS-tree uses (hyper)spherical regions rather than (hyper)cubes


R-trees in PostgreSQL103/150

Up to version 8.2, PostgreSQL had separate R-tree implementation

From version 8.2 on, R-trees implemented using GiST.


GiST in PostgreSQL104/150

PostgreSQL provides an implementation of Generalized Search Trees (GiST).

Above discussion of tree-based indexes shows commonalities:

Some examples:


... GiST in PostgreSQL105/150

GiST trees have the following structural constraints:

However, the "predicate" constraints are looser than those for any trees discussion so far.

Users are free to implement their own p to produce their own "flavour" of search tree.

Details: src/backend/access/gist


Signature-based Selection


Indexing with Signatures107/150

Signature-based indexing: an "efficient" linear scan.

Why abandon sub-linear complexity?   (e.g. O(1) for hashing)

Arising from work with high-d feature vectors in multimedia:

We consider two applications of signatures to indexing:


... Indexing with Signatures108/150

A signature is a compact (lossy) descriptor for a tuple.

Scanning the whole data file is too expensive, so

For this to be better than a linear scan of the data, require:


Signature Files109/150

File organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile-small.png]

(One signature slot per tuple slot in the data file; unused signature slots are zeroed)


... Signature Files110/150

Note that signatures do not determine tuple placement.

The only requirement is that

The data file could be organised via hashing, B-tree, curves, etc.

Thus, signatures can be used in conjunction with other indexing schemes.


Signatures111/150

Signatures (descriptors)

There are two main types of tuple signature: We use the following terminology consistently:


... Signatures112/150

A tuple r consists of n attributes A1..An.

A codeword cw(Ai) is

A tuple descriptor is built by combining n codewords.


Generating Codewords113/150

A method for generating a k-in-m codeword for attribute Ai

seed = hash(A[i])
nbits = 0
codeword = 0  /* m zero bits */
while (nbits < k) {
    i = random(0,m-1)
    if (bit i not set in codeword) {
        set bit i in codeword
        nbits++
    }
}

Requires bit-string operations and a pseudo-random number generator.


Superimposed Codewords (SIMC)114/150

In a superimposed codewords (simc) indexing scheme

A tuple descriptor desc(r) is


SIMC Example115/150

Consider the following tuple (from a bank deposit database)

branch acctNo name amount
Perryridge 102 Hayes 400

It has the following codewords/descriptor (for m = 12,   k = 2)

Ai cw(Ai)
Perryridge 010000000001
102 000000000011
Hayes 000001000100
400 000010000100
desc(r) 010011000111


... SIMC Example116/150

Consider a very small example database, with associated signatures:

Signature Tuple
100101001001 (Brighton,217,Green,750)
010101010101 (Clearview,117,Throggs,295)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010011000111 (Perryridge,102,Hayes,400)
100101010011 (Redwood,222,Lindsay,695)
011110111010 (Round Hill,305,Turner,350)


SIMC Queries117/150

To answer query q in SIMC

desc(q) is formed by bitwise OR of codewords for supplied attributes.

E.g. consider the query (Perryridge, ?, ?, ?).

Ai cw(Ai)
Perryridge 010000000001
? 000000000000
? 000000000000
? 000000000000
desc(q) 010000000001


... SIMC Queries118/150

Once we have a query descriptor, we search the signature file:

for each descriptor D[i] in signature file {
    if (D[i] matches desc(q))
        mark tuple R[i] as a match
}
for each marked tuple R {
    fetch tuple R
}

The critical assumption to make this approach viable:

scanning the signature file and comparing descriptors
is faster than scanning the tuple file and checking tuples


... SIMC Queries119/150

What sort of signature comparison is required to answer the example query?

Clearly, any matching tuple must have the value "Perryridge" for A1.

Any tuple with "Perryridge" must have these bits 010000000001 set.

So, checking whether each descriptor has these bits set, gives us the matches.

This can be generalised to multiple supplied values in the query:

In other words ...
for any tuple r that is a match for query q
the 1-bits in desc(q) are a subset of the 1-bits in desc(r)


... SIMC Queries120/150

Query/tuple descriptor matching can be implemented efficiently using logic operations on bit-strings.

Note that if   bits(A) ⊆ bits(B)   then   A AND B = A.

Thus, the query/tuple descriptor matching can be implemented as:

matches(Ri,q) if desc(q) AND desc(Ri) = desc(q)


Example SIMC Query121/150

Consider the query and the example database:

Signature Tuple
010000000001 (Perryridge,?,?,?)
100101001001 (Brighton,217,Green,750)
010101010101 (Clearview,117,Throggs,295)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010011000111 (Perryridge,102,Hayes,400)
100101010011 (Redwood,222,Lindsay,695)
011110111010 (Round Hill,305,Turner,350)


... Example SIMC Query122/150

The query has potential matches:

branch acctNo name amount
Clearview 117 Throggs 295
Perryridge 102 Hayes 400

The first is an example of a false match.

False matches are caused by:


False Matches123/150

Example:

Ai cw(Ai)
Perryridge 010000000001
102 000000000011
Hayes 000001000100
400 000010000100
desc(r) 010011000111
Ai cw(Ai)
Clearview 010000010000
117 000100000001
Throggs 000001000100
295 000100000100
desc(r) 010101010101


... False Matches124/150

How to reduce likelihood of false matches?

Increasing m helps, but at the expense of reading more descriptor data.

Having k too high    increased overlapping.

Having k too low    increased codeword collisions.

Thus, for SIMC schemes, there is the notion of choosing optimal m, k.


Design with SIMC125/150

SIMC schemes have the notion of false match probability pF.

As suggested above, pF is affected by m, k and n (#attributes)

In designing a SIMC index for a data file, the aim is to choose indexing parameters (m, k) to minimise the likelihood of false matches.

  1. start by choosing acceptable pF   (e.g. pF ≤ 10-5 i.e. one false match in 10,000)
  2. then choose m and k to achieve no more than this pF.
Formulae to derive m and k given pF and n:

k  =  1/loge2 . loge ( 1/pF )

m  =  ( 1/loge 2 )2 . n . loge ( 1/pF )


... Design with SIMC126/150

Design example:


Query Cost for SIMC127/150

A page-oriented view of how SIMC queries are answered:

determine query descriptor QD
for each page SB of tuple descriptors {
    for each tuple descriptor RD in page SB {
        if (RD matches QD) {
	    add tuple R to list of matches
            add pageOf(R) to Pages
        }
    }
}
for each page P in Pages {
    Buf = getPage(f,P)
    check Buf for matching tuples
}

With a buffer manager, the previous tuple-oriented version would behave like this.


... Query Cost for SIMC128/150

To answer pmr query, must read r signatures.

Since one signature per tuple, cannot feasibly store them in memory.

If signature contains m bits, can fit ND = 8B/m signatures per page.

Thus, must read bD = r/ND pages of signature data.

E.g. m=32,   B=1024,   r=105     ⇒     ND = 256,   bD=40

How many data pages do we read?


... Query Cost for SIMC129/150

Number of data pages read depends on:

Total tuples to read = rq + rF.

Expected false matches = rF  =  (r - rq).pF.

Assume matching tuples are uniformly distributed over the data pages.

Then total data pages read = bq  =  (rq+rF)/r × b

Costpmr  =  bD + bq


... Query Cost for SIMC130/150

For one queries, can use same optimisation as for Heaps.

That is, stop the search as soon as the single matching tuple is found.

The average cost in the case would be:

Costone :   Best = 2     Average = bD/2 + δ     Worst = bD + b


... Query Cost for SIMC131/150

SIMC provides no assistance in answering range, space queries, unless the data file is sorted.

If the data file is sorted,

SIMC provides no asistance at all for sim queries   (linear scan).


Applications of SIMC132/150

SIMC is useful for pmr, which allows a wide range of query types, e.g.

select * from R where a=1
select * from R where a=3 and c=1 and d=4
select * from R where a=2 and b=3 and c=4 and d=5

The above discussion has assumed that n attributes are indexed.

In fact, SIMC is reasonably flexible about how many attributes are involved

Thus, it also has applications in information retrieval:


Two-level SIMC133/150

Scanning one descriptor for every tuple is not efficient.

How to reduce number of descriptors?

Have a smaller number of larger descriptors.

E.g. one descriptor for each data page.

Every attribute of every tuple in page contributes to descriptor.

How large is a page descriptor (BD)?

Use formulae above, but with Nr.n ``attributes''.

Typically, pages are 1..8KB    8..64 BD/page (NBD).


Two-Level SIMC Files134/150

File organisation for two-level superimposed codeword index

[Diagram:Pics/select/simc-2lev-small.png]


... Two-Level SIMC Files135/150

Alternative file organisation:

[Diagram:Pics/select/simc-2lev1-small.png]

(Assumes that we need to read the data pages anyway, so may as well include tuple descriptors there ... or simply omit them altogether)


Queries with Two-level SIMC136/150

Method for answering queries:

form query descriptor QD
for each page descriptor BD {
    if (BD matches QD)
        add P to Pages
}
for each page P in Pages {
    read page P
    check for matching tuples
    // could do this using tuple descriptors
}

Costpmr  =  b/NBD + bq


Bit-sliced SIMC137/150

Reading all page descriptors is still not efficient, especially when only a few bits are set.

To improve efficiency, reconsider the matching process:

Signature Tuple
010000000001 (Perryridge,?,?,?)
100101001001 (Brighton,217,Green,750)
010101010101 (Clearview,117,Throggs,295)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010011000111 (Perryridge,102,Hayes,400)
100101010011 (Redwood,222,Lindsay,695)
011110111010 (Round Hill,305,Turner,350)


... Bit-sliced SIMC138/150

Rather than storing b m-bit page descriptors, store m b-bit descriptor slices.

[Diagram:Pics/select/bit-slice-small.png]


... Bit-sliced SIMC139/150

To answer queries:

form query descriptor QD
// Result is a bit-slice
Result = AllOnes
for each bit b set in QD {
    read bit-slice b
    Result = Result AND b
}

Result has bit i set if page i is candidate page.

Queries much more efficient because often only a few bits set in QD.

However, updates are expensive ... need to set k different slices for each insert.

Changing b is very expensive.


Query Cost for Bit-sliced SIMC140/150

A major cost in SIMC queries is reading the page-descriptor slices.

Each slice is b bits long (one bit for each data file page).

Thus, each slice occupies 8B/b bytes.

Generally, we can fit several page descriptor slices per page.

However, we make the pessimistic assumption ...

reading a slice means reading one page from the page descriptor slices file


... Query Cost for Bit-sliced SIMC141/150

Consider a query (val1, ?, val2, ?, ...) for an n attribute relation:

In fact, we can optimise further by always reading just the first j of these mk slices.

Thus, the descriptor-reading cost in queries can be made constant.

The downside of this trick is a (slight) increase in likelihood of false matches.

Costpmr   =   j + bq


Disjoint Codewords (DJC)142/150

In a disjoint codewords (djc) indexing scheme

A tuple descriptor desc(t) is


DJC example143/150

Consider the following tuple (from a bank deposit database)

branch acctNo name amount
Perryridge 102 Hayes 400

It has the following codewords/descriptor (for m = 4,   k = 2)

Ai cw(Ai)
Perryridge 0101
102 0011
Hayes 1001
400 0101
desc(t) 0101001110010101

(Note: length of tuple = ~28 bytes,   length of descriptor = 16 bits = 2 bytes)


DJC Queries144/150

Use a similar approach to SIMC:

desc(q) is formed by concatentation of codewords


... DJC Queries145/150

E.g. consider the query (Perryridge, ?, Hayes, ?)

Ai cw(Ai)
Perryridge 0101
? 0000
Hayes 1001
? 0000
desc(t) 0101000010010000

If the query has s supplied attributes, then desc(q) has sk bits set to 1.


... DJC Queries146/150

Searching the signature file is done in the same way as for SIMC:

Matches = {}
for each descriptor D[i] in sig file {
    if (D[i] matches desc(q))
        Matches = Matches  TupleId[i]
}
for each Tid in Matches
    fetch tuple t via Tid

The matching process is also the same as for SIMC

matches(Ti,q) if desc(q) AND desc(Ti) = desc(q)


Example DJC Query147/150

Consider the query and the example database:

Signature Tuple
0101 0000 1001 0000 (Perryridge,?,Hayes,?)
1010 0110 1001 0110 (Brighton,217,Green,750)
0101 0101 0101 0101 (Clearview,117,Throggs,295)
1010 1010 1001 1010 (Downtown,101,Johnson,512)
1010 0011 0011 0101 (Mianus,215,Smith,700)
0101 0011 1001 1010 (Perryridge,102,Hayes,400)
1001 0101 0011 1010 (Redwood,222,Lindsay,695)
0101 0110 1001 0110 (Round Hill,305,Turner,350)


False Matches148/150

The query has potential matches:

branch acctNo name amount
Perryridge 102 Hayes 400
Round Hill 305 Turner 350

DJC also suffers from the "false match" problem, due to:

Note:


... False Matches149/150

How to reduce likelihood of false matches?

Since descriptor is formed from concatentation of codewords: Thus, for each Ai, mi bits in the codeword, with mi/2 1-bits.

DJC can be "tuned" to the mix of pmr queries used on the relation.


Query Cost for DJC150/150

If we assume that

then the average query cost analysis is exactly the same as for SIMC.

However, we can make sure that certain query types are answered with less likelihood of false matches by adjusting the codeword lengths.

The optimisations from SIMC (two-level, bit-slice) can also be applied.

As for SIMC, DJC assists with one and pmr, but not range, space, sim.


Produced: 15 May 2016