Selection on One Attribute


Selection


Varieties of Selection2/160

Selection is a relational algebra operation that ...

We consider three distinct styles of selection: Each style has several possible file-structures/techniques.


... Varieties of Selection3/160

We can view a relation as defining a tuple space

E.g. if N=n, we are checking existence of a tuple at a point


One-dimensional Selection


Operations for 1d Select5/160

One-dimensional select queries = condition on single attribute.


... Operations for 1d Select6/160

[Diagram:Pics/select/qspace-small.png]


... Operations for 1d Select7/160

Other operations on relations:


... Operations for 1d Select8/160

In the algorithms below, we assume:


Implementing Select Efficiently9/160

Two basic approaches:

Our analyses assume: 1 input buffer available for each relation.

If more buffers are available, most methods benefit.


Heap Files

Note: this is not "heap" as in the top-to-bottom ordered tree.
It means simply an unordered collection of tuples in a file.


Heap File Structure11/160

The simplest possible file organisation.

New tuples inserted at end of file; tuples deleted by marking.

[Diagram:Pics/file-struct/heap-small.png]


Selection in Heaps12/160

For all selection queries, the only possible strategy is:

// select * from R where C
f = openFile(fileName("R"),READ);
for (p = 0; p < nPages(f); p++) {
    buf = readPage(f, p);
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (matches(tup,C))
            add tup to result set
    }
}

i.e. linear scan through file searching for matching tuples


... Selection in Heaps13/160

The heap is scanned from the first to the last page:

[Diagram:Pics/file-struct/heapscan-small.png]

Costrange  =  Costpmr  =  b

If we know that only one tuple matches the query (one query),
a simple optimisation is to stop the scan once that tuple is found.

Costone :      Best = 1      Average = b/2      Worst = b


Insertion in Heaps14/160

Insertion: new tuple is appended to file (in last page).

f = openFile(fileName("R"),READ|WRITE);
b = nPages(f);
buf = readPage(f, b-1);  // request page
if (isFull(buf)) // all slots used
    { b++; clear(buf); }
if (tooLarge(newTup,buf)) // not enough space
    { deal with oversize tuple }
addTuple(newRec, buf);
writePage(f, b, buf);  // mark page as dirty & release

Costinsert  =  1r + 1w         (plus possible extra writes for oversize recs)


... Insertion in Heaps15/160

Alternative strategy:

PostgreSQL's strategy:


... Insertion in Heaps16/160

PostgreSQL's tuple insertion:

heap_insert(Relation relation,    // relation desc
            HeapTuple newtup,     // new tuple data
            CommandId cid, ...)   // SQL statement


Deletion in Heaps17/160

Deletion for one conditions, e.g.

delete from Employees where id = 12345

Method:

Costdelete1:   best: 1r+1w   avg: (b/2)r+1w   worst: br+1w


... Deletion in Heaps18/160

Deletion for pmr or range conditions, e.g.

delete from Employees where dept = 'Marketing'
delete from Employees where 40 <= age and age < 50

Method:

Costdelete  =  br + bqw     (where bq is #pages with matches)


... Deletion in Heaps19/160

Implementation of deletion:

// Deletion in heaps ... delete from R where C
f = openFile(fileName("R"),READ|WRITE);
for (p = 0; p < nPages(f); p++) {
    buf = readPage(f, p);
    ndels = 0;
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (matches(tup,C))
            { ndels++; delTuple(buf,i); }
    }
    if (ndels > 0) writePage(f, p, buf);
}


... Deletion in Heaps20/160

How to deal with free slots:

Insertion with free slots:

f = openFile(fileName("R"),READ|WRITE);
for (p = 0; p < nPages(f); p++) {
    buf = readPage(f, p);
    if (spaceInBuf(buf,newTup))
        { addTuple(newTup,buf); break; }
}
if (p == nPages(f)) // not yet added
    { clear(buf); addTuple(newTup,buf); }
writePage(f, p, buf);


... Deletion in Heaps21/160

PostgreSQL tuple deletion:

heap_delete(Relation relation,    // relation desc
            ItemPointer tid, ..., // tupleID
            CommandId cid, ...)   // SQL statement


Updates in Heaps22/160

Example:   update R set F = val where Condition

Analysis for updates is similar to that for deletion

Costupdate  =  br + bqw

A complication: new version of tuple may not fit in page.

Solution:   delete and then insert
(Cost harder to analyse; depends on how many "oversize" tuples are produced)


... Updates in Heaps23/160


// Update in heaps ... update R set F = val where C
f = openFile(fileName("R"),READ|WRITE);
for (p = 0; p < nPages(f); p++) {
    buf = readPage(f, p);
    nmods = 0;
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (matches(tup,C)) {
            nmods++;
            modTuple(tup,F,val);
            if (!tooLarge(tup,buf))
                addTuple(tup,buf);
            else {
                delTuple(buf,i);
                InsertTupleInHeap(f,tup); 
            }
    }   }
    if (nmods > 0) writePage(f, p, buf);
}


... Updates in Heaps24/160

PostgreSQL tuple update:

heap_update(Relation relation,     // relation desc
            ItemPointer otid,      // old tupleID
            HeapTuple newtup, ..., // new tuple data
            CommandId cid, ...)    // SQL statement


Heaps in PostgreSQL25/160

PostgreSQL stores all table data in heap files (by default).

Typically there are also associated index files.

If a file is more useful in some other form:

Heap file implementation:   src/backend/access/heap


... Heaps in PostgreSQL26/160

PostgreSQL "heap files" use 3 physical files


... Heaps in PostgreSQL27/160

Free space map

Visibility map


Sorted Files


Sorted Files29/160

Records stored in file in order of some field k (the sort key).

Makes searching more efficient; makes insertion less efficient

[Diagram:Pics/file-struct/insert-in-sorted-small.png]


... Sorted Files30/160

In order to simplify insertion, use overflow blocks.

[Diagram:Pics/file-struct/sorted-file-small.png]

Large-scale file re-arrangement occurs less frequently.


... Sorted Files31/160

Conceptual view of sorted file structure:

[Diagram:Pics/file-struct/sfile1-small.png]

Total number of overflow blocks = bov.

Average overflow chain length = Ov = bov/b.

Bucket = data page + its overflow page(s)


Selection in Sorted Files32/160

For one queries on sort key, use binary search.

// select * from R where k = val  (sorted on R.k)
lo = 0; hi = b-1
while (lo <= hi) {
    mid = (lo+hi) div 2;
    buf = getPage(f,mid);
    (tup,loVal,hiVal) = searchBucket(f,mid,k,val);
    if (tup != null) return tup;
    else if (val < loVal) hi = mid - 1;
    else if (val > hiVal) lo = mid + 1;
    else return NOT_FOUND;
}
return NOT_FOUND;


... Selection in Sorted Files33/160

Search a page and its overflow chain for a key value

searchBucket(f,p,k,val)
{
    tup = null;
    buf = readPage(f,p);
    (tup,min,max) = searchPage(buf,k,val,+INF,-INF)
    if (tup != null) return(tup,min,max);
    ovf = openOvFile(f);
    ovp = ovflow(buf);
    while (tup == NULL && ovp != NO_PAGE) {
        buf = readPage(ovf,ovp);
        (tup,min,max) = searchPage(buf,k,val,min,max)
        ovp = ovflow(buf);
    }     
    return (tup,min,max);
}


... Selection in Sorted Files34/160

Search within a page for key; also find min/max values

searchPage(buf,k,val,min,max)
{
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val) res = tup;
        if (tup.k < min) min = tup.k;
        if (tup.k > max) max = tup.k;
    }
	return (res,min,max);
}


... Selection in Sorted Files35/160

The above method treats each bucket as a single large page.

Cases:

Costone :     Best = 1      Worst = log2 b + bov


... Selection in Sorted Files36/160

Average Costone is messier to analyse:

Costone :    Average  ≅  ((log2b)-1)(1+Ov)

To be more "precise"

In general, average case is difficult to analyse since dependent on other factors such as data distribution.


... Selection in Sorted Files37/160

For pmr query, on non-unique attribute k

[Diagram:Pics/file-struct/sfile-pmr-small.png]

Begin by locating a page p containing k=val   (as for one query).

Scan backwards and forwards from p to find matches.

Thus, Costpmr  =  Costone + (bq-1).(1+Ov)


... Selection in Sorted Files38/160

For range queries on unique sort key (e.g. primary key):

Costrange  =  Costone + (bq-1).(1+Ov)

If secondary key, similar method to pmr.

[Diagram:Pics/file-struct/sfile-range-small.png]


... Selection in Sorted Files39/160

So far, have assumed query condition involves sort key k.

If condition contains attribute j, not the sort key

Costone,   Costrange,   Costpmr   as for heap files


Insertion in Sorted Files40/160

Insertion is more complex than for Heaps.

Thus, Costinsert  =  Costone + δw

where δ  =  1 or 2, depending on whether we need to create a new overflow block


Deletion in Sorted Files41/160

Deletion involves:

Cost depends on selection criteria   (i.e. on how many matching tuples there are)

Thus, Costdelete  =  Costselect + bqw


Hashed Files


Hashing43/160

Basic idea: use key value to compute page address of tuple.

[Diagram:Pics/file-struct/hash-small.png]

e.g. tuple with key = v is stored in page i

Requires: hash function h(v) that maps KeyDomain → [0..b-1].


Hash Functions44/160

Points to note:


... Hash Functions45/160

Usual approach in hash function:

May need separate hash function for each data type

Or, convert each data value to a string and hash that.


... Hash Functions46/160

PostgreSQL hash function (simplified):


hash_any(unsigned char *k, register int keylen)
{
    register uint32 a, b, c, len;
    /* Set up the internal state */
    len = keylen;
    a = b = 0x9e3779b9;
    c = 3923095;
    /* handle most of the key */
    while (len >= 12) {
        a += (k[0] + (k[1] << 8) + (k[2] << 16) + (k[3] << 24));
        b += (k[4] + (k[5] << 8) + (k[6] << 16) + (k[7] << 24));
        c += (k[8] + (k[9] << 8) + (k[10] << 16) + (k[11] << 24));
        mix(a, b, c);
        k += 12; len -= 12;
    }

    /* collect any data from last 11 bytes into a,b,c */
    mix(a, b, c);
    return c;
}


... Hash Functions47/160

Where mix is defined as:


#define mix(a,b,c) \
{ \
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8);  \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12); \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b>>5);  \
  a -= b; a -= c; a ^= (c>>3);  \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}

i.e. scrambles all of the bits from the bytes of the key value

See backend/access/hash/hashfunc.c for details.


... Hash Functions48/160

Above functions give hash value as 32-bit quantity (uint32).

Two ways to map raw hash value into a page address:


Hashing Performance49/160

Aims:

Note: if data distribution not uniform, block address distribution cannot be uniform.

Best case: every block contains same number of tuples.

Worst case: every tuple hashes to same block.

Average case: some blocks have more tuples than others.


... Hashing Performance50/160

With non-uniform distribution or too many tuples, some blocks will fill.

Use overflow blocks to contain "excess" tuples; one chain of overflow blocks for each primary block   (containing all tuples with same hash value).

[Diagram:Pics/file-struct/hash-file-small.png]


... Hashing Performance51/160

Trade-off (assuming reasonably uniform hash function):

Want to minimise overflow chain length, yet keep blocks as full as possible.


... Hashing Performance52/160

Two important measures for hash files:

Three cases for distribution of tuples in a hashed file:

Case L Ov
Best ≅ 1 0
Worst >> 1 **
Average < 1 0<Ov<1

(** performance is same as Heap File)

To achieve average case, aim for   0.75  ≤  L  ≤  0.9.


Selection with Hashing53/160

Best performance occurs for one queries on hash key field.

Basic strategy:

Best Costone  =  1    (find in data page)

Average Costone  =  1+Ov/2    (scan half of ovflow chain)

Worst Costone  =  1+max(OvLen)    (find in last page of ovflow chain)


... Selection with Hashing54/160

Select via hashing on unique hash key k (one)

Overview:

P = hashToPage(val)
for each tuple t in page P {
    if (t.k = val) return t
}
for each overflow page Q of P {
    for each tuple t in page Q {
        if (t.k = val) return t
}   }


... Selection with Hashing55/160

Details of select via hashing on unique key k:

// select * from R where k = val
f = openFile(relName("R"),READ);
p = hash(val) % nPages(f);
buf = readPage(f, p)
for (i = 0; i < nTuples(buf); i++) {
    tup = getTuple(buf,i);
    if (tup.k == val) return tup;
}
ovp = ovflow(buf);
while (ovp != NO_PAGE) {
    buf = readPage(ovf,ovp);
    for (i = 0; i < nTuples(Buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val) return tup;
}   }


... Selection with Hashing56/160

Select via hashing on non-unique hash key k (pmr)

// select * from R where k = val
f = openFile(relName("R"),READ);
p = hash(val) % nPages(f);
buf = readPage(f, p)
for (i = 0; i < nTuples(buf); i++) {
    tup = getTuple(buf,i);
    if (tup.k == val) append tup to results
}
ovp = ovflow(buf);
while (ovp != NO_PAGE) {
    buf = readPage(ovf,ovp);
    for (i = 0; i < nTuples(Buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val) append tup to results
}   }

Costpmr  =  1 + Ov


... Selection with Hashing57/160

Unless the hash function is order-preserving (and most aren't) hashing does not help with range queries.

Costrange = b + bov

Selection on attribute j which is not hash key ...

Costone,   Costrange,   Costpmr  =  b + bov


Insertion with Hashing58/160

Insertion uses similar process to one queries.

Overview:

P = hashToPage(val)
if room in page P {
    insert t into P; return
}
for each overflow page Q of P {
    if room in page Q {
        insert t into Q; return
}   }
add new overflow page Q
link Q to previous overflow page
insert t into Q


... Insertion with Hashing59/160

Details of insertion into hashed file:

f = openFile(fileName("R"),READ|WRITE);
p = hash(tup.k) % nPages(f);
buf = readPage(f,p);
if (!isFull(buf) && addTuple(buf,tup) >= 0)
    { writePage(f, p, buf); return; }
ovf = openFile(ovFileName("R"),READ|WRITE);
p = ovFlow(buf); hasOvFlow = (p != NO_PAGE);
while (p != NO_PAGE) {
    buf = readPage(ovf,p);
    if (!isFull(buf) && addTuple(buf,tup) >= 0)
        { writePage(ovf, p, buf); return; }
    p = ovflow(buf);
}
ovflow(buf) = nPages(ovf);
writePage((hasOvFlow?ovf:f), p, buf);
clear(buf);
addTuple(buf, tup);
writePage(ovf, nPages(ovf), buf);


... Insertion with Hashing60/160

Variation in costs determined by iteration on overflow chain.

Best Costinsert  =  1r + 1w

Average Costinsert  =  (1+Ov)r + 1w

Worst Costinsert  =  (1+max(OvLen))r + 2w


Deletion with Hashing61/160

Similar performance to select:

// delete from R where k = val
// f = data file ... ovf = ovflow file
p = hash(val) % b;
buf = readPage(f,p);
delTuples(f,buf,k,val);
p = ovFlow(buf)
while (p != NO_PAGE) {
    buf = readPage(ovf,p);
    delTuples(ovf,buf,k,val);
    p = ovFlow(buf);
}

Extra cost over select is cost of writing back modified blocks.

Method works for both unique and non-unique hash keys.


... Deletion with Hashing62/160

Function for deleting all matches in a buffer:

delTuples(outf, buf, k, val)
{
    int i;
    int ndels = 0;
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val)
            { ndels++; delTuple(buf,i); }
    }
    if (ndels > 0)
        writePage(outf, p, buf);
}

Realistic deletion would also handle free-list of empty pages.


Another Problem with Hashing...63/160

So far, discussion of hashing has assumed a fixed file size (fixed b).

This is known as static hashing.

What size file to use?


... Another Problem with Hashing...64/160

If we change the file size, by adding more blocks to accomodate more tuples, then we are effectively changing the hash function.

In order to be able to use the new hash function to find existing tuples, all tuples need to be removed and re-inserted.

In other words, this requires complete re-organisation of the file.

Obvious disadvantages:


Flexible Hashing65/160

Several hashing schemes have been proposed to deal with dynamic files.

Two methods access blocks via a directory:

A third method expands files systematically, so needs no directory: All methods:


... Flexible Hashing66/160

Require a function to act on hash values to give d bits

#define HashSize 32
typedef unsigned int HashVal;

// return low-order d bits
HashVal bits(d int, h HashVal)
{
    HashVal mask;
    assert(d > 0 && d <= HashSize);
    mask = 0xffffffff >> (HashSize-d);
    return (h & mask);
}

// return high-order d bits
HashVal bits'(d int, h HashVal)
{
    assert(d > 0 && d <= HashSize);
    return (h >> (HashSize-d));
}


... Flexible Hashing67/160

Important concept for flexible hashing: splitting

Aim: expandable data file, requiring minimal large reorganisation


... Flexible Hashing68/160

Example of splitting:

[Diagram:Pics/file-struct/split-small.png]

Tuples only show key value; assume h(val) = val


Extendible Hashing69/160

File organisation:

Directory could live in page 0 and be cached in memory buffer.


... Extendible Hashing70/160

[Diagram:Pics/file-struct/exhash1-small.png]


Selection with Ext.Hashing71/160

Size of directory = 2d; d is called depth.

Hash function h produces a (e.g. 32-bit) bit-string.

Use first d bits of it to access directory to obtain block address.

Selection method for one queries:

p = directory[bits'(d,h(val))];
buf = readPage(f,p);
for (i = 0; i < nTuples(buf); i++) {
    tup = getTuple(buf,i);
    if (tup.k == val) return tup;
}

No overflow blocks, so Costone  =  1

Assume: a copy of the directory is pinned in DBMS buffer pool.


Insertion with Ext.Hashing72/160

For insertion, use selection method to determine block.

If selected block S not full, simply insert tuple.

What to do if selected block full?

The partitioning process is called splitting.


... Insertion with Ext.Hashing73/160

Basis for splitting?

All tuples r in this block have same bits'(d,hash(r.k))

So ... use d+1 bits of hash (i.e. bits'(d+1,hash(r.k))

This gives twice as many possible hash values.

Since hash values index directory, directory size is doubled.

E.g. d=4 gives indexes 0..15, d=5 gives indexes 0..31


... Insertion with Ext.Hashing74/160

Doubling directory size and adding one block creates many empty directory slots.

What to do with these slots?   Make them point to existing (buddy) pages.

[Diagram:Pics/file-struct/exhash2-small.png]

The idea: some parts are hashed effectively using d bits, others using d+1 bits.


... Insertion with Ext.Hashing75/160

If we split a block with two pointers to it, no need to make directory larger.

[Diagram:Pics/file-struct/exhash3-small.png]


Ext.Hashing Example76/160

Consider the following set of tuples describing bank deposits:

Branch Acct.No Name Amount
Brighton 217 Green 750
Downtown 101 Johnshon 500
Mianus 215 Smith 700
Perryridge 102 Hayes 400
Redwood 222 Lindsay 700
Round Hill 305 Turner 350
Clearview 117 Throggs 295


... Ext.Hashing Example77/160

We hash on the branch name, with the following hash function:

Branch Hash Value
Brighton 0010 1101 1111 1011
Clearview 1101 0101 1101 1110
Downtown 1010 0011 1010 0000
Mianus 1000 0111 1110 1101
Perryridge 1111 0001 0010 0100
Redwood 1011 0101 1010 0110
Round Hill 0101 1000 0011 1111


... Ext.Hashing Example78/160

Assume we have a file with   c=2.

Start with an initially empty file:

[Diagram:Pics/file-struct/exhashex0-small.png]

Add Brighton... tuple (hash=0010...).
Add Downtown... tuple (hash=1010...).

[Diagram:Pics/file-struct/exhashex1-small.png]


... Ext.Hashing Example79/160

Add Mianus... tuple (hash=1000...).

[Diagram:Pics/file-struct/exhashex2-small.png]


... Ext.Hashing Example80/160

Add Perryridge... tuple (hash=1111...).

[Diagram:Pics/file-struct/exhashex3-small.png]


... Ext.Hashing Example81/160

Add Redwood... tuple (hash=1011...).

[Diagram:Pics/file-struct/exhashex4-small.png]


... Ext.Hashing Example82/160

Add Round Hill... tuple (hash=0101...).

[Diagram:Pics/file-struct/exhashex5-small.png]


... Ext.Hashing Example83/160

Add Clearview... tuple (hash=1101...).

[Diagram:Pics/file-struct/exhashex6-small.png]


... Ext.Hashing Example84/160

Add another Perryridge... tuple (hash=1111...).

[Diagram:Pics/file-struct/exhashex7-small.png]


... Ext.Hashing Example85/160

Add another Round Hill... tuple (hash=0101...).

[Diagram:Pics/file-struct/exhashex8-small.png]


... Ext.Hashing Example86/160

Asumption: directory is small enough to be stored in memory.

Most of the time we read and write exactly one block.

If we need to split, we write just one extra block.

Conditions to minimise splitting:

On average, Costinsert  =  1r + (1+δ)w

(δ is a small value depending on b, load, hash function,...)


... Ext.Hashing Example87/160

A potential problem:

A degenerate case:


Deletion with Ext.Hashing88/160

Similar to ordinary static hash file - mark tuple as removed.

However, might also wish to reclaim space:

Both cases:


... Deletion with Ext.Hashing89/160

What to do with empty blocks from file perspective?

Empty block might create hole in middle of file.

Two methods to deal with this:


Linear Hashing90/160

File organisation:

Advantages over other approaches: Uses systematic method of growing data file ...


... Linear Hashing91/160

File grows linearly (one block at a time, at regular intervals).

Can be viewed as having "phases" of expansion; during each phase, b doubles.

[Diagram:Pics/file-struct/linhash1-small.png]


Selection with Lin.Hashing92/160

If b=2d, the file behaves exactly like standard hashing.

Use d bits of hash to compute block address.

// select * from R where k = val
P = bits(d,hash(val));  --least sig bits
for each tuple t in page P
         and its overflow pages {
    if (t.k == val) return t;
}

Average Costone  =  1+Ov


... Selection with Lin.Hashing93/160

If b != 2d, treat different parts of the file differently.

[Diagram:Pics/file-struct/linhash2-small.png]

Parts A and C are treated as if part of a file of size 2d+1.

Part B is treated as if part of a file of size 2d.

Part D does not yet exist (B expands into it).


... Selection with Lin.Hashing94/160

Modified algorithm:

// select * from R where k = val
h = hash(val);
p = bits(d,h);
if (p < sp) { p = bits(d+1,h); }
buf = readPage(f,p);
// now proceed as before
for each tuple R in block p
         and its overflow blocks {
    if (R.k == val) return R;
}


Insertion with Lin.Hashing95/160

To see why selection works, need to look at how file expands.

[Diagram:Pics/file-struct/linhash3-small.png]


Lin.Hashing Example96/160

Assume we have a file with: c=2.

Start with an initially empty file:

[Diagram:Pics/file-struct/linhashex0-small.png]

Add Brighton... tuple (hash=...1011).

Add Downtown... tuple (hash=...0000).

[Diagram:Pics/file-struct/linhashex1-small.png]


... Lin.Hashing Example97/160

Add Mianus... tuple (hash=...1101).

[Diagram:Pics/file-struct/linhashex2-small.png]

Add Perryridge... tuple (hash=...0100).

[Diagram:Pics/file-struct/linhashex3-small.png]


... Lin.Hashing Example98/160

Add Redwood... tuple (hash=...0110).

[Diagram:Pics/file-struct/linhashex4-small.png]


... Lin.Hashing Example99/160

Add Round Hill... tuple (hash=...1111).

[Diagram:Pics/file-struct/linhashex5-small.png]


... Lin.Hashing Example100/160

Add Clearview... tuple (hash=...1110).

[Diagram:Pics/file-struct/linhashex6-small.png]


... Lin.Hashing Example101/160

Add another Clearview... tuple.

[Diagram:Pics/file-struct/linhashex7-small.png]


Linear Hashing Insertion102/160

Abstract view:

p = bits(d,hash(tup.k));
if (p < sp) p = bits(d+1,hash(val));
// bucket p = page p + its ovflow pages
for each page Q in bucket p {
    if (space in Q) {
        insert tuple into Q
        break
    }
}
if (no insertion) {
    add new ovflow page to bucket p
    insert tuple into new page
}
if (need to split) {
    partition tuples from bucket sp
          into buckets sp and sp+2^d
    sp++;
    if (sp == 2^d) { d++; sp = 0; }
}


... Linear Hashing Insertion103/160

Detailed algorithm:

h = hash(tup.k);
p = bits(d,h);
if (p < sp) p = bits(d+1,h);
//start insertion into p bucket
inserted = 0; inOvflow = 0;
buf = readPage(f,p);
if (isFull(buf))
    { p = ovFlow(buf); }
else {
    // have space in data page
    addTuple(buf,tup); // assume it fits
    writePage(f, p, buf);
    inserted = 1;
}
...


... Linear Hashing Insertion104/160

...
// attempt insertion into ovflow pages
while (!inserted && p != NO_PAGE) {
    inOvflow = 1;
    buf = readPage(ovf,p);
    if (!isFull(buf)) {
        addTuple(buf,tup);
        writePage(ovf, p, buf);
        inserted = 1;
    }
    if (!inserted)
        { p = ovFlow(buf); }
}
// add a new ovflow page
if (!inserted) {
    ovflow(buf) = nPages(ovf);
    outf = inOvflow ? ovf : f;
    writePage(outf, p, buf);
    clear(buf);
    addTuple(buf, tup);
    writePage(ovf, nPages(ovf), buf);
}
// tuple inserted


Splitting105/160

How to decide that we "need to split"?

In extendible hashing, blocks were split on overflow.

In linear hashing, we always split block sp.

Two approaches to triggering a split:


... Splitting106/160

How do we accomplish partitioning?

All tuples in sp plus its overflow blocks have same hash (e.g. 01).

New block is at address sp+2d.

We consider d+1 bits of hash (giving e.g. 001, 101).

Gives addresses for each tuple of sp or sp+2d.

Place tuples according to d+1 hash bits.


... Splitting107/160

Partitioning process for block sp=01:

[Diagram:Pics/file-struct/linhash4-small.png]


... Splitting108/160

Some observations on splitting:

Simplest if we maintain free list of overflow pages.


... Splitting109/160

Detailed splitting algorithm:


// partitions tuples between two buckets
newp = sp + 2^d; oldp = sp;
buf = readPage(f,sp);
clear(oldBuf); clear(newBuf);
for (i = 0; i < nTuples(buf); i++) {
    tup = getTuple(buf,i);
    p = bits(d+1,hash(tup.k));
    if (p == newp) 
        addTuple(newBuf,tup);
    else
        addTuple(oldBuf,tup);
}
p = ovflow(buf);  oldOv = newOv = 0;
while (p != NO_PAGE) {
    ovbuf = readPage(ovf,p);
    for (i = 0; i < nTuples(ovbuf); i++) {
        tup = getTuple(buf,i);
        p = bits(d+1,hash(tup.k));
        if (p == newp) {
            if (isFull(newBuf)) {
                nextp = nextFree(ovf);
                ovflow(newBuf) = nextp;
                outf = newOv ? f : ovf;
                writePage(outf, newp, newBuf);
                newOv++; newp = nextp; clear(newBuf);
            }
            addTuple(newBuf, tup);
        }
        else {
            if (isFull(oldBuf)) {
                nextp = nextFree(ovf);
                ovflow(oldBuf) = nextp;
                outf = oldOv ? f : ovf;
                writePage(outf, oldp, oldBuf);
          	oldOv++; oldp = nextp; clear(oldBuf);
            }
            addTuple(oldBuf, tup);
        }
    }
    addToFreeList(ovf,p);
    p = ovflow(buf);
}
sp++;
if (sp == 2^d) { d++; sp = 0; }


Insertion Cost110/160

If no split required, cost same as for standard hashing:

Best Costinsert  =  1r + 1w

Average Costinsert  =  (1+Ov)r + 1w

Worst Costinsert  =  (1+max(Ov))r + 1w


... Insertion Cost111/160

If split occurs, incur Costinsert plus cost of partitioning.

Partitioning cost involves:

On average, Costpartition  =  (1+Ov)r + (2+Ov)w


Deletion with Lin.Hashing112/160

Deletion is similar to ordinary static hash file.

But might wish to contract file when enough tuples removed.

Rationale: r shrinks, b stays large wasted space.

Method: remove last block in file (contracts linearly).

Involves a coalesce procedure which is an inverse split.


Hash Files in PostgreSQL113/160

PostgreSQL uses linear hashing on tables which have been:

create index Ix on R using hash (k);

Hash file implementation: backend/access/hash

Note: "bucket" = data page + associated ovflow pages


... Hash Files in PostgreSQL114/160

PostgreSQL uses slightly different file organisation ...

If overflow pages become empty, add to free list and re-use.


... Hash Files in PostgreSQL115/160

PostgreSQL hash file structure:

[Diagram:Pics/file-struct/pgsql-hashfile-small.png]


Indexes


Selection via Trees117/160

Tree-based file access methods


Indexing118/160

The basic idea behind an index is as for books.

[Diagram:Pics/file-struct/book-index-small.png]

A table of (key,usage) pairs:


Indexing File Structure119/160

[Diagram:Pics/file-struct/index-small.png]


Indexes120/160

A 1-d index is based on the value of a single attribute A.

Some possible properties of A:

Taxonomy of index types, based on properties of index attribute:

primary index on unique field, file sorted on this field
clustering index on non-unique field, file sorted on this field
seconday file not sorted on this field


... Indexes121/160

Indexes themselves may be structured in several ways:

dense every tuple is referenced by an entry in the index file
sparse only some tuples are referenced by index file entries
single-level tuples are accessed directly from the main index file
multi-level may need to access several index pages to reach tuple

Index structures aim to minimise storage and search costs.


... Indexes122/160

A relation/file may have:

Indexes are typically stored in separate files to data   (possibly cached in memory).


Primary Index123/160

Primary index: ordered file whose "tuples" have two fields:

Dense: one index tuple per data tuple   (no requirement on data file ordering)

Sparse: one index tuple per data page     (requires primary key sorted data file)

Index file has blocking factor ci   (where typically ci ≫ c)

Index has total i blocks:    Dense: i = r/ci    Sparse: i = b/ci


... Primary Index124/160

One possible implementation for index blocks:

typedef struct {
    IndexHeader  info,
    IndexEntry[] entries
} IndexBlock;

typedef struct {
    Datum   key,
    int     pageNum,
    int     tupleNum 
} IndexEntry;

Note that pageNum+tupleNum is equivalent to TupleId.


Dense Primary Index125/160

[Diagram:Pics/file-struct/prim-index-small.png]


Sparse Primary Index126/160

[Diagram:Pics/file-struct/prim-index1-small.png]


Index Overheads127/160

Consider a relation and index with the following parameters:

(continued)


... Index Overheads128/160

For each page in the index file ...

Size of index file:


Selection with Prim.Index129/160

For one queries:

ix = binary search index for entry with key K
if nothing found { return NotFound }
b = getPage(ix.pageNum)
t = getTuple(b,ix.tupleNum)
   -- may require reading overflow pages
return t

Worst cost:   read log2i index pages  +  read 1+Ov data pages.

Thus, Costone,prim  =  log2 i + 1 + Ov


... Selection with Prim.Index130/160

For range queries on primary key:

For pmr queries involving primary key: For space queries involving primary key: For queries not involving primary key, index gives no help.


Insertion with Prim.Index131/160

Overview:

insert tuple into page P
find location for new entry in index file
  // could check whether it already exists
insert new index entry (k,P,i) into index file

Problem: order of index entries must be maintained.

On average, this requires us to read/write half of index file.

Costinsert,prim  =  log2ir + i/2.(1r+1w) + (1+Ov)r + (1+δ)w


Deletion with Prim.Index132/160

Overview:

find tuple using index
mark tuple as deleted
remove index entry for tuple

If we delete index entries by marking ...

If we delete index entry by index file reorganisation ...


Clustering Index133/160

Index on non-unique ordering attribute Ac.

Usually a sparse index; one pointer to first tuple containing value.

Assists with:


... Clustering Index134/160

[Diagram:Pics/file-struct/clus-index-small.png]


... Clustering Index135/160

Insertions are expensive: rearrange index file and data file.

This applies whether new value or new tuple containing known value.

One "optimisation": reserve whole block for one particular value.

Deletions relatively cheap (similar to primary index).

(Note: can't mark index entry for value X until all X tuples are deleted)


Secondary Index136/160

Generally, dense index on non-unique attribute As

Problem: multiple tuples with same value for As.

Three possible solutions:


... Secondary Index137/160

First style of secondary index:

[Diagram:Pics/file-struct/sec-index-small.png]


... Secondary Index138/160

Third (favoured) style of secondary index:

[Diagram:Pics/file-struct/sec-index1-small.png]


Selection with Sec.Index139/160

Since non-unique attribute, there are no (explicit) one queries.

For pmr queries involving As:

binary search for key K in index file 1
for each entry for K in index file 2 {
    b = getPage(ix.pageNum)
    search for tuple within buffer b
}

Assume that answering the query q requires:

Costpmr  =  (log2i + aq + bq.(1 + Ov))r


... Selection with Sec.Index140/160

For range queries e.g. Lo≤ AsHi:

binary search for key Lo in index file 1
FOR each entry in index file 2 until Hi DO
    access tuple R via TupleID from index
END

Analysis almost same as for pmr ...

Costrange  =  (log2i + aq + ∑k=LoHi(bk.(1 + Ov)))r

Note: may read some blocks multiple times   (alleviate with buffer pool)


Insertion/Deletion with Sec.Index141/160

Insertion:

As for primary index:

Deletion:

Can use mark-style (tombstone) deletion for tuples.

Caution: can only mark index-entry for k when no more entries for k in block-address blocks.


Multi-level Indexes142/160

In indexes described so far, search begins with binary search of index.

Requires log2 i index block accesses.

We can improve this by putting a second index on the first index.

Second index is sparse; (key,pointer) to first entry in first index block.

If there are i first-level index blocks, there will be i2 = i/ci second-level index blocks.

If second-level index too big, add third level ...

Maximum levels:   t  =  logci r     (dense index)

Top-level of index is always just a single block.


... Multi-level Indexes143/160

[Diagram:Pics/file-struct/ml-index-small.png]


Select with ML.Index144/160

For one query on indexed key field:

I = top level index block
for level = t to 1 {
    read index block I
    search index block for J'th entry
        where index[J].key <= K < index[J+1].key
    if J=0 { return NotFound }
    I = index[J].block
}
-- I is now address of data block
search block I and its overflow blocks

Read t index blocks and 1+Ov data blocks.

Thus, Costone,mli  =  (t + 1 + Ov)r

(Recall that t = logcir and single-level index needs log2i)


B-Trees145/160

B-trees are MSTs with the properties:

B-tree insertion and deletion methods Advantages of B-trees over general MSTs


... B-Trees146/160

Example B-tree (depth=3, n=3):

[Diagram:Pics/file-struct/btree0-small.png]


B-Tree Depth147/160

Depth depends on effective branching factor  (i.e. how full nodes are).

Simulation studies (random insertions and deletions) show typical B-tree nodes are 69% full.

Gives   load Li = 0.69 × ci   and   depth of tree ~ logLi r .

Example: ci=128,    Li=88

Level #nodes #keys
root 1 87
1 88 7656
2 7744 673728
3 681472 59288064

Note: ci is generally larger than 128 for a real B-tree.


B+Trees148/160

In database context, nodes are index blocks.

Two possible ways to structure them:

Higher levels of B+ tree have greater ci     B+ tree may have less levels.


B-Tree Index149/160

[Diagram:Pics/file-struct/btree-small.png]


B+Tree Index150/160

[Diagram:Pics/file-struct/b+tree-small.png]


Insertion into B-Trees151/160

Overview of the method:

  1. find leaf node and position in node where entry would be stored
  2. if node is not full, insert entry into appropriate spot
  3. if node is full, split node into two half-full nodes and
  4. if parent full, split and promote
Note: if duplicates not allowed and key is found, may stop after step 1.


... Insertion into B-Trees152/160

When splitting a node and promoting the middle element, we may find that the parent node is full.

In this case we split the parent in the same way as the child and promote its middle element.

This splitting may propagate all the way to root.

In this case we create a new root containing a single entry.

This is the only exception to the half-full rule.


B-Tree Insertion Cost153/160

Insertion cost = CosttreeSearch + CosttreeInsert + CostdataInsert

Best case: write one page (most of time)

Costinsert  =  Dr + 1w + 1r + 1w

Common case: 3 node writes (rearrange 2 leaves + parent)

Costinsert  =  Dr + 1r + 1w + 3w


... B-Tree Insertion Cost154/160

Worst case: 2D-1 node writes (propagate to root)

Costinsert  =  Dr + (2D-1)w + 1r + 1w


Deletion from B-Trees155/160

Overview of the method:

  1. find entry in node nd
  2. if nd is a non-leaf node:
    1. remove entry
    2. promote its immediate successor from child node to take its place
    3. continue as if operation were deletion of promoted child
  3. if nd is a leaf node
    1. remove entry
    2. if still more than half full, compact
    3. if < half full, rearrange entries between parent and siblings to make half full


Selection with B+Trees156/160

For one queries:

N = B-tree root node
while (N is not a leaf node) {
   scan N to find reference to next node
   N = next node
}
scan leaf node N to find entry for K
access tuple t using TupleId from N

Costone  =  (D + 1)r

Generally, D is around 2 or 3, even for very large data files.

A further optimisation is to buffer B-tree root page   ( total D page reads)


... Selection with B+Trees157/160

For range queries (assume sorted on index attribute):

search index to find leaf node for Lo
for each leaf node entry until Hi found {
	access tuple t using TupleId from entry
}

Costrange  =  (D + bi + bq)r

(If hold root block in memory read D-1 index blocks)


... Selection with B+Trees158/160

For pmr, need index on ≥ 1 query attribute.

Could have indexes on several attributes:

Pages = {}
for each Ai = k in query {
    find pages P containing Ai = k
    Pages = Pages  P
}
for each P in Pages {
    scan P for matching tuples
}

If q mentions ai, aj, ak, then

Costpmr  =  (Di + Dj + Dk + |bqi ∩ bqj ∩ bqk|)r

For space queries, treat as a combination of pmr and range.


B-trees in PostgreSQL159/160

PostgreSQL implements Lehman/Yao-style B-trees.

A variant that works effectively in high-concurrency environments.

B-tree implementation: backend/access/nbtree


... B-trees in PostgreSQL160/160

Interface functions for B-trees

// build Btree index on relation
Datum btbuild(rel,index,...)
// insert index entry into Btree
Datum btinsert(rel,key,tupleid,index,...)
// start scan on Btree index
Datum btbeginscan(rel,key,scandesc,...)
// get next tuple in a scan
Datum btgettuple(scandesc,scandir,...)
// close down a scan
Datum btendscan(scandesc)


Produced: 12 Jun 2016