Scan, Sort, Project
Implementing Relational Operations |
Relational Operations | 2/90 |
DBMS core = relational engine, with implementations of
... Relational Operations | 3/90 |
Implementation of relational operations in DBMS:
... Relational Operations | 4/90 |
All relational operations return a set of tuples.
Can represent a typical operation programmatically as:
ResultSet = {}// initially an empty set while (t = nextRelevantTuple()) {// format tuple according to projection t' = formatResultTuple(t,Projection)// add next relevant tuple to result set ResultSet = ResultSet ∪ t' } return ResultSet
All of the hard work is in the nextRelevantTuple()
... Relational Operations | 5/90 |
nextRelevantTuple()
nextRelevantTuple()
ResultSet
ResultSet
... Relational Operations | 6/90 |
There are three "dimensions of variation" in this system:
Query Types | 7/90 |
Queries fall into a number of classes:
SQL | RelAlg | a.k.a. | ||||
select * from R |
R | - | ||||
select from R |
Proj[x,y]R | - | ||||
select * from R order by |
Sort[x]R | ord |
... Query Types | 8/90 |
SQL | RelAlg | a.k.a. | ||||
select * from R where id = |
Sel[id=k]R | one | ||||
select * from R where a = |
Sel[a=k]R | - | ||||
select * from R where a= and b= |
Sel[a=j ∧ b=k]R | pmr | ||||
select * from R where a> and a< |
Sel[a>j ∧ a<k]R | rng | ||||
select * from R where a> and a< and b> and b< |
Sel[...]R | space |
... Query Types | 9/90 |
SQL | RelAlg | a.k.a. | ||||
select * from R,S where R.id = S.r |
R Join[id=r] S | - | ||||
select * from R,S where R. =S. and R. =S. |
R Join[v=w ∧ x=y] S | - | ||||
select * from R,S where R. S. |
R Join[...] S | - | ||||
select * from R where R.* |
R ≅ Obj | sim |
Cost Models |
Cost Models | 11/90 |
An important aspect of this course is
Rather, we attempt to develop cost models
... Cost Models | 12/90 |
Assumptions in our cost models:
... Cost Models | 13/90 |
In developing cost models, we also assume:
... Cost Models | 14/90 |
Typical values for measures used in cost models:
Symbol | E.g. Value | |||
r | 106 | |||
R | 128 bytes | |||
b | 105 | |||
B | 8192 bytes | |||
c | 60 | |||
Tr,Tw | 10 msec | |||
- | ≅ 0 | |||
answers for query q |
bq | ≥ 0 |
Example file structures | 15/90 |
When describing file structures
... Example file structures | 16/90 |
Consider three simple file structures:
Some records in each page may be marked as "deleted".
... Example file structures | 17/90 |
Heap file with b = 4, c = 4:
... Example file structures | 18/90 |
Sorted file with b = 4, c = 4:
... Example file structures | 19/90 |
Hashed file with b = 3, c = 4, h(k) = k%3
... Example file structures | 20/90 |
Indexed file with b = 4, c = 4, bi = 2, ci = 8:
Scanning |
Scanning | 22/90 |
Consider the query:
select * from T;
Conceptually:
for each tuple t in relation T { add tuple t to result set }
... Scanning | 23/90 |
Implemented via iteration over file containing T:
for each page P in file of relation T { for each tuple t in page P { add tuple t to result set } }
Cost: read every data page once
Cost = b.Tr
... Scanning | 24/90 |
In terms of file operations:
// implementation of "select * from T" File inf;// data file handle int p;// input file page number Buffer buf;// input file buffer int i;// current record in input buf Tuple t;// data for current record inf = openFile(fileName("T"), READ) for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } }
... Scanning | 25/90 |
Scan implementation when file has overflow pages, e.g.
... Scanning | 26/90 |
In this case, the implementation changes to:
for each page P in file of relation T { for each tuple t in page P { add tuple t to result set } for each overflow page V of page P { for each tuple t in page V { add tuple t to result set } } }
Cost: read each data and overflow page once
Cost = (b + bOv).Tr
where bOv = total number of overflow pages
... Scanning | 27/90 |
In terms of file operations:
// implementation of "select * from T" File inf;// data file handle File ovf;// overflow file handle int p;// input file page number int ovp;// overflow file page number Buffer buf;// input file buffer int i;// current record in input buf Tuple t;// data for current record inf = openFile(fileName("T"), READ) ovf = openFile(ovFileName("T"), READ) for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } ovp = ovflow(buf); while (ovp != NO_PAGE) { buf = readPage(ovf,ovp); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); add t to result set } ovp = ovflow(buf); } }
Cost: read data+ovflow page Cost = (b+bov).Tr
Selection via Scanning | 28/90 |
Consider a one query like:
select * from Employee where id = 762288;
In an unordered file, search for matching record requires:
Guaranteed at most one answer; could be in any page.
... Selection via Scanning | 29/90 |
In terms of file operations (assuming var delcarations as before):
inf = openFile(fileName("Employee"), READ); for (p = 0; p < nPages(inf); p++) buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { t = getTuple(buf,i); if (getField(t,"id") == 762288) return t; } }
For different selection condition, simply replace
(getField(t,"id")==762288)
... Selection via Scanning | 30/90 |
Cost analysis for one searching in unordered file
File Copying | 31/90 |
Consider an SQL statement like:
create table T as (select * from S);
Effectively, copies data from one file to another.
Conceptually:
make empty relation T for each tuple t in relation S { append tuple t to relation T }
... File Copying | 32/90 |
In terms of file operations:
File inf,outf;// input/output file handles int ip,op;// input/output page numbers int i;// record number in input buf Tuple t;// current record Buffer buf;// input file buffer Buffer obuf;// output file buffer inf = openFile(fileName("S"), READ); outf = openFile(fileName("T"), CREATE); clear(obuf); for (ip = op = 0; ip < nPages(inf); ip++) { buf = readPage(inf, ip); for (i = 0; i < nTuples(buf); i++) { t = getTuple(i, buf); addTuple(t, obuf); if (isFull(obuf)) { writePage(outf,op++,obuf); clear(obuf); } } } if (nTuples(obuf) > 0) writePage(outf,op,obuf);
... File Copying | 33/90 |
Cost analysis for file copying:
Iterators | 34/90 |
Higher-levels of DBMS are given a view of scanning as:
cursor = initScan(relName,condition); while (tup = getNextTuple(cursor)) { process tup } endScan(cursor);
Also known as iterator.
... Iterators | 35/90 |
Implementation of simple scan iterator (via file operations):
typedef struct { File inf;// data file handle Buffer buf;// input buffer int curp;// current page number int curi;// current record number Expr cond;// representation of condition } Cursor;
... Iterators | 36/90 |
Implementation of simple scan iterator (continued):
Cursor *initScan(char *rel, char *cond) { Cursor *c; c = malloc(sizeof(Cursor)); c->inf = openFile(fileName(rel),READ); c->buf = readPage(c->inf,0); c->curp = 0; c->curi = 0; c->cond = makeTestableCondition(cond); return c; } void endScan(Course *c) { closeFile(c->inf); freeExpr(c->cond); free(c); }
... Iterators | 37/90 |
Implementation of simple scan iterator (continued):
Tuple getNextTuple(Cursor *c) { getNextTuple: if (c->curi < nTuples(c->buf)) return getTuple(c->buf, c->curi++); else {// no more tuples in this page; get next page c->curp++; if (c->curp == nPages(c->inf)) return NULL;// no more pages else { c->buf = readPage(c->inf,c->curp); c->curi = 0; goto getNextTuple; } } }
... Iterators | 38/90 |
Implementation of full iterator interface via file operations:
typedef struct { File inf;// data file handle File ovf;// overflow file handle Buffer buf;// input buffer int curp;// current page number int curop;// current ovflow page number int curi;// current record number Expr cond;// representation of condition } Cursor;
... Iterators | 39/90 |
Implementation of full iterator interface (continued):
Cursor *initScan(char *rel, char *cond) { Cursor *c; c = malloc(sizeof(Cursor)); c->inf = openFile(fileName(rel),READ); c->ovf = openFile(ovFileName(rel),READ); c->buf = readPage(c->inf,0); c->curp = 0; c->curop = NO_PAGE; c->curi = 0; c->cond = makeTestableCondition(cond) return c; } void endScan(Course *c) { closeFile(c->inf); if (c->ovf) closeFile(c->ovf); freeExpr(c->cond); free(c); }
... Iterators | 40/90 |
Implementation of scanning interface (continued):
Tuple getNextTuple(Cursor *c) { getNextTuple: if (c->curi < nTuples(c->buf)) return getTuple(c->buf, c->curi++); else {// no more tuples in this page; get next page if (c->curop == NO_PAGE) { c->curop = ovflow(c->buf); if (c->curop != NO_PAGE) {// start ovflow chain scan getNextOvPage: c->buf = readPage(c->ovf,c->curop); c->curi = 0; goto getNextTuple; } else { getNextDataPage: c->curp++; if (c->curp == nPages(c->inf)) return NULL;// no more pages else { c->buf = readPage(c->inf,c->curp); c->curi = 0; goto getNextTuple; } } } else {// continue ovflow chain scan c->curop = ovflow(c->buf); if (c->curop == NO_PAGE) goto getNextDataPage; else goto getNextOvPage; } } }
Scanning in PostgreSQL | 41/90 |
Scanning defined in: /backend/access/heap/heapam.c
Implements iterator data/operations:
HeapScanDesc
scan = heap_beginscan(rel,...,nkeys,keys)
initscan()
tup = heap_getnext(scan, direction)
heapgettup()
heap_endscan(scan)
scan
HeapKeyTest()
... Scanning in PostgreSQL | 42/90 |
typedef struct HeapScanDescData {// scan parameters Relation rs_rd;// heap relation descriptor Snapshot rs_snapshot;// snapshot ... tuple visibility int rs_nkeys;// number of scan keys ScanKey rs_key;// array of scan key descriptors ...// state set up at initscan time PageNumber rs_npages;// number of pages to scan PageNumber rs_startpage;// page # to start at ...// scan current state, initally set to invalid HeapTupleData rs_ctup;// current tuple in scan PageNumber rs_cpage;// current page # in scan Buffer rs_cbuf;// current buffer in scan ... } HeapScanDescData;
Scanning in other File Structures | 43/90 |
Above examples are for heap files
btree
hash
gist
gin
Sorting |
The Sort Operation | 45/90 |
Sorting is explicit in queries only in the order by
select * from Students order by name;
More important, sorting is used internally in other operations:
group by
External Sorting | 46/90 |
Sort methods such as quicksort are designed for in-memory data.
For data on disks, need external sorting techniques.
The standard external sorting method (merge sort) works by
... External Sorting | 47/90 |
Sorting tuples within pages
Two-way Merge Sort | 48/90 |
Requires three in-memory buffers:
Assumption: cost of merge on two buffers ≅ 0.
... Two-way Merge Sort | 49/90 |
Two-way merge-sort method:
read each page into buffer, sort it, write it numberOfRuns = b; runLength = 1; while (numberOfRuns > 1) { for each pair of adjacent runs { merge the pair of runs to output, by - read pages from runs into input buffers, one page at a time - apply merge algorithm to transfer tuples to output buffer - flush output buffer when full and when merge finished } numberOfRuns = numberOfRuns / 2 runLength = runLength * 2 }
... Two-way Merge Sort | 50/90 |
Example:
... Two-way Merge Sort | 51/90 |
Two-way merge-sort method (improved):
numberOfRuns = b; runLength = 1; while (numberOfRuns > 1) { for each pair of adjacent runs { merge the pair of runs to output, by - read pages from runs into input buffers, one page at a time - if (runLength == 1) sort contents of each input buffer - apply merge algorithm to transfer tuples to output buffer - flush output buffer when full and when merge finished } numberOfRuns = numberOfRuns / 2 runLength = runLength * 2 }
Avoids first pass to sort contents of individual pages.
... Two-way Merge Sort | 52/90 |
Consider file where b = 2k:
... Two-way Merge Sort | 53/90 |
Example:
Merging Two Sorted Pages | 54/90 |
Method using operations on files and buffers:
// Pre: buffers B1,B2; outfile position op // Post: tuples from B1,B2 output in order i1 = i2 = 0; clear(Out); R1 = getTuple(B1,i1); R2 = getTuple(B2,i2); while (i1 < nTuples(B1) && i2 < nTuples(B2)) { if (lessThan(R1,R2)) { addTuple(R1,Out); i1++; R1 = getTuple(B1,i1); } else { addTuple(R2,Out); i2++; R2 = getTuple(B2,i2); } if (isFull(Out)) { writePage(outf,op++,Out); clear(Out); } } for (i1=i1; i1 < nTuples(B1); i1++) { addTuple(getTuple(B1,i1), Out); if (isFull(Out)) { writePage(outf,op++,Out); clear(Out); } } for (i2=i2; i2 < nTuples(B2); i2++) { addTuple(getTuple(B2,i2), Out); if (isFull(Out)) { writePage(outf,op++,Out); clear(Out); } } if (nTuples(Out) > 0) writePage(outf,op,Out);
Merging Runs vs Merging Pages | 55/90 |
In the above, we merged two input buffers.
In general, we need to merge sorted "runs" of pages.
The only difference that this makes to the above method:
R1 = getTuple(B1,i1);
becomes
if (i1 == nTuples(B1)) { B1 = readPage(inf,ip++); i1 = 0; } R1 = getTuple(B1,i1);
Comparison for Sorting | 56/90 |
Above assumes that we have a function to compare tuples.
Mechanism needs to be generic, to handle all of:
select * from Employee order by eid; select * from Employee order by name; select * from Employee order by age;
Envisage a function tupCompare(r1,r2,f)
strcmp
r1
r2
f
r1.f < r2.f
r1.f > r2.f
r1.f == r2.f
... Comparison for Sorting | 57/90 |
SQL allows sorting based on multiple fields, e.g.
select * from Employee order by name,age;
How to specify sort condition for several attributes?
E.g. sort key is (A1,A2,..An)
int tupCompare(Tuple r1, Tuple r2, SortKeys A) { if (r1.A1 != r2.A1) return(r1.A1 - r2.A1); else if (r1.A2 != r2.A2) return(r1.A2 - r2.A2); else if ... ... else if (r1.An != r2.An) return(r1.An - r2.An); else return 0; }
Assume that ri.Ai
Cost of Two-way Merge Sort | 58/90 |
For a file containing b data pages:
Example: Relation with r=105 and c=50 ⇒ b=2000 pages.
Number of passes for sort: ⌈log22000⌉ = 11
Reads/writes entire file 11 times! Can we do better?
n-Way Merge Sort | 59/90 |
Merge passes use: B total memory buffers, n input buffers, B-n output buffers
Typically, consider only one output buffer, i.e. B = n + 1
Note that initial sort pass uses all B buffers ..
... n-Way Merge Sort | 60/90 |
Method:
// Produce B-page-long runs for each group of B pages in Rel { read pages into memory buffers sort group in memory write pages out to Temp }// Merge runs until everything sorted // n-way merge, where n=B-1 numberOfRuns = ⌈b/B⌉ while (numberOfRuns > 1) { for each group of n runs in Temp { merge into a single run via input buffers write run to newTemp via output buffer } numberOfRuns = ⌈numberOfRuns/n⌉ Temp = newTemp// swap input/output files }
... n-Way Merge Sort | 61/90 |
Method for merging n runs (n input buffers, 1 output buffer):
for i = 1..n { read first page of run[i] into a buffer[i] set current tuple cur[i] to first tuple in buffer[i] } while (more than 1 run still has tuples) { s = find buffer with smallest tuple as cur[i] copy tuple cur[i] to output buffer if (output buffer full) { write it and clear it} advance cur[i] to next tuple if (no more tuples in buffer[i]) { if (no more pages in run[i]) mark run[i] as complete else { read next page of run[i] into buffer[i] set cur[i] to first tuple in buffer[i] } } } copy tuples in non-empty buffer to output
Cost of n-Way Merge Sort | 62/90 |
Consider file where b = 4096, B = 16 total buffers:
For b data pages and n=15 input buffers (15-way merge)
... Cost of n-Way Merge Sort | 63/90 |
Costs (number of passes) for varying b and B (n=B-1):
B=3 | B=16 | B=128 | ||||
7 | 2 | 1 | ||||
10 | 3 | 2 | ||||
13 | 4 | 2 | ||||
17 | 5 | 3 | ||||
20 | 5 | 3 |
In the above, we assume that
Sorting in PostgreSQL | 64/90 |
Sort uses a polyphase merge-sort (from Knuth):
SortTuple
qsort()
If memory fills while reading, form "runs" and do disk-based sort.
... Sorting in PostgreSQL | 65/90 |
Disk-based sort has phases:
workMem
Effectively, a "tape" is a sorted run.
Implementation of "tapes": backend/utils/sort/logtape.c
... Sorting in PostgreSQL | 66/90 |
Sorting is generic and comparison operators are defined in catalog:
// gets pointer to function via pg_operator SelectSortFunction(Oid sortOperator, bool nulls_first, Oid *sortFunction, int *sortFlags);// returns negative, zero, positive ApplySortFunction(FmgrInfo *sortFunction, int sortFlags, Datum datum1, bool isNull1, Datum datum2, bool isNull2);
Flags indicate: ascending/descending, nulls-first/last.
Projection |
The Projection Operation | 68/90 |
Consider the query:
select distinct name,age from Employee;
If the Employee
(94002, John, Sales, Manager, 32) (95212, Jane, Admin, Manager, 39) (96341, John, Admin, Secretary, 32) (91234, Jane, Admin, Secretary, 21)
then the result of the projection is:
(Jane, 21) (Jane, 39) (John, 32)
Note that duplicate tuples (e.g. (John,32)
... The Projection Operation | 69/90 |
The projection operation needs to:
(straightforward, whichever file organisation is used)
(straightforward, manipulating internal record structure)
(not as simple as other operations ...)
Removing Attributes | 70/90 |
Projecting attributes involves creating a new tuple, using only some values from the original tuple.
Precisely how to achieve this depends on tuple internals.
Removing attributes from fixed-length tuples:
... Removing Attributes | 71/90 |
Removing attributes from variable-length tuples:
Sort-based Projection | 72/90 |
Overview of the method:
Rel
Temp
... Sort-based Projection | 73/90 |
The method, in detail:
// Inputs: relName, attrList inf = openFile(fileName(relName),READ); tempf = openFile(tmpName,CREATE); clear(outbuf); j = 0; for (p = 0; p < nPages(inf); p++) { buf = readPage(inf,p); for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); newtup = project(tup,attrList); addTuple(newtup,outbuf); if (isFull(outbuf)) { writePage(tempf,j++,outbuf); clear(outbuf); } } } mergeSort(tempf);
... Sort-based Projection | 74/90 |
tempf = openFile(tmpName,READ); outf = openFile(result,CREATE); clear(outbuf); prev = EMPTY; j = 0; for (p = 0; p < nPages(tempf); p++) { buf = readPage(tempf,p); for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tupCompare(tup,prev) != 0) { addTuple(tup,outbuf); if (isFull(outbuf)) { writePage(outf,j++,outbuf); clear(outbuf); } prev = tup; } } }
Cost of Sort-based Projection | 75/90 |
The costs involved are (assuming B+1 buffers for sort):
Rel
Temp
Temp
Temp
Note that we often ignore cost of writing the result; especially when comparing different algorithms for the same relational operation.
Improving Sort-based Projection | 76/90 |
Some approaches for improving the cost:
Hash-based Projection | 77/90 |
Overview of the method:
Rel
Hash Functions | 78/90 |
Hash function h(tuple,range)
mod
... Hash Functions | 79/90 |
Usual approach in hash function:
unsigned int hash(char *val, int b) { char *cp; unsigned int v, sum = 0; for (c = val; *c != '\0'; c++) { v = *c + (*(c+1) << 8); sum += (sum + 2153*v) % 19937; } return(sum % b); }
Hash-based Projection | 80/90 |
Partitioning phase:
... Hash-based Projection | 81/90 |
Algorithm for partitioning phase:
for each page P in relation Rel { for each tuple t in page P { t' = project(t, attrList) H = h1(t', B-1) write t' to partition[H] } }
Each partition could be implemented as a simple data file.
... Hash-based Projection | 82/90 |
Duplicate elimination phase:
... Hash-based Projection | 83/90 |
Algorithm for duplicate elimination phase:
for each partition P in 0..B-2 { for each tuple t in partition P { H = h2(t, B-1) if (!(t occurs in buffer[H])) append t to buffer H } output contents of all buffers clear all buffers }
Cost of Hash-based Projection | 84/90 |
The total cost is the sum of the following:
Rel
... Cost of Hash-based Projection | 85/90 |
If the largest partition had more than B-1 pages
Index-only Projection | 86/90 |
Under the conditions:
Basic idea:
... Index-only Projection | 87/90 |
Method:
for each entry I in index file { tup = project(I.key, attrList) if (tupCompare(tup,prev) != 0) { addTuple(outbuf,tup) if (isFull(outbuf)) { writePage(outf,op++,outbuf); clear(outbuf); } prev = tup; } }
"for each index entry": loop over index pages and loop over entries in each page
Cost of Index-only Projection | 88/90 |
Assume that the index (see details later):
Index
Comparison of Projection Methods | 89/90 |
Difficult to compare, since they make different assumptions:
Best case scenario for each (assuming B+1 in-memory buffers):
Projection in PostgreSQL | 90/90 |
Code for projection forms part of execution iterators:
ExecProject(projInfo,...)
ExecTargetList(...)
ExecStoreTuple(newTuple,...)
Produced: 13 Jun 2016