Implementing Join


Join1/87

DBMSs are engines to store, combine and filter information.

Filtering is achieved via selection and projection.

The join operation () is the primary means of combining information.

Because join is

many methods have been developed for its implementation.

(We use a running example to compare costs of the various join processing methods)


... Join2/87

Types of join:

Focus on simple equijoin, since common in practice (R.pk=S.fk)


Join Example3/87

Consider a university database with the schema:

create table Student(
   id     integer primary key,
   name   text,  ...
);
create table Enrolled(
   stude  integer references Student(id),
   subj   text references Subject(code),  ...
);
create table Subject(
   code   text primary key,
   title  text,  ...
);

And the following request on this database:

List names of students in all subjects, arranged by subject.


... Join Example4/87

The result of this request would look like:

Subj      Name
--------  -----------------
COMP1011  Chen Hwee Ling
COMP1011  John Smith
COMP1011  Ravi Shastri
...
COMP1021  David Jones
COMP1021  Stephen Mao
...
COMP3311  Dean Jones
COMP3311  Mark Taylor
COMP3311  Sashin Tendulkar


... Join Example5/87

An SQL query to provide this information:

select E.subj, S.name
from   Student S, Enrolled E
where  S.id = E.stude
order  by E.subj, S.name;

And its relational algebra equivalent:

Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )

The core of the query is the join Join[id=stude](Student,Enrolled)

To simplify writing of formulae, S = Student, E = Enrolled.


... Join Example6/87

Some database statistics:

Sym Meaning Value
rS # student records 20,000
rE # enrollment records 80,000
CS Student records/page 20
CE Enrolled records/page 40
bS # data pages in Student 1,000
bE # data pages in Enrolled 2,000

Also, in cost analyses below, N = number of memory buffers.


... Join Example7/87

Out = Student ⋈ Enrolled relation statistics:

Sym Meaning Value
rOut # tuples in result 80,000
COut result records/page 80
bOut # data pages in result 1,000

Notes:


Join via Cross-product8/87

Join can be defined as a cross-product followed by selection:

Join[Cond](R,S)   =   Select[Cond]( R × S )

For the example query, could implement

Join[id=stude](Student,Enrolled)
as
Select[id=stude](Student × Enrolled)

Cross-product contains 20,000 × 80,000 = 1,600,000,000 tuples.


... Join via Cross-product9/87

For Temp = (Student × Enrolled)

I/O costs:

Assuming Tw=Tr=0.01s, this will take around 500 hours!


... Join via Cross-product10/87

Because

DBMSs do not implement join via cross-product.

DBMSs implement only join and provide cross-product as:

R × S  =  Join[true](R,S)
or, in SQL

select * from R,S


Nested-Loop Join


Nested Loop Join12/87

The simplest join algorithm:

Algorithm to compute   Join[Cond](R,S):

for each tuple r in R {
    for each tuple s in S {
        if ((r,s) satisfies join condition) {
            add (r,s) to result
}   }   }

R is the outer relation; S is the inner relation.


... Nested Loop Join13/87

Requires (at least) three memory buffers (2 input, 1 output).

[Diagram:Pics/join/nest-loop-small.png]


... Nested Loop Join14/87

Abstract algorithm for Join[Cond](R,S) (with 3 memory buffers):

for each page of relation R {
    read into buffer rBuf
    for each page of relation S {
        read into buffer sBuf
	for each record r in rBuf {
            for each record s in sBuf {
	        if ((r,s) satisfies Cond) {
		    add combined(r,s) to OutBuf
	            write Outbuf when full
}   }   }   }   }


... Nested Loop Join15/87

Detailed algorithm for Join[Cond](R,S) (with 3 memory buffers):

// rf: file for R, sf: file for S, of: output file
outp = 0; clearBuf(oBuf);
for (rp = 0; rp < nPages(rf); rp++) {
   readPage(rf, rp, rBuf);
   for (sp = 0; sp < nPages(sf); sp++) {
      readPage(sf, sp, sBuf);
      for (i = 0; i < nTuples(rBuf); i++) {
         rTup = getTuple(rBuf, i);
         for (j = 0; j < nTuples(sBuf); j++) {
            sTup = getTuple(sBuf, j);
            if (satisfies(rTup,sTup,Cond)) {
            rsTup = combine(rTup,sTup);
            addTuple(oBuf, rsTup);
            if (isFull(oBuf)) {
               writePage(of, outp++, oBuf);
               clearBuf(oBuf);
}   }   }   }   }   }


... Nested Loop Join16/87

The three-memory-buffer nested loop join requires:

Cost   =   bR + bR bS

If we use S as the outer relation in the join

Cost   =   bS + bS bR

It is (slightly) better to use smaller relation as outer relation.


Nested Loop Join on Example17/87

If Student is outer relation and Enrolled is inner:

Cost = bS + bS bE
= 1,000 + 1,000 × 2,000 = 2,001,000

If Enrolled is outer relation and Student is inner:

Cost = bE + bE bS
= 2,000 + 2,000 × 1,000 = 2,002,000

Cost of nested-loop join is too high   (5 hours, if Tr=0.01 sec)


Implementing Join Better18/87

Aims of effective join computation:

Range of costs for Join(R,S)


Block Nested Loop Join19/87

If at least bR+2 memory buffers available:

[Diagram:Pics/join/blk-nest-loop-small.png]


... Block Nested Loop Join20/87

Algorithm for nested loop join with bR+2 memory buffers:

read all of R's pages into memory buffers
for each page of relation S {
    read page into S's input buffer
    for each tuple s in S's buffer {
        for each tuple r in R's memory buffers {
            if ((r,s) satisfies JoinCond)) {
                add (r,s) to output buffer
                write output buffer when full
}   }   }   } 

Note that R effectively becomes the inner relation in this scheme.


... Block Nested Loop Join21/87

This method requires:

Cost   =   bR + bS

Notes:


... Block Nested Loop Join22/87

Further performance improvements:

Example:


Block Nested Loop Join on Example23/87

If ≥ 1002 memory buffers are available:

Cost = bS + bE
= 1,000 + 2,000 = 3,000

This is considerably better than 106    (30 secs vs 5 hours).

But what if we have only N memory buffers, where N < bR ,  N < bS?


... Block Nested Loop Join on Example24/87

In general case, read outer relation in runs of N-2 pages

for each run of N-2 pages from R {
    read N-2 of R's pages into memory buffers
    for each page of relation S {
        read page into S's input buffer
        for each tuple s in S's buffer do
            for each tuple r in R's memory buffers {
                if ((r,s) satisfies JoinCond)) {
                    add (r,s) to output buffer
                    write output buffer when full
}   }   }   }   }


... Block Nested Loop Join on Example25/87

Block nested loop join requires

Cost   =   bR + bS . bR/N-2

Notes:


... Block Nested Loop Join on Example26/87

Costs for various buffer pool sizes:

N Inner Outer #runs Cost
22 Student Enrolled 50 101,000
52 Student Enrolled 20 41,000
102 Student Enrolled 10 21,000
1002 Student Enrolled 1 3,000
22 Enrolled Student 100 102,000
52 Enrolled Student 40 42,000
102 Enrolled Student 20 22,000
1002 Enrolled Student 2 4,000


Block Nested Loop Join in Practice27/87

Why block nested loop join is very useful in practice ...

Many queries have the form

select * from R,S where r.i=s.j and r.x=k

This would typically be evaluated as

Join [i=j] ((Sel[r.x=k](R)), S)

If |Sel[r.x=k](R)| is small may fit in memory (in small #buffers)


Join Conditions and Methods28/87

Nested loop join makes no assumptions about join conditions.

for each pair of tuples (r,s) {
    check join condition on (r,s)
    if satisfied, add to results
}

To improve join:

As noted above, simple equijoin is a common join condition.

Thus, a range of other join algorithms has been developed specifically for equality join conditions.


Index Nested Loop Join29/87

Most joins considered so far have a common problem:

If there is an index on S, we can avoid such repeated scanning.

Consider Join[R.i=S.j](R,S):

for each tuple r in relation R {
    use index to select tuples
        from S where s.j = r.i
    for each selected tuple s from S {
        add (r,s) to result
}   }

(For ordered indexes (e.g. Btree), this also assists join conditions like R.i<S.j)


... Index Nested Loop Join30/87

This method requires:

Cost   =   bR + rR.SelS    (SelS is the cost of performing a select on S).


... Index Nested Loop Join31/87

For index lookup:

 
Note: building an index "on the fly" to perform a join can be very cost-effective.


Index Nested Loop Join on Example32/87

Case 1: Join[id=stude](Student,Enrolled)

 
Cost = bS + rS btreeE
= 1,000 + 20,000 × (3+1.01) = 80,000


... Index Nested Loop Join on Example33/87

Case 2: Join[id=stude](Student,Enrolled)

 
Cost = bS + rS btreeE
= 1,000 + 20,000 × (3+4) = 150,000


... Index Nested Loop Join on Example34/87

Case 3: Join[id=stude](Student,Enrolled)

 
Cost = bE + rE hashS
= 2,000 + 80,000 × 1.1 = 90,000


Optimised Index Nested Loop Join35/87

Consider the following scenario for Join[R.i=S.j](R,S):

Could save repeated index scans with the same R.i value


... Optimised Index Nested Loop Join36/87

Abstract algorithm for optimised index nested loop join:

for each tuple r in relation R {
   if (prev == r.i)
      use selected tuples in buffer(s)
   else {
      use index to select tuples
         from S where s.j = r.i
      store selected tuples in buffer(s)
   }
   for each selected tuple s from S
      add (r,s) to result
   prev = r.i
}

Cost savings depend on repetition factor, #buffers, size of index scans


Sort-Merge Join


Sort-Merge Join38/87

Basic approach:

Advantages: Disadvantages:


... Sort-Merge Join39/87

Method requires several cursors to scan sorted relations:

[Diagram:Pics/join/sort-merge-small.png]


... Sort-Merge Join40/87

Abstract algorithm for merge phase of Join[R.i=S.j](R,S):


r = first tuple in R
s = first tuple in S
while (r != eof and s != eof) {
    // align cursors to start of next common run
    while (r != eof and r.i < s.j) { r = next tuple in R }
    while (s != eof and r.i > s.j) { s = next tuple in S }
    // scan common run, generating result tuples
    while (r != eof and r.i == s.j) {
        ss = s   // set to start of run
        while (ss != eof and ss.j == r.i) {
            add (r,s) to result
            ss = next tuple in S
        }
        r = next tuple in R
    }
    s = ss   // start search for next run
}


Sidetrack: Iterators41/87

Sort-merge join implementation is simplified by use of iterators.

Typical usage of iterator:

Iterator iter; Tuple tup;
iter = startScan("Rel","i=5");
while ((tup = nextTuple(iter)) != NULL) {
    process(tuple);
}
endScan(iter);


... Sidetrack: Iterators42/87

typedef struct {
    File   inf;  // input file
    Buffer buf;  // buffer holding current page
    int    curp; // current page during scan
    int    curr; // index of current record in page
} Iterator;

// simple linear scan; no condition
Iterator *startScan(char *relName) {
    Iterator *iter = malloc(sizeof(Iterator));
    iter->inf  = openFile(fileName(relName),READ);
    iter->curp = 0;
    iter->curr = -1;
    readPage(iter->inf, iter->curp, iter->buf);
}


... Sidetrack: Iterators43/87

Tuple nextTuple(Iterator *iter) {
    // check if reached end of current page
    if (iter->curr == nTuples(iter->buf)-1) {
        // check if reached end of data file
	if (iter->curp == nPages(iter->inf)-1)
	    return NULL;
	iter->curp++;
        iter->buf = readPage(iter->inf, iter->curp);
        iter->curr = -1;
    }
    iter->curr++;
    return getTuple(iter->buf, iter->curr);
}
// curp and curr hold indexes of most recently read page/record


... Sidetrack: Iterators44/87

TupleID scanCurrent(Iterator *iter) {
    // form TupleID for current record
    return iter->curp + iter->curr;
}

void setScan(Iterator *iter, int page, int rec) {
    assert(page >= 0 && page < nPages(iter->inf));
    if (iter->curp != page) {
        iter->curp = page;
	readPage(iter->inf, iter->curp, iter->buf);
    }
    assert(rec >= 0 && rec < nTuples(iter->buf));
    iter->curr = rec;
}

void endScan(Iterator *iter) {
    closeFile(iter->buf);
    free(iter);
}


Sort-Merge Join45/87

Concrete algorithm using iterators:

Iterator *ri, *si;  Tuple rup, stup;

ri = startScan("SortedR");
si = startScan("SortedS");
while ((rtup = nextTuple(ri)) != NULL
       && (stup = nextTuple(si)) != NULL) {
    // align cursors to start of next common run
    while (rtup != NULL && rtup.i < stup.j)
           rtup = nextTuple(ri);
    if (rtup == NULL) break;
    while (stup != NULL && rtup.i > stup.j)
           stup = nextTuple(si);
    if (stup == NULL) break;
	// must have (r.i == s.j) here
...


... Sort-Merge Join46/87

...
    // remember start of current run in S
    TupleID startRun = scanCurrent(si);
    // scan common run, generating result tuples
    while (rtup != NULL && rtup.i == stup.j) {
        while (stup != NULL and stup.j == rtup.i) {
            addTuple(outbuf, combine(rtup,stup));
            if (isFull(outbuf)) {
                writePage(outf, outp++, outbuf);
                clearBuf(outbuf);
            }
            stup = nextTuple(si);
        }
        rtup = nextTuple(ri);
        setScan(si, startRun);
    }
}


... Sort-Merge Join47/87

Buffer requirements:


... Sort-Merge Join48/87

Cost of sort-merge join.

Step 1: sort each relation that is not already sorted:

Step 2: merge sorted relations:


Sort-Merge Join on Example49/87

Case 1:   Join[id=stude](Student,Enrolled)

 
Cost  =  bS + bE  =  3,000    (i.e. minimal cost)


... Sort-Merge Join on Example50/87

Case 2:   Join[id=stude](Student,Enrolled)

 
Cost = sort(S) + sort(E) + bS + bE
= bS log30 bS + bE log30 bE + bS + bE
= 1,000 × 3 + 2,000 × 3 + 1,000 + 2,000
= 12,000


... Sort-Merge Join on Example51/87

Case 3:   Join[id=stude](Student,Enrolled)

 
Cost depends on which relation is outer and which is inner.


... Sort-Merge Join on Example52/87

Case 3 (continued) ...

If E is outer relation:

If S is outer relation:


Sidetrack 2: More on Iterators53/87

Above description of iterators:

In the general case, an iterator involves: A typical SQL query involves many iterators


... Sidetrack 2: More on Iterators54/87

Requires a more general definition of execution state:

typedef struct {
    Oper   op;    // operation (sel,sort,join,...)
    Reln   r1;    // first relation
    Reln   r2;    // second relation (if any)
    Buffer *bufs; // buffers used by operation
    int    curp1; // index of current page for r1
    int    curr1; // index of current record in page
    int    curp2; // index of current page for r2
    int    curr2; // index of current record in page
    Cond   cond;  // condition for choosing tuple(s)
} Iterator;

For PostgreSQL details, see include/nodes/execnodes.h


Hash Join


Hash Join56/87

Basic idea:

Requires sufficent memory buffers Other issues: Variations:   simple,   grace,   hybrid.


Simple Hash Join57/87

Basic approach:

Makes the assumption: whole of S hashes into memory


... Simple Hash Join58/87

Data flow:

[Diagram:Pics/join/hash-simple-small.png]


... Simple Hash Join59/87

Algorithm for ideal simple hash join Join[R.i=S.j](R,S):

for each tuple r in relation R
   { insert r into buffer[h(R.i)] }
for each tuple s in relation S {
   for each tuple r in buffer[h(S.j)] {
      if ((r,s) satisfies join condition) {
         add (r,s) to result
      }
   }
}

Cost = bR + bS    (minimum possible cost)


... Simple Hash Join60/87

Consider that we have N buffers available.

If bR ≤ N-2 buffers, no need to hash   (use nested loop).

In practice, size of hash table bhR > bR   (e.g. data skew)
hash table for R is even less likely to fit in memory

Can be handled by a variation on above algorithm:


... Simple Hash Join61/87

Algorithm for realistic simple hash join Join[R.i=S.j](R,S):

for each tuple r in relation R {
   if (buffer[h(R.i)] is full) {
      for each tuple s in relation S {
         for each tuple rr in buffer[h(S.j)] {
            if ((rr,s) satisfies join condition) {
               add (rr,s) to result
            }
         }
      }
      clear all hash table buffers
   }
   insert r into buffer[h(R.i)]
}

Note: requires multiple passes over the S relation.


... Simple Hash Join62/87

Cost depends on N and on properties of data/hash.

Worst case:

Best case:


Grace Hash Join63/87

Basic approach:

Similar approach to sort-merge join, except:

Requires enough buffer space to hold largest partition of inner relation.


... Grace Hash Join64/87

Partition phase:

[Diagram:Pics/join/hash1-small.png]

This is applied to each relation R and S.


... Grace Hash Join65/87

Probe/join phase:

[Diagram:Pics/join/hash2-small.png]

The second hash function (h2) simply speeds up the matching process.
Without it, would need to scan entire R partition for each record in S partition.


... Grace Hash Join66/87

Abstract algorithm for Join[R.i=S.j](R,S):

// assume h(val) generates [0..N-2]
// assume h2(val) generates [0..N-3]

// Partition phase (each relation -> N-1 partitions)
// 1 input buffer, N-1 output buffers

for each tuple r in relation R 
    add r to partition h(r.i) in output file R'
for each tuple s in relation S
    add s to partition h(s.j) in output file S'
...


... Grace Hash Join67/87

Abstract algorithm for Join[R.i=S.j](R,S) (cont.)

// Probe/join phase
// 1 input buffer for S, 1 output buffer
// N-2 buffers to build hash table for R partition

for each partition p = 0 .. N-2 {
    // Build in-memory hash table for partition p of R'
    for each tuple r in partition p of R'
        insert r into buffer h2(r.i)
    
    // Scan partition p of S', probing for matching tuples
    for each tuple s in partition p of S' {
        b = h2(s.j)
        for all matching tuples r in buffer b
            add (r,s) to result
}   }


... Grace Hash Join68/87

Concrete algorithm for partitioning:

Buffer iBuf, oBuf[N-1];
File inf, outf[N-1]; char rel[100];
int i, r, h, ip, op[N-1]; Tuple tup;
for (i = 0; i < N-1; i++) {
    clearBuf(oBuf[i]);   op[i] = 0;
    rel = sprintf("%s%d","Rel",i);
    outf[i] = openFile(fileName(rel),WRITE));
}
inf = openFile(fileName("Rel"),READ);
for (ip = 0; ip < nPages(inf); ip++) {
    iBuf = readPage(inf, ip);
    for (r = 0; r < nTuples(iBuf); r++) {
        tup = getTuple(iBuf, r);
        h = hash(tup.i, N-1);
        addTuple(oBuf[h], tup);
        if (isFull(oBuf[h])) {
            writePage(outf[h], op[h]++, oBuf[h]);
            clearBuf(oBuf[h]);
}   }   }


... Grace Hash Join69/87

Cost of grace hash join:

Total Cost   =   3 (bR + bS)


... Grace Hash Join70/87

The above cost analysis assumes:

We achieve this situation if:


... Grace Hash Join71/87

Possibilities for dealing with "over-long" partitions of R


Grace Hash Join on Example72/87

For the example Join[id=stude](Student,Enrolled):

Cost = 3 (bS + bE)
= 3 (1,000 + 2,000) = 9,000


Hybrid Hash Join73/87

An optimisation if we have √bR < N < bR+2

When we come to scan and partition S relation Final phase is same as grace join, but with only k-m partitions.


... Hybrid Hash Join74/87

Some observations:

Other notes:


... Hybrid Hash Join75/87

Need to choose appropriate m and k to minimise cost

Approach to maximise saving:


... Hybrid Hash Join76/87

Data flow for hybrid hash join (partitioning R):

[Diagram:Pics/join/hybhash1-small.png]


... Hybrid Hash Join77/87

Data flow for hybrid hash join (partitioning S):

[Diagram:Pics/join/hybhash2-small.png]

After this, proceed as for grace hash join.


... Hybrid Hash Join78/87

Cost of hybrid hash join:

Cost  =  bR + bS + k*PR + k*PS + k*PR + k*PS
         =  bR + bS + 2 * k * (PR + PS)
         =  bR + bS + 2 * k * (ceil(bR/(m+k)) + ceil(bS/(m+k)) )

How to determine k:


Hybrid Hash Join on Example79/87

Case 1:   N = 100 buffers, bR = 1000

Case 2:   N = 200 buffers, bR = 1000 Case 3:   N = 502 buffers, bR = 1000


Pointer-based Join80/87

Conventional join algorithms set up R ↔ S connections via attribute values.

Join could be performed faster if direct connections already existed.

Such a modification to conventional RDBMS structure would be worthwhile:


... Pointer-based Join81/87

The basic idea for pointer-based join is:

for each tuple r in relation R {
    for each rid associated with r {
        fetch tuple s from S via rid
        add (r,s) to result relation
    }
}

Often, each R tuple is associated with only one rid, so the inner loop is not needed.


... Pointer-based Join82/87

The advantage over value-based joins:

The (potential) disadvantages:


General Join Conditions83/87

Above examples all used simple equijoin e.g. Join[i=j](R,S).

For theta-join e.g Join[i<j](R,S):


... General Join Conditions84/87

For multi-equality (pmr) join e.g. Join[i=j ∧ k=l](R,S)


Join Summary85/87

No single join algorithm is superior in some overall sense.

Which algorithm is best for a given query depends on:

Choosing the "best" join algorithm is critical because the cost difference between best and worst case can be very large.

E.g.   Join[id=stude](Student,Enrolled):   3,000 ... 2,000,000

In some cases, it may be worth modifying access methods "on the fly" (e.g. add index) to enable an efficient join algorithm.


... Join Summary86/87

Comparison of join costs   (from Zeller/Gray VLDB90, assumes bR = bS = b)

[Diagram:Pics/join/hash-join-costs-small.png]


Join in PostgreSQL87/87

Join implementations are under: src/backend/executor

PostgreSQL suports three kinds of join:

Query optimiser chooses appropriate join, by considering


Produced: 29 May 2016