Implementing Join

COMP9315 21T1 ♢ Implementing Join ♢ [0/9]
❖ Join

DBMSs are engines to store, combine and filter information.

Join () is the primary means of combining information.

Join is important and potentially expensive

Most common join condition: equijoin, e.g. (R.pk = S.fk)

Join varieties (natural, inner, outer, semi, anti) all behave similarly.

We consider three strategies for implementing join

COMP9315 21T1 ♢ Implementing Join ♢ [1/9]
❖ Join Example

Consider a university database with the schema:

create table Student(
   id     integer primary key,
   name   text,  ...
);
create table Enrolled(
   stude  integer references Student(id),
   subj   text references Subject(code),  ...
);
create table Subject(
   code   text primary key,
   title  text,  ...
);

We use this example for each join implementation, by way of comparison

COMP9315 21T1 ♢ Implementing Join ♢ [2/9]
❖ Join Example (cont)

Goal: List names of students in all subjects, arranged by subject.

SQL query to provide this information:

select E.subj, S.name
from   Student S
       join Enrolled E on (S.id = E.stude)
order  by E.subj, S.name;

And its relational algebra equivalent:

Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )


To simplify formulae, we denote Student by S and Enrolled by E
COMP9315 21T1 ♢ Implementing Join ♢ [3/9]
❖ Join Example (cont)

Some database statistics:

Sym Meaning Value
rS # student records 20,000
rE # enrollment records 80,000
cS Student records/page 20
cE Enrolled records/page 40
bS # data pages in Student 1,000
bE # data pages in Enrolled 2,000

Also, in cost analyses later, N = number of memory buffers.

COMP9315 21T1 ♢ Implementing Join ♢ [4/9]
❖ Join Example (cont)

Relation statistics for Out = Student ⨝ Enrolled

Sym Meaning Value
rOut # tuples in result 80,000
COut result records/page 80
bOut # data pages in result 1,000

Notes:

COMP9315 21T1 ♢ Implementing Join ♢ [5/9]
❖ Implementing Join

A naive join implementation strategy

for each tuple TS in Students {
   for each tuple TE in Enrolled {
      if (testJoinCondition(C,TS,TE)) {
         T1 = concat(TS,TE)
         T2 = project([subj,name],T1)
         ResultSet = ResultSet ∪ {T2}
}  }  }

Problems:

COMP9315 21T1 ♢ Implementing Join ♢ [6/9]
❖ Implementing Join (cont)

An alternative naive join implementation strategy

for each tuple TE in Enrolled {
   for each tuple TS in Students {
      if (testJoinCondition(C,TS,TE)) {
         T1 = concat(TS,TE)
         T2 = project([subj,name],T1)
         ResultSet = ResultSet ∪ {T2}
}  }  }

Relatively minor performance difference ...

Terminology: relation in outer loop is outer; other relation is inner
COMP9315 21T1 ♢ Implementing Join ♢ [7/9]
❖ Join Summary

None of nested-loop/sort-merge/hash join is superior in some overall sense.

Which strategy is best for a given query depends on:

Given query Q, choosing the "best" join strategy is critical;
the cost difference between best and worst case can be very large.

E.g.   Join[id=stude](Student,Enrolled):   3,000 ... 2,000,000  page reads

COMP9315 21T1 ♢ Implementing Join ♢ [8/9]
❖ Join in PostgreSQL

Join implementations are under: src/backend/executor

PostgreSQL suports the three join methods that we discuss:

The query optimiser chooses appropriate join, by considering
COMP9315 21T1 ♢ Implementing Join ♢ [9/9]


Produced: 22 Mar 2021