❖ Join |
DBMSs are engines to store, combine and filter information.
Join (⨝) is the primary means of combining information.
Join is important and potentially expensive
Most common join condition: equijoin, e.g. (R.pk = S.fk)
Join varieties (natural, inner, outer, semi, anti) all behave similarly.
We consider three strategies for implementing join
❖ Join Example |
Consider a university database with the schema:
create table Student( id integer primary key, name text, ... ); create table Enrolled( stude integer references Student(id), subj text references Subject(code), ... ); create table Subject( code text primary key, title text, ... );
We use this example for each join implementation, by way of comparison
❖ Join Example (cont) |
Goal: List names of students in all subjects, arranged by subject.
SQL query to provide this information:
select E.subj, S.name from Student S join Enrolled E on (S.id = E.stude) order by E.subj, S.name;
And its relational algebra equivalent:
Student
Enrolled
❖ Join Example (cont) |
Some database statistics:
Sym | Meaning | Value |
rS | # student records | 20,000 |
rE | # enrollment records | 80,000 |
cS | Student |
20 |
cE | Enrolled |
40 |
bS | # data pages in Student |
1,000 |
bE | # data pages in Enrolled |
2,000 |
Also, in cost analyses later, N = number of memory buffers.
❖ Join Example (cont) |
Relation statistics for Out
Sym | Meaning | Value |
rOut | # tuples in result | 80,000 |
COut | result records/page | 80 |
bOut | # data pages in result | 1,000 |
Notes:
Enrolled
subj
name
❖ Implementing Join |
A naive join implementation strategy
for each tuple TS in Students { for each tuple TE in Enrolled { if (testJoinCondition(C,TS,TE)) { T1 = concat(TS,TE) T2 = project([subj,name],T1) ResultSet = ResultSet ∪ {T2} } } }
Problems:
❖ Implementing Join (cont) |
An alternative naive join implementation strategy
for each tuple TE in Enrolled { for each tuple TS in Students { if (testJoinCondition(C,TS,TE)) { T1 = concat(TS,TE) T2 = project([subj,name],T1) ResultSet = ResultSet ∪ {T2} } } }
Relatively minor performance difference ...
❖ Join Summary |
None of nested-loop/sort-merge/hash join is superior in some overall sense.
Which strategy is best for a given query depends on:
E.g. Join[id=stude](Student,Enrolled): 3,000 ... 2,000,000 page reads
❖ Join in PostgreSQL |
Join implementations are under: src/backend/executor
PostgreSQL suports the three join methods that we discuss:
nodeNestloop.c
nodeMergejoin.c
nodeHashjoin.c
Produced: 22 Mar 2021