Hash Join

COMP9315 21T1 ♢ Hash Join ♢ [0/17]
❖ Hash Join

Basic idea:

Requires sufficent memory buffers Other issues: Variations:   simple,   grace,   hybrid.
COMP9315 21T1 ♢ Hash Join ♢ [1/17]
❖ Simple Hash Join

Basic approach:

No overflows allowed in in-memory hash table
COMP9315 21T1 ♢ Hash Join ♢ [2/17]
❖ Simple Hash Join (cont)

Data flow in hash join:

[Diagram:Pics/join/hash-join.png]

COMP9315 21T1 ♢ Hash Join ♢ [3/17]
❖ Simple Hash Join (cont)

Algorithm for simple hash join Join[R.i=S.j](R,S):

for each tuple r in relation R {
   if (buffer[h(R.i)] is full) {
      for each tuple s in relation S {
         for each tuple rr in buffer[h(S.j)] {
            if ((rr,s) satisfies join condition) {
               add (rr,s) to result
      }  }  }
      clear all hash table buffers
   }
   insert r into buffer[h(R.i)]
}

Best case:  # join tests  ≤  rS.cR    (cf. nested-loop  rS.rR)

COMP9315 21T1 ♢ Hash Join ♢ [4/17]
❖ Simple Hash Join (cont)

Cost for simple hash join ...

Best case: all tuples of R fit in the hash table

Good case: refill hash table m times (where m ≥ ceil(bR / (N-3)) ) Worst case: everything hashes to same page
COMP9315 21T1 ♢ Hash Join ♢ [5/17]
❖ Grace Hash Join

Basic approach (for R ⨝ S ):

For best-case cost (O(bR + bS) ): If < √bR buffers or poor hash distribution
COMP9315 21T1 ♢ Hash Join ♢ [6/17]
❖ Grace Hash Join (cont)

Partition phase (applied to both R and S):

[Diagram:Pics/join/grace-hash1.png]

COMP9315 21T1 ♢ Hash Join ♢ [7/17]
❖ Grace Hash Join (cont)

Probe/join phase:

[Diagram:Pics/join/grace-hash2.png]

The second hash function (h2) simply speeds up the matching process.
Without it, would need to scan entire R partition for each record in S partition.

COMP9315 21T1 ♢ Hash Join ♢ [8/17]
❖ Grace Hash Join (cont)

Cost of grace hash join:

Total Cost   =   2bR + 2bS + bR + bS   =   3 (bR + bS)
COMP9315 21T1 ♢ Hash Join ♢ [9/17]
❖ Hybrid Hash Join

A variant of grace hash join if we have √bR < N < bR+2

When we come to scan and partition S relation Final phase is same as grace join, but with only k-1  partitions.

Comparison:

COMP9315 21T1 ♢ Hash Join ♢ [10/17]
❖ Hybrid Hash Join (cont)

First phase of hybrid hash join (partitioning R):

[Diagram:Pics/join/hyb-hash1.png]

COMP9315 21T1 ♢ Hash Join ♢ [11/17]
❖ Hybrid Hash Join (cont)

Next phase of hybrid hash join (partitioning S):

[Diagram:Pics/join/hyb-hash2.png]

COMP9315 21T1 ♢ Hash Join ♢ [12/17]
❖ Hybrid Hash Join (cont)

Final phase of hybrid hash join (finishing join):

[Diagram:Pics/join/hyb-hash3.png]

COMP9315 21T1 ♢ Hash Join ♢ [13/17]
❖ Hybrid Hash Join (cont)

Some observations:

Other notes: For k partitions, Cost = (3-2/k).(bR+bS)
COMP9315 21T1 ♢ Hash Join ♢ [14/17]
❖ Costs for Join Example

SQL query on student/enrolment database:

select E.subj, S.name
from   Student S join Enrolled E on (S.id = E.stude)
order  by E.subj

And its relational algebra equivalent:

Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )

Database:  rS = 20000,  cS = 20,  bS = 1000,  rE = 80000,  cE = 40,  bE = 2000


We are interested only in the cost of Join,  with N  buffers

COMP9315 21T1 ♢ Hash Join ♢ [15/17]
❖ Costs for Join Example (cont)

Costs for hash join variants on example (N=103):

Hash Join Method Cost Analysis Cost
Hybrid Hash Join (3-2/k).(bS+bE) = 2.8((1000+2000)
assuming k = 10 ... and one partition fits in 91 pages
8700
Grace Hash Join 3(bS+bE) = 3(1000+2000) 9000
Simple Hash Join bS + bE.ceil(bR/(N-3)) =
1000 + ceil(1000/100).2000 = 1000 + 10.2000
21000
Sort-merge Join sort(S) + sort(E) + bS + bE =
2.1000.2 + 2.2000.2 + 1000 + 2000
11000
Nested-loop Join bS + bE.ceil(bS/(N-2)) =
1000 + 2000.ceil(1000/101) = 1000 + 10.2000
21000

COMP9315 21T1 ♢ Hash Join ♢ [16/17]
❖ Join in PostgreSQL

Join implementations are under: src/backend/executor

PostgreSQL suports three kinds of join:

Query optimiser chooses appropriate join, by considering
COMP9315 21T1 ♢ Hash Join ♢ [17/17]


Produced: 25 Apr 2021