Query Cost Estimation

COMP9315 21T1 ♢ Cost Estimation ♢ [0/9]
❖ Query Cost Estimation

Without executing a plan, cannot always know its precise cost.

Thus, query optimisers estimate costs via:

Result size estimated by statistical measures on relations, e.g.
rS cardinality of relation S
RS avg size of tuple in relation S
V(A,S) # distinct values of attribute A
min(A,S) min value of attribute A
max(A,S) max value of attribute A
COMP9315 21T1 ♢ Cost Estimation ♢ [1/9]
❖ Estimating Projection Result Size

Straightforward, since we know:

Assume page size B,   bout = ceil(rT / cout),   where cout = floor(B/Rout)

If using select distinct ...

COMP9315 21T1 ♢ Cost Estimation ♢ [2/9]
❖ Estimating Selection Result Size

Selectivity = fraction of tuples expected to satisfy a condition.

Common assumption: attribute values uniformly distributed.

Example: Consider the query

select * from Parts where colour='Red'

If   V(colour,Parts)=4,   r=1000  ⇒  |σcolour=red(Parts)|=250

In general, | σA=c(R) |  ≅  rR / V(A,R)


Heuristic used by PostgreSQL: | σA=c(R) |  ≅  r/10

COMP9315 21T1 ♢ Cost Estimation ♢ [3/9]
❖ Estimating Selection Result Size (cont)

Estimating size of result for e.g.

select * from Enrolment where year > 2015;

Could estimate by using:

Assume: min(year)=2010, max(year)=2019, |Enrolment|=105
Heuristic used by some systems:   | σA>c(R) | ≅ r/3
COMP9315 21T1 ♢ Cost Estimation ♢ [4/9]
❖ Estimating Selection Result Size (cont)

Estimating size of result for e.g.

select * from Enrolment where course <> 'COMP9315';

Could estimate by using:

e.g. | V(course,Enrolment) | = 2000,   | σA<>c(E) | = r * 1999/2000


Heuristic used by some systems:   | σA<>c(R) | ≅ r

COMP9315 21T1 ♢ Cost Estimation ♢ [5/9]
❖ Estimating Selection Result Size (cont)

How to handle non-uniform attribute value distributions?

So, for part colour example, might have distribution like:

White: 35%   Red: 30%   Blue: 25%   Silver: 10%

Use histogram as basis for determining # selected tuples.

Disadvantage: cost of storing/maintaining histograms.

COMP9315 21T1 ♢ Cost Estimation ♢ [6/9]
❖ Estimating Selection Result Size (cont)

Summary: analysis relies on operation and data distribution:

E.g. select * from R where a = k;

Case 1:   uniq(R.a)   ⇒   0 or 1 result

Case 2:   rR  tuples && size(dom(R.a)) = n   ⇒   rR / n  results

E.g. select * from R where a < k;

Case 1:   k ≤ min(R.a)   ⇒   0 results

Case 2:   k > max(R.a)   ⇒   ≅ rR results

Case 3:   size(dom(R.a)) = n   ⇒   ? min(R.a) ... k ... max(R.a)  ?

COMP9315 21T1 ♢ Cost Estimation ♢ [7/9]
❖ Estimating Join Result Size

Analysis relies on semantic knowledge about data/relations.

Consider equijoin on common attr: R ⨝a S

Case 1:   values(R.a) ∩ values(S.a) = {}   ⇒   size(R ⨝a S) = 0

Case 2:   uniq(R.a) and uniq(S.a)   ⇒   size(R ⨝a S)  ≤  min(|R|, |S|)

Case 3:   pkey(R.a) and fkey(S.a)   ⇒   size(R ⨝a S)  ≤  |S|

COMP9315 21T1 ♢ Cost Estimation ♢ [8/9]
❖ Cost Estimation: Postscript

Inaccurate cost estimation can lead to poor evaluation plans.

Above methods can (sometimes) give inaccurate estimates.

To get more accurate cost estimates:

Either way, optimisation process costs more (more than query?)

Trade-off between optimiser performance and query performance.

COMP9315 21T1 ♢ Cost Estimation ♢ [9/9]


Produced: 5 Apr 2021