IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation...
Transcript of IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation...
IBM Research
Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation
Consistently Estimating the Selectivity of Conjuncts of Predicates
Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, Peter Haas, Utkarsh Srivastava
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation2
Agenda
Consistency and Bias Problems in Cardinality Estimation
The Maximum Entropy Solution
Iterative Scaling
Performance Analysis
Related Work
Conclusions
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation3
What is the problem?
Consider the following three attributes:
Make
Color
Model
Correlation
Legend:
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation4
Make
‘Mazda’
attribute
value
Color
‘red’
Model
‘323’
Legend:
How to estimate the cardinality of the predicate…
… Make = ‘Mazda’ AND Model = ’323’AND Color = ‘red’
200,000
200,000
100,000
cardinality
(real cardinality: 49,000)
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation5
Without any additional knowledge
Legend:
Selectivity( Make= ‘Mazda’ AND Model = 323AND Color = ‘red’ )
Independence assumption:
s(Make = ‘Mazda’ ) * s( Model = ‘323’ ) * s( Color =‘red’ ) =
100,000 * 200,000 * 200,000 = 0.0041,000,000 1,000,000 1,000,000 denote by s(?)
the selectivityof ?
Make
‘Mazda’
Color
red
Model
‘323’
100,000 200,000
200,000
Base cardinality: 1000,000
Estimated Cardinality: 0.004 * 1,000,000 = 4000
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation6
Additional knowledge given (1):
Make
‘Mazda’
Color
‘red’
Model
‘323’
Legend:
Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )
Additional knowledge:
Make AND Model
card(‘Mazda’ AND ‘323’) = 50,000
case 1: s( Make AND Model ) * s( Color ) =
ConjunctPred X AND Pred Y
Cardinality
100,000 200,000
200,000
estimated card:10,000
50,000 * 200,000 = 0.01 1,000,000 1,000,000
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation7
200,000 * 90,000 = 0.0181,000,000 1,000,000
Additional knowledge given (2):
Legend:
case 1: s( Make AND Model ) * s(Color) =0.01 estimated card: 10,000
case 2: s( Make AND Color ) * s( Model ) =
ConjunctPred X AND Pred Y
Cardinality
100,000 200,000
estimated card:18,000
Make
‘Mazda’
Color
‘red’
Model
‘323’
Additional knowledge:
Make AND Model
card(‘Mazda’ AND ‘323’) = 50,000
Make AND Color
cardl(‘Mazda’ AND ‘red’) = 90,000
200,000
Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation8
150,000 * 100,000 = 0.0151,000,000 1,000,000
Additional knowledge given (3):
Legend:
case 2: s( Make AND Color ) * s( Model ) =0.018 estimated card: 18,000
case 3: s( Model AND Color ) * s( Make ) =
ConjunctPred X AND Pred Y
Cardinality
estimated card:15,000
100,000 200,000
Make
‘Mazda’
Color
‘red’
Model
‘323’
Additional knowledge:
Make AND Model
card(‘Mazda’ AND ‘323’) = 50,000
Model AND Color
card(‘323’ AND ‘red’) = 150,000
Make AND Color
cardl(‘Mazda’ AND ‘red’) = 90,000
200,000
Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )
case 1: s( Make AND Model ) * s(Color) =0.01 estimated card: 10,000
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation9
Why is this a problem?
case 2: s( Make AND Color ) * s( Model ) =0.018 estimated card: 18,000
case 0: s( Make) * s(Model ) * s(Color) =0.004 estimated card: 4,000
Make, Color
Index Scan
FETCH Model
90,000
18,000
Make Color
Index Intersect4,000
Model
case 3: s( Model AND Color ) * s( Make ) =0.015 estimated card: 15,000
Cardinality BiasFleeing from Knowledge to Ignorance
Model, Color
Index Scan
FETCH Make
150,000
15,000
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation10
What has happened?
Inconsistent modeldifferent estimates for the same intermediate result
due to multivariate statistics with overlapping information
Bias during plan selection results in the selection of sub-optimal plans
Bias Avoidance means keeping the model consistentState-of-the-art is to do bookkeeping of the first multivariate statistic used, and
ignore further overlapping multivariate statistics
Does not solve the problem, as ignoring knowledge also means bias
Bias is arbitrary, depends on what statistics are used first during optimization
Only possible solution is to exploit all knowledge consistently
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation11
Problem: Only partial knowledge of the DNF atoms
Mazda 323
red
Mazda & 323 & red Mazda & 323 & red
Mazda & 323 & red
Mazda & 323 &
red
Mazda & 323 &
red
Mazda & 323 & red
Mazda & 323& red
100,000 200,000
Make
‘Mazda’
Color
‘red’
Model
‘323’
Additional knowledge:
Make AND Model
p(‘Mazda’ AND ‘323’) = 50,000
Model AND Color
p(‘323’ AND ‘red’) = 150,000
Make AND Color
pl(‘Mazda’ AND ‘red’) = 90,000
200,000
Mazda & 323 & red
Legend:
DNF = disjunctive normal form
X denotes not X
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation12
Mazda 323
red
Mazda & 323 & red Mazda & 323 & red
Mazda & 323 & red
Mazda & 323 &
red
Mazda & 323 &
red
Mazda & 323 & red
Mazda & 323& red
How to compute the missing values of the distribution?
Probability( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )
Mazda & 323 & red
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation13
Solution: Information Entropy H( X ) = -∑ xi log( xi )
Entropy is a measure for the “uninformedness” of a probability distribution
X=(x1, …, xm) with x1 + … + xm = 1
Maximizing information entropy for unknown selectivities
using known selectivities as constraints
will avoid bias
The less is known about a probability distribution, the larger the entropyNothing uniformity: s(X = ?) = 1/m
Marginals independence: s(X = ? and Y = ?) = s(X=?) * s(Y=?)
Thus: the principle of maximum entropy generalizes uniformity and independence used in today’s query optimizers
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation14
Entropy Maximization for Cardinality Estimation
given some selectivities (single and conjunctive) over a space of n predicates p1, …, pn
choose a model which is consistent with this knowledge but otherwise as uniform as possible
maximize the entropy of the probability distribution X = (xb | b {0,1}n)
xb is the selectivity of the DNF atom bi = 0 means that predicate pi is negated in the DNF bi = 1 means that predicate pi is a positive term in the DNF
Legend:
{0,1}n denotes the n-fold cross product of theset {0,1}, i.e., {0,1} … {0,1}
nb bb xxXH
}1,0{log))(max(
n times
Also, for a predicatep1 = pp0 = not p
in
binib
p},...,1{}1,0{
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation15
Mazda & 323 & red +
Mazda & 323& red
Mazda & 323 & red
Mazda & 323 & red
Mazda & 323 & red Mazda & 323 & red
Mazda & 323 & red
Mazda & 323 &
red
Mazda & 323 &
red
Maximum Entropy Principle – Example:
Constraints:
s1= Mazda & 323 & red +
Mazda & 323 & red +
Mazda 323
red
Mazda & 323 & red
1 0 0 0 1 0
0 0 1
1 0 1
1 1 0
0 1 1 1 1 1
0 0 0
s1 = x100 +
x101 +
x110 +
x111
s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2
Knowledge sY, Y T:
T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, }
s1,2 = s(Mazda & 323) = 0.05
s1,3 = s(Mazda & red) = 0.09
s2,3 = s(red & 323) = 0.15
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation16
Maximum Entropy Principle – Example:
Constraints:
0.10 = s1 = x100 + x101 + x110 + x111
0.20 = s3 = x101 + x111 + x011 + x001
0.20 = s2 = x010 + x011 + x110 + x111
0.05 = s1,2 = x110 + x111
Mazda 323
red
0.09 = s1,3 = x101 + x111
0.15 = s2,3 = x011 + x111
1 0 0 0 1 0
0 0 1
1 0 1
1 1 0
0 1 1 1 1 1
0 0 0
3}1,0{
log))(max(b bb xxXH
1.00 = s = x000 + x001 + x010 + x011 + x100 + x101 + x110 + x111
Objective Function:
s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2
Knowledge sY, Y T:
T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, }
s1,2 = s(Mazda & 323) = 0.05
s1,3 = s(Mazda & red) = 0.09
s2,3 = s(red & 323) = 0.15
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation17
General solution: Iterative Scaling
Solving the Constrained Optimization Problem
Minimize the objective function:
nb bb xx}1,0{
log
Satisfying the |T|2{1, .., n} constraints:
)( : allfor
YCb Yb sxTY
321 ppp 321 ppp
321 ppp
321 ppp
321 ppp 321 ppp
321 ppp
1 0 0 0 1 01 1 0
1 0 1 0 1 1
1 1 1
0 0 1
321 ppp 0 0 0
Legend:
2{1,…,n} denotes the powerset of {1,..,n}
C(Y) denotes all DNF atoms that contribute to Y, i.e., formally,
C(Y) := {b {0,1}n | iY : bi = 1} andC() := {0,1}n
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation18
We can build a Lagrangian function by associating a multiplier Y with each constraint and subtracting the constraints from the objective function
Maximum Entropy and Lagrange Multipliers
Replacing xb in each constraint yields a condition in the exponentiated Lagrange multipliers zX
nb bb xx}1,0{
log
TYesz YYCb TbPW W each for
,
)(
YCb Yb sx
is convex.
01ln : 10each for ,
TbPY Ybb
n xx
L}, {b
Differentiation w.r. to xb and equating to zero yields conditions for minimum
TbPY YbY ze
xez Y
,
1
Exponentiation of the Lagrange Multipliers in the derivatives yields product form
Legend:
P(b, T) T denotes the indexes Y of all known selectivities sY to which DNF atom b contributes its value xb:
P(b,T) = {Y T | iY : bi = 1} {}
TY YCb YbYb bb sxxxXL n )(}1,0{log),(
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation19
Iterative Scaling
We can now isolate zY for a particular Y T
and thus iteratively compute zY from all zW, W T\{Y}
This algorithm is called Iterative Scaling (Darroch and Ratcliff, 1972) and converges to a stable set of Lagrangian multipliers zY, Y T
This stable point minimizes the objective function and satisfies all constraints
We can compute all DNF atoms xb from these stable multipliers using
and can in turn compute all missing selectivities
)( }\{,
*
YCb YTbPW WY
z
esz
Y
TbPY Yb ze
x,
1
)(
YCb bY xs
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation20
Maximum Entropy Solution of the Example
Mazda 323
red
Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )
s1,2,3 = x111 = ???
s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2
s(Mazda & 323) = s1,2 = 0.05s(Mazda & red) = s1,3 = 0.09s(red & 323) = s2,3 = 0.15
Knowledge:
1 0 0 0 1 0
0 0 1
1 0 1
1 1 0
0 1 1
1 1 1
0 0 0
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation21
1
2
3
z2,3 z2,3
z1,3
z3
z1,2
z2
z3
z2
z1,3
z1 z1 z 1 z1
z2z2
z1,2
z3 z3
zØzØzØzØzØzØzØzØ
Iterative Scaling
s1 = 0.1 s2 = 0.2
s3 = 0.2
s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15
Knowledge:
1st Iteration:
})1({ }1\{,
11
*
Cb TbPW Wz
esz
z1 = 0.067957z2 = 1z1,2 = 1z3 = 1z1,3 = 1z2,3 = 1z = 1
s1 = 0.1s2 = 0.785759s1,2 = 0.05s3 = 0.785759s1,3 = 0.05s2,3 = 0.392879s = 1.571518
s1
x 100
x 101
x 110
x 111
000001010011100101110111
1,2
1,3
2,3
Ø
sØ = 1
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation22
Iterative Scaling
1st Iteration:
})2({ }2\{,
22
*
Cb TbPW Wz
esz
z1 = 0.067957z2 = 0.254531z1,2 = 1z3 = 1z1,3 = 1z2,3 = 1z = 1
s1 = 0.062727s2 = 0.2s1,2 = 0.012727s3 = 0.492879s1,3 = 0.031363s2,3 = 0.1s = 0.985759
x 110
x 011
x 010
x 111
s1 = 0.1 s2 = 0.2
s3 = 0.2
s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15
sØ = 1
s2
Knowledge:
1
2
3
z2,3 z2,3
z1,3
z3
z1,2
z2
z3
z2
z1,3
z1 z1 z 1 z1
z2z2
z1,2
z3 z3
zØzØzØzØzØzØzØzØ
000001010011100101110111
1,2
1,3
2,3
Ø
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation23
Iterative Scaling
1st Iteration:
})2,1({ }2,1\{,
2,12,1
*
Cb TbPW Wz
esz
z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 1z1,3 = 1z2,3 = 1z = 1
s1 = 0.1s2 = 0.237273s1,2 = 0.05s3 = 0.511516s1,3 = 0.05s2,3 = 0.118637s = 1.023032
x 110
x 111
s1 = 0.1 s2 = 0.2
s3 = 0.2
s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15
sØ = 1Knowledge:
s1,2
1
2
3
z2,3 z2,3
z1,3
z3
z1,2
z2
z3
z2
z1,3
z1 z1 z 1 z1
z2z2
z1,2
z3 z3
zØzØzØzØzØzØzØzØ
000001010011100101110111
1,2
1,3
2,3
Ø
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation24
Iterative Scaling
1st Iteration:
})3({ }3\{,
33
*
Cb TbPW Wz
esz
z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 0.390994z1,3 = 1z2,3 = 1z = 1
s1 = 0.069550s2 = 0.165023s1,2 = 0.034775s3 = 0.2s1,3 = 0.019550s2,3 = 0.046386s = 0.711516
x 101x 011
x 001
x 111
s1 = 0.1 s2 = 0.2
s3 = 0.2
s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15
sØ = 1Knowledge:
s3
1
2
3
z2,3 z2,3
z1,3
z3
z1,2
z2
z3
z2
z1,3
z1 z1 z 1 z1
z2z2
z1,2
z3 z3
zØzØzØzØzØzØzØzØ
000001010011100101110111
1,2
1,3
2,3
Ø
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation25
Iterative Scaling
1st Iteration:
z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 0.390994z1,3 = 4.603645z2,3 = 1z = 1
s1 = 0.14s2 = 0.200248s1,2 = 0.07s3 = 0.27045s1,3 = 0.09s2,3 = 0.081611s = 0.781966
})3,1({ }3,1\{,
3,13,1
*
Cb TbPW Wz
esz
x 101
x 111
s1 = 0.1 s2 = 0.2
s3 = 0.2
s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15
sØ = 1Knowledge:
s1,3
1
2
3
z2,3 z2,3
z1,3
z3
z1,2
z2
z3
z2
z1,3
z1 z1 z 1 z1
z2z2
z1,2
z3 z3
zØzØzØzØzØzØzØzØ
000001010011100101110111
1,2
1,3
2,3
Ø
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation26
Iterative Scaling
1st Iteration:
z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 0.390994z1,3 = 4.603645z2,3 = 1.837978z = 1
s1 = 0.177709s2 = 0.268637s1,2 = 0.107709s3 = 0.338839s1,3 = 0.127709s2,3 = 0.15s = 0.850355
Mazda 323
red
)3,2( }3,2\{,
3,23,2
*
Cb TbPW Wz
esz
x 011
x 111
s2,3
s1 = 0.1 s2 = 0.2
s3 = 0.2
s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15
Knowledge: sØ = 1
1
2
3
z2,3 z2,3
z1,3
z3
z1,2
z2
z3
z2
z1,3
z1 z1 z 1 z1
z2z2
z1,2
z3 z3
zØzØzØzØzØzØzØzØ
000001010011100101110111
1,2
1,3
2,3
Ø
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation27
Iterative Scaling
1st Iteration:
z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 0.390994z1,3 = 4.603645z2,3 = 1.837978z = 1.175979
s1 = 0.208982s2 = 0.315911s1,2 = 0.126664s3 = 0.398468s1,3 = 0.150183s2,3 = 0.176397s = 1
0.097264
0.029399
0.079133
0.1101150.029399
0.169152
0.432619
0.052919
})({ }\{,
*
Cb TbPW Wz
esz
s1 = 0.1 s2 = 0.2
s3 = 0.2
s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15
Knowledge: sØ = 1
1
2
3
z2,3 z2,3
z1,3
z3
z1,2
z2
z3
z2
z1,3
z1 z1 z 1 z1
z2z2
z1,2
z3 z3
zØzØzØzØzØzØzØzØ
000001010011100101110111
1,2
1,3
2,3
Ø
sØ
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation28
Maximum Entropy Solution of the Example
Mazda 323
red
Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )
s1,2,3 = x111 = 0.049918
Iterations: 241
s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2
s(Mazda & 323) = s1,2 = 0.05s(Mazda & red) = s1,3 = 0.09s(red & 323) = s2,3 = 0.15
Knowledge:
0.049918
0.000082
0.100082
0.0499180.009918
0.009918
0.740082
0.040082
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation29
Let’s compare:
case 2: s( Make AND Color ) * s( Model ) =0.018 estimated card: 18,000
case 3: s( Model AND Color ) * s( Make ) =0.015 estimated card: 15,000
Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )
case 1: s( Make AND Model ) * s(Color) =0.010 estimated card: 10,000
case 0: s( Make) * s( Model ) * s(Color) =0.004 estimated card: 4,000
Real : s( Model AND Color AND Make ) =0.049 actual card: 49,000
ME: s( Model AND Color ) * s( Make ) =0.049918 estimated card: 49,918
Error: 10x
Error: 5x
Error: 2.5x
Error: 3x
Almost no error
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation30
0
100
200
300
400
500
600
700
800
900
1000
1 2.1b 2.1.c 2.1a 2.2c 2.2a 2.2b 2.3 3
Ab
solu
te E
stim
atio
n E
rro
r
75%: 2138
788
79 7942 65
11 9 6 0
100%: 9583
Forward Estimation: Predicting s1,2,3 , given …
s1
s2
s3
s1,3 s2,3 s1,2
s1,3
s2,3
s1,2
s1,3
s1,2
s2,3
s1,2
s1,3
s2,3
s1,2,3
4th quartile
3rd quartile
2nd quartile
1st quartile
median
Legend:
200 queries
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation31
0
100
200
300
400
500
600
700
800
900
1000
SOTA ME SOTA ME SOTA ME SOTA ME
Ab
solu
te E
stim
atio
n E
rro
r
2.2a 2.2b 2.2c 2.3
44 4411 979 65 43 6
s1,3
s2,3
s1,2
s1,3
s1,2
s2,3
s1,2 , s1,3 , s2,3
DB2 ME DB2 ME DB2 ME DB2 ME
Comparing DB2 and ME : Predicting s1,2,3 , given …
4th quartile
3rd quartile
2nd quartile
1st quartile
mean
Legend:
200 queries
4th quartile
3rd quartile
2nd quartile
1st quartile
median
Legend:
200 queries
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation32
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
SOTA ME SOTA ME SOTA ME
abso
lute
est
imat
ion
err
or
MAKE = ? AND MODEL =? MAKE = ? AND COLOR =?MODEL = ? AND COLOR =?
Backward Estimation: Given s1,2,3 , predicting …
s1,2 s1,3 s2,3
DB2 ME DB2 ME DB2 ME
4th quartile
3rd quartile
2nd quartile
1st quartile
mean
Legend:
200 queries
4th quartile
3rd quartile
2nd quartile
1st quartile
median
Legend:
200 queries
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation33
0
25
50
75
100
5 6 7 8 9 10 11 12 13 14 15 16 17 18number of predicates |P|
tim
e u
nti
l co
nve
rgen
ce o
f it
erat
ive
scal
ing
012345678910
0
1
2
3
4
5
6
7
8
9
10
|T|
Computation Cost
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation34
Related Work Selectivity Estimation
SAC+79 P.G. Selinger et al: Access Path Selection in a Rela tional DBMS. SIGMOD 1979
Chr83S. Christodoulakis: Estimating record selectivities. Inf. Syst. 8(2): 105-115 (1983)
Lyn88C. A. Lynch: Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Col umn Values. VLDB 1988: 240-251
PC84 G. Piatetsky-Shapiro, C. Connell: Accurate Estimation of the Number of Tuples Satisfying a Condition. SIGMOD Confer ence 1984: 256-276
PIH+96 V. Poosala, et. al: Improved histograms for selectivity estima tion of range predicates. SIGMOD 1996
Recommending, Constructing, and Maintaining Multivariate Statistics
AC99 A. Aboulnaga, S. Chaudhuri: Self-tuning Histograms: Build ing Histograms Without Looking at Data. SIGMOD 1999: 181-192
BCG01 N. Bruno, S. Chaudhuri, L. Gravano: STHoles: A Multidi mensional Workload-Aware Histogram. SIGMOD 2001
BC02 N. Bruno and S. Chaudhuri: Exploiting Statistics on Query Expressions for Optimization. SIGMOD 2002
BC03 N. Bruno, S. Chaudhuri: Efficient Creation of Statistics over Query Expressions. ICDE 2003:
BC04 N. Bruno, S. Chaudhuri: Conditional Selectivity for Statistics on Query Expressions. SIGMOD 2004: 311-322
SLM+01 M. Stillger, G. Lohman, V. Markl, and M. Kandil: LEO – DB2’s Learning Optimizer. VLDB 2001
IMH+04 I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, A. Aboulnaga: CORDS: Automatic discovery of correlations and soft func tional dependencies. Proc. 2004 ACM SIGMOD, June 2004.
CN00 S. Chaudhuri, V. Narasayya: Automating Statistics Manage ment for Query Optimizers. ICDE 2000: 339-348
DGR01 A. Deshpande, M. Garofalakis, R. Rastogi: Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. SIGMOD 2001
GJW+03 C. Galindo-Legaria, M. Joshi, F. Waas, et al: Statistics on Views. VLDB 2003: 952-962
GTK01 L. Getoor, B. Taskar, D. Koller: Selectivity Estimation using Probabilistic Models. SIGMOD 2001
PI97 V. Poosala and Y. Ioannidis: Selectivity Estimation without value independence. VLDB 1997
Entropy and Maximum Entropy
Sha48 C. E. Shannon: A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948
DR72 J.N. Darroch and D. Ratcliff: Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics (43), 1972:1470–1480.
GP00 W. Greiff, J. Ponte: The maximum-entropy approach and probabilistic IR models. ACM TIS. 18(3): 246-287, 2000
GS85 S. Guiasu and A. Shenitzer: The principle of maximum-en tropy. The Mathematical Intelligencer, 7(1), 1985.
IBM Research
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation35
Conclusions
Problem: Inconsistent Cardinality Model and Bias in today’s Query Optimizersdue to overlapping Multivariate Statistics (MD Histograms, etc.)
To reduce bias, today’s optimizers only use a consistent subset of available multivariate statistics
Cardinality estimates suboptimal despite better information
Bias towards plans without proper statistics (“fleeing from knowledge to ignorance”)
Solution: Maximizing Information Entropy
Generalizes concepts of uniformity and independence used in today’s query optimizers
All statistics are utilized Cardinality estimates improve, some by orders of magnitude
Cardinality Model is consistent No bias towards particular plans
Consistent estimates are computed in subsecond time
for up to 10 predicates per table
however, algorithm is exponential in the number of predicates
Not covered in the talk (see paper):Reducing algorithm complexity through pre-processing
Impact on query performance speedup, sometime by orders of magnitude
Future Work:Extension to join estimates