IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation...

35
IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation http://w3.ibm.com/ibm/presentations Consistently Estimating the Selectivity of Conjuncts of Predicates Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, Peter Haas, Utkarsh Srivastava

Transcript of IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation...

Page 1: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation

Consistently Estimating the Selectivity of Conjuncts of Predicates

Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, Peter Haas, Utkarsh Srivastava

Page 2: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation2

Agenda

Consistency and Bias Problems in Cardinality Estimation

The Maximum Entropy Solution

Iterative Scaling

Performance Analysis

Related Work

Conclusions

Page 3: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation3

What is the problem?

Consider the following three attributes:

Make

Color

Model

Correlation

Legend:

Page 4: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation4

Make

‘Mazda’

attribute

value

Color

‘red’

Model

‘323’

Legend:

How to estimate the cardinality of the predicate…

… Make = ‘Mazda’ AND Model = ’323’AND Color = ‘red’

200,000

200,000

100,000

cardinality

(real cardinality: 49,000)

Page 5: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation5

Without any additional knowledge

Legend:

Selectivity( Make= ‘Mazda’ AND Model = 323AND Color = ‘red’ )

Independence assumption:

s(Make = ‘Mazda’ ) * s( Model = ‘323’ ) * s( Color =‘red’ ) =

100,000 * 200,000 * 200,000 = 0.0041,000,000 1,000,000 1,000,000 denote by s(?)

the selectivityof ?

Make

‘Mazda’

Color

red

Model

‘323’

100,000 200,000

200,000

Base cardinality: 1000,000

Estimated Cardinality: 0.004 * 1,000,000 = 4000

Page 6: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation6

Additional knowledge given (1):

Make

‘Mazda’

Color

‘red’

Model

‘323’

Legend:

Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )

Additional knowledge:

Make AND Model

card(‘Mazda’ AND ‘323’) = 50,000

case 1: s( Make AND Model ) * s( Color ) =

ConjunctPred X AND Pred Y

Cardinality

100,000 200,000

200,000

estimated card:10,000

50,000 * 200,000 = 0.01 1,000,000 1,000,000

Page 7: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation7

200,000 * 90,000 = 0.0181,000,000 1,000,000

Additional knowledge given (2):

Legend:

case 1: s( Make AND Model ) * s(Color) =0.01 estimated card: 10,000

case 2: s( Make AND Color ) * s( Model ) =

ConjunctPred X AND Pred Y

Cardinality

100,000 200,000

estimated card:18,000

Make

‘Mazda’

Color

‘red’

Model

‘323’

Additional knowledge:

Make AND Model

card(‘Mazda’ AND ‘323’) = 50,000

Make AND Color

cardl(‘Mazda’ AND ‘red’) = 90,000

200,000

Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )

Page 8: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation8

150,000 * 100,000 = 0.0151,000,000 1,000,000

Additional knowledge given (3):

Legend:

case 2: s( Make AND Color ) * s( Model ) =0.018 estimated card: 18,000

case 3: s( Model AND Color ) * s( Make ) =

ConjunctPred X AND Pred Y

Cardinality

estimated card:15,000

100,000 200,000

Make

‘Mazda’

Color

‘red’

Model

‘323’

Additional knowledge:

Make AND Model

card(‘Mazda’ AND ‘323’) = 50,000

Model AND Color

card(‘323’ AND ‘red’) = 150,000

Make AND Color

cardl(‘Mazda’ AND ‘red’) = 90,000

200,000

Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )

case 1: s( Make AND Model ) * s(Color) =0.01 estimated card: 10,000

Page 9: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation9

Why is this a problem?

case 2: s( Make AND Color ) * s( Model ) =0.018 estimated card: 18,000

case 0: s( Make) * s(Model ) * s(Color) =0.004 estimated card: 4,000

Make, Color

Index Scan

FETCH Model

90,000

18,000

Make Color

Index Intersect4,000

Model

case 3: s( Model AND Color ) * s( Make ) =0.015 estimated card: 15,000

Cardinality BiasFleeing from Knowledge to Ignorance

Model, Color

Index Scan

FETCH Make

150,000

15,000

Page 10: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation10

What has happened?

Inconsistent modeldifferent estimates for the same intermediate result

due to multivariate statistics with overlapping information

Bias during plan selection results in the selection of sub-optimal plans

Bias Avoidance means keeping the model consistentState-of-the-art is to do bookkeeping of the first multivariate statistic used, and

ignore further overlapping multivariate statistics

Does not solve the problem, as ignoring knowledge also means bias

Bias is arbitrary, depends on what statistics are used first during optimization

Only possible solution is to exploit all knowledge consistently

Page 11: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation11

Problem: Only partial knowledge of the DNF atoms

Mazda 323

red

Mazda & 323 & red Mazda & 323 & red

Mazda & 323 & red

Mazda & 323 &

red

Mazda & 323 &

red

Mazda & 323 & red

Mazda & 323& red

100,000 200,000

Make

‘Mazda’

Color

‘red’

Model

‘323’

Additional knowledge:

Make AND Model

p(‘Mazda’ AND ‘323’) = 50,000

Model AND Color

p(‘323’ AND ‘red’) = 150,000

Make AND Color

pl(‘Mazda’ AND ‘red’) = 90,000

200,000

Mazda & 323 & red

Legend:

DNF = disjunctive normal form

X denotes not X

Page 12: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation12

Mazda 323

red

Mazda & 323 & red Mazda & 323 & red

Mazda & 323 & red

Mazda & 323 &

red

Mazda & 323 &

red

Mazda & 323 & red

Mazda & 323& red

How to compute the missing values of the distribution?

Probability( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )

Mazda & 323 & red

Page 13: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation13

Solution: Information Entropy H( X ) = -∑ xi log( xi )

Entropy is a measure for the “uninformedness” of a probability distribution

X=(x1, …, xm) with x1 + … + xm = 1

Maximizing information entropy for unknown selectivities

using known selectivities as constraints

will avoid bias

The less is known about a probability distribution, the larger the entropyNothing uniformity: s(X = ?) = 1/m

Marginals independence: s(X = ? and Y = ?) = s(X=?) * s(Y=?)

Thus: the principle of maximum entropy generalizes uniformity and independence used in today’s query optimizers

Page 14: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation14

Entropy Maximization for Cardinality Estimation

given some selectivities (single and conjunctive) over a space of n predicates p1, …, pn

choose a model which is consistent with this knowledge but otherwise as uniform as possible

maximize the entropy of the probability distribution X = (xb | b {0,1}n)

xb is the selectivity of the DNF atom bi = 0 means that predicate pi is negated in the DNF bi = 1 means that predicate pi is a positive term in the DNF

Legend:

{0,1}n denotes the n-fold cross product of theset {0,1}, i.e., {0,1} … {0,1}

nb bb xxXH

}1,0{log))(max(

n times

Also, for a predicatep1 = pp0 = not p

in

binib

p},...,1{}1,0{

Page 15: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation15

Mazda & 323 & red +

Mazda & 323& red

Mazda & 323 & red

Mazda & 323 & red

Mazda & 323 & red Mazda & 323 & red

Mazda & 323 & red

Mazda & 323 &

red

Mazda & 323 &

red

Maximum Entropy Principle – Example:

Constraints:

s1= Mazda & 323 & red +

Mazda & 323 & red +

Mazda 323

red

Mazda & 323 & red

1 0 0 0 1 0

0 0 1

1 0 1

1 1 0

0 1 1 1 1 1

0 0 0

s1 = x100 +

x101 +

x110 +

x111

s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2

Knowledge sY, Y T:

T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, }

s1,2 = s(Mazda & 323) = 0.05

s1,3 = s(Mazda & red) = 0.09

s2,3 = s(red & 323) = 0.15

Page 16: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation16

Maximum Entropy Principle – Example:

Constraints:

0.10 = s1 = x100 + x101 + x110 + x111

0.20 = s3 = x101 + x111 + x011 + x001

0.20 = s2 = x010 + x011 + x110 + x111

0.05 = s1,2 = x110 + x111

Mazda 323

red

0.09 = s1,3 = x101 + x111

0.15 = s2,3 = x011 + x111

1 0 0 0 1 0

0 0 1

1 0 1

1 1 0

0 1 1 1 1 1

0 0 0

3}1,0{

log))(max(b bb xxXH

1.00 = s = x000 + x001 + x010 + x011 + x100 + x101 + x110 + x111

Objective Function:

s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2

Knowledge sY, Y T:

T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, }

s1,2 = s(Mazda & 323) = 0.05

s1,3 = s(Mazda & red) = 0.09

s2,3 = s(red & 323) = 0.15

Page 17: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation17

General solution: Iterative Scaling

Solving the Constrained Optimization Problem

Minimize the objective function:

nb bb xx}1,0{

log

Satisfying the |T|2{1, .., n} constraints:

)( : allfor

YCb Yb sxTY

321 ppp 321 ppp

321 ppp

321 ppp

321 ppp 321 ppp

321 ppp

1 0 0 0 1 01 1 0

1 0 1 0 1 1

1 1 1

0 0 1

321 ppp 0 0 0

Legend:

2{1,…,n} denotes the powerset of {1,..,n}

C(Y) denotes all DNF atoms that contribute to Y, i.e., formally,

C(Y) := {b {0,1}n | iY : bi = 1} andC() := {0,1}n

Page 18: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation18

We can build a Lagrangian function by associating a multiplier Y with each constraint and subtracting the constraints from the objective function

Maximum Entropy and Lagrange Multipliers

Replacing xb in each constraint yields a condition in the exponentiated Lagrange multipliers zX

nb bb xx}1,0{

log

TYesz YYCb TbPW W each for

,

)(

YCb Yb sx

is convex.

01ln : 10each for ,

TbPY Ybb

n xx

L}, {b

Differentiation w.r. to xb and equating to zero yields conditions for minimum

TbPY YbY ze

xez Y

,

1

Exponentiation of the Lagrange Multipliers in the derivatives yields product form

Legend:

P(b, T) T denotes the indexes Y of all known selectivities sY to which DNF atom b contributes its value xb:

P(b,T) = {Y T | iY : bi = 1} {}

TY YCb YbYb bb sxxxXL n )(}1,0{log),(

Page 19: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation19

Iterative Scaling

We can now isolate zY for a particular Y T

and thus iteratively compute zY from all zW, W T\{Y}

This algorithm is called Iterative Scaling (Darroch and Ratcliff, 1972) and converges to a stable set of Lagrangian multipliers zY, Y T

This stable point minimizes the objective function and satisfies all constraints

We can compute all DNF atoms xb from these stable multipliers using

and can in turn compute all missing selectivities

)( }\{,

*

YCb YTbPW WY

z

esz

Y

TbPY Yb ze

x,

1

)(

YCb bY xs

Page 20: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation20

Maximum Entropy Solution of the Example

Mazda 323

red

Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )

s1,2,3 = x111 = ???

s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2

s(Mazda & 323) = s1,2 = 0.05s(Mazda & red) = s1,3 = 0.09s(red & 323) = s2,3 = 0.15

Knowledge:

1 0 0 0 1 0

0 0 1

1 0 1

1 1 0

0 1 1

1 1 1

0 0 0

Page 21: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation21

1

2

3

z2,3 z2,3

z1,3

z3

z1,2

z2

z3

z2

z1,3

z1 z1 z 1 z1

z2z2

z1,2

z3 z3

zØzØzØzØzØzØzØzØ

Iterative Scaling

s1 = 0.1 s2 = 0.2

s3 = 0.2

s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15

Knowledge:

1st Iteration:

})1({ }1\{,

11

*

Cb TbPW Wz

esz

z1 = 0.067957z2 = 1z1,2 = 1z3 = 1z1,3 = 1z2,3 = 1z = 1

s1 = 0.1s2 = 0.785759s1,2 = 0.05s3 = 0.785759s1,3 = 0.05s2,3 = 0.392879s = 1.571518

s1

x 100

x 101

x 110

x 111

000001010011100101110111

1,2

1,3

2,3

Ø

sØ = 1

Page 22: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation22

Iterative Scaling

1st Iteration:

})2({ }2\{,

22

*

Cb TbPW Wz

esz

z1 = 0.067957z2 = 0.254531z1,2 = 1z3 = 1z1,3 = 1z2,3 = 1z = 1

s1 = 0.062727s2 = 0.2s1,2 = 0.012727s3 = 0.492879s1,3 = 0.031363s2,3 = 0.1s = 0.985759

x 110

x 011

x 010

x 111

s1 = 0.1 s2 = 0.2

s3 = 0.2

s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15

sØ = 1

s2

Knowledge:

1

2

3

z2,3 z2,3

z1,3

z3

z1,2

z2

z3

z2

z1,3

z1 z1 z 1 z1

z2z2

z1,2

z3 z3

zØzØzØzØzØzØzØzØ

000001010011100101110111

1,2

1,3

2,3

Ø

Page 23: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation23

Iterative Scaling

1st Iteration:

})2,1({ }2,1\{,

2,12,1

*

Cb TbPW Wz

esz

z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 1z1,3 = 1z2,3 = 1z = 1

s1 = 0.1s2 = 0.237273s1,2 = 0.05s3 = 0.511516s1,3 = 0.05s2,3 = 0.118637s = 1.023032

x 110

x 111

s1 = 0.1 s2 = 0.2

s3 = 0.2

s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15

sØ = 1Knowledge:

s1,2

1

2

3

z2,3 z2,3

z1,3

z3

z1,2

z2

z3

z2

z1,3

z1 z1 z 1 z1

z2z2

z1,2

z3 z3

zØzØzØzØzØzØzØzØ

000001010011100101110111

1,2

1,3

2,3

Ø

Page 24: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation24

Iterative Scaling

1st Iteration:

})3({ }3\{,

33

*

Cb TbPW Wz

esz

z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 0.390994z1,3 = 1z2,3 = 1z = 1

s1 = 0.069550s2 = 0.165023s1,2 = 0.034775s3 = 0.2s1,3 = 0.019550s2,3 = 0.046386s = 0.711516

x 101x 011

x 001

x 111

s1 = 0.1 s2 = 0.2

s3 = 0.2

s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15

sØ = 1Knowledge:

s3

1

2

3

z2,3 z2,3

z1,3

z3

z1,2

z2

z3

z2

z1,3

z1 z1 z 1 z1

z2z2

z1,2

z3 z3

zØzØzØzØzØzØzØzØ

000001010011100101110111

1,2

1,3

2,3

Ø

Page 25: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation25

Iterative Scaling

1st Iteration:

z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 0.390994z1,3 = 4.603645z2,3 = 1z = 1

s1 = 0.14s2 = 0.200248s1,2 = 0.07s3 = 0.27045s1,3 = 0.09s2,3 = 0.081611s = 0.781966

})3,1({ }3,1\{,

3,13,1

*

Cb TbPW Wz

esz

x 101

x 111

s1 = 0.1 s2 = 0.2

s3 = 0.2

s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15

sØ = 1Knowledge:

s1,3

1

2

3

z2,3 z2,3

z1,3

z3

z1,2

z2

z3

z2

z1,3

z1 z1 z 1 z1

z2z2

z1,2

z3 z3

zØzØzØzØzØzØzØzØ

000001010011100101110111

1,2

1,3

2,3

Ø

Page 26: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation26

Iterative Scaling

1st Iteration:

z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 0.390994z1,3 = 4.603645z2,3 = 1.837978z = 1

s1 = 0.177709s2 = 0.268637s1,2 = 0.107709s3 = 0.338839s1,3 = 0.127709s2,3 = 0.15s = 0.850355

Mazda 323

red

)3,2( }3,2\{,

3,23,2

*

Cb TbPW Wz

esz

x 011

x 111

s2,3

s1 = 0.1 s2 = 0.2

s3 = 0.2

s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15

Knowledge: sØ = 1

1

2

3

z2,3 z2,3

z1,3

z3

z1,2

z2

z3

z2

z1,3

z1 z1 z 1 z1

z2z2

z1,2

z3 z3

zØzØzØzØzØzØzØzØ

000001010011100101110111

1,2

1,3

2,3

Ø

Page 27: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation27

Iterative Scaling

1st Iteration:

z1 = 0.067957z2 = 0.254531z1,2 = 3.928794z3 = 0.390994z1,3 = 4.603645z2,3 = 1.837978z = 1.175979

s1 = 0.208982s2 = 0.315911s1,2 = 0.126664s3 = 0.398468s1,3 = 0.150183s2,3 = 0.176397s = 1

0.097264

0.029399

0.079133

0.1101150.029399

0.169152

0.432619

0.052919

})({ }\{,

*

Cb TbPW Wz

esz

s1 = 0.1 s2 = 0.2

s3 = 0.2

s1,2 = 0.05s1,3 = 0.09s2,3 = 0.15

Knowledge: sØ = 1

1

2

3

z2,3 z2,3

z1,3

z3

z1,2

z2

z3

z2

z1,3

z1 z1 z 1 z1

z2z2

z1,2

z3 z3

zØzØzØzØzØzØzØzØ

000001010011100101110111

1,2

1,3

2,3

Ø

Page 28: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation28

Maximum Entropy Solution of the Example

Mazda 323

red

Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )

s1,2,3 = x111 = 0.049918

Iterations: 241

s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2

s(Mazda & 323) = s1,2 = 0.05s(Mazda & red) = s1,3 = 0.09s(red & 323) = s2,3 = 0.15

Knowledge:

0.049918

0.000082

0.100082

0.0499180.009918

0.009918

0.740082

0.040082

Page 29: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation29

Let’s compare:

case 2: s( Make AND Color ) * s( Model ) =0.018 estimated card: 18,000

case 3: s( Model AND Color ) * s( Make ) =0.015 estimated card: 15,000

Selectivity( Make = ‘Mazda’ AND Model = ‘323’AND Color = ‘red’ )

case 1: s( Make AND Model ) * s(Color) =0.010 estimated card: 10,000

case 0: s( Make) * s( Model ) * s(Color) =0.004 estimated card: 4,000

Real : s( Model AND Color AND Make ) =0.049 actual card: 49,000

ME: s( Model AND Color ) * s( Make ) =0.049918 estimated card: 49,918

Error: 10x

Error: 5x

Error: 2.5x

Error: 3x

Almost no error

Page 30: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation30

0

100

200

300

400

500

600

700

800

900

1000

1 2.1b 2.1.c 2.1a 2.2c 2.2a 2.2b 2.3 3

Ab

solu

te E

stim

atio

n E

rro

r

75%: 2138

788

79 7942 65

11 9 6 0

100%: 9583

Forward Estimation: Predicting s1,2,3 , given …

s1

s2

s3

s1,3 s2,3 s1,2

s1,3

s2,3

s1,2

s1,3

s1,2

s2,3

s1,2

s1,3

s2,3

s1,2,3

4th quartile

3rd quartile

2nd quartile

1st quartile

median

Legend:

200 queries

Page 31: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation31

0

100

200

300

400

500

600

700

800

900

1000

SOTA ME SOTA ME SOTA ME SOTA ME

Ab

solu

te E

stim

atio

n E

rro

r

2.2a 2.2b 2.2c 2.3

44 4411 979 65 43 6

s1,3

s2,3

s1,2

s1,3

s1,2

s2,3

s1,2 , s1,3 , s2,3

DB2 ME DB2 ME DB2 ME DB2 ME

Comparing DB2 and ME : Predicting s1,2,3 , given …

4th quartile

3rd quartile

2nd quartile

1st quartile

mean

Legend:

200 queries

4th quartile

3rd quartile

2nd quartile

1st quartile

median

Legend:

200 queries

Page 32: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation32

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

SOTA ME SOTA ME SOTA ME

abso

lute

est

imat

ion

err

or

MAKE = ? AND MODEL =? MAKE = ? AND COLOR =?MODEL = ? AND COLOR =?

Backward Estimation: Given s1,2,3 , predicting …

s1,2 s1,3 s2,3

DB2 ME DB2 ME DB2 ME

4th quartile

3rd quartile

2nd quartile

1st quartile

mean

Legend:

200 queries

4th quartile

3rd quartile

2nd quartile

1st quartile

median

Legend:

200 queries

Page 33: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation33

0

25

50

75

100

5 6 7 8 9 10 11 12 13 14 15 16 17 18number of predicates |P|

tim

e u

nti

l co

nve

rgen

ce o

f it

erat

ive

scal

ing

012345678910

0

1

2

3

4

5

6

7

8

9

10

|T|

Computation Cost

Page 34: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation34

Related Work Selectivity Estimation

SAC+79 P.G. Selinger et al: Access Path Selection in a Rela tional DBMS. SIGMOD 1979

Chr83S. Christodoulakis: Estimating record selectivities. Inf. Syst. 8(2): 105-115 (1983)

Lyn88C. A. Lynch: Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Col umn Values. VLDB 1988: 240-251

PC84 G. Piatetsky-Shapiro, C. Connell: Accurate Estimation of the Number of Tuples Satisfying a Condition. SIGMOD Confer ence 1984: 256-276

PIH+96 V. Poosala, et. al: Improved histograms for selectivity estima tion of range predicates. SIGMOD 1996

Recommending, Constructing, and Maintaining Multivariate Statistics

AC99 A. Aboulnaga, S. Chaudhuri: Self-tuning Histograms: Build ing Histograms Without Looking at Data. SIGMOD 1999: 181-192

BCG01 N. Bruno, S. Chaudhuri, L. Gravano: STHoles: A Multidi mensional Workload-Aware Histogram. SIGMOD 2001

BC02 N. Bruno and S. Chaudhuri: Exploiting Statistics on Query Expressions for Optimization. SIGMOD 2002

BC03 N. Bruno, S. Chaudhuri: Efficient Creation of Statistics over Query Expressions. ICDE 2003:

BC04 N. Bruno, S. Chaudhuri: Conditional Selectivity for Statistics on Query Expressions. SIGMOD 2004: 311-322

SLM+01 M. Stillger, G. Lohman, V. Markl, and M. Kandil: LEO – DB2’s Learning Optimizer. VLDB 2001

IMH+04 I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, A. Aboulnaga: CORDS: Automatic discovery of correlations and soft func tional dependencies. Proc. 2004 ACM SIGMOD, June 2004.

CN00 S. Chaudhuri, V. Narasayya: Automating Statistics Manage ment for Query Optimizers. ICDE 2000: 339-348

DGR01 A. Deshpande, M. Garofalakis, R. Rastogi: Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. SIGMOD 2001

GJW+03 C. Galindo-Legaria, M. Joshi, F. Waas, et al: Statistics on Views. VLDB 2003: 952-962

GTK01 L. Getoor, B. Taskar, D. Koller: Selectivity Estimation using Probabilistic Models. SIGMOD 2001

PI97 V. Poosala and Y. Ioannidis: Selectivity Estimation without value independence. VLDB 1997

Entropy and Maximum Entropy

Sha48 C. E. Shannon: A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948

DR72 J.N. Darroch and D. Ratcliff: Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics (43), 1972:1470–1480.

GP00 W. Greiff, J. Ponte: The maximum-entropy approach and probabilistic IR models. ACM TIS. 18(3): 246-287, 2000

GS85 S. Guiasu and A. Shenitzer: The principle of maximum-en tropy. The Mathematical Intelligencer, 7(1), 1985.

Page 35: IBM Research Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended.

IBM Research

Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation35

Conclusions

Problem: Inconsistent Cardinality Model and Bias in today’s Query Optimizersdue to overlapping Multivariate Statistics (MD Histograms, etc.)

To reduce bias, today’s optimizers only use a consistent subset of available multivariate statistics

Cardinality estimates suboptimal despite better information

Bias towards plans without proper statistics (“fleeing from knowledge to ignorance”)

Solution: Maximizing Information Entropy

Generalizes concepts of uniformity and independence used in today’s query optimizers

All statistics are utilized Cardinality estimates improve, some by orders of magnitude

Cardinality Model is consistent No bias towards particular plans

Consistent estimates are computed in subsecond time

for up to 10 predicates per table

however, algorithm is exponential in the number of predicates

Not covered in the talk (see paper):Reducing algorithm complexity through pre-processing

Impact on query performance speedup, sometime by orders of magnitude

Future Work:Extension to join estimates