Query Execution 2 and Query...

92
Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu

Transcript of Query Execution 2 and Query...

Page 1: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Query Execution 2 and Query Optimization

Instructor: Matei Zahariacs245.stanford.edu

Page 2: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Query Execution Overview

Query representation (e.g. SQL)

Logical query plan(e.g. relational algebra)

Optimized logical plan

Physical plan(code/operators to run)

CS 245 2

Query optim

ization

Page 3: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example SQL QuerySELECT titleFROM StarsInWHERE starName IN (

SELECT nameFROM MovieStarWHERE birthdate LIKE ‘%1960’

);

(Find the movies with stars born in 1960)

CS 245 3

Page 4: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Parse Tree <Query>

<SFW>

SELECT <SelList> FROM <FromList> WHERE <Condition>

<Attribute> <RelName> <Tuple> IN <Query>

title StarsIn <Attribute> ( <Query> )

starName <SFW>

SELECT <SelList> FROM <FromList> WHERE <Condition>

<Attribute> <RelName> <Attribute> LIKE <Pattern>

name MovieStar birthDate ‘%1960’

CS 245 4

Page 5: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Ptitle

sstarName=name

StarsIn Pname

sbirthdate LIKE ‘%1960’

MovieStar

´

Logical Query Plan

CS 245 5

Page 6: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Improved Logical Query Plan

Ptitle

starName=name

StarsIn Pname

sbirthdate LIKE ‘%1960’

MovieStarCS 245 6

Page 7: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Need expected size

StarsIn

MovieStar

P

s

Estimate Result Sizes

CS 245 7

Page 8: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Parameters: join order,memory size, project attributes, ...Hash join

Seq scan Index scan Parameters:select condition, ...

StarsIn MovieStar

One Physical Plan

H

CS 245 8

Page 9: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Parameters: join order,memory size, project attributes, ...Hash join

Index scan Seq scan Parameters:select condition, ...

StarsIn MovieStar

Another Physical Plan

H

CS 245 9

Page 10: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Sort-merge join

Seq scan Seq scan

StarsIn MovieStar

Another Physical Plan

CS 245 10

Page 11: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Logical plan

P1 P2 … Pn

C1 C2 … Cn

Pick best!

Estimating Plan Costs

Physical plancandidates

CS 245 11

Page 12: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Execution Methods: Once We Have a Plan, How to Run it?

Several options that trade between complexity, performance and startup time

CS 245 12

Page 13: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example: Simple Query

SELECT quantity * priceFROM ordersWHERE productId = 75

Pquanity*price (σproductId=75 (orders))

CS 245 13

Page 14: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Method 1: Interpretationinterface Operator {Tuple next();

}

class TableScan: Operator {String tableName;

}

class Select: Operator {Operator parent;Expression condition;

}

class Project: Operator {Operator parent;Expression[] exprs;

}

CS 245 14

interface Expression {Value compute(Tuple in);

}

class Attribute: Expression {String name;

}

class Times: Expression {Expression left, right;

}

class Equals: Expression {Expression left, right;

}

Page 15: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example Expression Classes

CS 245 15

class Attribute: Expression {String name;

Value compute(Tuple in) {return in.getField(name);

}}

class Times: Expression {Expression left, right;

Value compute(Tuple in) {return left.compute(in) * right.compute(in);

}}

probably better to use anumeric field ID instead

Page 16: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example Operator Classes

CS 245 16

class TableScan: Operator {String tableName;

Tuple next() {// read & return next record from file

}}

class Project: Operator {Operator parent;Expression[] exprs;

Tuple next() {tuple = parent.next();fields = [expr.compute(tuple) for expr in exprs];return new Tuple(fields);

}}

Page 17: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Running Our Query with Interpretation

CS 245 17

ops = Project(expr = Times(Attr(“quantity”), Attr(“price”)),parent = Select(expr = Equals(Attr(“productId”), Literal(75)),parent = TableScan(“orders”)

));

while(true) {Tuple t = ops.next();if (t != null) {out.write(t);

} else {break;

}}

Pros & cons of this approach?

recursively calls Operator.next()and Expression.compute()

Page 18: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Method 2: Vectorization

Interpreting query plans one record at a time is simple, but it’s too slow» Lots of virtual function calls and branches for

each record (recall Jeff Dean’s numbers)

Keep recursive interpretation, but make Operators and Expressions run on batches

CS 245 18

Page 19: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Implementing Vectorization

CS 245 19

class TupleBatch {// Efficient storage, e.g.// schema + column arrays

}

interface Operator {TupleBatch next();

}

class Select: Operator {Operator parent;Expression condition;

}

...

class ValueBatch {// Efficient storage

}

interface Expression {ValueBatch compute(TupleBatch in);

}

class Times: Expression {Expression left, right;

}

...

Page 20: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Typical Implementation

Values stored in columnar arrays (e.g. int[]) with a separate bit array to mark nulls

Tuple batches fit in L1 or L2 cache

Operators use SIMD instructions to update both values and null fields without branching

CS 245 20

Page 21: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Pros & Cons of Vectorization

+ Faster than record-at-a-time if the queryprocesses many records

+ Relatively simple to implement

– Lots of nulls in batches if query is selective

– Data travels between CPU & cache a lot

CS 245 21

Page 22: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Method 3: Compilation

Turn the query into executable code

CS 245 22

Page 23: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Compilation Example

Pquanity*price (σproductId=75 (orders))

class MyQuery {void run() {

Iterator<OrdersTuple> in = openTable(“orders”);for(OrdersTuple t: in) {

if (t.productId == 75) {out.write(Tuple(t.quantity * t.price));

}}

}}

CS 245 23

generated class with the rightfield types for orders table

Can also theoretically generate vectorized code

Page 24: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Pros & Cons of Compilation

+ Potential to get fastest possible execution

+ Leverage existing work in compilers

– Complex to implement

– Compilation takes time

– Generated code may not match hand-written

CS 245 24

Page 25: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What’s Used Today?

Depends on context & other bottlenecks

Transactional databases (e.g. MySQL):mostly record-at-a-time interpretation

Analytical systems (Vertica, Spark SQL):vectorization, sometimes compilation

ML libs (TensorFlow): mostly vectorization (the records are vectors!), some compilation

CS 245 25

Page 26: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Query Optimization

CS 245 26

Page 27: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

CS 245 27

Page 28: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

CS 245 28

Page 29: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What Can We Optimize?

Operator graph: what operators do we run, and in what order?

Operator implementation: for operators with several impls (e.g. join), which one to use?

Access paths: how to read each table?» Index scan, table scan, C-store projections, …

CS 245 29

Page 30: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Typical Challenge

There is an exponentially large set of possible query plans

Result: we’ll need techniques to prune the search space and complexity involved

Access pathsfor table 1

Access pathsfor table 2

Algorithmsfor join 1

Algorithmsfor join 2⨯ ⨯ ⨯ ⨯ …

CS 245 30

Page 31: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

CS 245 31

Page 32: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What is a Rule?

Procedure to replace part of the query plan based on a pattern seen in the plan

Example: When I see expr OR TRUE for an expression expr, replace this with TRUE

CS 245 32

Page 33: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Implementing Rules

Each rule is typically a function that walks through query plan to search for its pattern

void replaceOrTrue(Plan plan) {for (node in plan.nodes) {if (node instanceof Or) {if (node.right == Literal(true)) {plan.replace(node, Literal(true));break;

}// Similar code if node.left == Literal(true)

}}

}

Or

expr TRUE

node

node.left node.right

CS 245 33

Page 34: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Implementing Rules

Rules are often grouped into phases» E.g. simplify Boolean expressions, pushdown

selects, choose join algorithms, etc

Each phase runs rules till they no longer applyplan = originalPlan;while (true) {for (rule in rules) {rule.apply(plan);

}if (plan was not changed by any rule) break;

}

CS 245 34

Page 35: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Result

Simple rules can work together to optimize complex query plans (if designed well):

SELECT * FROM users WHERE(age>=16 && loc==CA) || (age>=16 && loc==NY) || age>=18

(age>=16) && (loc==CA || loc==NY) || age>=18

(age>=16 && (loc IN (CA, NY)) || age>=18

age>=18 || (age>=16 && (loc IN (CA, NY))

CS 245 35

Page 36: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example Extensible Optimizer

For Thursday, you’ll read about Spark SQL’s Catalyst optimizer» Written in Scala using its pattern matching

features to simplify writing rules» >500 contributors worldwide, >1000 types of

expressions, and hundreds of rules

We’ll also use Spark SQL in assignment 2

CS 245 36

Page 37: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

CS 245 37

Page 38: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

CS 245 38

Page 39: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Common Rule-Based OptimizationsSimplifying expressions in select, project, etc» Boolean algebra, numeric expressions, string

expressions, etc» Many redundancies because queries are

optimized for readability or generated by code

Simplifying relational operator graphs» Select, project, join, etc

These relational optimizations have the most impact

CS 245 39

Page 40: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Common Rule-Based OptimizationsSelecting access paths and operator implementations in simple cases» Index column predicate ⇒ use index» Small table ⇒ use hash join against it» Aggregation on field with few values ⇒ use

in-memory hash table

Rules also often used to do type checking and analysis (easy to write recursively)

Also veryhigh impact

CS 245 40

Page 41: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Common Relational Rules

Push selects as far down the plan as possible

Recall:

σp(R ⨝ S) = σp(R) ⨝ S if p only references R

σq(R ⨝ S) = R ⨝ σq(S) if q only references S

σp∧q(R ⨝ S) = σp(R) ⨝ σq(S) if p on R, q on S

CS 245 41

Idea: reduce # of records early to minimize work in later ops; enable index access paths

Page 42: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Common Relational Rules

Push projects as far down as possible

Recall:

Px(σp(R)) = Px(σp(Px∪z(R))) z = the fields in p

Px∪y(R ⨝ S) = Px∪y ((Px∪z (R)) ⨝ (Py∪z (S)))

x = fields in R, y = in S, z = in both

CS 245 42Idea: don’t process fields you’ll just throw away

Page 43: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Project Rules Can Backfire!

Example: R has fields A, B, C, D, Ep: A=3 ∧ B=“cat”x: {E}

Px(σp(R)) vs Px(σp(P{A,B,E}(R)))

CS 245 43

Page 44: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What if R has Indexes?

A = 3 B = “cat”

Intersect buckets to getpointers to matching tuples

CS 245 44

In this case, should do σp(R) first!

Page 45: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Bottom Line

Many possible transformations aren’t always good for performance

Need more info to make good decisions» Data statistics: properties about our input or

intermediate data to be used in planning» Cost models: how much time will an operator

take given certain input data statistics?

CS 245 45

Page 46: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

CS 245 46

Page 47: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What Are Data Statistics?

Information about the tuples in a relation that can be used to estimate size & cost» Example: # of tuples, average size of tuples,

# distinct values for each attribute, % of null values for each attribute

Typically maintained by the storage engine as tuples are added & removed in a relation» File formats like Parquet can also have them

CS 245 47

Page 48: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Some Statistics We’ll Use

For a relation R,

T(R) = # of tuples in R

S(R) = average size of R’s tuples in bytes

B(R) = # of blocks to hold all of R’s tuples

V(R, A) = # distinct values of attribute A in R

CS 245 48

Page 49: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example

CS 245 49

R: A: 20 byte string

B: 4 byte integer

C: 8 byte date

D: 5 byte string

A B C Dcat 1 10 acat 1 20 bdog 1 30 adog 1 40 cbat 1 50 d

Page 50: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example

CS 245 50

T(R) = 5 S(R) = 37V(R, A) = 3 V(R, C) = 5V(R, B) = 1 V(R, D) = 4

R: A: 20 byte string

B: 4 byte integer

C: 8 byte date

D: 5 byte string

A B C Dcat 1 10 acat 1 20 bdog 1 30 adog 1 40 cbat 1 50 d

Page 51: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Challenge: Intermediate Tables

Keeping stats for tables on disk is easy, but what about intermediate tables that appear during a query plan?

Examples:

σp(R)

R ⨝ S

CS 245 51

We already have T(R), S(R), V(R, a), etc,but how to get these for tuples that pass p?

How many and what types of tuple passthe join condition?

Should we do (R ⨝ S) ⨝ T or R ⨝ (S ⨝ T) or (R ⨝ T) ⨝ S?

Page 52: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Stat Estimation Methods

Algorithms to estimate subplan stats

An ideal algorithm would have:1) Accurate estimates of stats2) Low cost3) Consistent estimates (e.g. different plans

for a subtree give same estimated stats)

Can’t always get all this!

CS 245 52

Page 53: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Size Estimates for W = R1⨯R2

S(W) =

T(W) =

CS 245 53

Page 54: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Size Estimates for W = R1⨯R2

S(W) =

T(W) =

CS 245 54

S(R1) + S(R2)

T(R1) ´ T(R2)

Page 55: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Size Estimate for W = σA=a(R)

S(W) =

T(W) =

CS 245 55

Page 56: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Size Estimate for W = σA=a(R)

S(W) = S(R)

T(W) =

CS 245 56

Not true if some variable-length fieldsare correlated with value of A

Page 57: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example

CS 245 57

R V(R,A)=3V(R,B)=1V(R,C)=5V(R,D)=4

W = σZ=val(R) T(W) =

A B C Dcat 1 10 acat 1 20 bdog 1 30 adog 1 40 cbat 1 50 d

Page 58: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example

CS 245 58

R V(R,A)=3V(R,B)=1V(R,C)=5V(R,D)=4

W = σZ=val(R) T(W) =

A B C Dcat 1 10 acat 1 20 bdog 1 30 adog 1 40 cbat 1 50 d what is probability this

tuple will be in answer?

Page 59: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example

CS 245 59

R V(R,A)=3V(R,B)=1V(R,C)=5V(R,D)=4

W = σZ=val(R) T(W) =

A B C Dcat 1 10 acat 1 20 bdog 1 30 adog 1 40 cbat 1 50 d

T(R)V(R,Z)

Page 60: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Assumption:

Values in select expression Z=val are uniformly distributed over all V(R, Z) values

CS 245 60

Page 61: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Alternate Assumption:

Values in select expression Z=val are uniformly distributed over a domain with DOM(R, Z) values

CS 245 61

Page 62: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example

CS 245 62

R V(R,A)=3, DOM(R,A)=10V(R,B)=1, DOM(R,B)=10V(R,C)=5, DOM(R,C)=10V(R,D)=4, DOM(R,D)=10

W = σZ=val(R) T(W) =

A B C Dcat 1 10 acat 1 20 bdog 1 30 adog 1 40 cbat 1 50 d

Alternate assumption

Page 63: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example

CS 245 63

R V(R,A)=3, DOM(R,A)=10V(R,B)=1, DOM(R,B)=10V(R,C)=5, DOM(R,C)=10V(R,D)=4, DOM(R,D)=10

W = σZ=val(R) T(W) =

A B C Dcat 1 10 acat 1 20 bdog 1 30 adog 1 40 cbat 1 50 d

Alternate assumption

what is probability thistuple will be in answer?

Page 64: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example

CS 245 64

R V(R,A)=3, DOM(R,A)=10V(R,B)=1, DOM(R,B)=10V(R,C)=5, DOM(R,C)=10V(R,D)=4, DOM(R,D)=10

W = σZ=val(R) T(W) =

A B C Dcat 1 10 acat 1 20 bdog 1 30 adog 1 40 cbat 1 50 d

T(R)DOM(R,Z)

Alternate assumption

Page 65: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

SC(R, A) = average # records that satisfyequality condition on R.A

T(R)

V(R,A)

SC(R,A) =

T(R)

DOM(R,A)CS 245 65

Selection Cardinality

Page 66: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What About W = σz ³ val(R)?

T(W) = ?

CS 245 66

Page 67: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What About W = σz ³ val(R)?

T(W) = ?

Solution 1: T(W) = T(R) / 2

CS 245 67

Page 68: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What About W = σz ³ val(R)?

T(W) = ?

Solution 1: T(W) = T(R) / 2

Solution 2: T(W) = T(R) / 3

CS 245 68

Page 69: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Solution 3: Estimate Fraction of Values in Range

Example: R

CS 245 69

ZMin=1 V(R,Z)=10

W = σz ³ 15(R)

Max=20

f = 20-15+1 = 6 (fraction of range)20-1+1 20

T(W) = f ´ T(R)

Page 70: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Equivalently, if we know values in column:

f = fraction of distinct values ≥ val

T(W) = f ´ T(R)

CS 245 70

Solution 3: Estimate Fraction of Values in Range

Page 71: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

What About More Complex Expressions?E.g. estimate selectivity for

SELECT * FROM RWHERE user_defined_func(a) > 10

CS 245 71

Page 72: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

CS 245 72

Page 73: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Size Estimate for W = R1 ⨝ R2

Let X = attributes of R1

Y = attributes of R2

CS 245 73

Case 1: X ∩ Y = ∅:

Same as R1 x R2

Page 74: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

R1 A B C R2 A D

CS 245 74

Case 2: W = R1 ⨝ R2, X ∩ Y = A

Page 75: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

R1 A B C R2 A D

CS 245 75

Case 2: W = R1 ⨝ R2, X ∩ Y = A

Assumption (“containment of value sets”):V(R1, A) £ V(R2, A) Þ Every A value in R1 is in R2

V(R2, A) £ V(R1, A) Þ Every A value in R2 is in R1

Page 76: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

R1 A B C R2 A D

Take 1 tuple Match

Computing T(W) whenV(R1, A) £ V(R2, A)

CS 245 76

1 tuple matches with T(R2) tuples...V(R2, A)

so T(W) = T(R1) ´ T(R2)V(R2, A)

Page 77: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

CS 245 77

V(R1, A) £ V(R2, A) ⇒ T(W) = T(R1) ´ T(R2)V(R2, A)

V(R2, A) £ V(R1, A) ⇒ T(W) = T(R1) ´ T(R2)V(R1, A)

Page 78: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

T(W) = T(R1) ⨯ T(R2)

max(V(R1, A), V(R2, A))

CS 245 78

In General for W = R1 ⨝ R2

Where A is the common attribute set

Page 79: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Values uniformly distributed over domain

R1 A B C R2 A D

This tuple matches T(R2) / DOM(R2, A), so

T(W) = T(R1) T(R2) = T(R1) T(R2) DOM(R2, A) DOM(R1, A)

Assume these are the sameCS 245 79

Case 2 with Alternate Assumption

Page 80: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Tuple Size after Join

In all cases:

S(W) = S(R1) + S(R2) – S(A)

size of attribute A

CS 245 80

Page 81: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

PAB(R)

σA=aÙB=b(R)

R ⨝ S with common attributes A, B, C

Set union, intersection, difference, …

CS 245 81

Using Similar Ideas, Can Estimate Sizes of:

Page 82: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

E.g. W = σA=a(R1) ⨝ R2

Treat as relation U

T(U) = T(R1) / V(R1, A) S(U) = S(R1)

Also need V(U, *) !!

CS 245 82

For Complex Expressions, Need Intermediate T, S, V Results

Page 83: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

To Estimate V

E.g., U = σA=a(R1)

Say R1 has attributes A, B, C, D

V(U, A) =

V(U, B) =

V(U, C) =

V(U, D) =

CS 245 83

Page 84: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

R1 V(R1, A)=3V(R1, B)=1V(R1, C)=5V(R1, D)=3

U = σA=a(R1)

A B C Dcat 1 10 10cat 1 20 20dog 1 30 10dog 1 40 30bat 1 50 10

CS 245 84

Example

Page 85: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

R1 V(R1, A)=3V(R1, B)=1V(R1, C)=5V(R1, D)=3

U = σA=a(R1)

A B C Dcat 1 10 10cat 1 20 20dog 1 30 10dog 1 40 30bat 1 50 10

CS 245 85

Example

V(U, A) = 1 V(U, B) = 1 V(U, C) = T(R1)V(R1,A)

V(U, D) = somewhere in between…

Page 86: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

V(U, A) = V(R, A) / 2

V(U, B) = V(R, B)

CS 245 86

Possible Guess in U = σA≥a(R)

Page 87: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

For Joins: U = R1(A,B) ⨝ R2(A,C)

We’ll use the following estimates:

V(U, A) = min(V(R1, A), V(R2, A))

V(U, B) = V(R1, B)

V(U, C) = V(R2, C)

Called “preservation of value sets”CS 245 87

Page 88: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Example:

Z = R1(A,B) ⨝ R2(B,C) ⨝ R3(C,D)

T(R1) = 1000 V(R1,A)=50 V(R1,B)=100

T(R2) = 2000 V(R2,B)=200 V(R2,C)=300

T(R3) = 3000 V(R3,C)=90 V(R3,D)=500

R1

R2

R3

CS 245 88

Page 89: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

T(U) = 1000´2000 V(U,A) = 50200 V(U,B) = 100

V(U,C) = 300

Partial Result: U = R1 ⨝ R2

CS 245 89

Page 90: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

End Result: Z = U ⨝ R3

T(Z) = 1000´2000´3000 V(Z,A) = 50200´300 V(Z,B) = 100

V(Z,C) = 90V(Z,D) = 500

CS 245 90

Page 91: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Another Statistic: Histograms

CS 245 91

10 20 30 40

5

10

1512

number of tuplesin R with A valuein a given range

σA=a(R) = ?

Requires some care to set bucket boundaries

σA≥a(R) = ?

Page 92: Query Execution 2 and Query Optimizationweb.stanford.edu/class/cs245/slides/07-Query-Optimization-p1.pdf · Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu.

Outline

What can we optimize?

Rule-based optimization

Data statistics

Cost models

Cost-based plan selection

CS 245 92