Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing...

Post on 22-May-2020

17 views 0 download

Transcript of Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing...

Query Planing & Optimization

Holger Pirk

1 / 38

Query Optimization

2 / 38

Motivation

Query OptimizationStart with a correct planCreate a better plan

I maintain correctnessI a better plan is often much more complicated

A step backSecondary goal: Performance

I some kind of numeric valuePrimary goal: Correctness

I a boolean

3 / 38

Plan Correctness

Expectation ManagementCorrectness is hard to prove when semantics are fuzzyQuery optimizers settle for equivalenceThis puts the burden of correctness on the initial compiler

Visualisation

SQL

Plan

Compiler

Optimizer

4 / 38

Plan equivalenceRelational Algebra

is closed, i.e., every operator takes relations as input and producesrelationsOperators are easily composableSyntactically correct plans: easySemantically correct plans: harderSemantically equivalent plans: very tricky

Semantic equivalencePlans are (semantically) equivalent if they (provably) produce thesame output on any databasePlan equivalence is quite hard to prove

IdeaDivide and Conquer

5 / 38

TransformationsRationale

An equivalent transformation of a subplan is an equivalenttransformation of the entire planFor example: adjacent selections can be reordered

HenceCreate localized transformation rules in the form Pattern =>Rewrite

I For example Select(Select(input, condition1), condition2)=> Select(Select(input, condition2), condition1)

ApplicationTraverse the plan tree from the root on (in any order)For every traversed node, see if the pattern matchesIf so, replace it with the rewrite and start again from the rootIf the pattern never matched, you are done

6 / 38

The two dimensions of query optimization

Algorithm-Agnostic Algorithm-AwareData-AgnosticData-Aware

7 / 38

The two dimensions of query optimization

Logical PhysicalRule-BasedCost-Based

8 / 38

Logical Optimization

9 / 38

Rule-Based Logical Query Optimization

10 / 38

Rule-Based Query Optimization

What could possibly be done. . .without knowledge of the data andwithout knowledge of algorithms?

Well,apply heuristics: things that are (almost) universally good,understand the principles,build something very portable/robust (on relational algebra)

11 / 38

Operator Pushdown

Selectionscan be pushed through joins if they only refer to attributes from oneside of the joinThis is often a good optimization (dependes on join costs &selectivity)

12 / 38

Operator Reordering: Recall

HenceCreate localized transformation rules in the form Pattern =>Rewrite

I For example Select(Select(input, condition1), condition2)=> Select(Select(input, condition2), condition1)

ApplicationTraverse the plan tree from the root on (in any order)For every traversed node, see if the pattern matchesIf so, replace it with the rewrite and start again from the rootIf the pattern never matched, you are done

The problemHow do you decide when to apply the ruleHow do you avoid infinite loops through contradicting rules

13 / 38

Illustration: Pushing downselections

σpriority=“urgent”

Order Customer

σstatus<>“fulfilled”

⨝order.customer_id = customer.id

Γnation, min(order.date)

14 / 38

Operator Reordering

Many operators are associativeTwo-way JoinsSelectionsUnionsDifferences

This allows us to swap their orderIn the hope of reducing costs

Heavily heuristics drivenrelative selectivities can be estimated from the shapes of predicates

I s(==) < s(<) < s(<>)

15 / 38

Rule-Based Query Optimization

The solution: guardsPlace guard conditions on the rules

I An easy exampleI Select(Select(input, condition1), condition2) if

condition1.cmp = ’>’ and condition2.cmp = ’==’=> Select(Select(input, condition2), condition1)

ContextRule-based Query Optimization is the standard in "simple" DBMSs

I MonetDB, Spark

16 / 38

Transformations

Before

⨝order.customer_id = customer.id

σpriority=“urgent” Customer

σstatus<>“fulfilled”

Order

Γnation, min(order.date)

After

⨝order.customer_id = customer.id

σstatus<>“fulfilled” Customer

σpriority=“urgent”

Order

Γnation, min(order.date)

17 / 38

Rule-Based Query Optimization in Action

Before Optimization

⨝order.customer_id = customer.id

σstatus<>“fulfilled” Customer

σpriority=“urgent”

Order

Γnation, min(order.date)

Customerid nation1 UK2 USA3 China4 Uk

Orderstatus priority c_id datex x 1 17x x 2 12x x 1 5x x 3 93x x 3 21x x 3 42x x 1 31x x 2 8x x 3 74x x 2 44x x 1 94x x 2 88

. . . and four million more18 / 38

Solution

Γcustomer_id, min(nation), min(order.date)

σstatus<>“fulfilled” Customer

σpriority=“urgent”

Order

Γnation, min(order.date)

⨝order.customer_id = customer.id

⨝order.customer_id = customer.id

Γcustomer_id, min(order.date) Customer

σpriority=“urgent”

Order

Γnation, min(order.date)

σstatus<>“fulfilled”

19 / 38

Problems with rule-based optimization

Very brittle: a missing rule can have devastating consequencesContradicting rules lead to infinite cyclesOften wrong

I Does not take data into account

20 / 38

Cost-based logical query optimization

21 / 38

Cost Metric

What constitutes a "better" planNot an easy question to answerWe define some numeric cost metric

For nowSum of the number of tuples produced by each operator

22 / 38

Cost-Based logical query optimization

IdeaCost are data dependent. . .. . . but we don’t know the data (before running the query). . .. . . so, let’s estimate it!

A simple approachLet’s Estimate the number of tuples produced by an equality select

I (remember, joins are cross products with selects on top)We are selecting one value out of all values in the database

I Assuming uniform distribution: 1distinct values in column

Let’s keep the number of distinct values as a "statistic"Selectivity of priority=="urgent" predicate: 50%In practice: only very few orders are urgent, say 2%

23 / 38

Statistics

Histogramskeep a tuple count for every unique value for every columnEquality predicate selectivity is simply occurences of a value

total tuple count

General predicate estimates basically evaluate the query on thehistogram first

Status Histogram

pendingshippingshipped fulfilled0

20

40

60

80

Priority Histogram

urgent normal0

20

40

60

80

100

24 / 38

Complicating Factors

25 / 38

Attribute Correllation

The questionWhat is the selectivity of the second selection (status=="pending" )Assume we have a histogramwell, it is 2%(assuming attribute independence)

Now, assume the followingThe median time to fulfill an order is a weekSome orders take more than two weeksSomeone gets upset about this and tries to fix itThe person sets all order that are pending for more than a week, topriority urgentThat means, that 50% of the pending orders are now urgent

26 / 38

Dealing with correllation

Multidimensional histograms assuming independence

Pending Shipping Shipped FulfilledNormal 160 20 10 10Urgent 16 2 1 1

27 / 38

Dealing with correllation

Multidimensional histograms

Pending Shipping Shipped FulfilledNormal 80 20 10 10Urgent 96 2 1 1

28 / 38

Physical Optimization

29 / 38

It all comes down to the metric

Examples for physical Cost MetricsNumber of volcano function callsSum of all produced Tuples (intermediate and final)Number if Page Faults (I/O)CPU costsmax(I/O, CPU)Total Intermediate Size

30 / 38

Physical plan optimization

Counting tuples is hard, counting costs is harderPhysical plans are more complicated: they don’t only contain therelational operator but the algorithmDifferent algorithms have different costs

I In terms of intermediate sizesI In terms of CPU (i.e., function calls)

For example: Nested Loop JoinsI require less space (no need for overallocation)I and don’t require hash calculationI But induce more comparisons

31 / 38

Rule-based physical optimization

32 / 38

Rule-based physical optimization

State of the artis rule based

I Remember the rules for join algorithms? Yeah, that!Very focused on hardware:

I ParallelismI Cache-conscious partitioning

Cost based optimization of physical plans is a research topic(incidentally, one of mine)When it comes to indices: use them!

33 / 38

Cost-based physical optimization

34 / 38

Access Path Selection

Data can be read from multiple sourcesThe base tableA column-store indexA tree indexBitmaps

Indices usually don’t contain all the necessary dataThey are mainly used for tuple selection

I not attribute projection

They may need to be combined with base table data

35 / 38

Access Path Selection

ExampleA customer has six attributes: id, name, address, nation,phone, accountNumber

Suppose you have a column-index on nation

The query is select * from customer where nation = "UK"

36 / 38

Worksheet: By-Reference Bulk Processing of DSM Data

Consider the following queryALTER TABLE ordersADD FOREIGN KEY (customer_id) REFERENCES customer(id);

select phone, count(*) from orders, customer where order.date > 2018-01-01and customer.nation = UK and customer_id = customer.id group by phone

Attributes (assume all integers)I Order: date, status, priority, discount, customer_idI Customer: id, name, address, nation, phone, accountNumber

TaskFind the best plan in terms of Page Faults/Cache Misses

I Selectivities:F order.date > 2018-01-01: 10%F customer.nation = UK: 90%

I DSM, spanned pages, 5M Orders, 1M customersI 512 KB Cache, 64 Byte Pages

37 / 38

The End

38 / 38