Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing...
Embed Size (px)
Transcript of Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing...
-
Query Planing & Optimization
Holger Pirk
1 / 38
-
Query Optimization
2 / 38
-
Motivation
Query OptimizationStart with a correct planCreate a better plan
I maintain correctnessI a better plan is often much more complicated
A step backSecondary goal: Performance
I some kind of numeric valuePrimary goal: Correctness
I a boolean
3 / 38
-
Plan Correctness
Expectation ManagementCorrectness is hard to prove when semantics are fuzzyQuery optimizers settle for equivalenceThis puts the burden of correctness on the initial compiler
Visualisation
SQL
Plan
Compiler
Optimizer
4 / 38
-
Plan equivalenceRelational Algebra
is closed, i.e., every operator takes relations as input and producesrelationsOperators are easily composableSyntactically correct plans: easySemantically correct plans: harderSemantically equivalent plans: very tricky
Semantic equivalencePlans are (semantically) equivalent if they (provably) produce thesame output on any databasePlan equivalence is quite hard to prove
IdeaDivide and Conquer
5 / 38
-
TransformationsRationale
An equivalent transformation of a subplan is an equivalenttransformation of the entire planFor example: adjacent selections can be reordered
HenceCreate localized transformation rules in the form Pattern =>Rewrite
I For example Select(Select(input, condition1), condition2)=> Select(Select(input, condition2), condition1)
ApplicationTraverse the plan tree from the root on (in any order)For every traversed node, see if the pattern matchesIf so, replace it with the rewrite and start again from the rootIf the pattern never matched, you are done
6 / 38
-
The two dimensions of query optimization
Algorithm-Agnostic Algorithm-AwareData-AgnosticData-Aware
7 / 38
-
The two dimensions of query optimization
Logical PhysicalRule-BasedCost-Based
8 / 38
-
Logical Optimization
9 / 38
-
Rule-Based Logical Query Optimization
10 / 38
-
Rule-Based Query Optimization
What could possibly be done. . .without knowledge of the data andwithout knowledge of algorithms?
Well,apply heuristics: things that are (almost) universally good,understand the principles,build something very portable/robust (on relational algebra)
11 / 38
-
Operator Pushdown
Selectionscan be pushed through joins if they only refer to attributes from oneside of the joinThis is often a good optimization (dependes on join costs &selectivity)
12 / 38
-
Operator Reordering: Recall
HenceCreate localized transformation rules in the form Pattern =>Rewrite
I For example Select(Select(input, condition1), condition2)=> Select(Select(input, condition2), condition1)
ApplicationTraverse the plan tree from the root on (in any order)For every traversed node, see if the pattern matchesIf so, replace it with the rewrite and start again from the rootIf the pattern never matched, you are done
The problemHow do you decide when to apply the ruleHow do you avoid infinite loops through contradicting rules
13 / 38
-
Illustration: Pushing downselections
σpriority=“urgent”
Order Customer
σstatus“fulfilled”
⨝order.customer_id = customer.id
Γnation, min(order.date)
14 / 38
-
Operator Reordering
Many operators are associativeTwo-way JoinsSelectionsUnionsDifferences
This allows us to swap their orderIn the hope of reducing costs
Heavily heuristics drivenrelative selectivities can be estimated from the shapes of predicates
I s(==) < s(
-
Rule-Based Query Optimization
The solution: guardsPlace guard conditions on the rules
I An easy exampleI Select(Select(input, condition1), condition2) if
condition1.cmp = ’>’ and condition2.cmp = ’==’=> Select(Select(input, condition2), condition1)
ContextRule-based Query Optimization is the standard in "simple" DBMSs
I MonetDB, Spark
16 / 38
-
Transformations
Before
⨝order.customer_id = customer.id
σpriority=“urgent” Customer
σstatus“fulfilled”
Order
Γnation, min(order.date)
After
⨝order.customer_id = customer.id
σstatus“fulfilled” Customer
σpriority=“urgent”
Order
Γnation, min(order.date)
17 / 38
-
Rule-Based Query Optimization in Action
Before Optimization
⨝order.customer_id = customer.id
σstatus“fulfilled” Customer
σpriority=“urgent”
Order
Γnation, min(order.date)
Customerid nation1 UK2 USA3 China4 Uk
Orderstatus priority c_id datex x 1 17x x 2 12x x 1 5x x 3 93x x 3 21x x 3 42x x 1 31x x 2 8x x 3 74x x 2 44x x 1 94x x 2 88
. . . and four million more18 / 38
-
Solution
Γcustomer_id, min(nation), min(order.date)
σstatus“fulfilled” Customer
σpriority=“urgent”
Order
Γnation, min(order.date)
⨝order.customer_id = customer.id
⨝order.customer_id = customer.id
Γcustomer_id, min(order.date) Customer
σpriority=“urgent”
Order
Γnation, min(order.date)
σstatus“fulfilled”
19 / 38
-
Problems with rule-based optimization
Very brittle: a missing rule can have devastating consequencesContradicting rules lead to infinite cyclesOften wrong
I Does not take data into account
20 / 38
-
Cost-based logical query optimization
21 / 38
-
Cost Metric
What constitutes a "better" planNot an easy question to answerWe define some numeric cost metric
For nowSum of the number of tuples produced by each operator
22 / 38
-
Cost-Based logical query optimization
IdeaCost are data dependent. . .. . . but we don’t know the data (before running the query). . .. . . so, let’s estimate it!
A simple approachLet’s Estimate the number of tuples produced by an equality select
I (remember, joins are cross products with selects on top)We are selecting one value out of all values in the database
I Assuming uniform distribution: 1distinct values in columnLet’s keep the number of distinct values as a "statistic"Selectivity of priority=="urgent" predicate: 50%In practice: only very few orders are urgent, say 2%
23 / 38
-
Statistics
Histogramskeep a tuple count for every unique value for every columnEquality predicate selectivity is simply occurences of a valuetotal tuple countGeneral predicate estimates basically evaluate the query on thehistogram first
Status Histogram
pendingshippingshipped fulfilled0
20
40
60
80
Priority Histogram
urgent normal0
20
40
60
80
100
24 / 38
-
Complicating Factors
25 / 38
-
Attribute Correllation
The questionWhat is the selectivity of the second selection (status=="pending" )Assume we have a histogramwell, it is 2%(assuming attribute independence)
Now, assume the followingThe median time to fulfill an order is a weekSome orders take more than two weeksSomeone gets upset about this and tries to fix itThe person sets all order that are pending for more than a week, topriority urgentThat means, that 50% of the pending orders are now urgent
26 / 38
-
Dealing with correllation
Multidimensional histograms assuming independence
Pending Shipping Shipped FulfilledNormal 160 20 10 10Urgent 16 2 1 1
27 / 38
-
Dealing with correllation
Multidimensional histograms
Pending Shipping Shipped FulfilledNormal 80 20 10 10Urgent 96 2 1 1
28 / 38
-
Physical Optimization
29 / 38
-
It all comes down to the metric
Examples for physical Cost MetricsNumber of volcano function callsSum of all produced Tuples (intermediate and final)Number if Page Faults (I/O)CPU costsmax(I/O, CPU)Total Intermediate Size
30 / 38
-
Physical plan optimization
Counting tuples is hard, counting costs is harderPhysical plans are more complicated: they don’t only contain therelational operator but the algorithmDifferent algorithms have different costs
I In terms of intermediate sizesI In terms of CPU (i.e., function calls)
For example: Nested Loop JoinsI require less space (no need for overallocation)I and don’t require hash calculationI But induce more comparisons
31 / 38
-
Rule-based physical optimization
32 / 38
-
Rule-based physical optimization
State of the artis rule based
I Remember the rules for join algorithms? Yeah, that!Very focused on hardware:
I ParallelismI Cache-conscious partitioning
Cost based optimization of physical plans is a research topic(incidentally, one of mine)When it comes to indices: use them!
33 / 38
-
Cost-based physical optimization
34 / 38
-
Access Path Selection
Data can be read from multiple sourcesThe base tableA column-store indexA tree indexBitmaps
Indices usually don’t contain all the necessary dataThey are mainly used for tuple selection
I not attribute projection
They may need to be combined with base table data
35 / 38
-
Access Path Selection
ExampleA customer has six attributes: id, name, address, nation,phone, accountNumber
Suppose you have a column-index on nationThe query is select * from customer where nation = "UK"
36 / 38
-
Worksheet: By-Reference Bulk Processing of DSM Data
Consider the following queryALTER TABLE ordersADD FOREIGN KEY (customer_id) REFERENCES customer(id);
select phone, count(*) from orders, customer where order.date > 2018-01-01and customer.nation = UK and customer_id = customer.id group by phone
Attributes (assume all integers)I Order: date, status, priority, discount, customer_idI Customer: id, name, address, nation, phone, accountNumber
TaskFind the best plan in terms of Page Faults/Cache Misses
I Selectivities:F order.date > 2018-01-01: 10%F customer.nation = UK: 90%
I DSM, spanned pages, 5M Orders, 1M customersI 512 KB Cache, 64 Byte Pages
37 / 38
-
The End
38 / 38