Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing...

of 38 /38
Query Planing & Optimization Holger Pirk 1 / 38

Embed Size (px)

Transcript of Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing...

  • Query Planing & Optimization

    Holger Pirk

    1 / 38

  • Query Optimization

    2 / 38

  • Motivation

    Query OptimizationStart with a correct planCreate a better plan

    I maintain correctnessI a better plan is often much more complicated

    A step backSecondary goal: Performance

    I some kind of numeric valuePrimary goal: Correctness

    I a boolean

    3 / 38

  • Plan Correctness

    Expectation ManagementCorrectness is hard to prove when semantics are fuzzyQuery optimizers settle for equivalenceThis puts the burden of correctness on the initial compiler

    Visualisation

    SQL

    Plan

    Compiler

    Optimizer

    4 / 38

  • Plan equivalenceRelational Algebra

    is closed, i.e., every operator takes relations as input and producesrelationsOperators are easily composableSyntactically correct plans: easySemantically correct plans: harderSemantically equivalent plans: very tricky

    Semantic equivalencePlans are (semantically) equivalent if they (provably) produce thesame output on any databasePlan equivalence is quite hard to prove

    IdeaDivide and Conquer

    5 / 38

  • TransformationsRationale

    An equivalent transformation of a subplan is an equivalenttransformation of the entire planFor example: adjacent selections can be reordered

    HenceCreate localized transformation rules in the form Pattern =>Rewrite

    I For example Select(Select(input, condition1), condition2)=> Select(Select(input, condition2), condition1)

    ApplicationTraverse the plan tree from the root on (in any order)For every traversed node, see if the pattern matchesIf so, replace it with the rewrite and start again from the rootIf the pattern never matched, you are done

    6 / 38

  • The two dimensions of query optimization

    Algorithm-Agnostic Algorithm-AwareData-AgnosticData-Aware

    7 / 38

  • The two dimensions of query optimization

    Logical PhysicalRule-BasedCost-Based

    8 / 38

  • Logical Optimization

    9 / 38

  • Rule-Based Logical Query Optimization

    10 / 38

  • Rule-Based Query Optimization

    What could possibly be done. . .without knowledge of the data andwithout knowledge of algorithms?

    Well,apply heuristics: things that are (almost) universally good,understand the principles,build something very portable/robust (on relational algebra)

    11 / 38

  • Operator Pushdown

    Selectionscan be pushed through joins if they only refer to attributes from oneside of the joinThis is often a good optimization (dependes on join costs &selectivity)

    12 / 38

  • Operator Reordering: Recall

    HenceCreate localized transformation rules in the form Pattern =>Rewrite

    I For example Select(Select(input, condition1), condition2)=> Select(Select(input, condition2), condition1)

    ApplicationTraverse the plan tree from the root on (in any order)For every traversed node, see if the pattern matchesIf so, replace it with the rewrite and start again from the rootIf the pattern never matched, you are done

    The problemHow do you decide when to apply the ruleHow do you avoid infinite loops through contradicting rules

    13 / 38

  • Illustration: Pushing downselections

    σpriority=“urgent”

    Order Customer

    σstatus“fulfilled”

    ⨝order.customer_id = customer.id

    Γnation, min(order.date)

    14 / 38

  • Operator Reordering

    Many operators are associativeTwo-way JoinsSelectionsUnionsDifferences

    This allows us to swap their orderIn the hope of reducing costs

    Heavily heuristics drivenrelative selectivities can be estimated from the shapes of predicates

    I s(==) < s(

  • Rule-Based Query Optimization

    The solution: guardsPlace guard conditions on the rules

    I An easy exampleI Select(Select(input, condition1), condition2) if

    condition1.cmp = ’>’ and condition2.cmp = ’==’=> Select(Select(input, condition2), condition1)

    ContextRule-based Query Optimization is the standard in "simple" DBMSs

    I MonetDB, Spark

    16 / 38

  • Transformations

    Before

    ⨝order.customer_id = customer.id

    σpriority=“urgent” Customer

    σstatus“fulfilled”

    Order

    Γnation, min(order.date)

    After

    ⨝order.customer_id = customer.id

    σstatus“fulfilled” Customer

    σpriority=“urgent”

    Order

    Γnation, min(order.date)

    17 / 38

  • Rule-Based Query Optimization in Action

    Before Optimization

    ⨝order.customer_id = customer.id

    σstatus“fulfilled” Customer

    σpriority=“urgent”

    Order

    Γnation, min(order.date)

    Customerid nation1 UK2 USA3 China4 Uk

    Orderstatus priority c_id datex x 1 17x x 2 12x x 1 5x x 3 93x x 3 21x x 3 42x x 1 31x x 2 8x x 3 74x x 2 44x x 1 94x x 2 88

    . . . and four million more18 / 38

  • Solution

    Γcustomer_id, min(nation), min(order.date)

    σstatus“fulfilled” Customer

    σpriority=“urgent”

    Order

    Γnation, min(order.date)

    ⨝order.customer_id = customer.id

    ⨝order.customer_id = customer.id

    Γcustomer_id, min(order.date) Customer

    σpriority=“urgent”

    Order

    Γnation, min(order.date)

    σstatus“fulfilled”

    19 / 38

  • Problems with rule-based optimization

    Very brittle: a missing rule can have devastating consequencesContradicting rules lead to infinite cyclesOften wrong

    I Does not take data into account

    20 / 38

  • Cost-based logical query optimization

    21 / 38

  • Cost Metric

    What constitutes a "better" planNot an easy question to answerWe define some numeric cost metric

    For nowSum of the number of tuples produced by each operator

    22 / 38

  • Cost-Based logical query optimization

    IdeaCost are data dependent. . .. . . but we don’t know the data (before running the query). . .. . . so, let’s estimate it!

    A simple approachLet’s Estimate the number of tuples produced by an equality select

    I (remember, joins are cross products with selects on top)We are selecting one value out of all values in the database

    I Assuming uniform distribution: 1distinct values in columnLet’s keep the number of distinct values as a "statistic"Selectivity of priority=="urgent" predicate: 50%In practice: only very few orders are urgent, say 2%

    23 / 38

  • Statistics

    Histogramskeep a tuple count for every unique value for every columnEquality predicate selectivity is simply occurences of a valuetotal tuple countGeneral predicate estimates basically evaluate the query on thehistogram first

    Status Histogram

    pendingshippingshipped fulfilled0

    20

    40

    60

    80

    Priority Histogram

    urgent normal0

    20

    40

    60

    80

    100

    24 / 38

  • Complicating Factors

    25 / 38

  • Attribute Correllation

    The questionWhat is the selectivity of the second selection (status=="pending" )Assume we have a histogramwell, it is 2%(assuming attribute independence)

    Now, assume the followingThe median time to fulfill an order is a weekSome orders take more than two weeksSomeone gets upset about this and tries to fix itThe person sets all order that are pending for more than a week, topriority urgentThat means, that 50% of the pending orders are now urgent

    26 / 38

  • Dealing with correllation

    Multidimensional histograms assuming independence

    Pending Shipping Shipped FulfilledNormal 160 20 10 10Urgent 16 2 1 1

    27 / 38

  • Dealing with correllation

    Multidimensional histograms

    Pending Shipping Shipped FulfilledNormal 80 20 10 10Urgent 96 2 1 1

    28 / 38

  • Physical Optimization

    29 / 38

  • It all comes down to the metric

    Examples for physical Cost MetricsNumber of volcano function callsSum of all produced Tuples (intermediate and final)Number if Page Faults (I/O)CPU costsmax(I/O, CPU)Total Intermediate Size

    30 / 38

  • Physical plan optimization

    Counting tuples is hard, counting costs is harderPhysical plans are more complicated: they don’t only contain therelational operator but the algorithmDifferent algorithms have different costs

    I In terms of intermediate sizesI In terms of CPU (i.e., function calls)

    For example: Nested Loop JoinsI require less space (no need for overallocation)I and don’t require hash calculationI But induce more comparisons

    31 / 38

  • Rule-based physical optimization

    32 / 38

  • Rule-based physical optimization

    State of the artis rule based

    I Remember the rules for join algorithms? Yeah, that!Very focused on hardware:

    I ParallelismI Cache-conscious partitioning

    Cost based optimization of physical plans is a research topic(incidentally, one of mine)When it comes to indices: use them!

    33 / 38

  • Cost-based physical optimization

    34 / 38

  • Access Path Selection

    Data can be read from multiple sourcesThe base tableA column-store indexA tree indexBitmaps

    Indices usually don’t contain all the necessary dataThey are mainly used for tuple selection

    I not attribute projection

    They may need to be combined with base table data

    35 / 38

  • Access Path Selection

    ExampleA customer has six attributes: id, name, address, nation,phone, accountNumber

    Suppose you have a column-index on nationThe query is select * from customer where nation = "UK"

    36 / 38

  • Worksheet: By-Reference Bulk Processing of DSM Data

    Consider the following queryALTER TABLE ordersADD FOREIGN KEY (customer_id) REFERENCES customer(id);

    select phone, count(*) from orders, customer where order.date > 2018-01-01and customer.nation = UK and customer_id = customer.id group by phone

    Attributes (assume all integers)I Order: date, status, priority, discount, customer_idI Customer: id, name, address, nation, phone, accountNumber

    TaskFind the best plan in terms of Page Faults/Cache Misses

    I Selectivities:F order.date > 2018-01-01: 10%F customer.nation = UK: 90%

    I DSM, spanned pages, 5M Orders, 1M customersI 512 KB Cache, 64 Byte Pages

    37 / 38

  • The End

    38 / 38