RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process...

RIOS: Runtime Integrated QueryOptimizer for Spark

YoufuLi(UCLA),MingdaLi(UCLA), LingDing(UCLA)andMatteoInterlandi(Microsoft)

CloudComputingPrograms

OpenSourceData-IntensiveScalableComputing(DISC)Platforms:HadoopMapReduceandSpark◦ functionalAPI◦ mapand reduceUser-DefinedFunctions◦ RDDtransformations(filter,flatMap,zipPartitions,etc.)

Severalyearslater,introductionofhigh-levelSQL-likedeclarativequerylanguages(andsystems)◦ Conciseness◦ Pickaphysicalexecutionplanfromanumberofalternatives

QueryOptimizationTwostepsprocess◦ Logical optimizations(e.g.,filterpushdown)◦ Physical optimizations(e.g.,joinordersandimplementation)

PhysicaloptimizerinRDMBS:◦ Cost-based◦ Datastatistics (e.g.,predicateselectivities,costofdataaccess,etc.)

Theroleofthecost-basedoptimizeristo(1) enumeratesomesetofequivalentplans(2) estimatethecostofeach(3) selectasufficientlygoodplan

QueryOptimization:WhyImportant?

0.25 1 4

256 1024 4096

W R W R W R W R

Scale Factor = 10

AsterixDB

0.25 1 4

256 1024 4096

W R W R W R W R

Scale Factor = 10

AsterixDB

QueryOptimization:WhyImportant?

BadplansoverBigDatacanbedisastrous!

Challenges for Cost-based OptimizerinDISC

Lack of upfrontstatistics:◦ datasitsinHDFSandunstructured

Evenifinputstatisticsareavailable:◦ Correlations betweenpredicates◦ Exponentialerrorpropagationinjoins◦ ArbitraryUDFs

Cost-based OptimizerinDISC:StateoftheArt

Pre-existing statistics

Bad statistics

No upfront statistics

• Collect and store statistics• Noruntimeplanrevision

Pre-existing statistics◦ Spark CBO[1]

Bad statistics

Bad statistics◦ AdaptiveQueryplanning[2]

Assumptionisthatsomeinitialstatisticsexist

Bad statistics◦ AdaptiveQueryplanning[2]

No upfront statistics◦ Pilotruns(samples)[3]

Assumptionisthatsomeinitialstatisticsexist

• Samplesareexpensive• Onlyforeign-keyjoins• Noruntimeplanrevision

Traditional Query Planning VS RIOS

Query Optimizer

Physical Plan

Execution

Logical Plan

Statistics

Traditional Query Planning VS RIOS

Query Optimizer

Physical Plan

Execution

Logical Plan

Statistics

Logical Plan

Planning ExecutionRuntime Stats

Physical plan

RuntimeIntegratedOptimizerforSparkKeyidea:Execute-Gather-Aggregate-Planstrategy(EGAP)◦ Queryplansarelazily executed◦ Statisticsaregatheredatruntime◦ Aggregate statistics aftergathering◦ Joinsaregreedily planned for execution◦ Plan canbedynamicallychanged ifabaddecision wasmade

Neitherupfrontstatisticsnorpilotrunsarerequired◦ Rawdatasetsizeisrequiredforinitialguess

Supportfornotforeign-keyjoins

RuntimeOptimizer:anExample

Assumption: A < C

RuntimeOptimizer:ExecuteStep

Assumption: A < C

RuntimeOptimizer:Gatherstep

Assumption: A < C

RuntimeOptimizer:Aggregatestep

Assumption: A < C

Driver

RuntimeOptimizer:Planstep

Assumption: A < C

RuntimeOptimizer:Executestep

Assumption: A < C

RuntimeOptimizer:Gatherstep

Assumption: A < C

RuntimeOptimizer:Planstep

Assumption: A < C

RuntimeOptimizer:Executestep

Assumption: A < C

ABCABCABC

RuntimeOptimizer:WrongGuess

σ(C)AA

Assumption: A < Cσσ(A) > σ(C)

σ(C)AA

Assumption: A < Cσσ(A) > σ(C)

σ(C)AA

Assumption: A < Cσ

σ(A) > σ(C)Repartition Step

σ(C)AA

Assumption: A < Cσ

σ(A) > σ(C)

σ(C)AA

Assumption: A < Cσ

σ(A) > σ(C)

σ(C)AA

Assumption: A < Cσ

σ(A) > σ(C)

σ(C)AA

Assumption: A < Cσ

SABCABCABC

σ(A) > σ(C)

RuntimeIntegratedOptimizerforSpark

Spark batch execution model allows late binding of joins

Set of Statistics:◦ Joinestimations(basedonsamplingorsketches)◦ Numberofrecords◦ Averagesizeofeachrecord

Statisticsareaggregated usingaSparkjoboraccumulators

Joinimplementationsarepickedbasedonthresholds

ChallengesandOptimizations

Execute- Blockandreviseexecutionplanswithoutwastingcomputation

Aggregate- Efficientaccumulationofstatistics

Plan- Trytoscheduleasmanybroadcastjoinsaspossible

Experiments

Q1:WhataretheperformanceofRIOScomparedtoregularSpark,pilotruns and Spark-CBO?

Q2:Howexpensivearewrongguesses?

Minibenchmarkwith3FactTables

4 16 64

256 1024 4096

1 10 100 1000

Scale Factor

spark good-orderRIOS R2RRIOS W2R

pilot-runspark wrong-order

Q1:RIOSisalwaysfasterthanSparkandpilotrun

4 16 64

256 1024 4096

1 10 100 1000

Scale Factor

Q2:Notmuch,around15%intheworstcase

4 16 64

256 1024 4096

1 10 100 1000

Scale Factor

TPCDSandTPCHQueries

1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000Query 17 Query 50 Query 28 Query 9

Scale Factor

spark good-orderCBO-plan

RIOSpilot-run

spark bad-order

TPCDSandTPCHQueries

Q1:RIOSisalwaysthefasterapproach

1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000Query 17 Query 50 Query 28 Query 9

Scale Factor

spark good-orderCBO-plan

RIOSpilot-run

spark bad-order

ConclusionsRIOS:cost-based queryoptimizerforSpark

Statisticsaregatheredatruntime(noneedforinitialstatisticsorpilotruns)

Latebindofjoins

Upto2xfasterthanthebestplangeneratedbypilotrun,and>100Xthanpreviousapproachesforfacttablejoins.

ExperimentConfiguration

๏ Datasets:• TPCDS• TPCH

๏ Configuration:• 16 machines, 4 cores (2 hyper threads per core)

machines, 32GB of RAM, 1TB disk• Spark 2.2.1• Scale factor from 1 to 1000 (~1TB)

Reference1. https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-

spark-2-2.html.2. S.Agarwal,S.Kandula,N.Bruno,M.-C.Wu,I.Stoica,andJ.Zhou.Re-optimizing

data-parallelcomputing.InNSDI,2012.3. K.Karanasos,A.Balmin,M.Kutsch,F.Ozcan,V.Ercegovac,C.Xia,andJ.

Jackson.Dynamicallyoptimizingqueriesoverlargescaledataplatforms.InSIGMOD,pages943–954.

4. S.Chaudhuri,G.Das,andV.Narasayya.Optimizedstratifiedsamplingforapproximatequeryprocessing.TODS,32(2),jun 2007.

5. O.Papapetrou,W.Siberski,andW.Nejdl.Cardinalityestimationanddynamiclengthadaptationforbloomfilters.DistributedandParallelDatabases,28(2):119– 156,2010.

Thankyou

RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process...

Documents

Transcript of RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process...

PowerCenter Pushdown

Pushdown Automata

MODEL CHECKING PROBABILISTIC PUSHDOWN AUTOMATA · PDF filemodel checking probabilistic pushdown automata ... (ppda) and properties ... model checking probabilistic pushdown automata

More Applications - CSUmassey/Teaching/cs301/Restricted...– Nondeterministic Pushdown Automata – Pushdown Automata and Context Free Grammars – Deterministic Pushdown Automata

Oracle® Data Service Integrator · PDF filePreventing SQL Pushdown ... Best Practices Using ... XQuery uses a type system and supports query optimization

INTERIA · PDF file33. Multi Pushdown ... For the sake of ad server query optimization, ... best to add to the code console.log(typeof dharmapi),

Data Virtualization in the Time of Big Data - Cisco...6 Data Virtualization: Parallel Processing + Query Pushdown = Parallel Pushdown 7 7 Big Data Virtualization Use Cases 10 8 The

Symbolic Visibly Pushdown Automata⋆

Pushdown Automata and Parser - uni-saarland.decompilers.cs.uni-saarland.de/teaching/cc/2011/slides/04_pda.pdf · Pushdown Automata and Parser Pushdown Automata Memory unboundedly

7. Pushdown Automata

Scalable XML Query Processing using Parallel Pushdown ... · automata (§2.2), which provide the underlying theory for our new parallel pushdown transducers. 2.1 Parallel XPath Query

Pushdown Automata - CSE 211 (Theory of Computation)teacher.buet.ac.bd/atif/courses/pda_6.pdf · Pushdown Automata CSE 211 (Theory of Computation) ... pushdown automata recognize all

Using Pushdown Optimization

Pushdown Automata PDAs

Pushdown Automata A pushdown automata (PDA) is essentially an ...

Pushdown Automata - cs.unc.eduotternes/comp455/6-pushdown-automata.pdf · Pushdown Automata (PDAs) A pushdown automaton (PDA) is essentially a finite automaton with a stack. Example

Scope- Bounded Pushdown Languages

Pushdown autometa

Reachability in pushdown register automatawrap.warwick.ac.uk/86773/7/WRAP-reachability-pushdown... · 2017-08-04 · Higher-order pushdown automata [15] take the idea of pushdown

Chapter 6 Pushdown automata