RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process...

42
RIOS: Runtime Integrated Query Optimizer for Spark Youfu Li (UCLA), Mingda Li(UCLA), Ling Ding(UCLA) and Matteo Interlandi(Microsoft)

Transcript of RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process...

Page 1: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RIOS: Runtime Integrated QueryOptimizer for Spark

YoufuLi(UCLA),MingdaLi(UCLA), LingDing(UCLA)andMatteoInterlandi(Microsoft)

Page 2: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

CloudComputingPrograms

OpenSourceData-IntensiveScalableComputing(DISC)Platforms:HadoopMapReduceandSpark◦ functionalAPI◦ mapand reduceUser-DefinedFunctions◦ RDDtransformations(filter,flatMap,zipPartitions,etc.)

Severalyearslater,introductionofhigh-levelSQL-likedeclarativequerylanguages(andsystems)◦ Conciseness◦ Pickaphysicalexecutionplanfromanumberofalternatives

Page 3: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

QueryOptimizationTwostepsprocess◦ Logical optimizations(e.g.,filterpushdown)◦ Physical optimizations(e.g.,joinordersandimplementation)

PhysicaloptimizerinRDMBS:◦ Cost-based◦ Datastatistics (e.g.,predicateselectivities,costofdataaccess,etc.)

Theroleofthecost-basedoptimizeristo(1) enumeratesomesetofequivalentplans(2) estimatethecostofeach(3) selectasufficientlygoodplan

Page 4: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

QueryOptimization:WhyImportant?

0.25 1 4

16 64

256 1024 4096

16384

W R W R W R W R

431

17

343

21

unab

le to

fini

sh in

5+

hour

s

276

1510

2

954

Tim

e (s

)

Scale Factor = 10

Spark

AsterixDB

Hive

Pig

Page 5: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

0.25 1 4

16 64

256 1024 4096

16384

W R W R W R W R

431

17

343

21

unab

le to

fini

sh in

5+

hour

s

276

1510

2

954

Tim

e (s

)

Scale Factor = 10

Spark

AsterixDB

Hive

Pig

QueryOptimization:WhyImportant?

BadplansoverBigDatacanbedisastrous!

Page 6: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Challenges for Cost-based OptimizerinDISC

Lack of upfrontstatistics:◦ datasitsinHDFSandunstructured

Evenifinputstatisticsareavailable:◦ Correlations betweenpredicates◦ Exponentialerrorpropagationinjoins◦ ArbitraryUDFs

Page 7: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Cost-based OptimizerinDISC:StateoftheArt

Pre-existing statistics

Bad statistics

No upfront statistics

Page 8: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Cost-based OptimizerinDISC:StateoftheArt

• Collect and store statistics• Noruntimeplanrevision

Pre-existing statistics◦ Spark CBO[1]

Bad statistics

No upfront statistics

Page 9: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Cost-based OptimizerinDISC:StateoftheArt

• Collect and store statistics• Noruntimeplanrevision

Pre-existing statistics◦ Spark CBO[1]

Bad statistics◦ AdaptiveQueryplanning[2]

No upfront statistics

Assumptionisthatsomeinitialstatisticsexist

Page 10: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Cost-based OptimizerinDISC:StateoftheArt

Pre-existing statistics◦ Spark CBO[1]

Bad statistics◦ AdaptiveQueryplanning[2]

No upfront statistics◦ Pilotruns(samples)[3]

Assumptionisthatsomeinitialstatisticsexist

• Samplesareexpensive• Onlyforeign-keyjoins• Noruntimeplanrevision

• Collect and store statistics• Noruntimeplanrevision

Page 11: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Traditional Query Planning VS RIOS

Query Optimizer

Physical Plan

Execution

Logical Plan

Statistics

Page 12: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Traditional Query Planning VS RIOS

Query Optimizer

Physical Plan

Execution

Logical Plan

Statistics

Logical Plan

Planning ExecutionRuntime Stats

Physical plan

RIOS

Page 13: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeIntegratedOptimizerforSparkKeyidea:Execute-Gather-Aggregate-Planstrategy(EGAP)◦ Queryplansarelazily executed◦ Statisticsaregatheredatruntime◦ Aggregate statistics aftergathering◦ Joinsaregreedily planned for execution◦ Plan canbedynamicallychanged ifabaddecision wasmade

Neitherupfrontstatisticsnorpilotrunsarerequired◦ Rawdatasetsizeisrequiredforinitialguess

Supportfornotforeign-keyjoins

Page 14: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:anExample

BAA

CAA

AAB

AAC

Assumption: A < C

Page 15: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:ExecuteStep

BAA

CAA

AAB

AAC

A

B

C

Assumption: A < C

Page 16: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:Gatherstep

BAA

CAA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < C

Page 17: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

U

RuntimeOptimizer:Aggregatestep

BAA

CAA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < C

Driver

Page 18: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:Planstep

BAA

CAA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < C

U

Page 19: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:Executestep

BAA

CAA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < C

UAB

Page 20: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:Gatherstep

BAA

CAA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < C

UAB

S

Page 21: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:Planstep

BAA

CAA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < C

UAB

S

Page 22: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:Executestep

BAA

CAA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < C

UAB

S

ABCABCABC

Page 23: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:WrongGuess

B(A)A

σ(C)AA

AAB

AAC

Assumption: A < Cσσ(A) > σ(C)

Page 24: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:WrongGuess

B(A)A

σ(C)AA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < Cσσ(A) > σ(C)

Page 25: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:WrongGuess

B(A)A

σ(C)AA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < Cσ

B

σ(A) > σ(C)Repartition Step

Page 26: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:WrongGuess

B(A)A

σ(C)AA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < Cσ

B

σ(A) > σ(C)

Page 27: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:WrongGuess

B(A)A

σ(C)AA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < Cσ

B

BC

S

σ(A) > σ(C)

Page 28: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:WrongGuess

B(A)A

σ(C)AA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < Cσ

B

BC

S

σ(A) > σ(C)

Page 29: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeOptimizer:WrongGuess

B(A)A

σ(C)AA

AAB

AAC

A

B

C

S

S

S

S

Assumption: A < Cσ

B

BC

SABCABCABC

σ(A) > σ(C)

Page 30: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

RuntimeIntegratedOptimizerforSpark

Spark batch execution model allows late binding of joins

Set of Statistics:◦ Joinestimations(basedonsamplingorsketches)◦ Numberofrecords◦ Averagesizeofeachrecord

Statisticsareaggregated usingaSparkjoboraccumulators

Joinimplementationsarepickedbasedonthresholds

Page 31: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

ChallengesandOptimizations

Execute- Blockandreviseexecutionplanswithoutwastingcomputation

Aggregate- Efficientaccumulationofstatistics

Plan- Trytoscheduleasmanybroadcastjoinsaspossible

Page 32: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Experiments

Q1:WhataretheperformanceofRIOScomparedtoregularSpark,pilotruns and Spark-CBO?

Q2:Howexpensivearewrongguesses?

Page 33: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Minibenchmarkwith3FactTables

4 16 64

256 1024 4096

16384

1 10 100 1000

11

25

92

1482

9

24

86

1353

11

25

93

1521

74

1312

9095

unab

le to

fini

sh in

5+

hour

s

69

1242

8951

unab

le to

fini

sh in

5+

hour

s

Tim

e (s

)

Scale Factor

spark good-orderRIOS R2RRIOS W2R

pilot-runspark wrong-order

Page 34: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Minibenchmarkwith3FactTables

Q1:RIOSisalwaysfasterthanSparkandpilotrun

4 16 64

256 1024 4096

16384

1 10 100 1000

11

25

92

1482

9

24

86

1353

11

25

93

1521

74

1312

9095

unab

le to

fini

sh in

5+

hour

s

69

1242

8951

unab

le to

fini

sh in

5+

hour

s

Tim

e (s

)

Scale Factor

spark good-orderRIOS R2RRIOS W2R

pilot-runspark wrong-order

Page 35: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Minibenchmarkwith3FactTables

Q2:Notmuch,around15%intheworstcase

4 16 64

256 1024 4096

16384

1 10 100 1000

11

25

92

1482

9

24

86

1353

11

25

93

1521

74

1312

9095

unab

le to

fini

sh in

5+

hour

s

69

1242

8951

unab

le to

fini

sh in

5+

hour

s

Tim

e (s

)

Scale Factor

spark good-orderRIOS R2RRIOS W2R

pilot-runspark wrong-order

Page 36: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

TPCDSandTPCHQueries

4

16

64

256

1024

4096

16384

1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000Query 17 Query 50 Query 28 Query 9

19

37

130

2468

15

25

114

1879

13 17

51

1298

30

62

255

6717

17 19

69

2171

10 14

66

1861

9 12

30

1109

16

29

187

6589

11

17

56

1973

10 14

45

1437

9 11

28

1073

12

21

109

5760

15

22

72

2114

13 18

60

1779

13 17

47

1177

19

30

239

7632

23

40

176

3253

21

34

163

2308

15

22

61

1399

44

97

828

unab

le to

fini

sh in

5+

hour

s

Tim

e (s

)

Scale Factor

spark good-orderCBO-plan

RIOSpilot-run

spark bad-order

Page 37: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

TPCDSandTPCHQueries

Q1:RIOSisalwaysthefasterapproach

4

16

64

256

1024

4096

16384

1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000Query 17 Query 50 Query 28 Query 9

19

37

130

2468

15

25

114

1879

13 17

51

1298

30

62

255

6717

17 19

69

2171

10 14

66

1861

9 12

30

1109

16

29

187

6589

11

17

56

1973

10 14

45

1437

9 11

28

1073

12

21

109

5760

15

22

72

2114

13 18

60

1779

13 17

47

1177

19

30

239

7632

23

40

176

3253

21

34

163

2308

15

22

61

1399

44

97

828

unab

le to

fini

sh in

5+

hour

s

Tim

e (s

)

Scale Factor

spark good-orderCBO-plan

RIOSpilot-run

spark bad-order

Page 38: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

ConclusionsRIOS:cost-based queryoptimizerforSpark

Statisticsaregatheredatruntime(noneedforinitialstatisticsorpilotruns)

Latebindofjoins

Upto2xfasterthanthebestplangeneratedbypilotrun,and>100Xthanpreviousapproachesforfacttablejoins.

Page 39: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and
Page 40: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

ExperimentConfiguration

๏ Datasets:• TPCDS• TPCH

๏ Configuration:• 16 machines, 4 cores (2 hyper threads per core)

machines, 32GB of RAM, 1TB disk• Spark 2.2.1• Scale factor from 1 to 1000 (~1TB)

Page 41: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Reference1. https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-

spark-2-2.html.2. S.Agarwal,S.Kandula,N.Bruno,M.-C.Wu,I.Stoica,andJ.Zhou.Re-optimizing

data-parallelcomputing.InNSDI,2012.3. K.Karanasos,A.Balmin,M.Kutsch,F.Ozcan,V.Ercegovac,C.Xia,andJ.

Jackson.Dynamicallyoptimizingqueriesoverlargescaledataplatforms.InSIGMOD,pages943–954.

4. S.Chaudhuri,G.Das,andV.Narasayya.Optimizedstratifiedsamplingforapproximatequeryprocessing.TODS,32(2),jun 2007.

5. O.Papapetrou,W.Siberski,andW.Nejdl.Cardinalityestimationanddynamiclengthadaptationforbloomfilters.DistributedandParallelDatabases,28(2):119– 156,2010.

Page 42: RIOS: Runtime Integrated Query Optimizer forSpark · Query Optimization Two steps process Logicaloptimizations (e.g., filter pushdown) Physical optimizations (e.g., join orders and

Thankyou