Generalized Sub-Query Fusion for Eliminating Redundant I/O ...
Transcript of Generalized Sub-Query Fusion for Eliminating Redundant I/O ...
Generalized Sub-Query Fusion for Eliminating Redundant I/O from Big-Data QueriesKaushik Rajan,
Partho Sarthi, Akash Lal (Microsoft Research)
Abhishek Modi, Ashit Gosalia, Prakhar Jain, Mo Liu, (Microsoft)
Saurabh Kalikar (Intel, interned at Microsoft Research)
Big data query compilation
1 1 1 1
2 2 2 2
Query compilation
SQL Query
Data parallel execution
scheduling/execution exchange
Map 1
Reduce 2
Reduce 3 Execution Plan
Query compilation and execution in big-data systems (Spark, Hadoop, Snowflake, Amazon Redshift, Google
BigQuery, Azure Synapse)
Big data query compilation
1 1 1 1
2 2 2 2
scheduling/execution exchange
Plans with fewer stages preferable
Map 1
Reduce 2
Reduce 3 Execution Plan
Exchanges expensive as they induce disk and network I/O
Data parallel execution
Query compilation
SQL Query
Redundant stages of processing
β’ TPCDS, 40% of queries have redundant I/O
β’ 16% of all queries, High-impact spend at least 50% time on stages with redundant I/O
β’ 9% medium impact, spend 10-50% time on stages with redundant I/O
Redundancy analysis in SPARK on TPCDS
RESIN: MapReduce reasoning during optimization
State-of-the-art
Logical plan
Physical plan
Map 1
Reduce 2Produce plans with
fewer stages
SQLβSQL+MR
RESIN
MR ops
Logical plan
SQLSQL β SQL rewrites SQL PhyOps,
exchanges Physical plan
Map 1
Reduce 2
Reduce 3
RESIN: MapReduce reasoning during optimization
Logical plan
SQLSQL β SQL rewrites SQL PhyOps,
exchanges Physical plan
State-of-the-art
Map 1
Reduce 2
Reduce 3
Logical plan
Physical plan
MR opsMap 1
Reduce 2Produce plans with
fewer stages
SQLβSQL+MR
RESIN
1. Subquery Fusion2. Binary elimination
ResinMapResinReduce
Impact of RESIN on I/O and Memory
Rest of the talk
1. ResinMap and ResinReduce
2. Generalized sub-query fusion
3. Implementation on Spark
4. Experimental evaluation
T1 = SELECT ππ, βπ β βπ1,π πππππ β π πππππ1
FROM rawLogsWHERE βπ1 β₯ 0 β§ βπ1 <
24 β§ π πππππ1 β₯ 0
signals = SELECT * FROM T1UNION ALL
SELECT * FROM T2
Union
Project
iotLogs : (ππ, βπ1, π πππππ1, βπ2, π πππππ2)
T2 = SELECT ππ, βπ β βπ2,π πππππ β π πππππ2
FROM rawLogsWHERE βπ2 β₯ 0 β§ βπ2 <
24 β§ π πππππ2 β₯ 0
ResinMap[/*1*/{Filter(βπ1 β₯ 0 β§ βπ1 < 24 β§π πππππ1 β₯ 0)), Cols(id, βπ β βπ1,
π πππππ β π πππππ1)}, /*2*/{Filter(βπ2 β₯ 0 β§ βπ2 < 24 β§π πππππ2 β₯ 0), Cols(id, βπ β βπ2,
π πππππ β π πππππ2)}
(a) SQL Query(b) Standardexecution plan
(c) Optimizedexecution plan
signals
signals
scaniotLogs
I scan iotLogs
FilterProjectFilter
scan iotLogsiotLogs :
(ππ, βπ1, π πππππ1, βπ2, π πππππ2)
ResinMapA row-wise operator, can produce multiple output rows per input row
T1 = SELECT ππ, βπ β βπ1,π πππππ β π πππππ1
FROM rawLogsWHERE βπ1 β₯ 0 β§ βπ1 <
24 β§ π πππππ1 β₯ 0
signals = SELECT * FROM T1UNION ALL
SELECT * FROM T2
Union
Project
iotLogs : (ππ, βπ1, π πππππ1, βπ2, π πππππ2)
T2 = SELECT ππ, βπ β βπ2,π πππππ β π πππππ2
FROM rawLogsWHERE βπ2 β₯ 0 β§ βπ2 <
24 β§ π πππππ2 β₯ 0
ResinMap[/*1*/{Filter(βπ1 β₯ 0 β§ βπ1 < 24 β§π πππππ1 β₯ 0)), Cols(id, βπ β βπ1,
π πππππ β π πππππ1)}, /*2*/{Filter(βπ2 β₯ 0 β§ βπ2 < 24 β§π πππππ2 β₯ 0), Cols(id, βπ β βπ2,
π πππππ β π πππππ2)}
(a) SQL Query(b) Standardexecution plan
(c) Optimizedexecution plan
signals
signals
scaniotLogs
I scan iotLogs
FilterProjectFilter
scan iotLogsiotLogs :
(ππ, βπ1, π πππππ1, βπ2, π πππππ2)
ResinMapA row-wise operator, can produce multiple output rows per input row
ResinMap[/*1*/{Filter(βπ1 β₯ 0 β§ βπ1 < 24 β§π πππππ1 β₯ 0)), Cols(id, βπ β βπ1,
π πππππ β π πππππ1)}, /*2*/{Filter(βπ2 β₯ 0 β§ βπ2 < 24 β§π πππππ2 β₯ 0), Cols(id, βπ β βπ2,
π πππππ β π πππππ2)}
signals
scan iotLogs
ResinMapA row-wise operator, can produce multiple output rows per input row
//Each mapper m processes a partition rawlogs[m]Method ResinMap(m) {
foreach ππ, βπ1, π πππππ1, βπ2, π πππππ2 β πππ‘πΏπππ π {if(βπ1 β₯ 0 β§ βπ1 < 24 β§ π πππππ1 β₯ 0) {
βπ = βπ1; π πππππ = π πππππ1; ππ’π‘ππ’π‘ ππ, βπ, π πππππ}if(βπ2 β₯ 0 β§ βπ2 < 24 β§ π πππππ2 β₯ 0) {
βπ = βπ2; π πππππ = π πππππ2; ππ’π‘ππ’π‘ ππ, βπ, π πππππ}
}
Single table select, project, union queries in one stage
ResinReduce
πΊπππ’ππ΅π¦ππ,s1βmax(π πππππ1)πΊπππ’ππ΅π¦ππ,π 2βmππ(π πππππ2)
Join (π1 = π2)
Project(c1 β ππ, π 1) Project(c2 β ππ, π 2)
πΉπππ‘ππ (βπ1 β€ 12) πΉπππ‘ππ (βπ2 β€ 18)
iotLogs : (ππ, βπ1, π πππππ1, βπ2, π πππππ2)
iotLogs : (ππ, βπ1, π πππππ1, βπ2, π πππππ2)
ResinReduce[(πππ¦ = ππ),/*1*/{filter(βπ1 β€ 12)), aggregate(
)π 1 β
max(π πππππ1 , ππ1 β πππ’ππ‘ β )}/*2*/{ filter(βπ2 β€ 18)aggregate (s2 βπππ π πππππ2 ππ2 β πππ’ππ‘ β )}]
iotLogs : (ππ, βπ1, π πππππ1, βπ2, π πππππ2)
(Filter(ππ1 > 0 β§ ππ2 > 0)
Eliminate multiple shuffles from single table join queries
Filter( βπ1 β€ 12 β¨ βπ2 β€ 18)
Project(ππ, π πππππ1) Project(ππ, π πππππ2)
Key based operator, process rows sharing key, produce one row
ResinReduce[(πππ¦ = ππ),/*1*/{filter, aggregate}/*2*/{ filter,aggregate}]
Join (πππ = ππ)
Filter(π4)(βπ‘))
Project(πππ‘π¦, πππ)
πΊπππ’ππ΅π¦πππ‘π¦,s1βmax(π πππππ) πΊπππ’ππ΅π¦πππ‘π¦,π 2βmax(π πππππ)
Join (π1 = π2)
Scan signals Scan signals Scan dInfo
Filter(π1(βπ)) Filter(π2(βπ))
Project(id, π πππππ) Project(id, π πππππ)
Project(πππ‘π¦, πππ)
Scan dInfo
Project(πππ‘π¦, π πππππ)
Join (πππ = ππ)
Project(πππ‘π¦, π πππππ)
Filter(π3(βπ‘))
S1 S2
S3
S4
S5 S6
S7 S8
S9
Project(c1 β πππ‘π¦, π 1) Project(c2 β πππ‘π¦, π 2)
Join (πππ = ππ)
Filter( π1 β§ π3 β¨ (π2 β§ π4)(βπ, βπ‘)
Filter(π1 β¨ π2(βπ))
Project(id,signal,hr) Project(πππ‘π¦, πππ,ht)
Project(πππ‘π¦, π πππππ, βπ, βπ‘)
S10
Filter(π3 β¨ π4(βπ‘))
Scan dInfoScan signals
S11
S12
S13
Sub-query fusion
Eliminate scans/shuffles from multi-table queries
Join (πππ = ππ)
Filter(π4(βπ‘))
Project(πππ‘π¦, πππ)
Scan signals Scan signals Scan dInfo
Filter(π1(βπ)) Filter(π2(βπ))
Project(id, π πππππ) Project(id, π πππππ)
Project(πππ‘π¦, πππ)
Scan dInfo
Project(πππ‘π¦, π πππππ)
Join (πππ = ππ)
Project(πππ‘π¦, π πππππ)
Filter(π3(βπ‘))
S2
S3
S4
Filter(π1 β¨ π2(βπ))
Project(id,signal,hr) Project(πππ‘π¦, πππ,ht)
S10
Filter(π3 β¨ π4(βπ‘))
Scan dInfoScan signals
S11
Sub-query fusion
Eliminate scans/shuffles from multi-stage queries
Filter(π1(βπ)) Filter(π2(βπ))
S1
Filter(π1 β§ π3) Filter(π2 β§ π4)
πΊπππ’ππ΅π¦πππ‘π¦,s1βmax(π πππππ) πΊπππ’ππ΅π¦πππ‘π¦,π 2βmax(π πππππ)
Join (π1 = π2)
Project(c1 β πππ‘π¦, π 1) Project(c2 β πππ‘π¦, π 2)
Join (πππ = ππ)
Project(πππ‘π¦, π πππππ)
Join (πππ = ππ)
Project(πππ‘π¦, π πππππ)
Sub-query fusion
Eliminate scans/shuffles from multi-stage queries
Filter(π1 β¨ π2(βπ))
Project(id,signal,hr) Project(πππ‘π¦, πππ,ht)
Filter(π3 β¨ π4(βπ‘))
Scan dInfoScan signals
Filter(π1(βπ)) Filter(π2(βπ))Filter(π1 β§ π3) Filter(π2 β§ π4)
S5 S6
Join (πππ = ππ)Filter( π1 β§ π3 β¨ (π2 β§ π4)(βπ, βπ‘)
Project(πππ‘π¦, π πππππ, βπ, βπ‘)
Filter(π1 β§ π3) Filter(π2 β§ π4)
S12
πΊπππ’ππ΅π¦πππ‘π¦,s1βmax(π πππππ) πΊπππ’ππ΅π¦πππ‘π¦,π 2βmax(π πππππ)
Join (π1 = π2)
Project(c1 β πππ‘π¦, π 1) Project(c2 β πππ‘π¦, π 2)
πΊπππ’ππ΅π¦πππ‘π¦,s1βmax(π πππππ) πΊπππ’ππ΅π¦πππ‘π¦,π 2βmax(π πππππ)
Join (π1 = π2)
Project(c1 β πππ‘π¦, π 1) Project(c2 β πππ‘π¦, π 2)
Sub-query fusion
Eliminate scans/shuffles from multi-stage queries
Filter(π1 β¨ π2(βπ))
Project(id,signal,hr) Project(πππ‘π¦, πππ,ht)
Filter(π3 β¨ π4(βπ‘))
Scan dInfoScan signals
Join (πππ = ππ)Filter( π1 β§ π3 β¨ (π2 β§ π4)(βπ, βπ‘)
Project(πππ‘π¦, π πππππ, βπ, βπ‘)
Filter(π1 β§ π3) Filter(π2 β§ π4)
S7 S8
S9S13
ResinReduce[(πππ¦ = πππ‘π¦),/*1*/{filter(π1 β§ π3), aggregate(
)π 1 β
max(π πππππ , ππ1 β πππ’ππ‘ β )}/*2*/{ filter(π2 β§ π4)
aggregate (s2 β πππ₯ π πππππ ππ2 βπππ’ππ‘ β )}]
(Filter(ππ1 > 0 β§ ππ2 > 0)
Join (πππ = ππ)
Filter(π4)(βπ‘))
Project(πππ‘π¦, πππ)
πΊπππ’ππ΅π¦πππ‘π¦,s1βmax(π πππππ) πΊπππ’ππ΅π¦πππ‘π¦,π 2βmax(π πππππ)
Join (π1 = π2)
Scan signals Scan signals Scan dInfo
Filter(π1(βπ)) Filter(π2(βπ))
Project(id, π πππππ) Project(id, π πππππ)
Project(πππ‘π¦, πππ)
Scan dInfo
Project(πππ‘π¦, π πππππ)
Join (πππ = ππ)
Project(πππ‘π¦, π πππππ)
Filter(π3(βπ‘))
S1 S2
S3
S4
S5 S6
S7 S8
S9
Project(c1 β πππ‘π¦, π 1) Project(c2 β πππ‘π¦, π 2)
Join (πππ = ππ)
Filter( π1 β§ π3 β¨ (π2 β§ π4)(βπ, βπ‘)
Filter(π1 β¨ π2(βπ))
Project(id,signal,hr) Project(πππ‘π¦, πππ,ht)
Project(πππ‘π¦, π πππππ, βπ, βπ‘)
S10
Filter(π3 β¨ π4(βπ‘))
Scan dInfoScan signals
S11
S12
S13
Sub-query fusion
Eliminate scans/shuffles from multi-table queries
ResinReduce[(πππ¦ = πππ‘π¦),/*1*/{filter(π1 β§ π3), aggregate(
)π 1 β
max(π πππππ , ππ1 β πππ’ππ‘ β )}/*2*/{ filter(π2 β§ π4)
aggregate (s2 β πππ₯ π πππππ ππ2 βπππ’ππ‘ β )}]
(Filter(ππ1 > 0 β§ ππ2 > 0)S13
In the paper
β’Parameters for ResinMap and ResinReduceoperators, semantics and implementation
β’ Fusing of operators without increasing the number of rows shuffled
β’ Fusion rules for all sparkSQL operators, conditions under which fusion is possible
Rest of the talk
1. ResinMap and ResinReduce
2. Generalized sub-query fusion
3. Implementation on Spark
4. Experimental evaluation
Implementation
Implemented RESIN on catalyst optimizer in SPARK 2.4
1. Added logical and physical operators for ResinMap andResinReduce
2. Added a new batch of optimization rulesβ’ Perform fusion in a single traversal of the tree
β’ Perform Union and Join elimination by checking fused parent
β’ Introduce exchanges if parent after fusion cannot be eliminated
3. Added implementations for our operators with codegen support
Evaluation
β’ Evaluated with TPCDS at 1TB and 10TB scale, data stored in parquet
β’ Two different clusters <120 cores, 480 GB memory> and <480 cores, 1.6TB memory>
β’ Detailed evaluation of 40 (out of 104) queries with redundant I/O
β’ Note: baseline that already has basic I/O optimizations (predicate and project pushdown to store, exchange reuse)
Speedup at 1TB
Impact of RESIN on I/O and Memory
Conclusions
β’Big-data optimizers produce plans with redundant I/O and compute
β’Proposed optimizer extensions to perform first class map-reduce reasoning
β’Added generic map and reduce operators, rewrites that fuse stages and eliminate redundant I/O
β’Demonstrated savings in terms of latency, disk and network I/O
Thank You
Email [email protected] or any of the other authors to contact us