Query Processing and Optimizationin Modern Database Systems
Viktor Leis
Architecture of Traditional RDBMSs
feature techniquetransaction isolation locking (2PL)synchronization latching (“lock coupling”)large data sets buffer managementdurability ARIES-style loggingindexing B+treestorage slotted pages (row-wise)SQL iterator model (interpreter)parallelization Exchange operatorsquery optimization DP, indep. assumption
I optimizing (random) disk I/O operations
Architecture of Traditional RDBMSs
feature techniquetransaction isolation locking (2PL)synchronization latching (“lock coupling”)large data sets buffer managementdurability ARIES-style loggingindexing B+treestorage slotted pages (row-wise)SQL iterator model (interpreter)parallelization Exchange operatorsquery optimization DP, indep. assumption
I optimizing (random) disk I/O operations
Traditional RDBMSs on Modern Hardware
feature technique overhead1transaction isolation locking (2PL) 16%synchronization latching (“lock coupling”) 14%large data sets buffer management 35%durability ARIES-style logging 12%indexing B+treestorage slotted pages (row-wise)SQL iterator model (interpreter)parallelization Exchange operatorsquery optimization DP, indep. assumption
1OLTP Through the Looking Glass (Harizopoulos et al., SIGMOD 2008)
Modern Database Systems
I OLAP: column stores (Vectorwise, Vertica, Microsoft Apollo,IBM BLU)
I OLTP: main-memory systems (e.g., Microsoft Hekaton,VoltDB)
I OLAP&OLTP: HANA, HyPer
HyPer in 2017
feature HyPer in 2017 contributionstransaction isolation MVCC, precision lockingsynchronization - Part Ilarge data sets -durability physiological loggingindexing Adaptive Radix Tree Master’s [ICDE 2013]storage Data BlocksSQL LLVM compilationparallelization morsel-driven parallelism Part IIquery optimization DP, indep. assumption Part III
Part I:Synchronization
on Multi-Core CPUs
ICDE 2014, TKDE 2016, Damon 2016
SynchronizationI default index structure in HyPer: Adaptive Radix TreeI latch acquisition causes cache misses
25
50
75
100
5 10 15 20threads
M o
pera
tions
/sec
ond
no synchronization
lock coupling
I this explains single-threaded databases (VoltDB, HyPer 2011)
SynchronizationI default index structure in HyPer: Adaptive Radix TreeI latch acquisition causes cache misses
25
50
75
100
5 10 15 20threads
M o
pera
tions
/sec
ond
no synchronization
lock coupling
I this explains single-threaded databases (VoltDB, HyPer 2011)
Hardware Transactional Memory
I recent feature offered by Intel CPUs (from Haswell)
+ the easiest way to synchronize data structures+ often very good scalability− not yet widespread− scalability issues can be hard to debug
Hardware Transactional Memory
I recent feature offered by Intel CPUs (from Haswell)+ the easiest way to synchronize data structures+ often very good scalability− not yet widespread− scalability issues can be hard to debug
Optimistic Lock Coupling
I idea: writers acquire latches (only on modified nodes)I readers validate accesses using version counters (restart if
necessary)+ very general technique+ easy to use− may lead to restarts
Read-Optimized Write Exclusion (ROWEX)
I idea: writers acquire latches (on modified nodes)I writers ensure that reads are always safe+ reads always succeed− more difficult than optimistic lock coupling (but easier than
lock-free techniques)
Conclusions
25
50
75
100
5 10 15 20threads
M o
pera
tions
/sec
ond
no synchronization
lock coupling
Opt. Lock Coupling
ROWEX
HTM
I latching (does not scale), lock-free data structures (scalablebut slow), and HTM (not widespread) have major problems
I Optimistic Lock Coupling and ROWEX are scalable andpractical
Part II:Intra-Query Parallelization
on Multi-Core CPUs
SIGMOD 2014, VLDB 2015
Motivation: Many, Many Cores
NetBurst (Foster)NetBurst (Paxville)
Core (Kentsfield) Core (Lynnfield)
Nehalem (Beckton) Nehalem (Westmere EX)
Sandy Bridge EP
Ivy Bridge EP
Ivy Bridge EX
Haswell EP
Broadwell EPBroadwell EX
Skylake EP
1
10
20
30
2000 2004 2008 2012 2016year
core
s pe
r CPU
Parallel Query Processing in HyPer
I break input into work units (“morsels”)I worker threads grab morsels dynamically (“work stealing”)I # worker threads = # hardware threadsI requires all operators to be aware of parallelismI better scalability than Exchange operators
Example 1: Hash Join
morsel
T
Phase 1: process T morsel-wise and store NUMA-locally
Phase 2: scan NUMA-local storage areaand insert pointers into HT
next morsel
Storagearea of
blue core
scan Insert t
he po
inter
into H
T
globalHash Table
Storagearea of
red core
Storagearea of
green core
v
v
v
Example 2: Window Functionsselect a, b, rank() over (partition by a order by b) from r
1. hash partitioning (thread-local)
thread 1 thread 2
2. combine
3.1. inter-partition parallelism
3.2. intra-partition parallelism
3. sort/evaluation
Scalability on 32-core System (TPC-H Queries)
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22
010203040
010203040
010203040
010203040
1 16 32 48 64 1 16 32 48 64 1 16 32 48 64 1 16 32 48 64threads
spee
dup
over
HyP
er
System
HyPer
Vectorwise
Part III:Query Optimization
VLDB 2016
Query Optimization
SELECT ...FROM R,S,TWHERE ...
v
B
B
RS
T
HJ
INLcardinalityestimation
costmodel
plan spaceenumeration
I Do we need a new architecture for query optimizers, too?
Join Order Benchmark
I Internet Movie Data Base data set (4 GB)I much more challenging than synthetic benchmarks like TPC-HI 113 queries with 3 to 16 joins
Cardinality Estimation: PostgreSQL
1e8
1e6
1e4
1e2
1
1e2
1e4
0 1 2 3 4 5 6number of joins
←un
dere
stim
atio
n [lo
g sc
ale]
ov
eres
t. →
95th percentile
5th percentile
median75th percentile
25th percentile
Cardinality Estimation: Commercial Systems
PostgreSQL DBMS A DBMS B DBMS C HyPer
1e8
1e6
1e4
1e2
1
1e2
1e4
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6number of joins
←un
dere
stim
atio
n [lo
g sc
ale]
ove
rest
imat
ion
→
95th percentile
5th percentile
median75th percentile
25th percentile
Conclusions
I query optimization is essentialI most (random) join orders are badI optimizers will find good plans for most queries
I cardinality estimation is usually the reason for bad plansI cost model much less important (with memory-resident data)I relative plan quality decreases when more indexes are availableI operators should not rely on estimates (if possible)
Future Work
featuretransaction isolation MVCC, precision lockingsynchronization Optimistic Lock Couplinglarge data sets ?durability ?indexing Adaptive Radix Treestorage Data BlocksSQL LLVM compilationparallelization morsel-driven parallelismquery optimization index-based join sampling (CIDR 2017)
Top Related