Download - Query Processing and Optimization in Modern Database Systems · 2017. 3. 13. · Query Processing and Optimization in Modern Database Systems ViktorLeis. ... 1OLTP Through the Looking

Query Processing and Optimizationin Modern Database Systems

Viktor Leis

Architecture of Traditional RDBMSs

feature techniquetransaction isolation locking (2PL)synchronization latching (“lock coupling”)large data sets buffer managementdurability ARIES-style loggingindexing B+treestorage slotted pages (row-wise)SQL iterator model (interpreter)parallelization Exchange operatorsquery optimization DP, indep. assumption

I optimizing (random) disk I/O operations

Traditional RDBMSs on Modern Hardware

feature technique overhead1transaction isolation locking (2PL) 16%synchronization latching (“lock coupling”) 14%large data sets buffer management 35%durability ARIES-style logging 12%indexing B+treestorage slotted pages (row-wise)SQL iterator model (interpreter)parallelization Exchange operatorsquery optimization DP, indep. assumption

1OLTP Through the Looking Glass (Harizopoulos et al., SIGMOD 2008)

Modern Database Systems

I OLAP: column stores (Vectorwise, Vertica, Microsoft Apollo,IBM BLU)

I OLTP: main-memory systems (e.g., Microsoft Hekaton,VoltDB)

I OLAP&OLTP: HANA, HyPer

HyPer in 2017

feature HyPer in 2017 contributionstransaction isolation MVCC, precision lockingsynchronization - Part Ilarge data sets -durability physiological loggingindexing Adaptive Radix Tree Master’s [ICDE 2013]storage Data BlocksSQL LLVM compilationparallelization morsel-driven parallelism Part IIquery optimization DP, indep. assumption Part III

Part I:Synchronization

on Multi-Core CPUs

ICDE 2014, TKDE 2016, Damon 2016

SynchronizationI default index structure in HyPer: Adaptive Radix TreeI latch acquisition causes cache misses

25

50

75

100

5 10 15 20threads

M o

pera

tions

/sec

ond

no synchronization

lock coupling

I this explains single-threaded databases (VoltDB, HyPer 2011)

Hardware Transactional Memory

I recent feature offered by Intel CPUs (from Haswell)

+ the easiest way to synchronize data structures+ often very good scalability− not yet widespread− scalability issues can be hard to debug

Hardware Transactional Memory

I recent feature offered by Intel CPUs (from Haswell)+ the easiest way to synchronize data structures+ often very good scalability− not yet widespread− scalability issues can be hard to debug

Optimistic Lock Coupling

I idea: writers acquire latches (only on modified nodes)I readers validate accesses using version counters (restart if

necessary)+ very general technique+ easy to use− may lead to restarts

Read-Optimized Write Exclusion (ROWEX)

I idea: writers acquire latches (on modified nodes)I writers ensure that reads are always safe+ reads always succeed− more difficult than optimistic lock coupling (but easier than

lock-free techniques)

Conclusions

25

50

75

100

5 10 15 20threads

M o

pera

tions

/sec

ond

no synchronization

lock coupling

Opt. Lock Coupling

ROWEX

HTM

I latching (does not scale), lock-free data structures (scalablebut slow), and HTM (not widespread) have major problems

I Optimistic Lock Coupling and ROWEX are scalable andpractical

Part II:Intra-Query Parallelization

on Multi-Core CPUs

SIGMOD 2014, VLDB 2015

Motivation: Many, Many Cores

NetBurst (Foster)NetBurst (Paxville)

Core (Kentsfield) Core (Lynnfield)

Nehalem (Beckton) Nehalem (Westmere EX)

Sandy Bridge EP

Ivy Bridge EP

Ivy Bridge EX

Haswell EP

Broadwell EPBroadwell EX

Skylake EP

1

10

20

30

2000 2004 2008 2012 2016year

core

s pe

r CPU

Parallel Query Processing in HyPer

I break input into work units (“morsels”)I worker threads grab morsels dynamically (“work stealing”)I # worker threads = # hardware threadsI requires all operators to be aware of parallelismI better scalability than Exchange operators

Example 1: Hash Join

morsel

T

Phase 1: process T morsel-wise and store NUMA-locally

Phase 2: scan NUMA-local storage areaand insert pointers into HT

next morsel

Storagearea of

blue core

scan Insert t

he po

inter

into H

T

globalHash Table

Storagearea of

red core

Storagearea of

green core

v

v

v

Example 2: Window Functionsselect a, b, rank() over (partition by a order by b) from r

1. hash partitioning (thread-local)

thread 1 thread 2

2. combine

3.1. inter-partition parallelism

3.2. intra-partition parallelism

3. sort/evaluation

Scalability on 32-core System (TPC-H Queries)

1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22

010203040

010203040

010203040

010203040

1 16 32 48 64 1 16 32 48 64 1 16 32 48 64 1 16 32 48 64threads

spee

dup

over

HyP

er

System

HyPer

Vectorwise

Part III:Query Optimization

VLDB 2016

Query Optimization

SELECT ...FROM R,S,TWHERE ...

v

B

B

RS

T

HJ

INLcardinalityestimation

costmodel

plan spaceenumeration

I Do we need a new architecture for query optimizers, too?

Join Order Benchmark

I Internet Movie Data Base data set (4 GB)I much more challenging than synthetic benchmarks like TPC-HI 113 queries with 3 to 16 joins

Cardinality Estimation: PostgreSQL

1e8

1e6

1e4

1e2

1

1e2

1e4

0 1 2 3 4 5 6number of joins

←un

dere

stim

atio

n [lo

g sc

ale]

ov

eres

t. →

95th percentile

5th percentile

median75th percentile

25th percentile

Cardinality Estimation: Commercial Systems

PostgreSQL DBMS A DBMS B DBMS C HyPer

1e8

1e6

1e4

1e2

1

1e2

1e4

0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6number of joins

←un

dere

stim

atio

n [lo

g sc

ale]

ove

rest

imat

ion

→

95th percentile

5th percentile

median75th percentile

25th percentile

Conclusions

I query optimization is essentialI most (random) join orders are badI optimizers will find good plans for most queries

I cardinality estimation is usually the reason for bad plansI cost model much less important (with memory-resident data)I relative plan quality decreases when more indexes are availableI operators should not rely on estimates (if possible)

Future Work

featuretransaction isolation MVCC, precision lockingsynchronization Optimistic Lock Couplinglarge data sets ?durability ?indexing Adaptive Radix Treestorage Data BlocksSQL LLVM compilationparallelization morsel-driven parallelismquery optimization index-based join sampling (CIDR 2017)