Staged Database Systems
description
Transcript of Staged Database Systems
@Carnegie MellonDatabase
s
Staged Database Systems
Thesis Oral
Stavros Harizopoulos
2
Database world: a 30,000 ft view
DBMSDBMS
Sarah: “Buy this book”Jeff: “Which store needs more advertising?”
internetinternetoffloaddata
OLTP: Online Transaction ProcessingOLTP: Online Transaction Processingmany short-lived requestsmany short-lived requests
DSS: Decision Support SystemsDSS: Decision Support Systemsfew long-running queriesfew long-running queries
DB systems fuel most e-applications
Improved performance Impact on everyday life
3
New HW/SW requirements• More capacity, throughput efficiency• CPUs run much faster than they can access data
CPU
mem
ory
the ‘80s
1 cycle 10 300
DSS stressI/O subsystem
today
Need to optimize all levels of memory hierarchy
4
The further, the slower
• Keep data close to CPU
• Locality and predictability is key
DBMS core design contradicts above goals
Overlap mem. accessesOverlap mem. accesseswith computationwith computation
Modify algorithms and structuresModify algorithms and structuresto exhibit more localityto exhibit more locality
5
Thread-based execution in DBMS
• Queries are handled by a pool of threads
• Threads execute independently
• No means to exploit common operations
DBMS
thread pool
xno
coordination
D
CD
C
StagedDB
New design to expose locality across threads
6
Staged Database Systems
• Organize system components into stages• No need to change algorithms / structures
Stage 3
Stage 2
Stage 1DBMS
queries
StagedDB
queries
High concurrency locality across requests
7
Thesis
“By organizing and assigning system components into self-contained stages,
database systems can exploit instruction and data commonality across concurrent
requests
thereby improving performance.”
8
Summary of main results
• 56% - 96% fewer I-misses• STEPS: full-system
evaluation on Shore
• 1.2x - 2x throughput• QPipe: full-system
evaluation on BerkeleyDB
memory hierarchy
I
I
D
D L1
L2-L3
RAMRAM
DisksDisks
9
Contributions and dissemination
• Introduced StagedDB design • Scheduling algorithms for staged systems
• Built novel query engine design• QPipe engine maximizes data and work sharing
• Addressed instruction cache in OLTP• STEPS applies to any DBMS with few changes
CIDR’03
VLDB’04
SIGMOD’05
CMU-TR’02IEEE Data Eng. ’05
ICDE’06 demo sub.
CMU-TR’05HDMS’05 VLDB J. subm.
TODS subm.
10
Outline• IntroductionIntroduction
• QPipe
• STEPS
• Conclusions
I D
DSS
11
Query-centric design of DB engines
• Queries are evaluated independently
• No means to share across queries
• Need new design to exploit common data instructions work across operators
12
QPipe: operator-centric engine• Conventional: “one-query, many-operators”
• QPipe: “one operator, many-queries”
• Relational operators become Engines
• Queries break up in tasks and queue up
conventional QPipe
queue
runtime
13
QPipe designpacketdispatcher
S S
A
threadpool
storage engine
queryplans
conventional design
J
Engine-S
Q Q
Engine-J
Engine-AQ
Q
readwrite
read
14
Reusing data & work in QPipe
• Detect overlap at run time
• Shared pages and intermediate results are
simultaneously pipelined to parent nodes
Q1
overlap inred operator
simultaneouspipelining
Q2 Q1 Q2
15
Mechanisms for sharing
• Multi-query optimization
• Materialized views
• Buffer pool management
• Shared scans• RedBrick, Teradata, SQL Server
requiresworkload knowledge
opportunistic
limited use
not used in practice
QPipe complements above approaches
16
Experimental setup
• QPipe prototype• Built on top of BerkeleyDB, 7,000 C++ lines• Shared-memory buffers, native OS threads
• Platform• 2GHz Pentium 4, 2GB RAM, 4 SCSI disks
• Benchmarks• TPC-H (4GB)
17
Sharing order-sensitive scans
I I
M-J
S
A
ORDERS LINEITEM
TPC-HQuery 4
Q1
M-J
S
AQ2
I IORDERS LINEITEM
• Two clients send query at different intervals• QPipe performs 2 separate joins
order-sensitive
order-insensitive
M-J
I I
M-J
I I+
18
Sharing order-sensitive scans
• Two clients send query at different intervals• QPipe performs 2 separate joins
0
50
100
150
200
250
300
0 20 40 60 80 100 120 140
Baseline Qpipe w/SP
time difference between arrivals
tota
l res
pons
e tim
e (s
ec)
19
TPC-H workload
• Clients use pool of 8 TPC-H queries
• QPipe reuses large scans, runs up to 2x faster
• ..while maintaining low response times
0
20
40
60
80
0 2 4 6 8 10 12
Qpipe w/SPDBMS XBaseline
thro
ughp
ut (
quer
ies/
hr)
number of clients
20
QPipe: conclusions• DB engines evaluate queries independently
• Limited existing mechanisms for sharing
• QPipe requires few code changes
• SP is simple yet powerful technique
• Allows dynamic sharing of data and work
• Other benefits (not described here)• I-cache, D-cache performance• Efficiently execute MQO plans
21
Outline• IntroductionIntroduction
• QPipeQPipe
• STEPS
• Conclusions
I DOLTP
22
Online Transaction Processing
Need solution for instruction cache-residency
L1-I sizes for various CPUs
Max on-chipL2/L3 cache
‘96 ‘00 ‘04‘98 ‘02Year Introduced
10KB
100KB
1MB
10MB
Ca
ch
e s
ize
• High-end servers, non I/O bound
• L1-I stalls are 20-40% of execution time• Instruction caches cannot grow
23
Related work
• Hardware and compiler approaches• Increased block size, stream buffer [Ranganathan98]
• Code layout optimizations [Ramirez01]
• Database software approaches• Instruction cache for DSS [Padmanabhan01][Zhou04]• Instruction cache for OLTP: Challenging!
24
STEPS for cache-resident code
STEPS: Synchronized Transactions through Explicit Processor Scheduling
• Microbenchmark: eliminate 96% of L1-I misses
• TPC-C: eliminate 2/3 of misses, 1.4 speedup
Begin
Select
Update
Insert
Delete
Commit
Transaction
keep thread model,insert sync points
S still largerthan I-cache
multiplex execution,reuse instructions
25
I-cache aware context-switching
code fits inI-cache
context-switch(CTX)point
select( ) s1 s2 s3 s4 s5 s6 s7
thread 1 thread 2
instructioncache
thread 1 thread 2
select( ) s1 s2 s3 s4 s5 s6 s7
select( ) s1 s2 s3
s4 s5 s6 s7
select( ) s1 s2 s3
s4 s5 s6 s7
MissMMMMMMM
MMMM
MMMM
HHHH
HitHHH
MMMMMMMM
no STEPS with STEPS
26
Placing CTX calls in source
AutoSTEPS tool
Evaluation
DBMSbinary
valgrind 0x010x040x050x04…
…instructionmem. refs STEPS
simulation 0x01
0x05…
…mem. address
for CTX
gdb file1.c:30
file2.c:40…
…lines to
insert CTX
• Comparable performance to manual• ..while being more conservative
27
Experimental setup (1st part)
• Implemented on top of Shore
• AMD AthlonXP• 64KB L1-I + 64KB L1-D, 256KB L2
• Microbenchmark• Index fetch, in-memory index
• Fast CTX for both systems, warm cache
28
Microbenchmark: L1-I misses
STEPS eliminates 92-96% of misses for add’l threads
Shore Shore w/Steps
1 2 4 6Concurrent threads
L1-
I cac
he
mis
ses
8 10
1K
2K
3K
4K
AthlonXPAthlonXP
29
L1-I misses & speedup
L1-I Miss reduction Upper LimitL1-I Miss reduction %
Sp
eed
up
1.1
1.2
1.3
1.4
Mis
s re
duct
ion 100%
80%
60%
Speedup
40%
10 20 30 40Concurrent threads
50 60 70 80
10 20 30 40Concurrent threads
50 60 70 80
Steps achieves max performance for 6-10 threads• No need for larger thread groups
AthlonXPAthlonXP
30
Challenges in full-system operationSo far:
• Threads are interested in same Op• Uninterrupted flow• No thread scheduler
Full-system requirements• High concurrency on similar Ops• Handle exceptions
• Disk I/O, locks, latches, abort
• Co-exist with system threads• Deadlock detection, buffer pool housekeeping
31
System design
• Fast CTX through fixed scheduling• Repair thread structures at exceptions• Modify only thread package
STEPS wrapper
Op X
STEPS wrapper
Op Y
STEPS wrapper
Op Z
stray thread
executionteam
to otherOp
32
Experimental setup (2nd part)
• AMD AthlonXP• 64KB L1-I + 64KB L1-D, 256KB L2
• TPC-C (wholesale parts supplier)• 2GB RAM, 2 disks
• 10-30 Warehouses (1-3GB), 100-300 users
• Zero think time, in-memory, lazy commits
33
One transaction: payment
100 200 300
• STEPS outperforms baseline system
• 1.4 speedup, 65% fewer L1-I misses
Number of users
20%
40%
60%
80%
100%
CyclesL1-I misses
Nor
mal
ized
cou
nt
34
Mix of four transactions
100 200Number of users
Nor
mal
ized
cou
nt
20%
40%
60%
80%
100%
CyclesL1-I misses
• Xaction mix reduces team size
• Still, 56% fewer L1-I misses
35
STEPS: conclusions
• STEPS can handle full OLTP workloads
• Significant improvements in TPC-C• 65% fewer L1-I misses• 1.2 – 1.4 speedup
STEPS minimizes both capacity / conflict misses without increasing I-cache size / associativity
36
StagedDB: future work
• Promising platform for Chip-Multiprocessors• DBMS suffer from CPU-to-CPU cache misses• StagedDB allows work to follow data
-- not the other way around!
• Resource scheduling• Stages cluster requests for DB locks, I/O• Potential for deeper, more effective scheduling
37
Conclusions
• New hardware, new requirements
• Server core design remains the same
• Need new design to fit modern hardware
StagedDB:Optimizes all memory hierarchy levels
Promising design for future installations
38
The speaker would like to thank:
his academic advisorAnastassia Ailamaki
his thesis committee membersPanos K. Chrysanthis,
Christos Faloutsos,Todd C. Mowry,
and Michael Stonebraker
and his coauthorsKun Gao,
Vladislav Shkapenyuk,and Ryan Williams
Thank you
39
QPipe backup
40
A Engine in detail
• tuple batching I-cache
• query grouping I&D-cache
relational operator code
simultaneouspipelining
schedulingthread free threads busy threads
main routineEngine
parametersEngine
queue
Harizopoulos04 (VLDB)Zhou03 (VLDB)
Padmanabhan01 (ICDE)Zhou04 (SIGMOD)
41
Simultaneous Pipelining in QPipe
join
without SP with SP
Q1write
Q2 Q2
COMPLETE
2 Q2copy
3
Q1 Q1Q1
4pipeline
Q1
join
Q2 Q2Q2
Q2 attach1 Q2
joincoordinator
SP
Q1
Q1 Q1
read
42
Sharing data & work across queries
S S
M-J
A
TABLE A TABLE B
Query 1 : “Find average age of studentsenrolled in both class A and class B”
S
TABLE A
maxQuery 2
S S
M-J
TABLE A TABLE B
Query 3min
datasharingopportunity
worksharingopportunity
43
Sharing opportunities at run time• Q1 executes operator R• Q2 arrives with R in its plan
sharing potential
result production for R in Q1
Q2
result production for R in Q2
Rwithout SP
Q1 Q2R
writeread
with SPR coordinator
SP
Q1Q2
readpipeline
44
TPC-H workload
• Clients use pool of 8 TPC-H queries
• QPipe reuses large scans, runs up to 2x faster
• ..while maintaining low response times
0
20
40
60
80
0 2 4 6 8 10 12
Qpipe w/SPDBMS XBaseline
thro
ughp
ut (
quer
ies/
hr)
number of clients
0
200
400
600
800
1000
1200
0 20 40 60 240
BaselineQpipe w/SP
aver
age
resp
onse
tim
e
think time (sec)
45
STEPS backup
46
Smaller L1-I cache
AthlonXPPentium III
209%AthlonXP, Pentium IIIAthlonXP, Pentium III
10 threads10 threads
No
rma
lize
d co
un
t
Cycles
L1-I miss
es
Br. Misp
red.
L1-D m
isses
Branch
es
20%
40%
60%
80%
100%
120%
Br. miss
ed BTB Instr. stalls
(cycles)
Steps outperforms Shore even on smaller caches (PIII)• 62-64% fewer mispredicted branches on both CPUs
47
SimFlex: L1-I misses
Shore-16KBSteps-16KBMIN Shore-32KB
Steps-32KBMIN
Shore-64KBSteps-64KBMIN
higherassociativity
L1-
I cac
he
mis
ses
2K
4K
6K
8K
10K
direct2-way
4-way8-way
full higherassociativity
AthlonXP
64b cache block64b cache block10 threads10 threads
Steps eliminates all capacity misses (16, 32KB caches)• Up to 89% overall miss reduction (upper limit is 90%)
48
One Xaction: payment10 20 30
Steps outperforms Shore• 1.4 speedup, 65% fewer L1-I misses
• 48% fewer mispredicted branches
Number of Warehouses
No
rma
lize
d co
un
t
20%
40%
60%
80%
100%
Cycles L1-Imisses
L1-Dmisses
L2-Imisses
L2-Dmisses
Branchesmispred.
49
Mix of four Xactions10 20
No
rma
lize
d co
un
t
20%
40%
60%
80%
100%
Cycles L1-Imisses
L1-Dmisses
L2-Imisses
L2-Dmisses
Branchesmispred.
• Xaction mix reduces average team size (4.3 in 10W)
• Still, Steps has 56% fewer L1-I misses (out of 77% max)
121% 125% Number of Warehouses