Weaving Relations for Cache Performance

Weaving Relations for Cache Performance

Anastassia AilamakiCarnegie Mellon

David DeWitt, Mark Hill, and Marios SkounakisUniversity of Wisconsin-Madison

Memory Hierarchies

PROCESSOR EXECUTION PIPELINE

MAIN MEMORY

L1 I-CACHE L1 D-CACHE

L2 CACHE

Cache misses are extremely expensive

$$$

MAIN MEMORY

10

0.336

70

01020304050607080

VAX 11/780 Pentium II Xeon

cycl

es p

er in

stru

ctio

n

01020304050607080

mem

ory

late

ncy

CPI Memory latency

1 access to memory 1000 instruction opportunities

Processor/Memory Speed Gap

Range Selection (no index)

0%

20%

40%

60%

80%

100%

A B C D

DBMS

Me

mo

ry s

tall

tim

e (

%)

Range selection (clustered index)

0%

20%

40%

60%

80%

100%

A B C DDBMS

L1 Data L2 Data L1 Instruction L2 Instruction

Join (no index)

0%

20%

40%

60%

80%

100%

A B C D

DBMS

Data accesses on caches: 19%-86% of memory stalls

PII Xeon running NT 4.0, 4 commercial DBMSs: A,B,C,D Memory-related delays: 40%-80% of execution time

Breakdown of Memory Delays

Data Placement on Disk Pages

Slotted Pages: Used by all commercial DBMSs Store table records sequentially Intra-record locality (attributes of record r together) Doesn’t work well on today’s memory hierarchies

Alternative: Vertical partitioning [Copeland’85] Store n-attribute table as n single-attribute tables Inter-record locality, saves unnecessary I/O Destroys intra-record locality => expensive to reconstruct record

Contribution: Partition Attributes Across … have the cake and eat it, too

Inter-record locality + low record reconstruction cost

Outline

The memory/processor speed gap What’s wrong with slotted pages? Partition Attributes Across (PAX) Performance results Summary

1237RH1PAGE HEADER

30Jane RH2 4322 John

45 RH3 Jim 20

RH4

7658 Susan 52

1563

RID SSN Name Age

1 1237 Jane 30

2 4322 John 45

3 1563 Jim 20

4 7658 Susan 52

5 2534 Leon 43

6 8791 Dan 37

R

Records are stored sequentially Offsets to start of each record at end of page

Formal name: NSM (N-ary Storage Model)

Current Scheme: Slotted Pages

CACHE

MAIN MEMORY

1237RH1PAGE HEADER


45 RH3 Jim 20

RH4

7658 52

1563

block 130Jane RH

52 2534 Leon block 4

Jim 20 RH4 block 3

45 RH3 1563 block 2

select namefrom Rwhere age > 50

NSM pushes non-referenced data to the cache

2534 LeonSusan

Predicate Evaluation using NSM

Need New Data Page Layout

Eliminates unnecessary memory accesses Improves inter-record locality Keeps a record’s fields together Does not affect I/O performance

and, most importantly, is…

low-implementation-cost, high-impact

1237RH1PAGE HEADER


45

1563

RH3 Jim 20

RH4

7658 Susan 52

PAGE HEADER 1237 4322

1563

7658

Jane John Jim Susan

30 45 2052

NSM PAGE PAX PAGE

Partition data within the page for spatial locality

Partition Attributes Across (PAX)

CACHE

1563

PAGE HEADER 1237 4322

7658

Jane John Jim Suzan

30 45 2052

block 130 45 2052

MAIN MEMORY

select namefrom Rwhere age > 50

Fewer cache misses, low reconstruction cost

Predicate Evaluation using PAX

FIXED-LENGTH VALUES VARIABLE-LENGTH VALUESHEADER

offsets to variable-length fields

null bitmap,record length, etc

NSM: All fields of record stored together + slots

A Real NSM Record

pid 3 2 4v4

43221237

Jane John

1 1

30 45

1 1

f }

}Page Header

attribute sizes

free space# records

# attributes

F - Minipage

presence bits

presence bits

v-offsets

}}

F - Minipage

V - Minipage

PAX: Detailed Design

PAX: Group fields + amortizes record headers

Outline

The memory/processor speed gap What’s wrong with slotted pages? Partition Attributes Across (PAX) Performance results Summary

Main-memory resident R, numeric fields Query:

select avg (ai)

from R

where aj >= Lo and aj <= Hi

PII Xeon running Windows NT 4 16KB L1-I, 16KB L1-D, 512 KB L2, 512 MB RAM Used processor counters Implemented schemes on Shore Storage Manager

Similar behavior to commercial Database Systems

Sanity Check: Basic Evaluation

Execution time breakdown

0%

20%

40%

60%

80%

100%

A B C D ShoreDBMS

% e

xecu

tion

tim

e

Computation Memory Branch mispr. Resource

Memory stall time breakdown

0%

20%

40%

60%

80%

100%

A B C D ShoreDBMS

Me

mo

ry s

tall

time

(%

)

L1 Data L2 Data L1 Instruction L2 Instruction

We can use Shore to evaluate DSS workload behavior

Compare Shore query behavior with commercial DBMS Execution time & memory delays (range selection)

Why Use Shore?

Sensitivity to Selectivity

0

20

40

60

80

100

120

140

160

1% 5% 10% 20% 50% 100%

selectivity

sta

ll c

yc

les

/ re

co

rd

NSM L2

PAX L2

PAX saves 70% of NSM’s data cache penalty PAX reduces cache misses at both L1 and L2 Selectivity doesn’t matter for PAX data stalls

Effect on Accessing Cache Data

Cache data stalls

0

20

40

60

80

100

120

140

160

NSM PAX

page layout

sta

ll c

yc

les

/ re

co

rd

L1 Data stalls

L2 Data stalls

PAX: 75% less memory penalty than NSM (10% of time) Execution times converge as number of attrs increases

Execution time breakdown

0

300

600

900

1200

1500

1800

NSM PAX

page layout

clo

ck

cy

cle

s p

er

rec

ord

HardwareResource

BranchMispredict

Memory

Computation

Sensitivity to # of attributes

0

1

2

3

4

5

6

2 4 8 16 32 64

# of attributes in recorde

lap

se

d t

ime

(s

ec

)

NSM

PAX

Time and Sensitivity Analysis

100M, 200M, and 500M TPC-H DBs Queries:

1. Range Selections w/ variable parameters (RS)2. TPC-H Q1 and Q6

sequential scans lots of aggregates (sum, avg, count) grouping/ordering of results

3. TPC-H Q12 and Q14 (Adaptive Hybrid) Hash Join complex ‘where’ clause, conditional aggregates

128MB buffer pool

Evaluation Using a DSS Benchmark

PAX/NSM Speedup on PII/NT

0%

15%

30%

45%

60%

RS Q1 Q6 Q12 Q14Query

PA

X/N

SM

Sp

ee

du

p

100 MB200 MB500 MB

PAX improves performance even with I/O Speedup differs across DB sizes

TPC-H Queries: Speedup

Updates

Policy: Update in-place Variable-length: Shift when needed PAX only needs shift minipage data

Update statement:update R

set ap=ap + b

where aq > Lo and aq < Hi

PAX/NSM Speedup on PII/NT

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

1 2 3 4 5 6 7

Number of updated attributes

PA

X/N

SM

Sp

ee

du

p

2%10%20%50%100%

PAX always speeds queries up (7-17%) Lower selectivity => reads dominate speedup High selectivity => speedup dominated by write-backs

Updates: Speedup

PAX: a low-cost, high-impact DP technique

Performance Eliminates unnecessary memory references High utilization of cache space/bandwidth Faster than NSM (does not affect I/O)

Usability Orthogonal to other storage decisions “Easy” to implement in large existing DBMSs

Summary

Weaving Relations for Cache Performance

Documents

Transcript of Weaving Relations for Cache Performance