Improving Index Performance through Prefetching

Carnegie Mellon

Improving Index Performance through Prefetching

Shimin Chen, Phillip B. Gibbons† and Todd C.

MowrySchool of Computer ScienceCarnegie Mellon University

Information SciencesResearch CenterBell Laboratories

†

Improving Index Performance through Prefetching - 2 - Chen, Gibbons & Mowry

Carnegie Mellon

Databases and the Memory Hierarchy

Traditional Focus: buffer pool management (DRAM as a cache for disk)

Important Focus Today: processor cache performance (SRAM as a cache for

DRAM) e.g., [Ailamaki et al, VLDB ’99], etc.

Disk

Main MemoryCPUL2/L3Cache

Larger, slower, cheaper

L1Cache


Carnegie Mellon

Index Structures

Used extensively in databases to accelerate performance selections, joins, etc.

Common Implementation: B+-Trees

Leaf Nodes

Non-Leaf Nodes


Carnegie Mellon

B+-Tree Indices: Common Access Patterns

Search: locate a single tuple

Range Scan: locate a collection of

tuples within a range


Carnegie Mellon

Cache Performance of B+-Tree Indices

A main memory B+-Tree containing 10M keys: Search: 100K random searches Scan: 100 range scans of 1M keys, starting at random

keys Detailed simulations based on Compaq ES40 system

Most of execution time is wasted on data cache misses 65% for searches, 84% for range scans

Data Cache StallsOther StallsBusy Time


Carnegie Mellon

B+-Trees: Optimizing Search for Cache vs. Disk To minimize the number of data transfers (I/O or cache

misses):

Optimal Node Width = Natural Data Transfer Size for disk: disk page size (~8 Kbytes) for cache: cache line size (~64 bytes)

Much narrower nodes and higher trees Search performance more sensitive

to changes in branching factors

Optimized for disk

Optimized for cache


Carnegie Mellon

Previous Work: “Cache-Sensitive B+-Trees”Rao and Ross [SIGMOD 2000]

Key insight:

nearly all child ptrs can be eliminated by restricting data layout double the branching factor of cache-line-sized non-leaf

nodesB+-Trees CSB+-Trees

K1 K2

K3 K4 K5 K6 K7 K8

K1 K3K2 K4

K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4


Carnegie Mellon

Impact of CSB+-Trees on Search Performance

Search is 15% faster due to reduction in height of tree

However: update performance is worse [Rao & Ross, SIGMOD ’00] range scan performance does not improve

There is still significant room for improvement


B+-Tree CSB+-Tree


Carnegie Mellon

Latency Tolerance in Modern Memory Hierarchies

Main MemoryCPUL2/L3CacheL1

Cache

pref 0(r2)pref 4(r7)pref 0(r3)pref 8(r9)

Modern processors overlap multiple simultaneous cache misses e.g., Compaq ES40 supports 8 off-chip misses per processor

Prefetch instructions allow software to fully exploit the parallelism

What dictates performance: NOT simply the number of cache misses but rather the amount of exposed miss latency


Carnegie Mellon

Our Approach

New Proposal: “Prefetching B+-Trees” (pB+-Trees) use prefetching to reduce the amount of exposed miss

latency

Key Challenge: data dependences caused by chasing pointers

Benefits: significant performance gains for:

searches range scans updates (!)

complementary to CSB+-Trees


Carnegie Mellon

Overview

Prefetching Searches

Prefetching Range Scans

Experimental Results

Conclusions


Carnegie Mellon

Example: Search where Node Width = 1 Line

0Time (cycles)

Cache miss

300

450

150

We suffer one full cache miss at each level of the tree.

600

1000 keys, 64B lines, 4B keys, ptrs & tupleIDs

4 levels in B+-Tree (cold cache)


Carnegie Mellon

Same Example where Node Width = 2 Lines

0Time (cycles)

Cache miss

0Time (cycles)

Cache miss

300

600

150

450

150

450

600

750

3 levels in tree

900

300

Additional misses per node dominate reduction in # of levels.


Carnegie Mellon

How Things Change with Prefetching

0Time (cycles)

Cache miss

300

600

150

450

0Time (cycles)

Cache miss

480

160

320

# of misses exposed miss latency

fetch all lines within a node in parallel

0

Cache miss

300

600

150

450

750

900

Time (cycles)


Carnegie Mellon

pB+-Trees: Using Prefetching to Improve Search

Basic Idea: make nodes wider than the natural data transfer size

e.g., 8 cache lines wide prefetch all lines of a node before searching in the node

Improved Search Performance: Larger branching factors, shallower trees Cost to access every node only increased slightly

Reduced Space Overhead: primarily due to fewer non-leaf nodes

Update Performance: ???


Carnegie Mellon

Overview



Experimental Results

Conclusions


Carnegie Mellon

Range Scan Cache Behavior: Normal B+-Trees

Steps in Range Scan:• search for the starting leaf node• traverse the leaves until end is found

0Time(cycles)

Cache miss

300

450

600

We suffer a full cache miss for each leaf node!

150

750

900


Carnegie Mellon

If Prefetching Wider Nodes

e.g., node width = 2 lines

0Time(cycles)

Cache miss

300

450

600

150

750

900

0Time(cycles)

Cache miss

320

• Exposed miss latency is reduced by up to a factor of node width.

A definite improvement, but can we still do better?

160

480


Carnegie Mellon

The Ideal Case

Overlap misses until• all latency is hidden, or• run out of bandwidth

How can we achieve this? 0

Time(cycles)

Cache miss

0Time(cycles)

Cache miss

300

450

600

150

750

900

0Time(cycles)

Cache miss

320

160

480

200


Carnegie Mellon

The Pointer Chasing Problem

Currently visiting Want to prefetch

If prefetching through pointer chasing,

still experience the full latency at each node

Directly prefetch

Ideal case


Carnegie Mellon

Our Solution: Jump Pointer Arrays

Put leaf addresses in an

array

Directly prefetch by using the jump pointers

Back pointers needed to initialize prefetching


Carnegie Mellon

Our Solution: Jump Pointer Arrays

0Time

Cache miss


Carnegie Mellon

External Jump Pointer Arrays: Efficient Updates

Impact of an insertion is limited to its chunk

Deletions leave empty slots

Actively interleave empty slots during bulkload and chunk splits

Back pointer to position in jump-pointer array is now a hint points to correct chunk but may require local search within chunk to init prefetching

hints chunked linked-list


Carnegie Mellon

Alternative Design: Internal Jump-Pointer Arrays

B+-Trees already contain structures that point to the leaf nodes

bottom non-leaf nodes

the parents of the leaf nodes ( “bottom non-leaf nodes”)

By linking them together, we can use them as a jump-pointer array

Tradeoff: no need for back-pointers, and simpler to maintain consumes less space, though external array overhead is <1% but less flexible, chunk size is fixed by B+-Tree structure


Carnegie Mellon

Overview



Experimental Results search performance range scan performance update performance

Conclusions


Carnegie Mellon

Experimental Framework

Results are for a main-memory database environment (we are extending this work to disk-based environments)

Executables: we added prefetch instructions to C source code by hand used gcc to generate optimized MIPS executables with

prefetch instructions

Performance Measurement: detailed, cycle-by-cycle simulations

Machine Model: based on Compaq ES40 system, with slightly updated

parameters


Carnegie Mellon

Simulation Parameters

Pipeline Parameters

Clock Rate 1 GHz

Issue Width 4 insts/cycle

Functional Units 2 Int, 2 FP, 2 Mem, 1

Branch

Reorder Buffer Size 64 insts

Integer Multiply/Divide 12/76 cycles

All Other Integer 1 cycle

FP Divide/Square Root 15/20 cycles

All Other FP 2 cycles

Branch Prediction Scheme

gshare

Memory Parameters

Line Size 64 bytes

Primary Data Cache 64 KB, 2-way set assoc.

Primary Instruction Cache

64 KB, 2-way set-assoc.

Miss Handlers 32 for data, 2 for inst

Unified Secondary Cache

2 MB, direct-mapped

Primary-to-Secondary Miss Latency

15 cycles (plus contention)

Primary-to-Memory Miss Latency

150 cycles (plus contention)

Main Memory Bandwidth

1 access per 10 cycles

Models all the gory details, including memory system contention


Carnegie Mellon

Index Search Performance

100K random searchesafter bulkload; 100% full (except root);warm caches.

104

105

106

10710

20

30

40

50

60

70

80

# of tupleIDs in the trees

tim

e (

M c

yc

les

)B+tree CSB+ p2B+tree p4B+tree

p16B+treep8B+tree

p8CSB+

pB+-Trees achieve 27-47% speedup vs. B+-Trees, 14-34% vs. CSB+-Trees optimal node width is 8 cache lines pB+-Trees and CSB+-Trees are complementary: p8CSB+-Trees are

best


Carnegie Mellon

Same Search Experiments with Cold Caches

Large discrete steps within each curve

What is happening here?

104

105

106

10760

80

100

120

140

160

180

# of tupleIDs in trees

time

(M

cyc

les)

B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+treep8CSB+

100K random searchesafter bulkload; 100% full (except root);cold caches (i.e. clearedafter each search).


Carnegie Mellon

Analysis of Cold Cache Search Behavior

Height of the tree dominates performance effect is blurred in warm cache case

If the same height, the smaller the node size the better

104

105

106

107

60

80

100

120

140

160

180

# of tupleIDs in trees

tim

e (M

cyc

les)

B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+treep8CSB+

Tree Type

Number of Keys

10K

30K

100K

300K 1M

3M 10M

B+ 5 6 6 7 7 8 8

CSB+ 4 5 5 5 6 6 7

p2B+ 4 4 5 5 6 6 6

p4B+ 3 3 4 4 4 5 5

p8B+ 3 3 3 4 4 4 4

p16B+ 2 3 3 3 3 4 4

p8CSB+

3 3 3 3 3 4 4

# of Levels in the Trees


Carnegie Mellon

Overview




Conclusions


Carnegie Mellon

Index Range Scan Performance

Scans of 1K-1M keys: 6.5-8.7 speedup over B+-Trees factor of 3.5-3.7 from prefetching wider nodes additional factor of ~2 from jump-pointer arrays

log scale

100 scans starting atrandom locations on indexbulkloaded with 3M keys(100% full)

10

101

102

103

104

105

106

104

106

108

10

# of tupleIDs scanned through in a single call

tim

e (C

ycle

s)B+tree p8B+tree

p8iB+treep8eB+tree


Carnegie Mellon

Index Range Scan Performance

Small scans (<1K keys): overshooting cost is noticeable exploit only if scan is expected to be large (e.g., search for end)

101

102

103

104

105

106

104

106

108

1010

# of tupleIDs scanned through in a single call

tim

e (C

ycle

s)B+tree p8B+tree p8eB+treep8iB+tree

log scale

100 scans starting atrandom locations on indexbulkloaded with 3M keys(100% full)


Carnegie Mellon

Overview




Conclusions


Carnegie Mellon

Update Performance

pB+-Trees achieve at least a 1.24 speedup in all cases

Why?

50 60 70 80 90 100percentage of entries used in leaf nodes

50

60

70

80

90

100

110

tim

e (M

cyc

les)

Insertions Deletions

100K random insertions/deletions on 3M-key bulkloaded index; warm caches


50

60

70

80

90

100

110

B+tree p8B+tree p8eB+treep8iB+tree


Carnegie Mellon

Update Performance

Reason #1: faster search times

Reason #2: less frequent node splits with wider nodes


50

60

70

80

90

100

110

tim

e (M

cyc

les)

Insertions Deletions

100K random insertions/deletions on 3M-key bulkloaded index; warm caches


50

60

70

80

90

100

110



Carnegie Mellon

pB+-Trees: Other Results

Similar results for: varying bulkload factors of trees large segmented range scans mature trees varying jump-pointer array parameters:

prefetch distance chunk size

Optimal node width: increases as memory bandwidth increases

(matches the width predicted by our model in the paper)


Carnegie Mellon

Cache Performance Revisited

Search: eliminated 45% of original data cache stalls 1.47 speedup

Scan: eliminated 97% of original data cache stalls 8-fold speedup



Carnegie Mellon

Conclusions

Impact of Prefetching B+-Trees on performance:

Search: 1.27-1.55 speedup over B+-Trees wider nodes reduce height of tree, # of expensive misses outperform and are complementary to CSB+-Trees

Updates: 1.24-1.52 speedup over B+-Trees faster search and less frequent node splits in contrast with significant slowdowns for CSB+-Trees

Range Scan: 6.5-8.7 speedup over B+-Trees wider nodes: factor of ~3.5 speedup jump-pointer arrays: additional factor of ~2 speedup

Prefetching B+-Trees also reduce space overhead.

These benefits are likely to increase with future memory systems.

Applicable to other levels of the memory hierarchy (e.g., disks).


Carnegie Mellon

Backup Slides


Carnegie Mellon

Revisiting the Optimal Node Width for Searches

Total cache misses for a search is minimized when: w = 1

w = # of cache lines per nodem = # of child pointers per one-cache-line wide nodeN = # of tupleIDs in index

1logwmN

w wmTotal cache misses

Misses per level # of levels in tree


Carnegie Mellon

Scheduling Prefetches Early Enough

ni ni+1ni+2 ni+3ni+2 ni+3

currently visiting

ni

want to prefetch

ni+3

p = &n0;while(p) { work(p->data); p = p->next;}

P

Loading a node

L

Work()

W

Our Goal: fully hide latency

• thus achieving fastest possible computation rate of 1/W

e.g., if L=3W, we must prefetch 3 nodes ahead to achieve this.


Carnegie Mellon

Performance without Prefetching

ni

ni+1

ni+2

ni+3

Time

Li Wi

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

while(p) {work(p->data);p = p->next;

}

Li

Wi

loading nkwork(nk)

Computation rate = 1/(L+W)


Carnegie Mellon

Prefetching One Node Ahead

ni

ni+1

ni+2

ni+3

Li Wi

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

• Computation is overlapped with memory accesses.

computation rate = 1/L

Li

Wi

loading nkwork(nk)

data dependence

visiting

ni

prefetch ni+1

pf(p->next)

while(p) {pf(p->next);

work(p->data);p = p->next;

}

Time


Carnegie Mellon

Prefetching Three Nodes Ahead

ni

ni+1

ni+2

ni+3

Li Wi

Wi+1

Wi+2

Wi+3

Computation rate does not improve (still = 1/L)!

visitingni

prefetchni+3

pf(p->next->next->next)

Li+1

Li+2

Li+3

L

Pointer-Chasing Problem: [Luk & Mowry, ASPLOS ’96]• any scheme which follows the pointer chain is limited to a rate of 1/L

Timewhile(p) {

pf(p->next->next->next);work(p->data);p = p->next;

}

Li

Wi

loading nkwork(nk)

data dependence


Carnegie Mellon

Our Goal: Fully Hide Latency

ni

ni+1

ni+2

ni+3

Li Wi

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

Achieves the fastest possible computation rate of 1/W.

visiting ni

prefetch ni+1pf(&ni+3)

Time

Li

Wi

loading nkwork(nk)

data dependence

while(p) {pf(&ni+3);

work(p->data);p = p->next;

}


Carnegie Mellon

Challenges in Supporting Efficient Updates

jump-pointer array

back pointers

Conceptual view of jump-pointer array:

What if we really implemented it this way?

•Insertion: could incur significant overheads• copying data within the array to create a new hole• updating back-pointers

•Deletion: ok; just leave a hole


Carnegie Mellon

Summary: Why We Expect Updates to Perform Well

Insertions: only a small number of jump pointers move

between insertion point and nearest hole in the chunk normally only update the hint pointer for the inserted node

which does not require any significant overhead significant overheads only occur on chunk splits, which are rare

Deletions: no data is moved (just leave an empty hole) no need to update any hints

In general, the jump-pointer array requires little concurrency control.


Carnegie Mellon

B+-Trees Modeled and their Notations

B+-Trees: regular B+-Trees

CSB+-Trees: cache-sensitive B+-Trees [Rao & Ross, SIGMOD 2000]

pwB+-Trees: prefetching B+-Trees with node size = w cache lines and no jump-pointer arrays

• we consider w = 2, 4, 8, and 16

p8B+-Trees: prefetching B+-Trees with node size = 8 cache lines and external jump-pointer arrays

p8B+-Trees: prefetching B+-Trees with node size = 8 cache lines and internal jump-pointer arrays

p8CSB+-Trees: prefetching cache-sensitive B+-Trees with node size = 8 cache lines (and no jump-pointer arrays)

(Gory implementation details are in the paper.)

e

i


Carnegie Mellon

Searches with Varying Bulkload Factors

Similar trends with smaller bulkload factors as when 100% full

Performance of pB+-Trees is somewhat less sensitive to bulkload factor


40

50

60

70

80

90

tim

e (M

cyc

les)

B+tree CSB+ p2B+tree p4B+tree

p16B+treep8CSB+

p8B+tree


100

120

140

160

180

200

tim

e (M

cyc

les)

cold cacheswarm caches


Carnegie Mellon

Range Scans with Varying Bulkload Factors

Prefetching B+-Trees offer: larger speedups with smaller bulkload factors (more nodes to fetch) less sensitivity of performance to bulkload factor

50 60 70 80 90 10010

5

106

107

percentage of entries used in leaf nodes

tim

e (C

ycle

s)



Carnegie Mellon

Large Segmented Range Scans

1M keys, scanned in 1000-key segments

Similar performance gains as unsegmented scans

50 60 70 80 90 10010

8

109

1010

percentage of entries used in leaf nodes

time

(C

ycle

s)



Carnegie Mellon

Insertions with Cold Caches


120

140

160

180

200

220

240

260

time

(M

cyc

les)



Carnegie Mellon

Deletions with Cold Caches


120

140

160

180

200

220

time

(M

cyc

les)



Carnegie Mellon

55 60 65 70 75 80 85 90percentage of entries used in leaf nodes

0

2000

4000

6000

8000

10000

inse

rtio

ns

with

no

de

sp

lits B+tree

p8B+tree p8eB+treep8iB+tree

Analysis of Nodes Splits upon Insertions

Far fewer node splits

Bulkload Factor = 60-90% Bulkload Factor = 100%

At least 2 splits

One split No splits

Fewer node splits Fewer non-leaf node

splits


Carnegie Mellon

Mature Trees: Searches (Warm Caches)

40 80 120 160 200number of search (x 1000)

0

50

100

150

200

time

(M

cyc

les)



Carnegie Mellon

Mature Trees: Insertions (Warm Caches)

40 80 120 160 200number of insertion (x 1000)

0

50

100

150

200

time

(M

cyc

les)


• CSB+-Tree could be 25% worse than B+-Tree under the same mature tree experiments (on diff h/w configuration)

• pB+-Trees are significantly faster than B+-Tree


Carnegie Mellon

Mature Trees: Deletions (Warm Caches)

40 80 120 160 200number of deletion (x 1000)

0

50

100

150

200

time

(M

cyc

les)



Carnegie Mellon

Mature Trees: Searches (Cold Caches)

40 80 120 160 200number of search (x 1000)

0

100

200

300

400

500

time

(M

cyc

les)



Carnegie Mellon

Mature Trees: Insertions (Cold Caches)

40 80 120 160 200number of insertion (x 1000)

0

100

200

300

400

500

time

(M

cyc

les)



Carnegie Mellon

Mature Trees: Deletions (Cold Caches)

40 80 120 160 200number of deletion (x 1000)

0

100

200

300

400

500

time

(M

cyc

les)



Carnegie Mellon

Mature Trees: Large Segmented Range Scans

B+tree p8B+ p8eB+ p8iB+0

500

1000

1500

2000

2500

3000

3500

40003537

825

479 452


Carnegie Mellon

Search varying memory bandwidth (warm cache)

5 10 15 20 25 30normalized bandwidth (B)

60

65

70

75

80

85

90

95

100

no

rma

lize

d e

xecu

tion

tim

e

p2B+tree p4B+tree p8B+tree p16B+treep19B+tree

Even when pessimistic (B=5), p8B+-Tree still achieve significant speedups: 1.2 for warm cache


Carnegie Mellon

Search varying memory bandwidth (cold cache)

5 10 15 20 25 30normalized bandwidth (B)

50

60

70

80

90

100

110

no

rma

lize

d e

xecu

tion

tim

e

p2B+tree p4B+tree p8B+tree p16B+treep19B+tree

• Even when B=5, 1.3 speedup for cold cache

• The optimal value for w increases when B gets larger


Carnegie Mellon

Scan varying prefetching distance (P8eB+-Tree)

102

104

106

entries scanned through in a single call

104

105

106

107

108

109

time

(C

ycle

s)

k=2 k=3 k=4 k=8 k=16k=32

• not sensitive to moderate increases in the prefetching distance

• Though overshooting cost shows up when #entries to scan is small


Carnegie Mellon

Scan varying chunk size (P8eB+-Tree)

102

104

106

entries scanned through in a single call

104

105

106

107

108

109

time

(C

ycle

s)

c=2 c=4 c=8 c=16c=32

Not sensitive to varying chunk size


Carnegie Mellon

Table 1 Terminology

Variable Definition

w # of cache lines in an index node

m # of child pointers in a one-line-wide node

N # of <key, tupleID> pairs in an index

d # of child pointers in non-leaf node (= w m)

T1 Full latency of a cache miss

Tnext Latency of an additional pipelined cache miss

B Normalized memory bandwidth (B = T1/Tnext)

K # of nodes to prefetch ahead

C #of cache lines in jump-pointer array chunk

pwB+-Tree Plain pB+-Tree with w-line-wide nodes

pwB+-Tree pwB+-Tree with external jump-pointer arrays

pwB+-Tree pwB+-Tree with internal jump-pointer arrays

e

i


Carnegie Mellon

Search w/ & w/o Jump-Pointer Arrays: Cold Cache

entries in leaf nodes

104

105

106

107

60

80

100

120

140

160

180tim

e (

M c

ycle

s)

p8B+tree p8eB+treep8iB+tree

different # of levels in tree


Carnegie Mellon

Cache Performance Revisited

Search: eliminated 45% of original data cache stalls 1.47 speedup

Scan: eliminated 97% of original data cache stalls 8-fold speedup



Carnegie Mellon

Can We Do Even Better on Searches?

Hiding latency across levels is difficult given: data dependence through the child pointer the relatively large branching factor of tree nodes equal likelihood of following any child

assuming uniformly distributed random search keys

What if we prefetch a node’s children in parallel with accessing it? duality between this and creating wider nodes BUT, this approach has the following relative disadvantages:

storage overhead for the child (or grandchild) pointers size of node can only grow by multiples of the branching

factor

Improving Index Performance through Prefetching

Documents

Transcript of Improving Index Performance through Prefetching