Improving Index Performance through Prefetching

70
Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons and Todd C. Mowry School of Computer Science Carnegie Mellon University Information Sciences Research Center Bell Laboratories

description

School of Computer Science Carnegie Mellon University. Information Sciences Research Center Bell Laboratories. †. Improving Index Performance through Prefetching. Shimin Chen , Phillip B. Gibbons † and Todd C. Mowry. CPU. L2/L3 Cache. L1 Cache. Larger, slower, cheaper. Main Memory. - PowerPoint PPT Presentation

Transcript of Improving Index Performance through Prefetching

Page 1: Improving Index Performance  through Prefetching

Carnegie Mellon

Improving Index Performance through Prefetching

Shimin Chen, Phillip B. Gibbons† and Todd C.

MowrySchool of Computer ScienceCarnegie Mellon University

Information SciencesResearch CenterBell Laboratories

Page 2: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 2 - Chen, Gibbons & Mowry

Carnegie Mellon

Databases and the Memory Hierarchy

Traditional Focus: buffer pool management (DRAM as a cache for disk)

Important Focus Today: processor cache performance (SRAM as a cache for

DRAM) e.g., [Ailamaki et al, VLDB ’99], etc.

Disk

Main MemoryCPUL2/L3Cache

Larger, slower, cheaper

L1Cache

Page 3: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 3 - Chen, Gibbons & Mowry

Carnegie Mellon

Index Structures

Used extensively in databases to accelerate performance selections, joins, etc.

Common Implementation: B+-Trees

Leaf Nodes

Non-Leaf Nodes

Page 4: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 4 - Chen, Gibbons & Mowry

Carnegie Mellon

B+-Tree Indices: Common Access Patterns

Search: locate a single tuple

Range Scan: locate a collection of

tuples within a range

Page 5: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 5 - Chen, Gibbons & Mowry

Carnegie Mellon

Cache Performance of B+-Tree Indices

A main memory B+-Tree containing 10M keys: Search: 100K random searches Scan: 100 range scans of 1M keys, starting at random

keys Detailed simulations based on Compaq ES40 system

Most of execution time is wasted on data cache misses 65% for searches, 84% for range scans

Data Cache StallsOther StallsBusy Time

Page 6: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 6 - Chen, Gibbons & Mowry

Carnegie Mellon

B+-Trees: Optimizing Search for Cache vs. Disk To minimize the number of data transfers (I/O or cache

misses):

Optimal Node Width = Natural Data Transfer Size for disk: disk page size (~8 Kbytes) for cache: cache line size (~64 bytes)

Much narrower nodes and higher trees Search performance more sensitive

to changes in branching factors

Optimized for disk

Optimized for cache

Page 7: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 7 - Chen, Gibbons & Mowry

Carnegie Mellon

Previous Work: “Cache-Sensitive B+-Trees”Rao and Ross [SIGMOD 2000]

Key insight:

nearly all child ptrs can be eliminated by restricting data layout double the branching factor of cache-line-sized non-leaf

nodesB+-Trees CSB+-Trees

K1 K2

K3 K4 K5 K6 K7 K8

K1 K3K2 K4

K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4

Page 8: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 8 - Chen, Gibbons & Mowry

Carnegie Mellon

Impact of CSB+-Trees on Search Performance

Search is 15% faster due to reduction in height of tree

However: update performance is worse [Rao & Ross, SIGMOD ’00] range scan performance does not improve

There is still significant room for improvement

Data Cache StallsOther StallsBusy Time

B+-Tree CSB+-Tree

Page 9: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 9 - Chen, Gibbons & Mowry

Carnegie Mellon

Latency Tolerance in Modern Memory Hierarchies

Main MemoryCPUL2/L3CacheL1

Cache

pref 0(r2)pref 4(r7)pref 0(r3)pref 8(r9)

Modern processors overlap multiple simultaneous cache misses e.g., Compaq ES40 supports 8 off-chip misses per processor

Prefetch instructions allow software to fully exploit the parallelism

What dictates performance: NOT simply the number of cache misses but rather the amount of exposed miss latency

Page 10: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 10 - Chen, Gibbons & Mowry

Carnegie Mellon

Our Approach

New Proposal: “Prefetching B+-Trees” (pB+-Trees) use prefetching to reduce the amount of exposed miss

latency

Key Challenge: data dependences caused by chasing pointers

Benefits: significant performance gains for:

searches range scans updates (!)

complementary to CSB+-Trees

Page 11: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 11 - Chen, Gibbons & Mowry

Carnegie Mellon

Overview

Prefetching Searches

Prefetching Range Scans

Experimental Results

Conclusions

Page 12: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 12 - Chen, Gibbons & Mowry

Carnegie Mellon

Example: Search where Node Width = 1 Line

0Time (cycles)

Cache miss

300

450

150

We suffer one full cache miss at each level of the tree.

600

1000 keys, 64B lines, 4B keys, ptrs & tupleIDs

4 levels in B+-Tree (cold cache)

Page 13: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 13 - Chen, Gibbons & Mowry

Carnegie Mellon

Same Example where Node Width = 2 Lines

0Time (cycles)

Cache miss

0Time (cycles)

Cache miss

300

600

150

450

150

450

600

750

3 levels in tree

900

300

Additional misses per node dominate reduction in # of levels.

Page 14: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 14 - Chen, Gibbons & Mowry

Carnegie Mellon

How Things Change with Prefetching

0Time (cycles)

Cache miss

300

600

150

450

0Time (cycles)

Cache miss

480

160

320

# of misses exposed miss latency

fetch all lines within a node in parallel

0

Cache miss

300

600

150

450

750

900

Time (cycles)

Page 15: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 15 - Chen, Gibbons & Mowry

Carnegie Mellon

pB+-Trees: Using Prefetching to Improve Search

Basic Idea: make nodes wider than the natural data transfer size

e.g., 8 cache lines wide prefetch all lines of a node before searching in the node

Improved Search Performance: Larger branching factors, shallower trees Cost to access every node only increased slightly

Reduced Space Overhead: primarily due to fewer non-leaf nodes

Update Performance: ???

Page 16: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 16 - Chen, Gibbons & Mowry

Carnegie Mellon

Overview

Prefetching Searches

Prefetching Range Scans

Experimental Results

Conclusions

Page 17: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 17 - Chen, Gibbons & Mowry

Carnegie Mellon

Range Scan Cache Behavior: Normal B+-Trees

Steps in Range Scan:• search for the starting leaf node• traverse the leaves until end is found

0Time(cycles)

Cache miss

300

450

600

We suffer a full cache miss for each leaf node!

150

750

900

Page 18: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 18 - Chen, Gibbons & Mowry

Carnegie Mellon

If Prefetching Wider Nodes

e.g., node width = 2 lines

0Time(cycles)

Cache miss

300

450

600

150

750

900

0Time(cycles)

Cache miss

320

• Exposed miss latency is reduced by up to a factor of node width.

A definite improvement, but can we still do better?

160

480

Page 19: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 19 - Chen, Gibbons & Mowry

Carnegie Mellon

The Ideal Case

Overlap misses until• all latency is hidden, or• run out of bandwidth

How can we achieve this? 0

Time(cycles)

Cache miss

0Time(cycles)

Cache miss

300

450

600

150

750

900

0Time(cycles)

Cache miss

320

160

480

200

Page 20: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 20 - Chen, Gibbons & Mowry

Carnegie Mellon

The Pointer Chasing Problem

Currently visiting Want to prefetch

If prefetching through pointer chasing,

still experience the full latency at each node

Directly prefetch

Ideal case

Page 21: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 21 - Chen, Gibbons & Mowry

Carnegie Mellon

Our Solution: Jump Pointer Arrays

Put leaf addresses in an

array

Directly prefetch by using the jump pointers

Back pointers needed to initialize prefetching

Page 22: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 22 - Chen, Gibbons & Mowry

Carnegie Mellon

Our Solution: Jump Pointer Arrays

0Time

Cache miss

Page 23: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 23 - Chen, Gibbons & Mowry

Carnegie Mellon

External Jump Pointer Arrays: Efficient Updates

Impact of an insertion is limited to its chunk

Deletions leave empty slots

Actively interleave empty slots during bulkload and chunk splits

Back pointer to position in jump-pointer array is now a hint points to correct chunk but may require local search within chunk to init prefetching

hints chunked linked-list

Page 24: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 24 - Chen, Gibbons & Mowry

Carnegie Mellon

Alternative Design: Internal Jump-Pointer Arrays

B+-Trees already contain structures that point to the leaf nodes

bottom non-leaf nodes

the parents of the leaf nodes ( “bottom non-leaf nodes”)

By linking them together, we can use them as a jump-pointer array

Tradeoff: no need for back-pointers, and simpler to maintain consumes less space, though external array overhead is <1% but less flexible, chunk size is fixed by B+-Tree structure

Page 25: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 25 - Chen, Gibbons & Mowry

Carnegie Mellon

Overview

Prefetching Searches

Prefetching Range Scans

Experimental Results search performance range scan performance update performance

Conclusions

Page 26: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 26 - Chen, Gibbons & Mowry

Carnegie Mellon

Experimental Framework

Results are for a main-memory database environment (we are extending this work to disk-based environments)

Executables: we added prefetch instructions to C source code by hand used gcc to generate optimized MIPS executables with

prefetch instructions

Performance Measurement: detailed, cycle-by-cycle simulations

Machine Model: based on Compaq ES40 system, with slightly updated

parameters

Page 27: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 27 - Chen, Gibbons & Mowry

Carnegie Mellon

Simulation Parameters

Pipeline Parameters

Clock Rate 1 GHz

Issue Width 4 insts/cycle

Functional Units 2 Int, 2 FP, 2 Mem, 1

Branch

Reorder Buffer Size 64 insts

Integer Multiply/Divide 12/76 cycles

All Other Integer 1 cycle

FP Divide/Square Root 15/20 cycles

All Other FP 2 cycles

Branch Prediction Scheme

gshare

Memory Parameters

Line Size 64 bytes

Primary Data Cache 64 KB, 2-way set assoc.

Primary Instruction Cache

64 KB, 2-way set-assoc.

Miss Handlers 32 for data, 2 for inst

Unified Secondary Cache

2 MB, direct-mapped

Primary-to-Secondary Miss Latency

15 cycles (plus contention)

Primary-to-Memory Miss Latency

150 cycles (plus contention)

Main Memory Bandwidth

1 access per 10 cycles

Models all the gory details, including memory system contention

Page 28: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 28 - Chen, Gibbons & Mowry

Carnegie Mellon

Index Search Performance

100K random searchesafter bulkload; 100% full (except root);warm caches.

104

105

106

10710

20

30

40

50

60

70

80

# of tupleIDs in the trees

tim

e (

M c

yc

les

)B+tree CSB+ p2B+tree p4B+tree

p16B+treep8B+tree

p8CSB+

pB+-Trees achieve 27-47% speedup vs. B+-Trees, 14-34% vs. CSB+-Trees optimal node width is 8 cache lines pB+-Trees and CSB+-Trees are complementary: p8CSB+-Trees are

best

Page 29: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 29 - Chen, Gibbons & Mowry

Carnegie Mellon

Same Search Experiments with Cold Caches

Large discrete steps within each curve

What is happening here?

104

105

106

10760

80

100

120

140

160

180

# of tupleIDs in trees

time

(M

cyc

les)

B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+treep8CSB+

100K random searchesafter bulkload; 100% full (except root);cold caches (i.e. clearedafter each search).

Page 30: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 30 - Chen, Gibbons & Mowry

Carnegie Mellon

Analysis of Cold Cache Search Behavior

Height of the tree dominates performance effect is blurred in warm cache case

If the same height, the smaller the node size the better

104

105

106

107

60

80

100

120

140

160

180

# of tupleIDs in trees

tim

e (M

cyc

les)

B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+treep8CSB+

Tree Type

Number of Keys

10K

30K

100K

300K 1M

3M 10M

B+ 5 6 6 7 7 8 8

CSB+ 4 5 5 5 6 6 7

p2B+ 4 4 5 5 6 6 6

p4B+ 3 3 4 4 4 5 5

p8B+ 3 3 3 4 4 4 4

p16B+ 2 3 3 3 3 4 4

p8CSB+

3 3 3 3 3 4 4

# of Levels in the Trees

Page 31: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 31 - Chen, Gibbons & Mowry

Carnegie Mellon

Overview

Prefetching Searches

Prefetching Range Scans

Experimental Results search performance range scan performance update performance

Conclusions

Page 32: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 32 - Chen, Gibbons & Mowry

Carnegie Mellon

Index Range Scan Performance

Scans of 1K-1M keys: 6.5-8.7 speedup over B+-Trees factor of 3.5-3.7 from prefetching wider nodes additional factor of ~2 from jump-pointer arrays

log scale

100 scans starting atrandom locations on indexbulkloaded with 3M keys(100% full)

10

101

102

103

104

105

106

104

106

108

10

# of tupleIDs scanned through in a single call

tim

e (C

ycle

s)B+tree p8B+tree

p8iB+treep8eB+tree

Page 33: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 33 - Chen, Gibbons & Mowry

Carnegie Mellon

Index Range Scan Performance

Small scans (<1K keys): overshooting cost is noticeable exploit only if scan is expected to be large (e.g., search for end)

101

102

103

104

105

106

104

106

108

1010

# of tupleIDs scanned through in a single call

tim

e (C

ycle

s)B+tree p8B+tree p8eB+treep8iB+tree

log scale

100 scans starting atrandom locations on indexbulkloaded with 3M keys(100% full)

Page 34: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 34 - Chen, Gibbons & Mowry

Carnegie Mellon

Overview

Prefetching Searches

Prefetching Range Scans

Experimental Results search performance range scan performance update performance

Conclusions

Page 35: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 35 - Chen, Gibbons & Mowry

Carnegie Mellon

Update Performance

pB+-Trees achieve at least a 1.24 speedup in all cases

Why?

50 60 70 80 90 100percentage of entries used in leaf nodes

50

60

70

80

90

100

110

tim

e (M

cyc

les)

Insertions Deletions

100K random insertions/deletions on 3M-key bulkloaded index; warm caches

50 60 70 80 90 100percentage of entries used in leaf nodes

50

60

70

80

90

100

110

B+tree p8B+tree p8eB+treep8iB+tree

Page 36: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 36 - Chen, Gibbons & Mowry

Carnegie Mellon

Update Performance

Reason #1: faster search times

Reason #2: less frequent node splits with wider nodes

50 60 70 80 90 100percentage of entries used in leaf nodes

50

60

70

80

90

100

110

tim

e (M

cyc

les)

Insertions Deletions

100K random insertions/deletions on 3M-key bulkloaded index; warm caches

50 60 70 80 90 100percentage of entries used in leaf nodes

50

60

70

80

90

100

110

B+tree p8B+tree p8eB+treep8iB+tree

Page 37: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 37 - Chen, Gibbons & Mowry

Carnegie Mellon

pB+-Trees: Other Results

Similar results for: varying bulkload factors of trees large segmented range scans mature trees varying jump-pointer array parameters:

prefetch distance chunk size

Optimal node width: increases as memory bandwidth increases

(matches the width predicted by our model in the paper)

Page 38: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 38 - Chen, Gibbons & Mowry

Carnegie Mellon

Cache Performance Revisited

Search: eliminated 45% of original data cache stalls 1.47 speedup

Scan: eliminated 97% of original data cache stalls 8-fold speedup

Data Cache StallsOther StallsBusy Time

Page 39: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 39 - Chen, Gibbons & Mowry

Carnegie Mellon

Conclusions

Impact of Prefetching B+-Trees on performance:

Search: 1.27-1.55 speedup over B+-Trees wider nodes reduce height of tree, # of expensive misses outperform and are complementary to CSB+-Trees

Updates: 1.24-1.52 speedup over B+-Trees faster search and less frequent node splits in contrast with significant slowdowns for CSB+-Trees

Range Scan: 6.5-8.7 speedup over B+-Trees wider nodes: factor of ~3.5 speedup jump-pointer arrays: additional factor of ~2 speedup

Prefetching B+-Trees also reduce space overhead.

These benefits are likely to increase with future memory systems.

Applicable to other levels of the memory hierarchy (e.g., disks).

Page 40: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 40 - Chen, Gibbons & Mowry

Carnegie Mellon

Backup Slides

Page 41: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 41 - Chen, Gibbons & Mowry

Carnegie Mellon

Revisiting the Optimal Node Width for Searches

Total cache misses for a search is minimized when: w = 1

w = # of cache lines per nodem = # of child pointers per one-cache-line wide nodeN = # of tupleIDs in index

1logwmN

w wmTotal cache misses

Misses per level # of levels in tree

Page 42: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 42 - Chen, Gibbons & Mowry

Carnegie Mellon

Scheduling Prefetches Early Enough

ni ni+1ni+2 ni+3ni+2 ni+3

currently visiting

ni

want to prefetch

ni+3

p = &n0;while(p) { work(p->data); p = p->next;}

P

Loading a node

L

Work()

W

Our Goal: fully hide latency

• thus achieving fastest possible computation rate of 1/W

e.g., if L=3W, we must prefetch 3 nodes ahead to achieve this.

Page 43: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 43 - Chen, Gibbons & Mowry

Carnegie Mellon

Performance without Prefetching

ni

ni+1

ni+2

ni+3

Time

Li Wi

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

while(p) {work(p->data);p = p->next;

}

Li

Wi

loading nkwork(nk)

Computation rate = 1/(L+W)

Page 44: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 44 - Chen, Gibbons & Mowry

Carnegie Mellon

Prefetching One Node Ahead

ni

ni+1

ni+2

ni+3

Li Wi

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

• Computation is overlapped with memory accesses.

computation rate = 1/L

Li

Wi

loading nkwork(nk)

data dependence

visiting

ni

prefetch ni+1

pf(p->next)

while(p) {pf(p->next);

work(p->data);p = p->next;

}

Time

Page 45: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 45 - Chen, Gibbons & Mowry

Carnegie Mellon

Prefetching Three Nodes Ahead

ni

ni+1

ni+2

ni+3

Li Wi

Wi+1

Wi+2

Wi+3

Computation rate does not improve (still = 1/L)!

visitingni

prefetchni+3

pf(p->next->next->next)

Li+1

Li+2

Li+3

L

Pointer-Chasing Problem: [Luk & Mowry, ASPLOS ’96]• any scheme which follows the pointer chain is limited to a rate of 1/L

Timewhile(p) {

pf(p->next->next->next);work(p->data);p = p->next;

}

Li

Wi

loading nkwork(nk)

data dependence

Page 46: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 46 - Chen, Gibbons & Mowry

Carnegie Mellon

Our Goal: Fully Hide Latency

ni

ni+1

ni+2

ni+3

Li Wi

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

Achieves the fastest possible computation rate of 1/W.

visiting ni

prefetch ni+1pf(&ni+3)

Time

Li

Wi

loading nkwork(nk)

data dependence

while(p) {pf(&ni+3);

work(p->data);p = p->next;

}

Page 47: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 47 - Chen, Gibbons & Mowry

Carnegie Mellon

Challenges in Supporting Efficient Updates

jump-pointer array

back pointers

Conceptual view of jump-pointer array:

What if we really implemented it this way?

•Insertion: could incur significant overheads• copying data within the array to create a new hole• updating back-pointers

•Deletion: ok; just leave a hole

Page 48: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 48 - Chen, Gibbons & Mowry

Carnegie Mellon

Summary: Why We Expect Updates to Perform Well

Insertions: only a small number of jump pointers move

between insertion point and nearest hole in the chunk normally only update the hint pointer for the inserted node

which does not require any significant overhead significant overheads only occur on chunk splits, which are rare

Deletions: no data is moved (just leave an empty hole) no need to update any hints

In general, the jump-pointer array requires little concurrency control.

Page 49: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 49 - Chen, Gibbons & Mowry

Carnegie Mellon

B+-Trees Modeled and their Notations

B+-Trees: regular B+-Trees

CSB+-Trees: cache-sensitive B+-Trees [Rao & Ross, SIGMOD 2000]

pwB+-Trees: prefetching B+-Trees with node size = w cache lines and no jump-pointer arrays

• we consider w = 2, 4, 8, and 16

p8B+-Trees: prefetching B+-Trees with node size = 8 cache lines and external jump-pointer arrays

p8B+-Trees: prefetching B+-Trees with node size = 8 cache lines and internal jump-pointer arrays

p8CSB+-Trees: prefetching cache-sensitive B+-Trees with node size = 8 cache lines (and no jump-pointer arrays)

(Gory implementation details are in the paper.)

e

i

Page 50: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 50 - Chen, Gibbons & Mowry

Carnegie Mellon

Searches with Varying Bulkload Factors

Similar trends with smaller bulkload factors as when 100% full

Performance of pB+-Trees is somewhat less sensitive to bulkload factor

50 60 70 80 90 100percentage of entries used in leaf nodes

40

50

60

70

80

90

tim

e (M

cyc

les)

B+tree CSB+ p2B+tree p4B+tree

p16B+treep8CSB+

p8B+tree

50 60 70 80 90 100percentage of entries used in leaf nodes

100

120

140

160

180

200

tim

e (M

cyc

les)

cold cacheswarm caches

Page 51: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 51 - Chen, Gibbons & Mowry

Carnegie Mellon

Range Scans with Varying Bulkload Factors

Prefetching B+-Trees offer: larger speedups with smaller bulkload factors (more nodes to fetch) less sensitivity of performance to bulkload factor

50 60 70 80 90 10010

5

106

107

percentage of entries used in leaf nodes

tim

e (C

ycle

s)

B+tree p8B+tree p8eB+treep8iB+tree

Page 52: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 52 - Chen, Gibbons & Mowry

Carnegie Mellon

Large Segmented Range Scans

1M keys, scanned in 1000-key segments

Similar performance gains as unsegmented scans

50 60 70 80 90 10010

8

109

1010

percentage of entries used in leaf nodes

time

(C

ycle

s)

B+tree p8B+tree p8eB+treep8iB+tree

Page 53: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 53 - Chen, Gibbons & Mowry

Carnegie Mellon

Insertions with Cold Caches

50 60 70 80 90 100percentage of entries used in leaf nodes

120

140

160

180

200

220

240

260

time

(M

cyc

les)

B+tree p8B+tree p8eB+treep8iB+tree

Page 54: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 54 - Chen, Gibbons & Mowry

Carnegie Mellon

Deletions with Cold Caches

50 60 70 80 90 100percentage of entries used in leaf nodes

120

140

160

180

200

220

time

(M

cyc

les)

B+tree p8B+tree p8eB+treep8iB+tree

Page 55: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 55 - Chen, Gibbons & Mowry

Carnegie Mellon

55 60 65 70 75 80 85 90percentage of entries used in leaf nodes

0

2000

4000

6000

8000

10000

inse

rtio

ns

with

no

de

sp

lits B+tree

p8B+tree p8eB+treep8iB+tree

Analysis of Nodes Splits upon Insertions

Far fewer node splits

Bulkload Factor = 60-90% Bulkload Factor = 100%

At least 2 splits

One split No splits

Fewer node splits Fewer non-leaf node

splits

Page 56: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 56 - Chen, Gibbons & Mowry

Carnegie Mellon

Mature Trees: Searches (Warm Caches)

40 80 120 160 200number of search (x 1000)

0

50

100

150

200

time

(M

cyc

les)

B+tree p8B+tree p8eB+treep8iB+tree

Page 57: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 57 - Chen, Gibbons & Mowry

Carnegie Mellon

Mature Trees: Insertions (Warm Caches)

40 80 120 160 200number of insertion (x 1000)

0

50

100

150

200

time

(M

cyc

les)

B+tree p8B+tree p8eB+treep8iB+tree

• CSB+-Tree could be 25% worse than B+-Tree under the same mature tree experiments (on diff h/w configuration)

• pB+-Trees are significantly faster than B+-Tree

Page 58: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 58 - Chen, Gibbons & Mowry

Carnegie Mellon

Mature Trees: Deletions (Warm Caches)

40 80 120 160 200number of deletion (x 1000)

0

50

100

150

200

time

(M

cyc

les)

B+tree p8B+tree p8eB+treep8iB+tree

Page 59: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 59 - Chen, Gibbons & Mowry

Carnegie Mellon

Mature Trees: Searches (Cold Caches)

40 80 120 160 200number of search (x 1000)

0

100

200

300

400

500

time

(M

cyc

les)

B+tree p8B+tree p8eB+treep8iB+tree

Page 60: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 60 - Chen, Gibbons & Mowry

Carnegie Mellon

Mature Trees: Insertions (Cold Caches)

40 80 120 160 200number of insertion (x 1000)

0

100

200

300

400

500

time

(M

cyc

les)

B+tree p8B+tree p8eB+treep8iB+tree

Page 61: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 61 - Chen, Gibbons & Mowry

Carnegie Mellon

Mature Trees: Deletions (Cold Caches)

40 80 120 160 200number of deletion (x 1000)

0

100

200

300

400

500

time

(M

cyc

les)

B+tree p8B+tree p8eB+treep8iB+tree

Page 62: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 62 - Chen, Gibbons & Mowry

Carnegie Mellon

Mature Trees: Large Segmented Range Scans

B+tree p8B+ p8eB+ p8iB+0

500

1000

1500

2000

2500

3000

3500

40003537

825

479 452

Page 63: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 63 - Chen, Gibbons & Mowry

Carnegie Mellon

Search varying memory bandwidth (warm cache)

5 10 15 20 25 30normalized bandwidth (B)

60

65

70

75

80

85

90

95

100

no

rma

lize

d e

xecu

tion

tim

e

p2B+tree p4B+tree p8B+tree p16B+treep19B+tree

Even when pessimistic (B=5), p8B+-Tree still achieve significant speedups: 1.2 for warm cache

Page 64: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 64 - Chen, Gibbons & Mowry

Carnegie Mellon

Search varying memory bandwidth (cold cache)

5 10 15 20 25 30normalized bandwidth (B)

50

60

70

80

90

100

110

no

rma

lize

d e

xecu

tion

tim

e

p2B+tree p4B+tree p8B+tree p16B+treep19B+tree

• Even when B=5, 1.3 speedup for cold cache

• The optimal value for w increases when B gets larger

Page 65: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 65 - Chen, Gibbons & Mowry

Carnegie Mellon

Scan varying prefetching distance (P8eB+-Tree)

102

104

106

entries scanned through in a single call

104

105

106

107

108

109

time

(C

ycle

s)

k=2 k=3 k=4 k=8 k=16k=32

• not sensitive to moderate increases in the prefetching distance

• Though overshooting cost shows up when #entries to scan is small

Page 66: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 66 - Chen, Gibbons & Mowry

Carnegie Mellon

Scan varying chunk size (P8eB+-Tree)

102

104

106

entries scanned through in a single call

104

105

106

107

108

109

time

(C

ycle

s)

c=2 c=4 c=8 c=16c=32

Not sensitive to varying chunk size

Page 67: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 67 - Chen, Gibbons & Mowry

Carnegie Mellon

Table 1 Terminology

Variable Definition

w # of cache lines in an index node

m # of child pointers in a one-line-wide node

N # of <key, tupleID> pairs in an index

d # of child pointers in non-leaf node (= w m)

T1 Full latency of a cache miss

Tnext Latency of an additional pipelined cache miss

B Normalized memory bandwidth (B = T1/Tnext)

K # of nodes to prefetch ahead

C #of cache lines in jump-pointer array chunk

pwB+-Tree Plain pB+-Tree with w-line-wide nodes

pwB+-Tree pwB+-Tree with external jump-pointer arrays

pwB+-Tree pwB+-Tree with internal jump-pointer arrays

e

i

Page 68: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 68 - Chen, Gibbons & Mowry

Carnegie Mellon

Search w/ & w/o Jump-Pointer Arrays: Cold Cache

entries in leaf nodes

104

105

106

107

60

80

100

120

140

160

180tim

e (

M c

ycle

s)

p8B+tree p8eB+treep8iB+tree

different # of levels in tree

Page 69: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 69 - Chen, Gibbons & Mowry

Carnegie Mellon

Cache Performance Revisited

Search: eliminated 45% of original data cache stalls 1.47 speedup

Scan: eliminated 97% of original data cache stalls 8-fold speedup

Data Cache StallsOther StallsBusy Time

Page 70: Improving Index Performance  through Prefetching

Improving Index Performance through Prefetching - 70 - Chen, Gibbons & Mowry

Carnegie Mellon

Can We Do Even Better on Searches?

Hiding latency across levels is difficult given: data dependence through the child pointer the relatively large branching factor of tree nodes equal likelihood of following any child

assuming uniformly distributed random search keys

What if we prefetch a node’s children in parallel with accessing it? duality between this and creating wider nodes BUT, this approach has the following relative disadvantages:

storage overhead for the child (or grandchild) pointers size of node can only grow by multiples of the branching

factor