Effect of Node Size on the Performance of Cache-Conscious B+ Trees

Effect of Node Size on the Performance of Cache-

Conscious B+ Trees

Written by: R. Hankins and J.Patel

Presented by: Ori Calvo

Introduction Who cares about cache improvement Traditional databases are designed to

reduce IO accesses. But… Chips are cheap. Chips are big. Why not store all the database in

memory? Reducing main memory accesses is the

next challenge.

Objectives Introduction to cache-conscious

B+Trees. Provide a model to analyze the

effect of node size. Examine “real-life” results against

our model’s conclusions.

B+Tree Refresher

d Ordered B+Tree has between d and 2d keys in each node.

Root has between 1 and 2d keys. Every node must be at least half full. 2*(d+1)^(h-1) <= N <= (2d+1)^h Fill percentage is usually ln2 ~ 69%

B+Tree Refresher (Cont…) Good search performance. Good incremental performance. Better cache behavior than T-Tree. What is the optimal node size ?

Improving B+Tree

Question:Assuming node size = cache line size, how can we make B+Tree algorithm to utilize better the cache?

Hint:Locality !!!

Pointer Elimination Node size = cache line size. Only half of a node is used for

storing keys. Get rid of pointers and store more

keys. Instead of pointers to child nodes

use offsets.

Introducing CSB+Tree Balanced search tree. Each node contains m keys, where

d<=m<=2d and d is the order of the tree. All child nodes are put into a node group. Nodes within a node group are stored

contiguously. Each node holds:

pFirstChild - pointer to first child nKeys - number of keys arrKeys[2d] - array of keys

CSB+TreeP N K1 K2

P N K1 K2 P N K1 K2 P N K1 K2

CSB+Tree vs. B+Tree Assuming, node size = 64B B+Tree: 7 Keys + 8 Pointers + 1 Counter CSB+Tree: 1 Pointer + 1 Counter + 14

Results: A cache line can satisfy almost one more level

of comparisons The fan out is larger Less space

CSS Tree

Can we do more elimination ?

Shaking our foundations Should node size be equal to cache

line size ?

What about instructions count ?

How can we measure the effect of node size on the overall performance ?

Building Execution Time Model We need to take into account:

Instruction executed. Data cache misses. Instruction cache misses (Only 0.5%). Mis-predicted branches.

Model the above during an equality search.

Should be independent of implementation and platform details, but …

Execution Time ModelT = I*cpi + M*miss_latency + B*pred_penalty

VariableDescriptionValueDepend upon

cpiProcessor clock cycles per instruction executed

0.63 (P3)Platform

miss_latencyProcessor clock cycles per L2 cache miss

78 (P3)Platform

pred_penaltyProcessor clock cycles to correct a mis-predicted branch

15 (P3)Platform

IInstructions countImplementation

MData cache misses countImplementation

BMis-predicted branchesImplementation

CPI – 0.63 ? Can be extracted from a processor’s

design manual, but.. Modern processor are very complex Some instructions require more time

to retire than others On Pentium 3 CPI is between 0.33 to

Other PSV – Where do they come from?

Miss_latency Same problems as CPI

Pred_penalty The manual provides tight upper and

lower bounds.

PSV Experiment

For(I=0; I<Queries; I++) {address = origin + random offsetval = *address;for(j=0; j<Instructions; j++) {

/* Computing involving “val” */}

PSV Results

Calculate I I is depended upon the actual

implementation of the CSB+Tree

Two main components: I_search - Searching inside a node I_trav - Node traversals

Analyzing code leads to the following conclusions: I_search ~ 5 I_trav ~ 30

Calculate I_Serach

BinarySearch:middle = (p1+p2)/2;comp *middle,key;jle less;p1 = middle;less:

p2 = middle;jump BinarySearch;

Calculate T_TravNode *Find(Node *pNode,int key) {

int *pKeysBegin = pNode->Keys; (1)int *pKeysEnd = pNode->Keys + pNode->nKeys; (3)int *pFoundKey,foundKey;

pFoundKey = BinarySearch(pKeysBegin,pKeysEnd,key); (8) ?

if( pFoundKey < pKeysEnd ) {foundKey = *pFoundKey;} (3,1)else {foundKey = INFINITE;} (1)

int offset = (int)(pFoundKey - pKeysBegin); (2)Node *pChild = NULL;if( key < foundKey ) {pChild = pNode->pChilds + offset;} (4,1)else {pChild = pNode->pChilds + offset + 1;} (3)

return pChild; --------} (23-25)

Calculate I (Finishing)

travsearch IhefIhI )1(log* 2

•h - Height of the tree

•f - Fill percentage

•e - Max number of keys in a node

Calculate M M_node – Cache misses while

searching inside a node

)(log2 L

)(log2 nKeys

When L is the number of cache line inside a node

Calculate M (Cont…)

Cache misses per tree traversal is bounded by:

TreeHeight * M_node

What about q traversal ?

Calculate M for q traversals Let’s assume there are no cache

conflicts and no capacity misses On first traversal there are M_node

cache misses per node access On subsequent traversals

Nodes near the root will have high probability of being found in the cache

Leaf nodes will have substantially lower probability

Calculate M for q traversals (Cont..) Suppose,

q is the number of queries b is the number of blocks

Then, the number of Unique Blocks that are visited is:

))/11(1(*),( qbbqbUB

Calculate M for q traversals (Finishing)

Assuming q*M_node queries is performed by each tree traversal, then:M is the sum of UB at each level of the tree:

MqbUBM nodei

Calculate B•h - Height of the tree

•f - Fill percentage

•e - Max number of keys in a node

)1*(log* 2

Mid year evaluation We built a simple model

T = I*cpi + M*miss_latency + B*pred_penalty

Now, we want to use it

Our model’s prediction We want to look at the performance

behavior that our model predicts on Pentium 3

The following parameters are used 10,000,000 items Number of queries = 10000 Fill percentage = 67% Cache line size = 32 bytes

Effect of node size on cache misses count

Effect of node size on instructions count

Effect of node size on execution time

Numbers Best cache utilization at small node

sizes: 64-256 bytes For larger node sizes there ate fewer

instructions executed, the minimum is reached at 1632 bytes.

Optimal node size is 1632 bytes, performing 26% faster over a node size of 32 bytes.

Our Model Conclusions Conventional wisdom suggests:

Node size = Cache line size

We show:Using large node size can result in better search performance.

Experimental Setup Pentium 3

768MB of main memory 16KB of L1 data cache 512KB of L2 data/instruction cache

• 4-way, set associative• 32 byte of cache line

Linux, kernel version 2.4.13 10,000,000 entries in database The database is queried 10,000 times

Effect of node size on cache misses count

Effect of node size on instructions count

Effect of node size on execution time

Final Conclusions We investigated the performance of

CSB+Tree We introduced first-order analytical

models We showed that cache misses and

instruction count must be balanced Node size of 512 bytes performs well Larger node size suffer from poor insert

performance

Effect of Node Size on the Performance of Cache-Conscious B+ Trees

Documents

Transcript of Effect of Node Size on the Performance of Cache-Conscious B+ Trees

Effect of Node Size on the Performance of Cache-Conscious ...has a critical impact on the performance of the index. The index search performance improves signif-icantly as the number

Making B -Trees Cache Conscious in Main Memoryftp.cse.buffalo.edu/.../disc01/cd1/out/papers/sigmod/p475-rao/p475-r… · Some previous work on cache conscious tech-niques were summarized

Cache-Conscious Wavefront Scheduling - UBC ECEaamodt/papers/tgrogers.micro2012.pdf · Cache-Conscious Wavefront Scheduling Timothy G. Rogers1 Mike O’Connor2 Tor M. Aamodt1 1University

Conscious Living/Conscious Dying

Weibo / Cache-Cache

Cache-Conscious Wavefront Scheduling

cache. ... cache.

Cache-Conscious Data Placement

Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias.

Copyright 2005, Data Mining Research Lab, The Ohio State University Cache-conscious Frequent Pattern Mining on a Modern Processor Amol Ghoting, Gregory.

Effect of Node Size on the Performance of Cache-Conscious ... · Cache-conscious indices are designed to improve the performance of main-memory indices by reducing the number of processor

A Graph Theoretic Approach to Cache-Conscious Placement of Data for Direct Mapped Caches

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

The Hardness of Cache Conscious Data Placement Erez Petrank, Technion Dror Rawitz, Caesarea Rothschild Institute Appeared in 29 th ACM Conference on Principles.

Cache Conscious Indexes

Cache-Conscious Structure Definition · Cache-Conscious Structure Definition Trishul M. Chilimbi Bob Davidson Computer Sciences Department Microsoft Corporation University of Wisconsin-Madison

Conscious Capitalism, Conscious Leadership

Finite Element Representation of Muscles...Node 179 Node 204 Node 205 Node 206 Node 227 Node 228 Node 229 Node 264 Node 265 Node 266 Node 277 Node 278 Node 279 Figure 3: Force-strain

Accessing the Global Cache using a Mapping Nodepublic.dhe.ibm.com/.../v10/16L03_IIB10006_REST_API_MappingGlob… · The REST API tools for IIB Mapping node Cache Transforms 16L03.