High Performance Self Organizing Dynamic...

High Performance Self Organizing Dynamic Dictionaries

Thesis submitted in partial fulfillmentof the requirements for the degree of

MS by Researchin

Computer Science and Engineering

by

Ziaul Choudhury201107546

[email protected]

Software Engineering Research CenterInternational Institute of Information Technology

Hyderabad - 500 032, INDIAJuly 2016

Copyright c© Ziaul Choudhury, 2016

All Rights Reserved

International Institute of Information TechnologyHyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “High Performance Self Organizing DynamicDictionaries” by Ziaul Choudhury, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Dr. Suresh Purini

To My Family

Acknowledgements

First I would like to thank Dr. Suresh Purini for all the support, guidance, suggestions and encour-agement. He is a wonderful guide and supported me with his patience and his knowledge. He patientlylistened to what ever I say and helped me to think in right direction when I was stuck. I am really thankfulto him for being available to me even during odd hours.

I am grateful to SERC for providing enough resources and nice atmosphere for doing research. Iwould like to thank my lab mates, Ronak, Lakshya, Apaar for their encouragement.

I am really thankful to my friends Srinath, Bhuvanan, Divyesh, Harsha, Ajay, for their love andsupport when I was having tough time. I will cherish all the memories for rest of life. I would alsolike to thanks Siva for his guidance, suggestions and motivation while writing this thesis. Finally andmost importantly, I would like to thank my parents and family for their unconditional love and supportthroughout the thesis.

v

Abstract

A dictionary data structure is a key-value store which supports insert, search and delete operations.Conventional data structures such as balanced binary search trees and hash tables are good both in theoryand practice. However, these dictionaries are not adaptive as they do not exploit the temporal locality inkey access patterns over time. On the other hand dictionaries like splay trees and working-set trees adaptto the changing key access patterns by reorganizing themselves with every key access. Consequently, theset of most frequently accessed key, i.e. the working-set, can be retrieved quickly in these dictionaries.

With the gaining popularity of GPU based accelerator platforms for general purpose computing,several data structures such as B-tees and hash tables have been successfully ported to these platforms.In this thesis, we propose heterogeneous (CPU+GPU) hash tables, that optimizes operations for thefrequently accessed keys. The basic idea is to maintain a dynamic set of most frequently accessed keys inthe GPU memory and the rest of the keys in the CPU main memory. Further, queries are processed inbatches of fixed size. We measured the query throughput of our hash tables using Millions of QueriesProcessed per Second (MQPS) as a metric, on different key access distributions. On distributions, wheresome keys are queried more frequently than others, we achieved on an average 10x higher MQPS whencompared to a highly tuned serial hash table in the C++ Boost library; and 5x higher MQPS against astate of the art concurrent lock free hash table. The maximum load factor on the hash tables was set to0.9. On uniform random query distributions, as expected our hash tables do not out perform concurrentlock free hash tables, nevertheless matches their performance.

The second main contribution of this thesis is the design and implementation of self organizing CacheOblivious B-trees (CO B-tree) which use the memory hierarchy effectively to reduce the number ofcache misses incurred while processing a query. A cache oblivious B-tree (CO B-tree) incurs optimalnumber of cache block transfers across any two levels in a multi level memory hierarchy. In this work,we propose two variants of the dynamic CO B-tree, a working set CO B-tree and a hybrid CO B-tree.They reorganize themselves with every key access facilitating faster retrieval of the frequently accessedkeys with the minimum number of cache misses. Theoretically, we prove that a self organizing CO B-treecan be designed with a small additive factor overhead to the already proven bounds for the data structure.In empirical experiments we compare both our data structures to a B-tree, an AVL tree, a top down splaytree and a static/dynamic CO B-tree. Our CO B-trees, on average, have 1.5 to 2 times lesser cache missescompared to the B-tree and the top down splay tree respectively. For higher degree of temporal locality inaccess patterns, the hybrid CO B-tree has better query throughput (queries processed per unit time) andlesser number of cache misses compared to the rest of the data structures including a static CO B-tree.

vi

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Accessing the Working set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Thesis Contributions and Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Computational Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Heterogeneous computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Cache Oblivious Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 Splay trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Working Set Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Layered Working Set tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Static Cache Oblivious B-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Heterogeneous CPU+GPU Self Organizing Hash Tables . . . . . . . . . . . . . . . . . . . . 144.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Overview of Heterogeneous Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 A Spliced Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.2 Reorganize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4 A Simplified S-Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.4.1 MRU List Processing: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.5 A Cache Partitioned SS-hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.6.1 Query performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.6.2 Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Self Organizing Dynamic Cache Oblivious B-trees . . . . . . . . . . . . . . . . . . . . . . 265.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.1 Dynamic vEB Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Working set CO B-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3 Hybrid CO B-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3.1 Structure Split and Join Operations . . . . . . . . . . . . . . . . . . . . . . . . 31

vii

viii CONTENTS

5.3.2 Search Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3.3 Update Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 A Working Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

List of Figures

Figure Page

1.1 Call sequences in (a) Mozilla 1.0, (b) VMware GSX Server 2.0.1, (c) squid runningunder User- Mode Linux 2.4.18.48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 A schematic of the GPU hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 A typical memory hierarchy in modern systems. . . . . . . . . . . . . . . . . . . . . . 6

3.1 A splay operation on a splay tree with the key 14. . . . . . . . . . . . . . . . . . . . . 103.2 An example of a search for x in the working set structure. After finding x, it is removed

from T4 and inserted into T1. Finally, a shift from 1 to 4 is performed in which an elementis removed from Ti and inserted into Ti+1 for 1 ≤ i < 4. . . . . . . . . . . . . . . . . . 11

3.3 The recursive vEB layout of a complete binary search tree. . . . . . . . . . . . . . . . 133.4 Example of the vEB transformation for the complete binary search tree on the left. . . 13

ix

List of Tables

Table Page

4.1 The left figure shows the structure of a query element. The figure on the right showsthe overall structure of the hash tables. The value part in the query vector is omitted forsimplification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 The figure shows a reorganize operation on a simple MRU list with the overflow and thequery sections. The value part is omitted for simplification. . . . . . . . . . . . . . . . . 17

4.3 The overall design of the S-hash and SS-hash tables. . . . . . . . . . . . . . . . . . . 20

4.4 The query throughput of the heterogeneous hash tables on different key access distribu-tions and query mixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 The cache misses/query comparison for the hash tables. . . . . . . . . . . . . . . . . . 23

4.6 The two graphs shows the time split of the CPU,GPU and memory transfers for processingdifferent number of queries under different key access distribution. . . . . . . . . . . . 24

4.7 The two graphs show the variance of the query throughput with the size of the MRU listand the query vector respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 The figure on the left represent the logical and physical layout of a hybrid CO B-treein memory. Lightly shaded regions are vEB trees (layout). The unshaded regions arered-black trees (pointer layout). Dotted lines are subtree boundaries. The first figure onthe right hand side show two sequential search operations being performed on a hybridCO B-tree placed initially as a static vEB tree. Notice the evolution of new root blockswith different layouts during the searches. The bottom figure show a generic searchoperation on a hybrid CO B-tree. In both the figures |RT | = 1. Every search out ofRTresults in displacement of a key (keys a and y). . . . . . . . . . . . . . . . . . . . . . 30

5.2 The figure on the left demonstrates a path-split operation using the key E on a treewith a root block placed as a red-black tree. The operation is bottom up and at firstsplits the lower root block in the vEB layout using the path nodes D and E. Finally themodified red-black tree on top is split with a red-black split operation using the key E.The figure on the right show a join operation. Also depicting the different cases arisingwhile executing it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3 The figure shows the transformation of a hybrid CO B-tree with successive searchoperations. The colour coding of the different sections of the tree is consistent with therest of the thesis. The primary root block is at its maximum allowable size at the start.Notice how additional levels of root blocks are added to the subtree j during the process. 34

x

LIST OF TABLES xi

5.4 A working example of a hybrid CO B-tree on 15 nodes. The nodes of the tree are colorcoded to corresponding to the layout they belong to. The darker nodes belong to theprimary root block in the vEB layout. The lightly shaded and the unshaded nodes arestatic vEB tree nodes and red-black tree nodes respectively. The primary root block cancontain a maximum of 3 nodes. The size threshold for conversion of a red-black tree tostatic vEB tree is 2 nodes. The memory layout of the trees is shown below each sub figure. 40

5.5 The two graphs compare the query throughput of the trees against two query string sizes.Each point in the graph corresponds to a specific search string generated with a differentdegree of skewness controlled by the skew factor β. The y axis is throughput in millionqueries processes per second. The x axis is the various values of β. . . . . . . . . . . . . 41

5.6 The cache misses per query incurred by the respective trees to process the search stringfor various values of β and across two machine architectures. The y axis is cache missesper query. The x axis is different values of β. . . . . . . . . . . . . . . . . . . . . . . 42

5.7 The query throughput comparison for the trees relative to two kinds of query mixes. They and x axis correspond to the query throughput (million queries processed per second)and the query string size in millions respectively. . . . . . . . . . . . . . . . . . . . . 43

5.8 The cache misses per query comparison for the trees relative to two kinds of query mixes.The y axis is cache misses per query and x axis is size of the query string in millions.The results are collected on the Intel machine. . . . . . . . . . . . . . . . . . . . . 43

Chapter 1

Introduction

A dictionary is a key-value store which maps a set of unique keys to values. Examples are a symboltable in a compiler, a database of student academic records, shortest distances maintained by an algorithmdealing with graphs. A dictionary can be a part of system program, or directly used for practical purposes,or even can be a part of another algorithm. The keys must be from a linearly ordered set. That is anytwo random keys can be compared based on their key values. Binary search trees, B-trees, Hash tablesare classic examples of dictionaries. For algorithmic simplification the data part associated with a key isoften omitted when describing dictionaries.

Anecdotal evidence suggests that in most applications dictionaries are subjected to the followingdistribution of operations, 90% search, 9% insert and 1% delete operations. Typical examples aredatabase applications where most of the time the data base is used to search through the records. Theaccesses to the keys stored in a dictionary often tend to exhibit temporal locality, i.e. until a point intime some keys are accessed more frequently than the rest. This behavior gives rise to working sets for adictionary. A good example is a dictionary maintained for a network router. A network router receivesnetwork packets at a high rate from incoming connections and must quickly decide on which outgoingwire to send each packet, based on the IP address in the packet. The router needs a big table (a map) thatcan be used to look up an IP address and find out which outgoing connection to use. If an IP addresshas been used once, it is likely to be used again, perhaps many times. Another example of temporallocality can be seen in mmap calls related to virtual memory accesses in operating system kernels [11].Each process in a Unix-like kernel has a number of virtual memory areas (VMAs). At a minimum, astatically linked binary has one VMA for each of its code, data, and stack segments. Processes can createan arbitrary number of VMAs, up to the limit imposed by the operating system or machine architecture,by mapping disk files into memory with the mmap system call. Figure 1.1 shows the mmap call sequencesin some widely used applications. Notice that temporal locality is exhibited by the program in the rangeof VMAs it refers to during its lifetime.

A self organizing dictionary reorganize itself based on its key access pattern. It is hard to predict theset of most frequently accesses keys, i.e. the working set, for an online key access sequence. Variousdictionaries employ operations/heuristics to approximate this working set and optimize its operations

1

Figure 1.1 Call sequences in (a) Mozilla 1.0, (b) VMware GSX Server 2.0.1, (c) squid running underUser- Mode Linux 2.4.18.48

around it. The move to front heuristic in the self organizing list, keeps the working set near the head ofthe list using a sequence of swap operations. A Splay tree [53] is a self organizing version of the binarysearch tree. The just accessed key is made the root in a Splay tree using the technique of tree rotations.These tree rotations make the access paths to the working set shorter while maintaining the binary searchorder on the keys and at the same time keep the tree nearly balanced. Theoretically splay trees have beenshown to satisfy the working set property on access sequences in an amortized sense. The working setω(x) of a key x is defined as the number of distinct keys accessed since the last access of x [10], [26].Intuitively, the working set property assumes that a recently accessed key has high probability of gettingaccessed again in the future and thus should be quicker to access again. Also, this behavior is typical ofcaching systems, where the just accessed memory address is cached in a faster memory for speedy accessin the future.

Unlike splay trees, there exists other dictionaries namely the working set trees [10], [16] that maintainsthe working set explicitly for each key x and guarantees access to it within O(logω(x) in the worst case.Note that in all the above mentioned dictionaries, the working set is intrinsic to the layout of the keys inthe dictionary.

1.1 Accessing the Working set

With the recent advances in accelerator based computing, heterogeneous (CPU+GPU) based systemsare gaining acceptance in a wide variety of applications. A GPU can be used for massively data parallelcomputations. It is a device providing high computational throughput at low power. The works in [51],[33] have accelerated database operations like scan and join using the GPU. Data structures like B+trees, hash tables, quad-trees, have also been successfully ported to these exotic platforms. The GPUis not a standalone device it is connected to a host CPU. The CPU offloads computations to the GPU.In this heterogeneous (CPU+GPU) model, after offloading work to the GPU the CPU mostly sits idle.Heterogeneous data structures that keep both the devices simultaneously active has also been reported

2

in literature [57], [22], [38]. It is a novel idea to split a data structure across both the devices where theworking set is accumulated on the faster device, i.e. the GPU. A key belonging to the working set canbe retrieved very quickly by searching through all the keys in the working set at once using the GPU’sparallel framework.

For non GPU based homogeneous systems, other avenues for faster retrieval of the working setmust be sough after. With the increasing latency divide between the CPU and memory, memory basedoptimizations have proven to exceedingly speed up the run times and the energy quota of an application.A B-tree optimizes the number of block transfers between the disk and the main memory but can only betuned for a specific memory hierarchy. A cache oblivious B-tree (CO B-tree) incurs optimal number ofcache block misses across any two levels in a multi level memory hierarchy. A CO B-tree exists in boththeory [13] and practice [56]. Unlike a splay tree, a CO B-tree is seldom aware of the presence of anyworking set. A self organizing version of the CO B-tree will lead to the keys in the working set beingaccessed faster with the minimum number of cache misses. Similar to a splay tree, engineering a rotationlike operation in a CO B-tree is a non trivial task as the keys need to be present in a single permutation(vEB order) to make the operations on the tree cache efficient. A rotation operation would change thevEB order of the keys thereby defeating the purpose of the layout.

1.2 Thesis Contributions and Layout

Inspired by the ideas in the previous section, In this thesis two types of self organizing dictionariesare proposed, i.e. a self organizing hash table and a CO B-tree. The developed data structures areexperimentally evaluated by comparing them against their state of the art counterparts. The specificcontributions in this direction are as follows.

• We propose a set of hash tables in the heterogeneous (CPU+GPU) model that exploits temporallocality for faster key access by maintaining the working-set in the GPU memory and the rest ofthe keys in the CPU memory. We compare our hash tables against highly optimized serial andparallel hash tables on the CPU and report a 5− 10x improvement on query throughput.

• We propose two variants of the dynamic CO B-tree that reorganize themselves with every keyaccess facilitating faster retrieval of the frequently accessed keys with the minimum number ofcache misses. Experimentally, the developed CO B-trees have 1.5 to 2 times lesser cache missescompared to the existing alternatives.

The layout of this thesis is as follows. In chapter 2, we describe the computational models i.e. thememory model and the heterogeneous (CPU+GPU) model that we use in designing our data structures.In chapter 3, we briefly describe the data structures that have inspired the work in this thesis. Finally wedescribe our hash table and tree designs in chapter 4 and 5. Finally, we conclude in chapter 6 with somereferences to future work.

3

Chapter 2

Computational Models

In this chapter, we broadly outline the two models namely the heterogeneous (CPU+GPU) modeland the Cache Oblivious model, using which the data structures proposed in this thesis are derived.The knowledge of these models are necessary towards better understanding of the design techniquesemployed in the data structures. Firstly, we describe the accelerator based CPU+GPU framework usingwhich the hash tables have been designed. Due to the lack of theoretical underpinnings, the discussion onthis model mostly involves the description of the underlying hardware stack realizing it.

For the binary search trees, the cache oblivious model is used. The strength of this model comes fromthe fact that without the knowledge of any details related to the memory hierarchy, e.g block size, anydata structure designed in this model always incurs the optimal number of cache block misses whileprocessing a query.

2.1 Heterogeneous computing

Graphics Processing Units (GPUs) are finding a place in wide spectrum of devices ranging from mobilephones to super computers due to their remarkable performance-price ratio and competitive performance-per-watt in some GPU variants. They can be used for general purpose computation (GPGPUs) alsousing programming frameworks such as CUDA [1] and OpenCL [54]. Among the many applicationsported onto these GPUs, database engines are relevant to the current work discussed in this thesis. GPUshave been used to accelerate database operations like select and join by storing the keys on them usingindexing structures like hash tables or B+ trees [5], [27].

GPUs are composed of multiple streaming multiprocessors (SMs). Each SM in turn contains a numberof light weight primitive cores (refer Figure 2.1). Whereas the SMs execute in parallel independently, thecores within a SM consume the same instruction stream generated by the threads running on the SM in alock step fashion. This is the Single Instruction Multiple Thread (SIMT) model which is similar to theSIMD model. A high end Kepler GPU from NVIDIA, for example, has an overall core count of 2880which are distributed equally among multiple SMs. The memory subsystem is composed of a globalDRAM and an L2 cache shared by all the SMs. There is also a small software managed data cache called

4

Figure 2.1 A schematic of the GPU hardware

shared memory attached to each SM and as the name suggests is shared by all the cores within an SM.The access time of the shared memory is close to that of registers.

A compute kernel on a GPU is organized as a collection of thread blocks. Each thread block can be1D, 2D or 3D and every thread within the block is given a unique coordinate the dimension of whichdepends on the block organization. Thread blocks are in turn organized into a grid of 1D, 2D or 3Ddimension. Every thread is uniquely determined based on its grid and block coordinates. All the threadswithin the same block are scheduled for execution on the same SM. However, different thread blockscan be scheduled on different SMs. A group of 32 threads within a block, called a warp, runs in lockstep. A warp of threads is the basic unit of execution in a GPU. Decomposing a problem on the GPUinvolves mapping each input point of the program to grids, blocks and warps.The GPU is embedded in asystem as an accelerator device connected to the CPU through a low bandwidth PCIe express bus. Hybridcomputing using CPU and GPU traditionally involves the GPU handling the data parallel part of thecomputation by taking advantage of its massive number of cores, while the CPU handling the sequentialcode or data transfer management. Unfortunately a large fraction of time in a CPU+GPU code is spentin transferring data across the slow PCIe bus. This problem can be mitigated by carefully placing thedata in the GPU so that fetching of new data from the CPU is as minimum as possible. Another solutionis to overlap the data transfer operations with computation, using an asynchronous API in the CUDAframework. The CPU after transferring the data and launching the kernel mostly sits idle during the

5

Figure 2.2 A typical memory hierarchy in modern systems.

computation. The main objective behind the hybrid computational model is to make both the CPU andGPU contribute towards the computation.

Data structures that use both the CPU and GPU simultaneously have been reported in literature. Kellyand Breslow [38] proposed a heterogeneous approach to construct quad-trees by building the first fewlevels in the CPU and the rest of the levels in the GPU. The work load division strategy has also provenits worth in cases where the costly or frequent operations were accelerated on the GPU while the restof the operations were handled by the CPU. Daga and Nutter [24] proposed a B+ tree implementationon an Accelerated Processing Unit (APU). They eliminated the need to copy the entire tree to the GPUmemory, thus freeing the implementation from the limited GPU memory. The work in [28] proposed ahybrid CPU+GPU ray-tracing implementation based on an optimal KD-Tree as acceleration structure.The construction and traversal of this KD-tree takes benefit from both the CPU and the GPU to achievehigh-performance ray-tracing. The work in [27] investigated the possible speedups by traversing B+Trees in parallel on the GPU, avoiding the overhead of selecting the entire table to transform it intorow-column format and leverage the logarithmic nature of tree searches.

2.2 Cache Oblivious Model

Modern computers are characterized by having a memory system consisting of a hierarchy of severallevels of memory, where each level is acting as a cache for the next level. The typical memory levels ofcurrent machines are registers, level 1 cache, level 2 cache, level 3 cache, main memory, and disk. Whilethe sizes of the levels increase with the distance from the CPU the access times to the levels also getlarger, most dramatically when going from main memory to disk (Figure 2.2). To circumvent dramaticperformance loss data is moved between the memory levels in blocks (cache lines or disk blocks). As aconsequence of this organization of the memory, the memory access pattern of an algorithm has a major

6

influence on its practical running time. A basic rule commonly stated in the literature for achieving goodrunning times is to ensure locality of reference in the developed algorithms.

The cache oblivious model is inspired from the two level External memory model. The central aspectof the external memory model is that transfers between primary main memory and the disk involve blocksof data. Specifically, the disk is partitioned into blocks of B elements each, and accessing one element ondisk copies its entire block to main memory. The cache can store up to M

Bblocks, for a total size of M

elements, M >= B2, Before fetching a block from disk when the memory is already full, the algorithmmust decide which block to evict from memory.

The cache-oblivious model was introduced by Frigo, Leiserson, Prokop, and Ramachandran in1999[30]. Unlike the external memory model this model targets multi level memories. Its principleidea is to design external memory algorithms without knowing B and M. This simple idea has severalsurprisingly powerful consequences. One consequence is that, if a cache oblivious algorithm performswell between two levels of the memory hierarchy (nominally called cache and disk), then it mustautomatically work well between any two adjacent levels of the memory hierarchy. This consequencefollows almost immediately, though it relies on every two adjacent levels being modelled as an externalmemory, each presumably with different values for the parameters B and M, in such a way that blocksin memory levels nearer the CPU store subsets of memory levels farther from the CPU (the inclusionproperty). A further consequence, if the number of memory transfers is optimal up to a constant factorbetween any two adjacent memory levels, then any weighted combination of these counts (with weightscorresponding to the relative speeds of the memory levels) is also within a constant factor of optimal. Inthis way, we can design and analyse algorithms in a two-level memory model, and obtain results for anarbitrary many-level memory hierarchy, provided we can make the algorithms cache oblivious. Thereare two main assumptions in this model regarding caches which are (Optimal page replacement) and(Full associativity). The first assumes that replacement strategy knows the future and always evicts thepage that will be accessed farthest in the future and the second assumes that any block can be storedanywhere in cache. The complexity of an algorithm measured in this model is the number of memoryblock transfers that happen between any two levels in a generic memory hierarchy.

Frigo et al. [30] described optimal θ(SortM,B(N)) memory transfer cache-oblivious algorithmsfor matrix transposition, fast Fourier transform, and sorting; Prokop also described a static searchtree obtaining the optimal O(logB N) transfer search bound [31]. Subsequently, Bender et al. [14]described a cache-oblivious dynamic search trees with the same search cost, and simpler and improvedcache-oblivious dynamic search trees were then developed by several authors [12], [15],[19], [49]. Cache-oblivious algorithms have also been developed for e.g. problems in computational geometry [3], [18]for scanning dynamic sets, for layout of static trees [13], for partial persistence, and for a number offundamental graph problems [7] using cache-oblivious priority queues [7], [17]. Most of these resultsmake the so-called tall cache assumption, that is, they assume that M >= B2; we make the sameassumption throughout this thesis.

7

Empirical investigations of the practical efficiency of cache-oblivious algorithms for sorting [20],searching [41] and matrix problems have also been performed. The overall conclusion of these investi-gations is that cache-oblivious methods often outperform RAM algorithms, but not always as much asalgorithms tuned to the specific memory hierarchy and problem size. On the other hand, cache-obliviousalgorithms perform well on all levels of the memory hierarchy, and seem to be more robust to changingproblem sizes than cache-aware algorithms.

8

Chapter 3

Background

This chapter gives a background on the data structures that inspired the work in this thesis. Wefirst describe a set of tree data structures that exploits the temporal locality in key access patterns forfaster retrieval of the frequently accessed keys. We then describe a cache oblivious layout of a completebinary search tree that ensures key accesses with minimum number of cache misses in a generic memoryhierarchy.

3.1 Splay trees

Splay tree is a type of balanced binary search tree. Structurally, it is identical to an ordinary binarysearch tree; the only difference is in the algorithms for finding, inserting, and deleting entries. All splaytree operations run in O(log n) time on average, where n is the number of entries in the tree. Any singleoperation can take θ(n) time in the worst case. But any sequence of k splay tree operations, with the treeinitially empty and never exceeding n items, takes O(k × log n) worst-case time. Splay trees are simplerand easier to program. Because of their simplicity, splay tree insertions and deletions are typically fasterin practice (sometimes by a constant factor, sometimes asymptotically). Find operations can be faster orslower, depending on circumstances. Splay trees are designed to give especially fast access to entries thathave been accessed recently, so they really excel in applications where a small fraction of the entries arethe targets of most of the find operations. When an element is accessed in a splay tree, tree rotations areused to move it to the top of the tree. This simple algorithm can result in extremely good performance inpractice. Notice that the algorithm requires that we be able to update the tree in place, but the abstractview of the set of elements represented by the tree does not change and the rep invariant is maintained.This is an example of a benign side effect, because it does not change the value represented by the datastructure. There are three kinds of tree rotations, zig-zig, zig-zag and zag-zag, that are used to moveelements upward in the tree (Figure 3.1). These rotations have two important effects: they move thenode being splayed upward in the tree, and they also shorten the path to any nodes along the path tothe splayed node. This latter effect means that splaying operations tend to make the tree more balanced.There are two variants of a splay tree, top down and bottom up. A top down splay tree performs the tree

9

Figure 3.1 A splay operation on a splay tree with the key 14.

restructuring as it proceeds down the access path. A bottom up spay tree does the splay operation after itlocates a key in the tree. In this work we use bottom up splay trees for their simplicity.

3.2 Working Set Trees

A Working Set tree (WS tree) on N elements is a dictionary satisfying the working set bound [10].It consists of log logN balanced binary search trees (Figure 3.2) and the elements within each tree areconnected through a doubly linked list. Let T1, · · · , Tl be the trees and L1, · · · , Ll be the correspondingdoubly linked lists, where l = log logN . For 1 ≤ i ≤ l − 1, |Ti| = 22

iand 0 < |Tl| ≤ 22

l. Elements in

a tree occur in the most recently used (MRU) order in the corresponding linked list. The element at thehead of the linked list is the most recently used element. By concatenating the linked lists from L1 to Ll,we get a global MRU list of elements. All the operations on the working set structure are engineeredso that this property is satisfied. By maintaining this property, we can find an element x in time boundlogω(x), where ω(x) indicates the position of the element x in the global MRU list. An element issearched in the trees T1 to Tl, in that order. If the element is found in tree Ti, then it is deleted from Ti,and reinserted into the treeT1. Before reinsertion, an element has to be moved from tree Tj to Tj+1, for1 ≤ j ≤ i− 1, in order to maintain the tree size constraints. Whenever an element has to be moved fromtree Tj to Tj+1, the element at the tail of the list Lj is chosen for deletion and it occupies the head ofthe list Lj+1 after insertion into the tree Tj+1. During an insert operation, an element is shifted fromtreeTj to Tj+1, for 1 ≤ j ≤ l − 1. Then the new element is inserted into tree T1 and is placed at the

10

Figure 3.2 An example of a search for x in the working set structure. After finding x, it is removed fromT4 and inserted into T1. Finally, a shift from 1 to 4 is performed in which an element is removed from Tiand inserted into Ti+1 for 1 ≤ i < 4.

head of the list L1. If the tree Tl reaches it full size during this process, then a new tree will be created.During a delete operation of an element present in tree Ti, an element will be moved from Tj to Tj−1, fori < j ≤ l . The most recently used element in Tj moves to the position of the least recently used elementin tree Tj−1. If the tree Tl gets empty during this process, it will be destroyed. The working set structuresupports search operations in O(logω(x)) time, whereas insert and delete operations require O(logN)

worst case running time [10].

3.3 Layered Working Set tree

The structure discussed above is in the pointer machine model. The structure described in [16]achieves the same worst case search and update bound in the binary search tree machine model. Thestructure is described briefly below. The Layered Working Set (LWS) tree is obtained by cleverlycomposing the individual trees T0, · · · , Tl from the working set structure into a binary search tree T .Each node x ∈ T has a label field label[x] ∈ {0, · · · , l}. The label of a node is always greater thanor equal to that of an ancestor. All the nodes with the same label j belong to the layer tree Lj andis analogous to the tree Tj in the working set structure. The size of the layer tree Lj is 22

jas before.

However, the size of the last layer subtreeLl which contains the left over elements can be less than 22l. A

layer Lj could have been spread out in T as independent sub trees. Each such independent subtree ismaintained using a balanced binary tree data structure such as a Red-Black tree. This ensures that thedepth of a key x in Lj in the tree T is bounded by O(2j). An MRU list is maintained by maintainingthe fields the younger[x], older[x] and nextlayer[x] in every node x. The younger and older fieldscontain the keys of the elements inserted just after and before the element x. If x is the oldest element ina layer Lj , then older[x] = nil and nextlayer[x] is set to the oldest element in layer Lj+1. Similarly,if x is the youngest element in a layer Lj , then younger[x] = nil and the nextlayer[x] is set to the

11

youngest element in layer Lj+1. This is how the MRU list is maintained without explicit use of pointers.The root of the tree contains the total number of layers in the tree. There are two types of internal treeoperations, intra-layer and inter-layer, using which the search and update operations on the tree T areperformed. The intra-layer operations correspond to the regular operations supported on a Red-Blacktree i.e., Insert − FixUp(x ),Delete − FixUp(x ),Split(x )andJoin(x ) [44]. All these operations takeO(2j) worst case time on a layer subtree corresponding to Lj . The inter-layer operations involve search orkey shift operations across the layers. The following is a brief summary of the four intra-layer/inter-layeroperations supported. For details refer to [16].

• YoungestInLayer(Lj)/OldestInLayer(Lj) These operations returns the node corresponding tothe youngest/oldest element in layerLj .

• MoveUp(x)/MoveDown(x)These operation moves x from its current layerLj to the layerLj−1/Lj+1

which is above/below it respectively. They make use of the Split(x)/Join(x) intra-layer operationin their implementation. The Split and Join operation are described later in the chapter. After themovement the fields older[x], younger[x] and nextlayer[x] are adjusted accordingly.

Each of the inter-layer operations has O(2j) worst case time complexity. The actual search, delete andinsert operations on the tree T are independent of the specific implementation details of the inter-layerand intra-layer operations as long as the time complexity bounds are satisfied. The operation Search(x)

on the tree T follows a normal binary search strategy. If the element x is found in the layer subtree Lj ,then a MoveUp(x) operation is repeatedly performed till the element x moves into the layer subtreeL1. Then the size of the trees T1, · · · , Tj is fixed by moving the oldest element y in layer subtree Li toLi+1 for 1 ≤ i ≤ j − 1 using the OldestInLayer(Li) and MoveDown(y) operations repeatedly. TheInsert(x) operation inserts the element x in the last layer tree Lt by using a regular binary search to findits position in the tree T . If |Lt| > 22

t, then we initialize a new tree Lt+1 with the element x in it. Then

a Search(x) is performed to bring the element x to layer tree L1. The Delete(x) operation locates thelayer treeTj in which the element x is present and moves it down to the tree Tt by using MoveDown(x)

operation repeatedly. Then by using one more MoveDown(x) operation we can extract out the elementx from the tree. Then the sizes of the layer trees Li, by moving the youngest element y in each layertree Li to Li−1 by repeatedly using Y oungestInLayer(Li) and MoveUp(y) operations. If layer treeLt becomes empty during this process, then we delete it from T and decrement the value of t that ismaintained in the root of T .

3.4 Static Cache Oblivious B-trees

The B-tree is a classic external memory dictionary data structure that incurs O(logB N + 1) blocktransfers per operation, where B is the block size and N is the number of elements in the B-tree. The Bkeys are tightly packed together in contiguous memory locations in a node. The value of B is chosen

12

Figure 3.3 The recursive vEB layout of a complete binary search tree.

Figure 3.4 Example of the vEB transformation for the complete binary search tree on the left.

to minimizes the number of disk accesses required per operation. A B-tree structure optimized for diskaccesses is not necessarily optimal at higher levels of memory hierarchy due to the sequential layoutof elements within a B-tree node. This problem can be addressed by using a recursive B-tree structure[40] wherein the elements within a B-tree node are organized using another B-tree and so on. However,engineering such a structure is cumbersome and error prone; and further the optimal block sizes and thedepth of the recursive layout are dependent on the memory hierarchy. The solution to this problem is toplace the keys of a binary search tree in a cache oblivious layout. This is also referred to as a static cacheoblivious B-tree structure.

A static Cache Oblivious B-tree (CO B-tree) is a balanced binary search tree which allows only searchoperations. The nodes of the tree are stored in a linear array where the node to array index mappingis decided by what is called van Embde Boas (vEB) layout (Figure 3.4). The name comes form thevan Embde Boas trees [55] due to the similarity in the recursion. If T is a binary search tree of heighth, then let T0 be the sub-tree consisting of the root node and rest of the nodes in the top bh/2c levelsof T . Let T1, · · · , Tk be the sub-trees hanging at the leaf nodes of T0. Since T is a balanced searchtree, k = O(

√N), and each of the sub-trees T1, · · · , Tk contain O(

√N) nodes. The vEB layout of T

consists of vEB layout of T0 followed by the vEB layout of the trees T1, · · · , Tk in that order. Duringthe recursive layout of a tree, once we reach a subtree of size at most B, then the nodes in the sub-treestraddle across atmost two blocks. While searching for a key, atmost O

(logNlogB

)such subtrees will be

touched incurring a overall block transfer complexity ofO(logB N +1). This structure is cache obliviousas it incurs the same number of block transfers compared to a cache aware implementation, without thememory hierarchy details.

13

Chapter 4

Heterogeneous CPU+GPU Self Organizing Hash Tables

A hash table is a key-value store which supports constant time insert, delete and search operations.There is no relative ordering on the placement of keys inside a hash table. Hash tables do not lay anyspecial emphasis on the key access patterns over time. Therefore hash tables are unable to exploit thetemporal locality in the key access patterns.

In this chapter, inspired by the working-set structure, we propose a set of two level heterogeneous(CPU+GPU) hash tables. In all the designs the first level of the hash table is smaller in size and residesin the GPU memory. It essentially caches the most recently accessed keys or in other words hot data.The second level hash table resides in the CPU memory and contains the rest of the keys. We overlayan MRU list on the keys residing in the GPU hash table. The queries are batched and are processedfirst on the GPU followed by the CPU. Our overall hash tables can be viewed as a heterogeneous twolevel working-set structure. To the best of our knowledge, this is the first attempt towards designingheterogeneous (CPU+GPU) hash tables, wherein we use the GPU accelerator to improve the querythroughput by exploiting the key access patterns of the hash table.

4.1 Background

General hash table designs include linear hash table, chained hash table, cuckoo hash table andhopscotch hash table [35]. Among these the cuckoo hashing [48] technique can achieve good performancefor lookup operations. Cuckoo hashing is an open address hashing scheme which uses a fixed number ofhash tables with one hash function per table. On collision the key replaces the already present key in theslot. Now the ”slotless” key is hashed into a different table by the hash function of that table and theprocess continues until all the keys have a slot. There has been efforts directed towards designing highperformance hash tables for multi-core systems. Lea’s hash table from the Java Concurrency Package [43]is a closed address lock based hash table based on chaining. Hopscotch hashing [35] guarantees constantlookup operations. It is a lock based open address technique which combines linear probing with thecuckoo hashing technique. Initial work on lock free hashing was done in [46] which used chaining. Thelock free version of cuckoo hashing was designed in [47]. The algorithm allowed mutating operations

14

to operate concurrently with query ones and required only single word compare-and-swap primitives.They used a two-round query protocol enhanced with a logical clock technique to ensure correctness.Pioneering work was done for parallel hashing in GPU by Alcantara et.al. [5]. They used cuckoo hashingon the GPU for faster lookup and update operations. Each thread handled a separate query and used GPUatomic operations to prevent race conditions while probing for hash table slots. This design is a part ofthe CUDPP [6] library, which is a data parallel library of common algorithms in the GPU. The work in[39] presented the Stadium Hashing (Stash) technique which is a cuckoo hash design and scalable tolarge hash tables. It removes the restriction of maintaining the hash table wholly on the limited GPUmemory by storing container buckets in the host memory as well. It used a compact data structure namedticket-board separate from hash table buckets maintained in the GPU memory which guided all theoperations on the hash tables.

4.2 Overview of Heterogeneous Hash Tables

In this section, we give an overview of the basic design of our hash tables and their memory layoutacross both the devices. The primary goal of our hash tables is to support faster operations on recentlyaccessed keys similar to the working-set structure. Unlike previous works the size and scalability of ourhash tables is not restricted by the limited GPU memory. The GPU is used to cache the most frequentlyaccessed keys (hot data). These key-value pairs are processed in parallel by all the GPU threads. Thesearch and update queries are bundled into batches of size B before processing. We intuitively expectthat every key k with ω(k) ≤ cM where M is the size of GPU global memory and 0 < c ≤ 1 is someconstant, is available in the GPU. The value of c depends on the key-value pair record size.

All the key-value pairs in our heterogeneous hash tables are partitioned between a list and a CPUbased hash table. The list is implemented using an array residing in the unified memory. Unified memoryis an abstracted form of memory that can be accessed by both the devices without any explicit datatransfers [1]. The support for unified memory is provided with CUDA 6.0 onwards. This memoryis allocated on the GPU initially and when the CPU accesses a chunk of this memory, that chunk istransferred to the CPU by the underlying CUDA framework implicitly. The key-value pairs in the list arearranged from the most recently to the least recently accessed pair in the left to right order respectively.The size of the list is O(M) and has three sections: an active middle area which contains all the key-valuepairs belonging to the list and empty left and right areas of size B each. The left area stores the mostrecent query vector after it is being processed by both the devices. This addition of the query vector isdone before the MRU list is reorganized. This reorganization will be explained later in the thesis. Theright area stores overflowing keys from the list. These overflow keys are the oldest keys in the GPUmemory and will be accommodated in the CPU memory during successive operations on the hash tables.The rest of the key-value pairs which are older enough and thus can not be accommodated in the MRUlist, are maintained in the CPU hash table. The design of this hash table is different for all the cases andwill be described in the later sections. Each element in the query vector, a query element, contains a

15

key-value pair. The rightmost three bits in the key are reserved. The first two bits identify the operationbeing carried out with the key; i.e. a search, insert or delete. The last bit is set if the key-value pair ispresent in the GPU and vice-versa (Figure 4.2). The next three sections describe the hash table designs indetail.

Figure 4.1 The left figure shows the structure of a query element. The figure on the right shows theoverall structure of the hash tables. The value part in the query vector is omitted for simplification.

4.3 A Spliced Hash Table

A spliced hash table (S-hash) is the simplest of our designs where a standard GPU hash table from theCUDPP library is fused together with a serial CPU hash table from the C++ Boost [52] library, withinthe framework described in the previous section. The GPU hash table (CUDPP hash) is separate from theMRU list and is maintained as a separate data structure in the GPU memory. The CUDPP hash processessearch and update queries in batches. The CUDPP hash and the Boost hash communicate through theMRU list in the unified memory. Recall that the MRU list contains O(M) slots. To identify the positionof a key in the MRU list O(logM) bits are stored along with the value part. These position bits link akey-value pair in the CUDPP hash to its location in a specific slot of the MRU list.

4.3.1 Operations

The operations are bundled in a query vector first and sent to the CUDPP hash for processing. Theworking set bit is set for each insert(key) operation in a query element. The CUDPP hash can nothandle mixed operations. Hence the query elements with search operations are separated from the deleteoperations before processing. Each GPU thread handles an individual query element. For a search query,if the key is found in the CUDPP hash, the position bits located in the value field are read and the workingset bit corresponding to the location of the key in the MRU list is set to 0 and the working set bit in thequery element is set to 1. Delete queries are handled similarly without setting the working set bit in thequery element after a successful removal. The search and delete queries that could not be serviced by theGPU are sent to the CPU. The CPU takes one query at a time and executes the corresponding operationon the Boost hash. The setting of the working set bit in the query element is done as before. The query

16

vector is now copied to the leftmost section in the MRU list. This copying can be avoided if the queryvector is placed in this section of the MRU list at the beginning only. To prevent duplicate keys in thehash tables, the query vector is scanned for repeated keys with the working set bit set. If duplicates arefound, the working set bit is set for one key and unset for the rest. The keys in the query vector whoseworking set bit is set and is not available in the CUDPP hash are added to the hash table in a separatebatch. Now a Reorganize operation is executed on the list which arranges the keys in the MRU order(Figure 4.2).

4.3.2 Reorganize

This operation shrinks the MRU list by removing all the keys in the list whose working set bits areset to 0. Figure 4.2 shows an instance of this operation. The MRU list with the associated left and rightsections is shown along with an example query vector with insertion of the key 98 and search for theother keys. After the addition of the query elements to the leftmost section of the MRU list, an exclusiveprefix scan operation is carried out on the list. This prefix scan is carried out using the working set bitvalues. The overflow area where the working set bits are set to X, is not included in the prefix scan. Thekeys are then packed using the indices returned by the scan operation. The index starts at B, where Bis the size of the query vector. If a key overflows to the overflow section due to addition of a new keyto the MRU list (key 98 in Figure 4.2), then it gets added by the CPU exactly when the CUDPP hashstarts processing the next batch of queries. Since this overflow section belongs to the MRU list in unifiedmemory, the CPU can read these keys without any explicit data transfers. The prefix scan is carried outin-place using a scan operation in the Thrust high performance GPU library [36].

Figure 4.2 The figure shows a reorganize operation on a simple MRU list with the overflow and thequery sections. The value part is omitted for simplification.

17

4.4 A Simplified S-Hash Table

The simplified S-hash table (SS-hash) eliminates the CUDPP hash and operates on the MRU listdirectly. The step to separate the queries based on operations is no longer necessary as the GPU nowhandles mixed operations together in one batch. The algorithm for maintaining the MRU list in theGPU remains the same. The only difference is the replacement of the CUDPP hash with our MRU listprocessing logic described below (Figure 4.3).

4.4.1 MRU List Processing:

After the query vector is filled with query elements, a GPU kernel is launched. The thread configurationof the kernel is adjusted to launch W warps, where W equals the size of the active middle section in theMRU list. A warp is assigned a single MRU list element. Each block in the kernel loads a copy of thequery vector in its shared memory. The ith warp processes the j × 32 + ith key in the MRU list, herej is the index of the block containing the warp and each block has a maximum capacity of 32 warps.The warp reads the assigned key in the list and all the threads in the warp linearly scan the query vectorin shared memory for the key. If a thread in the warp finds a match, it first reads the operations bits toidentify if it is a search or a delete operation. For a successful search operation, a thread sets the workingset bit of the key in the query vector and unsets the corresponding bit in the MRU list. The thread usesthe copy of the query vector in the global memory for this step. This bit manipulation is done using thebit-wise OR/AND primitives. For a successful delete, the key-value pair along with the working set bit inthe MRU list is set to 0. The working set bit in the query vector is left unset. The success of a concurrentsearch for the same key that is getting deleted is determined by whether the search read the key beforethe delete started modifying the bits in the key-value pair. Insert operations need not be processed as theywill be taken care by the reorganize step that was described before. This is a pretty straightforward codewithout any optimization. Listed below are some optmizations which are intrinsic to our algorithm andsome others which are incorporated with minor modifications.

• Memory Bank conflicts: Bank conflicts within a block occur when a few threads within a warpread the same shared memory location. In our algorithm all the warp threads read adjacent memorylocations of the shared memory therefore preventing bank conflicts.

• Warp Serialization: If the threads from two different warps read the same shared memory location,the two warps are scheduled one after another on the respective SM. There is a high probabilityfor this to happen as all the warps within a block scan the query vector linearly starting from thebeginning. To reduce this probability each warp chooses a random location in the query vector tostart the scan from and wrap around in case it over flows the size of the query vector.

• Launch configuration: The number of warps, thereby blocks, launched can be reduced if morework is assigned to a single warp. Instead of processing a single key from the MRU list, each warpcan pick up a constant number of keys to look for inside the query vector.

18

• Redundant work: There might be scenarios where all the query elements are serviced by a verysmall fraction of the warps while the majority of the warps do redundant work of simply readingthe query vector and the keys before expiring. To combat this issue, each warp on successfullyprocessing a query decrements an global atomic counter initialized to B. Now a warp only startsits execution cycle if the value of this counter is greater than 0.

In this design, the Boost hash is replaced by a lock free cuckoo hash from [47]. The overflow keysare now added in parallel by individual CPU threads to the CPU hash table.

Figure 4.3 The overall design of the S-hash and SS-hash tables.

4.5 A Cache Partitioned SS-hash Table

In this section, the focus is shifted to the CPU hash table design. The lock free cuckoo hash is replacedby our own implementation in the SS-hash table. The developed hash table is optimized for multi-core

19

caches using the technique of partitioning, hence we call it CPSS hash table. The work in [45] designeda shared memory hash table for multi-core machines (CPHASH), by partitioning the table across thecaches of cores and using message passing to transfer search/insert/delete operations to a partition. Eachpartition was handled by a separate thread on a core. We design our hash table along similar lines. Thehash table processes queries in batches and operates in a rippling fashion during query processing.

Our hash table is implemented using a circular array. The array housing the hash table is partitionedinto P different partitions, P is the number of CPU threads launched. In each partition P[i], a fixed numberof buffer slots, R[i], are reserved. The rest of the slots in each partition are used for hashing the keys.Within a partition the collisions are resolved using closed addressing. A mixed form of cuckoo hashingand linear probing is used. Each partition uses two hash function h1 and h2, each operating on half theslots reserved for hashing. Each partition is serviced by a thread and handles Q

Pqueries, Q is the total

number of queries batched for processing on the CPU.

4.5.1 Operations

The query element m in the batch is assigned to the partition k = H(m)%P , here H is a hash functionand H 6= h1, h2. The assignment is completed by writing the contents of the query element to a slot inR[k]. After this the threads execute a barrier operation and come out of the barrier only if there are nomore entries in the buffer slots of each thread’s partition. Each thread i reads a key from its buffer andhashes it to one of the hashing slots using h1. If the slot returned is already full, the thread searches foran empty slot in the next constant number of slots using linear probing. If this scheme fails the threadsreplaces the last read key from its slot and inserts its key into this slot. The ”slotless” key is hashed in theother half of the partition using h2. The same process is repeated here also. If the thread is unsuccessfulin assigning a slot to the removed key, it simply replaces the key from last read slot and inserts thejust removed key in R[i + 1]%P. The insertion to the buffer slots of an adjacent partition is done usinglock free techniques. All the insertions and deletions happen at the starting slot of these buffers usingthe atomic compare-and-swap primitive. This is the same mechanism used by a lock free stack [34].For search and delete operation, each thread probes for a constant number of slots within its partition.Unsuccessful queries are added to the adjacent partition’s buffer slots for the adjacent thread to process.There is a major issue with concurrent cuckoo hashing in general. A search for a specific key might be inprogress while that key is in movement due to insertions happening in parallel, thus the search returnsfalse for a key present in the hash table. Note that in our case the overall algorithm for the hash tablesis designed in such a way that the CPU side insertions always happens in a separate batch before thesearches and deletes.

20

4.6 Performance Evaluation

This section compares the performance of our heterogeneous hash tables to the most effective priorhash tables in both sequential and concurrent (multi-core) environments. The metric used for performancecomparison is query throughput which is measured in Millions of Queries Processed per Second (MQPS).

The experimental setup consists of a NVIDIA Tesla K20 GPU and an Intel Xeon E5-1680 CPU. Boththe CPU and GPU are connected through a PCIe express bus with 8GB/s peak bandwidth. The GPU has5GB of global memory with 14 SM and 192 cores per SM. The GPU runs on latest CUDA framework 7.5.The host is a 8 core CPU running at 3.2 GHz with 32 GB RAM. The CPU code is implemented using thelatest C++11 standard. All the results are averaged out over 100 runs. Our hash tables are comparedagainst a concurrent lock free cuckoo hash implementation (LF-Cuckoo) from [47] and a serial hashtable from the Boost library. For completeness we also compared the results with Lea’s concurrentlocked (LL) hash table.

4.6.1 Query performance

We use micro-benchmarks similar to the works in [35],[47]. Each experiment uses the same dataset of 64 bit key-value pairs for all the hash tables. The results were collected by setting up the hashtables densities (load factor) close to 90%(0.9). Figure 4.4 compares the query throughput of the hashtables on 10M queries. All the hash tables are filled with 64M key-value pairs initially. The results areshown for two types of query mixes, one has a higher percentage of search queries and the other hasmore update operations. Two types of key access patterns are simulated for the search and delete queries.A Uniform distribution generates the queries at random from the data set while the Gaussian distributiongenerates queries where a fraction of the keys are queried more than the others. The standard deviationof the distribution is set such that 20% of the keys have higher access frequency. As each warp in theGPU processes a single MRU list element, it is treated as a single GPU thread in the plots. The numberof {GPU, CPU} threads is varied linearly from {32, 4} to {1024, 128}. The size of the MRU list is fixedat 1M key-value pairs and each query vector has 8K entries. The Boost hash being a serial structurealways operates with a single CPU thread.

As can be seen in Figure 4.4, for search dominated uniform key access patterns the heterogeneoushash tables out performs the Boost hash and Lea’s hash and the throughput scales with the increasingnumber of threads. The hash tables has the same query throughput compared to the lock free cuckoohash. For the insert dominated case, the heterogeneous hash tables out performs all the other hash tables.The reason is the simplified insert operation where the simple reorganize operation inserts the keys intothe MRU list and thereby to the hash tables. The CPU only handles the overflow inserts which have lessprobability of occurrence. For the Gaussian distribution case, our hash tables out performed the others bya significant margin. They can process 10 times more queries compared to the Boost hash and 5 timesmore compared to the lock free cuckoo hash. The frequently accessed keys are always processed on theGPU. The CPU only processes the unprocessed queries in the query vector and the overflow keys. The

21

probability of CPU doing work is less, as most of the queries are satisfied by the GPU without generatingany overflow. Figure 4.5 shows the cache misses per query for the Uniform and the Gaussian distributioncase. The CPSS hash has fewer number of cache misses compared to the Boost hash and the lock freecuckoo hash. As the GPU has a primitive cache hierarchy and most of the memory optimizations hasalready been taken care of, only the CPU cache misses are reported. In the Gaussian distribution case theCPSS hash performs much better compared to the other case as most of the queries are resolved by theGPU itself and the CPU has less work to do.

Figure 4.4 The query throughput of the heterogeneous hash tables on different key access distributionsand query mixes.

4.6.2 Structural Analysis

The experiments in this section are carried out on the CPSS hash to find out the reasons for the speedup that was reported earlier. In Figure 4.6 the number of queries were varied with {1024, 128} threadsin total. The other parameters are same as before. As can be seen, for the uniform scenarios half the timeis spent on memory copies. These cover both DeviceToHost or HostToDevice implicit memorytransfers. These memory transfers were captured with the help of the CUDA profiler. In the Gaussiandistribution case the GPU performs most of the work with minimum memory transfer overhead andhence the expected speed up is achieved.

As can be seen in Figure 4.7 the maximum throughput is achieved when our hash tables are configuredwith MRU list and query vector size of 1M and 8K respectively. With increasing size of the MRUlist and the query vector, the time spent by the GPU in the Reorganize operation and the time for the

22

Figure 4.5 The cache misses/query comparison for the hash tables.

DeviceToHost memory transfers increases. This is the reason for the diminishing query throughputat higher values of these parameters.

23

Figure 4.6 The two graphs shows the time split of the CPU,GPU and memory transfers for processingdifferent number of queries under different key access distribution.

24

Figure 4.7 The two graphs show the variance of the query throughput with the size of the MRU list andthe query vector respectively.

25

Chapter 5

Self Organizing Dynamic Cache Oblivious B-trees

The B-tree is an external memory dictionary data structure that supports insert, delete and searchoperations. Each operation incurs O(logB N + 1) block transfers, where B1 is the block size and N isthe number of elements in the B-tree. The block size B is chosen so as to minimize the number of diskaccesses required per operation [29].

A B-tree structure optimized for disk accesses is not necessarily optimal at higher levels of memoryhierarchy due to the sequential layout of elements within a B-tree node. This problem can be addressedby using a recursive B-tree structure [40] wherein the elements within a B-tree node are organizedusing another B-tree and so on. However, engineering such a structure is cumbersome and error prone;and further the optimal block sizes and the depth of the recursive layout are dependent on the memoryhierarchy. A cache oblivious B-tree structure [14], [32] organizes elements in a memory hierarchyindependent fashion while incurring optimal number of block transfers between any two levels of thehierarchy. Laying out the elements of a static balanced binary search tree in a specific order called thevan Emde Boas (vEB) [25] order makes this possible.

On the other hand a cache oblivious B-tree is not sensitive to its key access patterns over time. Thesekey access patterns in general exhibit temporal locality, i.e. until a point in time, some keys are accessedmore frequently than the rest. There exists data structures like splay trees and working set trees thatreorganizes itself with every key access such that the set of most frequently accessed keys, i.e. theworking set, is cheap to access again in the future. The accessed key is made the root in a splay tree usingthe technique of tree rotations. These tree rotations make the access paths to the working set shorterwhile maintaining the binary search order on the keys. Engineering a similar rotation like operation in acache oblivious B-tree is a non trivial task as the keys need to be present in a single permutation (vEBorder) to make the operations on the tree cache efficient. A rotation operation would change the vEBorder of the keys thereby defeating the purpose of the layout.

Inspired by this problem of combining the properties of a splay and a CO B-tree into a single datastructure, in this thesis we propose two alternative designs for the CO B-tree2 that makes it sensitive to

1In this thesis the parameter B denotes the cache block transfer size between two levels in a generic memory hierarchy.2In this thesis we use the term static/dynamic vEB layout interchangeably with a static/dynamic CO B-tree.

26

changing key access patterns. The nodes in our CO B-trees reorganize themselves with every key accessas in splay trees such that the working set can be accessed faster with the minimum number of cachemisses. The first CO B-tree satisfies the working set property for every key accessed with a small additivefactor overhead. For the second design, we present a novel technique of maintaining subtrees of a searchtree in different memory layouts rather than a single uniform layout. We also propose methods to splitand join static vEB trees. Instead of restricting to only theoretical results for cache oblivious B-trees, weimplement our data structures and compare their performance in practice with existing alternatives.

5.1 Background

To measure the togetherness in a set of data accesses, the work in [59] defined reference affinity usinga parameter called stack distance. Stack distance is defined as the amount of distinct memory accessesbetween the same two memory accesses in an execution trace of a program. Similarly, in the context of adictionary the equivalent of reference affinity is termed as the working set.

Splay trees have an amortized access bound of O(logω(x)) for every key x. They do not accountfor the working set explicitly. The number of costly splay operations were reduced in [4] by monitoringfor information on the current working set and its change. The modified algorithm for the splay tree”splayed” a node only when necessary under a heuristic measure. Similar work in [42] modified theaccess algorithm in a splay tree using counters attached to nodes to keep track of number of accessesand account for the current working set explicitly thereby splaying conditionally when a node has beenaccessed a threshold number of times. The working set trees [16], [10] maintain the working set ω(x) ofa key x explicitly and guarantee accesses to it within O(logω(x)) time steps in the worst case. Thesedata structures although achieving sub-logarithmic access times are not cache efficient, i.e. the searchoperations do not use the fetched keys inside a cache block efficiently before evicting it. The vEB layout[14] is a cache efficient recursive layout of a binary search tree that places a node along with its childsubtrees at contiguous memory locations. Having discussed the static form earlier, the following sectiondescribes the dynamic vEB layout in details.

5.1.1 Dynamic vEB Layout

The work in [14] showed how to maintain a dynamic CO B-tree within an amortized O(logB N +

log2N/B + 1) cache block transfers per update operation. They used a strongly weight balanced B-treestored in a vEB layout along with a packed memory array data [9][8] structure. In this work we usethe CO B-trees in [19]. Their basic idea is to maintain a dynamic binary tree of height log n + O(1)

embedded in a static complete binary search tree which in turn is embedded in an array in a cacheoblivious fashion using the vEB layout. The entire structure is implemented just as an array of dataelements without the use of any pointers. The left and right child nodes for each node in this implicitvEB layout is calculated using meta data stored for each complete binary search tree of a particular size.

27

Let T denote the dynamic binary search tree and H be the height of the static tree it is embedded in.Note that H is the upper bound on the height of T . For a node v in T let s(v) = 2H−d(v)+1 − 1, whered(v) is the depth of v. The density of a node v, ρ(v) is defined as ρ(v) = |Tv |

s(v) , where Tv is the size ofthe sub tree in T hanging from v. An upper and lower density is maintained for each node in T and thesedensities follow a linear progression. An insert(x) starts with locating x within T . If d(v) = H + 1 thenT is re balanced by walking upto the nearest ancestor w of v such that ρ(w) less than its upper densitythreshold. Then all the nodes hanging from the tree rooted at w are evenly redistributed such that allthe descendant nodes of w are within their density thresholds. Similarly deletes proceed by moving thenode to be deleted to the leaf level by a sequence of swaps and then deleting it from here followed by ananalogous re balancing operation to maintain the density thresholds.

By using an extra level of indirection [14], the amortized cost of update operations can be furtherreduced to O(logB N + 1) in the CO B-trees. The N elements are divided into N/ logN chunks,where each chunk is at least constant fraction full. The minimum element from each chunk is fed intothe top tree structure in the CO B-tree. A chunk will be split if it overflows and can be merged withanother chunk during a deletion operation if the chunk is not sufficiently full. Thus only a fraction ofO(1/ logN) of the updates into the data structure cause modification to the structure above. Thereforethe amortized cost of updates in the structures is O( logB N+log2 N/B

logN + logN/B) = O(logN/B).There is an initial cost of searching for the key which is O(logB N). Thus we get an amortizedO(logB N + logN/B + 1) = O(logB N + 1) number of cache block transfers for updates using themethod of indirection.

Empirical investigations3 on the practical efficiency of dynamic CO B-trees have shown that unlikea search, an update operation is slower in a CO B-tree compared to conventional data structures like ared-black or a B-tree [56], [58], [41], [50].

Theoretical work on CO B-trees with working set property exists in literature. Brodal, Rasmussen andTruelsen [21] described a theoretical implicit cache oblivious dictionary with the working set property bylaying out the keys inside a single array and moving the keys within the array to sustain the mentionedproperties. An idea of self organizing vEB layout was proposed in [37]. Splay rotations were performedinside the layout after a key access. A noise factor determined how much the vEB order has been distorteddue to rotations. When the noise reached a certain threshold value, a redistribute function completelyrearranged the tree in the vEB order.

5.2 Working set CO B-tree

In this section we present a version of the dynamic CO B-tree that satisfies the working set bound onkey access sequences. The data structure is an array of dynamic CO B-trees.

3http://www.cphstl.dk/Report/Cache-oblivious-search-trees/cphstl-report-2003-3.pdf

28

The design is similar to the working set structure [10]. It consists of log logN CO B-trees, N is thetotal number of keys. Let T1, · · · , Tl be the trees, where l = log logN . For 1 ≤ i ≤ l−1, |Ti| = 22

iand

0 < |Tl| ≤ 22l. There is an access order maintained within the keys inside a tree using an implicit link

list by maintaining extra fields in a tree node along with the key similar to the working set tree proposedin [16]. The link list is encoded as follows. Each node x ∈ Tj stores the key of the node inserted into Tjdirectly before and after it. This information is stored in the fields older[x] and younger[x], respectively.If x is the node with the youngest/oldest key in Tj , then no node was inserted after/before it and thusyounger[x] = nil/older[x] = nil. Two fields, namely oldest/youngest is maintained per tree Tjthat stores the value of the oldest/youngest key in Tj . Using these two fields the node containing theoldest/youngest key in Tj can be retrieved with a single tree traversal. The youngest field is updatedafter every insertion in a tree. The oldest field is updated to older[oldest] when the oldest node isdeleted from the respective tree.

5.2.1 Operations

Here we present the algorithms for the search, insert and delete operation in the working set COB-tree. The search and update operations are engineered so that the key x with a working set value ofω(x) lies in the tree Tj , where j = log logω(x).

The search for a key x starts at T1 and terminates at a node n in Ti. After this, n is deleted fromTi, and reinserted into the tree T1. Before reinsertion, a key has to be moved from tree Tj to Tj+1, for1 ≤ j ≤ i− 1, in order to maintain the tree size constraints. Whenever a key has to be moved from treeTj to Tj+1, the oldest node is deleted from Tj and inserted into Tj+1.

Insertions happen in the last tree Tl. If Tl reaches its maximum size then a new tree Tl+1 with a biggersize threshold is created. During deletion of a key present in tree Ti, a key has to be moved from Tj toTj−1, for i < j ≤ l. The youngest node in Tj is deleted from Tj and inserted into Tj−1. The fields of theoldest node in Tj−1 are adjusted so that the new node is now the oldest node in Tj−1 and its older fieldis nil.

Theorem 1 The search for a key x in a working set CO B-tree costs an amortized O(logB ω(x) +

log logω(x)) number of cache block transfers.

Proof: The search for key x involves a series of insert, delete and tree traversal operations in the COB-trees from T1 through Tj , where j = log logω(x). The order of these operations is explained in thedescription of the search operation above. The amortized number of block transfers while performingan update operation on a tree Ti is O(logB 22

i) [14] while a traversal incurs O(logB 22

i) cache block

transfers in the worst case. The total number of block transfers while performing these operations onthe first j trees is thus O(Σj

i=0 logB 22i), which is equal to O(logB ω(x) + log logω(x)) after algebraic

simplification.

29

Figure 5.1 The figure on the left represent the logical and physical layout of a hybrid CO B-tree inmemory. Lightly shaded regions are vEB trees (layout). The unshaded regions are red-black trees (pointerlayout). Dotted lines are subtree boundaries. The first figure on the right hand side show two sequentialsearch operations being performed on a hybrid CO B-tree placed initially as a static vEB tree. Noticethe evolution of new root blocks with different layouts during the searches. The bottom figure show ageneric search operation on a hybrid CO B-tree. In both the figures |RT | = 1. Every search out ofRTresults in displacement of a key (keys a and y).

5.3 Hybrid CO B-tree

Accessing keys with a larger working set value inside a working set CO B-tree result in operationsinvolving multiple dynamic vEB trees. As mentioned earlier dynamic vEB trees are slower when it comesto update operations. Consequently, the search related tree rearrangements in a working set CO B-treelowers its overall throughput. Employing heuristic measures to ensure faster cache effective retrieval ofthe frequently accessed keys by sacrificing the working set guarantee per key access can lead to morefaster and cache efficient adaptive CO B-trees.

In this section we propose a hybrid CO B-tree that maintains a fixed size cache of the most recentlyaccessed keys (see figure 5.1). Unlike the previous section, the data structure proposed here is a singlesearch tree where each tree node maintains a key, a constant number of relationship pointers (childand parent pointers) and some meta data information. The data structure is hybrid in the sense that itsnodes are not arranged in one uniform memory layout. Instead different sections of the tree may havedifferent memory layouts. The layout of a certain section may change over time. The layouts alternatebetween a static vEB layout, a dynamic vEB layout and a pointer based layout. The nodes in the pointerlayout are arranged in the binary search order and are placed at random memory locations. They areconnected through the left and right child pointers located inside each node. A red-black tree is used for

30

realizing the pointer layout. The presence of the pointer layout makes our data structure more flexibleand efficient for reorganization towards changing access patterns. The static vEB layout is used for cacheefficiency of the structure. A single section placed in the dynamic vEB layout acts as a key cache. Thesearch related insertion and deletion operations are localized to this single dynamic vEB section. Beforeproceeding further, we define the following terms that will be used henceforth for the description of ourdata structure.

• Bulk node: A bulk node is a higher degree node made up of multiple tree nodes arranged in asorted order inside a contiguous memory area. A bulk node of size n contains n tree nodes andatmost n+ 1 non null child pointers (left and right children of the constituent tree nodes). A bulknode supports multi-way binary search similar to a B-tree node of order n.

• Root block: A root block, RX , |RX | = s, of a search tree X is a set of s tree nodes containingthe root and the top s− 1 descendants of the root. Tree nodes at the same level are added in theleft to right order. The nodes inRX may be arranged in a different layout compared to the rest ofthe tree, i.e. the s+ 1 lower handing subtrees.

• Inorder predecessor subtree: An inorder predecessor subtree of a node n belonging to a rootblock RX of a tree X , contains all those nodes that lie between p and n, where p is the inorderpredecessor node of n withinRX . Similarly, there exists the inorder successor subtree.

A hybrid CO B-tree T on N nodes contain a root block RT , also called the primary root block, ofsize4 r (r = O(N)) and r + 1 subtrees T1 through Tr+1 hanging from the leaf nodes inRT (figure 5.1).T is arranged in memory asRT followed by T1 through Tr+1. The layout is recursive as each of thesesubtrees have their own root blocks and so on. Similar to the trees in the working set CO B-tree, accessordering information is maintained for the keys inRT by maintaining similar fields inside each node ofRT . Also, the oldest/youngest node information is maintained insideRT . The following invariants holdtrue for T .

• The nodes inRT are arranged in a dynamic vEB layout and placed inside a contiguous memoryarea. The layout of the nodes in other subtrees alternate between a static vEB layout and a pointerlayout.

• If ω(x) <= r, x belong toRT , i.e. the r most recently accessed keys lie inRT .

• A bulk node is treated as a single root block in the pointer layout containing one node (the bulknode itself).

5.3.1 Structure Split and Join Operations

A hybrid CO B-tree needs to reorganize itself with changing access patterns. This demands foroperations that move keys between the layouts. Similar to the tree shift operation in a working set CO

4In this thesis we set r = n100

, i.e. one percent of the total number of nodes n in the tree, for experiments.

31

B-tree and the splay operation in a splay tree, the hybrid CO B-tree also supports key shuffle operations.In this section we describe two operations that rearranges the nodes of the hybrid CO B-tree, specificallyinside the different layouts, with every search operation, so as to move the most recently accessed keyinsideRT .

Figure 5.2 The figure on the left demonstrates a path-split operation using the key E on a tree with aroot block placed as a red-black tree. The operation is bottom up and at first splits the lower root block inthe vEB layout using the path nodes D and E. Finally the modified red-black tree on top is split with ared-black split operation using the key E. The figure on the right show a join operation. Also depictingthe different cases arising while executing it.

Path-Split(A, p, x, Al, Ar): The path-split operation splits a search tree A into two subtrees Al andAr such that all the keys in Al/Ar are less/greater than the key x respectively. This is done using a rootto node path, P of A, and a key x, such that x ∈ P , x is the last node in the path P , (Algorithm 1 andfigure 5.2). Firstly, the left/right child pointers that were followed to create P are marked 5. Now the leftand right child nodes of x, these nodes do not belong to P , are added to two newly created bulk nodesPl and Pr respectively (steps 3 through 12). Incase either child of x is a bulk node, it is first split at themiddle into two bulk nodes connected by a single node (Node-Split operation), which is then insertedinto the respective newly created bulk node. The nodes in P lie inside a sequence of root blocks. Atype field is maintained inside each node that determines the layout of its containing root block. Thedivision of A is done by dividing its constituent root blocks. The root blocks containing P are retracedbackwards by visiting the nodes in P from the end. Each root blockR is divided into two root blocksRA andRB using the section, s, of P lying inside it, such that all the nodes inRA/RB are less/greaterthan x respectively. A root block in the vEB layout is divided using the newly created bulk nodes (thesenodes act as parent nodes for the segmented root blocks). The nodes in s that are less/greater than x are

5A 64 bit pointer can be marked by setting/unsetting the most significant bit.

32

Algorithm 1 Splitting a tree A using path P and key x

1: procedure PATH-SPLIT(A, P, X, Al, Ar)2: for j = |P | to j ≥ 0 j −− do3: Pl ← new bulkNode4: Pr ← new bulkNode5: if x.Left.Type = bulkNode then6: Pl ← Node-Split(x.Left)7: else8: Pl ← x.Left9: if x.Right.Type = bulkNode then

10: Pr ← Node-Split(x.Right)11: else12: Pr ← x.Right13: n← P [j]14: if n.Type 6= Free then . Tag Free

refers to the pointer layout15: x.Left← Pl, x.Right← Pr

16: while n.Type 6= Free do17: if n.Key < x then18: Pl ← Pl ∪ n

19: else20: Pr ← Pr ∪ n21: n← P [−− j]22: if n.Type = bulkNode then23: Pl ← Pl ∪ y ∀y ∈ n&y < x24: Pr ← Pr ∪ z ∀z ∈ n&z > x25: if x ∈ n then26: x.Left← Pl x.Right← Pr

27: if n.Type = Free then28: if n.Left.Marked then29: n.Left = x30: else31: n.Right = x32: Height← 033: while n.Type = Free do34: Height← Height + 135: n← P [−− j]36: Red-Black-Split(n, height, x)

return Al ← x.Left Ar ← x.Right

added to Pl/Pr (steps 14 to 20) and the marked child pointers inside each node are made null. Note thatthe child pointers lying inside the respective nodes in Pl/Pr point to the location of subtrees within thevEB layout of R. If the parent of R was a bulk node, its nodes are distributed among Pl and Pr in asimilar manner (step 22 to 24). Finally, Pl/Pr is made the left/right child of x. Now x is made a child ofthe first non bulk node in P before discovering any node in the vEB layout again (steps 27 to 31). A rootblock arranged as red-black tree is divided with the key x using a red-black split operation by finding theroot, i.e the root of the root block in s, and the height of the tree (steps 32 o 36). The above steps arerepeated until all the nodes in P are exhausted.

Join(A, B, x): The join operation takes two trees A, B and merges them into a single tree using thekey x (figure 5.2). The trees are such that all the nodes in A/B are less/greater than x respectively. Therespective root blocks of A and B, i.e. RA and RB, are used for merging and has to be present in thepointer layout. For a tree with a root block in the vEB layout, a transformation operation splits the treeinto two by creating a new root block of size one containing a single tree node pointing to the left andright child subtrees in the vEB layout. The merge process is carried out using this new root block. Recallthat a red-black tree was used for implementing the pointer layout. Hence both the root blocks are nowred-black trees. Finally, the root blocks along with x are merged using a red-black join operation [23]. IfRA andRB are both bulk nodes, they are simply made the left and right child of x.

33

In the process of joining, we run the risk of generating red-black trees (root blocks) containing O(N)

keys. Consequently, a search will incur O(logN) cache misses, violating the bounds of a CO B-tree.In such a scenario, whenever after a join, the number of nodes in a red-black tree becomes greaterthan N

1ε , ε > 1, the nodes are collected inside an array and transformed into a static vEB tree. The

value of ε is adjusted such that the size of the root blocks formed is always O(polylog(n))=O(logkN),

1 < k < logN1ε

log logN .

Figure 5.3 The figure shows the transformation of a hybrid CO B-tree with successive search operations.The colour coding of the different sections of the tree is consistent with the rest of the thesis. The primaryroot block is at its maximum allowable size at the start. Notice how additional levels of root blocks areadded to the subtree j during the process.

Lemma 1 The number of root blocks lying on a search path in a hybrid CO B-tree is always constant.

Proof: A hybrid CO B-tree T on N nodes is initially arranged as a single vEB tree. During successivesearch operations T gets fragmented into smaller static vEB trees hanging from the nodes inRT . UntilRT is within threshold, |RT | <= r , each of these subtrees have a root block of size one containing asingle bulk node (figure 5.3). OnceRT starts to overflow, larger root blocks in the pointer layout starts toemerge as a consequence of the join operation. Unlike the vEB layout, a root block in the pointer layoutdoes not create a new root block during division in the path-split operation. Once these larger root blocks(red-black trees) cross the threshold for the number of nodes, they are converted into static vEB trees.Now if a search runs through these newly formed vEB trees, a new level of root blocks is formed. Thenumber of additional levels formed is equal to the number of times red-black trees get transformed intovEB trees. This is equal to log

N1εN , N

1ε is the threshold value, which is constant.

Lemma 2 A pointer layout can be transformed into a static vEB layout in O(1) amortized cache block

transfers.

Proof: The nodes in a hybrid CO B-tree can be divided into two sets, i.e. nodes in the pointer layoutand the vEB layout. Every path-split operation adds a node to the pointer layout. This node might bea single or a bulk node. Once N

1ε number of single nodes accumulate inside a red-black tree, they

are read and arranged into a vEB layout. This incurs N1ε number of cache block transfers. Using an

34

accounting argument, where every path-split operation contributes one dollar, it can be seen that after N1ε

path-split operations, enough credit is accumulated to support the transformation operation. Thereforethe transformation operation incurs O(1) amortized cache block transfers.

Lemma 3 Moving through a search path in the hybrid CO B-tree on N nodes incurs O(logB N +

log logN) number of cache block transfers.

Proof: A search through a hybrid CO B-tree on N nodes encounters a constant number of root blocks(lemma 1). A root block is either arranged as a vEB tree, a red-black tree or a bulk node. The vEBtree and the red-black tree can have a maximum of O(N) and O(polylog(n)) nodes respectively. Themaximum size of a single bulk node can be O(logN). This can be proved as follows. A bulk node isformed while splitting a subtree in the vEB layout. As a part of the split operation, the nodes belongingto a path inside this vEB layout are divided across two bulk nodes. The vEB tree being a balanced searchtree, the path length is O(logN). Hence, the size of a bulk node is always O(logN). Note that, onceformed, a bulk node never gets bigger by combining with other nodes. It can only split further therebydecreasing in size. Therefore, moving through a vEB tree, a red-black tree and a bulk node during asearch incurs O(logB N), O(log logN) and O( logNB ) number of cache block transfers respectively. Theabove bound can be arrived at by summing up these individual costs.

5.3.2 Search Operation

Algorithm 2 Search for a key x

1: procedure SEARCH( X )2: A← Root-Search(RT , x)3: if A.Key = x then return true4: Detach(A)5: Path← Binary-Search(A, x)6: if P = NIL then return false7: Path-Split(A,Path, x,Al, Ar)8: Root-Insert( x )9: if x.Left = NIL then

10: x.Left← Al

11: else12: Pred← InoPred(RT , x)13: Pred.Right← Al

14: if x.Right = NIL then15: x.Right← Ar

16: else

17: Succ← InoSucc(RT , x)18: Succ.Left← Ar

19: if |RT | > threshold then20: y← Oldest-Node(RT )21: PTree← InoPred-tree(RT , y)22: STree← InoSucc-tree(RT , y)23: Detach(PTree)24: Detach(STree)25: NewTree← Join(PTree, STree, y)26: m← InoPred(RT , y)27: n← InoSucc(RT , y)28: Root-Delete(y)29: if m.Right = NIL then30: m.Right← NewTree31: else32: n.Left← NewTree

35

The search for key x in a hybrid CO B-tree T , locates the key either inRT or any of its child subtreesand moves x insideRT on completion. Conditionally a node might be pushed out ofRT to maintain thesize invariant. The operation starts with a search for x insideRT (Algorithm 2 and figure 5.1). Unlikeearlier works where the search was carried out implicitly inside the vEB array [19], the binary searchinside the vEB layouts of our tree is carried out using the left/right child pointers located within each node.Storing parent and child pointers along with a node in a vEB layout for traversal may seem redundantat first, but the linkage is necessary when the nodes move out of the vEB layout and becomes part of ared-black tree. The child pointers help to locate the respective children subtrees of the node which arestill inside the vEB layout and not part of the red-black tree. Failing to find x insideRT , the search getsrouted to one of the subtrees, A, hanging from a leaf node inRT . Before the search A is detached fromRT by setting the child pointer leading to A to null. If x is present in T , a binary search in A returnsa non empty search path P . Using P , the subtree A is split into two subtrees Al and Ar after which xis inserted into RT using a vEB insert operation (steps 5 to 8). The subtrees Al and Ar are made thechildren of either x or the predecessor and the successor node of x withinRT respectively (steps 9 to18). This completes a generic search operation. Incase the size invariant ofRT is violated, i.e. |RT | > r,a few more adjustments are made. The oldest node inRT , y, is retrieved and a new subtree is createdusing a join operation with the inorder predecessor and successor subtree of y. Before the join operation,the inorder subtrees are first detached fromRT . Now y is deleted fromRT using a vEB delete procedureand finally the new subtree is made a child of the predecessor or successor node of y inRT (steps 19 to32).

Theorem 2 A search operation in a hybrid CO B-tree containing N nodes incurs an amortized

O(logB N + log logN) number of cache block transfers.

Proof: Once a search path of length atmost O(logN ) is discovered incurring O(logB N + log logN)

cache block transfers (lemma 3), a vEB insert (insertion in RT ) followed by a path-split and a joinoperation occur in the worst case. The number of cache block transfers incurred in a path-split whiledividing a red-black tree is O(height) = O(log logN). For dividing a root block in the vEB layout,a path inside the layout is traced which is then written inside an array (bulk node). This results inO(logB N + logN

B ) more cache block transfers. As the number of root blocks on the path is constant(lemma 2), the cache block transfers incurred during the key arrangements is O(logB N + log logN).The join operation incurs O(h1 + h2) = O(log logN) cache block transfers, here h1/h2 are the heightsof the participating red-black trees. As |RT | = O(N), the single vEB insert intoRT incurs an amortizedO(logB N) number of cache block transfers. The above bound can be arrived at by summing the cost ofthe search, vEB insert and the key rearrangement operations.

36

5.3.3 Update Operations

Here we describe the algorithms for the update operations in a hybrid CO B-tree T . Initially T isarranged as a single vEB tree. Successive search operations on T leads to the formation ofRT and otherroot blocks in various layout as described earlier. For handling corner cases, two values, −∞ and +∞,are added toRT prior to the first insertion. OnceRT is formed T gets locked in as a static vEB tree.

Algorithm 3 Insertion of a key x into T

1: procedure INSERT( X )2: if |RT | = 0 then3: vEB-Insert(T, x) return4: Root-Insert( x )5: if |RT | > threshold then6: y← Oldest-Node(RT )7: PTree← InoPred-tree(RT , y)8: STree← InoSucc-tree(RT , y)9: Detach(PTree)

10: Detach(STree)11: NewTree← Join(PTree, STree, y)12: m← InoPred(RT , y)13: n← InoSucc(RT , y)14: Root-Delete(y)15: if m 6= NIL then16: m.Right← NewTree17: else if n.Left = NIL then18: n.Left← NewTree

Insertions happen in T directly if |RT | = 0, i.e no search operations were executed. In the othercase, the new key is inserted intoRT . After repeated insertions, if the size invariant ofRT is violated,the oldest key y inRT is moved out ofRT . This is done by deleting y fromRT and joining it with itsinorder predecessor and successor subtrees similar to the search operation described earlier.

A delete operation removes the key x directly from T if |RT | = 0. The other cases are handledas follows. If x ∈ RT , it is moved out of RT by joining it with its inorder subtrees as in the searchand insert operation. Now x is a node of a red-black tree. Following this x is replaced by its inorderpredecessor or successor node, m, in the respective root block (m is a not bulk node). After this, x isswapped with the biggest or the smallest node in its inorder predecessor or successor subtree. Note thatthese subtrees are relative to the new location of x, i.e. the old position of m. Now x belongs to either ared-black or a vEB tree. The former case is handled by directly deleting it from the respective red-blacktree (x is either a leaf or a node with a single child). For the later case, x can’t be deleted directly fromthe vEB layout as it is static. Any change in the layout would alter the position of the subtrees locatedwithin, that are being pointed to by nodes in different sections of T . The only way to remove x is tobring it down to a leaf and then mark its entry as a space in the array housing the vEB layout. This can bedone by swapping x with the available inorder predecessor or successor node inside the layout. The stepsincase, x belongs to a subtree hanging fromRT is the same except the join operation.

Theorem 3 An insert/delete operation in a hybrid CO B-tree onN nodes incurs an amortizedO(logB N+

log logN + 1) number of cache block transfers.

37

Algorithm 4 Deletion of a key x in T

1: procedure DELETE( X )2: Search(x)3: if |RT | = 0 then4: vEB-Delete(T, x) return5: if x ∈ RT then6: PTree← InoPred-tree(RT , x)7: STree← InoSucc-tree(RT , x)8: Detach(PTree)9: Detach(STree)

10: NewTree← Join(PTree, STree, x)11: m← InoPred(RT , x)12: n← InoSucc(RT , x)13: Root-Delete(x)14: if m.Right = NIL then15: m.Right← NewTree16: else17: n.Left← NewTree18: a← InoPred(NewTree, x)19: b← InoSucc(NewTree, x)20: else21: a← InoPred(A, x) . x ∈ A22: b← InoSucc(A, x)

23: if a 6= NIL then24: Swap(a, x)25: if x.Right 6= NIL then26: c← Smallest(x.Right)

27: Swap(c, x)28: else if x.Left 6= NIL then29: c← Largest(x.Left)30: Swap(c, x)31: else32: Remove(x) . x is a leaf node.33: else if b 6= NIL then34: Swap(b, x)35: if x.Right 6= NIL then36: c← Smallest(x.Right)37: Swap(c, x)38: else if x.Left 6= NIL then39: c← Largest(x.Left)40: Swap(c, x)41: else42: Remove(x) . x is a leaf node.43: if x.Type = Free then44: red-black-delete(x)45: else if x.Type = bulkNode then46: Remove(x) . x is the

rightmost/leftmost node in the bulk node.47: else48: Make-Leaf(x)49: Remove(x) . x is a leaf and can be

replaced by a space in the vEB array.

38

Proof: An insertion in the worst case involves a vEB insert into RT followed by a join operationthereby incurring an amortized O(logB N + log logN) number of cache block transfers. For deletes,a search is followed by tree traversal and red-black delete operations. A red-black delete operationsincurs O(log logN) number of cache block transfers. The tree traversals include moving through either ared-black or a static vEB tree. Thus summing up the bounds, the delete operation also incurs an amortizedO(logB N + log logN) number of cache block transfers.

5.4 A Working Example

This section describes a working example of a hybrid CO B-tree on 15 nodes (figure 5.4). A sequenceof search operations are shown in the example. The example also demonstrates the rearrangement (rearrangingRT on exceeding the threshold) and the transformation ( conversion of a red-black tree tostatic vEB layout) operation happening in the data structure while processing the search queries. Thefirst three searches (search for the keys f , l, h) simply divide the original tree in the static vEB layout tosmaller subtrees with a root block in the pointer layout (the root block is a single bulk node) and insertsthe search keys intoRT . The search for the key b, results inRT exceeding its size threshold. In orderto maintain the size threshold of RT , the oldest key f in RT is removed by joining it with its inordersubtrees (sub figure 6). After this, every search results in a rearrangement operation (sub figure 10 and11). During successive rearrangements, bigger red-black trees are formed. The search for j, results in theformation of a red-black tree with two nodes. This meets the threshold value for the maximum numberof nodes in a red-black tree. The two nodes in the red-black tree are read and arranged as a static vEBlayout (sub figure 12).

5.5 Performance Evaluation

Experimental Framework: We compare the performance of our cache oblivious B-trees with anAVL tree, a B-tree (order 16), a top down splay tree and a static/dynamic CO B-tree. All the datastructures are initially populated with the same 64 bit keys (without duplicates) chosen randomly from auniformly distributed key space. As the CO B-trees developed are aimed at adopting to changing keyaccess patterns, the experiments mainly focus on the performance of the search queries. Towards this, wegenerate search query strings containing search queries from the data set with various degrees of temporallocality and measure the time taken by a tree to process each query string. This gives us the throughput,i.e queries processed per second, for each tree. For comparing the behavior of the trees with respect tocaches, the last level cache misses (LLC) were counted using the Valgrind tool. The data structureswere implemented in C++ running on an Ubuntu system with gcc version 4.8. Two types of machinearchitectures were used for experimentation. The first is an Intel Xeon E5-1680 CPU containing8 cores running at 3.2 GHz. It has 32 GB DDR3 primary memory running with a maximum bandwidthof 59.7 GB/Sec and 3 level of cache where the size of the last level cache is 20 MB. The second is a low

39

Figure 5.4 A working example of a hybrid CO B-tree on 15 nodes. The nodes of the tree are color codedto corresponding to the layout they belong to. The darker nodes belong to the primary root block in thevEB layout. The lightly shaded and the unshaded nodes are static vEB tree nodes and red-black treenodes respectively. The primary root block can contain a maximum of 3 nodes. The size threshold forconversion of a red-black tree to static vEB tree is 2 nodes. The memory layout of the trees is shownbelow each sub figure.

end ARM Cortex-A15 CPU containing 4 cores running at 2 GHz. It has 4 GB DDR2 primary memoryrunning with a maximum bandwidth of 10 GB/Sec and 2 levels of cache where the LLC size is 4 MB.

Query String Generation: The keys inside a search query string are generated with various degreeof skewness simulating temporal locality in key access patterns. The locality of reference is oftenmodelled with Zipf’s distribution (ZD) [2] where the ith most commonly accessed item is accessed witha probability pi inversely proportional to i. To control the degree of skewness using Zipf’s distributionwe use an extra parameter α ≥ 0 [4] such that pi = 1

iαC , where C =∑n

j=1(1jα ) and n is the number

of nodes in tree. Further, Bell and Gupta [11] defined another parameter β called the skew factor,β =

∑ n100j=1 pj , that gives the probability of accessing the most frequently accessed 1 percent of the keys.

The higher the value of β the greater the number of occurrence of the frequent 1 percent keys in the querystring. For α = 0, β = 0.01, which indicates that the most-frequently-accessed 1 percent of the keys areaccessed 1 percent of the time (i.e. the distribution is uniform). In our evaluation, we have used β valuesof 0.01, 0.10, 0.20, 0.40, 0.50, 0.60, 0.80 and 0.90.

Each key in the data set was given a unique randomly-selected position in a table of size n. This isthe access probability table where the ith key has an access probability of pi defined using the modified

40

Figure 5.5 The two graphs compare the query throughput of the trees against two query string sizes. Eachpoint in the graph corresponds to a specific search string generated with a different degree of skewnesscontrolled by the skew factor β. The y axis is throughput in million queries processes per second. The xaxis is the various values of β.

Zipf’s distribution described above. The first key has the highest access probability and so on. The key tobe searched is obtained by selecting a random number between 0 and 1 and then finding the index of thecorresponding key in the access-probability table using the cumulative probability distribution curve.

5.5.1 Results

For near uniform distribution of search queries the hybrid CO B-tree has comparable query throughputto the rest of the trees. For skewed distributions it outperforms all the others including the static COB-tree. As can be seen in figure 5.5, for β = 0.01, i.e. for uniform key access distribution, the B-tree andthe static CO B-tree have the highest throughput. The working set CO B-tree has the lowest throughputthroughout all values of β. For skewed distributions with β > 0.5, the hybrid CO B-tree outperforms theother trees including the static CO B-tree. It has 2 to 3 times more throughput compared to the static COB-tree and splay tree respectively. Also, notice that the throughput increases rapidly for the hybrid COB-tree with changing β values. This shows that our tree adapts more rapidly to changing access patterns.

With respect to cache misses per lookup (figure 5.6), the CO B-trees outperform the rest. For skeweddistributions the hybrid CO B-tree has the lowest cache misses. The static CO B-tree has the lowest cachemisses per query for low value of β. The working set CO B-tree has almost the same value as the B-tree.On average, the hybrid CO B-tree has 1.5 to 2 times lesser cache misses compared to the B-tree and thesplay tree respectively. For β > 0.6, the hybrid CO B-tree has the lowest cache misses per query value.For the ARM machine, we get a comparatively larger value of cache misses per query but nevertheless thetrend exhibited by the data structures are similar. The hybrid CO B-tree has 2 to 4 times lesser cachemisses compared to a B-tree and splay tree respectively.

We next focus on query mixes, where every tree processes query strings containing search, insert anddelete queries in various proportions. Having studied the search performance in the previous experiments,here we generate a higher percentage of update queries (inserts are more than deletes) compared to

41

Figure 5.6 The cache misses per query incurred by the respective trees to process the search string forvarious values of β and across two machine architectures. The y axis is cache misses per query. The xaxis is different values of β.

searches. The search queries in the query mix are generated uniformly at random. As can be seen inthe figure 5.7, a CO B-tree followed by the working set CO B-tree has the lowest throughput while theB-tree has the highest. The hybrid CO B-tree has comparable throughput to a splay and an AVL tree.With respect to caches (figure 5.8), the cache oblivious trees have consistently lower cache misses amongwhich the hybrid CO B-tree has the lowest cache misses per query value.

42

Figure 5.7 The query throughput comparison for the trees relative to two kinds of query mixes. They and x axis correspond to the query throughput (million queries processed per second) and the querystring size in millions respectively.

Figure 5.8 The cache misses per query comparison for the trees relative to two kinds of query mixes.The y axis is cache misses per query and x axis is size of the query string in millions. The results arecollected on the Intel machine.

43

Chapter 6

Conclusion

In this work, we proposed a set of heterogeneous working-set hash tables whose layout spans acrossGPU and CPU memories, where the GPU handles the most frequently accessed keys. The hash tablesoperates without any explicit data transfers between the devices. This concept can be extended to anyset of interconnected devices with varying computational powers where the most frequently accessedkeys lies on the fastest device and so on. For non-uniform key access distributions, our hash tablesoutperformed all the others in query throughput. In our future work, we plan to investigate the challengesinvolved in using multiple accelerators including GPUs and FPGAs. We envisage that maintaininga global MRU list spanning across all the devices could be computationally expensive. So suitableapproximations that give the right trade-off have to be made.

We also explored the idea of designing a cache oblivious B-tree that adapts to changing key accesspatterns. The developed data structures led to faster and cache efficient retrieval of the frequently accessedkeys. The gathered experimental results prove that our data structures adapt faster compared to the othersearch trees with increasing skewness in the key references. To the best of our knowledge, this is the firstpractical work on access pattern aware cache oblivious data structures. Although, our data structureslook promising, as a future plan, we intend to perform more thorough investigation with real databasesand machines with more complex memory hierarchies, e.g a networked system fetching data from aremote server with exponential differences in memory latencies across different levels, to further testtheir effectiveness in real world scenarios.

Both the ides are general enough and has the scope of expansion. For example, the hash tables designphilosophy can be extended to any string of interconnected devices with varied computational powerssuch that the most recently accessed keys are present in the fastest device and so on. Similarly the B-treedesigns are general enough to adapt to any memory hierarchy. Except the working set CO B-tree, thesize of the working-set was fixed statically in most of our designs. But this is a non-real assumption asthe working-set tend to change over time. As a future work we plan to employ more suitable heuristics toapproximate the size of this working-set.

44

Bibliography

[1] NVIDIA CUDA Compute Unified Device Architecture - Programming Guide, 2007.

[2] L. A. Adamic and B. A. Huberman. Zipf?s law and the internet. Glottometrics, 3(1):143–150, 2002.

[3] P. K. Agarwal, L. Arge, A. Danner, and B. Holland-Minkley. Cache-oblivious data structures for orthogonal

range searching. In Proceedings of the Nineteenth Annual Symposium on Computational Geometry, SCG

’03, pages 237–245, New York, NY, USA, 2003. ACM.

[4] T. Aho, T. Elomaa, and J. Kujala. Reducing splaying by taking advantage of working sets. In C. C. McGeoch,

editor, WEA, volume 5038 of Lecture Notes in Computer Science, pages 1–13. Springer, 2008.

[5] D. A. F. Alcantara. Efficient Hash Tables on the Gpu. PhD thesis, Davis, CA, USA, 2011. AAI3482095.

[6] Y. Allusse, P. Horain, A. Agarwal, and C. Saipriyadarshan. GpuCV: A GPU-accelerated framework for

image processing and computer vision. In Advances in Visual Computing, volume 5359 of Lecture Notes in

Computer Science, pages 430–439. Springer, Dec. 2008.

[7] L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro. Cache-oblivious priority queue

and graph algorithm applications. In Proceedings of the thiry-fourth annual ACM symposium on Theory of

computing, pages 268–276. ACM, 2002.

[8] R. Bayer and E. McCreight. Software pioneers. chapter Organization and maintenance of large ordered

indexes, pages 245–262. 2002.

[9] R. Bayer and E. M. McCreight. Organization and maintenance of large ordered indices. Acta Inf., 1:173–189,

1972.

[10] M. Bdoiu, R. Cole, E. D. Demaine, and J. Iacono. A unified access bound on comparison-based dynamic

dictionaries. Theor. Comput. Sci., 382(2):86–96, Aug. 2007.

[11] J. Bell and G. Gupta. An evaluation of self-adjusting binary search tree techniques. Software: Practice and

Experience, 23(4):369–382, 1993.

[12] M. A. Bender, R. Cole, and R. Raman. Exponential structures for efficient cache-oblivious algorithms. In

Automata, Languages and Programming, pages 195–207. Springer, 2002.

[13] M. A. Bender, E. D. Demaine, and M. Farach-Colton. Efficient tree layout in a multilevel memory hierarchy.

In Algorithms?ESA 2002, pages 165–173. Springer, 2002.

[14] M. A. Bender, E. D. Demaine, and M. Farach-Colton. Cache-oblivious b-trees. SIAM J. Comput., 35(2):341–

358, Aug. 2005.

45

[15] M. A. Bender, Z. Duan, J. Iacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary. In

Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 29–38. Society

for Industrial and Applied Mathematics, 2002.

[16] P. Bose, K. Douı̈eb, V. Dujmovic, and J. Howat. Layered working-set trees. CoRR, abs/0907.2071, 2009.

[17] G. S. Brodai and R. Fagerberg. Funnel heap-a cache oblivious priority queue. In Algorithms and Computation,

pages 219–228. Springer, 2002.

[18] G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In Automata, Languages and

Programming, pages 426–438. Springer, 2002.

[19] G. S. Brodal, R. Fagerberg, and R. Jacob. Cache oblivious search trees via binary trees of small height. In

Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’02, pages

39–48, Philadelphia, PA, USA, 2002. Society for Industrial and Applied Mathematics.

[20] G. S. Brodal, R. Fagerberg, and K. Vinther. Engineering a cache-oblivious sorting algorithm. Journal of

Experimental Algorithmics (JEA), 12:2–2, 2008.

[21] G. S. Brodal, C. Kejlberg-Rasmussen, and J. Truelsen. A cache-oblivious implicit dictionary with the working

set property. In Algorithms and Computation, pages 37–48. Springer, 2010.

[22] J. Choi, A. Chandramowlishwaran, K. Madduri, and R. Vuduc. A cpu: Gpu hybrid implementation and

model-driven scheduling of the fast multipole method. In Proceedings of Workshop on General Purpose

Processing Using GPUs, page 64. ACM, 2014.

[23] T. H. Cormen. Introduction to algorithms. MIT press, 2009.

[24] M. Daga and M. Nutter. Exploiting coarse-grained parallelism in b+ tree searches on an apu. In Proceedings

of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC ’12,

pages 240–247, Washington, DC, USA, 2012. IEEE Computer Society.

[25] E. D. Demaine. Cache-oblivious algorithms and data structures. 2002.

[26] P. Denning. Working sets past and present. Software Engineering, IEEE Transactions on, SE-6(1):64–84,

Jan 1980.

[27] J. Fix, A. Wilkes, and K. Skadron. Accelerating braided b+ tree searches on a gpu with cuda.

[28] T. Foley and J. Sugerman. Kd-tree acceleration structures for a gpu raytracer. In Proceedings of the ACM

SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 15–22. ACM, 2005.

[29] M. J. Folk, B. Zoellick, and G. Riccardi. File structures - An object-oriented approach with C++. Addison-

Wesley-Longman, 1998.

[30] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings

Symposium on Foundations of Computer Science (FOCS’99), New York, NY, 1999.

[31] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In 40th Annual

Symposium on Foundations of Computer Science, pages 285–297, New York, New York, Oct. 17–19 1999.

46

[32] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of

the 40th Annual Symposium on Foundations of Computer Science, FOCS ’99, pages 285–, Washington, DC,

USA, 1999. IEEE Computer Society.

[33] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics

processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data,

pages 511–524. ACM, 2008.

[34] D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the

Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’04, pages 206–215,

New York, NY, USA, 2004. ACM.

[35] M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch hashing. In 22nd Intl. Symp. on Distributed Computing,

2008.

[36] J. Hoberock and N. Bell. Thrust: A parallel template library, 2010. Version 1.7.0.

[37] W. Jiang, C. Ding, and R. Cheng. Memory access analysis and optimization approaches on splay trees. In

Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable

systems, pages 1–6. ACM, 2004.

[38] M. Kelly and A. Breslow. Quad-tree construction on the gpu: A hybrid cpu-gpu approach. Retrieved June13,

2011.

[39] F. Khorasani, M. E. Belviranli, R. Gupta, and L. N. Bhuyan. Stadium hashing: Scalable and flexible hashing

on gpus. 2015.

[40] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey.

Fast: Fast architecture sensitive tree search on modern cpus and gpus. In Proceedings of the 2010 ACM

SIGMOD International Conference on Management of Data, SIGMOD ’10, pages 339–350, New York, NY,

USA, 2010. ACM.

[41] R. E. Ladner, R. Fortna, and B.-H. Nguyen. A comparison of cache aware and cache oblivious static search

trees using program instrumentation. In Experimental Algorithmics, pages 78–92. Springer, 2002.

[42] T. W. Lai and D. Wood. Adaptive heuristics for binary search trees and constant linkage cost. In SODA,

pages 72–77, 1991.

[43] D. Lea. Hash table util.concurrent.ConcurrentHashMap, revision 1.3, in JSR-166, the proposed Java

Concurrency Package.

[44] D. P. Mehta and S. Sahni. Handbook Of Data Structures And Applications (Chapman & Hall/Crc Computer

and Information Science Series.). Chapman & Hall/CRC, 2004.

[45] Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. Cphash: A cache-partitioned hash table. SIGPLAN Not.,

47(8):319–320, Feb. 2012.

[46] M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the

Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’02, pages 73–82, New

York, NY, USA, 2002. ACM.

47

[47] N. Nguyen and P. Tsigas. Lock-free cuckoo hashing. In Distributed Computing Systems (ICDCS), 2014

IEEE 34th International Conference on, pages 627–636. IEEE, 2014.

[48] R. Pagh and F. F. Rodler. Cuckoo hashing. J. Algorithms, 51(2):122–144, May 2004.

[49] N. Rahman, R. Cole, and R. Raman. Optimised predecessor data structures for internal memory. In G. Brodal,

D. Frigioni, and A. Marchetti-Spaccamela, editors, Algorithm Engineering, volume 2141 of Lecture Notes in

Computer Science, pages 67–78. Springer Berlin Heidelberg, 2001.

[50] N. Rahman, R. Cole, and R. Raman. Optimised predecessor data structures for internal memory. In Algorithm

Engineering, pages 67–78. Springer, 2001.

[51] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus, gpus and

intel mic architectures. Intel Labs, pages 77–80, 2010.

[52] B. Schling. The Boost C++ Libraries. XML Press, 2011.

[53] D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652–686, July 1985.

[54] J. E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming standard for heterogeneous computing

systems. IEEE Des. Test, 12(3):66–73, May 2010.

[55] P. van Emde Boas. Preserving order in a forest in less than logarithmic time. In Proceedings of the 16th

Annual Symposium on Foundations of Computer Science, SFCS ’75, pages 75–84, 1975.

[56] M. Verver. Evaluation of a cache-oblivious data structure. 2008.

[57] Z. Wei and J. JaJa. A fast algorithm for constructing inverted files on heterogeneous platforms. Journal of

Parallel and Distributed Computing, 72(5):728–738, 2012.

[58] K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and F. Gustavson. An experimental comparison of cache-

oblivious and cache-conscious programs. In Proceedings of the nineteenth annual ACM symposium on

Parallel algorithms and architectures, pages 93–104. ACM, 2007.

[59] Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping and structure splitting using whole-program

reference affinity. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design

and Implementation, PLDI ’04, pages 255–266, New York, NY, USA, 2004. ACM.

48

High Performance Self Organizing Dynamic...

Documents

Transcript of High Performance Self Organizing Dynamic...