SDC: A Software Defined Cache for Efficient Data IndexingSDC: A Software Defined Cache for Efficient...

SDC: A Software Defined Cache for Efficient Data IndexingFan Ni, Song Jiang, Hong JiangUniversity of Texas at Arlington

Arlington, [email protected],

{song.jiang,hong.jiang}@uta.edu

Jian HuangUniversity of Illinois atUrbana-Champaign

Urbana, [email protected]

Xingbo WuUniversity of Illinois at Chicago

Chicago, [email protected]

ABSTRACTCPU cache has been used to bridge the processor-memory perfor-mance gap to enable high-performance computing. As the cacheis of limited capacity, for its maximum efficacy it should (1) avoidcaching data that are less likely to be accessed and (2) identify andcache data that would otherwise cost a program multiple memoryaccesses to reach. Unfortunately, existing cache architectures areinadequate on these two efforts. First, to cost-effectively exploit thespatial locality, they adopt a relatively large and fixed-size cacheline as the caching unit. Thus, much of the space in a cache linecan be wasted when the data locality is weak. Second, for easy use,the cache is designed to be transparent to programs, which hindersprograms from fully exploiting its performance potentials.

To address these problems, we propose a high-performance Soft-ware Defined Cache (SDC) architecture providing a simple andgeneric key-value abstraction that allows (1) caching data at a gran-ularity smaller than a cache line, and (2) enabling programs toexplicitly insert, retrieve, and invalidate data in the cache with newinstructions. By providing a programwith the ability of explicitly us-ing the cache as a lookaside key-value buffer, SDC enables a muchmore efficient cache without disruptively changing the existingcache organization and without substantially increasing hardwarecost. We have prototyped SDC on the gem5 simulator and evaluatedit with various data index structures and workloads. Experimentresults show that SDC can improve the cache performance for theworkloads by up to 5.3× over current cache design.

CCS CONCEPTS• Computer systems organization → Processors and mem-ory architectures.

KEYWORDSsoftware-defined cache, key value, data indexing

ACM Reference Format:Fan Ni, Song Jiang, Hong Jiang, Jian Huang, and Xingbo Wu. 2019. SDC: ASoftware Defined Cache for Efficient Data Indexing. In 2019 InternationalConference on Supercomputing (ICS ’19), June 26–28, 2019, Phoenix, AZ, USA.ACM,NewYork, NY, USA, 12 pages. https://doi.org/10.1145/3330345.3330353

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’19, June 26–28, 2019, Phoenix, AZ, USA© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6079-1/19/06. . . $15.00https://doi.org/10.1145/3330345.3330353

1 INTRODUCTIONComputer systems have been using caches to bridge thewell-knownperformance gap between fast CPU and slowmemory. Data accessesare accelerated by keeping the most frequently accessed data in thecaches to avoid memory accesses that can be an order of magnitudeslower [22]. However, caches are much more expensive and smallerin capacity than memory, and must be efficiently used. To this end,the cache should make efforts to (1) cache only frequently accesseddata to maximize its hit ratio, and (2) remove metadata access byenabling cache access of data directly. Existing cache architecturesare inadequate in these two aspects by employing a relatively largecache access unit and transparent caching strategy, respectively.

1.1 The Issue with Large Cache LinesThe cache hit ratio can be seriously compromised with weak spa-tial access locality. For highly efficient memory access, low cachespace overhead, and effective prefetching, the cache usually adoptsa relatively large access unit, such as 64-byte cache line. However,there exist significant access patterns exhibiting weak intra-linespatial locality, leading to fetching and caching of unused dataand compromised cache performance. A representative of suchaccess patterns is pointer chasing, where a sequence of data itemsare accessed by following their pointers. Pointer chasing is com-mon, especially in pointer-linked index structures, such as B+ tree,skip list, and hash tables, that are widely used in data-intensiveapplications [21, 26, 57].

To illustrate intra-line spatial locality exhibited in the access of B+tree and hash table, we use each of them to build an index structureby inserting 8,000 key-value items, each with a distinct 8-byte keyand an 8-byte value. The B+ tree has a fanout of 20, or at most 20pointers in an internal node to its child nodes. The size of the hashtable is 8,000. That is, each of its linked-list buckets has one itemon average. Nodes in the data structures are individually allocatedwith glibc’smalloc(). We issue 80,000 lookup requests to each of thedata structures. To reflect normal cache use scenario, we choose askewed key distribution (zipfian). We name every contiguous eightbytes in a cache line as a cache slot. We track accesses to the slotsin the gem5 simulator [18] with a 32KB 8-way cache1. Figures 1aand 1c show access count at each slot of a cache line after serving thelookups on the B+ tree and hash table data structures, respectively.As shown, for both data structures frequency of access to differentslots in a cache line is highly skewed. For example, in the 84th cacheline only three slots (Slots 2, 3, and 5) have references (62852, 62852,and 125704 times, respectively) and other slots do not have anyreferences. This uneven use of slots occurs mostly at nodes that are

1We use a small processor cache to match the relatively small data set size used in ourexperiments.

82

https://doi.org/10.1145/3330345.3330353

https://doi.org/10.1145/3330345.3330353

SDC: A Software Defined Cache for Efficient Data Indexing ICS ’19, June 26–28, 2019, Phoenix, AZ, USA

(a) B+ Tree (b) B+ Tree with SDC

(c) Hash Table (d) Hash Table with SDC

Figure 1: Number of accesses at each 8-byte slot (on the Y axis) of every 64-byte cache line (on the X axis) in a 32KB 8-waycache for serving 80,000 key lookups in each of the data structures (B+ tree and hash table) indexing 8,000 distinct keys. Keysin the lookups follow the zipfian distribution. Figures (b) and (d) show results with the proposed SDC technique.

at or close to leaves of B+ tree, which dominate cache space heldby the data structure. This issue can be serious even for the hashtable whose average bucket size is only one, as shown in Figure 1c.

Admittedly this weak spatial locality issue is ameliorated whenmost lookups are successful and a large piece of data is accessedafter a successful key lookup. However, referencing a small pieceof data after an index lookup is common. For example, Facebookreported that in its Memcached key-value (KV) pools 90% of the KVitems have values of smaller than 500 bytes [3]. In a KV pool (USR)dedicated for storing user-account statuses all values are of twobytes. In a general-purpose pool (ETC), 40% of requests to the storehave values of 2, 3, or 11 bytes. In a pool for frequently accesseddata, 99% of KV items are smaller than 68 bytes [35]. In Twitter’sKV workloads, after compression each tweet has only 46 bytes oftext [36]. In contrast, the space and time costs of pointer-linkedindexes are significant. In the Facebook’s Memcached servers, thehash table, including the pointers in the buckets and in the LRU lists,accounts for about 20∼40% of the memory space [3]. A recent studyonmodern in-memory databases shows that hash index accesses arethemost significant single source of run-time overhead, constituting14∼94% of total query execution time [26]. Therefore, the under-utilization of cache space due to weak spatial locality exhibited inthe pointer-chasing accesses can be a serious performance issue.

1.2 The Issue with Transparent CachingTo minimize software burdens, the cache is designed to automati-cally cache data at any memory address referenced by the instruc-tions, and then access them via the addresses. In a program, there are

two ways in which data are addressed. One is that their addressesare directly coded in the program and they can be referenced withthe addresses without any proceeding metadata accesses. Exampledata include variables and data elements in an array. The other isthat the data have been named by the program’s users, such as user-defined keys. To access the data in the memory, the program has tofirst use index structure(s) to translate the keys into actual memoryaddresses. Example data addressed in this way include values in key-value (KV) caches (such as Memcached) or in-memory KV stores,data accessed by keys in in-memory databases, and in-memoryinodes accessed via file paths (as the keys) in file systems.

In existing cache designs two levels of address translation areperformed to load and access a data item named by a user-definedkey in the cache, as illustrated in Figure 2(a). The first level of trans-lation (from a key to a memory address) is conducted by softwarewith a number of memory accesses on an index structure. Notethat all accessed metadata on the index will also be loaded into thecache, increasing demand on the cache space. The second level oftranslation (from the memory address to a cache address) is con-ducted by hardware. If we could access data in the cache directlyvia their keys, the accesses and caching of the metadata would beavoided. This is one of our design objectives.

Interestingly, the two levels of address translation also occurin a physical-memory-address-indexed cache with every memoryaddress, as shown in Figure 2(b). The first-level translation is froma process’s virtual address to the physical address, and the second-level one is from the physical address to the cache address. Thefirst-level translation requires expensive traversal on the process’spage table (a prefix tree). Fortunately, the TLB cache was invented

83

ICS ’19, June 26–28, 2019, Phoenix, AZ, USA Fan Ni, Song Jiang, Hong Jiang, Jian Huang, and Xingbo Wu

Memory Address

Key

Index structure

Cache Address

(a) Translate a key to a cache address.

TLB

Physical Address

Page Table

Virtual Address

Cache Address

(b) From virtual addresses to cache addresses with TLB.

Key

Index structure

Cache Address

Hardware Hashing Unit

Memory Address Shadow Memory Address

(c) Translate a key to a cache address with SDC.

Figure 2: Various scenarios of address translation into the cache space.

to accelerate the translation by avoiding access of page table (themetadata). In this work, we’d like to make the technique generalizedand make similar benefit available to user-defined index structures.

1.3 Our Solution: Software-Defined CacheIn this paper we propose a new cache design, Software-DefinedCache (SDC), as a supplement to existing cache architecture toallow the last-level cache (LLC) to serve as a look-aside cache forprogrammers to explicitly insert, retrieve, or invalidate data inthe cache using a simple and generic key-value abstraction. As alook-aside cache, SDC is presented to programmers much like ascratch pad, whose cached data are not automatically associated toor written back, upon eviction, to the memory, except that it doesnot burden programmers as a scratch pad would.

First, SDC is accessed with user-defined keys, instead of memoryaddresses. Second, the cache space is still managed by the hardwarefor functions such as data replacement according to access locality,to relieve programmers from the complex and often expensive task.In addition, SDC uses a cache slot, which is much smaller than cacheline, as its access unit without substantially increasing overheadcache space. These advantages of SDC are achieved by seamlesslyintegrating it into existing cache architecture. As illustrated in Fig-ure 2(c), SDCmaps a key to a memory address in the shadow addressspace, which is a (physical) address range that has not yet beenmapped to any physical memory, using a hardware hash unit. Itthen leverages existing cache address mapping mechanism to mapthe shadow address to a cache address for caching user-supplieddata. It also leverages existing cache’s replacement strategy to man-age its space. In the meantime, the regular transparent caching isstill available, making the LLC be a hybrid cache. As part of theLLC, SDC’s space size is dynamically determined by its data accesslocality. And cache utilization with weak spatial locality can besignificantly improved, as illustrated in Figures 1b and 1d.

This papermakes three contributions. 1)We show two fundamen-tal sources of caching inefficiency (large cache line and transparentcaching) that can seriously compromise the performance of dataindexing in data-intensive applications. 2) We propose SDC thatuses part of the LLC as a look-aside cache accessed via user-definedkeys at a fine granularity. SDC enables a KV cache at the LLC forprograms to access with a simple and generic key-value interface. 3)We extensively evaluate the SDC design with a full-system simula-tion on gem5 with various workloads. Experiment results show thatfor its targeted workloads SDC significantly improves application

performance. As anecdotal examples, SDC improves throughput ofa real-world in-memory database by up to 3.3×.

2 THE DESIGN OF SDCThe design goal of SDC is to provide programs with explicit fine-grained LLC cache access via user-defined keys. This KV-style useof the cache provides programmers with a critical architectural sup-port to overcome weak spatial locality and for TLB-like capabilityto avoid index traversal. To achieve the goal, there are a number ofchallenges to address to make it truly functional and efficient. First,currently data are loaded into caches as a side effect of memoryaccesses, and a program cannot insert data directly into the cachewithout issuing a memory access. We need to carefully define a newprogramming interface friendly to programmers and compilers andbeing least intrusive to existing cache architecture. Second, in themapping from a potentially large key space to a cache space keycollision is unavoidable. We must develop effective mapping andcache replacement strategy to minimize mapping conflicts. Third,allowing smaller data items in the cache means higher space cost foradditional metadata. We need to carefully make trade-offs amongthis space overhead, the API’s usability, and the cache’s hit ratio tobuild a cost-effective software-defined cache.

2.1 Extending ISA to Enable SDCTo enable use of CPU cache as a look-aside KV cache we introducethree instructions (and more of their variants) to insert, look up,and invalidate a KV item in the cache. Their formats are shown inTable 1. A common operand of the instructions is the key, providedin a 64-bit register. Admittedly user-defined keys can be of anyformat and any length, such as a character string of variable length.To enforce a consistent representation, it may require programmersto first convert a key into a 64-bit number (e.g., taking the last8 characters of the key string). Apparently, the conversion maycompromise uniqueness of the original keys. A similar issue willoccur in the SDC’s implementation, where for reducing metadataspace cost we reduce size of the tags (for matching cache slots) ata (small) risk of returning wrong values upon the lookup (a falsehit). To address the issues, programmers are expected to verifythe truthfulness of the value2. A commonly used technique forthe verification is to store the original full key together with itsvalue, and use address of the KV pair in the memory as the valuein sdc_insert and sdc_lookup, or Mem(value addr) in Table 1.

2The security or privacy is not a concern here as SDC guarantees a KV item insertedby one process will not be returned to another process.

84


Table 1: SDC Instructions. Operands 2 and 3 specify the memory address of the value to be inserted in sdc_insert, and thatfor receiving return value in sdc_lookup, and the value size, respectively. For the case where the value size is 8 bytes (one cacheslot), these two instructions have variants directly storing the value in a register specified by Operand 2. Operation status isa bit in the status register indicating success of the operation. To update a value in the cache, one needs to invalidate it andthen inserting the new value.

Name Operand 1 Operand 2 Operand 3 Operation statussdc_insert Reg(key) Mem(value addr) Reg(value size) 1 or 0

sdc_invalidate Reg(key) - - 1 or 0sdc_lookup Reg(key) Mem(value addr) Reg(value size) 1 or 0

1 s t r u c t kv_en t ry bu f f e r , ∗ p ;2 u i n t 6 4 _ t key ;3 . . .4 whi l e ( r e c e i v e _ r e q ( command , &b u f f e r ) {5 key = b u f f e r . key ;6 sw i t ch ( command ) {7 c a s e LOOKUP :8 i f ( sdc_ lookup ( key , &p , s i z e o f ( p ) ) ) {9 i f ( p−>key == key ) { / ∗ o r i g i n a l key compar ison may be

needed ∗ /10 p r o c e s s i n g _ d a t a ( p ) ;11 }12 } e l s e { / ∗ miss , do r e g u l a r s e a r ch ∗ /13 p = index_ lookup ( key ) ;14 i f ( p != NULL ) {15 p r o c e s s i n g _ d a t a ( p ) ;16 s d c _ i n s e r t ( key , &p , s i z e o f ( p ) ) ;17 }18 }19 break ;20 c a s e DELETE :21 s d c _ i n v a l i d a t e ( key ) ;22 de l _ f r om_ th e_kv_ s t o r e ( key ) ;23 break ;24 c a s e INSERT :25 p = i n s _ i n t o _ t h e _ k v _ s t o r e ( key , &b u f f e r ) ;26 i f ( p != NULL )27 s d c _ i n s e r t ( key , &p , s i z e o f ( p ) ) ;28 break ;29 }30 }

Figure 3: Pseudo code for a hypothetical KV store handlingvarious requests with the support of SDC functions.

After receiving the address returned by a lookup, the program willcompare the key at the address with the original key [2, 7, 28].

The return value (in Operands 2 and 3) is up to the programmer’sinterpretation. It can be either the cached value itself or a pointerto the actual value. The value size can be of 1, 2, 4, or 8 slots (or8, 16, 32, 64 bytes, respectively). The sdc_invalidate instructionexplicitly removes identified KV items from the cache. It is notedthat, like TLB, SDC uses the cache as a look-aside buffer. When theitems are evicted or invalidated, there are no write-back operations.

With the instructions, a library of functions can be available forprogrammers to use the SDC cache. Figure 3 shows an example useof the cache in a pseudo code, whose execution is facilitated by thearchitecture illustrated in Figure 2(c).

2.2 Two Levels of Address MappingAs mentioned, one of the SDC design principles is to seamlesslyintegrate it into the existing cache architecture. In addition to thebenefit of simple design and reuse of the cache circuitry for a re-duced cost, the integration allows efficient use of the cache space.To achieve high cache-space efficiency, we must keep the most

frequently accessed data in the cache, regardless of their sources(either loaded via regular memory accesses or inserted by SDCinstructions). If these two types of data were cached or managedseparately, it would be hard to consistently compare their accesslocality, and it would also be difficult to determine the space allo-cations between them to dynamically and accurately reflect theirrespective space demands. To this end SDC maps its key space to arange of physical address space not mapped to any physical mem-ory, named shadow memory address space, from which the existingcache scheme kicks in to further map it into cache address space.

Accordingly, there are two levels of address mapping in SDC. Thefirst-level mapping (from the user key space to the shadow memoryaddress space) is exclusively managed by SDC. The second-levelmapping (from the memory address space, including regular andshadow memory addresses, to the cache space) is managed by theexisting cache scheme at the cache-line granularity and by SDC atthe slot granularity. Figure 4 illustrates these two levels of addressmapping, which serves as an overview of the SDC’s architecturewith details explained in the following two sections.

2.2.1 Mapping from User Key to Memory. For the first-levelmapping, we need to decide where the shadow memory fits intothe address space. In principle, it can be placed at any locationnot used in current addressing. As the memory address space israrely fully occupied (by mapping to DRAM), we choose to doublecurrent physical memory space to make SDC compatible with anytoday’s and future’s system configurations by adding an extensionbit before the MSB (Most Significant Bit) of a memory address.Accordingly, this bit will be the first bit of a cache line tag, with a “0”indicating a regular cache line and a “1” an SDC cache line. In thisway, a physical address in the shadow memory will be of the format’100...0bk−1bk−2...b0’ with a shadow memory size of 2k bytes. Notethat for SDC cache lines the sequence of ‘0’s before bk−1 does notneed to be stored in their tags to reduce cache metadata.

A critical issue in the SDC design is to guarantee security andprivacy of each process’s data in the SDC cache. A key is unique onlywithin a process. To isolate keys of one process from other processesand make a key unique in an SDC cache, we evenly partition theshadow memory space into shards. Keys of a process are mappedexclusively to a dedicated shard. Suppose an SDC supports up toeight processes simultaneously. The three bits bk−1bk−2bk−3 in theaforementioned address format represent shard identifier, one for aprocess. As will be further explained, sdc_lookup()) only returnsvalues inserted by the same caller process.

An important parameter in the SDC design is the size of theshadow memory, or alternatively the size of a shard for a given

85


Hash

User key space0xe1ed3ef0 (64-bit key)

0 1 0 0 1 0 0 01 0 0 0 0 0 00 0 1 0 1 1 03 0520 648 47 2122232425

Memory address space Shard 0 Shard 1

Offset in the shard: 22Pid: 3

Unused zeros: 20

Sub-tag 10 01

Extension bit: 1

2728

Family seg(m): 2

Collision seg: 2

Hash2

Shadow memory address space

Shadow memory address

(a) An example of the first-level address mapping. A 64-bit key("0xe1ed3ef0") in an SDC instruction issued by a process is translatedinto a 49-bit shadow memory address. The hash unit ("Hash") trans-lates the key into a 22-bit offset in a shard. A 3-bit process id ("001")is placed before the offset to indicate the shard ("Shard 1") where thekey is mapped. Note "sub-tag" is used in the second-level mapping.

Slot id: 3Set id: 15Family seg(m): 2

Family bits: 2

Set 0

Set 1

Set n-1

Way 0 1 2 3 4 … 14 15

…

…

……

0 1 0 0 1 0 0 01 0 0 0 0 0 00 0 1 0 1 1 03 0520 648 47 21222324252728

Pid: 3

Shadow memory address

(b) An example of the second-level address mapping. Like regularcache, the 15-bit "set id" determines into which cache set the shadowmemory address is mapped. As detailed in Section 2.2.2 and illus-trated in Figure 5, some other bits, such as "pid", "family bits", and"family segment", are used to determine a cache line in the set. The3-bit "slot id" determines a slot in the cache line.

Figure 4: Two-level address mapping in SDC. Bit segments and their respective lengths are shown as segment name : segmentlength in bits. A 32MB 16-way associative cache is illustrated.

number of shards. An ill-chosen size may compromise the SDC’sperformance advantage. A too large shard may cause the spacesparsely filled and many SDC cache lines under-utilized. However,with a too small shard there can be many keys mapped to a singleaddress, increasing conflict misses. We set it to be the LLC cachesize, ensuring the cache space can be fully utilized.

SDC uses a hash function to map a key to an address in a shard.Though a multi-choice hashing can help reduce mapping conflicts,we do not choose it to avoid extra access latency and hardware cost.Instead, SDC achieves the multi-choice capability in the second-level mapping by reusing the hardware for the set associative cache.At the first level, the hash function translate a 64-bit key into anoffset in a shard, and precedes it with a process id indicating theprocess to which the key belongs, as illustrated in Figure 4a.

2.2.2 Mapping fromShadowMemory toCache. The SDC cacheis hosted only in the LLC, as it is designed for memory-intensiveapplications with large working sets whose caching in a small cacheis unlikely to yield many hits [48]. Though LLC is relatively slow, amiss on a KV item in the SDC cache has a high penalty (multiplememory accesses for the index search). Therefore, increasing cachehit ratio outweighs achieving short hit time. Further, as LLC isshared among cores in a processor, keeping SDC data only in theLLC avoids the cost for maintaining cache coherence in a single-socket system. For multi-socket systems, consistency of the SDCdata can be maintained either by hardware, using techniques suchas Intel QPI [58] or AMD HyperTransport interconnection [12], orby relying on OS using techniques similar to that for TLB [1].

In SDC, in addition to the mapping conflicts from memory ad-dresses to cache lines as that in regular caches, there is anotherkind of conflicts in the mapping from user keys to shadow memoryaddresses. To reduce the conflict misses, we introduce the conceptof cache-line family that includes all SDC cache lines in a set whosetags differ only at the lastm bits, as shown in Figure 4b. A familyin a cache set has 2m member cache lines. A shadow memory ad-dress can be mapped to any cache lines in the family of a set. Keys

Sub-tag(Family Seg(2) + Collision Seg(2))

Valid Bit

Seq Bit

Ref Bit

1001 ---- 0110 1100 0001 0010 1001 11101 0 1 1 1 1 1 10 0 1 0 1 1 1 01 0 0 0 1 1 1 1

Cache line

V0 V1 V2(Family id)

1 001 01

Pid

Per-slot metadata

Family bits

Extension bit

SDC cache line tag

Figure 5: Status bits for an SDC cache line and its slots.

mapped to the shadow memory address can then be stored in anycache lines in the family, reducing impact of key collision. The useof cache line family can also prevent one process from using thecache space too aggressively in a multi-process context. Should thefamily id field (see Figure 5) not be introduced, a value would likelybe placed in any line in a cache set.

SDC maintains some status bits for each slot in an SDC cacheline as listed below and illustrated in Figure 5.• Sub-tag: bits for identifying a value matching a given user key.A sub-tag of a slot consists of two bit segments. The first one isthe family segment field as shown in Figure 4b, which is the mbits (assuming the family size is 2m ) on the left of the set id field.The other one is some bits generated by applying another hashfunction (Hash2 in Figure 4a) on the original user key, aiming tofurther reduce probability of returning a value not matching agiven user key (a false hit). If a value occupies multiple cache-lineslots, only the first slot needs to have a sub-tag created with theabove rules. Thus, the second hash function can produce morebits to fill the space originally reserved for sub-tags to furtherreduce probability of false hits.

86


MRU cache line LRU cache line

Cache line family

…

Figure 6: Two-level replacement strategy: a hypotheticalLRU replacement for cache lines in a cache set and a clockreplacement for SDC slots in a cache line family.

• Valid Bit: a bit indicating if the slot contains valid value.• Sequence Bit: a bit indicating the first slot storing a user value. Itis noted that a value can occupy 1, 2, 4, or 8 contiguous slots in acache line, and is placed at a slot offset that is multiples of theirrespective sizes. For a value, the sequence bit of its last occupiedslot is ‘0’ and the bits for its other slots are all ‘1’s.

• Reference Bit: a bit indicating if the slot has been recently refer-enced. It is used to facilitate the clock replacement among slots ina cache-line family. The bit is set to ‘1’ once the slot is accessed.For an SDC operation, there can be multiple sub-tag-matching

slots in different cache lines of a family. For sdc_lookup, SDC ran-domly selects one from the slots and returns it. For sdc_invalidateand sdc_insert all the matching slots are invalidated.

When there is no free space available for inserting a new KVitem, SDC employs a two-level replacement strategy. The objectiveis to keep hot (or frequently accessed) cache lines, either regularor SDC cache lines, from being replaced, and keep hot slots in acache-line family from being replaced. When a KV item at a shadowmemory address is mapped into a family of SDC cache lines in acache set (all at the same cache-line offset, or slot address), theitem can be placed in any of the cache lines whose correspondingslot (or multiple slots for a larger value size) is available (with ’0’valid bit). If none of them in the family is available and currentfamily size has not yet reached its maximum size, SDC will firstask the existing cache-line-level replacement scheme, such as LRUas illustrated in Figure 6, to identify a cold cache line in the set forpotential replacement. If the selected cache line does not belong tothe family, it is replaced and the value is stored in the correspondingslot(s) of the cache line. Other unused slots in the cache line aremarked as invalid (’0’ valid bits). Otherwise, instead of replacingthe entire SDC cache line selected by the replacement scheme, weproceed with the slot-level CLOCK replacement within the familyas described in the below.

We first assume that value is one slot long. As shown in Figure 6,a value can be stored at a slot in any cache line of the family (at thesame offset). The replacement policy checks reference bits of theslots, one at a time, in a certain order. Whenever it encounters a‘1’, it resets it to 0 and proceeds to the next one. It stops at a slotwhose reference bit is ‘0’ and replaces its current value with thenew one. Note that if the current value occupies more than one slot,which can be known by examining sequence bits of this slot and itsneighboring slots, the other slots occupied by the value need to bemarked as invalid. If the new value is more than one slot long, thereplacement procedure is similar. The only difference is to checkreference bits of multiple slots starting at the offset in each cacheline and replace them together if they are all ‘0’s. Otherwise, theyare all reset to ‘0’. The two-level replacement policy dynamicallybalances cache space allocation among regular data cache and SDC.

2.3 Metadata Storage Cost of SDCWhile SDC supports cache space management at a finer granularity(the slots), some metadata have to be associated with individualslots. To have an empirical estimation of the space cost, we choosea popular processor (Intel E5-2683 V4 CPUs [23]) for SDC to beimplemented in its LLC (L3). Its LLC has a 20-way set associative40MB cache with 64-byte cache lines. The size of a shard is thesame as the cache size (40MB). There are eight 8-byte slots in acache line. Let’s assume a sub-tag of 4 bits (2-bit family segmentfrom the regular tag and 2-bit collision segment from key hashing).Adding the 3 bits for the valid, sequence, and reference bits, eachSDC cache line requires 7 × 8 = 56 bits. Assuming that the SDCsupports up to 8 processes, an SDC tag has 7 bits, which is less thanregular tag size (27) by 20 bits. Therefore, to accommodate SDCeach cache line needs extra 56 − 20 = 36 bits, or 36/(64*8) = 7% ofthe cache size. To further reduce the space cost, we can simplifythe design by fixing the slot size to 16-byte, which will save thereference bit and accommodate 4 slots in a cache line. This willreduce the space cost to less than 1% ( (6 ∗ 4-20)/(64 ∗ 8) ≈ 0.78%).

3 EVALUATIONTo evaluate the performance of SDC, we prototype SDC in gem5 [6]and conduct extensive experiments with micro-benchmarks, real-world in-memory database and key-value cache traces from a pro-duction system to reveal its performance insights.

3.1 Evaluation MethodologyWmodify gem5’s memory system model [18] and enable SDC inthe LLC. We implement the three SDC instructions (sdc_insert,sdc_invalidate, and sdc_lookup) as pseudo-instructions in theX86 ISA. The processor’s pipeline is drained before simulated exe-cution of a gem5’s pseudo-instruction. This operation and its per-formance penalty do not occur in a real hardware implementation.To estimate the penalty of pseudo-instruction execution, we tenta-tively modify the SDC instructions to remove any cache/memoryaccesses. In this case these gem5’s pseudo-instructions have an ex-ecution time of about 20ns (50 cycles), much longer than expectedreal execution time (about 1-2 cycles). To compensate the effect, weconservatively deduct 4ns from each SDC instruction’s executiontime. Our experiments are carried out in the full-system simulationmode and the simulation configurations are summarized in Table 2.We choose a small LLC (3MB) for the sake of moderate simulationtime. The workloads’ data set sizes are determined accordingly. Forparameters not listed here, default values specified in the gem5release are used.

We conduct experiments on the simulator by running an in-memory key-value store serving insertion, lookup, and deletionrequests. The core index structure can be a B+ tree or a hash table.The B+ tree used in the evaluation has a fanout of 20, and its sourcecode is taken from the open-source database LMDB [46]. The hashtable has 221 linked-list buckets, and the code is taken from theopen-source Memcached KV cache [17]. Both indexes are initiallypopulated with 1.5 million KV items. For the hash table the items’keys are uniformly distributed in the buckets after hashing. Betweenthe two data structures, the B+ tree, which has five levels and storesall KV items at the leaf nodes, represents index structures with a

87


Table 2: Simulation Configurations

Processor ConfigurationISA X86 (64-bit)CPU 1 core, 2.5GhzSeparated L1 I/D Cache 32KB, 8-way, 3-cycleUnified L2 Cache (LLC) 3MB, 12-way, 22-cycleCache line size 64-byteCache replacement policy LRU

DRAM ConfigurationMemory DDR3_1600_8x8DRAM bus bandwidth 12.8 GB/s

SDC-related ConfigurationKey 64-bit, zipfian distributionShadow memory size (one shard) 3MB unless notedSub-tag bits 3-bit unless notedcache line family size 4 unless notedHash function for generating address xorshift [37]Hash function for generating sub-tag hash64shift [47]Hardware hashing unit latency 5-cycleSDC access latency 29-cycleData slot length 8-byte

Simulation Software ConfigurationOS Linux 3.4.112, x86_64Compiler gcc-6Compile option O3

high penalty for a key lookup miss in the cache. In contrast, thehash table is configured so that each bucket has less than one KVitem on average, representing a relatively low miss penalty.

For a key-value item in the stores, the key and the value are8 bytes and 512 bytes long, respectively. For SDC operations, the8-byte key and 8-byte memory address of the KV item are actuallykey and value, respectively, in sdc_insert for insertion. The valuereturned by an sdc_lookup instruction is interpreted as the addressfor locating the corresponding key-value item. Unless noted other-wise only the first 64 bytes of a value are accessed after a successfulkey lookup. After a KV store is populated, we continuously issuelookup requests and measure the store’s throughput. The lookupkeys are pre-generated using Yahoo’s cloud serving benchmark tool(YCSB) [14]. The keys follow a zipfian distribution with skewnessof 0.99 to simulate workload of strong temporal locality. Data setinvolved in the lookups in an experiment can be a subset of KVitems pre-inserted in the stores. To calculate the data set’s size, forvalues we only include the data that is actually accessed after alookup (64 bytes by default). In each experiment we will first replaythe lookup sequence three times to warm up the cache before actualmeasurements are conducted.

For performance comparison, we use a system with regular LLCcache configurations (see Table 2) as the baseline. Furthermore,we compare SDC with a software solution for accelerating indexstructure lookups, named SLB (Search Lookaside Buffer) [52]. SLBis a carefully-tuned application-managed look-aside buffer, imple-mented as a hash table, to cache frequently accessed data in anindex structure. In the experiments the SLB buffer is set to be of theLLC cache size (3MB), at which it achieves its best performance [52].In addition to the synthetic workloads for micro-benchmarking, wealso evaluate SDC in real-systems in Sections 3.5.

0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6Data set (× Cache size)

0.0

0.5

1.0

Th

rou

ghp

ut

(mil

lion

ops/

sec)

Regular SLB SDC

(a) B+-tree


0.0

0.5

1.0

Th

rou

ghp

ut

(mil

lion

ops/

sec)

Regular SLB SDC

(b) Hash Table

Figure 7: Throughput of lookup requests in the B+-tree andhash-table based stores with various cache set sizes.

B+-Tree0

25

50

75

100

(%)

Increase of Cache Hit Avoidance of Index Walk

0.20 0.40 0.80 1.60 3.20 6.40 12.80 25.60

Hash TableData set (× Cache size)

0

25

50

75

100

(%)

Figure 8: Contributions to memory access reduction.

3.2 Microbenchmark PerformanceIn the evaluation with the micro-benchmarks, we are interested inSDC’s quantitative performance advantages, sources of its perfor-mance benefits, and impact of SDC’s parameters.

Figure 7 shows the lookup throughput of the two KV storeson the three caches with varying data set sizes. As shown, SDCproduces the highest throughput in all testing scenarios. For theB+-tree-based store, the improvements can be as high as 5.3X overthose of Regular at smaller data sets. The improvements with thehash-table-based store are lower (at about 9.6%∼23%). SDC improvesthe performance mainly for two reasons. First, with SDC a lookuprequest can be served by the SDC cache if the KV item has beeninserted and stays in the cache, removing the need of index walk,which can be very expensive as it usually introduces a number ofmemory accesses. Second, SDC makes efficient use of cache linespace by reducing idle data items in a cache line even with weakspatial access locality, which improves cache hit ratio.

While both reasons can lead to reduction of memory accesses,we measure their relative contributions to the reduction in eachscenarios and the results are shown in Figure 8. As shown, the tworeasons contribute differently in the two stores. The B+-tree storebenefits mostly from the first reason (avoidance of index walk),as an SDC lookup hit may remove multiple (likely five or more)memory accesses. In contrast, an SDC hit saves only one or twomemory accesses in the hash-table index due to small bucket size.And the second reason (increase of cache hits due to higher cache

88



0

2

4

6

8

Nu

mof

DR

AM

acce

sses

/req Regular

SLB

SDC

(a) B+-Tree


0

1

2

3

Nu

mof

DR

AM

acce

sses

/req Regular

SLB

SDC

(b) Hash Table

Figure 9: Number of memory accesses served at the DRAMper lookup in the B+-Tree and hash-table based stores.

line utilization) contributes more to hash table’s throughput im-provement. Figure 9 shows average number of memory accesses perlookup. SDC can significantly reduce memory accesses comparedto regular cache, especially for the B+-tree store (by up to 85%).

When the data set grows beyond a certain size (about 3.2 timescache size) (see Figure 7), the working set starts to exceed the cachesize and the capacity misses keep increasing. Figure 10 shows thehit ratio of the SDC cache with increase of the cache size. Whenthe working set grows very large and the SDC cache hit ratio isreduced, SDC’s performance advantage shrinks. However, as SDCis a user-defined cache, the user program may track the hit ratioand selectively insert and look up a subset of its working set in thecache and at least keep a high hit ratio for this smaller set of data.

Compared to SLB, SDC can provide up to 54% performance im-provement. By maintaining an in-memory buffer, SLB requires anextra memory copy to move a data item into the buffer. Further-more, tracking access frequency to effectively perform replacementpolicy requires extra space and time costs for recording, updating,and searching a set of metadata about data items in the buffer. Ac-cordingly, SLB increases its demand on cache space. It takes moreCPU cycles and memory accesses to insert data items into the cache.Therefore, its performance advantage is smaller. Note that by designSLB cannot be shared by multiple processes. If each process has itsown SLB buffer, the software-managed buffers may compete witheach other, leading to many cache misses and SLB’s inefficacy.

3.3 Performance Impact of Family SizeIn the SDC design, the tag of an SDC cache line is actually a familyid used to identify a family of SDC cache lines in a cache set. Anda shadow address can be mapped to any of them. A sub-tag ofa cache-line slot is to perform address matching. A larger familyprovides more candidate slots to serve a data insertion request andhelps to reduce conflict misses and improve the cache’s perfor-mance. In the experiments, we measure the stores’ performancewith different family sizes to study its impact. Figures 11 and 12show SDC’s hit ratio and the store’s throughput with different fam-ily sizes, respectively. The throughput results are normalized totheir respective counterparts for the baseline store whose familysize is 1 and sub-tag is 3 bits long. When the family size doubles,one more bit is added to the sub-tag. As shown, the improvementwith a larger family is substantial, especially with a larger data setand consequentially more serious collisions among multiple keysmapped to the same shadow address.

0.20 0.40 0.80 1.60 3.20 6.40 12.80 25.60Data set (× Cache Size)

0

20

40

60

80

100

SD

Cac

cess

hit

rati

o(%

)

B+-Tree Hash Table

Figure 10: Hit ratio of SDC lookups for theB+-Tree andhash-table based stores with various data set sizes.


0

20

40

60

80

100

SD

Cac

cess

hit

rati

o(%

)

Family size=1

Family size=2

Family size=4

(a) B+-Tree


0

20

40

60

80

100

SD

Cac

cess

hit

rati

o(%

)

Family size=1

Family size=2

Family size=4

(b) Hash Table

Figure 11: SDC access hit ratio for lookups in the B+-Treeand hash-table based stores with different family sizes.

With a larger family size, the throughput can be up to 2.3×higher (for the B+-tree store). The improvement reduces if we keepincreasing the data set size after the improvement peaks at around adata set of 3.2×-cache-size large. This is due to an increased ratio offalse hits. With a large family size, a KV item can stay longer in thecache and likely produce more hits. However, because key spaceof a process can be much larger than a shard memory space, thekey collision can still become increasingly serious with a large dataset size. Therefore, using a larger family may also produce morefalse hits. Fortunately, our evaluation shows most of the increasedhits are true ones (the false hit ratio is very low (< 1.5%)) andSDC almost always obtains performance improvement with a largefamily size. The only exceptions are with the hash-table whenthe data sets are small and family size is 2. This is the situationwhen using a large family cannot improve the hit ratio but receivesmore false hits. By introducing the concept of cache line familyand making its size a design parameter, one can limit interferenceamong processes competing for cache lines in a cache set, whichhelps reduce false hits.

3.4 Performance Impact of Sub-tag LengthAs discussed, false hit can be a threat to the cache’s performanceas its penalty can be higher than that of a miss. It takes an extramemory access for verification to reveal a false hit. False hits canbe minimized by increasing sub-tag length. Candidate data items inslots of different cache lines in a family are more strictly screenedwhen longer sub-tags are compared before returning a matchedone to the program. For a short sub-tag, it can be easy to find one oreven multiple matched data items. In the case of multiple matchings,

89



0.0

0.5

1.0

1.5

2.0

Nor

mal

ized

Th

rou

ghp

ut

Family size=1

Family size=2

Family size=4

(a) B+-Tree


0.00

0.25

0.50

0.75

1.00

1.25

Nor

mal

ized

Th

rou

ghp

ut

Family size=1

Family size=2

Family size=4

(b) Hash Table

Figure 12: Throughput of lookups in the B+-Tree and hash-table based stores with different family sizes.


0

10

20

30

SD

Cfa

lse

hit

rati

o(%

)

SubTag len=0

SubTag len=1

SubTag len=2

SubTag len=3

(a) B+-Tree


0

10

20

30

SD

Cfa

lse

hit

rati

o(%

)

SubTag len=0

SubTag len=1

SubTag len=2

SubTag len=3

(b) Hash Table

Figure 13: SDC false hit ratio for lookups in the B+-tree andhash-table based stores with different sub-tag lengths.


0.0

0.5

1.0

Th

rou

ghp

ut

(mil

lion

ops/

sec)

SubTag len=0

SubTag len=1

SubTag len=2

SubTag len=3

Regular

(a) B+-Tree


0.0

0.5

1.0

1.5

Th

rou

ghp

ut

(mil

lion

ops/

sec)

SubTag len=0

SubTag len=1

SubTag len=2

SubTag len=3

Regular

(b) Hash Table

Figure 14: Throughput of lookups in the B+-tree and hash-table based stores with different sub-tag lengths.

SDC randomly picks one and returns it. This may turn many misses(if long sub-tags are used) into (likely false) hits and artificiallyboost the hit ratio. Figure 13 shows the false hit ratio with differentsub-tag lengths. And Figure 14 shows corresponding throughput. Inthe experiment the family size is fixed at 4. As expected, using a veryshort sub-tag (0 or 1 bit) can dramatically increase false hit ratio,and reduce the throughput. Note that the throughput reduction ismoderate as most of false hits with short sub-tags are simply misseswith long sub-tags. As long as SDC uses a moderately sized sub-tag(between 3 and 5), the false hit ratio can be made reasonably small.

3.5 Benefits for In-Memory DatabaseTo evaluate the efficiency of SDC in real-systems, we run experi-ments with Silo, an open-source in-memory database [50, 51]. InSilo, a Masstree-inspired tree is used for data indexes. To enableSDC accesses, we added about 100 lines of code to Silo. We use

8 64 128 256 512Value size (bytes)

0

1

2

3

Nor

mal

ized

thro

ugh

pu

t

Workload A

Workload B

Workload C

Workload D

Figure 15: Throughput of the silo database with YCSB work-loads at different value sizes. The throughput is normalizedto that with regular system without SDC.

8-byte strings as keys and vary the value size from 8 to 512 bytes.To simulate the cost of data access, when a lookup returns a KVitem, we read the first 8-byte of every 64-byte of data in the value.So accessing 512-byte data requires 8 memory accesses.

We first fill the database with 1million KV items, and then use theYCSB benchmark to generate four workloads, each issuing 128,000requests with zipfian distribution. The four workloads follow thesame access patterns as that of YCSB’s A, B, C, and Dworkloads [14].Workload A has 50% read and 50% update, representing a sessionstore. Workload B has 95% read and 5% update, representing phototagging applications. Workload C has 100% read, and WorkloadD has 95% read and 5% insert with a bias towards records thatare created recently. We use the first 32,000 requests to warm upthe system and measure the time of serving the remaining 96,000requests with one worker thread.

As shown in Figure 15, with a small value size SDC can signif-icantly improve the throughput by up to 3.3× (e.g., for read-onlyWorkload C). For workloads with a larger percentage of write re-quests, the improvement is less significant as SDC cannot acceleratewrite operations. And for each write operation an sdc_invalidaterequest needs to be issued to SDC to invalidate a possibly cached key.With a larger data size, more time is spent on accessing in-memorydata and the overhead introduced by the index searching becomesless significant. When the value size is 512-byte, the throughput isimproved by 1.9× for workload C. As existing studies have shownsmall data are common in today’s data processing systems [3], theabove experiment results with real-world in-memory database showthat SDC is effective in improving the data indexing performance.

4 RELATEDWORKEffective use of CPU cache is critical to supporting high-performancememory-intensive applications. There have been many works onleveraging the cache to reduce or accelerate memory access.

Speeding up pointer chasing. Pointer chasing is a memoryaccess pattern that is notoriously inefficient due to its access ir-regularity [16, 24, 25] and inherent serialization [24, 29, 31]. In themeantime, it is performance critical and extensively used in soft-ware such as databases, file systems, graph processing, dynamicrouting tables, and garbage collections. A major technique to ad-dress the issue is to predict and prefetch the next node and pointer.

90


It can be hardware-based [10, 13, 42] , software-based [29, 31] , orpre-execution-based [11, 30]. This approach can be expensive byconsuming too much memory bandwidth [16, 24, 41], and cannotbe consistently effective on different data structures and accesspatterns. Concerned with limited CPU cache capacity to hold theworking set accessed during pointer chasing, Hsieh et al. proposedto perform pointer chasing insidememorywith the PIM (processing-in-memory) hardware support in the 3D-stacked memory [21]. Infact, one does not have to cache entire working set, or all the datain the pointer-linked nodes, in the cache. With SDC, programmerscan (selectively) insert only pointers to the next node in the cache.With a much reduced data set for caching, the pointers can stay inthe cache to enable efficient in-cache pointer chasing.

Improving index search performance. Traversal on indexstructures can be a frequent and expensive operation. There havebeen many efforts on optimization of index data structures, such ascuckoo hashing [43, 44], Masstree [32], Wormhole [53], CSS-Trees(Cache-Sensitive Search Trees) [38], and B+-Trees (Cache-SensitiveB+-Trees) [39]. These efforts are limited to specific indexes. In con-trast, SDC can help to avoid many traversals on any of the indexeswith expected access temporal locality. As mentioned, SLB is asoftware-based technique to collect pairs of key and its correspond-ing index search result into a memory buffer. With strong spatialand temporal localities data in the buffer will also stay in the CPUcache [52]. Compared with SDC, SLB may not be aware of thecache size and demands on cache space from other data structuresand/or processes. Therefore, it does not know whether a data itemis cached and does not have a control over it. Further, cost of thebuffer management, including data admission and replacement, canbe expensive. SDC addresses these inadequacies.

Index search can also be accelerated with hardware supportsby designing new specialized hardware components [20, 26, 33] orleveraging newly available hardware features [8, 19, 54, 56]. Widx,an on-chip accelerator for database hash index lookups, is suchan example. To use Widx programmers must disclose how keysare hashed into hash buckets and how walk on the nodes in thebucket is conducted. In addition to its limited applicability only onhash index, Widx increases programmers’ burden and is in a sharpcontrast with SDC, which does not require any knowledge on howthe search is actually conducted. There are proposals on using newvector instructions for hash table search [20, 33]. However, theirusage is usually limited to Vectorwise database systems.

Memoization technique. Memoization technique has beenused in memory systems to cache the results of repetitive computa-tions and allow the program to skip them. It can be implementedusing software and hardware. The software memoization [15, 45]suffers from high runtime cost and is only beneficial for long mem-oization tasks [9, 49]. The hardware implementation [9, 49] incurssignificant hardware overheads as it needs to introduce dedicatememoization caches. Since not all applications can benefit fromthe caches, it can waste a lot of chip area without benefits andcompromise power efficiency. MCache [55] stores memoizationtables in memory and allows them to share cache capacity withconventional data. Different from existing solutions, SDC reusesthe regular cache space on-demand for caching frequently-accessed

index entries and only needs minor changes to existing cache struc-tures. SDC can achieve the same goal of memoization techniquewith low hardware cost and being easy-to-use for users.

Software-controlled use ofCPUcache. There have been stud-ies on incorporation of user knowledge on data access in the cachemanagement. Various techniques have been proposed, from cachecontrol instructions [40] to scratchpad memory [4]. User programscan use the instructions to disclose memory access patterns, suchas data items that will only be used once. There are also instruc-tions for prefetching a cache line, resetting data in a cache lineto 0, and invalidating or evicting a cache line. Compared to SDC,these instructions add only limited capability to make the cachemore effectively used. In contrast, scratchpad is a technique thatremoves an important cache feature – transparently mapping mem-ory addresses to cache addresses – to allow great control over itsuse by providing instructions to move data to and from the mainmemory. A major disadvantage of scratchpad is its introduced newaddress space (on scratchpad) that is different frommemory addressspace assumed in user programs and makes it very hard to adaptprograms to the platform. Thus, scratchpads work poorly with ir-regular or input-dependent data sets [27, 34]. They are rarely usedin mainstream processors and often used in embedded systems andspecial-purpose processors. In contrast, instead of removing thememory address space, SDC enables an address space that is closerto and more friendly to user programs’ semantics (the user-definedkeys) to customize cache uses. Further, SDC does not require a sepa-rate fast memory. It dynamically allocates some space from existingcache for user-defined use and enables efficient space sharing.

There have been works in the literature named as software-defined caching [5] or software-defined cache hierarchies [48]. Theyallow programs to define shares, which are collections of cachebanks, and accordingly to have control over data placement andcapacity allocation either at one cache level (Jigsaw [5]) or acrossa cache hierarchy (Jenga [48]). They function at a coarser grainfor fitting working sets into the shares, while SDC allows userprograms to control caching of individual data items.

5 CONCLUSIONWe propose SDC, a user-programmable CPU cache architecturethat enables key-value based data caching to allow programs toplace their selected data items in the cache for quick access. Ourdesign minimizes programming effort of using the user-controllablelook-aside cache, and enables rich performance optimization oppor-tunities with a customized use of the CPU cache. SDC seamlesslyintegrates into existing cache architecture. And its implementationrequires moderate changes to current cache circuitry. With a proto-type implementation in gem5, we extensively conduct experimentswith various workloads. The results show that SDC improves theperformance of index-based data management programs by up to5.3× over existing cache design.

ACKNOWLEDGMENTSWe are grateful to the anonymous reviewers for their valuablecomments and feedback. This work was supported in part by NSFgrants CNS-1527076 and CCF-1815303.

91


REFERENCES[1] Nadav Amit. 2017. Optimizing the TLB Shootdown Algorithm with Page Access

Tracking. In Proceedings of the 2017 USENIX Conference on Usenix Annual TechnicalConference (USENIX ATC ’17). USENIX Association, Berkeley, CA, USA, 27–39.http://dl.acm.org/citation.cfm?id=3154690.3154694

[2] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee,Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array of WimpyNodes. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating SystemsPrinciples (SOSP’09). ACM, New York, NY, USA, 1–14. https://doi.org/10.1145/1629575.1629577

[3] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny.2012. Workload Analysis of a Large-scale Key-value Store. In Proceedings ofthe 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference onMeasurement and Modeling of Computer Systems (SIGMETRICS ’12). ACM, NewYork, NY, USA, 53–64. https://doi.org/10.1145/2254756.2254766

[4] Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, Mahesh Balakrishnan, and PeterMarwedel. 2002. Scratchpad memory: A design alternative for cache on-chipmemory in embedded systems. In Hardware/Software Codesign, 2002. CODES 2002.Proceedings of the Tenth International Symposium on. IEEE, IEEE, Estes Park, CO,USA, 73–78.

[5] Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable Software-definedCaches. In Proceedings of the 22Nd International Conference on Parallel Architec-tures and Compilation Techniques (PACT ’13). IEEE Press, Piscataway, NJ, USA,213–224. http://dl.acm.org/citation.cfm?id=2523721.2523752

[6] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, AliSaidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, SomayehSardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D.Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit.News 39, 2 (Aug. 2011), 1–7. https://doi.org/10.1145/2024716.2024718

[7] W. A. Burkhard. 1976. Hashing and Trie Algorithms for Partial Match Retrieval.ACM Trans. Database Syst. 1, 2 (June 1976), 175–187. https://doi.org/10.1145/320455.320469

[8] Eric S. Chung, John D. Davis, and Jaewon Lee. 2013. LINQits: Big Data onLittle Clients. In Proceedings of the 40th Annual International Symposium onComputer Architecture (ISCA ’13). ACM, New York, NY, USA, 261–272. https://doi.org/10.1145/2485922.2485945

[9] Daniel Citron and Dror G Feitelson. 2000. Hardware memoization of mathemat-ical and trigonometric functions. Technical Report. The Hebrew University ofJerusalem.

[10] Jamison Collins, Suleyman Sair, Brad Calder, and Dean M. Tullsen. 2002. PointerCache Assisted Prefetching. In Proceedings of the 35th Annual ACM/IEEE Interna-tional Symposium on Microarchitecture (MICRO 35). IEEE Computer Society Press,Los Alamitos, CA, USA, 62–73. http://dl.acm.org/citation.cfm?id=774861.774869

[11] Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative Precomputation:Long-range Prefetching of Delinquent Loads. In Proceedings of the 28th AnnualInternational Symposium on Computer Architecture (ISCA ’01). ACM, New York,NY, USA, 14–25. https://doi.org/10.1145/379240.379248

[12] HyperTransport Technology Consortium et al. 2008. HyperTransport I/O linkspecification. Revision 1 (2008), 111–118.

[13] Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A Stateless, Content-directed Data Prefetching Mechanism. In Proceedings of the 10th InternationalConference on Architectural Support for Programming Languages and OperatingSystems (ASPLOS X). ACM, New York, NY, USA, 279–290. https://doi.org/10.1145/605397.605427

[14] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and RussellSears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings ofthe 1st ACM Symposium on Cloud Computing (SoCC ’10). ACM, New York, NY,USA, 143–154. https://doi.org/10.1145/1807128.1807152

[15] YonghuaDing and Zhiyuan Li. 2004. ACompiler Scheme for Reusing IntermediateComputation Results. In Proceedings of the International Symposium on CodeGeneration and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, USA, 279–. http://dl.acm.org/citation.cfm?id=977395.977679

[16] E. Ebrahimi, O. Mutlu, and Y. N. Patt. 2009. Techniques for bandwidth-efficientprefetching of linked data structures in hybrid prefetching systems. In 2009 IEEE15th International Symposium on High Performance Computer Architecture. IEEE,Raleigh, NC, USA, 7–17. https://doi.org/10.1109/HPCA.2009.4798232

[17] Brad Fitzpatrick. 2004. Distributed Caching with Memcached. Linux J. 2004, 124(Aug. 2004), 5–. http://dl.acm.org/citation.cfm?id=1012889.1012894

[18] gem5. 2014. Gem5-Classic Memory System. http://www.gem5.org/Classic_Memory_System.

[19] Brian Gold, Anastassia Ailamaki, Larry Huston, and Babak Falsafi. 2005. Acceler-ating Database Operators Using a Network Processor. In Proceedings of the 1stInternationalWorkshop onDataManagement on NewHardware (DaMoN ’05). ACM,New York, NY, USA, Article 1, 6 pages. https://doi.org/10.1145/1114252.1114260

[20] Timothy Hayes, Oscar Palomar, Osman Unsal, Adrian Cristal, and Mateo Valero.2012. Vector Extensions for Decision Support DBMS Acceleration. In Proceedings

of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO-45). IEEE Computer Society, Washington, DC, USA, 166–176. https://doi.org/10.1109/MICRO.2012.24

[21] Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K Chang, AmiraliBoroumand, Saugata Ghose, and Onur Mutlu. 2016. Accelerating pointer chasingin 3D-stacked memory: Challenges, mechanisms, evaluation. In 2016 IEEE 34thInternational Conference on Computer Design (ICCD). IEEE, Phoenix, USA, 25–32.https://doi.org/10.1109/ICCD.2016.7753257

[22] INTEL. 2013. Intel Haswell processors. http://www.7-cpu.com/cpu/Haswell.html.[23] Intel. 2016. Intel Xeon Processor E5-2683 v4. https://ark.intel.com/products/

91766/Intel-Xeon-Processor-E5-2683-v4-40M-Cache-2-10-GHz-.[24] Doug Joseph and Dirk Grunwald. 1997. Prefetching Using Markov Predictors. In

Proceedings of the 24th Annual International Symposium on Computer Architecture(ISCA ’97). ACM, New York, NY, USA, 252–263. https://doi.org/10.1145/264107.264207

[25] M. Karlsson, F. Dahlgren, and P. Stenstrom. 2000. A prefetching technique forirregular accesses to linked data structures. In Proceedings Sixth International Sym-posium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).IEEE, Touluse, France, 206–217. https://doi.org/10.1109/HPCA.2000.824351

[26] Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, andParthasarathy Ranganathan. 2013. Meet the Walkers: Accelerating Index Tra-versals for In-memory Databases. In Proceedings of the 46th Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO-46). ACM, New York, NY,USA, 468–479. https://doi.org/10.1145/2540708.2540748

[27] Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa,Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015.Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the 42Nd AnnualInternational Symposium on Computer Architecture (ISCA ’15). ACM, New York,NY, USA, 707–719. https://doi.org/10.1145/2749469.2750374

[28] Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT:A Memory-efficient, High-performance Key-value Store. In Proceedings of theTwenty-Third ACM Symposium on Operating Systems Principles (SOSP ’11). ACM,New York, NY, USA, 1–13. https://doi.org/10.1145/2043556.2043558

[29] Mikko H Lipasti, William J Schmidt, Steven R Kunkel, and Robert R Roediger.1995. SPAID: Software prefetching in pointer-and call-intensive environments. InMicroarchitecture, 1995., Proceedings of the 28th Annual International Symposiumon. IEEE, IEEE, Ann Arbor, MI, USA, 231–236.

[30] Chi-Keung Luk. 2001. Tolerating Memory Latency Through Software-controlledPre-execution in Simultaneous Multithreading Processors. In Proceedings of the28th Annual International Symposium on Computer Architecture (ISCA ’01). ACM,New York, NY, USA, 40–51. https://doi.org/10.1145/379240.379250

[31] Chi-Keung Luk and Todd C. Mowry. 1996. Compiler-based Prefetching forRecursive Data Structures. In Proceedings of the Seventh International Conferenceon Architectural Support for Programming Languages and Operating Systems(ASPLOS VII). ACM, New York, NY, USA, 222–233. https://doi.org/10.1145/237090.237190

[32] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache Craftinessfor Fast Multicore Key-value Storage. In Proceedings of the 7th ACM EuropeanConference on Computer Systems (EuroSys ’12). ACM, New York, NY, USA, 183–196.https://doi.org/10.1145/2168836.2168855

[33] Rich Martin. 1996. A Vectorized Hash-Join. Technical Report. University ofCalifornia at Berkeley, California, USA.

[34] Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2016. Whirlpool:Improving Dynamic Cache Management with Static Data Classification. In Pro-ceedings of the Twenty-First International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS ’16). ACM, New York,NY, USA, 113–127. https://doi.org/10.1145/2872362.2872363

[35] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee,Harry C Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling Memcache atFacebook. In Presented as part of the 10th USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI 13). USENIX, Lombard, IL, 385–398. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala

[36] Brendan O’Connor. 2011. How much text versus metadata is in a tweet. http://goo.gl/EBFIFs.

[37] François Panneton and Pierre L’Ecuyer. 2005. On the Xorshift Random NumberGenerators. ACM Trans. Model. Comput. Simul. 15, 4 (Oct. 2005), 346–361. https://doi.org/10.1145/1113316.1113319

[38] Jun Rao and Kenneth A. Ross. 1999. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of the 25th International Conferenceon Very Large Data Bases (VLDB ’99). Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 78–89. http://dl.acm.org/citation.cfm?id=645925.671362

[39] Jun Rao and Kenneth A. Ross. 2000. Making B+- Trees Cache Conscious inMain Memory. In Proceedings of the 2000 ACM SIGMOD International Conferenceon Management of Data (SIGMOD ’00). ACM, New York, NY, USA, 475–486.https://doi.org/10.1145/342009.335449

[40] Freescale Semiconductor. 2005. PowerPC e500 Core Family Reference Manual.https://goo.gl/Jjs38u

92

http://dl.acm.org/citation.cfm?id=3154690.3154694

https://doi.org/10.1145/1629575.1629577

https://doi.org/10.1145/1629575.1629577

https://doi.org/10.1145/2254756.2254766


https://doi.org/10.1145/2024716.2024718

https://doi.org/10.1145/320455.320469

https://doi.org/10.1145/320455.320469

https://doi.org/10.1145/2485922.2485945

https://doi.org/10.1145/2485922.2485945


https://doi.org/10.1145/379240.379248

https://doi.org/10.1145/605397.605427

https://doi.org/10.1145/605397.605427

https://doi.org/10.1145/1807128.1807152



https://doi.org/10.1109/HPCA.2009.4798232


http://www.gem5.org/Classic_Memory_System

http://www.gem5.org/Classic_Memory_System

https://doi.org/10.1145/1114252.1114260

https://doi.org/10.1109/MICRO.2012.24

https://doi.org/10.1109/MICRO.2012.24

https://doi.org/10.1109/ICCD.2016.7753257

http://www.7-cpu.com/cpu/Haswell.html

https://ark.intel.com/products/91766/Intel-Xeon-Processor-E5-2683-v4-40M-Cache-2-10-GHz-

https://ark.intel.com/products/91766/Intel-Xeon-Processor-E5-2683-v4-40M-Cache-2-10-GHz-

https://doi.org/10.1145/264107.264207

https://doi.org/10.1145/264107.264207


https://doi.org/10.1145/2540708.2540748

https://doi.org/10.1145/2749469.2750374

https://doi.org/10.1145/2043556.2043558

https://doi.org/10.1145/379240.379250

https://doi.org/10.1145/237090.237190

https://doi.org/10.1145/237090.237190

https://doi.org/10.1145/2168836.2168855

https://doi.org/10.1145/2872362.2872363

https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala

https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala

http://goo.gl/EBFIFs

http://goo.gl/EBFIFs

https://doi.org/10.1145/1113316.1113319

https://doi.org/10.1145/1113316.1113319


https://doi.org/10.1145/342009.335449

https://goo.gl/Jjs38u


[41] Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. FeedbackDirected Prefetching: Improving the Performance and Bandwidth-Efficiency ofHardware Prefetchers. In Proceedings of the 2007 IEEE 13th International Sympo-sium on High Performance Computer Architecture (HPCA ’07). IEEE Computer So-ciety, Washington, DC, USA, 63–74. https://doi.org/10.1109/HPCA.2007.346185

[42] Nitish Kumar Srivastava and Akshay Dilip Navalakha. 2018. Pointer-ChasePrefetcher for Linked Data Structures. CoRR abs/1801.08088 (2018), 12.arXiv:1801.08088 http://arxiv.org/abs/1801.08088

[43] Y. Sun, Y. Hua, D. Feng, L. Yang, P. Zuo, and S. Cao. 2015. MinCounter: An efficientcuckoo hashing scheme for cloud storage systems. In 2015 31st Symposium onMass Storage Systems and Technologies (MSST). IEEE, Santa Clara, California, USA,1–7. https://doi.org/10.1109/MSST.2015.7208292

[44] Yuanyuan Sun, Yu Hua, Song Jiang, Qiuyu Li, Shunde Cao, and Pengfei Zuo. 2017.SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud StorageSystems. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIXAssociation, Santa Clara, CA, 553–565. https://www.usenix.org/conference/atc17/technical-sessions/presentation/sun

[45] Arjun Suresh, Erven Rohou, and André Seznec. 2017. Compile-time FunctionMemoization. In Proceedings of the 26th International Conference on CompilerConstruction (CC 2017). ACM, New York, NY, USA, 45–54. https://doi.org/10.1145/3033019.3033024

[46] Symas. 2016. LMDB: Lightning Memory-Mapped Database Manager. http://www.lmdb.tech/doc/index.html.

[47] Thomas Wang. 2007. Integer Hash Function. http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm.

[48] Po-An Tsai, Nathan Beckmann, andDaniel Sanchez. 2017. Jenga: Software-definedcache hierarchies. In 2017 ACM/IEEE 44th Annual International Symposium onComputer Architecture (ISCA). IEEE/ACM, Toronto, Ontario, Canada, 652–665.https://doi.org/10.1145/3079856.3080214

[49] Tomoaki Tsumura, Ikuma Suzuki, Yasuki Ikeuchi, Hiroshi Matsuo, HiroshiNakashima, and Yasuhiko Nakashima. 2007. Design and Evaluation of an Auto-memoization Processor. In Proceedings of the 25th Conference on Proceedings ofthe 25th IASTED International Multi-Conference: Parallel and Distributed Com-puting and Networks (PDCN’07). ACTA Press, Anaheim, CA, USA, 245–250.

http://dl.acm.org/citation.cfm?id=1295581.1295621[50] Stephen Tu. 2013. Silo source code on Github. https://github.com/stephentu/silo.[51] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden.

2013. Speedy Transactions in Multicore In-memory Databases. In Proceedings ofthe Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP’13).ACM, New York, NY, USA, 18–32. https://doi.org/10.1145/2517349.2522713

[52] Xingbo Wu, Fan Ni, and Song Jiang. 2017. Search Lookaside Buffer: EfficientCaching for Index Data Structures. In Proceedings of the 2017 Symposium on CloudComputing (SoCC ’17). ACM, New York, NY, USA, 27–39. https://doi.org/10.1145/3127479.3127483

[53] XingboWu, Fan Ni, and Song Jiang. 2019. Wormhole: A Fast Ordered Index for In-memory Data Management. In Proceedings of the Fourteenth EuroSys Conference2019 (EuroSys ’19). ACM, New York, NY, USA, Article 18, 16 pages. https://doi.org/10.1145/3302424.3303955

[54] Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Process-ing Data Warehousing Queries on GPU Devices. Proc. VLDB Endow. 6, 10 (Aug.2013), 817–828. https://doi.org/10.14778/2536206.2536210

[55] Guowei Zhang and Daniel Sanchez. 2018. Leveraging Hardware Caches forMemoization. IEEE Comput. Archit. Lett. 17, 1 (Jan. 2018), 59–63. https://doi.org/10.1109/LCA.2017.2762308

[56] Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang.2015. Mega-KV: A Case for GPUs to Maximize the Throughput of In-memoryKey-value Stores. Proc. VLDB Endow. 8, 11 (July 2015), 1226–1237. https://doi.org/10.14778/2809974.2809984

[57] Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. 2009. Towards Practical PageColoring-based Multicore Cache Management. In Proceedings of the 4th ACMEuropean Conference on Computer Systems (EuroSys ’09). ACM, New York, NY,USA, 89–102. https://doi.org/10.1145/1519065.1519076

[58] Dimitrios Ziakas, Allen Baum, Robert A Maddox, and Robert J Safranek. 2010.Intel® quickpath interconnect architectural features supporting scalable systemarchitectures. In High Performance Interconnects (HOTI), 2010 IEEE 18th AnnualSymposium on. IEEE, IEEE, Mountain View, California, US, 1–6.

93


http://arxiv.org/abs/1801.08088

http://arxiv.org/abs/1801.08088

https://doi.org/10.1109/MSST.2015.7208292

https://www.usenix.org/conference/atc17/technical-sessions/presentation/sun

https://www.usenix.org/conference/atc17/technical-sessions/presentation/sun

https://doi.org/10.1145/3033019.3033024

https://doi.org/10.1145/3033019.3033024

http://www.lmdb.tech/doc/index.html

http://www.lmdb.tech/doc/index.html

http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm

http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm

https://doi.org/10.1145/3079856.3080214


https://github.com/stephentu/silo

https://doi.org/10.1145/2517349.2522713

https://doi.org/10.1145/3127479.3127483

https://doi.org/10.1145/3127479.3127483

https://doi.org/10.1145/3302424.3303955

https://doi.org/10.1145/3302424.3303955

https://doi.org/10.14778/2536206.2536210

https://doi.org/10.1109/LCA.2017.2762308

https://doi.org/10.1109/LCA.2017.2762308

https://doi.org/10.14778/2809974.2809984

https://doi.org/10.14778/2809974.2809984

https://doi.org/10.1145/1519065.1519076

SDC: A Software Defined Cache for Efficient Data IndexingSDC: A Software Defined Cache for Efficient...

Documents

Transcript of SDC: A Software Defined Cache for Efficient Data IndexingSDC: A Software Defined Cache for Efficient...