Memory System Characterization of Big Data...

8
Memory System Characterization of Big Data Workloads Martin Dimitrov*, Karthik Kumar*, Patrick Lu**, Vish Viswanathan*, Thomas Willhalm* *Software and Services Group, **Datacenter and Connected Systems Group, Intel Corporation Abstract—Two recent trends that have emerged include (1) Rapid growth in big data technologies with new types of computing models to handle unstructured data, such as map- reduce and noSQL (2) A growing focus on the memory sub- system for performance and power optimizations, particularly with emerging memory technologies offering different char- acteristics from conventional DRAM (bandwidths, read/write asymmetries). This paper examines how these trends may intersect by characterizing the memory access patterns of various Hadoop and noSQL big data workloads. Using memory DIMM traces collected using special hardware, we analyze the spatial and temporal reference patterns to bring out several insights related to memory and platform usages, such as memory footprints, read-write ratios, bandwidths, latencies, etc. We develop an analysis methodology to understand how conventional opti- mizations such as caching, prediction, and prefetching may apply to these workloads, and discuss the implications on software and system design. Keywords-big data, memory characterization I. I NTRODUCTION The massive information explosion over the past decade has resulted in zetabytes of data [1] being created each year, with most of this data being in the form of files, videos, logs, documents, images, etc (unstructured formats). Data continues to grow at an exponential pace, for example, in 2010, the world created half as much data as it had in all previous years combined [2]. With such data growth, it becomes challenging for conventional computing models to handle such large volumes. Big Data analytics [3] [4] [5] [6] has emerged as the solution to parse, analyze and extract meaningful information from these large volumes of unstructured data. Extracting this information provides opportunities and insights in a variety of fronts- from making more intelligent business decisions, understanding trends in usages and markets, detecting frauds and anomalies, etc- with many of these being possible real-time. As a result of these advantages, big data processing is becoming increas- ingly popular. Two primary big data computing models have emerged: (1) Hadoop-based computing, and (2) NoSQL- based computing, and the two are among the fastest growing software segments in modern computer systems. On the other hand, computing systems themselves have seen a shift in optimizations; unlike a decade earlier when most optimizations were primarily on the processor, there has been more focus on the overall platform, and particularly the memory subsystem for performance and power improve- ments. Recent studies have shown memory to be a first- order consideration for performance, and sometimes even the dominant consumer of power in a computer system. This emphasis becomes even more important with recent trends in emerging memory technologies [7] [8], that are expected to offer different characteristics from conventional DRAM- such as higher latencies, differing capacities, persistence, etc. In order to have software run efficiently using such technolo- gies, it becomes critical to characterize and understand the memory usages.While various studies have been performed on memory characterization of workloads [9] [10] [11] [12] over the past decade, unfortunately, most of them focus on the SPEC benchmark suites and traditional computing models. Very few studies have examined memory behaviors of big data workloads- and these are mostly specific to an optimization, such as TLB improvements [13] [14]. This paper addresses this gap by providing a detailed char- acterization of the spatial and temporal memory references of various big data workloads. We analyze various building blocks of big data operations such as sort, wordcount, aggregations, and joins on Hadoop, and building indexes to process data on a NoSQL data store. We characterize the memory behavior by monitoring various metrics such as memory latencies, first, second and last level processor cache miss rates, code and data TLB miss rates, peak memory bandwidths, etc. We examine the impact of Hadoop compression both on performance and the spatial patterns. Using specially designed hardware, we are able to observe and trace the memory reference patterns of all the workloads at the DIMM level with precise timing information. These traces provide us the unique ability to obtain insights based on the spatial and temporal references, such as memory foot- prints, and the spatial histograms of the memory references over the footprints. This paper also examines the potential for big data workloads to be tolerant to the higher latencies expected in emerging memory technologies. The classic mechanisms for doing this are caching in a faster memory tier, and predicting future memory references to prefetch sections of memory. We examine the cacheability of big data workloads by running the memory traces through a cache simulator with different cache sizes. An interesting insight is that we observe is that many of the workloads operate on only a small subset of their spatial footprint at a time. As a result, we find that a cache that is less than 0.1% the size of the footprint can provide a hit rate as high as 40% of all the memory references. For prefetchability, we observe that using existing prefetcher schemes to predict the precise 2013 IEEE International Conference on Big Data 15 978-1-4799-1293-3/13/$31.00 ©2013 IEEE

Transcript of Memory System Characterization of Big Data...

Page 1: Memory System Characterization of Big Data Workloadsprof.ict.ac.cn/bpoe2013/downloads/papers/S7202_5633.pdfcharacterizing the memory access patterns of various Hadoop and noSQL big

Memory System Characterization of Big Data Workloads

Martin Dimitrov*, Karthik Kumar*, Patrick Lu**, Vish Viswanathan*, Thomas Willhalm**Software and Services Group, **Datacenter and Connected Systems Group,

Intel Corporation

Abstract—Two recent trends that have emerged include (1)Rapid growth in big data technologies with new types ofcomputing models to handle unstructured data, such as map-reduce and noSQL (2) A growing focus on the memory sub-system for performance and power optimizations, particularlywith emerging memory technologies offering different char-acteristics from conventional DRAM (bandwidths, read/writeasymmetries).

This paper examines how these trends may intersect bycharacterizing the memory access patterns of various Hadoopand noSQL big data workloads. Using memory DIMM tracescollected using special hardware, we analyze the spatial andtemporal reference patterns to bring out several insights relatedto memory and platform usages, such as memory footprints,read-write ratios, bandwidths, latencies, etc. We develop ananalysis methodology to understand how conventional opti-mizations such as caching, prediction, and prefetching mayapply to these workloads, and discuss the implications onsoftware and system design.

Keywords-big data, memory characterization

I. INTRODUCTION

The massive information explosion over the past decadehas resulted in zetabytes of data [1] being created each year,with most of this data being in the form of files, videos,logs, documents, images, etc (unstructured formats). Datacontinues to grow at an exponential pace, for example, in2010, the world created half as much data as it had inall previous years combined [2]. With such data growth,it becomes challenging for conventional computing modelsto handle such large volumes. Big Data analytics [3] [4][5] [6] has emerged as the solution to parse, analyze andextract meaningful information from these large volumesof unstructured data. Extracting this information providesopportunities and insights in a variety of fronts- from makingmore intelligent business decisions, understanding trends inusages and markets, detecting frauds and anomalies, etc-with many of these being possible real-time. As a result ofthese advantages, big data processing is becoming increas-ingly popular. Two primary big data computing models haveemerged: (1) Hadoop-based computing, and (2) NoSQL-based computing, and the two are among the fastest growingsoftware segments in modern computer systems.

On the other hand, computing systems themselves haveseen a shift in optimizations; unlike a decade earlier whenmost optimizations were primarily on the processor, therehas been more focus on the overall platform, and particularlythe memory subsystem for performance and power improve-ments. Recent studies have shown memory to be a first-order consideration for performance, and sometimes eventhe dominant consumer of power in a computer system. This

emphasis becomes even more important with recent trendsin emerging memory technologies [7] [8], that are expectedto offer different characteristics from conventional DRAM-such as higher latencies, differing capacities, persistence, etc.In order to have software run efficiently using such technolo-gies, it becomes critical to characterize and understand thememory usages.While various studies have been performedon memory characterization of workloads [9] [10] [11] [12]over the past decade, unfortunately, most of them focuson the SPEC benchmark suites and traditional computingmodels. Very few studies have examined memory behaviorsof big data workloads- and these are mostly specific to anoptimization, such as TLB improvements [13] [14].

This paper addresses this gap by providing a detailed char-acterization of the spatial and temporal memory referencesof various big data workloads. We analyze various buildingblocks of big data operations such as sort, wordcount,aggregations, and joins on Hadoop, and building indexesto process data on a NoSQL data store. We characterizethe memory behavior by monitoring various metrics suchas memory latencies, first, second and last level processorcache miss rates, code and data TLB miss rates, peakmemory bandwidths, etc. We examine the impact of Hadoopcompression both on performance and the spatial patterns.Using specially designed hardware, we are able to observeand trace the memory reference patterns of all the workloadsat the DIMM level with precise timing information. Thesetraces provide us the unique ability to obtain insights basedon the spatial and temporal references, such as memory foot-prints, and the spatial histograms of the memory referencesover the footprints.

This paper also examines the potential for big dataworkloads to be tolerant to the higher latencies expectedin emerging memory technologies. The classic mechanismsfor doing this are caching in a faster memory tier, andpredicting future memory references to prefetch sections ofmemory. We examine the cacheability of big data workloadsby running the memory traces through a cache simulatorwith different cache sizes. An interesting insight is thatwe observe is that many of the workloads operate on onlya small subset of their spatial footprint at a time. As aresult, we find that a cache that is less than 0.1% the sizeof the footprint can provide a hit rate as high as 40% ofall the memory references. For prefetchability, we observethat using existing prefetcher schemes to predict the precise

2013 IEEE International Conference on Big Data

15978-1-4799-1293-3/13/$31.00 ©2013 IEEE

Page 2: Memory System Characterization of Big Data Workloadsprof.ict.ac.cn/bpoe2013/downloads/papers/S7202_5633.pdfcharacterizing the memory access patterns of various Hadoop and noSQL big

next memory reference is a hard problem at the memoryDIMM level due to mixing of different streams from differ-ent processor cores, and this mixed stream getting furtherinterleaved across different memory ranks for performance.Clearly, more sophisticated algorithms and techniques arerequired if prefetching is to be transparent to the applicationat the lower levels of the memory hierarchy. In order toexamine the potential for the same, we use signal processingtechniques: entropy and trend analysis (correlation withknown signals) to bring out insights related to the memorypatterns.

We believe this is the first paper to examine the designspace for memory architectures running big data workloads,by analyzing spatial patterns using DIMM traces and pro-viding a detailed memory characterization. The study bringsout a wealth of insights for system and software design.The experiments are performed on a 4 node cluster, using 2socket servers with Intel Xeon E5 processors, with each nodeconfigured with 128GB of DDR3 memory, and 2TB of SSDstorage. (We intentionally selected fast storage and largememory capacities: with the price for flash and non-volatilemedia continuing to drop, we chose this configuration tounderstand forward looking usages). The remainder of thispaper is organized as follows: Section II describes therelated work. Section III describes the various workloadsused in this paper. Section IV describes the experimentalmethodology used, and Section V presents the results andobservations, with Section VI concluding the paper anddiscussing future work.

II. RELATED WORK

A. Memory system characterization

There have been several papers discussing memory sys-tem characterization of enterprise workloads, over the pastdecade. Barroso et al. [9] characterize the memory referencesof various commercial workloads. Domain specific charac-terization studies include memory characterization of paral-lel data mining workloads [15], of the ECperf benchmark[16], of memcached [17], and of the SPEC CPU2000 andCPU2006 benchmark suites [10] [11] [12]. Particularly note-worthy is the work of Shao et al. [12], where statistical mea-sures are used for memory characterization. The commonfocus of all these works is using instrumentation techniquesand platform monitoring to understand how specific work-loads use memory. With emerging memory technologies [8][18] [19] [20] having different properties from conventionalDRAM that has been used for the past decade, this type ofcharacterization focus becomes particularly important.

B. Big Data workloads

Various researchers have proposed benchmarks and work-loads representative of big data usages; the common focusof all these benchmarks is they deal with processing unstruc-tured data, typically using Hadoop or NoSQL. The Hibench

suite developed by Intel [21] consists of several Hadoopworkloads such as sort, wordcount, hive aggregation, etc.that are proxies for big data usages in the real world. In thispaper, we use several Hadoop workloads from the HiBenchsuite. Another class of data stores to handle unstructureddata are NoSQL databases [22], which are specialized forquery and search operations. They differ from conventionaldatabases in that they typically do not offer transactionalguarantees, and this is a trade-off made in exchange for veryfast retrieval.

Recent studies have also proposed characterizing andunderstanding these big data usage cases. These can beclassified as follows• Implications on system design and architecture: A study

from IBM research [23] examines how big data work-loads may be suited for the IBM POWER architecture.Chang et al. [24] examine the implications of big dataanalytics on system design.

• Modeling big data workloads: Yang et al. [25] proposeusing statistics-based techniques for modeling mapreduce. Atikoglu et al. [26] model and analyze thebehavior of a key-value store (memcached).

• Performance characterization of big data workloads:Ren et al. characterize the behavior of a productionHadoop cluster, using a specific case study [27]. Issa etal. [28] present power and performance characterizationof Hadoop with memcached.

Very few studies focus on understanding the memorycharacteristics of big data workloads. Noteworthy amongthese are that of Basu et al. [14], which focuses on page-table and virtual memory related optimizations for big dataworkloads. Jia et al. [13] present a characterization of L1,L2, and LLC cache misses observed for a Hadoop workloadcluster. Both of these studies focus on characterization atthe virtual memory and cache hierarchy, as opposed to theDRAM level.

C. Contributions

The following are the unique contributions of this paper:• We believe this is the first study analyzing the memory

reference patterns of big data workloads. Using hard-ware memory traces at the DIMM level, we are able toanalyze references to physical memory.

• We introduce various metrics to qualitatively and quan-titatively characterize the memory reference patterns,and we discuss the implications of these metrics forfuture system design.

III. WORKLOADS

We use several Hadoop workloads, from the HiBenchworkload suite [21] and a NoSQL datastore that builds in-dexes from text documents. We use a performance-optimizedHadoop configuration in our experiments. Since Hadoop hasa compression codec for both input and output data, all the

16

Page 3: Memory System Characterization of Big Data Workloadsprof.ict.ac.cn/bpoe2013/downloads/papers/S7202_5633.pdfcharacterizing the memory access patterns of various Hadoop and noSQL big

Hadoop workloads are examined with and without use ofcompression. The following is a brief description of thevarious workloads used:

A. Sort

Sort is a good proxy of a common type of big data opera-tion, that requires transforming data from one representationto another. In this case, the workload sorts its text input data,which is generated using the Hadoop RandomTextWriterexample. In our setup, we sort a total of 96GB dataset inHDFS, using a 4 node cluster, with 24GB dataset/node.

B. WordCount

Word count also represents a common big data operation:extracting a small amount of interesting data from a largedataset, or a “needle in haystack search”. In this case,the workload counts the number of occurrences of eachword in the input data set. The data set is generated usingthe Hadoop RandomTextWriter example. In our setup, weperform wordcount on a 128GB dataset in HDFS, distributedbetween the 4 nodes as 32GB per node.

C. Hive Join

The Hive join workload approximates a complex analyticquery, representative of typical OLAP workloads. It com-putes the average and the sum for each group by joiningtwo different tables. The join task consists of two sub-tasksthat perform a complex calculation on two data sets. In thefirst part of the task, each system must find the sourceIPthat generated the most revenue within a particular daterange. Once these intermediate records are generated, thesystem must then calculate the average pageRank of all thepages visited during this interval. The data set generatedapproximates web-server logs with hyperlinks following theZipfian distribution. In this case, we simulated nearly 130million user visits, to nearly 18 million pages.

D. Hive Aggregation

Hive aggregation approximates a complex analytic query,representative of typical OLAP workloads by computing theinlink count for each document in the dataset, a task that isoften used as a component of PageRank calculations. Thefirst step is to read each document and search for all theURLs that appear in the contents. The second step is then,for each unique URL, to count the number of unique pagesthat reference that particular URL across the entire set ofdocuments. It is this type of task that the map-reduce isbelieved to be commonly used for.

E. NoSQL indexing

The NoSQL workload uses a NoSQL data store to buildindexes, from 240GB of text files, distributed across the4 nodes. This type of computation is heavy in regularexpression comparisons, and is a very common big data usecase.

IV. EXPERIMENTAL METHODOLOGY

In this section, we discuss the experimental methodologyused in the paper. The experimental methodology is focusedon the following objectives: (1) providing insights about howbig data applications use the memory subsystem (SectionIV-A) (2) examining the latency tolerance of big data work-loads (since emerging memory technologies have higherlatencies than DRAM). The latency tolerance is examinedby understanding the potential for classic techniques to hidelatency: cacheability in a faster tier (Sections IV-B, IV-C,and IV-D) and prefetching into a faster tier (IV-D and IV-E).

A. General characterization

Performance counter monitoring allows us to analyzevarious characteristics of the workload and its memoryreferences. Some of these metrics that are of interest include:• Memory footprints: the memory footprint is defined as

the span of memory that is touched at least once bythe workload. It can also be viewed as the minimumamount of memory required to keep the workload “inmemory”.

• CPI: cycles per instruction. This is a measure of theaverage number of hardware cycles required to executeone instruction. A lower value indicates that the in-structions are running efficiently on the hardware, withfewer stalls, dependencies, waits, and bottlenecks.

• L1, L2, Last Level Cache (LLC) misses per instruction(MPI): the processor has 3 levels of cache hierarchy.The L1 and L2 caches are smaller (kB sizes), fastcaches that are exclusive to each core. The LLC is thelast level cache that is shared amongst all the coresin a socket. Since the LLC is the last level in thehierarchy, the penalty for an LLC miss is a reference toDRAM, and this requires 10s of nanoseconds of waittime. Hence the LLC MPI is often a good indicator ofthe memory intensiveness of a workload.

• Memory bandwidth: Memory bandwidth is the data ratemeasured for references to the memory subsystem. IntelXeon E5 two socket server platforms can easily supportbandwidths greater than 60,000 MB/s.

• Instruction and Data Translation Lookaside Buffers(TLB) MPI: The TLBs are a cache for the page tableentries. A higher miss rate for the data TLBs indicatesmemory references are more widespread in distributionsince a TLB miss occurs whenever a 4kB boundary iscrossed, and a page entry that is not cached in the TLBis referenced.

B. Cache line working set characterization

The spatial nature of the memory references of a work-load can be identified by a characterization of the cacheline references. For example, the memory referenced bya workload may span 5GB. However, it may be the case,that most of the references were concentrated on a smaller

17

Page 4: Memory System Characterization of Big Data Workloadsprof.ict.ac.cn/bpoe2013/downloads/papers/S7202_5633.pdfcharacterizing the memory access patterns of various Hadoop and noSQL big

100MB region within the 5GB. In order to understand suchbehavior, we employ the following methodology: (1) wecreate a histogram of the cache lines and their memoryreferences: against each cache line, we have the number oftimes it is referenced (2) we sort the cache lines by theirnumber of references, with the most frequently referencedcache line occurring first (3) we select a footprint size (forexample: 100MB), and percentage of references containedwithin this footprint size. In this example, 1638400 lines(=100MB/64 bytes per cache line) are required to contain100MB; we compute the total number of references againstthe first 1238400 cache lines in the list from step (2); anddivide by the total number of references in the trace. Thisgives us information about the spatial distribution of thecache lines within the hottest 100MB, against the overallmemory footprint. Intuitively, if one had a fixed cache andhad to pin the cache lines with no replacements or additionspossible, then the cache lines to be selected for this wouldbe the ones highlighted in this analysis.

C. Cache simulation

The previous section considers the spatial distributionof the memory references; however, it does not considerthe temporal nature. For example, the workload could berepeatedly streaming through memory in very large scans,that are repeated. However, if the working set spanned by thescan is much larger than the size of the cache, this will resultin very poor cacheability. Moreover, there could be certaincache eviction and replacement patterns that could resultin poor cacheability that is not apparent from inspectingthe spatial distribution. On the other hand, it could bealso possible that the workload focuses on small regionsof memory at a time, resulting in very good cacheability;again this may not be apparent from the spatial distribution.In order to account for such cases, we run the memoryreference traces through a cache simulator with differentcache sizes, and observe the corresponding hit rates. Highhit rates indicate a tiered memory architecture, such as afirst tier of DRAM could be used to cache a good portionof the memory references to a second larger tier, based ona non-volatile technology.

D. Entropy

The previous metrics provide information about the spa-tial distribution, and how the temporal pattern impactscacheability. Another important consideration for memory-level optimizations is predictability and compressibility. Thisis related to the information content based on the observationthat a signal with a high amount of information content isharder to compress, and potentially difficult to predict. Inorder to quantify and compare this feature for the differenttraces, we use entropy of the memory references as a metric,as it has been used in [12] as a metric for understandingmemory reference patterns of SPEC workloads. The entropy

is measure of the information content in the trace, andtherefore gives good indication for its cacheability andpredictability. For a set of cache lines K, the entropy isdefined as

H = −∑c∈K

p(c)× log(p(c))

where p(c) denotes the probability of the cache line c.The following illustrative example demonstrates how

the entropy can be used to characterize memory ref-erences: Consider a total of 10 distinct cache lines:{a, b, c, d, e, f, g, h, i, j} that are referenced in a trace andthe following three scenarios, each consisting of 100 refer-ences to these cache lines:

(1) Each of the 10 cache lines are referenced 10 times.(2) Cache lines a, b, c, d, e are referenced 19 times each,

and cache lines f, g, h, i, j are referenced 1 time each.(3) Cache line a is referenced 91 times, and cache lines

b, c, d, e, f, g, h, i, j are referenced 1 time each.

All three access patterns use all 10 cache lines and have100 cache line references. Metrics like footprint and refer-ence counts therefore become identical in all the 3 cases.However, in the last case a single cache line contains 91%of the references, but only 19% of the references in case(2) and 10% of the references in case (1). Similarly, a setof 3 cache lines contains 93% of the references in case (3),57% of the references in case (2) and 30% of the referencesin case (1). Therefore from a footprint or working set pointof view, (3) is preferable over (2), which again is preferableover (1). This is nicely reflected in the entropy, which is 1 forscenario (1), 0.785 for scenario (2), and 0.217 for scenario(3).

The number of references to various cache lines in a tracegets converted to a probability distribution between the cachelines in a trace, and is therefore relative between the cachelines. In particular, the entropy is independent of the lengthof a trace.

E. Correlation and trend analysis

To further understand the predictability of memory refer-ences, we examine the traces for “trends”. For example, ifwe knew that a trace had a trend of increasing physicaladdress ranges in its references, aggressively prefetchinglarge sections in the upper physical address ranges to anupper level cache would result in a fair percentage of cachehits. In order to quantify this, we use correlation analysiswith known signals. The computation can be mathematicallyexpressed as follows:

c(n) = f ⊗ g(n) =∞∑

m=−∞f [m]g[m+ n] (1)

Here g is the trace, and f is a known signal. For our

18

Page 5: Memory System Characterization of Big Data Workloadsprof.ict.ac.cn/bpoe2013/downloads/papers/S7202_5633.pdfcharacterizing the memory access patterns of various Hadoop and noSQL big

analysis, we use a single sawtooth function:

fs,l(n) =

{s · n if 0 ≤ n ≤ l0 otherwise (2)

The correlation output would then examine the trace g,looking for the known signal f .

With a slope of s = 64 and a length of l = 1000, the testfunction f is mimicking a ascending stride through memoryof 1000 cache lines. Please note that the infinite sum in (1)collapses to

fs,l ⊗ g(n) =l∑

m=0

fs,l[m]g[m+ n] =l∑

m=0

s ·m · g[m+ n]

Furthermore, it is worth noting that test function for andescending stride with a negative slope s simply results ina negative correlation

f−s,l ⊗ g = (−fs,l)⊗ g = −fs,l ⊗ g

.

V. RESULTS

A. Experimental Setup

We perform our experiments on a 4 node cluster, witheach node being an Intel Xeon E5 2.9GHz two socket serverplatform, with 128GB 1600MHz DDR3 memory, and eachnode having 2TB of SSD storage. One of the nodes is fittedwith the special hardware required to collect memory tracesof the workloads at the DIMM level. The memory tracerinterposes between the DIMM and the motherboard and iscompletely unobtrusive, while at the same time capable ofrecording all signals arriving at the pins of the memoryDIMM. We keep the cluster configuration identical, andensured and verified during our experiments that Hadoopand the NoSQL workload distribute tasks equally amongthe nodes; hence observations made at one node can begeneralized. In order to collect the memory traces, we usespecially designed hardware that can record physical addressreferenced at the DIMM level, without any overhead.

B. General characterization

In this section, we describe the memory characteristics ofall the workloads.

Memory footprints: Figure 1 shows the memory footprintsof the workloads, in GB. It is observed that most workloadshave footprints of 10GB or greater, with the NoSQL work-load and the uncompressed sort workload having the largestfootprints. It can also be observed that compression reducesthe memory footprints, and helps reduce the execution time,as seen in Figure 3. Also, the nature of the footprints ismostly read-intensive, with the NoSQL workload and word-count map having read-write ratios of 2 or greater. Amongthe Hadoop workloads, Hive join and Hive aggregation werefound to have more writes, when compared to the sort

Figure 1. Memory footprints of the workloads

Figure 2. Cycles per Instruction to execute the workloads

Figure 3. Execution times for the workloads

Figure 4. First level data cache misses per 1000 instructions

19

Page 6: Memory System Characterization of Big Data Workloadsprof.ict.ac.cn/bpoe2013/downloads/papers/S7202_5633.pdfcharacterizing the memory access patterns of various Hadoop and noSQL big

Figure 5. Second level cache misses per 1000 instructions

Figure 6. Last level cache misses per 1000 instructions

Figure 7. Peak memory bandwidths recorded for the workloads

Figure 8. Data TLB misses per 1000 instructions

Figure 9. Instruction TLB misses per 1000 instructions

and wordcount workloads. In most cases, we observed thatenabling Hadoop compression reduced the read-write ratio.

CPI: The CPI of the workloads is shown in Figure 2.It is observed that most of the workloads have CPI closerto 1, with uncompressed sort having the largest CPI. Thisdifference with compression for the sort workload is alsoapparent in the execution times, shown in Figure 3. Fromall the Figures 2, 3, 1, it can be observed that the sort andwordcount workloads benefit most from using compression,followed by the Hive aggregation workload.

L1, L2, LLC MPI: Figures 4, 5, and 6 show the missrates per thousand instructions of the three levels of thecache hierarchy respectively. The sort workload is observedto have the highest cache miss rates. Intuitively, this makessense because it transforms a large volume of data fromone representation to another. The benefits of compressionis also apparent in the last level cache miss rates of all theworkloads.

Memory bandwidth: The peak memory bandwidths of theworkloads are shown in Figure 7. It is observed that all theworkloads have peak bandwidths of several 10s of GB/s,all within the platform capability of 70 GB/s. Wordcount isobserved to be the most bandwidth intensive amongst theworkloads. We note that while some workloads have higherbandwidth with compression enabled, the total data trafficto memory (product of execution time and bandwidth) islowered in all cases, with compression enabled.

Instruction and Data TLB MPI: It is interesting to notethat although the sort workload has almost an order ofmagnitude larger footprint than the wordcount workload,wordcount has much higher Data TLB miss rates. Thisindicates that the memory references of the workload arenot well contained within page granularities, and are morewidespread. In terms of instruction TLBs, the NoSQL work-load is observed to have the highest miss rates.

C. Cache line working set characterization

Figures 11 shows the working set characterization de-scribed in the earlier section. It is observed that the hottest

20

Page 7: Memory System Characterization of Big Data Workloadsprof.ict.ac.cn/bpoe2013/downloads/papers/S7202_5633.pdfcharacterizing the memory access patterns of various Hadoop and noSQL big

Figure 10. Entropy of cache line references

Figure 11. Percentage references contained in memory footprints

Figure 12. Cache miss rates for different cache sizes

100MB of cache lines contain 20% of the workloads forall, except for the NoSQL workload and the map phaseof the uncompressed word count workload. The NoSQLworkload stands out from the other workloads in termsof this characterization for the various footprint sizes. A1GB footprint is observed to contain 60% of the memoryreferences of all but this workload. It is also interesting tonote that even though the sort workload has footprints ofmore than 100GB, more than 60% of its memory referencesare contained in 1GB, i.e. less than 1% of its footprint.

D. Cache simulation

Figure 12 shows the cacheability while accounting for thetemporal nature of the reference patterns. It is interestingto compare Figure 11 and Figure 12 - and observe that thepercentage of cache hits for the references is higher in Figure12. This indicates that all the big data workloads do notoperate on the entire footprint at once; rather they operateon spatial subsets of the total footprint, which makes themthe hit rates higher in a cache that allows for evictions and

Figure 13. Correlation of traces with known signal; suffixes are as follows– p: prefetch, no: no prefetch, c: compression, nc: no compression

replacements. It is observed that a 100MB cache has hit ratesof 40-50% for most of the workloads, and a 1GB cachehas hit rates of 80% for most workloads; indicating thatthese workloads are extremely cache friendly. Observing thetrends- it is interesting to observe that the NoSQL workloadappears to have lowest hit rates and slopes for both theworking set analysis and the cache simulations.

E. Entropy

Figure 10 shows the entropies of all the workloads’ cacheline references. It is observed that most of the big dataworkloads have entropies ranging from 13 to 16, with theNoSQL and sort workloads having the highest entropies-indicating large information content, harder predictability,and poorer compressibility. A common feature betweenthese workloads is that they operate on entire datasets: boththe inputs and outputs are comparably of the same size, withlarge transforms being performed. A noteworthy comparisonfor Figure 10 would be with the entropies for the SPECworkloads in [12]; it is observed that most of the SPECworkloads have entropies in the range of 8-13 (lower thanthe big data workloads), with equake and mcf being the onlyworkloads to have entropies close to 16.

F. Correlation and trend analysis

Figure 13 shows the correlation analysis described in Sec-tion IV-E, with a known signal that has an increasing slope of64, for some of the workloads. We normalize the correlationanalysis using the mean and standard deviation to ensure faircomparisons can be made between the different workloads.A higher magnitude for the correlation indicates “the trend”(known signal) is observed strongly in the trace, with apositive magnitude denoting successive physical addressesare likely to be increasing in magnitude (by 64 bytes) andnegative magnitude indicating successive physical addressesare likely to decrease in magnitude. It is observed thatthe Hive aggregation workload overall has high correlationmagnitudes, indicating it may be beneficial to predict andprefetch memory references in the higher address ranges(when compression is disabled) and lower address ranges

21

Page 8: Memory System Characterization of Big Data Workloadsprof.ict.ac.cn/bpoe2013/downloads/papers/S7202_5633.pdfcharacterizing the memory access patterns of various Hadoop and noSQL big

(when compression is enabled). In most cases (other thanthe NoSQL workload), enabling prefetchers during tracecollection results in higher correlation, as would be expecteddue to the prefetcher hitting adjacent addresses. In the caseof the NoSQL workload, on further examination of the trace,we observed there were several local phases of increasingand decreasing trends that changed along the duration of thetrace.

VI. CONCLUSION AND OUTLOOK

We examine the design space for memory architecturesrunning big data workloads, by analyzing spatial patternsusing DIMM traces and providing a detailed memory char-acterization and highlight various observations and insightsfor system design. Our study shows that several big dataworkloads can potentially hide latencies by caching refer-ences in a faster tier of memory. Moreover, there are trends(increasing address ranges) observable in these workloads,indicating potential for aggressively prefetching large sec-tions of the dataset onto a faster tier. For future work, weplan to expand the measurements to include more big dataworkloads as well as exploring further ways to characterizethe workloads, with variations in dataset size, etc.

REFERENCES

[1] Wikibon, “Big data statistics,” in http://wikibon.org/blog/big-data-statistics/.

[2] SeekingAlpha, “Opportunities to play the explosive growthof big data,” in http://seekingalpha.com/article/910101-opportunities-to-play-the-explosive-growth-of-big-data.

[3] D. Boyd and K. Crawford, “Six provocations for big data,”Oxford Internet Institute: A Decade in Internet Time: Sympo-sium on the Dynamics of the Internet and Society, 2011.

[4] B. Brown, M. Chui, and J. Manyika, “Are you ready for theera of big data?” McKinsey Quarterly, vol. 4, pp. 24–35, 2011.

[5] K. Bakshi, “Considerations for big data: Architecture andapproach,” in IEEE Aerospace Conference, 2012, pp. 1–7.

[6] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Rox-burgh, and A. H. Byers, “Big data: The next frontier forinnovation, competition and productivity,” Technical report,McKinsey Global Institute, Tech. Rep., 2011.

[7] Y. Xie, “Modeling, architecture, and applications for emerg-ing memory technologies,” Design & Test of Computers,IEEE, vol. 28, no. 1, pp. 44–51, 2011.

[8] A. Makarov, V. Sverdlov, and S. Selberherr, “Emerging mem-ory technologies: Trends, challenges, and modeling methods,”Microelectronics Reliability, vol. 52, no. 4, pp. 628–634,2012.

[9] L. A. Barroso, K. Gharachorloo, and E. Bugnion, “Mem-ory system characterization of commercial workloads,” ACMSIGARCH Computer Architecture News, vol. 26, no. 3, pp.3–14, 1998.

[10] A. Jaleel, “Memory characterization of workloads usinginstrumentation-driven simulation–a pin-based memory char-acterization of the spec cpu2000 and spec cpu2006 bench-mark suites,” Intel Corporation, VSSAD, 2007.

[11] F. Zeng, L. Qiao, M. Liu, and Z. Tang, “Memory performancecharacterization of spec cpu2006 benchmarks using tsim,”Physics Procedia, vol. 33, no. 0, pp. 1029 – 1035, 2012.

[12] Y. S. Shao and D. Brooks, “Isa-independent workload charac-terization and its implications for specialized architectures,”in IEEE International Symposium on Performance Analysisof Systems and Software, 2013, pp. 245–255.

[13] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo, “Character-izing data analysis workloads in data centers,” arXiv preprintarXiv:1307.8013, 2013.

[14] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M.Swift, “Efficient virtual memory for big memory servers,”in International Symposium on Computer Architecture, 2013,pp. 237–248.

[15] J.-S. Kim, X. Qin, and Y. Hsu, “Memory characterization ofa parallel data mining workload,” in Workload Characteriza-tion: Methodology and Case Studies, 1998. IEEE, 1999, pp.60–68.

[16] M. Karlsson, K. Moore, E. Hagersten, and D. Wood, “Mem-ory characterization of the ecperf benchmark,” in Workshopon Memory Performance Issues, 2002.

[17] Y. Xu, E. Frachtenberg, S. Jiang, and M. Palecezny, “Char-acterizing facebook’s memcached workload,” IEEE InternetComputing, vol. 99, p. 1, 2013.

[18] Y. Xie, “Future memory and interconnect technologies,” inDesign, Automation & Test in Europe Conference & Exhibi-tion (DATE), 2013. IEEE, 2013, pp. 964–969.

[19] J. Chen, R. C. Chiang, H. H. Huang, and G. Venkataramani,“Energy-aware writes to non-volatile main memory,” ACMSIGOPS Operating Systems Review, vol. 45, no. 3, pp. 48–52, 2012.

[20] E. Chen et al., “Advances and future prospects of spin-transfer torque random access memory,” Magnetics, IEEETransactions on, vol. 46, no. 6, pp. 1873–1878, 2010.

[21] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “Thehibench benchmark suite: Characterization of the mapreduce-based data analysis,” in International Conference on DataEngineering Workshops. IEEE, 2010, pp. 41–51.

[22] M. Stonebraker, “Sql databases v. nosql databases,” Commun.ACM, vol. 53, no. 4, pp. 10–11, Apr. 2010.

[23] A. E. Gattiker, F. H. Gebara, A. Gheith, H. P. Hofstee, D. A.Jamsek, J. Li, E. Speight, J. W. Shi, G. C. Chen, and P. W.Wong, “Understanding system and architecture for big data,”IBM research, 2012.

[24] J. Chang, K. T. Lim, J. Byrne, L. Ramirez, and P. Ran-ganathan, “Workload diversity and dynamics in big dataanalytics: implications to system designers,” in Workshop onArchitectures and Systems for Big Data, 2012, pp. 21–26.

[25] H. Yang, Z. Luan, W. Li, D. Qian, and G. Guan, “Statistics-based workload modeling for mapreduce,” in Parallel andDistributed Processing Symposium Workshops PhD Forum,2012, pp. 2043–2051.

[26] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, andM. Paleczny, “Workload analysis of a large-scale key-valuestore,” in ACM International Conference on Measurement andModeling of Computer Systems, 2012, pp. 53–64.

[27] Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, “Workloadcharacterization on a production hadoop cluster: A case studyon taobao,” in IEEE International Symposium on WorkloadCharacterization, 2012, pp. 3–13.

[28] “Hadoop and memcached: Performance and power character-ization and analysis,” Journal of Cloud Computing, vol. 1,no. 1, 2012.

22