HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding...

13
HabaneroHadoop: An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang * MIT CSAIL [email protected] Alan Cox Vivek Sarkar Rice University {alc,vsarkar}@rice.edu Abstract This paper proposes a novel runtime system, Habanero Hadoop, to address the inefficient utilization of memory on multi-core machines by the Hadoop MapReduce run- time system. The current Hadoop MapReduce runtime im- plementation utilizes multiple cores in a machine by ex- ploiting parallelism among map tasks and among reduce tasks. Each task is executed in a separate Java Virtual Ma- chine (JVM). Unfortunately, in some applications, this de- sign leads to poor memory utilization because some data structures used by the application are duplicated in their en- tirety across multiple JVMs running on the same machine. In addition, running a large number of map tasks on each node sometimes incurs significant memory overhead. This mem- ory inefficiency leads to a scalability bottleneck in problems such as data clustering and classification. As the memory is becoming a valuable resource for data analytics applications running in data centers, improving the memory efficiency of the MapReduce runtime can potentially result in a dramatic decrease in a company’s spending on cloud computing re- sources. The Habanero Hadoop system integrates a shared mem- ory model into the fully distributed memory model of the Hadoop MapReduce system, avoiding duplication of read- only in-memory data structures within a node. Addition- ally, Habanero Hadoop implements an alternative way to uti- lize multi-core systems by exploiting parallelism within each map task. Previous work optimizing multi-core performance for MapReduce runtimes focused on maximizing CPU uti- lization rather than memory efficiency. The resulting Ha- banero Hadoop runtime can reduce the memory usage by as much as 6x and improves the throughput for large inputs by 2x for an 8-core machine running widely-used data analyt- ics applications, including HashJoin, KMeans and K Nearest Neighbors. 1. Introduction Data analytics applications are widely used by industry and the scientific research community. Hadoop MapReduce [15] * This work was done while the author was at Rice University is a popular platform for executing MapReduce data ana- lytics applications in clusters. However, the current Hadoop MapReduce runtime does not efficiently utilize the memory resources in multi-core machines, leading to performance and scalability bottlenecks for popular data analytics appli- cations. Current Hadoop MapReduce runtime treats every core as an independent machine. The runtime implementation uses multi-core systems by decomposing a MapReduce job into multiple map/reduce tasks that can execute in parallel. Each map/reduce task is executed sequentially using a single core in a separate JVM instance. The runtime system adopts a fully distributed memory model as each task on the same machine executes in its own memory space. This distributed memory model hurts the memory ef- ficiency of many popular data analytics applications that have large in-memory read-only data structures, includ- ing KMeans, K Nearest Neighbors (KNN) and HashJoin. KMeans uses the old cluster centroids data in the memory to calculate the new cluster centroids at each iteration [3, 17]. A typical HashJoin application requires each map task to keep a copy of the lookup table in memory [1, 5]. Additionally, Hadoop’s design to create large number of map tasks to utilize each multi-core node sometimes re- sults in significant memory overhead. Certain applications, such as KMeans and KNN, maintain large in-memory data structures for partial result of the map tasks. For example, KMeans stores newly computed cluster centroids data in memory during the execution of the map tasks. Hadoop MapReduce’s design to use multi-core systems leads to significant memory inefficiency for a few rea- sons. First, since there is no shared memory space between map/reduce tasks on the same node, the in-memory, read- only data structures needed by map/reduce tasks are dupli- cated across JVMs on the same machine as shown in Fig- ure 1. Second, it takes extra time for every map/reduce task to load in and compute the same read-only in-memory data structures. Finally, the approach of running a large number of map tasks on each compute node creates a large number of in-memory partial result data structures, leading to a large memory footprint. 1

Transcript of HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding...

Page 1: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

HabaneroHadoop: An Optimized MapReduceRuntime for Multi-core Systems

Yunming Zhang ∗

MIT [email protected]

Alan Cox Vivek SarkarRice University

{alc,vsarkar}@rice.edu

AbstractThis paper proposes a novel runtime system, HabaneroHadoop, to address the inefficient utilization of memoryon multi-core machines by the Hadoop MapReduce run-time system. The current Hadoop MapReduce runtime im-plementation utilizes multiple cores in a machine by ex-ploiting parallelism among map tasks and among reducetasks. Each task is executed in a separate Java Virtual Ma-chine (JVM). Unfortunately, in some applications, this de-sign leads to poor memory utilization because some datastructures used by the application are duplicated in their en-tirety across multiple JVMs running on the same machine. Inaddition, running a large number of map tasks on each nodesometimes incurs significant memory overhead. This mem-ory inefficiency leads to a scalability bottleneck in problemssuch as data clustering and classification. As the memory isbecoming a valuable resource for data analytics applicationsrunning in data centers, improving the memory efficiency ofthe MapReduce runtime can potentially result in a dramaticdecrease in a company’s spending on cloud computing re-sources.

The Habanero Hadoop system integrates a shared mem-ory model into the fully distributed memory model of theHadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally, Habanero Hadoop implements an alternative way to uti-lize multi-core systems by exploiting parallelism within eachmap task. Previous work optimizing multi-core performancefor MapReduce runtimes focused on maximizing CPU uti-lization rather than memory efficiency. The resulting Ha-banero Hadoop runtime can reduce the memory usage by asmuch as 6x and improves the throughput for large inputs by2x for an 8-core machine running widely-used data analyt-ics applications, including HashJoin, KMeans and K NearestNeighbors.

1. IntroductionData analytics applications are widely used by industry andthe scientific research community. Hadoop MapReduce [15]

∗ This work was done while the author was at Rice University

is a popular platform for executing MapReduce data ana-lytics applications in clusters. However, the current HadoopMapReduce runtime does not efficiently utilize the memoryresources in multi-core machines, leading to performanceand scalability bottlenecks for popular data analytics appli-cations.

Current Hadoop MapReduce runtime treats every core asan independent machine. The runtime implementation usesmulti-core systems by decomposing a MapReduce job intomultiple map/reduce tasks that can execute in parallel. Eachmap/reduce task is executed sequentially using a single corein a separate JVM instance. The runtime system adopts afully distributed memory model as each task on the samemachine executes in its own memory space.

This distributed memory model hurts the memory ef-ficiency of many popular data analytics applications thathave large in-memory read-only data structures, includ-ing KMeans, K Nearest Neighbors (KNN) and HashJoin.KMeans uses the old cluster centroids data in the memory tocalculate the new cluster centroids at each iteration [3, 17].A typical HashJoin application requires each map task tokeep a copy of the lookup table in memory [1, 5].

Additionally, Hadoop’s design to create large number ofmap tasks to utilize each multi-core node sometimes re-sults in significant memory overhead. Certain applications,such as KMeans and KNN, maintain large in-memory datastructures for partial result of the map tasks. For example,KMeans stores newly computed cluster centroids data inmemory during the execution of the map tasks.

Hadoop MapReduce’s design to use multi-core systemsleads to significant memory inefficiency for a few rea-sons. First, since there is no shared memory space betweenmap/reduce tasks on the same node, the in-memory, read-only data structures needed by map/reduce tasks are dupli-cated across JVMs on the same machine as shown in Fig-ure 1. Second, it takes extra time for every map/reduce taskto load in and compute the same read-only in-memory datastructures. Finally, the approach of running a large numberof map tasks on each compute node creates a large numberof in-memory partial result data structures, leading to a largememory footprint.

1

Page 2: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

Figure 1: Hadoop MapReduce on a four cores system

Figure 2: Memory Wall for HashJoin Application

As data analytics applications try to solve larger prob-lems, the size of the in-memory data structures increases.When the memory usage of map/reduce tasks starts ap-proaching the memory limit allocated per JVM, the fre-quency of garbage collection calls increases significantly,leading to a decrease in the system’s throughput. We callthis sudden drop of performance as the “memory wall”. Thismemory wall effect for HashJoin is shown in Figure 2. Thethroughput drops as more memory is used up to store theincreasingly larger look up table.

Memory is an expensive resource and costs a lot ofpower [9]. Previous analysis on Amazon EC2 instances pric-ing shows that memory is more expensive per unit cost thanCPU and storage space [2]. Furthermore, the memory re-source will be even more precious as the number of coresis increasing faster than the available memory [10]. As aresult, focusing on improving the memory efficiency of theHadoop MapReduce runtime can significantly reduce com-panies spending on cloud computing resources.

The Habanero Hadoop system reduces the memory foot-print of the applications in two ways. First, the system cre-ates a shared memory space between different map tasksrunning on the same compute node. This shared memoryspace avoids the duplication of in-memory read-only datastructures needed by the applications. The shared memory

space also enables reuse of the in-memory data structuresacross multiple map tasks. Secondly, the system parallelizesthe execution of each map task, reducing the total number ofmap tasks needed to take advantage of the multi-core CPUresource on each node. As a result, the runtime reduces thenumber of in-memory partial result data structures stored oneach compute node without incurring any penalty to CPUutilization. The users can configure the maximum number ofmap tasks running simultaneously on each node and chooseto use sequential or parallel map task to suit the needs of I/Oand compute intensive applications.

Previous work has focused more on reducing the runningtime rather than the memory footprint of distributed data an-alytics applications [11, 12, 16]. Recently, some researchershave looked at the memory footprint of various equi-join al-gorithms [4] and of the group-by aggregate used in the shuf-fle phase of common MapReduce runtimes [2]. Our workfocuses on optimizing the memory efficiency for the mapphase of common MapReduce data analytics applications.

This paper makes three main contributions. First, it pro-vides a detailed study of the memory wall performance bot-tleneck in popular memory-intensive data analytics applica-tions, such as KMeans, KNN and HashJoin running on theHadoop MapReduce runtime. Second, it proposes an imple-mentation of the optimized Habanero Hadoop runtime. Fi-nally, it presents an evaluation of the performance benefitsusing the Habanero Hadoop runtime system for KMeans,KNN and HashJoin.

The remainder of this paper is organized as follows. Sec-tion 2 describes the design and implementation of HashJoin,KMeans and KNN applications on MapReduce systems.Section 3 presents the programming model of the HabaneroHadoop runtime. Section 4 provides implementation detailsof the Habanero Hadoop system. Section 5 evaluates thethroughput, memory footprint and CPU utilization of threedifferent applications running on the Habanero Hadoop run-time system. Section 6 discusses related work. Section 7summarizes the paper and identifies opportunities for pos-sible future work.

2. Data Analytics ApplicationsIn this section, we examine three popular data analytics ap-plications, HashJoin, KMeans and K Nearest Neighbor thatcan benefit from the improved memory efficiency of theHadoop MapReduce runtime. HashJoin is an important op-eration in many analytical query processing engines suchas Pig [7] and Hive [14]. It is representative of I/O inten-sive applications that use a large in-memory read-only refer-ence object. Similar applications inlude Broadcast Join [5],Fragmented Replicated Join [1] and MapSide join [13].KMeans [3, 17] is a popular iterative clustering algorithm. Itis more compute intensive and keeps large in-memory clus-ter data and partial result in each iteration. Other popularclustering algorithms that share the same performance char-

2

Page 3: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

acteristics include FuzzyKMeans and Latent Dirichlet Allo-cation [3]. K Nearest Neighbor is representative of query-based learning algorithms leveraging distance metrics [8]that are compute intensive and keep a large query set andpartial result in memory.

2.1 HashJoinA MapReduce implementation of HashJoin is essential foranalyzing large data sets. For example, HashJoin plays a cru-cial role in analyzing terabytes of unstructured logs [5]. Pre-vious work has shown that HashJoin has the best responsetime and memory consumption for joining randomly orderedtables when compared to sort-based algorithms [4]. Reduc-ing the memory footprint of HashJoin enables processing oflarger data sets and increases the maximum number of con-current queries that can be executed on each node.

HashJoin takes as input two tables, a data table S anda reference table R, and performs an equi-join on a singlecolumn of the two tables, as shown below

S 1 R,with|S| � |R|.It is often the case that the size of one table S is much

larger than the other data set R [1, 5, 13]. Since the largerdata set S is too large to fit in one compute node, theHashJoin application has to divide up the data set S to pro-cess it across compute nodes in parallel. The smaller dataset R is usually small enough that it can be loaded into thememory of a single compute node. The implementation onlyneeds to use the map phase of the MapReduce job.

The Hadoop MapReduce runtime splits up S into smallpieces and uses them as input to each map task. It broadcaststhe smaller R to every map task and loads R into a hash table.Each map task reads in one key-value pair from the split ofS at a time and queries the hash table containing R to seeif there is a match. If there is a match, then the matchingkey-value pair is outputted. At the end of the map phase, theoutput key-value pairs define the output table.

The memory footprint of each map task is large becausethe hash table containing R takes up a lot of memory [13].Hash join is not compute intensive because the map func-tion only queries the hash table once for each key-valuepair.The overall time complexity of the HashJoin applicationis O(|S|).

2.2 KMeansKMeans is a representative clustering application that is of-ten used to genereate topics in databases of documents. Im-proving the memory efficiency of the runtime allows the ap-plication to process larger databases and find more topics.It takes as input the parameter k, and partitions a set of nsample objects into k clusters. The algorithm first chooses krandom objects as centroids. Then, it assigns every sampleobject to a cluster that it is most similar to. Once all the sam-ple objects have been assigned to a cluster, the algorithm re-calculates the centroid location of each cluster. The processrepeats until the centroid value stabilizes.

In the MapReduce implementation of KMeans [3, 17],each of the n objects is represented as a sample vector. Thecluster centroids are represented as a vector as well. In theMap phase, each map task first reads in a file containing cen-troid data and creates an array containing all the cluster cen-troid vectors. Next, each map task reads in a subgroup of then sample vectors. A map task calculates the similarity be-tween each sample vector with all k centroids and assignsthe sample vector to the most similar cluster. We use Eu-clidean distance for measuring the similarity of two vectors.At the end of each map task, the algorithm creates a partialsum of the sample vectors and the number of sample vectorsin each cluster. In the reduce phase, the algorithm adds upthe partial sums of each cluster and divides it by the numberof samples in each cluster to calculate the new center coor-dinates of the cluster. Multiple MapReduce jobs are chainedtogether to iteratively improve the quality of the cluster cen-troids.

The KMeans application is very memory-intensive be-cause each map task needs to keep the read-only cluster cen-troids data in memory. Additionally, partial result, such asthe partial sum of the sample vectors, are also stored in mem-ory and are outputted at the end of the map task. The partialresult and the duplication of read-only cluster data reducesthe amount of memory available to each map task, limitingthe number of clusters KMeans can generate. The KMeansapplication is compute intensive. For n sample objects andk clusters, the application performs O(|n| × |k|) similaritycomputations.

2.3 K Nearest NeighborsK Nearest Neighbors algorithm (KNN) is one of the mostwidely used classification algorithm. KNN uses two datasets, a query set Q and a training set T. It classifies theelements of Q into different categories by comparing everyelement with the already classified training set T. It choosesthe K closest elements in T.

The Hadoop MapReduce runtime splits up T into smallpieces and uses them as inputs to the map tasks. In the mapphase, the algorithm first loads into memory the completequery set Q. Next, for each element in Q, it calculates thedistance to every element in the split of T. The applicationuses a priority queue to store partial result containing onlythe top K most similar elements in T. In the reduce phase,the reduce tasks aggregate all the partial top K elements tocompute the overall top K elements in T for each element inQ.

KNN is a memory intensive application because it keepsthe query set Q and partial result for each element in thequery set Q in the memory. KNN is also a more computeintensive application than HashJoin because it performs dis-tance calculations between every point in Q and every pointin T, leading to a time complexity of O(|Q|× |T |).

3

Page 4: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

1 p u b l i c c l a s s MyMapper ex tends2 Mapper <LongWri tab le , Text ,3 Text , LongWri tab le> {4 p r i v a t e D a t a S t r u c t u r e read−only−d a t a ;5 p r i v a t e D a t a S t r u c t u r e p a r t i a l − r e s u l t ;6 p u b l i c vo id s e t u p ( C o n t e x t c o n t e x t ){7 l o a d t h e read−only−d a t a ;8 i n i t i a l i z e t h e p a r t i a l − r e s u l t ;9 }

10 p u b l i c vo id map ( LongWr i t ab l e key ,11 Text va lue ,12 C o n t e x t c o n t e x t ) {13 r e a d d a t a ;14 w r i t e t o p a r t i a l − r e s u l t ;15 }16 }

Figure 3: Original programming interface for Hadoop Mapper

3. Programming InterfaceThis section describes the programming interface supportedby the Habanero Hadoop runtime. In this model, the userextends a Mapper/Reducer library class and implements cus-tomized map/reduce and set up functions. The users need tomake a few minor changes to existing Hadoop MapReduceapplications to take advantage of the optimizations in Ha-banero Hadoop. The original programming model for com-mon data analytics applications is shown in Figure 3. Thechanges in the Habanero Hadoop programming model for atypical data analytics application is shown in Figure 4.

First, the user needs to declare the shared read-only datastructures as static to take advantage of the shared memoryspace as shown in line 4 of Figure 4. Different map tasksrunning in the same JVM can share a single copy of the read-only data structure.

Furthermore, since the read-only data is shared by mul-tiple map tasks, the user needs to make sure the read-onlydata is loaded and computed once and no map task starts ex-ecution until the data is loaded into memory. To achieve this,the user creates a critical section in the set up function usinglocks as shown in lines 9 to 15 in Figure 4.

Lastly, the user extends a ParMapper library to take ad-vantage of multithreading within a single map task, as shownin line 2 of Figure 4. The multithreading is implicit for mostof the MapReduce applications that only perform read op-erations on in-memory data structures. No change needs tobe made to the map function. However, for applications thatwrite to in-memory data structures, such as partial result datastructures used in KMeans and KNN, the user has to syn-chronize the write operations.

4. ImplementationThe Habanero Hadoop system utilizes two optimizations,the Compute Server and the Parallel Mapper (ParMapper),to minimize memory footprint and maximize reuse of in-memory data structures in a single MapReduce job.

1 p u b l i c c l a s s MyMapper ex tends2 ParMapper<LongWri tab le , Text ,3 Text , LongWri tab le> {4 p r i v a t e s t a t i c D a t a S t r u c t u r e read−only−d a t a ;5 p r i v a t e D a t a S t r u c t u r e p a r t i a l − r e s u l t ;6 p r i v a t e s t a t i c Lock l o c k ;7 p u b l i c vo id s e t u p ( C o n t e x t c o n t e x t ){8 i n i t i a l i z e t h e p a r t i a l − r e s u l t ;9 l o c k . l o c k ( ) ;

10 t r y {11 i f ( read−only−d a t a n o t l o a d e d i n memory ){12 l o a d t h e read−only−d a t a ;13 } f i n a l l y {14 l o c k . un l oc k ( )15 }16 }17 p u b l i c vo id map ( LongWr i t ab l e key ,18 Text va lue ,19 C o n t e x t c o n t e x t ) {20 r e a d d a t a ;21 s y n c h r o n i z e w r i t e s t o p a r t i a l − r e s u l t ;22 }23 }

Figure 4: Habanero Hadoop programming interface

In this paper, we focus on optimizing the map phaseof the applications because the map phase dominates thecomputation time for popular data analytics applications,including KMeans, KNN and HashJoin. These optimizationsshould be easily applicable to the reduce phase as well.

4.1 Compute ServerHabanero Hadoop runs a Compute Server to execute mul-tiple map tasks in parallel within each node, and creates ashared memory space between different tasks as shown inFigure 5. The Compute Server also reuses the read-only databetween subsequent map tasks on the same node.

To use the Compute Server effectively, the user first se-lects a value for available map slots in the Compute Serverin the runtime’s configuration file. The specified map slotsvalue represents the number of maximum map tasks that canbe executed in parallel in the Compute Server on each com-pute node. The value is usually determined by the number ofcores and amount of memory available on the compute node.

The Compute Server runs multiple map tasks in parallelto achieve good I/O and computation resource utilization.Each map task has an I/O thread that fetches input key-valuepairs from a input file split. Multiple map tasks can readdifferent input split files and deserialize the data in parallelwithout worrying about contention from reading the sameinput split file. Furthermore, running multiple map tasksat the same time can overlap I/O with computation. Whenone map task is loading input key-value pairs, another maptask can be processing the key-value pairs, keeping the CPUbusy. The parallel I/O operations and overlap between I/Oand computation is critical for I/O intensive applications,such as HashJoin.

4

Page 5: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

Figure 5: Compute Server for multi-core systems

To avoid the duplication of read-only in-memory datastructures, the Compute Server creates a shared memoryspace between different map tasks running on the same com-pute node. The need for an efficient shared memory spaceacross tasks was also noted by Gillick et al. [8]. To achievethis, the map tasks need to run in the same Compute ServerJVM.

The design of the Compute Server is shown in Figure 6.The tasktracker is responsible for scheduling and managingthe map tasks running on the compute node. It first receives amap task assigned to the current compute node and preparesto execute the map task using an available map slot. Whenthe first time a map task is assigned to the compute node,the tasktracker launches a Compute Server JVM and estab-lishes a socket connection with the Compute Server. Afterthe connection has been established, the tasktracker sendsin task id information to the Compute Server. The ComputeServer subsequently starts a Child thread in the ComputeServer JVM to process the task. The Child thread acquiresfurther information about the task from the tasktracker us-ing an existing Remote Procedure Call (RPC) interface inHadoop with the task id. The Child thread then copies therequired files of the map task, including the input split filefor the task, from HDFS to a local directory and starts theexecution of the map task.

When additional tasks were assigned to the tasktracker,it checks if the Compute Server is already up and running.If so, the tasktracker sends in task id information throughthe socket connection. The Compute Server acts as a mul-tithreaded server. It accepts multiple task execution requestfrom the tasktracker and creates new Child threads in thesame JVM for the assigned map tasks. The Child threadsagain uses the RPC interface to get all the informationneeded to set up the local directory and execute the tasks.

As mentioned earlier, the user creates static data struc-tures to share the read-only data structure across differentmap tasks. Static data structures are shared across different

Figure 6: Design and Implementation of the Compute Server

instances of the mapper class running in the Compute ServerJVM on each node. The next step is creating a synchroniza-tion scheme to make sure that the application the data struc-ture required by the application is loaded into the memoryonly once and no map task starts until the data structure hasbeen loaded into memory. Currently, we achieve this by set-ting up a critical section in the set up phase of each maptask using locks as shown in Section 3. Only the first threadto enter the critical section will load the data structure intomemory.

The second feature of the Compute Server is the reuseof data structures between multiple map tasks. During themap phase, a large number of map tasks are assigned toeach compute node. Everytime a map task finishes execu-tion, a map slot becomes available for the next map task.Hadoop MapReduce model loads read-only data structuresinto memory for every map task. HashJoin takes more thanone third of the time of each map task to load the lookuptable into memory. The data reuse optimization can achieveimproved performance by loading the read-only in-memorydata structure only once. To achieve this, we designed theCompute Server to be persistent. As a result, once a map taskfinished execution in the Compute Server JVM, another maptask will be launched in the same Compute Server. The newmap task will again perform the check if the data structureis already loaded. If it is already loaded, then the map taskcan reuse the data structure. The persistent Compute ServerJVM will be shut down at the end of the MapReduce job.

4.2 Parallel MapperHabanero Hadoop also provides a multithreaded ParallelMapper library (ParMapper). ParMapper can significantlyreduce the memory footprint for compute intensive appli-cations with large in-memory partial result data structures,such as KMeans and KNN.

5

Page 6: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

Figure 7: Parallel Mapper for multi-core systems

The ParMapper can process multiple input key-valuepairs in parallel within a single map task. By default, Hadoopmappers sequentially generate key-value pairs from their in-put split. Every time a key-value pair is generated, the maptask immediately processes it using the user defined mapfunction. This design is inherently sequential as the maptask has to finish processing one key-value pair before mov-ing on to process the next one. The ParMapper improvesupon the original Hadoop mapper by subdividing the inputkey-value pairs to the map task into chunks and processingdifferent chunks in parallel. The ParMapper includes a sin-gle I/O thread to prefetch key-value pairs into a buffer whileother worker threads are executing map tasks. To do this, theruntime creates a clone of each key-value pair and allocatesa new buffer to store the cloned key-value pairs for eachchunk.

To generate dynamic task parallelism, the I/O threadstarts an asynchronous task to process a buffer once it is full.A separate buffer is used for each worker thread to allow theJVM to free up buffers in completed tasks.

To load balance across multiple cores, the ParMapper au-tomatically chooses a good task granularity for each workerthread. The granularity of each task is decided by the buffersize. Since the execution time for each call to a map func-tion for each pair is different from application to applica-tion, there is not a fixed buffer size that is good for all ap-plications. ParMapper adaptively selects a buffer size thatachieves good performance for different applications. Themain thread first reads in a small number of input key-valuepairs as a sample chunk and records the time needed to pro-cess the sample chunk. Based on an empirically chosen de-sired running time for each chunk, the runtime calculatesa good buffer size. The implementation of ParMapper isshown in Figure 7.

The ParMapper also improves CPU utilization for somecompute intensive tasks. The runtime dynamically subdi-vides each map task so that the granularity of tasks assignedto each core is smaller than the original map task. The im-proved granularity contributes to better load balance acrosscores. The effect is shown in Section 5 for KNN.

In the cases that there are write operations performed inthe map function on a local object, such as partial result, theuser would need to synchronize write operations in the mapfunction, as shown in Section 3

5. Experimental ResultsThis chapter presents our experimental evaluation of the Ha-banero Hadoop runtime. We studied three widely used appli-cations, KMeans, KNN and HashJoin described in Section 2.For each application, we demonstrate the improved mem-ory efficiency and throughput running it on the HabaneroHadoop system.

5.1 Experimental SetupWe run our tests for HashJoin using a cluster of ten com-pute nodes. Each compute node has two quad-core 2.4 GHzIntel Xeon CPUs and 16 GB memory. The experiments forKMeans and KNN were conducted on a smaller cluster withfour compute nodes. Each compute node has two quad-core2.4 GHz Intel Xeon CPUs and 8 GB memory. Differentnodes are connected using an Infiniband network switch.

We use Java 1.8.0 and Hadoop 1.0.3 to conduct the exper-iments. All of the experiments are conducted on top of theHadoop Distributed File System (HDFS). The same 32 MBblock size is used for all applications to rule out the impactof block size. We believe the scalability of the cluster is nota big issue here because the runtime optimizations will im-prove the performance of every single compute node. As aresult, the memory efficiency and throughput improvementfor large-scale inputs should be able to scale as more com-pute nodes are added to the cluster.

For the baseline, we used the unmodified Hadoop MapRe-duce system with eight map slots (the maximum number ofmap tasks running simultaneously per node) to utilize alleight cores on the node. In this configuration, the heap sizelimit was set to 2 GB for each task executing JVM in thelarge cluster for HashJoin and 1 GB for each task executingJVM in the smaller cluster for KMeans and KNN. For theHabanero Hadoop system, we evaluated different configura-tions using different number of sequential and parallel mapslots. The heap size of the Compute Server JVM is set to 16GB for the larger cluster and 8 GB for the smaller cluster.

For the applications, we first show a CPU utilization overtime graph to demonstrate the compute and I/O intensity ofthe application and selects a good Habanero Hadoop con-figuration. Next, we show a throughput over the in-memorydata structure size graph to demonstrate the improved thememory efficiency and throughput of the Habanero Hadoop

6

Page 7: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

Figure 8: CPU utilization for HashJoin on a single-node with eightcores

system. Finally, we have a third graph showing the aggre-gated heap size of each configuration for different runs withincreasing in-memory data structure size.

5.2 HashJoinHashJoin is an I/O intensive application with large in-memory read-only reference table. The performance of theapplication depends on parallel I/O and deserialization inCompute Server. The memory footprint of HashJoin can bereduced significantly using the shared memory space in Ha-banero Hadoop’s Compute Server optimization.

Figure 8 illustrates that HashJoin is an I/O intensive appli-cation. We ran the HashJoin application with different num-bers of sequential map slots in the Habanero Hadoop run-time. We performed a HashJoin operation on a 400 MB ta-ble with a smaller 200 MB table. The 200 MB table is loadedinto the memory of each map task. Figure 8 shows the CPUutilization over time graph for HashJoin on a single computenode with eight cores. We chose to use sequential map tasksbecause hash join is not compute intensive enough to takeadvantage of parallel map tasks.

The line using one sequential map slot in Figure 8 seesperiodic bursts of high CPU utilization. At first, the lineshows a burst to almost 400% CPU utilization when com-puting the in-memory data structure. After the first peak, theCPU utilization peaks to close to 100% intermittently be-cause the system is only processing the key-value pairs se-quentially using one core. Almost half of the time of the hashjoin application is spent on waiting for I/O operations withlow CPU utilization. Habanero Hadoop with two sequentialmap task slots is capable of running two map tasks in par-allel. This configuration of Habanero Hadoop utilizes I/Omore effectively by overlapping communication and com-putation. The overlap is shown as the gaps between peaks inCPU utilization are much shortened and the running time isreduced by half. Using eight sequential map slots, the Ha-banero Hadoop runtime is able to achieve much better CPUutilization through improved I/O utilization. The peak uti-lization goes to around 300% even after the initial burst and

Figure 9: Breaking the memory wall for HashJoin using the Com-pute Server on a ten-node cluster

the running time is halved again. Since HashJoin is an I/Ointensive application, the Compute Server is designed to usemultiple I/O threads in multiple map tasks to improve theperformance of the application. As a result, the best config-uration using Habanero Hadoop for HashJoin is using eightsequential map slots.

Figure 9 shows how Habanero Hadoop can break thememory wall. The x axis represents the size of the lookuptable and the y axis represents the throughput of the sys-tem as the larger table is streamed through the map tasks.The MB/second measure represents how many megabytesof the larger table have completed the join operation withthe smaller lookup table. The larger table that we are usingas the input is 800 MB.

The throughput of each configuration for HashJoin is rel-atively stable because the size of the larger table is fixed andHashJoin is performing a constant time search for each ele-ment in the big table. The details are explained in Section 2.The throughput drops because more garbage collection ac-tivities are triggered as we increase the size of the lookuptable and it takes longer to load in the look up table.

Figure 9 shows that the throughput of Hadoop drops whenprocessing large lookup tables. As the size of the lookup ta-ble increases, each map task consumes more heap space tokeep the lookup table data structure in memory. When theheap memory available to each ChildJVM executing a maptask is used up, the garbage collection activity increases sig-nificantly. There are two types of garbage collections, a fullgarbage collection (full GC) and an incremental garbage col-lection. A full garbage collection usually takes much longerthan a regular garbage collection call because a full GC oftenstops the execution of the map task. In this paper, we focusour study on the number of full garbage collection calls dur-ing the execution of the applications.

Table 1 shows that the throughput drop for Hadoop ataround 300 MB is due to a significant increase in fullgarbage collection activity on each compute node. The fullgarbage activity on a compute node increased by three times

7

Page 8: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

Lookup TableSize

HabaneroHadoop with 8sequential mapslots

Hadoop with 8map slots

100 0 0200 0 80300 0 244400 0 376500 0 686

Table 1: Number of total full garbage collections calls on eachcompute node for HashJoin

Figure 10: Heap memory size on a compute node for HashJoin

for Hadoop with eight map slots processing a 300 MB lookup table. When garbage collection activity takes up morethan 99% of the execution time, the JVM crashes and thetask fails to finish.

On the other hand, the Habanero Hadoop system expe-rience no significant drop in throughput as the lookup tablesize increases. The Habanero Hadoop system with eight se-quential map slots uses all 16 GB memory without any du-plicated copies of the data. As a result, Habanero Hadoophas much more memory available to each map task thanHadoop. Table 1 demonstrates the number of full garbagecollection calls of Habanero Hadoop stays at zero. Thesedata show that the throughput is stable because garbage col-lection activity is not a performance bottleneck for HashJoinusing Habanero Hadoop for the lookup table sizes we tested.At 500 MB, Habanero Hadoop system with eight map slotsimproves the throughput of HashJoin by two times.

Figure 10 shows the impact of duplicating the in-memorydata structures and the memory savings of the ComputeServer. The Figure presents aggregated heap memory sizeon a single compute node.

First, we notice that the aggregated heap size of all theconfigurations is linear in the lookup table size because weload the content of the lookup table text data into a hashtable. Additionally, as we can see, the heap size is accounting

Figure 11: CPU utilization on a single compute node with eightcores for KMeans

for a large proportions of the overall memory usage. With500 MB look up table size, Hadoop is using 12 GB of heapmemory in Hadoop MapReduce system.

Furthermore, the bar chart shows that duplicating thelookup table results in large heap memory usage. Hadoopwith eight map slots uses six times more heap memorythan the Habanero Hadoop system. The significant reductionof memory footprint is because the look up table data isduplicated eight times in Hadoop with eight map slots andthe Habanero Hadoop system only stores a single copy ofthe look up table in the Compute Server.

We have shown that the Habanero Hadoop system canpush back the memory wall, achieve better throughput forlarge scale problems and improve the memory utilization oneach compute node for HashJoin. The best configuration forHashJoin is using eight sequential map slots for two reasons.First, HashJoin is an I/O intensive application and more I/Othreads leads to improved performance. Second, HashJoindoes not contain partial result data structures that createsignificant overhead when creating a large number of maptasks. HashJoin is also not compute intensive to justify theoverhead of using parallel map tasks.

5.3 KMeansKMeans is a compute intensive application that stores largeread-only in-memory cluster centroids data and write-onlypartial result data structure. Habanero Hadoop minimizesthe memory footprint by running Parallel Mappers in theCompute Server.

Figure 11 shows that the execution time of KMeans isdominated by the compute-intensive map phase. The inputdocument size is 55 MB and the cluster data size is 30 MB.The experiment is conducted on a single compute node witheight cores and 8 GB of memory.

The KMeans application has a map phase and a reducephase. In a single compute node, the shuffle phase is in-significant because there is no transfer of data across differ-ent nodes. The map phase has consistent full CPU utilizationand the reduce phase has fluctuating CPU utilization. The

8

Page 9: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

Figure 12: Breaking the memory wall for KMeans on a four-nodecluster

chart shows that the map phase dominated the time of theexecution. As a result, Habanero Hadoop focused the opti-mizations in the map phase of the data analytics applications.

Hadoop with one map slot configuration consistently uti-lizes the CPU at 100% from 0 to 4700 seconds, because asingle map slot can only use a single computation thread.The consistent 100% utilization shows that the application iscompute-intensive in the map phase and a single I/O threadcan keep a single core busy. Hadoop with two map slotskeeps the two cores busy by achieving a stable 200% CPUutilization. In the same way, the map phase performance al-most scales to eight cores. The performance scales as morecomputation threads are used by the Hadoop MapReduceruntime.

Figure 12 shows that the Habanero Hadoop runtime canbreak the memory wall for the KMeans application.

There are 200 MB of document vector data generatedfrom the “20 newsgroups” data, ensuring that the documentvectors are representative of real news documents.

The throughput of KMeans is almost an inverse curve.Since the application is streaming the documents through thecluster data, the y axis is the throughput of the system rep-resented by the number of documents processed per second.At each iteration, all of the documents in the database arecompared against topic centroids and assigned to differentclusters. For each document processed by the map function,we compare it against all of the clusters to find the clusterthat is most similar to the content of the document. As a re-sult, the throughput decreases as we increase the number ofclusters because it takes longer to calculate the similaritieswith all the clusters.

The memory wall can be observed in the Hadoop witheight map slots configuration. As shown in Figure 12, at51200 clusters, the throughput drops out of the inverse curve.The drop in throughput at 51200 clusters is due to increased

garbage collection activity. In Table 2, we can see a 380times increase in the number of full GCs at 51200 clusters.

Number ofClusters

Habanero Hadoopwith 2 parallel mapslots

Hadoop with 8 mapslots

7680 0 615360 1 3025600 3 21938400 7 40951200 13 15558464000 15 N/A76800 21 N/A

Table 2: Number of total full garbage collections calls on eachcompute node for KMeans

In contrast, we can see that the Habanero Hadoop withtwo parallel map slots breaks the memory wall and scalesto 76800 clusters. The throughput performance of HabaneroHadoop with two parallel map slots is comparable to that ofHadoop with eight map slots when the number of clusters issmall. This shows that Habanero Hadoop with two parallelmap slots can fully utilize all eight cores on the machine.

Table 2 shows that Habanero Hadoop performs only 21full garbage collection calls on each node even when gener-ating 76800 clusters. As a result, the throughput of the Ha-banero Hadoop is not hurt by full garbage collection activityand stays on the inverse curve when processing large num-bers of clusters. At 51200 clusters, the Habanero Hadoopsystem avoided the memory wall. For cluster sizes around51200, the Habanero Hadoop runtime improves the through-put of the original Hadoop system by two times.

Finally, we show the heap memory size for HabaneroHadoop and Hadoop in Figure 13. The heap size data arecollected during the test runs shown in Figure 12. The totalmemory is estimated by averaging the heap sizes at differenttimes of the map phase.

We first note that Hadoop with eight map slots uses twotimes more heap memory than the Habanero Hadoop system.The large memory footprint is because there are eight maptasks running in parallel, duplicating the cluster data eighttimes. In addition, there are other in-memory data structuresassociated with each map task such as the partial result forcalculating the new cluster data. Hadoop’s design to runeight map tasks in parallel results in eight copies of thepartial result data structures stored in the memory of eachnode.

The Habanero Hadoop system uses significantly less heapmemory for two reasons. First, it uses the Compute Server toeliminate the duplication of read-only cluster data. Second,the Habanero Hadoop system only needs to create two par-allel map tasks to keep eight cores fully utilized. As a result,there are only two copies of large in-memory partial resultdata structure.

9

Page 10: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

Figure 13: aggregated heap memory size on each compute node forKMeans

The memory footprint reduction in Habanero Hadoop forKMeans is less effective than that for HashJoin becauseparallel map tasks use more memory than the sequential maptasks. The ParMapper creates clones of key-value pairs andthe key-value pairs are loaded into separate buffers to enableparallel execution, as described in Section4. On the otherhand, the original Hadoop mapper reuses the memory bufferfor the current key-value pair to store the next key-value pair.

Figure 14 shows that Habanero Hadoop with a single par-allel map slot can not keep the CPU fully utilized. HabaneroHadoop with one parallel map slot has significantly lowerthroughput than that of the Habanero Hadoop with two par-allel map slots. The performance difference is because a sin-gle parallel map slot can only use one I/O thread readinginput key-value pairs on each compute node and it is notenough to keep the CPU resources fully utilized. HabaneroHadoop with two parallel map slots utilizes two parallel I/Othreads to better overlap the computation of one map taskwith the I/O activities of the other map task, achieving higherCPU utilization. The number of I/O threads needed to fullyutilize a multi-core machine for HashJoin, eight, is morethan that for KMeans, two, because HashJoin is more I/Ointensive than KMeans.

Next, we analyze the impact on memory utilization ofincreasing the number of parallel and sequential map slotsin Habanero Hadoop. Figure 15 shows that the aggregatedheap memory usage increases as the number of parallel mapor sequential map slots increases. The increase in memoryusage is because each parallel map or sequential map taskhas an in-memory local partial sum data structure that isproportional to the number of clusters.

Habanero Hadoop with eight sequential map slots hasthe largest aggregated heap memory usage because it keepseight copies of large in-memory partial result data structureson each compute node. However, Habanero Hadoop with

Figure 14: ParMapper and Compute Server’s impact on throughputon a four-node cluster for KMeans

Figure 15: Aggregated heap memory size on each compute nodefor KMeans

eight sequential map slots still has a smaller memory foot-print than that of Hadoop with eight map slots because theCompute Server in Habanero Hadoop avoids the duplicationof read-only data. Finally, the large memory footprint of Ha-banero Hadoop with eight sequential map slots demonstratesthat the best Habanero Hadoop configuration for HashJoinis not a good configuration for KMeans due to the memoryoverhead of creating a large number of map tasks.

Habanero Hadoop with two parallel map slots, whichretains two copies of partial result data structures, has amuch smaller memory footprint than that of Hadoop witheight sequential map slots. The reduction demonstrates thatParallel Mapper (ParMapper) can reduce the memory foot-print of compute intensive applications without incurringany penalty to the performance of the application. HabaneroHadoop with one parallel map slot can further reduce thememory footprint at the cost of decreased throughput, asshown in Figure 14.

10

Page 11: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

Figure 16: Breaking the memory wall for KNN on a four-nodecluster

In summary, we show that the most efficient HabaneroHadoop confguration for KMeans uses two parallel mapslots that keeps all eight cores busy with a reduced mem-ory footprint. The Habanero Hadoop system with two par-allel map slots running with our experimental set up can re-duce the heap memory usage by two times and improve thethroughput by two times when generating a large number ofclusters. This result also shows that the best configurationfor Habanero Hadoop depends on the performance charac-teristics of the application.

5.4 K Nearest NeighborsK Nearest Neighbors (KNN) is also a compute intensive ap-plication with large in-memory read-only and partial resultdata structures, as described in Section 2. We skip the CPUover time graph for KNN because it is very similar to that ofKMeans.

Figure 16 shows Habanero Hadoop can tackle largerproblems in the multi-core cluster. The throughput of KNNis also on an inverse curve because it takes longer to pro-cess each document as the input size increases. Hadoop witheight map slots drops out of the inverse curve up at 89600documents, showing that the memory wall for the configu-ration is around 89600 documents. On the other hand, Ha-banero Hadoop with two parallel map slots stays on theinverse curve even when processing 102400 documents.Moreover, the throughputs of Habanero Hadoop is actuallybetter than Hadoop even before the memory wall becauseParMapper can break map tasks into smaller subtasks, im-proving the CPU utilization with better load balance acrossmultiple cores.

Table 3 shows that a increase in full GC activity is respon-sible for the drop in throughput for the Hadoop MapReduceruntime. At 89600 documents, Hadoop with eight map slotsperforms a 783 times more full GCs. In contrast, the num-

Figure 17: Aggregated heap memory size on each compute nodefor KNN

ber of full GCs for Habanero Hadoop less than four due toimproved memory efficiency.

Number ofDocuments

Habanero Hadoopwith 2 parallel mapslots

Hadoop with 8 mapslots

7680 0 015360 0 2325600 0 2138400 1 17851200 0 32764000 2 36076800 2 48389600 4 378578

Table 3: Number of total full garbage collections calls on eachcompute node for KNN

Figure 17 shows that Habanero Hadoop reduces the mem-ory footprint of the KNN application on each compute node.Hadoop with eight map slots has larger memory footprint be-cause it duplicates the data structures eight times. HabaneroHadoop with two parallel map slots keeps a single copy ofthe data structure in-memory and only creates two copies ofpartial result data structures, achieving the best memory ef-ficiency in all four configurations.

Using our experimental set up with eight cores per node,Habanero Hadoop with two parallel map slots reduces theheap memory usage of KNN by three times and improves thethroughput for a large number of documents by two times.

The performance of MapReduce runtimes for large-scaledata analytics applications largely depends on efficient uti-lization of the memory resources in multi-core systems. Inthis section, we show that the Habanero Hadoop system canreduce the memory footprint of popular MapReduce data an-alytics applications by as much as six times and achieve twotimes throughput improvement for large inputs. The results

11

Page 12: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

also show that choosing the number of map slots and the typeof map task, parallel or sequential, has a significant impacton the performance of Habanero Hadoop system.

6. Related WorkPhoenix [11] is a MapReduce runtime system that is op-timized for a single shared-memory multi-core machines.The Phoenix system uses threads to spawn parallel map/re-duce tasks on the same machine and utilizes shared-memorybuffers for low overhead communication. Habanero Hadoopimplemented several optimizations mentioned in the Phoenixpaper such as key-value pair prefetching and a dynamicframework that discovers the best input size for each task.However, the Phoenix system is not designed to scale theperformance across a large number of machines. Unlike thePhoenix system, the Habanero Hadoop system supports exe-cution on cluster of nodes. In addition, the Habanero Hadoopsystem focuses on reducing the the memory footprint of theapplications instead of improving the utilization of the CPUresources.

Tiled Map Reduce (TMR) [6] has a similar goal to this pa-per because it also focused on reducing the memory footprintof the MapReduce applications. However it takes a differentapproach to achieving improved memory efficiency by di-viding up a big MapReduce job into a number of indepen-dent sub-jobs and reusing the intermediate data structuresamong the sub-jobs. TMR optimizes MapReduce mainly atthe programming model level by limiting the data to be pro-cessed in each MapReduce job. In contrast, this paper re-duces the memory footprint through improved runtime sys-tem design.

Compressed Buffer Trees (CBT) [2] improve the memoryefficiency of groupby operations used in the MapReduce en-gines. The CBT compresses large in-memory append-and-aggregate datasets efficiently, leading to reduced memoryfootprint in combiner operations in the map phase and clus-ter wide aggregation before in the shuffle phase. On the otherhand, the Habanero Hadoop runtime reduces the memoryfootprint of the map phase of the applications.

Spark [16] uses Resilient Distributed Datasets (RDDs) tokeep intermediate data structures in memory for successiveiterations of MapReduce jobs. This approach avoids redun-dant disk I/O operations and achieves performance gains byreducing the latency between multiple iterations of MapRe-duce jobs. The Habanero Hadoop system focuses on improv-ing the memory efficiency of a single MapReduce job so thatmore data can be kept in memory.

Main Memory Map Reduce (M3R) [12] is a high per-formance implementation of the Hadoop MapReduce API.It reduces the latency of MapReduce jobs by sharing theheap-state between multiple MapReduce jobs. Similar to theCompute Server design, the M3R runtime can potentiallyrun multiple mappers and reducers in the same JVM. How-ever, M3R does not focus on the memory efficiency of a sin-

gle MapReduce job. The Habanero Hadoop runtime allowssharing of heap-state between different map tasks in a sin-gle MapReduce job. The optimizations implemented in Ha-banero Hadoop can be applied to the M3R system as well.

Blanas et al. [4] have characterized the memory footprintof various join algorithms. Their research suggests that hash-based join algorithms often achieves lower latency and thesmaller memory footprint than sort-based join algorithmswith unsorted input. Habanero Hadoop further reduces thememory footprint of hash-based join algorithms by creatinga shared memory space between different map tasks in thesame job.

7. Conclusions and Future WorkIn this paper, we first evaluated the performance character-istics of popular data analytics applications and identifiedthe memory bottlenecks in the Hadoop MapReduce applica-tions.

To tackle the memory bottleneck, we presented HabaneroHadoop, an optimized MapReduce runtime. The runtime isdesigned to significantly improve the memory efficiency ofmany popular MapReduce data analytics applications. Weachieved improved memory efficiency by creating a sharedmemory space across different tasks running on the samecompute node, enabling reuse of in-memory data structuresand reducing the number of partial result data structures.

Our experimental results show that the Habanero Hadoopruntime can reduce the memory footprint of KMeans andKNN by as much as three times and improve the applica-tions’ throughput by two times for large cluster sizes on eightcore nodes. The runtime also reduces the memory footprintfor HashJoin by six times and achieves two times throughputimprovement for lookup tables greater than 400 MB with theset up we described in Section 5. Different setups with dif-ferent number of cores might achieve different performanceimprovements.

We observed that for different applications, the best ap-proach to reducing memory footprint without incurring anypenalty to the CPU utilization is different. For HashJoin, thebest configuration uses Habanero Hadoop with eight sequen-tial map slots to fully utilize the I/O capability of the system.For KMeans and KNN, the best configuration uses HabaneroHadoop with two parallel map slots. This shows that the bestconfiguration of Habanero Hadoop depends on the perfor-mance characteristics of the specific application.

An interesting direction for future research is to automatethe process of selecting the best configuration. We need toset up performance counters to measure the I/O, computa-tion and memory characteristics of the application at run-time. Using the information collected, the runtime could au-tomatically choose the best configuration.

References[1] Fragment Replicated Join in Pig.

12

Page 13: HabaneroHadoop: An Optimized MapReduce Runtime for Multi ... · Hadoop MapReduce system, avoiding duplication of read-only in-memory data structures within a node. Addition-ally,

http://wiki.apache.org/pig/PigFRJoin, 2009. URLhttp://wiki.apache.org/pig/PigFRJoin.

[2] H. Amur, W. Richter, D. G. Andersen, M. Kaminsky,K. Schwan, A. Balachandran, and E. Zawadzki. Memory-efficient Groupby-aggregate Using Compressed Buffer Trees.In Proceedings of the 4th Annual Symposium on Cloud Com-puting, SOCC ’13, pages 18:1–18:16, New York, NY, USA,2013. ACM. ISBN 978-1-4503-2428-1. . URL http:

//doi.acm.org/10.1145/2523616.2523625.

[3] R. Anil, S. Owen, T. Dunning, and E. Friedman. Mahoutin Action. Manning Publications, Manning Publications Co.Sound View Ct. 3B Greenwich, CT 06830, 2010. URL http:

//manning.com/owen/.

[4] S. Blanas and J. M. Patel. Memory footprint matters: Ef-ficient equi-join algorithms for main memory data process-ing. In Proceedings of the 4th Annual Symposium on CloudComputing, SOCC ’13, pages 19:1–19:16, New York, NY,USA, 2013. ACM. ISBN 978-1-4503-2428-1. . URLhttp://doi.acm.org/10.1145/2523616.2523626.

[5] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita,and Y. Tian. A Comparison of Join Algorithms for LogProcessing in MapReduce. In Proceedings of the 2010 ACMSIGMOD International Conference on Management of Data,SIGMOD ’10, pages 975–986, New York, NY, USA, 2010.ACM. ISBN 978-1-4503-0032-2. . URL http://doi.acm.

org/10.1145/1807167.1807273.

[6] R. Chen, H. Chen, and B. Zang. Tiled-MapReduce: Op-timizing Resource Usages of Data-parallel Applications onMulticore with Tiling. In Proceedings of the 19th Interna-tional Conference on Parallel Architectures and CompilationTechniques, PACT ’10, pages 523–534, New York, NY, USA,2010. ACM. ISBN 978-1-4503-0178-7. . URL http:

//doi.acm.org/10.1145/1854273.1854337.

[7] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M.Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, andU. Srivastava. Building a high-level dataflow system on topof MapReduce: the Pig experience. Proc. VLDB Endow., 2(2):1414–1425, Aug. 2009. ISSN 2150-8097. URL http:

//dl.acm.org/citation.cfm?id=1687553.1687568.

[8] D. Gillick, A. Faria, and J. DeNero. MapReduce: Dis-tributed computing for machine learning. CS 262A classproject report, Department of Electrical Engineering andComputer Sciences, University of California - Berkeley, De-cember 18, 2006. http://www1.icsi.berkeley.edu/

~arlo/publications/gillick_cs262a_proj.pdf.

[9] U. Hoelzle and L. A. Barroso. The Datacenter As a Computer:An Introduction to the Design of Warehouse-Scale Machines.Morgan and Claypool Publishers, 1st edition, 2009. ISBN159829556X, 9781598295566.

[10] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt,and T. F. Wenisch. Disaggregated memory for expansionand sharing in blade servers. In Proceedings of the 36thAnnual International Symposium on Computer Architecture,ISCA ’09, pages 267–278, New York, NY, USA, 2009. ACM.ISBN 978-1-60558-526-0. . URL http://doi.acm.org/

10.1145/1555754.1555789.

[11] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, andC. Kozyrakis. Evaluating MapReduce for Multi-core andMultiprocessor Systems. In Proceedings of the 2007 IEEE13th International Symposium on High Performance Com-puter Architecture, HPCA ’07, pages 13–24, Washington, DC,USA, 2007. IEEE Computer Society. ISBN 1-4244-0804-0. .URL http://dx.doi.org/10.1109/HPCA.2007.346181.

[12] A. Shinnar, D. Cunningham, V. Saraswat, and B. Herta. M3r:Increased performance for in-memory hadoop jobs. Proc.VLDB Endow., 5(12):1736–1747, Aug. 2012. ISSN 2150-8097. . URL http://dx.doi.org/10.14778/2367502.

2367513.

[13] L. Tang. Hive Join Optimization. https://www.

facebook.com/notes/facebook-engineering/

join-optimization-in-apache-hive/470667928919,2010.

[14] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. An-thony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehous-ing Solution over a MapReduce Framework. Proc. VLDB En-dow., 2(2):1626–1629, Aug. 2009. ISSN 2150-8097. . URLhttp://dx.doi.org/10.14778/1687553.1687609.

[15] T. White. Hadoop: The Definitive Guide. O’Reilly Media,Inc., 1st edition, 2009. ISBN 0596521979, 9780596521974.

[16] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. Mc-Cauley, M. J. Franklin, S. Shenker, and I. Stoica. Re-silient Distributed Datasets: A Fault-tolerant Abstraction forIn-memory Cluster Computing. In Proceedings of the 9thUSENIX Conference on Networked Systems Design and Im-plementation, NSDI’12, pages 2–2, Berkeley, CA, USA,2012. USENIX Association. URL http://dl.acm.org/

citation.cfm?id=2228298.2228301.

[17] W. Zhao, H. Ma, and Q. He. Parallel K-Means ClusteringBased on MapReduce. In Proceedings of the 1st InternationalConference on Cloud Computing, CloudCom ’09, pages 674–679, Berlin, Heidelberg, 2009. Springer-Verlag. ISBN 978-3-642-10664-4. . URL http://dx.doi.org/10.1007/

978-3-642-10665-1_71.

13