4944a061

Performance and Reliability Effects of Multi-tier Bidding on MapReduce inAuction-based Clouds

Moussa Taifi, Justin Y. ShiComputer Science Department

Temple UniversityPhiladelphia, USA

Email: {moussa.taifi, shi}@temple.edu

Abstract—Hadoop has become a central big data processingframework in todays cloud environments. Ensuring the goodperformance and cost effectiveness of Hadoop is crucial forthe numerous applications that rely on it. In this paper weanalyze Hadoop’s performance in a multi-tier market-orientedcloud infrastructure known as Spot Instances. Amazon SpotInstances (SIs) are designed to deliver a cheap but transientalternative to fixed cost On-Demand (ODIs) instances. Recently,AWS introduced SIs in their managed Elastic MapReduceoffering. This managed framework lets the users design amulti-tier Hadoop architecture using fine grained controlsto define the instance types both in terms of capacity, i.e.compute/storage/network, but also in terms of costs, i.e. ODI vsSI. The performance effects of such fine grained configurationsare not yet well understood. First, we analyze a set of clusterconfigurations that can lead to important performance effectsthat can affect both the running time and the cost of such cloudHadoop clusters. Second, we examine Hadoop’s fault tolerancemechanisms and show the inadequacy of these mechanisms formulti-tier bidding architectures. Third, we discuss directionsfor making the Hadoop framework more market-aware withoutlosing its focus on extreme scalability.

Keywords-Failures, Hadoop, Auction-based Clouds, Perfor-mance of Systems, Fault Tolerance.

I. INTRODUCTION

Hadoop has become a crucial element of today’s cloudecosystem [1]. Large companies need to deals with ex-tremely large data [2]–[4] for applications such as log analy-sis [5], [6], intelligent e-commerce and business intelligenceapplications [7], [8]. Scientists have also found Hadoopcapabilities beneficial in terms of large scale text process-ing [9], genome assembly [10], [11] and social networkgraph analysis [12], [13]. The computer research communityalso contributed many studies and improvements: adaptingHadoop to opportunistic and heterogeneous environments[14], [15], improving the overall performance of jobs [16],running Hadoop as a back-end of higher level APIs [3],[17] as well as providing detailed analysis of Hadoop failuremodel [18].

On the other hand running large scale clusters dedicatedto Hadoop can be a costly endeavor even using pay-as-you-go cloud resources such as AWS EC2. Spot instances areconsidered the cheapest purchasing model in the cloud pric-

ing structure in the EC2 infrastructure. These spot instances,provide market-based guarantees concerning the time oflaunch and termination of instances. Given this uncertaintyof the availability of this type of instances AWS ElasticMapReduce(EMR) introduced a multi-tiered option whereinstances are launched in one of three groups: Master, Corenodes and Task nodes. This fine grained configuration optiongives more power to the cluster administrator since it ispossible to fine tune the EMR job flows to fit the budgetand time requirement of various applications. The mainissue that this configuration option introduces is a loss ofdata locality for tasks that are not part of the core group.While core nodes are part of the HDFS cluster and alsoare TaskTracker nodes, task nodes only run TaskTrackersthat do not participate in the HDFS cluster. This loss ofdata locality can go against the “bringing computation tothe data” philosophy that Hadoop has been built on and canaffect the runtime and the cost of the Hadoop jobs.

In this paper, we focus on Hadoop’s behavior under aset of multi-tiered market-oriented configurations using theEMR framework. Despite the high usage of Hadoop inthe cloud, little research work has been done on analyzingHadoops performance under multi-tiered market-orientedconfigurations and we identify a lack of understandingconcerning the efficiency of fine grained configurations inthe context of Spot Instances on the EC2 platform.

Specifically, in this research work we analyze Hadoop’sbehavior under different core nodes to task nodes ratios.This ratio is important because the number of DataNodes isbounded by the number of core nodes and thus task nodes donot have access to local data as they do not run DataNodesand do not participate in the HDFS data storage. Quantifyingthe performance effect of the resulting lack of data localityis hard and in this work we analyze its effect to provide bestpractices for future EMR users.

Second, we analyze Hadoop’s fault tolerance mechanismsin this kind of multi-tier bidding architecture. This is im-portant because the main tradeoff of Spot Instances is costvs. availability. Since the SIs are prone to being shutdownby the cloud provider, it is central to deal explicitly withfailure for sustainable and cost effective usage of this type of

2013 IEEE Seventh International Symposium on Service-Oriented System Engineering

978-0-7695-4944-6/12 $26.00 © 2012 IEEE

DOI 10.1109/SOSE.2013.13

61

infrastructures. We perform a set of experiments to identifyHadoop’s fault tolerance mechanisms that are the most/leastbeneficial when dealing with virtual machine failures in amarket-based environment.

Hadoop’s popularity and ever shrinking research and oper-ating budgets make this research direction directly applicablein cloud environments. The problem of dealing with multi-tier bidding and cluster configurations is complex and weaim to contribute to the understanding of the effects theseconfigurations have on the runtime of Hadoop applications.

This paper is organized as follows. In section 2 we reviewrelated works on Spot instances, the Hadoop framework, itsfault tolerant mechanisms and current research in multi-tierbidding in cloud infrastructures. In section 3 we describe theexperiments carried out to understand the effects of multi-tier bidding on performance and fault tolerance. In section 4we provide the results of our experimentation. In section 5we discuss the implications of our findings and propose newdirections for adapting Hadoop-like frameworks to market-based and auction-based cloud infrastructures. Finally, weconclude in section 6.

II. BACKGROUND

A. Spot Instances Background

Spot Instances (SI) are the cheapest instance types in thecloud pricing structure of the AWS EC2 infrastructure. TheseSIs expose directly to the user the tradeoff between cost andreliability using a market-based scheduling mechanism. Thetime of launch and termination of SIs is a function of theuser’s bid and the current market conditions.

If an SI is terminated by the cloud provider due to therise in market prices, the user is not charged for any partialhour. On the other hand, if the SI is terminated by theuser, the user is charged for the full hour. A third importantcharacteristic is that the user pays the market price for anSI regardless of the initial bid on those instances. This aimsto simulate a second-price sealed auction to encourage truevalue bids from customers. Amazon also provides a choicebetween persistent requests that are unbounded requests andas long as they are active, the cloud providers will run the SIevery time the market price is below the initial bid. Finally,currently changing the bid on an instance is only allowed atthe request creation time and cannot be changed during thelifetime of an instance.

Since the introduction of the SI cost model a set ofresearch works have tried to address some of the challengesrelated to this opportunistic environments. Early efforts haveworked on simulating the availability of single instances un-der different bid conditions [19], [20]. This work presenteda decision model and simulator to optimize the runtimeand the costs of using spot instances. Further work in [21]addressed high-availability for single business applicationsusing a cost-reward model and SLA guarantees. The studiesreported in [22], [23], provide resource allocation strategies

that try to deal with potential unavailability periods usingmonetary and runtime estimation techniques.

Even if the research efforts have addressed many of thechallenges of spot instances, only a few such as [24] and[25] have addressed the applicability of spot instances toHPC environments. While developing strategies is importantfor single applications if they are to use spot instances, thisknowledge is not fully applicable to HPC or data-intensiveapplications such as big data problems that requires a largenumber of interconnected components to solve a singleproblem.

The closest research work related to the current work is[26], which deals directly with the usage of SIs in MapRe-duce applications. In that research work, the authors studiedthe effect of adding SIs as “accelerators” to a core group ofHadoop nodes. The findings were preliminary results and theauthors pointed to the large variation of performance whenusing SIs in the case of failure-free runs (good performance)and in the case of out-of-bid failures (worse performancethan non-accelerated Hadoop clusters). Further work in [14]developed a version of Hadoop to deal with opportunisticenvironments such as volunteer computing. Both these workpointed in the direction of multi-tier Hadoop clusters but didnot examine the performance effect of multi-tier bidding ona Hadoop cluster.

We consider that this gap in data-intensive market-oriented computing research is important and should notbe overlooked due to the potential cost and time savingsof infrastructures such as SIs. In this work, we analyzethe performance effect of multi-tier bidding on performanceboth in failure-free and out-of-bid failure scenarios.

B. Hadoop Background

We first briefly describe the Hadoop MapReduce frame-work. Also since the main trade off when using SIs isthe possibility of out-of-bid failures, we describe the faulttolerance mechanisms that are used by Hadoop to mitigatenode failures.

The Hadoop framework [27], [28] comprises two maintypes of tasks: Mappers and Reducers. Initially, Mappersread the input data from the HDFS(Hadoop distributedfilesystem) in the form of splits. Each Mapper producesintermediary data composed of key-value pairs. These pairsare stored locally on the node running the Map functionand are not written to the HDFS storage. This intermediaryoutput is the input of the Reducer tasks. Each Reducer isassigned a separate key range and tries to copy the parts ofthe Map outputs which have records in that key-range. Thisphase is called the shuffle phase. Each Reducer then appliesits Reduce function to each key-pair and stores its output tothe HDFS.

To determine the progress of a MapReduce job duringlarge scale processing each task has a progress score. Thisprogress score is an estimation of completion percentage of

62

each task and is bounded between 0 and 1. For Reducer tasksthis score is increased to 0.33 at the end of the shuffle phasewhich means that all the Map intermediary outputs havebeen copied to the Reducers. The Reducer score is increasedto 0.66 after the key-pairs received have been sorted. Whilethe Reducer applies the Reduce function to its key-pairs,the score is between 0.66 and 1. For Mappers, the progressscore is determined by the fraction of input data read. Theprogress rate of a task then is the ratio of this progress scoreover the running time of that task so far.

The Hadoop distributed filesystem, HDFS, is composed ofa centralized metadata NameNode and of set of distributedDataNodes. The NameNode handles the coordination ofthe reads and writes and the filesystem metadata. Eachblock written to HDFS is written in a pipeline fashionand replicated across a set of DataNodes. A configurablereplication factor defines the number of DataNodes thatwill have this block of data. In the event of a DataNodefailure, a Write or Read Time Out occurs, WTO/RTO.Connection timeouts, CTO, can also occur if a DataNodeis no accessible. Mappers can suffer from RTOs whileReducers can suffer from WTOs, and CTOs can affect bothkinds of tasks.

Each compute node can also run a TaskTracker, TT,which is a daemon responsible for managing local tasks. Bydefault, TTs are initially configured with a pre-set number ofMapper and Reducer slots that depends on the compute nodecapacities and by default all the TaskTrackers have the samenumber of slots. For large datasets the number of tasks thatare to be scheduled can be far greater than the number ofslots available and a job proceeds in multiple waves. TheJobTracker, JT, is a centralized component that managesuser’s jobs, communicates regularly with the TaskTrackersand coordinates the scheduling and location of the tasks ofthe currently submitted Hadoop jobs.

C. Failures in Hadoop

Due to the large number of components involved inlarge scale Hadoop cluster, individual failure componentsare considered the norm in Hadoop and are dealt withexplicitly by the framework with minimal intervention fromthe application user. The main components that need to bemonitored are the TaskTrackers, the Map and Reduce tasksand the DataNodes. We do not discuss the fault tolerancemechanisms of logically centralized components such asthe HDFS NameNode and the JobTracker; readers can findin [29] distributed coordination mechanisms to deal withthese components. We use the notation initially proposed in[18], Table I, to describe the fault tolerance mechanisms ofHadoop.1) Task Tackers: Concerning TaskTrackers failures, The

JobTracker checks every 200s if any TaskTracker did notsend heartbeats for the past 600s. If a TaskTracker is markedas dead, all the Map and Reduce tasks that were assigned to

Table IFAILURE TOLERANCE VARIABLES IN HADOOP

Symbol Description

Rj Number of Reduce tasks running for job j at time of check

Mj Number of Map tasks for a job, equal to input data splits

ARj Total number of shuffles attempted by Reducer R

KRj Total number of failed shuffles attempted by Reducer R

DRj Total number of Map outputs successfully copied by Reducer

R

SRj Total number of Maps Reducer R failed to shuffle from

FRj (M) Total number of times Reduce task R failed to copy Map M’s

output

Nj(M) Total number of notifications received by the JobTracker thatMap M’s output is unavailable.

PRj Time from Reducer R’s start until it last made progress

TRj Time since Reducer R last made progress

Qj Maximum running time among completed Maps

Z(Ti) Progress rate of Task Ti

Tset Set of tasks running or completed in Job Jj

it are re-attempted on a different TaskTracker. In addition,any Map task that completed on that TaskTracker is alsorestarted if the corresponding job is still running and has atleast 1 Reducer. This process is depicted in Figure 1.2) Map Tasks: In addition to detecting dead TaskTrack-

ers, the JobTracker also recomputes lost Map outputs early.This is can happen even before declaring that a correspond-ing TT is dead, if enough Reducers notify the JobTrackerthat they are not able to copy a Map output. The JobTrackerrestarts a Map task if the number of notifications receivedis bigger than half the number of Reducers running for thecurrent job and at least 3 notifications are received for thatMap task. This process is depicted in Figure 2.3) Reduce Tasks: In the case of malfunctioning Reducers,

it is the responsibility of each parent TaskTracker to restarta Reducer. If a Reducer is repeatedly not able to copy Mapoutputs, the TaskTracker considers that Reducer faulty. Tomake this decision three conditions need to be met. First,half of the copy attempts of a Reducer need to fail. Thisprocess is depicted in Figure 3. Second, either the totalnumber of Maps Reducer R failed to shuffle from needs tobe bigger than 5 or equal to the difference between the totalnumber of Map tasks and the number of successfully copiedMap outputs by the Reduce task. Third, either the Reducetask has copied less than half of the total number of Maptasks or the time since the Reducer last made progress isbigger than half the maximum time between the time sincethe start of the Reduce task or the maximum running time ofall the completed Maps. This checks if the progress of theReducer is not enough or, when compared to other tasks, it

63

Figure 1. Detection and mitigation strategies for dead TaskTrackersJobTracker

Start Hadoop Job

Check TTi in job Ji

Is TTi

dead?Any TT

left?

TTi hadrunningtasks?

Restart tasks

Anycompleted

Maptasks?

Job hasreducers?

Restart completedMap tasks

no

yes

yes no

yes

no

yes

no

yes

no

Figure 2. Detection and mitigation strategies for lost Map outputsJobTracker

Start Hadoop Job

Check Mi in job Jj

Nj(Mi) >3

Nj(Mi) >0.5Rj

Restart M ji

no

yes

no

yes

is has been stalled for the majority of its predicted runningtime.4) Speculative execution: Due to the existence of tran-

sient slow tasks in large Hadoop clusters, called stragglers,Hadoop implements a preventive strategy by speculatingunder-performing tasks. The version that we are using forour tests is the stable Hadoop 0.20, which relies on theprogress rate of tasks for making its scheduling decisions.

Figure 3. Detection and mitigation strategies for malfunctioning ReducersJobTracker

Start Hadoop Job

∀TTi : check localRi’s of Jj

KRj ≥ 0.5AR

j

SRj ≥ 5 ∨ SR

j = Mj −DRj

DRj < 0.5Mj ∨

TRj ≥ 0.5max(PR

j , Qj)

Restart Ri

yes

no

no

yes

no

yes

This native approach uses individual and global tasks statis-tics to determine which task to select for speculation. Atask is chosen if its progress rate is slower than the averageprogress rate of the other tasks in the same job by at leastone standard deviation. We describe this process in Figure4.

Figure 4. Speculation candidate selection procedure (native Hadoop)JobTracker

Start Hadoop Job

Current speculated tasks = 0 ∧ JobJi running for more than 60s?

no

Check TaskTcur ∈ Tset

Z(Tcur) ≥avg(Z(Ti)Ti∈Tset

) −std(Z(Ti)Ti∈Tset

)

Add Tcur to candidates

Checked allrunning tasks Ti?

Select slowest Tslowest task

Select first TTavail

with an available slot

Run Tslowest on TTavail

yes

yes

no

no

yes

64

D. Multi-tier Bidding Background

One of the most fundamental differences between thecurrent large scale scientific clusters and auction-based cloudclusters is that the bidding strategy chosen by the useralso define the failure rate of the cluster nodes. This isin contrast with the traditional clusters failure model whichonly considers hardware, software and human errors to bepart of the failure spectrum. This traditional model assumesthat failures can affect up to 10 percent of the cluster at onetime. Traditionally, users/admins have assumed that failuresin large scale cluster can affect only a single node or a subsetof the processing nodes at one time.

Contrary to traditional clusters, in the case of spot in-stances it is central to recognize that the bidding strategythat the cluster user/admin employs affects the failure rateand the failure model of the cloud cluster. On one hand,the interaction between the market price fluctuation and thebidding price choice produces a matching expected failurefrequency [24]. On the other hand the granularity of thebidding strategy, which is related to how the bid decisionspreads the risk over the individual instances, affects thefailure model itself [25].

To illustrate this idea we differentiate between uniformand multi-tier bidding using the spot instance market. Inthe case of uniform bidding the whole cluster behaves as asingle “macro” application with a lifetime based on a singlebid. This leads to an “all-or-nothing” behavior as shown inFigure 5; In the case where the market price is above theuser’s bid all the nodes are shutdown. We call this effectbid-price coupling [25].

Out-of-Bid Failure(Market price moves above customer's bid

price)

Spot-based Virtual Clusterusing uniform bidding

Figure 5. Example of an out-of-bid failure in a spot-based virtual clusterwith uniform bidding leading to a total failure

On the other hand, multi-tier bidding is a logical directionfor mitigating this price coupling. One illustration of thisstrategy is shown in Figure 6. In this Figure we showthat nodes are grouped in bid classes, with decreasing bidprices from on-demand price to some minimum bid price.This strategy does not completely remove the possibility ofcomplete cluster failure but has the potential to provide moreflexibility to meet budget and deadline constraints as wellas a finer grain of control over which part of the clustercomponents can be hardened to deal with out-of-bid failures.

These observations apply directly to decoupled program-

Out-of-Bid Failure(Market price moves

above customer's lowest bid price)

Spot-based Virtual Clusterusing multi-tier bidding

On-Demand Bid Price

High Bid Price

Medium Bid Price

Low Bid Price

Figure 6. Example of an out-of-bid failure in a spot-based virtual clusterwith non-uniform bidding leading to a partial failure

ming models such as Hadoop/MapReduce. While this pro-gramming model considers failures as the norm, we arguethat the fault tolerance mechanisms that Hadoop implementsare only partially effective in market-oriented clusters suchas SI clusters. The lack of information about the nature ofthe failures that affect the nodes, the conservative thresholdsand sometimes conflicting strategies [18] make it hard forHadoop to detect and differentiate between traditional nodefailures and market-based failures. Furthermore, Hadoopbeing a data-intensive programming model, has to keep largeamounts of data distributed in the cluster using a distributedfilesystem such as HDFS [30]. While HDFS assumes thatthe user is taking responsibility for determining a suitablereplication factor when used in faulty environment, thedefault factor of 3 copies per block cannot be used directlyon spot instances since all the replicas could be randomlydistributed on low bid instances and be lost all at once ifthe market outbids their individual prices.

One current solution provided by the EMR framework onAWS, Elastic MapReduce, proposes a mitigating strategy byproviding the user with the option to use a mixture of On-Demand and Spot Instance depending on the role allocatedfor each node. This includes allocating the centralized andsensitive components on OD instances, such as the NameN-ode, JobTracker and HDFS DataNodes. On the other hand,the TaskTrackers can be assigned to the OD instances aswell as to “accelerators” nodes that run on SIs and do notparticipate in the HDFS cluster.

One of the potential problems with the multi-tier archi-tecture of EMR when using SIs is the potential loss of datalocality for certain tasks. Map tasks usually are scheduledon nodes that have the input splits needed by that Maptask. This can improve dramatically the performance of Maptasks since they are able to read their input data directlyfrom the local disks. However, if a Map task is scheduledon a Task node which does not have a DataNode processrunning locally, then all the reads for this Map task will bedone from the network which can slowdown this Map tasksignificantly. In the case of Maps that need to write theiroutputs, this architecture change should have minimal effectbecause usually Maps write to the local disk to store theirintermediate results.

65

In the case of the Reduce tasks the shuffle phase almostalways happens over the network since they need to collectall the key pairs to which they were assigned, from all theMap tasks that generated these key pairs. On the other handthe write phase is done to the HDFS. Since some Reducetasks can also be scheduled on Task nodes then their writeswill be over the network which can also slow these tasksconsiderably.

We are interested in quantifying the performance effectof this lack of data locality. Also we are interested in theperformance effect of out-of-bid failures in this multi-tiercontext especially since Hadoop’s default fault tolerancemechanisms might not be optimized for this kind of en-vironments.

III. METHODOLOGY

We divided our evaluation into two sections, shown inTable II. First, we evaluated the effect of multi-tier biddingon the Elastic Map Reduce(EMR) cloud platform [31].Second, we simulated the multi-tier EMR platform on theTcloud private HPC cloud [32], a local virtualized testbed.We evaluated the effect out-of-bid failures on performanceand how Hadoop’s fault tolerance mechanisms react to suchfailures.

For the EMR experiments we used 11 EC2 nodes. Onenode was running the NameNode and the JobTracker, therest were divided into core nodes or task nodes. Core nodesrun both DataNodes and TaskTrackers while the Task nodesonly run TaskTrackers. The master node is of type m1.xlargewhich has 4 virtual cores with 2 EC2 Compute Units each,15GB memory and 1.7 TB of storage. The core and tasknodes are of type m1.small which has 1 virtual core with 1EC2 Compute Unit, 1.7 GB memory and 160 GB of storage.The network used is the default EC2 network which consistof 1 GigE with 500 to 800 Mbps bandwidths [33] and theall the instances were located in the same region US-East.

For the Tcloud experiments we used 10 virtual machineswith 2 virtual cores, 4GB of memory and 20GB of storage.The network was a dedicated 1 GigE network. The 10 virtualmachines run on a 10 physical node cluster with 12 cores,24GB memory and 2TB of storage for each node. We usedthe Hpcfy v2.2 virtual cluster management [34] to build theHadoop clusters out of the virtual machines.

For all the benchmarks we used the Hadoop 0.20 version.The EMR clusters had 2 Map and 1 Reduce slot per Task-tracker node. The Tcloud virtual clusters were configuredwith 2 Maps and 2 Reduce slots per TaskTracker.

For the EMR cluster we measured the failure-free multi-tier cluster performance. These experiments are aimed todetermine the effect of the data locality loss introduced bythe multi-tier setup. We varied the ratio of core nodes tototal number of nodes, we use the following notation inthe rest of this paper: C/T ratio, from 0.2 to 1 whichis effectively modifying the ratio of on-demand vs. total

Table IIENVIRONMENTS USED IN EVALUATION

Environment Scale (VMsper cluster)

Experiments

EC2 production 11 (x4) Measuring Failure-Free Multi-tier cluster Performance

Local testbed 10 (x4) Measuring Failure-Injectioneffect on Multi-tier clusterPerformance

number of instances (SIs + ODIs). We evaluated 4 ratios,0.2, 0.5, 0.8 and 1. We used 4 benchmarks, DFSIO-Read,DFSIO-Write, Terasort and Wordcount.

For the Tcloud cluster we specifically try to reproducethe multi-tier clusters similar to the EMR when using Spotinstances. To achieve this, for each run, we started the virtualmachines and used the Hpcfy toolkit to contextualize theminto one Hadoop cluster. Then we decommissioned a numberof VMs from the HDFS cluster to match the ratios of coreto task nodes. For this experiment we used the 0.8 and 0.5ratios of core nodes to total nodes.

In this experiment we choose to study a set of relativelyshort Hadoop jobs, emphasizing response time, which havebeen shown to be popular in current public workloads traces[15], [35], [36]. The latest SQL-like frameworks that run ontop of MapReduce such as Hbase [37] and Pig [38] showthe need for tools where the user wants an answer quickly,for example when accessing log data for monitoring or fordecision making in data-driven business situations. We usedthe Terasort benchmark to sort a 10GB random dataset. Ina failure-free and all-core environment the job takes 210seconds to complete.

Our goal is to see the efficiency of Hadoop fault tolerancemechanisms in the case of task node failures. We do notaim at analyzing Hadoop over many job types. We considerthat Terasort is a suitable benchmark as it contains a set ofdesirable benchmark features [39]. Parallel sorting includesa set of features that make it a reliable benchmark since it issimply described and can be scaled easily. Additionally, fordata-intensive architectures it tests the system’s ability tomove data efficiently, which effectively tests the bisectionbandwidth of the underlying system. Finally, it providesmeasurable levels of stability, determinism and load balanc-ing based on the input data and algorithm used which canprovide reasonable baselines for other applications.

On the other hand, we argue that the insights of our paperare suitable for longer jobs and other types of jobs. Hadoop’sfailure mechanisms, described above, still remain the sameat larger scale since it uses a mixture of fixed thresholds andthe proportion of IO/network failures.

We simulate the behavior of the spot market where allthe task nodes that are out-bid will be shutdown by thecloud provider. We inject a fail-stop failure by killing allthe Task nodes, i.e. effectively terminating each Task VM,

66

at a specified time after the job is started and before theend of the failure-free job runtime. We introduce the Tasknode failures at 30s intervals for each run. After the runwe restart all the VMs and rebuild the whole virtual clusterfrom scratch.

In the following section we start by describing the resultson the EMR platform. We describe the performance effectof the multi-tier bidding architecture. Then we describe theresults of the fault injection experiments on the Tcloudplatform.

IV. RESULTS

In this section we report the results of the experimentsthat we performed for the EMR and Tcloud platforms.First, we discuss the results of the multi-tier bidding on theEMR performance using 4 benchmarks and 4 C/T ratiosof core nodes to total nodes in a failure free environment.Second, we discuss the results of the performance effect offailure injections in a multi-tier bidding architecture usingthe Tcloud platform.

A. Multi-tier bidding effect on the EMR failure free perfor-mance

Concerning the EMR DFSIO-READ benchmark in Figure7, we notice that increasing the C/T ratio consistentlyimproves the read performance but the improvement is notlinear with the increase of C/T . For example setting theC/T to 0.2 produces a failure-free performance loss of atmost 3.07x while the expected loss should be in the 5xrange since we are using only 1/5th of the virtual drivesavailable for the HDFS. Also for a C/T ratio of 0.8 wenotice that we only experience at most a 1.11x performanceloss compared an expected 1.25x since we are using only4/5th of the available drives.

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Data Size (Gb)

Spe

edup

Tal

l−co

re/T

mul

ti−tie

r

0.20.50.81

Figure 7. Elastic MapReduce DFSIO-Read running times comparison with4 different Core nodes to Total nodes ratios

Concerning the EMR DFSIO-WRITE benchmark in Fig-ure 8, we notice that increasing the C/T ratio does notalways improve the write ratio. In the case of C/T = 0.8,

the write performance is actually improved by at least 1.32xinstead of a performance loss of 1.25x. For the other C/Tratios we also notice that in the case of a 0.5 ratio we geta loss of 1.17x instead of 2x and for 0.2 we get a loss of1.31x instead of 5x.

We also notice for both of these read and write bench-marks are sensitive to the datasize. By increasing the datasizewe notice that the performance is slowly decreasing for allthree ratios.

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Data Size (Gb)

Spe

edup

Tal

l−co

re/T

mul

ti−tie

r

0.20.50.81

Figure 8. Elastic MapReduce DFSIO-Write running times comparisonwith 4 different Core nodes to Total nodes ratios

Concerning the EMR Terasort benchmark in Figure 9,we notice that increasing the C/T ratio improves the sortperformance. On the other hand we notice that the improve-ment increases with the increase in the data size. Also wenotice that setting the C/T to 0.8 gives minimal performanceloss compared to the expected 1.25x loss. In the case ofC/T = 0.5 the performance loss is at most 1.15x while theexpected value is 2x and decreases with larger datasizes. ForC/T = 0.2, the performance loss is at most 1.5x comparedthe expected 5x performance loss.

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Data Size (Gb)

Spe

edup

Tal

l−co

re/T

mul

ti−tie

r

0.20.50.81

Figure 9. Elastic MapReduce Terasort running times comparison with 4different Core nodes to Total nodes ratios

Concerning the EMR Wordcount benchmark in Figure

67

10, we notice that increasing the C/T ratio has a lesspronounced performance effect for all three ratios. The moststriking result is the 0.2 ratio that sees only a 1.047x perfor-mance loss compared to an expected 5x loss compared tocluster using all the available virtual drives. The other C/Tratios 0.5 and 0.8 see minimal losses and even comparablerunning times to the all core cluster running times.

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Data Size (Gb)

Spe

edup

Tal

l−co

re/T

mul

ti−tie

r

0.20.50.81

Figure 10. Elastic MapReduce Wordcount running times comparison with4 different Core nodes to Total nodes ratios

In summary we noticed that the C/T ratio in multi-tier Hadoop clusters has less negative effect than expected.We noticed that for all the workloads increasing the C/Tratio improved the performance except for the write bench-mark. However increasing the C/T ratio is characterized byrapidly diminishing returns and should be taken into accountwhen choosing this ratio.

On the other hand, one of the main attractive features ofspot instances is the cost savings that come with payingthe cloud market price. In our case we noticed that onaverage the spot price for m1.small instances was 11xcheaper than the on-demand price. Effectively a smaller C/Tratio decreases the cost of failure free Hadoop jobs and theprevious benchmarks showed that smaller C/T ratios haverelatively small effect on performance and thus on runningtime.

The challenge comes in choosing the correct ratio foreach application. Related work in [25] discusses possibledirections for this problem. We consider this problem outsidethe scope of this paper.

B. Effect of failure injection on multi-tier MapReduce per-formance

We also evaluated the effect of failure injection on multi-tier MapReduce performance. This experiment differs fromprevious experiments related to failure analysis in MapRe-duce clusters [18]. In our experiments we are injectingfailures in the form of total VM termination. Furthermore,contrary to single node failures simulations, we experimentwith the effect of a whole Task node group termination for

each run. This characteristic is used to simulate the spotinstances that can go down at once if the market pricefluctuation provokes an out-of-bid failure in all the tasknodes.

To evaluate the failure mechanisms of Hadoop 0.20 weonly analyze the distributed components failures and do notdeal with the centralized components, mainly the NameNodeand the JobTracker. Also we restrict our study to failuresaffecting the task nodes that do not contribute to the HDFScluster. Thus we are left with the 4 mechanisms describedin section II-C to deal with the following components:TaskTrackers, Map tasks, Reduce tasks and Speculativetasks. We are interested the effect of multiple failures onthese mechanisms to see which one gets triggered and howit affects the application runtime.

We run two set of experiments that differ based on theC/T ratio, 0.5 and 0.8. For each we obtain an average failurefree runtime. Then we run the Terasort benchmark whilevarying the time of failure from the start to near the endtime of the failure free job time. We run this experimentwith and without speculation to isolate the benefits of eachmechanism.

For the cluster with C/T ratio of 0.5 we notice thatin Figure 11 the average failure free runtime is 250s. Onthe other hand we notice that no matter when the failureoccurs all the runs get delayed by at least 600s on average.This is the default timeout of the Hadoop TaskTracker deathdetection. Also we notice that there is minimal changebetween runs with speculation enabled and no speculation.

30 60 90 120 150 180 2100

500

1000

1500

Time of out−of−bid failure relative to start of job (s)

Tim

e (s

)

Average−No speculationAverage−SpeculationAverage−Failure Free

Figure 11. Tcloud MapReduce Terasort running times comparison with aCore nodes to Total nodes ratio of 0.5

For the cluster with C/T ratio of 0.8, Figure 12, we noticethat the average failure-free runtime is 280s. We see similareffects in the 0.8 cluster. The runs are delayed by at least600s for all the failure times and enabling speculation doesnot bring any additional gains.

This consistent behavior is explained by two elements.

68

30 60 90 120 150 180 2100

200

400

600

800

1000

1200

1400

Time of out−of−bid failure relative to start of job (s)

Tim

e (s

)

Average−No speculationAverage−SpeculationAverage−Failure Free

Figure 12. Tcloud MapReduce Terasort running times comparison with aCore nodes to Total nodes ratio of 0.8

First, the long wait for the 600s timeout shows that the mainmechanism that gets triggered in our case is the TaskTrackerdeath detection and recovery, Figure 1. The detection of lostMap output, Figure 2, does not get triggered because thereare not enough notifications from Reducers to restart thelost Map jobs, this is due to the fact that less than half ofthe Reducers sent notifications to the JobTracker. Also thedetection of faulty Reducers, Figure 3, are the responsibilityof the TaskTracker and therefore if the VM is terminatedthen the TaskTracker is also terminated, which means thatthere is no recovery of faulty Reducers in our case.

The second point is that the default speculative executiondid not get to fully complete. This is because even if theJobTracker selected a task to be rerun on a different node,that node might be down because it is part of the Task nodegroup. The JobTracker does not know that a node is notavailable to run speculative tasks until the TaskTracker timesout. Another inefficiency of the speculative execution is thatit caps the number of speculative tasks to 1. Since we have20% and 50% of the nodes down in an out-of-bid failurethen only one of the tasks will be rescheduled, leaving theother ones to wait for their respective TaskTracker to timeoutbefore being retried.

V. DISCUSSION

Fine grain multi-tier biddingThe multi-tier bidding in the EMR platform constitutes

a starting point for fine grained market-based clusters. Cur-rently the platform allows only for 2 types of nodes, coreand task nodes. We think that more classes can give moreflexibility to users. This can accomplish a wider spread ofrisk due to wider bids and possibly wider instance types,without further loss of performance. Also as we statedpreviously the selection of risk-aware C/T ratios is stilla new topic and comes at the intersection of economics,computer science and systems research.

Adaptive ThresholdsHadoop relies on static timeouts and thresholds which are

set at startup time and cannot be tuned at runtime. Thisdoes not allow for flexibility when faced with new informa-tion. The conservative nature of this approach is inheritedfrom best practices and assumptions in production clusters.However, in the case of novel market-based clusters moreadaptability is needed. Especially in case there is additionalknowledge such as the current market prices. Future workwill try to include a feedback to trigger faster responses fordetection and recovery using the market prices knowledge.One can imagine a market-aware JobTracker that would beable to identify dead trackers faster using the bid price andcurrent market price. Also another direction is to increase theinformation sharing between the bidding platform and thecentralized components such as the JobTracker to determinethe number of Task nodes that are susceptible to go downand to adapt the number of speculative tasks accordingly.

VI. CONCLUSION

In this paper we analyzed Hadoop’s performance in amulti-tier market-oriented architecture. We examined severalaspects of the Hadoop framework and how it reacts tovarious design decisions. We analyzed the performanceeffect of the ratios of core to task nodes and the effectof failures in such multi-tier Hadoop clusters. We foundthat that the core to task nodes ratio in multi-tier Hadoopclusters may be used to decrease the cost of Hadoop jobswith mininal negative performance effects in failure-freesituations. However, under out-of-bid failures, the currentHadoop fault tolerance methods are not fully aware of themarket and thus can cause unnecessary performance lossesdue to various static timeouts and fixed thresholds that aremarket-oblivious. This trade-off between cost, performanceand reliability may be reconciled by further studies intooptimal bidding strategies in cloud markets and adequatemarket-aware fault tolerance for the Hadoop framework.

We also believe that our results can provide insights be-yond the Hadoop/MapReduce framework and can contributeto the development of more flexible and tunable large scalecomputing frameworks that can achieve the larger goals ofsustainable extreme scale HPC cloud computing.

VII. ACKNOWLEDGMENTS

This research is supported in part by the National ScienceFoundation grant CNS 0958854 and educational resourcegrants from Amazon.com. Views and conclusions containedin this document are those of the authors and should notbe interpreted as representing the opinions of the fundingagencies or sponsoring companies mentioned.

REFERENCES

[1] Hadoop wiki. (2012, 11) Projects powered by hadoopmapreduce. http://wiki.apache.org/hadoop/poweredby. Elec-tronic. Apache Software Foundation.

69

[2] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain,J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousingand analytics infrastructure at facebook,” in Proceedings ofthe 2010 international conference on Management of data.ACM, 2010, pp. 1013–1020.

[3] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang,S. Antony, H. Liu, and R. Murthy, “Hive-a petabyte scaledata warehouse using hadoop,” in Data Engineering (ICDE),2010 IEEE 26th International Conference on. IEEE, 2010,pp. 996–1005.

[4] J. Zawodny, “Yahoo! launches worlds largest hadoop produc-tion application,” Yahoo! Developer Network Blog, 2008.

[5] D. Beaver, S. Kumar, H. Li, J. Sobel, and P. Vajgel, “Findinga needle in haystack: Facebooks photo storage,” Proc. 9thUSENIX OSDI, 2010.

[6] How rackspace now uses mapreduce and hadoop to queryterabytes of data. http://highscalability.com/how-rackspace-now-uses-mapreduce-andhadoop- query-terabytes-data.

[7] D. Borthakur, J. Gray, J. Sarma, K. Muthukkaruppan,N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov,A. Menon, S. Rash et al., “Apache hadoop goes realtime atfacebook,” in Proceedings of the 2011 international confer-ence on Management of data, SIGMOD, vol. 11, 2011, pp.1071–1080.

[8] T. Craig and M. Ludloff, Privacy and Big Data. O’ReillyMedia, Inc., 2011.

[9] Cloud9. (2012, 5) Mapreduce library for data-intensive textprocessing. http://lintool.github.com/cloud9/. Electronic.

[10] Contrail. (2012, 5) Contrail: Assemblyof large genomes using cloud computinghttp://sourceforge.net/apps/mediawiki/contrail-bio.Electronic.

[11] R. Taylor, “An overview of the hadoop/mapreduce/hbaseframework and its current applications in bioinformatics,”BMC bioinformatics, vol. 11, no. Suppl 12, p. S1, 2010.

[12] U. Kang, C. Tsourakakis, and C. Faloutsos, “Pegasus: A peta-scale graph mining system implementation and observations,”in Data Mining, 2009. ICDM’09. Ninth IEEE InternationalConference on. IEEE, 2009, pp. 229–238.

[13] W. Xue, J. Shi, and B. Yang, “X-rime: Cloud-based largescale social network analysis,” in Services Computing (SCC),2010 IEEE International Conference on. Ieee, 2010, pp.506–513.

[14] H. Lin, X. Ma, J. Archuleta, W. Feng, M. Gardner, andZ. Zhang, “Moon: Mapreduce on opportunistic environ-ments,” in Proceedings of the 19th ACM International Sym-posium on High Performance Distributed Computing. ACM,2010, pp. 95–106.

[15] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica,“Improving mapreduce performance in heterogeneous envi-ronments,” in Proceedings of the 8th USENIX conference onOperating systems design and implementation. USENIXAssociation, 2008, pp. 29–42.

[16] J. Dittrich, J. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, andJ. Schad, “Hadoop++: Making a yellow elephant run like acheetah (without it even noticing),” Proceedings of the VLDBEndowment, vol. 3, no. 1-2, pp. 515–529, 2010.

[17] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins,“Pig latin: a not-so-foreign language for data processing,”in Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data. ACM, 2008, pp. 1099–1110.

[18] F. Dinu and T. Ng, “Understanding the effects and im-plications of compute node related failures in hadoop,” inProceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. ACM,2012, pp. 187–198.

[19] A. Andrzejak, D. Kondo, and S. Yi, “Decision model forcloud computing under sla constraints,” in Modeling, Analysis& Simulation of Computer and Telecommunication Systems(MASCOTS), 2010 IEEE International Symposium on. IEEE,2010, pp. 257–266.

[20] S. Yi, D. Kondo, and A. Andrzejak, “Reducing costs of spotinstances via checkpointing in the amazon elastic computecloud,” in Cloud Computing (CLOUD), 2010 IEEE 3rd In-ternational Conference on. Ieee, 2010, pp. 236–243.

[21] M. Mazzucco and M. Dumas, “Achieving performance andavailability guarantees with spot instances,” in High Perfor-mance Computing and Communications (HPCC), 2011 IEEE13th International Conference on. IEEE, 2011, pp. 296–303.

[22] W. Voorsluys and R. Buyya, “Reliable provisioning of spotinstances for compute-intensive applications,” Arxiv preprintarXiv:1110.5969, 2011.

[23] W. Voorsluys, S. Garg, and R. Buyya, “Provisioning spot mar-ket cloud resources to create cost-effective virtual clusters,”in Proceeding of Algorithms and Architectures for ParallelProcessing, pp. 395–408, 2011.

[24] M. Taifi, J. Shi, and A. Khreishah, “Spotmpi: a framework forauction-based hpc computing using amazon spot instances,”in Proceeding of Algorithms and Architectures for ParallelProcessing, pp. 109–120, 2011.

[25] M. Taifi, “Banking on decoupling: Budget-driven sustainabil-ity for hpc applications on ec2 spot instances,” in Proceedingsof 1st International Workshop on Dependability Issues inCloud Computing, 2012.

[26] N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi,and C. Krintz, “See spot run: using spot instances formapreduce workflows,” in Proceedings of the 2nd USENIXconference on Hot topics in cloud computing. USENIXAssociation, 2010, pp. 7–7.

[27] Hadoop Wiki. (2012, 11) Apache hadoop project page.http://hadoop.apache.org/. Electronic. Apache Software Foun-dation.

[28] T. White, Hadoop: The definitive guide. Yahoo Press, 2010.

70

[29] P. Hunt, M. Konar, F. Junqueira, and B. Reed, “Zookeeper:Wait-free coordination for internet-scale systems,” in Pro-ceedings of the 2010 USENIX conference on USENIX annualtechnical conference. USENIX Association, 2010, pp. 11–11.

[30] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “Thehadoop distributed file system,” in Mass Storage Systems andTechnologies (MSST), 2010 IEEE 26th Symposium on. Ieee,2010, pp. 1–10.

[31] Amazon AWS. (2012, 10) Elastic mapreduce platform.http://aws.amazon.com/elasticmapreduce/. Electronic. Ama-zon AWS.

[32] M. Taifi and J. Shi. (2012) Tcloud, hpcaas private cloud,http://tec.hpc.temple.edu. http://tec.hpc.temple.edu. [Online].Available: http://tec.hpc.temple.edu

[33] K. Jackson, L. Ramakrishnan, K. Muriki, S. Canon, S. Cho-lia, J. Shalf, H. Wasserman, and N. Wright, “Performanceanalysis of high performance computing applications on theamazon web services cloud,” in Cloud Computing Technologyand Science (CloudCom), 2010 IEEE Second InternationalConference on. IEEE, 2010, pp. 159–168.

[34] M. Taifi. (12, 3) Hpcfy- virtual hpc cluster orchestration li-brary. [Online]. Available: https://github.com/moutai/HPCFY

[35] J. Dean and S. Ghemawat, “Mapreduce: Simplified dataprocessing on large clusters,” Communications of the ACM,vol. 51, no. 1, pp. 107–113, 2008.

[36] Y. Chen, A. Ganapathi, R. Griffith, and R. Katz, “Thecase for evaluating mapreduce performance using workloadsuites,” in Modeling, Analysis & Simulation of Computer andTelecommunication Systems (MASCOTS), 2011 IEEE 19thInternational Symposium on. IEEE, 2011, pp. 390–399.

[37] L. George, HBase: The Definitive Guide: The DefinitiveGuide. O’Reilly Media, 2011.

[38] A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayana-murthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava,“Building a high-level dataflow system on top of map-reduce:the pig experience,” Proceedings of the VLDB Endowment,vol. 2, no. 2, pp. 1414–1425, 2009.

[39] G. Blelloch, L. Dagum, S. Smith, K. Thearling, and M. Zagha,“An evaluation of sorting as a supercomputer benchmark,”International Journal of High Speed Computing, 1993.

71

4944a061

Documents

Transcript of 4944a061