Paper Title (use style: paper title)v-scheiner.brunel.ac.uk/bitstream/2438/13466/3/Fulltext.docx ·...

gSched: A Resource Aware Hadoop Scheduler for Heterogeneous Cloud Computing Environments

Godwin Caruana1, Maozhen Li2,1, Man Qi3, Mukhtaj Khan4 and Omer Rana5

1Department of Electronic and Computer Engineering, Brunel University London, Uxbridge, UB8 3PH, UK2School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, China

3Department of Computing, Canterbury Christ Church University, Canterbury, Kent, CT1 1QU, UK4Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan

5School of Computer Science and Informatics, Cardiff University, Cardiff, CF24 3XF, UK

Abstract

MapReduce has become a major programming model for data intensive applications in cloud computing environments. Hadoop, an open source implementation of MapReduce, has been adopted by an increasingly wide user community. However, Hadoop suffers from task scheduling performance degradation in heterogeneous contexts due to its homogeneous design focus. This paper presents gSched, a resource aware Hadoop scheduler which takes into account both the heterogeneity of computing resources and provisioning charges in task allocation in cloud computing environments. gSched is initially evaluated in an experimental Hadoop cluster and demonstrates enhanced performance compared with the default Hadoop scheduler. Further evaluations are conducted on the Amazon EC2 Cloud which demonstrates the effectiveness of gSched in task allocation in heterogeneous cloud computing environments.

Keywords: MapReduce, Task Scheduling, Resource Awareness, Cloud Computing, Cost Effective Computing.

1. IntroductionHeterogeneous environments such as those provided via cloud infrastructures have become

commonplace. Heterogeneous environments in the context of this work refers to computing nodes with different processing capabilities in terms of CPU, memory and IO. Financially it is highly desirable that a cloud computing environment can scale up and down as well as significantly lower capital expenditure costs [1]. Cloud computing environments such as Amazon Web Services [2], Windows Azure [3] and the Google Cloud Platform [4] are heavily dependent on heterogeneous distributed computing models and resource virtualization technologies. Establishing a good balance between resource requirements and utilization in such a context is considered a challenge [5]. Despite continuous evolution [6][7], cloud services and cost models are still relatively immature [8].

MapReduce [9] has become a major programming model for data intensive applications in computing clouds. Popular implementations include Mars [10], Phoenix [11], Apache Hadoop [12] and Google’s [9]. Inspired from functional programming, MapReduce is based on two primary operations, namely map and reduce. A map operation takes a (key/value) pair and emits an intermediate list of (key/value) pairs. A reduce operation takes all the values represented by the same key in the intermediate list and processes them accordingly. Multiple mappers and reducers can be employed in a MapReduce application.

The effort to exploit the MapReduce programming model in cloud computing environments has become increasingly significant. Hadoop is a MapReduce implementation which has been adopted by an increasingly wide user community due to its open source nature. The default Hadoop scheduler employs a simple FCFS (first-come-first-serve) strategy to allocate tasks. Hadoop allocates a task to a computing node using the following approach. Any failed tasks are given the highest priority. Subsequently, non-running tasks are considered. For map operations, tasks with data local to the computing node are chosen first. Finally, Hadoop looks for a task to execute speculatively. Speculative execution is primarily intended to execute long running tasks on more than one node. Hadoop also supports an alternative fair and capacity scheduling scheme. However, Hadoop scheduling does not consider the heterogeneity of computing resources explicitly. Furthermore, because of the highly multiplexed, highly virtualized environments in context, even specific compute node performance have a tendency to vary over time. The differences and variations of the underlying resource capabilities

must be taken into account in order to optimize any scheduling strategy from the performance and resource effectiveness perspectives. This issue is further amplified in the context of Hadoop - it does not consider any differences or similarities in terms of computing nodes features such as CPU and I/O capabilities. Achieving a good balance in this respect remains a challenge in a highly heterogeneous cloud computing environment [13] [14] [15].

In this paper, we present gSched, a resource aware Hadoop scheduler for improving MapReduce task allocation in heterogeneous cloud computing environments. The novelty of gSched lies in that it considers the heterogeneity of computing resources when scheduling tasks with an aim to minimize the execution times of the tasks by matching the tasks to appropriate computing nodes. As a result, the resources in a Hadoop cluster can be better utilized. gSched takes into account Hadoop's support for data locality [14], but this is considered in the context of resource heterogeneity for increased effectiveness. gSched exploits heterogeneity by utilizing the different characteristics of participating computing resources (i.e. CPU, RAM, I/O) and employs a machine learning approach to assessing the Hadoop nodes to which the tasks to be allocated.

The rest of the paper is organized as follows. Section 2 discusses related work on MapReduce scheduling in heterogeneous computing environments. Section 3 presents the design of gSched for resource awareness in task allocation. Section 4 initially evaluates the performance of gSched in an experimental Hadoop cluster and subsequently evaluates gSched in the Amazon EC2 Cloud. Section 5 concludes the paper and points out some future work.

2. Related Work In a typical MapReduce scenario, there are no strong task precedence constraints beyond the order of reduce phases which need to start after the associated map phases. Map tasks do not have any order in isolation and neither do reduce tasks. There is also no real intrinsic priority in terms of the individual map tasks and reduce tasks within a particular task. The default Hadoop scheduling scheme supports simple scheduling approaches such as FCFS and fair scheduling. A number of studies have been conducted [16, 17, 18, 19, 49, 50, 51] to improve Hadoop performance from different aspects. However, scheduling Hadoop tasks in heterogeneous computing environments still remains a challenge. It is imperative for job scheduling to establish a good balance between effectiveness in resource utilization and performance in job execution [20]. To achieve this, Sandholm and Lai [21] focused on dynamic priority (DP) based scheduling. Within specific processing windows, users are allocated slots for task processing on a proportional time share basis taking into account their priorities. This is somewhat similar to Hadoop’s default capacity scheduler which also supports scheduling of jobs based on priorities. However, the DP approach does not consider the heterogeneity of the underlying computing resources. Kc and Anyanwu [22] considered scheduling Hadoop tasks to meet the deadlines based on the premise that the computation costs of task processing are uniform across the computing resources. However, computing clouds commonly consist of heterogeneous resources and increasing evidence shows that heterogeneity problems must be tackled in MapReduce frameworks [23]. Hadoop is not specifically bound to any degree of homogeneity in terms of computing resources – i.e. it does not consider any differences or similarities in terms of computing nodes capabilities such as CPU and I/O capabilities. Numerous approaches to tackling resource heterogeneity exist. The consideration for workload types and queues [24] is a popular research avenue alongside other techniques which focus on specific application areas [25] [26]. In [27], Guo and Fox proposed an approach which predicts the benefit of running new speculative tasks. This technique steals unutilized slots and allocates them to specific running tasks. However, this can lead to contention in terms of subsequent tasks being starved of processing slots. Tasks allocated to the stolen slots would not allow further tasks to be launched and processed on these slots. This is because the slots would be occupied by tasks designated as unutilized and stolen. Furthermore, the process for terminating such tasks can actually lead to additional maintenance overhead and subsequent cost. To deal with the challenge of resource heterogeneity, Lee et al. [28] introduced progress share as a new fairness metric which is defined in terms of computing rate. To improve the overall performance, when a slot becomes available, a task with a high computing rate will be selected for scheduling. This is done by exploiting the underlying heterogeneity without starving other jobs in execution. gSched deals with task throttling in such a way that it takes into account the characteristics of both the computing nodes and tasks instead of using the computing rate as a baseline influencer for task

scheduling selection. In contrast with [28], gSched does not need the end user to provide task characteristics a-priori. Rather, it exploits the similarity of the executed tasks to establish the most appropriate node for execution of a specific task. The degree of compatibility is computed via distance vector metrics – the closer the similarity the more comparable the capabilities of nodes. This is done in combination with a machine learning scheme. This is employed to establish whether the task exhibits a CPU or IO bias. In turn, the classifier is used to influence the node selector according to upcoming tasks. Zaharia et al. [29] proposed LATE for the longest approximate time to end. The technique gives priority to tasks which will impact response time mostly. Only the slowest tasks are re-launched speculatively. These tasks are scheduled immediately on the fastest node of the processing cluster. The proposed implementation is dependent on the assumption that the progress rate of tasks is relatively constant [29]. This assumption breaks down in highly multiplexed cloud based environments. In this paper, gSched establishes the estimated time to compute each task on each participating node. This is based on the performance characteristics of nodes and task characteristics. gSched employs a machine learning approach to allocate tasks to appropriate computing nodes based on their characteristics. The classifier output is used in conjunction with previously executed task characteristics from the same task to establish the most appropriate nodes for tasks to be scheduled. gSched also ensures that nodes are not left idle. This enables gSched to establish a good task to node matching. This enables gSched to better utilize Hadoop cluster resources and minimize job execution times.

3. The Design of gSched The overall gSched implementation is represented in Figure 1. At Hadoop cluster startup, gSched

profiles the performance characteristics of the participating nodes. An estimated time for these tasks is computed for each node where tasks have not been executed based on the relative node characteristics. The inferred times are kept in an Estimated Completion Time (ECT) table. gSched attempts to assign the tasks to the more appropriate node using the profiled node characteristics and inferred times. Tasks with CPU bias are attempted for scheduling on nodes with better CPU speeds. For I/O bound tasks, nodes with better I/O performance are preferred. The identification of task bias is based on a machine learning scheme.

Task to NodeMapping

MachineLearning

AvailableNodes

NodePerformance

Profiles FinishedTask

Profiles

ECTTable

HadoopTask

MatchedNode

Figure 1: gSched architecture.

Implemented within a Hadoop construct, gSched is intended to exploit heterogeneous capabilities in a resource effective manner in task scheduling. Various elements influence MapReduce processing costs in a typical cloud context. From a very high level perspective, the costs for the resources acquired denoted as R which includes storage, memory and processor can be typically represented as shown in Eq.1.

c=∑t=1

T

∑n=1

N

Rt(n)∗u (1)

where: c is the operational expenditure. T is the time span in units for resource use. N is the total number of nodes. R is the resources acquired. n is a specific node. u is the unit cost.

gSched attempts to minimize the time span T for resource use whilst maximizing the heterogeneous resources R acquired. The actual task to node matching and allocation approach is based on a number of processes. A based machine learning technique for establishing a priori, an indicative processing time for executing a task T on a node with processing characteristics P is implemented.

3.1 Building ECT in gSched

Whilst data skew can introduce task processing characteristic differences [30], the proposed approach is based on the assumption that under most circumstances, the individual tasks of a specific Hadoop MapReduce Job are reasonably the same in terms of their CPU, Memory and I/O requirements. In gSched, each Hadoop job can have a number of map tasks and reduce tasks. A map task can be is CPU bound and the shuffle phase in a reduce task can be IO bound. Map tasks or reduce tasks are normally individual tasks without data dependencies. For this work, the default hash partitioning approach is considered adequate for even data distribution [30]. More specifically, job composition is considered to be constituted of a number of logically 'equivalent', arbitrarily divisible tasks, e.g. the map tasks. Thus, if a job J is constituted of 4 map tasks, {t1….t4}, t1≈ t2≈ t3≈ t4.

The objective of the ECT estimation model in gSched is to infer an estimated compute time of a task on a participating node. This is based on node characteristic similarity (S), and the actual processing time of a typical task from the same job on a specific node. The general characteristics of a node are a function of the capabilities of a number of core components, described by signature further on. For the actual execution of a task, a number of additional perspectives influence general performance, including virtual memory for example. The focus on CPU and IO in gSched is primarily due to their impact factors in such a context. It is also assumed that the underlying network and its influence would be relatively stable. During Hadoop cluster start-up, gSched profiles the participating node capabilities and characteristics. In a cloud platform, the performance characteristics of resources vary over time. This is due to the aggressive multiplexing frequently employed. Re-profiling is performed to have a running insight of any such changes and thus allows gSched to better exploit resource availability. The overall computation time to establish the best node to task match has to be kept low. In order to establish the estimated compute time of a node, the key characteristics are represented as a vector. The general vector space model is a simple yet effective technique to store representation information. Each node has a performance characteristics vector associated with it. This is created during the respective profiling process. Cosine similarity [31] is an effective scheme for establishing how similar two vectors are as represented in Eq.2. Whilst the vectors are non-sparse, their dimensionality is small and the overall complexity of the similarity computation is O(n).

In this approach, the node similarity S is based on the core characteristics of the underlying node(s) participating in the MapReduce cluster:

Sig = [cpuFreq, numProc, cpuSpeed, ioSpeed, diskSpace, physicalMem, virtualMem, cpuUsage]Where:

cpuFreq is the processor(s) frequency in kHz. numProc is the number of processor cores. cpuSpeed is based on the computation of the first 35 numbers in a Fibonacci sequence. ioSpeed refers to the speed to read and write (in seconds) a 100MB ASCII text based file. diskSpace is the disk space available on the local node. physicalMem is the total amount of RAM. virtualMem is the available virtual memory.

cpuUsage is the average CPU utilization rate.

S=∑i=1

n

(Sig¿¿ai¿x Sig¿¿bi)/(√∑i=1

n

( Sigai ) ( Sigai ) x √∑i=1

n

(Sigbi ) (Sigbi ))¿¿¿ (2)Where:

Sigai and Sigbi reflect the characteristics profile vectors of the two nodes being compared. n is the number of computing nodes in a Hadoop cluster. S is the cosine similarity value between the two nodes being compared.

To determine the estimated time of a task on a node beyond the one it was actually executed on, gSched adopts the following approach. For the sake of simplicity, we refer to performance benchmark of the first node as Pa where a task from a particular job has actually been executed and the target node to estimate the time as Pb. In gSched, we consider Hadoop tasks are either CPU bounds (e.g. map tasks) or IO bounds (e.g. reduce tasks). The node performance benchmarks Pa and Pb are further focused on the cpuSpeed and ioSpeed metrics which can be computed using Eq.(3).

Pa/Pb = (cpuSpeeda x ioSpeeda)/ (cpuSpeedb x ioSpeedb) (3)The original task time (oTT) is the actual time of a task running on a node. The estimated time that

the same task would take on another node can be computed using Eq. 4.

ECT = ( Pb

Pax oTT

S ) (4) When S is 1, the performance between the nodes will still vary slightly because Pa and Pb are taken into consideration outside S as well. This allows gSched to have a degree of sensitivity to changing performance characteristics in multiplexed environments where key node performance characteristics (cpuSpeed and ioSpeed) change over time.

If the similarity (S) is less than a pre-defined threshold, i.e., the nodes are not similar enough, the ECT is not deduced and the ECT table entry for the target node is updated to reflect this. If gSched does not find an inferred ECT time for an already executed task on participating nodes, it will forcefully schedule tasks on the node after a pre-defined number of attempts.

3.2 Task to Node Matching and Allocation

As already indicated, for this work, it is assumed that the underlying network and its influence would be relatively stable. For future work it is intended that this consideration is explicitly included as part of the matching and allocation scheme. The estimated completion time matrix is updated with the actual processing times and task characteristics on the nodes the tasks are executed. The estimated time that the same task would take on the nodes on which the task has not been actually executed on is then inferred. This is based on the variation of node characteristics difference using a distance vector scheme.

For the remaining tasks, rather than allocating the tasks to nodes with empty slots immediately, gSched initially throttles back these tasks, which is in part similar to [29] for a maximum number of times. gSched establishes whether the task types pertaining to the specific job exhibiting CPU bias or otherwise using the naive Bayesian classification scheme. Machine learning has been applied in various contexts including scheduling strategies [32][33]. Based on whether the tasks exhibit CPU or IO bias, gSched selects an available node which is more congruent with the characteristics of the tasks to be scheduled. If any node is not allocated to a task for a number of attempts, the task is subsequently scheduled forcefully (parameter Ntaskrefusal in Table 2). If after a number of allocation cycles a node is not selected, it will be automatically assigned with the next task. This is applied in

order to ensure that there are no idle resources for uncontrolled periods simply because the matching scheme did not identify a good matching between a node and a task. Furthermore, beyond the well documented MapReduce scalability characteristics [34] [35]. gSched adopts a number of additional strategies to achieve its objectives, including:

Classifier training. The machine learning classifier loads new training instances and is re-trained only when there are no active jobs. This is to avoid using computing resources for non-job related processing. A maximum retraining delay period ensures currency.

Dynamic scheduler reconfiguration without cluster restart which is not possible with the out of the box scheduler. This provides an opportunity to dynamically fine tune the scheduler according to changing workloads, environment etc. etc.

Node re-profiling is performed during idle periods, with a determined maximum latency period. Reducing speculative execution via appropriate node use, reducing unnecessary processing cost.

Assigning tasks to nodes which exhibit properties reflective of task characteristics maximizes the utilization of Hadoop cluster resources.

3.3 gSched Implementation

At Hadoop cluster startup and during any re-profiling, gSched establishes the key characteristics of the participating nodes. The key characteristics of participating nodes such as the disk space, CPU processing capabilities (MIPS), number of processors and memory are identified and recorded. The associated cluster profile functionality allows gSched to gain insight and track changing runtime capabilities which occur in highly multiplexed, virtual resources based environments. The ability to re-profile participating nodes, as well as changing its configuration at runtime allows gSched to dynamically adapt to varying underlying cluster capabilities over time.

gSched’s core design and operational characteristics are described with the following definitions:

ArbitraryTasksn describes a number of tasks from a Job J.

Sigai/Sigbi the signature of node a/b respectively.

N a processing node.

Pa / Pb the performance of node a/b respectively.

oTT the original task time.

fslots the number of node free slots.

nshape describes the node characteristics.

NodeWithFreeSlots[] list of nodes with fslots > 0.

Jt M/R task within the job (J) .

ECT_Jttime time to execute the task Jt.

Sig(n) Signature of Node (n).

Jtshape the task characteristics.

ECT Estimated Completion Time matrix

TClass Task Class based on the Machine Learning Scheme (CPU or IO bound).

Algorithm 1 presents the core elements of the task allocation scheme. When the number of tasks processed from a job J is less than a configured parameter, i.e. there is no visibility of the type of task in terms of characteristics, an average time is established and the node with the average performance is selected. If there is no information in the ECT, the node is selected randomly. On task completion, gSched updates the ECT table accordingly.

Input: A Hadoop map task or reduce task to be scheduled;Output: A matched computing node;1: select ArbitraryTasksn 2: select N with average relative time from ECT

if no ECT info available select randomly

allocate 1st set of ArbitraryTasksn

depending on NodeWithFreeSlots[] on n where fslots> 03: generate/update ECT matrix 4: ∀ task vector = Jtshape 5: classify Task TClass using Learning Scheme 6: schedule remaining tasks via7: assign task Jt to node 8: where node fslots> 0 and

if TClass = {0} (cpu bound) select n = min+1 (Sig1 - Sign) for ECT_Jtshape

else select n = minimize (Sig1 - Sign) for ECT_Jtshape

mark Sign in useend if

Algorithm 1: Task allocation.

Where the visibility of task's type is already available for a particular job, the respective task is classified as either CPU or I/O bound via the machine learning scheme. The relevant and inferred node processing times for the tasks that are already processed within the same job are then acquired. If the task is CPU bound such as a map task, gSched allocates the task to the available node with the minimum processing time taking into consideration the available task slots. The same approach is applied to the I/O bound tasks such as the shuffle tasks in Hadoop. This time however, the scheme establishes which available node has a better I/O performance characteristics for task assignment. This approach allows task assignment throttling to establish a good combination of task and node based on the current cluster profile. Based on the actual time of a finished task, task characteristics and node characteristics, gSched estimates the execution time that the task will take on the other nodes. The machine learning scheme is a Naïve Bayes. gSched takes the training instances added by the process which records finished tasks and re-trains the model. The core training instance data is constituted of the following attributes:

task time (ms)

physical memory (mb)

virtual memory (mb)

map input records (mb)

total m/r time (s)

class {CPU, IO}

The class {CPU, IO} is established from the relationship of task time and map input records – basically the ratio between task time and CPU time – {gSched configuration}. We employ Naïve Bayes for training and classification of task bias. Parameter estimation in this construct is based on maximum likelihood. The current implementation is based on the default Weka [36] Naïve Bayes Updateable scheme. The model is only re-trained when the scheduler is idle.

3.4 gSched Computation Complexity

The core functions within gSched and their respective complexities in computation are as follows:updateMlScheme O(1) Updates the training instances within the Machine Learning modelclassifyTaks O(n) Classifies the specific tasks to establish CPU or IO biascheckFinishedTaskOnNode O(n) Determines finished tasks on node and extract training instance information for ML

Scheme as well as ECT metricsassignTasks O(n2) Core task allocation function isMostAppropriateNode O(n2) Establishes, based on the heuristic, whether the current node is the most relevant

genECT O(1) Generates the Estimated Completion Time table

MapReduce normally generates a linear increase in computation complexity with an increasing number of computer nodes. On the other hand, Hadoop cluster tuning is a key element for effectiveness for any scheduling scheme. In this work, all critical function upper-bounds can be capped to O(n) via configuration. With such a general upper bound complexity, spreading the workload over n nodes would generate an arbitrary merge complexity less than O(n).

4. Experimental Results For this work, a specifically configured Amazon Machine Image (AMI) was created and packaged.

This was done by launching multiple instances of the image to shape the Hadoop MapReduce cluster, fully configured with the basic as well as the proposed (gSched) task to node matching and allocation scheme. The baseline configuration of the virtual appliance is described in Table 1. gSched has a number of default parameters which are summarized in Table 2.

Table 1: AMI Virtual Appliance software baseline.Software Environment

SVM Weka 3.6.0 (SMO)O/S Ubuntu 12.10

Hadoop Hadoop 0.20.205Java JDK 1.7

Table 2: gSched parameters.Parameters Description Value

ratio The ratio between CPU and IO time. 0.10similarity Distance threshold – which is the difference between source and target vectors

to establish task and node similarity in terms of capabilities.0.99

Ntasks Number of tasks processed before firing gSched. 2Tconfig Time span for scheduler configuration reload. 60Tupdate Time span before the machine learning scheme is re-loaded with new

instances.1000

Ntaskrefusal Maximum number of times a node can refuse to allocate a task before being forced to.

1

ECTSize The maximum size of the ECT table containing the estimated time to complete for each node with respect to tasks.

100

Roffset Rank offset for ECT node selectors. 1

4.1 Base-lining gSched

gSched and the standard Hadoop FIFO scheduler are initially compared locally using the TestDFSIO and MRBench benchmarks [12] on an experimental Hadoop cluster. This comprised of nodes described in Table 3.

Table 3: TestDFSIO and MRBench – Experimental Hadoop cluster.Type Number/Specifications Role

Physical 1 – 1024 MB, 2 core, 500GB HDD MasterVirtual 1 – 768 MB Ram, 2 Core, 256 GB HDD SlaveVirtual 1 – 512 MB Ram, 1 Core, 256 GB HDD SlaveVirtual 1 – 384 MB Ram, 1 Core, 120 GB HDD Slave

There are a number of benchmarking tools available with the standard Hadoop distribution including TeraSort, NNBench, MRBench and TestDFSIO. For the evaluations carried out herewith, the latter two were employed. TestDFSIO benchmarks the read and write test for the distributed file system - HDFS. MRBench is on the other hand a job oriented benchmark test. Its dependency on HDFS is not considered significant in this evaluation construct. This combination provides a representative performance evaluation of the two key characteristics focused in this work, namely the CPU and IO perspectives. Figure 2 and Figure 3 show that gSched outperforms the default Hadoop

scheduler in both benchmarks. For the TestDFSIO test, about 10 Mb files were employed. For the MRBench test, the number of DataLines was varied from 1 to 12 and the respective number of Maps was varied from 2 to 16. gSched shows a remarkable improvement in both tests. In order to evaluate the performance of gSched in a reflective cloud based environment, a number of additional experiments were carried out using Amazon’s EC2 service [2] [37].

Figure 2: The performance of gSched on TestDFSIO benchmark tests on an experimental Hadoop cluster.

Figure 3: The performance of gSched on MRBench benchmark tests on an experimental Hadoop cluster.

Amazon provides various instance types for the provisioning of computing resources with different stated performance, specification attributes and characteristics [37]. First generation (M1 series) Amazon EC2 compute and S3 storage were employed to carry out these evaluations. An initial base-lining experiment with 4 ‘t1.micro’ EC2 nodes using the TestDFSIO and MRBench tests was performed. This time, for the TestDFSIO test about 50 Mb files were employed. The number of DataLines for the MRBench was varied from 1 to 12 and the respective Maps from 2 to 16. The expected outcome for this experiment was that, given similar node characteristics (4 ‘t1.micro’ EC2 nodes), the performance of gSched and the standard scheduler should not vary considerably. Figure 4 presents the result of the experiment, showing the difference between the standard Hadoop and gSched schedulers performance for the TestDFSIO experiment.

Figure 4: gSched performance on TestDFSIO benchmark tests on Amazon EC2.

The discrepancy was traced to a large number of ‘TaskRunner Child errors’ for the standard scheduler execution test on the tested configuration. This is due to the standard configuration failing to perform adequately with the resources allocated to the nodes. The respective TaskTracker (on node) becomes blacklisted, in-turn effectively downscaling the cluster processing capabilities. On the other hand, the MRBench performance as shown in Figure 5 is visibly very similar. This is due to the absence of any opportunity for gSched to exploit heterogeneous performance differences for improved performance. In fact, the 4 nodes in this experiment are the same using EC2 t1micro instances.

Figure 5: Performance of gSched on MRBench benchmark tests on Amazon EC2.

4.2 Distributed SMO using gSched

In the context of machine learning, support vector machine (SVM) based techniques have been proven to be effective in a variety of applications. SVM training is however a computationally intensive process. In the work presented in [39] we discussed a MapReduce based distributed SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the distributed SVM reduces the training time significantly. Ontology semantics are subsequently employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers [40].

However, the overall training process is still considered computationally expensive. The possibility to further improve performance was believed to make this specific setting a compelling one to evaluate gSched performance. Using the Adult data set [40], in conjunction with the distributed SVM MRSMO [42], we performed an experiment to establish gSched's performance in comparison with the standard Hadoop scheduler in a cloud based scenario. The Adult data set containing 48,000 instances and 14 attributes is commonly used for similar experiments [43]. The simple Hadoop cluster setup

described in Table 4 was employed. Figure 6, Figure 7, Figure 8 and Figure 9 portray gSched’s speedup in seconds and efficiency improvements in percentage over the standard scheduler for the respective SVM Polynomial and Gaussian SVM Kernel training tests as originally performed in [42] respectively.

Table 4: MRSMO with gSched – Amazon EC2 clusterType Number Hadoop Role

m1.medium 2 1 Master / 1 Slavem1.small 2 Slavet1.micro 2 Slave

For this experiment, the number of nodes was varied from 1 to 6 and the number of MapReduce tasks was varied from 4 to 16. The sequential SMO times for both the Polynomial and Gaussian SVM training were used as a reference baseline. Table 5 shows the combination of nodes that were employed.

Table 5: Node types.Nodes 1 2 3 4 5 6Type 1x m1.m 1x m1.m

1x m1.s2x m1.m1x m1.s

2x 1.m1x m1.s1x t1.m



where:m1.m = m1microm1.s = m1.smallt1.m = t1.micro

Figure 6: The speedup of gSched on the Polynomial Kernel.

It can be observed from Figure 8 and Figure 9 that gSched introduces an average overhead of 0.5% and 0.3% in Polynomial tests and Gaussian tests respectively when one node is used in the cluster. The initial negative performance outcome is due to the fact that gSched has a warm-up period. This is required to establish the underlying node characteristics and to initially populate the internal ECT table - a minimum set of baseline estimated compute times. The incremental machine learning classifier also needs to be trained. This requires the relevant training instances, which, in turn, depends on successfully executed and finished task information. During this time, the matching and allocation scheme is ineffective - until it has gained the required degree of insight to start effectively capitalizing on the proposed matching and allocation scheme.

However, it can be observed from Figure 6 and Figure 7 that the maximum speedup of gSched over the default scheduler on the Polynomial Kernel is 18 seconds for the 4 node test scenario. The corresponding maximum speed up for the Gaussian Kernel is 15 seconds for the 2 node experiment.

Figure 7: The speedup of gSched on the Gaussian Kernel.

Figure 8 shows that a maximum of 2.3% efficiency improvement of gSched over the default scheduler on the Polynomial Kernel when 4 nodes and 16 tasks are used in the test. Figure 9 shows that gSched achieves a maximum of 3.8% efficiency improvement on the Gaussian Kernel when 2 nodes and 16 tasks are used in the test.

Figure 8: The efficiency improvement of gSched on the Polynomial Kernel.

Figure 9: The efficiency improvement of gSched on the Gaussian Kernel.

4.3 gSched Scalability and Resource EffectivenessAll of gSched’s critical function upper-bounds are or can be capped to O(n) via configuration.

Spreading an M/R workload over M nodes (constant) with an arbitrary merge complexity of less than or equal to O(n), the overall processing impact should still be O(n).

A further set of experiments with 10 mixed EC2 nodes was performed to evaluate the scalability of gSched when dealing with concurrent jobs and resource effectiveness using the standard WordCount test [12]. The EC2 instance types employed for this experiment are shown in Table 6.

Table 6: Amazon EC2 instance types - initial scalability experiment.Type Number Hadoop Role

m1.medium 1 Master

c1.medium 1 Slave

t1.micro 5 Slavet1.small 3 Slave

The dataset employed was about 5 MB in size. A number of concurrent runs were executed, using the same input set. Figure 10 shows the scalability of gSched compared with the default Hadoop scheduler using the baseline Hadoop configuration in this scenario. gSched performs similarly to the standard scheduler when dealing with up to 20 concurrent jobs. Subsequently however, gSched starts to outperform the default Hadoop scheduler which is reflected via the respective slope (Δy/Δx).

Figure 10: The scalability of gSched using 10 Amazon EC2 instances.

Reducing the number of speculative copies or failed tasks becomes an important perspective to control the respective processing and cost counters associated with the EC2 service. Another set of experiments was performed using 16 Amazon EC2 instances. The WordCount test was re-employed. Table 7 summarizes the cluster setup. Table 8 portrays the gSched configuration.

Table 7: Amazon Instance Types.Type Number Hadoop Role

m1.medium 2 1 Master / 1 Slave

c1.medium 2 Slave

m1.small 6 Slave

t1.micro 6 Slave

Table 8: EC2 instance types - second experiment set.gSched Configuration

ratio 0.15Ntasks 2Tconfig 60Tupdate 1000

Ntaskrefusal 2ECTSize 50Roffset 1

Figure 11 portrays the scalability of gSched when using 16 EC2 instances, which shows that the standard Hadoop scheduler has more speculative copies or task failures than gSched. Failures reflect scenarios where tasks are allocated to nodes which may not have the required resource characteristics to process associated tasks. As a result, gSched is more resource effective than the default Hadoop scheduler in job scheduling.

Figure 11: The scalability of gSched using 16 Amazon EC2 instances.

Figure 12: Task failures in gSched.

An approximate area under trapezoid approach can be employed to establish the differences under the curves of the graphs represented in Figure 11 and Figure 12 respectively. The results presented in Table 9 show the improvements in terms of resource effectiveness and overall scalability.

The approach adopted by the default Hadoop scheduler is based on a progress rate of a task. In Hadoop, each map task has an associated progress score. The score is assigned with 0 on task start and with 1 when it is marked complete. The progress rate is simply a ratio of progress score over the running task time. A reduce task is assigned slightly different scores, but the principle is exactly the same. In a highly heterogeneous cloud environment, this approach is not efficient in the long run because the performance characteristics of participating nodes change over time. This impinges on Hadoop’s performance significantly and in a budget sensitive construct inflates associated operating costs [43].

Table 9: gSched improvements over standard scheduler.Experiment ≈ Improvement

Standard vs. gSched (Scalability) 17%Standard vs. gSched (Resource Effectiveness)

98%

In contrast with [29], gSched limits the number of speculative and failed tasks significantly by attempting to schedule tasks, including speculative ones, on those which are more reflective of the task characteristics. The typical difference in cost between the standard Hadoop scheduler and gSched in terms of maximizing resources available can be established via the charge rate difference (crd). Using Eq.1, the crd can be defined as a function of the resources employed and the time during which these resources are employed:

crd=c (Standard )−c ( gSched )

(5)

T gSched=T standard∗(1−p)❑

RgSched=R standard ¿(1−E)❑

crd= ∑t=1

T standard

∑n=1

Nstandard

Rt (standard ) (n )∗u−∑t=1

TgSched

∑n=1

NgSched

R t (gSched) (n )∗u

Where crd is the charge rate difference between the standard Hadoop scheduler and gSched. E is the resource utilization improvement of gSched over the standard Hadoop scheduler in

percentage. P is the performance improvement of gSched over the standard scheduler in percentage.

From a budgeting perspective, using a typical EC2 billing hour as an optimization constraint for the task allocation scheme, gSched costs less than the standard Hadoop scheduler in completing the jobs as shown in Table 10. Taking into consideration job concurrency as well as the number of task failures the standard Hadoop scheduler exhibits, the performance difference from gSched becomes increasingly more significant, even without the explicit consideration of the cost of storage and network I/O [44] [45].

Table 10: Indicative EC2 instance charges.Type Number $/hour Standard gSched

m1.medium

2 0.186 0.372 0.267

c1.medium 2 0.170 0.34 0.244m1.small 6 0.085 0.51 0.367t1.micro 6 0.002 0.12 0.086

1.342$ 0.964$

5. ConclusionIn this paper we have presented gSched for job scheduling in heterogeneous MapReduce Hadoop

environments, taking into account both task execution times and running costs. gSched was evaluated in both an experimental Hadoop cluster and Amazon EC2 cloud constructs. We have shown the improvements of gSched over the default job scheduler of Hadoop.

Further research will be performed to improve gSched from a number of perspectives. These include considerations for the underlying network capabilities, data placement and sharding. Resource profiling and benchmarking has been found to be non-straightforward in a highly multiplexed virtual resource provisioning construct. Thus, we intend to perform larger scale tests in terms of instances and workloads to increasingly fine tune and improve the performance of gSched, similar to the work carried out in [46]. Recently Xu et al.[47, 48] presented a novel approach to task scheduling on heterogeneous computer systems using chemical reaction optimization. We will further research how this work could be extended to gSched for tasking scheduling in heterogeneous Hadoop systems.

References

[1] “Forrester Research-Sizing The Cloud.” [Online]. Available: http://www.forrester.com/Sizing+The+Cloud/fulltext/-/E-RES58161?objectid=RES58161. [Accessed: 14-Dec- 2012].

[2] “What is AWS?” [Online]. Available: http://aws.amazon.com/what-is-aws/. [Accessed: 09-March-2015].[3] “Features: An Introduction to Windows Azure.” [Online]. Available:

http://www.windowsazure.com/en-us/home/features/overview/. [Accessed: 09-March-2015].[4] “Google Cloud Platform.” [Online]. Available: https://cloud.google.com/. [Accessed: 09-March-2015].[5] A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues, “Brief announcement: modelling MapReduce for optimal

execution in the cloud”, in Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing, 2010, pp. 408–409.

[6] J. Strebel, “Cost Optimization Model for Business Applications in Virtualized Grid Environments,” in Grid Economics and Business Models, 2009, no. 5745, pp. 74–87.

[7] J. Strebel and A. Stage, “An economic decision model for business software application deployment on hybrid Cloud environments.,” in Multikonferenz Wirtschaftsinformatik 2010, 2010, pp. 195–206.

[8] B. Martens, M. Walterbusch, and F. Teuteberg, “Costing of Cloud Computing Services: A Total Cost of Ownership Approach,” Hawaii Int. Conf. Syst. Sci., vol. 0, pp. 1563–1572, 2012.

[9] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.

[10] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, “Mars: a MapReduce framework on graphics processors,” in Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT ’08, 2008, p. 260.

[11] K. Taura, T. Endo, K. Kaneda, and A. Yonezawa, “Phoenix: a parallel programming model for accommodating dynamically joining/leaving resources,” in SIGPLAN Not., 2003, vol. 38, no. 10, pp. 216–229.

[12] T. White, Hadoop: The Definitive Guide, 1st Edition, vol. 1. O’Reilly Media, Inc, 2009.[13] B. T. Rao and L. S. S. Reddy, “Scheduling Data Intensive Workloads through Virtualization on MapReduce based

Clouds,” Int. J. Distrib. Parallel Syst., vol. 3, no. 4, pp. 99–110, 2012.[14] B. T. Rao, N. V. Sridevi, V. K. Reddy, and L. S. S. Reddy, “Performance Issues of Heterogeneous Hadoop Clusters

in Cloud Computing,” Glob. J. Comput. Sci. Technol., vol. XI, no. VIII, p. 6, Jul. 2012.[15] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar, “Tarazu: Optimizing MapReduce on

Heterogeneous Clusters,” SIGARCH Comput. Arch. News, vol. 40, no. 1, pp. 61–74, Mar. 2012.[16] S. Babu, “Towards automatic optimization of MapReduce programs,” Proc. 1st ACM Symp. Cloud Comput. - SoCC

’10, p. 137, Jun. 2010.[17] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “SkewTune,” in Proceedings of the 2012 international conference

on Management of Data - SIGMOD ’12, 2012, p. 25.[18] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-reduce clusters,” in

HotCloud’12: Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing, 2012, p. 18.[19] J. Xiao and Z. Xiao, “High-integrity mapreduce computation in cloud with speculative execution”, in Theoretical

and mathematical foundations of computer science. Second international conference, ICTMF 2011, Singapore, May 5--6, 2011. Selected papers., Berlin: Springer, 2011, pp. 397–404.

[20] B. Hamidzadeh, D. J. Lilja, and L. Y. Kit, “Dynamic task scheduling using online optimization,” IEEE Trans. Parallel Distrib. Syst., vol. 11, no. 11, pp. 1151–1163, 2000.

[21] T. Sandholm and K. Lai, “Dynamic proportional share scheduling in Hadoop,” in JSSPP’10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing, 2010, pp. 110–131.

[22] K. Kc and K. Anyanwu, “Scheduling Hadoop Jobs to Meet Deadlines,” in 2010 IEEE Second International Conference on Cloud Computing Technology and Science, 2010, pp. 388–392.

[23] M. M. Rafique, B. Rose, A. R. Butt, and D. S. Nikolopoulos, “Supporting MapReduce on large-scale asymmetric multi-core clusters.,” Oper. Syst. Rev., vol. 43, no. 2, pp. 25–34, 2009.

[24] C. Tian, H. Zhou, Y. He, and L. Zha, “A Dynamic MapReduce Scheduler for Heterogeneous Workloads,” in 2009 Eighth International Conference on Grid and Cooperative Computing, 2009, pp. 218–224.

[25] D. Gillick, A. Faria, and J. DeNero, “MapReduce: Distributed computing for machine learning,” 2006. [Online]. Available: http://cs.smith.edu/dftwiki/images/6/68/MapReduceDistributedComputingMachineLearning.pdf. [Accessed: 09-March-2015].

[26] Z. Ma, L. Gu, and L. G. Zhiqiang Ma, “The limitation of MapReduce: A probing case and a lightweight solution,” in In Proc. of the 1st Intl. Conf. on Cloud Computing, GRIDs, and Virtualization, 2010, pp. 68–73.

[27] Z. Guo and G. Fox, “Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization,” in Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on, pp. 714–716.

[28] G. Lee, B.-G. Chun, and H. Katz, “Heterogeneity-aware resource allocation and scheduling in the cloud,” in HotCloud’11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing, 2011, p. 4.

[29] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving MapReduce Performance in Heterogeneous Environments.,” in OSDI, 2008, pp. 29–42.

[30] M. B. YongChul Kwon, Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “A Study of Skew in MapReduce Applications,” in Proceedings of the 2012 international conference on Management of Data - SIGMOD ’12 , 2012, p. 25.

[31] S. Cesare and Y. Xiang, Software Similarity and Classification, p66. Springer, 2012[32] J. Yang, H. Xu, L. Pan, P. Jia, F. Long, and M. Jie, “Task scheduling using Bayesian optimization algorithm for

heterogeneous computing environments,” Appl. Soft Comput., vol. 11, no. 4, pp. 3297–3310, June 2011.[33] M. I. Daoud and N. Kharma, “A hybrid heuristic–genetic algorithm for task scheduling in heterogeneous processor

networks,” J. Parallel Distrib. Comput., vol. 71, no. 11, pp. 1518–1531, Nov. 2011.[34] T. Aarnio, “Parallel Data Processing with Mapreduce,” TKK T-110.5190, Seminar on Internetworking, 2009.

[Online]. Available: http://www.cse.tkk.fi/en/publications/B/5/papers/Aarnio_final.pdf [Accessed: 09-March-2015].[35] J. Dean, “Experiences with MapReduce, an abstraction for large-scale computation,” PACT ’06 Proc. 15th Int. Conf.

Parallel Archit. Compil. Tech., pp. 1–1, 2006.[36] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software,”

ACM SIGKDD Explor. Newsl., vol. 11, no. 1, p. 10, Nov. 2009.[37] Z. Ou, H. Zhuang, J. K. Nurminen, A. Ylä-Jääski, and P. Hui, “Exploiting hardware heterogeneity within the same

instance type of Amazon EC2,” Proceedings of the 4th USENIX Workshop on Hot Topics in Cloud Computing, p. 4, Jun. 2012.

[38] “Amazon EC2 Instance Types.” [Online]. Available: http://aws.amazon.com/ec2/instance-types/ [Accessed: 09-March-2015].

[39] G. Caruana, M. Li, and H. Qi, “SpamCloud: A MapReduce based anti-spam architecture,” in 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, 2010, vol. 6, pp. 3003–3006.

[40] G. Caruana, M. Li, and Y. Liu, “An Ontology Enhanced Parallel SVM for Scalable Spam Filter Training,” Neurocomputing, no. 108, pp. 45–57, 2013.

[41] A. Frank and A. Asuncion, UCI Machine Learning Repository - Adult Data Set. University of California, Irvine, School of Information and Computer Sciences, 2010.

[42] G. Caruana, M. Li, and M. Qi, “A MapReduce based parallel SVM for large scale spam filtering,” in Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on, 2011, vol. 4, pp. 2659–2662.

[43] L. J. Cao, S. S. Keerthi, C.-J. J. Ong, J. Q. Zhang, U. Periyathamby, X. J. J. Fu, and H. P. Lee, “Parallel sequential minimal optimization for the training of support vector machines.,” IEEE Trans. Neural Networks, vol. 17, no. 4, pp. 1039–1049, Jul. 2006.

[44] B. Palanisamy, A. Singh, L. Liu, and B. Jain, “Purlieus,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC ’11, 2011, p. 1.

[45] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, “An Analysis of Traces from a Production MapReduce Cluster,” in 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010, pp. 94–103.

[46] Z. Zhang, L. Cherkasova, and B. T. Loo, “Exploiting Cloud Heterogeneity to Optimize Performance and Cost of MapReduce Processing,” in Proceedings of the Fourth International Workshop on Cloud Data and Platforms , 2014, pp. 1-6.

[47] Y. Xu, K. Li and K. Li, “A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems, ” IEEE Transaction on Parallel and Distributed System, vol. 26, no. 12, pp.3208-3222, 2015.

[48] Y. Xu and K. Li, “A DAG scheduling scheme on heterogeneous computing systems using double molecular structure-based chemical reaction optimization,” Journal of Parallel and Distributed Computing, vol. 73, no. 9, pp. 1306-1322, 2013.

[49] Z. Tang, L. Jiang, J. Zhou, K. Li, K. Li, “A self-adaptive scheduling algorithm for reduce start time,” Future Generation Computer Systems, vol. 43-44, pp. 51-60, 2015.

[50] Z. Tang, M. Liu, A. Ammar, K. Li, K. Li, “An optimized MapReduce workflow scheduling algorithm for heterogeneous computing, ” Journal of SuperComputing, DOI 10.1007/s11227-014-1335-2, 2014.

[51] K. Li, X. Tang, K. Li, “Scheduling precedence constrained stochastic tasks on heterogeneous cluster systems,” IEEE Transactions on Computers, http://DOI. ieeecomputersociety.org/10.1109/TC.2013.205, 2013.

Paper Title (use style: paper title)v-scheiner.brunel.ac.uk/bitstream/2438/13466/3/Fulltext.docx ·...

Documents

Transcript of Paper Title (use style: paper title)v-scheiner.brunel.ac.uk/bitstream/2438/13466/3/Fulltext.docx ·...