Evaluating Task Scheduling in Hadoop-based Cloud...

EVALUATING TASK SCHEDULING EVALUATING TASK SCHEDULING IN HADOOPIN HADOOP--BASED CLOUD BASED CLOUD IN HADOOPIN HADOOP--BASED CLOUD BASED CLOUD SYSTEMSSYSTEMS

SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIUUNIVERSITY OF CHINESE ACADEMY OF SCIENCESUNIVERSITY OF CHINESE ACADEMY OF SCIENCES& RICE UNIVERSITY

2013-9-30

OUTLINEOUTLINE

Background & Motivation • Background & Motivation • Hadoop Task schedulerp• Benchmark & Methodology

E l ti• Evaluation• CONCLUSIONS & Future work

PRIVATE CLOUDPRIVATE CLOUD

"The NIST Definition of Cloud Computing" , National Institute

f St d d d T h lof Standards and Technology. Retrieved 24 July 2011

Th l d i f i• The cloud infrastructure isprovisioned for exclusive use by asingle organization comprisingsingle organization comprisingmultiple consumers (e.g.,business units). It may be owned,

d d t d b thmanaged, and operated by theorganization, a third party, orsome combination of them, and it,may exist on or off premises.

MOTIVATIONMOTIVATION• A private cloud serves multiple users.• Different task priorities e e t tas p o t es• Different task types• Different task data sizes• Different task data sizes

• Optimizing the performance of private cloud is necessary and urgentcloud is necessary and urgent

• A challenge for task scheduling!

OUTLINEOUTLINE

• Background & Motivation Hadoop Task scheduler• Hadoop Task scheduler

• Benchmark & Methodologygy• Evaluation

CO C S O S & F t k• CONCLUSIONS & Future work

HADOOPHADOOP OVERVIEWOVERVIEWHadoop

• An open source software framework for• An open-source software framework for processing a large volume of data on a clustercluster

HADOOPHADOOP TASK SCHEDULERTASK SCHEDULER

• FIFONaïve Fair sharing• Naïve Fair sharing

• Fair Sharing with Delay Schedulingg y g• Capacity Scheduling

HOD• HOD

OUTLINEOUTLINE

CLOUDRANKCLOUDRANK--DD• A benchmark presented by ICT of CAS• A benchmark suite for private cloud• A benchmark suite for private cloud• Help researchers to simulate various multi-user applications in industrial scenariosapplications in industrial scenarios

• Benchmark provides a set of 13 representative d t l i t ldata analysis tools

• Basic operations • Data mining operations• Data warehouse operations

DATA SOURCES OF EACH DATA SOURCES OF EACH PROGRAM IN CLOUDRANKPROGRAM IN CLOUDRANK--D D

Application Data sourcesSort

Automatically generatedWord count Automatically generated Word countGrep

Naive Bayes News and WikipediaSupport vector machine Scientist searchSupport vector machine Scientist search

K-means Sougou corpus Item based collaborative filtering Ratings on movies

Retail market basket data

Frequent pattern growth Click-stream data of an on-line news portal Traffic accident data

Collection of web html document Hidden Markov model Scientist search

Grep select

Automatically generated tableRanking select Automatically generated table User visits aggregation User visits-rankings join

CONTENTCONTENT

WORKLOAD DESIGNWORKLOAD DESIGNImage processing Text indexingLog processing Web crawlingData mining Machine learning

2%Data mining Machine learningReporting Data storage

Applications in CloudRank-D Percent

16%17% private clouds Applications age

Web crawling D t i i

Naive BayesSVM

15%17%

Data mining Machine learning Image Processing

SVM HMM IBCF FPG

Processing FPG Text Indexing Log Processing

Basic Operations 31%

Reporting D t St Hive 34%

7% Data Storage %

WORKLOAD DESIGNWORKLOAD DESIGN

Category Application Jobs

Sort 9

100 Jobs

Basic Operations

Sort 9Word count 11

Grep 11100 Jobs

Data Mining

Naïve Bayes 6Support vector machine 6

K-means 7Data Mining Operations

K means 7Item based collaborative 3Frequent pattern growth 7Hidd M k d l 6Hidden Markov model 6

Data Warehouse Grep select

34Ranking select

Operations 34user visits aggregation user visits-rankings join

JOB SUBMITTINGJOB SUBMITTING• Follows the distribution of

input data size in Taobao Input Data size Percentage

• Follows an exponential distribution with a mean of

<25MB 40.57%

14 seconds(Facebook)• Job submitted in a random

25MB-625MB 39.33%

order 1.2GB-5GB 12.03%

>5GB 8.07%

TESTBEDTESTBED

• Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)DataNodes)

CPU Type Intel CPU CoreIntel ®Xeon E5645 6 cores@2 40GIntel ®Xeon E5645 6 cores@2.40G

L1 D/I Cache L2 Cache L3 Cache Memory Disk

6 × 32 KB 6 × 256 KB 12MB 16GB 8TB

OS Hadoop Mahout Hive

CentOS 5.5 1.0.2 0.6 0.11

HADOOP CONFIGURATIONHADOOP CONFIGURATION

Hadoop Parameter Value Description

d t kt k The maximum number of map tasks thatmapred.tasktracker.map.tasks.maximum 12

pwill be executed simultaneously by a tasktracker.

mapred.tasktracker.r The maximum number of reduce tasks thateduce.tasks.maximum

12 will be executed simultaneously by a tasktracker.

mapred map tasks 48 Maximum number of concurrent runningmapred.map.tasks 48 reduce task.

mapred.reduce.tasks 45 Maximum number of concurrent runningmap task.Th t l b f li ti ifi ddfs.replication 2 The actual number of replications specifiedwhen the file is created.

mapreduce.tasktrackt fb d h tb TRUE O th t f b d h tb ter.outofband.heartbe

atTRUE Open the out of band heartbeat.

HADOOPHADOOP SCHEDULER SCHEDULER EVALUATIONEVALUATION• Data Processed per Second• Turnaround time• Turnaround time

• Running timeW iti Ti• Waiting Time

• Throughput

DATA PROCESSED PER SECONDDATA PROCESSED PER SECONDDATA PROCESSED PER SECONDDATA PROCESSED PER SECOND25

3s) 12

0Fair with

DSNaïve Fair

Capacity FIFO HOD

Task Scheduler

0Fair with

DSNaïve Fair Capacity FIFO HOD

Task Scheduler

The total running time (103sec) of running full workload by using five schedulers respectively

The Data Processed per Second(Megabytes processed per second) ofrunning full workload by using fiverunning full workload by using fiveschedulers respectively.

TURNAROUND TURNAROUND TIMETIME

Fair with DS Naïve Fair Capacity FIFO HODTask Scheduler

The average job turnaround time (103sec) of running full workload by using five schedulers respectively.

AVERAGE JOB RUNNING TIME AVERAGE JOB RUNNING TIME & WAITING TIME & WAITING TIME

0.0 0Fair

with DSNaïve Fair

Capacity FIFO HOD

Task Scheduler

The average job running time (103 ) f i f ll kl d b

Task Scheduler

Average job waiting time (second) of i f ll kl d b i fi(103sec) of running full workload by

using five schedulers respectively.running full workload by using five schedulers respectively.

THROUGHPUTTHROUGHPUT

0.40 jo

Fair with DS Naïve Fair Capacity FIFO HODFair with DS Naïve Fair Capacity FIFO HOD

Task Scheduler

The throughput (number of jobs processed in one minute) of runningThe throughput (number of jobs processed in one minute) of running full workload by using five schedulers respectively

EVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSIS

• Fair with delay scheduling scheduler is the mostefficient scheduler

• some jobs with large size will have longer timeto finish than usual jobsto finish than usual jobs

• Fair with delay scheduling, naïve fair, capacity,these three schedulers are all have the betterthese three schedulers are all have the betterperformance than default FIFO scheduler

HOD h d l f d t ll• HOD scheduler preformed not very well,affected by the extra cost of virtualization

CONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORK

O ti i i th f f H d l t i• Optimizing the performance of Hadoop clusters is very necessary and significant

Th h i f t k h d l i iti l f t• The choice of task schedulers is very critical for system performance improvement of Hadoop cluster

• With fair sharing with delay scheduling, DPS is improved by 20% than that of FIFO scheduler

• Optimization and design of the scheduler need to refer to the characteristics of the workload

• In the future, we will use more complex workloads to study and evaluate more efficient task schedulers for Hadoop based cloud systemsHadoop based cloud systems

Q & AQ & AQ & AQ & A

THANKS!THANKS!

EE MAIL MAIL SOUNDER LIU@163 COMSOUNDER LIU@163 COMEE--MAIL: MAIL: SOUNDER_LIU@163.COMSOUNDER_LIU@163.COM,,XUJG@UCAS.AC.CNXUJG@UCAS.AC.CN

Evaluating Task Scheduling in Hadoop-based Cloud...

Documents

Transcript of Evaluating Task Scheduling in Hadoop-based Cloud...

Hadoop Interview Questions Version 2.0.0 Author: Hadoop ...kpbigdata.com/img/Hadoop_Interview_question.pdf · Hadoop Interview Questions Version 2.0.0 Author: Hadoop Learning Resource

HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects.

MapReduce Programming with Apache Hadoop - DSTdst.lbl.gov/ACSDownloads/kjackson/downloads/Hadoop-HDFS8-12pm.… · MapReduce Programming with Apache Hadoop Viraj Bhat ... (hadoop,

Intro hadoop ecosystem components, hadoop ecosystem tools

Workload-Aware Aggregate Maintenance in Columnar In-Memory Databases - Big Data …prof.ict.ac.cn/bpoe2013/downloads/papers/BigD389_5597.pdf · 2013-09-13 · Workload-Aware Aggregate

Benchmarking Datacenter and Big Data Systemsprof.ict.ac.cn/BigDataBench/old/2.0/BigDataBenchmarking_xi'an.pdf · Big Data Benchmarking Workshop DCBench DCBench: typical data center

BIGDATA HADOOP COURSE CONTENT · Industries using Hadoop. Data Locality. Hadoop Architecture. Map Reduce & HDFS. Using the Hadoop single node image (Clone). The Hadoop Distributed

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Hue: The Hadoop UI - Hadoop Singapore

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

Hadoop Installation Guide | Hadoop Configuration

CloudRank D A Benchmark Suite for Private Cloud Systemsprof.ict.ac.cn/DComputing/uploads/2013/DC_5_2_CloudRank-D.pdf · HPCA 2013 HVC Tutorial Hadoop Configurations Dimensions Explanation

Hadoop Present - Open Enterprise Hadoop

Hadoop 1.0 vs Hadoop 2.0

Apache Hadoop and Hive. Outline Architecture of Hadoop Distributed File System Hadoop usage at Facebook Ideas for Hadoop related research.

Hadoop 3 (2017 hadoop taiwan workshop)

Hadoop Conf 2014 - Hadoop BigQuery Connector

Hadoop Summit 2010 Machine Learning Using Hadoop

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Configuración para Hadoop Configuración de WPS para Hadoop · Configuración para Hadoop Versión 4.2 Introducción ¿Qué es Hadoop? Hadoop esun marco de trabajo del software de