Post on 28-Jul-2020
EVALUATING TASK SCHEDULING EVALUATING TASK SCHEDULING IN HADOOPIN HADOOP--BASED CLOUD BASED CLOUD IN HADOOPIN HADOOP--BASED CLOUD BASED CLOUD SYSTEMSSYSTEMS
SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIUUNIVERSITY OF CHINESE ACADEMY OF SCIENCESUNIVERSITY OF CHINESE ACADEMY OF SCIENCES& RICE UNIVERSITY
2013-9-30
OUTLINEOUTLINE
Background & Motivation • Background & Motivation • Hadoop Task schedulerp• Benchmark & Methodology
E l ti• Evaluation• CONCLUSIONS & Future work
PRIVATE CLOUDPRIVATE CLOUD
"The NIST Definition of Cloud Computing" , National Institute
f St d d d T h lof Standards and Technology. Retrieved 24 July 2011
Th l d i f i• The cloud infrastructure isprovisioned for exclusive use by asingle organization comprisingsingle organization comprisingmultiple consumers (e.g.,business units). It may be owned,
d d t d b thmanaged, and operated by theorganization, a third party, orsome combination of them, and it,may exist on or off premises.
MOTIVATIONMOTIVATION• A private cloud serves multiple users.• Different task priorities e e t tas p o t es• Different task types• Different task data sizes• Different task data sizes
• Optimizing the performance of private cloud is necessary and urgentcloud is necessary and urgent
• A challenge for task scheduling!
OUTLINEOUTLINE
• Background & Motivation Hadoop Task scheduler• Hadoop Task scheduler
• Benchmark & Methodologygy• Evaluation
CO C S O S & F t k• CONCLUSIONS & Future work
HADOOPHADOOP OVERVIEWOVERVIEWHadoop
• An open source software framework for• An open-source software framework for processing a large volume of data on a clustercluster
HADOOPHADOOP TASK SCHEDULERTASK SCHEDULER
• FIFONaïve Fair sharing• Naïve Fair sharing
• Fair Sharing with Delay Schedulingg y g• Capacity Scheduling
HOD• HOD
OUTLINEOUTLINE
• Background & Motivation Hadoop Task scheduler• Hadoop Task scheduler
• Benchmark & Methodologygy• Evaluation
CO C S O S & F t k• CONCLUSIONS & Future work
CLOUDRANKCLOUDRANK--DD• A benchmark presented by ICT of CAS• A benchmark suite for private cloud• A benchmark suite for private cloud• Help researchers to simulate various multi-user applications in industrial scenariosapplications in industrial scenarios
• Benchmark provides a set of 13 representative d t l i t ldata analysis tools
• Basic operations • Data mining operations• Data warehouse operations
DATA SOURCES OF EACH DATA SOURCES OF EACH PROGRAM IN CLOUDRANKPROGRAM IN CLOUDRANK--D D
Application Data sourcesSort
Automatically generatedWord count Automatically generated Word countGrep
Naive Bayes News and WikipediaSupport vector machine Scientist searchSupport vector machine Scientist search
K-means Sougou corpus Item based collaborative filtering Ratings on movies
Retail market basket data
Frequent pattern growth Click-stream data of an on-line news portal Traffic accident data
Collection of web html document Hidden Markov model Scientist search
Grep select
Automatically generated tableRanking select Automatically generated table User visits aggregation User visits-rankings join
CONTENTCONTENT
• Background & Motivation Hadoop Task scheduler• Hadoop Task scheduler
• Benchmark & Methodologygy• Evaluation
CO C S O S & F t k• CONCLUSIONS & Future work
WORKLOAD DESIGNWORKLOAD DESIGNImage processing Text indexingLog processing Web crawlingData mining Machine learning
2%Data mining Machine learningReporting Data storage
Applications in CloudRank-D Percent
16%17% private clouds Applications age
Web crawling D t i i
Naive BayesSVM
15%17%
Data mining Machine learning Image Processing
SVM HMM IBCF FPG
35%
15%%
11%
Processing FPG Text Indexing Log Processing
Basic Operations 31%
Reporting D t St Hive 34%
7% Data Storage %
WORKLOAD DESIGNWORKLOAD DESIGN
Category Application Jobs
Sort 9
100 Jobs
Basic Operations
Sort 9Word count 11
Grep 11100 Jobs
Data Mining
Naïve Bayes 6Support vector machine 6
K-means 7Data Mining Operations
K means 7Item based collaborative 3Frequent pattern growth 7Hidd M k d l 6Hidden Markov model 6
Data Warehouse Grep select
34Ranking select
Operations 34user visits aggregation user visits-rankings join
JOB SUBMITTINGJOB SUBMITTING• Follows the distribution of
input data size in Taobao Input Data size Percentage
• Follows an exponential distribution with a mean of
size
<25MB 40.57%
14 seconds(Facebook)• Job submitted in a random
25MB-625MB 39.33%
order 1.2GB-5GB 12.03%
>5GB 8.07%
TESTBEDTESTBED
• Hadoop cluster with 5 nodes (1 NameNode,4 DataNodes)DataNodes)
CPU Type Intel CPU CoreIntel ®Xeon E5645 6 cores@2 40GIntel ®Xeon E5645 6 cores@2.40G
L1 D/I Cache L2 Cache L3 Cache Memory Disk
6 × 32 KB 6 × 256 KB 12MB 16GB 8TB
OS Hadoop Mahout Hive
CentOS 5.5 1.0.2 0.6 0.11
HADOOP CONFIGURATIONHADOOP CONFIGURATION
Hadoop Parameter Value Description
d t kt k The maximum number of map tasks thatmapred.tasktracker.map.tasks.maximum 12
pwill be executed simultaneously by a tasktracker.
mapred.tasktracker.r The maximum number of reduce tasks thateduce.tasks.maximum
12 will be executed simultaneously by a tasktracker.
mapred map tasks 48 Maximum number of concurrent runningmapred.map.tasks 48 reduce task.
mapred.reduce.tasks 45 Maximum number of concurrent runningmap task.Th t l b f li ti ifi ddfs.replication 2 The actual number of replications specifiedwhen the file is created.
mapreduce.tasktrackt fb d h tb TRUE O th t f b d h tb ter.outofband.heartbe
atTRUE Open the out of band heartbeat.
HADOOPHADOOP SCHEDULER SCHEDULER EVALUATIONEVALUATION• Data Processed per Second• Turnaround time• Turnaround time
• Running timeW iti Ti• Waiting Time
• Throughput
DATA PROCESSED PER SECONDDATA PROCESSED PER SECONDDATA PROCESSED PER SECONDDATA PROCESSED PER SECOND25
3s) 12
s)
20
ng ti
me
(103
8
10
DPS
(MB
/s
10
15
Tota
l run
nin
4
6
0
5
T
0
2
0Fair with
DSNaïve Fair
Capacity FIFO HOD
Task Scheduler
0Fair with
DSNaïve Fair Capacity FIFO HOD
Task Scheduler
The total running time (103sec) of running full workload by using five schedulers respectively
The Data Processed per Second(Megabytes processed per second) ofrunning full workload by using fiverunning full workload by using fiveschedulers respectively.
TURNAROUND TURNAROUND TIMETIME
1.0
1.2 e
(103
s)
0.8
arou
nd ti
me
0.4
0.6
Turn
a
0.0
0.2
Fair with DS Naïve Fair Capacity FIFO HODTask Scheduler
The average job turnaround time (103sec) of running full workload by using five schedulers respectively.
AVERAGE JOB RUNNING TIME AVERAGE JOB RUNNING TIME & WAITING TIME & WAITING TIME
1.0
1.2
me
(103
s)
200
250
sec.
)
0.6
0.8
Run
ning
tim
150
200
aitin
g tim
e (
0.2
0.4
R
50
100Wa
0.0 0Fair
with DSNaïve Fair
Capacity FIFO HOD
Task Scheduler
The average job running time (103 ) f i f ll kl d b
Task Scheduler
Average job waiting time (second) of i f ll kl d b i fi(103sec) of running full workload by
using five schedulers respectively.running full workload by using five schedulers respectively.
THROUGHPUTTHROUGHPUT
0 30
0.35
0.40 jo
bs/m
in)
0.20
0.25
0.30
hrou
ghpu
t (j
0.10
0.15
0. 0
Th
0.00
0.05
Fair with DS Naïve Fair Capacity FIFO HODFair with DS Naïve Fair Capacity FIFO HOD
Task Scheduler
The throughput (number of jobs processed in one minute) of runningThe throughput (number of jobs processed in one minute) of running full workload by using five schedulers respectively
EVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSISEVALUATION RESULT ANALYSIS
• Fair with delay scheduling scheduler is the mostefficient scheduler
• some jobs with large size will have longer timeto finish than usual jobsto finish than usual jobs
• Fair with delay scheduling, naïve fair, capacity,these three schedulers are all have the betterthese three schedulers are all have the betterperformance than default FIFO scheduler
HOD h d l f d t ll• HOD scheduler preformed not very well,affected by the extra cost of virtualization
CONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORKCONCLUSIONS & FUTURE WORK
O ti i i th f f H d l t i• Optimizing the performance of Hadoop clusters is very necessary and significant
Th h i f t k h d l i iti l f t• The choice of task schedulers is very critical for system performance improvement of Hadoop cluster
• With fair sharing with delay scheduling, DPS is improved by 20% than that of FIFO scheduler
• Optimization and design of the scheduler need to refer to the characteristics of the workload
• In the future, we will use more complex workloads to study and evaluate more efficient task schedulers for Hadoop based cloud systemsHadoop based cloud systems
Q & AQ & AQ & AQ & A
THANKS!THANKS!
EE MAIL MAIL SOUNDER LIU@163 COMSOUNDER LIU@163 COMEE--MAIL: MAIL: SOUNDER_LIU@163.COMSOUNDER_LIU@163.COM,,XUJG@UCAS.AC.CNXUJG@UCAS.AC.CN