Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma,...
-
Upload
nigel-thornton -
Category
Documents
-
view
215 -
download
0
Transcript of Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma,...
Meeting Service Level Objectives of Pig Programs
Zhuoyao Zhang, Ludmila Cherkasova,
Abhishek Verma, Boon Thau Loo
University of PennsylvaniaHewlett-Packard Labs
Cloud Environment
•Advantages▫Large amount of resources▫Elasticity ▫Pay-as-you-go pricing model
•Challenges▫Distributed resources▫Error-prone
MapReduce and Pig
•MapReduce: Simple and fault tolerant framework for data processing in the cloud
•Pig▫Advanced MapReduce based platform▫Widely used: Yahoo!, Twitter, LinkedIn▫PigLatin: A high-level declaratice language
for expressing data analysis tasks as Pig programs
j1
j2
j3
j4
j5
j6
j7
Motivation•Latency-sensitive applications
▫Personalized advertising▫Spam and fraud detection▫Real-time log analysis
•How much resource does an application need to meet their deadlines?
Contributions•Performance modeling for Pig programs▫Given a Pig grogram, estimates its
completion time as a function of assigned resource
•Deadline driven resource allocation estimates for Pig programs▫Given a completion time target,
determine the amount of resources for a Pig program to achieve it
Outline•Introduction•Building block
▫Performance model for single MapReduce jobs
•Resource allocation for Pig programs
•Evaluation•Conclusion and ongoing work
Theoretical Makespan Bounds•Bounds- based makespan estimates
▫n tasks, k servers▫avg: average duration of the n tasks▫max: maximum duration of the n tasks
•Lower bound
•Upper boundk
navgTlow
max)1(
k
navgTup
IllustrationSchedule 1: 1 4 3 2 3 1 2
Schedule 2: 3 1 2 3 2 1 4
Makespan = 4Lower bound =
4
Makespan = 7Upper bound =
8
1
2
4
3
1
2
4
3
•Estimate the bounds of the job completion time based on job profile▫Most production jobs are executed
routinely on new data sets
▫Job profile based on previous running Map stage: Mavg, Mmax, AvgInputSize, Selectivity
Reduce stage: Shavg, Shmax, Ravg, Rmax, Selectivity
▫Predict the completion time for future running with the profile
Estimate Completion Time for Single MR Job
•Estimating bounds on the duration of map and reduce stages
•Map stage duration depends on:▫NM -- the number of map tasks
▫SM -- the number of map slots
•Reduce stage duration depends on:▫NR -- the number of reduce tasks
▫SR -- the number of reduce slots
•Job duration TJlow , TJ
up , Tjavg
▫ Sum of the map and reduce stage duration10
max
)1(
MS
NMT
SN
MT
M
Mavg
upM
M
Mavg
lowM
Estimate Completion Time for Single MR Job
•Given a deadline D and the job profile, find the minimal resource to complete the job within D
Resource Allocation for Single MR Job
Given number of map/reduce tasks
Find the value of SMJ, SR
J with minimum value of SM
J+ SRJ using Lagrange's multipliers
Statistics from job profile
Outline•Introduction•Building block
▫Performance model for single MapReduce jobs
•Resource allocation for Pig programs
•Evaluation•Conclusion and ongoing work
Performance Model for Pig Programs
•Let P = {J1, J2,….JN } , extract the job profile of each job contained in P▫Assign unique name for each job within a
program•The program completion time sum of
the completion time of all the jobs contained in P
Ni iP TT
1
•Possible strategy: find out an appropriate pair of map and reduce slots for each job in the program
•Problem: difficult to implement and manage by the scheduler
NNN
R
N
N
M
N
RM
RM
dC SB
SA
dC SB
SA
dC SB
SA
222
2
2
2
111
1
1
1
Dd
Ni i 1
Resource Allocation for Pig Programs
with
Resource Allocation for Pig Programs
•A simpler and more elegant solution▫Allocate the same set of resource to the
entire program instead of to each job•Rewrite the previous equations into
DSS
TNi
NiNi
iPR
iPM
iP C
BA
1
11
Find the minimum set of map and reduce slots
( SMP , SR
P ) for the entire Pig program
Experiment Setup•66 nodes cluster in 2 racks
▫4 AMD 2.39GHz cores▫8 GB RAM, ▫two 160GB hard disks
•Configuration▫1 jobtracker, 1 namenode, 64 worker
nodes▫2 map slots and 1 reduce slot for each
node
Benchmark•Pigmix benchmark
▫17 programs▫8 tables as the input data
•Dataset▫Test dataset
Generated with the Pig mix data generator Total size around 1TB.
▫Experimental dataset Same layout as the test dataset 20% larger in size
Model Accuracy•How well of our performance model
captures Pig program completion time?
Normalized results for predicted and measured completion time
Meeting Deadlines•Are we meeting deadlines with our
resource allocation mode?
Pigmix executed on experimental data set : do we meet deadlines?
Conclusion•Conclusion
▫The performance model can accurately estimate the completion time of MapReduce workflow
▫Enables automatic resource provisioning for MapReduce workflow with deadlines
•Ongoing work▫Refine the performance model for workflow with
concurrent jobs▫Incorporating failure scenarios in the current
model
Thank you