Predictions for Parallel Applications and Systems
description
Transcript of Predictions for Parallel Applications and Systems
SERC Research Seminar DayAugust 18, 2007
Predictions for Parallel Applications and Systems
Sathish VadhiyarGrid Applications Research Laboratory (GARL)
SERC Research Seminar DayAugust 18, 2007
GARL Research• Grid Applications
– Climate Modeling– Gene Mutations
• Performance Modeling• Rescheduling• Others
– Prediction of queue wait times
SERC Research Seminar DayAugust 18, 2007
GARL Research• Grid Applications
– Climate Modeling– Gene Mutations
• Performance Modeling• Rescheduling• Others
– Prediction of queue wait times
SERC Research Seminar DayAugust 18, 2007
Rescheduling
• The base is a parallel checkpointing library called SRS
• Checkpointing? – storing application’s state so as to continue from the previous state after interruption
• Interruption either by a scheduler or system faults
• SRS allows processor reconfiguration
SERC Research Seminar DayAugust 18, 2007
Application Progress
System 1
Storage
System 2
SERC Research Seminar DayAugust 18, 2007
Optimal Checkpoint Interval
• Storing checkpoints periodically will help in fault-tolerance
•How periodic?• What is the optimal checkpoint interval?
– More checkpointing will lead to increased checkpoint overhead
– Less checkpointing frequency will lead to increase times for recovery from failures
SERC Research Seminar DayAugust 18, 2007
Illustration
SERC Research Seminar DayAugust 18, 2007
Dynamic Determination of Optimal Checkpointing Intervals
• Start the application on a set of resources
• Predict the next failure on the set of resources
• Checkpoint “just before” the next failure• The prediction has to be really accurate• But no prediction can be 100% accurate
SERC Research Seminar DayAugust 18, 2007
Probability Distribution of Failures
• Use a probability distribution of failures on the resources
• Need to know: The next time of failure with x% certainty
• But more certainty is also not good
SERC Research Seminar DayAugust 18, 2007
Markov Chains
For parallel M-M checkpointing
In SRS, there is almost no system down phase
For sequential applications
In SRS, transition from state 0 can lead to many states
SERC Research Seminar DayAugust 18, 2007
GARL Research• Grid Applications
– Climate Modeling– Gene Mutations
• Performance Modeling• Rescheduling• Others
– Prediction of queue wait times
SERC Research Seminar DayAugust 18, 2007
Motivation for Queue Wait Times
• A Grid consisting of number of batch queues
• A meta system that will:– predict the wait times and execution
times of jobs– Decide which queue is “most
suitable” for the job
SERC Research Seminar DayAugust 18, 2007
What is a good predictor?
• There are number of prediction strategies• Evaluating a predictor’s goodness:
1. Mean Absolute Percentage Error (MAPE)2. Upper bound for actual/predicted3. Average of (actual-predicted) [absolute error]4. Absolute error/actual wait time [relative error]5. Average error/average queue wait time6. Coefficient of correlation
• Each of these metrics has flaws
SERC Research Seminar DayAugust 18, 2007
Illustration
Method 1 Method 2
Metric 3 value of Method 1 < Metric 3 value of Method 2
i.e. Method 1 is better
SERC Research Seminar DayAugust 18, 2007
Our goals
• To define useful metrics that can clearly say whether a method is “good” or “bad”
• Goodness of predictors– In terms of absolute wait times– In terms of execution times– In terms of resource demand
SERC Research Seminar DayAugust 18, 2007
Illustration:Prediction errors versus absolute wait times
(A-P)/A%
Wait times
y1x1, y1
f(x)
x2, y2
SERC Research Seminar DayAugust 18, 2007
Reality??
SERC Research Seminar DayAugust 18, 2007
What we want to do…
• Define metrics that can evaluate a method in the “absolute” sense, not “comparative” sense– Stare at a single graph and ask “Is this graph good”
as much as possible
• In some cases, it may just not be possible– Use comparisons
• Evaluate the existing methods on these sets of metrics
• Come up with a method that performs the best in terms of all of the defined metrics
SERC Research Seminar DayAugust 18, 2007
GARL Research• Grid Applications
– Climate Modeling– Gene Mutations
• Performance Modeling• Rescheduling• Others
– Prediction of queue wait times
SERC Research Seminar DayAugust 18, 2007
Motivation
• Certain large computational phases of climate modeling (CCSM) are done only by some processors
• Load balancing – offload work from these processors to other processors– Increased processor utilization– Decreased execution time
• How much offloading?– Need to predict workload based on previous
computations
SERC Research Seminar DayAugust 18, 2007
What is happening…
Proc 0 Proc 1 Proc 2 Proc 3 Proc 4
Phase 1
Phase 2
SERC Research Seminar DayAugust 18, 2007
What should happen…
Proc 0 Proc 1 Proc 2 Proc 3 Proc 4
Phase 1
Phase 2
For this, we need to know the workload in phase 1
We predict the workload based on previous time steps
SERC Research Seminar DayAugust 18, 2007
Advantages
SERC Research Seminar DayAugust 18, 2007
GARLians
• Yadnyesh Joshi (M.Sc)• Karthikeyan Raman (M.Tech, jointly with Prof.
Govindarajan)• H.A. Sanjay (Ph.D, jointly with Prof. Ravi Nanjundiah, CAOS)• Sivagama Sundari (Ph.D)• Ashish Srivatsava (Project Assistant)• Alumni
– 1 student intern from INSA, Lyon, France– Summer interns– Project assistants– 2 M.Scs
SERC Research Seminar DayAugust 18, 2007
Questions ????
http://garl.serc.iisc.ernet.in