Predictions for Parallel Applications and Systems

SERC Research Seminar DayAugust 18, 2007

Predictions for Parallel Applications and Systems

Sathish VadhiyarGrid Applications Research Laboratory (GARL)


GARL Research• Grid Applications

– Climate Modeling– Gene Mutations

• Performance Modeling• Rescheduling• Others

– Prediction of queue wait times


Rescheduling

• The base is a parallel checkpointing library called SRS

• Checkpointing? – storing application’s state so as to continue from the previous state after interruption

• Interruption either by a scheduler or system faults

• SRS allows processor reconfiguration


Application Progress

System 1

Storage

System 2


Optimal Checkpoint Interval

• Storing checkpoints periodically will help in fault-tolerance

•How periodic?• What is the optimal checkpoint interval?

– More checkpointing will lead to increased checkpoint overhead

– Less checkpointing frequency will lead to increase times for recovery from failures


Illustration


Dynamic Determination of Optimal Checkpointing Intervals

• Start the application on a set of resources

• Predict the next failure on the set of resources

• Checkpoint “just before” the next failure• The prediction has to be really accurate• But no prediction can be 100% accurate


Probability Distribution of Failures

• Use a probability distribution of failures on the resources

• Need to know: The next time of failure with x% certainty

• But more certainty is also not good


Markov Chains

For parallel M-M checkpointing

In SRS, there is almost no system down phase

For sequential applications

In SRS, transition from state 0 can lead to many states


Motivation for Queue Wait Times

• A Grid consisting of number of batch queues

• A meta system that will:– predict the wait times and execution

times of jobs– Decide which queue is “most

suitable” for the job


What is a good predictor?

• There are number of prediction strategies• Evaluating a predictor’s goodness:

1. Mean Absolute Percentage Error (MAPE)2. Upper bound for actual/predicted3. Average of (actual-predicted) [absolute error]4. Absolute error/actual wait time [relative error]5. Average error/average queue wait time6. Coefficient of correlation

• Each of these metrics has flaws


Illustration

Method 1 Method 2

Metric 3 value of Method 1 < Metric 3 value of Method 2

i.e. Method 1 is better


Our goals

• To define useful metrics that can clearly say whether a method is “good” or “bad”

• Goodness of predictors– In terms of absolute wait times– In terms of execution times– In terms of resource demand


Illustration:Prediction errors versus absolute wait times

(A-P)/A%

Wait times

y1x1, y1

f(x)

x2, y2


Reality??


What we want to do…

• Define metrics that can evaluate a method in the “absolute” sense, not “comparative” sense– Stare at a single graph and ask “Is this graph good”

as much as possible

• In some cases, it may just not be possible– Use comparisons

• Evaluate the existing methods on these sets of metrics

• Come up with a method that performs the best in terms of all of the defined metrics


Motivation

• Certain large computational phases of climate modeling (CCSM) are done only by some processors

• Load balancing – offload work from these processors to other processors– Increased processor utilization– Decreased execution time

• How much offloading?– Need to predict workload based on previous

computations


What is happening…

Proc 0 Proc 1 Proc 2 Proc 3 Proc 4

Phase 1

Phase 2


What should happen…

Proc 0 Proc 1 Proc 2 Proc 3 Proc 4

Phase 1

Phase 2

For this, we need to know the workload in phase 1

We predict the workload based on previous time steps


Advantages


GARLians

• Yadnyesh Joshi (M.Sc)• Karthikeyan Raman (M.Tech, jointly with Prof.

Govindarajan)• H.A. Sanjay (Ph.D, jointly with Prof. Ravi Nanjundiah, CAOS)• Sivagama Sundari (Ph.D)• Ashish Srivatsava (Project Assistant)• Alumni

– 1 student intern from INSA, Lyon, France– Summer interns– Project assistants– 2 M.Scs


Questions ????

http://garl.serc.iisc.ernet.in

Predictions for Parallel Applications and Systems

Documents

Transcript of Predictions for Parallel Applications and Systems