Active Sampling for Accelerated Learning of Performance Models
description
Transcript of Active Sampling for Accelerated Learning of Performance Models
Active Sampling for Accelerated Learning of
Performance Models
Piyush Shivam, Shivnath Babu, Jeff Chase
Duke University
C3
C1
C2
Site A
Site B
Site C
Task scheduler
Task workflow
A network of clusters or grid sites.
Each site is a pool of heterogeneous resources (e.g., CPU, memory, storage, network)
Managed as a shared utility.
Jobs are task/data workflows.
Challenge: choose the ‘best’ resource mapping/schedule for the job mix.
Instance of “utility resource planning”.
Solution under construction: NIMO
Networked Computing Utility
Subproblem: Predict Job Completion Time
AttributesSamples
CPU speed
Memory size
Network latency
Disk spindles Execution time
s1 2.4 GHz
2 GB 1 ms 10 2 hours
. . . . . .
. . . . . .
Premises (Limitations)• Important batch applications are run repeatedly.
– Most resources are consumed by applications we have seen in the past.
• Behavior is predictable across data sets.– …given some attributes associated with the data set.– Stable behavior per unit of data processed (D)– D is predictable from data set attributes.
• Behavior depends only on resource attributes.– CPU type and clock, seek time, spindle count.
• Utility controls the resources assigned to each job.– Virtualization enables precise control.
• Your mileage may vary.
NIMONonInvasive Modeling for
Optimization
• NIMO learns end-to-end performance models– Models predict performance as a function of, (a)
application profile, (b) data set profile, and (c) resource profile of candidate resource assignment
• NIMO is active– NIMO collects training data for learning models by
conducting proactive experiments on a ‘workbench’• NIMO is noninvasive
App/data profiles
(Target) performance
Candidate resource profiles
Model
“What if…”
Applicationprofiler
Training setdatabase
Active learning
C3
C1
C2
Site A
Site B
Site C
SchedulerResourceprofiler
The Big Picture
Jobs, benchmarks
Pervasive instrumentation
Correlate metrics
with job logs
Generic End-to-End Model
compute phases(compute resource busy)
stall phases(compute resource
stalled on I/O)
Od
(storage
occupancy)
On
(network
occupancy)
+ + )(T = D *totaldata
comp.time
Oa
(compute
occupancy)
Os
(stall occupancy)
occupancy: average time consumed per unit of datadirectly observable
Independent variables
Dependent variables
Resource profile ( )
Dataprofile ( )
Statistical Learning
Complexity (e.g., latency hiding, concurrency, arm contention) is captured implicitly in the training data rather than in the structure of the model.
Sampling Challenges
• Full system operating range– Samples must cover space of candidate resource
assignments
• Cost of sample acquisition– Acquiring a sample has a non-negligible cost, e.g.,
time to acquire a sample, or opportunity cost for the application
• Curse of dimensionality– Too many parameters!– E.g., 10 dimensions X 10 values per dimension– 5 minutes for each sample => 951 years for 1%
samples!
Active Learning in NIMO
Passive sampling
Active sampling
Number of training samples
Accuracy of
current model
100%
• Passive sampling might not expose the system operating range
• Active sampling using “design of experiments” collects most relevant training data
• Automatic and quick
How to learn accurate models quickly?
Sample Carefully
Passive sampling
Active sampling with acceleration
Number of training samples
Accuracy ofcurrent model
100%
Active samplingwithout acceleration
Active Sampling Challenges
• How to expose the main factors and interactions in the shortest time?– Which dimensions/attributes to perturb?– What values to choose for the attributes?
• Where to conduct the experiment?– On a separate system (“workbench”) or “live”?
Planning `active’ experiments
1. Choose a predictor function to refine• Focus in on the most significant/relevant
predictors….or…the least accurate• Example: CPU-intensive app needs an
accurate compute time predictor2. Choose attribute (if any) to add to the predictor
• Example: CPU speed3. Choose the values of the attributes 4. Conduct the experiment5. Compute current prediction error; Go to Step 1
Choosing the Next Predictor
• Learn the most significant/relevant predictors first.– Static vs. dynamic ordering– Static: define total order, e.g., a priori or by
pre-estimates of influence (Plackett-Burman).• Cycle through the order: round-robin vs.
improvement threshold– Dynamic: choose the predictor with maximum
current error
Choosing New Attributes
• Include the most significant/relevant attributes– Choose attributes to expose main factors and
interactions• Add an attribute when error reduction from
further training with the current set falls below threshold.
• Choose the attribute with maximum potential improvement in accuracy.– Establish total order using pre-estimate of
relevance using Plackett-Burman.
Choosing New Values• Select a new value sample to train the selected
predictor function with the chosen set of attributes.
• Range of approaches balance coverage vs. interactions
Binary search/bracketPB to identify interactions
La-Ib
a = #levels for valueb = degree of interactions
Experimental Results
• Biomedical applications– BLAST, fMRI, NAMD, CardioWave
• Resources– 5 CPU speeds, 6 Network latencies, 5 Memory
sizes– 5 X 6 X 5 = 150 resource assignments
• Goal: Learn executing time model with least number of training assignments
• Separate test set to evaluate the accuracy of the current model
BLAST Application
• Total time for 150 assignments: 130 hrs
• Active sampling: 5 hrs
• Sample space: 2%
• Incorrect order of predictor refinement
• 12 hrs• 10% sample space
BLAST Application
• Total time for 150 assignments: 130 hrs
• Active sampling: 5 hrs
• Sample space: 2%
• Incorrect order of attribute refinement
• 12 hrs• 10% sample space
Summary/Conclusions
• Current SLT – given the right data, learn the right model
• Use active sampling to acquire the right data• Ongoing experiments demonstrate the
importance/potential of guided active sampling– 2% sample space, >= 90% model accuracy
• Upcoming VLDB paper…