Active Sampling for Accelerated Learning of Performance Models · The application proler uses a...

Active Sampling for Accelerated Learning of PerformanceModels

Piyush Shivam, Shivnath Babu, Jeffrey S. ChaseDuke University, Durham NC 27708{shivam,shivnath,chase}@cs.duke.edu

AbstractModels of system behavior are useful for prediction,diagnosis, and optimization in self-managing systems.Statistical learning approaches have an important rolein part because they can infer system models automat-ically from instrumentation data collected as the sys-tem operates. Work in this domain often takes a “giventhe right data, we learn the right model” approach, andleaves the issue of acquiring the “right data” unad-dressed. This gap is problematic for two reasons: (i) theaccuracy of the models depends on adequate coverageof the system operating range in the observations; and(ii) it may take a long time to obtain adequate coveragewith passive observations. This paper describes our ap-proach to bridging this gap in the NIMO system, whichincorporates active learning of performance models ina self-managing computing utility.

1 IntroductionStatistical learning techniques (SLT) have potential tosimplify a variety of system management tasks. Sev-eral recent systems have used SLT to infer quantitativemodels of application performance to enable effectiveresource allocation for repeated workloads. For exam-ple, [3] exposes potential causes of degraded transac-tion response time by identifying system-level metricscorrelated with response time. In databases, [8] usesSLT to cost query-execution plans, enabling a queryoptimizer to adapt automatically to changes in work-loads and system environments. NIMO [6] learns mod-els to predict performance of frequently used applica-tions (initially batch compute tasks) on heterogeneouscompute and storage resources.

The general approach taken in these systems andmany other related works is to transform the specificsystem-management problem to a problem of learn-ing a statistical model that fits a set of m sample datapoints collected as the system operates. Each samplexi (1 ≤ i ≤ m) is a point in a high-dimensional spacewith n dimensions (attributes) X1, X2, . . . , Xn. For ex-ample, a sample s in NIMO represents a complete runof a task graph G on a set ~R of resources assigned torun G; s has the general form 〈ρ1, ρ2, . . . , ρk, t〉, whereeach ρi is a hardware attribute of ~R (e.g., CPU speed ordisk seek time), and t is the completion time of G on ~R.

Given the set of m samples x1, . . . , xm, an appro-priate model can be fitted to the data. For example,

NIMO learns multi-variate regression models that canpredict the completion time t from the hardware at-tributes 〈ρ1, ρ2, . . . , ρk〉 of the resources assigned to runG. Some challenges arise in this setting:

• Limitations of passive sampling. Samples col-lected only by passive observations of system be-havior may not be representative of the full systemoperating range—e.g., system behavior on a flashcrowd may never be observed—limiting the accu-racy of models learned from such samples.

• Cost of sample acquisition. Acquiring a samplemay have a nonnegligible cost. For example, asample 〈ρ1, ρ2, . . . , ρk, t〉 in NIMO “costs” timet to acquire. High costs—e.g., for long-runningbatch tasks—limit the rate at which samples canbe acquired for learning.1

• Curse of dimensionality. As the dimensionality (n)of the samples increases, the number (m) of sam-ples needed to maintain the accuracy of the learnedmodel can increase exponentially.

In NIMO, we seek to address these challengesthrough an active sampling of candidate resource as-signments chosen to accelerate convergence to an ac-curate model.

Active (or proactive) sampling acquires samples ofsystem behavior by planning experiments that perturbthe target system to expose the relevant range of behav-ior for each application G. In NIMO, each experimentdeploys G on a sample candidate assignment ~R, eitherto serve a real request, or proactively to use idle or dedi-cated resources (a “workbench”) to refine the model forG.

Active sampling with acceleration seeks to reduce thetime before a reasonably accurate model is available, asdepicted in Figure 1. The x-axis shows the progress oftime for collecting samples and learning models, andthe y-axis shows the accuracy of the best model learnedso far. While passive sampling may never collect a fullyrepresentative set of samples, active sampling withoutacceleration may take longer to converge to accuratemodels.

This paper outlines NIMO’s approach to active sam-pling for accelerated learning.

1Acquiring samples corresponding to a mere 1% of a 10-dimensional space with 10 distinct values per dimension and averagesample-acquisition time of 5 minutes, takes around 951 years!

with learned modelMaximum prediction accuracy

modelcurrent bestAccuracy of

Fairly−accuratemodel is readyfor use here

accelerationwith

without acceleration

Time

Active sampling

Active sampling

Passivesampling

Figure 1: Active and accelerated learning

2 Overview of NIMO

The NIMO (NonInvasive Modeling for Optimization)system generates effective resource assignments forbatch workflows running on networked utilities. Fig-ure 2 shows NIMO’s overall architecture consisting of:(i) a workbench for doing experiments and collectingsamples; (ii) a modeling engine composed of a resourceprofiler and an application profiler that drives samplecollection to learn performance models for batch work-flows; and (iii) a scheduler that uses learned perfor-mance models to find good resource assignments forbatch workflows. NIMO is active: it deploys and mon-itors applications on heterogeneous resource assign-ments so that it can collect sufficient training data tolearn accurate models. NIMO is also noninvasive: itgathers the training data from passive instrumentationstreams, with no changes to operating systems or appli-cation software.

A batch workflow G input to NIMO consists of oneor more batch tasks linked in a directed acyclic graphrepresenting task precedence and data flow (e.g., [1]).In this paper, we speak of G as an individual task,although the approach generalizes to build models ofworkflows composed of multiple tasks. NIMO builds aperformance model M(G, ~R) that can predict G’s com-pletion time on a resource assignment ~R comprising thehardware resources (e.g., compute, network, and stor-age) assigned simultaneously to run G. Note that ac-curate prediction of completion time is a prerequisitefor many current provisioning and scheduling systems,e.g., [2].

NIMO models the execution of a task G on a resourceassignment ~R as a multi-stage system where each stagecorresponds to a resource Ri ∈ ~R. We characterize thework done by G in the ith stage as G’s “occupancy” oi

of Ri defined as the average time spent by G on Ri perunit of data. Intuitively, oi captures the average delayincurred on Ri by each unit of data flow in the multi-stage system. For a known volume of data flow D, G’scompletion time t is given as D ×

∑i=n

i=1oi for an n-

stage system. G’s application profile captures the ap-plication behavior resulting from G’s interaction withits resources ~R, whose properties are characterized by~R’s resource profile.

Figure 2: Architecture of NIMO

Resource Profile. The resource profile ~ρ is a vector ofhardware attributes 〈ρ1, ρ2, . . . , ρk〉 of ~R that are mea-surable independently of any application. For example,the resource profile of a compute resource may containthe processor speed and memory latency as attributes.NIMO generates resource profiles by running standardbenchmark suites on the resources.Application Profile. G’s application profile is a vec-tor of predictors 〈f1(~ρ), f2(~ρ), . . . fn(~ρ)〉 where fi(~ρ)

predicts G’s occupancy oi of resource Ri ∈ ~R asa function of the resource profile ~ρ of assignment ~R,oi = fi(~ρ). Section 3 explains how NIMO builds eachpredictor automatically through active and acceleratedlearning.

3 Active Learning of PredictorsNIMO’s application profiler uses active sampling foraccelerated learning of predictors comprising a taskG’s application profile. The application profiler usesa workbench consisting of heterogeneous compute andstorage resources and a network link emulator (nist-net). Learning a predictor fi(~ρ) from training datais nontrivial because of several reasons: (i) the curseof dimensionality—resource profile ~ρ = 〈ρ1, . . . , ρk〉may contain many attributes; (ii) the training datamust expose the full system operating range; and(iii) the high cost of data acquisition—collecting a sam-ple 〈ρ1, . . . , ρk, o1, . . . , on〉, where 〈o1, . . . , on〉 repre-sents G’s occupancies on assignment ~R with profile〈ρ1, . . . , ρk〉, involves running G to completion on ~R.(Recall that fi predicts oi as a function of a subset ofattributes in ρ1, . . . , ρk.)

Algorithm 1 summarizes the main steps to learn thepredictors. (Details of these steps are discussed in Sec-tions 3.1–3.5.) The algorithm consists of an initializa-tion step and a loop that continuously refines predictoraccuracy. It acquires new samples by deploying andmonitoring G on selected resource assignments instan-tiated in the workbench.3.1 Sequence of Exploring Predictors

In each iteration of Algorithm 1, Step 2.1 picks a spe-cific predictor to refine. NIMO chooses the predic-tor whose refinement contributes the maximum to the

Algorithm 1: Active and accelerated learning of pre-dictors f1(ρ1, . . . , ρk), · · · , fn(ρ1, . . . , ρk) for taskG

1) Initialize: f1 = f2 = . . . fn = 1;2) Design the next experiment: (Sections 3.1–3.3)

2.1 Select a predictor for refinement, denoted f ;2.2 Should an attribute from ρ1, . . . , ρk be

added to the set of attributes already used inf? If yes, then pick the attribute to be added;

2.3 Select new assignment(s) to refine f usingthe set of attributes from Step 2.2;

3) Conduct chosen experiment: (Section 3.4)3.1 In the workbench run G on the

assignment(s) picked in Step 2.3;3.2 After each run, generate the corresponding

sample 〈ρ1, . . . , ρk, o1, . . . , on〉, whereo1, . . . , on are the observed occupancies;

3.3 Refine f based on the new sample set;

4) Compute current accuracy:(Section 3.5)Compute the current accuracy of each predictor.If the overall accuracy of predicting completiontime is above a threshold, and a minimum numberof samples have been collected, then stop, else goto Step 2;

model’s overall accuracy. We are exploring both staticand dynamic schemes for guiding the choice of predic-tors.

• Static scheme: In a static scheme, NIMO first de-cides the order in which to refine the predictors,and then defines the traversal plan for choosingthe predictor to refine in each iteration.

Order: NIMO obtains the refinement order ei-ther from a domain expert, or from a relevance-based design (e.g., Plackett-Burman [7]). Therelevance-based techniques determine the rele-vance of independent variables (predictors) on adependent variable (completion time).Traversal: NIMO uses two traversal techniques:(i) round-robin, where it picks the predictor for re-finement in a round-robin fashion from the giventotal order; and (ii) improvement-based, whereNIMO traverses the predictors sequentially in or-der, and refines each predictor until the improve-ment in accuracy for that predictor drops below athreshold. Section 3.5 explains how NIMO obtainsthe current accuracy of a predictor.

• Dynamic scheme: In a dynamic scheme, NIMOdoes not use any predefined order for refining thepredictors. In each iteration, it picks the predictorwith the maximum current prediction error.

L

Coverage of

range

LowLow High

Capturing interactions

High

assumption)

(BasicPB−design) foldover)

(PB−design with

(Full factorialdesign)

(Independence

operating

−I2

max −I max −I

−I21 2

1 maxL L

L

Figure 3: Techniques for selecting new sample assign-ments (PB = Plackett-Burman [7])

3.2 Adding New Attributes to Predictors

Step 2.2 of Algorithm 1 decides when to add a new at-tribute in ~ρ to a predictor f(~ρ), and if so, which of thek attributes to add. As in Section 3.1, NIMO’s twofoldstrategy is to first define a total order over the ρ1, . . . , ρk

attributes, and then to define a traversal plan based onthis order to select attributes for inclusion in f(~ρ).

NIMO obtains the total ordering of attributes as de-scribed in Section 3.1. Once NIMO decides the order inwhich to add attributes, it considers two approaches fordeciding when to add the next attribute:

1. Add the next attribute when the marginal improve-ment in a predictor’s accuracy (with the current setof attributes) per new added sample drops below apredetermined threshold.

2. At each point of time, maintain two competing at-tribute sets for each predictor. When the accuracyof one candidate dominates the other, add the nextattribute (in the order) to the losing attribute set.

3.3 Selecting New Sample Assignment(s)

Step 2.3 of Algorithm 1 chooses new assignments torun task G and collect new samples for learning. Foreach new assignment ~R we need to choose the valuesfor each attribute ρi in ~R’s resource profile, while ac-counting for:

1. The coverage of the operating range of each at-tribute to improve accuracy over a wider range ofattribute values. For example, limited variationof CPU speeds and network latencies may fail toexpose latency-hiding behavior due to prefetching[6].

2. The important interactions among attributes. Forexample, changing CPU speed or network latencyaffects the occupancy of the storage resource [6].

Figure 3 shows some of sample selection techniques interms of their general performance and tradeoff on thetwo metrics above. We use an Lα-Iβ naming format,where (i) α represents the number of significant distinctvalues, or levels, in the attribute’s operating range cov-ered by the technique, and (ii) β represents the largest

degree of interactions among attributes captured by thetechnique.

• Lmax-I1: This technique systematically exploresall levels of a newly-added attribute using a binary-search-like approach. However, it assumes that theeffects of attributes are independent of each other,so it chooses values for attributes independently.

• L2-I2: The classic Plackett-Burman design withfoldover technique [7] captures two levels (low andhigh) per attribute and up to pair-wise interactionsamong attributes. This technique is an example ofthe common approach of fractional factorial de-sign.

3.4 Performing the Selected Experiment

In Step 3 of Algorithm 1, NIMO instantiates the assign-ment selected in Step 2.3 in the workbench and runs thetask. The noninvasive instrumentation data collectedfrom the run and its analysis to derive occupancy mea-sures is described in [6].3.5 Evaluating Accuracy

NIMO uses two techniques for computing the currentprediction error of a predictor.

• Cross-Validation: We use leave-one-out cross-validation to estimate the current accuracy of eachpredictor. For each sample s of the m samples,we learn predictor fi using all samples other thans. The learned predictors are used to predict occu-pancies for s, and the prediction accuracy is noted.The individual accuracy of each predictor and theoverall prediction accuracy are averaged over them samples.

• Fixed test set: We designate a small subset of re-source assignments in the workbench as a test set,e.g., a random subset of the possible assignmentsin the workbench. The current prediction error off(~ρ) is computed as the average of f(~ρ)’s percent-age error in predicting occupancy on each assign-ment in the test set. Note that the samples collectedfrom the test set are never used as training samplesfor f(~ρ).

4 Discussion

4.1 Curse of Dimensionality

The exploration strategies discussed in Sections 3.1and 3.2 help NIMO cope with the curse of dimen-sionality by considering predictors and attributes it-eratively. For example, the relevance-based orderingguides NIMO to include only the predictors and at-tributes that are most relevant to the model’s overallaccuracy. For a batch task NIMO may quickly iden-tify that it is compute-intensive for the range of attributevalues of interest, and guide sample acquisition to refinethe predictor for the compute resource only.

4.2 Vertical Vs. Horizontal Compression

Algorithm 1 attempts to reduce the overall learning timeby acquiring only the relevant samples. We term thismethodology vertical compression since it reduces thetotal data points in the training set table. An alterna-tive is to reduce the run time for a task G to acquirea sample. For example, if G has very periodic behav-ior, then we can learn G’s occupancies on a resourceassignment based on a few of G’s initial periods. (Re-call that occupancies are averages per unit of data flow.)This methodology, termed horizontal compression, canbe applied in conjunction with vertical compression tofurther reduce the overall learning time.4.3 Workbench

The configuration of the workbench where NIMO con-ducts experiments poses some interesting questions:

• Does the workbench replicate actual resourcesfrom the target system or emulate them? Whilereplicating actual resources leads to accurate em-pirical data, it may restrict the range of resource as-signments. One alternative is to emulate a varietyof hardware resources over real hardware throughvirtualization.

• Should the system learn from production runs in-stead of (or in addition to) runs on a separate work-bench? For example, NIMO can harness under-utilized resources in the utility, or NIMO’s sched-uler can perform online experiments through con-trolled resource assignments to production runs oftasks. However, having a workbench as a separateentity allows NIMO to circumvent the traditionalexploitation versus exploration tradeoff by allow-ing both to occur in parallel.

4.4 Extended Iterative Learning

When NIMO’s scheduler actually uses a task G’s per-formance model learned on the workbench, it may dis-cover significant prediction errors for some resource as-signment ~R. Such errors may arise because Algorithm1 failed to capture (i) the effect of a resource attribute,(ii) an interaction among attributes, or (iii) the correctoperating range of an attribute. This scenario is similarto one where Step 4 of Algorithm 1 detects low pre-diction accuracy on an assignment. In such an event,NIMO can bring G back to the workbench and updateG’s predictors by continuing Algorithm 1 where it leftoff at Step 4, with ~R now included in the set of samples.

5 Preliminary Experimental ResultsWe are currently exploring accelerated learning usingactive sampling for several production batch tasks. Weprovide preliminary results for one such task, fMRI,a statistical parametric application used in biomedicalimaging. We consider the execution of this task on threeresources—a compute node, network, and a storage ar-ray, corresponding to learning compute, network, and

0

10

20

30

40

50

60

70

80

90

100

60 80 100 120 140 160 180 200 220 240 260

Pre

dict

ion

accu

racy

(%)

Learning time (mins)

Mean (improvement-based)SD (improvement-based)

Mean (round-robin)SD (round-robin)

Figure 4: This graph compares round-robin active sam-pling strategy to an improvement-based active samplingstrategy to learn predictors for fMRI. Prediction ac-curacy increases immediately under the improvement-based strategy.

storage predictors respectively. Currently, our NIMOprototype uses multi-variate linear regression to learnthe predictors. We are also exploring the potential ofmore sophisticated learning algorithms such as regres-sion splines [5].

The important resource-profile attributes in this ex-periment were found to be CPU speed, memory size,and network latency. NIMO’s workbench comprises 5CPUs varying from 451 MHz to 1396 MHz, 4 mem-ory sizes ranging from 256 MB to 2GB, and 10 differ-ent network latencies ranging from 0, 2,..., 18 ms be-tween the compute and storage nodes. Therefore, thetotal sample space is 5 × 4 × 10 = 200 assignments.

We compare two active sampling strategies to illus-trate the potential of active sampling for acceleratedlearning. Exhaustive evaluation of all strategies is be-yond the scope of this paper. We use round-robin traver-sal of predictors in one case, and improvement-basedtraversal in other (Section 3.1). The latter strategylearns an accurate predictor for the network resource,then the predictor for the CPU resource, and finally thepredictor for the storage resource. In both strategieswe use static ordering of predictors and attributes, andLmax-I1 technique for selecting samples (Section 3.3),where we vary the operating range of one resource at atime keeping the others constant.

Figure 4 shows the accelerated learning in NIMO us-ing active sampling. The x-axis shows the progress oftime for collecting samples and learning predictors, andthe y-axis shows the mean and standard deviation offMRI’s completion-time prediction accuracy on 15 testassignments chosen randomly from the 200 candidateassignments. The figure shows that the round-robinstrategy learns an accurate function quickly using lessthan 5% of total sample assignments in less than 100minutes. (The total time to run all fMRI on all 200 as-signments is around 5 days.) The round-robin strategyconverges faster to accurate predictions because it re-fines the CPU and I/O predictors together, without wait-

ing for one predictor to be learned properly. This resultshows the potential of accelerated learning; active sam-pling is able to expose the primary factors that affectperformance quickly, and hence learn accurate predic-tors.

6 Related WorkThe theory of design of experiments (DOE) is a branchof statistics that studies planned investigation of factorsaffecting system performance. Active learning [4] frommachine learning deals with the issue of picking thenext sample that provides the most information to max-imize some objective function. NIMO uses both DOEand active learning in an iterative fashion to enable ef-fective resource assignment in network utilities. Re-cently, the computer architecture community has alsolooked at DOE for improving simulation methodol-ogy [7].

7 ConclusionCurrent work that applies SLT to system managementoften takes a “given the right data, we learn the rightmodel” approach, and leaves the issue of acquiring the“right data” unaddressed. NIMO bridges this gap byusing active sampling strategies for accelerated modellearning. Our preliminary experiments with NIMO in-dicate its potential for significant reduction in the train-ing data and learning time required to generate fairlyaccurate models.

References[1] J. Bent, D. Thain, A. C. Arpaci-Dusseau, R. H. Arpaci-

Dusseau, and M. Livny. Explicit Control in a Batch-Aware Distributed File System. In Proceedings of theUSENIX Symposium on Networked Systems Design andImplementation, Mar 2004.

[2] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. TheAppLeS Parameter Sweep Template: User-Level Middle-ware for the Grid. In Proceedings of the ACM/IEEE Con-ference on Supercomputing, Nov 2000.

[3] I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, andJ. Symons. Correlating Instrumentation Data to Sys-tem States: A Building Block for Automated Diagnosisand Control. In Proceedings of the USENIX Symposiumon Operating Systems Design and Implementation, Dec2004.

[4] V. Fedorov. Theory of Optimal Experiments. AcademicPress, 1972.

[5] J. Friedman. Multivariate Adaptive Regression Splines.Annals of Statistics, 19(1):1–141, 1991.

[6] P. Shivam, S. Babu, and J. Chase. Learning ApplicationModels for Utility Resource Planning. In Proceedings ofInternational Conference on Autonomic Computing, Jun2006.

[7] J. J. Yi, D. J. Lilja, and D. M. Hawkins. A StatisticallyRigorous Approach for Improving Simulation Methodol-ogy. In Proceedings of International Symposium on High-Performance Computer Architecture, Feb 2003.

[8] N. Zhang, P. Hass, V. Josifovski, G. Lohman, andC. Zhang. Statistical Learning Techniques for CostingXML Queries. In Proceedings of International Confer-ence on Very Large Data Bases, Aug-Sep 2005.

Active Sampling for Accelerated Learning of Performance Models · The application proler uses a...

Documents

Transcript of Active Sampling for Accelerated Learning of Performance Models · The application proler uses a...