Predictive Performance Modeling of Virtualized Storage ... · Performance modeling and evaluation...

Predictive Performance Modeling ofVirtualized Storage Systems using

Optimized Statistical Regression Techniques

Qais Noorshams, Dominik Bruhn, Samuel Kounev, Ralf ReussnerChair for Software Design and Quality

Karlsruhe Institute of TechnologyKarlsruhe, Germany

{noorshams, kounev, reussner}@kit.edu,[email protected]

ABSTRACTModern virtualized environments are key for reducing theoperating costs of data centers. By enabling the sharing ofphysical resources, virtualization promises increased resourceefficiency with decreased administration costs. With theincreasing popularity of I/O-intensive applications, however,the virtualized storage used in such environments can quicklybecome a bottleneck and lead to performance and scalabili-ty issues. Performance modeling and evaluation techniquesapplied prior to system deployment help to avoid such is-sues. In current practice, however, virtualized storage andits performance-influencing factors are often neglected ortreated as a black-box. In this paper, we present a measure-ment-based performance prediction approach for virtualizedstorage systems based on optimized statistical regression tech-niques. We first propose a general heuristic search algorithmto optimize the parameters of regression techniques. Then,we apply our optimization approach and create performancemodels using four regression techniques. Finally, we presentan in-depth evaluation of our approach in a real-world rep-resentative environment based on IBM System z and IBMDS8700 server hardware. Using our optimized techniques,we effectively create performance models with less than 7%prediction error in the most typical scenario. Furthermore,our optimization approach reduces the prediction error byup to 74%.

Categories and Subject DescriptorsC.4 [Performance of Systems]: Modeling techniques, Per-formance attributes; G.1.6 [Numerical Analysis]: Opti-mization; I.2.8 [Artificial Intelligence]: Problem Solving,Control Methods, and Search—Heuristic methods

KeywordsI/O, Storage, Performance, Prediction, Virtualization

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICPE’13, March 21–24, 2013, Prague, Czech Republic.Copyright 2013 ACM 978-1-4503-1636-1/13/04 ...$15.00.

1. INTRODUCTIONThe high growth rate of modern IT systems and servicesdemands a powerful yet flexible and cost-efficient data centerlandscape. Server virtualization is a key technology used inmodern data centers to address this challenge. By enablingthe shared resource usage by multiple virtual systems, virtu-alization promises increased resource efficiency and flexibilitywith decreased administration costs. Furthermore, moderncloud environments enable new pay-per-use cost models cou-pled with flexible on-demand resource provisioning.

Modern cloud applications increasingly have I/O-intensiveworkload profiles (cf. [2]), e.g., mail or file server applica-tions are often deployed in virtualized environments. Withthe increasing popularity of such applications, however, thevirtualized storage of shared environments can quickly be-come a bottleneck and lead to unforeseen performance andscalability issues. Performance modeling and evaluation tech-niques applied prior to system deployment help to avoid suchissues and ensure that systems meet their quality-of-servicerequirements.

In current practice, however, virtualized storage and itsperformance-influencing factors are often neglected or treatedas a black-box due to their complexity. Several modelingapproaches considering I/O-intensive applications in virtual-ized environments have been proposed, e.g., [1, 17], however,without explicitly considering storage configuration aspectsand their influences on the overall system performance. More-over, the increasing complexity of modern virtualized storagesystems poses challenges for classical performance modelingapproaches.

In this paper, we present a measurement-based perfor-mance prediction approach for virtualized storage systems.We derive optimized performance models based on statisticalregression techniques capturing the complex behaviour ofthe virtualized storage system. Furthermore, the modelsexplicitly consider the influences of workload profiles andstorage configuration parameters. More specifically, we firstpropose a general heuristic search algorithm to optimize theparameters of regression techniques. This algorithm is notlimited to a certain domain and can be used as a generalregression optimization method. Then, we apply our opti-mization approach and create performance models based onsystematic measurements using four regression techniques.Finally, we evaluate the models in different scenarios to assesstheir prediction accuracy and the improvement achieved by

283

our optimization approach. The scenarios comprise interpo-lation and extrapolation scenarios as well as scenarios whenthe workload is distributed on multiple virtual machines. Weevaluate our approach in a real-world environment basedon IBM System z and IBM DS8700 server hardware. Usingour optimized techniques, we effectively create performancemodels with less than 7% prediction error in the most typ-ical interpolation scenario. Furthermore, our optimizationapproach reduces the prediction error by up to 74% withstatistical significance.

In summary, the contribution of this paper is multifold:i) We present a measurement-based performance modelingapproach for virtualized storage systems based on four statis-tical regression techniques. ii) We propose a general heuristicoptimization method to maximize the predictive power of sta-tistical regression techniques. To the best of our knowledge,this is the first automated regression optimization method ap-plied for performance modeling of virtualized storage systems.iii) We validate our approach in a real-world environmentbased on the state-of-the-art server technology of the IBMSystem z and IBM DS8700. iv) We comprehensively evaluateour approach in multiple aspects including interpolation andextrapolation scenarios, scenarios with multiple virtual ma-chines, and the improvement achieved by our optimizationapproach.

The remainder of this paper is organized as follows: Sec-tion 2 further motivates our approach. Section 3 discussesthe regression techniques we considered for performance mod-eling. In Section 4, we introduce our system environmentand methodology. Section 5 presents our heuristic optimiza-tion algorithm. In Section 6, we extensively evaluate ourperformance models and optimization approach. Finally,Section 7 presents related work, while Section 8 summarizesand concludes with the lessons learned.

2. MOTIVATIONThe use of virtualization technology has significant impli-cations on the landscape of modern data centers. By con-solidating multiple systems to increase resource efficiency,virtualization presents today’s systems with increasing chal-lenges to meet the non-functional requirements. Especiallyfor virtualized storage systems, this trend is evident. At thecost of highly increased complexity, storage systems haveevolved significantly from simple disks to sophisticated sys-tems with multiple caching and virtualization layers. Further-more, modern virtualized storage systems feature intricatescheduling and optimization strategies.

Without any doubt, there are a variety of performanceinfluences of virtualized storage systems (cf. [23]), e.g., thearchitecture of the virtualization platform and the implica-tions of workload consolidation. Moreover, the many logicallayers between the application and the physical storage leadto complex effects and interdependencies of influencing fac-tors, which are more and more difficult to quantify. Theeffect of a few simple influencing factors alone leads to a widerange of performance characteristics. Illustrated in Figure 1afor instance, the response time of an application spreads bymore than the factor of 35 when varying five factors, e.g.,the request size or the I/O scheduler. Moreover, Figure 1bexemplifies a shift in the influencing factors. The CFQ sched-uler, which is the standard I/O scheduler since many years,impairs performance drastically for certain configurations.Consequently, the performance traits of an application de-

0

20

40

60

Res

ponse

Tim

e(m

s)

(a) Distribution

0

20

40

60

Request Size,

I/O Scheduler

Small, CFQ

Small, NOOP

Large, CFQ

Large, NOOP

(b) Examples

Figure 1: Response Time Depending on Influencing Factors

ployed in such an environment are highly unclear and mightbe also subject to change.

This development is the motivation for our approach. Theincreasing complexity of modern virtualized storage systemsposes challenges for classical performance modeling approach-es to manually create, calibrate, and validate explicit perfor-mance models within a limited time period. The effects andinterdependencies of the influencing factors are more andmore complex, which needs to be accounted for and quanti-fied. Furthermore, the effects are subject to change and needto be captured flexibly as the systems keep evolving. There-fore, we target at an automated measurement-based approachto virtualized storage performance modeling based on regres-sion techniques. Our goal is to create practical performancemodels while lifting the burden for performance engineers tomanually develop them. Building a regression-based perfor-mance model, however, is also not straightforward due to themany techniques and their parameters. To briefly introducethe techniques we considered, the following section analyzesmultiple regression techniques and their parameters.

3. REGRESSION TECHNIQUESRegression techniques are used to model the effect of indepen-dent variables on dependent variables, e.g., the effect of I/Orequest sizes on the request response time. In this section, weintroduce and analyze the regression techniques we appliedfor modeling the performance of virtualized storage systemsand discuss their parameters. Furthermore, we highlight therelationships between the considered techniques.

3.1 Linear Regression ModelsOne of the most popular techniques is based on linear regres-sion models (LRM) [12] which assume a linear relationshipbetween the dependent variable and the independent vari-ables. Thus, the dependent variable is represented as a linearcombination of the independent variables with an addedconstant intercept. More formally, for a vector of indepen-dent variables ~x := (x1, . . . , xn) a model f with coefficientsβ0, . . . , βn is created of the form

f(~x) = β0 +∑ni=1 βixi.

The model f , however, does not necessarily have to belinear in the independent variables xi. The independentvariables can be derived, e.g., x2 = x2

1 or x3 = x1x2. Forthe sake of clarity, we explicitly exclude such definitions andspecifically consider two forms: The linear regression modelsare explicitly linear in the independent variables as well. Thelinear regression models with interaction can include terms

284

expressing the interaction between variables, e.g., x1x2. Thismodel is of maximum degree n in the independent variablesxi, however, the effects of the independent variables remainlinear in the model. When creating a model based on a setof training data, the coefficients of the model are determinedto minimize the error between the model and the trainingdata.

Model Derivation. The most popular approach to create amodel is the method of least squares. For a set of trainingdata {(~x1, y1), . . . , (~xN , yN )} where ∀i: ~xi := (xi1, . . . , xin),

this method finds the coefficients ~β := (β0, . . . , βn) such that

the residual sum of squares defined as RSS(~β) :=∑Ni=1(yi −

f(~xi))2 is minimized. To find the minimum, the derivative

of RSS is set to zero and solved for ~β. As RSS is a quadraticfunction, the minimum is guaranteed to always exist.

Parameters.− None.

3.2 Multivariate Adaptive Regression SplinesMultivariate Adaptive Regressions Splines (MARS) [9] con-sist of piecewise linear functions (so-called hinge functions).Formally, these functions with so-called knot t are defined as(x− t)+ := max{0, x− t} and (t− x)+ := max{0, t− x}. Let~x = (x1, . . . , xn) be the vector of independent variables asabove, H := {(xi − t)+, (t− xi)+}i=1,...,n and t ∈ T , whereT is the set of observed values of each independent variable.Then, MARS constructs a model f of the form

f(~x) = β0 +∑ni=1 βi hi(~x)

with hi ∈ H and coefficients β0, . . . , βn. Similar to the linearregression models, we explicitly distinguish between MARSand MARS with interactions. The latter also includes hingefunctions that are a product of one or more functions in H.

Model Derivation. The parameters βi are estimated by themethod of least squares as for the linear regression model.The model is constructed iteratively in a forward step. Start-ing from a constant function, the algorithm adds the hingefunctions in H that result in the maximum reduction ofthe residual squared error. If interactions are permitted,the algorithm can also choose a hinge function as a factor.However, each variable can only appear once in a term, i.e.,higher-order powers of a variable are excluded. This stepstops if a predefined number of maximum terms is reachedor if the residual error falls below a predefined threshold.Finally, the model is pruned to avoid overfitting. The termsthat produce the smallest increase in residual squared errorare removed iteratively until a predefined number of termsis reached. Frequently, the pruning step also considers themodel complexity combined with the residual squared error.This results in a trade-off between accuracy and complexityof the model.Parameters.− nnodes: Maximum number of terms in the forward step.− threshold: Maximum acceptable residual error.− nprune: Maximum number of terms after pruning.

3.3 Classification and Regression TreesClassification and Regression Trees (CART) [5] are a groupof algorithms that model data in a tree-based representa-tion. CART models are binary trees with conditions in theirnon-leaf nodes and constant values in their leaf nodes. Todetermine the value of the dependant variable correspondingto a set of values of the independent variables, the evaluation

starts at the root and the condition in this node is checked.If the condition is true, the left edge is followed, otherwisethe right edge. This is repeated until a leaf is reached.

Model Derivation. Similar to MARS, the algorithm com-prises a forward and a pruning step. In the forward step, aninitial tree with a single node is created. The leafs in thetree are split iteratively at a certain splitting variable and asplitting point. The splitting point is determined such thatthe residual squared error is minimized. The algorithm splitsa leaf until it contains less samples than a predefined value.The algorithm stops if no leaf can be split any further. Inthe pruning step, the tree is reduced using cost-complexitypruning (cf. [12]). Let m be the m-th leaf, Rm be the regionspecified by the conditions to the m-th leaf, and l(T ) bethe number of leaves in the tree T . Furthermore, for thetraining data {(~xi, yi)}i, let the number of observations ina region nm := |{~xi | ~xi ∈ Rm}|, the mean of the observa-tions values cm := n−1

m

∑~xi∈Rm

yi, and the mean squareddifference between each observed value and the mean of theobservations values qm(T ) := n−1

m

∑xi∈Rm

(yi − cm)2. Then,the cost complexity criterion with parameter α is defined as

cα(T ) :=∑l(T )m=1 nmqm(T ) + α l(T )

that is to be minimized. The parameter α is a trade-offbetween complexity and goodness-of-fit to the training data.It is determined according to the residual sum of squareswith cross-validation. This splits the training data into twopartitions consisting of model creation and model testingdata. For a given α, the tree is greedily pruned using weakestlink pruning. Iteratively, the node with the smallest per-node

increase in∑l(T )m=1 nmqm(T ) is collapsed.

Parameters.− minsplit: Minimum number of samples to split a leaf.− cp: Maximum acceptable tree complexity after pruning.

3.4 M5 TreesM5 trees [25] are – figuratively speaking – a combination oflinear regression models and CART trees. Similar to CARTtrees, M5 models are binary decision trees with conditions intheir non-leaf nodes. The leaf nodes, however, contain linearregression models instead of the constant values as in CARTtrees. For the prediction of a value, starting in the root node,the conditions are evaluated until a linear regression modelin a leaf is reached. Then, the model is used as the predictor.

Model Derivation. Similar to MARS and CART, the treeis first built in a forward step and pruned in the next step.To build a M5 tree, the initial model consists of a one-nodetree T . The tree is split iteratively into subtrees Ti untila predefined maximum number of splits is reached. M5splits the tree at the condition that maximizes the expectedreduction in error ∆e with

∆e := σ(T )−∑i|Ti||T | σ(Ti),

where σ is the standard deviation. For each leaf, a linearregression model is created. Finally, the M5 model f isgreedily simplified to decrease the complexity c with

c(T ) := n−1 ∑i |yi − f(~xi)| · n+pn−p ,

where {(~xi, yi)}i is the training data, n is the number oftraining data and p is the number of parameters in themodel. The term n+p

n−p allows an increase in error if thecomplexity is decreased. For each leaf node, parameters ofthe linear model are removed, whereas for every non-leafnode, the node is collapsed if this reduces c(T ).Parameters.− nsplits: Maximum number of splits in the forward step.

285

LRM

MARS

CART

M5

Cubist

Piecewiselinearity

Tree structureStep function

CombinationBoosting

Inst.-based

Figure 2: Relation between the Regression Techniques

3.5 Cubist ForestsCubist forests [26, 18] are an extension of M5 trees. Thus,Cubist forests are rule-based model trees with linear modelsin the leaves. Compared to M5, Cubist introduces twoextensions. First, it follows a boosting-like approach, i.e.,it creates a set of trees instead of a single tree. To obtaina single value, the tree predictions are aggregated usingtheir arithmetic mean. Second, it combines model-basedand instance-based learning (cf. [24]), i.e., it can adjust theprediction of unseen points by the values of their nearesttraining points.

Model Derivation. Initially, the maximum number of treesin Cubist is defined to construct a forest. The first tree iscreated using the M5 algorithm. The following trees arecreated to correct the predictions of the training points bythe previous tree ft(~x). Each value of a training point yi ismodified by y′i := 2 yi − ft(~x) to compensate for over- andunder-estimations. Then, the tree creation is repeated. Incontrast to, e.g., Random Forests [4] combining the predictiontrees with the mode operator, Cubist aggregates the valuespredicted by each tree using arithmetic mean. Finally, theprediction of unseen points can be adjusted by the values ofa possibly dynamically determined number of training points(so-called instance-based correction), cf. [24]. The predictionof a new point ~x is adjusted by the weighted mean of thenearest training points (so-called neighbors) with weightwn := 1/(m(~x, ~n) + 0.5) for every neighbor ~n, where m(~x, ~n)is the Manhatten distance of ~x and ~n. M5 is a special caseof Cubist with one tree and no instance-based correction.

Parameters.

− nsplits: Maximum number of splits in the forward step.

− ntrees : Number of trees.

− ninstances : Size of instance-based correction.

3.6 SummaryFigure 2 shows an overview of the considered regressiontechniques illustrating the relationships among them. Asdescribed above, the MARS algorithm can be seen as anextension of LRM by allowing piecewise linear models. How-ever, MARS regulates the number of linear terms. WhileMARS and CART seem relatively different, the forward stepof MARS can be transformed into the one of CART by usinga tree-based structure with step functions, cf. [12]. M5 inturn can be seen as a combination of LRM and CART. How-ever, M5 differs from CART in the complexity criterion andthe pruning procedure. Finally, Cubist extends M5 by intro-ducing a boosting like scheme creating several trees that areaggregated by their mean. Furthermore, Cubist introducesan instance-based correction to include the training data inthe prediction of unseen data.

IBM System z

IBM DS8700

CPU, RAM

Processors,Memory

PR/SM (Hypervisor)

z/VM (Hypervisor)

z/Linuxz/OS

z/Linux

LPAR1 LPAR2

RAID Arrays SSD/HDD

Storage ServerVC

NVC

FibreChannel

SwitchedFibre Channel

Figure 3: IBM System z and IBM DS8700

4. METHODOLOGYIn our approach, we apply statistical regression techniquesto create performance models based on systematic measure-ments. In this section, we present our experimental environ-ment and setup as well as our measurement methodologyand performance modeling approach.

4.1 System Under StudyA typical virtualized environment in a data center consistsof servers providing computational resources connected toa set of storage systems. Such storage systems typicallydiffer significantly from traditional hard disks and nativestorage systems due to the complexity of modern storagevirtualization platforms.

In this paper, we consider a representative virtualized en-vironment based on the IBM mainframe System z and thestorage system DS8700. They are state-of-the-art high-perfor-mance virtualized systems with redundant or hot swappableresources for high availability. The System z combined withthe DS8700 represent a typical virtualized environment thatcan be used as a building block of cloud computing infrastruc-tures. It supports on-demand elasticity of pooled resourceswith a pay-per-use accounting system (cf. [22]). The Sys-tem z provides processors and memory, whereas the DS8700provides storage space. The structure of this environment isillustrated in Figure 3.

The Processor Resource/System Manager (PR/SM) is ahypervisor managing logical partitions (LPARs) of the ma-chine and enabling CPU and storage virtualization. Formemory virtualization and administration purposes, IBMintroduces another hypervisor, z/VM. The System z supportsthe classical mainframe operating system z/OS and specialLinux ports for System z commonly denoted as z/Linux.The System z is connected to the DS8700 via fibre channel.Storage requests are handled by a storage server having avolatile cache (VC) and a non-volatile cache (NVC). Thestorage server is connected via switched fibre channel withSSD or HDD RAID arrays. As explained in [8], the storageserver applies several pre-fetching and destaging algorithmsfor optimal performance. When possible, read-requests areserved from the volatile cache, otherwise they are servedfrom the RAID arrays and stored in the volatile cache forfuture requests. Write-requests are written to the volatile aswell as the non-volatile cache, but they are destaged to theRAID arrays asynchronously.

In such a virtualized storage environment, a wide vari-ety of heterogeneous performance-influencing factors exists,cf. [23]. In many cases, the factors have a significantly differ-ent effect compared to traditional native storage systems. AsFigure 1b illustrates, e.g., the standard Linux I/O schedulerCFQ performs significantly worse than the NOOP scheduler

286

Storage-Performance-Influencing Factors

Workload

Requests

Size Mix Pattern

Locality

System

Operating System

Filesystem I/O Scheduler

Hardware

Figure 4: Performance Influences (derived from [23])

for certain configurations. Figure 4 gives an overview ofthe major storage-performance-influencing factors used asbasis for our analysis. In general, we distinguish betweenworkload and system factors. Workload factors comprise av-erage request size, read/write mix and random or sequentialaccess pattern. Furthermore, the locality of requests affectsthe storage cache effectiveness. System factors compriseoperating system and hardware configurations. The majorfactors of the operating system are the file system and theI/O scheduler. Hardware factors can be analyzed on differentabstraction levels, but are often system specific.

4.2 Experimental SetupIn our experimental environment, the DS8700 contains 2 GBNVC and 50 GB VC with a RAID5 array containing sevenHDDs. Measurements are obtained in a z/Linux virtualmachine (VM) with 2 IFLs (cores) and 4 GB of memory.Using the O_DIRECT POSIX flag, we isolate the effects ofoperating system caching in order to focus our measurementson the storage performance. However, we explicitly takeinto account the cache of the storage system by varying theoverall size of data accessed in our workloads. As a basis forour experimental analysis, we used the open source Flexi-ble File System Benchmark1 (FFSB) due to its fine-grainedconfiguration possibilities. FFSB runs at the applicationlayer and measures the end-to-end response time covering allsystem layers from the application all the way down to thephysical storage. The system and workload parameters forthe measurements match the factors described in Section 4.1and illustrated in Figure 4. The locality of the workload canbe deduced from the overall size of the set of files accessed inthe workload since FFSB requests are randomly distributedamong the files. The considered parameter values are sum-marized in Table 1. The parameter space is fully exploredleading to a total of 1120 measurement configurations.

4.3 Experimental MethodologyFor a given configuration of a benchmark run, the measure-ments run in three phases: First, a set of 16 MB files iscreated. Second, the target number of workload threads arestarted and they start reading from and writing into theinitial file set. After an initial warm up phase, the measure-ment phase starts and the benchmark begins obtaining readand write performance data. For each workload thread, theread and write operations consist of 256 sub-requests of thespecified size directed to a randomly chosen file from thefile set. If the access pattern is sequential, the sub-requestsaccess subsequent blocks within the file. Each thread issuesa request as soon as the previous one is completed.

For each of the 1120 configurations, we configured a oneminute warm up phase and a five minute measurement phase.

1http://github.com/FFSB-prime (extending http://ffsb.sf.net)

Table 1: Experimental Setup Configuration

System Parameters

File system ext4I/O scheduler CFQ, NOOP

Workload Parameters

Threads 100File set size 1 GB, 25 GB, 50 GB, 75 GB, 100 GBRequest size 4 KB, 8 KB, 12 KB, 16 KB, 20 KB, 24 KB,

28 KB, 32 KBAccess pattern random, sequentialRead percentage 0%, 25%, 30%, 50%, 70%, 75%, 100%

The measurement phase consists of five intervals of one mi-nute length each. In one minute, the benchmark obtainsbetween approximately 90 000 and 2 800 000 measurementsamples depending on the configuration and approximately575 000 measurement samples on average. Furthermore, weanalyzed the measurement intervals to ensure stable andmeaningful results. Illustrated in Figure 5, we evaluated therelative standard error (RSE) of the response time means foreach configuration over the five intervals. For read requests,the 90th percentile of the RSE is 8.45% and the mean RSEis 3.35%. For write requests, the 90th percentile of the RSEis 5.35% and the mean RSE is 2.10%. Thus, we concludethat the measurement samples are sufficiently large and themeasurements are sufficiently stable. This is important sinceour goal was to keep the measurement length limited.

Since a manual measurement and evaluation approach ishighly tedious and error prone, we automated the processas part of a new tool we developed called SPA (StoragePerformance Analyzer)2. Illustrated in Figure 6, our toolbasically consists of a benchmark harness that coordinatesand controls the execution of the FFSB benchmark and atailored analysis library used to process and evaluate thecollected measurements. The benchmark harness componentruns on a controller machine managing the measurementprocess. Using SSH connections, the benchmark controllerfirst configures the benchmark, then it executes the targetworkload, and it finally collects the results into an SQLitedatabase. Furthermore, the benchmark controller guaranteesa synchronized execution of experiments on multiple targets,i.e., on multiple VMs that can be deployed on the samesystem. The evaluation is automated using tailored analysisfunctions implemented using the open source statistics toolR [27]. The analysis library comprises the analysis, opti-mization and regression functions we created and appliedfor regression optimization and performance model creation.For more information on the technical realization, we referthe reader to [6].

4.4 Performance Modeling MethodologyWe employ the regression techniques presented in Section 3.As independent variables, we used the system and the work-load parameters listed in Table 1. As dependent variables,we used the system response time and throughput. For eachregression technique, we first optimize the parameters to cre-ate effective models as explained in detail in the next section.We then create dedicated models for read and write requestsas well as for response time and throughput predictions.Due to the structure of the independent variables, the LRMand MARS models without interaction perform significantlyworse than all the other techniques. Therefore, we explicitly

2http://sdqweb.ipd.kit.edu/wiki/SPA

287

0.00

0.25

0.50

0.75

1.00

0 10 20

Relative Standard Error (%)

Cum

ula

tive

Dis

trib

uti

on

Funct

ion

Operation

read

write

Figure 5: Distribution of the Relative Standard Error

Controller Machine

Benchmark Harness

BenchmarkController

Benchmark Driverfor FFSB

SQLite Database

R – Statistics Tool

AnalysisLibrary

Analysis, Optimiza-tion & RegressionFunctions

VM 2

FFSBBenchmark

VM 1

FFSBBenchmark

SSH

Figure 6: Overview of our Automated Methodology

exclude LRM and MARS models without interactions fromconsideration. Furthermore, Cubist stems from M5 bothalgorithmically and in terms of implementation and they areidentical for Cubist’s standard parameter values. Therefore,in the following, we do not explicitly distinguish between thetwo. Thus, having four models for each regression technique,we create a total of 16 performance models evaluated indetail in Section 6.

5. REGRESSION OPTIMIZATIONRegression techniques often have parameters, which influencethe effectiveness of the technique in a certain applicationscenario. Usually, the parameters are left to their standardvalues or chosen based on an educated guess. However, bothapproaches are insufficient to systematically create powerfulmodels.

To create effective performance models, we search for op-timal parameters for each of our considered techniques3, cf.Table 2. A full factorial search would take weeks due to thehigh computational costs of creating millions of regressionmodels. Therefore, we propose a heuristic iterative searchalgorithm shown in Algorithm 1 and illustrated in Figure 7.In a nutshell, we split the parameter space into multiple sub-spaces and iteratively refine the search in the most promisingregions.

3Note that there can be more input parameters than the ones consid-

ered here depending on the implementation of the technique. In thispaper, we use the implementation in the statistics tool R [27].

Algorithm 1 Iterative Parameter Optimization

Configuration:

k ← Number of splitsn← Number of explorationsS // Stopping criterion

5: Definitions:

~p := (p1, . . . , pl), L := {1, . . . , l} // l parameterspi ∈ [ai, bi], ∀i ∈ L // Ranges of parametersc(~p) := Mean RMSE in cross-validation of parameters

Init:

10: E ← {(~a := (a1, . . . , al), c(~a) )}// Evaluate first border parameters

M ← E // M : Set of best parameters

Algorithm:

while S does not apply do15: j-th Iteration:

for all (~ν j(h), ·) ∈M do // h-th pivot

for all i ∈ L doAi ← {pi | (pi, ·) ∈ E ∧ pi < ν j(h)i}if Ai 6= ∅ then a∗i ← maxAi

20: else a∗i ← aiBi ← {pi | (pi, ·) ∈ E ∧ pi > ν j(h)i}if Bi 6= ∅ then b∗i ← minBi

else b∗i ← bi

s∗i ←b∗i−a

∗i

k+1,∀i ∈ L // Step width

25: S∗i ← {a∗i , a∗i + s∗i , . . . , a∗i + k s∗i , b

∗i }

for all s ∈ S∗i doif s invalid then round s to next valid value

end forend for

30: Ej(h) ← {(~x, c(~x)) | ~x ∈ S∗1 × . . .× S∗l }// Evaluate parameters if not evaluated yet

end forE ← E ∪

⋃hE

j(h) // Save all evaluated parameters

M ← Find n best tuples in E (w.r.t. c)35: end while

Algorithm. The algorithm’s configuration parameters arethe number of splits (i.e., splitting points) in each iterationand the number of subspace explorations in the next iteration.The number of splits configures the sampling frequency ofthe search space and should be higher if multiple high butnarrow optima can be assumed. The number of explorationsconfigures how many local optima are analyzed in theiradjacent area and should be higher if many local optimacan be assumed. The algorithm can be configured with anarbitrary stopping criterion, e.g., when the improvementbetween iterations falls below a given threshold or when aspecified maximum number of iterations is reached. For eachof the given parameters ~p := (p1, ..., pl), a reasonable rangepi ∈ [ai, bi] needs to be defined to limit the search space.The lower bound usually exists for most parameters andthe upper bound can be derived based on the used trainingdata. For a given set of parameter values, we evaluate themodel using the mean root mean square error (RMSE) of a10-fold cross-validation, denoted as c(~p). This means thatthe measurement data is separated into 10 groups and eachgroup is used once as a prediction set while the rest of thedata is used for model building. This evaluation approach iskey to prevent overfitting of the prediction models.

288

~ν 1(1)

~ν 2(1) ~ν 2

(2)

~ν 3(1)

~ν 3(2)

Iteration 1

Iteration 2

Iteration 3

k = 3, n = 2

Figure 7: First Iterations of the Parameter Optimization

The algorithm begins with the evaluation of one cornerpoint in the parameter space, cf. Figure 7. As initially thisis the only evaluated point, it is the only point saved in a setM for exploration in the next iteration.

In each iteration, we explore every point ~x in the set Mcontaining the points corresponding to the best parametervalue assignments. The unevaluated space around ~x is splitinto (k+ 1)l subspaces. Initially, this is the whole parameterspace. Each corner point of the subspaces is then evaluatedand the best n found so far are saved in M for the nextiteration. To save computation time, we maintain a set E ofalready evaluated parameter values. The algorithm repeatseach iteration until the stopping criterion is fulfilled.

Example. In Figure 7, the first three iterations of thealgorithm with k = 3 and n = 2 are illustrated. Initially,one corner point is evaluated. This pivot ~ν 1

(1) is explored inthe first iteration. Since there are no other evaluated points,the whole parameter space is cut with k = 3 splits for eachparameter indicated by the grid lines (at the Iteration 1 level).Each of the intersection points depicted as small gray points(at the Iteration 2 level) is evaluated. After the best n = 2evaluated points are found, they become the next pivots ~ν 2

(1)

and ~ν 2(2). Now, when ~ν 2

(1) is explored, the space around itlimited by the neighbouring evaluated points is explored, cf.dashed lines between Iteration 2 and Iteration 3. The sameis done for the second pivot ~ν 2

(2). This process is repeatedin the next iterations and the best parameters are used formodel building. Note that the space does not necessarilyhave to be split equidistantly for all parameters, cf. ~ν 3

(1).That is why the complexity evaluation is independent fromthe exact values the algorithm explores.

Complexity. In our implementation of the algorithm, weused indexed data structures to realize the logic. Thus, theregression model building by means of the cross-validationevaluation represents by far the most computationally ex-pensive part of the algorithm. Therefore, we evaluate thecomplexity of the cross-validation evaluation. In the worstcase, in every iteration we evaluate every new set of parame-

Table 2: Parameters of the Regression Techniques

Parameter Standard Valid/Used Best ~p (RTr,Range RTw, TPr, TPw)

MARS4 nnodes 21 (der.)5 [3,∞)/[3, 800] 255, 353, 131, 523

threshold 0.001 [0, 1)/[0, 0.999] 5.01e−6, 3.07e−7,5.05e−5, 2.66e−5

nprune nnodes [2,∞)/[2, 800] 98, 134, 99, 141

CART minsplit 20 [2,∞)/[2, 400] 3, 2, 2, 2

cp 0.01 [0, 1)/[0, 0.999] 3.07e−7, 4.09e−7,0, 0

Cubist nsplits 100 [2,∞)/[2, 400] 173, 217, 173, 217ntrees 1 [1, 100] 77, 41, 11, 26ninstances 0 [0, 9] 2, 4, 2, 1

ter values except the initial corner one in the first iterationand the four corner ones in every other iteration. Thus,the number of evaluations in our search in the worst case islimited by

O((k+2)l−1+(max j−1) ·n ((k+2)l−4)) = O(max j ·n kl),where max j is the maximum number of iterations. By con-figuring max j, n and k, the overhead of the algorithm canbe adjusted and kept within bounds. As discussed in thenext section, our evaluation showed that even moderately lowvalues of these parameters lead to substantial improvementsin the model accuracy.

Optimality. Our main goal is to achieve such model accu-racy improvements efficiently. However, if we do not limitthe number of iterations and allow to search long enough,our approach is able to find locally optimal solutions. Thus,for convex problems our approach also guarantees globallyoptimal solutions. This is since, let k > 1 and n > 0, ∀i ∈ Lone has if j → ∞, then s∗i → 0 for continuous parametersand s∗i → 1 for discrete parameters. In other words, as weiteratively explore the current best solutions of the iteration,at least one solution will be fully explored.

Optimization Results. Based on the parameter optimiza-tion algorithm, for each regression technique we create fourdifferent models as described in Section 4: A read responsetime model (RTr), a write response time model (RTw), aread throughput model (TPr), and a write throughput model(TPw). We configured the number of splits k to four and thenumber of explorations n to five. We stop after 10 iterations.Table 2 summarizes the parameters of the regression tech-niques and lists their standard values as well as their validranges. For unbounded parameters, we chose a generousupper limit. For each model, the recommended parametervalue assignments for our performance models determined bythe optimization algorithm are shown. In the next section,we evaluate the accuracy of the resulting optimized regres-sion models and demonstrate the benefits of our modelingapproach.

6. EVALUATIONIn this section, we analyze and evaluate our approach inthree steps. First, we evaluate the performance models w.r.t.their goodness-of-fit as well as their predictive power forinterpolation and extrapolation scenarios. Then, we analyzescenarios where the workload is distributed on multiple VMs.Finally, we evaluate the improvements in model accuracyachieved through our heuristic optimization. In each of the

4The standard MARS algorithm does not include interactions. How-

ever, we only consider MARS with interactions, cf. Section 4.4.5Derived value (min(200,max(20, 2 ·#IndependentV ariables)) + 1).

289

Table 3: Goodness-of-Fit of Performance Models

Coefficient of determination R2

Model RTr RTw TPr TPw14

∑LRM 0.9720 0.9703 0.9699 0.9546 0.9667MARS 0.9955 0.9991 0.9951 0.9973 0.9968CART 0.9983 1 1 1 0.9996Cubist 0.9990 0.9994 0.9995 1 0.9995

three steps, we use the optimized models generated by ourSPA tool as a basis for the evaluation.

6.1 Model AccuracyGoodness-of-Fit. For each of the optimized performance mod-els, we evaluate how well it reflects the observed systembehavior. To evaluate this goodness-of-fit, we analyze thecoefficient of determination R2 defined as

R2 := 1− SSE

SST= 1−

∑i(yi − f(~xi))

2∑i(yi −

1N

∑j yj)

2≤ 1,

where SSE and SST are the residual and the total sum ofsquares, respectively, {(~xi, yi)}i=1,...,N represent the mea-surement data and f is the respective regression model. R2

values close to 1 indicate a good fit. The results of ourevaluation for the different models are shown in Table 3.Overall, most of the models fit very well to the measure-ments. Especially MARS performs particularly well evenwith a comparably low number of terms (cf. nprune param-eter values). CART, however, has an expectedly good fitdue to the low minsplit and cp parameters allowing for ahighly splitted tree. Furthermore, as expected Cubist alsofits very well due to the large number of trees and especiallythe instance-based correction. Finally, LRM exhibits a fairlywell fit, however, exhibiting the lowest fit of all the models.

Predictive Power for Interpolation Scenarios. We evaluate100 configuration scenarios with parameter values randomlychosen within the ranges shown in Table 4. For each scenario,we compare the model predictions with measurements on thereal system.

Figure 9 shows the mean relative error for the variousmodels. Overall, the models perform very well and especiallyMARS and Cubist exhibit excellent performance predictionaccuracy with less than 7% and 8% error, respectively. TheCART trees are highly splitted, yet with approximately 10%error their accuracy is acceptable. Finally, the LRM modelsexhibit the highest error with approximately 13%.

Figure 8 depicts the empirical cumulative distributionfunctions of the prediction errors. In all cases, the majorityof the error is less than 10%. For MARS and Cubist, the erroris less than 10% for more than 75% of the samples. Mostnotably, the 90th percentile of the error of both MARS andCubist is less than 25%. Moreover, even the 99th percentileof the error is less than 50% for almost all models.

Predictive Power for Extrapolation Scenarios. We now con-sider configuration scenarios with parameter values outsideof the ranges used to build the models. More specifically, weextrapolate the request size, the file set size, and the readpercentage in two steps in each direction. We evaluate 256scenarios by considering all combinations of the parametervalues shown in Table 5. We explicitly distinguish betweensmall (i.e., 1 KB and 2 KB) and large request sizes (i.e.,36 KB and 40 KB).

As shown in Figure 10, the predictive power of the modelsfor extrapolation scenarios differs significantly for smaller

Table 4: Interpolation Setup Configuration

Parameter Range

I/O scheduler {CFQ, NOOP}File set size [1 GB, 100 GB]

rounded to multiples of 16 MBRequest size [4 KB, 32 KB]

rounded to multiples of 512 bytesAccess pattern {random, sequential}Read percentage [25%, 100%] for read requests

[0%, 75%] for write requests

Table 5: Extrapolation Setup Configuration

Parameter Values

I/O scheduler {CFQ, NOOP}File set size {512 MB, 768 MB, 110 GB, 120 GB}Request size {1 KB, 2 KB, 36 KB, 40 KB}Access pattern {random, sequential}Read percentage {15%, 20%, 80%, 85%}

and larger request sizes especially for throughput predic-tions. Scenarios with smaller request sizes are predictedfairly well for response time. The error for the best perform-ing model Cubist is less than 15%. Throughput predictions,however, exhibit much higher errors. This is due to thelarge performance overhead of serving many small requestssimultaneously. The file system introduces significant over-head since the standard extent sizes (in which data is stored)are 4 KB. Furthermore, the optimization strategies in thevarious layers (especially the I/O scheduler) are very eagerto optimize the requests. The latter is difficult since mul-tiple threads issue requests simultaneously. This effect ismultiplied by the penalty introduced by frequent backendRAID array accesses if the file set size is larger than the NVCand the VC, respectively. On the other hand, scenarios withlarger request sizes are predicted very well. This is notablesince the models have to cope with unexplored regions in theparameter space. Especially, Cubist predicts response timeswith less than 9% error and throughput with approximately16% error. Also, MARS performs impressively well with ap-proximately 11% error for response time and approximately21% error for throughput.

Summary. Overall, the performance models exhibited goodaccuracy and predictive power. To highlight the key aspects,the models managed to effectively capture the performanceinfluences with very high goodness-of-fit characteristics. Ininterpolation scenarios, the models exhibited excellent per-formance prediction accuracy for arbitrarily chosen config-uration scenarios. Here, MARS and Cubist performed bestwith approximately 7%-8% prediction error. Also, in ex-trapolation scenarios, the models showed promising results.Except for throughput predictions for smaller request sizes,the best performing model Cubist had a prediction error ofonly 9%-16%.

6.2 Workload DistributionWe now evaluate scenarios where the workload is distributedon multiple co-located VMs as done for example in thecontext of server consolidation. We distribute the workloadevenly across two and three VMs with shared cores, i.e.,both the number of threads and the file set size are scaledinversely by the number of VMs. We explicitly distinguishbetween the different I/O scheduler implementations as wellas between random and sequential workloads. Thus, having100 randomly chosen configurations based on the parameterranges in Table 4, we conduct 25 measurements for each

290

LRM MARS CART Cubist

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

RTr

RTw

TPr

TPw

10−1 100 101 102 103 10−1 100 101 102 103 10−1 100 101 102 103 10−1 100 101 102 103

Relative Error (%)

Cum

ula

tive

Dis

trib

uti

on

Funct

ion

Model Optimized Standard

Figure 8: Cumulative Distribution Function of the Relative Error of the Performance Models (Note the Logarithmic x-Axis)


0

4

8

12

16

RTrRTwT

PrTPw RTr

RTwTPr

TPw RTr

RTwTPr

TPw RTr

RTwTPr

TPw

Mea

nR

elati

ve

Err

or

(%)

Mean Relative Error (%)Model RTr RTw TPr TPw

14

∑LRM 14.73 8.72 15.05 13.74 13.06MARS 7.80 4.90 9.52 5.39 6.90CART 10.47 7.98 12.41 10.26 10.28Cubist 8.63 7.23 7.75 6.44 7.51

Figure 9: Prediction Accuracy of Interpolation

CFQ & NOOP scheduler and random & sequential workloadcombination. These measurements are used to evaluate theprediction error for each model. Note that the models arestill extracted from measurements obtained in one VM toshow the effectiveness of our approach.

For two and three VMs, respectively, Figure 11 and Fig-ure 12 show the mean relative error for all models structuredaccording to the chosen I/O scheduler and workload pattern.Since CFQ performs significant optimizations that depend onthe current load, the model extracted from measurements inone VM does not fit well with the measurements conductedin multiple VMs for sequential workload. For random work-


0

100

200

300

400

0

100

200

300

400

1K

B,

2K

B36

KB

,40

KB

RTrRTwT

PrTPw RTr

RTwTPr

TPw RTr

RTwTPr

TPw RTr

RTwTPr

TPw

Mea

nR

elati

ve

Err

or

(%)


14

∑Request Size: 1 KB, 2 KB

LRM 22.36 33.65 393.19 315.81 191.25MARS 21.79 17.54 395.66 168.26 150.81CART 22.18 22.59 211.69 204.60 115.27Cubist 14.47 14.96 130.76 108.38 67.14

Request Size: 36 KB, 40 KB


Figure 10: Prediction Accuracy of Extrapolation

load, however, this effect is less significant. Response time ispredicted fairly well and especially throughput is predictedvery well. The NOOP scheduler performs only optimizationsbased on request splitting and merging. Thus, the modelextracted from measurements in one VM fits much better formultiple VMs in most cases. Only for write response times,

291


0

30

60

90

0

30

60

90

0

30

60

90

0

30

60

90

CF

QC

FQ

NO

OP

NO

OP

Random

Seq

uen

tial

Random

Seq

uen

tial

RTrRTwT

PrTPw RTr

RTwTPr

TPw RTr

RTwTPr

TPw RTr

RTwTPr

TPw

Mea

nR

elati

ve

Err

or

(%)


14

∑Scheduler: CFQ, Random Requests


Scheduler: CFQ, Sequential Requests


Scheduler: NOOP, Random Requests


Scheduler: NOOP, Sequential Requests


Figure 11: Prediction Accuracy of the One-VM-Models whenthe Workload is Distributed on Two VMs

the model exhibits a high error particularly for larger fileset sizes exceeding the storage cache. We conclude that thefrequent cache misses and write request optimizations in thestorage system lead to higher prediction errors. Nonetheless,in most cases the performance model still provides a validapproximation of the system performance.

Summary. Distributing the workload among multiple VMsleads to complex optimization and virtualization effects.While our models were not tuned for these effects, theystill provided promising prediction results in most scenarios.In some scenarios, the error was evident and expected as,e.g., the CFQ scheduler is known for very active and eagerrequest optimization. Still, the mean error remains less than10%-20% for the MARS and Cubist models in most cases.While further in-depth analysis of the virtualization andscheduler influences is out of scope of this paper, the resultsare very encouraging.


0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

CF

QC

FQ

NO

OP

NO

OP

Random

Seq

uen

tial

Random

Seq

uen

tial

RTrRTwT

PrTPw RTr

RTwTPr

TPw RTr

RTwTPr

TPw RTr

RTwTPr

TPw

Mea

nR

elati

ve

Err

or

(%)


14

∑Scheduler: CFQ, Random Requests


Scheduler: CFQ, Sequential Requests


Scheduler: NOOP, Random Requests


Scheduler: NOOP, Sequential Requests


Figure 12: Prediction Accuracy of the One-VM-Models whenthe Workload is Distributed on Three VMs

6.3 Regression OptimizationTo evaluate the improvements in model accuracy achievedthrough our regression optimization algorithm, we comparethe accuracy of the models when using the optimized regres-sion parameters vs. the standard parameters, respectively.We evaluate the performance prediction error for each modelwith 100 random configurations based on the ranges shownin Table 4, cf. Section 6.1.

In Figure 13, we show the reduction in error for eachmodel. Overall, the results show significant improvements ofthe model accuracy, i.e., reduced modeling error. EspeciallyMARS and CART benefit from the parameter optimiza-tion exhibiting an error reduction of 66.30% and 74.08%,respectively. The improvement of Cubist is less, however,an error reduction of approximately 16% is still observed.Interestingly, the highest error reduction was achieved forread throughput predictions across all models. The MARS

292

models showed most improvements for throughput metrics,while the CART models were almost evenly improved acrossall metrics. Surprisingly, only a small error reduction wasachieved for Cubist’s read response time predictions, whereassignificant improvements were seen for write response timeand read throughput predictions.

In Figure 8, we evaluate the error distribution of the dif-ferent techniques in the form of empirical cumulative distri-bution functions. The results show that our optimizationespecially eliminated large prediction errors. Furthermore,the results confirm that the optimization especially improvedCART and MARS with a decisive reduction in large errors.For Cubist, the improvement was less than for the othertechniques, however, the error reduction is still evident. Thisis due to the fact that the technique performs competitivelywell even with standard parameters and the optimizationis primarily able to decrease the number of extreme errors.Still, the optimization process was key for achieving thehigh quality results of the MARS and the Cubist modelspresented in Section 6.1 and Section 6.2. These two modelsexhibited the highest accuracy in almost all of the consideredevaluation scenarios

Finally, we evaluate the statistical significance of the op-timization results. In a paired t-test, we evaluate the meanof each difference of the prediction error of the respectiveregression technique between the optimized and the standardparameter values. Thus, for each regression technique thenull hypothesis H0 is that the true mean of differences ofthe prediction error is equal to zero. For the t-test, thep-value of both MARS and CART is less than 2.2e−16 andthe p-value of Cubist is 0.0003394. Thus, H0 is rejectedconfirming that the optimization is statistically significantfor every considered regression technique.

Summary. Overall, the parameter optimization was crucialfor creating effective performance models. The mean errorreductions of approximately 16%, 66%, and 74% for Cubist,MARS, and CART, respectively, lead to performance modelswith high accuracy and predictive power. Especially theimprovements for MARS lead to accurate models that, to-gether with the Cubist models, exhibited the best predictionaccuracy in our scenarios. Furthermore, the improvementachieved by the parameter optimization was statisticallysignificant for every considered regression technique.

7. RELATED WORKMany general modeling techniques for storage systems exist,e.g., [7, 11, 20, 21, 28], but they are only shortly mentionedhere as our work is focused on virtualized environments.

The work closely related to the approach presented in thispaper can be classified into two groups. The first groupis focused on modeling storage performance in virtualizedenvironments. Here, Kraft et al. [17] present two approachesbased on queueing theory to predict the I/O performanceof consolidated virtual machines. Their first, trace-basedapproach simulates the consolidation of homogeneous work-loads. The environment is modeled as a single queue withmultiple servers having service times fitted to a MarkovianArrival Process (MAP). In their second approach, they pre-dict storage performance in consolidation of heterogeneousworkloads. They create linear estimators based on meanvalue analysis (MVA). Furthermore, they create a closedqueueing network model, also with service times fitted to aMAP. Both methods use monitored measurements on the

MARS CART Cubist

-100

-75

-50

-25

0

RTrRTw T

PrTPw RTr

RTw TPr

TPw RTr

RTw TPr

TPw

Re l

ati

ve

Err

or

Diff

eren

ce(%

)

Relative Error Difference (%)Model RTr RTw TPr TPw

14

∑MARS −58.64 −60.20 −77.59 −68.78 −66.30CART −73.14 −73.24 −80.78 −69.15 −74.08Cubist −0.25 −24.01 −31.52 −7.23 −15.75

Figure 13: Error Reduction by Parameter Optimization

block layer that is lower than typical applications run. In [1],Ahmad et al. analyze the I/O performance in VMware’s ESXServer virtualization. They compare virtual to native perfor-mance using benchmarks. They further create mathematicalmodels for the virtualization overhead. The models are usedfor I/O throughput degradation predictions. To analyze per-formance interference in a virtualized environment, Koh etal. [16] manually run CPU bound and I/O bound benchmarks.While they develop mathematical models for prediction, theyexplicitly focus on the consolidation of different types ofworkloads, i.e., CPU and I/O bound. By applying differentmachine learning techniques, Kundu et al. [19] use artificialneural networks and support vector machines for dynamic ca-pacity planning in virtualized environments. Further, Gulatiet al. [10] present a study on storage workload characteriza-tion in virtualized environments, but perform no performanceanalysis.

The second group of related work deals with benchmarkingand performance analysis of virtualized environments notspecifically targeted at storage systems. Hauck et al. [13]propose a goal-oriented measurement approach to determineperformance-relevant infrastructure properties. They exam-ine OS scheduler properties and CPU virtualization overhead.Huber et al. [14] examine performance overhead in VMwareESX and Citrix XenServer virtualized environments. Theycreate regression-based models for virtualized CPU and mem-ory performance. In [3], Barham et al. introduce the Xenhypervisor comparing it to a native system as well as othervirtualization plattforms. They use a variety of benchmarksfor their analysis to quantify the overall Xen hypervisoroverhead. Iyer et al. [15] analyze resource contention whensharing resources in virtualized environments. They focuson analyzing cache and core effects.

8. CONCLUSIONSummary. We presented a measurement-based performanceprediction approach for virtualized storage systems. Wecreated optimized performance models based on statisticalregression techniques capturing the complex behaviour of thevirtualized storage system. We proposed a general heuristicsearch algorithm to optimize the parameters of regression

293

techniques. This algorithm is not limited to a certain domainand can be used as a general regression optimization method.We applied our optimization approach and created perfor-mance models based on systematic measurements using fourregression techniques: LRM, MARS, CART, and Cubist. Weevaluated the models in different scenarios to assess theirprediction accuracy and the improvement achieved by ouroptimization approach. The scenarios comprised interpola-tion and extrapolation scenarios as well as scenarios whenthe workload is distributed on multiple virtual machines. Weevaluated the models in a real-world environment based onIBM System z and IBM DS8700 server hardware. In themost typical scenario, the interpolation ability of the modelswere excellent with our best models having less than 7%prediction error for MARS and less than 8% for Cubist. Theextrapolation accuracy was promising. Except for through-put predictions of 1 KB and 2 KB requests, the predictionerror of our best model Cubist was always less than 16%.Furthermore, using our models we were able to successfullyapproximate the performance if the workload is distributedon two and three virtual machines. The average predictionerror of both MARS and Cubist was less than 20% in mostscenarios. Finally, our optimization process showed to be cru-cial and reduced the prediction error by 16%, 66%, and 74%for Cubist, MARS, and CART, respectively, with statisticalsignificance.

Lessons Learned. i) Our performance modeling approachwas able to extract powerful prediction models. For inter-polation and extrapolation scenarios, the models showed tohave excellent prediction accuracy. ii) Even if the workloadis distributed on several virtual machines, the performancemodels are able to approximate the expected performancein most cases. iii) Our optimization process showed to bekey for the prediction accuracy of the models and achievedstatistically significant improvements for every consideredregression technique. iv) While LRM and CART showedto perform well, MARS and Cubist consistently showed thebest fit and the best prediction accuracy.

Application Scenarios. Generally, our approach is targetedat the analysis and evaluation of virtualized storage systems.It is especially beneficial for complex systems and in casesthat prohibit explicit fine-grained performance models dueto, e.g., time constraints for the manual performance modelcreation, calibration, and validation. Our automated perfor-mance modeling approach can aid (e.g., the system developer)to measure the system once and extract a performance modelthat represents the environment. This model can be used indifferent scenarios (e.g., by customers) to assess the systemcharacteristics and evaluate deployment decisions.

9. ACKNOWLEDGMENTSThis work was funded by the German Research Foundation(DFG) under grant No. RE 1674/5-1 and KO 3445/6-1. Weespecially thank the Informatics Innovation Center (IIC)6

for providing the system environment of the IBM System zand the IBM DS8700.

10. REFERENCES[1] I. Ahmad, J. Anderson, A. Holler, R. Kambo, and V. Makhija.

An analysis of disk performance in VMware ESX server virtualmachines. In WWC-6, 2003.

6http://www.iic.kit.edu/

[2] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz,A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, andM. Zaharia. A view of cloud computing. Commun. ACM,53(4):50–58, 2010.

[3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art ofvirtualization. SIGOPS Oper. Syst. Rev., 37:164–177, 2003.

[4] L. Breiman. Random Forests. Machine Learning, 45, 2001.[5] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen.

Classification and Regression Trees. The Wadsworth andBrooks-Cole statistics-probability series. Chapman & Hall,1984.

[6] D. Bruhn. Modeling and Experimental Analysis of VirtualizedStorage Performance using IBM System z as Example. Master’sthesis, Karlsruhe Institute of Technology, Karlsruhe, Germany,2012.

[7] J. S. Bucy, J. Schindler, S. W. Schlosser, G. R. Ganger, andContributors. The DiskSim Simulation Environment -Version 4.0 Reference Manual. Carnegie Mellon University,Pittsburgh, PA, 2008.

[8] B. Dufrasne, W. Bauer, B. Careaga, J. Myyrrylainen,A. Rainero, and P. Usong. IBM System Storage DS8700Architecture and Implementation.http://www.redbooks.ibm.com/abstracts/sg248786.html, 2010.

[9] J. H. Friedman. Multivariate Adaptive Regression Splines.Annals of Statistics, 19(1):1–141, 1991.

[10] A. Gulati, C. Kumar, and I. Ahmad. Storage workloadcharacterization and consolidation in virtualized environments.In VPACT ’09.

[11] P. Harrison and S. Zertal. Queueing models of RAID systemswith maxima of waiting times. Performance Evaluation,64:664–689, 2007.

[12] T. Hastie, R. Tibshirani, and J. Friedman. The Elements ofStatistical Learning: Data Mining, Inference, and Prediction.Springer Series in Statistics. Springer, 2nd edition, 2011.

[13] M. Hauck, M. Kuperberg, N. Huber, and R. Reussner. Ginpex:deriving performance-relevant infrastructure properties throughgoal-oriented experiments. In QoSA-ISARCS ’11.

[14] N. Huber, M. von Quast, M. Hauck, and S. Kounev. Evaluatingand Modeling Virtualization Performance Overhead for CloudEnvironments. In CLOSER ’11.

[15] R. Iyer, R. Illikkal, O. Tickoo, L. Zhao, P. Apparao, andD. Newell. VM3: Measuring, modeling and managing VMshared resources. Computer Networks, 53:2873–2887, 2009.

[16] Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, andC. Pu. An Analysis of Performance Interference Effects inVirtual Environments. In ISPASS ’07.

[17] S. Kraft, G. Casale, D. Krishnamurthy, D. Greer, andP. Kilpatrick. Performance Models of Storage Contention inCloud Environments. SoSyM, 2012.

[18] M. Kuhn, S. Witson, C. Keefer, and N. Coulter. Cubist Modelsfor Regression. http://cran.r-project.org/web/packages/Cubist/vignettes/cubist.pdf, 2012. Last accessed: Oct 2012.

[19] S. Kundu, R. Rangaswami, A. Gulati, M. Zhao, and K. Dutta.Modeling Virtualized Applications using Machine LearningTechniques. In VEE ’12.

[20] A. S. Lebrecht, N. J. Dingle, and W. J. Knottenbelt. Analyticaland Simulation Modelling of Zoned RAID Systems. TheComputer Journal, 54:691–707, 2011.

[21] E. K. Lee and R. H. Katz. An analytic performance model ofdisk arrays. SIGMETRICS Perform. Eval. Rev., 21(1), 1993.

[22] P. Mell and T. Grance. The NIST definition of cloudcomputing. National Institute of Standards and Technology,53(6):50, 2009.

[23] Q. Noorshams, S. Kounev, and R. Reussner. ExperimentalEvaluation of the Performance-Influencing Factors ofVirtualized Storage Systems. In EPEW ’12, volume 7587 ofLNCS. Springer, 2012.

[24] J. R. Quinlan. Combining Instance-Based and Model-BasedLearning. In ICML ’93.

[25] J. R. Quinlan. Learning with Continuous Classes. In AI ’92.World Scientific.

[26] RuleQuest Research Pty Ltd. Data Mining with Cubist.http://rulequest.com/cubist-info.html, 2012. Last accessed:Oct 2012.

[27] The R Project for Statistical Computing.http://www.r-project.org/, 2012. Last accessed: Oct 2012.

[28] M. Wang, K. Au, A. Ailamaki, A. Brockwell, C. Faloutsos, andG. R. Ganger. Storage Device Performance Prediction withCART Models. In MASCOTS ’04.

294

Predictive Performance Modeling of Virtualized Storage ... · Performance modeling and evaluation...

Documents

Transcript of Predictive Performance Modeling of Virtualized Storage ... · Performance modeling and evaluation...