An automated forecasting method for workloads on web-based ...€¦ · An automated forecasting...

An automated forecasting method forworkloads on web-based systems

- Employing an adaptive method using splines to forecastseasonal time series with outliers

Ali-Reza Rezaie

Department of Mathematics and Mathematical Statistics, Umea UniversitySupervisor: Sara Sjostedt-de Luna

Examiner: Konrad AbramowiczMaster Thesis - 30 ECTS

Fall 2014

Abstract

This thesis introduces an automated forecasting method of time seriesfor popular websites type of workload, such as the Wikimedia workload. Thesetype of workloads are characterized by slowly, pronounced, changing seasonalpattern with occasional missing values and extreme outliers. The predictionmethod captures the seasonal pattern by cubic splines and predicts the residualby an autoregressive model. The parameters of the model are estimated fromthe recent observed values, outliers excluded, since detection and prediction ofoutliers are handled separately. The method is evaluated on the Wikimedia data,where the data is hourly based. The Wikimedia data consists of the amount ofrequests to the specific homepages owned by Wikimedia and the data sent fromthe homepages to the users internet browsers.

Sammanfattning

Denna uppsats introducerar en automatiskt prediktionsmetod for tidseriersom baseras pa populara hemsidors databelastning, exempelvis Wikimedias.Dessa typer av databelastningar kannetecknas av langsamma, tydliga, skiftandesasongsmonster med enstaka saknade varden, och extrema, avvikande varden.Prediktionsmetoden anvander kubiska splines for att fanga upp sasongsmonstretoch predikterar dess residual med en autoregressiv modell. Modellens parametrarar estimerade endast fran de senaste observerade vardena, dar de avvikandevardena som uppstar utesluts, eftersom upptackterna och prediktionerna avde avvikande vardena hanterats separat. Metoden blir sedan utvarderad paWikimedias data, dar datan ar timbaserat. Wikimediadatan bestar av antaletbegaranden som sker till de specifika hemsidorna som ags av Wikimedia och detdata som skickas fran hemsidorna till anvandarnas weblasare.

ii

Preface

Acknowledgment

It has been one year since I started this project and writing this thesis. Thejourney has been long but the experience has been great and this is theend result. First, I would like to thank Ahmed Ali-Eldin, Amardeep Mehta,Johan Tordsson and Erik Elmroth from the Computer Science department atUmea University for being involved in their project to publish an scientificarticle about this topic. Furthermore, I would like to thank my teachers,including Oleg Seleznjev whom I worked with on the project to write thearticle, my examiner Konrad Abramowicz, and fellow classmates at theDepartment of Mathematics and Mathematical Statistics for making statisticseasier to understand and enjoy. I also want to thank Johan Svensson atthe Department of Statistics for always taking his time to answering mystatistical questions when I started reading the subject. Last but not least Iwould like to thank my supervisor Sara Sjostedt-de Luna for introducing meto this project and always taking her little spare time whenever I wanted toaddress an issue, and how I enjoyed the discussions we had.

Umea, a husky cold day in the fall of 2014.

Ali-Reza Rezaie

iii

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 The Wikimedia data . . . . . . . . . . . . . . . . . . . . . . . 21.4 Past research . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 62.1 Time series analysis . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Means, variances and autocovariances . . . . . . . . . . 62.1.2 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Trend and seasonality . . . . . . . . . . . . . . . . . . 72.1.4 White noise . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Time series models . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Autoregressive models . . . . . . . . . . . . . . . . . . 92.2.2 Moving average models . . . . . . . . . . . . . . . . . . 112.2.3 ARMA models . . . . . . . . . . . . . . . . . . . . . . 122.2.4 ARIMA models . . . . . . . . . . . . . . . . . . . . . . 132.2.5 SARIMA models . . . . . . . . . . . . . . . . . . . . . 132.2.6 Model selection . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Semi-parametric models . . . . . . . . . . . . . . . . . . . . . 162.3.1 Basis functions . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.4 Creating basis functions . . . . . . . . . . . . . . . . . 19

2.4 Goodness-of-fit indicator . . . . . . . . . . . . . . . . . . . . . 19

3 An automated prediction method 213.1 One-step-ahead predictive model . . . . . . . . . . . . . . . . 21

3.1.1 Outlier detection . . . . . . . . . . . . . . . . . . . . . 253.1.2 Level shifts . . . . . . . . . . . . . . . . . . . . . . . . 263.1.3 One-step-ahead prediction algorithm . . . . . . . . . . 26

3.2 Predictions 1+h hours ahead . . . . . . . . . . . . . . . . . . . 283.2.1 Outliers and missing values . . . . . . . . . . . . . . . 293.2.2 Overall algorithm for 1+h forecasting model . . . . . . 29

4 Forecasting the Wikimedia workload 324.1 One-step-ahead prediction . . . . . . . . . . . . . . . . . . . . 34

4.1.1 Adaptive splines with AR Model (Benchmark Model) . 34

iv

4.1.2 Adaptive splines model . . . . . . . . . . . . . . . . . . 364.1.3 Naive predictor . . . . . . . . . . . . . . . . . . . . . . 374.1.4 Comparing different parameter settings . . . . . . . . . 40

4.2 Prediction 1+h hours ahead . . . . . . . . . . . . . . . . . . . 424.2.1 Naive 1+h steps ahead predictor . . . . . . . . . . . . 424.2.2 Adaptive spline model . . . . . . . . . . . . . . . . . . 434.2.3 Adaptive splines with AR model . . . . . . . . . . . . . 44

5 Discussion and conclusions 45

v

1 Introduction

1.1 Background

Popular and large websites, such as Wikipedia, Amazon and Google, have anenormous daily Internet traffic on their servers. This traffic can be quantifiedand described as the workload of an Internet site, which means the amountof users whom are accessing a specific website at a certain time t, and theamount of data that is transferred from the website server to the user’sInternet browser at time t.

The Internet services that are needed to uphold these popular websites arerun by big data centers, which are buildings with hundreds, up to thousands,of connected servers. These servers are consuming a great amount of energyin order to run, cool and maintain them. To minimize costs, the datacenters have an interest to plan ahead, for capacity planning and resourcemanagement, to be able to optimize the amount of servers needed to runthe websites. In return, this requires a deeper understanding of the expectedoperating workload,1 because the workload is continuously changing overtime, often with pronounced weekly and daily patterns (see e.g. section 1.3).

Once the servers are operational along with the website, if there are anexcessive amount of available servers up and running, the economic andenvironmental cost in energy and maintaining the servers will be more thannecessary. On the contrary, if an insufficient amount of servers are available,the website will be slow to respond or crash, since it is unable to handle theworkload. Consequently, slow response time or crashing of a website causes adecrease or stoppage in the number of users, which causes web stores such asAmazon to lose sales income. In order for these large websites to deliver highquality service, in terms of fast response times to their users, while using theleast possible number of servers to optimize costs, there is need for a methodthat predicts how many users are accessing a specific website and how muchdata is transferred from the website at a certain time point. The forecastsprovide the knowledge to determine how many servers are required for thewebsite to be fully operational, such that the workload can be manageableat a certain time t.

A good and robust prediction method is therefore needed to forecast theworkload, based on past observed workload data. The method should be

1. Ali-Eldin, A. & Rezaie, A. et al. (2014), How will your workload look like in 6 years?Analyzing Wikimedia’s workload, IEEE Computer Society, p. 349

1

fast and automatic, since the predictions might be needed on a minute orhourly basis.

1.2 Aim

The aim of this thesis is to develop an automatic and flexible predictionmethod that can be used to forecast popular and large websites workloads.The workload data for popular websites are often characterized by similarfeatures, which consists of strong, slowly changing seasonal patterns, withsudden abrupt level shifts, occasional extreme outliers and missing values.The prediction method is developed and estimated using the Wikimedia data.

1.3 The Wikimedia data

The Wikimedia foundation servers,2 are known for operating Wikipedia, thesixth most popular site on the web.3 The Wikimedia workload data that isprovided for the work of this thesis are two different types of data sets overthe time period of May 17 of 2008 to October 16 of 2013 and it consists of:

• The amount of user requests hourly, to web pages that the Wikimediaservers host.

• The amount of data (in bytes) distributed by the Wikimedia serversper hour.

The data that is analyzed is hourly based and represents only the Englishversion of the Wikimedia websites. The two data sets are highly correlatedwith each other with an estimated correlation coefficient of ρ = 0.774, dueto the fact that the amount of data transfer is highly associated with theamount of requests to the servers.

The data sets can be seen in figure 1.3.1, and when zooming in on thedata, each of them has a characteristic daily seasonal pattern as figure 1.3.2suggests.

2. Wikimedia, http://www.wikimedia.org, (2014-08-10)3. Alexa, http://www.alexa.com/topsites, (2014-08-10)

2

2009 2010 2011 2012 2013 2014

Requests

Date

Req

uest

s0

10M

30M

50M

70M

90M

(a)

2009 2010 2011 2012 2013 2014

Data sent

Date

Dat

a se

nt0

200G

400G

600G

800G

1000

G

(b)

Figure 1.3.1: Hourly workload on the English Wikimedia servers. (a) The number ofrequests received per hour and (b) total amount of data (bytes) sent per hour.

Date

Req

uest

s10

M15

M20

M

Jun 01 Jun 04 Jun 07 Jun 10 Jun 13

Figure 1.3.2: The number of hourly requests for the Wikimedia servers zoomed in, takenfrom June 1, 2008, to June 14, 2008, illustrating the weekly seasonal and daily repetitivepattern.

June 1, 2008 is a Sunday and from figure 1.3.2, going ahead 6 days whereeach peak corresponds to one day. It can be seen that June 7 and 14, whichare Saturdays, has the lowest amount of requests in the week and Mondayto Wednesday has the highest. The seasonality of the data consists of adistinct weekly pattern which varies slowly over time. Note that the amountof requests varies between 10 to 20 millions over this two week period.

Modeling and forecasting the data is challenging from several perspectives.The forecasting method needs to be adaptive, since the seasonal dynamicsof the data is constantly changing with time. The data has unforeseeableoutliers going upwards, corresponding to sudden popular sites and downward

3

outliers which are associated with system errors, caused by the monitoringsoftware. The data has several level shifts which induce a rapid increase ordecrease in the data flow, which follows with higher or lower variability in thedaily seasonality of the data. Additionally missing values occasionally ariseand all of these problems makes prediction a challenge. Hence the automatedprediction method needs to be flexible and adapt to the changes in the data,which means the method needs to be estimated using only the nearest pastvalues.

The predictions are done by applying cubic basis splines on the recent pastvalues to capture the trend and seasonality, and autoregressive models topredict the remainder.

1.4 Past research

Using cubic spline to predict univariate time series data has been an alternatemethod to parametric models, for example Hyndman et al. (2005) suggestusing smoothing cubic splines to local linear forecasts.4 Although smoothingsplines and linear forecasts are not applied in this thesis, the principle ofimplementing cubic splines to produce local prediction is employed.

Applying cubic splines to predict a workload type of data for web-basedsystems has been done in the past by Herbst et al. (2013) on the GermanWikipedia data, which had similar characteristics as the data presented inthis thesis.5 However, the method used in the article was regarding adaptingdifferent forecasting approaches (cubic splines was one of the methods) thatwas best suited to the data at a certain time point and use that specificmethod to predict the next time step. However the length of the Wikipediadata that was used, was only set to 3 weeks with no outliers or missing values.

Modeling different workloads is a major area in the computer science field,where this thesis focuses on a certain type of workload that is characterizedby a distinct, deterministic seasonal pattern. For other types of modelingworkloads, see Ali-Eldin et al. (2012), Bodık et al. (2009) and Andreolini &

4. Hyndman, R. J. et al. (2005), Local linear forecasts using cubic smoothing splines,Australian & New Zealand Journal of Statistics, Vol.47(1), pp. 87-99

5. Herbst, N.R. et al. (2014), Selfadaptive workload classification and forecastingfor proactive resource provisioning. Proceedings of the ACM/SPEC internationalconference on performance engineering (ICPE 2013), pp. 187-198

4

Casolari (2006).6 7 8

Moreover, some of the work and results during this master thesis were includedpartially in a published article by Ali-Aldin et al. (2014).9

1.5 Disposition

The rest of the thesis is organized as follows. Section 2 presents the theoryon time series modeling and forecasting that is used when constructing theforecasting method outlined in section 3. Section 4 evaluates the forecastingmethod on the two Wikimedia data sets, whereas section 5 ends this thesiswith a summary and discussion, as well as suggestions on future areas ofresearch.

6. Ali-Eldin, A. et al. (2012), Efficient provisioning of bursty scientific workloads onthe cloud using adaptive elasticity control. In Proceedings of the 3rd workshop onScientific Cloud Computing, pp. 31-40

7. Bodık, P. et al. (2009), Statistical machine learning makes automatic control practicalfor internet datacenters. In Proceedings of the 2009 conference on hot topics in cloudcomputing, pp. 12-15

8. Andreolini, M. & Casolari, S. (2006), Load prediction models in web-based systems.In Proceedings of the 1st international conference on Performance evaluationmethodologies and tools, pp. 27-36

9. Ali-Aldin & Rezaie et al. pp. 352-353

5

2 Theory

This section starts by introducing some commonly used notation and conceptsin time series modeling and forecasting. Further, the parametric models areintroduced, with a discussion on how to analyze and evaluate those models.Finally a semi-parametric method is introduced which is based on splinesand as well a goodness-of-fit measure.

2.1 Time series analysis

Time series analysis is a term that is used to describe methods that modelsand forecasts data that are observed over a grid of time points, where thedata is distributed evenly in time e.g. every minute or hour.

2.1.1 Means, variances and autocovariances

Let Yt be a real-valued stochastic variable that is observed at a certain timepoint t. For a time series {Yt : t = 0,±1,±2, ...} the mean function can bedescribed as

µt = E(Yt), for t = 0,±1,±2, ...,

which is the expected value of the time series Yt at time t.

The autocovariance function (ACVF) γt,s, measures the linear dependencebetween Yt and Ys. The ACVF is defined as

γt,s = Cov(Yt, Ys), for t, s = 0,±1,±2, ...,

where Cov(Yt, Ys) = E[(Yt−µt)((Ys−µs)] = E(YtYs)−µtµs and Cov(Yt, Yt) =V ar(Yt) = σ2. Hence the property γt,s = γs,t holds for any t and s. Theautocorrelation function (ACF) is the standardized version of the ACVF,given by

ρt,s = Corr(Yt, Ys), for t, s = 0,±1,±2, ...,

where

Corr(Yt, Ys) =Cov(Yt, Ys)√

V ar(Yt), V ar(Ys)=

γt,s√γt,tγs,s

.

The variable ρt,s is defined between −1 ≤ ρt,s ≤ 1 and ρt,s = 0 implies thatthere is no linear dependence between Yt and Ys.

10

10. Cryer, J.D. & Chan, K-S. (2008), Time Series Analysis with Applications in R,2nd Ed., New York, NY: Springer Science Media, pp. 11-12

6

2.1.2 Stationarity

An important concept in time series analysis is stationarity. There exists twotypes of stationarity, strict stationarity and weak stationarity. Informally,stationarity means that the behavior of the time series does not change overtime. The time series {Yt} is strictly stationary if the joint distribution of(Yt1 , Yt2 , ..., Ytn) is the same as the joint distribution of (Yt1−k, Yt2−k, ..., Ytn−k)for any points (t1, ..., tn) and any finite k.11

A time series {Yt} is weakly stationary if

1. The mean µ is constant over time.2. The autocovariance γt,t−k = γ0,k for all t and k.12

Whenever stationarity is applied in this thesis, it will be referring to weakstationarity unless it is explicitly indicated otherwise.

2.1.3 Trend and seasonality

In practice, many time series {Yt} are non-stationary. Non-stationary timeseries are often modeled by decomposing them into three components. Thetime series Yt, for any time in t, is then described as Yt = Mt +St +Zt whereMt is the deterministic trend that captures the slowly varying changes, St

is the seasonality component that describes the repetitive patterns with aknown periodicity s, and Zt is a weak stationary process.13

An example of a non-stationary time series which has both a trend and aseasonal component can be seen in figure 2.1.1 where the monthly time seriesdescribes the carbon dioxide levels in Mauna Loa, Hawaii between January1965 to December 1980.14

11. Cryer & Chan, p. 1612. ibid., pp. 16-1713. Brockwell, P.J. & Davis, R.A. (2002), Introduction to Time Series and Forecasting,

2nd Ed., New York, NY: Springer-Verlag, p. 2314. Datamarket, http://datamarket.com, (2014-08-11)

7

1965 1970 1975 1980

320

325

330

335

340

Date

CO

2 Le

vels

●

●

●

●●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

CO2 Levels

Figure 2.1.1: Monthly data over carbon dioxide levels in Mauna Loa, Hawaii betweenJanuary 1965 to December 1980.

As seen in figure 2.1.1, there is a linearly increasing trend and an annualseasonal pattern of periodicity s = 12 months.

2.1.4 White noise

White noise {et : t = 0,±1,±2, ...} is a time series with zero mean, µe = 0,and its variance σ2

e is constant and finite. The white noise is uncorrelatedin time such that Cov(et, es) = 0 when t 6= s. Note that even thought thewhite noise is uncorrelated at all time lags, there could still be non-lineardependency at different time lags.15 In this thesis the white noise is denotedby WN(0, σ2

e).

There exists a special case of white noise is when it is independent andidentically distributed,16 but it is not applied in this thesis.

2.2 Time series models

One of the most common parametric time series models is the autoregressiveintegrated moving average (ARIMA) model. The model is made of threedecisive components, where each element has a set of parameters p, d and q

15. Brockwell & Davis, pp. 16-1716. ibid., pp. 16-17

8

which is defined by integer numbers. When a specific model is defined, thenit is written as ARIMA(p, d, q) to be able to identify the structure of all thecomponents.17

What these parameters are and how they are defined will be explained inthe coming subsections. For simplicity when describing a future model inthis subsection, the constant term in the given model will be µt = 0 if theprocess is stationary, unless indicated otherwise.

2.2.1 Autoregressive models

Since the autoregressive (AR) model is commonly used in this thesis, it willget a more detailed introduction than the moving average model (which isintroduced in the next subsection).

The AR model is what its name suggests, it regresses on itself. The ARmodel of order p describes the current value as a linear function of the p pastvalues, including a white noise term that the past p values cannot explain.The constant p is a positive integer and the stationary AR(p) model is givenby

Yt = φ1Yt−1 + φ2Yt−2 + ...+ φpYt−p + et, (2.2.1)

where φi, i = 1, . . . , p, are the parameters of the AR model and et isWN(0, σ2

e).

By introducing the backshift operator B where Bd (Yt) = Yt−d, for d =0,±1,±2, ..., the model can be rewritten as

φ(B)Yt = et,

where φ(·) is the p:th order polynomial

φ(z) = 1− φ1 z − φ2 z2 − ...− φp z

p.

The process is stationary, if and only if18

φ(z) = 1− φ1 z − φ2 z2 − ...− φp z

p 6= 0, ∀ |z| = 1. (2.2.2)

Assuming stationarity, multiplying equation (2.2.1) with Yt, then takingexpectations, gives the variance γ0,

γ0 = φ1γ1 + φ2γ2 + ...+ φpγp + σ2e , (2.2.3)

17. Cryer & Chan p. 9218. Brockwell & Davis, pp. 83-86

9

where σ2e is the variance of the white noise and γk = Cov(Yt, Yt−k). On the

other hand, multiplying equation (2.2.1) by Yt−k, and taking expectations,yields the autocovariance for the k lag,

γk = φ1γk−1 + φ2γk−2 + ...+ φpγk−p, k ≥ 1. (2.2.4)

Equation (2.2.3) and (2.2.4) represents the Yule-Walker equations, which canbe written in matrix form as19

γ0 = φ′pγp + σ2e ,

and

γp = Γpφp, (2.2.5)

where

γp = [γ1, ..., γp]′, (2.2.6)

φp = [φ1, φ2, ..., φp]′, (2.2.7)

and

Γp =

γ0 γ1 γ2 . . . γp−1γ1 γ0 γ1 . . . γp−2γ2 γ1 γ0 . . . γp−3...

......

. . .

γp−1 γp−2 γp−3 . . . γ0

. (2.2.8)

Noting that γk = γ−k, the Yule-Walker equations can be formulated by termsof autocorrelations. By dividing with γ0 on both sides of equation (2.2.4),the equation for autocorrelation for the k lag is obtained,

ρk = φ1ρk−1 + φ2ρk−2 + ...+ φpρk−p, k ≥ 1.

Hence by dividing (2.2.5) with γ0, where ρ0 = 1, yields20

ρ1ρ2ρ3...ρp

=

ρ0 ρ1 ρ2 . . . ρp−1ρ1 ρ0 ρ1 . . . ρp−2ρ2 ρ1 ρ0 . . . ρp−3...

......

. . .

ρp−1 ρp−2 ρp−3 . . . ρ0

φ1

φ2

φ3...φp

, (2.2.9)

19. Brockwell & Davis, p. 13920. Cryer & Chan, p. 76

10

or defined more compactly by

ρp = Rpφp.

The coefficients φ1, φ2, . . . , φp can be estimated from the observed data Y1, . . . ,Yn by the Ordinary Least Squares (OLS) method, that minimizes the sumof squared errors (SSE)

SSE(φ) =n∑

t=p+1

(Yt −p∑

i=1

φiYt−i)2,

with respect to φi, i = 1, 2, . . . , p.21

There exist other methods to estimate the coefficients in the AR-model, suchas the maximum likelihood estimator or the Yule-Walker method. The lattermethod is a method of moment estimator that uses equation (2.2.9) to derivethe coefficients φ.22 The method replaces the γk in equation (2.2.5) by thesample autocovariances (see subsection 2.2.6) γk or ρk in equation (2.2.9) bythe sample autocorrelations ρk, for all k, and estimates the coefficients φi bysolving the linear equation system described by equation (2.2.9) or (2.2.5)such that

φp = Γ−1p γp, (2.2.10)

or

φp = R−1p ρp.

The OLS estimator will be applied in this thesis when estimating the ARcoefficients.

2.2.2 Moving average models

The moving average (MA) model of order q is abbreviated as MA(q), whereq is a positive integer. An MA(q)-model is often used to smooth the originaldata and then overlay it on the original time series plot to reveal patterns inthe original data or describe a correlated time series.23 The model consists

21. Cryer & Chan, pp. 154-15622. ibid., pp. 149-150, 158-16023. Montgomery, D.C., Jennings, C.L. & Kulahci, M. (2008), Introduction to Time Series

Analysis and Forecasting, New Jersey: Wiley series in probability and statistics, p. 22

11

of a linear combination of the weights 1, θ1, θ2, ..., θq on the white noiseet, et−1, ..., et−q, which makes it a q step correlated process. Note that aMA(q) process is always stationary and it can be defined as24

Yt = et − θ1et−1 − θ2et−2 − ...− θqet−q,where et isWN(0, σ2

e). Equivalently the model can be defined by the backshiftoperator as

Yt = θ(B)et,

where θ(B) = 1− θ1B− θ2B2− ...− θqBq. The variance of the MA(q) modelsatisfies

γ0 = (1 + θ21 + θ22 + ...+ θ2q)σ2e , (2.2.11)

where σ2e is the variance of the white noise. The autocovariance for the k lag

is defined as

γk = (−θk + θ1θk+1 + θ2θk+2 + ...+ θq−kθq)σ2e , 1 ≤ k ≤ q. (2.2.12)

The autocorrelation of the MA(q) process is given when equation (2.2.12) isdivided by (2.2.11) and thus defined as

ρk =

−θk + θ1θk+1 + θ2θk+2 + ...+ θq−kθq

1 + θ21 + θ22 + ...+ θ2q, 1 ≤ k ≤ q,

0, k > q.

The coefficients of the MA(q) model are typically estimated by the maximumlikelihood method assuming et is normally distributed.25

2.2.3 ARMA models

The AR and the MA models can be combined into one model called theARMA model given by

Yt = φ1Yt−1 + φ2Yt−2 + ...+ φpYt−p + et − θ1et−1 − θ2et−2 − ...− θqet−q,where et is WN(0, σ2

e). With the backshift operator it can be rewritten as

φ(B)Yt = θ(B) et.

The ARMA(p, q) model is stationary if (2.2.2) is satisfied and the parametersare typically estimated from the data by the maximum likelihood methodassuming normally distributed et, for any t.26

24. Cryer & Chan, p. 6525. Shumway, R.H. & Stoffer, D.S. (2011), Time series analysis and its applications,

3rd Ed., New York, NY: Springer Science Media, pp. 126-12826. Brockwell & Davis, pp. 83-84

12

2.2.4 ARIMA models

The I in ARIMA stands for integrated and it relates to how many times theprocess Yt needs to be differentiated to become stationary. Differentiation iscommonly used to remove a linear or exponential trend in the time series. Alinear trend can be removed by differentiating the time series once by forming

∇Yt = Yt − Yt−1 = (1−B)Yt.

The stationary series ∇Yt can then be modeled as an ARMA model. Thegeneral ARIMA(p, d, q) model is written as

∇dYt = φ1∇dYt−1 + ...+ φp∇dYt−p + et − θ1et−1 − ...− θqet−q,

where et is WN(0, σ2e) and ∇d Yt = ∇(∇d−1(Yt)), for d ≥ 1, with ∇0 Yt = Yt.

The order of the differentiation d varies depending on the type of trend thatis in the data. For example, the data is differentiated twice when quadratictrends exists. The ARIMA models are typically estimated by maximumlikelihood methods for a given p, q, d.27

2.2.5 SARIMA models

Time series may include different kind of seasonal patterns of periodicity s,such as repetitive monthly or weekly patterns. For example, the monthlydata in figure 2.1.1, has an annual seasonal pattern with s = 12. Thesetypes of seasonal time series can be modeled by seasonal ARIMA models,so-called SARIMA models. Characteristics for seasonal time series is thatthe autocorrelation at the seasonal lags s, 2s, 3s, ..., are strong and weaker forthe lags in between, and the SARIMA models allows to emphasize seasonallags in the model and to do seasonal differentiating ∇s Yt = Yt − Yt−s.

The SARIMA has an additional seasonal component in its specification whichis defined by P for the seasonal AR part, Q for the seasonal MA part and Dfor the number of seasonal differentiation, defined as ∇D

s Yt = ∇s(∇D−1s (Yt)),

where all the components are positive integers. A SARIMA(p, d, q)(P,D,Q)smodel is described as

φ(B) Φ(Bs)(1−B)d(1−Bs)D Yt = θ(B)Θ (Bs) et,

27. Brockwell & Davis, pp. 29, 180-182

13

where s is the seasonal lag and et is WN(0, σ2e). The polynomials Φ(·),Θ(·)

are the seasonal components defined as28

Φ(z) = 1− Φ1 z − Φ2 z2 − ...− ΦP z

P ,

Θ(z) = 1−Θ1 z −Θ2 z2 − ...−ΘQ z

Q.

For example a SARIMA(1, 0, 1)(1, 0, 1)12 is defined as

(1− φ1B)(1− Φ1B12)Yt = (1 + θ1B)(1 + Θ1B

12) et,

or equivalently by

Yt = φ1Yt−1 + ΦYt−12 − φ1Φ1Yt−13 + et + θ1et−1 + Θ1et−12 + θ1Θ1et−13.

The parameters of the SARIMA(p, d, q)(P,D,Q) are typically estimated bythe maximum likelihood method based on a set of observed values Y1, . . . , Ynof the time series for given p, d, q, P,D,Q.

2.2.6 Model selection

The autocorrelation function (ACF) and the partial autocorrelation function(PACF) are often used as diagnostic tools, to identify p, d, q, P,D and Q.The ACF for a stationary time series at lag k is defined as

ρk =γkγ0,

where γk = Cov(Yt, Yt−k). The ACF of a given time lag k can be estimatedfrom the observed time series by

ρk =γkγ0,

where γ0 is the sample variance and γk is the sample autocovariance definedas

γk =1

n

n−|k|∑t=1

(Yt+|k| − Y )(Yt − Y ), k = 0,±1,±2, ...,±(n− 1), (2.2.13)

and Y is the mean value of Y1, . . . , Yn.29

The PACF of lag k is defined as the autocorrelation between Yt and Yt−k given

28. Brockwell & Davis, pp. 203-20629. ibid., pp. 46, 59

14

that the linear dependence of the intervening observations Yt−k+1, ..., Yt−1 areremoved, for k > 1. The PACF corresponds to the correlations between theresiduals of Yt and Yt−k after regression on (Yt−k+1, Yt−k+2, ..., Yt−1), for k > 1.The PACF, α(·), satisfies

α(0) = 1,

and

α(k) = φkk, k ≥ 1,

where φkk is the last component of the k-dimensional vector given by

φk = Γ−1k γk,

where Γ−1k is the inverted autocovariance matrix, which was defined byequation (2.2.8), φk was defined by (2.2.7) and γk was defined by (2.2.6).The estimates are given by

α(0) = 1,

and

α(k) = φkk, k ≥ 1,

where φkk is the last component of equation (2.2.10) which stated that

φk = Γ−1k γk,

where Γ−1k and γk had their corresponding values estimated by equation(2.2.13). This means for any lag k, α(k) is the same as the estimatedcoefficient φk in a fitted AR(k) model. When plotting the sample ACFand PACF for a time series, the following behavior in table 2.2.1 is useful inidentifying the specific model.30

Table 2.2.1: General behavior of the ACF and PACF for ARMA models.

AR(p) MA(q) ARMA(p,q)

ACF Tails off Cuts after lag q Tails offPACF Cuts after lag p Tails off Tails off

If the ACF and PACF have large (positive) values that decreases very slowlywith time it indicates that d or D is larger than zero and differentiationshould be used. If the ACF and/or the PACF have larger values at lagss, 2s, 3s, ..., for some integer s, it indicates a periodicity at lag s. For furtherinformation regarding parametric time series analysis and forecasting, seeBrockwell & Davis (2002).

30. Cryer & Chan, p. 116

15

2.3 Semi-parametric models

A discrete time series Yt is sometimes modeled by a deterministic underlyingfunction x(t) describing the trend and seasonal pattern of the time series,plus a stationary stochastic component. In that case Yt is modeled by

Yt = Xt + Zt, (2.3.1)

where Xt = x(t), and Xt = Mt + St is the deterministic function describingthe trend and seasonality. The remainder Zt is assumed to be stationaryand can for example be described by an ARMA model. Figure 2.3.1 gives anillustration of a continuous function x(t) being observed at the discrete timepoints 0, 1, 2, ..., 10 in the interval [0, 10].

t

f(t) ●

●

● ●

●

●

●

● ●

●

●

0 2 4 6 8 10

−1.

0−

0.5

0.0

0.5

1.0

Figure 2.3.1: The red curve denotes the continuous underlying function x(t), the blackpoints are the values of a time series Yt, for t = 0, 1, 2...., 10. The differences between thepoints and the red curve corresponds to the remainder Zt.

The deterministic function x(t) is assumed to be continuous and differentiableover some time interval T , and x(t) can be described by a linear combinationof g known basis functions (λ1(t), λ2(t), ..., λg(t)) over the time t and it isdefined as

x(t) =

g∑j=1

cjλj(t), (2.3.2)

where the coefficients c = (c1, c2, . . . , cg) are unknown.31 Given the basisfunctions, the coefficients can be estimated by the OLS method by finding

31. Ramsey, J.O. & Silverman, B.W. (2006), Functional Data Analysis, 2nd Ed.,New York, NY: Springer Science Media, pp. 38-40, 44

16

the c that minimizes the SSE32

SSE(c) =n∑

t=1

[Yt −g∑

j=1

cjλj(t)]2,

where {Yt, t = 1, 2, ..., n}, is the observed time series.

2.3.1 Basis functions

There exists several types of basis functions and the most common ones arethe Fourier, B-splines, wavelets and polynomial bases. Their appearancesand applications are different when compared to each other and the choiceto use a certain sets of basis functions depends of the features of the data.The amount of basis functions g, should be chosen sufficiently large to beable to capture the important features of the data. In this thesis B-splinesare applied.

2.3.2 Splines

The deterministic trend and seasonality x(t) is described by splines in thisthesis. Splines are smooth piecewise polynomial functions of fixed degree k,defined over a time interval T = [ta, tb]. Given a set of m interior breakpoints33

ta < τ1 < τ2 < ... < τm < tb,

a spline is constructed by k-order polynomials between each break point(τi, τi+1), i = 0, 1, ...,m, starting at ta and ending at tb, while restrictingthem to connect smoothly at the interior break points with no restrictionon length of the break points τi. Hence the characteristic of the polynomialschange whenever one passes a break point τi, but the spline is continuous andtypically all its derivatives match up to the order k− 1 at all break points.34

2.3.3 B-Splines

A set of basis functions can be constructed in several ways, to be able todescribe a spline of degree k, for a given set of break points. One of the most

32. Ramsey & Silverman, p. 6033. By interior, it is defined as break points that are not placed at the beginning or end

of the domain of definition of the function.34. Ramsey & Silverman, pp. 46-49

17

popular set of basis functions are the B-splines, due to their computationalefficiency. The B-spline basis functions are defined as35

Bp(t) =

p+k+1∑j=p

[p+k+1∏i=p,i 6=j

1

τi − τj

]max (0, (t− τj)k), −∞ < t <∞.

Note that Bp(t) is non-zero only if t is in the interval [τp, τp+k+1]. Thisimplies that during calculations a design matrix is created that is almostorthogonal,36 being an advantage in computations, which is the major reasonwhy B-splines are computed efficient.37 For further reading on how B-splineswork and are computed, see Giles (2010). Figure 2.3.2 gives an example of aB-spline basis system with 13 basis functions defined for a third degree splineon the closed interval [0, 10] with 9 equally spaced interior break points.

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2.3.2: Thirteen cubic B-spline basis functions, where each basis function isrepresented by a unique color, with nine interior break points between the interval [0,10].

The number of basis functions b depends on two factors. These factors arethe number of break points m and the degree of the polynomial k, which is

35. Powell, M.J.D. (1981), Approximation theory and methods, New York, NY: CambridgeUniversity Press, pp. 229-230

36. Giles, D. (2010), B-splines, Wiley Interdisciplinary Reviews: Computational Statistics,Vol.2(2), pp. 239-240

37. Ramsey, J.O., Hooker, G. & Graves, S. (2009), Functional Data Analysis with R andMATLAB, New York, NY: Springer Science Media, p. 29

18

given by the equation

b = m+ (k + 1).

2.3.4 Creating basis functions

Generally there are two ways to chose the amount of break points, the firstone is the put them evenly spaced. This is more common when the futuredata is unknown and stochastic. The second approach is when the data isknown and a good spline fit is desired to describe the data, is to put thebreak points more densely where there are high curvature to be able to fitthe data well.

There exists a trade-off for the basis functions. If an excessive amount ofbasis functions are created, there is a chance to catch much of the noisevariation in the data. This leads to overfitting and an excessive amount ofbasis functions are not favorable for computational reasons as well. On thecontrary, if an insufficient number of basis functions are created, there mightbe aspects of the data that could go missing and thus the estimations becomesoversmoothed. The best way is therefore to seek the minimum amount ofbasis function that can fit the data well. The most probable way to decidethe number of basis functions is to try different amounts and then see whichnumber of functions fits the data best.38

2.4 Goodness-of-fit indicator

Once a time series model has been fitted to the data, it needs to be evaluatedfor its assessment-of-fit, for its predictive performance and there are severalmethods to do this. In this thesis the mean absolute percentage error (MAPE)will be used, that focuses on the prediction errors. MAPE is defined forforecasting on the horizon h as

MAPE(h) =1

n− h

n−h∑t=1

∣∣∣∣Yt+h − Ft+h

Yt+h

∣∣∣∣,where Yt+h is the true value at time t + h, Ft+h is the predicted value ofYt+h based on the data up to time t+ h. MAPE(h) will be decimal numberbetween 0 and 1.39

38. Ramsey & Silverman, pp. 48, 6739. Yaffee, R. & McGee, M. (2000), An Introduction to Time Series Analysis and

Forecasting: With Applications of SAS and SPSS, San Diego, CA: Science PublishingCo. Inc, pp. 15-17

19

The reason MAPE is chosen, is because it is a relative measure. MAPE avoidsletting the larger values (which are present in particular in the later part ofthe time series in the data set) dominate the evaluation of the predictionmethod. Other measures could of course also be used such as the meanabsolute error or squared errors.

20

3 An automated prediction method

This section applies the theory that was presented in the previous section,to construct a stable, automatic prediction model for workloads with similarcharacteristics and challenges as the Wikimedia workload. In subsection 3.1an one-step-ahead prediction model is presented, whereas subsections 3.2introduces an adapted prediction model for 1 + h steps ahead forecasts.

3.1 One-step-ahead predictive model

As seen in figure 1.3.2 the Wikimedia workload (Yt) is characterized by arepetitive weekly pattern, with an almost local constant trend that slowlychanges with time. The occasional exception is shown in figure 3.1.1, wherea level shift occurs in the workload with increased variance in the seasonality.Hence, when a prediction model is fitted to past data, it should only use awindow of recent values, and successively update the parameter estimatesof the model as time moves on, through a sliding window. In this way theprediction model is able to adapt to the slowly changing time dynamics. Thewindow of the past values will have to be chosen for each specific workload.

Oct 27 Nov 06 Nov 16

Date

Req

uest

s

20M

30M

40M

50M

60M

Figure 3.1.1: Example of a level shift with increased variance in the seasonality, on theWikimedia request data, between October 17, to November 21, 2012.

The Wikimedia data consists of occasional upward and downward outliersand these events often happen in groups, see e.g. fig 3.1.2. Typically thereis no possibility to predict when an upward or downward outlier occurs. Forupward outliers it is desirable to adapt as quickly as possible to the increasedworkload change when it appears. For downward outliers, which are mainly

21

caused by monitoring problems, the goal is to ignore them and predict themby the estimated pattern.

Outliers are usually rare events and should not be used when estimatingthe normal seasonal pattern. Therefore there is a need for an automaticway of identifying outliers and removing the effect of them when updatingthe parameters of the prediction model. This is discussed in subsection 3.1.1.Missing values can cause additional work when fitting prediction models suchas the SARIMA model. The missing values needs to be replaced by estimatedvalues since the SARIMA model needs observed data at all time points, inorder to estimate its parameters by e.g. the maximum likelihood method.

Splines are flexible, computationally fast and when fitting splines to a givendata set, it is not necessary to estimate missing values, and the estimationprocedure is easy to automatize, which is why this method is chosen toestimate the pattern. The weekly repetitive pattern at time t, correspondsto Xt = St in equation (2.3.2) and is described by a smooth deterministicfunction x(z), z ∈ [0, s].

Sep 28 Oct 08 Oct 18 Oct 28

Date

Req

uest

s

010

M20

M30

M40

M

Figure 3.1.2: Example of upward and downward outliers with missing values in-betweenthe dip, for the period September 25, to October 29, 2008, for the Wikipedia number ofrequest data.

The repetitive pattern P of periodicity s is estimated from a window of sizesd of past recent values up to time t. The component d is the number ofsequences of periodicity s, where s and d are integers. For example, supposebeing at time t and want to predict the repetitive seasonal pattern P attime t + h, i.e. Pt+h. Assuming there are no outliers or missing values in

22

the sd most recently observed values up to time t. Then Pt+h is predictedbased on those values in the following way: For a given degree k, and a givennumber and placement of break points, a spline xt(z), z ∈ [0, s], is fitted bythe least squares method to the data consisting of the d overlaying sequencesof periodicity s, yielding xt(z) ∈ [0, s]. Then the pattern Pt+h is predicted byPt+h = xt(h) at time t + h. Thus at time t, forecasting the weekly patternone step forward at time t+1, gives Pt+1 = xt(1). Figure 3.1.3 illustrates theidea on how x(t) is estimated on the Wikimedia hourly requests data withs = 168 hours (one week) and d = 2.

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

(a)

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●●●●●

●

●

●

●

●

●●●

●

●

●

●

●

●●●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●●●

●

●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

(b)

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●●●●●

●

●

●

●

●

●●●

●

●

●

●

●

●●●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●●●

●

●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

(c)

Figure 3.1.3: Procedure to estimate the weekly pattern P of the Wikimedia request databy splines from d = 2 weeks of data. Between June 1 and June 14, 2008. (a) Two weeksof data corresponding to a window of size 2 × 168 hours. (b) The two weeks of data areoverlaid on top of each other. (c) A cubic spline is fitted (red line) to the data, based on83 equally spaced interior break points, which corresponds to the estimated pattern P .

23

Since the repetitive pattern P indicates the normally behaving pattern, ithas to be estimated without the influence of outliers. If some of the last sdtime points up to time t are outliers (to be defined) or missing values, theyhave been replaced by estimated spline values. Thus if Yt is not defined asan outlier then it is given by

Y ′t = Yt, (3.1.1)

if Yt is not an outlier (to be defined) or a missing value, and otherwise

Y ′t = Pt, (3.1.2)

where Pt = xt−1(1) is the estimated pattern based on the vector Wt−1 ={Y ′t−1, Y ′t−2, ..., Y ′t−sd}. Thus the one-step-ahead forecast can be formed as

Yt+1 = Pt+1. (3.1.3)

However, typically there is autocorrelation left in the residuals, εr = Yr− Pr,where r = t, t− 1, ..., t− sd− 1, that can be used to improve the predictionon Yt+1 by forecasting the next error εt+1. The residuals are proposed to bepredicted by an AR(p)-model when Yt is not an outlier or a missing value,this is because the AR model coefficients are simpler and faster to estimatethan a ARMA model and the AR model is often more robust.

The AR parameters φ0, φ1, ..., φp should be estimated based on the sd pastresiduals, but the residuals should be without the influence of outliers andmissing values, see subsection 3.1.1 for more details. The residual εt+1 canthen be predicted by

εt+1 = φ0 + φ1εt + φ2εt−1 + ...+ φpεt−p−1, p ≤ sd. (3.1.4)

Hence, given the data up to time t, if Yt is not an outlier or a missing value,the one-step-ahead predictor of Yt+1 is given by

Yt+1 = Pt+1 + εt+1, (3.1.5)

where Pt+1 is the estimated spline model and εt+1 is the predicted error fromthe estimated AR(p) model (3.1.4). The values Pt+1 and εt+1 correspondsto the estimated quantities Xt and Zt respectively at time t + 1, defined insubsection 2.3.

Upward outliers usually come in groups. While it is not possible to predict

24

when upward outliers occur, the goal with the prediction model is to adjustfor the workload increase quickly and start mitigating for the workloadchange for the subsequent outliers. Thus if Yt is an upward outlier, thenYt+1 is predicted by

Yt+1 = Pt+1 + (Yt − Pt), (3.1.6)

to catch up with the ”explosive” nature of upward outliers.

Downward outliers mainly occur because of monitoring problems and donot typically represent the true workload. They should therefore be ignoredand the corresponding true unobserved values be forecast by the estimatedtrend Pt. Hence, if Yt is a downward outlier, Yt+1 is predicted by

Yt+1 = Pt+1. (3.1.7)

If Yt is a missing value, likely occurring due to the monitoring software, thenYt+1 is predicted by Yt+1 = Pt+1.

When Yt+1 has been observed and the workload at Yt+2 is to be predicted, thelocal window is shifted one hour ahead and the method above is repeated.

3.1.1 Outlier detection

In an automated procedure, there are different ways to detect observationsthat deviates from the normal behaving data. This subsection describes waysto detect outliers, when a pronounced repetitive pattern P is predicted inthe data.

An outlier is a value that deviates substantially from the repetitive patternP . To detect outliers, the residual variability around the pattern P excludingoutliers is studied. Let et be defined as

et = Yt − Pt,

if Yt is not an outlier (to be defined) or a missing value and by

et = φ0 +

p∑i=1

φiet−i,

otherwise. Here φ0, ..., φp are estimated from Ev = {er, r = v, v − 1, ..., v −sd − 1}, where v is the largest value smaller than t, such that Yv is not an

25

outlier or a missing value.

Let Sres be the standard deviation of Et−1 = {et−1, et−2, ..., et−sd}. ThenYt is defined as an upward outlier if

Yt > max(xt−1(1), xt−1(2), ..., xt−1(s)) + ηSres, (3.1.8)

where η is a positive value. Similarly, Yt is defined to be a downward outlierif

Yt < min(xt−1(1), xt−1(2), ..., xt−1(s))− ηSres. (3.1.9)

For instance if η = 4 and assuming the residuals are normally distributed.Then Yt will be an upward outlier if it exceeds max(xt−1(1), ..., xt−1(s)) plusthe 99.997% quantile of the residual distribution. The value of the parameterη has to be chosen by the user, which will be discussed more thoroughly insection 4, where a sensitivity analysis of the parameter is presented.

3.1.2 Level shifts

The automatic prediction method proposed up to now, could be trapped ina mode where it is defining all subsequent values as upward outliers due toa persistent abrupt and large level shift, see e.g. figure 3.1.1. In order tohandle such level shifts, a counter is added that controls how many timesin a row an upward outliers has occurred. If Yt is the s + 1 outlier in arow then Y ′u = Yu, u = t, t − 1, ..., t − s − 1 and the spline is re-fit to thenew Wt = {Y ′r , r = t, t − 1, ..., t − sd − 1}. If there are any missing valuesin Yt, ..., Yt−s−1, the model will just adapt to equation (3.1.2) in any of thevalues of Yt, ..., Yt−s−1.

3.1.3 One-step-ahead prediction algorithm

To be able to get an overview how the one-step-ahead prediction model isbuilt, a pseudocode will be presented in Algorithm 1 below. As input to thealgorithm, it is needed to pre-specify the degree of the spline k, the numberand the placement of break points, the seasonality s, the number of sequencesd of the repetitive pattern used in estimating the parameters s, the order pof the autoregressive residual model and the outlier detection parameter η.

26

Algorithm 1 One-step-ahead forecast, part 1

1: j = 0 (counts the number of consecutive upward outliers).2: Let t = t0 (The starting time of the algorithm).3: Let v be the smallest positive integer, such that Yt−v was not an outlier

or a missing value.4: Compute Sres from Et−1.5: If Yt is a missing value then

5a: Y ′t = Pt, where Pt = xt−1(1) is the estimated spline based on Wt−1.

5b: et = φ0 + φ1et−1 + φ2et−2 + ... + φpet−p, where φi, i = 0, 1, ..., p areestimated from Et−v.

5c: Re-estimate P by splines, using Wt.

5d: Yt+1 = Pt+1.

6: If Yt is an upward outlier and j ≤ s then




6d: Yt+1 = Pt+1 + (Yt − Pt).

6e: j = j + 1 (counts the number of consecutive upward outliers).

7: If Yt is an upward outlier and j > s then

7a: Y ′u = Yu, u = t, ..., t− s− 1.

7b: Re-estimate P by splines, using Wt.

7c: Et = {er = Y ′r − Pr, r = t, ..., t− sd− 1}.7d: Re-estimate φ0, φ1, ..., φp from the residuals Et using OLS.

7e: εt+1 = φ0 + φ1et + φ2et−1 + ...+ φpet−p−1.

7f: Yt+1 = Pt+1 + εt+1.

7g: j = 0.

27

Algorithm 1 One-step-ahead forecast, part 2

8: If Yt is a downward outlier then




8d: Yt+1 = Pt+1.

9: If Yt is not a missing value and not an outlier then

9a: Y ′t = Yt.

9b: et = Yt − Pt, where Pt = xt−1(1) is the estimated spline from Wt−1.

9c: Re-estimate φ0, φ1, ..., φp from Et using OLS.

9d: εt+1 = φ0 + φ1et + φ2et−1 + ...+ φpet−p−1.

9e: Re-estimate P by splines, using Wt.

9f: Yt+1 = Pt+1 + εt+1.

10: Set t = t+ 1, go to 3.

3.2 Predictions 1+h hours ahead

To be able to predict 1+h hours ahead, the same model can be used, slightlymodified. The 1 + h predictor model is defined as

Yt+1+h = Pt+1+h + et+1+h,

where Pt+1+h is the seasonal pattern at time t + 1 + h, estimated from Wt

and et+1+h is the predicted residual at time t+ 1 + h.

The prediction of et+1+h is done, given that Yt is not an outlier, by firstusing Et to predict et+1 such as in the one step case, and then continuerecursively until et+1+w is attained by the equation

et+1+w = φ0 +

p∑r=1

φret+1+w−r, w = 1, 2, ..., h, (3.2.1)

for h > 1.

Sometimes if there is a fast dip in the data which is not classified as anoutlier, the residual et will be large valued. Simultaneously, if h is rather

28

large and if there is a chance that one of the AR-coefficients φi is negativeand bigger than 1 in value, then the prediction Yt+1+h = Pt+1+h + et+1+h willgive a rather low value or sometimes even a negative one. If Yt+1+h turns outto be smaller than definition of a downward outlier, then the model will justpredict Yt+1+h by Pt+1+h.

3.2.1 Outliers and missing values

The adaptation for upward outliers is not applied as in the 1 hour model,because it is not possible to adjust quickly enough for upward outliers whenpredicting 1 + h hours ahead. The model will miss the upward outliers by hsteps and it will adjust to the upward outlier as soon as it is known at Yt thatit should be classified as an upward spike. If Yt is not defined an outlier, et+1+h

is predicted by using the estimated AR(p) model such as (3.2.1). However,if Yt is defined as an upward outlier, the errors will be defined as e∗t in theAR(p) (where the AR coefficients stay constant until the next observation isnot classified as an outlier) and the errors are defined as

e∗t = Yt − Pt,

for upward outlying values and

e∗t = et,

otherwise. Thus the model predicts upward outliers as

Yt+1+h = Pt+1+h + e∗t+1+h.

When Yt is a downward outlier or a missing value, the model predicts by

Yt+1+h = Pt+1+h.

The residuals in the vector Et are estimated as in the case of upward outliers,when at Yt it is known that the observation is defined as a downward outlieror a missing value.

3.2.2 Overall algorithm for 1+h forecasting model

The pseudocode for the 1 + h steps ahead prediction model is presented inAlgorithm 2.

29

Algorithm 2 Predicting 1+h hours ahead, part 1

1: j=0 (counts the number of consecutive upward outliers).2: Let t = t0 (The starting time of the algorithm).3: Let v be the smallest positive integer, such that Yt−v was not an outlier

or a missing value.4: Compute Sres from Et−1.5: If Yt is a missing value then



5c: e∗t = et.

5d: Re-estimate P by splines, using Wt.

5e: Yt+1+h = Pt+1+h.

6: If Yt is an upward outlier and j ≤ s then



6c: e∗t = Yt − Pt.

6d: e∗t+1 = φ0 + φ1e∗t + φ2e

∗t−1 + ...+ φpe

∗t−p−1.

6e: Continue recursively to predict e∗t+1+w, as e∗t+1+w = φ0 +p∑

r=1φre∗t+1+w−r

until e∗t+1+h is attained, where w = 1, 2, ..., h.

6f: Re-estimate P by splines, using Wt.

6g: Yt+1+h = Pt+1+h + e∗t+1+h.

6h: j = j + 1 (counts the number of consecutive upward outliers).

30

Algorithm 2 Predicting 1+h hours ahead, part 2

7: If Yt is an upward outlier and j > s then

7a: Y ′u = Yu, u = t, ..., t− s− 1.

7b: Re-estimate P by splines, using Wt.

7c: Et = {er = Y ′r − Pr, r = t, ..., t− sd− 1}.7d: Re-estimate φ0, φ1, ..., φp from the residuals Et using OLS.

7e: et+1 = φ0 + φ1et + φ2et−1 + ...+ φpet−p−1.

7f: e∗t = et.

7g: Continue recursively to predict et+1+w, as et+1+w = φ0 +p∑

r=1φret+1+w−r

until et+1+h is attained, where w = 1, 2, ..., h.

7h: Yt+1+h = Pt+1+h + et+1+h.

7i: j = 0.

8: If Yt is a downward outlier then



8c: e∗t = et.

8d: Re-estimate P by splines, using Wt.

8e: Yt+1+h = Pt+1+h.

9: If Yt is not a missing value and not an outlier then do

9a: Y ′t = Yt.

9b: et = Yt − Pt, where Pt = xt−1(1) is the estimated spline based on Wt−1.

9c: Re-estimate φ0, φ1, ..., φp from the residuals Et using OLS.

9d: et+1 = φ0 + φ1et + φ2et−1 + ...+ φpet−p−1.

9e: e∗t = et

9f: Continue recursively to predict et+1+w, as et+1+w = φ0 +p∑

r=1φret+1+w−r

until et+1+h is attained, where w = 1, 2, ..., h.

9g: Re-estimate P buy splines, using Wt.

9h: If Yt+1+h < min(Pt)− ηSres, then Yt+1+h = Pt+1+h.

9i: else Yt+1+h = Pt+1+h + et+1+h.

10: Set t = t+ 1, go to 3.

31

4 Forecasting the Wikimedia workload

In this section the performance of the prediction models that was presentedin section 3 is evaluated on the Wikimedia request and data sent data fordifferent parameter settings. First a discussion for the placement of breakpoints is made and then followed by a presentation of the results for thedifferent methods. This is continued with a discussion of different parametersettings, which are changed on the model. The parameters that will alternateare different AR(p) models, the amount of interior breakpoints m and thestandard deviation of the residuals η.

Since the prediction performance needs to be evaluated of different methodsseparately for upward outliers and normally behaving data, a clear definitionof upward outliers is necessary at certain time points. After some preliminaryinvestigations on the data (see section 4.1.4), it was decided to use thebenchmark model in section 4.1.1 to determine which values are upwardoutliers. The predictions are also compared to naive predictors in section4.1.3 and 4.2.1. Prediction performance for downward outliers and missingvalues cannot be evaluated, since the real values at those specific time pointsare unknown.

The Wikimedia data consists of daily cycles of 24 hours. Therefore thedistances between the break points are set to be a divisor of 24. Howeverif there are too many break points, there is a risk of overfitting and it alsoincreases the computational time. Moreover, since the prediction method isautomatic and therefore non-reactive to sudden changes in the curvature, itmay be wise to choose evenly distributed break points e.g. at every secondhour.

A larger window, i.e. a bigger value of d, may result in too slow adaptation toa changing behavior, whereas a smaller window size, e.g. d = 1, may resultin noisy or unstable predictions. The window of estimating P is thus hereset to d = 2 weeks, noting that the Wikimedia data has a weekly seasonalpattern with periodicity s = 168 hours.

The autocorrelation function (ACF) and the partial autocorrelation function(PACF) of the first 25 residual values, defined as Yt − Pt, taken from threedifferent time points can be seen in figure 4.0.1.

32

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

(a)

5 10 15 20 25

−0.

20.

00.

20.

40.

60.

8

Lag

Par

tial A

CF

(b)

0 5 10 15 20 25

−0.

20.

00.

20.

40.

60.

81.

0

Lag

AC

F

(c)

5 10 15 20 25

−0.

20.

00.

20.

40.

6

Lag

Par

tial A

CF

(d)

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

(e)

5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

Lag

Par

tial A

CF

(f)

Figure 4.0.1: Diagnostics over the residuals of the cubic spline fit to the Wikimedia requestdata. The ACF (a) and the PACF (b) of the residuals for the first 25 lags, between May17 and May 31, 2008. The ACF (c) and the PACF (d) of the residuals for the first 25lags, between June 1 and June 14, 2008. The ACF (e) and the PACF (f) of the residualsfor the first 25 lags, between July 6 and July 20 , 2008.

Figure 4.0.1 shows a significant positive ACF decaying with time, but thenincreasing again. By inspection of the PACF, figure 4.0.1 indicates thatthese residuals seem to follow approximately a stationary AR(p)-process.40

40. See table 2.2.1.

33

The ACF plots in figure 4.0.1 also implies that there might be a seasonalAR component. However in pilot studies when a seasonal AR-componentof time lag s was added to the regular AR model, the forecasts gave asignificant decreased accuracy compared to an AR model with a small orderp. Moreover, when using certain models, it is generally a better idea to use asimpler structure and the simpler AR models are often more robust comparedto the seasonal AR models. Thus seasonal AR components will not be usedwhen performing the forecasts.

4.1 One-step-ahead prediction

In this subsection the prediction performance of one-step-ahead predictionmodel is presented and evaluated. The model is based on splines with theAR model and is studied on the two Wikimedia data sets. Furthermore, themodels forecasting performance is compared with an adaptive spline modeland a naive predictor.

4.1.1 Adaptive splines with AR Model (Benchmark Model)

As a bench mark, the parameter settings were set to d = 2 for the weeklywindow, break points every 2 hours, η = 4 to define outliers, p = 2 inthe residual AR(p) model and cubic splines, where k = 3. The mean averagepercentage error (MAPE) for the one-step-ahead forecasts for the Wikimediarequests and data sent are given in table 4.1.1. Note that the MAPE ispresented separately for the values classified as upward outliers and the”normal” sized workloads. The number of normal values and upward outliersare also given in table 4.1.1.

Table 4.1.1: MAPE for the one-step-ahead forecasts of the Wikimedia requests and datasent data with the AR(p) residual model, when d = 2, η = 4, p = 2 and break points forthe cubic splines at every 2 hours.

Normal Upward Outliers No. normal values No. upward values

Requests 1.915% 4.715% 46 241 221Data Sent 1.954% 8.984% 46 394 69

The MAPE for both data sets are almost equivalent on the normal dataflow. However, the upward outliers are predicted almost two times worseon the data sent data than the request data. This issue might have to dothat the outliers of the data sent have a bigger outburst when it comes toupward outliers, which in turn makes the adjustment for mitigating to the

34

true workload harder. Some predicted values (red) with their correspondingtrue values (blue) from both data sets are illustrated in figure 4.1.1.

Jun 14 Jun 24 Jul 04 Jul 14Date

Dat

a S

ent

500G

600G

700G

800G

(a)

Sep 03 Sep 13 Sep 23 Oct 03Date

Dat

a S

ent

300G

400G

500G

(b)

Sep 28 Oct 08 Oct 18 Oct 28Date

Req

uest

s0

10M

20M

30M

40M

(c)

Oct 17 Oct 27 Nov 06 Nov 16Date

Req

uest

s20

M30

M40

M50

M60

M

(d)

Figure 4.1.1: True values (blue) and corresponding predicted values (red), using adaptivesplines with an AR(p) residual model, for d = 2, η = 4, p = 2 and break points forthe cubic splines at every second hour. Figure (a) and (b) represents the amount of theWikimedia data sent, between June 14 and July 17, 2013, and September 2 and October6, 2011. Figure (c) and (d) is the amount of Wikimedia user requests, between September25 and October 29, 2008, and October 17 and November 19, 2012.

Figure 4.1.1a shows a clear seasonal pattern where the predictive valuescoincide well with the true values, whereas 4.1.1b shows a downward levelshift that the model adapts to. Figure 4.1.1c shows a period of an upwardspike that the models adapts to, later followed by a downward outlier andmissing values, where the forecasts mimics the normal seasonal pattern.Figure 4.1.1d illustrates an upward level shift with increased variance in itsseasonality, where the level shift is adapted fairly well by the model.

35

4.1.2 Adaptive splines model

To see the effect of the predicted residuals, this section evaluates forecastsmade solely by splines, namely

Yt+1 = Pt+1.

The only exception is that when upward outliers occur, the model forecastsby equation (3.1.6) and the outliers are defined by η = 4. The MAPE forboth data sets are presented in table 4.1.2.

Table 4.1.2: MAPE for the one-step-ahead forecasts of the Wikimedia requests and datasent data using adaptive splines solely, with d = 2, η = 4 and break points for the cubicsplines at every 2 hours.

Normal Upward Outliers

Requests 5.831% 6.935%Data Sent 5.489% 10.940%

The prediction performance for the normal values is again very similar forthe two data sets. However, the MAPE of the normal values, have almostthree times as high MAPE as the previous model on both data sets, wherethe predicted residuals were included in the model. The upward outliers areslightly better predicted with the former model as well. Illustrations of thepredicted and the true values are given in figure 4.1.2.


Dat

a S

ent

500G

600G

700G

800G

(a)


Dat

a S

ent

300G

400G

500G

(b)

Figure 4.1.2: True values (blue) and corresponding predicted values (red), using adaptivesplines, for d = 2, η = 4 and break points for the cubic splines at every second hour.Figure (a) and (b) represents the amount of the Wikimedia data sent, between June 14and July 17, 2013, and September 2 and October 6, 2011.

36


Req

uest

s0

10M

20M

30M

40M

(c)


Req

uest

s20

M30

M40

M50

M60

M

(d)

Figure 4.1.2: True values (blue) and corresponding predicted values (red), using adaptivesplines, for d = 2, η = 4 and break points for the cubic splines at every second hour.Figure (c) and (d) is the amount of Wikimedia user requests, between September 25 andOctober 29, 2008, and October 17 and November 19, 2012.

Figure 4.1.2 also illustrates the significant less accurate predicted valuescompared to figure 4.1.1. In fig 4.1.2b and 4.1.2d there is a two week laggingeffect of the downward and upward level shifts before the model adapts,which the residual AR model adjust to in the previous model much morerapid.

4.1.3 Naive predictor

Since there is typically a strong autocorrelation between neighboring workloadvalues, it is of interest to compare prediction performance to naive predictorsas well. A naive predictor forecasts the next time step using the current valueand is defined by

Yt+1 = Yt. (4.1.1)

If Yt is a missing value or a downward outlier, Yt+1 is here forecast by themost recent observed Yt−v, v > 0, that was not classified as a downwardoutlier or a missing value.

Another variant of naive predictor is a seasonal naive predictor that canbe of interest for seasonal data of periodicity s being defined as

Yt+1 = Yt−s−1. (4.1.2)

If Yt−s−1 is a missing value or a downward outlier, Yt+1 is here predicted byYt−qs−1, where q is the smallest positive integer such that Yt−rs−1 is not amissing value or a downward outlier. The idea is that the autocorrelation

37

between the times Yt+1 and Yt−s−1 should be high, due to the distinct seasonalpattern and therefore a good predictor. The prediction performance of thenaive and seasonal naive predictor, where s = 168, can be seen in table 4.1.3.

Table 4.1.3: MAPE for the Wikimedia requests and data sent data for the naive andseasonal naive predictors.

Naive Seasonal NaiveNormal Outliers Normal Outliers

Requests 4.340% 4,810% 5.383% 33.491%Data sent 3.948% 8.449% 5.341% 33.956%

The naive predictor works better than the seasonal naive predictor, especiallyfor the outliers and it also works better than the adaptive splines withoutthe residual AR model. However, the adaptive splines with the predictedresiduals still predicts the normal values much better than the naive predictor,and has comparable size of prediction error for the upward outliers. The naivepredictions still look good as figure 4.1.3 suggests, with the exception of themissing value in figure 4.1.3c. The seasonal naive predictor will however havea one week of lagging effect when the distinct weekly pattern gets interruptedby outliers or level shifts as figure 4.1.4 illustrates.


Dat

a S

ent

500G

600G

700G

800G

(a)


Dat

a S

ent

300G

400G

500G

(b)

Figure 4.1.3: True values (blue) and corresponding predicted values (red) for the naivepredictor. Figure (a) and (b) represents the amount of the Wikimedia data sent, betweenJune 14 and July 17, 2013, and September 2 and October 6, 2011.

38


Req

uest

s0

10M

20M

30M

40M

(c)


Req

uest

s20

M30

M40

M50

M60

M

(d)

Figure 4.1.3: True values (blue) and corresponding predicted values (red) for the naivepredictor. Figure (c) and (d) is the amount of Wikimedia user requests, between September25 and October 29, 2008, and October 17 and November 19, 2012.


Dat

a S

ent

500G

600G

700G

800G

(a)


Dat

a S

ent

300G

400G

500G

(b)


Req

uest

s0

10M

20M

30M

40M

(c)


Req

uest

s20

M30

M40

M50

M60

M

(d)

Figure 4.1.4: True values (blue) and corresponding predicted values (red) for the seasonalnaive predictor. Figure (a) and (b) represents the amount of the Wikimedia data sent,between June 14 and July 17, 2013, and between September 2 and October 6, 2011. Figure(c) and (d) is the amount of Wikimedia user requests, between September 25 and October29, 2008 and between October 17 and November 19, 2012.

39

4.1.4 Comparing different parameter settings

From comparisons so far, the adaptive splines with the residual model has thebest prediction performance. This subsection investigates the impact of thedifferent parameter values on the prediction performance on the Wikimediadata. The following factors are varied:

• The number of evenly distributed interior break pointsm, for the spline.

• The outlier detection parameter η.

• The number of lags p in the AR(p) residual model.

Three levels where chosen for each factor:

• Break points every 2, 3 or 4 hours (m = 83, 55, 41).

• η = 3.5, 3.75, 4.0, 4.25 standard deviations to detect outliers.

• p = 1, 2, 3 in the residual AR(p)-model.

The prediction performance is presented in tables 4.1.4 - 4.1.9. The lowestMAPE of the same magnitude for normal values and outliers are shaded ingrey, and those for the prediction model that was described in subsection4.1.1 will be marked in green color.

Table 4.1.4: MAPEs for the Wikimedia amount of requests data predicted one-step-aheadby adaptive splines with an AR(1) residual model.

Normal Outliersm 83 55 41 83 55 41

η3.50 Sres 1.929% 1.942% 1.988% 4.682% 4.639% 4.704%3.75 Sres 1.945% 1.927% 1.985% 4.683% 4.689% 4.726%4.00 Sres 1.920% 1.932% 1.987% 4.718% 4.711% 4.772%4.25 Sres 1.926% 1.929% 1.982% 4.734% 4.785% 4.875%




40




Table 4.1.7: MAPEs for the Wikimedia amount of data sent data predicted one-step-aheadby adaptive splines with an AR(1) residual model.









41

There are only minor differences in the forecasting performance for normalvalues when changing the different parameter values, which illustrates therobustness of the method for the normal data flow. The same holds forthe outliers of the requests data whereas outliers of the data sent are moresensitive to the choice of the parameter settings.

Having said that, the following tendencies can be seen:

• For both data sets, the prediction performance is the best for p = 2 forthe autoregressive model.

• A larger amount of break points is generally better.

• There seems to be a trade-off in MAPE between the outliers and thenormal data flow, depending how low the value η is. If η is low (η =3.5), then outliers are predicted better, but the normal data flow ispredicted slightly worse. The opposite applies when η is high (η = 4),the normal data flow is predicted better and outliers are forecastedworse. In particular this is of importance for the data sent.

4.2 Prediction 1+h hours ahead

In this subsection the prediction performance of the 1 + h hours steps aheadforecasting methods are studied for the two Wikimedia data sets. Theforecasts methods consists of the naive method, adaptive splines and adaptivesplines with an AR(2) residual model.

4.2.1 Naive 1+h steps ahead predictor

Given the information up to time t, the naive predictor forecasts Yt+1+h by

Yt+1+h = Yt.

If Yt is a missing value or a downward outlier, Yt+1+h is predicted by themost recently observed Yt−v, v > 0, that was not classified as an outlier or amissing value. The MAPE for both data sets are given in table 4.2.1.

42

Table 4.2.1: MAPEs for the naive 1 +h hours ahead predictor for the Wikimedia requestsand data sent.

1+h 1 2 3 4 5 6

RequestsNormal 4.301% 8.097% 11.447% 14.339% 17.171% 19.745%Outliers 4.810% 8.453% 11.427% 14.287% 17.149% 20.103%

Data sentNormal 3.948% 7.224% 10.004% 12.380% 14.494% 16.465%Outliers 8.449% 12.527% 15.440% 17.887% 20.307% 23.029%

As expected the forecasting errors rise as h increases.

4.2.2 Adaptive spline model

The adaptive spline model forecasts the weekly pattern P by cubic splines(k = 3), m = 83, η = 4, d = 2, predicts Yt+1+h by

Yt+1+h = Pt+1+h,

and when upward outliers occur at Yt, the model forecasts as

Yt+1+h = Pt+1+h + (Yt − Pt).

The prediction performance for different h are given in table 4.2.2.

Table 4.2.2: MAPEs for the adaptive spline predictor 1+h hours ahead for the Wikimediarequests and data sent.

1+h 1 2 3 4 5 6


Data sentNormal 5.489% 5.491% 5.496% 5.502% 5.490% 5.505%Outliers 10.940% 15.382% 19.370% 22,032% 25,005% 27,029%

This model clearly improves the prediction performance for the normal valuescompared to the naive predictor. The major reason being that the adaptivesplines is able to pick up the pronounced seasonal pattern, which the naivepredictor does not take into consideration. The MAPE increases minimally ash increases, on the other hand, the MAPE for outliers increase substantiallyas h grows. This is expected since it is not possible to predict a spike inthe data several hours ahead. Note that the MAPEs are lower for the naivepredictor than for the adaptive splines on all time steps h when forecastingoutliers.

43

4.2.3 Adaptive splines with AR model

The adaptive splines with an AR(2) residual model, as described in section3.2 is presented in this subsection. The parameters chosen were cubic splines(k = 3), m = 83, η = 4, p = 2 and d = 2. The model prediction method canbe seen in Algorithm 2 and its forecasting performance is presented in table4.2.3.

Table 4.2.3: MAPEs for the 1+h adaptive spline with the AR(2) residual model predictorfor the Wikimedia requests and data sent.

1+h 1 2 3 4 5 6


Data sentNormal 1.954% 2.798% 3.384% 3.792% 4.072% 4.333%Outliers 8.984% 16.593% 20.866% 24.314% 27.313% 29.543%

Adding prediction of the residual terms to the adaptive splines, substantiallyimproves the prediction performance of the normal values. Especially forsmaller values of h, it performs much better than the naive and adaptivespline predictors. On the other hand, the naive predictor still forecastsoutliers better than the other methods.

44

5 Discussion and conclusions

In this thesis an automatic and adaptive prediction model is proposed fordynamical seasonal time series, with upward and downward outliers alongwith missing values. These type of time series are typical for server loads,such as amount of requests and the data sent to and from a certain website,such as the Wikimedia home pages that are studied in this thesis. Theprediction models are in a case of interest to e.g. plan the amount of serversneeded in the data centers to met the number requirement of the user requestsand the data sent, while at the same time not use more energy (servers) thanneeded.

The proposed models use adaptive splines to capture the seasonal repetitivepattern, and an autoregressive model to predict the residuals, while at thesame time handling and taking into account outliers and missing values.

The proposed model works well and much better than naive and adaptivespline predictors on the normally behaving values of the Wikimedia numberof requests and data sent. However, outliers tend to be equally or betterpredicted by naive predictors, especially for several steps ahead predictions.Thus there might be a need to change the 1 + h model, for h ≥ 1, such thatit predicts the upward outliers as the naive predictor.

The proposed model seems to be robust to changes in the different parametervalues needed to be set by the user, at least for normal values. This is notsurprising since it is very hard to design models that can predict an outlierseveral steps ahead. Still it would be interesting to see if improvements onthe outlier predictions could be made for the several steps ahead forecasts.

The current Wikimedia data is the aggregated total requests and data sentper hour. It would be of interest to study the performance of the methodgiven in this thesis on a minute scale, where the servers can adapt morequickly to fast increasing amount of server requests and data sent from andto users, and the decrease as well. The data will be more oscillating whichbrings the question of there needs to be a minute adjustment of the methodgiven in this thesis. The method has to be fast as well, since the modelupdates every minute, making the requirement for the computational timeto be minimal.

45

Bibliography

Alexa, http://www.alexa.com/topsites, (2014-08-31).

Ali-Eldin, A., Rezaie, A., Mehta, A., Razroev, S., Sjostedt-de Luna, S.,Seleznjev, O., Tordsson, J. & Elmroth, E. (2014), How will your workloadlook like in 6 years? Analyzing Wikimedia’s workload. In Proceedings ofthe 2014 IEEE International Conference on Cloud Engineering (IC2E 2014),IEEE Computer Society, pp. 349-354.

Ali-Eldin, A., Kihl, M., Tordsson, J. & Elmroth, E. (2012), Efficientprovisioning of bursty scientific workloads on the cloud using adaptiveelasticity control. In Proceedings of the 3rd workshop on Scientific CloudComputing, pp. 31-40.

Andreolini, M. & Casolari, S. (2006), Load prediction models in web-basedsystems. In Proceedings of the 1st international conference on Performanceevaluation methodologies and tools, pp. 27-36.

Bodık, P., Griffith, R., Sutton, C., Fox, A., Jordan, M. & Patterson, D.(2009), Statistical machine learning makes automatic control practical forinternet datacenters. In Proceedings of the 2009 conference on hot topics incloud computing, pp. 12-15.

Brockwell, P.J. & Davis, R.A. (2002), Introduction to Time Series andForecasting, 2nd ed., New York, NY: Springer-Verlag.

Cryer, J.D. & Chan, K-S. (2008), Time Series Analysis with Applicationsin R, 2nd ed., New York, NY: Springer Science Media.

Datamarket, http://datamarket.com, (2014-08-31).

Giles, D. (2010), B-splines, Wiley Interdisciplinary Reviews: ComputationalStatistics, Vol.2(2), pp. 237-242.

Herbst, N.R., Kounev, S., Huber, N. & Amrehn, E. (2014), Selfadaptiveworkload classification and forecasting for proactive resource provisioning.Proceedings of the ACM/SPEC international conference on Internationalconference on performance engineering (ICPE 2013), pp. 187-198.

Hyndman, R. J., King, M. L., Pitrun, I. & Billah, B. (2005), Local linearforecasts using cubic smoothing splines, Australian & New Zealand Journalof Statistics, 47(1), pp. 87-99.

Montgomery, D.C., Jennings, C.L. & Kulahci, M. (2008), Introduction toTime Series Analysis and Forcasting, New Jersey: Wiley series in probabilityand statistics.

Powell, M.J.D. (1981), Approximation theory and methods, New York, NY:Cambridge University Press.

Ramsey, J.O. & Silverman, B.W. (2006), Functional Data Analysis, 2nd Ed.,New York, NY: Springer Science Media.

Ramsey, J.O., Hooker, G. & Graves, S. (2009), Functional Data Analysiswith R and MATLAB, New York, NY: Springer Science Media.

Shumway, R.H. & Stoffer, D.S. (2011), Time series analysis and itsapplications, 3rd Ed., New York, NY: Springer Science Media.

Yaffee, R. & McGee, M. (2000), An Introduction to Time Series Analysis andForecasting: With Applications of SAS and SPSS, San Diego, CA: SciencePublishing Co. Inc.

Wikimedia, http://www.wikimedia.org, (2014-08-31).

An automated forecasting method for workloads on web-based ...€¦ · An automated forecasting...

Documents

Transcript of An automated forecasting method for workloads on web-based ...€¦ · An automated forecasting...