Statitical consulting project report

1

Statistical Consulting for TELCAP M2 Statistics and Econometrics by Orestis Ampeliotis Sayli Javadekar Zhao-qiu Luo Damien Quesada

�

�

2

ACKNOLWDGEMENT

We would like to thank TelCap for providing us with the necessary logistic requirements to complete this assignment. We would also like to thank Prof Daouia and Prof Orozco for their guidance without which we would not be able to complete this assignment.

3

TABLE OF CONTENTS

ACKNOLWDGEMENT 2

INTRODUCTION 4

THEORETICAL BACKGROUND 5

TIME SERIES 1 - TOULOUSE 7 Overview 7 Seasonality 8 Differencing 9 Estimation &Model Selection 10 Prediction 15

TIME SERIES 2 – AFGHANISTAN 17 Overview 17 Differencing 18

Estimation and Model Selection 21 MODEL 1 21 MODEL 2 23 MODEL 3 26

CONCLUSION 29

ANNEXE: 30

Codes for Toulouse 30

Codes for Kabul 33

4

Introduction

In this assignment we were given time series data for mobile data traffic for two locations, Toulouse and Kabul. The aim was to provide a model that fits the data and could be utilised to predict the traffic. The data for the two places was provided by TelCap. For Toulouse, we worked with cell421and found a ARIMA (1,(7)) with an error of 13.3% and we could predict for 365 days. Whereas for Kabul, we worked with cell 279 and found a AR (6) with an error of 11.1% and we could predict for 20 days. We tested that our model fits quite well for four randomly selected cells of Toulouse, however for the cells in Afghanistan as the data in Afghanistan behaves more erratic it’s not possible to fit the same model in other cells. We have programmed all the codes in SAS for this project.

5

Theoretical Background

First we consider 𝑦! to be the time series. Now we define the models we have used in our project.

• Stationarity: If neither the mean 𝜇!nor the autocovariances 𝛾!"depend on the date t, then the process for 𝑦! is said to be weakly stationary:

o E(𝑦!)= μ for all t o 𝛾!" = 𝑐𝑜𝑣 𝑦! ,𝑦!!! = E(𝑦! − 𝜇)(𝑦!!! − 𝜇)=𝛾(𝑗) for all t and any j.

In practice, we consider a time series stationary based on :

1. The chronogram of the time series has a constant mean and constant variance over time 2. If the ACF,PACF and IACF1 plots are decreasing exponentially 3. The Augmented Dickey-Fuller Unit-Root test2

• Autocorrelation function (ACF):

We denote 𝜌(𝜏) as autocorrelation function, it defined as

𝜌 𝜏 =𝛾 𝜏𝛾 0

where 𝛾 0 = 𝑐𝑜𝑣 𝑦! ,𝑦! = 𝑣𝑎𝑟 (𝑦!) and 𝛾 𝜏 = 𝑐𝑜𝑣 𝑦! ,𝑦!!!

• Partial autocorrelation function (PACF) and Inverse autocorrelation function (IACF)

PACF and IACF are complicated to define and understand. We referenced several test books3 for you to read it if necessary. (see references)

• Lag operator:

We define the lag operator B such that 𝐵𝑦! = 𝑦!!!

• White noise:

A stationary time series 𝜀! is said white noise if cov(𝜀! , 𝜀!) = 0 for all t ≠ s

1Brocklebank J. and Dickey D.(2003). SAS for Forecasting Time Series, United States, pp.58-‐78.

2Cryer. J. and Chan. KS. (2008). Time series analysis : with application in R. Springer, United States, pp. 129.

3 Yves ARAGON(2006). Séries Temporelles appliquées.

6

In practice, this is verified using the Portmanteau Tests available for testing for autocorrelations in the residuals of a model: it tests whether any of a group of autocorrelations of the residual time series are different from zero.

• ARMA(p, q) Model

If the time series is stationary and the ACF, PACF and IACF decrease rapidly

Φ! 𝐵 𝑦! = Θ! 𝐵 𝜀!

𝑦! − 𝜙!𝑦!!! −⋯− 𝜙!𝑦!!! = 𝜃! + 𝜀! − 𝜃!𝜀!!! −⋯− 𝜃!𝜀!!!

Where 𝜀!are White Noise i.e. 𝜀!~𝑊𝑁 0,𝜎! . This model has two parameters:

Ø order of AR is p with coefficients AR:𝜙!,𝜙!,𝜙!, . . . ,𝜙! Ø order of MA is q with coefficients MA:𝜃!,𝜃!,𝜃!,… ,𝜃!

• SARMA(p, q)(P,Q)s Model

If the time series presents a seasonality of period s, we use :

Φ! 𝐵! Φ! 𝐵 𝑦! = 𝑐 + Θ! 𝐵! Θ! 𝐵 𝜀!

• ARIMA(p,d,q) Model

Φ! 𝐵 Δ!𝑦! = Θ! 𝐵 𝜀!

Where Δ!𝑦! = (1− 𝐵)!𝑦! and follows an ARMA model

• SARIMA(p,d,q)(P, D, Q)S Model

We defineΘ! 𝐵! = 1− 𝑏!𝐵!−. . .−𝑏!𝐵!" and Φ! 𝐵! = 1− 𝑎!𝐵!−. . .−𝑎!𝐵!"

1− 𝐵 ! 1− 𝐵! !𝑦! = Θ! 𝐵! Θ! 𝐵Φ! 𝐵! Φ! 𝐵

𝜀!

7

Time Series 1 - Toulouse

Overview

We have chosen randomly one of the cells for Toulouse (cell 512) which has given us the daily data from 6th July 2010 to 31st December 2013, accounting for 1,212 observations. The variable of interest is Traffic_CS. Before building any model on our series, our first step is to see if this series is stationary. Thus to check for this we try to plot the ACF and PACF plots. During this, we noticed that in the data, a few days are missing, thus we have to ‘fill’ these gaps and have a value for each day of our interval. To do so, we used a SAS procedure proc expand with the method spline4.

So our raw data after filling the gaps looks like this:

Graphic1.1 : the original chronological traffic volume of Toulouse Saint-Rome during Jul.2010 to Jan.2014

4The spline method is just a way to join together two spaced points thanks to a segmented function consisting of third-degree (cubic) polynomial functions, so that the whole curve and its first and second derivatives are continuous. The methods should not be very important for the following. The real need is to fill every gap of the time series. For more explanations, see the book:

Bartels, R. H.; Beatty, J. C.; and Barsky, B. A. "Hermite and Cubic Spline Interpolation." Ch. 3 in An Introduction to Splines for Use in Computer Graphics and Geometric Modelling. San Francisco, CA: Morgan Kaufmann, pp. 9-17, 1998.

Or the website:

http://mathworld.wolfram.com/CubicSpline.html

8

From the figure above we see that, there are a few observations of the variable Traffic_CS equal to 16. These can be considered atypical and potential outliers. Confirming with Mr. Olivier Rostaing, that these have been observed probably due to a failure of the equipment, hence they have been deleted and the missing values have been filled by the “spline method” in order to have a proper time series to work with. We see that this series displays a non-constant mean and variance. Thus the series is not stationary.

To stabilize the variance we use a log transformation. From now on, the variable of interest will be

𝑥! = 𝑇𝑟𝑎𝑓𝑓𝑖𝑐_𝐶𝑆!

𝑦! = log (𝑥!)

To convert it to a stationary series would be our first step, but before that we observe other patterns in the series.

Seasonality

From the figure below, we can see that the same pattern is repeated for all the three years. (see arrows) This tells us that our series displays yearly seasonality.

Graphic1.2: the chronological traffic volume𝑥! of Toulouse Saint-‐Rome during Jul.2010 to Jan.2014 after removing the outliers

Next, we checked the ACF and the PACF plots of this series and we obtain,

9

Graphic1.3 : The autocorrelation plots, partial autocorrelation plots and inverse autocorrelation plots of Toulouse Saint-‐Rome traffic volume 𝑥!

The plots tell us that there is a pattern repeated every 7 days as well (see arrows). Hence there is weekly seasonality along with yearly seasonality in our non stationary time series.

Differencing

If there is a non-stationary time series yt, and a seasonality of period s, then to make yt stationary, we difference with the order s:

∆!𝑦! = 1− 𝐵! 𝑦! = 𝑦! − 𝑦!!!

If the ACF and PACF decrease rapidly to null, then it means that we have a stationary series and now we can fit an ARMA model to this ‘new’ time series.

We try 3 methods:

• Difference by 7: the seasonality is removed, but the model is not valid at the end. • Difference by 365: it does not eliminate the seasonality. • Difference by 365 and 7: in this case we eliminate the seasonality and we get a white noise, so

we use this method.

10

After the log transform and differencing by 365 and 7 the Augmented Dickey-Fuller test confirms that our series is likely to be stationnary

The time series becomes:

Graphic1.4: the chronological traffic volume of Toulouse Saint-‐Rome (𝑧!)during Jul.2010 to Jan.2014

Here we see that the series has a constant mean and variance. From now, we are interested in

𝑧! = (1− 𝐵!) (1− 𝐵!"#)𝑦!

Because of the seasonality, we want to fit a SARMA model to 𝑧! and so a SARIMA to 𝑦!

Estimation &Model Selection

Next, to see which model would fit well for our data, we take a look at the ACF and the PACF plots. We have to choose a model AR (p) if the PACF is null after rank p and a MA (q) if the ACF is null after rank q. Below, we see that the ACF is null after lag 7. We want to make clear that the differenciated time series doesn’t look strictly stationary, however, we will assume it is for the following, the period being large, we could consider this decaying fast enough

11

Graphic1.5 The autocorrelation plots, partial autocorrelation plots and inverse autocorrelation plots of Toulouse Saint-‐Rome (𝑧!)traffic volume after log transformation and differencing for 7 and 365

Our autocorrelation plots suggests the use of a model MA (7):

𝑧! = 𝜇 + 𝜀! + 𝜃!𝜀!!!

!

!!!

Where all εt are a white noise term error.

Next we fit a MA (7) on our series using proc arima in SAS. However we do not obtain a white noise which is seen from the Autocorrelation Checks of residuals in the SAS output (Portmanteau Test). Here, we test

H0 = ‘There is no autocorrelation’ against H1 = ‘There is autocorrelation’

The p-value (Pr> Khi-2) is very small (usually we compare this value to a 5% risk level), which allows us to reject H0. We can say that there is a significant autocorrelation, and so, we have to reject the hypothesis of White Noise. (see table below). Therefore, MA (7) is not valid.

12

To help us get a good model, we next use the proc arima with the minic method, method which computes the ‘optimal’ model according to the AIC or BIC5 creteria.

According to this method the optimal model is an AR (1) :

𝑧! = 𝑐 + 𝜑𝑧!!! + 𝜀!,

but again as we see from the table below, the p-values are <.05 hence we reject the null hypothesis that there is no autocorrelation.

Since the two models are not valid, we try several combinations of them. The model ARMA (1, 7)

𝑧! = 𝑐 + 𝜀! + 𝜑 ∙ 𝑧!!! + 𝜃!𝜀!_!

!

!!!

5Cryer. J. and Chan. KS. (2008). Time series analysis : with application in R. Springer, United States, pp. 130-‐131.

13

Here we get white noise, but we noticed that the θi, i = 1, … 6 are not significant (except i=5, but it’s close to 5%, its significance is not very obvious), so we delete them.

Next, we noticed that the intercept (MU, ‘c’ in our formula) is not significant:

After deleting the constant,

14

Thus, we have a model with white noise and significant coefficients.

To ensure that this is the best model, we tried to fit a model ARMA(2,(7)) and model ARMA(1,(8)) to the data, but we do not get the white noise.

Thus, we keep our model ARMA(1, (7)).

𝒛𝒕 = 𝜺𝒕 + 𝝋𝒛𝒕!𝟏 + 𝜽𝜺𝒕!𝟕

And so

1− 𝐵! 1− 𝐵!"# 𝑦! =Θ ! (𝐵)Φ!(𝐵)

𝜀! ↔ 1− 𝐵! 1− 𝐵!"# 𝑦! = 1− 𝜃 𝐵!

1− 𝜙𝐵 𝜀!

We have the equation of a SARIMA model. However, the theory defines it by only one seasonality s. Here we have two different seasonalities s1=7 and s2=365 and so it is a non standard SARIMA model.

𝑦!~ 𝑆𝐴𝑅𝐼𝑀𝐴(1,0, (7))(1, 1,0)!,!"#

15

Prediction

From this model, we can compute some predictions.

We fit our non-standard SARIMA to the 𝑦! variable in order to get the forecast. However, we predicted 𝑦! = log (Traffic_CS). Coming back to 𝑥! = Traffic_CS is easy but not trivial.

We have to use the transformation:

𝑥!,!"#$%&'( = 𝑒!!,!"#$%&'(!

!!

!

where σ² is the variance of the forecast 𝑦!,!"#$%&'(

After the vertical line we see what our model predicts (one year prediction). We can see that the forecast (in red) has the same shape as the original data (in black).

Graphic1.6: One year forecasting of our whole data set

16

To check whether our prediction is good or not, we delete a part of data (here data deleted is from 1st January 2013 to 1st January 2014) and predict them by using our model :

Graphic1.7: Check of the forecast only on a part of the data

Then we compare our predicted values with the original ones which are the true values and calculate the error rate.

𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 𝑀𝑒𝑎𝑛(𝑥! − 𝑥!!"#$%&'(

𝑥!)

Our model has a 13.3% error for the one year forecast.

We fitted this model on 4 randomly selected cells from Toulouse. The model was valid for the 3 out of the 4 cells selected (cell512, cell521G1, cell451D3), the 4th cell (cell421D1) needed adjustments with the lag.

We tried the same methodology without the weekends to check if there were any improvements in the model. Here we succeeded in finding another model, but the final error was bigger than before so we keep the whole data and our model.

17

Time Series 2 – Afghanistan

Overview

We choose one of the cells for Afghanistan (cell279) from the data set that has been provided to us, which contains daily data from 25th May 2012 to 22nd August 2014, accounting for 839 observations. The variable of interest is Traffic_CS. The first step, as we discussed before, is to check if our time series is stationary. In order to plot the time series and the ACF and PACF graphs we have to “treat” the missing values that our data has. Using again the method that SAS provides (spline) we achieve that.

Graphic2.1 : the original chronological traffic volume of Afghanistan from May 2012 to Sept 2014

From the Graphic2.1 we see that there is no clear trend and there is a sudden fall in the traffic between 2/4/2013 and 26/6/2013 which is a main reason for non-stationary time series.

The usual method to forecast this are the Markov Chains, however it is impossible here, because we only have one occurrence of the fall .We can propose 3 methods :

• Model 1: Fit an ARIMA model to the whole data. • Model 2: Forecast the whole data but by placing this part up in the continuity of the data

18

• Model 3: We use the data just after the gap (we have to know why this fall appeared, maybe it won’t happen anymore)

Like before let the variable of interest be

𝑥! = 𝑇𝑟𝑎𝑓𝑓𝑖𝑐_𝐶𝑆!

Differencing

If we lift the gap, we see that when we fill the gap we see yearly seasonality . And since this gap occurs only once in our data, we can consider it unnatural.

Now, we check the ACF and the PACF plots of the original time series and we obtain the following results:

Graphic2.2: The autocorrelation plots, partial autocorrelation plots and inverse autocorrelation plots of Afghanistan time series traffic volume

As we can notice from the ACF and PACF plots, they are not decreasing exponentially providing evidence of a non-stationary time series. So according with the Augmented Dickey-Fuller Single Mean Test obtained from SAS output for lag5 it is required to difference our time series.

19

Thus we difference our time series for 1. We also take a seasonal difference of 365 that we will justify later. Hence we have the following time series:

Graphic2.3: the differenced chronological traffic volume of Afghanistan

And the corresponding Autocorrelation plots:

20

Graphic2.4: The autocorrelation plots, partial autocorrelation plots and inverse autocorrelation plots of differenced 1 and 365 times Afghanistan time series traffic volume

After the first simple difference we get,

∆!𝑥! = 1− 𝐵! 𝑥! = 𝑥! − 𝑥!!!

The seasonal difference of 365 of ∆!𝑥! series gives

zt =(1-B365)1 ∆!𝑥!

where s=365 and D=1

As we can see now the ACF and PACF plots are rapidly decreasing so we can assume that our time series is stationary. We confirm it by the Augmented Dickey-Fuller test.

21

Estimation and Model Selection

MODEL 1

The next step is to fit a model with our time series in order to perform forecasting. For this purpose we use the ACF and PACF plots. We see that the ACF plot is null after lag 1 and PACF plot is null after lag 6. Thus we try to fit a MA(1) on this series however we do not get a white noise. Next we try a AR(6) and this gives us a white noise and significant coefficients.

Here we see that for all the estimates the p-values are less than .05 thus they are significant.

The Autocorrelation Check for the residuals gives us,

22

This is in accordance to the Portemanteau tests for autocorrelations. As the p-values are all greater than .05, the null hypothesis that there is no autocorrelation is not rejected. Thus white noise is obtained.

Our model finally is,

𝑧! = 𝜀! + 𝜑!𝑧!!! + 𝜑!𝑧!!! + 𝜑!𝑧!!! + 𝜑!𝑧!!! + 𝜑!𝑧!!! + 𝜑!𝑧!!!

which can be represented in terms of

Thus 𝑥! (such that zt =(1-B365)1 ∆!𝑥!) is a SARIMA(6,1,0)(0,1,0)365.

𝑧! = 1− 𝐵!"# ! 1− 𝐵! 𝑥! =Θ ! (𝐵)Φ!(𝐵)

𝜀!

Prediction

With this pure autoregressive model AR(6) , we firstly tried to forecast one-year traffic value after 22 September 2014.(see graph below).

Graphic2.5: Forecast of the original data of Afghanistan

23

Then we calculate the error rate by the mean absolute value of the difference between the predictions and true values divided the true values. This amounts to 11.1%

Know that we have forecasted, we have a “bigger” time series to observe, and we clearly see the seasonality now with a maximum in july. MODEL 2

In this model, we lift the gap by 130 and fit a model to this new series. So our new series looks like below

Graphic2.6: chronogram of traffic volume of Afghanistan after lifting the fall

We difference this series once and 365 times as before and then 𝑥! fit the following SARIMA(10,1,0)(0,1,0)365

1− 𝐵!"# ! 1− 𝐵! 𝑥! =Θ!(𝐵)Φ!"(𝐵)

𝜀!

According to the ACF and PACF

24

To verify this model we check the information below:

-‐ The Autocorrelation check of residuals gives us,

-‐ As mentioned earlier, we obtain white noise according to the Portemanteau tests. Also all of the

estimates are significant as seen in the table below

25

𝑥! fit the model SARIMA(10,1,0)(0,1,0)365 :

𝑧! = 1− 𝐵!"# ! 1− 𝐵! 𝑥! =Θ!(𝐵)Φ!"(𝐵)

𝜀!

Predictions

We use this model to make the predictions for 365 days. Below is the graph of the series

Further we calculate the error as explained before and we get 14.6% percent for 30 days.

26

MODEL 3

Here we work on the series after the gap and delete the previous data.

Graphic 2.7: Chronogram of the selected data after the gap

To get the stationary series we difference it once and to consider the seasonality we difference it 365 times as before. With the same methodology, we try to fit a model.

To verify this model we check the information below.

- The Autocorrelation check of residuals verify the white noise according to the test.

27

- We obtain the estimates to be significant.

The final model is 1− 𝐵!"# ! 1− 𝐵! 𝑥! =!!(!)!!"(!)

𝜀!

So 𝑥! fit the following SARIMA(10,1,0)(0,1,0)365 follows a SARIMA(10,1,0)(0,1,0)365.

28

Predictions :

We use this model to make the predictions for 365 days. Below is the graph of the series

Further we calculate the error as explained before and we get 10.98 percent for 20 days.

So, for the 3 models, we get quite similar SARIMA.

Original Data Lifted Data Cut Data Model of 𝒙𝒕 SARIMA(6,1,0)(0,1,0)365 SARIMA(10,1,0)(0,1,0)365 SARIMA(10,1,0)(0,1,0)3

65 Model of 𝒛𝒕 = (1-B365) ∆𝟏𝒙𝒕

ARMA(6,0) ARMA(10,0) ARMA(10,0)

Error Rate(%) 11.1 14.6 10.98

29

Conclusion

In this assignment, we analyse respectively the telecommunication traffic series in Toulouse and Afghanistan. Since the traffic series in Toulouse behaves much better than that in Afghanistan, the prediction is effective for longer duration hence we predicted one year’s traffic volume for Toulouse but only ten days’ for Afghanistan. Both traffic series experienced non stationarity in this telecommunication traffic modelling study because the demand patterns influencing the series were not relatively stable, thus requiring series transformation, which is generally done by differentiation(as we did 365 and 7for Toulouse and once for Afghanistan). From our study we can say that the modern traffic in telecommunication with strong correlation characteristics can be appropriately modeled by time series, especially seasonal ARIMA. Evaluating the seasonal ARIMA model (developed and finally chosen as being the most appropriate in this study) showed a fairly high performance related to the residual dimension, which did not have any correlation. To conclude, we strongly recommend that we need to use ARIMA models with customised lags for each cells of Toulouse and Afghanistan.

30

Annexe:

Codes for Toulouse

/* importing the dataset*/ PROC IMPORT OUT=telDATAFILE= "C:\Users\USER\Desktop\TSE M2 Eco STat\Telcap\Donnes\Tls data\Finaldata.xlsx" DBMS=xlsx REPLACE; SHEET="HistoricalTraffic"; GETNAMES=YES; RUN; /* we are keeping from the dataset only the variable of our interest (Traffic_CS and the date)*/ Data tel (keep=date Traffic_CS); Set tel; run; /*we are deleting the potential outliers i.e values for which Traffic_CS is very small*/ data tel1; set tel; if Traffic_CS<50 then delete; run; /* proc expand method using spline to fill the gaps in the data(for more information see the references)*/ Proc expand data=tel1 out=tel1 to=day method=spline plots=TRANSFORMIN; id date; run; /* deleting the data after 31/12/2012 in order to compare with the prediction ATTENTION: This is the final code. Initially we did all the procedure with Traffic_CS and not with Traffic_CSbis in order to find the ARMA(1,(7)). As follows: data tel2; settelp; ltra_CS=log(Traffic_CS); ltra7=dif7(ltra_CS); ltra365=dif365(ltra7); run; proc arima data=tel2; ivar=ltra_CS(7,365) minic perror=(1:11);run; e p=1 q=(7) noint plot;run;

31

/*we are taking the log of the traffic_CS and we differencing seasonal two times with respect the weekly seasonality and the yearly */ data tel2; set tel1; ltra_CS=log(Traffic_CSbis); ltra7=dif7(ltra_CS); ltra365=dif365(ltra7); run; /* we are fitting the model ARMA(1,(7)) and in this part we are predicting also for the following 365 days*/ Proc arima data=tel2; I var=ltra_CS(7,365) minic perror=(1:11);run;/*differencing for 7 and 365*/ e p=1 q=(7) noint plot;run;/* estimation of ARMA(p, q) without the intercept */ f out=previs lead=365 id=date interval=day noprint; run; quit;/* forecast of the estimated model for lead=365 days, the outputs are stored in dataset previs */ /* we have taken the log transformation before so now we make the transformation mentioned in the report */ Data previs; set previs; Traffic_forecast=exp(forecast + STD*STD/2); run; data previbis; merge previs tel2; by date; run; /* we are plotting the time series and the prediction*/ Proc gplot data=previbis; symbol1 v=plus i=join color=black; symbol2 v=star i=join color=red; plotTraffic_CS * date=1Traffic_forecast * date=2/overlay href='01JAN2013'd; run; /*Prediction and calculating the error*/ /*After having conclude for our model we are taking the Traffic_CSbis in order to find the error of our prediction like below (the reason for why we delete a part of data, we have already explained in our report.*/ Data telp; set tel1; Traffic_CSbis = Traffic_CS; if date >'31DEC2012'dthen Traffic_CSbis= .; run;

32

data tel2; set telp; ltra_CS=log(Traffic_CSbis); ltra7=dif7(ltra_CS); ltra365=dif365(ltra7); run; /* we are fitting the model ARMA(1,(7)) and in this part we are predicting also for the following 365 days*/ Proc arima data=tel2; I var=ltra_CS(7,365) minic perror=(1:11);run;/*differencing for 7 and 365*/ e p=1 q=(7) noint plot;run;/* estimation of ARMA(p, q) without the intercept */ f out=previs lead=365 id=date interval=day noprint; run; quit;/* forecast of the estimated model for lead=365 days, the outputs are stored in dataset previs */ /* we have taken the log transformation before so now we make the transformation mentioned in the report */ Data previs; set previs; Traffic_forecast=exp(forecast + STD*STD/2); run; data previbis; merge previs tel2; by date; run; /* we are plotting the time series and the prediction*/ Proc gplot data=previbis; symbol1 v=plus i=join color=black; symbol2 v=star i=join color=red; plot Traffic_CS * date=1Traffic_forecast * date=2/overlay href='01JAN2013'd; run; /* Then we are calculate the error*/ data difference; set previbis; if date>'31DEC2012'dthen error = ABS(Traffic_CS - Traffic_forecast); Rerror = ABS(Traffic_CS - Traffic_forecast)/Traffic_CS; Qerror = (Traffic_CS - Traffic_forecast)**2; run; proc means data=difference; var error Rerror Qerror; run;

33

Codes for Kabul

/* import the data */ proc import out=kblc279_4 datafile= "C:\Users\damien1991\Desktop\M2\Stat Consulting\kblc279_4" dbms=xlsx replace; sheet="Historical Traffic"; getnames=yes; run; /* only keep date and Traffic_CS */ data tel279 (keep=date Traffic_CS); set kblc279_4; run; /*fill the gaps with spline method */ proc expand data=tel279 out=tel279 to=day method=spline plots=TRANSFORMIN; id date; run; proc arima data=tel279; i var=Traffic_CS(1,365) nlag=100 minic perror=(1:11);run; e p=6 noint;run; /* estimate the model AR(6) */ f out=previs_af279 lead=365 id=date interval=day ; run; /* forecast for leads = 365 days */ data previs_af; set previs_af279; Traffic_forecast=forecast; run; /* merge forecasted and real data */ data previbis_af; merge previs_af tel279; by date; run; /* plot forecast */ proc gplot data=previbis_af; symbol1 v=plus i=join color=black; symbol2 v=star i=join color=red; plot Traffic_CS * date=1 Traffic_forecast * date=2/overlay href='12SEP2014'd; run; /********************** calculate the error *****************************/ data tel279_bis; set tel279; if Date>'02SEP2014'd then delete; run; proc arima data=tel279_bis; i var=Traffic_CS(1,365) nlag=100 minic perror=(1:11);run;

34

e p=6 noint;run; f out=previs_af279 lead=20 id=date interval=day ; run; data previs_af; set previs_af279; Traffic_forecast=forecast; run; data previbis_af; merge previs_af tel279; by date; run; proc gplot data=previbis_af; symbol1 v=plus i=join color=black; symbol2 v=star i=join color=red; plot Traffic_CS * date=1 Traffic_forecast * date=2/overlay href='02SEP2014'd; run; data difference; set previbis_af; if date>'02SEP2014'd then do; error = ABS(Traffic_CS - Traffic_forecast); Rerror = ABS(Traffic_CS - Traffic_forecast)/Traffic_CS; Qerror = (Traffic_CS - Traffic_forecast)**2; end; run; proc means data=difference; var error Rerror Qerror; run; Afghanistan Lift : we try to fix the fall by lifting it. /* import the data */ proc import out=kblc279_4 datafile= "C:\Users\damien1991\Desktop\M2\Stat Consulting\kblc279_4" dbms=xlsx replace; sheet="Historical Traffic"; getnames=yes; run; /*keep the variables of interest date and Traffic_CS */ data tel279 (keep=date Traffic_CS); set kblc279_4; run; /* fill the gaps with spline method */ proc expand data=tel279 out=tel279 to=day method=spline plots=TRANSFORMIN; id date; run; /* lift the fall */ data tel279lift;

35

set tel279; if Date>'01APR2013'd and Date<'23JUN2013'd then Traffic_CS=Traffic_CS+130; run; proc arima data=tel279lift; i var=Traffic_CS(1,365) minic perror=(1:11);run; /* identification of the model for Traffic_CS simple diff and seasonal diff 365 */ e p=10 noint;run; /* estimation of the model AR(p=10) */ f out=previs_af279 lead=365 id=date interval=day ; run; /*forecast for leads = 365 days */ data previs_af279; set previs_af279; Traffic_forecast=forecast; run; /* merge of forecasted and original data */ data previbis_af; merge previs_af279 tel279lift; by date; run; /* plot of the forecast */ proc gplot data=previbis_af; symbol1 v=plus i=join color=black; symbol2 v=star i=join color=red; plot Traffic_CS * date=1 Traffic_forecast * date=2/overlay href='22SEP2014'd; run; /********************** calculate the error *****************************/ data tel279lift_bis; set tel279lift; if Date>'22AUG2014'd then delete; run; proc arima data=tel279lift_bis; i var=Traffic_CS(1,365) minic perror=(1:11);run; /* identification of the model for Traffic_CS simple diff and seasonal diff 365 */ e p=10 noint;run; /* estimation of the model AR(p=10) */ f out=previs_af279 lead=365 id=date interval=day ; run; /*forecast for leads = 365 days */ data previs_af279; set previs_af279; Traffic_forecast=forecast; run; /* merge of forecasted and original data */ data previbis_af; merge previs_af279 tel279lift; by date; run;

36

/* plot of the forecast */ proc gplot data=previbis_af; symbol1 v=plus i=join color=black; symbol2 v=star i=join color=red; plot Traffic_CS * date=1 Traffic_forecast * date=2/overlay href='22AUG2014'd; run; /* computation of the error */ data difference; set previbis_af; if date>'22AUG2014'd then do; error = ABS(Traffic_CS - Traffic_forecast); Rerror = ABS(Traffic_CS - Traffic_forecast)/Traffic_CS; Qerror = (Traffic_CS - Traffic_forecast)**2; end; run; proc means data=difference; var error Rerror Qerror; run; Afghanistan cut : we keep the data after the fall /* import the data */ proc import out=kblc279_4 datafile= "C:\Users\damien1991\Desktop\M2\Stat Consulting\kblc279_4" dbms=xlsx replace; sheet="Historical Traffic"; getnames=yes; run; /* only keep date and Traffic_CS */ data tel279 (keep=date Traffic_CS); set kblc279_4; run; /*fill the gaps with spline method */ proc expand data=tel279 out=tel279 to=day method=spline plots=TRANSFORMIN; id date; run; /*select the data after the fall */ data tel279cut; set tel279; if Date<'26JUN2013'd then delete; run; proc arima data=tel279cut; i var=Traffic_CS(1,365) nlag=100 minic perror=(1:11);run; e p=10 noint;run; /* estimate the model AR(10) */ f out=previs_af279 lead=365 id=date interval=day ; run; /* forecast for leads = 365 days */

37

data previs_af; set previs_af279; Traffic_forecast=forecast; run; /* merge forecasted and real data */ data previbis_af; merge previs_af tel279cut; by date; run; /* plot forecast */ proc gplot data=previbis_af; symbol1 v=plus i=join color=black; symbol2 v=star i=join color=red; plot Traffic_CS * date=1 Traffic_forecast * date=2/overlay href='12SEP2014'd; run; /********************** calculate the error *****************************/ data tel279cut_bis; set tel279cut; if Date>'02SEP2014'd then delete; run; proc arima data=tel279cut_bis; i var=Traffic_CS(1,365) nlag=100 minic perror=(1:11);run; e p=10 noint;run; f out=previs_af279 lead=20 id=date interval=day ; run; data previs_af; set previs_af279; Traffic_forecast=forecast; run; data previbis_af; merge previs_af tel279cut; by date; run; proc gplot data=previbis_af; symbol1 v=plus i=join color=black; symbol2 v=star i=join color=red; plot Traffic_CS * date=1 Traffic_forecast * date=2/overlay href='02SEP2014'd; run; data difference; set previbis_af; if date>'02SEP2014'd then do; error = ABS(Traffic_CS - Traffic_forecast); Rerror = ABS(Traffic_CS - Traffic_forecast)/Traffic_CS; Qerror = (Traffic_CS - Traffic_forecast)**2; end; run; proc means data=difference; var error Rerror Qerror;

38

run;

Statitical consulting project report

Documents

Transcript of Statitical consulting project report