MGT 267 PROJECT · simple exponential smoothing, Holt’s exponential smoothing, Winters’...
Transcript of MGT 267 PROJECT · simple exponential smoothing, Holt’s exponential smoothing, Winters’...
MGT 267 PROJECT
Forecasting the United States Retail
Sales of the Pharmacies and Drug
Stores
Done by: Shunwei Wang &
Mohammad Zainal
Dec. 2002
ABSTRACT
The present study aims at forecasting the pharmacy and drug store retail sales in US.
Different forecasting techniques are examined in the present study namely the moving average,
simple exponential smoothing, Holt’s exponential smoothing, Winters’ exponential smoothing,
simple regression, multiple regression, time series decomposition and ARIMA model. Quarterly
data are used to predict the retail sales using the above mentioned models. The forecast results
obtained by the ARIMA are found to be the best among other models. The assessment criteria are
based on the minimum RMSE, MAPE, and maximum R2.
1. INTRODUCTION
The retail sales of pharmacies and drug stores in the US represent essential economical data
for the Pharmaceutical companies. It has a significant impact on the market decisions made by the
mangers to predict future sales, inventory needs, personnel requirements, and other important
economic or business forecasting. However, there are many variables that may affect forecasting
of retail sales. Therefore, we are interested in forecasting the retail sales of pharmacies and drug
stores in the US, and want to build up a possible forecasting model.
Monthly and quarterly data of the real economic variable are obtained from the following
source: (http://www.economagic.com/em-cgi/data.exe/cenret/nrt28). The monthly data are
arranged in quarterly format in the present investigation. Forty quarters data points from 1992 to
2001 are utilized.
The retail sales of pharmacies and drug stores in the US
16500
21500
26500
31500
36500
41500
Q1-
92
Q3-
92
Q1-
93
Q3-
93
Q1-
94
Q3-
94
Q1-
95
Q3-
95
Q1-
96
Q3-
96
Q1-
97
Q3-
97
Q1-
98
Q3-
98
Q1-
99
Q3-
99
Q1-
00
Q3-
00
Q1-
01
Q3-
01
Time
The r
eta
il sale
(M
illio
n)
It is clearly evident from the time series plot that there are certain characteristics in the retail
sales of pharmacies and drug stores in the US from 1992 to 2001. These aspects can be
summarized as follows
1. There is a positive trend in the above time series plot. As such there an upward
movement in the pattern due to an increase in the population and health care standards.
Accordingly significant amount of money is spent. Moreover, the recent advancements in
the field of Pharmacy led to the development of more effective and expensive drugs
compared with conventional ones.
2. A seasonal pattern occurs in the data. There is a significant increase of the retail of sail in
the fourth quarter. The reasons are expected due to the followings:
An increase in the cold and flu diseases is noticed in this quarter.
Fourth quarter is the holidays season as such the pharmaceutical products and
some other related ones are largely purchased as gifts compared to the other
quarters.
In general, due to globalization, companies nowadays are involved in many other
types of business. One company may invest in another sister company and the
whole retail of the company takes effect at the fourth quarter.
The used data are separated into two groups. One is the historical data for the forecasting
model, with 36 periods from Q1 – 1992 to Q4 – 2000; another is holdout to test the goodness of
the fit, with 4 periods from Q1 – 2001 to Q4 – 2001.
2. FORECASTING TECHNIQUES AND THEIR RESULTS:
2.1 Moving Average
Moving average technique is used as a forecast model for the retail sales data. Four-quarter
moving average is invoked since the seasonal pattern occurs every four quarters.
The US Retail Sales: Pharmacies and Drug Stores
16500
21500
26500
31500
36500
41500
Apr-
92
Oct-
92
Apr-
93
Oct-
93
Apr-
94
Oct-
94
Apr-
95
Oct-
95
Apr-
96
Oct-
96
Apr-
97
Oct-
97
Apr-
98
Oct-
98
Apr-
99
Oct-
99
Apr-
00
Oct-
00
Apr-
01
Oct-
01
Series 1 Forecast of Series 1 Fitted Values
Method 4-Quarter Moving Average
Mean Absolute Percentage Error (MAPE) 4.35%
R-Square 89.54%
Root Mean Square Error Historic before 2001 1,540.34
RMSE / Mean Holdout Q1, 2001-Q4, 2001 1.93%
-.4000
-.2000
.0000
.2000
.4000
.6000
.8000
1.0000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
ACF
Upper Limit
Lower Limit
-.4000
-.2000
.0000
.2000
.4000
.6000
.8000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
PACF
Upper Limit
Low er Limit
2.2 Simple Exponential Smoothing (SES)
Another approach is implemented herein to forecast the pharmaceutical and drug stores
retail in US using SES. The outcome of the ForecastX is shown below
16500
21500
26500
31500
36500
41500
Apr-
92
Oct-
92
Apr-
93
Oct-
93
Apr-
94
Oct-
94
Apr-
95
Oct-
95
Apr-
96
Oct-
96
Apr-
97
Oct-
97
Apr-
98
Oct-
98
Apr-
99
Oct-
99
Apr-
00
Oct-
00
Apr-
01
Oct-
01
Y Forecast of Y Fitted Values
Method Exponential Smoothing
Mean Absolute Percentage Error (MAPE) 4.17%
R-Square 89.54%
Root Mean Square Error Historic before 2001 1,540.52
RMSE / Mean Holdout Q1, 2001-Q4, 2001 1.52%
Method Statistics Value
Alpha 0.62
2.3 Holt’s Exponential Smoothing
This method can be used in order to bring the forecast values closer to the values observed if
the data series exhibits a trend and seasonality. This is true for our scenario.
16500
21500
26500
31500
36500
41500
Apr-
92
Oct-
92
Apr-
93
Oct-
93
Apr-
94
Oct-
94
Apr-
95
Oct-
95
Apr-
96
Oct-
96
Apr-
97
Oct-
97
Apr-
98
Oct-
98
Apr-
99
Oct-
99
Apr-
00
Oct-
00
Apr-
01
Oct-
01
Series 1 Forecast of Series 1 Fitted Values
Method Exponential Smoothing
Mean Absolute Percentage Error (MAPE) 3.47%
R-Square 94.48%
Root Mean Square Error Historic before 2001 1,118.62
RMSE / Mean Holdout Q1, 2001-Q4, 2001 0.89%
Method Statistics Value
Alpha 0.10
Gamma 0.89
2.4 Winters’ Exponential Smoothing
This method along with the previous method is an extension of the basic smoothing
model. They are used for data that exhibit both trend and seasonality.
16500
21500
26500
31500
36500
41500
Apr-
92
Oct-
92
Apr-
93
Oct-
93
Apr-
94
Oct-
94
Apr-
95
Oct-
95
Apr-
96
Oct-
96
Apr-
97
Oct-
97
Apr-
98
Oct-
98
Apr-
99
Oct-
99
Apr-
00
Oct-
00
Apr-
01
Oct-
01
Series 1 Forecast of Series 1 Fitted Values
Method Exponential Smoothing
Mean Absolute Percentage Error (MAPE) 1.08%
R-Square 99.51%
Root Mean Square Error Historic before 2001 334.20
RMSE / Mean Holdout Q1, 2001-Q4, 2001 0.50%
Method Statistics Value
Alpha 0.80
Beta 0.82
Gamma 0.25
Just as stated previously, there is seasonality in the retail sale data. The seasonal index of
the fourth quarter is 1.07, which has a significant increment compare with other three
quarters.
Season Seasonal Indices
Q 1 0.99
Q 2 1.00
Q 3 0.97
Q 4 1.07
-.4000
-.3000
-.2000
-.1000
.0000
.1000
.2000
.3000
.4000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
ACF
Upper Limit
Low er Limit
-.4000
-.3000
-.2000
-.1000
.0000
.1000
.2000
.3000
.4000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
PACF
Upper Limit
Low er Limit
2.5 Simple Regression
We hypothesize that Personal Consumption Expenditures in Medical care (X1) is
influential in determining US Retail Sales: Pharmacies and Drug Stores (Y). So we look at a
scatter plot of these two variables.
400000030000002000000
40000
30000
20000
PCE
RS
S = 1396.76 R-Sq = 94.3 % R-Sq(adj) = 94.1 %
RS = -7635.55 + 0.0113290 PCE
Linear regression model
From this scatter plot, it is obvious that there is a positive linear relationship between
these two variables. So, simple regression method can be used here.
16500
21500
26500
31500
36500
41500
Apr-
92
Oct-
92
Apr-
93
Oct-
93
Apr-
94
Oct-
94
Apr-
95
Oct-
95
Apr-
96
Oct-
96
Apr-
97
Oct-
97
Apr-
98
Oct-
98
Apr-
99
Oct-
99
Apr-
00
Oct-
00
Apr-
01
Oct-
01
Y Forecast of Y Fitted Values
The regression equation is
Y = - 7170 + 0.0112 X1
Predictor Coef SE T P
Constant -7170 1683 -4.26 0.000
X1 0.0111557 0.0005885 18.96 0.000
Analysis of Variance
Source DF SS MS F P
Regression 1 746023507 746023507 359.39 0.000
Residual Error 34 70577628 2075813
Total 35 81660113
Essential diagnostic check based on residual analysis is carried out as shown in the figure
below. One can see an existing pattern which means that the simple regression model can not
fit the data properly. To overcome this drawback, a nonlinear term may be added to the regression line.
15000 25000 35000
-2000
-1000
0
1000
2000
3000
FITS1
RE
SI1
Residual Analysis
3500000300000025000002000000
35000
30000
25000
20000
PCE
RS
S = 1068.70 R-Sq = 95.4 % R-Sq(adj) = 95.1 %
+ 0.0000000 PCE**2
RS = 41072.4 - 0.0232084 PCE
Linear regression model
35000300002500020000
3000
2000
1000
0
-1000
-2000
FITS3
RE
SI3
Residual Analysis
The above figures illustrates that the addition of a quadratic term improved the model and
satisfied the assumption.
Method Linear Regression
Mean Absolute Percentage Error (MAPE) 4.99%
R-Square 95.4%
Root Mean Square Error Historic before 2001 24,401.03
RMSE / Mean Holdout Q1,2001-Q4,2001 1.04%
The regression equation is
Y = 41072.4 - 2.32E-02X + 5.99E-09X12
R-Sq = 95.4 %
Analysis of Variance
SOURCE DF SS MS F P
Regression 2 7.79E+08 3.89E+08 340.994 0 0.00
Error 33 37689909 1142118
Total 35 8.17E+08
SOURCE DF Seq SS F P
Linear 1 7.46E+08 359.389 0 0.00
1Quadratic 1 32887719 28.7954 6.26E-06
2.6 Multiple-Regression Model
There are many variables that may affect forecasting of retail sales pharmacies and drug
stores in the US, includes the total population, gross domestic product (GDP), personal income,
personal consumption expenditures in health insurance and number of outpatient visits, etc.
However, a correlation may exist between some of the proposed variables, which will result in
the serious error in the forecast regression model. Three explanatory variables are chosen as:
1. X1: Personal Consumption Expenditures in Medical Care
(http://www.economagic.com/em-cgi/data.exe/beana/m206u033)
There is a high relationship between the retail sales of the pharmacy and drug stores with
the personal consumption expenditures in medical care. Generally, this explanatory
variable implicitly represents the information resulted from increasing the population and
personal income. A positive correlation coefficient is expected for this variable.
2. X2: Unemployment rate
(http://www.economagic.com/em-cgi/data.exe/feddal/ru)
Unemployment rate is an index for the economical condition. The monthly data of the
employment rate are averaged to approximate the quarterly unemployment rate.
3. X3: Inflation in Consumer Price
(http://www.economagic.com/em-cgi/data.exe/var/inflation-ar-cpiu)
The amount of retail sale is affected by the inflation in consumer price. To forecast the
retail sale of pharmacies and drug stores, this explanatory variable is incorporated in our
model. In the same manner, the monthly data are averaged to estimate the quarterly data of
inflation in consumer price.
The correlation among three explanatory variables:
Correlations: X1, X2, X3
X1 X2
X2 -0.649
0.000
X3 -0.210 0.037
0.194 0.818
Cell Contents: Pearson correlation P-Value
From the result above, there is not serious multicollinearity among these three explanatory
variables.
Personal Consumption Expenditures in Medical care (X1), Unemployment rate (X2) and
Inflation in Consumer Price (X3) are used as the explanatory variables.
The regression equation is
Y = - 19104 + 0.0133 X1 + 954 X2 + 225 X3
Predictor Coef SE Coef T P
Constant -19104 4510 -4.24 0.000
X1 0.0132927 0.0008566 15.52 0.000
X2 953.7 367.1 2.60 0.014
X3 224.5 184.9 1.21 0.233
Given that the other two variables are in the model, X3 is not significant in this model. The
regression process is carried out again to have the regression equation as
Y = - 16854 + 0.0129 X1 + 833 X2
Predictor Coef SE Coef T P
Constant -16854 4138 -4.07 0.000
X1 0.0129425 0.0008117 15.94 0.000
X2 832.5 355.6 2.34 0.025
Analysis of Variance
Source DF SS MS F P
Regression 2 1226570711 613285355 351.44 0.000
Residual Error 37 64568180 1745086
Total 39 1291138890
Dummy variables are added to the model in order to capture the seasonality in the data. As
such Q2, Q3 and Q4 are coded as follows
Q2=1 for all second quarters and zero otherwise
Q3=1 for all third quarters and zero otherwise
Q4=1 for all fourth quarters and zero otherwise
The regression equation is
Y = - 16182 + 0.0127 X1 + 771 X2 + 207 Q2 - 659 Q3 + 1731 Q4
Predictor Coef SE Coef T P
Constant -16182 3143 -5.15 0.000
X1 0.0127178 0.0006208 20.49 0.000
X2 770.6 270.7 2.85 0.007
Q2 207.1 448.4 0.46 0.647
Q3 -658.8 449.8 -1.46 0.152
Q4 1731.0 451.7 3.83 0.001
The variables Q2 and Q3 are not significant in the occurrence of the others parameters and
the regression process is carried out again. This gives the equation to be
Y = - 15970 + 0.0126 X1 + 746 X2 + 1886 Q4
Predictor Coef SE Coef T P
Constant -15970 3230 -4.94 0.000
X1 0.0126391 0.0006355 19.89 0.000
X2 745.8 277.6 2.69 0.011
Q4 1886.5 377.9 4.99 0.000
S = 1029 R-Sq = 97.0% R-Sq(adj) = 96.8%
This makes sense because only the retail of the fourth quarter has significant impact on the retail.
2.7 Time Series Decomposition
The trend cycle can be estimated by smoothing the series to reduce the random
variation.
2X4 Moving Average
16500
21500
26500
31500
36500
41500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
From above 2x4 MA plot, a trend in the RS data is shown.
16500
21500
26500
31500
36500
41500
Oct-9
7
Dec
-97
Feb-9
8
Apr
-98
Jun-
98
Aug
-98
Oct-9
8
Dec
-98
Feb-9
9
Apr
-99
Jun-
99
Aug
-99
Oct-9
9
Dec
-99
Feb-0
0
Apr
-00
Jun-
00
Aug
-00
Oct-0
0
Dec
-00
Y-Q Fitted Values
Above is a weighted MA Smoothing technique. From the pattern, one can find that the
forecast value in the right side of curve obviously smaller than the real values. It means that there
is a quickly increase of the RS in the coming year.
After removing the trend and isolating the seasonal component, Exponential Smoothing is
used to fit the data.
0
5000
10000
15000
20000
25000
30000
35000
40000
Ap
r-92
Oct
-92
Ap
r-93
Oct
-93
Ap
r-94
Oct
-94
Ap
r-95
Oct
-95
Ap
r-96
Oct
-96
Ap
r-97
Oct
-97
Ap
r-98
Oct
-98
Ap
r-99
Oct
-99
Ap
r-00
Oct
-00
Ap
r-01
Oct
-01
Y Forecast of Y Fitted Values
Method Exponential Smoothing
Mean Absolute Percentage Error (MAPE) 0.57%
R-Square 99.77%
RMSE / Mean Holdout Q1,2001-Q4,2001 1.91%
2.8 ARIMA Model
Second-order difference is implemented to remove non-stationarity from time
series.
Second-Order Differences
-6000
-4000
-2000
0
2000
4000
6000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Also seasonal differencing is used to remove the seasonal factor.
Second Seasonal Difference
-1500
-1000
-500
0
500
1000
1500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
ARIMA model is used to fit the data as
16500
21500
26500
31500
36500
41500
Apr-
92
Oct-
92
Apr-
93
Oct-
93
Apr-
94
Oct-
94
Apr-
95
Oct-
95
Apr-
96
Oct-
96
Apr-
97
Oct-
97
Apr-
98
Oct-
98
Apr-
99
Oct-
99
Apr-
00
Oct-
00
Apr-
01
Oct-
01
Y Forecast of Y Fitted Values
ARIMA (2,2,0)*(1,2,1).
Method ARIMA (p,d,q)*(P,D,Q)
Mean Absolute Percentage Error (MAPE) 0.91%
R-Square 99.36%
RMSE / Mean Holdout Q1,2001-Q4,2001 0.24%
This model is good for the forecast value of the last year.
Method Statistics Value
Method Selected Box Jenkins
Model Selected ARIMA(2,2,0) * (1,2,1)
Error plot
-1,000.00
-800.00
-600.00
-400.00
-200.00
0.00
200.00
400.00
600.00
800.00
1,000.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
3. DISCUSSION
The moving average method is suitable for the stationery data. However, our situation
involves non-stationery data. The R-square in this model is only about 89.54% and the holdout
RMSE/Mean is about 1.93%. From ACF and PACF, it is noticed that some autocorrelations are
significantly different from zero (at lag 4, 8 and 12) which assures the seasonality at fourth
quarter.
No significant difference is found between the SES technique and the moving average
technique. However, SES attained smaller holdout RMSE/Mean (SES: 1.93%, MA(4): 1.52%).
Again SES is designated for a stationery data which is not true for our case.
Since Holt’s Exponential Smoothing adds a growth factor (or trend factor) to the equation as
a way of adjusting for the trend, the model is better than former. The holdout RMSE in 2001 is
reduced to 0.89% in this model. However, the seasonality factor in this model is still not
considered. So, still there is a space to improve our model.
For the Winter’s Exponential Smoothing method, one can see that the holdout RMSE/Mean
is 0.5% for the last year. Also, MAPE has significantly reduced, and R-Square is nearly 100%
(99.51%). The forecast error has only 0.50% for the last year. Also, no significant autocorrelation
is found for this forecasting technique.
It is found in the results that the MAPE of the simple linear regression is bigger than previous
forecast models (4.99%), and the R-Square for both of the simple linear and multiple linear
regression are not very high yet.
The time series decomposition fits the historic data seems well, R-Square is 99.77%,
However, the RMSE for the last year is a bit larger (1.91%).
Finally, ARIMA model is evaluated using 2
nd order difference to achieve stationarity in
the data. Also, 2nd
order difference is implemented to de-seasonalize the data. ARIMA model is
found to have the minimum RMSE/MEAN ratio (0.24%) compared to other models. Error pattern
seems to follow a white noise model.
4. CONCLUSION
Different forecasting methods are utilized to predict the retail sale in US. The ARIMA
technique exhibits best performance among other models. The RMSE and MAPE are found to be
optimum for ARIMA (2,2,0)*(1,2,1). The table below shows the predicted values for the next two
years using ARIMA model along with the holdout period for 2002.
Forecast -- Box Jenkins Selected
Actual Forecast
Date Quarterly Quarterly Annual
Mar-2002 34346 33,984.84
Jun-2002 35358 34,980.57
Sep-2002 34932 33,868.30
Dec-2002 38412 37,141.17 139,974.88
Mar-2003 34,119.04
Jun-2003 34,624.10
Sep-2003 33,437.70
Dec-2003 35,973.18 138,154.01
Mar-2004 33,306.69
Jun-2004 33,326.83
Sep-2004 31,281.33
Dec-2004 33,003.27 130,918.14
Avg 34,087.25 136,349.01
Max 37,141.17 139,974.88
Min 31,281.33 130,918.14
Holdout period