Efficient forecasting in nearly non-stationary processes

Journal of ForecastingJ. Forecast. 21, 1–26 (2002)DOI: 10.1002/for.812

Efficient Forecasting in NearlyNon-stationary Processes

ISMAEL SANCHEZ*Universidad Carlos III de Madrid, Spain

ABSTRACTThis paper proposes a procedure to make efficient predictions in a nearlynon-stationary process. The method is based on the adaptation of the the-ory of optimal combination of forecasts to nearly non-stationary processes.The proposed combination method is simple to apply and has a better per-formance than classical combination procedures. It also has better averageperformance than a differenced predictor, a fractional differenced predictor,or an optimal unit-root pretest predictor. In the case of a process that hasa zero mean, only the non-differenced predictor is slightly better than theproposed combination method. In the general case of a non-zero mean, theproposed combination method has a better overall performance than all itscompetitors. Copyright 2002 John Wiley & Sons, Ltd.

KEY WORDS forecast combination; fractional differencing; nearnon-stationarity; overdifferencing; pretest forecast; unit roots

INTRODUCTION

The efficient prediction of time series with a high degree of persistence plays an important role intime series analysis. This is especially relevant, for instance, in economic time series, where thepresence of persistence in the form of large autoregressive roots is more the rule than the exception.However, these series are, by their very nature, difficult to forecast. Nevertheless, agents still requirepredictions to help them to make their decisions. Therefore, it is very important to find methodscapable of offering efficient forecasts in this context.

This paper deals with the efficient prediction, in terms of mean squared prediction error, ofa nearly non-stationary ARMA process. It shows that the adaptation of the theory of forecastcombination to nearly non-stationary processes can offer an efficient and yet simple combinationscheme that surpasses the classical combination procedures. The proposed method also has betteroverall performance than other common procedures, such as differencing, estimation of a fractionaldifferenced predictor, pretest differencing (forecasts made with a differenced predictor if the null

* Correspondence to: Ismael Sanchez, Departamento de Estadıstica y Econometrıa, Butarque, 15. 28911 Leganes, Madrid,Spain. E-mail: [email protected]/grant sponsor: CICYT; Contract/grant number: PB96-0339.

Received May 1999Revised April 2000

Copyright 2002 John Wiley & Sons, Ltd. Accepted August 2000

2 I. Sanchez

hypothesis of a unit root is not rejected, and with a non-differenced predictor otherwise), or theestimation of the properly specified non-differenced model.

Overdifferencing, when it is suspected that the process is near-non-stationary has been recom-mended by some researchers (Box and Jenkins, 1976, p.192; Campbell and Perron, 1991), since itcan produce better forecasts. The consequences in estimation and prediction of overdifferencing anearly non-stationary autoregression have been studied theoretically by Sanchez and Pena (2001).These authors show that the misspecified overdifferenced predictor can have a lower mean squaredprediction error (MSPE) than a properly specified one if the process is close enough to unity, dueto its more parsimonious representation. However, if the root is not close enough to unity, the lossof MSPE can be important. Differencing is, therefore, a method that can improve our predictions,but should be applied with caution, since we could incur high inefficiency.

Several alternatives can be considered to avoid a great loss when overdifferencing and still takeadvantage of its potential benefit. The first alternative would be to take differences only when aunit root test does not reject the existence of such a unit root (pretest differencing). Several authors(i.e. Diebold and Kilian, 2000; Stock, 1996) have shown, empirically, that pretest differencing inunivariate time series can help to improve the performance of the overdifferenced predictor. Thereason for this is that, in the region close to non-stationarity, unit roots tests have very low power,inducing the use of the differenced predictor. It is precisely in this region where the differencedpredictor can outperform the non-differenced one. Furthermore, when the process is far from theunit circle, tests have high power, preventing the use of the differenced predictor, and avoiding highinefficiency. Doubts arise, however, about the performance of pretest differencing in intermediateregions where unit roots tests still has low power and overdifferencing may be inefficient.

A second alternative, as a compromise between differencing and non-differencing, would be toestimate a fractional differenced predictor, also denoted as ARFIMA model, (see, for instance,Beran, 1995). Due to the long-memory properties of these models, predictions generated by afractional differenced predictor will be similar to those of an overdifferenced one. Furthermore, theestimation of the differencing parameter will allow the model to adapt to the characteristics of thereal process. As the process moves off the unit circle, the estimated differencing parameter willtend to be smaller, allowing a quicker reversion to the mean.

A third and natural alternative would be to optimally combine the forecasts from both, thedifferenced and the non-differenced predictors. The attention is focused on linear combinations.The literature on combining forecast was initiated with Bates and Granger (1969), where they showthat a linear combination of two (stationary and unbiased) competing forecasts can achieve a lowerMSPE than either of the original predictors. Let f1,t, f2,t be two competing unbiased forecastsof the value yt. Let e1,t and e2,t be the prediction errors of the forecasts and let us denote thevariances of these prediction errors by �2

1 and �22 . According to Bates and Granger (1969), the

optimal combined forecast would be fc,t D ˛f1,t C 1 � ˛f2,t, where

˛ D �22 � cove1t, e2t

�21 C �2

2 � 2cove1t, e2t1

It can be proved (e.g. R.F. Phillips, 1987) that the problem of finding the parameters of theoptimal combination (in terms of minimizing the MSPE) is equivalent to the regression problem:

yt D ˛0 C ˛1f1,t C ˛2f2,t C et 2

restricted to˛0 D 0; ˛1 C ˛2 D 1 3

Copyright 2002 John Wiley & Sons, Ltd. J. Forecast. 21, 1–26 (2002)

Efficient Forecasting in Nearly Non-stationary Processes 3

In fact, the estimation of the parameters in equation (1) by their sampling counterparts yields aresult numerically identical to the ordinary least squares (OLS) estimation of regression (2) subjectto (3). An extension of this combination procedure is provided by Granger and Ramanathan (1984),who show that the estimation of the unrestricted regression (2) provides a lower in-sample MSPEthan the constrained regression. As shown in Diebold (1988), the estimation of the unrestrictedregression (2) produces, in general, serial correlation of the residuals. He also shows that a com-bination scheme that includes a constant term, but imposes the restriction ˛1 C ˛2 D 1, avoids thissource of autocorrelation and still produces an unbiased combined forecast. This model can beexpressed as

yt D ˛0 C ˛1f1,t C 1 � ˛1f2,t C et 4

This paper adapts the combination scheme (4) to a near-non-stationary process following anARMA model. In the next section the model is presented and the effect of the different roots, inthe case of a near-non-stationary process, is discussed. It demonstrates that the effect of the rootsthat are not close to the unit circle can be very small and, therefore, the analysis of the AR(1)case offers an appropriate framework on which to build the combination model. The third sectionstudies the optimal h-steps-ahead linear combination of the AR(1) and the random walk predictors.This section can be considered an extension of the work of Clements and Hendry (1998), to amulti-horizon forecast and to the case where the mean of the process is unknown. This sectionalso compares the proposed optimal combination with pretest differencing and shows that pretestpredictors can be very inefficient. In the fourth section the proposed procedure is compared withtheir competitors, using a simulation exercise, in both situations, that of the AR(1) and that of moregeneral cases. The fifth section shows two empirical examples that illustrate the advantage of theproposed procedure. Conclusions are presented in the final section.

NEARLY NON-STATIONARY PROCESSES

Let fytg be the following stationary and invertible ARMA(pC 1, q) process:

�B1 � �Byt D cC �Bat 5

where B is the backshift operator; �B D (1 �∑p

iD1 �iBi)

and �B D (1 �∑q

iD1 �iBi)

are poly-nomial operators on B such that �B D 0 and �B D 0 have all their roots outside the unit circle.The parameter � also holds j�j < 1. Let at be a sequence of independent identically distributedrandom variables with zero mean and variance �2. If a difference is applied to yt, the resultingseries, wt D 1 � Byt, can be represented as

�B1 � �Bwt D 1 � B�Bat 6

A process will be said to be nearly non-stationary if it generates time series that can be easilyconfused with non-stationary ones. The main feature of this process will be that the autoregressivecharacteristic equation will have a root, ��1, very close to unity. If � is close enough to one, theterm 1 � �B in (6) will be similar to 1 � B. Therefore, although the overdifferenced processwt is a non-invertible ARMA(pC 1, q C 1), it will be easily confused, in a real situation, with anARMA(p, q). Some authors have parameterized this root as � D exp�c/T D c/TC oT�1 (i.e.


4 I. Sanchez

P.C.B. Phillips, 1987) for the AR(1) case. This parameterization is useful in representing that, asthe sample size increases, � has to be closer to one in order to be confused with an integratedprocess. In this article, however, such a parameterization will not be needed.

The similarity between wt and a true ARMA(p, q) process does not depend only on � butis also influenced by the remaining roots. To better illustrate this point, let us first suppose thatthe process is the autoregression ARpC 1 : �B1 � �Byt D at. The overdifferenced processwould then be: �B1 � �Bwt D 1 � Bat. If � is close to unity, this process will resemblean AR(p). Let �j be the coefficients of the polynomial �B D 1 � �1B� �2B2 � Ð Ð Ð, where�B1 � �B D �B1 � B. These coefficients follow

�j D

�j C � � 1

(1 �

j�1∑iD1

�i

)if j � p

� � 1

(1 �

p∑iD1

�i

)if j > p

7

with �i D 0 if j < 1. Let us denote by r�1i , i D 1, . . . , p, the roots of the characteristic equation

�B D 0. Then, (1 �

p∑iD1

�i

)D

p∏iD1

1 � ri 8

Therefore, if in (7) � is very close to unity and the remaining roots are not, the effect of theseremaining roots will be very small and �j ³ 0 for j > p. When � is not close to unity, however,the remaining roots can have some influence. From (8) it can be seen that (1) negative valuesof ri increase the value of �j, j > p, and reduce the similarity between wt and an AR(p); and(2) positive values of ri will increase the probability of its confusion with an AR(p).

Similarly, if the process follows the ARMA1, q: 1 � �Byt D �Bat, the overdifferenced pro-cess would be 1 � �Bwt D 1 � B�Bat. Again, if � is very close to unity, this process will bevery similar to an MA(q). Let �j be the coefficients of the polynomial �B D 1 � �1B� �2B2 �Ð Ð Ð, where �B1 � �B D 1 � B�B. These coefficients follow:

�j D

�j C 1 � �

(�j�1 �

j�1∑iD1

�j�1�i�i

)if j � q

�j�q�11 � �

(�q �

q∑iD1

�q�i�i

)if j > q

9

with �i D 0 if j < 1. If we denote by s�1i , i D 1, . . . , p, the roots of the characteristic equation

�B D 0 then, (�q �

q∑iD1

�q�i�i

)D

q∏iD1

� � si 10

Hence, if � is very close to one and the remaining roots si are not, their influence in (9) willbe very small. In order to demonstrate the effect of the remaining roots more clearly, it is moreconvenient to analyse the ARMA(1,1) case: 1 � �Byt D 1 � �Bat. The coefficients (9) are:

�j D{� C 1 � � if j D 1�j�21 � �� if j > 1



The effect of the moving average root � is as follows: (1) as the root � approaches the value� (close to unity) the coefficients �j will be �1 ³ 1, and �j ³ 0, j > 1; then the process ytwill resemble a white noise, instead of a random walk; (2) when the root � approaches �1, thecoefficients will be �1 ³ ��, �j ³ 1 � �2, j > 1; so that the process yt will then be close to anIMA(1,1). The moving average roots will then have the opposite effect to that of the autoregressiveones.

In summary, although all the roots can influence the similarity between a stationary and an inte-grated process, the influence of the roots that are different to � will be very small, even negligible,if � is close enough to unity. Hence, the adaptation of the theory of forecast combination to nearlynon-stationary processes can be achieved by focusing on the AR(1) case. The expressions for thecoefficients of the combination in the AR(1) case will be functions of the estimated autoregres-sive parameter �. In a general ARMA situation, and since the effects of the remaining roots arevery small, these same expressions than in the AR(1) can be used employing the correspondingestimation of �. Only in extreme cases could the efficiency of the proposed combination be affected.

COMBINING FORECASTS IN NEARLY NON-STATIONARY AR(1) PROCESSES

The zero mean caseLet yt follow the process

yt D �yt�1 C at 11

where j�j < 1, and at is a white noise. If the process is thought to have a unit root, the h-steps-aheadpredictor, from observation T, will be QyTCh D yT. The MSPE of this predictor will be denoted asV1, and is

V1 D EyTCh � yT2 D 2�2 1 � �h

1 � �212

However, if the process is thought to be stationary, the predictor, estimated by OLS, would beOyTCh D O�hyT. The MSPE of this predictor will be denoted as V�, and is (see, for instance, Kunitomoand Yamamoto, 1985)

V� D EyTCh � O�hyT2 D �2(

1 � �2h

1 � �2C h2�2h�2

T

)C OT�3/2 13

If we consider a linear combination of both predictors, we have: ycTCh D ˛0 C ˛1 QyTCh C ˛2 OyTCh.In order to avoid serial correlation in the forecast errors of the combination, the restriction ˛1 C ˛2 D1 will be used (see Diebold, 1988). The unbiasedness is still guaranteed by the inclusion of theconstant term ˛0. The proposed combined predictor, then, is:

ycTCh D ˛0 C ˛1 QyTCh C 1 � ˛1 OyTCh 14

Let ˛Ł0 and ˛Ł

1 denote the coefficients that minimize EyTCh � ycTCh2. Then, it holds that:

˛Ł1 D V� �C1� � B2

�

V1 C V� � 2C1� � B2�

˛Ł0 D 1 � ˛Ł

1B�


6 I. Sanchez

where C1� D E[yTCh � O�hyTyTCh � yT], and B� D EyTCh � O�hyT. The following propositionsgive an asymptotic approximation of B� and C1�. All proofs are in the Appendix. These propositionsmake use of the results of Bhansali (1981). In order to apply Bhansali’s results, the followingassumptions should be made, where O#y D T�1

(∑TtD2 y

2t�1

)and jj Ð jj is the Euclidean norm:

A1: For some v0 > 2, Ejatjv0 < 1.A2: Ejj O#�1

y jj2kk D 1, 2, . . . , k0 is bounded for all finite and sufficiently large T and some k0.

Proposition 1 Assume A1 for v0 D 4h, where h ½ 1 is a prefixed integer, and A2. Let yt follow(11), then

B� D EyTCh � O�hyT D OT�1

Proposition 2 Assume A1 for v0 D 4h, where h ½ 1 is a prefixed integer, and A2. Let yt follow(11). Then

C1� D �2(

1 � �2h

1 � �2

)C OfT�1� � 1g 15

With these propositions, it can easily be seen that both ˛Ł1 and ˛Ł

0 are OT�1. The MSPE of thisoptimal combination is

PŁ D ˛Ł21 V1 C 1 � ˛Ł

12V� C ˛Ł2

0 C 2˛Ł11 � ˛Ł

1C1� � 2˛Ł01 � ˛Ł

1B� 16

Since both B� and ˛Ł0 are OT�1, it can be seen in (16) that the contribution of the constant

term in the optimal combination is OT�2 and, therefore, negligible. Similarly, if � is very closeto unity, the remaining term in (15) can also be ignored. Then, the following forecast combinationcan finally be proposed for the near-non-stationary situation:

ycTCh D ˛yT C 1 � ˛ O�hyT 17

where

˛ D h2�2h�21 � �2

T1 � �h2 C h2�2h�21 � �2C OfT�11 � �g 18

Then, if � is close to one, the remaining term in (18) will have a negligible effect. The combinedpredictor will be obtained replacing � with the OLS estimation O�. Then,

OycTCh D OyT C 1 � O O�hyT 19

where

O D h2 O�2h�21 � O�2

T1 � O�h2 C h2 O�2h�21 � O�220

This coefficient O has an intuitive interpretation. As seen in Figure 1, the values of O as a functionof O� form a convex curve. Therefore, when O� is very close to one, the weight of the overdifferencedpredictor moves toward unity very quickly in order to benefit from its parsimonious representation.If O� is not close enough to unity, the weight of the overdifferenced predictor is very low to avoidinefficiency. Then, the differenced predictor has only a significant weight when the process is veryclose to non-stationarity. For a given O�, the value of O decreases with T, since the non-differencedpredictor will be more accurate.



0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Estimated ρ

Est

imat

ed α

T=100T=50

Figure 1. Values of O (down) for given values of O� (across) and h D 1. Solid line: T D 50. Dashed line:T D 100

The non-zero mean caseLet yt follow the process

yt D cC �yt�1 C at 21

where j�j < 1, and at is a white noise. This process can also be written as yt D &1 � �C �yt�1 Cat, with & D Eyt. If the process is thought to have a unit root, the h-steps-ahead predictor, fromobservation T, will be QyTCh D yT. It can be proved that the MSPE of this predictor is still V1. Ifthe process is thought to be stationary, the estimated predictor would be

Oy&TCh D Oc1 � O�h1 � O� C O�hyT

where Oc and O� are the OLS estimators of c and � respectively. The MSPE of this predictor can bederived from the expressions in Kunitomo and Yamamoto (1985), yielding

V&� D �2 1 � �2h

1 � �2C �2

T

[h2�2h�2 C

(1 � �h

1 � �

)2]

C OT�3/2

The combined predictor would be:

ycTCh D ˛&0 C ˛&1 QyTCh C 1 � ˛&1 Oy&TChThe coefficients ˛Ł&

0 and ˛Ł&1 that minimize EyTCh � ycTCh

2 are:

˛Ł&1 D V&� �C&1� � B&�

2

V1 C V&� � 2C&1� � B&� 2

˛Ł&0 D 1 � ˛Ł&

1 B&�


8 I. Sanchez

where C&1� D E[yTCh � Oy&TChyTCh � yT], and B&� D EyTCh � Oy&TCh. Let us denote yt�1 Dyt�1 �

(∑TtD2 yt�1

)/T� 1, and O#y D

(∑TtD2 y

2t�1

)/T, and let us make the following assumption:

A20: Ejj O#�1y jj2kk D 1, 2, . . . , k0 is bounded for all finite and sufficiently large T and some k0.


B&� D EyTCh � Oy&TCh D OT�1

When the distribution of at is symmetric, some authors have proved that B� D 0 and B&� D 0(Malinvaud, 1970; Fuller and Hasza, 1980). Propositions 1 and 3 show, however, that symmetry isnot necessary in order to obtain the required asymptotic simplifications.


C&1� D �2(

1 � �2h

1 � �2

)C OfT�1� � 1g 22

With these propositions, it can be observed that the contribution of the constant term in theMSPE of the optimal combination is OT�2 and is therefore negligible. Furthermore, if � is veryclose to unity, the remaining term in (22) can also be ignored. The following forecast combinationcan be proposed for the near-non-stationary situation: OycTCh D O&yT C 1 � O& Oy&TCh, with

O& D fh2 O�2h�21 � O�2 C 1 � O�h2g1 C O�T1 � O�h21 � O�C fh2 O�2h�21 � O�2 C 1 � O�h2g1 C O� 23

where O� is obtained from the OLS estimation of model (21).

Pretest differencing and optimal forecast combinationThis section compares pretest differencing to the optimal combination predictor. The comparisonis made by expressing the pretest predictor as a forecast combination where the weights depend onthe result of the unit root test. By a comparison of the optimal weights and the average weightsof pretest differencing, it can be seen that the pretesting predictor can be very inefficient. For thesake of exposition, the AR(1) model with no mean will be assumed.

Let x be a Bernouilli random variable with probability distribution Px D 1 D p; Px D 0 D1 � p. Let x D 1 represent the non-rejection of a unit root in a realization of the process with agiven non-stationary test, and x D 0 the rejection of such a unit root. If the null hypothesis of aunit root is false, 1 � p � 1 � p� is the power function of the test. Given an observed series, thepretest differencing prediction OypTCh can easily be written as a forecast combination with weights xand 1 � x since

OypTCh D xyT C 1 � x O�hyT 24

with equal weights at all horizons. It can be seen from (24) that the combination coefficients canbe very different from the optimal ones in (18), since, given a time series, the pretest predictor putsthe whole weight in only one predictor. Then, the pretest predictor will be optimal (1) if the timeseries contains a unit root and the test detects it or (2) if it is white noise and the test rejects theunit root. The behaviour of the pretest predictor can also be evaluated with its average performance



for a given value of �. This can be seen using the power function of the unit root test, since thevalues of p� and 1 � p� can be interpreted as the combination coefficients that, on average, thepretest differencing predictor uses. Then, for a given unit root test, the comparison of p� with theoptimal coefficient ˛ in (18) will show the departure from optimality. Figure 2 shows coefficients ˛in (18) (up to terms of small order) and the values of p� of the point optimal invariant (POI) unitroot test (Dufour and King, 1991; Hwang and Schmidt, 1996; Elliott, 1999) for T D 100. By theNeyman–Pearson lemma, the power function of the POI test is, under the assumption of normality,the highest attainable power by a unit root test. Any feasible unit root test, therefore, will havehigher or similar values to p�. The values of p� have been obtained by 100,000 Monte Carloreplications, with stationary initial conditions when � < 1. The critical values of the POI, for eachvalue of �, have also been obtained using a similar Monte Carlo experiment. As can be seen, theweights that the pretest differencing predictor assigns are, on average, clearly suboptimal, since ittends to give excessive weight to the differenced predictor.

COMPARING ACCURACY

In this section, a Monte Carlo experiment compares the accuracy of the proposed combinationprocedure with some other alternative methods. The competing forecasting methods, apart from theproposed optimal combination scheme, are: differencing, non-differencing, fractional differencing,pretest forecast, and other classical combination methods. The non-differenced predictor is the OLSestimation of the properly specified model without imposing any restriction on the parameter’s value,whereas in the differenced predictor the unit root is imposed and there is no mean. The fractional

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ρ

Com

bina

tion

coef

ficie

nt

α p(ρ)

Figure 2. Values of optimal coefficient ˛ (dashed line) and p� (solid line) of POI test at h D 1 for differentvalues of �. T D 100. Model: yt D �yt�1 C at


10 I. Sanchez

differenced predictor has been built following the least squares procedure developed in Beran (1995).This procedure has the advantage of allowing for the estimation of values of the differencingparameter outside the stationary range. In the case of an AR(1) with zero mean, the estimatedmodel is 1 � Bdyt D at. In the non-zero mean case, the estimated model is 1 � Bdyt � & D at.Predictions from this fractional differenced predictor have been generated by using an autoregressiveapproximation with all the available sample (Brockwell and Davis, 1991, p. 533).

The classical combination procedures included in this comparison are all based on the OLS esti-mation of model (2). At each lead time, the predictors to be combined are the differenced and thenon-differenced one. The regression is estimated with the available sample and with different restric-tions on the parameters. These restrictions are: (a) estimation of regression (2) with no restrictions;(b) estimation without constant term; (c) estimation with constant term but with the coefficientsamounting to unity (˛1 C ˛2 D 1); and (d) estimation with all the restrictions (3). These classicalcombination procedures have behaved, in general, poorly in comparison to their competitors. Thispoor performance can be explained by the sampling variability of parameter estimation. Procedures(a), (b), and (c) were consistently much worse than procedure (d). Therefore, only the results of thefull restricted procedure (d) are reported. This combination method (d) can be seen as the samplingcounterpart of the proposed optimal combination method, the only difference being the method ofestimating the combination weight ˛. The regression procedure (d) only obtains information fromthe sampling covariances of the data, whereas the proposed procedure also uses the informationthat the process is nearly non-stationary to improve the efficiency of the predictions.

The pretest predictor is based on the use of the POI test. This test have the maximum attainablepower (under normality) by a unit root test. Therefore, although POI test is unfeasible—it needsa known fixed alternative—it can be used as an upper bound for pretest predictors. The powerfunction depends on the deterministic components of the process under the stationary alternative.It also depends on the assumptions about the initial values under the alternative. In this work,the initial observations (when � < 1) will always be extracted from the unconditional distribution.These initial conditions will ensure that, under the stationary alternative, the process is covariancestationary. The power function has been obtained through a Monte Carlo experiment with 100,000replications in order to obtain both the critical values at each value of � and the empirical power.Feasible unit root tests that, using this assumption about the initial values, have a power functionvery close to the POI test (Gaussian power envelope) have been proposed by Pantula et al. (1994)and Elliott (1999). If the null hypothesis of unit root is not rejected, the differenced predictor isapplied. The non-differenced predictor is used if the unit root is rejected.

An important aspect of the experiment that should be highlighted here is the possibility ofobtaining explosive estimated predictors (j O�j > 1). There are two reasons to recommend discardingthese explosive situations in the simulations. First, their practical interest is rather limited, sincesituations in which practitioners have doubts about differencing deal mainly with estimated rootsclose to but lower than unity. Second, these explosive replications exert an excessive influenceon the computations, because explosive estimated predictors are easily worse than most of theircompetitors. Very few explosive replications can have such a great influence that it can give anover-pessimistic view of the non-differenced predictor. In order to obtain a more realistic picture ofwhat can be expected in a practical situation, only replications whose estimated roots were outsidethe unit circle have been considered.

The performance of all the predictors is measured as the empirical out-of-sample MSPE. Inorder to facilitate the comparisons, the non-differenced predictor has been used as a benchmark.The relative difference with this benchmark is then reported. For instance, in the case of the



differenced predictor, the expected gain (with respect to the non-differenced one) would be G1 DV� � V1/V�. Positive values of this relative difference will represent the expected gain (or lossif it is negative) of differencing with respect to the non-differenced predictor. The expected gainof the fractional differenced predictor will be denoted by Gf. The expected gain for the proposedoptimal combination will be Go; and for the classical combination based on the full restricted OLSestimation, the expected gain will be denoted by Gc. The expected gain for the pretest forecast willbe denoted by Gp .

The following subsection show the empirical conditional expected gain (conditional on eachvalue of �) in the AR(1) case and the average of the empirical expected gain for the values of� considered. This average can be interpreted as an indicator of what can be expected in a realsituation, where � is unknown but is suspected to be near unity. Finally, results for the AR(2) andthe ARMA(1,1) models are presented.

Conditional expected gainThis subsection reports the result of the Monte Carlo experiment in the AR(1) case for differentvalues of the autoregressive parameter �. Sample sizes are T D 50,100, and the values of � are � D1, 0.98, 0.96, 0.94, 0.92, 0.90. When � < 1, the model used to generate data is yt D �yt�1 C at in thezero mean case, and yt D 101 � �C �yt�1 C at in the non-zero mean case. In each replication, arandom sample of size TC 15 with random noise at ¾ N0, 1 is generated. The initial value when� < 1 is y1 D a1/

√1 � �2. When � D 1, the starting value is y1 D a1. The model is estimated

with the initial T observations, and the last 15 values are used to evaluate the correspondingprediction errors. By averaging the squared prediction errors of 10,000 replications (with j O�j < 1),the sampling estimation of the MSPE of forecasting yTCH of the competing predictors is obtained.Then, the empirical expected gain of each procedure, with respect to the non-differenced one, isbuilt. All computations have been done in Matlab.

Figures 3 and 4 show the empirical expected gain when T D 50, and Figures 5 and 6 whenT D 100. The first conclusion that can be drawn from these figures is that there is no single methodthat outperforms all its competitors in all cases. As can be expected, the optimal predictor when� D 1 is the random walk. The fractional differenced predictor also has an excellent performancein the unit root case. The gains of these two predictors, however, decline rather quickly, finallybecoming highly negative as � moves away from unity. Both predictors are very inefficient if theprocess is not very close to the unit root. The classical combination predictor shows, in general, avery poor relative performance, especially at high values of �. The proposed combination method,however, maintains its good performance for all values of the root. The proposed predictor givespositive gains when � is close to unity. When � is far from the unit root, the proposed combinationmethod still shows positive gains, or a negligible loss. It can be concluded that the estimated optimalweights of the proposed predictor adapt quite efficiently to the properties of the underlying process.

All procedures have better (relative) performance when there are deterministic components. Thiscan be explained by the larger parsimony of the differenced predictor in that setting, since ithas two parameters less than the non-differenced predictor. Similarly, the fractional differencedpredictor has no mean if the estimated parameter is larger than d D 0.5. Therefore, it is alsomore parsimonious than the non-differenced predictor. The sample size affects both differencedand fractional differenced predictors negatively. Again, this fact can be explained by the effectof the estimation of the parameters: as the sample size increases, the non-differenced predictor ismore accurate and more difficult for a misspecified model to outperform. The proposed combination


12 I. Sanchez

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Horizon

Exp

ecte

d ga

in

Differenced pred.Pretest pred.Fractional pred.Proposed comb.Classical comb.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.45

−0.4

−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

00.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

Horizon

Exp

ecte

d ga

in

= 0.96ρ = 0.94ρ

= 0.92ρ = 0.90ρ

= 1ρ = 0.98ρ

Figure 3. Empirical expected gain of competing predictors. Model: yt D �yt�1 C at. Dotted line: G1; solidline with ‘C’ symbol: Gp; dash-dotted line: Gf; solid line: Go; solid line with ‘o’ symbol: Gc. Sample sizeT D 50



0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Horizon

Exp

ecte

d ga

in


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Horizon

Exp

ecte

d ga

in

= 1ρ = 0.98ρ

= 0.96ρ = 0.94ρ

= 0.92ρ = 0.90ρ

Figure 4. Empirical expected gain of competing predictors. Model: yt D 101 � �C �yt�1 C at. Dotted line:G1; solid line with ‘C’ symbol: Gp; dash-dotted line: Gf; solid line: Go; solid line with ‘o’ symbol: Gc.Sample size T D 50


14 I. Sanchez

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.15

−0.1

−0.05

0

0.05

0.1

Horizon

Exp

ecte

d ga

in


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.08

−0.07

−0.06

−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.25

−0.2

−0.15

−0.1

−0.05

0

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

Horizon

Exp

ecte

d ga

in

= 1ρ = 0.98ρ

= 0.96ρ = 0.94ρ

= 0.92ρ = 0.90ρ

Figure 5. Empirical expected gain of competing predictors. Model: yt D �yt�1 C at. Dotted line: G1; solidline with ‘C’ symbol: Gp; dash-dotted line: Gf; solid line: Go; solid line with ‘o’ symbol: Gc. Sample sizeT D 100



0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.1

−0.05

0

0.05

0.1

0.15

Horizon

Exp

ecte

d ga

in


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.45

−0.4

−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

Horizon

Exp

ecte

d ga

in

= 1ρ = 0.98ρ

= 0.96ρ = 0.94ρ

= 0.92ρ = 0.90ρ

Figure 6. Empirical expected gain of competing predictors. Model: yt D 101 � �C �yt�1 C at. Dotted line:G1; solid line with ‘C’ symbol: Gp; dash-dotted line: Gf; solid line: Go; solid line with ‘o’ symbol:Gc. Sample size T D 100


16 I. Sanchez

predictor, however, includes the effect of the sample size in its definition and its properties, therefore,do not alter very much because of the sample size.

Regarding pretesting, it can be seen that it helps the differenced predictor to avoid high ineffi-ciencies when the process is far from non-stationarity.

An important conclusion that can be drawn from Figures 3 to 6 is that, to better evaluate theperformance of the competing predictors, it is necessary to balance the gains and losses of eachprocedure. This balance is done in next subsection by averaging the expected gain for a range ofvalues of �. This average then enables us to measure the overall performance.

Average expected gainFigure 7 shows the average expected gain on the values � D 0.90, 0.92, 0.94, 0.96, 0.98 and 1.00.This average is a simple and transparent way of summarizing the expected performance of thepredictors in a real situation, where the value of � is unknown but is believed to be near unity.The range of values of � has been selected because the probability of detecting a unit root in those

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.16

−0.14

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

Horizon

Exp

ecte

d ga

in


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Horizon

Exp

ecte

d ga

in

T = 50. Non-zero mean T = 50. Zero mean

T = 100. Non-zero mean T = 100. Zero mean

Figure 7. Average empirical expected gain of competing predictors. The average is on the values � D 0.90,0.92, 0.94, 0.96, 0.98, 1. Dotted line: G1 (left). Solid line with ‘C’ symbol: Gf. Dash-dotted line: Gf. Solidline: Go. Solid line with ‘o’ symbol: Gc. Sample size T D 100



cases is very high. According to the empirical power function of the POI test, the probability ofdetecting a unit root when � D 0.90 and there is a non-zero mean is around 50% if T D 100, and80% if T D 50. A wider range of values negatively affects the differenced and fractional differencedpredictors, principally, since their inefficiency is very high when the process is far from the unitcircle.

Conclusions about this averaged expected gain depend on the presence of deterministic compo-nents. In the zero mean case, the best procedures are the non-differenced predictor and, closelybehind it, the one proposed here. The remaining predictors perform much worse than these two.From a practical point of view, this means that, for forecasting purposes, when there is a zeromean process, unless we are completely sure that the process is non-stationary or really close tonon-stationarity � ³ 0.99 we need not worry about the existence of a unit root and we shouldestimate a properly specified model without imposing any restrictions. This situation is applicablewhen in doubt about a second difference, since the mean has already disappeared with the firstdifference. Therefore, for forecasting purposes, we should not care much about taking the seconddifference.

In the case of a non-zero mean, the best overall procedure to employ would be the proposed opti-mal combination method, except in the region of T D 50 and lead time h < 5 where the differencedpredictor outperforms the one proposed. This advantage of the differenced predictor, however, isnot very important for two reasons. First, it decreases rather rapidly with larger horizons and largersample size. Second, for T D 50 and h < 5, Figure 4 reveals that the average curve for the proposedpredictor is the mean of a set of positive gains, whereas the differenced predictor averages positiveand negative values. Therefore, although the average may be somewhat higher for the differencedpredictor at short horizon and small sample size, there is a risk of falling into inefficiency, whereaswith the combined predictor there is no such risk. With a non-zero mean, therefore, the proposedpredictor is recommended.

The conclusion, therefore, is that if a process is nearly non-stationary with non-zero mean, the bestpredictor uses an efficient combination of the differenced and non-differenced predictors followingthe proposed procedure. Once again, in order to build an efficient predictor there is no need to careabout the result of a unit root test.

Extension to AR(2) and ARMA(1,1)In a general case, the predictors to combine will be based on the estimation of the correspond-ing ARMA model with and without the unit root restriction. Once the predictions are obtained,they are combined using the same coefficients, (20) or (23), as in the AR(1) case. In this sub-section, a simulation exercise is performed to empirically evaluate the robustness of this proposedcombination. As previously shown, the effect of the remaining roots in a nearly non-stationaryprocess may be marginal. Only in extreme cases where the remaining roots are also close to theunit circle can a significant effect be expected. From expressions (8) and (10) it can be seen thatthe potential effect of the remaining roots is mainly due to their size rather than to their num-ber. The analysis of the AR(2) and the ARMA(1,1) can therefore illustrate the robustness of theoptimal combined predictor. The models used in the experiment are 1 � �B1 � �Byt D c C at,with � D �0.7,�0.5, 0.5, 0.7 and c D 101 � �1 � �; and 1 � �Byt D c C 1 � �Bat with� D �0.7,�0.5, 0.5, 0.7 and c D 101 � �. In each model, the average expected gain is evaluatedin the range � D 0.90, 0.92, 0.94, 0.96, 0.98, 1.00. Sample size is T D 100. Models are estimatedby conditional least squares (LS). The number of replications, with estimated autoregressive rootsoutside the unit circle, is 10,000.


18 I. Sanchez

Two different procedures have been used to estimate the value of � that will be used in thecombining coefficient (23). The first procedure is the LS estimation of the coefficient � fromthe non-differenced model. The second is the sampling first-order autocorrelation. This secondprocedure is consistent with the idea of building the combined predictor ignoring the remainingroots. These procedures are compared with the differenced predictor and the non-differenced one.The performance of the predictors is evaluated, as in the previous sub-section, as the averageexpected gain with respect to the non-differenced predictor. Figure 8 shows these average expectedgains. The comparison of this figure with Figure 7 shows that the proposed combination methodis very robust to these fairly unfavourable conditions. In general, the LS estimation of � (solidline) has a better performance, and is more robust, than the sampling first-order autocorrelation(dash-dotted line) and is, therefore, preferred.

REAL EXAMPLES

In this section, the proposed combination procedure is applied to two real examples. The first is theseries of transportation of goods by train in Spain (mill. of tons/km). The data are monthly and sea-sonally unadjusted and range from 1982: 01 to 1999: 06. The series (with logarithmic transformation)is plotted in Figure 9. Three different models have been selected to fit the series:

1 � B1 � B12xt D 1 � �1B1 � �12B12at 25

1 � �Bf1 � B12xt � &g D 1 � �1B1 � �12B12at 26

1 � Bdf1 � B12xt � &g D 1 � �1B1 � �12B12at 27

The main difference between these models is the approach adopted to parameterize the persistence.In model (25) a unit root is assumed. This model will be denoted as the differenced predictor.In model (26) the value of � is estimated, and it is expected to be near unity. This model willbe called the non-differenced predictor. In model (27) it is assumed that the process has longmemory and a parameter d is estimated. This model will be called the fractional predictor. Theprediction performance of these three models is also compared with the performance of the forecastcombination model

xcTCh D ˛Ox1TCh C 1 � ˛Ox0

TCh 28

where Ox1TCh is the h-steps-ahead prediction from the differenced predictor (25) and Ox0

TCh is theh-steps-ahead prediction generated by the non-differenced predictor (26). Two different procedureshave been used to estimate the values of ˛. The first procedure is the proposed optimal combinationweight O& shown in (23). This weight just uses the estimated parameter � in equation (26). Thesecond procedure is the OLS estimation of the model

xTCh D ˛Ox1TCh C 1 � ˛Ox0

TCh C et 29

In order to compare the forecasting performance of these five predictors, I have used the first 138observations (approx. 65% of the series) for the estimation of the models, and the remaining 6 yearsto evaluate the empirical h-steps-ahead MSPE with all the available prediction errors. The estimationis, besides, recursive in the sense that all the models are re-estimated to include all past data prior to



0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.14−0.12

−0.1−0.08−0.06−0.04−0.02

00.020.040.06

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.16−0.14−0.12

−0.1−0.08−0.06−0.04−0.02

00.020.04

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

Horizon

Exp

ecte

d ga

in

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

HorizonE

xpec

ted

gain

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

Horizon

Exp

ecte

d ga

in

= −0.5φ = 0.5φ

= −0.7φ = 0.7φ

= −0.5θ = 0.5θ

= −0.7θ = 0.7θARMA(1,1)

ARMA(1,1)

AR(2)

Figure 8. Empirical expected gain. Dotted line: OG1. Solid line: OGo estimated with LS estimation of �.Dash-dotted line: OGo estimated with sampling first-order autocorrelation. Sample size T D 100. Non-zeromean


20 I. Sanchez

0 20 40 60 80 100 120 140 160 180 2006

6.2

6.4

6.6

6.8

7

7.2

Figure 9. Transportation of goods by train in Spain (with logarithmic transformation)

the forecast origin. Models (25) to (27) have been estimated using non-linear LS. The estimation ofthe fractional model (27) has been made following Beran (1995, p. 666). The series has previouslybeen analysed in SCA, where the models were identified and some outliers were also detectedand taken into account in the models. All the remaining computations have been done in Matlab,where the lsqnonlin.m function was used to find a minimum to the sum of squares. The estimationof model (25) using the whole sample, and taking into account the outliers, yielded the followingresults: O�1 D 0.56, O�12 D 0.66. The estimated parameters of model (26) using the full sample are:O� D 0.97, O�1 D 0.54, O�12 D 0.65. Similarly, the estimated parameters of model (27) with the wholesample are: Od D 0.81, O�1 D 0.39, O�12 D 0.62.

Figure 10 shows the empirical out-of-sample MSPE for horizon h D 1, . . . , 12. As in the previoussection, the relative difference in MSPE with respect to the non-differenced predictor is reported,and it can be interpreted as the expected gain of each predictor with respect to the non-differencedone. From this figure, it can be concluded that the proposed combination method is the preferredpredictor at horizons up to a year. The fractional predictor has a poor performance at short term,but it improves its relative behaviour and becomes slightly better than the proposed method ath D 12. Although the classical combination method, obtained from equation (29), is based on thesame combining scheme as the proposed method, its performance is worse because the proposedmethod uses the information more efficiently.

The second example is the chemical process concentration readings (series A) in Box and Jenkins(1976, p. 525 (series A)). The series is plotted in Figure 11. Box and Jenkins (1976) suggested twoalternative models for this series; namely, an ARMA(1,1) with non-zero mean (non-differencedpredictor) and an IMA(1,1) with zero mean (differenced predictor). A fractional differenced model,ARFIMA(0, d, 1), will also be fitted to this series. The estimation of the ARMA(1,1) modelwith the full sample yielded the following estimations: O� D 0.92, O� D 0.59. For the IMA model,



0 1 2 3 4 5 6 7 8 9 10 11 12−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

Horizon

Exp

ecte

d ga

in

Proposed combinationDifferenced predictor

Classical combinationFractional predictor

Figure 10. Transportation of goods by train in Spain. Empirical expected gain, relative to the non-differencedpredictor, of competing predictors

0 20 40 60 80 100 120 140 160 180 20016

16.5

17

17.5

18

18.5

Figure 11. Chemical process concentration readings given in Box and Jenkins (series A)

the result is: O� D 0.70. Finally, the estimates of the fractional predictor with the whole sam-ple are: Od D 0.47, O� D 0.08. This moving average estimate is, however, not significant and anARFIMA(0, d, 0) has finally been used with Od D 0.41. The proposed combination method and


22 I. Sanchez

the classical combination method, based both on expression (28), will also be used. The esti-mation of the out-of-sample MSPE of these five competing predictors has been made as in theprevious series: the available data has been divided into estimation and prediction subsamplesand the estimation subsample increases recursively to include all past data prior to the forecastorigin. The initial estimation subsample contains 80 observations. The empirical out-of-sampleMSPE, relative to the non-differenced predictor, is shown in Figure 12. This figure reveals thatthe proposed combination procedure clearly outperforms both the non-differenced and the differ-enced predictor. It also surpasses the classical combination model at all horizons. At h D 1, thefractional differenced predictor has better performance than the proposed method. At longer hori-zons, however, the performance of the fractional predictor declines and the proposed method ispreferred.

CONCLUSIONS

This paper illustrates a procedure for building an efficient predictor in a nearly non-stationaryARMA process. The method is based on the optimal linear combination of the differenced andthe non-differenced predictor. Classical linear combination methods have a poor performance inthis situation because of the sampling variability of the estimation. The proposed combinationpredictor, however, not only surpasses the classical combination procedures but also has a betteroverall performance than differencing, fractional differencing, or optimal pretest forecasts. Theadvantage of the proposed predictor comes from the efficient estimation of the optimal combinationweights, since they efficiently incorporates the information that the process is nearly non-stationary.Specifically:

0 1 2 3 4 5 6 7 8 9 10−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

Horizon

Exp

ecte

d ga

in

Differenced pred.Proposed comb.Classical comb.Fractional pred.

Figure 12. Chemical process concentration readings. Empirical expected gain, relative to the non-differencedpredictor, of competing predictors



(1) the method uses only the information of the largest autoregressive root, since, as stated in thepaper, this is the main one that characterizes a nearly non-stationary process. This avoids theinclusion of nuisance parameters in the combination that would decrease the efficiency of thepredictor.

(2) It uses the theoretical MSPE, at each lead time, of the competing predictors (differenced andnon-differenced) in the definition of the optimal weights. In order to do this, several propositionshave been proved to allow for asymptotic simplifications.

(3) It uses the same estimation of � as the non-differenced predictor, so that no further estimationis needed, avoiding unnecessary sampling variability.

When the process has zero mean, only the non-differenced predictor has slightly better empiricalperformance than the proposed combination. In the more general non-zero mean case, the proposedoptimal combination is, on average, the more efficient procedure. Therefore, if the process has zeromean, the recommendation is to forecast with the non-differenced model. When there is a non-zeromean, the recommendation is to combine the differenced and non-differenced predictors with theproposed combination procedure.

It has been shown in this paper that pretest differencing is far from being optimal and, therefore,is of little help in obtaining efficient predictions. As far as forecasting is concerned, we should notworry about the presence of a unit root (unless we are completely sure of its presence), since eventhe most powerful test does not outperform the proposed optimal combination method.

APPENDIX

Proof of Proposition 1By Holder’s inequality and applying that Ejj O� � �jj2k D OT�k (Bhansali, 1981), it can beseen that Ef� � O�2yTg D OT�1. Then, applying a Taylor expansion of O�h around �h, it can beverified that B� D EyTCh � O�hyT D Ef�h � O�hyTg D O[Ef� � O�yTg] C OT�1, where Ef O� ��yTg D EyT O#0 O#�1

y , with O#0 D T�1(∑T

tD2 yt�1at)

and O#y D T�1(∑T

tD2 y2t�1

). Applying a Tay-

lor expansion of O#�1y around #�1

y and that Ejj O#y � #yjj2k D OT�k (Bhansali, 1981), it canbe obtained, using Holder’s inequality, that EyT O#0 O#�1

y D #�1y EyT O#0� #�2

y EfyT O#0 O#y � #yg CO[EfyT O#0 O#y � #y2g], where it can be checked that EyT O#0 D 0. The magnitude of Ejj O#0jj2holds that

E

∥∥∥∥∥∥∥∥∥∥∥

T∑tD2

yt�1at

T

∥∥∥∥∥∥∥∥∥∥∥

2 D E

T∑tD2

y2t�1a

2t

T2

D T� 1

T2

�4

1 � �2D OT�1 A1

Then, EfyT O#0 O#y � #yg D OT�1. Following the same arguments, EfyT O#0 O#y � #y2g D OT�3/2,and the proposition follows.

Proof of Proposition 2After some algebra, it can be obtained that C1� D �21 � �2�11 � �2hC �h � 1Ef�h �O�hy2

Tg. By the same arguments as in the proof of Proposition 1, Ef�h � O�hy2Tg D


24 I. Sanchez

O[Ef O� � �y2Tg] C OT�1. Similarly, it can be written that Ey2

T O#0 O#�1y D #�1

y Ey2T O#0�

#�2y Efy2

T O#0 O#y � #yg C O[Efy2T O#0 O#y � #y2g]. The term Ey2

T O#0 can be written as Ey2T O#0 D

T� 1�1 ∑TtD2 Eyt�1aty2

T, where

Eyt�1aty2T D E

1∑jD0

�jat�1�jat

( 1∑

iD0

�iaT�i

)( 1∑kD0

�kaT�ki

) D �4�

2T�tC1

1 � �2

Hence Ey2T O#0 D OT�1. By (A1), Efy2

T O#0 O#y � #yg D OT�1 and Efy2T O#0

( O#y � #y2g DOT�3/2

), and the proposition follows.

Proof of Proposition 3B&� can be written as B&� D E

(c∑h�1

jD0 �j � Oc∑h�1

jD0 O�j)

C E{(�h � O�h) yT}. Using that EjjOc �

cjj2k D OT�k, Ejj O� � �jj2k D OT�k, Ec � Oc D OT�1 and E� � O� D OT�1 and usingsimilar arguments to those used in Proposition 1, it can be verified that

Ec�j � Oc O�j D OT�1 A2

and, therefore, E(c∑h�1

jD0 �j � Oc∑h�1

jD0 O�j)

D OT�1. Similarly, Ef�h � O�hyTg D O[Ef O� � �

yTg] C OT�1, wheref O� � �yTg D EyT O#m O#�1y , with yt�1 D yt�1 � O&; O& D

(∑TtD2 yt�1

)/T�

1; O#y D(∑T

tD2 y2t�1

)/T; and O#m D

(∑TtD2 yt�1at

)/T. Then, it can be written that EyT O#m O#�1

y D#�1y EyT O#m� #�2

y EfyT O#m O#y � #yg C O[EfyT O#m O#y � #y2g]. The term EyT O#m can be decom-

posed as EyT O#m D T�1 ∑TtD2 EatyTyt�1 D T�1 ∑T

tD2

{EatyTyt�1� T� 1�1 ∑T

jD2 E atyT

yj�1

)}. It can, then, be checked that EatyTyt�1 D E

{at(&C∑1

iD0 �iaT�i

) (&C∑1

kD0 �k

at�1�k)} D &�T�t�2, and then T�1 ∑T

tD2 EatyTyt�1 D OT�1. Similarly

T∑jD2

EatyTyj�1 Dt∑jD2

EatyTyj�1CT∑

jDtC1

EatyTyj�1 A3

The first term in (A3) verifies that∑t

jD2 EatyTyj�1 D ∑tjD2 &�

2�T�t D t � 1&�2�T�t. Then,fTT� 1g�1 ∑T

tD2

∑tjD2 EatyTyj�1 D OT�1. The second term in (A3) is

∑TjDtC1 EatyT

yj�1 D &�2 C T� t&�2�T�t C �T�tEa3t . Then

T∑tD2

T∑jDtC1

EatyTyj�1

TT� 1D &�2

T� 1C &�2

T∑tD2

T� t�T�t

TT� 1C Ea3

t

T∑tD2

�T�t

TT� 1D OT�1

Therefore EyT O#m D OT�1 and, by Holder’s inequality, EfyT O#m O#y � #yg D OT�3/2. Simi-larly, EfyT O#m O#y � #y2g D OT�2, and then

Ef O� � �yTg D OT�1 A4

and, therefore, the proposition holds.



Proof of Proposition 4

C&1� D E

c h�1∑

jD0

�j C �hyT Ch�1∑jD0

�jaTCh�j � Och�1∑jD0

O�j C O�hyT

ðc h�1∑

jD0

�j C �hyT Ch�1∑jD0

�jaTCh�j � yT

Let us denote ˇ D c∑h�1

jD0 �j. Then

C&1� D ˇEˇ � O C �h � 1Efˇ � O yTg C ˇEf�h � O�hyTg (A5)

C �h � 1Ef�h � O�hy2Tg C �2 1 � �2h

1 � �2

By (A2) it holds that Eˇ � O D OT�1. In addition, c can be written as c D &1 � �, whereEyt D &. Then, ˇ D &1 � �h/1 C � D O� � 1. Therefore, ˇEˇ � O D OfT�1� � 1g. Tosolve the second term of (A5), it can be verified that O[Efˇ � O yTg] D O[Efc�j � Oc O�jyTg],j D 1, . . . , h� 1. After some manipulations, it can be obtained that

Efc�j � Oc O�jyTg D �jEfc � OcyTg � cEf O�j � �jyTg � EfOc � c O�j � �jyTg A6

The last term of (A6) verifies, by Holder’s inequality, that EfOc � c O�j � �yTg D OT�1. Theterm Efc � OcyTg can be rearranged as Efc � OcyTg D &21 � �� EOcyT. Let us denote y� DT� 1�1

(∑T�1tD1 yt

)and yC D T� 1�1

(∑TtD2 yt

). Then,

Oc D yC � O�y� D y�1 � O�C yT � y1T� 1�1

Applying this decomposition, we obtain EOcyT D Ef1 � O�yTy�g C EfyTyT � y1T� 1�1g,where it can easily be verified that EfyTyT � y1T� 1�1g D OT�1. Similarly, apply-ing (A4), it holds that Ef1 � O�yTy�g D 1 � �EyTy�C OT�1. Also EyTy� D T�1�1 ∑T�1

tD1 f&2 C �T�t#yg D &2 C OT�1 and, then, EOcyT D &21 � �C OT�1. Therefore,Efc � OcyTg D OT�1. To solve the second term in (A6), it can be applied that O[Ef O�j ��jyTg] D O[Ef O� � �yTg]. In the proof of Proposition 3 it has already been seen that Ef O� ��yTg D OT�1. Applying this result to (A6) it can be obtained that �h � 1Efˇ � O yTg DOfT�11 � �g. Similarly, the third term in (A5) verifies ˇEf�h � O�hyTg D OfT�11 � �g.To solve the fourth term in (A5) it can be applied, using (33) and Holder’s inequality, thatO[Ef�h � O�hy2

Tg] D OT�1. Then, �h � 1Ef�h � O�hy2Tg D OfT�11 � �g, and the proposi-

tion follows.

ACKNOWLEDGEMENTS

The author would like to thank Laura Mayoral and Eva Senra for their help. He is also grateful tothe participants of the NBER/NSF Time Series Seminar, Taiwan, 1999; and the XXV SEIO meeting


26 I. Sanchez

for helpful discussions and suggestions on this work. The author also thanks the referees and theeditor for their valuable and constructive comments. Part of this research was conducted while theauthor was visiting the University of California, San Diego, and he is grateful to this institution.This research was supported in part by CICYT, grant PB96-0339. The usual disclaimer applies.

REFERENCES

Bhansali RJ. 1981. Effects of not knowing the order of an autoregressive process on the mean squared errorof prediction-I. Journal of the American Statistical Association 76: 588–597.

Bates JM, Granger CWJ. 1969. The combination of forecasts. Operational Research Quarterly 20: 451–468.Beran J. 1995. Maximum likelihood estimation of the differencing parameter for invertible short and long

memory autoregressive integrated moving average models. Journal of the Royal Statistical Society Series B57: 659–672.

Box GEP, Jenkins GW. 1976. Time Series Analysis, Forecasting and Control, (2nd edn). Holden-Day: SanFrancisco, CA.

Brockwell PJ, Davis RA. 1991. Time Series: Theory and Methods, (2nd edn). Springer-Verlag: New York.Campbell JY, Perron P. 1991. Pitfalls and opportunities: what macroeconomists should know about unit roots.

NBER Macroeconomics Annual, 1991. MIT Press: Cambridge, MA.Clements MP, Hendry DF. 1998. Forecasting Economic Time Series. Cambridge University Press: New York.Diebold FX. 1988. Serial correlation and the combination of forecasts. Journal of Business and Economic

Statistics 6: 105–111.Diebold FX, Kilian L. 2000. Unit-root tests are useful for selecting forecasting models. Journal of Business

and Economic Statistics 18: 265–273.Dufour JM, King ML. 1991. Optimal invariant tests for the autocorrelation coefficient in linear regressions

with stationary or nonstationary AR(1) errors. Journal of Econometrics 47: 115–143.Elliott G. 1999. Efficient tests for a unit root when the initial observation is drawn from its unconditional

distribution. International Economic Review 40: 767–783.Fuller WA, Hasza DP. 1980. Predictors for the first-order autoregressive process. Journal of Econometrics 13:

139–157.Granger CWJ, Ramanathan R. 1984. Improved methods of combining forecasts. Journal of Forecasting 3:

197–204.Hwang J, Schmidt P. 1996. Alternative methods of detrending and the power of unit root tests. Journal of

Econometrics 71: 227–248.Kunitomo N, Yamamoto T. 1985. Properties of Predictors in Misspecified Autoregressive Time Series Models.

Journal of the American Statistical Association 80: 941–950.Malinvaud E. 1970. Statistical Methods of Econometrics, (2nd edn). North-Holland: Amsterdam.Pantula SG, Gonzalez-Farias G. Fuller WA. 1994. A comparison of unit-root test criteria. Journal of Business

and Economic Statistics 12: 449–459.Phillips PCB. 1987. Towards a unified asymptotic theory for autoregression. Biometrika 74: 535–547.Phillips RF. 1987. Composite forecasting. Journal of Business and Economic Statistics 5: 389–395.Sanchez I, Pena D. 2001. Properties of predictors in overdifferenced nearly nonstationary autoregression.

Journal of Time Series Analysis 22: 45–66.Stock JH. 1996. VAR, error correction, and the pretest forecast at long horizons. Oxford Bulletin of Economics

and Statistics 58: 685–701.

Author’s biography :Ismael Sanchez is Visiting Professor at Universidad Carlos III de Madrid from which he obtained his PhD.His main research interest areas are nearly non-stationary processes, unit roots, and diagnosis and modelselection in time series.

Author’s address :Ismael Sanchez, Departamento de Estadıstica y Econometrıa, Butarque, 15. 28911 Leganes. Madrid, Spain.


Efficient forecasting in nearly non-stationary processes

Documents

Transcript of Efficient forecasting in nearly non-stationary processes