Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei...

53
130 Reihe Ökonomie Economics Series Testing for Relative Predictive Accuracy: A Critical Viewpoint Robert M. Kunst

Transcript of Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei...

Page 1: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

130

Reihe Ökonomie

Economics Series

Testing for Relative Predictive Accuracy:

A Critical Viewpoint

Robert M. Kunst

Page 2: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

130

Reihe Ökonomie

Economics Series

Testing for Relative Predictive Accuracy:

A Critical Viewpoint

Robert M. Kunst

May 2003

Institut für Höhere Studien (IHS), Wien Institute for Advanced Studies, Vienna

Page 3: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Contact: Robert M. Kunst Department of Economics and Finance Institute for Advanced Studies Stumpergasse 56 1060 Vienna, Austria

: +43/1/599 91-255 fax: +43/1/599 91-163 email: [email protected]

Founded in 1963 by two prominent Austrians living in exile – the sociologist Paul F. Lazarsfeld and the economist Oskar Morgenstern – with the financial support from the Ford Foundation, the AustrianFederal Ministry of Education and the City of Vienna, the Institute for Advanced Studies (IHS) is the firstinstitution for postgraduate education and research in economics and the social sciences in Austria.The Economics Series presents research done at the Department of Economics and Finance and aims to share “work in progress” in a timely way before formal publication. As usual, authors bear fullresponsibility for the content of their contributions. Das Institut für Höhere Studien (IHS) wurde im Jahr 1963 von zwei prominenten Exilösterreichern –dem Soziologen Paul F. Lazarsfeld und dem Ökonomen Oskar Morgenstern – mit Hilfe der Ford-Stiftung, des Österreichischen Bundesministeriums für Unterricht und der Stadt Wien gegründet und istsomit die erste nachuniversitäre Lehr- und Forschungsstätte für die Sozial- und Wirtschafts-wissenschaften in Österreich. Die Reihe Ökonomie bietet Einblick in die Forschungsarbeit der Abteilung für Ökonomie und Finanzwirtschaft und verfolgt das Ziel, abteilungsinterneDiskussionsbeiträge einer breiteren fachinternen Öffentlichkeit zugänglich zu machen. Die inhaltlicheVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen.

Page 4: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Abstract

Tests for relative predictive accuracy have become a widespread addendum to forecast comparisons. Many empirical research reports conclude that the difference between the entertained forecasting models is 'insignificant'. This paper collects arguments that cast doubt on the usefulness of relative predictive accuracy tests. The main point is not that test power is too low but that their application is conceptually mistaken. The features are highlighted by means of some Monte Carlo experiments for simple time-series decision problems.

Keywords Information criteria, forecasting, hypothesis testing

JEL Classifications C12, C32, C53

Page 5: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Contents

1 Introduction 1

2 The Diebold-Mariano test 3

3 Some basic properties of DM testing in action 4 3.1 The basic simulations ........................................................................................................... 4 3.2 Some simulations with Diebold-Mariano testing ................................................................... 8

4 The dangers of double checks 12

5 Prediction evaluation as a model selection criterion 15 5.1 Some experiments with fixed coefficients ........................................................................... 15 5.2 Some experiments with randomized coefficients................................................................ 29

6 Summary and conclusion 43

References 44

Page 6: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

1 IntroductionThere has been too much formalism, tradition, and confusion thatleads people to think that statistics and statistical science is mostlyabout testing uninteresting or trivial null hypotheses, whereas sci-ence is much more than this. We must move beyond the traditionaltesting-based thinking because it is so uninformative [Burnham andAnderson, 2002, p.42]

A decade ago, comparative studies of the predictive performance of time-series models were usually presented on the basis of lists of descriptive statisticssuch as mean squared errors or their ratios across models. The contributionof Diebold and Mariano (DM) has revolutionized this practice. In the late1990s, hardly any forecasting study was published in a major academic journalwithout using their test or one of its later refinements. A typical view is theone expressed by Fildes and Stekler (2002, p.439) in a recent survey paper“Whatever benchmark is used in the evaluation of forecasts, the difference be-tween the two sets of errors should be tested for statistical significance” [originalitalics]. The main aim of this paper is to express some caution regarding such“should be” prescriptions.In short, the DM test is based on the following ideas. A forecaster, who does

not know the data-generation mechanism of given time-series data, entertains aset of models–using ‘model’ throughout in the sense of a parameterized collec-tion of probability distributions–and compares their forecasting performanceon a part of the sample, typically the most recent segment. A positive trans-form of the prediction errors serves as a moment or cost function. One of theentertained models is chosen as a baseline model. The forecaster considers thenull hypothesis that a given model is unable to improve predictive performancerelative to that model. Only if the DM statistic rejects this null hypothesis, canthe competing and more sophisticated model definitely be recommended.Even at a first glance, some informal arguments can be raised against this

testing strategy. Everyone who has worked in professional forecasting will knowthat the cost of using a more sophisticated model is small. Just to the con-trary, administrative directors of forecasting institutions may actually prefer asophisticated model over a simple one, as such choice will improve the repu-tation of the institution. Therefore, the forecaster is not in the situation ofclassical hypothesis testing. There is no need to be conservative and there is nocoercive assignment of null models or null hypotheses. Rather, the forecasteris in a decision situation. The best workhorse among a group of models has tobe selected. The appropriate statistical framework is not hypothesis testing butrather model selection. Appropriate methods for model selection can be found ininformation theory and AIC—type criteria, or in Bayesian posterior-odds analy-sis. These methods are tuned to make a specific selection from a finite set, whilehypothesis testing implies an interval of ‘rejection failure’, within which somemodels cannot be ranked. Such a ‘demilitarized zone’ does not appear to be auseful innovation but rather constitutes a practical inconvenience. Particularly

1

Page 7: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

for DM testing, this interval appears to be rather wide in many applications, aswill be demonstrated in this paper.Moreover, while the DM test formally does not take refuge to the concept of

a ‘true model’, its very null hypothesis reflects that concept. From a Bayesianviewpoint, classical hypothesis testing is justified only if the sharp or point nullcan be assigned a non-zero prior weight. However, the null hypothesis of theDM test is a priori improbable. Economic reality is usually seen as a complexdynamically evolving structure whose hidden time-constant laws are almost im-possible to retrieve, in the spirit of the ‘Haavelmo distribution’. Because none ofthe entertained forecasting models comes anywhere near this complexity, differ-ent models are just different approximations and imply different distributionalproperties of their prediction errors. Although certain moments may coincideacross these distributions by pure chance, it is difficult to imagine assigning anon-zero prior weight to this event.This paper first focuses on such theoretical aspects of significance tests for

predictive accuracy. From a statistical viewpoint, it is quite difficult to justifytheir usage. However, it may still be convenient to apply the tests from anempirical perspective. The implicit strengthening of the prior weight on simplemodels could be favorable for model selection decisions in certain situations,even if it were untenable from a theoretical viewpoint. Monte Carlo simulationswill serve to assess this argument. The simulations are based on simple time-series models and analyze the performance of model selection guided by DMtests in some nested and also non-nested situations. The truly empirically rele-vant case may be much more complex, as the typical forecaster uses a handful ofsimple ideas to model an economic reality far beyond the reach of parsimoniouslyparameterized structure. Nevertheless, the presented simulations point to somerecurrent features. Firstly, the demilitarized zone generated by DM tests pre-vents any useful model choice even in situations where such a model choice hasquite clear support from usual model selection statistics. Secondly, if the aim isselecting the true model or even pseudo-true model structure, comparing mea-sures of predictive accuracy is a poor substitute for using information criteria,whether DM tests are used or not. Thirdly, however, the situation is less clearif the aim of selecting the optimum prediction model replaces the quest for atrue structure. Prediction criteria may be more likely to hit upon the optimumprediction model than information criteria, and, moreover, the simplicity biasthat is characteristic for DM testing may even improve upon this choice.As a bottom line, a final answer to the question of whether to use DM tests

or not requires clearly stating the aims of modeling. It is indeed unfortunatethat the econometric and the forecasting literature alike have done little toseparate the targets of searching ‘true’ models and of optimizing predictionand have repeatedly tended to blur the distinction between these aims. Thisis evident from the exaggerated concern about ‘model misspecification’ in aforecasting situation–where misspecified models may yield excellent forecasts–and from statements like “it is our fervent belief that success in [the evaluationand improvement of forecasting performance] should also lead to an improvedunderstanding of the economic process” [Fildes and Stekler, 2002, p. 462,

2

Page 8: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

original italics]. Taking into account that good forecasting models may be pooror incorrect descriptions of the data-generating mechanism–even in situationswhere such a data-generating mechanism exists–and vice versa may suggest toregard such paradigms with the utmost caution.The plan of the paper is as follows. Section 2 reviews the DM test statistic.

Section 3 explores some basic properties of this statistic in small samples–throughout the paper, a sample size of n = 100 is used–against the backdropof time-series model selection among AR(1), MA(1), and ARMA(1,1) structureswhen the true generating model is ARMA(1,1) with positive coefficients. Sec-tion 4 discusses some theoretical arguments against the usage of significancechecks on model selection decisions. Section 5 explores the performance of pre-dictive accuracy evaluations as model selection criteria, both from the viewpointof finding true structures and of finding the best forecasting model In Section5.1, the standard design of ARMA(1,1) is maintained, while Section 5.2 re-peats the simulation experiments for designs with random coefficients. Section6 concludes.Here, an important note is in order. This paper does not aim at criticizing

the literature for any lack of correctness, particularly not the work of DM andof Linhart. Neither does it focus on problems of the lack of test power in theconsidered procedures, particularly as various modifications of the original DMtest have been suggested recently. The critique rather focuses on the method-ological concept of the tests at a fundamental level. For a recent critique of DMtest power, see Ashley (in press).

2 The Diebold-Mariano testIt is useful to review the original procedure as it was suggested by Dieboldand Mariano (1995, DM). DM motivated their contribution in the followingparagraph:

Given the obvious desirability of a formal statistical procedure forforecast-accuracy comparisons, one is struck by the casual mannerin which such comparisons are typically carried out. The literaturecontains literally thousands of forecast-accuracy comparisons; almostwithout exception, point estimates of forecast accuracy are examined,with no attempt to assess their sampling uncertainty. On reflection,the reason for the casual approach is clear: Correlation of forecasterrors across space and time, as well as several additional complica-tions, makes formal comparison of forecast accuracy difficult.

While DM do not really specify the ‘additional complications’, some of theseare outlined in the remainder of their paper, such as non-normality and smallsamples. In this paper, we contend that the ‘casual manner’ may be preferableto the ‘obviously desired’ testing approach, which argument will be supportedby some simulations.

3

Page 9: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Contrary to what is usually cited as ‘the DM test’, DM suggest varioustesting procedures with similar aims and subject them to some Monte Carlocomparisons. In its narrow sense, the DM statistic appears to be the statisticS1, which is introduced as an ‘asymptotic test’ for the null hypothesis thatE (dt) = 0, when dt = g(ejt)− g (eit), with eit and ejt forecast errors from usingtwo different forecasting procedures that could be indexed i and j. The functiong (.) is a loss function. Although DM do not rigorously specify the propertiesof g (.), it is reasonable to assume that g (.) be positive and that g (x) = 0 forx = 0 only. Typical loss functions are g (x) = |x| for mean absolute errors(MAE) and g (x) = x2 for mean squared errors (MSE). It may be useful to alsoconsider certain cases of asymmetric or bounded loss functions, though somemonotonicity in the form of g (x) ≥ g (y) for x > y > 0 or x < y < 0 may bereasonable.With these definitions, DM consider the statistic d̄ defined as the time av-

erage over a sample of dt, t = 1, . . . , n, d̄ = n−1Pnt=1 dt. It is easily shown that

the standardized statistic

S1 =d̄q

n−12πf̂d (0)(1)

converges to a standard normal distribution for n → ∞. The element in thedenominator f̂d (0) is a consistent estimator of the spectral density of dt at thefrequency 0, such as

f̂d (0) = (2π)−1

n−1Xk=−n+1

w (k/S(n)) γ̂d (k) , (2)

with the lag window function w (.), the truncation lag S (n), and the estimatedautocorrelation function γ̂d (.). Typical choices for w (.) would be the Bartlettwindow or the rectangular window. While DM favor the rectangular window,the Bartlett window appears to be the better choice for small n, as it does notrequire to choose a too small value for S (n).Apart from the S1 statistic, DM also consider some non-parametric tests, in

particular for the case where ‘only a few forecast-error observations are avail-able’, which may be the empirically relevant case. Apart from rank tests andsign tests, they review a test following Meese and Rogoff (1988), whichchecks the correlation of eit− ejt and eit+ ejt and whose idea may also be gen-eralized to the loss-function framework of the S1 construction. In the remainderof this paper, the standard normal test based on the statistic S1 will be regardedas ‘the DM test’, with w (.) specified as the Bartlett window.

3 Some basic properties of DM testing in action

3.1 The basic simulations

Some simulations may serve to highlight the main features at stake. The pre-dictive performance of some rival models is evaluated, some of which are ‘mis-

4

Page 10: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

specified’. For the baseline simulations, 1000 replications of time series of length210 are generated. The data-generating process is a simple ARMA(1,1) process

xt = φxt−1 + εt + θεt−1 (3)

with εt independently drawn from a N(0, 100) distribution. The coefficients φand θ are varied over the positive unit interval [0, 1] in steps of 0.05. The first100 observations of each trajectory are discarded. Simple time-series models,such as AR(1), MA(1), and the ‘true’ model ARMA(1,1) are fitted to the ob-servations 101 to 200 in order to provide parameter estimates θ̂ and φ̂. Theseparameter estimates and the observations 101 to 200 are then used to generateforecasts for the observation 201. The prediction error is stored. This out-of-sample forecasting is repeated for the estimation range 102 to 201 to predictthe observation 202, and another prediction error is stored. This rolling out-of-sample forecasting is repeated until the end of the generated sample is reached.Rolling out-of-sample forecasting is a widespread technique in the economet-ric literature (see, e.g., Tashman, 2000). Thus, each model generates 10,000prediction errors for each fixed parameter (φ, θ), from which moment statisticsare calculated, such as the mean absolute error (MAE) or the mean squarederror (MSE). The reported results focus on the MAE. Note that all parameterestimates rely on exactly 100 observations.This experiment is not new and presumably has been reported by other

authors. However, it will serve as an instructive background. It is an experimentin the classical statistics framework, as all simulations are conducted conditionalon fixed and true parameter values. Figure 1 shows the difference MAE(MA)-MAE(AR) in the form of a contour plot. Unless φ = 0 or θ = 0, both forecastingmodels are ‘misspecified’. It is seen that the MA forecast is preferable because ofits smaller MAE for the MA processes, but also for AR processes with φ < 0.15and for some mixed processes with θ > φ. In the area to the right, AR modelsyield the more precise prediction. The figure allows crude recommendationsto empirical forecasters who, for technical reasons, do not want to use mixedmodels for forecasting. For example, if the data support an ARMA(0.8,0.8)model, autoregressive forecasts are preferable to moving-average forecasts.A different picture emerges when the MA(1) forecast model is compared to

an ARMA(1,1) model. The ARMA(1,1) is the true model for all trajectoriesand should dominate the MA(1) model for all parameter values asymptotically,excepting φ = 0. Figure 2 shows that for φ < 0.2, the MA(1) model yieldsbetter forecasts if these are assessed by the MAE. For larger φ, the precisionof MA forecasts deteriorates quickly, which is indicated by the steepness of theslope. Contour curves are shown in the range [−1, 1]. Even in the area to theleft of the MAE(MA)=MAE(ARMA) curve, the difference in accuracy neverexceeds 0.1, hence the loss to the forecaster from using the ARMA model in allcases appears to be small. By contrast, the MA model yields relatively poorpredictions for (φ, θ) = (0.6, 0.9), where the MA forecast still dominates the ARforecast, as can be seen from Figure 1.The counterpart for the AR(1) versus the ARMA(1,1) model is shown in

Figure 3. Its symmetry to Figure 2 is surprisingly exact, although the increase

5

Page 11: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 1: MAE for the MA(1) forecast model minus MAE for the AR(1) forecastmodel when the true model is ARMA(φ, θ). Sample size is 100.

6

Page 12: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 2: MAE for the MA(1) forecast model minus MAE for the ARMA(1,1)forecast model when the true model is ARMA(φ, θ). Sample size is 100.

7

Page 13: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

of the MAE of the AR(1) forecast as θ rises is markedly slower than for theMA(1) forecast as φ rises. The values for (φ, θ) = (1, 1) should be ignored, asthey may only reflect numerical instabilities of the maximum-likelihood estimateof the GAUSS routine. The three figures can be superimposed and demonstratethat the ARMA(1,1) model is to be preferred roughly whenever φ > 0.2 andθ > 0.2. In the band close to the axes, the simpler models AR(1) and MA(1)dominate, with an asymmetric preference toward MA(1) in the southwest corner.The collected evidence from the simulations allows useful empirical guidelinesfor the forecasting practitioner.

Figure 3: MAE for the ARMA(1,1) forecast model minus MAE for the AR(1)forecast model when the true model is ARMA(φ, θ). Sample size is 100.

3.2 Some simulations with Diebold-Mariano testing

For these simulations, the data-generating process is identical to that used in theprevious subsection. After estimation and forecasting, i.e., after conducting theten one-step predictions, a DM test statistic is calculated and is compared to thetwo-sided 0.05 asymptotic significance point, i.e., the 0.975 fractile of the stan-dard normal distribution. Like all significance levels in frequentist testing, this

8

Page 14: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

choice is debatable. The intention was to impose a relatively strict significancelevel in order to underscore the effects without biasing the results unduly, as forexample by an 0.01 level. 1,000 replications allow to calculate the frequency ofrejections and non-rejections of the null hypothesis that E | |e1,t| − |e2,t| | = 0.Figure 4 shows that the frequency of the finding that autoregressive forecasts are‘significantly’ better than moving-average forecasts exceeds 0.5 only if φ > 0.9,whereas moving-average forecasts only achieve a rejection frequency of 0.4 inthe extreme north-west corner. In other words, for all ARMA processes exceptfor the nearly integrated ones, the researcher will find on average that there isno significant difference between the two forecasting models. Assuming a uni-form prior distributions over the unit square in the Bayesian style, one may evenconclude that the researcher will be unable to make a decision in 90% of thepossible cases.

Figure 4: Frequency of a rejection of the DM test for MA(1) versus AR(1)predictions, if the true model is ARMA(1,1). 1,000 replications of series oflength 100, with 10 single-step predictions for each trajectory.

Figure 5 shows a comparable experiment for the two rival models AR(1)and ARMA(1,1). Generally, the difference between the two forecasting modelsis ‘insignificant’, with rejection rates of the DM statistic ranging from 0.2 to 0.5.

9

Page 15: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

The null hypothesis of equal forecasting precision is also rejected at least in 20%of the simulated cases even for autoregressive models with θ = 0, where the ARforecasts are definitely more efficient, as no additional parameter is estimated.On the other hand, the ARMA forecast is ‘significantly better’ only for φ > 0.3and θ > 0.9. In other words, the DM test is unable to recommend the ARMAmodel for most parameter constellations, even when the generating model hasa substantial moving-average component.

Figure 5: Frequency of a rejection of the DM test for AR(1) versus ARMA(1)predictions, if the true model is ARMA(1,1). 1,000 replications of series oflength 100, with 10 single-step predictions for each trajectory.

Finally, Figure 6 shows the experiment for the rival models MA(1) andARMA(1,1). This is a nested situation, the nesting model being the true onefor all cases excepting φ = 0. Only for φ > 0.9 does the DM test imply adecision in favor of the ARMA(1,1) model with a probability of more than0.5. For φ < 0.9, the incorrect parsimonious model and the true structure yieldforecasts of a comparable quality, at least if one believes in the results of the DMtest. Consulting Figure 2, it is seen that the true model gives better forecastsindeed even for φ > 0.25.The experiments of this section demonstrate that the DM test can be a quite

10

Page 16: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 6: Frequency of a rejection of the DM test for MA(1) versus ARMA(1)predictions, if the true model is ARMA(1,1). 1,000 replications of series oflength 100, with 10 single-step predictions for each trajectory.

11

Page 17: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

useless instrument, if a decision concerning the forecasting model is searched for.This conclusion is valid for nested situations as well as for the more realistic non-nested cases. Such experiments could be extended in several directions. Firstly,the training interval can be extended and the estimation interval shortened.A certain ratio of these lengths will optimize the accuracy of decision even ifbased on the DM test only. Secondly, noting that the informative Figures 1—3 areobtained from decisions based on 1,000 replications while the non-informativeFigures 4—6 reflect decisions based on single trajectories, it may be interestingto assess the relative merits of extending the sample length as compared tosampling additional trajectories. It may be conjectured that sampling additionaltrajectories is the most promising route. In practice, the relative merits of acertain forecasting model can only be assessed by considering a large numberof comparable data sets. Such gathering of cases cannot be substituted byadditional sophistication within one case. For example, the question whethermodel A or B are preferable for forecasting gross industrial output will remainundecided if DM tests are used. It may lead to an erroneous conclusion ifdescriptive horse races are evaluated. Repeating such descriptive horse racesfor a large number of comparable economies and averaging or summarizing theresults will enable a more accurate decision, however, while this collection ofparallel evidence does not appear to be encouraged by focusing on DM statisticsfor single cases.It is worth while noting that the performance of the DM—based model selec-

tion procedure can be improved by changing the significance level from 0.05 to0.1 or even 0.2. Because standard hypothesis testing gives no prescription onthe way that significance levels depend on the sample size, such amendmentsremain arbitrary. The original contribution of DM focuses on the significancelevel of 0.1 exclusively, with sample sizes of prediction intervals ranging from 8to 512, thus containing the value of 10 that is used in this paper.

4 The dangers of double checksThe practical utility of hypothesis testing procedures is of limitedvalue in model identification. [Akaike 1981, p. 722]

The contributions of Akaike (1974) and Schwarz (1978) started a revo-lution in time-series model selection. Rival methods, such as the visual inspec-tion of sample moment functions recommended by Box and Jenkins (1976)and later Tiao and Tsay (1984) and nested sequences of hypothesis tests,lost the market of core time-series analysis to the information-criterion (IC)approach. It is interesting that empirical economics has remained largely ina pre-revolutionary stage, with a preference for classical testing within theNeyman-Pearson framework. The hidden controversy can be seen from com-paring current standard textbooks on time series, such as Brockwell andDavis (2002), the attempted synthesis in an introductory textbook on econo-metrics by Ramanathan (2002), the slightly misplaced comments on IC in an

12

Page 18: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

advanced econometrics textbook by Greene (1990), and the radical omissionof IC in an economic time-series book by Hamilton (1994). On the battlefieldof such an undeclared warfare, it may well be that an empirical paper gets re-jected by a statistical journal due to the non-usage of IC, while adding themin a revision and submitting the manuscript to an economic journal will causeanother rejection, as the information criteria ‘although intuitively appealing, ...have no firm basis in theory’ (Greene, p.245). Although economists may notbe aware of this background, the warfare on IC versus significance testing hasa long history in the statistical discipline. It is a new version of the fight ofBayesian and frequentist paradigms.In the frequentist paradigm, methods to tackle the finite-action problem

directly have never been developed. Rather, model selection was embeddedin the framework of testing a null hypothesis against an alternative. Becausehypothesis testing is geared to binary and nested decisions, it can be applied tofinite action only by means of sequences of test decisions and of artificial nesting.A single element in a frequentist testing chain mainly relies on a calculation ofa measure of divergence between likelihoods, which are maximized for the tworival hypotheses, and the evaluation of the tail properties of the distributionof the divergence statistic, assuming the ‘restricted’ model to be correct. As inmany frequentist tests, the significance level, i.e., the quantile of the distribution,remains arbitrary. As in all sequential tests, the arbitrariness of this significancelevel plays a crucial role and severely affects the final outcome of the test chain.The frequentist reluctance to eliciting prior distributions on parameter spacesprevents any systematic optimization of significance levels, beyond problematicaims such as ‘keeping the overall significance level at 5%’.Even in classical texts on regression analysis, purely descriptive statistics are

reported side by side with test statistics for significance assessment. For exam-ple, the ‘regression F’ is viewed as a test statistic for checking the null hypothesisthat all regression coefficients are zero, while the ‘coefficient of determination’R2 is viewed as a descriptive statistic, even though the former and the latterstatistic are linked by a one-one transform. In its descriptive interpretation,the R2 commonly serves to informally assess the descriptive quality of a modeland to compare rival linear regression models. Note that comparative statistics,such as ratios of MSE or MAE across models, perform a similar function inforecasting experiments, or at least did so before DM tests were introduced.The large-sample properties of model selection on the basis of an R2 and even

of its adjusted version due toWherry-Theil are well known and generally dis-courage its usage for this purpose. By contrast, the optimization of IC variantswas shown to lead to true structures (BIC) or otherwise preferable models (AIC,FPE). Like the R2, IC statistics can be used for all model comparisons, includ-ing the situation of non-nested rival models. Contrary to a hypothesis testingdecision, the IC decision is discrete and apparently non-stochastic, as prefer-ence corresponds to a simple operation of binary mathematical operators suchas < and >. Similarly, the implied MSE can be compared across forecastingmodels by a simple and exact comparison. This correspondence is not a merecoincidence. Roughly, optimization of the out-of-sample MSE approximates the

13

Page 19: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

optimization of the AIC, as can be seen from the original derivation of the AICcriterion.As a statistic calculated from data that are interpreted as realizations of

random variables, the AIC has a statistical distribution that depends on thedata-generating process. Therefore, the ‘significance’ of an AIC decision can bederived in principle. This approach was considered by Linhart (1988). Thisand similar contributions were generally ignored by the time-series literature,and with a good reason. By construction, the AIC decision removes the fre-quentist emphasis on a significance level and thus achieves an automatic andimplicit adjustment of such significance level.In a sense, the contribution by Diebold and Mariano (1995) is the equiv-

alent to the one by Linhart (1988). Where Linhart implicitly imposes a sec-ondary significance test on a primary decision according to the AIC, Dieboldand Mariano impose a secondary significance test on a primary decision ac-cording to a summary measure of prediction errors. Unlike Linhart’s idea,the suggestion by Diebold and Mariano was welcomed immediately by theforecasting research community and it was widely applied. This remarkable di-vergence in reception by the academic community can tentatively be explainedby two main arguments: firstly, the evaluation of prediction errors was notrecognized as a decision procedure; secondly, the dominance of frequentist overBayesian statistics is more pronounced in econometrics than in other fields ofstatistical applications. The analogy of predictive evaluation and of AIC eval-uation will be demonstrated by some more simulation experiments in the nextsection.A maybe too simplistic example of a possible ‘double check’ is the following.

Suppose somebody wishes to test for the null hypothesis H0 : µ = 0 in asample of Gaussian random observations with known unit variance against thealternative HA : µ 6= 0. On a significance level of 0.05, decision is usually basedon a mean statistic S0(n) = n−1/2x̄, where x̄ denotes the sample mean and n isthe sample size. For |S0(n)| > 1.96 the null hypothesis is rejected, otherwise H0is retained. This decision is based on the property that S0(n) will converge indistribution to the N(0, 1) law. However, for a given infinite random realization,S0(n) converges to a limit point that can be depicted as the integral over aBrownian motion trajectory. One may then be interested in whether this limitpoint S0(∞) = limn→∞ n−1/2x̄ = limn→∞ S0 (n) justifies a rejection. In thissense, the null hypothesis of interest will be H̃0 : |S0(∞)| < 1.96 or, in words,that ‘µ is insignificantly different from 0’. Because S0 (n) shows some samplingvariation around its limit S0 (∞) for finite n, a much higher value than 1.96will be needed to reject H̃0. The testing hierarchy can even be continued. Ofcourse, the results of the secondary test on H̃0 are altogether uninteresting andcontradict basic statistical principles. However, that secondary test is a simpleanalogue to Linhart’s test on the AIC and to the DM test on the significanceof relative predictive accuracy.To justify double-checking, it is argued in the literature that the researcher

must safeguard against unnecessary complexity. A quote from Linhart (1988)is revealing: “Model 7 is less complex and one could be of the opinion that it

14

Page 20: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

should be preferred to model 2 unless the hypothesis that model 2 is worse canbe rejected at a small significance level” Indeed, this is exactly what informationcriteria are designed for, as they penalize the complex model 2 such that it hasto beat the model 7 convincingly in order to be chosen. Linhart’s test isunable to provide any new evidence in favor of neither model 2 nor model 7.It can only implicitly increase the penalty for complexity. In other words, theLinhart test counteracts the tendency of AIC to prefer complex models toooften. It is not known whether the combined AIC—Linhart procedure achievesthe full consistency of Schwarz’ BIC for chains of tests in pure autoregressions.In small samples, it appears that AIC—Linhart is too conservative and choosestoo parsimonious models.A similar observation may hold with respect to the DM test and model se-

lection by forecasting accuracy. Like AIC, out-of-sample forecasting accuracycriteria contain a penalty for spurious complexity that is typically so high thatthe true model class is ruled out due to sampling variation in parameter es-timation, as was demonstrated in Figures 1—6. The only effect that can beachieved by a double-check test is to bias selection even more and probablyunduly in the direction of the simple structures. This undue bias implies justwhat the forecaster wants to avoid: models that improve forecasting accuracyare systematically discarded.

5 Prediction evaluation as a model selection cri-terion

5.1 Some experiments with fixed coefficients

Two maps are generated from a simulation design that is similar to the oneof Section 3. Data are generated from ARMA models with 100+100+10 ob-servations. 1000 replications are conducted. Contrary to Figures 1—3, relativeevaluation will be based on single trajectories. In MSE evaluation, predictionerrors from 10 moving forecasts are squared and averaged. The model withthe lowest MSE is then selected as the most appropriate prediction tool. InAIC evaluation, a single AIC is calculated from observations #101—#200 forthe three models, i.e., the autoregressive, the moving-average, and the ARMAmodel. The model with the lowest AIC is then selected as the optimum time-series description.Figure 7 shows the relative selection frequency of the pure autoregressive

model on the basis of the smallest RMSE as calculated from only ten out-of-sample single-step forecasts. For both θ and φ small, the frequency is around0.2 and gradually increases as φ ↑ 1. For θ = 1, selection frequency reaches aminimum, excepting θ = φ = 1. It is interesting to note that the maximumfrequency of selecting the autoregressive structure is not obtained for θ = 0 buton a skew ‘ridge’ that runs from (φ, θ) = (0, 0.5) to (φ, θ) = (1, 0). The autore-gressive model also dominates for θ = φ = 1, probably due to the unsatisfactoryperformance of the maximum-likelihood estimator for the ARMA(1,1) model in

15

Page 21: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

this region.

Figure 7: Prediction MSE as a model selection criterion: relative selection fre-quency of the AR(1) model when the data are generated by ARMA(1,1) models.

Figure 8 gives the relative selection frequency of the pure moving-averagemodel. The moving-average model is unlikely to be selected for φ = 1, whereits frequency drops to zero, while it is selected in more than 40% of the caseswith φ = 0. As for the autoregressive model, selection frequency is not entirelymonotonously dependent on φ, however.Figure 9 gives the relative selection frequency of the mixed ARMA(1,1)

model. The most sophisticated model has a low probability of being selectedfor φ = 0 or for θ = 0, albeit a slightly higher one for near white noise. Thisprobability rises as φ and θ increase, excepting the area around φ = θ = 1, withthe already mentioned problems of the estimation procedure.The true information criteria permit a more precise model selection than

the RMSE evaluation, as can be seen from Figures 10—12. There is a 100%preference for the mixed model when both φ and θ are large enough, and thereis a strong preference for the autoregressive and moving-average structures closeto their natural habitats. The poor performance of the ML estimator close tothe point φ = θ = 1 is confirmed by the peak in that corner in Figure 10 andthe corresponding trough in Figure 12.

16

Page 22: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 8: Prediction MSE as a model selection criterion: relative selection fre-quency of the MA(1) model when the data are generated by ARMA(1,1) models.

17

Page 23: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 9: Prediction MSE as a model selection criterion: relative selection fre-quency of the ARMA(1,1) model when the data are generated by ARMA(1,1)models.

18

Page 24: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

The experiment was repeated for the Schwarz criterion in place of the AIC.Figures of selection frequencies turned out very similar to Figures 10—12 and aretherefore not shown. It appears that the recent statistical literature generallyprefers AIC and its variants to BIC, which, despite its asymptotic consistencyproperty, does not select satisfactorily in small samples. This paper focusesexclusively on the AIC.

Figure 10: AIC as a model selection criterion: relative selection frequency ofthe AR(1) model when the data are generated by ARMA(1,1) models.

The conclusions from these experiments are twofold.Firstly, comparative MSE or other forecast precision evaluations are poor

substitutes for traditional information criteria when it comes to model selection.While the MSE graphs were based on trajectories of length 110, the AIC graphsused a length of 100 throughout. Also, computing time for the AIC graphs wasapproximately 5% of the computing time of the MSE graphs. Nonetheless, visualimpression clearly supports model selection by AIC, provided one is interestedin finding the ‘true’ model. If the aim is choosing a prediction model, the truemodel is not necessarily the better choice and the MSE—selected model, althoughincorrect, may dominate its AIC—selected counterpart. Also, one should keep inmind that, in practical applications, the data-generating process may be muchmore complex than the entertained model classes. The main relative weakness

19

Page 25: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 11: AIC as a model selection criterion: relative selection frequency ofthe MA(1) model when the data are generated by ARMA(1,1) models.

20

Page 26: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 12: AIC as a model selection criterion: relative selection frequency ofthe ARMA(1,1) model when the data are generated by ARMA(1,1) models.

21

Page 27: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

of the MSE criterion is its exclusive reliance on few observations at the end ofthe sample. This weakness may turn into a virtue if all entertained models arepoor approximations at best and, for example, an ARMA(1,1) structure withtime-changing coefficients approximates better than a linear one. This point istaken up in Section 5.2.Secondly, however, the figures demonstrate that the recommendations by

MSE comparisons are altogether similar and comparable to the AIC recommen-dations. In other words, an MSE comparison is nothing else than an evaluationof model selection statistics. If researchers are reluctant to study the signifi-cance of AIC decisions, there is no reason for a demand for significance tests onMSE or MAE decisions.Finally, one should not overlook an important argument that may favor

MSE/MAE comparisons over IC. While forecasting evaluations automaticallypenalize model complexity in such a way that the optimum prediction work-horse is supported, IC evaluations solely rely on the number of parameters as asometimes poor measure of complexity. The traditional IC approach may workfor choosing among simple linear structures but it may fail for nonlinear modelswith high inherent complexity and comparatively few parameters. Forecast-ing comparisons will continue to give the appropriate implicit penalty to theinconvenient nonlinear workhorse.An important argument is certainly that model selection may not be the

ultimate aim, particularly when forecasting criteria are considered. The showngraphs only demonstrate that AIC outperforms MSE with regard to the searchfor the ‘true’ structure. They do not show the relative prediction performanceof the selected models. Therefore, another set of simulations were conducted,using the same ARMA models to generate trajectories as before, although nowof length 211 instead of 210. Three model selection procedures are considered:

1. Models are selected on the basis of AIC. Here, 1000 replications are gen-erated. For each replication, the selection based on observations #100—#200, on #101—#201 etc. is evaluated. The model with minimum AICis used as a prediction model for the next observation, that is, for #201,#202, ... For each of the 10,000 cases, the MSE and MAE are collected.Finally, MSE (and MAE) are averaged over 10,000 cases.

2. Models are selected on the basis of MSE. For each replication, first the ob-servation #200 is forecasted using #100—#199. The model with minimumMSE is then used to predict observation #201 from #100—#200. Then,the end point of the sample is shifted, until ten forecasts are available.The implied MSE is averaged over 10,000 cases.

3. First, each of the three models generates forecasts over a training period#201—#210. From these ten forecasts, the DM statistic is calculated tocompare the basic AR(1) prediction and the more ‘sophisticated’ MA(1)and ARMA(1,1) rivals. If the rivals achieve a ‘significantly’ better perfor-mance at the one-sided 2.5% risk level, they are used in order to predict

22

Page 28: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

#211. Otherwise, prediction relies on the AR(1) model. The resultingMSE is averaged over 1000 cases.

While the resulting model selection frequencies for the two former caseshave already been reported, those for the latter case are shown in Figures 13—15. The default model of AR(1) is selected with a large probability over thewhole parameter space, with a minimum for generating models with a largeMA component. The default model is rarely overruled by the pure MA(1)model, even when θ is large. Mixed ARMA(1,1) are increasing in their selectionfrequency as φ and θ increase, although they never become as dominant as forthe other model selection techniques.

Figure 13: MSE plus DM significance as a model selection criterion: rela-tive selection frequency of the AR(1) model when the data are generated byARMA(1,1) models.

Again, it appears that model selection guided by MSE and an additionalDM step does a poor job. Still, it may do a better job if forecasting is theultimate aim of the exercise. Figure 16 shows that strategy #2, i.e., modelselection based on the MSE, indeed achieves better forecasts than strategy #1in a portion of the parameter space. The portion is characterized by small φand large θ values, where AIC correctly chooses ARMA models but the bias

23

Page 29: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 14: MSE plus DM significance as a model selection criterion: rela-tive selection frequency of the MA(1) model when the data are generated byARMA(1,1) models.

24

Page 30: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 15: MSE plus DM significance as a model selection criterion: relativeselection frequency of ARMA(1,1) when the data are generated by ARMA(1,1)models.

25

Page 31: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

in φ estimation tends to outweigh the benefits from estimating it instead ofrestricting it at zero. For other (φ, θ) constellations, strategy #2 is worse andit becomes definitely inferior as φ→ 1.

Figure 16: Contour plot of the difference of MSE achieved by MSE modelselection minus MSE achieved by AIC model selection. 756 replications of n =100 trajectories from ARMA processes.

Using the DM test on top of MSE selection generally succeeds in improvingpredictive accuracy relative to pure MSE selection. The reason is that theAR(1) forecasts are the most ‘robust’ predictions, while ARMA(1,1) forecastsbring in more bias and sampling variation. Figure 17 shows, however, that theMSE-DM search strategy is unable to beat AIC selection for those parametervalues where pure MSE search has the most severe deficiencies, that is, for largeφ. Standard AIC model selection achieves the best forecasting performance forφ > 0.5, ignoring the north-east corner (φ, θ) = (1, 1), a severely unstable modelwhere the ARMA(1,1) estimator faces problems and it can be improved uponby replacing it by the AR(1) estimator.We note that strategy #1 can be improved by replacing the asymptotic AIC

criterion by the AICc suggested by Hurvich and Tsai, which indeed shrinksthe area of MSE dominance considerably. Also, one may consider improving onstrategy #2, replacing the selection based on a single forecast by the average

26

Page 32: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 17: Contour plot of the difference of MSE achieved by MSE-DM modelselection minus MSE achieved by AIC model selection. 756 replications of n =100 trajectories from ARMA processes.

27

Page 33: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

over a training period, as in strategy #3. In order to separate the effects of theDM test proper and of using a training set of 10 observations, the experimentwas re-run with this modified strategy #2’. It turned out that the performanceof strategy #2’ does not differ substantially from strategy #2 and does noteven dominate it over the whole parameter space. The slight improvementfrom strategy #3 relative to #2 mainly reflects its implicit preference for usingautoregressive models in forecasting.The bottom line is that DM testing and thus artificially biasing the decision

toward a simple structure may help in cases where AIC selection chooses acorrect model with unfavorable biases in parameter estimation. In those caseswhere AIC selection is to be preferred to MSE selection, DM testing does nothelp. All these remarks are only valid if the aim is prediction. If the aim is asearch for the true model structure, out-of-sample prediction measures are muchless likely to find that true structure than standard information criteria, whileeven this poor performance deteriorates if DM testing is applied.

Figure 18: Contour plot of the difference of MSE achieved by MSE-DM modelselection with MA default model minus MSE achieved by MSE model selectionbased on a 10—observations training set. 756 replications of n = 100 trajectoriesfrom ARMA processes.

28

Page 34: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

A variant of this experiment, which is reported in less detail here, replacesthe default AR model for DM testing by a default MA model. In this case, AICyields better prediction for θ > 0.5, while MSE—DM is preferable for θ < 0.4.Figure 18 compares this moving-average DM strategy to the strategy #2’. DMtesting incurs a deterioration for models with a dominant autoregressive com-ponent, i.e., for φ much larger than θ, while it incurs an improvement for otherparameter values. The reason is similar to the one for the autoregressive ver-sion of the experiment. DM testing biases model selection more strongly towardchoosing the MA structures, which is beneficial for MA—dominated generationmechanisms, although inappropriate for AR—dominated ones. When the trueparameter value is unknown, the effects of the additional testing step are un-certain.

5.2 Some experiments with randomized coefficients

Data-generating models of the ARMA type may bias this research unduly to-ward model selection by IC. The sample size of 100 is large enough to allowreliable discrimination among AR, MA, and ARMA structures, accounting forthe whole sample optimally. By contrast, model selection by forecasting crite-ria in the simulation design relies on the local behavior over a relatively shorttime span, whether the DM test is used or not. This observation motivates toinvestigate a different design that, conversely, may be biased in favor of forecast-ing criteria. Among economists, it is a common criticism of time-series modelsthat the economic world is evolving and changing over time, hence any simplestructure with constant coefficients may not be trustworthy. While it is notpossible to generate models that match this idea of an evolving world in everyaspect, ARMA processes with time-changing coefficients may be a good ap-proximation. If the true structure is subject to changes with some persistence,forecasting based on a moving time window may capture the locally optimalprediction workhorse even when the assumed forecaster is unaware of the truedata-generating process. What is otherwise an inefficiency, may be an asset offlexibility in such design. This design also meets the requirement that, in orderto achieve a realistic design, the generating model should be more general thanthe structures in the set of entertained models.Currently, arguments of coefficient variation often support the application

of unobserved-components structures. However, most of these models are fixed-coefficients structures indeed, many of them assuming additional unit roots andvery special short-run autocorrelation. By contrast, a model with truly changingstructure is the ARMA model with stochastic coefficients, of the form

yt = φtyt−1 + εt + θtεt−1. (4)

In line with the time-series literature, such a model is called a RC—ARMA model(random coefficients ARMA). RC—ARMA models are simple generalizations ofthe RCA models studied by Nicholls and Quinn (1982). While the conditionsfor finite second moments are somehow stricter than in the constant-coefficients

29

Page 35: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

ARMA model, strict stationarity even holds for Eφt > 1, as long as the φtprocess is stable and its variation is not too large. Typical trajectories for Eφtequal or close to unity display short periods of very high or low values, beforethey turn back to longer periods of relatively small values. It should be notedthat trajectories for Eφt = 0.9, say, are quite different in appearance fromARMA(1,1) or AR(1) processes with φ = 0.9. RCA models have been taken upin the literature on conditional heteroskedasticity in time series, and we referto these publications for detailed conditions (see Tsay, 1987, Bougerol andPicard, 1992, Kunst, 1997).In order to simplify this quite general structure somewhat, we convene that

θt ≡ θ but thatφt − φ = ζ

¡φt−1 − φ

¢+ ηt, (5)

with given |ζ| < 1 and ηt n.i.d. |ζ| < 1 is necessary to guarantee stability of

the process (φt), as E (φt − φ)2=¡1− ζ2

¢−1E η2t . For the reported simulation,

ζ = 0.9 is imposed, otherwise the standard design of the previous sections isused. E η2t is set at 0.1 times E ε

2t , which results in a sufficiently small variance for

(φt) and in trajectories that are sufficiently different from constant-coefficientsARMA processes.For this generating model, Figures 19—21 show the selection frequencies as

surfaces if the MSE is chosen as the relevant model selection criterion. Thegraphs are not too different from the constant-coefficients case.Using AIC as a model selector instead of the MSE, this time only based

on the starting samples from observations #100—#200, results in the selectionfrequencies represented in Figures 22—24. Although the global maximum of se-lected AR models around (φ, θ) = (0.6, 0) is to be noted, the general impressionstill favors the AIC selection. This impression changes, however, if the taskof optimizing the forecasting properties is focused instead of selecting the truemodel structure, which is not exactly contained in the selection set anyway, asthe generating models are not traditional ARMA models.For the narrow-sense prediction evaluation experiment, the model selection

decision was allowed to be revised as the sample end moved from #200 to#209. For each sample, a one-step prediction was calculated according tothe AIC—selected model and according to the MSE—selected model. The dif-ference MSE(MSE)-MSE(AIC) is shown as contour plots in Figure 25. Forthe MAE instead of the MSE, a very similar plot was obtained. The area{(φ, θ) |φ > 0.7, θ ∈ [0, 1]} has been omitted, as these models show episodes oflocal instability that drive up measures such as MAE and MSE unduly. Gen-erally, in these areas the differences are convincingly positive, which indicatesthat the AIC—selected model yields better predictions than the one determinedby MSE optimization. Even within the shown area, dominance of prediction-gauged MSE selection is not as widespread as could have been expected. MSEselection performs better for structures that are close to white noise. Figure21 shows that such structures are classified as ARMA(1,1) with positive prob-ability, while AIC classifies them as simple AR(1) or MA(1). RC-ARMA mod-els with small φ and θ are predicted better by ARMA(1,1) forecasts than by

30

Page 36: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 19: Prediction MSE as a model selection criterion: relative selectionfrequency of the AR(1) model when the data are generated by ARMA(1,1)models with random coefficients.

31

Page 37: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 20: Prediction MSE as a model selection criterion: relative selectionfrequency of the MA(1) model when the data are generated by ARMA(1,1)models with random coefficients.

32

Page 38: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 21: Prediction MSE as a model selection criterion: relative selectionfrequency of ARMA(1,1) model when the data are generated by ARMA(1,1)models with random coefficients.

33

Page 39: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 22: AIC as a model selection criterion: relative selection frequency ofAR(1) model when the data are generated by ARMA(1,1) models with randomcoefficients.

34

Page 40: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 23: AIC as a model selection criterion: relative selection frequency ofMA(1) model when the data are generated by ARMA(1,1) models with randomcoefficients.

35

Page 41: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 24: AIC as a model selection criterion: relative selection frequency ofARMA(1,1) model when the data are generated by ARMA(1,1) models withrandom coefficients.

36

Page 42: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

constant-coefficients AR(1) or MA(1) forecasts, which can also be corroboratedfrom other unreported simulation exercises. MSE selection also performs betterfor RC-ARMA models with small φ and large θ. AIC sees such trajectories asMA, while the MSE search picks up the evidence on a non-zero φ. For thesestructures, the criterion of parsimony may be a disadvantage. For larger φ,the MSE—selected models fall behind. The probability of achieving a smallerforecast by an AR or MA model is substantial, even when the trajectory athand points to a full mixed ARMA. MSE search relies on this local evidenceand inefficiently prefers the parsimonious structure, thus missing the chance toimprove upon the prediction.As a consequence of the experiments shown in Figure 25, ‘selecting a fore-

casting model by forecasting’ can be recommended for cases with low time cor-relation or comparatively short memory. Then, forecasting criteria may be ableto exploit the advantage of adapting flexibly to time-changing coefficient struc-tures. For larger time correlation, AIC dominates, as it gives a larger weight tothe long-run information.

Figure 25: Difference of mean-squared prediction error of MSE—based modelselection minus AIC—based model selection. Generating model is RC—ARMA(φ, θ).

Here, let us return to the main topic of this research. Suppose that the

37

Page 43: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

forecaster optimizes the MSE but prefers the simplest model, as long as the rivalmodels do not achieve a ‘significant’ improvement. Again, it appears naturalto assume that the simplest forecasting model is the AR(1) model, which istypically preferred by most empirical researchers to the MA(1) and ARMA(1)models, as it can be estimated by least-squares regression and conveniently fitsinto the framework of distributed lags.This experiment is comparable to the one reported above for the fixed-

coefficients ARMA design. Applying the DM test to the very same experimentas before turned out to impose an excessive computational burden. Therefore,single-step predictions were evaluated for the range #201-#210, from whichthe DM statistic was calculated. Then, the AR(1) model acted as the defaultmodel, from which a forecast for #211 was used, unless either the MA(1) orthe ARMA(1,1) model yielded a ‘significantly’ better prediction for the trainingrange. If both models came out significantly better than AR(1), the model withthe larger DM statistic was selected. Figures 26—28 show the selection frequen-cies for the three competing models. By construction, AR models are selectedwith a sizeable frequency over the whole parameter region, while MA mod-els rarely are ‘significantly’ better than the AR baseline. As expected, mixedARMA models attain their maximum frequency for generating structures withlarge θ and φ, though they are not as dominant for these values as for the othermodel selection strategies.The results of this model selection procedure, which is based upon MSE

and DM test significance, are reported as contour plots. Figure 29 gives thedifferences of the mean-absolute prediction errors for this complex procedureversus the AIC—based selection. AIC is preferable for larger φ, while MSE—DM is better for small φ and particularly for large θ. While this outcomeis comparable to the relative performance of the selection algorithm withoutDM testing (Figure 25), the impression changes for mean-squared instead ofmean-absolute errors, when the area of preference for the MSE—DM selectionincreases considerably. The reason is that MSE—DM selection avoids occasionallarge prediction failure for some extremely unstable trajectories with large φ,while AIC selection yields better prediction on average.Of special interest is the comparison between the DM—based procedure and

pure MSE optimization. In analogy to the results for the fixed-coefficientsARMA model, it turns out that the DM-testing step incurs an improved per-formance for the largest part of the parameter space. DM testing implies largerMSE, i.e., a deterioration in a region with small θ and φ around 0.5. For largeφ, pure MSE optimization and MSE—DM yield very similar results and cause aslightly confusing contour plot, which is therefore not shown. The improvementachieved by the DM step is strongest for large θ and small φ, where AIC identi-fies MA models, while the MSE prefers AR or ARMA structures and DM testinggives an additional boost to the AR model with its most ‘robust’ forecastingperformance.Again, an intermediate experiment was conducted, with model selection de-

termined by the average MSE over a training period of 10 observations, in orderto find out how much of the change from MSE to MSE—DM is to be attributed

38

Page 44: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 26: Selection frequency for AR(1) model according to a prediction eval-uation using the MSE and the DM test over a training set of 10 observations,if the true structure is a RC—ARMA(1,1) model.

39

Page 45: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 27: Selection frequency for MA(1) model according to a prediction eval-uation using the MSE and the DM test over a training set of 10 observations,if the true structure is a RC—ARMA(1,1) model.

40

Page 46: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 28: Selection frequency for ARMA(1,1) model according to a predictionevaluation using the MSE and the DM test over a training set of 10 observations,if the true structure is a RC—ARMA(1,1) model.

41

Page 47: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Figure 29: Difference of mean-absolute prediction error of MSE-DM—basedmodel selection minus AIC—based model selection. Generating model is RC—ARMA(φ, θ).

42

Page 48: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

to DM testing and how much to using a training period. For large φ, theintermediate design improves upon the single-observation MSE selection. Forφ ≥ 0.7, averaging the MSE over a training phase even definitely beats MSE—DM. However, the change relative to single-observation MSE selection remainssmall, therefore the largest part of the displayed features are due to the DMtesting step and not to the usage of a training phase. In order to focus on themain points, the results of the intermediate experiment are not shown.The general conclusion to be drawn from the experiment is that application

of the DM significance test offers little improvement for RC—ARMA data. Themain drawback of MSE—based selection is that mixed ARMA structures arenot recognized for large φ, due to the masking effects of the time variationin the coefficients and other models are selected instead, which results in badprediction performance. This effect is even enhanced by the DM significancetest that prevents discarding the AR model due to the relative merit of the fullARMA model not attaining ‘significance’ in the training period.

6 Summary and conclusionTests for the significance of differences in predictive accuracy have become apopular tool in empirical studies of predictive evaluation. Here, it is arguedthat the merit of such significance tests is doubtful. Theoretical arguments andsimulation evidence is provided.From a theoretical point of view, such tests are equivalent to testing the

significance of a model selection decision. Like testing for significance on top ofa model selection decision by AIC or by BIC, application of such a significancetest is not justified and is usually motivated by a misunderstanding of modelselection and a futile attempt of viewing information theoretic procedures in theframework of classical hypothesis testing. Such double checks implicitly giveadditional prior weight to the formal null hypothesis and, in practice, causea bias in favor of simplicity, even though a penalty for complexity has beenincluded already at the model selection stage.From an empirical point of view, two simple, though empirically not implau-

sible, model selection designs are evaluated by simulation. In the ARMA(1,1)design with fixed coefficients, choosing the AR(1) model as the ‘null model’indeed points to a possible improvement of forecasting performance due to thesecondary significance check. Searching for the optimum model by minimizingMSE over a relatively short training period indeed outperforms AIC search forsizeable parts of the parameter space. Both features are somehow surprisingand may insinuate some more research by analytical and simulation methods.A possible reason for the second effect may be that AIC gives an insufficientpenalty to the full ARMA(1,1) class in small samples. The effect can be mit-igated by replacing the AIC by BIC or by the small-sample AICc criterion, assuggested by Hurvich and Tsai (1989). Another reason may be that, despiteof the known advantageous forecasting properties of AIC—selected models, infor-mation criteria are still insufficiently tuned to the prediction task. With regard

43

Page 49: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

to finding the true structure, AIC clearly dominates the rival procedures.The other simulation example is a RC—ARMA(1,1) generating model with

randomized coefficients. This more realistic design assumes that all modelsin the selection set are ‘misspecified’ and, instinctively, may favor MSE—basedmodel search. However, the area of MSE dominance over AIC increases onlyslightly relative to the fixed-coefficients design. Subjecting this decision to asecondary significance test with an AR(1) null model causes an improvement inforecasting performance for random-coefficient MA models but a further dete-rioration for generating models with a large autoregressive coefficient.It is obvious that it is fairly easy to find examples for data generation mecha-

nisms that support any of the three considered selection procedures, hence thisresearch is at a too early stage to allow specific recommendations. It is alsoobvious that each of the competing selection procedures can be modified andimproved. Modified information criteria and modified versions of the originalDM test have already been mentioned. However, the impression remains thatsignificance testing for relative prediction accuracy is far from a panacea forthe difficult task of improving forecasting accuracy in practice. The simulationevidence also points to the importance of separating the tasks of finding the op-timum model as such in a Kullback-Leibler sense of approximating truth–andthe task of optimizing predictive performance.The author wishes to thank Jesus Crespo-Cuaresma for helpful comments.

The usual provisio applies.

References[1] Akaike, H. (1974) A New Look at the Statistical Model Identification .

IEEE Transactions on Automatic Control AC—19, 716—723.

[2] Akaike, H. (1981) ‘Modern development of statistical methods’ Pages 169—184 in P. Eykhoff (ed.) Trends and progress in system identification.Pergamon Press.

[3] Ashley, R. (in press) ‘Statistically significant forecasting improvements:how much out-of-sample data is likely necessary?’ International Journal ofForecasting

[4] Bougerol, P., and Picard, N. (1982) ‘Strict stationarity of generalizedautoregressive processes’ Annals of Probability 20, 1714—1730.

[5] Box, G.E.P. and Jenkins, G. (1976) Time Series Analysis, Forecastingand Control. Revised edition, Holden-Day.

[6] Brockwell, P.J., and Davis, R.A. (2002) Introduction to Time Seriesand Forecasting. 2nd edition, Springer-Verlag.

[7] Burnham, K.P., and Anderson, D.R. (2002) Model Selection and Mul-timodel Inference: A Practical Information-Theoretic Approach. 2nd edi-tion, Springer-Verlag.

44

Page 50: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

[8] Diebold, F.X., and Mariano, R.S. (1995) ‘Comparing Predictive Ac-curacy’ Journal of Business and Economic Statistics 13, 253—263.

[9] Fildes, R., and Stekler, H. (2002) ‘The state of macroeconomic fore-casting’ Journal of Macroeconomics 24, 435—468.

[10] Greene, W.H. (1990) Econometric Analysis. Macmillan.

[11] Hamilton, J.D. (1994) Time Series Analysis. Princeton University Press.

[12] Hurvich, C.M., and Tsai, C.L. (1989) ‘Regression and time series modelselection in small samples’ Biometrika 76, 297—307.

[13] Kunst, R.M. (1997) ‘Augmented ARCH models for financial time series:stability conditions and empirical evidence’ Applied Financial Economics7, 575—586.

[14] Linhart, H. (1988) ‘A test whether two AIC’s differ significantly’ SouthAfrican Statistical Journal 22, 153—161.

[15] Nicholls, D.F., and Quinn, B.G. (1982) Random coefficient autoregres-sive models: an introduction. Lecture Notes in Statistics Vol. 11, Springer.

[16] Ramanathan, R. (2002) Introductory Econometrics with Applications. 5thedition, South-Western.

[17] Schwarz, G. (1978) ‘Estimating the dimension of a model’ Annals ofStatistics 6, 461—464.

[18] Tashman, L.J. (2000) ‘Out-of-sample tests of forecasting accuracy: ananalysis and review’ International Journal of Forecasting 16, 437—450.

[19] Tiao, G.C., and Tsay, R.S. (1984) ‘Consistent Estimates of Autore-gressive Parameters and Extended Sample Autocorrelation Function forStationary and Nonstationary ARMA Models’ Journal of the AmericanStatistical Association 79, 84—96.

[20] Tsay, R.S. (1987) ‘Conditional heteroscedastic time series models’ Journalof the American Statistical Association 82, 590—604.

45

Page 51: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have
Page 52: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

Author: Robert M. Kunst Title: Testing for Relative Predictive Accuracy: A Critical Viewpoint Reihe Ökonomie / Economics Series 130 Editor: Robert M. Kunst (Econometrics) Associate Editors: Walter Fisher (Macroeconomics), Klaus Ritzberger (Microeconomics) ISSN: 1605-7996 © 2003 by the Department of Economics and Finance, Institute for Advanced Studies (IHS), Stumpergasse 56, A-1060 Vienna • +43 1 59991-0 • Fax +43 1 59991-555 • http://www.ihs.ac.at

Page 53: Testing for Relative Predictive AccuracyVerantwortung für die veröffentlichten Beiträge liegt bei den Autoren und Autorinnen. Abstract Tests for relative predictive accuracy have

ISSN: 1605-7996