Henry II Forecasting

8/2/2019 Henry II Forecasting

1/53

Econometric Modelling

David F. Hendry

Nuffield College, Oxford University.

July 18, 2000

Abstract

The theory of reduction explains the origins of empirical models, by delineating all the steps

involved in mapping from the actual data generation process (DGP) in the economy far too com-

plicated and high dimensional ever to be completely modeled to an empirical model thereof. Each

reduction step involves a potential loss of information from: aggregating, marginalizing, condition-

ing, approximating, and truncating, leading to a local DGP which is the actual generating process

in the space of variables under analysis. Tests of losses from many of the reduction steps are feas-

ible. Models that show no losses are deemed congruent; those that explain rival models are called

encompassing. The main reductions correspond to well-established econometrics concepts (causal-ity, exogeneity, invariance, innovations, etc.) which are the null hypotheses of the mis-specification

tests, so the theory has considerable excess content.

General-to-specific (Gets) modelling seeks to mimic reduction by commencing from a general

congruent specification that is simplified to a minimal representation consistent with the desired

criteria and the data evidence (essentially represented by the local DGP). However, in small data

samples, model selection is difficult. We reconsider model selection from a computer-automation

perspective, focusing on general-to-specific reductions, embodied in PcGets an Ox Package for

implementing this modelling strategy for linear, dynamic regression models. We present an econo-

metric theory that explains the remarkable properties of PcGets. Starting from a general congruent

model, standard testing procedures eliminate statistically-insignificant variables, with diagnostic

tests checking the validity of reductions, ensuring a congruent final selection. Path searches in

PcGets terminate when no variable meets the pre-set criteria, or any diagnostic test becomes sig-

nificant. Non-rejected models are tested by encompassing: if several are acceptable, the reduction

recommences from their union: if they re-appear, the search is terminated using the Schwartz cri-

terion.

Since model selection with diagnostic testing has eluded theoretical analysis, we study model-

ling strategies by simulation. The Monte Carlo experiments show that PcGets recovers the DGP

specification from a general model with size and power close to commencing from the DGP it-

self, so model selection can be relatively non-distortionary even when the mechanism is unknown.

Empirical illustrations for consumers expenditure and money demand will be shown live.

Next, we discuss sample-selection effects on forecast failure, with a Monte Carlo study of their

impact. This leads to a discussion of the role of selection when testing theories, and the problems

inherent in conventional approaches. Finally, we show that selecting policy-analysis models byforecast accuracy is not generally appropriate. We anticipate that Gets will perform well in selecting

models for policy.

Financial support from the UK Economic and Social Research Council under grant L138251009 Modelling Non-

stationary Economic Time Series, R000237500, and Forecasting and Policy in the Evolving Macro-economy, L138251009,

is gratefully acknowledged. The research is based on joint work with Hans-Martin Krolzig of Oxford University.

1


2/53

2

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory of reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Empirical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 DGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Data transformations and aggregation . . . . . . . . . . . . . . . . . . . 6

2.4 Parameters of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Data partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.6 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.7 Sequential factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.7.1 Sequential factorization ofW1T. . . . . . . . . . . . . . . . . . 7

2.7.2 Marginalizing with respect toV1T. . . . . . . . . . . . . . . . . 7

2.8 Mapping to I(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.9 Conditional factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.10 Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.11 Lag truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.12 Functional form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.13 The derived model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.14 Dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.15 Econometric concepts as measures of no information loss . . . . . . . . . 9

2.16 Implicit model design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.17 Explicit model design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.18 A taxonomy of evaluation information . . . . . . . . . . . . . . . . . . . 9

3 General-to-specific modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Pre-search reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Additional paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Encompassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 Information criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Sub-sample reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.6 Significant mis-specification tests . . . . . . . . . . . . . . . . . . . . . . 11

4 The econometrics of model selection . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Search costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Selection probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Deletion probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Path selection probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.5 Improved inference procedures . . . . . . . . . . . . . . . . . . . . . . . 17

5 PcGets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1 The multi-path reduction process ofPcGets . . . . . . . . . . . . . . . . 195.2 Settings in PcGets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3 Limits to PcGets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3.1 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.4 Integrated variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Some Monte Carlo results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1 Aim of the Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Design of the Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 26


3/53

3

6.3 Evaluation of the Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 27

6.4 Diagnostic tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.5 Size and power of variable selection . . . . . . . . . . . . . . . . . . . . 29

6.6 Test size analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 Empirical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.1 DHSY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.2 UK Money Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Model selection in forecasting, testing, and policy analysis . . . . . . . . . . . . . 408.1 Model selection for forecasting . . . . . . . . . . . . . . . . . . . . . . . 40

8.1.1 Sources of forecast errors . . . . . . . . . . . . . . . . . . . . 40

8.1.2 Sample selection experiments . . . . . . . . . . . . . . . . . . 42

8.2 Model selection for theory testing . . . . . . . . . . . . . . . . . . . . . 43

8.3 Model selection for policy analysis . . . . . . . . . . . . . . . . . . . . . 44

8.3.1 Congruent modelling . . . . . . . . . . . . . . . . . . . . . . . 45

9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

10 Appendix: encompassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50


4/53

4

1 Introduction

The economy is a complicated, dynamic, non-linear, simultaneous, high-dimensional, and evolving en-

tity; social systems alter over time; laws change; and technological innovations occur. Time-series

data samples are short, highly aggregated, heterogeneous, non-stationary, time-dependent and inter-

dependent. Economic magnitudes are inaccurately measured, subject to revision and important vari-

ables not unobservable. Economic theories are highly abstract and simplified, with suspect aggregation

assumptions, change over time, and often rival, conflicting explanations co-exist. In the face of this

welter of problems, econometric modelling of economic time series seeks to discover sustainable and

interpretable relationships between observed economic variables.

However, the situation is not as bleak as it may seem, provided some general scientific notions are

understood. The first key is that knowledge accumulation is progressive: one does not need to know all

the answers at the start (otherwise, no science could have advanced). Although the best empirical model

at any point will later be supplanted, it can provide a springboard for further discovery. Thus, model

selection problems (e.g., data mining) are not a serious concern: this is established below, by the actual

behaviour of model-selection algorithms.

The second key is that determining inconsistencies between the implications of any conjectured

model and the observed data is easy. Indeed, the ease of rejection worries some economists about eco-

nometric models, yet is a powerful advantage. Conversely, constructive progress is difficult, because we

do not know what we dont know, so cannot know how to find out. The dichotomy between construction

and destruction is an old one in the philosophy of science: critically evaluating empirical evidence is a

destructive use of econometrics, but can establish a legitimate basis for models.

To understand modelling, one must begin by assuming a probability structure and conjecturing

the data generation process. However, the relevant probability basis is unclear, sincet the economic

mechanism is unknown. Consequently, one must proceed iteratively: conjecture the process, develop

the associated probability theory, use that for modelling, and revise the starting point when the results do

not match consistently. This can be seen in the gradual progress from stationarity assumptions, through

integrated-cointegrated systems, to general non-stationary, mixing processes: further developments will

undoubtedly occur, leading to a more useful probability basis for empirical modelling. These notesfirst review the theory of reduction in 2 to explain the origins of empirical models, then discuss somemethodological issues that concern many economists.

Despite the controversy surrounding econometric methodology, the LSE approach (see Hendry,

1993, for an overview) has emerged as a leading approach to empirical modelling. One of its main

tenets is the concept of general-to-specific modelling (Gets general-to-specific): starting from a gen-

eral dynamic statistical model, which captures the essential characteristics of the underlying data set,

standard testing procedures are used to reduce its complexity by eliminating statistically-insignificant

variables, checking the validity of the reductions at every stage to ensure the congruence of the selected

model. Section 3 discusses Gets, and relates it to the empirical analogue of reduction.

Recently econometric model-selection has been automated in a program called PcGets, which is anOx Package (see Doornik, 1999, and Hendry and Krolzig, 1999a) designed for Gets modelling, currently

focusing on reduction approaches for linear, dynamic, regression models. The development ofPcGets

has been stimulated by Hoover and Perez (1999), who sought to evaluate the performance of Gets. To

implement a general-to-specific approach in a computer algorithm, all decisions must be mechan-

ized. In doing so, Hoover and Perez made some important advances in practical modelling, and our

approach builds on these by introducing further improvements. Given an initial general model, many

reduction paths could be considered, and different selection strategies adopted for each path. Some of


5/53

5

these searches may lead to different terminal specifications, between which a choice must be made.

Consequently, the reduction process is inherently iterative. Should multiple congruent contenders even-

tuate after a reduction round, encompassing can be used to test between them, with only the surviving

usually non-nested specifications retained. If multiple models still remain after this testimation pro-

cess, a new general model is formed from their union, and the simplification process re-applied. Should

that union repeat, a final selection is made using information criteria, otherwise a unique congruent and

encompassing reduction has been located.

Automating Gets throws further light on several methodological issues, and prompts some newideas, which are discussed in section 4. While the joint issue of variable selection and diagnostic testing

using multiple criteria has eluded most attempts at theoretical analysis, computer automation of the

model-selection process allows us to evaluate econometric model-selection strategies by simulation.

Section 6 presents the results of some Monte Carlo experiments to investigate if the model-selection

process works well or fails badly; their implications for the calibration of PcGets are also analyzed.

The empirical illustrations presented in section 7 demonstrate the usefulness of PcGets for applied

econometric research.

Section 8 then investigates model selection in forecasting, testing, and policy analysis and shows the

drawbacks of some widely-used approaches.

2 Theory of reduction

First we define the notion of an empirical model, then explain the the origins of such models by the

theory of reduction.

2.1 Empirical models

In an experiment, the output is caused by the inputs and can be treated as if it were a mechanism:

yt = f(zt) + t

[output] [input] [perturbation]

(1)

where yt is the observed outcome of the experiment when zt is the experimental input, f() is themapping from input to output, and t is a small, random perturbation which varies between experiments

conducted at the same values of z. Given the same inputs {zt}, repeating the experiment generatesessentially the same outputs.

In an econometric model, however:

yt = g (zt) + t[observed] [explanation] [remainder]

(2)

yt can always be decomposed into two components, namely g (zt) (the part explained) and t (unex-

plained). Such a partition is feasible even when yt does not depend on g (zt). In econometrics:

t = yt g (zt) . (3)

Thus, models can be designed by selection of zt. Design criteria must be analyzed, and lead to the

notion of a congruentmodel: one that matches the data evidence on the measured attributes. Successive

congruent models should be able to explain previous ones, which is the concept of encompassing, and

thereby progress can be achieved.


6/53

6

2.2 DGP

Let {ut} denote a stochastic process where ut is a vector of n random variables. Consider thesample U1T = (u1 . . .uT), where U

1t1 = (u1 . . .ut1). Denote the initial conditions by U0 =

(. . .ur . . .u1 u0), and let Ut1 =U0 : U

1t1

. The density function ofU1T conditional on U0

is given by DUU1T|U0,

where DU () is represented parametrically by a k-dimensional vector of

parameters = (1 . . . k) with parameter space Rk. All elements of need not be the same at

each t, and some of the

{i

}may reflect transient effects or regime shifts. The data generation process

(DGP) of{ut} is written as:

DUU1T | U0,

with Rk. (4)

The complete sample {ut, t = 1, . . . , T } is generated from DU () by a population parameter value p.The sample joint data density DU

U1T|U0,

is called the Haavelmo distribution (see e.g., Spanos,

1989).

The complete set of random variables relevant to the economy under investigation over t = 1, . . . T

is denoted {ut }, where denotes a perfectly measured variable and U1T = (u1, . . . ,uT), defined onthe probability space (, F,P). The DGP induces U1T = (u1, . . . ,uT) but U1T is unmanageably large.Operational models are defined by a sequence of data reductions, organized into eleven stages.

2.3 Data transformations and aggregation

One-one mapping ofU1T to a new data set W1T : U

1T W1T. The DGP ofU1T, and so ofW1T is

characterized by the joint density:

DUU1T | U0,1T

= DW

W1T | W0,1T

(5)

where 1T and 1T , making parameter change explicit The transformation from U to Waffects the parameter space, so is transformed into .

2.4 Parameters of interest

M. Identifiable, and invariant to an interesting class of interventions.

2.5 Data partition

Partition W1T into the two sets:

W1T =X1T : V

1T

(6)

where the X1T matrix is T n. Everything about must be learnt from analyzing the X1T alone, so thatV1T is not essential to inference about .

2.6 Marginalization

DWW1T | W0,1T

= DV|X

V1T | X1T,W0,1a,T

DXX1T | W0,1b,T

. (7)

Eliminate V1T by discarding the conditional density DV|X

V1T|X1T,W0,1a,T

in (7), while retaining

the marginal density DX

X1T|W0,1b,T

. must be a function of1b,T alone, given by = f

1b,T

.

A cut is required, so that1a,T :

1b,T

a b.


7/53

7

2.7 Sequential factorization

To create the innovation process sequentially factorize X1T as:

DXX1T | W0,1b,T

=

Tt=1

Dxxt | X1t1,W0,b,t

. (8)

Mean innovation error process t = xt E

xt|X1t1

.

2.7.1 Sequential factorization ofW1T.

Alternatively:

DWW1T | W0,1T

=

Tt=1

Dw (wt | Wt1, t) . (9)

RHS innovation process is t = wt Ewt|W1t1

.

2.7.2 Marginalizing with respect to V1T.

Dw (wt | Wt1,t) = Dv|x (vt | xt,Wt1,a,t)Dxxt | V1t1,X1t1,W0,b,t

, (10)

as Wt1 = V1t1,X1t1,W0. must be obtained from {b,t} alone. Marginalize with respect toV1t1:

Dxxt | V1t1,X1t1,W0, b,t

= Dx

xt | X1t1,W0,b,t

. (11)

No loss of information if and only ifb,t = b,t t, so the conditional, sequential distribution of{xt}

does not depend on V1t1 (Granger non-causality).

2.8 Mapping to I(0)

Needed to ensure conventional inference is valid, though many inferences will be valid even if this

reduction is not enforced. Cointegration would need to be treated in a separate set of lectures.

2.9 Conditional factorization

Factorize the density ofxt into sets ofn1 and n2 variables where n1 + n2 = n:

xt =yt : z

t

, (12)

where the yt are endogenous and the zt are non-modelled.

Dxxt | X1t1,W0,bt

= Dy|z

yt | zt,X1t1,W0,a,t

Dzzt | X1t1,W0,b,t

(13)

zt is weakly exogenous for if (i) = f(a,t) alone; and (ii) (a,t,b,t) a b.

2.10 Constancy

Complete parameter constancy is :

a,t = a t (14)where a a, so that is a function ofa : = f(a).

Tt=1

Dy|zyt | zt,X1t1,W0,a

(15)

with a .


8/53

8

2.11 Lag truncation

Fix the extent of the history ofX1t1 in (15) at s earlier periods:

Dy|zyt | zt,X1t1,W0,a

= Dy|z

yt | zt,Xtst1,W0,

. (16)

2.12 Functional form

Map yt

into y

t= h (y

t) and z

tinto z

t= g (z

t), and denote the resulting data by X. Assume that y

tand zt simultaneously make Dy|z () approximately normal and homoscedastic, denoted Nn1 [t,]:

Dy|zyt | zt,Xtst1,W0,

= Dy|z

yt | zt ,Xtst1,W0,

(17)

2.13 The derived model

A (L)h (y)t = B (L)g (z)t + t (18)

where t gapp Nn1 [0,], and A (L) and B (L) are polynomial matrices (i.e., matrices whose elements

are polynomials) of order s in the lag operator L. t is a derived, and not an autonomous, process

defined by:

t = A (L)h (y)t B (L)g (z)t . (19)The reduction to the generic econometric equation involves all the stages of aggregation, marginaliza-

tion, conditioning etc., transforming the parameters from which determines the stochastic features of

the data, to the coefficients of the empirical model.

2.14 Dominance

Consider two distinct scalar empirical models denoted M1 and M2 with mean-innovation processes

(MIPs) {t} and {t} relative to their own information sets, where t and t have constant, finite vari-ances 2 and

2 respectively. Then M1 variance dominates M2 if

2 <

2 , denoted by M1 M2.

Variance dominance is transitive since ifM1 M2 and M2 M3 then M1 M3, and anti-symmetricsince ifM1 M2 then it cannot be true that M2 M1. A model without a MIP error can be variancedominated by a model with a MIP on a common data set. The DGP cannot be variance dominated in

the population by any models thereof (see e.g. Theil, 1971, p543). Let Ut1 denote the universe of

information for the DGP and let Xt1 be the subset, with associated innovation sequences {u,t} and{x,t}. Then as {Xt1} {Ut1}, E [u,t|Xt1] = 0, whereas E [x,t|Ut1] need not be zero. Amodel with an innovation error cannot be variance dominated by a model which uses only a subset of

the same information.

Ift = xt E [xt|Xt1], then 2 is no larger than the variance of any other empirical model errordefined by t = xt G [xt|Xt1] whatever the choice of G []. The conditional expectation is theminimum mean-square error predictor. These implications favour general rather than simple empirical

models, given any choice of information set, and suggest modelling the conditional expectation. A

model which nests all contending explanations as special cases must variance dominate in its class. Let

model Mj be characterized by parameter vector j with j elements, then as in Hendry and Richard

(1982):

M1 is parsimoniously undominated in the class {Mi} ifi, 1 i and no Mi M1.Model selection procedures (such as AIC or the Schwarz criterion: see Judge, Griffiths, Hill, Lutkepohl

and Lee (1985)) seek parsimoniously undominated models, but do not check for congruence.


9/53

9

2.15 Econometric concepts as measures of no information loss

[1] Aggregation entails no loss of information on marginalizing with respect to disaggregates when the

retained information comprises a set of sufficient statistics for the parameters of interest .

[2] Transformations per se do not entail any associated reduction but directly introduce the concept of

parameters of interest, and indirectly the notions that parameters should be invariant and identifiable.

[3] Data partition is a preliminary although the decision about which variables to include and which to

omit is perhaps the most fundamental determinant of the success or otherwise of empirical modelling.

[4] Marginalizing with respect to vt is without loss providing the remaining data are sufficient for ,

whereas marginalizing without loss with respect to V1t1 entails both Granger non-causality for xt and

a cut in the parameters.

[5] Sequential factorization involves no loss if the derived error process is an innovation relative to the

history of the random variables, and via the notion of common factors, reveals that autoregressive errors

are a restriction and not a generalization.

[6] Integrated data systems can be reduced to I(0) by suitable combinations of cointegration and differ-

encing, allowing conventional inference procedures to be applied to more parsimonious relationships.

[7] Conditional factorization reductions, which eliminate marginal processes, lead to no loss of in-

formation relative to the joint analysis when the conditioning variables are weakly exogenous for the

parameters of interest.[8] Parameter constancy, implicitly relates to invariance as constancy across interventions which affect

the marginal processes.

[9] Lag truncation involves no loss if the error process remains an innovation despite excluding some of

the past of relevant variables.

[10] Functional form approximations need involve no reduction (logs of log-normally distributed vari-

ables): e.g. when the two densities in (17) are equal.

[11] The derived model, as a reduction of the DGP, is nested within that DGP and its properties are

explained by the reduction process: knowledge of the DGP entails knowledge of all reductions thereof.

When knowledge of one model entails knowledge of another, the first is said to encompass the second.

2.16 Implicit model design

This correcponds to the symptomatology approach in econometrics, testing for problems (autocorrela-

tion, heteroscedasticity, omitted variables, multicollinearity, non-constant parameters etc.), and correct-

ing these.

2.17 Explicit model design

Mimic reduction theory in practical research to minimize the losses due to the reductions selected: leads

to Gets modelling.

2.18 A taxonomy of evaluation information

Partition the data X1T used in modelling into the three information sets:

[a] past data;

[b] present data

[c] future data.

X1T =X1t1 : xt : X

t+1T

(20)


10/53

10

[d] theory information, which often is the source of parameters of interest, and is a creative stimulus in

economics;

[e] measurement information, including price index theory, constructed identities such as consumption

equals income minus savings, data accuracy and so on; and:

[f] data of rival models, which could be analyzed into past, present and future in turn.

The six main criteria which result for selecting an empirical model are:

[a] homoscedastic innovation errors;

[b] weakly exogenous conditioning variables for the parameters of interest;[c] constant, invariant parameters of interest;

[d] theory consistent, identifiable structures;

[e] data admissible formulations on accurate observations; and

[f] encompass rival models.

Models which satisfy the first five information sets are said to be congruent: an encompassing

congruent model satisfies all six criteria.

3 General-to-specific modelling

The practical embodiment of reduction is general-to-specific (Gets) modelling. The DGP is replaced bythe concept of the local DGP (LDGP), namely the joint distribution of the subset of variables under

analysis. Then a general unrestricted model (GUM) is formulated to provide a congruent approxim-

ation to the LDGP, given the theoretical and previous empirical background. The empirical analysis

commences from this general specification, after testing for mis-specifications, and if none are appar-

ent, is simplified to a parsimonious, congruent representation, each simplification step being checked

by diagnostic testing. Simplification can be done in many ways: and although the goodness of a model

is intrinsic to it, and not a property of the selection route, poor routes seem unlikely to deliver useful

models. Even so, some economists worry about the impact of selection rules on the properties of the

resulting models, and insist on the use ofa priori specifications: but these need knowledge of the answer

before we start, so deny empirical modelling any useful role and in practice, it has rarely contributed.

Few studies have investigated how well general-to-specific modelling does. However, Hoover and

Perez (1999) offer important evidence in a major Monte Carlo, reconsidering the Lovell (1983) experi-

ments. They place 20 macro variables in databank; generate one ( y) as a function of 05 others; regress

y on all 20 plus all lags thereof, then let their algorithm simplify that GUM till it finds a congruent

(encompassing) irreducible result. They check up to 10 different paths, testing for mis-specification,

collect the results from each, then select one choice from the remainder by following many paths,

the algorithm is protected against chance false routes, and delivers an undominated congruent model.

Nevertheless, Hendry and Krolzig (1999b) improve on their algorithm in several important respects and

this section now describes these.

3.1 Pre-search reductions

First, groups of variables are tested in the order of their absolute t-values, commencing with a block

where all the p-values exceed 0.9, and continuing down towards the pre-assigned selection criterion,

when deletion must become inadmissible. A less-stringent significance level is used at this step, usually

10%, since the insignificant variables are deleted permanently. If no test is significant, the F-test on all

variables in the GUM has been calculated, establishing that there is nothing to model.


11/53

11

3.2 Additional paths

Blocks of variables constitute feasible search paths, in addition to individual-coefficients, like the block

F-tests in the preceding sub-section but along search paths. All paths that also commence with an

insignificant t-deletion are explored.

3.3 Encompassing

Encompassing tests select between the candidate congruent models at the end of path searches. Eachcontender is tested against their union, dropping those which are dominated by, and do not dominate,

another contender. If a unique model results,select that; otherwise, if some are rejected, form the union

of the remaining models, and repeat this round till no encompassing reductions result. That union

then constitutes a new starting point, and the complete path-search algorithm repeats till the union is

unchanged between successive rounds.

3.4 Information criteria

When a union coincides with the original GUM, or with a previous union, so no further feasible reduc-

tions can be found, PcGets selects a model by an information criterion. The preferred final-selection

rule presently is the Schwarz criterion, or BIC, defined as:

SC = 2log L/T + p log(T)/T,

where L is the maximized likelihood, p is the number of parameters and T is the sample size. For

T = 140 and p = 40, minimum SC corresponds approximately to the marginal regressor satisfying

|t| 1.9.

3.5 Sub-sample reliability

For that finally-selected model, sub-sample reliability is evaluated by the HooverPerez overlapping

split-sample test. PcGets concludes that some variables are definitely excluded; some definitely in-cluded, and some have an uncertain role, varying from a reliability of 25% (included in the final model,

but insignificant and insignificant in both sub-samples), through to 75% (significant overall and in one

sub-sample, or in both sub-samples).

3.6 Significant mis-specification tests

If the initial mis-specification tests are significant at the pre-specified level, we raise the required signi-

ficance level, terminating search paths only when that higher level is violated. Empirical investigators

would re-specify the GUM on rejection.

To see why Gets does well, we develop the analytics for several of its procedures.

4 The econometrics of model selection

The key issue for any model-selection procedure is the cost of search, since there are always bound to

be mistakes in statistical inference: specifically, how bad is it to search across many alternatives? The

conventional statistical analysis of repeated testing provides a pessimistic background: every test has a

non-zero null rejection frequency (or size, if independent of nuisance parameters), and so type I errors


12/53

12

accumulate. Setting a small size for every test can induce low power to detect the influences that really

matter.

Critics of general-to-specific methods have pointed to a number of potential difficulties, including

the problems of lack of identification, measurement without theory, data mining, pre-test biases,

ignoring selection effects, repeated testing, and the potential path dependence of any selection:

see inter alia, Faust and Whiteman (1997), Koopmans (1947), Lovell (1983), Judge and Bock (1978),

Leamer (1978), Hendry, Leamer and Poirier (1990), and Pagan (1987). The following discussion draws

on Hendry (2000a).Koopmans critique followed up the earlier attack by Keynes (1939, 1940) on Tinbergen (1940a,

1940b), and set the scene for doubting all econometric analyses that failed to commence from pre-

specified models. Lovells study of trying to select a small relation (zero to five regressors) hidden in

a large database (40 variables) found a low success rate, thereby suggesting that search procedures had

high costs, and supporting an adverse view of data-based model selection. The third criticism concerned

applying significance tests to select variables, arguing that the resulting estimator was biased in general

by being a weighted average of zero (when the variable was excluded) and an unbiased coefficient (on

inclusion). The fourth concerned biases in reported coefficient standard errors from treating the selected

model as if there was no uncertainty in the choice. The next argued that the probability of retaining

variables that should not enter a relationship would be high because a multitude of tests on irrelevant

variables must deliver some significant outcomes. The sixth suggested that how a model was selected

affected its credibility: at its extreme, we find the claim in Leamer (1983) that the mapping is the

message, emphasizing the selection process over the properties of the final choice. In the face of this

barrage of criticism, many economists came to doubt the value of empirical evidence, even to the extent

of referring to it as a scientific illusion (Summers, 1991).

The upshot of these attacks on empirical research was that almost all econometric studies had to

commence from pre-specified models (or pretend they did). Summers (1991) failed to notice that this

was the source of his claimed scientific illusion: econometric evidence had become theory dependent,

with little value added, and a strong propensity to be discarded when fashions in theory changed. Much

empirical evidence only depends on low-level theories which are part of the background knowledge base

not subject to scrutiny in the current analysis so a data-based approach to studying the economy is

feasible. Since theory dependence has at least as many drawbacks as sample dependence, data modelling

procedures are essential: see Hendry (1995a). Indeed, all of these criticisms are refutable, as we now

show.

First, identification has three attributes, as discussed in Hendry (1997), namely uniqueness, sat-

isfying the required interpretation, and correspondence to the desired entity. A non-unique result is

clearly not identified, so the first attribute is necessary, but insufficient, since uniqueness can be achieved

by arbitrary restrictions (criticized by Sims, 1980, inter alia). There can exist a unique combination of

several relationships which is incorrectly interpreted as one of those equations: e.g., a reduced form

that has a positive price effect, wrongly interpreted as a supply relation. Finally, a unique, interpretable

model of (say) a money-demand relation may in fact correspond to a Central Banks supply sched-ule, and this too is sometimes called a failure to identify the demand relation. Because economies are

highly interdependent, simultaneity was long believed to be a serious problem, but higher frequencies of

observation have attenuated this problem. Anyway, simultaneity is not invariant under linear transform-

ations although linear systems are so can be avoided by eschewing contemporaneous regressors until

weak exogeneity is established. Conditioning ensures a unique outcome, although it cannot guarantee

that the resulting model corresponds to the underlying reality.

Next, Keynes appears to have believed that statistical work in economics is impossible without


13/53

13

knowledge of everything in advance. But if partial explanations are devoid of use, and empirically we

could discover nothing not already known, then no science could have progressed. That is clearly refuted

by the historical record. The fallacy in Keyness argument is that since theoretical models are incomplete

and incorrect, an econometrics that is forced to use such theories as the only permissible starting point

for data analysis can contribute little useful knowledge, except perhaps rejecting the theories. When

invariant features of reality exist, progressive research can discover them in part without prior knowledge

of the whole: see Hendry (1995b). A similar analysis applies to the attack in Koopmans on the study

by Burns and Mitchell: he relies on the (unstated) assumption that only one sort of economic theory isapplicable, that it is correct, and that it is immutable (see Hendry and Morgan, 1995).

Data mining is revealed when conflicting evidence exists or when rival models cannot be encom-

passed and if they can, then an undominated model results despite the inappropriate procedure. Thus,

stringent critical evaluation renders the data mining criticism otiose. Gilbert (1986) suggests separat-

ing output into two groups: the first contains only redundant results (those parsimoniously encompassed

by the finally-selected model), and the second contains all other findings. If the second group is not null,

then there has been data mining. On such a characterization, Gets cannot involve data mining, despite

depending heavily on data basing.

When the LDGP is known a priori from economic theory, but an investigator did not know that the

resulting model was in fact true, so sought to test conventional null hypotheses on its coefficients, then

inferential mistakes will occur in general. These will vary as a function of the characteristics of the

LDGP, and of the particular data sample drawn, but for many parameter values, the selected model will

differ from the LDGP, and hence have biased coefficients. This is the pre-test problem, and is quite

distinct from the costs of searching across a general set of specifications for a congruent representation

of the LDGP.

If a wide variety of models would be reported when applying any given selection procedure to

different samples from a common DGP, then the results using a single sample apparently understate

the true uncertainty. Coefficient standard errors only reflect sampling variation conditional on a fixed

specification, with no additional terms from changes in that specification (see e.g., Chatfield, 1995).

Thus, reported empirical estimates must be judged conditional on the resulting equation being a good

approximation to the LDGP. Undominated (i.e., encompassing) congruent models have a strong claim

to provide such an approximation, and conditional on that, their reported uncertainty is a good measure

of the uncertainty inherent in such a specification for the relevant LDGP.

The theory of repeated testing is easily understood: the probability p that none ofn tests rejects at

100% is:

p = (1 )n .When 40 tests of correct null hypotheses are conducted at = 0.05, p0.05 0.13, whereas p0.005 0.89. However, it is difficult to obtain spurious t-test values much in excess of three despite repeated

testing: as Sargan (1981) pointed out, the t-distribution is thin tailed, so even the 0.5% critical value is

less than three for 50 degrees of freedom. Unfortunately, stringent criteria for avoiding rejections when

the null is true lower the power of rejection when it is false. The logic of repeated testing is accurate

as a description of the statistical properties of mis-specification testing: conducting four independent

diagnostic tests at 5% will lead to about 19% false rejections. Nevertheless, even in that context, there

are possible solutions such as using a single combined test which can substantially lower the size

without too great a power loss (see e.g., Godfrey and Veale, 1999). It is less clear that the analysis is

a valid characterization of selection procedures in general when more one path is searched, so there is

no error correction for wrong reductions. In fact, the serious practical difficulty is not one of avoiding


14/53

14

spuriously significant regressors because of repeated testing when many hypotheses are tested, it is

retaining all the variables that genuinely matter.

Path dependence is when the results obtained in a modelling exercise depend on the simplification

sequence adopted. Since the quality of a model is intrinsic to it, and progressive research induces

a sequence of mutually-encompassing congruent models, proponents of Gets consider that the path

adopted is unlikely to matter. As Hendry and Mizon (1990) expressed the matter: the model is the

message. Nevertheless, it must be true that some simplifications lead to poorer representations than

others. One aspect of the value-added of the approach discussed below is that it ensures a uniqueoutcome, so the path does not matter.

We conclude that each of these criticisms of Gets can be refuted. Indeed, White (1990) showed that

with sufficiently-rigorous testing, the selected model will converge to the DGP. Thus, any overfitting

and mis-specification problems are primarily finite sample. Moreover, Mayo (1981) emphasized the

importance of diagnostic test information being effectively independent of the sufficient statistics from

which parameter estimates are derived. Hoover and Perez (1999) show how much better Gets is than any

method Lovell considered, suggesting that modelling per se need not be bad. Indeed, overall, the size

of their selection procedure is close to that expected, and the power is reasonable. Moreover, re-running

their experiments using our version (PcGets) delivered substantively better outcomes (see Hendry and

Krolzig, 1999b). Thus, the case against model selection is far from proved.

4.1 Search costs

Let pdgpi denote the probability of retaining the i

th variable out ofk when commencing from the DGP

specification and applying the relevant selection test at the same significance level as the search pro-

cedure. Then 1 pdgpi is the expected cost of inference. For irrelevant variables, pdgpi 0, so thatwhole cost for those is attributed to search. Let p

gumi denote the probability of retaining the i

th variable

when commencing from the GUM, and applying the same selection test and significance level. Then,

the search costs are pdgpi pgumi . False rejection frequencies of the null can be lowered by increasing

the required significance levels of selection tests, but only at the cost of also reducing power. However,

it is feasible to lower the former and raise the latter simultaneously by an improved search algorithm,subject to the bound of attaining the same performance as knowing the DGP from the outset.

To keep search costs low, any model-selection process must satisfy a number of requirements. First,

it must start from a congruent statistical model to ensure that selection inferences are reliable: con-

sequently, it must test for model mis-specification initially, and such tests must be well calibrated (nom-

inal size close to actual). Secondly, it must avoid getting stuck in search paths that initially inadvertently

delete relevant variables, thereby retaining many other variables as proxies: consequently, it must search

many paths. Thirdly, it must check that eliminating variables does not induce diagnostic tests to become

significant during searches: consequently, model mis-specification tests must be computed at every

stage. Fourthly, it must ensure that any candidate model parsimoniously encompasses the GUM, so no

loss of information has occurred. Fifthly, it must have a high probability of retaining relevant variables:

consequently, a loose significance level and powerful selection tests are required. Sixthly, it must have

a low probability of retaining variables that are actually irrelevant: consequently, this clashes with the

fifth objective in part, but requires an alternative use of the available information. Finally, it must have

powerful procedures to select between the candidate models, and any models derived from them, to end

with a good model choice, namely one for which:

L =k

i=1

pdgpi pgumi


15/53

15

is close to zero.

4.2 Selection probabilities

When searching a large database for that DGP, an investigator could well retain the relevant regressors

much less often than when the correct specification is known, in addition to retaining irrelevant variables

in the finally-selected model. We first examine the problem of retaining significant variables commen-

cing from the DGP, then turn to any additional power losses resulting from search.

For a regression coefficient i, hypothesis tests of the null H0: i = 0 will reject with a probability

dependent on the non-centrality parameter of the test. We consider the slightly more general setting

where t-tests are used to check an hypothesis, denoted t(n, ) for n degrees of freedom, when is

the non-centrality parameter, equal to zero under the null. For a critical value c, P (|t| c|H0) = where H0 implies = 0. The following table records some approximate power calculations when one

coefficient null hypothesis is tested and when four are tested, in each case, precisely once.

t-test powers

n P (|t| c) P (|t| c)41 100 0.05 0.16 0.001

2 50 0.05 0.50 0.0632 100 0.01 0.26 0.005

3 50 0.01 0.64 0.168

4 50 0.05 0.98 0.902

4 50 0.01 0.91 0.686

6 50 0.01 1.00 0.997

Thus, there is little hope of retaining variables with = 1, and only a 5050 chance of retaining a

single variable with a theoretical |t| of 2 when the critical value is also 2, falling to 3070 for a criticalvalue of 2.6. When = 3, the power of detection is sharply higher, but still leads to more than 35%

mis-classifications. Finally, when = 4, one such variable will almost always be retained.

However, the final column shows that the probability of retaining all four relevant variables with

the given non-centrality is essentially negligible even when they are independent, except in the last

few cases. Mixed cases (with different values of ) can be calculated by multiplying the probabilities

in the fourth column (e.g., for = 2, 3, 4, 6 the joint P () = 0.15 at = 0.01). Such combinedprobabilities are highly non-linear in , since one is almost certain to retain all four when = 6, even

at a 1% significance level. The important conclusion is that, despite knowing the DGP, low signal-

noise variables will rarely be retained using t-tests when there is any need to test the null; and if there

are many relevant variables, all of them are unlikely to be retained even when they have quite large

non-centralities.

4.3 Deletion probabilities

The most extreme case where low deletion probabilities might entail high search costs is when many

variables are included but none actually matters. PcGets systematically checks the reducibility of the

GUM by testing simplifications up to the empty model. A one-offF-test FG of the GUM against the

null model using critical value c would have size P (FG c) = under the null if it was the only testimplemented. Consequently, path searches would only commence % of the time, and some of these

could also terminate at the null model. Let there be k regressors in the GUM, of which n are retained


16/53

16

when t-test selection is used should the null model be rejected. In general, when there are no relevant

variables, the probability of retaining no variables using t-tests with critical value c is:

P (|ti| < c i = 1, . . . , k) = (1 )k . (21)

Combining (21) with the FG-test, the null model will be selected with approximate probability:

pG = (1 ) + (1 )k , (22)

where is the probability of FG rejecting yet no regressors being retained (conditioning onFG c cannot decrease the probability of at least one rejection). Since is set at quite a high value,such as 0.20, whereas = 0.05 is more usual, FG c0.20 can occur without any |ti| c0.05. Evaluating(22) for = 0.20, = 0.05 and k = 20 yields pG 0.87; whereas the re-run of the HooverPerezexperiments with k = 40 reported by Hendry and Krolzig (1999b) using = 0.01 yielded 97.2% in

the Monte Carlo as against a theory prediction from (22) of 99%. Alternatively, when = 0.1 and

= 0.01 (22) has an upper bound of 96.7%, falling to 91.3% for = 0.05. Thus, it is relatively easy

to obtain a high probability of locating the null model, even when 40 irrelevant variables are included,

using relatively tight significance levels, or a reasonable probability for looser significance levels.

4.4 Path selection probabilities

We now calculate how many spurious regressors will be retained in path searches. The probability

distribution of one or more null coefficients being significant in pure t-test selection at significance level

is given by the k + 1 terms of the binomial expansion of:

( + (1 ))k .

The following table illustrates by enumeration for k = 3:

event probability number retained

P (|ti| < c, i = 1, . . . 3) (1 )3

0P (|ti| c | |tj | < c, j = i) 3 (1 )2 1P (|ti| < c | |tj | c, j = i) 3 (1 ) 2 2P (|ti| c, i = 1, . . . 3) 3 3

Thus, for k = 3, the average number of variables retained is:

n = 3 3 + 2 3 (1 ) 2 + 3 (1 )2 = 3 = k.

The result n = k is general. When = 0.05 and k = 40, n equals 2, falling to 0.4 for = 0.01: so

even if only t-tests are used, few spurious variables will be retained.

Combining the probability of a non-null model with the number of variables selected when the GUMF-test rejects:

p = ,

(where p is the probability any given variable will be retained), which does not depend on k. For

= 0.1, = 0.01, we have p = 0.001. Even for = 0.25 and = 0.05, p = 0.0125 before search

paths and diagnostic testing are included in the algorithm. The actual behaviour ofPcGets is much more

complicated than this, but can deliver a small overall size. Following the event FG c when = 0.1(so the null is incorrectly rejected 10% of the time), and approximating by 0.5 variables retained when


17/53

17

that occurs, then the average non-deletion probability (i.e., the probability any given variable will be

retained) is pr = n/k = 0.125%, as against the reported value of0.19% found by Hendry and Krolzig

(1999b). These are very small retention rates of spuriously-significant variables.

Thus, in contrast to the relatively high costs of inference discussed in the previous section, those

of search arising from retaining additional irrelevant variables are almost negligible. For a reasonable

GUM with (say) 40 variables where 25 are irrelevant, even without the pre-selection and multiple path

searches of PcGets, and using just t-tests at 5%, roughly one spuriously significant variable will be

retained by chance. Against that, from the previous section, there is at most a 50% chance of retainingeach of the variables that have non-centralites around 2, and little chance of keeping them all: the

difficult problem is retention of relevance, not elimination of irrelevance. The only two solutions are

better inference procedures, or looser critical values; we will consider them both.

4.5 Improved inference procedures

An inference procedure involves a sequence of steps. As a simple example, consider a procedure com-

prising two F-tests: the first is conducted at the = 50% level, the second at = 5%. The variables to

be tested are first ordered by their t-values in the GUM, such that t21 t22 t2k, and the first F-testadds in variables from the smallest observed t-values till a rejection would occur, with either F1 > c

or an individual |t| > c (say). All those variables except the last are then deleted from the model,and a second F-test conducted of the null that all remaining variables are significant. If that rejects, so

F2 > c , all the remaining variables are retained, otherwise, all are eliminated. We will now analyze

the probability properties of this 2-step test when all k regressors are orthogonal for a regression model

estimated from T observations.

Once m variables are included in the first step, non-rejection requires that (a) the diagnostics are

insignificant; (b) m 1 variables did not induce rejection, (c) |tm| < c and (d):

F1 (m, T k) 1m

mi=1

t2i c. (23)

Clearly, any t2

i 1 reduces the mean F1 statistic, and since P(|ti| < 1 ) = 0.68, when k = 40,approximately 28 variables fall in that group; and P(|ti| 1.65) = 0.1 so only 4 variables shouldchance to have a larger |ti| value on average. In the conventional setting where = 0.05 withP(|ti| < 2) 0.95, only 2 variables will chance to have larger t-values, whereas slightly more thanhalf will have t2i < 0.5 or smaller. Since P(F1 (20, 100) < 1|H0) 0.53, a first step with = 0.5should eliminate all variables with t2i 1, and some larger t-values as well hence the need to checkthat |tm| < c (below we explain why collinearity between variables that matter and those that do notshould not jeopardize this step).

A crude approximation to the likely value of (23) under H0 is to treat all t-values within blocks as

having a value equal to the mid-point. We use the five ranges t2i < 0.5, 1, 1.652, 4, and greater than 4,

using the expected numbers falling in each of the first four blocks, which yields:

F1 (38, 100) 138

0.25 20 + 0.75 8 + 1.332 8 + 1.822 2 = 31.8

38 0.84,

noting P(F1 (38, 100) < 0.84|H0) 0.72 (setting all ts equal to the upper bound of each block yieldsan illustrative upper bound of about 1.3 for F1). Thus, surprisingly-large values of , such as 0.75, can

be selected for this step yet have a high probability of eliminating almost all the irrelevant variables.

Indeed, using = 0.75 entails c 0.75 when m = 20, since:P (F1 (20, 100) < 0.75 | H0) 0.75,


18/53

18

or c 0.8 for m = 30.When the second F-test is invoked for a null model, it will falsely reject more than % of the

time since all small t-values have already been eliminated, but the resulting model will still be tiny

in comparison to the GUM. Conversely, this procedure has a much higher probability of retaining a

block of relevant variables. For example, commencing with 40 regressors of which m = 35 (say) were

correctly eliminated, should the 5 remaining variables all have expected t-values of two the really

difficult case in section 4.2 then:

E [F2 (5, 100)] 15 40

i=36

Et2i 4. (24)

When = 0.05, c = 2.3 and:

PF2 (5, 100) 2.3 | F2 = 4

> 0.99,

(using a non-central 2 () approximation to F2), thereby almost always retaining all five relevant vari-ables. This is obviously a dramatic improvement over the near zero probability of retaining all five

variables using t-tests on the DGP in section 4.2. Practical usage of PcGets suggests its operational

characteristics are well described by this analysis.

5 PcGets

PcGets attempts to meet all of the criteria in section 4. First, it always starts from a congruent general lin-

ear, dynamic statistical model, using a battery of mis-specification tests to ensure congruence. Secondly,

it is recommended that the GUM have near orthogonal, non-integrated regressors so test outcomes are

relatively orthogonal. Then PcGets pre-selection tests for variables at a loose significance level (25%

or 10%, say), to remove variables that are highly irrelevant then simplifies the model to be searched ac-

cordingly by eliminating those variables. It then explores multiple selection paths each of which begins

by eliminating one or more statistically-insignificant variables, with diagnostic tests checking the valid-

ity of all reductions, thereby ensuring a congruent final model. Path searches continue till no further

reductions are feasible, or a diagnostic tests rejects. All the viable terminal selections resulting from

these search paths are stored. If there is more than one terminal model, parsimonious encompassing

tests are conducted of each against their union to eliminate models that are dominated and do not dom-

inate any others. If a non-unique outcome does not result, the search procedure is repeated from the

new union. Finally, if mutually-encompassing contenders remain, information criteria are used to select

between these terminal reductions. Additionally, sub-sample significance is used to assess the reliability

of the resulting model choice. For further details, see e.g., Hendry and Krolzig (1999b).

There is little research on how to design model-search algorithms in econometrics. The search

procedure must have a high probability of retaining variables that do matter in the LDGP, and eliminating

those that do not. To achieve that goal, PcGets uses encompassing tests between alternative reductions.

Balancing the objectives of small size and high power still involves a trade-off, but one that is dependent

on the algorithm: the upper bound is probably determined by the famous lemma in Neyman and Pearson

(1928). Nevertheless, to tilt the size-power balance favourably, sub-sample information is also exploited,

building on the further development in Hoover and Perez of investigating split samples for significance

(as against constancy). Since non-central t-values diverge with increasing sample size, whereas central

ts fluctuate around zero, the latter have a low probability of exceeding any given critical value in

two sub-samples, even when those sample overlap. Thus, adventitiously-significant variables may be

revealed by their insignificance in one or both of the sub-samples.


19/53

19

PcGets embodies some further developments. First, PcGets undertakes pre-search simplification

F-tests to exclude variables from the general unrestricted model (GUM), after which the GUM is re-

formulated. Since variables found to be irrelevant on such tests are excluded from later analyses, this

step uses a loose significance level (such as 10%). Next, many possible paths from that GUM are in-

vestigated: reduction paths considered include both multiple deletions as well as single, so t and/or F

test statistics are used as simplification criteria. The third development concerns the encompassing step:

all distinct contending valid reductions are collected, and encompassing is used to test between these

(usually non-nested) specifications. Models which survive encompassing are retained; all encompassedequations are rejected. If multiple models survive this testimation process, their union forms a new

general model, and selection path searches recommence. Such a process repeats till a unique contender

emerges, or the previous union is reproduced, then stops. Fourthly, the diagnostic tests require careful

choice to ensure they characterize the salient attributes of congruency, are correctly sized, and do not

overly restrict reductions. A further improvement concerns model choice when mutually-encompassing

distinct models survive the encompassing step. A minimum standard error rule, as used by Hoover and

Perez (1999), will probably over-select as it corresponds to retaining all variables with |t| > 1. Instead,we employ information criteria which penalize the likelihood function for the number of parameters. Fi-

nally, sub-sample information is used to accord a reliability score to variables, which investigators may

use to guide their model choice. In Monte Carlo experiments, a progressive research strategy (PRS)

can be formulated in which decisions on the final model choice are based on the outcomes of such

reliability measure.

5.1 The multi-path reduction process of PcGets

The starting point for Gets model-selection is the general unrestricted model, so the key issues concern

its specification and congruence. The larger the initial regressor set, the more likely adventitious effects

will be retained; but the smaller the GUM, the more likely key variables will be omitted. Further, the

less orthogonality between variables, the more confusion the algorithm faces, leading to a proliferation

of mutual-encompassing models, where final choices may only differ marginally (e.g., lag 2 versus 1).1

Finally, the initial specification must be congruent, with no mis-specification tests failed at the outset.Empirically, the GUM would be revised if such tests rejected, and little is known about the consequences

of doing so (although PcGets will enable such studies in the near future). In Monte Carlo experiments,

the program automatically changes the significance levels of such tests.

The reduction path relies on a classical, sequential-testing approach. The number of paths is in-

creased to try all single-variable deletions, as well as various block deletions from the GUM. Different

critical values can be set for multiple and single selection tests, and for diagnostic tests. Denote by

the significance level for the mis-specification tests (diagnostics) and by the significance level for the

selection t-tests (we ignore F tests for the moment). The corresponding p-values of these are denoted

and

, respectively. During the specification search, the current specification is simplified only if

no diagnostic test rejects its null. This corresponds to a likelihood-based model evaluation, where the

likelihood function of model M is given by the density:

LM(M) =

fM(Y;M)

if minM(Y; M)

T1 whereas the opposite holds for the jth if there is not too much sample

overlap. Consequently, a progressive research strategy (shown as PRS below) can gradually eliminate

adventitiously-significant variables. Hoover and Perez (1999) found that by adopting a progressive

search procedure (as in Stage III), the number of spurious regressors can lowered (inducing a lower

overall size), without losing much power. Details of the resulting algorithm are shown in Table 2.

5.2 Settings in PcGets

The testimation process ofPcGets depends on the choice of:

the n diagnostic checks in the test battery; the parameters of these diagnostic tests;

the significance levels of the n diagnostics;

pre-search F-test simplification; the significance levels of such tests; the simplification tests (t and/or F); the significance levels of the simplification tests; the significance levels of the encompassing tests; the sub-sample split; the significance levels of the sub-sample tests; the weights accorded to measure reliability.

The choice of mis-specification alternatives determines the number and form of the diagnostic tests.

Their individual significance levels in turn determine the overall significance level of the test battery.

Since significant diagnostic-test values terminate search paths, they act as constraints on moving away

3The information criteria are defined as follows:

AIC = 2log L/T+ 2n/T,

SC = 2logL/T+ n log(T)/T,

HQ = 2log L/T+ 2n log(log(T))/T,

where L is the maximized likelihood, n is the number of parameters and T is the sample size: see Akaike (1985), Schwarz

(1978), and Hannan and Quinn (1979).


23/53

23

Table 2 Additions to the basic PcGets algorithm.

Stage 0

(1) Pre-simplification and testing of the GUM

(a) If a diagnostic test fails for the GUM, the significance level of that test is adjusted, or the

test is excluded from the test battery during simplifications of the GUM;

(b) if all variables are significant, the GUM is the final model, and the algorithm stops;

(c) otherwise, F-tests of sets of individually-insignificant variables are conducted:(i) if one or more diagnostic tests fails, thatF-test reduction is cancelled, and the algorithm

returns to the previous step;

(ii) if all diagnostic tests are passed, the blocks of variables that are insignificant are re-

moved and a simpler GUM specified;

(iii) if all diagnostic tests are passed, and all blocks of variables are insignificant, the null

model is the final model.

Stage III

(1) Post-selection sub-sample evaluation

(a) Test the significance of every variable in the final model from Stage II in two overlappingsub-samples (e.g., the first and last r%):

(i) if a variable is significant overall and both sub-samples, accord it 100% reliable;

(ii) if a variable is significant overall and in one sub-sample, accord it 75% reliable;

(iii) if a variable is significant overall and in neither sub-sample, accord it 50% reliable;

(iv) if a variable is insignificant overall but in both sub-samples, accord it 50% reliable;

(v) if a variable is insignificant overall and in only one sub-sample, accord it 25% reliable;

(vi) if a variable is insignificant overall and in neither sub-sample, accord it 0% reliable.

from the GUM. Thus, if a search is to progress towards an appropriate simplification, such tests mustbe well focused and have the correct size. The pre-search tests were analyzed above, as were the path

searches. The choices of critical values for pre-selection, selection and encompassing tests are important

for the success of PcGets: the tighter the size, the fewer the spurious inclusions of irrelevant, but the

more the false exclusions of relevant variables. In the final analysis, the calibration of PcGets depends

on the characteristics valued by the user: if PcGets is employed as a first pre-selection step in a users

research agenda, the optimal values of, , and may be higher than when the focus is on controlling

the overall size of the selection process. The non-expert user settings reflect this.

In section 6, we will use simulation techniques to investigate the calibration of PcGets for the op-

erational characteristics of the diagnostic tests, the selection probabilities of DGP variables, and the

deletion probabilities of non-DGP variables. However, little research has been undertaken to date to

optimize any of the choices, or to investigate the impact on model selection of their interactions.

5.3 Limits to PcGets

Davidson and Hendry (1981, p.257) mentioned four main problems in the general-to-specific method-

ology: (i) the chosen general model can be inadequate, comprising a very special case of the DGP;

(ii) data limitations may preclude specifying the desired relation; (iii) the non-existence of an optimal

sequence for simplification leaves open the choice of reduction path; and (iv) potentially-large type-II


24/53

24

error probabilities of the individual tests may be needed to avoid a high type-I error of the overall se-

quence. By adopting the multiple path development of Hoover and Perez (1999), and implementing a

range of important improvements, PcGets overcomes many of problems associated with points (iii) and

(iv). However, the empirical success of PcGets must depend crucially on the creativity of the researcher

in specifying the general model and the feasibility of estimating it from the available data aspects bey-

ond the capabilities of the program, other than the diagnostic tests serving their usual role of revealing

model mis-specification.

There is a central role for economic theory in the modelling process in prior specification, priorsimplification, and suggesting admissible data transforms. The first of these relates to the inclusion of

potentially-relevant variables, the second to the exclusion of irrelevant effects, and the third to the ap-

propriate formulations in which the influences to be included are entered, such as log or ratio transforms

etc., differences and cointegration vectors, and any likely linear transformations that might enhance

orthogonality between regressors. The LSE approach argued for a close link of theory and model,

and explicitly opposed running regressions on every variable on the database as in Lovell (1983) (see

e.g., Hendry and Ericsson, 1991a). PcGets currently focuses on general-to-simple reductions for linear,

dynamic, regression models, and economic theory often provides little evidence for specifying the lag

lengths in empirical macro-models. Even when the theoretical model is dynamic, the lags are usually

chosen either for analytical convenience (e.g., first-order differential equation systems), or to allow for

certain desirable features (as in the choice of a linear second-order single-equation model to replicate

cycles). Therefore, we adopt the approach of starting with an unrestricted rational-lag model with a

maximal lag length set according to available evidence (e.g., as 4 or 5 for quarterly time series, to allow

for seasonal dynamics). Prior analysis remains essential for appropriate parameterizations; functional

forms; choice of variables; lag lengths; and indicator variables (including seasonals, special events, etc.).

Orthogonalization helps notably in selecting a unique representation; as does validly reducing the initial

GUM. The present performance ofPcGets on previously-studied empirical problems is impressive, even

when the GUM is specified in highly inter-correlated, and probably non-stationary, levels. Hopefully,

PcGets support in automating the reduction process will enable researchers to concentrate their efforts

on designing the GUM: that could again significantly improve the empirical success of the algorithm.

5.3.1 Collinearity

Perfect collinearity denotes an exact linear dependence between variables; perfect orthogonality denotes

no linear dependencies. However, any state in between these is both harder to define and to measure

as it depends on which version of a model is inspected. Most econometric models contain subsets of

variables that are invariant to linear transformations, whereas measures of collinearity are not invariant:

if two standardized variables x and z are nearly perfectly correlated, each can act as a close proxy for

the other, yet x + z and x z are almost uncorrelated. Moreover, observed correlation matrices are notreliable indicators of potential problems in determining if either or both variables should enter a model

the source of their correlation matters. For example, inter-variable correlations of 0.9999 easily arise

in systems with unit roots and drift, but there is no difficulty determining the relevance of variables.

Conversely, in the simple bivariate normal:xtzt

IN2

0

0

,

1

1

, (25)

where we are interested in the DGP:

yt = xt + zt + t (26)


25/53

25

(for well behaved t say), when = 0.9999 there would be almost no hope of determining which

variables mattered in (26), even if the DGP formulation were known. In economic time series, however,

the former case is common, whereas (25) is almost irrelevant (although it might occur when trying to

let estimation determine which of several possible measures of a variable is best). Transforming the

variables to a near orthogonal representation before modelling would substantially resolve this prob-

lem, but otherwise, eliminating one of the two variables seems inevitable. Of course, which is dropped

depends on the vagaries of sampling, and that might be thought to induce considerable unmeasured

uncertainty, as the chosen model oscillates between retaining xt or zt. However, either variable indi-vidually is a near-perfect proxy for the dependence of yt on xt + zt, and so long as the entire system

remains constant, selecting either, or the appropriate sum, does not actually increase the uncertainty

greatly. That remains true even when one of the variables is irrelevant, although then the multiple-path

search is highly likely to select the correct equation. And if the system is not constant, the collinearity

will be broken.

Nevertheless, the outcome of a Monte Carlo model-selection study of (26) given (25) when =

0.9999 might suggest that model uncertainty was large and coefficient estimates badly biased

simply because different variables were retained in different replications. The appropriate metric is to

see how well xt + zt is captured. In some cases, models are estimated to facilitate economic policy,

and in such a collinear setting, changing only one variable will not have the anticipated outcome

although it will end the collinearity and so allow precise estimates of the separate effects. Transforming

the variables to a near orthogonal representation before modelling is assumed to have occurred in the

remainder of the chapter. By having a high probability of selecting the LDGP in such an orthogonal

setting, the reported uncertainties (such as estimated coefficient standard errors) in PcGets are not much

distorted by selection effects.

5.4 Integrated variables

To date, PcGets conducts all inferences as I(0). Most selection tests will in fact be valid even when the

data are I(1), given the results in, say, Sims, Stock and Watson (1990). Only t- or F-tests for an effect

that corresponds to a unit root require non-standard critical values. The empirical examples on I(1) dataprovided below do not reveal problems, but in principle it would be useful to implement cointegration

tests and appropriate transformations after stage 0, and prior to stage I reductions.

Similarly, Wooldridge (1999) shows that diagnostic tests on the GUM (and presumably simplifica-

tions thereof) remain valid even for integrated time series.

6 Some Monte Carlo results

6.1 Aim of the Monte Carlo

Although the sequential nature of PcGets and its combination of variable-selection and diagnostic test-

ing has eluded most attempts at theoretical analysis, the properties of the PcGets model-selection process

can be evaluated in Monte Carlo (MC) experiments. In the MC considered here, we aim to measure the

size and power of the PcGets model-selection process, namely the probability of inclusion in the

final model of variables that do not (do) enter the DGP.

First, the properties of the diagnostic tests under the potential influence of nuisance regressors are

investigated. Based on these results, a decision can be made as to which diagnostics to include in the test

battery. Then the size and power of PcGets is compared to the empirical and theoretical properties


26/53

26

of a classical t-test. Finally we analyze how the success and failure of PcGets are affected by the

choice of: (i) the significance levels of the diagnostic tests; and (ii) the significance levels of the

specification tests.

6.2 Design of the Monte Carlo

The Monte Carlo simulation study of Hoover and Perez (1999) considered the Lovell database, which

embodies many dozens of relations between variables as in real economies, and is of the scale and com-

plexity that can occur in macro-econometrics: the rerun of those experiments using PcGets is discussed

in Hendry and Krolzig (1999b). In this paper, we consider a simpler experiment, which however, allows

an analytical assessment of the simulation findings. The Monte Carlo reported here uses only stages I

and II in table 1: Hendry and Krolzig (1999b) show the additional improvements that can result from

adding stages 0 and III to the study in Hoover and Perez (1999).

The DGP is a Gaussian regression model, where the strongly-exogenous variables are Gaussian

white-noise processes:

yt =5

k=1k,0xk,t + t, t IN [0, 1] , (27)

xt = vt, vt IN10 [0, I10] for t = 1, . . . , T ,where 1,0 = 2/

T , 2,0 = 3/

T , 3,0 = 4/

T , 4,0 = 6/

T , 5,0 = 8/

T.

The GUM is an ADL(1, 1) model which includes as non-DGP variables the lagged endogenous

variable yt1, the strongly-exogenous variables x6,t, . . . , x10,t and the first lags of all regressors:

yt = 0,1yt1 +10k=1

1i=0

k,ixk,ti + 0,0 + ut, ut IN

0, 2

. (28)

The sample size T is 100 or 1000 and the number of replications M is 1000.

The orthogonality of the regressors allows an easier analysis. Recall that the t-test of the null k = 0

versus the alternative k = 0 is given by:

tk =kk =

k2 (XX)1kk =k/2 (XX)1kk2/2 .

The population value of the t-statistic is:

tk =k

k=

k

T1

2 Q1/2kk

,

where the moment matrix Q = limT T1XX is assumed to exist. Since the regressors areorthogonal, we have that k =xky/2k and2k =2/(T2k):tk =

kk =

Tkk .Thus the non-zero population t-values are 2, 3, 4, 6, 8. In (28), 17 of22 regressors are nuisance.


27/53

27

6.3 Evaluation of the Monte Carlo

The evaluation of Monte Carlo experiments always involves measurement problems: see Hendry (1984).

A serious problem here is that, with some positive probability, the GUM and the truth will get re-

jected ab initio on diagnostic tests. Tests are constructed to have non-zero nominal size under their null,

so sometimes the truth will be rejected: and the more often, the more tests that are used. Three possible

strategies suggest themselves: one rejects that data sample, and randomly re-draws; one changes the

rejection level of the offending test; or one specifies a more general GUM which is congruent. We

consider these alternatives in turn.

Hoover and Perez (1999) use a 2-significant test rejections criterion to discard a sample and re-

draw, which probably slightly favours the performance ofGets . In our Monte Carlo with PcGets, the

problem is solved by endogenously adjusting the significance levels of tests that reject the GUM (e.g.,

1% to 0.1%). Such a solution is feasible in a Monte Carlo, but metaphysical in reality, as one could

never know that a sample from an economy was unrepresentative, since time series are not repeatable.

Thus, an investigator could never know that the DGP was simpler empirically than the data suggest

(although such a finding might gradually emerge in a PRS), and so would probably generalize the initial

GUM. We do not adopt that solution here, partly because of the difficulties inherent in the constructive

use of diagnostic-test rejections, and partly because it is moot whether the PcGet algorithm fails by

overfitting on such aberrant samples, when in a non-replicable world, one would conclude that suchfeatures really were aspects of the DGP. Notice that fitting the true equation, then testing it against

such alternatives, would also lead to rejection in this setting, unless the investigator knew the truth, and

knew that she knew it, so no tests were needed. While more research is needed on cases where the DGP

would be rejected against the GUM, here we allow PcGets to adjust significance levels endogenously.

Another major decision concerns the basis of comparison: the truth seems to be a natural choice,

and both Lovell (1983) and Hoover and Perez (1999) measure how often the search finds the DGP

exactly or nearly. Nevertheless, we believe that finding the DGP exactly is not a good choice of

comparator, because it implicitly entails a basis where the truth is known, and one is certain that it is

the truth. Rather, to isolate the costs of selection per se, we seek to match probabilities with the same

procedures applied to testing the DGP. In each replication, the correct DGP equation is fitted, and thesame selection criteria applied: we then compare the retention rates for DGP variables from PcGets

with those that occur when no search is needed, namely when inference is conducted once for each

DGP variable, and additional (non-DGP) variables are never retained.

6.4 Diagnostic tests

PcGets records the rejection frequencies of both specification and mis-specification tests for the DGP,

the initial GUM, and the various simplifications thereof based on the selection rules. Figure 1 displays

quantilequantile (QQ) plots of the empirical distributions of seven potential mis-specification tests for

the estimated correct specification, the general model, and the finally-selected model. Some strong

deviations from the theoretical distributions (diagonal) are evident: the portmanteau statistic (see Boxand Pierce, 1970) rejects serial independence of the errors too often in the correct specification, never in

the general, and too rarely in the final model. The hetero-x test (see White, 1980) was faced with degrees

of freedom problems for the GUM, but anyway does not look good for the true and final model either.

Since this incorrect finite-sample size of the diagnostic tests induces an excessively-early termination of

any search path, resulting in an increased overall size for variable selection, we decided to exclude the

portmanteau and the hetero-x diagnostics from the test battery of statistics. Thus, the following results

use the five remaining diagnostic tests in table 3.


28/53

28

0 .5 1

.5

1Correct Model

Final (Specific) Model

Chow1

0 .5 1

.5

1 Chow2

0 .5 1

.5

1 portmonteau

0 .5 1

.5

1 normality

0 .5 1

.5

1 AR test

0 .5 1

.5

1 hetero

0 .5 1

.5

1 hetero-X

0 2 4 60

2

4

6

Failed tests

0 .5 1

.5

1

General ModelChow1

0 .5 1

.5

1 Chow2

0 .5 1

.5

1 portmonteau

0 .5 1

.5

1 normality

0 .5 1

.5

1 AR test

0 .5 1

.5

1 hetero

0 .5 1

.5

1 hetero-X

0 2 4 60

2

4

6

Failed tests

0 .5 1

.5

1 Chow1

0 .5 1

.5

1 Chow2

0 .5 1

.5

1 portmonteau

0 .5 1

.5

1 normality

0 .5 1

.5

1 AR test

0 .5 1

.5

1 hetero

0 .5 1

.5

1 hetero-X

0 2 4 60

2

4

6

Failed tests

Figure 1 Selecting diagnostics: QQ Plots for M = 1000 and T = 100..

0 .5 1

.5

1Correct Model: QQ Plots of Diagnostics for M=1000 and T=1000

Chow1

0 .5 1

.5

1

Chow2

0 .5

Henry II Forecasting

Documents

Transcript of Henry II Forecasting