3. Score-driven models for location Introduction to...

3. Score-driven models for location

Andrew Harvey

May 2017

Andrew Harvey () 3. Score-driven models for location May 2017 1 / 42

Introduction to dynamic conditional score (DCS) models

1. A unified and comprehensive theory for a class of nonlinear time seriesmodels in which the dynamics of a changing parameter, such as locationor scale, is driven by the score of the conditional distribution.2. Dynamics are driven by the score of the conditional distribution.3. For EGARCH, analytic expressions may be derived for (unconditional)moments, autocorrelations and moments of multi-step forecasts. Anasymptotic distributional theory for ML estimators can be obtained,sometimes with analytic expressions for the asymptotic covariance matrix.5. Extensions to multivariate time series. Correlation or association maychange over time. Time-varying copulas.6. Changing coeffi cients in ARs and changing signal-noise ratios.***Harvey, A.C. Dynamic models for volatility and heavy tails. CUP 2013http://www.econ.cam.ac.uk/DCSCreal et al (2011, JBES, 2013, JAE). Generalized Autoregressive Score(GAS)


Introduction to dynamic conditional score (DCS) models

A guiding principle is signal extraction. When combined with basic ideasof maximum likelihood estimation, the signal extraction approach leads tomodels which, in contrast to many in the literature, are relatively simple intheir form and yield analytic expressions for their principal features.For estimating location, DCS models are closely related to the unobservedcomponents (UC) models described in Harvey (1989).Such models can be handled using state space methods and they are easilyaccessible using the STAMP package of Koopman et al (2008).For estimating scale, the models are close to stochastic volatility (SV)models, where the variance is treated as an unobserved component.


Unobserved component models

A simple Gaussian signal plus noise model is

yt = µt + εt , εt ∼ NID(0, σ2ε

), t = 1, ...,T

µt+1 = φµt + ηt , ηt ∼ NID(0, σ2η),where the irregular and level disturbances, εt and ηt , are mutuallyindependent. The AR parameter is φ, while the signal-noise ratio,q = σ2η/σ2ε , plays the key role in determining how observations should beweighted for prediction and signal extraction.The reduced form (RF) is an ARMA(1,1) process

yt = φyt−1 + ξt − θξt−1, ξt ∼ NID(0, σ2

),

but with restrictions on θ. For example, when φ = 1, 0 ≤ θ ≤ 1. Theforecasts from the UC model and RF are the same.


Unobserved component models

The UC model is effectively in state space form (SSF) and, as such, it maybe handled by the Kalman filter (KF). The parameters φ and q can beestimated by ML, with the likelihood function constructed from theone-step ahead prediction errors.The KF can be expressed as a single equation. Writing this equationtogether with an equation for the one-step ahead prediction error, vt , givesthe innovations form (IF) of the KF:

yt = µt |t−1 + vtµt+1|t = φµt |t−1 + ktvt

The Kalman gain, kt , depends on φ and q.In the steady-state, kt is constant. Setting it equal to κ and re-arranginggives the ARMA(1,1) model with ξt = vt and φ− κ = θ.


Normal and Student t distributions

Normal (Gaussian) Y v N(µ, σ2) . The pdf is

f (y) =1

σ√2π

exp(−(y − µ)2

2σ2), −∞ < y < ∞

Standard normal N(0, 1). Kurtosis is 3.

Student’s tv distribution - ν is degrees of freedom. The pdf is

f (y) =Γ((ν+ 1)/2)Γ(ν/2)ϕ

√πν

(1+

(y − µ)2

νϕ2

)−(ν+1)/2where ϕ is scale and µ is median. Variance is νϕ2/(ν− 2) and excesskurtosis is 6/(ν− 4). Moments only exist up to and including ν− 1.Cauchy is t1

f (y) =1

ϕπ(1+ (y − µ)2/ϕ2)


5 4 3 2 1 0 1 2 3 4 5

0.1

0.2

0.3

0.4

y

f(y)

Normal and Cauchy distributions


Outliers

Suppose noise is from a heavy tailed distribution, such as Student’s t.Outliers.The RF is still an ARMA(1,1), but allowing the ξ ′ts to have a heavy-taileddistribution does not deal with the problem as a large observation becomesincorporated into the level and takes time to work through the system.An ARMA models with a heavy-tailed distribution is designed to handleinnovations outliers, as opposed to additive outliers. See the robustnessliterature.But a model-based approach is not only simpler than the usual robustmethods, but is also more amenable to diagnostic checking andgeneralization.


Unobserved component models for non-Gaussian noise

Simulation methods, such as MCMC, provide the basis for a direct attackon models that are nonlinear and/or non-Gaussian. The aim is to extendthe Kalman filtering and smoothing algorithms that have proved soeffective in handling linear Gaussian models. Considerable progress hasbeen made in recent years; see Durbin and Koopman (2012).But simulation-based estimation can be time-consuming and subject to adegree of uncertainty.Also the statistical properties of the estimators are not easy to establish.


Observation driven model based on the score

The DCS approach begins by writing down the distribution of the t − thobservation, conditional on past observations. Time-varying parametersare then updated by a suitably defined filter. Such a model is observationdriven, as opposed to a UC model which is parameter driven. In a linearGaussian UC model, the KF is driven by the one step-ahead predictionerror, vt . The DCS filter replaces vt in the KF equation by a variable, ut ,that is proportional to the score of the conditional distribution.The innovations form becomes

yt = µt |t−1 + vt , t = 1, ...,T

µt+1|t = φµt |t−1 + κut

where κ is an unknown parameter.


Why the score ?

Suppose that at time t − 1 we have θ̂t−1 the ML estimate of a parameter,θ. Then the score is zero, that is

∂ ln Lt−1(θ)∂θ

=t−1∑j=1

∂ ln `j (θ)∂θ

= 0 at θ = θ̂t−1. (1)

where `j (θ; yj ) = f (yj ; θ). When a new observation becomes available, asingle iteration of the method of scoring gives

θ̂t = θ̂t−1 +1

It (θ̂t−1)

∂ ln Lt (θ)∂θ

= θ̂t−1 +1

It (θ̂t−1)

∂ ln `t (θ)∂θ

where It (θ̂t−1) = t.I (θ̂t−1) is the information matrix for t observationsand the last line follows because of (1). For a Gaussian distribution asingle update goes straight to the ML estimate at time t (recursive leastsquares).


Why the score ?

As t → ∞, It (θ̂t−1)→ ∞ so the recursion becomes closed to newinformation. If it is thought that θ changes over time, the filter needs tobe opened up. This may be done by replacing 1/t by a constant, whichmay be denoted as κ. Thus

θ̂t = θ̂t−1 + κ1

I (θ̂t−1)

∂ ln `t (θ)∂θ

With no information about how θ might evolve, the above equation mightbe converted to the predictive form by letting θ̂t+1|t = θ̂t . Thus

θ̂t+1|t = θ̂t |t−1 + κ1

I (θ̂t |t−1)

∂ ln `t (θ)∂θ

For a Gaussian distribution in which θ is the mean and the variance isknown to be σ2, the recursion is an EWMA because

1

I (θ̂t |t−1)

∂ ln `t (θ)∂θ

= yt − θ̂t |t−1


Why the score ?

If there is reason to think that the parameter tends to revert to anunderlying level, ω, the updating scheme might become

θ̂t+1|t = ω(1− φ) + φθ̂t |t−1 + κ1

I (θ̂t |t−1)

∂ ln `t (θ)∂θ

where |φ| < 1. This scheme corresponds to a first-order autoregressiveprocess.More generally we might introduce lags so as to smooth out the changes orallow for periodic effects. This leads to the formulation of a QARMA filter.The score can also be motivated by a conditional mode argument based onsmoothed estimates, θt |T . See Durbin and Koopman (2012, p 252-3) andHarvey (2013, p 87-9)


Why the score ?

(1) The attraction of treating the score-driven filter as a model in its ownright is that it becomes possible to derive the asymptotic distribution ofthe ML estimator and to generalize in various directions.(2) The same approach can then be used to model scale, using anexponential link function.(3) Seen in this way, the justification for the class of DCS models is notthat they approximate corresponding UC models, but rather that theirstatistical properties are both comprehensive and straightforward.(4) An immediate practical advantage is seen from the response of thescore to an outlier.


Dynamic location model

yt = ω+ µt |t−1 + vt = ω+ µt |t−1 + exp(λ)εt ,

µt+1|t = φµt |t−1 + κut ,

where εt is serially independent, standard t-variate and

ut =

(1+

(yt − µt |t−1)2

νe2λ

)−1vt ,

where vt = yt − µt |t−1 is the prediction error and ϕ = exp(λ) is the(time-invariant) scale.ut → 0 as |y | → ∞. In the robustness literature this is called aredescending M-estimator. It is a gentle form of trimming.


5 4 3 2 1 1 2 3 4 5

5

4

3

2

1

1

2

3

4

5

y

u

Figure: Impact of ut for tν (with a scale of one) for ν = 3 (thick), ν = 10 (thin)and ν = ∞ (dashed).


Robust estimation

The location-dispersion model for a variable, −∞ < yt < ∞, is

yt = µ+ ϕεt , t = 1, ...,T ,

where εt is a standardized variable with PDF fε and the scale, ϕ > 0, iscalled the dispersion for yt . The density for yt is

f (yt ; µ, ϕ, ξ) = ϕ−1fε(ξ),

where ξ denotes one or more shape parameters, and the scores for µ and ϕare given by differentiating ln f (yt ) = ln fε((yt − µ)/ϕ)− ln ϕ. The scorefor scale is

∂ ln ft∂ϕ

=εtϕ

∂ ln ft∂µ− 1

ϕ.

When the scale is parameterized with an exponential link function,ϕ = expλ, the score is

∂ ln f∂λ

= εt∂ ln ft

∂µ− 1.


Robust estimation

Consider the location-dispersion model, but with f replaced byϕ−1 exp(ρ(εt )); see Maronna, Martin and Yohai (2006, p37-8).Differentiating the criterion function ρ((yt − µ)/ϕ)− ln ϕ in order toobtain a minimum gives

∂ρ

∂µ= ψL(εt ),

for location and for scale, with the exponential link function,

∂ρ

∂λ− 1 = ψS (εt )− 1 = εtψL(εt )− 1.


Robust estimation

The ML estimators are asymptotically effi cient, assuming certain regularityconditions hold. More generally ρ(.) may be any function deemed to yieldestimators with good statistical properties. In particular, the estimatorsshould be robust to observations which would be considered to be outliersfor a normal distribution. The estimators are obtained by solving∑ ψL(εt ) = 0 and ∑ ψS (εt ) = T . When normality is assumed, the MLestimators of the mean and variance are just the corresponding samplemoments. However, these estimators can be subject to considerabledistortion when outliers are present. Robust estimators, on the other hand,are resistant to additive outliers while retaining relatively high effi ciencywhen the data are from a normal distribution.Maronna, Martin and Yohai (2006, sect 8.6 and 8.8) give a robustalgorithm for AR and ARMA models with additive outliers. For afirst-order model their filter is essentially the same as for a DCS modelexcept that their dynamic equation is driven by a robust response functionand they regard the model as an approximation to a UC model.


Robust estimation and DCS

Robustness is achieved because either the response function, ψ(.),converges to a positive (negative) constant for observations tending toplus ( or minus) infinity or it goes to zero. These two approaches areusually classified as Winsorizing or as trimming.The score for a t-distribution converges to zero and so is soft trimming asopposed to hard trimming which has a response that is linear - as for aGaussian distribution - up to a certain point and then suddenly becomeszero.Hard Winsorizing has a constant response after a certain point.Is there a parametric form of Winsorizing that can be obtained from thescore of a distribution ?


Robust estimation and DCS

The exponential generalized beta distribution of the second kind (EGB2)

f (y ; µ, υ, ξ, ς) =υ exp{ξ(y − µ)υ}

B(ξ, ς)(1+ exp{(y − µ)υ})ξ+ς.

All moments exist. It is obtained as the logarithm of a GB2.Although υ is a scale parameter, it is the inverse of what would beconsidered a more conventional measure of scale. Thus scale is betterdefined as 1/υ or as the standard deviation

σ =√

ψ′(ξ) + ψ′(ς)/υ = h(ξ, ς)/υ = h/υ,

where ψ′(.) is the digamma function.


EGB2

When ξ = ς, the distribution is symmetric; for ξ = ς = 1 it is a logisticdistribution and when ξ = ς→ ∞ it tends to a normal distribution.For ξ = ς = 0, the distribution is double exponential or Laplace.A plot of the (symmetric) EGB2, GED and Student’s t with the sameexcess kurtosis shows them to be very similar. It is diffi cult to see theheavier tails of the t distribution from the graph, and the only discernibledifference among the three distributions is in the peak, which is higher andmore pointed for the GED. The EGB2 in turn is more peaked than the t.As the excess kurtosis increases, the differences between the peaks becomemore marked.


EGB2

The score function with respect to location in the EGB2 distribution isbounded

∂ ln ft∂µt pt−1

= υ(ξ + ς)bt (ξ, ς)− υξ,

where

bt (ξ, ς) =e(yt−µt pt−1)υ

e(yt−µt pt−1)υ + 1.

Because 0 ≤ bt (ξ, ς) ≤ 1, it follows that as y → ∞, the score approachesan upper bound of υς, whereas y → −∞ gives a lower bound of υξ.The variable bt (ξ, ς) is IID with a beta(ξ, ς) distribution. Asymptotictheory can be developed as in t-distribution case.


EGB2

It will prove more convenient to replace υ by h/σ and to define ut as

ut = σ2∂ ln ft

∂µt pt−1= σh[(ξ + ς)bt (ξ, ς)− ξ].

We note that the upper and lower bounds are σ√2 and −σ

√2

respectively when ς = ξ = 0. On the other hand, there is no upper (lower)bound for ς (or ξ)→ ∞ because hς → ∞ ( as does hξ). As ς = ξ → ∞,the distribution becomes normal and so for large ς and ξ, ut ' yt − µt pt−1.


2 1 1 2 3 4 5 6 7 8

2

1

1

2

y

u


Dynamic location models were fitted to the growth rate of US GDP andindustrial production using EGB2, Student’s t and normal distributions.GDP is quarterly, ranging from 1947q1 to 2012q4. Industrial productiondata are monthly and range from January 1960 to February 2013. All dataare seasonally adjusted and taken from the Federal Reserve EconomicData (FRED) database of the Federal Reserve of St. Louis.When first-order Gaussian models are fitted, there is little indication ofresidual serial correlation. There is excess kurtosis in all cases, but noevidence of asymmetry. For example, with GDP the Bowman-Shentonstatistic is 30.04, which is clearly significant because the distribution underthe null hypothesis of Gaussianity is χ22. The non-normality clearly comesfrom excess kurtosis, which is 1.9, rather than from skewness, which isonly 0.18. Comparing the residuals with a fitted normal shows them tohave a higher peak at the mean, as well as heavier tails.


N(s=0.999)

4 3 2 1 0 1 2 3 4 5

0.1

0.2

0.3

0.4

0.5

Density

N(s=0.999)


Tables 1 and 2 report the estimation results and table 3 comparesgoodness of fit. The Student-t model and the EGB2 outperform theGaussian model with the shape parameter, ν or ξ, confirming the excesskurtosis. The λ parameter is the logarithm of scale, ν, but the estimatesof σ are shown because these are comparable across different distributions.For GDP the EGB2 models gives a slightly better fit, whereas thet-distribution is better for industrial production. However, the differencesbetween the two are small compared with the Gaussian model.


Table 1 US GDP (quarterly)κ φ ω λ ξ (or ν) σ

EGB2Num SEAsy SE

0.30(0.063)(0.054)

0.50(0.103)(0.143)

0.008(0.001)(0.001)

−5.40(0.324)(0.415)

0.88(0.394)(0.502)

0.0091

tNum SEAsy SE

0.50(0.094)(0.089)

0.50(0.103)(0.141)

0.008(0.001)(0.001)

−4.88(0.071)(0.056)

6.49(2.364)(1.887)

0.0091

GaussianNum SEAsy SE

0.35(0.058)(0.061)

0.49(0.112)(0.141)

0.008(0.001)(0.001)

−4.70(0.044)(0.044)

− 0.0091


Table 2 US Industrial production (monthly)κ φ ω λ ξ (or ν) σ

EGB2Num SEAsy SE

0.20(0.033)(0.027)

0.85(0.036)(0.040)

0.003(0.001)(0.001)

−6.05(0.214)(0.287)

0.55(0.147)(0.196)

0.0069

tNum SEAsy SE

0.40(0.060)(0.055)

0.85(0.036)(0.040)

0.002(0.001)(0.001)

−5.25(0.046)(0.038)

4.49(0.743)(0.634)

0.0071

GaussianNum SEAsy SE

0.25(0.032)(0.035)

0.83(0.041)(0.046)

0.002(0.001)(0.001)

−4.95(0.028)(0.028)

− 0.0071


Table 3 US macroeconomic series - Model comparisonLog-Likelihood AIC BIC

GDPEGB2 868.376 −6.566 −6.498t 868.242 −6.565 −6.497Gaussian 862.212 −6.526 −6.472

Industrial productionEGB2 2291.66 −7.168 −7.133t 2293.56 −7.174 −7.139Gaussian 2255.21 −7.057 −7.029


Structural breaks

It might be thought that the EGB2 and t filters will be less responsive to apermanent change in the level than the linear Gaussian filter. However, formoderate size shifts, the score functions suggest that this might not be thecase, because only for large observations is the Gaussian response biggerthan the response of the robust filters. For example, for the logistic (EGB2with unit shape parameters), the score is only smaller than the observation( and hence the linear filter) when it is more than (approximately) 1.6standard deviations from the mean. The behaviour of the t-filter is similar.In order to investigate the issue of adapting to a permanent change inlevel, an upward shift was added to the US industrial production data atthe begining of 2010. The size of the shift was calibrated so as to beproportional to the sample standard deviation of the series.For a one standard deviation shift the paths of all three filters after thebreak are similar. For two standard deviations the Gaussian filter adaptsmore quickly, but it is still well below the new level initially and after 9 or10 months it is virtually indistinguishable from the other two filters.


2009 2010 2011

0 .02

0 .01

0.00

0.01

0.02

1 standard deviation level shift

US Industrial production (level shift)US Industrial production (actual data)Location EGB2Location TLocation Gaussian


2009 2010 2011

0 .02

0 .01

0.00

0.01

0.02

0.032 standard deviations level shift

US Industrial production (level shift)US Industrial production (actual data)Location EGB2Location TLocation Gaussian


Trend and seasonality

Gaussian local level model -random walk plus noise. The Kalman filter isan exponentially weighted moving average (EWMA) of past observations.For the DCS-t

yt = µt |t−1 + vt ,

µt+1|t = µt |t−1 + κut .

and the initial value, µ1|0,is treated as an unknown parameter that needsto estimated along with κ and ν. Since ut = (1− bt )(yt − µt |t−1),re-arranging the dynamic equation gives

µt+1|t = (1− κ(1− bt ))µt |t−1 + κ(1− bt )yt ,

which can be regarded as an EWMA modified to deal with outliers: wheny2t is large, bt is close to one.


Local linear trend

The DCS local linear trend filter is

yt = µt |t−1 + vt , t = 1, ...,T ,

µt+1|t = µt |t−1 + βt |t−1 + κ1utβt+1|t = βt |t−1 + κ2ut .

The initialization β3|2 = y2 − y1 and µ3|2 = y2 can be used, but, as in thelocal level model, initializing in this way is vulnerable to outliers at thebeginning. Estimating the fixed starting values, µ1|0 and β1|0, is a betteroption.An IRW trend in the UC local linear trend model implies the contraintκ2 = κ21/(2− κ1), 0 < κ1 < 1, which may be found from Harvey (1989,p. 177). The restriction can be imposed on the DCS-t model by treatingκ1 = κ as the unknown parameter, but without unity imposed as an upperbound.


Stochastic seasonal in the DCS model

In the DCS model for the seasonal

yt = z′tαt |t−1 + vt , αt+1|t = αt |t−1 + κtut ,

where zt picks out the current season, γt |t−1, that is γt |t−1 = z′tαt |t−1.

The question is how to parameterize κt .


Stochastic seasonal

We need i′κt = 0 as in the Kalman filter for the UC model.If κjt , j = 1, .., s, denotes the j − th element of κt , then in season j we setκjt = κs , where κs is a non-negative unknown parameter, while

κit = −κs/(s − 1) i 6= j .

The amounts by which the seasonal effects change therefore sum to zero.


Basic structural model

The seasonal recursions can be combined with the trend filtering equationsto give a structure similar the innovations form of the Kalman filter for theBSM. Thus

yt = µt |t−1 + γt |t−1 + vt

The initial conditions at time t = 0 are estimated by treating them asparameters; there are s − 1 seasonal parameters because the remaininginitial seasonal state is minus the sum of the others.


Application to rail travel

An unobserved components model was fitted to the rail series on uarterlyobservations on kms travelled by train passengers in UK using the STAMP8 package of Koopman et al (2009). Trend, seasonal and irregularcomponents were included but the model was augmented with interventionvariables to take out the effects of observations that are known to beunrepresentative.The intervention dummies were:(i) the train drivers strikes in 1982(1,3);(ii) the Hatfield crash and its aftermath, 2000(4) and 2001(1); and(iii) the signallers strike in 1994(3).


Application to rail travel

Fitting a DCS model with trend and seasonal avoids the need to dealexplicitly with the outliers. The ML estimates for the parameters in

yt = µt |t−1 + γt |t−1 + exp(λ)εt

µt+1|t = µt |t−1 + β+ κut

are

κ̃ = 1.421(0.161) κ̃s = 0.539 (0.070) λ̃ = −3.787 (0.053)ν̃ = 2.564 (0.319) β̃ = 0.003 (0.001)

with initial values µ̃ = 2.066(0.009), γ̃1 = −0.094(0.007),γ̃2 = −0.010(0.006) and γ̃3 = 0.086(0.006).The figures in parentheses are numerical standard errors. The last seasonalis γ̃4 = 0.018; it has no SE as it was constructed from the others.


1980 1985 1990 1995 2000 2005 2010

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

← 1994(3): signallers strike

← 1982(1): train drivers strike

← 2000(4): the Hatfield crash

Date

LNATkm DCSt trend


3. Score-driven models for location Introduction to...

Documents

Transcript of 3. Score-driven models for location Introduction to...