Nonparametric Modelling of High Dimensional Time Series · Outline 1 Survey some work on ⁄exible...

Nonparametric Modelling of High Dimensional TimeSeries

Oliver LintonUniversity of Cambridge

Yale University, 2016

1 / 51

We are concerned with the nonparametric prediction/estimation problem

E(Yt+k |Ft ),

where the information set Ft may be infinite dimensional and theconditional expectation is not restricted to be linear or even parametric.There are two leading cases:

Autoregression - Ft = σ (Ys ; s ≤ t) is the sigma algebra generated bythe sequence (Ys )s≤tStatic regression - Ft = σ (Xjt , j = 1, 2, . . .) .In practice there is usually a combination of both types ofconditioning information.

Data could be high or low frequency

Most econometric predictions are within parametric and even linearmodels.

2 / 51

Outline

1 Survey some work on flexible nonparametric models for time serieswith estimation and forecasting applications

2 Some themes are compromise between curse of dimensionality versusgenerality

3 Discuss in detail proposal to use marginal regression functions in amodel averaging approach for estimation, forecasting, and portfoliochoice.

3 / 51

Let (Yt ,Xt ) be a stationary time series process, where Xt ∈ Rd .Interested in the regression/forecasting function

E(Yt |Xt = x) ; E (Yt |Yt−1, . . . ,Yt−d )

Nonparametric methods are consistent

m can be well estimated when the dimension d is small, but verypoorly if the dimension d is high (say larger than 3) owing to the"curse of dimensionality": Stone (1980, 1982) and Ibragimov &Hasminskii (1980). Silverman (1986). Smoothness r is equallyimportant for rates of convergence.

Main methods include Kernel methods, sieve/series methods, splines(penalization) etc.

4 / 51

For standard kernel estimators with bandwidth h and kernel K

MSE (md (x)) ≈

bias terms︷︸︸︷µr (K )h

2r

(∑|u|=r

Dum(x)

)2+

variance terms︷︸︸︷υr (K )

1Thd

σ2(x)f (x)

.

The optimal bandwidth is h ∼ T−1/(2r+d ) which delivers the pointwiserate T−2r/(2r+d ) where r is smoothness. CLT, LFCLT, etc.

Optimality in a limited sense. Tibshirani (1984) Local Cramer-Raobound. Fan (1993). Local linear (kernel) is Best Linear Minimax.Extends to a number of nonparametric "models" like additive.

We are particularly interested in the case where d is large, which isnot covered by this theory. My interest goes back to Pagan and Ullah(1988) and Hong and Pagan (1991) who discuss the case whered → ∞

5 / 51

Functional regression. Masry (2005), Ferraty and Vieu(2002,2006,2012), Ramsay and Silverman (2005),

Y ∈ R ; X ∈ B = L2

Y ,X ∈ BInfill framework, e.g., X : [0, 1]→ R with observationsX (t1), . . . ,X (tT ) with t1, . . . , tT becoming dense in [0, 1].Functional autoregression, linear and nonlinear. Bosq (2000). Karginand Onatski (2008).

Kernel methods with metric $(X ,X ′); Assume infill type ofasymptotics with functional smoothness, Frechet/Hadamarddifferentiability. No covariate joint density. Consistency at slow rate.

6 / 51

We consider the case where X = (X1,X2, . . .) ∈ `2, i.e., the set ofsequences for which say ∑∞

k=1 x2k may not be well defined.

We will weight the Xi to get back to normed space.

7 / 51

Linton and Sancetta (2009) considered estimation of the predictor

E(Y0|Y−1,Y−2 . . .).

They established (strong) consistency of kernel estimator under weakconditions (stationarity and ergodicity and moment).Hong and Linton (2016) consider the problem of estimating

E (Yt |Xt = x),

where Xt could be lagged Y ′s or (Xt ,Xt−1, . . .). In either case there is anordering of the covariate which is exploited in estimation.

8 / 51

Kernel estimator aggregates covariate (lags) through metric betweensequences inside the kernel.

E(Y0|Y−1,Y−2, . . .) =

T∑s=1

Y−sKh

$(Y −s−1−T ,Y −1−(T−s))︷︸︸︷

T−s∑j=1

φ−1j ρ(Y−j ,Y−s−j )

T∑s=1

Kh(

∑T−sj=1 φ−1j ρ(Y−j ,Y−s−j )

)where K (·) is a kernel function, Kh(·/h)/h and h is a bandwidth.Downweights the past through parameter φj → ∞. Looks for sequences(Y−s ,Y−s−1, . . .) close to the most recent history (Y−1,Y−2, . . . , )

9 / 51

Can show that in the special case where K (∑ ui ) = ∏K (ui ) (e.g.,exponential) and ρ(t, s) = (t − s)2

E(Y0|Y−1,Y−2 . . .) =

T∑s=1

Y−sT−s∏j=1

Khj (Y−j − Y−s−j )

T∑s=1

T−s∏j=1

Khj (Y−j − Y−s−j ), hj = φjh,

where h→ 0 as T → ∞ and φj → ∞ as j → ∞.Method is multivariate kernel regression with bandwidths expanding withlag order. Achieves shrinkage indirectly. No explicit truncation.Application to Risk return relationship.

10 / 51

Some Heuristics for high dimensional np regression

Consider np regression with d(T )→ ∞. Then for standard kernelestimators with bandwidth h

MSE (md (x)) ≈

bias terms︷︸︸︷µr (K )h

2r

(∑|u|=r

Dum(x)

)2+

variance terms︷︸︸︷υr (K )

1Thd

σ2(x)f (x)

Suppose that d = d(T )→ ∞ then log of the optimal MSE is

logO(T−2r/(2r+d (T )) =−2r

2r + d(T )logT → −∞

provided logT/d(T )→ ∞ (rules out d(T ) = c logT ). Heuristics notquite right.

11 / 51

Hong and Linton (2016)

Yt = m(Xt ,Xt−1, . . .) + εt ,

where E (εt |Xt ,Xt−1, . . .) = 0, and:1 The sample observations Yt ,XtTt=1 are stationary and (jointly) nearepoch dependent with some rate not specified here andE (|Yt |2+δ) < ∞.

2 The real-valued stochastic process Xs∞s=1 admits a moving average

representation:

Xs =∞

∑j=−∞

ajεs−j ,

where aj is a square summable sequence, and εjj is an independentand identically distributed standard Gaussian sequence.

3 We have for m : R∞ → R,∣∣m(x)−m(x ′)∣∣ ≤ ∞

∑j=1cj∣∣xj − x ′j ∣∣β

where cj is a sequence with ∑∞j=1 cj j

pβ < ∞ and β ∈ (0, 1].12 / 51

A major issue concerns the small ball (or deviation) probability

ϕz (δ) := Pr (‖z − Z‖ ≤ δ) as δ→ 0

When Z is a d-dimensional continuous random vector with Lebesguedensity p(·) > 0, it can be readily shown that the shifted small ball(with respect to the usual Euclidean norm) is given by

ϕz (h) = Vdhdp(z) ∝ hd ,

where Vd = πd/2/Γ(d/2+ 1) is the volume of d-dimensional unitsphere. However when Z takes values in an infinite-dimensionalspace, it is generally diffi cult to specify the exact form of the smallball probability, and its behaviour varies depending heavily on thenature of the associated space and its topological structure.

Hong, Lifshits and Nazarov (2016, Electron Comm. Probab) establishresults for weakly dependent Gaussian sequences.

13 / 51

Suppose that K is supported on (and bounded away form zero on) [0,K ],that hj = jph, where p > 1/2 and h = h(T )→ 0

C ∗G =(2π)(1−p)(2p − 1)

2 · (2p)3p−12p−1

·[

π2(1−2p)/2p

sin(π/2p)

] −p2p−1

,

C ∗∗G =2p − 12

(π

2p sin π2p

) 2p2p−1

C` = limh→0

[`−1/2

(h−

4p2p−1)]=

√2π

CA =

[12π

∫ 2π

0

∣∣∣∣ ∞

∑j=0aj exp(iju)

∣∣∣∣1/p

du

]pV2(x) =

C ∗GC`

K1−p2p−1

σ2(x)

e−12 ‖Γ−1/2z‖22CA

1−p2p−1

= v(x ; p,K , a),

where σ2(·) is the conditional variance, and z = (zj ) = (j−pxj ).(1−pX1, 2−pX2, ...)

ᵀ=: Z is (`2, ‖.‖2)-valued

14 / 51

Theorem Suppose A1-A8 and B1-B4 hold. Then,

∆1/2T (x) (m(x)−m(x)−BT (x)) =⇒ N (0, 1) ,

where BT (x) = O(hβ)→ 0 is the bias component and

∆T (x) = Th1−p2p−1 exp

−C ∗∗G (CAK )−

22p−1 h−

22p−1V2(x)−1.

Limiting variance depends on dependence structure of covariates

15 / 51

Bias Variance Trade-offSuppose that

h = (logT )a

Then the following bandwidth sequence constant balances squared biasagainst variance

aopt =ϑ · W(χ

ϑTχ/ϑ)− χ logT

ϑ× χ× log logT ∈(−2p − 1

2, 0)↓ 12− p,

where W(y) is the lambert W function, which returns the solution x ofy = x · ex , while ϑ = [2β+ (1− p)/(2p − 1)] and χ = 2/(2p − 1).The upper bound on p is obtained from the bias condition pmax(β, c),hence optimal mse is bounded by

(logT )β( 12−pmax).

16 / 51

InferenceSelf normalized CLT allows usual inference. Let

V (x) =

(T

∑t=1Kt

)−2 T

∑t=1

Kt (Yt − m(x))−

1T

T

∑t=1Kt (Yt − m(x))

2

V (x)−1/2 (m(x)−m(x)−BT (x)) =⇒ N (0, 1)

Kt = K(‖H−1(x − Xt )‖

); H = (h1, h2, . . .).

17 / 51

How to get faster rates of convergence?

Change the function class to impose additional structure implicitly,e.g., Neural networks.

Assume that m is infinitely smooth or the smoothness increases withsample size. For example, if smoothness of order r(T ) = γ logT andd(T ) = c logT , we get optimal MSE of order

O(T−2γ log T /(2γ log T+c log T )) = O(T−2γ/(2γ+c ))

Hard to justify.

Impose some structure

18 / 51

Estimates of√E (Y 2t+j |Yt = y), j = 1, . . . , 50. y ∈ [0.01%, 0.99%]

19 / 51

Additive Nonparametric Models

Generalized Additive models (Hastie and Tibshirani, 1990)). For known G

G (m(x)) =d

∑j=1mj (xj )

Stone (1985, 1986), free from curse of dimensionality.Estimation by Marginal Integration (averaging out of high dimensionalestimator) or Backfitting (Iterative one dimensional smooths of partialresiduals) G = I

mj (Xjt )←− E

partial residuals︷︸︸︷

Yt −d

∑k 6=j

mk (Xkt )

|Xjt

Hastie and Tibshirani (1990). Mammen, Linton, and Nielsen (1999).Smooth backfittingIdentification issue. Rule out concurvity (i.e., singularity in jointdistribution of (X1, . . . ,Xd )). 20 / 51

In the time series case the conditioning information may consist of aninfinite number of lags of Y , i.e., d = ∞.Sparse additive models (SpAM). Meier, van de Geer, and Buhlman(2009). Potential components d >> T but true model d0 finite. iiddata

We now consider semiparametric models that use all the lags but insome special structure

21 / 51

Semiparametric Models

Linton and Mammen (2005) considered the Engle and Ng (1993)semiparametric (volatility) regression model

E(Y 2t |Yt−1,Yt−2 . . .) =∞

∑j=1

ψj (θ)m(Yt−j ),

where m(·) is an unknown function and the parametric family ψj (θ),θ ∈ Θ, j = 1, . . . ,∞ are parametric weights that decline suffi cientlyfast. This model includes the GARCH(1,1) as a special case andincludes an infinite set of lags.Linton and Mammen (2008) generalized this class of models to allowfor exogenous regressors and more complicated dynamics

B(L)Yt = A(L)m(Xt ) + εt

including some unit roots (in Y )Chen and Ghyssels (2010, RFS) model. Application of these methodsto volatility forecasting and news impact curve estimation using highfrequency data.

22 / 51

Estimation method is based on a characterization of the function m as thesolution of a linear integral equation

m = m∗ +Hm

with operator H and intercept of the form

m∗θ (x) =∞

∑j=1

ψj (θ)mj (x), mj (x) = E(Y2t |Yt−j = x)

They proposed an estimation strategy for the unknown quantities, whichrequires as input the estimation of mj (x) for j = 1, 2, . . . , d(T ), whered(T ) = c logT for some c > 0. Solve the inverse problem numerically.Essentially, iterative one dimensional smoothing.

23 / 51

This general approach to modelling is promising but quitecomputationally demanding.

In addition, the models considered thus far all have a finite number ofunknown functions (for example, in Linton and Mammen (2005) onlyone unknown function was allowed), and so appear to be heavily overidentified. Not flexible enough for forecasting.

Other approaches (Fan and Yao 2003) :I varying coeffi cient modelsI Partial linearI Single index models, for example

E(Y 2t |Yt−1,Yt−2 . . .) = m

(∞

∑j=1

ψjυ(Yt−j ; θ)

)

24 / 51

The infinite order additive model with decay may have some advantages

Yt =∞

∑j=1mj (Yt−j ) + εt , mj ∈ S(rj , cj ),

where smoothness rj increases with lag j (i.e., number of derivativesincreases or global bound on second derivatives shrinks)Maybe backfitting with increasing order can be plausible

mj (Yt−j )←− E

partial residuals︷︸︸︷

Yt −d (T )

∑k 6=j

mk (Yt−k )

|Yt−j

Take bandwidth hj = cjh(T ) such that

hj (T )→ ∞ as j → ∞ and hj → 0 as T → ∞

25 / 51

MAMA: Model Averaging MArginal Regression

Li, Linton, and Lu (2014). Suppose that Yt , t = 1, . . . ,T , are Tobservations collected from a stationary time series process andXt = (Z

ᵀt ,Y

ᵀt−1)

ᵀwith Zt =

(Zt1,Zt2, . . . ,ZtpT

)ᵀand

Yt−1 = (Yt−1,Yt−2, . . . ,Yt−dT )ᵀ, where Zt are exogenous regressors.

Here, pT , dT are large i.e., pT , dT → ∞.The main interest is to study the multivariate regression function:

m(x) = E(Yt∣∣Xt = x).

We propose methods to approximate this by one dimensional smoothingoperations

26 / 51

We model or approximate m(x) = E(Y |X = x) by

mw (x) = w0 +J

∑j=1wjE(Y |X(j) = x(j))

for some weights wj , j = 0, 1, . . . , J, where X(j) = (Xi1 , . . . ,Xikj )ᵀis a

subset of X and x(j) = (xi1 , . . . , xikj )ᵀ.

In general, X(j) and X(k ) could have different dimensions and of courseoverlapping members. The union of X(j) may exhaust one or more of S`(the set of all subsets of S = 1, 2, . . . , d of ` components, and this hascardinality J` = (

d`)) or it may not. In a time series setting we may be

combining

E (Yt |Yt−1), . . . ,E (Yt |Yt−d ),E (Yt |Yt−1,Yt−2), . . . ,E (Yt |Yt−2, . . . ,Yt−d )

A simple special case that we focus on is where J = d and X(j) = Xj isjust the j th component and the covariates are non overlapping.

27 / 51

KSIS+PMAMARChen, Li, Linton, Lu (2016). We assume that the dimension the potentialexplanatory variables Xt can be diverging at an exponential rate, i.e.,

pT + dT = O(expT δ0)

for some positive constant δ0.

We introduce screening method to reduce the number of possiblevariables to less than TWe then apply penalization technique to select down even further tok(T ) with k(T )/T → 0 and apply the aggregationA practically important refinement involves iterating between thescreening step and the PMAMAr step.We examine the nonlinear forecasting performance after dimensionreduction.

Linear models: Fan and Lv (2008); Additive models: Fan, Feng and Song(2011); Varying coeffi cient models: Fan, Ma and Dai (2014) and Liu, Liand Wu (2014); Linear SIS+model averaging: Ando and Li (2014).

28 / 51

For notational simplicity, we let

Xtj =Ztj , j = 1, 2, . . . , pT ,Yt−(j−pT ), j = pT + 1, pT + 2, . . . , pT + dT .

For j = 1, . . . , pT + dT , the kernel smoother of marginal regression

mj (xj ) := E(Yt |Xtj = xj )

is

mj (xj ) =∑Tt=1 YtKtj (xj )

∑Tt=1 Ktj (xj )

, Ktj (xj ) = K(Xtj − xj

h1

).

29 / 51

Screening Step

We consider ranking the importance of the covariates by calculating thecorrelation between the response variable and marginal regression:

cor(j) =cov(j)√v(Y ) · v(j)

=[ v(j)v(Y )

]1/2,

where v(Y ) = var(Yt ), v(j) = var(mj (Xtj )) and

cov(j) = cov(Yt ,mj (Xtj )) = var(mj (Xtj )) = v(j).

The value of cor(j) is non-negative for all j and the ranking of cor(j) isequivalent to the ranking of v(j) as v(Y ) is positive and invariant across j .

30 / 51

The sample version of cor(j) can be constructed as

ˆcor(j) =ˆcov(j)√

v(Y ) · v(j)=[ v(j)v(Y )

]1/2,

where

v(Y ) =1T

T

∑t=1Y 2t −

( 1T

T

∑t=1Yt)2,

ˆcov(j) = v(j) =1T

T

∑t=1m2j (Xtj )−

[ 1T

T

∑t=1mj (Xtj )

]2,

The screened sub-model can be determined by,

S =j = 1, 2, . . . , pT + dT : v(j) ≥ ρT

,

where ρT is a pre-determined positive number.

31 / 51

The above criterion is equivalent to

S =j = 1, 2, . . . , pT + dT : ˆcor(j) ≥ ρT

,

where ρT = ρ1/2T /

√v(Y ).

Define the index set of “true" candidate models as

S = j = 1, 2, . . . , pT + dT : v(j) 6= 0 .

Theorem Suppose that the conditions A1—A5 in the paper are satisfied.For any small δ1 > 0, there exists a positive constant δ2 such that

Pr(

max1≤j≤pT+dT

|v(j)− v(j)| > δ1T−2(1−θ1)/5)

= O(M(T ) exp

−δ2T (1−θ1)/5

),

where M(T ) = (pT + dT )T (17+18θ1)/10 and 1/6 < θ1 < 1 is defined byh1 = T−θ1 .

32 / 51

Theorem If we choose the pre-determined tuning parameterρT = δ1T−2(1−θ1)/5 and assume

minj∈S

v(j) ≥ 2δ1T−2(1−θ1)/5,

then we have

Pr(S ⊂ S

)≥ 1−O

(MS (T ) exp

−δ2T (1−θ1)/5

),

where MS (T ) = |S|T (17+18θ1)/10 with |S| being the cardinality of S .As pT + dT = O(expT δ0), in order to ensure the validity of Theorem1(i), we need to impose the restriction δ0 < (1− θ1)/5, which reduces toδ0 < 4/25 if the order of the optimal bandwidth in kernel smoothing (i.e.,θ1 = 1/5) is used.

33 / 51

PMAMA step

We denote the chosen covariates (after KSIS in the first step) by

X∗t =(X ∗t1,X

∗t2, . . . ,X ∗tqT

)ᵀwhich may include both exogenous variables

and lags of Yt , where qT might be divergent but is smaller than thesample size T .We approximate the conditional regression functionm∗(x) = E(Yt |X∗t = x) by an affi ne combination of one-dimensionalconditional component regressions

m∗j (xj ) = E(Yt |X ∗tj = xj ), j = 1, . . . , qT .

Each marginal regression m∗j (·) can be treated as a “nonlinear candidatemodel". That is,

m∗(x) ≈ w0 +qT

∑j=1wjm∗j (xj ),

where wj , j = 0, 1, . . . , qT , are to be determined later and can be seen asthe weights for different “candidate models".

34 / 51

For j = 1, . . . , qT , we estimate the marginal regression functions m∗j (·) bythe kernel smoothing method:

m∗j (xj ) =∑Tt=1 YtK tj (xj )

∑Tt=1 K tj (xj )

, K tj (xj ) = K(X ∗tj − xj

h2

).

Then, for j = 1, . . . , qT , we let

M(j) =[m∗j (X

∗1j ), . . . , m∗j (X

∗Tj )]ᵀ=: ST (j)YT

be the estimated values of

M(j) =[m∗j (X

∗1j ), . . . ,mj (X ∗Tj )

]ᵀ,

where ST (j) is the T × T smoothing matrix whose (k, l)-component isK lj (X ∗kj )/

[∑Tt=1 K tj (X

∗kj )], and YT = (Y1, . . . ,YT )

ᵀ.

35 / 51

We define the objective function by

QT (wT ) =[YT − M(wT )

]ᵀ[YT − M(wT )

]+ T

qT

∑j=1pλ(|wj |),

where

M(wT ) =[w1ST (1) + . . .+ wqT ST (qT )

]YT = ST (Y)wT ,

ST (Y) =[ST (1)YT , . . . ,ST (qT )YT

], and pλ(·) is a penalty function

with a tuning parameter λ.Our semiparametric estimator of the optimal weights wo can be obtainedthrough minimising the objective function QT (wT ):

wT = argminwTQT (wT ).

36 / 51

AIC and BIC: pλ(|z |) = 0.5λ2I (|z | 6= 0) with different values of λ;LASSO: pλ(|z |) = λ|z |;SCAD: p′λ(z) = λ

[I (z ≤ λ) + a0λ−z

(a0−1)λ I (z > λ)]with pλ(0) = 0, where

a0 > 2, λ > 0 and I (·) is the indicator function.In our numerical studies, we use the SCAD penalty in PMAMAR.

37 / 51

Theorem Suppose that the conditions A1—A8 are satisfied. There exists alocal minimizer wT of the objective function QT (·) such that

‖wT −wo‖ = OP(√

qT (T−1/2 + aT )

),

where ‖ · ‖ denotes the Euclidean norm and

aT = max1≤j≤qT

|p′λ(|woj |)|, |woj | 6= 0

.

Theorem Let wT (2) be the estimator of wo (2) which is composed of allthe zero weights and further assume that

λ→ 0,

√Tλ√qT→ ∞, lim inf

T→∞lim infϑ→0+

p′λ(ϑ)λ

> 0.

Then, the local minimizer wT of the objective function QT (·) satisfieswT (2) = 0 with probability approaching one.

38 / 51

Theorem If we further assume that the eigenvalues of ΛT 1 are boundedaway from zero and infinity,

√TAT Σ−1/2

T

(ΛT 1+ΩT

)[wT (1)−wo (1)−

(ΛT 1+ΩT

)−1ωT

]d−→ N

(0,A0

),

where 0 is a null vector whose dimension may change from line to line,AT is an s × sT matrix such that ATA

ᵀT → A0 and A0 is an s × s

symmetric and non-negative definite matrix, s is a fixed positive integer.The definitions of ΣT , ΛT 1, ΩT and ωT are given in the paper.

39 / 51

SimulationsThe sample size is set to be T = 100, and the numbers of candidateexogenous covariates and lagged terms are (pT , dT ) = (30, 10) and(pT , dT ) = (150, 50). The model is defined by

Yt = m1(Zt1) +m2(Zt2) +m3(Zt3) +m4(Zt4) +m5(Yt−1)

+m6(Yt−2) +m7(Yt−3) + εt

for t ≥ 1, where, following Meier, van de Geer and Bühlmann (2009), weset

mi (x) = sin(0.5πx), i = 1, 2, . . . , 7.

The exogenous covariates

Zt = (Zt1,Zt2, . . . ,ZtpT )ᵀ

are independently drawn from pT -dimensional Gaussian distribution withzero mean and covariance matrix cov(Z) = IpT or CZ, whose themain-diagonal entries are 1 and off-diagonal entries are 1/2. The errorterm εt are independently generated from the N(0, 0.72) distribution. Thereal size of exogenous regressors is 4 and the real lag length is 3.

40 / 51

We generate 100+ T observations from the process with initial statesY−2 = Y−1 = Y0 = 0 and discard the first 100− dT observations.The iterative version of KSIS+PMAMAR performs better in bothestimation and prediction than the KSIS+PMAMAR.The penGAM is the most conservative in variable selection and on averageselects the least number of variables.The ISIS suffers from the model misspecification problem.When the correlation among the exogenous variables increases, theperformance of all approaches worsens.

41 / 51

PCA+PMAMAR

Impose an approximate factor modelling structure on the ultra-highdimensional exogenous regressors and use the well-known principalcomponent analysis to estimate the latent common factors;

Apply the PMAMAR method to select the estimated common factorsand lags of the response variable which are significant.

42 / 51

LettingB0T = (b

01, . . . ,b0pT )

ᵀand Ut = (ut1, . . . , utpT )

ᵀ,

we assume the approximate factor model:

Zt = B0T f0t +Ut ,

where b0k is an r -dimensional vector of factor loadings, f0t is an

r -dimensional vector of common factors, and utk is called an idiosyncraticerror.Denote ZT = (Z1, . . . ,ZT )

ᵀ, the T × pT matrix of the observations of

the exogenous variables. We then construct

FT =(f1, . . . , fT

)ᵀas the T × r matrix consisting of the r eigenvectors (multiplied by

√T )

associated with the r largest eigenvalues of the T × T matrix

ZTZᵀT /(TpT ).

43 / 51

Define

H = V−1(F ᵀTF 0T /T

) [(B0T )

ᵀB0T /pT

], F 0T =

(f01 , . . . , f0T

)ᵀ,

and V is the r × r diagonal matrix of the first r largest eigenvalues ofZTZ

ᵀT /(TpT ) arranged in descending order.

Theorem. Suppose that the conditions B1—B4 are satisfied, and

T = o(p2T ), pT = O(expT δ∗

), 0 ≤ δ∗ < 1/3.

For the PCA estimation ft , we have

maxt

∥∥ft −Hf0t ∥∥ = OP (T−1/2 + T 1/4p−1/2T

).

44 / 51

Consider the following multivariate regression function with rotated latentfactors and lags of response:

m∗f (x1, x2) = E(Yt |Hf0t = x1,Yt−1 = x2

).

Apply the PMAMAR with

X∗t ,f =(fᵀt ,Y

ᵀt−1

)ᵀ=(ft1, . . . , ftr ,Y

ᵀt−1

)ᵀ.

For k = 1, . . . , r , define

m∗k ,f (zk ) = E[Yt |f 0tk = zk

], f 0tk = e

ᵀr (k)Hf

0t ,

where er (k) is an r -dimensional column vector with the k-th elementbeing one and zeros elsewhere, k = 1, . . . , r . We estimate m∗k ,f (zk ) by thekernel smoothing method:

m∗k ,f (zk ) =∑Tt=1 Yt Ktk (zk )

∑Tt=1 Ktk (zk )

, Ktk (zk ) = K( ftk − zk

h3

), j = 1, . . . r ,

where h3 is a bandwidth and ftk is the k-th element of ft .45 / 51

Theorem Suppose that the conditions A5 and B1—B5 are satisfied, andthe latent factor f0t has a compact support. Then we have

max1≤k≤r

supzk∈F ∗k

∣∣m∗k ,f (zk )− m∗k ,f (zk )∣∣ = oP (T−1/2),

where F ∗k is the compact support of f 0tk , m∗k ,f (zk ) is the infeasible kernelestimation defined as m∗k ,f (zk ) but with ftk replaced by f

0tk .

Factor-augmented linear regression and autoregression: Stock and Watson(2002), Bernanke, Boivin and Eliasz, (2005) Bai and Ng (2006), Pesaran,Pick and Timmermann (2011) and Cheng and Hansen (2015).Additive regression on principal components: Härdle and Tsybakov (1995).

46 / 51

Set a maximum number, say rmax (which is usually not too large), for thefactors. Since the factors extracted from the eigenanalysis are orthogonalto each other, the over-extracted insignificant factors will be discarded inthe PMAMAR step.Select the first few eigenvectors (corresponding to the first few largesteigenvalues) of ZTZ

ᵀT /(TpT ) so that a pre-determined amount, say 95%,

of the total variation is accounted for.Other commonly-used selection criteria such as BIC can be found in Baiand Ng (2002) and Fan, Liao and Mincheva (2013).

47 / 51

The exogenous variables Zt are generated via an approximate factor model:

Zt = Bft + zt ,

where the rows of the pT × r loadings matrix B and the common factorsft , t = 1, . . . ,T , are independently generated from the multivariateN(0, Ir ) distribution, and the pT -dimensional error terms zt , t = 1, . . . ,T ,are independently drawn from 0.1N(0, IpT ).We set pT = 30 or 150, r = 3, and generate the response variable via

Yt = m1(ft1) +m2(ft2) +m3(ft3) +m4(Yt−1)

+m5(Yt−2) +m6(Yt−3) + εt ,

where fti is the i-th component of ft , mi (·), i = 1, . . . , 6, are the same asin Example 1, and εt , t = 1, . . . ,T , are independently drawn from theN(0, 0.72) distribution.In this example, we choose the number of candidate lags of Y as dT = 10.

48 / 51

When pT = 30, the KSIS+PMAMAR outperforms all the other approaches(except the Oracle) in terms of estimation and prediction accuracy.When pT becomes larger than T , the PCA based approaches show theiradvantage in effective dimension reduction of the exogenous variables,which results in their lower EE and PE.The PCA+PMAMAR has a lower EE but higher PE than thePCA+KSIS+PMAMAR. This is due to the fact that without the KSISstep the PCA+PMAMAR selects more false lags of Y .

49 / 51

Empirical applicationWe next apply the proposed semiparametric model averaging methods toforecast inflation in the UK. The data were collected from the Offi ce forNational Statistics (ONS) and the Bank of England (BoE) websites andincluded quarterly observations on CPI and some other economics variablesover the period Q1 1997 to Q4 2013.All the variables are seasonally adjusted. We use 53 predictor seriesmeasuring aggregate real activity and other economic indicators toforecast CPI. Given the possible time persistence of CPI, we also add its 4lags as predictors.

50 / 51

Data from Q1 1997 to Q4 2012 are used as the training set and thosebetween Q1 2013 and Q4 2013 are used for forecasting.

Method IKSIS+PMAMAR KSIS+PMAMAR PCA+PMAMARPE 0.0360 0.1130 0.0787

Method penGAM ISIS Phillips curvePE 0.0865 0.3275 1.1900

The Phillips curve specification is:

It+1 − It = α+ β(L)Ut + γ(L)∆It + εt+1,

where It is the CPI in the t-th quarter, β(L) = β0 + β1L+ β2L2 + β3L

3

and γ(L) = γ0 + γ1L+ γ2L2 + γ3L

3 are lag polynomials with L being thelag operator, Ut is the unemployment rate, and ∆ is the first differenceoperator.

51 / 51

90100

110

120

time

CPI

1997 2001 2005 2009 2013

-2-1

01

23

time

norm

alis

ed lo

g(C

PI/C

PI(-

1))

1997 2001 2005 2009 2013

-20

24

time

Y

1998 2000 2002 2004 2006 2008 2010 2012

ObservedIKSIS+PMAMARKSIS+PMAMARPCA+PMAMARpenGAMISISPhillips curveARVARSTAR

-1.0

-0.5

0.0

0.5

1.0

time

Y

Q1 2013 Q2 2013 Q3 2013 Q4 2013

ObservedIKSIS+PMAMARKSIS+PMAMARPCA+PMAMARpenGAMISISPhillips curveARVARSTAR

Nonparametric Modelling of High Dimensional Time Series · Outline 1 Survey some work on ⁄exible...

Documents

Transcript of Nonparametric Modelling of High Dimensional Time Series · Outline 1 Survey some work on ⁄exible...