Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...

Joint estimation of the conditional mean and theconditional variance in high-dimensions

Joint work with M. Hebiri, K. Meziani, J. Salmon

SAPS IX, Le Mans, FRANCE

Arnak S. Dalalyan

ENSAE / CREST / GENES

I. Problem presentation

3

Continuous-time autoregression

Observations : (x t , yt) ∈ Rd × R obeying

yt = b∗(x t) + s∗(x t) Wt , t ∈ T = [0,T ],

where b∗ : Rd → R and s∗ : Rd → R+ such that

Drift : b∗(x t) = E[yt∣∣x t].

Volatility : s∗2(x t) = Var[yt∣∣x t].

(Wt)t≥0 is a standard Wiener process (w.r.t a filtration (Ft)t≥0).

The goal is to estimate the function b∗.

c© Dalalyan, A.S. Dec. 19, 2012 3

4

Discrete time (auto-)regression

Observations : finite collection (x t , yt) ∈ Rd × R obeying

yt = b∗(x t) + s∗(x t) ξt , t ∈ T = {1, . . . ,T},

where b∗ : Rd → R and s∗ : Rd → R+ such that

Conditional mean : E[yt |x t ] = b∗(x t).

Conditional variance : Var[yt |x t ] = s∗2(x t).

Therefore, ξt ’s are such that E[ξt |x t ] = 0 and Var[ξt |x t ] = 1.They are often assumed Gaussian N (0,1) for simplicity.

The goal is to estimate the functions b∗ and s∗.


5

Sparsity Assumption

• In these settings, estimating b∗ and s∗ under no furtherassumption is an ill-posed problem.

• Sparsity scenario : b∗ and s∗ belong to some low dimensionalspaces.

Example : Homoscedastic regression

∀x , b∗(x) =p∑

j=1

fj(x)β∗j = [f1(x), . . . , fp(x)]β∗, and s∗(x) ≡ σ∗

↪→ Dictionary {f1, . . . , fp} of functions from Rd to R

↪→ Unknown vector (β∗, σ∗) ∈ Rp × R, sparse vector β∗

↪→ Sparsity index : p∗ = |β∗|0 :=∑p

j=1 1l(β∗j 6= 0) with p∗ � p


6

Main assumptions on b∗ and s∗

Re-parametrize by the inverse of the conditional volatility s∗

r∗(x) =1

s∗(x)and f∗(x) =

b∗(x)s∗(x)

.

Group Sparsity Assumption

For p given functions f1, . . . , fp mapping Rd into R, there is a vectorφ∗ ∈ Rp such that f∗(x) =

∑pj=1 φ

∗j fj(x). Furthermore, for a given

partition G1, . . . ,GK of {1, . . . ,p}, the vector φ∗ is group-sparse that isCard({k : |φ∗Gk

|2 6= 0})� K .

Low dimensional volatility assumption

For q given functions r1, . . . , rq mapping Rd into R+, there is a vectorα∗ ∈ Rq such that r∗(x) =

∑q`=1 α

∗` r`(x) for almost every x ∈ Rd .


7

Motivations for these assumptions

� Group sparsity assumption is relevant in a sparseadditive model, that is when

↪→ f∗(x) = ψ1(x1) + . . .+ ψd(xd) s.t. ψj ≡ 0 for most j ,↪→ projection on basis ψj(xj) ≈

∑Kj`=1 φ

∗`,j f`(xj),

↪→ group sparsity of φ = (φ`,j).

� Low dimensionality of the volatility occurs, for instancewhen the noise is block-wise homoscedastic or periodic.


II. Relation to previous work

9

Homoscedastic regression

The model is

Y = Xβ∗ + σ∗ξ

Observations : Y = [y1, . . . , yT ]> ∈ RT

Noise : ξ = [ξ1, . . . , ξT ]> ∈ RT

Design Matrix : X = xt,j with xt,j = [fj(x t)] ∈ R

Coefficients : β∗ =[β∗1 , . . . , β

∗p]> ∈ Rp

Standard deviation : s∗(x t) ≡ σ∗ ∈ R+∗

Recall that the sparsity assumption postulates that |β∗|0 = p∗ � p.


10

Most popular methods : Lasso and Dantzig selector

� The LASSO of Tibshirani (1996) is defined as

βLasso

= arg minβ∈Rp

(|Y − Xβ|22

2σ∗2 + λ

p∑j=1

|Xj |2|βj |)

� The Dantzig selector of Candès and Tao (2007) is

βDS

= arg minβ∈Rp

{ p∑j=1

|Xj |2|βj | : maxj=1,··· ,p,

|X>j (Y − Xβ)||Xj |2

≤ λ}

For a tuning parameter satisfying λ ∝ 1/σ∗, sharp oracle inequalitiesare available, e.g., Bickel et al. (2009).

E(

1T|X(β

•− β)|22

)≤ C

p∗ log(p)T

.

To correctly tune the parameter λ, the knowledge of σ∗ is necessary.


11

Joint estimation of β∗ and σ∗

� Scaled Lasso, Städler et al. (2010),

(βScL, σScL) = arg min

β,σ

(T log(σ)+

|Y − Xβ|222σ2 +

λ

σ

∑p

j=1|Xj |2|βj |

).

This can be recast in a convex problem (do ρ := 1σ and φ := β

σ ) :

arg minφ,ρ

(− T log(ρ) +

|ρY − Xφ|222

+ λ∑p

j=1|Xj |2|φj |

).

� Square-Root Lasso Antoniadis (2010), Belloni et al. (2011), Sun& Zhang (2012), Gautier & Tsybakov (2011),

βSqR-Lasso

= arg minβ

(∣∣Y − Xβ∣∣2 + λ

∑p

j=1|X :,j |2|βj |

)σSqR-Lasso = T−1/2

∣∣Y − XβSqR-Lasso∣∣

2.

� Scaled DS version proposed by Dalalyan & Chen (2012) : sharpanalysis and computational advantages.


III. Main results

13

Estimation for heteroscedastic regressionContinuous time

Observations : (x t , yt)t≤T obeying yt = b∗(x t) + s∗(x t) Wt , thus

yt

s∗(x t)=

( p∑j=1

φ∗j fj(x t)

)+ Wt = X(t)φ∗ + Wt , t ∈ T = [0,T ].

Scaled Heteroscedastic Lasso :

φScHeL

= arg minφ∈Rp

{12

∫ T

0

(yt

s∗(x t)− X(t)φ

)2

dt + λ∑K

k=1|XGkφGk

|2

},

with |XGφG|22 =

∫ T0

(∑j∈G Xj(t)φj

)2 dt =∫ T

0

(∑j∈G fj(x t)φj

)2 dt .

Computation : the estimator φScHeL

can be efficiently computed evenfor very large dimensions p by solving a second-order cone program.


14

Estimation for heteroscedastic regressionDiscrete time

Observations : (x t , yt)t=1,...,T obeying

yt = b∗(x t) + s∗(x t) ξt = r∗(x t)−1(f∗(x t) + ξt).

Under our assumptions

f∗(x t) =∑p

j=1φ∗j fj(x t) = X(t)φ∗, r∗(x t) =

∑q

`=1α∗` r`(x t) = R(t)α∗.

Thus, [f∗(x1), . . . , f∗(xT )]> = Xφ∗ and [r∗(x1), . . . , r∗(xT )]

> = Rα∗. Thisleads to

DY Rα∗ = Xφ∗ + ξ, DY = diag(yt ; t = 1, . . . ,T ).

Scaled Heteroscedastic Lasso : (φScHeL

, αScHeL) solution to

minφ∈Rp

{−

T∑t=1

log(R(t)α) +12|DY Rα− Xφ|22︸︷︷︸

Gaussian log-likelihood

+ λ

K∑k=1

|XGkφGk|2︸︷︷︸

sparsity promoting penalty

}.


15

Finite sample risk bounds for the ScHeL

Theorem Consider either continuous-time model or discrete-time withsub-Gaussian errors ξ. Let K ∗ (resp. p∗) be the number of relevantgroups (resp. corrdinates of φ∗). Let ε ∈ (0,1) be a tolerance level andset

λk = 4(√|Gk |+

√log(K/ε)

).

Under some assumptions, with probability at least 1− 2ε,∣∣X(φ− φ∗)∣∣2 ≤ D3/2

T ,ε

√q log(2q/ε) + DT ,ε

√p∗ + K ∗ log(K/ε).∣∣R(α−α∗)

∣∣2 ≤ D3/2

T ,ε

√q log(2q/ε) + DT ,ε

√p∗ + K ∗ log(K/ε),

where DT ,ε ∝ (maxt |f∗(x t)|+ log(2T/ε)).


16

Summary

New procedures named ScHeL and ScHeDs :

� Suitable for fitting the heteroscedastic regression model.

� Simultaneous estimation of the mean and the variance functions.

� Takes into account group sparsity.

� Implemented using two different solvers :↪→ primal-dual interior point method (highly accurate),↪→ optimal first-order method (moderately accurate but with

cheap iterations).

� Competitive with state-of-the art algorithms↪→ applicable in a much more general framework.

Manuscript, proofs and codes available on request.


17

0 500 1000 1500 2000 25000.2

0.25

0.3

0.35

0.4

dimension pmeanerror

Precisions on the nonzero coefficients

precision 1, IPprecision 1, OFO

0 500 1000 1500 2000 25000

0.5

1

1.5x 10

−4

dimension p

meanerror

Precisions on the coefficients =0

precision 2, IP x 107

precision 2, OFO

0 500 1000 1500 2000 25000

200

400

600

800

dimension p

meanrunningtim

e(sec) Running times

timing, IPtiming, OFO

FIGURE: Comparing interior point (IP) vs. optimal first-order (OFO)method. Top & middle : MSE of β

ScHeDs. Bottom : running times.


18

References I

A. Antoniadis, Comments on : `1-penalization for mixture regression models, TEST 19 (2010), no. 2,257–258. MR 2677723

A. Belloni, V. Chernozhukov, and L. Wang, Square-root Lasso : Pivotal recovery of sparse signals via conicprogramming, Biometrika 98 (2011), no. 4, 791–806.

P. J. Bickel, Y. Ritov, and A. B. Tsybakov, Simultaneous analysis of Lasso and Dantzig selector, Ann. Statist.37 (2009), no. 4, 1705–1732.

E. J. Candès and T. Tao, The Dantzig selector : statistical estimation when p is much larger than n, Ann.Statist. 35 (2007), no. 6, 2392–2404.

Arnak S. Dalalyan and Yin Chen, Fused sparsity and robust estimation for linear models with unknownvariance, Advances in Neural Information Processing Systems 25 : NIPS, 2012, pp. 1268–1276.

E. Gautier and A. Tsybakov, High-dimensional instrumental variables regression and confidence sets,September 2011.

N. Städler, P. Bühlmann, and Sara s van de Geer, `1-penalization for mixture regression models, TEST 19(2010), no. 2, 209–256.

T. Sun and C.-H. Zhang, Scaled sparse linear regression, Biometrika 99 (2012), no. 4, 879–898.

R. Tibshirani, Regression shrinkage and selection via the Lasso, J. Roy. Statist. Soc. Ser. B 58 (1996), no. 1,267–288.


Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...

Documents

Transcript of Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...