Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...

19
Joint estimation of the conditional mean and the conditional variance in high-dimensions Joint work with M. Hebiri, K. Meziani, J. Salmon SAPS IX, Le Mans, FRANCE Arnak S. Dalalyan ENSAE / CREST / GENES

Transcript of Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...

Page 1: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

Joint estimation of the conditional mean and theconditional variance in high-dimensions

Joint work with M. Hebiri, K. Meziani, J. Salmon

SAPS IX, Le Mans, FRANCE

Arnak S. Dalalyan

ENSAE / CREST / GENES

Page 2: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

I. Problem presentation

Page 3: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

3

Continuous-time autoregression

Observations : (x t , yt) ∈ Rd × R obeying

yt = b∗(x t) + s∗(x t) Wt , t ∈ T = [0,T ],

where b∗ : Rd → R and s∗ : Rd → R+ such that

Drift : b∗(x t) = E[yt∣∣x t].

Volatility : s∗2(x t) = Var[yt∣∣x t].

(Wt)t≥0 is a standard Wiener process (w.r.t a filtration (Ft)t≥0).

The goal is to estimate the function b∗.

c© Dalalyan, A.S. Dec. 19, 2012 3

Page 4: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

4

Discrete time (auto-)regression

Observations : finite collection (x t , yt) ∈ Rd × R obeying

yt = b∗(x t) + s∗(x t) ξt , t ∈ T = {1, . . . ,T},

where b∗ : Rd → R and s∗ : Rd → R+ such that

Conditional mean : E[yt |x t ] = b∗(x t).

Conditional variance : Var[yt |x t ] = s∗2(x t).

Therefore, ξt ’s are such that E[ξt |x t ] = 0 and Var[ξt |x t ] = 1.They are often assumed Gaussian N (0,1) for simplicity.

The goal is to estimate the functions b∗ and s∗.

c© Dalalyan, A.S. Dec. 19, 2012 4

Page 5: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

5

Sparsity Assumption

• In these settings, estimating b∗ and s∗ under no furtherassumption is an ill-posed problem.

• Sparsity scenario : b∗ and s∗ belong to some low dimensionalspaces.

Example : Homoscedastic regression

∀x , b∗(x) =p∑

j=1

fj(x)β∗j = [f1(x), . . . , fp(x)]β∗, and s∗(x) ≡ σ∗

↪→ Dictionary {f1, . . . , fp} of functions from Rd to R

↪→ Unknown vector (β∗, σ∗) ∈ Rp × R, sparse vector β∗

↪→ Sparsity index : p∗ = |β∗|0 :=∑p

j=1 1l(β∗j 6= 0) with p∗ � p

c© Dalalyan, A.S. Dec. 19, 2012 5

Page 6: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

6

Main assumptions on b∗ and s∗

Re-parametrize by the inverse of the conditional volatility s∗

r∗(x) =1

s∗(x)and f∗(x) =

b∗(x)s∗(x)

.

Group Sparsity Assumption

For p given functions f1, . . . , fp mapping Rd into R, there is a vectorφ∗ ∈ Rp such that f∗(x) =

∑pj=1 φ

∗j fj(x). Furthermore, for a given

partition G1, . . . ,GK of {1, . . . ,p}, the vector φ∗ is group-sparse that isCard({k : |φ∗Gk

|2 6= 0})� K .

Low dimensional volatility assumption

For q given functions r1, . . . , rq mapping Rd into R+, there is a vectorα∗ ∈ Rq such that r∗(x) =

∑q`=1 α

∗` r`(x) for almost every x ∈ Rd .

c© Dalalyan, A.S. Dec. 19, 2012 6

Page 7: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

7

Motivations for these assumptions

� Group sparsity assumption is relevant in a sparseadditive model, that is when

↪→ f∗(x) = ψ1(x1) + . . .+ ψd(xd) s.t. ψj ≡ 0 for most j ,↪→ projection on basis ψj(xj) ≈

∑Kj`=1 φ

∗`,j f`(xj),

↪→ group sparsity of φ = (φ`,j).

� Low dimensionality of the volatility occurs, for instancewhen the noise is block-wise homoscedastic or periodic.

c© Dalalyan, A.S. Dec. 19, 2012 7

Page 8: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

II. Relation to previous work

Page 9: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

9

Homoscedastic regression

The model is

Y = Xβ∗ + σ∗ξ

Observations : Y = [y1, . . . , yT ]> ∈ RT

Noise : ξ = [ξ1, . . . , ξT ]> ∈ RT

Design Matrix : X = xt,j with xt,j = [fj(x t)] ∈ R

Coefficients : β∗ =[β∗1 , . . . , β

∗p]> ∈ Rp

Standard deviation : s∗(x t) ≡ σ∗ ∈ R+∗

Recall that the sparsity assumption postulates that |β∗|0 = p∗ � p.

c© Dalalyan, A.S. Dec. 19, 2012 9

Page 10: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

10

Most popular methods : Lasso and Dantzig selector

� The LASSO of Tibshirani (1996) is defined as

βLasso

= arg minβ∈Rp

(|Y − Xβ|22

2σ∗2 + λ

p∑j=1

|Xj |2|βj |)

� The Dantzig selector of Candès and Tao (2007) is

βDS

= arg minβ∈Rp

{ p∑j=1

|Xj |2|βj | : maxj=1,··· ,p,

|X>j (Y − Xβ)||Xj |2

≤ λ}

For a tuning parameter satisfying λ ∝ 1/σ∗, sharp oracle inequalitiesare available, e.g., Bickel et al. (2009).

E(

1T|X(β

•− β)|22

)≤ C

p∗ log(p)T

.

To correctly tune the parameter λ, the knowledge of σ∗ is necessary.

c© Dalalyan, A.S. Dec. 19, 2012 10

Page 11: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

11

Joint estimation of β∗ and σ∗

� Scaled Lasso, Städler et al. (2010),

(βScL, σScL) = arg min

β,σ

(T log(σ)+

|Y − Xβ|222σ2 +

λ

σ

∑p

j=1|Xj |2|βj |

).

This can be recast in a convex problem (do ρ := 1σ and φ := β

σ ) :

arg minφ,ρ

(− T log(ρ) +

|ρY − Xφ|222

+ λ∑p

j=1|Xj |2|φj |

).

� Square-Root Lasso Antoniadis (2010), Belloni et al. (2011), Sun& Zhang (2012), Gautier & Tsybakov (2011),

βSqR-Lasso

= arg minβ

(∣∣Y − Xβ∣∣2 + λ

∑p

j=1|X :,j |2|βj |

)σSqR-Lasso = T−1/2

∣∣Y − XβSqR-Lasso∣∣

2.

� Scaled DS version proposed by Dalalyan & Chen (2012) : sharpanalysis and computational advantages.

c© Dalalyan, A.S. Dec. 19, 2012 11

Page 12: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

III. Main results

Page 13: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

13

Estimation for heteroscedastic regressionContinuous time

Observations : (x t , yt)t≤T obeying yt = b∗(x t) + s∗(x t) Wt , thus

yt

s∗(x t)=

( p∑j=1

φ∗j fj(x t)

)+ Wt = X(t)φ∗ + Wt , t ∈ T = [0,T ].

Scaled Heteroscedastic Lasso :

φScHeL

= arg minφ∈Rp

{12

∫ T

0

(yt

s∗(x t)− X(t)φ

)2

dt + λ∑K

k=1|XGkφGk

|2

},

with |XGφG|22 =

∫ T0

(∑j∈G Xj(t)φj

)2 dt =∫ T

0

(∑j∈G fj(x t)φj

)2 dt .

Computation : the estimator φScHeL

can be efficiently computed evenfor very large dimensions p by solving a second-order cone program.

c© Dalalyan, A.S. Dec. 19, 2012 13

Page 14: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

14

Estimation for heteroscedastic regressionDiscrete time

Observations : (x t , yt)t=1,...,T obeying

yt = b∗(x t) + s∗(x t) ξt = r∗(x t)−1(f∗(x t) + ξt).

Under our assumptions

f∗(x t) =∑p

j=1φ∗j fj(x t) = X(t)φ∗, r∗(x t) =

∑q

`=1α∗` r`(x t) = R(t)α∗.

Thus, [f∗(x1), . . . , f∗(xT )]> = Xφ∗ and [r∗(x1), . . . , r∗(xT )]

> = Rα∗. Thisleads to

DY Rα∗ = Xφ∗ + ξ, DY = diag(yt ; t = 1, . . . ,T ).

Scaled Heteroscedastic Lasso : (φScHeL

, αScHeL) solution to

minφ∈Rp

{−

T∑t=1

log(R(t)α) +12|DY Rα− Xφ|22︸ ︷︷ ︸

Gaussian log-likelihood

+ λ

K∑k=1

|XGkφGk|2︸ ︷︷ ︸

sparsity promoting penalty

}.

c© Dalalyan, A.S. Dec. 19, 2012 14

Page 15: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

15

Finite sample risk bounds for the ScHeL

Theorem Consider either continuous-time model or discrete-time withsub-Gaussian errors ξ. Let K ∗ (resp. p∗) be the number of relevantgroups (resp. corrdinates of φ∗). Let ε ∈ (0,1) be a tolerance level andset

λk = 4(√|Gk |+

√log(K/ε)

).

Under some assumptions, with probability at least 1− 2ε,∣∣X(φ− φ∗)∣∣2 ≤ D3/2

T ,ε

√q log(2q/ε) + DT ,ε

√p∗ + K ∗ log(K/ε).∣∣R(α−α∗)

∣∣2 ≤ D3/2

T ,ε

√q log(2q/ε) + DT ,ε

√p∗ + K ∗ log(K/ε),

where DT ,ε ∝ (maxt |f∗(x t)|+ log(2T/ε)).

c© Dalalyan, A.S. Dec. 19, 2012 15

Page 16: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

16

Summary

New procedures named ScHeL and ScHeDs :

� Suitable for fitting the heteroscedastic regression model.

� Simultaneous estimation of the mean and the variance functions.

� Takes into account group sparsity.

� Implemented using two different solvers :↪→ primal-dual interior point method (highly accurate),↪→ optimal first-order method (moderately accurate but with

cheap iterations).

� Competitive with state-of-the art algorithms↪→ applicable in a much more general framework.

Manuscript, proofs and codes available on request.

c© Dalalyan, A.S. Dec. 19, 2012 16

Page 17: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

17

0 500 1000 1500 2000 25000.2

0.25

0.3

0.35

0.4

dimension pmeanerror

Precisions on the nonzero coefficients

precision 1, IPprecision 1, OFO

0 500 1000 1500 2000 25000

0.5

1

1.5x 10

−4

dimension p

meanerror

Precisions on the coefficients =0

precision 2, IP x 107

precision 2, OFO

0 500 1000 1500 2000 25000

200

400

600

800

dimension p

meanrunningtim

e(sec) Running times

timing, IPtiming, OFO

FIGURE: Comparing interior point (IP) vs. optimal first-order (OFO)method. Top & middle : MSE of β

ScHeDs. Bottom : running times.

c© Dalalyan, A.S. Dec. 19, 2012 17

Page 18: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given

18

References I

A. Antoniadis, Comments on : `1-penalization for mixture regression models, TEST 19 (2010), no. 2,257–258. MR 2677723

A. Belloni, V. Chernozhukov, and L. Wang, Square-root Lasso : Pivotal recovery of sparse signals via conicprogramming, Biometrika 98 (2011), no. 4, 791–806.

P. J. Bickel, Y. Ritov, and A. B. Tsybakov, Simultaneous analysis of Lasso and Dantzig selector, Ann. Statist.37 (2009), no. 4, 1705–1732.

E. J. Candès and T. Tao, The Dantzig selector : statistical estimation when p is much larger than n, Ann.Statist. 35 (2007), no. 6, 2392–2404.

Arnak S. Dalalyan and Yin Chen, Fused sparsity and robust estimation for linear models with unknownvariance, Advances in Neural Information Processing Systems 25 : NIPS, 2012, pp. 1268–1276.

E. Gautier and A. Tsybakov, High-dimensional instrumental variables regression and confidence sets,September 2011.

N. Städler, P. Bühlmann, and Sara s van de Geer, `1-penalization for mixture regression models, TEST 19(2010), no. 2, 209–256.

T. Sun and C.-H. Zhang, Scaled sparse linear regression, Biometrika 99 (2012), no. 4, 879–898.

R. Tibshirani, Regression shrinkage and selection via the Lasso, J. Roy. Statist. Soc. Ser. B 58 (1996), no. 1,267–288.

c© Dalalyan, A.S. Dec. 19, 2012 18

Page 19: Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf · Card(fk : j˚ G k j 2 6= 0g) ˝K: Low dimensional volatility assumption For q given