Madunicky J., Pasta J., Nemcova I. Smeckova K., Hladikova K., Hakucova K.
Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...
Transcript of Joint work with M. Hebiri, K. Meziani, J. Salmonlmm.univ-lemans.fr/files/saps9/12/Dalalyan.pdf ·...
Joint estimation of the conditional mean and theconditional variance in high-dimensions
Joint work with M. Hebiri, K. Meziani, J. Salmon
SAPS IX, Le Mans, FRANCE
Arnak S. Dalalyan
ENSAE / CREST / GENES
I. Problem presentation
3
Continuous-time autoregression
Observations : (x t , yt) ∈ Rd × R obeying
yt = b∗(x t) + s∗(x t) Wt , t ∈ T = [0,T ],
where b∗ : Rd → R and s∗ : Rd → R+ such that
Drift : b∗(x t) = E[yt∣∣x t].
Volatility : s∗2(x t) = Var[yt∣∣x t].
(Wt)t≥0 is a standard Wiener process (w.r.t a filtration (Ft)t≥0).
The goal is to estimate the function b∗.
c© Dalalyan, A.S. Dec. 19, 2012 3
4
Discrete time (auto-)regression
Observations : finite collection (x t , yt) ∈ Rd × R obeying
yt = b∗(x t) + s∗(x t) ξt , t ∈ T = {1, . . . ,T},
where b∗ : Rd → R and s∗ : Rd → R+ such that
Conditional mean : E[yt |x t ] = b∗(x t).
Conditional variance : Var[yt |x t ] = s∗2(x t).
Therefore, ξt ’s are such that E[ξt |x t ] = 0 and Var[ξt |x t ] = 1.They are often assumed Gaussian N (0,1) for simplicity.
The goal is to estimate the functions b∗ and s∗.
c© Dalalyan, A.S. Dec. 19, 2012 4
5
Sparsity Assumption
• In these settings, estimating b∗ and s∗ under no furtherassumption is an ill-posed problem.
• Sparsity scenario : b∗ and s∗ belong to some low dimensionalspaces.
Example : Homoscedastic regression
∀x , b∗(x) =p∑
j=1
fj(x)β∗j = [f1(x), . . . , fp(x)]β∗, and s∗(x) ≡ σ∗
↪→ Dictionary {f1, . . . , fp} of functions from Rd to R
↪→ Unknown vector (β∗, σ∗) ∈ Rp × R, sparse vector β∗
↪→ Sparsity index : p∗ = |β∗|0 :=∑p
j=1 1l(β∗j 6= 0) with p∗ � p
c© Dalalyan, A.S. Dec. 19, 2012 5
6
Main assumptions on b∗ and s∗
Re-parametrize by the inverse of the conditional volatility s∗
r∗(x) =1
s∗(x)and f∗(x) =
b∗(x)s∗(x)
.
Group Sparsity Assumption
For p given functions f1, . . . , fp mapping Rd into R, there is a vectorφ∗ ∈ Rp such that f∗(x) =
∑pj=1 φ
∗j fj(x). Furthermore, for a given
partition G1, . . . ,GK of {1, . . . ,p}, the vector φ∗ is group-sparse that isCard({k : |φ∗Gk
|2 6= 0})� K .
Low dimensional volatility assumption
For q given functions r1, . . . , rq mapping Rd into R+, there is a vectorα∗ ∈ Rq such that r∗(x) =
∑q`=1 α
∗` r`(x) for almost every x ∈ Rd .
c© Dalalyan, A.S. Dec. 19, 2012 6
7
Motivations for these assumptions
� Group sparsity assumption is relevant in a sparseadditive model, that is when
↪→ f∗(x) = ψ1(x1) + . . .+ ψd(xd) s.t. ψj ≡ 0 for most j ,↪→ projection on basis ψj(xj) ≈
∑Kj`=1 φ
∗`,j f`(xj),
↪→ group sparsity of φ = (φ`,j).
� Low dimensionality of the volatility occurs, for instancewhen the noise is block-wise homoscedastic or periodic.
c© Dalalyan, A.S. Dec. 19, 2012 7
II. Relation to previous work
9
Homoscedastic regression
The model is
Y = Xβ∗ + σ∗ξ
Observations : Y = [y1, . . . , yT ]> ∈ RT
Noise : ξ = [ξ1, . . . , ξT ]> ∈ RT
Design Matrix : X = xt,j with xt,j = [fj(x t)] ∈ R
Coefficients : β∗ =[β∗1 , . . . , β
∗p]> ∈ Rp
Standard deviation : s∗(x t) ≡ σ∗ ∈ R+∗
Recall that the sparsity assumption postulates that |β∗|0 = p∗ � p.
c© Dalalyan, A.S. Dec. 19, 2012 9
10
Most popular methods : Lasso and Dantzig selector
� The LASSO of Tibshirani (1996) is defined as
βLasso
= arg minβ∈Rp
(|Y − Xβ|22
2σ∗2 + λ
p∑j=1
|Xj |2|βj |)
� The Dantzig selector of Candès and Tao (2007) is
βDS
= arg minβ∈Rp
{ p∑j=1
|Xj |2|βj | : maxj=1,··· ,p,
|X>j (Y − Xβ)||Xj |2
≤ λ}
For a tuning parameter satisfying λ ∝ 1/σ∗, sharp oracle inequalitiesare available, e.g., Bickel et al. (2009).
E(
1T|X(β
•− β)|22
)≤ C
p∗ log(p)T
.
To correctly tune the parameter λ, the knowledge of σ∗ is necessary.
c© Dalalyan, A.S. Dec. 19, 2012 10
11
Joint estimation of β∗ and σ∗
� Scaled Lasso, Städler et al. (2010),
(βScL, σScL) = arg min
β,σ
(T log(σ)+
|Y − Xβ|222σ2 +
λ
σ
∑p
j=1|Xj |2|βj |
).
This can be recast in a convex problem (do ρ := 1σ and φ := β
σ ) :
arg minφ,ρ
(− T log(ρ) +
|ρY − Xφ|222
+ λ∑p
j=1|Xj |2|φj |
).
� Square-Root Lasso Antoniadis (2010), Belloni et al. (2011), Sun& Zhang (2012), Gautier & Tsybakov (2011),
βSqR-Lasso
= arg minβ
(∣∣Y − Xβ∣∣2 + λ
∑p
j=1|X :,j |2|βj |
)σSqR-Lasso = T−1/2
∣∣Y − XβSqR-Lasso∣∣
2.
� Scaled DS version proposed by Dalalyan & Chen (2012) : sharpanalysis and computational advantages.
c© Dalalyan, A.S. Dec. 19, 2012 11
III. Main results
13
Estimation for heteroscedastic regressionContinuous time
Observations : (x t , yt)t≤T obeying yt = b∗(x t) + s∗(x t) Wt , thus
yt
s∗(x t)=
( p∑j=1
φ∗j fj(x t)
)+ Wt = X(t)φ∗ + Wt , t ∈ T = [0,T ].
Scaled Heteroscedastic Lasso :
φScHeL
= arg minφ∈Rp
{12
∫ T
0
(yt
s∗(x t)− X(t)φ
)2
dt + λ∑K
k=1|XGkφGk
|2
},
with |XGφG|22 =
∫ T0
(∑j∈G Xj(t)φj
)2 dt =∫ T
0
(∑j∈G fj(x t)φj
)2 dt .
Computation : the estimator φScHeL
can be efficiently computed evenfor very large dimensions p by solving a second-order cone program.
c© Dalalyan, A.S. Dec. 19, 2012 13
14
Estimation for heteroscedastic regressionDiscrete time
Observations : (x t , yt)t=1,...,T obeying
yt = b∗(x t) + s∗(x t) ξt = r∗(x t)−1(f∗(x t) + ξt).
Under our assumptions
f∗(x t) =∑p
j=1φ∗j fj(x t) = X(t)φ∗, r∗(x t) =
∑q
`=1α∗` r`(x t) = R(t)α∗.
Thus, [f∗(x1), . . . , f∗(xT )]> = Xφ∗ and [r∗(x1), . . . , r∗(xT )]
> = Rα∗. Thisleads to
DY Rα∗ = Xφ∗ + ξ, DY = diag(yt ; t = 1, . . . ,T ).
Scaled Heteroscedastic Lasso : (φScHeL
, αScHeL) solution to
minφ∈Rp
{−
T∑t=1
log(R(t)α) +12|DY Rα− Xφ|22︸ ︷︷ ︸
Gaussian log-likelihood
+ λ
K∑k=1
|XGkφGk|2︸ ︷︷ ︸
sparsity promoting penalty
}.
c© Dalalyan, A.S. Dec. 19, 2012 14
15
Finite sample risk bounds for the ScHeL
Theorem Consider either continuous-time model or discrete-time withsub-Gaussian errors ξ. Let K ∗ (resp. p∗) be the number of relevantgroups (resp. corrdinates of φ∗). Let ε ∈ (0,1) be a tolerance level andset
λk = 4(√|Gk |+
√log(K/ε)
).
Under some assumptions, with probability at least 1− 2ε,∣∣X(φ− φ∗)∣∣2 ≤ D3/2
T ,ε
√q log(2q/ε) + DT ,ε
√p∗ + K ∗ log(K/ε).∣∣R(α−α∗)
∣∣2 ≤ D3/2
T ,ε
√q log(2q/ε) + DT ,ε
√p∗ + K ∗ log(K/ε),
where DT ,ε ∝ (maxt |f∗(x t)|+ log(2T/ε)).
c© Dalalyan, A.S. Dec. 19, 2012 15
16
Summary
New procedures named ScHeL and ScHeDs :
� Suitable for fitting the heteroscedastic regression model.
� Simultaneous estimation of the mean and the variance functions.
� Takes into account group sparsity.
� Implemented using two different solvers :↪→ primal-dual interior point method (highly accurate),↪→ optimal first-order method (moderately accurate but with
cheap iterations).
� Competitive with state-of-the art algorithms↪→ applicable in a much more general framework.
Manuscript, proofs and codes available on request.
c© Dalalyan, A.S. Dec. 19, 2012 16
17
0 500 1000 1500 2000 25000.2
0.25
0.3
0.35
0.4
dimension pmeanerror
Precisions on the nonzero coefficients
precision 1, IPprecision 1, OFO
0 500 1000 1500 2000 25000
0.5
1
1.5x 10
−4
dimension p
meanerror
Precisions on the coefficients =0
precision 2, IP x 107
precision 2, OFO
0 500 1000 1500 2000 25000
200
400
600
800
dimension p
meanrunningtim
e(sec) Running times
timing, IPtiming, OFO
FIGURE: Comparing interior point (IP) vs. optimal first-order (OFO)method. Top & middle : MSE of β
ScHeDs. Bottom : running times.
c© Dalalyan, A.S. Dec. 19, 2012 17
18
References I
A. Antoniadis, Comments on : `1-penalization for mixture regression models, TEST 19 (2010), no. 2,257–258. MR 2677723
A. Belloni, V. Chernozhukov, and L. Wang, Square-root Lasso : Pivotal recovery of sparse signals via conicprogramming, Biometrika 98 (2011), no. 4, 791–806.
P. J. Bickel, Y. Ritov, and A. B. Tsybakov, Simultaneous analysis of Lasso and Dantzig selector, Ann. Statist.37 (2009), no. 4, 1705–1732.
E. J. Candès and T. Tao, The Dantzig selector : statistical estimation when p is much larger than n, Ann.Statist. 35 (2007), no. 6, 2392–2404.
Arnak S. Dalalyan and Yin Chen, Fused sparsity and robust estimation for linear models with unknownvariance, Advances in Neural Information Processing Systems 25 : NIPS, 2012, pp. 1268–1276.
E. Gautier and A. Tsybakov, High-dimensional instrumental variables regression and confidence sets,September 2011.
N. Städler, P. Bühlmann, and Sara s van de Geer, `1-penalization for mixture regression models, TEST 19(2010), no. 2, 209–256.
T. Sun and C.-H. Zhang, Scaled sparse linear regression, Biometrika 99 (2012), no. 4, 879–898.
R. Tibshirani, Regression shrinkage and selection via the Lasso, J. Roy. Statist. Soc. Ser. B 58 (1996), no. 1,267–288.
c© Dalalyan, A.S. Dec. 19, 2012 18