Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… ·...

55
Estimator Selection in High-Dimensional Settings Christophe Giraud Universit´ e Paris Sud Journ´ ees MAS, Toulouse, 2014 1/53

Transcript of Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… ·...

Page 1: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Estimator Selection

in High-Dimensional Settings

Christophe Giraud

Universite Paris Sud

Journees MAS, Toulouse, 2014

1/53

Page 2: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

High-Dimensional Statistics

Big Data?

2/53

Page 3: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

High-Dimensional Statistics

Big Data?

3/53

Page 4: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

High-Dimensional Statistics

”Wide” Data

4/53

Page 5: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

High-dimensional data

� Biotech data : Biotech devices can sense up to tens of thousandsof ”features” � n = number of ”individuals”.

� Images : medical images, massive astrophysic images, videosurveillance images, etc. Each image is made of thousands up tomillions of pixels or voxels.

� Consumers preferences data : websites and loyalty programscollect huge amounts of informations on the preferences and thebehaviors of customers. Ex: recommendation systems for movies,books, musics.

� Business data : optimal exploitation of internal and external data(logistic and transportation, insurance, finance, etc)

� Crowdsourcing data : massive online participative data sets(recorded by volunteers). Ex: eBirds collects online millions of birdcounts across Northern America.

5/53

Page 6: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Whole Human Genome Microarray covering over 41,000 humangenes and transcripts on a standard 1” x 3” glass slide format.c©Agilent Technologies, Inc. 2004. Reproduced with Permission,

Courtesy of Agilent Technologies, Inc.

6/53

Page 7: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Blessing?

� we can sense thousands of variables on each ”individual” :potentially we will be able to scan every variables that mayinfluence the phenomenon under study.

� the curse of dimensionality : separating the signal from thenoise is in general almost impossible in high-dimensional data andcomputations can rapidly exceed the available resources.

7/53

Page 8: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Curse 1 : fluctuations cumulate

Exemple : linear regression Y = Xβ∗ + ε with cov(ε) = σ2In.The Least-Square estimator β ∈ argminβ∈Rp ‖Y − Xβ‖2 has a risk

E

[‖β − β∗‖2

]= Tr

((XTX)−1

)σ2.

Illustration :

Yi =

p∑j=1

β∗j cos(πji/n) + εi = fβ∗(i/n) + εi , for i = 1, . . . , n,

with

� ε1, . . . , εn i.i.d with N (0, 1) distribution

� β∗j independent with N (0, j−4) distribution

8/53

Page 9: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Curse 1 : fluctuations cumulate

0.0 0.2 0.4 0.6 0.8 1.0

-3-2

-10

12

34

p = 10

x

y

signalLS

0.0 0.2 0.4 0.6 0.8 1.0

-3-2

-10

12

34

p = 20

x

y

signalLS

0.0 0.2 0.4 0.6 0.8 1.0

-3-2

-10

12

34

p = 50

x

y

signalLS

0.0 0.2 0.4 0.6 0.8 1.0

-3-2

-10

12

34

p = 100

x

y

signalLS

9/53

Page 10: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Curse 2 : locality is lost

Observations (Yi , X(i)) ∈ R × [0, 1]p for i = 1, . . . , n.

Model: Yi = f (X (i)) + εi with f smooth.

Local averaging: f (x) = average of{Yi : X (i) close to x

}

10/53

Page 11: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Curse 2 : locality is lost

dimension = 2

distance between points

Freq

uenc

y

0.0 0.5 1.0 1.5

020

040

060

0

dimension = 10

distance between points

Freq

uenc

y

0.0 0.5 1.0 1.5 2.0

020

040

060

080

0

dimension = 100

distance between points

Freq

uenc

y

0 1 2 3 4 5

020

040

060

080

0

dimension = 1000

distance between points

Freq

uenc

y0 5 10 15

020

040

060

080

0

Figure: Histograms of the pairwise-distances between n = 100 pointssampled uniformly in the hypercube [0, 1]p, for p = 2, 10, 100 and 1000.

11/53

Page 12: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Curse 2 : locality is lost

Number n of points x1, . . . , xn required for covering [0, 1]p by theballs B(xi , 1):

n ≥ Γ(p/2 + 1)

πp/2

p→∞∼( p

2πe

)p/2 √pπ

p 20 30 50 100 200

larger than the estimatedn 39 45630 5.7 1012 42 1039 number of particles

in the observable universe

12/53

Page 13: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Some other curses

� Curse 3 : an accumulation of rare events may not be rare(false discoveries, etc)

� Curse 4 : algorithmic complexity must remain low

13/53

Page 14: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Low-dimensional structures in high-dimensional dataHopeless?

Low dimensional structures : high-dimensional data are usuallyconcentrated around low-dimensional structures reflecting the(relatively) small complexity of the systems producing the data

� geometrical structures in an image,

� regulation network of a ”biological system”,

� social structures in marketing data,

� human technologies have limited complexity, etc.

Dimension reduction :

� ”unsupervised” (PCA)

� ”estimation-oriented”-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

X[,1]

X[,2

]

X[,3

]

14/53

Page 15: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

PCA in action

original image original image original image original image

projected image projected image projected image projected image

MNIST : 1100 scans of each digit. Each scan is a16 × 16 image which is encoded by a vector in R

256.The original images are displayed in the first row, theirprojection onto 10 first principal axes in the secondrow.

15/53

Page 16: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

”Estimation-oriented” dimension reduction

-1 0 1 2 3

-1.0

-0.5

0.0

0.5

1.0

1.5

PCA

axis 1

axis

2

ComExPECInPEC

-4 -2 0 2 4

-4-2

02

46

LDA

LD1LD

2

InPECInPEC

InPEC

ExPEC

InPEC

ExPEC

InPEC

InPEC

InPECInPEC

InPEC

InPEC

InPECInPEC

ExPEC

ExPEC InPEC

InPEC

ExPEC

ExPEC

ExPECExPEC

ExPECExPECExPEC

ExPEC

ExPEC

ExPEC

InPEC

ExPEC

ExPEC

InPECInPEC

InPECInPEC

InPEC

InPEC

InPEC

InPEC

InPEC

InPEC

InPEC

InPECInPEC

InPEC

ExPEC

ExPEC

ExPECCom

InPECInPEC

InPECInPEC

Com ComExPECExPEC

ExPEC

ExPECExPEC

ExPECExPEC

ExPEC

InPEC

ComCom

Com

ComComCom

ComComCom

Com

Com

Com

ComCom

Com ComComComCom

Com

ComCom

Com

ComComCom

ComCom

Com

Com

ComComCom

ComCom

ComCom

ComCom

ComCom

Com

ComCom

Com ComComCom

Com

Com

ComCom

Com Com

Com

Com

ComComCom

ComCom Com

Com

Com

ComCom

ComCom

Com

Com

Com

Com

Com

Com

Com

ComComCom

Com

ComCom

ComCom

Com

Com

ComCom

Com

ComComComCom

Com

InPEC

ExPECExPEC

InPEC

ExPE

Figure: 55 chemical measurements of 162 strains of E. coli.Left : the data is projected on the plane given by a PCA.Right : the data is projected on the plane given by a LDA.

16/53

Page 17: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

A paradigm shift

Classical statistics:� a small number p of parameters

� a large number n of observations.

High-dimensional data:

� a huge number p of parameters

� a sample size n with n p or n p.

classical asymptotic analyses do not fit!

Statistical setting:

� either n, p → ∞ with p ∼ g(n) (yet sensitive to g)

� or treat n and p as they are (yet analysis is more involved)

Central issue:identify (at least approximately) the low-dimensional structures

17/53

Page 18: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Estimator selection

Unknown structures

� the low-dimensional structures are unknown� the ”class” of hidden structures is possibly unknown

� various sparsity patterns� smoothness� low-rank� etc

Strong effort

For each ”class” of structures, many computationally efficientestimators adapting to the hidden structures have been developed

18/53

Page 19: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Estimator selection

Difficulties

� No procedure is universally better than the others

� A sensible choice of the tuning parameters usually depends onsome unknown characteristics of the data (sparsity,smoothness, variance, etc)

Estimator selection / aggregation

needed for

� adapting to the possibly unknown ”classes” of structures

� choosing the best among several estimators

� tuning parameters

Ideal objective

� Select the ”best” estimator among a collection {fλ, λ ∈ Λ}.19/53

Page 20: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

1- a simple setting

20/53

Page 21: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Regression framework

Regression setting

� Yi = F (xi ) + εi , for i = 1, . . . , n, with ε1, . . . , εn i.i.d.

� F : X → R and σ2 = var(εi ) are unknown

� we want to estimate F or f ∗ = (F (x1), . . . ,F (xn))

Ex 1 : sparse linear regression

� F (x) ≈ 〈β∗, φ(x)〉 with β∗ ”sparse” in some sense andφ(x) ∈ R

p with possibly p > n

Ex 2 : non-parametric regression

� F : X → R is smooth

21/53

Page 22: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

A plethora of estimators

For each ”class” of structures, there is a strong effort fordeveloping computationally efficient estimators which adapt to thehidden structures

Sparse linear regression

� Coordinate sparsity: Lasso, Dantzig, Elastic-Net,Exponential-Weighting, Random Forest, etc.

� Structured sparsity: Group-lasso, Fused-Lasso,Hierarchical-Group Lasso, Bayesian estimators, etc

Non-parametric regression

� Spline smoothing, Nadaraya kernel smoothing, kernel ridgeestimators, nearest neighbors, L2-basis projection, SparseAdditive Models, etc

22/53

Page 23: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Important practical issues

Which class of structures?

� coordinate sparse linear regression?

� group-sparse linear regression?

� smoothing?

Which estimator should be used?

� Sparse regression: Lasso? Exponential-Weighting?Random-Forest?

� Non-parametric regression: Kernel regression? (which kernel?)Spline smoothing?

Which ”tuning” parameter?

� which penalty level for the lasso?

� which bandwith for kernel regression?

� etc23/53

Page 24: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

A simple example

ModelY = Xβ∗ + ε

Scale-invarianceAn estimator (Y ,X) → β(Y ,X) is scale invariant if

β(sY ,X) = sβ(Y ,X).

Lasso estimator

βλ = argminβ

{‖Y − Xβ‖2 + 2λ‖β‖1

}, λ > 0

is not scale-invariant

βλ(sY ,X) = sβλ/s(Y ,X).

24/53

Page 25: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

A simple example

The compatibility constant

κ[S ] = minu∈C(S)

{|S |1/2‖Xu‖2/‖uS‖1

},

where C(S) = {u : ‖uSc‖1 < 4‖uS‖1}.

Theorem (Koltchinskii, Lounici, Tsybakov)

Assumptions

� ε1, . . . , εni.i.d.∼ N (0, σ2)

� columns of X with norm 1

Then, for λ = 3σ√

2 log(p) + 2L with probability at least 1 − e−L

‖X(βλ − β∗)‖2 ≤ infβ �=0

{‖X(β − β∗)‖2 +

18σ2(L + log(p))

κ[supp(β)]2|β|0

}

25/53

Page 26: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Selection strategies

Resampling strategies

� V -fold Cross-Validation

� and many others

Complexity penalization

� Penalized log-likelihood (AIC, BIC, etc)

� LinSelect

� pairwise comparison : Goldenshluger-Lepski’s method,Birge-Baraud’s method

Some specific strategies

� Square-root / scaled (group-)Lasso

26/53

Page 27: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Principle of complexity penalization

Model

Y = f ∗ + ε

Typical selection criterion

λ ∈ argminλ

{‖Y − fλ‖2 + pen(λ)

}

Main issueTo shape λ → pen(λ) in order to have

E

[‖f ∗ − fbλ‖2

]≤ C min

λE

[‖f ∗ − fλ‖2

]+ something small

27/53

Page 28: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Principle of complexity penalization

If λ ∈ argminλ

{‖Y − fλ‖2 + pen(λ)

}then

‖f ∗−fbλ‖2 ≤ ‖f ∗−fλ‖2+2〈ε, f ∗ − fλ〉+pen(λ)+2〈ε, fbλ − f ∗〉 − pen(λ).

Which penalty pen(λ)?

Find pen(λ) such that there exist some {Zλ : λ ∈ Λ} fulfilling forsome a < 1, c ≥ 0

� E [supλ∈Λ Zλ] ≤ cσ2

� 2〈ε, fλ − f ∗〉 − pen(λ) ≤ a‖fλ − f ∗‖2 + Zλ, for all λ ∈ Λ.

Hence

(1 − a) E

[‖f ∗ − fbλ‖2

]≤ E

[‖f ∗ − fλ‖2

]+ pen(λ) + cσ2

28/53

Page 29: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Principle of complexity penalization

Since

2〈ε, fλ − f ∗〉 ≤ a‖fλ − f ∗‖2 + a−1

⟨ε,

fλ − f ∗

‖fλ − f ∗‖

the penalty must control the fluctuations of⟨ε,

fλ − f ∗

‖fλ − f ∗‖

⟩.

Concentration inequalities help!

29/53

Page 30: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Back to the simple example

Restricted eigenvalue

For k∗ = n/(3 log(p)) we set φ∗ = sup {‖Xu‖2/‖u‖2 : u k∗-sparse}

Theorem(Y. Baraud, C.G, S. Huet, N. Verzelen + T. Sun, C-H. Zhang)

If we assume that

� |β∗|0 ≤ C1 κ2[supp(β∗)] × nφ∗ log(p) ,

then there is a selection criterion such that with high probability,

‖X(β − β∗)‖22 ≤ C inf

β �=0

{‖X(β∗ − β)‖2

2 + C2φ∗|β|0 log(p)

κ2[supp(β)]σ2

}

30/53

Page 31: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Impact of the unknown variance?

Case of coordinate-sparse linear regression (N. Verzelen)

k-sparse signal f ∗ = Xβ∗

k

σ unknown and k unknown

σ known or k known

Ultra-high dimension

2k log(p/k) ≥ n

n

Min

imax

risk

Minimax prediction risk over k-sparse signal as a function of k

31/53

Page 32: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

2- a more complex setting

Joint work with Francois Roueff and Andres Sanchez-Perez.

32/53

Page 33: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Another strategy

IssueThe analysis of the fluctuations of the estimators can beintractable in some settings

Example

Non-linear estimators for Time Varying AutoRegressive processes

Xt =d∑

j=1

θj(t)Xt−j + ξt

0 100 200 300 400 500 600 700 800 900 1000−4

−3

−2

−1

0

1

2

3

4

t

Figure: A TVAR process

33/53

Page 34: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Estimation for TVAR

NotationsXt−1 = (Xt−1, . . . ,Xt−d) and θt−1 = (θ1(t), . . . , θd(t)) so that

Xt = 〈θt−1,Xt−1〉 + ξt

Normalized Least Mean Squares estimators (NLMS)

X(λ)t = 〈θ(λ)

t−1,Xt−1〉 where for λ > 0

θ(λ)t = θ

(λ)t−1 + λ

(Xt − 〈θ(λ)

t−1,Xt−1〉) Xt−1

1 + λ |Xt−1|22.

Optimality

Moulines, Priouret and Roueff (2005) have shown some optimalityof θ(λ) for a suitable λ depending on some unknown quantities.

34/53

Page 35: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Estimator selection

How to choose λ?Hard to quantify precisely the fluctuations of a TVAR...

A strategy

Use some technics from individual sequence forecasting.

Individual sequence model

The observations x1, x2, . . . are deterministic!

35/53

Page 36: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

A simple result (1)

Observationsx1, x2, . . . with values in [−B, B]

Predictorsx

(λ)t for λ ∈ Λ, with values in [−B, B].

Aggregation

xt =∑λ∈Λ

wλ(t)x(λ)t with wλ(t) ∝ πλe−η

Pt−1s=1 (xs−x

(λ)s )2

where π is a probability distribution on Λ

36/53

Page 37: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

A simple result (2)

Theorem (Catoni 97)

For η = 1/(8B2) we have

1

T

T∑t=1

(xt − xt)2 ≤ min

λ∈Λ

{1

T

T∑t=1

(xt − x(λ)t )2 +

8B2

Tlog(π−1

λ )

}

proof

x → exp(−x2) is concave on [−2−1/2, 2−1/2].

37/53

Page 38: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Issues

� Some processes like TVAR are not bounded

� Even if a process is bounded by some B, the choice η B−2

can be suboptimal

38/53

Page 39: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Extension to unbounded stochastic settings

Sublinear model(Xt)t∈Z satisfies

|Xt | ≤∑j∈Z

At(j) Zt−j ,

� Zt ≥ 0, independent

� At(j) ≥ 0 and A∗ := supt∈Z

∑j∈Z

At(j) < ∞ . (1)

39/53

Page 40: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Examples of sublinear processes (1)

Linear processes with time varying coefficients

Xt =∑j∈Z

at(j) ξt−j ,

� (ξt)t∈Z independents, standardized,

� (at(j))t,j satisfies (1) with At(j) = |at(j)|.Classical assumption :

supT≥1

supj∈Z

T∑t=1

|at,T (j) − a(t/T , j)| < ∞ .

where u → a(u, j) is smooth.

40/53

Page 41: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Examples of sublinear processes (2)

TVAR model

θ = (θ1, . . . θd) : (−∞, 1] → R, σ : (−∞, 1] → R+,

(ξt)t∈Z independents, E[ξt ] = 0, E[ξ2t ] = 1.

1. For all t ≤ T ,

Xt,T =d∑

j=1

θj

(t − 1

T

)Xt−j ,T + σ

( t

T

)ξt .

2.lim

M→∞supt≤T

P(|Xt,T | > M) = 0 .

41/53

Page 42: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Examples of sublinear processes (2)

Regularity-stability condition

θ ∈ C (β,R, δ, σ−, σ+) ={(θ, σ) : (−∞, 1] → R

d × [σ−, σ+] : θ ∈ Λd(β,R) ∩ sd(δ)}

,

where

� Λd(β,R) ={f : (−∞, 1] → R

d , sups �=s′

∣∣f (�β�−1)(s) − f (�β�−1)(s ′)∣∣

|s − s ′|β+1−�β� ≤ R

}

� sd(δ) ={θ : (−∞, 1] → R

d , Θ(z ; u) �= 0,∀|z | < δ−1, u ∈ [0, 1]}

where Θ(z ; u) = 1 − ∑dj=1 θj(u)z j .

42/53

Page 43: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Examples of sublinear processes (2)

Proposition

If (θ, σ) ∈ C (β,R, δ, 0, σ+), for δ ∈ (0, 1)

Xt,T =∞∑j=0

at,T (j) σ

(t − j

T

)ξt−j ,

where for any ρ ∈ (δ, 1), ∃K = K (ρ, δ, β, R) > 0 such that,∀t ≤ T and j ≥ 0,

|at,T (j)| ≤ Kρj .

Hence

supt

∑j≥0

|at,T (j) σ (t − j/T ) | ≤ Kσ+

1 − ρ.

43/53

Page 44: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Sublinearity of the predictors

L-Lipschitz predictors

� L = (Ls)s≥1 non-negative with

L∗ =∑j≥1

Lj < ∞ .

� A predictor Xt of Xt from (Xs)s≤t−1 is L-Lipschitz if∣∣Xt

∣∣ ≤ ∑s≥1

Ls |Xt−s | .

44/53

Page 45: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Risk bound with p-th moment

Theorem (C.G., F. Roueff, A. Sanchez-Perez)

If mp := supt∈Z E[Zpt ] < ∞, p ≥ 2,

1

T

T∑t=1

E

[(Xt − Xt)

2]≤ min

λ∈Λ

{1

T

T∑t=1

E

[(X

(λ)t − Xt)

2]

+log(π−1

λ )

}+ T (8η)p/2−1Ap

∗(1 + L∗)pmp .

45/53

Page 46: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Choice of η

For πλ = |Λ|−1, the optimal choice is

η =1

81−2/p (1 + L∗)2 A2∗m2/pp

(log |Λ|

T 2

)2/p

,

1

T

T∑t=1

E

[(Xt − Xt

)2]≤ inf

λ∈Λ

1

T

T∑t=1

E

[(X

(λ)t − Xt

)2]+C1

log(|Λ|)1−2/p

T 1−4/p

with C1 = 2 × 8(p−2)/p (1 + L∗)2 A2∗m

2/pp .

46/53

Page 47: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Risk bound with exponential moment

Theorem (C.G., F. Roueff, A. Sanchez-Perez)

If

1. φ(ζ) := supt∈Z E[eζZt ] < ∞, for some ζ > 0,

2. πλ = |Λ|−1

3.

η =ζ2

8(A∗)2(L∗ + 1)2max

{2, log

(T 2φ(ζ)

8 log N

)}−2

then

1

T

T∑t=1

E

[(Xt − Xt

)2]≤ min

λ∈Λ

1

T

T∑t=1

E

[(X

(λ)t − Xt

)2]

+ C2log |Λ|

Tmax

{2, log

(T 2φ(ζ)

8 log |Λ|)}2

with C2 = 16A2∗(1 + L∗)2ζ−2.

47/53

Page 48: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Conclusion

Corollary

The estimator X built from the aggregation of the X (λ) adapts tothe regularity of θ

CaveatThe choice of η depends on a moment of ξt (exponential momentor p-th moment). Yet, an upper-bound is enough.

48/53

Page 49: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

d = 3, T = 1000, σ = 1, δ = 0.5, N = 7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

t

T

θ

1

Figure: θ1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

t

T

θ

2

Figure: θ2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.1

−0.05

0

0.05

0.1

0.15

t

T

θ

3

Figure: θ3

49/53

Page 50: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Processes and estimation

0 100 200 300 400 500 600 700 800 900 1000−4

−3

−2

−1

0

1

2

3

4

t

Figure: A TVAR process

Figure: NLMS estimators for θ1

50/53

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

t

T

θ 1

θ( 1 )1

θ( 2 )1

θ( 3 )1

θ( 4 )1

θ( 5 )1

θ( 6 )1

θ( 7 )1

θ 1

θ 1

Page 51: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Processes and estimation

0 100 200 300 400 500 600 700 800 900 1000−4

−3

−2

−1

0

1

2

3

4

t

Figure: A TVAR process

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

t

T

θ 1

θ( 1 )1

θ( 2 )1

θ( 3 )1

θ( 4 )1

θ( 5 )1

θ( 6 )1

θ( 7 )1

θ 1

θ 1

Figure: NLMS estimators for θ1

50/53

Page 52: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Estimation

Figure: NLMS estimators for θ2

Figure: NLMS estimators for θ3

51/53

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

t

T

θ 2

θ( 1 )2

θ( 2 )2

θ( 3 )2

θ( 4 )2

θ( 5 )2

θ( 6 )2

θ( 7 )2

θ 2

θ 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

−0.45

−0.4

−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

t

T

θ 3

θ( 1 )3

θ( 2 )3

θ( 3 )3

θ( 4 )3

θ( 5 )3

θ( 6 )3

θ( 7 )3

θ 3

θ 3

Page 53: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Estimation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

t

T

θ 2

θ( 1 )2

θ( 2 )2

θ( 3 )2

θ( 4 )2

θ( 5 )2

θ( 6 )2

θ( 7 )2

θ 2

θ 2

Figure: NLMS estimators for θ2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

t

T

θ 3

θ( 1 )3

θ( 2 )3

θ( 3 )3

θ( 4 )3

θ( 5 )3

θ( 6 )3

θ( 7 )3

θ 3

θ 3

Figure: NLMS estimators for θ3

51/53

Page 54: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Boxplot of the cumulative error

0

0.2

0.4

0.6

0.8

1

1.2

expert 1 expert 2 expert 3 expert 4 expert 5 expert 6 expert 7 aggregate

empi

rical

err

or

Figure: Empirical cumulative error

52/53

Page 55: Estimator Selection eserved@d = *@let@token in High ...agarivie/Telecom/slidesMAS/Giraud-MA… · Crowdsourcing data : massive online participative data sets (recorded by volunteers).

Introduction to high-dimensional statistics

Available online :

http:

//www.cmap.polytechnique.fr/~giraud/MSV/LectureNotes.pdf

Contents

1. Curse of dimensionality

2. Model selection

3. Aggregation ofestimators

4. Convex criterions

5. Estimator selection

6. Low rank regression

7. Graphical models

8. Multiple testing

9. Supervised classification

53/53