Aggregation methods: optimality and fast...

101
Aggregation methods: optimality and fast rates. Guillaume Lecu´ e Universit´ e Paris 6 18 Mai 2007

Transcript of Aggregation methods: optimality and fast...

Page 1: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

Aggregation methods: optimality and fast rates.

Guillaume Lecue

Universite Paris 6

18 Mai 2007

Page 2: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Motivation.

M prior estimators (’weak’ estimators) : f1, . . . , fM

n observations : Dn

Aim

Construction of a new estimator which is approximatively as good as thebest ’weak’ estimator :

Aggregation method or Aggregate

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 3: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Motivation.

M prior estimators (’weak’ estimators) : f1, . . . , fM

n observations : Dn

Aim

Construction of a new estimator which is approximatively as good as thebest ’weak’ estimator :

Aggregation method or Aggregate

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 4: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Examples.

Adaptation :

Observations : Dm+n

Estimation : Dm →non-adaptive estimators f1, . . . , fM .

learning : D(n) →aggregate fn (adaptive).

Estimation :

ε−net : f1, . . . , fM (functions)

learning : Dn →aggregate fn.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 5: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Examples.

Adaptation :

Observations : Dm+n

Estimation : Dm →non-adaptive estimators f1, . . . , fM .

learning : D(n) →aggregate fn (adaptive).

Estimation :

ε−net : f1, . . . , fM (functions)

learning : Dn →aggregate fn.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 6: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Model.

(Z, T ) a measurable space,

P the set of all probability measures on (Z, T ),

F : P 7−→ F (example : F (π) = dπ/dµ),

Z : random variable with values in Z,

π probability distribution of Z ,

Dn = (Z1, . . . ,Zn) : n i.i.d. observations of Z .

Problem : Estimation of F (π) from Dn.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 7: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Model.

(Z, T ) a measurable space,

P the set of all probability measures on (Z, T ),

F : P 7−→ F (example : F (π) = dπ/dµ),

Z : random variable with values in Z,

π probability distribution of Z ,

Dn = (Z1, . . . ,Zn) : n i.i.d. observations of Z .

Problem : Estimation of F (π) from Dn.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 8: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Model.

(Z, T ) a measurable space,

P the set of all probability measures on (Z, T ),

F : P 7−→ F (example : F (π) = dπ/dµ),

Z : random variable with values in Z,

π probability distribution of Z ,

Dn = (Z1, . . . ,Zn) : n i.i.d. observations of Z .

Problem : Estimation of F (π) from Dn.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 9: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Model.

loss : Q : Z × F 7−→ R,

risk : A(f ) = E[Q(Z , f )],

A∗def= inff∈F A(f ) = A(f ∗), (f ∗ = F (π))

excess risk : A(f )− A∗ ≥ 0

Empirical Risk :

An(f ) =1

n

n∑i=1

Q(Zi , f ).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 10: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Model.

loss : Q : Z × F 7−→ R,

risk : A(f ) = E[Q(Z , f )],

A∗def= inff∈F A(f ) = A(f ∗), (f ∗ = F (π))

excess risk : A(f )− A∗ ≥ 0

Empirical Risk :

An(f ) =1

n

n∑i=1

Q(Zi , f ).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 11: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Model.

loss : Q : Z × F 7−→ R,

risk : A(f ) = E[Q(Z , f )],

A∗def= inff∈F A(f ) = A(f ∗), (f ∗ = F (π))

excess risk : A(f )− A∗ ≥ 0

Empirical Risk :

An(f ) =1

n

n∑i=1

Q(Zi , f ).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 12: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Model.

loss : Q : Z × F 7−→ R,

risk : A(f ) = E[Q(Z , f )],

A∗def= inff∈F A(f ) = A(f ∗), (f ∗ = F (π))

excess risk : A(f )− A∗ ≥ 0

Empirical Risk :

An(f ) =1

n

n∑i=1

Q(Zi , f ).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 13: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Regression

loss : Q((x , y), f ) = (y − f (x))2,

excess risk : A(f )− A∗ = ||f ∗ − f ||2L2(PX ) where f ∗(x) = E [Y |X = x ] .

Density estimation (f ∗ = dπ/dµ)

KL loss : Q(z , f ) = − log f (z),

excess risk : A(f )− A∗ = K (f ∗|f ).

L2-loss : Q(z , f ) =∫Z f 2dµ− 2f (z),

excess risk : A(f )− A∗ = ||f ∗ − f ||2L2(µ).

Classification

loss : φ : R 7−→ R,Q((x , y), f ) = φ(yf (x)), y ∈ {−1, 1}φ−risk : Aφ(f ) = E[Q((X ,Y ), f )] = E[φ(Yf (X ))].

f ∗ = f φ∗ s.t. Aφ(f φ∗) = minf :X 7−→R Aφ(f ).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 14: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Regression

loss : Q((x , y), f ) = (y − f (x))2,

excess risk : A(f )− A∗ = ||f ∗ − f ||2L2(PX ) where f ∗(x) = E [Y |X = x ] .

Density estimation (f ∗ = dπ/dµ)

KL loss : Q(z , f ) = − log f (z),

excess risk : A(f )− A∗ = K (f ∗|f ).

L2-loss : Q(z , f ) =∫Z f 2dµ− 2f (z),

excess risk : A(f )− A∗ = ||f ∗ − f ||2L2(µ).

Classification

loss : φ : R 7−→ R,Q((x , y), f ) = φ(yf (x)), y ∈ {−1, 1}φ−risk : Aφ(f ) = E[Q((X ,Y ), f )] = E[φ(Yf (X ))].

f ∗ = f φ∗ s.t. Aφ(f φ∗) = minf :X 7−→R Aφ(f ).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 15: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Regression

loss : Q((x , y), f ) = (y − f (x))2,

excess risk : A(f )− A∗ = ||f ∗ − f ||2L2(PX ) where f ∗(x) = E [Y |X = x ] .

Density estimation (f ∗ = dπ/dµ)

KL loss : Q(z , f ) = − log f (z),

excess risk : A(f )− A∗ = K (f ∗|f ).

L2-loss : Q(z , f ) =∫Z f 2dµ− 2f (z),

excess risk : A(f )− A∗ = ||f ∗ − f ||2L2(µ).

Classification

loss : φ : R 7−→ R,Q((x , y), f ) = φ(yf (x)), y ∈ {−1, 1}φ−risk : Aφ(f ) = E[Q((X ,Y ), f )] = E[φ(Yf (X ))].

f ∗ = f φ∗ s.t. Aφ(f φ∗) = minf :X 7−→R Aφ(f ).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 16: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Selectors.

F0 = {f1, . . . , fM} ⊂ F a dictionary.

Empirical Risk Minimization (ERM) :(Vapnik, Chervonenkis...)

f ERMn ∈ Arg min

f∈F0

An(f ).

penalized Empirical Risk Minimization (pERM) :

f ERMn ∈ Arg min

f∈F0

[An(f ) + pen(f )],

where pen is a penalty function. (Barron, Bartlett, Birge,Boucheron, Koltchinski, Lugosi, Massart,...)

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 17: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Selectors.

F0 = {f1, . . . , fM} ⊂ F a dictionary.

Empirical Risk Minimization (ERM) :(Vapnik, Chervonenkis...)

f ERMn ∈ Arg min

f∈F0

An(f ).

penalized Empirical Risk Minimization (pERM) :

f ERMn ∈ Arg min

f∈F0

[An(f ) + pen(f )],

where pen is a penalty function. (Barron, Bartlett, Birge,Boucheron, Koltchinski, Lugosi, Massart,...)

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 18: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation methods with exponential weights.

F0 = {f1, . . . , fM} ⊂ F a dictionary.

Aggregate with Exponential weights (AEW) :

f AEWn,T =

∑f∈F0

w(n)T (f )f , where w

(n)T (f ) =

exp (−nTAn(f ))∑g∈F0

exp (−nTAn(g)),

T−1 : temperature parameter.

Cumulative Aggregate with Exponential Weights (CAEW) :(Catoni,Yang,...)

f CAEWn,T =

1

n

n∑k=1

f AEWk,T .

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 19: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation methods with exponential weights.

F0 = {f1, . . . , fM} ⊂ F a dictionary.

Aggregate with Exponential weights (AEW) :

f AEWn,T =

∑f∈F0

w(n)T (f )f , where w

(n)T (f ) =

exp (−nTAn(f ))∑g∈F0

exp (−nTAn(g)),

T−1 : temperature parameter.

Cumulative Aggregate with Exponential Weights (CAEW) :(Catoni,Yang,...)

f CAEWn,T =

1

n

n∑k=1

f AEWk,T .

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 20: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aim of Aggregation(1) : Optimal rate of aggregation.

Definition

∀F0 = {f1, . . . , fM} ⊆ F , ∃fn such that ∀π ∈ P, ∀n ≥ 1

E[A(fn)− A∗

]≤ min

f∈F0

(A(f )− A∗) + C0γ(n,M).

∃F0 = {f1, . . . , fM} such that for any aggregate fn, ∃π ∈ P, ∀n ≥ 1

E[A(fn)− A∗

]≥ min

f∈F0

(A(f )− A∗) + C1γ(n,M).

γ(n,M) is an optimal rate of aggregation and fn is an optimalaggregation procedure.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 21: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aim of Aggregation(1) : Optimal rate of aggregation.

Definition

∀F0 = {f1, . . . , fM} ⊆ F , ∃fn such that ∀π ∈ P, ∀n ≥ 1

E[A(fn)− A∗

]≤ min

f∈F0

(A(f )− A∗) + C0γ(n,M).

∃F0 = {f1, . . . , fM} such that for any aggregate fn, ∃π ∈ P, ∀n ≥ 1

E[A(fn)− A∗

]≥ min

f∈F0

(A(f )− A∗) + C1γ(n,M).

γ(n,M) is an optimal rate of aggregation and fn is an optimalaggregation procedure.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 22: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aim of Aggregation(1) : Optimal rate of aggregation.

Definition

∀F0 = {f1, . . . , fM} ⊆ F , ∃fn such that ∀π ∈ P, ∀n ≥ 1

E[A(fn)− A∗

]≤ min

f∈F0

(A(f )− A∗) + C0γ(n,M).

∃F0 = {f1, . . . , fM} such that for any aggregate fn, ∃π ∈ P, ∀n ≥ 1

E[A(fn)− A∗

]≥ min

f∈F0

(A(f )− A∗) + C1γ(n,M).

γ(n,M) is an optimal rate of aggregation and fn is an optimalaggregation procedure.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 23: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aim of Aggregation(2) : Adaptation.

Definition (Oracle Inequality)

∀F0 = {f1, . . . , fM} ⊆ F , ∃fn such that ∀π ∈ P, ∀n ≥ 1

E[A(fn)− A∗

]≤ C min

f∈F0

(A(f )− A∗) + C0γ(n,M),

where C ≥ 1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 24: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Continuous scale of loss functions.

Classification problem : Aφ(f ) = E[φ(Yf (X ))],Y ∈ {−1, 1},X ∈ X .

φ(x) = φh(x) =

{(1− h)φ0(x) + hφ1(x) if 0 ≤ h ≤ 1(h − 1)x2 − x + 1 if h > 1,

∀x ∈ R

where φ0(z) = 1I(z≤0) is the 0− 1 loss and φ1(z) = max(0, 1− z) is theHinge loss.

h = 0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 25: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Continuous scale of loss functions.

Classification problem : Aφ(f ) = E[φ(Yf (X ))],Y ∈ {−1, 1},X ∈ X .

φ(x) = φh(x) =

{(1− h)φ0(x) + hφ1(x) if 0 ≤ h ≤ 1(h − 1)x2 − x + 1 if h > 1,

∀x ∈ R

where φ0(z) = 1I(z≤0) is the 0− 1 loss and φ1(z) = max(0, 1− z) is theHinge loss.

h = 1/3

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 26: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Continuous scale of loss functions.

Classification problem : Aφ(f ) = E[φ(Yf (X ))],Y ∈ {−1, 1},X ∈ X .

φ(x) = φh(x) =

{(1− h)φ0(x) + hφ1(x) if 0 ≤ h ≤ 1(h − 1)x2 − x + 1 if h > 1,

∀x ∈ R

where φ0(z) = 1I(z≤0) is the 0− 1 loss and φ1(z) = max(0, 1− z) is theHinge loss.

h = 2/3

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 27: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Continuous scale of loss functions.

Classification problem : Aφ(f ) = E[φ(Yf (X ))],Y ∈ {−1, 1},X ∈ X .

φ(x) = φh(x) =

{(1− h)φ0(x) + hφ1(x) if 0 ≤ h ≤ 1(h − 1)x2 − x + 1 if h > 1,

∀x ∈ R

where φ0(z) = 1I(z≤0) is the 0− 1 loss and φ1(z) = max(0, 1− z) is theHinge loss.

h = 1

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 28: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Continuous scale of loss functions.

Classification problem : Aφ(f ) = E[φ(Yf (X ))],Y ∈ {−1, 1},X ∈ X .

φ(x) = φh(x) =

{(1− h)φ0(x) + hφ1(x) if 0 ≤ h ≤ 1(h − 1)x2 − x + 1 if h > 1,

∀x ∈ R

where φ0(z) = 1I(z≤0) is the 0− 1 loss and φ1(z) = max(0, 1− z) is theHinge loss.

h = 2

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 29: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Continuous scale of loss functions.

Classification problem : Aφ(f ) = E[φ(Yf (X ))],Y ∈ {−1, 1},X ∈ X .

φ(x) = φh(x) =

{(1− h)φ0(x) + hφ1(x) if 0 ≤ h ≤ 1(h − 1)x2 − x + 1 if h > 1,

∀x ∈ R

where φ0(z) = 1I(z≤0) is the 0− 1 loss and φ1(z) = max(0, 1− z) is theHinge loss.

h = 3

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 30: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

ORA in classification

Loss function 0 ≤ h < 1 h = 1 h > 1Optimal rate of aggregation(ORA)

√log M

n

√log M

nlog M

n

Optimal aggregation proce-dure

ERM ERM, AEW, CAEW CAEW

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 31: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

ORA in classification

Loss function 0 ≤ h < 1 h = 1 h > 1Optimal rate of aggregation(ORA)

√log M

n

√log M

nlog M

n

Optimal aggregation proce-dure

ERM ERM, AEW, CAEW CAEW

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 32: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

ORA in classification

Loss function 0 ≤ h < 1 h = 1 h > 1Optimal rate of aggregation(ORA)

√log M

n

√log M

nlog M

n

Optimal aggregation proce-dure

ERM ERM, AEW, CAEW CAEW

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 33: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

ORA in classification

Loss function 0 ≤ h < 1 h = 1 h > 1Optimal rate of aggregation(ORA)

√log M

n

√log M

nlog M

n

Optimal aggregation proce-dure

ERM ERM, AEW, CAEW CAEW

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 34: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

ORA in classification

Loss function 0 ≤ h < 1 h = 1 h > 1Optimal rate of aggregation(ORA)

√log M

n

√log M

nlog M

n

Optimal aggregation proce-dure

ERM ERM, AEW, CAEW CAEW

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 35: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

ORA in classification

Loss function 0 ≤ h < 1 h = 1 h > 1Optimal rate of aggregation(ORA)

√log M

n

√log M

nlog M

n

Optimal aggregation proce-dure

ERM ERM, AEW, CAEW CAEW

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 36: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

2 Questions.

Question 1 : Why is there such a breakdown just after the Hinge loss ?

Question 2 : Do we really need aggregation procedures with exponentialweights to achieve the optimal rates of aggregation ?

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 37: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

2 Questions.

Question 1 : Why is there such a breakdown just after the Hinge loss ?

0 ≤ h ≤ 1,

√log M

n−→ log M

n, h > 1.

Question 2 : Do we really need aggregation procedures with exponentialweights to achieve the optimal rates of aggregation ?

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 38: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

2 Questions.

Question 1 : Why is there such a breakdown just after the Hinge loss ?

0 ≤ h ≤ 1,

√log M

n−→ log M

n, h > 1.

Question 2 : Do we really need aggregation procedures with exponentialweights to achieve the optimal rates of aggregation ?

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 39: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

2 Questions.

Question 1 : Why is there such a breakdown just after the Hinge loss ?

0 ≤ h ≤ 1,

√log M

n−→ log M

n, h > 1.

ERM −→ CAEW

Question 2 : Do we really need aggregation procedures with exponentialweights to achieve the optimal rates of aggregation ?

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 40: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

Margin assumption for the loss function φ :

The probability measure π satisfies the φ−margin assumption φ−MA(κ),with margin parameter κ ≥ 1 if

E[(φ(Yf (X ))− φ(Yf φ∗(X )))2] ≤ cφ(Aφ(f )− Aφ∗)1/κ,

for any f : X 7−→ R.

cf. Mammen and Tsybakov 99 (discriminant analysis) and Tsybakov 04(classification).

φ0 −MA(κ) ⇐⇒ P[|2η(X )− 1| ≤ t] ≤ tα,∀0 < t < 1, α =1

κ− 1

η(x) = P[Y = 1|X = x ]

(κ = 1 ⇐⇒ ∃h > 0, |2η(X )− 1| ≥ h)

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 41: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

Margin assumption for the loss function φ :

The probability measure π satisfies the φ−margin assumption φ−MA(κ),with margin parameter κ ≥ 1 if

E[(φ(Yf (X ))− φ(Yf φ∗(X )))2] ≤ cφ(Aφ(f )− Aφ∗)1/κ,

for any f : X 7−→ R.

cf. Mammen and Tsybakov 99 (discriminant analysis) and Tsybakov 04(classification).

φ0 −MA(κ) ⇐⇒ P[|2η(X )− 1| ≤ t] ≤ tα,∀0 < t < 1, α =1

κ− 1

η(x) = P[Y = 1|X = x ]

(κ = 1 ⇐⇒ ∃h > 0, |2η(X )− 1| ≥ h)

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 42: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

Margin assumption for the loss function φ :

The probability measure π satisfies the φ−margin assumption φ−MA(κ),with margin parameter κ ≥ 1 if

E[(φ(Yf (X ))− φ(Yf φ∗(X )))2] ≤ cφ(Aφ(f )− Aφ∗)1/κ,

for any f : X 7−→ R.

cf. Mammen and Tsybakov 99 (discriminant analysis) and Tsybakov 04(classification).

φ0 −MA(κ) ⇐⇒ P[|2η(X )− 1| ≤ t] ≤ tα,∀0 < t < 1, α =1

κ− 1

η(x) = P[Y = 1|X = x ]

(κ = 1 ⇐⇒ ∃h > 0, |2η(X )− 1| ≥ h)

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 43: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

κ = +∞ for any 0 ≤ h ≤ 1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 44: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

κ = +∞ for any 0 ≤ h ≤ 1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 45: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

κ = +∞ for any 0 ≤ h ≤ 1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 46: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

κ = +∞ for any 0 ≤ h ≤ 1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 47: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

κ = 1 for any h > 1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 48: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 1. Why there is a breakdown at h = 1?

κ = 1 for any h > 1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 49: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 2 : Do we really need agg. with exp. weights ?

Theorem (suboptimality of selectors)

For any M ≥ 2, φ : R 7−→ R s.t. φ(−1) 6= φ(1),∃f1, . . . , fM : X 7−→ {−1, 1} s.t. for any selector fn, ∃π s.t.

E[Aφ(fn)− Aφ∗

]≥ min

j=1,...,M

(Aφ(fj)− Aφ∗) + C

√log M

n.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 50: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 2 : Do we really need agg. with exp. weights ?

Theorem (suboptimality of selectors under the margin assumption)

For any M ≥ 2, κ ≥ 1, φ : R 7−→ R s.t. φ(−1) 6= φ(1),∃f1, . . . , fM : X 7−→ {−1, 1} s.t. for any selector fn, ∃π satisfying theφ0−MA(κ) s.t.

E[Aφ(fn)− Aφ∗

]≥ min

j=1,...,M

(Aφ(fj)− Aφ∗) + C

( log M

n

) κ2κ−1

.

√log M

n>>

( log M

n

) κ2κ−1

>>log M

n, 1 < κ < ∞.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 51: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Question 2 : Do we really need agg. with exp. weights ?

Suboptimality of Penalized ERM.

For any M ≥ 2, κ > 1 and φ : R 7−→ R s.t. φ(−1) 6= φ(1),∃f1, . . . , fM : X 7−→ {−1, 1}, ∃π satisfying the φ0−MA(κ) s.t. the pERMaggregate

f pERMn ∈ Arg min

j=1,...,M(Aφ

n (fj) + pen(fj)),

where |pen(f )| < 16

√log M

n , satisfies

E[Aφ(f pERM

n )− Aφ∗]≥ min

j=1,...,M

(Aφ(fj)− Aφ∗) + C

√log M

n

if√

M log M ≤√

n/(132e3), for any integer n ≥ 1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 52: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Theorem (Exact Oracle Inequalities for the general framework)

F0 = {f1, . . . , fM} ⊆ F . Assume that π satisfies (MA)(κ, c,F0) MH and|Q(Z , f )− Q(Z , f ∗)| ≤ K ,∀f ∈ F0.

E[A(f ERMn )− A∗] ≤ min

f∈F0

(A(f )− A∗) + 4γ(n,M, κ,B),

where the residual γ(n,M, κ,B) equals to(B

1κ log Mβ1n

)1/2

if B ≥(

log Mβ1n

) κ2κ−1

(log Mβ1n

) κ2κ−1

otherwise,

where B = minf∈F0 (A(f )− A∗) and β1 > 0.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 53: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Theorem (Exact Oracle Inequalities for the general framework)

F0 = {f1, . . . , fM} ⊆ F . Assume that π satisfies (MA)(κ, c,F0) MH and|Q(Z , f )− Q(Z , f ∗)| ≤ K ,∀f ∈ F0.If Q(z , ·) is convex for any z ∈ Z, then

E[A(f AEWn )− A∗] ≤ min

f∈F0

(A(f )− A∗) + 4γ(n,M, κ,B),

where the residual γ(n,M, κ,B) equals to(B

1κ log Mβ1n

)1/2

if B ≥(

log Mβ1n

) κ2κ−1

(log Mβ1n

) κ2κ−1

otherwise,

where B = minf∈F0 (A(f )− A∗) and β1 > 0.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 54: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Oracle Inequality in density estimation

Corollary. In density estimation (Margin parameter κ = 1.)

Let f1, . . . , fM : X 7−→ [0,B]. Assume that f ∗ is bounded by B. For anyε > 0, we have

E[||f ∗ − f AEWn ||2L2(µ)] ≤ (1 + ε) min

j=1,...,M(||f ∗ − fj ||2L2(µ)) +

C

ε

log M

n.

Aggregation of wavelet thresholded estimators⇓

adaptive minimax procedure over all Besov Balls Bsp,∞ for s > 1/p.

(Chesneau, L. (2007))

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 55: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Oracle Inequality in density estimation

Corollary. In density estimation (Margin parameter κ = 1.)

Let f1, . . . , fM : X 7−→ [0,B]. Assume that f ∗ is bounded by B. For anyε > 0, we have

E[||f ∗ − f AEWn ||2L2(µ)] ≤ (1 + ε) min

j=1,...,M(||f ∗ − fj ||2L2(µ)) +

C

ε

log M

n.

Aggregation of wavelet thresholded estimators⇓

adaptive minimax procedure over all Besov Balls Bsp,∞ for s > 1/p.

(Chesneau, L. (2007))

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 56: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Oracle Inequality in the regression framework

Corollary. In bounded regression (Margin parameter κ = 1.)

Let f1, . . . , fM : X 7−→ [0, 1]. For any ε > 0, we have

E[||f ∗ − f AEWn ||2L2(PX )] ≤ (1 + ε) min

j=1,...,M(||f ∗ − fj ||2L2(PX )) +

C

ε

log M

n.

Aggregation of wavelet thresholded estimators⇓

adaptive minimax procedure over all Besov Balls Bsp,∞ for s > 1/p.

(Chesneau, L. (2007))

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 57: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Oracle Inequality in the regression framework

Corollary. In bounded regression (Margin parameter κ = 1.)

Let f1, . . . , fM : X 7−→ [0, 1]. For any ε > 0, we have

E[||f ∗ − f AEWn ||2L2(PX )] ≤ (1 + ε) min

j=1,...,M(||f ∗ − fj ||2L2(PX )) +

C

ε

log M

n.

Aggregation of wavelet thresholded estimators⇓

adaptive minimax procedure over all Besov Balls Bsp,∞ for s > 1/p.

(Chesneau, L. (2007))

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 58: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Oracle Inequality in classification

Corollary. In classification under the Margin Assumption.

Let f1, . . . , fM : X 7−→ [−1, 1] and κ ≥ 1. For any π satisfying the marginassumption φ0−MA(κ) and any ε > 0, we have

E[A1(fAEWn )− A∗1 ] ≤ (1 + ε) min

j=1,...,M(A1(fj)− A∗1) + C

( log M

εn

) κ2κ−1

.

Aggregation of SVM classifiers (or plug-in classifiers)⇓

procedure adaptive simultaneously to the complexity and to the marginparameters.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 59: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Oracle Inequality in classification

Corollary. In classification under the Margin Assumption.

Let f1, . . . , fM : X 7−→ [−1, 1] and κ ≥ 1. For any π satisfying the marginassumption φ0−MA(κ) and any ε > 0, we have

E[A1(fAEWn )− A∗1 ] ≤ (1 + ε) min

j=1,...,M(A1(fj)− A∗1) + C

( log M

εn

) κ2κ−1

.

Aggregation of SVM classifiers (or plug-in classifiers)⇓

procedure adaptive simultaneously to the complexity and to the marginparameters.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 60: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

(Gaıffas, L. (2007))

Single-index model

(X ,Y ) ∈ Rd × R, Y = g(X ) + σ(X )ε

where ∃ϑ ∈ Sd−1+ =

{v ∈ Rd | ‖v‖2 = 1 et vd ≥ 0

}(index) and a

function f : R 7−→ R (link function) s.t.

g(x) = f (ϑ>x).

ε ⊥ X and ε ∼ N(0, 1) ;

σ0 ≤ σ(·) ≤ σ1 (σ1 known) ;

Aim : estimation of g from n observations Dn := [(Xi ,Yi ); 1 ≤ i ≤ n]

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 61: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

(Gaıffas, L. (2007))

Single-index model

(X ,Y ) ∈ Rd × R, Y = g(X ) + σ(X )ε

where ∃ϑ ∈ Sd−1+ =

{v ∈ Rd | ‖v‖2 = 1 et vd ≥ 0

}(index) and a

function f : R 7−→ R (link function) s.t.

g(x) = f (ϑ>x).

ε ⊥ X and ε ∼ N(0, 1) ;

σ0 ≤ σ(·) ≤ σ1 (σ1 known) ;

Aim : estimation of g from n observations Dn := [(Xi ,Yi ); 1 ≤ i ≤ n]

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 62: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Reduction of dimension

Without the assumption of Single-Index and if g ∈ Hd(s) (Holder class)⇓

n−s/(2s+d) is the minimax rate of convergence.

“Open”question 2 of Stone (82)

is n−s/(2s+1) the minimax rate of estimation of the regression function inthe Single-Index model, if the link function f belongs to a s-Holder Class ?

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 63: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Reduction of dimension

Without the assumption of Single-Index and if g ∈ Hd(s) (Holder class)⇓

n−s/(2s+d) is the minimax rate of convergence.

“Open”question 2 of Stone (82)

is n−s/(2s+1) the minimax rate of estimation of the regression function inthe Single-Index model, if the link function f belongs to a s-Holder Class ?

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 64: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

assumption on the Design PX

Assumption (D)

PX is compactly supported and PX << λd

For any v ∈ Sd−1+ , if µ := dPv>X/Leb :

µ is continuous ;Card{z ∈ Supp µ s.t. µ(z) = 0} < ∞ ;if µ(z) = 0, then µ is U-shaped around z ;

∃β ≥ 0, γ > 0 s.t. ∀v ∈ Sd−1+ and ∀ interval I ⊂ SuppPv>X ;

Pv>X [I ] ≥ γ|I |β+1.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 65: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Holder Balls

1-dimensional Holder Balls H(s, L) (regularity of the link function)

H(s, L) is made of all functions f : R 7−→ R bsc−times differentiablesatisfying ∀z1, z2 ∈ R

|f (bsc)(z1)− f (bsc)(z2)| ≤ L|z1 − z2|s−bsc.

For Q > 0, define

HQ(s, L) := H(s, L) ∩ {f | ‖f ‖∞ := supx|f (x)| ≤ Q}.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 66: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Upper bound

Theorem

If PX satisfies assumption (D), we can construct an estimator gsatisfying for any s ∈ [smin, smax] :

supϑ∈Sd−1

+

supf∈HQ(s,L)

E n‖g − g‖2L2(PX ) ≤ Cn−2s/(2s+1),

where g(·) = f (ϑ>·). The constant C > 0 depends on σ1, L, smin, smax

and PX .

g adapts both to the index and the regularity.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 67: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Lower bound

Theorem

Let s, L,Q > 0 and PX satisfying assumption (D). For any ϑ ∈ Sd−1+ :

infg

supf∈H(s,L)

E n‖g − g‖2L2(PX ) ≥ C ′n−2s/(2s+1)

where inf g denotes the infimum over all estimator g constructed on Dn.

n−2s/(2s+1) is the minimax rate of convergence in the single-indexmodel conjectured by Stone (1982).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 68: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Construction of the estimator

We split the sample in two subsamples

Training sample :

Dm := [(Xi ,Yi ); 1 ≤ i ≤ m] (for instance m = 3n/4)

Learning sample :

D(m) := [(Xi ,Yi );m + 1 ≤ i ≤ n].

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 69: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

weak estimators : local polynomial estimators

Construction of the weak estimators : We fixe a parameter

λ = (v , s) ∈ Λn = Sd−1+ (∆n)× [smin, smin + (log n)−1, . . . , smax ],

where ∆n = (n log n)−1/(2smin).

We work with the data projected in the direction v :

Dm(v) = [(v>Xi ,Yi ); 1 ≤ i ≤ m];

Construction of a 1-dimensional polynomial estimator f (λ)(·) withthe data Dm(v) for the regularity s LPE .

g (λ)(x) := τQ

(f (λ)(v>x)

)where τQ(g) := max(−Q,min(Q, g)).

We do this for any λ ∈ Λn ! ReCo

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 70: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

weak estimators : local polynomial estimators

Construction of the weak estimators : We fixe a parameter

λ = (v , s) ∈ Λn = Sd−1+ (∆n)× [smin, smin + (log n)−1, . . . , smax ],

where ∆n = (n log n)−1/(2smin).

We work with the data projected in the direction v :

Dm(v) = [(v>Xi ,Yi ); 1 ≤ i ≤ m];

Construction of a 1-dimensional polynomial estimator f (λ)(·) withthe data Dm(v) for the regularity s LPE .

g (λ)(x) := τQ

(f (λ)(v>x)

)where τQ(g) := max(−Q,min(Q, g)).

We do this for any λ ∈ Λn ! ReCo

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 71: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

weak estimators : local polynomial estimators

Construction of the weak estimators : We fixe a parameter

λ = (v , s) ∈ Λn = Sd−1+ (∆n)× [smin, smin + (log n)−1, . . . , smax ],

where ∆n = (n log n)−1/(2smin).

We work with the data projected in the direction v :

Dm(v) = [(v>Xi ,Yi ); 1 ≤ i ≤ m];

Construction of a 1-dimensional polynomial estimator f (λ)(·) withthe data Dm(v) for the regularity s LPE .

g (λ)(x) := τQ

(f (λ)(v>x)

)where τQ(g) := max(−Q,min(Q, g)).

We do this for any λ ∈ Λn ! ReCo

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 72: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

weak estimators : local polynomial estimators

Construction of the weak estimators : We fixe a parameter

λ = (v , s) ∈ Λn = Sd−1+ (∆n)× [smin, smin + (log n)−1, . . . , smax ],

where ∆n = (n log n)−1/(2smin).

We work with the data projected in the direction v :

Dm(v) = [(v>Xi ,Yi ); 1 ≤ i ≤ m];

Construction of a 1-dimensional polynomial estimator f (λ)(·) withthe data Dm(v) for the regularity s LPE .

g (λ)(x) := τQ

(f (λ)(v>x)

)where τQ(g) := max(−Q,min(Q, g)).

We do this for any λ ∈ Λn ! ReCo

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 73: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation of the weak estimators

Adaptation to the regularity and the index by aggregation. Once wehave the dictionary {g (λ);λ ∈ Λn} of weak estimators.

Empirical risk of g :

A(m)(g) :=n∑

i=m+1

(Yi − g(Xi ))2. (1)

For a temperature T−1 > 0, we put a Gibbs measure Gibbs on{g (λ);λ ∈ Λn} :

w(g) :=exp

(− TA(m)(g)

)∑λ∈Λn

exp(− TA(m)(g (λ))

) . (2)

the final estimator is the AEW aggregate of local polynomialestimators :

g :=∑λ∈Λn

w(g (λ))g (λ).

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 74: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Remarks on the temperature parameter

if T−1 large ⇒exponential weights are close to the uniform weights.

if T−1 small ⇒all the weights equal zero except one : g is the ERM.

T is a parameter of a trade-off between an aggregate with uniformweights and the ERM.

Quality of the aggregate depends on the choice of T .

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 75: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Remarks on the temperature parameter

if T−1 large ⇒exponential weights are close to the uniform weights.

if T−1 small ⇒all the weights equal zero except one : g is the ERM.

T is a parameter of a trade-off between an aggregate with uniformweights and the ERM.

Quality of the aggregate depends on the choice of T .

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 76: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Remarks on the temperature parameter

if T−1 large ⇒exponential weights are close to the uniform weights.

if T−1 small ⇒all the weights equal zero except one : g is the ERM.

T is a parameter of a trade-off between an aggregate with uniformweights and the ERM.

Quality of the aggregate depends on the choice of T .

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 77: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation phenomenon depending on the temperature

Weights {w(g (λ)) : λ ∈ Λn} on S1+(∆n) for ϑ = (1/

√2, 1/

√2) and

T = 0.05, 0.2, 0.5, 10

●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 78: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation phenomenon depending on the temperature

Weights {w(g (λ)) : λ ∈ Λn} on S1+(∆n) for ϑ = (1/

√2, 1/

√2) and

T = 0.05, 0.2, 0.5, 10

●●●●●●●●●●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 79: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation phenomenon depending on the temperature

Weights {w(g (λ)) : λ ∈ Λn} on S1+(∆n) for ϑ = (1/

√2, 1/

√2) and

T = 0.05, 0.2, 0.5, 10

●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 80: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation phenomenon depending on the temperature

Weights {w(g (λ)) : λ ∈ Λn} on S1+(∆n) for ϑ = (1/

√2, 1/

√2) and

T = 0.05, 0.2, 0.5, 10

●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●

−1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 81: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation phenomenon depending on the temperature

Weights {w(g (λ)) : λ ∈ Λ} on S2+(∆n) for ϑ = (0, 0, 1) and T = 0.05, 0.3, 0.5, 10

●●●●

●●●●

●● ●

●●●

●●

●●

● ●●

●●●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●● ●

●●

●●

●●

●●

●●

● ●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●

●●

●●

●●●

● ●

●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●● ●

● ●

●●

● ●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●●● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●

●●

●●● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●● ●

●●

●●

● ●●

●●●● ●●●● ●●

−1.0 −0.5 0.0 0.5 1.0−1.0−0.50.00.51.00.0

0.5

1.0

1.5

2.0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 82: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation phenomenon depending on the temperature

Weights {w(g (λ)) : λ ∈ Λ} on S2+(∆n) for ϑ = (0, 0, 1) and T = 0.05, 0.3, 0.5, 10

●●●●

●●●●

●● ●

●●●

●●

●●

● ●●

●●●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

● ●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●● ●

●●

● ●

●●

● ●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

● ●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●●● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●

●●

●●● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●● ●

●●

●●

● ●●

●●●● ●●●● ●●

−1.0 −0.5 0.0 0.5 1.0−1.0−0.50.00.51.00.0

0.5

1.0

1.5

2.0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 83: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation phenomenon depending on the temperature

Weights {w(g (λ)) : λ ∈ Λ} on S2+(∆n) for ϑ = (0, 0, 1) and T = 0.05, 0.3, 0.5, 10

●●●●

●●●●

●● ●

●●●

●●

●●

● ●●

●●●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●● ●

●●

●●

●●

●●

●●

● ●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●● ●

● ●

●●

● ●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●●● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●● ●

●●

●●

● ●●

●●●● ●●●● ●●

−1.0 −0.5 0.0 0.5 1.0−1.0−0.50.00.51.00.0

0.5

1.0

1.5

2.0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 84: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Aggregation phenomenon depending on the temperature

Weights {w(g (λ)) : λ ∈ Λ} on S2+(∆n) for ϑ = (0, 0, 1) and T = 0.05, 0.3, 0.5, 10

●●●●

●●●●

●● ●

●●●

●●

●●

● ●●

●●●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●● ●

●●

●●

●●

●●

●●

● ●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●● ●

● ●

●●

● ●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●●● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●● ●

●●

●●

● ●●

●●●● ●●●● ●●

−1.0 −0.5 0.0 0.5 1.0−1.0−0.50.00.51.00.0

0.5

1.0

1.5

2.0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 85: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

For 100 simulations

MISE against T for f = hardsine, ϑ = (2/√

14, 1/√

14, 3/√

14) and n = 200

0 5 10 15 20 25 30

0.01

00.

020

0.03

0

T

MIS

E

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 86: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

For 100 simulations

MISE against T for f = hardsine, ϑ = (2/√

14, 1/√

14, 3/√

14) and n = 400

0 5 10 15 20 25 30

0.00

50.

010

0.01

50.

020

T

MIS

E

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 87: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

For 100 simulations

MISE against T (f = hardsine, ϑ = (2/√

14, 1/√

14, 3/√

14))

n \ T 0.1 0.5 0.7 1.0 1.5 2.0 ERM aggCVT

100 0.029 0.021 0.019 0.018 0.017 0.018 0.037 0.020(.011) (.008) (.008) (.007) (.008) (.009) (.022) (.008)

200 0.016 0.010 0.010 0.009 0.009 0.010 0.026 0.010(.005) (.003) (.003) (.002) (.002) (.003) (0.008) (.003)

400 0.007 0.006 0.005 0.005 0.006 0.007 0.017 0.006(.002) (.001) (.001) (.001) (.001) (.002) (.003) (.001)

MISE against T (f = hardsine, ϑ = (1/√

21,−2/√

21, 0, 4/√

21))

n \ T 0.1 0.5 0.7 1.0 1.5 2.0 ERM aggCVT

100 0.038 0.027 0.021 0.019 0.017 0.017 0.038 0.020(.016) (.010) (.009) (.008) (.007) (.007) (.025) (.010)

200 0.019 0.013 0.012 0.012 0.013 0.014 0.031 0.013(.014) (.009) (.010) (.011) (.012) (.012) (.016) (.010)

400 0.009 0.006 0.005 0.005 0.006 0.007 0.017 0.006(.002) (.001) (.001) (.001) (.001) (.002) (.004) (.001)

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 88: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Some perspectives.

To introduce a margin parameter in the problem of prediction ofindividual sequences.

To construct of aggregation methods in a framework withoutempirical risk (pointwise estimation, Lp−risk for p 6= 2, ...).

To explore the quality of some randomized aggregates fornon-convex losses :

f Randn = fj , with probability wT (fj).

To explore the quality of aggregation of the ERM over the convexclass of the dictionary :

fn ∈ Arg minf∈C

An(f ), where C = {M∑

j=1

λj fj ;λj ≥ 0,∑

λj = 1}.

To construct of sparse aggregate for models selection.

To find the optimal Temperature parameter.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 89: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Some perspectives.

To introduce a margin parameter in the problem of prediction ofindividual sequences.

To construct of aggregation methods in a framework withoutempirical risk (pointwise estimation, Lp−risk for p 6= 2, ...).

To explore the quality of some randomized aggregates fornon-convex losses :

f Randn = fj , with probability wT (fj).

To explore the quality of aggregation of the ERM over the convexclass of the dictionary :

fn ∈ Arg minf∈C

An(f ), where C = {M∑

j=1

λj fj ;λj ≥ 0,∑

λj = 1}.

To construct of sparse aggregate for models selection.

To find the optimal Temperature parameter.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 90: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Some perspectives.

To introduce a margin parameter in the problem of prediction ofindividual sequences.

To construct of aggregation methods in a framework withoutempirical risk (pointwise estimation, Lp−risk for p 6= 2, ...).

To explore the quality of some randomized aggregates fornon-convex losses :

f Randn = fj , with probability wT (fj).

To explore the quality of aggregation of the ERM over the convexclass of the dictionary :

fn ∈ Arg minf∈C

An(f ), where C = {M∑

j=1

λj fj ;λj ≥ 0,∑

λj = 1}.

To construct of sparse aggregate for models selection.

To find the optimal Temperature parameter.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 91: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Some perspectives.

To introduce a margin parameter in the problem of prediction ofindividual sequences.

To construct of aggregation methods in a framework withoutempirical risk (pointwise estimation, Lp−risk for p 6= 2, ...).

To explore the quality of some randomized aggregates fornon-convex losses :

f Randn = fj , with probability wT (fj).

To explore the quality of aggregation of the ERM over the convexclass of the dictionary :

fn ∈ Arg minf∈C

An(f ), where C = {M∑

j=1

λj fj ;λj ≥ 0,∑

λj = 1}.

To construct of sparse aggregate for models selection.

To find the optimal Temperature parameter.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 92: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Some perspectives.

To introduce a margin parameter in the problem of prediction ofindividual sequences.

To construct of aggregation methods in a framework withoutempirical risk (pointwise estimation, Lp−risk for p 6= 2, ...).

To explore the quality of some randomized aggregates fornon-convex losses :

f Randn = fj , with probability wT (fj).

To explore the quality of aggregation of the ERM over the convexclass of the dictionary :

fn ∈ Arg minf∈C

An(f ), where C = {M∑

j=1

λj fj ;λj ≥ 0,∑

λj = 1}.

To construct of sparse aggregate for models selection.

To find the optimal Temperature parameter.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 93: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Some perspectives.

To introduce a margin parameter in the problem of prediction ofindividual sequences.

To construct of aggregation methods in a framework withoutempirical risk (pointwise estimation, Lp−risk for p 6= 2, ...).

To explore the quality of some randomized aggregates fornon-convex losses :

f Randn = fj , with probability wT (fj).

To explore the quality of aggregation of the ERM over the convexclass of the dictionary :

fn ∈ Arg minf∈C

An(f ), where C = {M∑

j=1

λj fj ;λj ≥ 0,∑

λj = 1}.

To construct of sparse aggregate for models selection.

To find the optimal Temperature parameter.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 94: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Some perspectives.

To introduce a margin parameter in the problem of prediction ofindividual sequences.

To construct of aggregation methods in a framework withoutempirical risk (pointwise estimation, Lp−risk for p 6= 2, ...).

To explore the quality of some randomized aggregates fornon-convex losses :

f Randn = fj , with probability wT (fj).

To explore the quality of aggregation of the ERM over the convexclass of the dictionary :

fn ∈ Arg minf∈C

An(f ), where C = {M∑

j=1

λj fj ;λj ≥ 0,∑

λj = 1}.

To construct of sparse aggregate for models selection.

To find the optimal Temperature parameter.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 95: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Margin assumption in the general framework

Margin Assumption :

The probability measure π satisfies the margin assumption MA(κ, c,F0),where κ ≥ 1, c > 0 and F0 ⊂ F if

E[(Q(Z , f )− Q(Z , f ∗))2] ≤ c(A(f )− A∗)1/κ,

for any function f ∈ F0.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 96: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Where does this Gibbs measure come from ?

The weights w = (wλ)λ∈Λ := (w(g (λ)))λ∈Λ are solution of

min(R(m)(θ) +

1

T

∑λ∈Λ

θλ log θλ

∣∣ (θλ) ∈ C),

where Rm(θ) :=∑

λ∈Λ θλR(m)(g(λ)) and

C :={

(θλ)λ∈Λ s.t. θλ ≥ 0 and∑λ∈Λ

θλ = 1}

.

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 97: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Reduction of complexity : “preselection”of the weak estimators.

iterative Construction of the dictionary : step 1, 2, 3, 4 (ϑ = (2/√

14, 1/√

14, 3/√

14))

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

−1.0 −0.5 0.0 0.5 1.0−1.0−0.50.00.51.00.0

0.5

1.0

1.5

2.0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 98: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Reduction of complexity : “preselection”of the weak estimators.

iterative Construction of the dictionary : step 1, 2, 3, 4 (ϑ = (2/√

14, 1/√

14, 3/√

14))

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−1.0 −0.5 0.0 0.5 1.0−1.0−0.50.00.51.00.0

0.5

1.0

1.5

2.0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 99: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Reduction of complexity : “preselection”of the weak estimators.

iterative Construction of the dictionary : step 1, 2, 3, 4 (ϑ = (2/√

14, 1/√

14, 3/√

14))

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

−1.0 −0.5 0.0 0.5 1.0−1.0−0.50.00.51.00.0

0.5

1.0

1.5

2.0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 100: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

Reduction of complexity : “preselection”of the weak estimators.

iterative Construction of the dictionary : step 1, 2, 3, 4 (ϑ = (2/√

14, 1/√

14, 3/√

14))

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−1.0 −0.5 0.0 0.5 1.0−1.0−0.50.00.51.00.0

0.5

1.0

1.5

2.0

Aggregation methods: optimality and fast rates. Universite Paris 6

Page 101: Aggregation methods: optimality and fast rates.lecueguillaume.github.io/assets/soutenanceTheseVersionFinale.pdfAggregation methods: optimality and fast rates. Guillaume Lecu´e Universit´e

General Framework. Aggregations Procedures. Optimality in classification. Applications.

weak estimators : local polynomial estimators

Construction of the weak estimators : the parameterλ = (v , s) ∈ Λn = Sd−1

+ (∆n)× [smin, smin + (log n)−1, . . . , smax ] is fixedWe work with the data projected in the direction v :

Dm(v) = [(Zi ,Yi ); 1 ≤ i ≤ m] where Zi := v>Xi ;

If h > 0 (window), define P(z,h) ∈ Polr minimizing

m∑i=1

(Yi − P(Zi − z))21Zi∈[z−h,z+h],

with the window at the point z given by

Hm(z) := argminh>0

{Lhs ≥ σ1

(mPZ [z − h, z + h])1/2

}where PZ (A) := 1

m

∑mi=1 1Zi∈A

Set f (z) := P(z,Hm(z))(z), and for Q > 0 fixed and all x ∈ Rd ,

g (λ)(x) := τQ

(f (λ)(v>x)

)where τQ(g) := max(−Q,min(Q, g)).

We do this for any λ ∈ Λn ! ReCo

Aggregation methods: optimality and fast rates. Universite Paris 6