Segmentation of the mean of heteroscedastic data via cross...

30

Transcript of Segmentation of the mean of heteroscedastic data via cross...

Page 1: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

1/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Segmentation of the mean of heteroscedastic datavia cross-validation

Alain Celisse

1UMR 8524 CNRS - Université Lille 1

2SSB Group, Paris

joint work with Sylvain Arlot

GDR �Statistique et Santé�

Paris, October, 21 2009

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 2: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

2/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Illustration: Original signal

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Position t

Sig

nal

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 3: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

2/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Illustration: Observed signal (discretized)

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Position t

Sig

nal

Discretized signal (n=100 observations)

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 4: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

3/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Illustration: Find breakpoints

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Position t

Sig

nal

?

??

? ?

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 5: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

3/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Illustration: True regression function

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Position t

Sig

nal

SignalReg. func.

? ?

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 6: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

4/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Statistical framework: Change-point detection

(t1,Y1), . . . , (tn,Yn) ∈ [0, 1]× Y independent,

Yi = s(ti ) + σi εi ∈ Y = R

Instants ti : deterministic (e.g. ti = i/n).

s: piecewise constant

Residuals ε: E [εi ] = 0 and E[ε2i]= 1.

Noise level: σi (heteroscedastic)

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 7: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

4/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Statistical framework: Change-point detection

(t1,Y1), . . . , (tn,Yn) ∈ [0, 1]× Y independent,

Yi = s(ti ) + σi εi ∈ Y = R

Instants ti : deterministic (e.g. ti = i/n).

s: piecewise constant

Residuals ε: E [εi ] = 0 and E[ε2i]= 1.

Noise level: σi (heteroscedastic)

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 8: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

5/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Estimation versus Identi�cation

Purpose:

Estimate s to recover most of the important jumps w.r.t. the noise

level −→ Estimation purpose.

55 60 65 70 75 80 85 90 95 100−0.2

0

0.2

0.4

0.6

0.8

1 Signal: YReg. func. s

Strategy:

1 Use piecewise constant functions.

2 Adopt the model selection point of view.

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 9: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

6/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Model selection

Models:

(Iλ)λ∈Λm: partition of [0, 1]

Sm: linear space of piecewise constant functions on (Iλ)λ∈Λm

Strategy:

(Sm)m∈Mn−→ (sm)m∈Mn

−→ sm ???

Goal:

Oracle inequality (in expectation, or with large probability):

‖s − sm‖2 ≤ C infMn

{‖s − sm‖2 + R(m, n)

}

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 10: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

6/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Model selection

Models:

(Iλ)λ∈Λm: partition of [0, 1]

Sm: linear space of piecewise constant functions on (Iλ)λ∈Λm

Strategy:

(Sm)m∈Mn−→ (sm)m∈Mn

−→ sm ???

Goal:

Oracle inequality (in expectation, or with large probability):

‖s − sm‖2 ≤ C infMn

{‖s − sm‖2 + R(m, n)

}

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 11: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

7/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Least-squares estimator

Empirical risk minimizer over Sm (= model):

sm ∈ arg minu∈Sm

Pnγ(u) = arg minu∈Sm

1

n

n∑i=1

(u(ti )− Yi )2 .

Regressogram:

sm =∑λ∈Λm

βλ1Iλ βλ =1

Card {ti ∈ Iλ}∑ti∈Iλ

Yi .

Oracle:

m∗ := Argminm∈Mn‖s − sm‖2 .

−→ s m∗ : best estimator among {sm | m ∈Mn}.

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 12: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

8/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Empirical Risk Minimization (ERM)

Assumption:

The number D − 1 of breakpoints is known.

Question:

Find the locations of the D − 1 breakpoints (D is given).

Strategy:

The �best� segmentation in D pieces is obtained by applying the

ERM algorithm over⋃

Dm=D Sm :

ERM algorithm:

mERM(D) = Argminm|Dm=DPnγ (sm) .

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 13: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

9/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

ERM segmentation: Homoscedastic

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2Segmentation (Homoscedastic)

Position t

Sig

nal

Yi Signal

OracleERM

−→ ERM is close to the oracle

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 14: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

10/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Expectations

Homoscedastic:

R (sm) = dist (s, Sm) + σ2Dm

n+ cste,

E [Pnγ(sm) ] = dist (s, Sm)−σ2Dm

n+ cste .

Conclusions:

1 The variance term σ2Dm/n does not matter,

2 Sms are only distinguished according to dist (s, Sm).

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 15: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

11/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

ERM segmentation: Heteroscedastic

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2Segmentation (Heteroscedastic)

Position t

Sig

nal

Yi Signal

OracleERM

−→ ERM over�ts in noisy regions

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 16: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

11/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

ERM over�tting: Expectations

Heteroscedastic:

R (sm) = dist (s, Sm) +1

n

∑λ

(σrλ)2 + cste,

E [Pnγ(sm) ] = dist (s, Sm)−1

n

∑λ

(σrλ)2 + cste,

with (σrλ)2 := 1

∑ni=1 σ

2i 1Iλ(ti ), nλ := Card ({i | ti ∈ Iλ}) .

Conclusions:

1 The variance term is di�erent for models Sm (with dimension

D),

2 ERM rather puts breakpoints in the noise.

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 17: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

12/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Cross-validation principle

0 0.5 1−3

−2

−1

0

1

2

3

0 0.5 1−3

−2

−1

0

1

2

3

0 0.5 1−3

−2

−1

0

1

2

3

0 0.5 1−3

−2

−1

0

1

2

3

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 18: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

13/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Cross-validation

Leave-p-out (Lpo) ∀1 ≤ p ≤ n − 1,

Rp(sm) =

(n

p

)−1 ∑D(t)∈Ep

1

p

∑Zi∈D(v)

(sD

(t)

m (Xi )− Yi

)2 ,where Ep =

{D(t) ⊂ {Z1, . . . ,Zn} | Card

(D(t)

)= n − p

}.

Algorithmic complexity: exponential.

Theorem (C. Ph.D. (2008))

Rp(sm) =∑

λ∈Λ(m)

{Sλ,2Aλ +

(S2

λ,1 − Sλ,2)Bλ},

where Sλ,1 :=∑n

i=1Yi1Iλ , Sλ,2 :=

∑n

i=1Y 2

i 1Iλ ,

Aλ,Bλ: known functions.

Algorithmic complexity: O(n).Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 19: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

14/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Applicability of cross-validation

Lpo-based model selection procedure:

1 Lpo is Computationally Tractable

C. and Robin (2008), CSDA: DensityC. and Robin (2008), arXiv: Multiple TestingC. Ph.D. (2008), TEL: Density, regressionC. (2009), arXiv: Density

2 As computationally expensive as ERM.

Lpo segmentation of dimension D:

For every 1 ≤ p ≤ n − 1,

mp(D) = Argminm|Dm=D Rp(sm).

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 20: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

15/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Taking variance into account: Lpo expectation

Theorem (C. Ph.D. (2008))

Homoscedastic:

E[Rp(sm)

]≈ dist (s, Sm) + σ2

Dm

n − p+ σ2 ,

Heteroscedastic:

E[Rp(sm)

]≈ dist (s, Sm)+

1

n − p

∑λ

(σrλ)2 + cste.

R(sm) = dist (s, Sm)+1

n

∑λ

(σrλ)2 + cste.

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 21: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

16/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Leave-one-out (Lpo with p = 1): An alternative to ERM

Strategy:

Replace ERM by leave-one-out

(Loo) to take variance into

account.

Loo algorithm:

m1(D) = Argminm|Dm=D R1(sm).

Conclusion:

Loo prevents from over�tting.

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

2

Oracle

ERM

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

2Oracle

Loo

ERM

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 22: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

17/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Quality of the segmentations w.r.t. D

5 10 15 20 25 30 35

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Number of breakpoints

Ave

rag

e lo

ss v

alu

e

Segmentation quality (Homosc.), N=300 trials

ERM

Loo

5 10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

Number of breakpointsA

vera

ge

loss

val

ue

Segmentation quality (heterosc.), N=300 trials

ERM

Loo

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 23: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

18/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Quality of the best segmentation

s· σ· ERM Loo

2 c 2.88 ± 0.01 2.93 ± 0.01pc,1 1.31 ± 0.02 1.16 ± 0.02pc,3 3.09 ± 0.03 2.52 ± 0.03

3 c 3.18 ± 0.01 3.25 ± 0.01pc,1 3.00 ± 0.01 2.67 ± 0.02pc,3 4.41 ± 0.02 3.97 ± 0.02

Table: Average of E[infD

∥∥s − s A(D)

∥∥2 ] /E [ infm ‖s − sm‖2]over

10 000 samples. A denotes either ERM, or Loo.

−→ Same results when D is chosen by VFCV.

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 24: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

18/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Quality of the best segmentation

s· σ· ERM Loo

2 c 2.88 ± 0.01 2.93 ± 0.01pc,1 1.31 ± 0.02 1.16 ± 0.02pc,3 3.09 ± 0.03 2.52 ± 0.03

3 c 3.18 ± 0.01 3.25 ± 0.01pc,1 3.00 ± 0.01 2.67 ± 0.02pc,3 4.41 ± 0.02 3.97 ± 0.02

Table: Average of E[infD

∥∥s − s A(D)

∥∥2 ] /E [ infm ‖s − sm‖2]over

10 000 samples. A denotes either ERM, or Loo.

−→ Same results when D is chosen by VFCV.

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 25: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

19/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Summary

1 Lpo takes variance into account−→ outperforms ERM (heteroscedastic).

−→ close to ERM (homoscedastic).

2 Lpo is fully tractable (closed-form expressions)−→ as computationally expensive as ERM.

3 Similar results when D is chosen by V -fold cross-validation.

Conclusion:

Cross-validation is robust (to heteroscedasticity) and reliable

alternative to ERM.

−→ Arlot and C. (2009), arXiv

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 26: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

20/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 27: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

21/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Results: Chromosome 9

Homoscedastic model (Picard et al. (05))

Heteroscedastic model (Picard et al. (05))

LOO+VFCV

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 28: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

22/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Results: Chromosome 1

Homoscedastic model (Picard et al. (05))

Heteroscedastic model (Picard et al. (05))

LOO+VFCV

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 29: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

23/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Prospects

1 Optimality results for segmentation procedures.

2 Other resampling schemes (Bootstrap, Rademacherpenalties,. . . )

3 Extension to the multivariate setting: Detect ANR projectBiology: Multi-patient CGH pro�le segmentation.Computer vision: Video segmentation

Thank you.

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

Page 30: Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

23/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Prospects

1 Optimality results for segmentation procedures.

2 Other resampling schemes (Bootstrap, Rademacherpenalties,. . . )

3 Extension to the multivariate setting: Detect ANR projectBiology: Multi-patient CGH pro�le segmentation.Computer vision: Video segmentation

Thank you.

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse