The Rate of Convergence of AdaBoost

61
The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire

description

The Rate of Convergence of AdaBoost. Indraneel Mukherjee Cynthia Rudin Rob Schapire. AdaBoost (Freund and Schapire 97). AdaBoost (Freund and Schapire 97). Basic properties of AdaBoost’s convergence are still not fully understood. - PowerPoint PPT Presentation

Transcript of The Rate of Convergence of AdaBoost

Page 1: The Rate of Convergence of  AdaBoost

The Rate of Convergence of AdaBoost

Indraneel MukherjeeCynthia RudinRob Schapire

Page 2: The Rate of Convergence of  AdaBoost

AdaBoost (Freund and Schapire 97)

Page 3: The Rate of Convergence of  AdaBoost

AdaBoost (Freund and Schapire 97)

Page 4: The Rate of Convergence of  AdaBoost

Basic properties of AdaBoost’s convergence are still not fully understood.

Page 5: The Rate of Convergence of  AdaBoost

Basic properties of AdaBoost’s convergence are still not fully understood.

We address one of these basic properties: convergence rates with no assumptions.

Page 6: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Page 7: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 8: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Combination: F(x)=λ1h1(x)+…+λNhN(x)

Page 9: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

misclassific. error ≤ exponentiaλ λoss

1m

1[ yiF(xi )≤0]i=1

m

∑ ≤1m

exp −yiF(xi)( )i=1

m

Combination: F(x)=λ1h1(x)+…+λNhN(x)

Page 10: The Rate of Convergence of  AdaBoost

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Exponential loss:

L(λ)= 1m

exp − λjyihj(xi)j=1

N

∑⎛⎝⎜

⎞⎠⎟i=1

m

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 11: The Rate of Convergence of  AdaBoost

Exponential loss:

L(λ)= 1m

exp − λjyihj(xi)j=1

N

∑⎛⎝⎜

⎞⎠⎟i=1

m

λ1

λ2

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 12: The Rate of Convergence of  AdaBoost

Exponential loss:

L(λ)= 1m

exp − λjyihj(xi)j=1

N

∑⎛⎝⎜

⎞⎠⎟i=1

m

λ1

λ2

Examples: {(xi , yi )}i=1,...,m , with each (xi,yi)∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 13: The Rate of Convergence of  AdaBoost

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under strong assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 14: The Rate of Convergence of  AdaBoost

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 15: The Rate of Convergence of  AdaBoost

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 16: The Rate of Convergence of  AdaBoost

Known:• AdaBoost converges asymptotically to the minimum of the

exponential loss (Collins et al 2002, Zhang and Yu 2005)• Convergence rates under assumptions:• “weak learning” assumption holds, hypotheses are better than

random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 17: The Rate of Convergence of  AdaBoost

Outline

• Convergence Rate 1: Convergence to a target loss“Can we get within of a ‘reference’ solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Ú

Page 18: The Rate of Convergence of  AdaBoost

Main Messages• Usual approaches assume a finite minimizer– Much more challenging not to assume this!

• Separated two different modes of analysis– comparison to reference, comparison to optimal– different rates of convergence are possible in each

• Analysis of convergence rates often ignore the “constants”– we show they can be extremely large in the worst case

Page 19: The Rate of Convergence of  AdaBoost

• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Based on a conjecture that says...

Page 20: The Rate of Convergence of  AdaBoost

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 21: The Rate of Convergence of  AdaBoost

radius B

λ*

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 22: The Rate of Convergence of  AdaBoost

radius B

λ*

L(λ*)

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 23: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

L(λ t)

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 24: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

ÚL(λ t)

"At iteration t, L(λ t) wiλλ be at m ost Ú m ore than that of any param eter vector of λ1-norm bounded by B

in a num ber of rounds that is at m ost a poλynom iaλinλogN,m , B, and 1/Ú."

Page 25: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

ÚL(λ t)

This happens at:

t ≤poλy λogN,m ,B, 1Ú( )

Page 26: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

ÚL(λ t)

This happens at:

t ≤poλy λogN,m ,B, 1Ú( )

Page 27: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

L(λ*)

ÚL(λ t)

t ≤poλy λogN,m ,B, 1Ú( )

This happens at:

Page 28: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

Page 29: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

poly log N ,m, B, 1Ú( )

Page 30: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

poly log N ,m, B, 1Ú( )

Best known previous result is that it takes at mostorder rounds (Bickel et al). e1/Ú2

Page 31: The Rate of Convergence of  AdaBoost

Intuition behind proof of Theorem 1

• Old fact: if AdaBoost takes a large step, it makes a lot of progress:

L(λ t)≤L(λ t−1) 1−d t2

d t is called the "edge." It is related to the step size.

λ1

λ2

Page 32: The Rate of Convergence of  AdaBoost

radius B

λ*λ t

St

Rt

Rt :=λnL(λ t)−λnL(λ*)

St :=infλÅaλ −λ tÅa1: L(λ)≤L(λ*){ }

L(λ*)L(λ t)

measuresprogress

measuresdistance

Page 33: The Rate of Convergence of  AdaBoost

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

Intuition behind proof of Theorem 1

If St is small, then d t is λarge.

Page 34: The Rate of Convergence of  AdaBoost

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

• Second lemma says:

• Combining:d t 's are large at each t (unless R t already small).

Intuition behind proof of Theorem 1

d t ≥ Rt−13 / B3 in each round t.

St remains small (unless Rt is already small).

If St is small, then d t is λarge.

Page 35: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the number of rounds required to achieve loss L* is at least (roughly) the norm of the smallest solution achieving loss L*

Page 36: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1: L(λ)≤L*{ } / 2λnm

Page 37: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the norm of the smallest solution achieving loss L* is exponential in the number of examples.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1: L(λ)≤L*{ } / 2λnm

Page 38: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1: L(λ)≤L*{ } / 2λnm

Lemma: There are simple datasets for which

inf ÅaλÅa1: L(λ)≤2m +Ú{ } ≥ 2m−2 −1( )λn(1 / (3Ú))

Page 39: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

Page 40: The Rate of Convergence of  AdaBoost

Theorem 1: For any λ* ∈° N , AdaBoost achieves λoss

at m ost L(λ*)+Ú in at m ost 13Åaλ*Åa16 Ú−5 rounds.

Conjecture: AdaBoost achieves loss at most L(λ*)+Ú

in at m ost O(B2 / Ú) rounds.

Page 41: The Rate of Convergence of  AdaBoost

Number of rounds

Loss

– (O

ptim

al L

oss)

10 100 1000 10000 1e+053e-0

6 3

e-05

3e

-04

3e-

03

3e-0

2

Rate on a Simple Dataset (Log scale)

Page 42: The Rate of Convergence of  AdaBoost

Outline

• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Ú

Page 43: The Rate of Convergence of  AdaBoost

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

Page 44: The Rate of Convergence of  AdaBoost

• Better dependence on than Theorem 1, actually optimal.

• Doesn’t depend on the size of the best solution within a ball

• Can’t be used to prove the conjecture because in some cases C>2m. (Mostly it’s much smaller.)

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

Ú

Page 45: The Rate of Convergence of  AdaBoost

• Main tool is the “decomposition lemma”– Says that examples fall into 2 categories, • Zero loss set Z • Finite margin set F.

– Similar approach taken independently by (Telgarsky, 2011)

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

Page 46: The Rate of Convergence of  AdaBoost

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

Page 47: The Rate of Convergence of  AdaBoost

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

F

Page 48: The Rate of Convergence of  AdaBoost

Theorem 2: AdaBoost reaches within Ú of the optimal lossin at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

Z

Page 49: The Rate of Convergence of  AdaBoost

1.) For some g > 0, there exists vector h+, Åah+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite h*.

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 50: The Rate of Convergence of  AdaBoost

++

++

+

++

-- -

-

-

+

+

++---

-

-

-

margin of g

margin of g

h+

Page 51: The Rate of Convergence of  AdaBoost

1.) For some g > 0, there exists vector h+, Åah+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite h*.

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 52: The Rate of Convergence of  AdaBoost

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

F

Page 53: The Rate of Convergence of  AdaBoost

+

++

+

-

--

-

-

-

F

Page 54: The Rate of Convergence of  AdaBoost

+

++

+

-

--

-

-

-

F

h*

Page 55: The Rate of Convergence of  AdaBoost

1.) For some g > 0, there exists vector h+, Åah+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite h*.

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 56: The Rate of Convergence of  AdaBoost

• We provide a conjecture about dependence on m.

Lemma: There are simple datasets for which the

constant C is doubly exponential, at least 2Ω(2m /m ).

Conjecture: If hypotheses are {-1,0,1}-valued, AdaBoostconverges to within Ú of the optimal loss within

2O(m ln m )Ú−1+o(1) rounds.

• This would give optimal dependence on m and simultaneously.

Ú

Page 57: The Rate of Convergence of  AdaBoost

To summarize• Two rate bounds, one depends on the size of the

best solution within a ball and has dependence .

• The other depends only on but C can be doubly exponential in m.

• Many lower bounds and conjectures in the paper.

Ú−5

C / Ú

Page 58: The Rate of Convergence of  AdaBoost

To summarize• Two rate bounds, one depends on the size of the

best solution within a ball and has dependence .

• The other depends only on but C can be doubly exponential in m.

• Many lower bounds and conjectures in the paper.

Ú−5

C / Ú

Thank you

Page 59: The Rate of Convergence of  AdaBoost

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

d t 's are large whenever loss on Z is large.

Intuition behind proof of Theorem 2

d t 's are large whenever loss on F is large.

Translates into that d t's are λarge whenever λoss on Z is sm aλλ.

• Second lemma says:

Page 60: The Rate of Convergence of  AdaBoost

• Old Fact: L(λ t)≤L(λ t−1) 1−d t2

If d t's are λarge, we m ake progress.• First lemma says:

d t 's are large whenever loss on Z is large.

Intuition behind proof of Theorem 2

d t 's are large whenever loss on F is large.

Translates into that d t's are λarge whenever λoss on Z is sm aλλ.

• Second lemma says:

Page 61: The Rate of Convergence of  AdaBoost

• see notes