The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive...

Post on 13-Oct-2020

0 views 0 download

Transcript of The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive...

The Boosting Approach to

Machine Learning

Maria-Florina Balcan

10/01/2018

Boosting

β€’ Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.

β€’ Backed up by solid foundations.

β€’ Works well in practice (Adaboost and its variations one of the top 10 algorithms).

β€’ General method for improving the accuracy of any given learning algorithm.

Readings:

β€’ The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001

β€’ Theory and Applications of Boosting. NIPS tutorial.

http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf

Plan for today:

β€’ Motivation.

β€’ A bit of history.

β€’ Adaboost: algo, guarantees, discussion.

β€’ Focus on supervised classification.

An Example: Spam Detection

Key observation/motivation:

β€’ Easy to find rules of thumb that are often correct.

β€’ Harder to find single rule that is very highly accurate.

β€’ E.g., β€œIf buy now in the message, then predict spam.”

Not spam spam

β€’ E.g., β€œIf say good-bye to debt in the message, then predict spam.”

β€’ E.g., classify which emails are spam and which are important.

β€’ Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets.

An Example: Spam Detection

β€’ apply weak learner to a subset of emails, obtain rule of thumb

β€’ apply to 2nd subset of emails, obtain 2nd rule of thumb

β€’ apply to 3rd subset of emails, obtain 3rd rule of thumb

β€’ repeat T times; combine weak rules into a single highly accurate rule.

π’‰πŸ

π’‰πŸ

π’‰πŸ‘

𝒉𝑻

…E

mai

ls

Boosting: Important Aspects

How to choose examples on each round?

How to combine rules of thumb into single prediction rule?

β€’ take (weighted) majority vote of rules of thumb

β€’ Typically, concentrate on β€œhardest” examples (those most often misclassified by previous rules of thumb)

Historically….

Weak Learning vs Strong/PAC Learning

β€’ [Kearns & Valiant ’88]: defined weak learning:being able to predict better than random guessing

(error ≀1

2βˆ’ 𝛾) , consistently.

β€’ Posed an open pb: β€œDoes there exist a boosting algo that turns a weak learner into a strong PAC learner (that can

produce arbitrarily accurate hypotheses)?”

β€’ Informally, given β€œweak” learning algo that can consistently

find classifiers of error ≀1

2βˆ’ 𝛾, a boosting algo would

provably construct a single classifier with error ≀ πœ–.

Weak Learning vs Strong/PAC Learning

Strong (PAC) Learning

β€’ βˆƒ algo A

β€’ βˆ€ 𝑐 ∈ 𝐻

β€’ βˆ€π·

β€’ βˆ€ πœ– > 0

β€’ βˆ€ 𝛿 > 0

β€’ A produces h s.t.:

Weak Learning

β€’ βˆƒ algo A

β€’ βˆƒπ›Ύ > 0

β€’ βˆ€ 𝑐 ∈ 𝐻

β€’ βˆ€π·

β€’ βˆ€ πœ– >1

2βˆ’ 𝛾

β€’ βˆ€ 𝛿 > 0

β€’ A produces h s.t.Pr π‘’π‘Ÿπ‘Ÿ β„Ž β‰₯ πœ– ≀ 𝛿

Pr π‘’π‘Ÿπ‘Ÿ β„Ž β‰₯ πœ– ≀ 𝛿

h

β€’ [Kearns & Valiant ’88]: defined weak learning & posed an open pb of finding a boosting algo.

Surprisingly….

Weak Learning =Strong (PAC) Learning

Original Construction [Schapire ’89]:

β€’ poly-time boosting algo, exploits that we can learn a little on every distribution.

β€’ A modest booster obtained via calling the weak learning algorithm on 3 distributions.

β€’ Cool conceptually and technically, not very practical.

β€’ Then amplifies the modest boost of accuracy by running this somehow recursively.

Error = 𝛽 <1

2βˆ’ 𝛾 β†’ error 3𝛽2 βˆ’ 2𝛽3

An explosion of subsequent work

Adaboost (Adaptive Boosting)

[Freund-Schapire, JCSS’97]

Godel Prize winner 2003

β€œA Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

Informal Description Adaboost

β€’ For t=1,2, … ,T

β€’ Construct Dt on {x1, …, xm}

β€’ Run A on Dt producing ht: 𝑋 β†’ {βˆ’1,1} (weak classifier)

xi ∈ 𝑋, 𝑦𝑖 ∈ π‘Œ = {βˆ’1,1}

+++

+

+++

+

- -

--

- -

--

ht

β€’ Boosting: turns a weak algo into a strong (PAC) learner.

β€’ Output Hfinal π‘₯ = sign σ𝑑=1π›Όπ‘‘β„Žπ‘‘ π‘₯

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.

weak learning algo A (e.g., NaΓ―ve Bayes, decision stumps)

Ο΅t = Pxi ~Dt(ht xi β‰  yi) error of ht over Dt

Adaboost (Adaptive Boosting)

β€’ For t=1,2, … ,T

β€’ Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

β€’ Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

β€’ Weak learning algorithm A.

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘ if 𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e 𝛼𝑑 if 𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖

Constructing 𝐷𝑑

[i.e., D1 𝑖 =1

π‘š]

β€’ Given Dt and ht set

𝛼𝑑 =1

2ln

1 βˆ’ πœ–π‘‘πœ–π‘‘

> 0

Final hyp: Hfinal π‘₯ = sign σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯

β€’ D1 uniform on {x1, …, xm}

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

Adaboost: A toy example

Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Adaboost: A toy example

Adaboost: A toy example

Adaboost (Adaptive Boosting)

β€’ For t=1,2, … ,T

β€’ Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

β€’ Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

β€’ Weak learning algorithm A.

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘ if 𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e 𝛼𝑑 if 𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖

Constructing 𝐷𝑑

[i.e., D1 𝑖 =1

π‘š]

β€’ Given Dt and ht set

𝛼𝑑 =1

2ln

1 βˆ’ πœ–π‘‘πœ–π‘‘

> 0

Final hyp: Hfinal π‘₯ = sign σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯

β€’ D1 uniform on {x1, …, xm}

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

Nice Features of Adaboost

β€’ Very general: a meta-procedure, it can use any weak learning algorithm!!!

β€’ Very fast (single pass through data each round) & simple to code, no parameters to tune.

β€’ Grounded in rich theory.

β€’ Shift in mindset: goal is now just to find classifiers a bit better than random guessing.

β€’ Relevant for big data age: quickly focuses on β€œcore difficulties”, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT’12].

(e.g., NaΓ―ve Bayes, decision stumps)

Analyzing Training Error

Theorem πœ–π‘‘ = 1/2 βˆ’ 𝛾𝑑 (error of β„Žπ‘‘ over 𝐷𝑑)

π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

So, if βˆ€π‘‘, 𝛾𝑑 β‰₯ 𝛾 > 0, then π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2 𝛾2𝑇

Adaboost is adaptive

β€’ Does not need to know 𝛾 or T a priori

β€’ Can exploit 𝛾𝑑 ≫ 𝛾

The training error drops exponentially in T!!!

To get π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ πœ–, need only 𝑇 = 𝑂1

𝛾2log

1

πœ–rounds

Understanding the Updates & Normalization

Pr𝐷𝑑+1

𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖 =

𝑖:𝑦𝑖=β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖

π‘π‘‘π‘’βˆ’π›Όπ‘‘ =

1 βˆ’ πœ–π‘‘π‘π‘‘

π‘’βˆ’π›Όπ‘‘ =1 βˆ’ πœ–π‘‘π‘π‘‘

πœ–π‘‘1 βˆ’ πœ–π‘‘

=1 βˆ’ πœ–π‘‘ πœ–π‘‘π‘π‘‘

Pr𝐷𝑑+1

𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖 =

𝑖:π‘¦π‘–β‰ β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖

𝑍𝑑𝑒𝛼𝑑 =

Probabilities are equal!

𝑍𝑑 =

𝑖:𝑦𝑖=β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖 π‘’βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.

Recall 𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

= 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Όπ‘‘ + πœ–π‘‘π‘’

𝛼𝑑 = 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘

=

𝑖:𝑦𝑖=β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖 π‘’βˆ’π›Όπ‘‘ +

𝑖:π‘¦π‘–β‰ β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖 𝑒𝛼𝑑

πœ–π‘‘1

𝑍𝑑𝑒𝛼𝑑 = πœ–π‘‘

𝑍𝑑

1 βˆ’ πœ–π‘‘πœ–π‘‘

=πœ–π‘‘ 1 βˆ’ πœ–π‘‘

𝑍𝑑

β€’ If π»π‘“π‘–π‘›π‘Žπ‘™ incorrectly classifies π‘₯𝑖,

Analyzing Training Error: Proof Intuition

- Then π‘₯𝑖 incorrectly classified by (wtd) majority of β„Žπ‘‘ ’s.

β€’ On round 𝑑, we increase weight of π‘₯𝑖 for which β„Žπ‘‘ is wrong.

- Which implies final prob. weight of π‘₯𝑖 is large.

Can show probability β‰₯1

π‘š

1

ς𝑑 𝑍𝑑

β€’ Since sum of prob. = 1, can’t have too many of high weight.

And (ς𝑑 𝑍𝑑) β†’ 0 .

Can show # incorrectly classified ≀ π‘š ς𝑑 𝑍𝑑 .

Theorem πœ–π‘‘ = 1/2 βˆ’ 𝛾𝑑 (error of β„Žπ‘‘ over 𝐷𝑑)

π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Step 3: ς𝑑 𝑍𝑑 = ς𝑑 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘ = ς𝑑 1 βˆ’ 4𝛾𝑑2 ≀ π‘’βˆ’2 σ𝑑 𝛾𝑑

2

[Unthresholded weighted vote of β„Žπ‘– on π‘₯𝑖 ]

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Recall 𝐷1 𝑖 =1

π‘šand 𝐷𝑑+1 𝑖 = 𝐷𝑑 𝑖

exp βˆ’π‘¦π‘–π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖

𝑍𝑑

𝐷𝑇+1 𝑖 =exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇× 𝐷𝑇 𝑖

=exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇×

exp βˆ’π‘¦π‘–π›Όπ‘‡βˆ’1β„Žπ‘‡βˆ’1 π‘₯𝑖

π‘π‘‡βˆ’1Γ— π·π‘‡βˆ’1 𝑖

…… .

=exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇×⋯×

exp βˆ’π‘¦π‘–π›Ό1β„Ž1 π‘₯𝑖

𝑍1

1

π‘š

=1

π‘š

exp βˆ’π‘¦π‘–(𝛼1β„Ž1 π‘₯𝑖 +β‹―+π›Όπ‘‡β„Žπ‘‡ π‘₯𝑇 )

𝑍1⋯𝑍𝑇=

1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

errS π»π‘“π‘–π‘›π‘Žπ‘™ =1

π‘š

𝑖

1π‘¦π‘–β‰ π»π‘“π‘–π‘›π‘Žπ‘™ π‘₯𝑖

1

0

0/1 loss

exp loss

=1

π‘š

𝑖

1𝑦𝑖𝑓 π‘₯𝑖 ≀0

≀1

π‘š

𝑖

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

= ς𝑑 𝑍𝑑 .=

𝑖

𝐷𝑇+1 𝑖 ΰ·‘

𝑑

𝑍𝑑

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Step 3: ς𝑑 𝑍𝑑 = ς𝑑 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘ = ς𝑑 1 βˆ’ 4𝛾𝑑2 ≀ π‘’βˆ’2 σ𝑑 𝛾𝑑

2

Note: recall 𝑍𝑑 = 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Όπ‘‘ + πœ–π‘‘π‘’

𝛼𝑑 = 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘

𝛼𝑑 minimizer of 𝛼 β†’ 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Ό + πœ–π‘‘π‘’

𝛼

β€’ If π»π‘“π‘–π‘›π‘Žπ‘™ incorrectly classifies π‘₯𝑖,

Analyzing Training Error: Proof Intuition

- Then π‘₯𝑖 incorrectly classified by (wtd) majority of β„Žπ‘‘ ’s.

β€’ On round 𝑑, we increase weight of π‘₯𝑖 for which β„Žπ‘‘ is wrong.

- Which implies final prob. weight of π‘₯𝑖 is large.

Can show probability β‰₯1

π‘š

1

ς𝑑 𝑍𝑑

β€’ Since sum of prob. = 1, can’t have too many of high weight.

And (ς𝑑 𝑍𝑑) β†’ 0 .

Can show # incorrectly classified ≀ π‘š ς𝑑 𝑍𝑑 .

Theorem πœ–π‘‘ = 1/2 βˆ’ 𝛾𝑑 (error of β„Žπ‘‘ over 𝐷𝑑)

π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

Generalization Guarantees

G={all fns of the form sign(σ𝑑=1𝑇 π›Όπ‘‘β„Žπ‘‘(π‘₯)) }

π»π‘“π‘–π‘›π‘Žπ‘™ is a weighted vote, so the hypothesis class is:

Theorem [Freund&Schapire’97]

βˆ€ 𝑔 ∈ 𝐺, π‘’π‘Ÿπ‘Ÿ 𝑔 ≀ π‘’π‘Ÿπ‘Ÿπ‘† 𝑔 + ෨𝑂𝑇𝑑

π‘šT= # of rounds

Key reason: VCdπ‘–π‘š 𝐺 = ෨𝑂 𝑑𝑇 plus typical VC bounds.

β€’ H space of weak hypotheses; d=VCdim(H)

Theorem where πœ–π‘‘ = 1/2 βˆ’ π›Ύπ‘‘π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

How about generalization guarantees?

Original analysis [Freund&Schapire’97]

Generalization GuaranteesTheorem [Freund&Schapire’97]

βˆ€ 𝑔 ∈ π‘π‘œ(𝐻), π‘’π‘Ÿπ‘Ÿ 𝑔 ≀ π‘’π‘Ÿπ‘Ÿπ‘† 𝑔 + ෨𝑂𝑇𝑑

π‘šwhere d=VCdim(H)

error

complexity

train error

generalizationerror

T= # of rounds

Generalization Guarantees

β€’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

β€’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Generalization Guarantees

β€’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

β€’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

β€’ These results seem to contradict FS’87 bound and Occam’s razor (in order achieve good test error the classifier should be as

simple as possible)!

How can we explain the experiments?

Key Idea:

R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in β€œBoosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation.

Training error does not tell the whole story.

We need also to consider the classification confidence!!