1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak...

27
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMU Reading: Chap. 14.3 C.B book

Transcript of 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak...

Page 1: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Machine Learning

CSE6740/CS7641/ISYE6740, Fall 2012

Boosting from Weak Learners

Le Song

Lecture 9, February 13, 2008

Slides based on Eric Xing, CMU

Reading: Chap. 14.3 C.B book

Page 2: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Rationale: Combination of

methods

There is no algorithm that is always the most accurate

We can select simple “weak” classification or regression

methods and combine them into a single “strong” method

Different learners use different

Algorithms

Hyperparameters

Representations (Modalities)

Training sets

Subproblems

The problem: how to combine them

Page 3: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Some early algorithms

Boosting by filtering (Schapire 1990)

Run weak learner on differently filtered example sets

Combine weak hypotheses

Requires knowledge on the performance of weak learner

Boosting by majority (Freund 1995)

Run weak learner on weighted example set

Combine weak hypotheses linearly

Requires knowledge on the performance of weak learner

Bagging (Breiman 1996)

Run weak learner on bootstrap replicates of the training set

Average weak hypotheses

Reduces variance

Page 4: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Combination of classifiers

Suppose we have a family of component classifiers

(generating ±1 labels) such as decision stumps:

where q = {k,w,b}

Each decision stump pays

attention to only a single

component of the

input vector

bwxxh k sign);( q

Page 5: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Combination of classifiers con’d

We’d like to combine the simple classifiers additively so that

the final classifier is the sign of

where the “votes” {ai} emphasize component classifiers that

make more reliable predictions than others

Important issues:

what is the criterion that we are optimizing? (measure of loss)

we would like to estimate each new component classifier in the same manner

(modularity)

);();()(ˆ mmhhh qaqa xxx 11

Page 6: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Measurement of error

Loss function:

Generalization error:

Objective: find h with minimum generalization error

Main boosting idea: minimize the empirical error:

))(( e.g. ))(,( xx hyIhy

))(,()( xhyEhL

N

i

ii hyN

hL1

1))(,()(ˆ x

Page 7: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Exponential Loss

Empirical loss:

Another possible measure of empirical loss is

n

i

imihyhL1

)(ˆexp)(ˆ x

N

i

ii hyN

hL1

1))(,()(ˆ x

Page 8: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Exponential Loss

One possible measure of empirical loss is

The combined classifier based on m − 1 iterations defines a weighted loss

criterion for the next simple classifier to add

each training sample is weighted by its "classifiability" (or difficulty) seen by the

classifier we have built so far

);(exp

);(exp)(ˆexp

);()(ˆexp

)(ˆexp)(ˆ

mimi

n

i

m

i

mimi

n

i

imi

n

i

mimiimi

n

i

imi

hayW

hayhy

hayhy

hyhL

q

q

q

x

xx

xx

x

1

1

11

11

1

Recall that:

);();()(ˆ

mmm hhh qaqa xxx 11

)(ˆexp imi

m

i hyW x11

Page 9: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Linearization of loss function

We can simplify a bit the estimation criterion for the new

component classifiers (assuming a is small)

Now our empirical loss criterion reduces to

We could choose a new component classifier to optimize this

weighted agreement

);();(exp mimimimi hayhay qq xx 1

n

i

mii

m

im

n

i

m

i

mimi

n

i

m

i

n

i

imi

hyWaW

hayW

hy

1

1

1

1

1

1

1

1

);(

));((

)(ˆexp

q

q

x

x

x

)(ˆexp imi

m

i hyW x11

Page 10: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

A possible algorithm

At stage m we find q* that maximize (or at least give a

sufficiently high) weighted agreement:

each sample is weighted by its "difficulty" under the previously combined m − 1

classifiers,

more "difficult" samples received heavier attention as they dominates the total

loss

Then we go back and find the “votes” am* associated with the

new classifier by minimizing the original weighted

(exponential) loss

n

i

mii

m

i hyW1

1 );( *qx

n

i

mii

m

im

n

i

m

i

mimi

n

i

m

i

hyWaW

hayWhL

1

1

1

1

1

1

);(

);(exp)(ˆ

q

q

x

x

Page 11: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Boosting

We have basically derived a Boosting algorithm that

sequentially adds new component classifiers, each trained on

reweighted training examples

each component classifier is presented with a slightly different problem

AdaBoost preliminaries:

we work with normalized weights Wi on the training examples, initially

uniform ( Wi = 1/n)

the weight reflect the "degree of difficulty" of each datum on the latest

classifier

Page 12: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

The AdaBoost algorithm

At the kth iteration we find (any) classifier h(x; qk*) for which

the weighted classification error:

is better than chance.

This is meant to be "easy" --- weak classifier

Determine how many “votes” to assign to the new component

classifier:

stronger classifier gets more votes

Update the weights on the training examples:

n

i

kii

k

ik hyW1

1

21

50 );(. *q x

kkk εε /)(log. 150a

);(exp kiki

k

i

k

i hayWW qx 1

)(ˆexp imi

m

i hyW x11

Page 13: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

The AdaBoost algorithm cont’d

The final classifier after m boosting iterations is given by the

sign of

the votes here are normalized for convenience

m

mmhhh

aa

qaqa

1

11 );();()(ˆ

xxx

Page 14: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

AdaBoost: summary

Input:

N examples SN = {(x1,y1),…, (xN,yN)}

a weak base learner h = h(x,q)

Initialize: equal example weights wi = 1/N for all i = 1..N

Iterate for t = 1…T:

1. train base learner according to weighted example set (wt,x) and obtain hypothesis

ht = h(x,qt)

2. compute hypothesis error t

3. compute hypothesis weight at

4. update example weights for next iteration wt+1

Output: final hypothesis as a linear combination of ht

Page 15: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

AdaBoost

15

Page 16: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Adaboost flow chart

16

Data set 1

Learner1

Data set 2

Learner2 LearnerT

Data set T

… ...

… ...

… ...

training instances that are wrongly

predicted by Learner1 will play more

important roles in the training of Learner2

weighted combination

Original training set

Page 17: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Toy Example

Weak classifier (rule of thumb): vertical or horizontal half-

planes

Uniform weights on all examples

17

Page 18: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Boosting round 1

Choose a rule of thumb (weak classifier)

Some data points obtain higher weights because they are

classified incorrectly

18

Page 19: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Boosting round 2

Choose a new rule of thumb

Reweight again. For incorrectly classified examples, weight

increased

19

Page 20: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Boosting round 3

Repeat the same process

Now we have 3 classifiers

20

Page 21: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Boosting aggregate classifier

Final classifier is weighted combination of weak classifiers

21

Page 22: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Base Learners

Weak learners used in practice:

Decision stumps (axis parallel splits)

Decision trees (e.g. C4.5 by Quinlan 1996)

Multi-layer neural networks

Radial basis function networks

Can base learners operate on weighted examples?

In many cases they can be modified to accept weights along with the

examples

In general, we can sample the examples (with replacement) according to

the distribution defined by the weights

Page 23: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Thus, if each base classifier is slightly better than

random so that for some , then the

training error drops exponentially fast in T since the

above bound is at most

Theoretical properties

Y. Freund and R. Schapire [JCSS97] have proved that the

training error of AdaBoost is bounded by:

23

where

Page 24: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

How will train/test error behave?

Expect: training error to continue to drop (or reach zero)

First guess: test error to increase when the continued

classifier becomes too complex

Overfitting: hard to know when to stop training

24

Page 25: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Actual experimental observation

Test error does not increase, even after 1000 rounds

Test error continues to drop even after training error is zeros!

25

Page 26: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Theoretical properties (con’t)

Y. Freund and R. Schapire [JCSS97] have tried to bound the

generalization error as:

It suggests Adaboost will overfit if T is large. However,

empirical studies show that Adaboost often does not overfit.

Classification error only measure whether classification right

or wrong, also need to consider confidence of classifier

Margin-based bound: for any 𝜃 > 0 with high probability

26

Where 𝑚 is the number of samples and d is the VC-

dimension of the weak learner Pr [𝑓 𝑥 ≠ 𝑦] ≤ 𝑂𝑇𝑑

𝑚

Pr [𝑚𝑎𝑟𝑔𝑖𝑛𝑓 𝑥, 𝑦 ≤ 𝜃] ≤ 𝑂𝑑

𝑚𝜃2

Page 27: 1: Expression networksMachine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Boosting from Weak Learners Le Song Lecture 9, February 13, 2008 Slides based on Eric Xing, CMURationale:

Summary

Boosting takes a weak learner and converts it to a strong

one

Works by asymptotically minimizing the empirical error

Effectively maximizes the margin of the combined hypothesis