Advanced Machine Learning & Perception

35
University Advanced Machine Learning & Perception Instructor: Tony Jebara

description

Advanced Machine Learning & Perception. Instructor: Tony Jebara. Boosting. Combining Multiple Classifiers Voting Boosting Adaboost Based on material by Y. Freund, P. Long & R. Schapire. Combining Multiple Learners. Have many simple learners Also called base learners or - PowerPoint PPT Presentation

Transcript of Advanced Machine Learning & Perception

Page 1: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Advanced Machine Learning & Perception

Instructor: Tony Jebara

Page 2: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosting•Combining Multiple Classifiers•Voting•Boosting•Adaboost

•Based on material by Y. Freund, P. Long & R. Schapire

Page 3: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Combining Multiple Learners•Have many simple learners•Also called base learners or weak learners which have a classification error of <0.5•Combine or vote them to get a higher accuracy•No free lunch: there is no guaranteed best approach here•Different approaches:

Votingcombine learners with fixed weight

Mixture of Expertsadjust learners and a variable

weight/gate fnBoosting

actively search for next base-learners and vote

Cascading, Stacking, Bagging, etc.

Page 4: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Voting•Have T classifiers•Average their prediction with weights

•Like mixture of experts but weight is constant with input

ht x( )

f x( ) = atht x( )t=1TÂ whereat ≥ 0and att=1

TÂ = 1

y

x

1x

2x

3x

1x

2x

3x

1x

2x

3x

Expert Expert Expert

Page 5: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Mixture of Experts•Have T classifiers or experts and a gating fn•Average their prediction with variable weights•But, adapt parameters of the gating function and the experts (fixed total number T of experts)

ht x( )

f x( ) = at x( )ht x( )t=1TÂ whereat x( ) ≥ 0and at x( )t=1

TÂ = 1Output

Input

GatingNetwork

1x

2x

3x

1x

2x

3x

1x

2x

3x

Expert Expert Expert

at x( )

Page 6: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosting•Actively find complementary or synergistic weak learners•Train next learner based on mistakes of previous ones.•Average prediction with fixed weights•Find next learner by training on weighted versions of data.

weights

training data x1,y1( ),K , xN ,yN( ){ }

w1,K ,wN{ }

ht x( )weak learner

ensemble of learners

weak rule

wistep - ht xi( )yi( )i=1NÂ

wii=1NÂ < 1

2- g

a1,h1( ),K , aT ,hT( ){ }

prediction f x( ) = atht x( )t=1TÂ

Page 7: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

AdaBoost•Most popular weighting scheme•Define margin for point i as•Find an ht and find weight t to min the cost function

sum exp-margins

weights

training data x1,y1( ),K , xN ,yN( ){ }

w1,K ,wN{ }

ht x( )weak learner

ensemble of learners

weak rule

prediction f x( ) = atht x( )t=1TÂ

wistep - ht xi( )yi( )i=1NÂ

wii=1NÂ < 1

2- g

a1,h1( ),K , aT ,hT( ){ }

yi atht xi( )t=1TÂ

exp - yi atht xi( )t=1

TÂ( )i=1NÂ

Page 8: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

AdaBoost•Choose base learner & t:•Recall error of base classifier ht must be

•For binary h, Adaboost puts this weight on weak learners:

(instead of the more general rule)

•Adaboost picks the following for the weights on data for

the next round (here Z is the normalizer to sum to 1)

minat ,ht

exp - yi atht xi( )t=1TÂ( )i=1

et =

wistep - ht xi( )yi( )i=1NÂ

wii=1NÂ < 1

2- g

at = 1

2 ln 1- et

et

ÊËÁÁÁÁ

ˆ¯˜̃̃˜̃

wi

t+1 =wi

t exp - atyiht xi( )( )Zt

Page 9: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Decision Trees

X>3

Y>5-1

+1-1

no

yes

yesno

X

Y

3

5

+1

-1

-1

Page 10: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

-0.2

Decision tree as a sum

X

Y-0.2

Y>5

+0.2-0.3

yesno

X>3

-0.1

no

yes

+0.1+0.1-0.1

+0.2

-0.3

+1

-1

-1sign

Page 11: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

An alternating decision tree

X

Y

+0.1-0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2-0.3yesno

X>3

-0.1

no

yes

+0.1

+0.7

Y<1

0.0

no

yes

+0.7

+1

-1

-1

+1

Page 12: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Example: Medical Diagnostics

•Cleve dataset from UC Irvine database.

•Heart disease diagnostics (+1=healthy,-1=sick)

•13 features from tests (real valued and discrete).

•303 instances.

Page 13: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Ad-Tree Example

Page 14: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Cross-validated accuracyLearning algorith

mNumber of splits

Average test error

Test error

varianceADtree 6 17.0% 0.6%C5.0 27 27.2% 0.5%

C5.0 + boostin

g446 20.2% 0.5%

Boost Stump

s16 16.5% 0.8%

Page 15: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

AdaBoost Convergence•Rationale?•Consider bound on the training error:

•Adaboost is essentially doing gradient descent on this.•Convergence?

Remp = 1N step - yi f xi( )( )i=1

N£ 1

N exp - yi f xi( )( )i=1NÂ

= exp - yi atht xi( )t=1TÂ( )i=1

NÂ= Ztt=1

T’

Loss

CorrectMarginMistakes

Brownboost

Logitboost

0-1 loss

exp bound on step

definition of f(x)

recursive use of Z

Page 16: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

AdaBoost Convergence•Convergence? Consider the binary ht case.

Remp £ Ztt=1T’ = wi

t exp - atyiht xi( )( )i=1NÂt=1

T’= wi

t exp ln et

1- et

ÊËÁÁÁÁ

ˆ¯˜̃̃˜̃

12yiht xi( )Ê

Ë

ÁÁÁÁÁÁÁ

ˆ

¯

˜̃̃˜̃̃˜̃

Ê

Ë

ÁÁÁÁÁÁÁ

ˆ

¯

˜̃̃˜̃̃˜̃i=1

NÂt=1T’

= wit et

1- et

Ê

ËÁÁÁÁÁ

ˆ

¯˜̃̃˜̃̃yiht xi( )

i=1NÂt=1

T’

= wit et

1- et

+ wit 1- et

etincorrectÂcorrectÂ

Ê

ËÁÁÁÁÁ

ˆ

¯˜̃̃˜̃̃t=1

T’= 2 et 1- et( )t=1

T’ = 1- 4gt2( )t=1

T’ £ exp - 2 gt2

tÂ( )

Remp £ exp - 2 gt2

tÂ( ) £ exp - 2T g2( )So, the final learner converges exponentially fast in T if each weak learner is at least better than gamma!

at = 12 ln 1- et

et

ÊËÁÁÁÁ

ˆ¯˜̃̃˜̃

et = 12 - gt £ 1

2 - g

Page 17: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Curious phenomenonBoosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Page 18: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Explanation using margins

Margin

0-1 loss

Page 19: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Explanation using margins

Margin

0-1 loss

No examples with small margins!!

Page 20: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Experimental Evidence

Page 21: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

AdaBoost Generalization Bound•Also, a VC analysis gives a generalization bound:

(where d is VC ofbase classifier)

•But, more iterations overfitting!•A margin analysis is possible, redefine margin as:

Then have

R £ Remp +O Td

N

Ê

ËÁÁÁÁÁ

ˆ

¯˜̃̃˜̃̃

marf x,y( ) =

y atht x( )tÂattÂ

R £ 1

Nstep q- marf xi,yi( )( )

i=1

NÂ +O dNq2

Ê

ËÁÁÁÁÁ

ˆ

¯˜̃̃˜̃̃

Page 22: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

AdaBoost Generalization Bound

Margin

dm

•Suggests this optimization problem:

Page 23: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

AdaBoost Generalization Bound•Proof Sketch

Page 24: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

UCI Results

Database Other Boosting Error

reductionCleveland 27.2 (DT) 16.5 39%Promoters 22.0 (DT) 11.8 46%Letter 13.8 (DT) 3.5 74%Reuters 4 5.8, 6.0, 9.8 2.95 ~60%Reuters 8

11.3, 12.1, 13.4 7.4 ~40%

% test error rates

Page 25: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•Consider classifying an image as face/notface•Use weak learner as stump that averages of pixel intensity•Easy to calculate, white areas subtracted from black ones

•A special representation of the sample called the integral image makes feature extraction faster.

Page 26: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•Summed area tables

•A representation that means any rectangle’s values can be calculated in four accesses of the integral image.

Page 27: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•Summed area tables

Page 28: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•The base size for a sub window is 24 by 24 pixels.•Each of the four feature types are scaled and shifted across all possible combinations•In a 24 pixel by 24 pixel sub window there are ~160,000 possible features to be calculated.

Page 29: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•Viola-Jones algorithm, with K attributes (e.g., K = 160,000) we have 160,000 different decision stumps to choose from

At each stage of boosting •given reweighted data from previous stage•Train all K (160,000) single-feature perceptrons•Select the single best classifier at this stage•Combine it with the other previously selected classifiers•Reweight the data•Learn all K classifiers again, select the best, combine, reweight•Repeat until you have T classifiers selected•Very computationally intensive!

Page 30: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•Reduction in Error as Boosting adds Classifiers

Page 31: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•First (e.g. best) two features learned by boosting

Page 32: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•Example training data

Page 33: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•To find faces, scan all squares at different scales, slow

•Boosting finds ordering on weak learners (best ones first)•Idea: cascade stumps to avoid too much computation!

Page 34: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•Training time = weeks (with 5k faces and 9.5k non-faces)

•Final detector has 38 layers in the cascade, 6060 features

•700 Mhz processor:•Can process a 384 x 288 image in 0.067 seconds (in 2003 when paper was written)

Page 35: Advanced Machine Learning & Perception

Tony Jebara, Columbia University

Boosted Cascade of Stumps•Results