Boosting and other Expert Fusion Strategies. References Chapter 9.5 Duda Hart & Stock Leo Breiman...

89
Boosting and other Expert Fusion Strategies

Transcript of Boosting and other Expert Fusion Strategies. References Chapter 9.5 Duda Hart & Stock Leo Breiman...

Page 1: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting and other Expert Fusion Strategies

Page 2: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

References

• http://www.boosting.org• Chapter 9.5 Duda Hart & Stock• Leo Breiman Boosting Bagging

Arcing• Presentation adapted from:

Rishi Sinha, Robin Dhamankar

Page 3: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Types of Multiple Experts

• Single expert on full observation space

• Single expert for sub regions of observation space (Trees)

• Multiple experts on full observation space

• Multiple experts on sub regions of observation space

Page 4: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Types of Multiple Experts Training

• Use full observation space for each expert

• Use different observation features for each expert

• Use different observations for each expert

• Combine the above

Page 5: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Online Experts Selection

• N strategies (experts)

• At time t:– Learner A chooses a distribution over N

experts

– Let pt(i) be the probability of i-th expert

– pt(i) = 1 and for a loss vector lt

Loss at time t: pt(i) lt(i)

• Assume bounded loss, lt(i) in [0,1]

Page 6: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Experts Algorithm: Greedy

• For each expert define its cumulative loss:

• Greedy: At time t choose the expert with minimum loss, namely, arg min Li

t

t

j

ji

t

ilL

1

Page 7: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Greedy Analysis

• Theorem: Let LGT be the loss of

Greedy at time T, then

• Proof in notes.• Weakness: Relies on a single

expert for every observation

)1min( Ti

i

TG LNL

Page 8: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Better Multiple Experts Algorithms

• Would like to bound

• Better Bound: Hedge AlgorithmUtilizes all experts for each observation

Ti

i

TA LL min

Page 9: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Multiple Experts Algorithm: Hedge

• Maintain weight vector at time t: wt

• Probabilities pt(k) = wt(k) / wt(j)• Initialization w1(i) = 1/N• Updates: wt+1(k) = wt(k) Ub(lt(k)) where b in [0,1] and br < Ub (r) < 1-(1-b)r

Page 10: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Hedge Analysis

• Lemma: For any sequence of losses

• Proof (Mansour’s scribe)• Corollary:

H

N

j

T Lbjw )1())(ln(1

1

b

jw

H

N

j

T

L

1

))(ln(1

1

Page 11: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Hedge: Properties

• Bounding the weights

• Similarly for a subset of experts.

Ti

T

t

t

Lil

T

t

tb

T

biwbiw

ilUiwiw

)()(

))(()()(

1)(

1

1

11

1

Page 12: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Hedge: Performance

• Let k be with minimal loss

• Therefore

TkLT

N

j

T bkwkwjw )()()( 11

1

1

bbLN

b

bN

H

Tk

TkL

L

1)/1ln()ln(

1

)1

ln(

Page 13: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Hedge: Optimizing b

• For b=1/2 we have

• Better selection of b:

)2ln(2)ln(2 TkH LNL

)ln(ln2min NNLLL iiH

Page 14: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Occam Razor

• Finding the shortest consistent hypothesis.

• Definition: ()-Occam algorithm– >0 and <1– Input: a sample S of size m– Output: hypothesis h– for every (x,b) in S: h(x)=b– size(h) < size(ct) m

• Efficiency.

Page 15: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Occam Razor Theorem

• A: (,)-Occam algorithm for C using H• D distribution over inputs X• ct in C the target function• Sample size:

• with probability 1- A(S)=h has error(h) <

)1(121

ln1

2

n

m

Page 16: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Occam Razor Theorem

• Use the bound for finite hypothesis class.• Effective hypothesis class size 2size(h)

• size(h) < n m

• Sample size:

1ln

12lnln

1 2 mn

mmn

Page 17: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Weak and Strong Learning

Page 18: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

PAC Learning model (Strong Learning)

• There exists a distribution D over domain X

• Examples: <x, c(x)>– use c for target function (rather than ct)

• Goal: – With high probability (1-)– find h in H such that – error(h,c ) < – arbitrarily small, thus STRONG LEARNING

Page 19: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Weak Learning Model

• Goal: error(h,c) < ½ - (Slightly above chance)• The parameter is small

– constantIntuitively: A much easier task

• Question:– Assume C is weak learnable, – C is PAC (strong) learnable

Page 20: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Majority Algorithm

• Hypothesis: hM(x)= MAJ[ h1(x), ... , hT(x) ]

• size(hM) < T size(ht)

• Using Occam Razor

Page 21: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Majority: outline

• Sample m example• Start with a distribution 1/m per

example.

• Modify the distribution and get ht

• Hypothesis is the majority • Terminate when perfect classification

– of the sample

Page 22: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Majority: Algorithm

• Use the Hedge algorithm.• The “experts” will be associate with

points.• Loss would be a correct classification.

– lt(i)= 1 - | ht(xi) – c(xi) |

• Setting b= 1- • hM(x) = MAJORITY( hi(x))

• Q: How do we set T?

Page 23: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Majority: Analysis

• Consider the set of errors SS={i | hM(xi)c(xi) }

• For every i in S:Li / T < ½ (Proof!)

• From Hedge properties:

2/)())(ln( 2 TxD

MSi iL

Page 24: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

MAJORITY: Correctness

• Error Probability:

• Number of Rounds:

• Terminate when error less than 1/m

Si ixD )(

2/2 Te

2

ln2

m

T

Page 25: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Bagging• Generate a random sample from training set by selecting

elements with replacement.

• Repeat this sampling procedure, getting a sequence of k “independent” training sets

• A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm

• To classify an unknown sample X, let each classifier predict.

• The Bagged Classifier C* then combines the predictions of the individual classifiers to generate the final outcome. (sometimes combination is simple voting)

Taken from Lecture slides for Data Mining Concepts and Techniques by Jiawei Han and M Kamber

Page 26: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting

• Also Ensemble Method. =>The final prediction is a combination of the prediction of several predictors.

• What is different?– Its iterative. – Boosting: Successive classifiers depends upon its

predecessors. Previous methods : Individual classifiers were “independent”

– Training Examples may have unequal weights.– Look at errors from previous classifier step to decide how

to focus on next iteration over data– Set weights to focus more on ‘hard’ examples. (the ones

on which we committed mistakes in the previous iterations)

Page 27: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting

• W(x) is the distribution of weights over N training observations ∑ W(xi)=1

• Initially assign uniform weights W0(x) = 1/N for

all x, step k=0

• At each iteration k :– Find best weak classifier Ck(x) using weights Wk(x)

– With error rate εk and based on a loss function:

• weight αk the classifier Ck‘s weight in the final hypothesis

• For each xi , update weights based on εk to get Wk+1(xi )

• CFINAL(x) =sign [ ∑ αi Ci (x) ]

Page 28: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting (Algorithm)

Page 29: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting As Additive Model

• The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers

• The process is iterative and can be expressed as follows.

• Typically we would try to minimize a loss function on the training examples

);()(1

m

M

mm xbxf

);()()( 1 mmmm xbxfxf

N

i

M

mmimi xbyL

Mmm 1 1},{

);(,min1

Page 30: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting As Additive Model

• Simple case: Squared-error loss

• Forward stage-wise modeling amounts to just fitting the residuals from previous iteration.

• Squared-error loss not robust for classification

2))((2

1))(,( xfyxfyL

2

21

1

));((

));()((

));()(,(

iim

iimi

iimi

xbr

xbxfy

xbxfyL

Page 31: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting As Additive Model

• AdaBoost for Classification: L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function

N

iimiimi

G

N

iimimi

G

i

N

ii

f

xGyxfy

xGxfy

xfyL

m

m

11

,

11

,

1

))(exp())(exp(minarg

)])()([exp(minarg

))(,(minarg

Page 32: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting As Additive Model

First assume that β is constant, and minimize w.r.t. G:

ew

xGyIwee

wexGyIwee

ewew

xfywwherexGyw

xGyxfy

N

i

mi

N

iii

mi

G

N

i

mi

N

iii

mi

G

N

xGy

mi

N

xGy

mi

G

imim

i

N

iimi

mi

G

N

iimiimi

G

iiii

m

m

1

)(

1

)(

1

)(

1

)(

)(

)(

)(

)(

1)(

1

)(

,

11

,

)])(([)(minarg

)])(([)(minarg

minarg

))(exp(,))(exp(minarg

))(exp())(exp(minarg

Page 33: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting As Additive Model

)()(minarg

)])(([)(minarg

1

)(

1

)(

Heerree

ew

xGyIwee

mG

N

i

mi

N

iii

mi

G

errm : It is the training error on the weighted samples

The last equation tells us that in each iteration we must find a classifier that minimizes the training error on the weighted samples.

Page 34: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting As Additive Model

)1

log(2

1

1

01

0)(1

0)(

)()(

2

2

m

m

m

m

mm

m

m

m

err

err

eerr

err

errerre

eeerre

eeeerrH

eeeerrH

Now that we have found G, we minimize w.r.t. β:

Page 35: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting (Recall)

• W(x) is the distribution of weights over the N training observations ∑ W(xi)=1

• Initially assign uniform weights W0(x) = 1/N for

all x, step k=0

• At each iteration k :– Find best weak classifier Ck(x) using weights Wk(x)

– With error rate εk and based on a loss function:

• weight αk the classifier Ck‘s weight in the final hypothesis

• For each xi , update weights based on εk to get Wk+1(xi )

• CFINAL(x) =sign [ ∑ αi Ci (x) ]

Page 36: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost

• W(x) is the distribution of weights over the N training points ∑ W(xi)=1

• Initially assign uniform weights W0(x) = 1/N for all x.

• At each iteration k :– Find best weak classifier Ck(x) using weights Wk(x)

– Compute εk the error rate as

εk= [ ∑ W(xi ) ∙ I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )]

– weight αk the classifier Ck‘s weight in the final hypothesis Set αk = log ((1 – εk )/εk )

– For each xi , Wk+1(xi ) = Wk(xi ) ∙ exp[αk ∙ I(yi ≠ Ck(xi ))]

• CFINAL(x) =sign [ ∑ αi Ci (x) ]

Page 37: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost(Example)

Original Training set : Equal Weights to all training samples

Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

Page 38: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost (Example)

ROUND 1

Page 39: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost (Example)

ROUND 2

Page 40: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost (Example)

ROUND 3

Page 41: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost (Example)

Page 42: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost (Characteristics)• Why exponential loss function?

– Computational• Simple modular re-weighting• Derivative easy so determining optimal

parameters is relatively easy

– Statistical • In a two label case it determines one half

the log odds of P(Y=1|x) => We can use the sign as the classification rule

• Accuracy depends upon number of iterations ( How sensitive.. we will see soon).

Page 43: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting performance

Decision stumps are very simple rules of thumb that test condition on a single attribute.

Decision stumps formed the individual classifiers whose predictions were combined to generate the final prediction.

The misclassification rate of the Boosting algorithm was plotted against the number of iterations performed.

Page 44: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting performance

Steep decrease in error

Page 45: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting performance

• Pondering over how many iterations would be sufficient….

• Observations– First few ( about 50) iterations increase

the accuracy substantially.. Seen by the steep decrease in misclassification rate.

– As iterations increase training error decreases ? and generalization error decreases ?

Page 46: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Can Boosting do well if?

• Limited training data?– Probably not ..

• Many missing values ?• Noise in the data ?• Individual classifiers not very

accurate ?– It cud if the individual classifiers have

considerable mutual disagreement.

Page 47: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Adaboost

• “Probably one of the three most influential ideas in machine learning in the last decade, along with Kernel methods and Variational approximations.”

• Original idea came from Valiant• Motivation: We want to improve the

performance of a weak learning algorithm

Page 48: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Adaboost• Algorithm:

Page 49: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting Trees Outline

• Basics of boosting trees.• A numerical optimization problem• Control the model complexity,

generalization– Size of trees– Number of Iterations – Regularization

• Interpret the final model– Single variable– Correlation of variables

Page 50: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting Trees : Basics

• Formally a tree is

• The parameters found by minimizing the empirical risk.

• Finding: j given R j : typically mean of yi in Rj

– Rj : Is tough but solutions exist.

JjjR 1},{

)(),(1

j

J

jj RxIxT

J

j Rxji

ji

yL1

),(minargˆ

Page 51: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Basics Continued …

• Approximate criterion for optimizing

• Boosted tree model is sum of such trees induced in a forward stage wise manner

• In case of binary classification and exponential loss functions this reduces to Ada Boost

)),(,(~

minarg~

1

i

N

ii xTyL

Page 52: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Numerical Optimization

• Loss Function is

• So the problem boils down to finding

• Which in optimization procedures are solved as

N

iii

ffxfyLfLf

1

^))(,(minarg)(minarg

Page 53: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Numerical Optimization Methods

• Steepest Descent

• Loss on Training Data converges to 0.

mmm

mmm

TNmxfxf

i

iiim

gff

gfL

gxf

xfyLg

imi

*

)*(minarg

,...,g,g])(

))(,([

1

1

2m1m)()( 1

},...,,{)}(),...,(),({ 2121 NNmmmm yyyxfxfxff

Page 54: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Generalization

• Gradient Boosting– We want the algorithm to generalize. – Gradient on the other hand is defined only

on the training data points.– So fit the tree T to the negative gradient

values by least squares.

• MART – Multiple additive regression trees

N

iiim xTg

1

2~

));((minarg

Page 55: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Algorithm

M

J

jjmmm

Rximi

j

f

End

RxIxfxfd

xfyL

c

b

x

m

jmim

^

1jm1

1jm

lm

mjmim

im

0

f :Output

For

)()()()

))(,(minarg

Rregion different t within coefficien of valueoptimal theFind )

J 1,2,...j ,R r to treeregression aFit )

function losson based r residuals pseudo Compute a)

:M to1mFor 2.

treenode terminalsingle to)(f Initialize.1

Page 56: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Tuning the Parameters

• The parameters that can be tuned are– The size of constituent trees J.– The number of boosting iterations M.– Shrinkage– Penalized Regression

Page 57: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Right-sized trees

• The optimal for one step might not be the optimal for the algorithm– Using very large tree (such as C4.5) as

weak learner to fit the residue assumes each tree is the last one in the expansion. Usually degrade performance and increase computation

• Solution : restrict the value of J to be the same for all trees.

Page 58: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Right sized trees.

• For trees the higher order interactions effects present in large trees suffer inaccuracies.

• J is the factor that helps control the higher order interactions. Thus we would like to keep J low.

• In practice the value of 4 J 8 is seen to have worked the best.

Page 59: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.
Page 60: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Controlling M (Regularization)

• After each iteration the training risk L(fM).

• As M , L(fM) 0• But this would risk over fitting the

training data.• To avoid this monitor prediction risk

on a validation sample.• Other methods in the Next Chapter.

Page 61: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Shrinkage

• Scale the contribution of each tree by a factor 0 < < 1 to control the learning rate.

and M control the prediction risk on training data.

• Are not independent of each other.

J

jjmjmmm RxIxff

1

^

1 )(*)(

Page 62: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.
Page 63: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.
Page 64: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Experts: Motivation

• Given a set of experts– No prior information– No consistent behavior– Goal: Predict as the best expert

• Model– online model– Input: historical results.

Page 65: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Expert: Goal

• Match the loss of best expert.• Loss:

– LA

– Li

• Can we hope to do better?

Page 66: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Example: Guessing letters

• Setting:– Alphabet of k letters

• Loss:– 1 incorrect guess– 0 correct guess

• Experts:– Each expert guesses a certain letter always.

• Game: guess the most popular letter online.

Page 67: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Example 2: Rock-Paper-Scissors

• Two player game.• Each player chooses: Rock, Paper, or Scissors.• Loss Matrix:

• Goal: Play as best as we can given the opponent.

Rock Paper Scissors

Rock 1/2 1 0

Paper 0 1/2 1

Scissors

1 0 1/2

Page 68: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Example 3: Placing a point

• Action: choosing a point d.• Loss (give the true location y): ||d-y||.• Experts: One for each point.• Important: Loss is Convex

• Goal: Find a “center”

||||)1(||||))1(( 2121 ydydydd

Page 69: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Adaboost

• Line 1: Given input space X and training examples x1,…xm and label space Y = {-1,1}

• Line 2: Initialize a distribution D to 1/m where m is the number of instances in the input space.

• Line 3: for( int t=0;t<T;t++)• Line 4: Train weak learning algorithm using Dt

• Line 5: Get a weak hypothesis ht which maps the input space to the label space. The error of this hypothesis is εt

• Line 6: αt = (1/2)ln((1- εt)/ εt)• Line 7: Dt(instancei)=(1/ Zt)(Dt(instancei)x{e- αt }if the

hypothesis correctly matched the instance to the label• x{eαt } otherwise

Page 70: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Adaboost

• Final hypothesis: H(x) = sign(sum(αt ht (x)))• Main ideas:

– Adaboost forces the weak learner to focus on incorrectly classified instances

– Training error decreases exponentially– Does boosting overfit?

• Baum showed Generalization error = O(sqrt(Td/m))• Schapire showed error = O(sqrt(d/mθ))• Does Generalization error depend on T or not? The jury

is still out.– No overfit mechanism

Page 71: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost: Dynamic Boosting

• Better bounds on the error• No need to “know” • Each round a different b

– as a function of the error

Page 72: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost: Input

• Sample of size m: < xi,c(xi) >

• A distribution D over examples – We will use D(xi)=1/m

• Weak learning algorithm• A constant T (number of iterations)

Page 73: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost: Algorithm

• Initialization: w1(i) = D(xi)• For t = 1 to T DO

– pt(i) = wt(i) / wt(j)– Call Weak Learner with pt

– Receive ht

– Compute the error t of ht on pt

– Set bt= t/(1-t)– wt+1(i) = wt(i) (bt)e, where e=1-|ht(xi)-c(xi)|

• Output

T

t t

T

ttt

A bxhbIxh1

1

1log21)()1(log)(

Page 74: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost: Analysis

• Theorem: – Given 1, ... , T

– the error of hA is bounded by

T

ttt

T

1

)1(2

Page 75: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost: Proof

• Let lt(i) = 1-|ht(xi)-c(xi)|

• By definition: pt lt = 1 –t

• Upper bounding the sum of weights– From the Hedge Analysis.

• Error occurs only if

T

t t

T

ttt bxcxhb

11

1log21 |)()(|)1(log

Page 76: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost Analysis (cont.)

• Bounding the weight of a point• Bounding the sum of weights

• Final bound as function of bt

• Optimizing bt:

– bt= t / (1 – t)

Page 77: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

AdaBoost: Fixed bias

• Assume t= 1/2 - • We bound:

TT e222/2 )41(

Page 78: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Learning OR with few attributes

• Target function: OR of k literals• Goal: learn in time:

– polynomial in k and log n– and constant

• ELIM makes “slow” progress – disqualifies one literal per round– May remain with O(n) literals

Page 79: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Set Cover - Definition

• Input: S1 , … , St and Si U

• Output: Si1, … , Sik and j Sjk=U

• Question: Are there k sets that cover U?

• NP-complete

Page 80: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Set Cover Greedy algorithm

• j=0 ; Uj=U; C=

• While Uj – Let Si be arg max |Si Uj|

– Add Si to C

– Let Uj+1 = Uj – Si

– j = j+1

Page 81: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Set Cover: Greedy Analysis

• At termination, C is a cover.• Assume there is a cover C’ of size k.

• C’ is a cover for every Uj

• Some S in C’ covers Uj/k elements of Uj

• Analysis of Uj: |Uj+1| |Uj| - |Uj|/k

• Solving the recursion.• Number of sets j < k ln |U|

Page 82: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Building an Occam algorithm

• Given a sample S of size m– Run ELIM on S – Let LIT be the set of literals– There exists k literals in LIT that

classify correctly all S

• Negative examples: – any subset of LIT classifies theme

correctly

Page 83: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Building an Occam algorithm

• Positive examples: – Search for a small subset of LIT – Which classifies S+ correctly– For a literal z build Tz={x | z satisfies x}– There are k sets that cover S+

– Find k ln m sets that cover S+

• Output h = the OR of the k ln m literals • Size (h) < k ln m log 2n• Sample size m =O( k log n log (k log n))

Page 84: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Application : Data mining

• Challenges in real world data mining problems– Data has large number of observations and large number of

variables on each observation.– Inputs are a mixture of various different kinds of variables– Missing values, outliers and variables with skewed

distribution.– Results to be obtained fast and they should be

interpretable.

• So off-shelf techniques are difficult to come up with.• Boosting Decision Trees ( AdaBoost or MART) come

close to an off-shelf technique for Data Mining.

Page 85: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Boosting TreesPresented by Rishi Sinha

Page 86: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Occam Razor

Page 87: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Occam algorithm and compression

A BS(xi,bi)

x1, … , xm

Page 88: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

compression

• Option 1:

– A sends B the values b1 , … , bm

– m bits of information• Option 2:

– A sends B the hypothesis h– Occam: large enough m has size(h) < m

• Option 3 (MDL):– A sends B a hypothesis h and “corrections”– complexity: size(h) + size(errors)

Page 89: Boosting and other Expert Fusion Strategies. References  Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Independent Component Analysis (ICA)

• This is the first ICA paper– My source for this explanation of ICA

is “Independent Component Analysis: A Tutorial” by Aapo Hyvarinen and “Variational Methods forBayesian Independent Component Analysis” ICA chapter by Rizwan A. Choudrey