Learning with AdaBoost

Learning with AdaBoost Learning with AdaBoost

Fall 2007

04/21/2304/21/23 Learning with AdaboostLearning with Adaboost 22

OutlineOutline

Introduction and background of Boosting Introduction and background of Boosting and Adaboostand Adaboost

Adaboost Algorithm exampleAdaboost Algorithm example Adaboost Algorithm in current projectAdaboost Algorithm in current project Experiment resultsExperiment results Discussion and conclusionDiscussion and conclusion


OutlineOutline




Boosting AlgorithmBoosting Algorithm Definition of Boosting[1]:Definition of Boosting[1]:

Boosting refers to a general method of producing a very Boosting refers to a general method of producing a very accurate prediction rule by combining rough and accurate prediction rule by combining rough and moderately inaccurate rules-of-thumb.moderately inaccurate rules-of-thumb.

Boosting procedures[2]Boosting procedures[2] Given a set of labeled training examples Given a set of labeled training examples

,where is the label associated with instance ,where is the label associated with instance On each round On each round , ,

• The booster devises a distribution (importance) over the example The booster devises a distribution (importance) over the example setset

• The booster requests a weak hypothesis (rule-of-thumb) with low The booster requests a weak hypothesis (rule-of-thumb) with low error error

After After TT rounds, the booster combine the weak hypothesis into a rounds, the booster combine the weak hypothesis into a single prediction rule.single prediction rule.

Niyx ii 1, iy ix

Tt ,,1

tht

tD


Boosting Algorithm(cont’d)Boosting Algorithm(cont’d)

The intuitive ideaThe intuitive idea

Altering the distribution over the domain in a way that Altering the distribution over the domain in a way that increases the probability of the “harder” parts of the increases the probability of the “harder” parts of the space, thus forcing the weak learner to generate new space, thus forcing the weak learner to generate new hypotheses that make less mistakes on these parts.hypotheses that make less mistakes on these parts.

DisadvantagesDisadvantages Needs to know the prior knowledge of accuracies Needs to know the prior knowledge of accuracies

of the weak hypothesesof the weak hypotheses The performance bounds depends only on the The performance bounds depends only on the

accuracy of the least accurate weak hypothesisaccuracy of the least accurate weak hypothesis


background of Adaboost[2]background of Adaboost[2]


Adaboost Algorithm[2]Adaboost Algorithm[2]


Advantages of AdaboostAdvantages of Adaboost

Adaboost adjusts adaptively the errors of Adaboost adjusts adaptively the errors of the weak hypotheses by the weak hypotheses by WeakLearn.WeakLearn.

Unlike the conventional boosting Unlike the conventional boosting algorithm, the prior error need not be algorithm, the prior error need not be known ahead of time.known ahead of time.

The update rule reduces the probability The update rule reduces the probability assigned to those examples on which the assigned to those examples on which the hypothesis makes a good predictions and hypothesis makes a good predictions and increases the probability of the examples increases the probability of the examples on which the prediction is poor.on which the prediction is poor.


The error bound[3]The error bound[3] Suppose the weak learning algorithm Suppose the weak learning algorithm WeakLearnWeakLearn, when , when

called by Adaboost, generates hypotheses with errors called by Adaboost, generates hypotheses with errors . Then the error of the final hypothesis . Then the error of the final hypothesis output by Adaboost is bounded above byoutput by Adaboost is bounded above by

Note that the errors generated by Note that the errors generated by WeakLearnWeakLearn are not are not uniform, and the final error depends on the error of all of uniform, and the final error depends on the error of all of the weak hypotheses. Recall that the errors of the the weak hypotheses. Recall that the errors of the previous boosting algorithms depend only on the maximal previous boosting algorithms depend only on the maximal error of the weakest hypothesis and ignored the error of the weakest hypothesis and ignored the advantages that can be gained from the hypotheses advantages that can be gained from the hypotheses whose errors are smaller. whose errors are smaller.

T ,,1

iifDi yxh ~Prfh

T

ttt

1

12


OutlineOutline




A toy example[2]A toy example[2]

Training set: 10 points Training set: 10 points (represented by plus or minus)(represented by plus or minus)

Original Status: Equal Original Status: Equal Weights for all training Weights for all training

samplessamples


A toy example(cont’d)A toy example(cont’d)

Round 1: Three “plus” points are not correctly classified;Round 1: Three “plus” points are not correctly classified;They are given higher weights.They are given higher weights.



Round 2: Three “minuse” points are not correctly classified;Round 2: Three “minuse” points are not correctly classified;They are given higher weights.They are given higher weights.



Round 3: One “minuse” and two “plus” points are not Round 3: One “minuse” and two “plus” points are not correctly classified;correctly classified;

They are given higher weights.They are given higher weights.



Final Classifier: integrate the three “weak” classifiers and Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier.obtain a final strong classifier.


OutlineOutline




Look at Adaboost[3] AgainLook at Adaboost[3] Again


Adaboost(Con’d):Adaboost(Con’d):Multi-class ExtensionsMulti-class Extensions

The previous discussion is restricted to The previous discussion is restricted to binary classification problems. The set binary classification problems. The set YY could have any number of labels, which is could have any number of labels, which is a multi-class problems.a multi-class problems.

The multi-class case (AdaBoost.M1) The multi-class case (AdaBoost.M1) requires the accuracy of the weak requires the accuracy of the weak hypothesis greater than ½. This condition hypothesis greater than ½. This condition in the multi-class is stronger than that in in the multi-class is stronger than that in the binary classification casesthe binary classification cases


AdaBoost.M1AdaBoost.M1


Error Upper Bound of Error Upper Bound of Adaboost.M1[3]Adaboost.M1[3]

Like the binary classification case, the error of Like the binary classification case, the error of the final hypothesis is also bounded.the final hypothesis is also bounded.

T

ttt

1

12


How does Adaboost.M1 work[4]?How does Adaboost.M1 work[4]?


Adaboost in our projectAdaboost in our project



1) The initialization has set the total weights of 1) The initialization has set the total weights of

target class the same as all other staff.target class the same as all other staff.

bird[1,…,10] = ½ * 1/10;bird[1,…,10] = ½ * 1/10;

otherstaff[1,…,690] = ½ * 1/690;otherstaff[1,…,690] = ½ * 1/690; 2) The history record is preserved to strengthen the 2) The history record is preserved to strengthen the

updating process of the weights.updating process of the weights. 3) the unified model obtained from CPM alignment are 3) the unified model obtained from CPM alignment are

used for training process.used for training process.



2) The history record2) The history record

weight_histogram(withweight_histogram(withHistory Record)History Record)

weight_histogram(weight_histogram(without History without History

Record)Record)



3) the unified model obtained from CPM alignment 3) the unified model obtained from CPM alignment are used for training process. This has are used for training process. This has decreased the overfitting problem.decreased the overfitting problem.

3.1) Overfitting Problem.3.1) Overfitting Problem.

3.2) CPM model.3.2) CPM model.


Adaboost in our projectAdaboost in our project 3.1) Overfitting Problem. 3.1) Overfitting Problem.

Why the trained Adaboost does not work for bird 11~20?Why the trained Adaboost does not work for bird 11~20?I have compared: I have compared:

I ) the rank of alpha value for each 60 classifiers I ) the rank of alpha value for each 60 classifiers II) how each classifier has actually detected birds in train processII) how each classifier has actually detected birds in train processIII) how each classifier has actually detected birds in test process.III) how each classifier has actually detected birds in test process.

The covariance is also computed for comparison:The covariance is also computed for comparison:cov(c(:,1),c(:,2))cov(c(:,1),c(:,2))

ans = 305.0000 6.4746ans = 305.0000 6.4746 6.4746 305.00006.4746 305.0000

K>> cov(c(:,1),c(:,3))K>> cov(c(:,1),c(:,3))

ans = 305.0000 92.8644ans = 305.0000 92.8644 92.8644 305.000092.8644 305.0000

K>> cov(c(:,2),c(:,3))K>> cov(c(:,2),c(:,3))

ans = 305.0000 -46.1186ans = 305.0000 -46.1186 -46.1186 305.0000-46.1186 305.0000

Overfitted!Overfitted!

Train data is different Train data is different from test data. This is from test data. This is

very common.very common.



Train ResultTrain Result(Covariance:6.4746)(Covariance:6.4746)



Comparison:Train&Test ResultComparison:Train&Test Result(Covariance:92.8644)(Covariance:92.8644)



3.2) CPM: continuous profile model; put forward 3.2) CPM: continuous profile model; put forward by Jennifer Listgarten. This is very useful for by Jennifer Listgarten. This is very useful for data alignment.data alignment.



The alignment results from CPM model:The alignment results from CPM model:

0 50 100 150 200 250-2

-1

0

1

2Aligned and Scaled Data

Latent Time

0 10 20 30 40 50 60 70 80 90 100-2

-1

0

1

2Unaligned and Unscaled Data

Experimental Time



The unified model from CPM alignment:The unified model from CPM alignment:

0 50 100 150 200 250-1

-0.5

0

0.5

1

1.5

2

0 10 20 30 40 50 60 70 80 90 100-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

without resampledwithout resampled after upsample after upsample and downsampleand downsample



The influence of CPM for history recordThe influence of CPM for history record

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8History Record(using CPM Alignment)

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1History Record(without CPM Alignment)


OutlineOutline




Browse all birdsBrowse all birds


Curvature DescriptorCurvature Descriptor


Distance DescriptorDistance Descriptor


Adaboost without CPMAdaboost without CPM


Adaboost without CPM(con’d)Adaboost without CPM(con’d)


Good_Part_SelectedGood_Part_Selected(Adaboost without CPM con’d)(Adaboost without CPM con’d)


Adaboost without CPM(con’d)Adaboost without CPM(con’d) The Alpha ValuesThe Alpha Values

Other Statistical Data: zero rate: 0.5333; Other Statistical Data: zero rate: 0.5333;

covariance: 0.0074; median: 0.0874covariance: 0.0074; median: 0.0874

0.075527 0 0.080877 0.168358 0 0

0 0 0.146951 0.007721 0.218146 0

0.081063 0 0 0.060681 0 0

0.197824 0 0.08873 0 0.080742 0.015646

0 0.080659 0.269843 0 0.028159 0

0 0.19772 0.086019 0.217678 0 0.21836

0 0.080554 0 0 0 0.190074

0 0.21237 0 0 0 0

0 0.060744 0 0 0 0

0.179449 0.338801 0.080667 0.080895 0 0.267993


Adaboost with CPMAdaboost with CPM


Adaboost with CPM(con’d)Adaboost with CPM(con’d)


Good_Part_SelectedGood_Part_Selected(Adaboost without CPM con’d)(Adaboost without CPM con’d)


Adaboost without CPM(con’d)Adaboost without CPM(con’d) The Alpha ValuesThe Alpha Values

Other Statistical Data: zero rate: 0.6167; Other Statistical Data: zero rate: 0.6167; covariance: 0.9488; median: 1.6468covariance: 0.9488; median: 1.6468

2.521895 0 2.510827 0.714297 0 0

1.646754 0 0 0 0 0

2.134926 0 2.167948 0 2.526712 0

0.279277 0 0 0 0.0635 2.322823

0 0 2.516785 0 0 0

0 0.04174 0 0.207436 0 0

0 0 1.30396 0 0 0.951666

0 2.513161 2.530245 0 0 0

0 0 0 0.041627 2.522551 0

0.72565 0 2.506505 1.303823 0 1.611553


OutlineOutline




Conclusion and discussionConclusion and discussion

1) Adaboost works with CPM unified model; 1) Adaboost works with CPM unified model;

This model has smoothed the trained data set This model has smoothed the trained data set and decreased the influence of overfitting.and decreased the influence of overfitting.

2) The influence of history record is very 2) The influence of history record is very interesting. It will suppress the noise and interesting. It will suppress the noise and strengthen the strengthen the WeakLearnWeakLearn boosting direction. boosting direction.

3) The step length of KNN selected by Adaboost 3) The step length of KNN selected by Adaboost is not discussed here. This is also useful for is not discussed here. This is also useful for suppress noise.suppress noise.


Conclusion and Conclusion and discussion(con’d)discussion(con’d)

4) The Adaboost does not rely on the trained order.4) The Adaboost does not rely on the trained order.The obtained Alpha value has very similar distribution for all the classifiers.The obtained Alpha value has very similar distribution for all the classifiers.There are two examples:There are two examples:Example 1: four different train orders have obtained the Alpha as follow:Example 1: four different train orders have obtained the Alpha as follow: 1) 6 birds1) 6 birds

Alpha_All1=Alpha_All1= 0.4480 0.1387 0.2074 0.5949 0.5868 0.3947 0.38740.4480 0.1387 0.2074 0.5949 0.5868 0.3947 0.3874

0.5634 0.6694 0.7447 0.5634 0.6694 0.74472) 6 birds2) 6 birdsAlpha_All2=Alpha_All2=

0.3998 0.0635 0.2479 0.6873 0.5868 0.2998 0.43200.3998 0.0635 0.2479 0.6873 0.5868 0.2998 0.4320 0.5581 0.6946 0.7652 0.5581 0.6946 0.76523) 6 birds3) 6 birdsAlpha_All3 = 0.4191 0.1301 0.2513 0.5988 0.5868 0.2920 0.4286Alpha_All3 = 0.4191 0.1301 0.2513 0.5988 0.5868 0.2920 0.4286 0.5503 0.6968 0.7134 0.5503 0.6968 0.71344) 6 birds4) 6 birdsAlpha_All4=Alpha_All4=

0.4506 0.0618 0.2750 0.5777 0.5701 0.3289 0.59480.4506 0.0618 0.2750 0.5777 0.5701 0.3289 0.5948 0.5857 0.7016 0.6212 0.5857 0.7016 0.6212



Example 2: 60 parts from Curvature Descriptor, Example 2: 60 parts from Curvature Descriptor, 60 from Distance Descriptor;60 from Distance Descriptor;

1) They are trained independently at first; 1) They are trained independently at first;

2) Then they are combined to be trained 2) Then they are combined to be trained together. together.

The results are as follow:The results are as follow:



5) how to combine the curvature and distance 5) how to combine the curvature and distance descriptor will be another important problem. descriptor will be another important problem. Currently I can obtain nice results by combining Currently I can obtain nice results by combining them. 10 birds are all found.them. 10 birds are all found.

Are they stable for all other class? How to Are they stable for all other class? How to integrate the improved Adaboost to combine the integrate the improved Adaboost to combine the two descriptors? Maybe Adaboost will improve two descriptors? Maybe Adaboost will improve even further (for general stuff, for example, even further (for general stuff, for example, elephant or camel).elephant or camel).



Current results without Adaboost:Current results without Adaboost:



6) How about the influence from the search order?6) How about the influence from the search order?

Could we try to reverse the search order? Could we try to reverse the search order?

My current result has improved by one more bird, but not My current result has improved by one more bird, but not too much.too much.

7) How many models could we obtain from the CPM 7) How many models could we obtain from the CPM model? model?

Currently I am using only one unified model.Currently I am using only one unified model.

8) Why does the rescaled model not work? 8) Why does the rescaled model not work?

(I do not think curvature is so sensitive to the rescale).(I do not think curvature is so sensitive to the rescale).

9) Could we try to boosting the Neural Network?9) Could we try to boosting the Neural Network?

??

??



10) Could we try to change the boosting function?10) Could we try to change the boosting function?Currently I am using the Currently I am using the Logistical Regression Logistical Regression

projection function to transmit the error projection function to transmit the error information to Alpha value; anyway, there are information to Alpha value; anyway, there are many methods to do this work. For many methods to do this work. For example:c45, decision stump, decision table, example:c45, decision stump, decision table, naïve bayes, voted perceptron and zeroR. etc. naïve bayes, voted perceptron and zeroR. etc.

11) How to use decision tree to replace 11) How to use decision tree to replace Adaboost? I think this will impede the search Adaboost? I think this will impede the search speed; but I am not sure the quality. speed; but I am not sure the quality.

??

??



12) How about the fuzzy SVM or SVM to 12) How about the fuzzy SVM or SVM to address this good parts selection problem?address this good parts selection problem?

13) How to understand the difference among good 13) How to understand the difference among good parts selected by computer and by human? parts selected by computer and by human?

(Do the parts from computer program have the (Do the parts from computer program have the similar semantic meaning?)similar semantic meaning?)

14) How about the stability of Curvature and 14) How about the stability of Curvature and Distance Descriptors?Distance Descriptors?

??

??

??

Thanks!Thanks!


ReferenceReference

[1] Yoav Freund, Robert Schapire, a short [1] Yoav Freund, Robert Schapire, a short Introduction to BoostingIntroduction to Boosting

[2] Robert Schapire, the boosting approach to [2] Robert Schapire, the boosting approach to machine learning; Princeton Universitymachine learning; Princeton University

[3] Yoav Freund, Robert Schapire, A decision-[3] Yoav Freund, Robert Schapire, A decision-theoretic generalization of on-line learning and theoretic generalization of on-line learning and application to boostingapplication to boosting

[4] R. Polikar, Ensemble Based Systems in [4] R. Polikar, Ensemble Based Systems in Decision Making, IEEE Circuits and Systems Decision Making, IEEE Circuits and Systems Magazine, vol.6, no.3, pp. 21-45, 2006. Magazine, vol.6, no.3, pp. 21-45, 2006.

Learning with AdaBoost

Documents

Transcript of Learning with AdaBoost