Lecture 20: Bagging and Boosting -...
Transcript of Lecture 20: Bagging and Boosting -...
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Lecture 20: Bagging and Boosting
Hao Helen Zhang
Fall, 2016
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Outlines
Bagging Methods
Bagging TreesRandom Forest
Adaboost
Strong learners vs Weak LearnersMotivationsAlgorithms
Additive Models
Adaboost and Additive Logistic Regression
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Ensemble Methods (Model Averaging)
A machine learning ensemble meta-algorithm designed to improvethe stability and accuracy of machine learning algorithms
widely used in statistical classification and regression.
can reduce variance and helps to avoid overfitting.
usually applied to decision tree methods, but it can be usedwith any type of method.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Bootstrap Bagging (Bagging)
Bagging is a special case of the model averaging approach.
Bootstrap aggregation = Bagging
Bagging leads to ”improvements for unstable procedures”(Breiman, 1996), e.g. neural nets, classification and regressiontrees, and subset selection in linear regression (Breiman,1994).
On the other hand, it can mildly degrade the performance ofstable methods such as K-nearest neighbors (Breiman, 1996).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Basic Idea
Given a standard training set D of size n, bagging
generates B new training sets Di , each of size n′, by samplingfrom D uniformly and with replacement.
The B models are fitted using the above B bootstrap samplesand combined by averaging the output (for regression) orvoting (for classification).
This kind of sample is known as a bootstrap sample.
By sampling with replacement, some observations may berepeated in each Di .
If n′ = n, then for large n, the set Di is expected to have thefraction (1− 1/e) ≈ 63.2% of the unique examples of D, therest being duplicates.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Bagging Procedures
Bagging uses the bootstrap to improve the estimate or predictionof a fit.
Given data Z = {(x1, y1), ..., (xn, yn)}, we generate Bbootstrap samples Z∗b
Empirical distribution P: putting equal probability 1/n oneach (xi , yi ) (discrete)Generate Z∗b = {(x1∗, yn∗), ..., (xn∗, yn∗)} ∼ P, b = 1, ...,B
Obtain f ∗b(x), b = 1, ...,B.
The Monte Carlo estimate of the bagging estimate
fbag(x) =1
B
B∑b=1
f ∗b(x).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Properties of Bagging Estimates
Advantages:
Note fbag(x) −→ EP f∗(x) as B →∞,
fbag(x) typically has smaller variance than f (x);
fbag(x) differs from f (x) only when the latter is nonlinear oradaptive function of data.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Bagging Classification TreesIn multiclass (K ≥ 3)problems, there are two scenarios
(1) f b(x) is indicator-vector, with one 1 and K-1 0’s (hardclassification)
(2) f b(x) = (p1, ..., pK ), the estimates of class probabilities (softclassification)
The bagged estimate is the average prediction at x from B trees
fbag(x) =1
B
B∑b=1
f ∗b(x),
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Bagging Classification Trees (cont.)There are two types of averaging:
(1) uses the majority vote;
(2) use the averaged probabilities.
The bagged classifier
Gbag(x) = arg maxk=1,··· ,K
fbag(x).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Example
Sample size n = 30, two classes
p = 5 features, each having a standard Gaussian distributionwith pairwise correlation Corr(Xj ,Xk) = 0.95.
The response Y was generated according to
Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 1|x1 > 0.5) = 0.8.
The Bayes error is 0.2.
A test sample of size 2, 000 was generated from the samepopulation.
We fit
classification trees to the training sample
classification trees to each of 200 bootstrap samples
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 8x.2<0.39
x.2>0.39
10/30
0
x.3<-1.575x.3>-1.575
3/21
0
2/5
1
0/16
0
2/9
1
PSfrag repla ements Original TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5x.2<0.36
x.2>0.36
7/30
0
x.1<-0.965x.1>-0.965
1/23
0
1/5
0
0/18
0
1/7
1
PSfrag repla ementsOriginal Tree Bootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5
x.2<0.39x.2>0.39
11/30
0
x.3<-1.575x.3>-1.575
3/22
0
2/5
1
0/17
0
0/8
1
PSfrag repla ementsOriginal TreeBootstrap Tree 1 Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5
x.4<0.395x.4>0.395
4/30
0
x.3<-1.575x.3>-1.575
2/25
0
2/5
0
0/20
0
2/5
0
PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3
Bootstrap Tree 4Bootstrap Tree 5x.2<0.255
x.2>0.255
13/30
0
x.3<-1.385x.3>-1.385
2/16
0
2/5
0
0/11
0
3/14
1
PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4
Bootstrap Tree 5x.2<0.38
x.2>0.38
12/30
0
x.3<-1.61x.3>-1.61
4/20
0
2/6
1
0/14
0
2/10
1
PSfrag repla ementsOriginal TreeBootstrap Tree 1Bootstrap Tree 2Bootstrap Tree 3Bootstrap Tree 4Bootstrap Tree 5
Figure 8.9: Bagging trees on simulated dataset. Topleft panel shows original tree. Five trees grown on boot-strap samples are shown.Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
About Bagging Trees
The original tree and five bootstrap trees are all different:
with different splitting features
with different splitting cutpoints
The trees have high variance due to the correlation in thepredictors
Averaging reduces variance and leaves bias unchanged.
Under squared-error loss, averaging reduces variance andleaves bias unchanged.
Therefore, bagging will often decrease MSE.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 8
Number of Bootstrap Samples
Test
Error
0 50 100 150 200
0.20
0.25
0.30
0.35
••••••••
•
•
•
•
•
•
••
•
•
•
••••••••
•
•••••••
•
•
•••••
•••••
••••••••••
•
•
•••
••••••••••••
••
•
•
•
•
•
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••
••••
••
•
••
•
••
••
•
•
•
•
•••
•
••
•
•
•••••••••••
•
••••
•
•
•
•
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
Bagged Trees
Original Tree
Bayes
Figure 8.10: Error urves for the bagging example ofFigure 8.9. Shown is the test error of the original treeand bagged trees as a fun tion of the number of boot-strap samples. The green points orrespond to majorityvote, while the purple points average the probabilities.Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
About Bagging
Bagging can dramatically reduce the variance of unstableprocedures like trees, leading to improved prediction
Bagging smooths out this variance and hence reducing the testerrorBagging can stabilize unstable procedures.
The simple structure in the model can be lost due to bagging
A bagged tree is no longer a tree.The bagged estimate is not easy to interpret.
Under 0-1 loss for classification, bagging may not help due tothe nonadditivity of bias and variance.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Random Forest
Random forest is an ensemble classifier that consists of manydecision trees and outputs the class that is the mode of theclass’s output by individual trees.
The algorithm was developed by Breiman (2001) and Cutler.
The method combines Breiman’s “bagging” idea and therandom selection of features, in order to construct a collectionof decision trees with controlled variation.
The selection of a random subset of features is an example ofthe random subspace method, a way to implement stochasticdiscrimination.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Learning Algorithm for Building A Tree
Denote the training size by n and the number of variables by p.Assume m < p is the number of input variables to be used todetermine the decision at a node of the tree.
Randomly choose n samples with replacement (i.e. take abootstrap sample).
Use the rest of the samples to estimate the error of the tree,by predicting their classes.
For each node of the tree, randomly choose m variables onwhich to base the decision at that node. Calculate the bestsplit based on these m variables in the training set.
Each tree is fully grown and not pruned.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Advantages of Random Forest
highly accurate in many real examples; fast; handles a verylarge number of input variables.ability to estimate the importance of variables forclassification through permutation.generates an internal unbiased estimate of the generalizationerror as the forest building progresses.impute missing data and maintains accuracy when a largeproportion of the data are missing.provides an experimental way to detect variable interactions.can balance error in unbalanced data sets.compute proximities between cases, useful for clustering,detecting outliers, and (by scaling) visualizing the datacan be extended to unlabeled data, leading to unsupervisedclustering, outlier detection and data views
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Disadvantages of Random Forest
Random forests are prone to overfitting for some data sets.This is even more pronounced in noisy classification/regressiontasks.
Random forests do not handle large numbers of irrelevantfeatures
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Motivation: Model Averaging
They are methods for improving the performance of weak learners.
strong learners: Given a large enough dataset, the classifiercan arbitrarily accurately learn the target function withprobability 1− τ (where τ > 0 can be arbitrarily small)
weak learners: Given a large enough dataset, the classifier canbarely learn the target function with probability 1
2 + τ
The error rate is only slightly better than a random guessing
Can we construct a strong learner from weak learners and how?
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Boosting
Motivation: combines the outputs of many weak classifiers toproduce a powerful “committee”.
Similar to bagging and other committee-based approaches
Originally designed for classification problems, but can also beextended to regression problem.
Consider the two-class problem
Y ∈ {−1, 1}, the classifier G (x) has training error
err =1
n
n∑i=1
I (yi 6= G (xi ))
The expected error rate on future predictions isEX,Y I (Y 6= G (X)).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Classification Trees
Classifications trees can be simple, but often produce noise or weakclassifiers.
Bagging (Breiman 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby a majority vote.
Boosting (Freund & Shapire 1996): Fit many large or smalltrees to re-weighted versions of the training data. Classify bya weighted majority vote.
In general, Boosting > Bagging > Single Tree
Breiman’s comment “AdaBoost .... best off-the-shelf classifierin the world”. (1996, NIPS workshop)
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
AlgorithmExample
AdaBoost (Discrete Boost)
Adaptively resampling the data (Freund & Shapire 1997; winnersof the 2003 Godel Prize)
1 sequentially apply the weak classification algorithm torepeatedly modified versions of the data (re-weighted data)
2 produces a sequence of weak classifiers
Gm(x), m = 1, 2, ...,M.
3 The predictions from Gm’s are then combined through aweighted majority vote to produce the final prediction
G (x) = sign
(M∑
m=1
αmGm(x)
).
Here α1, ..., αM ≥ 0 are computed by the boosting algorithm.Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
AlgorithmExample
AdaBoost Algorithm
1 Initially the observation weights wi = 1/n, i = 1, ..., n.2 For m=1 to M
(a) Fit a classifier Gm(x) to the training data using weights wi .(b) Compute the weighted error
errm =
∑ni=1 wi I (yi 6= Gm(xi ))∑n
i=1 wi.
(c) Compute the importance of Gm as
αm = log
(1− errm
errm
)(d) Update wi ← wi · exp[αm · I (yi 6= Gm(xi ))], i = 1, ..., n.
3 Output G (x) = sign(∑M
m=1 αmGm(x)).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
AlgorithmExample
Weights of Individual Weak Learners
In the final rule, the weight of Gm
αm = log
(1− errm
errm
),
where errm is the weight error of Gm.
The weights αm’s weigh the contribution of each Gm.
The higher (lower) errm, the smaller (larger) αm.
The principle is to give higher influence (larger weights) tomore accurate classifiers in the sequence.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
AlgorithmExample
Data Re-weighting (Modification) Scheme
At each boosting, we impose the updated weights w1, ...,wn tosamples (xi , yi ), i = 1, ..., n.
Initially, all weights are set to wi = 1/n. The usual classifier.
At step m = 2, . . . ,M, we modify weights for observationsindividually: increasing weights for those observationsmisclassified by Gm−1(x) and decreasing weights forobservations classified correctly by Gm−1(x).
Samples difficult to correctly classify receive ever increasinginfluenceEach successive classifier is forced to concentrate on thosetraining observations that are missed by previous ones insequence
The classification algorithm is re-applied to the weightedobservations
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
AlgorithmExample
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10
Training Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted Sample
Weighted SampleWeighted Sample
Training Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted SampleWeighted Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
PSfrag repla ementsG(x) = sign hPMm=1 �mGm(x)iGM (x)
G3(x)G2(x)G1(x)
Final Classifier
Figure 10.1: S hemati of AdaBoost. Classi�ers aretrained on weighted versions of the dataset, and then ombined to produ e a �nal predi tion.Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
AlgorithmExample
Power of Boosting
Ten features X1, ...,X10 ∼ N(0, 1)
two-classes: Y = 2 · I (∑10
j=1 X2j > χ2
10(0.5))− 1
sample size n = 2000, test size 10, 000
weak classifier: stump (a two-terminal node classification tree)
performance of boosting and comparison with other methods:
stump has 46% misclassification rate400-mode tree has 26% misclassification rateboosting has 12.2% error rate
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
AlgorithmExample
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10
Boosting Iterations
Test
Error
0 100 200 300 400
0.00.1
0.20.3
0.40.5
Single Stump
400 Node Tree
Figure 10.2: Simulated data (10.2): test error ratefor boosting with stumps, as a fun tion of the numberof iterations. Also shown are the test error rate for asingle stump, and a 400 node lassi� ation tree.Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
AlgorithmExample
Boosting And Additive Models
The success of boosting is not very mysterious. The key lies in
G (x) = sign
(M∑
m=1
αmGm(x)
).
Adaboost is equivalent to fitting an additive model using theexponential loss function (a very recent discovery by Friedmanet al. (2000)).
AdaBoost was originally motivated from a very differentperspective
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Introduction on Additive Models
An additive model typically assumes a function form
f (x) =M∑
m=1
βmb(x; γm),
βm’s are coefficients. b(x, γm) are basis functions of xcharacterized by γm.
The model f is obtained by minimizing a loss averaged over thetraining data
min{βm,γm}M1
n∑i=1
L
(yi ,
M∑m=1
βmb(xi ; γm)
). (1)
It is feasible to rapidly solve the sub-problem of fitting just a singlebasis.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Forward Stagewise Fitting for Additive Models
Forward stagewise modeling approximate the solution to (1) by
sequentially adding new basis functions to the expansionwithout adjusting the parameters and coef. of those that havebeen added.
At iteration m, one solves for the optimal basis functionb(x, γm) and corresponding coefficient βm, which is added tothe current expansion fm−1(x).
Previously added terms are not modified.
This process is repeated.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Squared-Error Loss Example
Consider the squared-error loss
L(y , f (x)) = [y − f (x)]2
At the mth step, given the current fit fm−1(x), we solve
minβ,γ
∑i
L(yi , fm−1(xi ) + βb(x, γ))⇐⇒
minβ,γ
∑i
[yi − fm−1(xi )− βb(xi ; γ)]2 =∑i
[rim − βb(xi , γ)]2.
The term βmb(x; γm) is the best fit to the current residual.This produces the updated fit
fm(x) = fm−1(x) + βmb(x; γm).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Forward Stagewise Additive Modeling
1 Initialize f0(x) = 0.2 For m = 1 to M:
(a) Compute
(βm, γm) = arg minβ,γ
n∑i=1
L (yi , fm−1(xi ) + βb(xi ; γ)) .
(b) Set fm(x) = fm−1(x) + βmb(x; γm)
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Additive Logistic Models and AdaBoost
Friedman et al. (2001) showed that AdaBoost is equivalent toforward stagewise additive modeling
using the exponential loss function
L(y , f (x)) = exp{−yf (x)},
using individual classifiers Gm(x) ∈ {−1, 1} as basis functions
The (population) minimizer of the exponential loss function is
f ∗(x) = arg minf
EY |x[e−Yf (x)]
=1
2log
Pr(Y = 1|x)
Pr(Y = −1|x),
which is equal to one half of the log-odds. So, AdaBoost can beregarded as an additive logistic regression model.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Forward Stagewise Additive Modeling with Exponential Loss
For exponential loss, the minimization at mth step in forwardstagewise modeling becomes
(βm, γm) = arg minβ,γ
n∑i=1
exp{−yi [fm−1(xi ) + βb(xi , γ)]}
In the context of a weak learner G , this is
(βm,Gm) = arg minβ,G
n∑i=1
exp{−yi [fm−1(xi ) + βG (xi )]},
or equivalently, (βm,Gm) = arg minβ,G
n∑i=1
w(m)i exp{−yiβG (xi )},
where w(m)i = exp{−yi fm−1(xi )}.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Iterative Optimization
In order to solve
minβ,G
n∑i=1
w(m)i exp{−yiβG (xi )},
we take the two-step (profile) approach
first, fix β > 0 and solve for G .
second, solve for β with G = G .
Recall that both Y and G (x) take only two values +1 and −1. So
yiG (xi ) = +1⇐⇒ yi = G (xi ),
yiG (xi ) = −1⇐⇒ yi 6= G (xi ).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Solving for Gm
(βm,Gm) = arg minβ,G
n∑i=1
w(m)i exp{−yiβG (xi )}
= arg minβ,G
eβ∑
yi 6=G(xi )
w(m)i + e−β
∑yi=G(xi )
w(m)i
= arg minβ,G
(eβ − e−β)∑
yi 6=G(xi )
w(m)i + e−β
n∑i=1
w(m)i
For any β > 0, the minimizer Gm is a {−1, 1}-valued function
Gm = arg minG
n∑i=1
w(m)i I [yi 6= G (xi )],
the classifier that minimizes training error for the weighted data.Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Solving for βm
Define
errm =
∑ni=1 w
(m)i I (yi 6= Gm(xi ))∑n
i=1 w(m)i
Plugging Gm in the objective gives
βm = argminβ{e−β + (eβ − e−β)errm}n∑
i=1
w(m)i
The solution is
βm =1
2log
(1− errm
errm
).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Connection to Adaboost
Since−yGm(x) = 2(I (y 6= Gm(x))− 1,
we have
w(m+1)i = w
(m)i exp{−βmyiGm(xi )}
= w(m)i exp{αmI (yi 6= G (xi ))} exp{−βm}
where αm = 2βm and exp{−βm} is constant across the datapoints. Therefore
the weight update is equivalent to line 2(d) of the AdaBoost
line 2(a) of the Adaboost is equivalent to solving theminimization problem
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Weights and Their Update
At the mth step, the weight for the ith observation is
w(m)i = exp{−yi fm−1(xi )},
which depends only on fm−1(xi ) but not on β or G (x).
At the (m + 1)th step, we update the weight using the fact
fm(x) = fm−1(xi ) + βmGm(xi ).
It leads to the following update formula
w(m+1)i = exp{−yi fm(xi )}
= exp{−yi (fm−1(xi ) + βmGm(xi ))}= w
(m)i exp{−βmyiGm(xi )}.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Why Exponential Loss?
Principal virtue is computational
Exponential loss concentrates much more influence onobservations with large negative margins yf (x). It is especiallysensitive to misspecification of class labels.
More robust losses: computation is not as easy (Use GradientBoosting).
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Exponential Loss and Cross Entropy
Define Y ′ = (Y + 1)/2 ∈ {0, 1}.The binomial negative log-likelihood loss function is
−l(Y , p(x)) = −[Y ′ log p(x) + (1− Y ′) log(1− p(x))
]= log (1 + exp{−2Yf (x)}) ,
where p(x) = [1 + e−2f (x)]−1.
The population minimizers of the deviance EY |x[−l(Y , f (x)]
and EY |x[e−Yf (x)] are the same.
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Loss Functions for Classification
Choice of loss functions matters for finite data sets.The Margin of f (x) is defined as yf (x).The classification rule is G (x) = sign[f (x)]The decision boundary is f (x) = 0
Observations with positive margin yi f (xi ) > 0 are correctlyclassified
Observations with negative margin yi f (xi ) < 0 are incorrectlyclassified
The goal of a classification algorithm is to produce positivemargins as frequently as possible
Any loss function should penalize negative margins moreheavily than positive margins, since positive margin arealready correctly classified
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Various Loss Functions
Monotone decreasing loss functions of the margin
Misclassification loss: I (sign(f (x)) 6= y)
Exponential loss: exp(−yf ) (not robust against mislabeledsamples)
Binomial deviance: log{1 + exp(−2yf )} (more robust againstinfluential points)
SVM loss: [1− yf ]+
Other loss functions
Squared error: (y − f )2 = (1− yf )2 (penalize positive marginsheavily)
Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10
-2 -1 0 1 2
0.00.5
1.01.5
2.02.5
3.0
MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector
PSfrag repla ementsLoss
y � fFigure 10.4: Loss fun tions for two- lass lassi� a-tion. The response is y = �1; the predi tion is f ,with lass predi tion sign(f). The losses are mis las-si� ation: I(sign(f) 6= y); exponential: exp(�yf); bi-nomial devian e: log(1 + exp(�2yf)); squared error:(y � f)2; and support ve tor: (1� yf) � I(yf > 1) (seeSe tion 12.3). Ea h fun tion has been s aled so that itpasses through the point (0; 1).Hao Helen Zhang Lecture 20: Bagging and Boosting
Bagging TreesRandom Forest
Boosting MethodsAdaBoost
Additive ModelsAdaBoost and Additive Logistic Models
Brief Summary on AdaBoost
AdaBoost fits an additive model, where the basis functionsGm(x) stage-wise optimize exponential loss
The population minimizer of exponential loss is the log odds
There are loss functions more robust than squared error loss(for regression problems) or exponential loss (for classificationproblems)
Hao Helen Zhang Lecture 20: Bagging and Boosting