Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An...
Transcript of Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An...
![Page 1: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/1.jpg)
Learning Ensembles
293S T. Yang. UCSB, 2017.
![Page 2: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/2.jpg)
Outlines
• Learning Assembles• Random Forest• Adaboost
![Page 3: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/3.jpg)
Training data: Restaurant example
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:
• Classification of examples is positive (T) or negative (F)
![Page 4: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/4.jpg)
A decision tree to decide whether to wait
• imagine someone talking a sequence of decisions.
![Page 5: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/5.jpg)
5
Learning Ensembles
• Learn multiple classifiers separately• Combine decisions (e.g. using weighted voting)• When combing multiple decisions, random errors
cancel each other out, correct decisions are reinforced. Training Data
Data1 Data mData2 × × × × × × × ×
Learner1 Learner2 Learner m× × × × × × × ×
Model1 Model2 Model m× × × × × × × ×
Model Combiner Final Model
![Page 6: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/6.jpg)
Homogenous Ensembles
• Use a single, arbitrary learning algorithm but manipulate training data to make it learn multiple models.§ Data1 ¹ Data2 ¹ … ¹ Data m§ Learner1 = Learner2 = … = Learner m
• Methods for changing training data:§ Bagging: Resample training data§ Boosting: Reweight training data§ DECORATE: Add additional artificial training data
Training Data
Data1 Data mData2 × × × × × × × ×
Learner1 Learner2 Learner m× × × × × × × ×
![Page 7: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/7.jpg)
7
Bagging
• Create ensembles by repeatedly randomly resampling the training data (Brieman, 1996).
• Given a training set of size n, create m sample sets§ Each bootstrap sample set will on average contain 63.2% of
the unique training examples, the rest are replicates.• Combine the m resulting models using majority vote.
• Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed.
![Page 8: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/8.jpg)
Random Forests
• Introduce two sources of randomness: “Bagging” and “Random input vectors”§ Each tree is grown using a bootstrap sample of
training data§ At each node, best split is chosen from random
sample of m variables instead of all variables M.• m is held constant during the forest growing• Each tree is grown to the largest extent possible• Bagging using decision trees is a special case of random
forests when m=M
![Page 9: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/9.jpg)
Random Forests
![Page 10: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/10.jpg)
Random Forest Algorithm
• Good accuracy without over-fitting• Fast algorithm (can be faster than growing/pruning a
single tree); easily parallelized• Handle high dimensional data without much problem
![Page 11: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/11.jpg)
Boosting: AdaBoost
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System Sciences,
55(1):119–139, August 1997.§ Simple with theoretical foundation
![Page 12: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/12.jpg)
12
Adaboost - Adaptive Boosting
• Use training set re-weighting§ Each training sample uses a weight to determine the
probability of being selected for a training set.
• AdaBoost is an algorithm for constructing a “strong” classifier as linear combination of “simple” “weak” classifier
• Final classification based on weighted sum of weak classifiers
![Page 13: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/13.jpg)
AdaBoost: An Easy Flow
Data set 1 Data set 2 Data set T
Learner1 Learner2 LearnerT… ...
… ...
… ...
training instances that are wrongly predicted by Learner1 will be weighted more for Learner2
weighted combination
Original training set
![Page 14: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/14.jpg)
14
Adaboost Terminology
• ht(x) … “weak” or basis classifier • … “strong” or final classifier
• Weak Classifier: < 50% error over any distribution
• Strong Classifier: thresholded linear combination of weak classifier outputs
![Page 15: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/15.jpg)
And in a Picture
training casecorrectlyclassified
training casehas large weightin this round
this DT has a strong vote.
![Page 16: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/16.jpg)
• Given training set X={(x1,y1),…,(xm,ym)}• yiÎ{-1,+1} correct label of instance xiÎX• Initialize distribution D1(i)=1/m; ( weight of training cases)• for t = 1,…,T:
• Find a weak classifier (“rule of thumb”)ht : X ® {-1,+1}
with small error et on Dt:• Update distribution Dt on {1,…,m}. 𝛼t = log(1/𝜀t-1)
• output final hypothesis
AdaBoost.M1
å ==
T
t tt xhsignxH1
))(()( a
yi * ht(xi) > 0, if correctyi * ht(xi) < 0, if wrong
![Page 17: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/17.jpg)
17
Reweighting
y * h(x) = -1
y * h(x) = 1
![Page 18: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/18.jpg)
Toy Example
![Page 19: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/19.jpg)
Round 1
Weak classifier: if h1 <0.2 à 1 else -1
![Page 20: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/20.jpg)
Round 2
Weak classifier: if h2 <0.8 à 1 else -1
![Page 21: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/21.jpg)
Round 3
Weak classifier: if h3 >0.7 à 1 else -1
![Page 22: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/22.jpg)
Final Combinationif h1 <0.2 à 1 else -1
if h3 >0.7 à 1 else -1
if h2 <0.8 à 1 else -1
![Page 23: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/23.jpg)
23
Pros and cons of AdaBoost
Advantages§ Very simple to implement§ Does feature selection resulting in relatively simple
classifier§ Fairly good generalization
Disadvantages§ Suboptimal solution§ Sensitive to noisy data and outliers
![Page 24: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/24.jpg)
24
References
• Duda, Hart, ect – Pattern Classification
• Freund – “An adaptive version of the boost by majority algorithm”
• Freund – “Experiments with a new boosting algorithm”
• Freund, Schapire – “A decision-theoretic generalization of on-line learning and an application to boosting”
• Friedman, Hastie, etc – “Additive Logistic Regression: A Statistical View of Boosting”
• Jin, Liu, etc (CMU) – “A New Boosting Algorithm Using Input-Dependent Regularizer”
• Li, Zhang, etc – “Floatboost Learning for Classification”
• Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study”
• Ratsch, Warmuth – “Efficient Margin Maximization with Boosting”
• Schapire, Freund, etc – “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods”
• Schapire, Singer – “Improved Boosting Algorithms Using Confidence-Weighted Predictions”
• Schapire – “The Boosting Approach to Machine Learning: An overview”
• Zhang, Li, etc – “Multi-view Face Detection with Floatboost”
![Page 25: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/25.jpg)
n Suppose
n Therefore, training error is:
n As:
Considering
Finally:
AdaBoost: Training Error Analysis
Õ=
£¹T
ttii zyxHi
m 1
})(:{11
1( ) 1, exp( ( ))T i i tti iD i y f x Z
m+ = - =å å Õ
Equivalent
{i: H(xi)≠yi} is a vector which i-th element is [H(xi) ≠ yi].
|{i: H(xi)≠yi}| is the sum of all the element in the vector
![Page 26: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with](https://reader033.fdocuments.in/reader033/viewer/2022061001/60afff3f8ceeb70a7b2b6d98/html5/thumbnails/26.jpg)
n According to
n Therefore, we choosen Let
n The right term is minimized when
AdaBoost: How to choose
( )* argmin argmin ( ) t i t i
t t
y h xt t t
iZ D i e a
a aa -= = å
Õ=
£¹T
ttii zyxHi
m 1
})(:{1
ta
titii xhyu aa == ),(
Minimize the error bound could be done by greedily minimizing Zt each round.
This equation is obvious if we treat ui as a binary-valued variable.
By setting dz/dα=0 , and considering ∑D(i)=1, we can easily get this solution.
Actually AdaBoost can just minimize the training error.
1 ( ) ( )1, let ( ), we have
2( ) ( )
t th y h y
th y
t th y h y
D i D irD i
r D i D ie e= ¹
¹
= ¹
ì = +ï -ï = =íï = -ïî
å åå
å å