Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

52
Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370

Transcript of Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Page 1: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Combining Multiple Learners

Ethem Chp. 15Haykin Chp. 7, pp. 351-370

Page 2: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2

Overview

Introduction Rationale

Combination Methods Static Structures

Ensemble averaging (Sum,Product,Min rule) Bagging Boosting Error Correcting Output Codes

Dynamic structures Mixture of Experts Hierarchical Mixture of Experts

Page 3: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3

Motivation

When designing a learning machine, we generally make some choices: parameters of machine, training data, representation,

etc…

This implies some sort of variance in performance

Why not keep all machines and average?

...

Page 4: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4

Rationale

No Free Lunch thm: “There is no algorithm that induces the most accurate learner in any domain, all the time.” http://www.no-free-lunch.org/

Generate a group of base-learners which when combined has higher accuracy

Different learners use different Algorithms: making different assumptions Hyperparameters: e.g number of hidden nodes in NN, k in k-NN Representations: diff. features, multiple sources of information Training sets: small variations in the sets or diff. subproblems

Page 5: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5

Reasons to Combine Learning Machines

Lots of different combination methods: Most popular are averaging and majority voting.

Intuitively, it seems as though it should work. We have parliaments of people who vote, and that works … We average guesses of a quantity, and we’ll probably be closer…

d1

d2

d3

d4

d5

Final output

input

Page 6: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6

Some theory > Reasons to Combine Learning Machines

…why?),,,,( mlkjicom fffffvotef

…but only if they are independent!

kNkN

Nk

ppk

NerrorP

)1()(

12

Binomial theorem says…

What is the implication?

Use many experts and take a vote

A related theory paper…

Tumer & Ghosh 1996

“Error Correlation and Error Reduction in Ensemble Classifiers”

(makes some assumptions, like equal variances)

Page 7: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7

Bayesian perspective (if outputs are posterior probabilities):

jjii PxCPxCPj

MMM

models all

,||

Page 8: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8

We want the base learners to be Complementary

what if they are all the same or very similar Reasonably accurate

but not necessarily very accurate

Page 9: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9

Types of Committee Machines

Static structures: the responses of several experts (individual networks) are combined in a way that does not involve the input signal. ensemble averaging: the outputs of different experts are linearly

combined to produce the output of the committee machine.

boosting: a ‘’weak learning’’ algorithm is converted into one that achieves high accuracy.

Dynamic structures: the input signal actuates the mechanism that combines the responses of the experts. mixture of experts: the outputs of different experts are non-linearly

combined by means of a single gating network.

hierarchical mixture of experts: the outputs of different experts are non-linearly combined by means of several gating networks arranged in a hierarchical fashion.

Page 10: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10

Overview

Introduction Rationale

Combination Methods Static Structures

Ensemble averaging (Sum,Product,Min rule) Bagging Boosting Error Correcting Output Codes

Dynamic structures Mixture of Experts Hierarchical Mixture of Experts

Page 11: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12

Ensemble Averaging >Voting

Regression

Classification

L

jjiji dwy

1

1 and 01

1

L

jjj

L

jjj

ww

dwy

Page 12: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13

Ensemble Averaging >Voting

Regression

Classification

L

jjiji dwy

1

1 and 01

1

L

jjj

L

jjj

ww

dwy

wj=1/L :

plurality voting: when we have multiple classes, the class that takes the most votes wins, or they equally affect the regression value

majority voting: when we have two classes, the class that takes the majority of the votes wins

wj proportional to error rate of classifier: Learned over a validation set

Page 13: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14

Page 14: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15

(Krogh & Vedelsby 1995)

Ensemble Averaging >Voting

Mi

iicom d

Mf

1

1

i

comii

icom fdM

tdM

tf 222 )(1

)(1

)(

If we use a committee machine fcom:

Error of combination is guaranteed to be lower than the average error:

Page 15: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)16

Similarly, we can show that if dj are iid:

Bias does not change, variance decreases by 1/L

=> Average over models with low bias and high variance

jjj

jj

jcom

jjj

jcom

dL

dLL

dL

dL

dEdELL

dL

EE

f

f

Var1

Var1

Var11

VarVar

11

22

]))([( :Variance

))(( :Bias2

22

ddEE

fdE

DD

D

Page 16: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)17

If we don’t have independent experts, then it has been shown that:

Var(y)=1/L2 [ Var(dj) + 2Cov(dj,di) j i<j

This means that Var(y) can be even lower than 1/L Var(dj)

(which is what is obtained in the previous slide)

if the individual experts are dependent, but negatively correlated!

Page 17: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)18

Ensemble Averaging

What we can exploit from this fact: Combine multiple experts with the same bias and

variance, using ensemble-averaging the bias of the ensemble-averaged system would the

same as the bias of one of the individual experts the variance of the ensemble-averaged system would be less

than the variance of one of the individual experts.

We can also purposefully overtrain individual networks, the variance will be reduced due to averaging

Page 18: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)19

Ensemble methods

Product rule Assumes that representations used by by different classifiers are

conditionally independent Sum rule (voting with uniform weights)

Further assumes that posteriors of class probabilities are close to the class priors

Very successful in experiments, despite very strong assumptions Committee machine less sensitive to individual errors

Min rule Can be derived as an approximation to the product/sum rule

Max rule

The respective assumptions of these rules are analyzed in Kittler et al. 1998

Page 19: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)20

We have shown that ensemble methods have the same bias but lower variance, compared to individual experts.

Alternatively, we can analyze the expected error of the ensemble averaging committee machine to show that it will be less than the average of the errors made by each individual network (see Bishop pp.365-66 for the derivation and Haykin experiment on pp.

355-56 given in the next slides)

Page 20: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)21

Computer ExperimentHaykin: pp. 187-198 and 355-356 C1: N([0,0], 1) C2: N([2,0], 4)

Bayes criterion for optimum decision boundary: P(C1|x) > P(C2|x)

Bayes decision boundary: circular, centered at [-2/3, 0]

Probability of correct classification by Bayes optimal classifier= 0.81% 1-Perror = 1 – (p(C1)P(e|C1) + p(C2)P(e|C2))

Simulation results with diff. networks (all with 2 hidden nodes): average of 79.4 and = 0.44 over 20 networks

Page 21: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22

Page 22: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23

Page 23: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24

Page 24: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)25

Combining the outputs of 10 networks, the ensemble average achieves an expected error (D) less than the expected value of the average error of the individual networks, over many trials with different data sets. 80.3% versus 79.4%

(average) 1% diff.

Avg.79.4

Page 25: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)26

Overview

Introduction Rationale

Combination Methods Static Structures

Ensemble averaging (Voting) Bagging Boosting Error Correcting Output Codes

Dynamic structures Mixture of Experts Hierarchical Mixture of Experts

Page 26: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)27

Voting method where base-learners are made different by training over slightly different training sets

Bagging (Bootstrap Aggregating) - Breiman, 1996take a training set D, of size Nfor each network / tree / k-nn / etc…

- build a new training set by sampling N examples, randomly with replacement, from D- train your machine with the new dataset

end foroutput is average/vote from all machines trained

Resulting base-learners are similar because they are drawn from the same original sample

Resulting base-learners are slightly different due to chance

Ensemble Methods > Bagging

Page 27: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28

Bagging

Not all data points will be used for training Waste of training set

Page 28: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)29

Bagging is better in 2 out of 3 cases and equal in the third.

Improvements are clear over single experts and even better than a simple ensemble.

Bagging is suitable for unstable learning algorithms

Unstable algorithms change significantly due to small changes in the data MLPs, decision trees

Ensemble Methods > Bagging

Error rates on UCI datasets (10-fold cross validation)Source: Opitz & Maclin, 1999

Single net Simple ensemble Baggingbreast cancer 3.4 3.5 3.4glass 38.6 35.2 33.1diabetes 23.9 23 22.8

Page 29: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30

Overview

Introduction Rationale

Combination Methods Static Structures

Ensemble averaging (Voting…) Bagging Boosting Error Correcting Output Codes

Dynamic structures Mixture of Experts Hierarchical Mixture of Experts

Page 30: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)31

Ensemble Methods > Boosting

In Bagging, generating complementary base-learners is left to chance and unstability of the learning method

Boosting – Schapire & Freund 1990 Try to generate complementary weak base-learners by training

the next learner on the mistakes of the previous ones Weak learner: the learner is required to perform only slightly

better than random < ½

Strong learner: arbitrary accuracy with high probability (PAC) Convert a weak learning model to a strong learning model by

“boosting” it Kearns and Valient (1988) posed the question, “are the notions of strong and

weak learning equivalent?” Schapire (1990) and Freund (1991) gave the first constructive proof.

Page 31: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)32

The Boosting model consists of component classifiers which we call "experts“ and trains each expert on data sets with different distributions.

There are three methods for implementing Boosting: Filtering: Filtering the training examples assumes a large source

of examples with the examples being discarded or kept during training.

Subsampling: Subsampling works with training examples of fixed size which are "resampled" according to a probability distribution during training.

Re-weighting: Re-weighting works with a fixed training sample where the examples are "weighted" by a weak learning algorithm.

Page 32: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)33

Boosting by filtering – General Approach

Boosting

take a training set D, of size N

do M times

train a network on D

find all examples in D that the network gets wrong

emphasize those patterns, de-emphasize the others, in a new dataset D2

set D=D2

loop

output is average/vote from all machines trained

General method – different types in literature, by filtering, sub-sampling or re-weighting, see Haykin Ch.7 for details.

Page 33: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)34

Boosting by Filtering

Original Boosting algorithm, Shapire 1990:

Training: Divide X into 3 sets: X1, X2 and X3 Use X1 to train c1 Feed X2 into c1 and get estimated labels

Take equal number of correctly & wrongly classified instances (in X2 by c1) to train classifier c2

online version possible (H/T – wait for correct or misclassified) – Haykin pp358

Feed X3 into c1 and c2 Add instances where they disagree to the third training set

Testing: Feed instance to c1 and c2

If they agree, take the decision If they dont agree, use c3’s decision (=majority decision)

Page 34: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Notice the effect of emphasizing the error

zone of the 1st classifier

Page 35: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)36

Committee Machine: 91.79% correct

Page 36: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)37

Boosting by Filtering – ctd.

Individual experts concentrate on hard-to-learn areas

Training data for each network comes from a different distribution

Output of individual networks can be combined by voting or addition (was found to be better in one work)

Requires large amount of training data Solution: Adaboost (Shapire 1996) Variant of boosting by filtering; short for adaptive boosting

Page 37: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)39

AdaBoost (ADAptive BOOSTing)

Modify the probabilities of drawing an instance xt for a classifier j, based on the probability of error of cj

For the next classifier: if pattern xt is correctly classified, its probability of being selected

decreases if pattern xt is NOT correctly classified, its probability of being

selected increases

All learners must have error less than ½ simple, weak learners if not, stop training (the problem gets more difficult for

next classifier)

Page 38: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)40

AdaBoost Algorithm

1. The initial distribution is uniform over the training sample.

2. The next distribution is computed by multiplying the weight of example i by some number (0,1] if the weak hypothesis classifies the input vector correctly;

otherwise, the weight is unchanged.

3. The weights are normalized.

4. The final hypothesis is a weighted vote of the L weak classifiers

Page 39: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)41

AdaBoost

Generate a sequence of base-learners each focusing on previous one’s errors

(Freund and Schapire, 1996)

Page 40: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)42

Ensemble Methods > Boosting by filtering

Single net Simple ensemble Bagging AdaBoostbreast cancer 3.4 3.5 3.4 4glass 38.6 35.2 33.1 31.1diabetes 23.9 23 22.8 23.3

Error rates on UCI datasets (10-fold cross validation)

Source: Opitz & Maclin, 1999

Page 41: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)43

Training error falls in each boosting iteration

Generalization error also tends to fall Improved generalization performance over 22 benchmark

problems, equal accuracy in one, worse accuracy in 4 problems

[Shapire 1996].

Shapire et al. explain the success of AdaBoost due to its property of increasing the margin, with the analysis involving the confidence of the individual classifiers [Shapire 1998].

Page 42: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)44

Overview

Introduction Rationale

Combination Methods Static Structures

Ensemble averaging (Voting) Bagging Boosting Error Correcting Output Codes

Dynamic structures Mixture of Experts Hierarchical Mixture of Experts

Page 43: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)45

110100

101010

011001

000111

W

Error-Correcting Output Codes

K classes; L sub-problems (Dietterich and Bakiri, 1995) Code matrix W specifies each dichotomizer’s task in its

columns, where rows are the classes

One per class

L=K

Pairwise

L=K(K-1)/2 not feasible for large K

1111

1111

1111

1111

W

Page 44: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)46

Full code L=2(K-1)-1

With reasonable L, find W such that the Hamming distance btw rows and columns are maximized.

Voting scheme

No guarantee that subtasks for the dich. will be simple

Code matrix and dichotomizers not optimized together

1111111

1111111

1111111

1111111

W

L

jjiji dwy

1

Page 45: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)47

Overview

Introduction Rationale

Combination Methods Static Structures

Ensemble averaging (Voting…) Bagging Boosting Error Correcting Output Codes (end of the chapter)

Dynamic structures Mixture of Experts Stacking Cascading

Page 46: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)48

Dynamic Methods > Mixtures of ExpertsVoting where weights are input-dependent (gating) – not constant

(Jacobs et al., 1991)

In general, experts or gating can be non-linear

Base learners become experts in diff. parts of the input space

L

jjjdwy

1

Page 47: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)50

Dynamic Methods > Mixtures of Experts

(Jacobs et al, 1991)

f1

f2

f3

f4

f5

Combine Outputinput

•Has a nice probabilistic interpretation as a mixture model.

• Many variations in literature: Gaussian mixtures; training with the Expectation-Maximization algorithm, etc.

)2

)(exp(

)2(

12

2/

xtf

i

ci

n

M

i

ni

ni xfxgxE )()(ln)(

Page 48: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)51

Dynamic Methods > Stacking

We cannot train f() on the training data; combiner should learn how the base-learners make errors.

Leave-one-out or k-fold cross validation

Wolpert 1992

Learners should be as different as possible, to complement each otherideally using different learning algorithms

f need not be linear, it can be a neural network

Page 49: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)52

Dynamic Methods > Cascading

Cascade learners in order of complexity

Use dj only if preceding ones are not confident

Training must be done on samples for which the previous learner is not confident

Note the difference compared to boosting

Page 50: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)53

Dynamic Methods > Cascading

Cascading assumes that the classes can be explained by small numbers of “rules” in increasing complexity, and a small set of exceptions not covered by the rules

Page 51: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)54

General Rules of Thumb Components should exhibit low correlation - understood well for

regression, not so well for classification. “Overproduce-and-choose” is a good strategy.

Unstable estimators (e.g. NNs, decision trees) benefit most from ensemble methods. Stable estimators like k-NN tend not to benefit. Boosting tends to suffer on noisy data.

Techniques manipulate either training data, architecture of learner,

initial configuration, or learning algorithm. Training data is seen as most successful route; initial configuration is least successful.

Uniform weighting is almost never optimal. Good strategy is to set the weighting for a component proportional to its error on a validation set.

Page 52: Combining Multiple Learners Ethem Chp. 15 Haykin Chp. 7, pp. 351-370.

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)55

References M. Perrone – review on ensemble averaging (1993)

Thomas G. Dietterich. Ensemble Methods in Machine Learning (2000). Proceedings of First International Workshop on Multiple Classifier Systems

David Opitz and Richard Maclin. Popular Ensemble Methods: An Empirical Study (1999). Journal of Artificial Intelligence Research, volume 11, pages 169-198

R. A. Jacobs and M. I. Jordan and S. J. Nowlan and G. E. Hinton. Adaptive Mixtures of Local Experts (1991). Neural Computation, volume 3, number 1, pages 79-87

Stuart Haykin - Neural Networks: A Comprehensive Foundation (Chapter 7)

Ensemble bibliography: http://www.cs.bham.ac.uk/~gxb/ensemblebib.php

Boosting resources: http://www.boosting.org