Bayesian Averaging of Classifers and the Overftting Problem

download Bayesian Averaging of Classifers and the Overftting Problem

of 8

Transcript of Bayesian Averaging of Classifers and the Overftting Problem

  • 7/28/2019 Bayesian Averaging of Classifers and the Overftting Problem

    1/8

    Bayesian Averaging of Classifiers and the Overfitting Problem

    Pedro [email protected]

    Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, U.S.A.

    Abstract

    Although Bayesian model averaging is the-oretically the optimal method for combin-ing learned models, it has seen very littleuse in machine learning. In this paper westudy its application to combining rule sets,and compare it with bagging and partition-ing, two popular but more ad hoc alterna-

    tives. Our experiments show that, surpris-ingly, Bayesian model averagings error ratesare consistently higher than the other meth-ods. Further investigation shows this to bedue to a marked tendency to overfit on thepart of Bayesian model averaging, contradict-ing previous beliefs that it solves (or avoids)the overfitting problem.

    1. Introduction

    A learners error rate can often be much reducedby learning several models instead of one, and thencombining them in some way to make predictions(Drucker et al., 1994; Freund & Schapire, 1996; Quin-lan, 1996; Maclin & Opitz, 1997; Bauer & Kohavi,1999). In recent years a large number of more or lessad hoc methods for this purpose have been proposedand successfully applied, including bagging (Breiman,1996a), boosting (Freund & Schapire, 1996), stacking(Wolpert, 1992), error-correcting output codes (Kong& Dietterich, 1995), and others. Bayesian learningtheory (Bernardo & Smith, 1994; Buntine, 1990) pro-vides a potential explanation for their success, andan optimal method for combining models. In the

    Bayesian view, using a single model to make predic-tions ignores the uncertainty left by finite data as towhich is the correct model; thus all possible modelsin the model space under consideration should be usedwhen making predictions, with each model weightedby its probability of being the correct model. Thisposterior probability is the product of the modelsprior probability, which reflects our domain knowledge(or assumptions) before collecting data, and the like-lihood, which is the probability of the data given the

    model. This method of making predictions is calledBayesian model averaging.

    Given the correct model space and prior distribu-tion, Bayesian model averaging is the optimal methodfor making predictions; in other words, no other ap-proach can consistently achieve lower error rates thanit does. Bayesian model averaging has been claimedto obviate the overfitting problem, by canceling out

    the effects of different overfitted models until only thetrue main effect remains (Buntine, 1990). In spite ofthis, it has not been widely used in machine learning(some exceptions are Buntine (1990); Oliver and Hand(1995); Ali and Pazzani (1996)). In this paper we ap-ply it to the combination of rule models and find that,surprisingly, it produces consistently higher error ratesthan two ad hoc methods (bagging and partitioning).Several possible causes for this are empirically rejected,and we show that Bayesian model averaging performspoorly because, contrary to previous belief, it is highlysensitive to overfitting.

    The next section briefly introduces the basic notionsof Bayesian theory, and their application to classifi-cation problems. We then describe the experimentsconducted and discuss the results.

    2. Bayesian Model Averaging in

    Classification

    In classification learning problems, the goal is to cor-rectly predict the classes of unseen examples givena training set of classified examples. Given a spaceof possible models, classical statistical inference se-lects the single model with highest likelihood given the

    training data and uses it to make predictions. Mod-ern Bayesian learning differs from this approach intwo main respects (Buntine, 1990; Bernardo & Smith,1994; Chickering & Heckerman, 1997): the computa-tion of posterior probabilities from prior probabilitiesand likelihoods, and their use in model averaging. InBayesian theory, each candidate model in the modelspace is explicitly assigned a prior probability, reflect-ing our subjective degree of belief that it is the cor-

  • 7/28/2019 Bayesian Averaging of Classifers and the Overftting Problem

    2/8

    rect model, prior to seeing the data. Let n be thetraining set size, x the examples in the training set,c the corresponding class labels, and h a model (orhypothesis) in the model space H. Then, by Bayestheorem, and assuming the examples are drawn inde-pendently, the posterior probability of h given (x, c) is

    given by:P r(h|x, c) =

    P r(h)

    P r(x, c)

    ni=1

    P r(xi, ci|h) (1)

    where P r(h) is the prior probability ofh, and the prod-uct of P r(xi, ci|h) terms is the likelihood. The dataprior P r(x, c) is the same for all models, and can thusbe ignored. In order to compute the likelihood it isnecessary to compute the probability of a class labelci given an unlabeled example xi and a hypothesis h,since P r(xi, ci|h) = P r(xi|h)P r(ci|xi, h). This proba-bility, P r(ci|xi, h), can be called the noise model, andis distinct from the classification model h, which sim-

    ply produces a class prediction with no probabilitiesattached.1

    In the literature on computational learning theory, auniform class noise model is often assumed (see, e.g.,Kearns and Vazirani (1994)). In this model, each ex-amples class is corrupted with probability , and thusP r(ci|xi, h) = 1 if h predicts the correct class cifor xi, and P r(ci|xi, h) = if h predicts an incorrectclass. Equation 1 then becomes:

    P r(h|x, c) P r(h) (1 )sns (2)

    where s is the number of examples correctly classifiedby h, and the noise level can be estimated by the

    models average error rate. An alternative approach(Buntine, 1990) is to rely on the fact that, implicitlyor explicitly, every classification model divides the in-stance space into regions, and labels each region witha class. For example, if the model is a decision tree(Quinlan, 1993), each leaf corresponds to a region. Aclass probability can then be estimated separately foreach region, by making:

    P r(ci|xi, h) =nr,ci

    nr(3)

    where r is the region xi is in, nr is the total numberof training examples in r, and nr,ci is the number ofexamples of class ci in r.

    Finally, an unseen example x is assigned to the classthat maximizes:

    P r(c|x, x, c, H) =hH

    P r(c|x, h) P r(h|x, c) (4)

    1Since a classification model h does not predict the ex-ample distribution Pr(xi|h), but only the class given xi,Pr(xi|h) = Pr(xi), and is therefore the same for all mod-els and can be ignored.

    If a pure classification model is used, P r(c|x, h) is1 for the class predicted by h for x, and 0 for all oth-ers. Alternatively, a model supplying class probabil-ities such as those in Equation 3 can be used. Sincethere is typically no closed form for Equation 4, andthe model space used typically contains far too many

    models to allow the full summation to be carried out,some procedure for approximating Equation 4 is neces-sary. Since P r(h|x, c) is often very peaked, using onlythe model with highest posterior can be an acceptableapproximation. Alternatively, a sampling scheme canbe used. Two widely-used methods are importancesampling (Bernardo & Smith, 1994) and Markov chainMonte Carlo (Neal, 1993; Gilks et al., 1996).

    3. Bayesian Model Averaging of

    Bagged C4.5 Rule Sets

    Bagging (Breiman, 1996a) is a simple and effective way

    to reduce the error rate of many classification learningalgorithms. For example, in the empirical study de-scribed below, it reduces the error of a rule learner in19 of 26 databases, by 4% on average. In the baggingprocedure, given a training set of size s, a bootstrapreplicate of it is constructed by taking s samples withreplacement from the training set. Thus a new train-ing set of the same size is produced, where each of theoriginal examples may appear once, more than once,or not at all. The learning algorithm is then appliedto this sample. This procedure is repeated m times,and the resulting m models are aggregated by uniformvoting (i.e., all models have equal weight).

    Bagging can be viewed as a form of importance sam-pling (Bernardo & Smith, 1994). Suppose we wantto approximate the sum (or integral) of a functionf(x)p(x) by sampling, where p(x) is a probability dis-tribution. This is the form of Equation 4, with p(x)being the model posterior probabilities. Since, givena probability distribution q(x) (known as the impor-tance sampling distribution),

    f(x)p(x) =

    f(x)

    p(x)

    q(x)

    q(x) (5)

    we can approximate the sum by sampling according to

    q(x), and computing the average of f(xi)p(xi)/q(xi)for the points xi sampled. Thus each sampled valuef(xi) will have a weight equal to the ratio p(xi)/q(xi).In particular, if p(x) = q(x) all samples should beweighed equally. This is what bagging does. Thus itwill be a good importance-sampling approximation ofBayesian model averaging to the extent that the learn-ing algorithm used to generate samples does so accord-ing to the model posterior probability. Since most clas-sification learners find a model by minimizing a func-

  • 7/28/2019 Bayesian Averaging of Classifers and the Overftting Problem

    3/8

    tion of the empirical error, and posterior probabilitydecreases with it, this is a reasonable hypothesis.2

    Given that bagging can be viewed as an approximationof Bayesian model averaging by importance sampling,and the latter is the optimal prediction procedure,then modifying bagging to more closely approximate

    it should lead to further error reductions. One way todo this is by noticing that in practice the probabilityof the same model being sampled twice when samplingfrom a very large model space is negligible. (This wasverified in the empirical study below, and has beennoted by many authors (e.g., Breiman (1996a); Char-niak (1993).) In this case, weighting models by theirposteriors leads to a better approximation of Bayesianmodel averaging than weighting them uniformly. Thiscan be shown as follows. With either method, themodels that were not sampled make the same con-tribution to the error in approximating Equation 4, sothey can be ignored for purposes of comparing the twomethods. When weighting models by their posteriorsthe error in approximating the terms that were sam-pled is zero, because the exact values of these termsare obtained (modulo errors in the P r(c|x, h) factors,which again are the same for both methods). Withuniform weights the error cannot be zero for all sam-pled terms, unless they all have exactly the same pos-terior. Thus the approximation error when using pos-teriors as weights is less than or equal to the errorfor uniform weights (with equality occurring only inthe very unlikely case of all equal posteriors). Notethat, unlike importance sampling, using the posteri-

    ors as weights would not yield good approximationsof Bayesian model averaging in the large sample limit,since it would effectively weight models by the squareof their posteriors. However, for the reasons above itis clearly preferable for realistic sample sizes (e.g., the10 to 100 models typically used in bagging and othermultiple model methods).

    2Of course, this ignores a number of subtleties. One isthat, given such a learner, only models at local maximaof the posterior will have a nonzero probability of beingselected. But since most of the probability mass is con-centrated around these maxima (in the large-sample limit,all of it), this is a reasonable approximation. Another sub-tlety is that in this regime the probability of each locallyMAP model being selected depends on the size of its basinof attraction in model space, not the amplitude of its pos-terior. Thus we are implicitly assuming that higher peaksof the posterior will typically have larger basins of attrac-tion than lower ones, which is also reasonable. In any case,none of what follows depends on these aspects of the ap-proximation, since they will be equally present in all thealternatives compared. Also, in Sections 5 and 6 we reporton experiments where the model space was exhaustivelysampled, making these considerations irrelevant.

    Whether a better approximation of Bayesian model av-eraging leads to lower errors can be tested empiricallyby comparing bagging (i.e., uniform weights) with thebetter approximation (i.e., models weighted by theirposteriors). In order to do this, a base learner is neededto produce the component models. The C4.5 release 8

    learner with the C4.5RULES postprocessor (Quinlan,1993) was used for this purpose. C4.5RULES trans-forms decision trees learned by C4.5 (Quinlan, 1993)into rule sets, and tends to be slightly more accurate.A non-informative, uniform prior was used. This isappropriate, since all rule sets are induced in a similarmanner from randomly selected examples, and thereis thus no a priori reason to suppose one will be moreaccurate than another.

    Twenty-six databases from the UCI repository wereused (Blake & Merz, 2000). Baggings error rate wascompared with that obtained by weighting the mod-els according to Equation 1, using both a uniformclass noise model (Equation 2) and Equation 3. Equa-tion 4 was used in both the pure classification andclass probability forms described. Error was mea-sured by ten-fold cross-validation. Every version of thecloser approximation of Bayesian model averaging per-formed worse than pure bagging on a large majorityof the data sets (e.g., 19 out of 26), and worse on aver-age. The best-performing combination was that of uni-form class noise and pure classification. Results forthis version (labeled BMA), bagging and the singlemodel learned from the entire training set in each foldare shown in Table 1. For this version, the experiments

    were repeated with m = 10, 50, and 100, with similarresults. Inspection of the posteriors showed them tobe extremely skewed, with a single rule model typicallydominating to the extent of dictating the outcome byitself. This occurred even though all models tendedto have similar error rates. In other words, Bayesianmodel averaging effectively performed very little aver-aging, acting more like a model selection mechanism(i.e., selecting the most probable model from the minduced).

    4. Bayesian Model Averaging of

    Partitioned RISE Rule Sets

    The surprising results of the previous section might bespecific to the bagging procedure, and/or to the useof C4.5 as the base learner. In order to test this, thissection reports similar experiments using a differentmultiple model method (partitioning) and base learner(RISE).

    RISE (Domingos, 1996a) is a rule induction systemthat assigns each test example to the class of the near-

  • 7/28/2019 Bayesian Averaging of Classifers and the Overftting Problem

    4/8

  • 7/28/2019 Bayesian Averaging of Classifers and the Overftting Problem

    5/8

    Table 3. Percent five-year return on investment for fourcurrencies: German mark (DM), British pound (BP), Swissfranc (SF), and Canadian dollar (CD).

    Currency B&H Best Unif. BMA BMAPDM 29.5 47.2 52.4 47.2 47.2BP 7.0 7.9 19.5 7.9 8.3SF 34.4 58.3 44.4 58.3 47.7CD 8.3 2.7 0.0 5.0 7.0

    that there is clearly no single right rule of this typesuggests that the use of Bayesian model averaging isappropriate. The choice of s and t, with t > s, canbe made empirically. If a maximum value tmax is setfor t (and, in practice, moving averages of more thana month or so are never considered), the total num-ber of possible rules is tmax(tmax 1)/2. It is thuspossible to compare the return yielded by the singlemost accurate rule with that yielded by averaging allpossible rules according to their posterior probabili-ties. These are computed assuming a uniform prior onrules/hypotheses and ignoring terms that are the samefor all rules (see Equation 1):

    P r(h|x, c) ni=1

    P r(ci|xi, h) (6)

    Let the two classes be + (rise/buy) and (fall/sell).For each rule h, P r(ci|xi, h) can take only four values:P r(+|+), P r(|+), P r(+|) and P r(|). Let n+be the number of examples in the sample which areof class but for which rule h predicts +, and simi-larly for the other combinations. Let n+ be the totalnumber of examples for which h predicts +, and simi-larly for n. Then, estimating probabilities from thesample as in Equation 3:

    P r(h|x, c) n++n+

    n++n+

    n+

    n+n+n

    n+n

    n

    n

    (7)

    Tests were conducted using daily data on five curren-cies for the years 197387, from the Chicago Mercan-tile Exchange (Weigend et al., 1992). The first tenyears were used for training (2341 examples) and the

    remaining five for testing. A maximum t of two weekswas used. The results for four currencies, in terms ofthe five-year return on investment obtained, are shownin Table 3. In the fifth currency, the Japanese yen, allaveraging methods led to zero return, due to the factthat downward movements were in the majority forall rules both when the rule held and when it did not,leading the program to hold U.S. dollars throughout.This reflects a limitation of making only binary pre-dictions, and not of multiple model methods. The first

    column of Table 3 shows the result of buying the for-eign currency on the first day and holding it through-out the five-year period. The second column corre-sponds to applying the single best rule, and is a clearimprovement over the former. The remaining columnsshow the results of applying various model averaging

    methods.Uniform averaging (giving the same weight to all rules)produced further improvements over the single bestrule in all but one currency. However, Bayesian modelaveraging (fourth column) produced results that werevery similar to that of the single best rule. Inspectionof the posteriors showed this to be due in each caseto the presence of a dominant peak in the (s, t) plane.This occurs even though the rule error rates typicallydiffer by very little, and are very close to chance (er-ror rates below 45% are rare in this extremely noisydomain). Thus it is not the case that averaging overall models in the space will make this phenomenondisappear.

    A further aspect in which the applications describedso far differ from the exact Bayesian procedure is thataveraging was only performed over the classificationmodels, and not over the parameters P r(xi, ci|h) of thenoise model. The maximum-likelihood values of theseparameters were used in place of integration over thepossible parameter values weighted by their posteriorprobabilities. For example, P r(|+) was estimated byn+

    n+

    , but in theory integration over the [0, 1] interval

    should be performed, with each probability weightedby its posterior probability given the observed frequen-

    cies. Although this is again a common approximation,it might have a negative impact on the performanceof Bayesian model averaging. Bayesian averaging wasthus reapplied with integration over the probabilityvalues, using uniform priors and binomial likelihoods.This led to no improvement (BMAP column). Weattribute this to the sample being large enough to con-centrate most of the posteriors volume around themaximum-likelihood peak. This was confirmed by ex-amining the curves of the posterior distributions.

    6. Bayesian Model Averaging of

    Conjunctions

    Because the results of the previous section might bespecific to the foreign exchange domain, the followingexperiment was carried out using artificially generatedBoolean domains. Classes were assigned at random toexamples described by a features. All conjunctionsof 3 of those features were then generated (a total ofa(a1)(a2)/6), and their posterior probabilities wereestimated from a random sample composed of half the

  • 7/28/2019 Bayesian Averaging of Classifers and the Overftting Problem

    6/8

    possible examples. The experiment was repeated tentimes for each of a = 7, 8, 9, . . . , 13. Because the classwas random, the error rate of both Bayesian model av-eraging and the best conjunction3 was always approx-imately 50%. However, even in this situation of purenoise and no possible right conjunction, the poste-

    rior distributions were still highly asymmetric (e.g.,the average posterior excluding the maximum was onaverage 14% of the maximum for a = 7, and decreasedto 6% for a = 13). As a result, Bayesian model av-eraging still made the same prediction as the bestconjunction on average 83.9% of the time for a = 7,decreasing to 64.4% for a = 13.4

    7. The Overfitting Problem in Bayesian

    Model Averaging

    The observations of the previous sections all point tothe conclusion that Bayesian model averagings disap-

    pointing results are not the effect of deviations fromthe Bayesian ideal (e.g., sampling terms from Equa-tion 4, or using maximum likelihood estimates of theparameters), but rather stem from some deeper prob-lem. In this section this problem is identified, and seento be a form of overfitting that occurs when Bayesianaveraging is applied.

    The reason Bayesian model averaging produces veryskewed posteriors even when model errors are similar,and as a result effectively performs very little averag-ing, lies in the form of the likelihoods dependence onthe sample. This is most easily seen in the case of a

    uniform noise model. In Equation 2, the likelihood ofa model (1 )sns increases exponentially with theproportion of correctly classified examples s/n. As aresult of this exponential dependence, even small ran-dom variations in the sample will cause some hypothe-ses to appear far more likely than others. Similar be-havior occurs when Equation 3 is used as the noisemodel, with the difference that each examples con-tribution to the likelihoods exponential variation nowdepends on the region the example is in, instead ofbeing the same for all examples. This behavior willoccur even if the true values of the noise parame-ters are known exactly ( in the uniform model, and

    the P r(xi, ci|h) values in Equation 3). The posteriorsexponential sensitivity to the sample is a direct conse-

    3Predicting the class with highest probability given thatthe conjunction is satisfied when it is (estimated from thesample), and similarly when it is not.

    4This decrease was not due to a flattening of the poste-riors as the sample size increased (the opposite occurred),but to the class probabilities given the value of each con-junction converging to the 50% limit.

    quence of Bayes theorem and the assumption that theexamples are drawn independently, which is usually avalid one in classification problems. Dependence be-tween the examples will slow the exponential growth,but will only make it disappear in the limit of all ex-amples being completely determined by the first.

    To see the impact of this exponential behavior, con-sider any two of the conjunctions in the previoussection, h1 and h2. Using the notation of Sec-tion 5, for each conjunction P r(+|+) = P r(|+) =P r(+|) = P r(|) = 1

    2, by design. By Equation 6,

    P r(h1|x, c)/P r(h2|x, c) = 1. In other words, given asufficiently large sample, the two conjunctions shouldappear approximately equally likely. Now supposethat: n = 4000; for conjunction h1, n++ = n =1050 and n+ = n+ = 950; and for conjunction h2,n++ = n = 1010 and n+ = n+ = 990. Theresulting estimates of P r(+|+), . . . , P r(|) for bothconjunctions are quite good; all are within 1% to 5%of the true values. However, the estimated ratio ofconjunction posteriors is, by Equation 7:

    P r(h1|x, c)

    P r(h2|x, c)=

    10502000

    1050 9502000

    950 9502000

    950 10502000

    105010102000

    1010 9902000

    990 9902000

    990 10102000

    1010 120

    In other words, even though the two conjunctionsshould appear similarly likely and have similar weightsin the averaging process, h1 actually has a far greaterweight than h2; enough so, in fact, that Bayes-averaging between h1 and 100 conjunctions with ob-served frequencies similar to h2s is equivalent to al-ways taking only h1 into account. If only a single pa-rameter P r(|) were being estimated, a deviationof 5% or more from a true value of 1

    2given a sample

    of size 2000 would have a probability of p5% = 0.013(binomial distribution, with p = 1

    2and n = 2000).

    This is a reasonably small value, even if not necessar-ily negligible. However, two independent parametersare being estimated for each model in the space, andthe probability of a deviation of 5% or more in anyone parameter increases with the number of parame-ters being estimated (exponentially if they are inde-pendent). With a = 10, there are 120 conjunctions

    of 3 Boolean features (Section 6), and thus 240 pa-rameters to estimate. Assuming independence to sim-plify, this raises the probability of a deviation of 5%or more to 1 (1 p5%)

    240 0.957. In other words,it is highly likely to occur. In practice this value willbe smaller due to dependences between the conjunc-tions, but the pattern is clear. In most applications,the model space contains far more than 120 distinctmodels; for most modern machine learning methods(e.g., rule induction, decision-tree induction, neural

  • 7/28/2019 Bayesian Averaging of Classifers and the Overftting Problem

    7/8

    networks, instance-based learning), it can be as high asdoubly exponential in the number of attributes. Thus,even if only a very small fraction of the terms in Equa-tion 4 is considered, the probability of one term beingvery large purely by chance is very high, and this out-lier then dictates the behavior of Bayesian model av-

    eraging. Further, the more terms that are sampledin order to better approximate Equation 4, the morelikely such an outlier is to appear, and the more likely(in this respect) Bayesian model averaging is to per-form poorly.

    This is an example of overfitting: preferring a hypoth-esis that does not truly have the lowest error of any hy-pothesis considered, but that by chance has the lowesterror on the training data (Mitchell, 1997). The obser-vation that Bayesian model averaging is highly proneto overfitting, the more so the better Equation 4 isapproximated, contradicts the common belief amongBayesians that it solves the overfitting problem, andthat in the limit overfitting cannot occur if Equation 4is computed exactly (see, e.g., Buntine (1990)). Whileuniform averaging as done in bagging can indeed re-duce overfitting by canceling out spurious variationsin the models (Rao & Potts, 1997), Bayesian modelaveraging in effect acts more like model selection thanmodel averaging, and is equivalent to amplifying thesearch process that produces overfitting in the under-lying learner in the first place. Although overfittingis often identified with inducing overly complex hy-potheses, this is a superficial view: overfitting is theresult of attempting too many hypotheses, and conse-

    quently finding a poor hypothesis that appears good(Jensen & Cohen, in press). In this light, Bayesianmodel averagings potential to aggravate the overfit-ting problem relative to learning a single model be-comes clear.

    The net effect of Bayesian model averaging will de-pend on which effect prevails: the increased overfitting(small if few models are considered), or the reductionin error potentially obtained by giving some weightto alternative models (typically a small effect, givenBayesian model averagings highly skewed weights).While Buntine (1990) and Ali and Pazzani (1996) ob-tained error reductions with Bayesian model averaging

    in some domains, these were typically small comparedwith those obtainable using uniform weights (whetherin bagging or using Buntines option trees, as done byKohavi and Kunz (1997)).5 Thus, the improvementsproduced by multiple models in these references wouldpresumably have been greater if the model posteriors

    5Note also that only one of the many variations ofBayesian model averaging in Buntine (1990) consistentlyreduced error.

    had been ignored. The fact that, in the studies withbagging reported in this article, Bayesian model aver-aging generally did not produce even those improve-ments may be attributable to the effect of a third fac-tor: the fact that, when bootstrapping is used, eachmodel is effectively learned from a smaller sample,

    while the multiple models in Buntine (1990) and Aliand Pazzani (1996) were learned on the entire sample,using variations in the algorithm.

    Multiple model methods can be placed on a spec-trum according to the asymmetry of the weights theyproduce. Bagging is at one extreme, with uniformweights, and never increases overfitting (Breiman,1996b). Methods like boosting and stacking produceweights that are more variable, and can sometimeslead to overfitting (see, e.g., Margineantu and Diet-terich (1997)). Bayesian model averaging is at theopposite end of the spectrum, producing highly asym-metric weights, and being correspondingly more proneto overfitting.

    8. Conclusions

    This paper found that, contrary to previous belief,Bayesian model averaging does not obviate the over-fitting problem in classification, and may in fact ag-gravate it. Bayesian averagings tendency to overfitderives from the likelihoods exponential sensitivity torandom fluctuations in the sample, and increases withthe number of models considered. The problem of suc-cessfully applying it in machine learning remains an

    open one.

    Acknowledgments

    This research was partly supported by PRAXIS XXIand NATO. The author is grateful to all those whoprovided the data sets used in the experiments.

    References

    Ali, K., & Pazzani, M. (1996). Classification usingBayes averaging of multiple, relational rule-basedmodels. In D. Fisher and H.-J. Lenz (Eds.), Learn-

    ing from data: Artificial intelligence and statisticsV, 207217. New York, NY: Springer.

    Bauer, E., & Kohavi, R. (1999). An empirical com-parison of voting classification algorithms: Bagging,boosting and variants. Machine Learning, 36, 105142.

    Bernardo, J. M., & Smith, A. F. M. (1994). Bayesiantheory. New York, NY: Wiley.

    Blake, C., & Merz, C. J. (2000). UCI repository of

  • 7/28/2019 Bayesian Averaging of Classifers and the Overftting Problem

    8/8

    machine learning databases. Department of Infor-mation and Computer Science, University of Califor-nia at Irvine, Irvine, CA. http://www.ics.uci.edu/-mlearn/MLRepository.html.

    Breiman, L. (1996a). Bagging predictors. MachineLearning, 24, 123140.

    Breiman, L. (1996b). Bias, variance and arcing classi-fiers(Technical Report 460). Statistics Department,University of California at Berkeley, Berkeley, CA.

    Buntine, W. (1990). A theory of learning classifica-tion rules. Doctoral dissertation, School of Com-puting Science, University of Technology, Sydney,Australia.

    Charniak, E. (1993). Statistical language learning.Cambridge, MA: MIT Press.

    Chickering, D. M., & Heckerman, D. (1997). Effi-cient approximations for the marginal likelihood ofBayesian networks with hidden variables. Machine

    Learning, 29, 181212.Domingos, P. (1996a). Unifying instance-based and

    rule-based induction. Machine Learning, 24, 141168.

    Domingos, P. (1996b). Using partitioning to speed upspecific-to-general rule induction. Proceedings of theAAAI-96 Workshop on Integrating Multiple Learned

    Models for Improving and Scaling Machine Learning

    Algorithms (pp. 2934). Portland, OR: AAAI Press.

    Drucker, H., Cortes, C., Jackel, L. D., LeCun, Y.,& Vapnik, V. (1994). Boosting and other machinelearning algorithms. Proceedings of the Eleventh In-ternational Conference on Machine Learning

    (pp.5361). New Brunswick, NJ: Morgan Kaufmann.

    Freund, Y., & Schapire, R. E. (1996). Experimentswith a new boosting algorithm. Proceedings ofthe Thirteenth International Conference on Machine

    Learning (pp. 148156). Bari, Italy: Morgan Kauf-mann.

    Gilks, W. R., Richardson, S., & Spiegelhalter, D. J.(Eds.). (1996). Markov chain Monte Carlo in prac-tice. London, UK: Chapman and Hall.

    Good, I. J. (1965). The estimation of probabilities:An essay on modern Bayesian methods. Cambridge,MA: MIT Press.

    Jensen, D., & Cohen, P. R. (in press). Multiple com-parisons in induction algorithms. Machine Learning.

    Kearns, M. J., & Vazirani, U. V. (1994). An introduc-tion to computational learning theory. Cambridge,MA: MIT Press.

    Kohavi, R., & Kunz, C. (1997). Option decision treeswith majority votes. Proceedings of the FourteenthInternational Conference on Machine Learning (pp.161169). Nashville, TN: Morgan Kaufmann.

    Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance.Proceedings of the Twelfth International Conference

    on Machine Learning (pp. 313321). Tahoe City,CA: Morgan Kaufmann.

    LeBaron, B. (1991). Technical trading rules and regime

    shifts in foreign exchange (Technical Report). De-partment of Economics, University of Wisconsin atMadison, Madison, WI.

    Maclin, R., & Opitz, D. (1997). An empirical evalu-ation of bagging and boosting. Proceedings of theFourteenth National Conference on Artificial Intel-

    ligence. Providence, RI: AAAI Press.

    Margineantu, D., & Dietterich, T. (1997). Pruningadaptive boosting. Proceedings of the FourteenthInternational Conference on Machine Learning (pp.211218). Nashville, TN: Morgan Kaufmann.

    Mitchell, T. M. (1997). Machine learning. New York,

    NY: McGraw-Hill.Neal, R. M. (1993). Probabilistic inference using

    Markov chain Monte Carlo methods (Technical Re-port CRG-TR-93-1). Department of Computer Sci-ence, University of Toronto, Toronto, Canada.

    Niblett, T. (1987). Constructing decision trees in noisydomains. Proceedings of the Second European Work-ing Session on Learning (pp. 6778). Bled, Yu-goslavia: Sigma.

    Oliver, J. J., & Hand, D. J. (1995). On pruning andaveraging decision trees. Proceedings of the TwelfthInternational Conference on Machine Learning (pp.

    430437). Tahoe City, CA: Morgan Kaufmann.Quinlan, J. R. (1993). C4.5: Programs for machine

    learning. San Mateo, CA: Morgan Kaufmann.

    Quinlan, J. R. (1996). Bagging, boosting, and C4.5.Proceedings of the Thirteenth National Conference

    on Artificial Intelligence (pp. 725730). Portland,OR: AAAI Press.

    Rao, J. S., & Potts, W. J. E. (1997). Visualizingbagged decision trees. Proceedings of the Third In-ternational Conference on Knowledge Discovery and

    Data Mining (pp. 243246). Newport Beach, CA:AAAI Press.

    Weigend, A. S., Huberman, B. A., & Rumelhart, D. E.(1992). Predicting sunspots and exchange rates withconnectionist networks. In M. Casdagli and S. Eu-bank (Eds.), Nonlinear modeling and forecasting,395432. Redwood City, CA: Addison-Wesley.

    Wolpert, D. (1992). Stacked generalization. NeuralNetworks, 5, 241259.