832 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND...

832 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 8, AUGUST 1998

The Random Subspace Method forConstructing Decision Forests

Tin Kam Ho, Member, IEEE

Abstract—Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemmabetween overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier isproposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. Theclassifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the featurevector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers andother forest construction methods by experiments on publicly available datasets, where the method’s superiority is demonstrated. Wealso discuss independence between trees in a forest and relate that to the combined classification accuracy.

Index Terms—Pattern recognition, decision tree, decision forest, stochastic discrimination, decision combination, classifiercombination, multiple-classifier system, bootstrapping.

——————————��F��——————————

1 INTRODUCTION

persistent problem in using decision trees for classifi-cation is how to avoid overfitting a set of training data

while achieving maximum accuracy. Some previous studiesattempt to prune back a fully-split tree even at the expenseof the accuracy on the training data. Other probabilisticmethods allow descent through multiple branches withdifferent confidence measures. However, their effect onoptimizing generalization accuracy is far from clear andconsistent.

In [11], I first proposed that combining multiple treesconstructed in randomly selected subspaces can achievenearly monotonic increase in generalization accuracywhile preserving perfect accuracy on training data, pro-vided that the features are sufficient to distinguish allsamples belonging to different classes, or that there is nointrinsic ambiguity in the datasets. This offers a betterway to overcome the apparent dilemma of accuracy opti-mization and over-adaptation. In [11], I showed with em-pirical results that the method works well with binarytrees that at each internal node split the data to two sidesof an oblique hyperplane.

The validity of the method does not depend on the par-ticulars of tree construction algorithms. Other types ofsplitting functions, including single-feature splits, such asthe C4.5 algorithm [26], supervised or unsupervised clus-tering, distribution map matching [13], and support vectormachines [33] can be readily applied. In this paper, I willinvestigate a number of alternative splitting functions.

For simplicity, in this paper I assume that the classificationproblems are in real-valued feature spaces, though the sam-

ples may have only binary or integer feature values. Cate-gorical variables are assumed to be numerically encoded.

2 METHODS FOR CONSTRUCTING A DECISION TREE

Many methods have been proposed for constructing a deci-sion tree using a collection of training samples. The major-ity of tree construction methods use linear splits at eachinternal node. I will focus on linear splits throughout ourdiscussions.

A typical method selects a hyperplane or multiple hy-perplanes at each internal node, and samples are assignedto branches representing different regions of the featurespace bounded by such hyperplanes. The methods differ bythe number and orientations of the hyperplanes and howthe hyperplanes are chosen. I can categorize such methodsby the types of splits they produce that are determined bythe number and orientation of the hyperplanes (Fig. 1):

1)�axis-parallel linear splits: A threshold is chosen on thevalues at a particular feature dimension, and samplesare assigned to the branches according to whether thecorresponding feature values exceed the threshold.Multiple thresholds can be chosen for assignment tomultiple branches. Another generalization is to useBoolean combinations of the comparisons againstthresholds on multiple features. These trees can bevery deep but their execution is extremely fast.

2)�oblique linear splits: Samples are assigned to thebranches according to which side of a hyperplane orwhich region bounded by multiple hyperplanes theyfall in, but the hyperplanes are not necessarily parallelto any axis of the feature space. A generalization is touse hyperplanes in a transformed space, where eachfeature dimension can be an arbitrary function of se-lected input features. The decision regions of thesetrees can be finely tailored to the class distributions,and the trees can be small. The speed of execution de-

0162-8828/98/$10.00 © 1998 IEEE

²²²²²²²²²²²²²²²²

•� The author is with Bell Laboratories, Lucent Technologies, 600 MountainAve., 2C-425, Murray Hill, NJ 07974-0636.�E-mail: [email protected]

Manuscript received 25 Mar. 1997; revised 18 June 1998. Recommended for accep-tance by I. Sethi.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 107054.

A

HO: THE RANDOM SUBSPACE METHOD FOR CONSTRUCTING DECISION FORESTS 833

pends on the complexity of the hyperplanes or thetransformation functions.

3)�piecewise linear splits: Branches represent a Voronoi tes-sellation of the feature space. Samples are assignedbased on nearest-neighbor matching to chosen anchorpoints. The anchor points can be selected amongtraining samples, class centroids, or derived clustercenters. These trees can have a large number ofbranches and can be very shallow.

Within each category, the splitting functions can be ob-tained in many ways. For instance, single-feature splits canbe chosen by Sethi and Sarvarayudu’s average mutual in-formation [28], the Gini index proposed by Breiman et al.[3], Quinlan’s information gain ratio [25], or Mingers’s Gstatistic [20], [21], etc. Oblique hyperplanes can be obtainedby Tomek links [23], simulated annealing [7], [8], or per-ceptron training [11]. Hyperplanes in transformed spacescan be chosen using the support vector machine method[33]. Piecewise linear or nearest-neighbor splits can be ob-tained by numerous ways of supervised or unsupervisedclustering. There are also many variations of each popularmethod, and I do not intend to provide a complete taxon-omy. Interested readers are referred to a survey by Datta-treya and Kanal [5].

An important issue often addressed in previous litera-ture is the stopping criterion for tree construction. Certaintypes of splits can be produced until the feature space ispartitioned into regions containing only samples of a singleclass (given that there are no intrinsic ambiguities amongclasses), other types of splits have inherent stopping rulesthat would not allow such a complete partitioning. A stop-ping criterion can also be introduced artificially on any typeof splits, for instance, by limiting the number of levels ofthe tree. Another approach is to build a fully split tree andthen prune back certain leaves that are considered overlyspecific. An alternative to these top-down constructionmethods is a recently proposed approach that determinesthe structure of the tree first, and then determines all thesplits simultaneously by optimizing a global criterion [29].

Each splitting function defines a model for projectingclassification from the training samples to unclassifiedpoints in the space. From this point of view, it is not sur-prising to see that, no method could be universally bestfor an arbitrary, finite training set. The advantage of eachsplitting function depends on the distribution of availabletraining samples and the difference of that from the truedistributions. On the other hand, if the training samples

are sufficiently dense, any one of the functions, if used togenerate fully-split trees, yields similar partitions andclassification accuracy. Their only differences will be in thesizes of the trees and training and execution speed. There-fore, I do not emphasize the advantages and disadvan-tages of each splitting function, and I will leave these toempirical judgment.

3 SYSTEMATIC CONSTRUCTION OF A FOREST

My method is another example of improving accuracythrough combining the power of multiple classifiers [15].Here, each classifier is a decision tree, and I call the com-bined classifier a decision forest.

Some previous attempts to construct multiple trees rely oncertain heuristic procedures or manual intervention [19], [30],[31]. The number of different trees they can obtain is oftenseverely limited. I emphasize that arbitrarily introduced dif-ferences between trees, such as using randomized initialconditions or perturbations in the search procedures [9], donot guarantee good combination performance. Here, goodcombination performance is defined as 100 percent correctrate on training data and a monotonic decrease in generali-zation error as the forest grows in the number of trees, pro-vided that the data do not have any intrinsic ambiguity. Also,it is important to have a systematic procedure that can pro-duce a large number of sufficiently different trees.

My method relies on an autonomous, pseudorandomprocedure to select a small number of dimensions from agiven feature space. In each pass, such a selection is madeand a subspace is fixed where all points have a constantvalue (say, zero) in the unselected dimensions. All samplesare projected to this subspace, and a decision tree is con-structed using the projected training samples. In classifica-tion a sample of an unknown class is projected to the samesubspace and classified using the corresponding tree.

For a given feature space of n dimensions, there are 2n

such selections that can be made, and with each selection adecision tree can be constructed. More different trees can beconstructed if the subspace changes within the trees, that is,if different feature dimensions are selected at each split. Theuse of randomization in selecting the dimensions is merelya convenient way to explore the possibilities.

The trees constructed in each selected subspace are fullysplit using all training data. They are, hence, perfectly cor-rect on the training set by construction assuming no intrin-sic ambiguities in the samples. When two samples cannotbe distinguished by the selected features, there will be am-biguities, but if no decision is forced in such cases, this doesnot introduce errors.

For each tree, the classification is invariant for points thatare different from the training points only in the unselecteddimensions. Thus, each tree generalizes its classification ina different way. The vast number of subspaces in high-dimensional feature spaces provides more choices thanneeded in practice. Hence, while most other classificationmethods suffer from the curse of dimensionality, this methodcan take advantage of high dimensionality. And contrary tothe Occam’s Razor, our classifier improves on generaliza-tion accuracy as it grows in complexity.

(a) (b) (c)

Fig. 1. Types of linear splits. (a) Axis-parallel linear splits. (b) Obliquelinear splits. (c) Piecewise linear splits.


This method of forest building leads to many interest-ing theoretical questions. For instance, how many of thesubspaces must be used before a certain accuracy with thecombined classification can be achieved? What will hap-pen if we use all the possible subspaces? How do the re-sults differ if we restrict ourselves to subspaces with cer-tain properties?

Some of these questions are addressed in the theory ofstochastic discrimination (SD), where the combination ofvarious ways to partition the feature spaces is studied [17],[18]. In the SD theory, classifiers are constructed by com-bining many components that have weak discriminativepower but can generalize very well. Classification accura-cies are related to the statistical properties of the combina-tion function, and it is shown that very high accuracies canbe achieved far before all the possible weak classifiers areused. The ability to build classifiers of arbitrary complexitywhile increasing generalization accuracy is shared by allmethods derived from the SD theory. Decision forest is oneof such methods. While other SD methods start with highlyprojectable classifiers with minimum enrichment and seekoptimization on uniformity [16], with decision forests, onestarts with guaranteed enrichment and uniformity, andseeks optimization on projectability.

Another previously explored idea to build multipleclassifiers originates from the method of cross-validation. There, random subsets are selected from thetraining set, and a classifier is trained using each subset.Such methods are also useful since overfitting can beavoided to some extent by withholding part of thetraining data. Two training set subsampling methodshave been proposed: bootstrapping [4] and boosting [6]. Inbootstrapping, subsets of the raw training samples areindependently and randomly selected, with replacement,according to a uniform probability distribution. Inboosting, the creation of each subset is dependent onprevious classification results, and a probability distri-bution is introduced to prefer those samples on whichprevious classifiers are incorrect. In boosting, weights arealso used in final decision combination, and the weightsare determined by accuracies of individual classifiers.Both these methods have been known to produce forestssuperior to single C4.5 trees [27]. In these experiments, Iwill compare accuracies of forests built by my subspacemethod to those built by such training set subsamplingmethods.

The random subspace method is a parallel learning al-gorithm, that is, the generation of each decision tree isindependent. This makes it suitable for parallel imple-mentation for fast learning that is desirable in somepractical applications. Moreover, since there is no hill-climbing, there is no danger of being trapped in localoptima.

4 THE COMBINATION FUNCTION

The decisions of nt individual trees are combined by aver-aging the conditional probability of each class at the leaves.For a point x, let vj(x) be the terminal node that x falls intowhen it descends down tree Tj (j = 1, 2, ..., nt). Given this, let

the probability that x belongs to class c (c = 1, 2,..., nc) bedenoted by P(c|vj(x)).

P c v xP c v x

P c v xj

j

k jk

nc0 54 9

0 54 90 54 9

=

=∑ ,

1

can be estimated by the fraction of class c points over allpoints that are assigned to vj(x) (in the training set). Noticethat in this context, since the trees are fully split, most ter-minal nodes contain only a single class (except for abnor-mal stops that may occur in some tree construction algo-

rithms) and thus the value of the estimate $P c v xj0 54 9 is al-

most always one. The discriminant function is defined as

g x n P c v xct

jj

nt

0 5 0 54 9==∑1

1

$

and the decision rule is to assign x to class c for which gc(x)is the maximum. For fully split trees, the decision obtainedusing this rule is equivalent to a plurality vote among theclasses decided by each tree.

It is obvious that the discriminant preserves 100 percentaccuracy on the training set, provided that there is no am-biguity in the chosen subspaces. However, it is possiblethat, two samples that are distinguishable in the originalfeature space become ambiguous in a chosen subspace, es-pecially if there is a large reduction in dimensionality. So,caution has to be taken in applying this method to verylow-dimensional data, and the choice of number of featuresto select could have a strong impact on accuracy. These willbe further illustrated in the experiments.

For an unseen point, g(x) averages over the posteriorprobabilities that are conditioned on reaching a particularterminal node. Geometrically, each terminal node defines aneighborhood around the points assigned to that node inthe chosen subspace. By averaging over the posterior prob-abilities in these neighborhoods (decision regions), the dis-criminant approximates the posterior probability for agiven x in the original feature space. This is similar to otherkernel-based techniques for estimating posterior probabili-ties, except that here the kernels are of irregular shapes andsizes and do not necessarily nest.

Because of this, the discriminant is applicable to any al-gorithm that partitions the feature space into regions con-taining only or mostly points of one class, for instance, themethod of learning vector quantization [10]. The analyticalproperties of the function and its several variants have beenstudied extensively by Berlind [2]. Essentially, accuracy ofthe function is shown to be asymptotically perfect (as thenumber of component classifiers increases) provided that anumber of conditions on enrichment, uniformity, symmetry,and projectability are satisfied.

5 INDEPENDENCE BETWEEN TREES

For a forest to achieve better accuracy than individualtrees, it is critical that there should be sufficient inde-pendence or dissimilarity between the trees. There havebeen few known measures for correlation between deci-


sion trees. The difference in combined accuracy of theforest from those of individual trees gives strong but in-direct evidence on their mutual independence.

A simple measure of similarity between two decisiontrees can be the amount that their decision regions overlap[32]. On fully split trees, each leaf represents a region la-beled with a particular class. On trees that are not fully splitthe regions can be labeled with the dominating class (tiesbroken arbitrarily). Given that, regardless of the structure ofthe trees, we can consider trees yielding the same decisionregions equivalent (Fig. 2). The similarity of two trees canthen be measured by the total volume of the regions labeledwith the same class by both trees.

Given two trees ti and tj, let their class decisions for apoint x be ci(x) and cj(x), respectively, I define tree agreementsi,j to be

s p x dxi jR

, = 0 5 ,

where R = {x|ci(x) = cj(x)}, and p(x) is the probability densityfunction of x.

Given two decision trees, the volume of the overlappingdecision regions is determined and, in theory, can be calcu-lated exactly. However, depending on the form of the split-ting function and the dimensionality of the space, the cal-culation could be difficult to carry out. Alternatively, onemay get an approximation by Monte-Carlo integration, i.e.,generating pseudorandom points to a sufficient density inthe feature space and measure the agreement of their classi-fication by the trees.

Assuming that the testing samples are representative forthe given problems, I will use them to measure the treeagreement. This makes the estimation more computation-ally feasible and allows us to limit the concern within theneighborhood of the given samples.

It should be noted that severe bias could result if thesamples for estimation are not chosen carefully. For in-stance, if the training samples are used in this context, sincethe trees are fully split and tailored to these samples, byconstruction, the estimate of tree agreement will always beone, and this will most likely be an overestimate. On thecontrary, since the decision boundaries can be arbitrary inregions where there is no representative sample, if one usespseudorandomly generated points distributed over regionsfar outside those occupied by the given samples, one couldrisk severely underestimating the tree agreement that isrelevant to the problem.

Using a set of n fixed samples and assuming equalweights, the estimate $ ,si j can be written as

$ ,s n f xi j kk

n

==

∑1

1

2 7 ,

where

f xc x c x

ki k j k2 7 2 7 2 7= =%&'

10

ifotherwise

.

6 COMPARISON OF RESULTS OF TREE AND FORESTCLASSIFIERS

A series of experiments were carried out using publiclyavailable data sets provided by the Project Statlog [24]. Iused all the four datasets that have separate training andtesting data (“dna,” “letter,” “satimage,” and “shuttle”). Ineach of these datasets, both training and testing samples arerepresented by feature vectors with integer components.Though some of the feature components are not necessarilynumerical by nature (say, the features in the “dna” set arebinary codings of a four-element alphabet), the samples arestill treated as points in real-valued spaces. There are nomissing values of any features in all four datasets. Table 1lists the sizes of the training and testing sets provided ineach collection.

I compared the accuracies of the following classifiers:

1)�Decision forests constructed using the subspacemethod versus single decision trees.

2)�Decision forests constructed using the subspacemethod versus those constructed using training setsubsampling methods.

3)�Decision forests constructed using the subspacemethod with different splitting functions.

4)�Decision forests constructed using the subspacemethod with different numbers of randomly selectedfeatures.

6.1 Forest Built on Subspaces Versus a SingleDecision Tree

I first compared the accuracies of the forests constructedusing my method against the accuracy of using a singledecision tree constructed using all features and all trainingsamples. For easy repetition of results by others, I chose touse the C4.5 algorithm [26] (Release 8) to construct the treesin either case.

In the experiments, the C4.5 algorithm was essentiallyuntouched. The forest construction procedure was imple-

Fig. 2. Two trees yielding the same decision regions in a 2-dim featurespace (Ci(x) = j if tree i decides that x is in class j).

TABLE 1SPECIFICATIONS OF DATA SETS USED IN THE EXPERIMENTS

name ofdata set

no. ofclasses

no. offeaturedimen-sions

no. oftrainingsamples

no. oftesting

samples

dna 3 180 2,000 1,186letter 26 16 15,000 5,000satimage 6 36 4,435 2000shuttle 7 9 43,500 14,500


mented external to the C4.5 algorithm. That means thefeatures were randomly selected before the data were inputto the algorithm. Decision combination, in this case ap-proximated by simple plurality voting, was done using theclass decisions extracted from the output of the algorithm.Along with the use of publicly available data, this experi-ment can be easily duplicated.

In each run, I first used the original C4.5 algorithm tobuild an unpruned tree using all the features, and appliedthe tree to the testing set. Then I repeated the procedurewith a pruned tree (using default confidence levels). I thenconstructed a decision forest, using the same C4.5 algo-rithm, but for each tree using only a half of the features thatwere randomly selected. Each tree was added to the forestand the accuracy of the forest on the testing set was meas-ured. I continued until 100 trees were obtained in each for-est. Fig. 3 compares the testing set accuracies of the forestsagainst those obtained using a single C4.5 tree with the fullfeature vectors. It is apparent that the forests yield superiortesting set accuracies for these four datasets.

It should be noted that in some of these datasets thenumber of features is not very large (dna: 180; letter: 16;

satimage: 36; shuttle: nine). But the method still works wellwith such relatively low-dimensional spaces. The only ex-ception is that with the “shuttle” data, there are so fewfeatures to choose from that the forest does not display agreat advantage over the single-tree classifier. To apply themethod to very low-dimensional data, I suggest first ex-panding the feature vectors by using certain functions ofthe original features. In many cases, I found that the inclu-sion of pairwise sums, differences, products, or Booleancombinations (for binary and categorical features) of theraw features served as a good expansion. I will discuss thisfurther in a later experiment with more data sets with veryfew feature dimensions.

6.2 Comparison to Forests Constructed by TrainingSet Subsampling

In this experiment, I compared our method of forest build-ing to other methods where each tree was constructed us-ing a randomly selected subset of the training data with allthe features. I experimented with both bootstrapping andboosting methods that are of this category. In bootstrap-ping, subsets of the same size as the original training set

Fig. 3. Comparison of test set accuracies: a single C4.5 tree, forests built by the random subspace method, and forests built by bootstrapping andboosting (for the “shuttle” data, accuracies of bootstrapping and the single-tree methods are essentially the same).


were created by sampling with replacement in uniformprobability. A tree was constructed with each subset.Boosting was done following the AdaBoost.M1 proceduredescribed in [6].

Again, all trees were constructed by the C4.5 algorithmas in the previous experiment, and the sample selectionprocedures were implemented externally. Fig. 3 shows theaccuracies of all the three forests building methods. It canbe seen that although the individual trees (say, the firsttree in each forests) built by training set subsampling aresometimes more accurate, with more trees, the combinedaccuracies of the random subspace forests are superior.The results suggest that there is much redundancy in thefeature components in these datasets, but the samples arerelatively sparse. Interestingly, the results also indicatethat the bootstrapping and boosting methods are verysimilar in performance.

6.2.1 Tree AgreementThe differences in tree agreement in forests built using thesubspace method and those built using the other twomethods can be shown by our measure $ ,si j estimated us-

ing the testing data. Table 2 lists the estimated tree agree-ment for each forest. Each estimate is averaged over all4,950 (100 × 99/2) pairs among 100 trees. For the “dna” and“letter” data, the subspace method yielded very dissimilartrees compared to the other methods. For the other twodatasets, the differences are not as obvious. Again, differ-ences between the bootstrapping and boosting methods areinsignificant.

6.3 Comparison Among DifferentSplitting Functions

In this experiment, I compared the effects of differentsplitting functions. Eight splitting functions were imple-mented. Some functions can produce multibranch splits,but for comparison across different functions, only binarysplits were used. For an internal node to be split into nbranches, the eight functions assign the points in the fol-lowing ways:

1)�single feature split with best gain ratio: Points are as-signed to different branches by comparing the value ofa single feature against a sequence of n − 1 thresholds.The feature and thresholds are chosen to maximizeQuinlan’s information gain ratio [25].

2)�distribution mapping: Points are projected onto a linedrawn between the centroids of two largest classes.Their distances to one centroid are sorted and a set ofthresholds are selected so that there are the samenumber of points between each pair of successive

thresholds. Points between a pair of thresholds are as-signed to a branch at the next level [14].

3)�class centroids: Centroids of the n − 1 largest classes andthat of the remaining points are computed. A point isassigned to a branch if it is closest to the correspond-ing centroid by Euclidean distance.

4)�unsupervised clustering: n clusters are obtained by com-plete-linkage clustering using Euclidean distance.Points are matched by Euclidean distance to the cen-troids of each cluster and assigned to the branch corre-sponding to the closest. If there are more points than apreset limit so that clustering is prohibitively expen-sive, points are sorted by the sum of feature valuesand divided evenly into n groups.

5)�supervised clustering: n anchors are initialized using onepoint from each of the first n classes. The remainingpoints are matched to the closest anchor by Euclideandistance. The anchors are then updated as the centroidsof matched points. The process is repeated for a numberof passes and the resulting centroids are used to repre-sent the branches. Points are assigned to the branch cor-responding to the nearest centroid.

6)�central axis projection: First, we find the two classeswhose means are farthest apart by Euclidean dis-tance. The other classes are matched to these two byproximity of the means. The centroids of the twogroups thus obtained are then computed and a line(the central axis) is drawn passing through bothcentroids. All data points are then projected ontothis line. Hyperplanes that are perpendicular to thisline are evaluated, and the one that best divides thetwo groups (causing minimum error) is chosen.Points are then assigned to two sides of the chosenhyperplane as two branches. This method permitsonly binary splits.

7)�perceptron: Again the data points are divided into twogroups by proximity of the class means. A hyperplaneis then derived using the fixed-increment perceptrontraining algorithm [22]. Points are then assigned totwo sides of the derived hyperplane. This method alsopermits only binary splits.

8)�support vector machine: n − 1 largest classes are chosenand the rest are grouped as one class. Points are trans-formed to a higher-dimensional space by a polynomialkernel of a chosen degree. Hyperplanes that maximizemargins in the transformed space are chosen. Eachhyperplane divides one class from the others. Pointsare compared to all chosen hyperplanes and assignedto the branch corresponding to the one it matches withthe largest margin [33].

Fig. 4 shows the accuracies of the forests that use theseeight functions. There are 50 trees in each forest. In eachforest, the subspace changes at each split. As expected, theeffects of splitting functions are data-dependent, and thereis no universal optimum. Nevertheless, the improvementsin the forest accuracy with increases in number of trees fol-low a similar pattern across different functions and differ-ent datasets. This demonstrates the validity of the forestconstruction method and its independence of the splittingfunctions.

TABLE 2AVERAGE TREE AGREEMENT OF FORESTS

CONSTRUCTED BY EACH METHOD

data setconstructionmethod dna letter satimage shuttlerandomsubspaces

0.7540 0.6595 0.8228 0.9928

bootstrapping 0.9081 0.8197 0.8294 0.9998boosting 0.8969 0.7985 0.8205 0.9996


6.3.1 Tree AgreementTable 3 shows the average tree agreement (averaged over all1,225 (50 × 49/2) pairs of 50 trees) for the eight splittingfunctions and the four datasets, estimated using the testingsamples. The absolute magnitude of the measure is data de-pendent. But the relative magnitudes for the same datasetreveal an interesting pattern. Namely, weaker agreement isobserved for forests built with unsupervised clustering, andstronger agreement is observed for the forests built withmaximum gain ratio (except for the “satimage” data). Theforests built with support vector machines are found to haveboth the strongest and weakest agreement depending on thedata set. There are significant data-dependent differences inthe relative order among the estimates given by differentsplitting functions. Once again, this suggests there are nouniversally optimal splitting functions.

6.4 Comparison Among Different Numbers ofRandom Features

In using the subspace method, one important parameter tobe determined is how many features should be selected in

each split. When the splitting function uses a single feature,the evaluation of possible splits is constrained within onlythe selected features. In other cases, the hyperplanes arefunctions of the selected features, so that the number ofrandom features used could affect the results significantly.

Fig. 4. Comparison of test set accuracies with different splitting functions.

TABLE 3AVERAGE TREE AGREEMENT WITHDIFFERENT SPLITTING FUNCTIONS

splitting data set

function dna letter satimage shuttlegain ratio 0.8804 0.7378 0.7980 0.9975dist. mapping 0.7728 0.5861 0.8188 0.9864centroid 0.7143 0.6593 0.8310 0.9975unsup.clustering

0.5337 0.5701 0.8110 0.9885

sup.clustering

0.6137 0.6320 0.8378 0.9812

c. axisprojection

0.8368 0.6481 0.8440 0.9911

perceptron 0.7913 0.5758 0.8200 0.9950supportvectors

0.7907 0.4827 0.9205 0.8720


Fig. 5. Comparison of “dna” test set accuracies with different numbers of random features: 90 (1/2 of all), 45 (1/4), 23 (1/8), 12 (1/16), and 135 (3/4).


I compared the effects of different numbers of featuresusing the “dna” dataset that has 180 features. Each of thesame eight splitting functions was used to construct aforest of 50 trees. Again, in each forest the subspacechanges at each split. The results show that the effectsare stronger on some splitting functions than others (Fig. 5).In particular, the split by maximum gain ratio is less sen-sitive to the subspace dimensionality. Nevertheless, itappears that using half of the features resulted in thebest or very close to the best combined accuracies withall splitting functions.

6.4.1 Tree AgreementTable 4 shows the average tree agreement for the eightsplitting functions and different numbers of random fea-tures, estimated using the testing set (“dna” data). Again,each estimate is averaged over all 1,225 (50 × 49/2) pairs of50 trees. The agreement generally increases with the num-ber of random features, and the relative ordering amongdifferent splitting functions is in good consistency. Treesbuilt with maximum gain ratio are the most similar to eachother, and those built with unsupervised clustering are leastsimilar.

It is important to note that forest accuracy is affected byboth the individual tree accuracy and the agreement be-tween the trees. This means optimizing on either factoralone does not necessarily result in the best forest accuracy.For instance, for the “dna” data, although the clusteringmethods give trees with weakest agreement, their individ-ual accuracies and thus the forest accuracies are not as goodas those obtained by, say, splits by maximum gain ratio. Butrecall also that although the training set subsampling meth-ods produce better individual trees, they are so similar toeach other that the forest accuracies are not as good as thoseobtained by the subspace method. Ideally, one should lookfor the best individual trees with lowest similarity. But ex-actly how this dual optimization can be done with an algo-rithm remains unclear.

The tree agreement measure can be used to order thetrees in the forest, so that the most dissimilar trees areevaluated first. This could be useful in practice to obtainbetter speed/accuracy tradeoff, that is, to use the leastnumber of trees to achieve a certain accuracy. Fig. 6 com-pares the accuracy gains obtained when the trees are sortedby increasing agreement and when they are unsorted. Theforest was constructed for the “dna” data using random

halves of the features and the maximum gain ratio as thesplitting function.

7 EXPLOITING REDUNDANCY IN DATA

The Statlog datasets I have used in previous experimentseither have a large number of features or a large number oftraining samples. For some problems, this may not be true.In this section, I discuss the behavior of the subspacemethod on a collection of datasets that have a larger varietyin size and feature dimensionality. I chose to use datasetsfrom the University of California at Irvine machine learningdatabase that have at least 500 samples and have no miss-ing values. I again used the C4.5 package to construct allthe trees.

Since there is no separate training and testing sets foreach problem, I employed a two-fold cross-validation pro-cedure. For each problem, I split the dataset randomlyinto two disjoint halves, constructed a forest with one halfand tested it on the other half. Then the training and test-ing sets were swapped and the run repeated. I performedthis procedure for ten times for each dataset. In reportingthe test set accuracies, I deleted the outliers, i.e., the runswith the highest and lowest accuracies, and reported theaverage of the remaining eight runs. This is done to avoidmisleading results due to occasional bad sampling onsome very small datasets, such as missing an entire classin training or testing.

TABLE 4TREE AGREEMENT FOR DIFFERENT NUMBERS OF RANDOM FEATURES (“DNA” DATA)

number of random featuressplittingfunction 12 23 45 90 135gain ratio 0.7070 0.7773 0.8498 0.8804 0.8849dist. mapping 0.5855 0.6555 0.7216 0.7728 0.7530centroid 0.6168 0.6385 0.6717 0.7143 0.6567unsup. clustering 0.4618 0.4723 0.4862 0.5337 0.5451sup. clustering 0.5089 0.5253 0.5507 0.6137 0.5571c. axis projection 0.6412 0.7082 0.7714 0.8368 0.8224Perceptron 0.6174 0.6811 0.7313 0.7913 0.7968support vectors 0.4985 0.5198 0.6281 0.7907 0.8144

Fig. 6. Comparison of test set accuracies between forests with sortedand unsorted trees.


Fig. 7. Comparison of test set accuracies of forests and single-trees.


Among these datasets, the “letter” and “splice” data alsobelong to the Statlog collection, but instead of being splitinto fixed and separated training and testing sets, they wereused in the same way with the others in the cross-validation procedure. For those datasets that have categori-cal variables, have very few samples, or have very fewfeature dimensions, I included the cross-products of allpairs of features in random subspace selection. Table 5shows the number of samples, classes, and features as wellas the data types for each dataset.

Fig. 7 shows the test set accuracies of decision forestsconstructed by the random subspace method, bootstrap-ping, and boosting, together with those of single C4.5tree (both pruned and unpruned) classifiers. From theseplots, a few observations can be made:

1)�For all datasets, the decision forests are more accuratethan both pruned and unpruned single-tree classifiers,and, in most cases, there are large differences.

2)�While all forests are similar in their behavior,namely, accuracies increase with number of trees,those built by bootstrapping or boosting tend to bein closer competition, while in some cases, thosebuilt by the random subspace method follow a dif-ferent trend.

3)�The subspace method is better in some cases, about thesame in other cases, or worse in yet other cases whencompared to the other two forest-building methods.

4)�The effect of data types, i.e., whether the features arenumeric, categorical, or mixed, is not apparent.

Fig. 7. Continued.


To obtain some hints on when the subspace method isbetter, in Table 5, I have arranged the entries in the sameorder with the plots in Fig. 7. For the datasets near the be-ginning of the table, the subspace method performs the bestamong the three forest-building methods. For those towardthe end of the table, the subspace method does not have anadvantage over the other two methods. From this arrange-ment and the plots, it should be apparent that the subspacemethod is best when the dataset has a large number offeatures and samples, and that it is not good when the da-taset has very few features coupled with a very small num-ber of samples (like the datasets “pima,” “tic-tac-toe,” or“yeast”) or a large number of classes (“abalone”).

For most other datasets, the three methods are in closeneighborhood of one another. Therefore, I expect that thesubspace method is good when there is certain redun-dancy in the dataset, especially in the collection of fea-tures. This makes the method especially valuable fortasks involving low-level features, such as in image rec-ognition (e.g., [1]) and in other domains of signal proc-essing. For the method to work on other tasks, redun-dancy needs to be introduced artificially using simplefunctions of the features.

8 CONCLUSIONS

I described a method for systematic construction of adecision forest. The method relies on a pseudorandomprocedure to select components of a feature vector, anddecision trees are generated using only the selected fea-ture components. Each tree generalizes classification tounseen points in different ways by invariances in theunselected feature dimensions. Decisions of the trees arecombined by averaging the estimates of posterior prob-abilities at the leaves.

Experiments were conducted using a collection ofpublicly available datasets. Accuracies were compared tothose of single trees constructed using the same tree con-struction algorithm but with all the samples and full

feature vectors. Significant improvements in accuracywere obtained using our method. Furthermore, it is clearthat as the forests grow in complexity (measured in thenumber of trees), their generalization accuracy does notdecrease, while maximum accuracies on the training setsare preserved. The forest construction method can beused with any splitting function. Eight splitting func-tions were implemented and tested. Though there aredata dependent differences of accuracies, the improve-ments in accuracy with increases in the number of treesfollow the same trend. The effects of the number of ran-dom features to be used were also investigated. It wasshown that in the chosen example using half of the fea-ture components yielded the best accuracy. Finally, whencompared to two training set subsampling methods forforest building on fourteen datasets, the subspacemethod was shown to perform better when the datasethas a large number of features and not too few samples.The method is expected to be good for recognition tasksinvolving many redundant features.

ACKNOWLEDGMENTS

The author would like to thank Linda Kaufman, who pro-vided the code for constructing support vector machines,and Don X. Sun for helpful discussions.

Parts of this paper have appeared in [11] and [12].

REFERENCES

[1]� Y. Amit, D. Geman, and K. Wilder, “Joint Induction of ShapeFeatures and Tree Classifiers,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 19, no. 11, pp. 1,300-1,305, Nov. 1997.

[2]� R. Berlind, “An Alternative Method of Stochastic DiscriminationWith Applications to Pattern Recognition,” doctoral dissertation,Dept. of Mathematics, State Univ. of New York at Buffalo, 1994.

[3]� L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classi-fication and Regression Trees. Belmont, Calif.: Wadsworth, 1984.

[4]� L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, pp.123-140, 1996.

[5]� G.R. Dattatreya and L.N. Kanal, “Decision Trees in Pattern Rec-ognition,” L.N. Kanal and A. Rosenfeld, eds., Progress in PatternRecognition 2. Amsterdam: Elsevier, 1985.

[6]� Y. Freund and R.E. Schapire, “Experiments With a New BoostingAlgorithm,” Proc. 13th Int’l Conf. Machine Learning, pp. 148-156,Bari, Italy, 3-6 July 1996.

[7]� D. Heath, S. Kasif, and S. Salzberg, “Induction of Oblique Deci-sion Trees,” Proc. 13th Int’l Joint Conf. Artificial Intelligence, vol. 2,pp. 1,002-1,007, Chambery, France, 28 Aug.-3 Sept. 1993.

[8]� S. Murthy, S. Kasif, and S. Salzberg, “A System for Induction ofOblique Decision Trees,” J. Artificial Intelligence Res., vol. 2, no. 1,pp. 1-32, 1994.

[9]� D. Heath, S. Kasif, and S. Salzberg, “Committees of Decision Trees,”B. Gorayska and J.L. Mey, eds., Cognitive Technology: In Search of aHumane Interface, pp. 305-317. New York: Elsevier Science, 1996.

[10]� T.K. Ho, “Recognition of Handwritten Digits by Combining Inde-pendent Learning Vector Quantizations,” Proc. Second Int’l Conf.Document Analysis and Recognition, pp. 818-821, Tsukuba ScienceCity, Japan, 20-22 Oct. 1993.

[11]� T.K. Ho, “Random Decision Forests,” Proc. Third Int’l Conf. Docu-ment Analysis and Recognition, pp. 278-282, Montreal, Canada, 14-18Aug. 1995.

[12]� T.K. Ho, “C4.5 Decision Forests,” Proc. 14th Int’l Conf. PatternRecognition, Brisbane, Australia, 17-20 Aug. 1998.

[13]� T.K. Ho and H.S. Baird, “Perfect Metrics,” Proc. Second Int’l Conf.Document Analysis and Recognition, pp. 593-597, Tsukuba ScienceCity, Japan, 20-22 Oct. 1993.

TABLE 5CHARACTERSTICS OF DATASETS USED IN

COMPARISON OF METHODS

data set no. ofsamples

no. offeatures

no. ofclasses

featuretype(s)

letter 20,000 16 26 numericsplice 3,190 60 3 categoricalvehicle 846 18 4 numericcar 1,728 6 4 mixedwdbc 569 30 2 numerickr-vs-kp 3,196 36 2 categoricallrs 531 93 10 numericgerman 1,000 24 2 numericsegmenta-tion

2,310 19 8 numeric

nursery 12,961 8 5 mixedabalone 4,177 8 29 mixedpima 768 8 2 numerictic-tac-toe 958 9 2 categoricalyeast 1,484 8 10 numeric


[14]� T.K. Ho and H.S. Baird, “Pattern Classification With CompactDistribution Maps,” Computer Vision and Image Understanding, vol.70, no. 1, pp. 101-110, Apr. 1998.

[15]� T.K. Ho, J.J. Hull, and S.N. Srihari, “Decision Combination inMultiple Classifier Systems,” IEEE Trans. Pattern Analysis and Ma-chine Intelligence, vol. 16, no. 1, pp. 66-75, Jan. 1994.

[16]� T.K. Ho and E.M. Kleinberg, “Building Projectable Classifiers ofArbitrary Complexity,” Proc. 13th Int’l Conf. Pattern Recognition,pp. 880-885, Vienna, 25-30 Aug. 1996.

[17]� E.M. Kleinberg, “Stochastic Discrimination,” Annals of Mathemat-ics and Artificial Intelligence, vol. 1, pp. 207-239, 1990.

[18]� E.M. Kleinberg, “An Overtraining-Resistant Stochastic ModelingMethod for Pattern Recognition,” Annals of Statistics, vol. 4, no. 6,pp. 2,319-2,349, Dec. 1996.

[19]� S.W. Kwok and C. Carter, “Multiple Decision Trees,” R.D.Shachter, T.S. Levitt, L.N. Kanal, and J.F. Lemmer, eds., Uncer-tainty in Artificial Intelligence, vol. 4, pp. 327-335. New York: El-sevier Science Publishers, 1990.

[20]� J. Mingers, “Expert Systems—Rule Induction With StatisticalData,” J. Operational Res. Soc., vol. 38, pp. 39-47, 1987.

[21]� J. Mingers, “An Empirical Comparison of Selection Measures forDecision-Tree Induction,” Machine Learning, vol. 3, pp. 319-342, 1989.

[22]� N.J. Nilsson, Learning Machines: Foundations of Trainable Pattern-Classifying Systems. New York: McGraw-Hill, 1965.

[23]� Y. Park and J. Sklansky, “Automated Design of Multiple-Class Piece-wise Linear Classifiers,” J. Classification, vol. 6, pp. 195-222, 1989.

[24]� Project StatLog, LIACC, Univ. of Porto, internet address: ftp.ncc.up.pt(directory: pub/statlog/datasets).

[25]� J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol.1, pp. 81-106, 1986.

[26]� J.R. Quinlan, C4.5: Programs for Machine Learning. San Mateo,Calif.: Morgan Kaufmann, 1993.

[27]� J.R. Quinlan, “Bagging, Boosting, and C4.5,” Proc. 13th Nat’l Conf.Artificial Intelligence, pp. 725-730, Portland, Ore., 4-8 Aug. 1996.

[28]� I.K. Sethi and G.P.R. Sarvarayudu, “Hierarchical Classifier DesignUsing Mutual Information,” IEEE Trans. Pattern Analysis and Ma-chine Intelligence, vol. 4, no. 4, pp. 441-445, July 1982.

[29]� I.K. Sethi and J.H. Yoo, “Structure-Driven Induction of DecisionTree Classifiers Through Neural Learning,” Pattern Recognition,vol. 30, no. 11, pp. 1,893-1,904, 1997.

[30]� S. Shlien, “Multiple Binary Decision Tree Classifiers,” PatternRecognition, vol. 23, no. 7, pp. 757-763, 1990.

[31]� S. Shlien, “Nonparametric Classification Using Matched BinaryDecision Trees,” Pattern Recognition Letters, vol. 13, pp. 83-87, Feb.1992.

[32]� P. Turney, “Technical Note: Bias and the Quantification of Stabil-ity,” Machine Learning, vol. 20, pp. 23-33, 1995.

[33]� V. Vapnik, The Nature of Statistical Learning Theory. New York:Springer-Verlag, 1995.

Tin Kam Ho received her PhD from the StateUniversity of New York at Buffalo in 1992. Herthesis was a pioneering study on classifier com-bination and segmentation-free word recognition.Dr. Ho was with the Center of Excellence forDocument Image Analysis and Recognition atSUNY at Buffalo from 1987 to 1992. Since 1992,she has been a member of technical staff in theComputing Sciences Research Center at BellLaboratories. Her research interests are in prac-tical algorithms for pattern recognition. She has

published actively on decision forests, distribution maps, stochasticdiscrimination, classifier combination, and many topics related to OCRapplications. She has served as referee for IEEE Transactions on Pat-tern Analysis and Machine Intelligence, CVIU, IJDAR, PRL, JMIV, JEI,ICPR, ICDAR, IWGR, SDAIR, DAS, ICCPOL, and Annals of Statistics.She is a member of IEEE and the Pattern Recognition Society.

832 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND...

Documents

Transcript of 832 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND...