[IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza,...

4
2007 IEEE International Symposium on Signal Processing and Information Technology Novel Multiclass SVM-Based Binary Decision ree Classifier Hossam Osman Department of Computer and Systems Engineering, Ain Shams University, Abbassia, Cairo, Egypt 11517 Phone: +(202) 3534 5160 Fax: +(202) 3539 2134 E-mail: hosmangmcit.gov.eg Abstract-This paper proposes a novel algorithm for II. SVM CLASSIFIERS constructing multiclass SVM-based binary decision tree classifiers. The basic strategy of the proposed algorithm is to set Assume a binary classification problem with a training the target values for the training patterns such that linear set {x ,t } where x; E St is the ith d-dimensional training separability is always achieved and thus a linear SVM can be _, l xi E constructed at each non-leaf node. It is argued that replacing vector, t, is the corresponding target value with ti = +1, if xi E complex, nonlinear SVMs by a larger number of linear SVMs Class 1, and ti = -1, if xi E Class 2, and N is the training set remarkably reduces training and classification times as well as size. Suppose the training patterns of Class 1 are linearly classifier size without compromising classification performance. separable from those of Class 2. An SVM finds the optimal This is experimentally demonstrated through a comparative separatinghyperplane classifier [8] analysis involving the most efficient existing multiclass SVM classifiers, namely the one-against-rest and the one-against-one. f (x) = w.x - b, (1) where I. INTRODUCTION f (xi) >0 V Xi E Class 1, SVMs have demonstrated remarkable efficiency in pattern f (xi) < 0 V Xi E Class 2 (2) classification [1]. On the other hand, since it is originally The classifier in (2) is optimal in the sense that it has the developed for binary classification, its multiclass extension is mg traini still an active research topic [2]-[7]. Generally speaking, the maximm septin margin T he trai approaches proposed so far share the issues of expensive paternsno the two aclas thSlinds (1) b firs computation, large classifier size, and the relatively long determining the Lagrangian multipliersa= {a,}¾ that training and classification speeds [3] [7]. globally maximize the objective function Here, this paper proposes a novel algorithm that is iterative N 1 N N in nature. The algorithm constructs a binary decision tree Q(a) = Yai - E a1i a, ti tj (xi. xj) (3) classifier with a linear SVM residing at each non-leaf node. i=1 2 i=1 j= Many SVM-based binary decision tree classifiers exist in the subject to the constraints literature [3] [4] [6] [7]. The main contribution of the proposed N algorithm is the replacement of the complex, nonlinear SVMs Za ti = 0, that are traditionally employed by existing algorithms and that handle issues as nonlinear separability, imperfect separation, a >0 i=1,2, ... ,N (4) higher-dimensional transformation, and kernel function a evaluation by a larger number of the much simpler, linear Then, the weight vector w is computed using [8] SVMs. This is done by structuring the classification problem = a, t, x,, (5) in a way that keeps linear separability. It is expected that a w= ai group of linear SVMs has shorter training time, faster nonzero ai classification decision, and smaller size compared to a group and for any a .0, the threshold b is computed using of a smaller number of nonlinear SVMs. It should be b = (I-ti (w * xI))Iti (6) emphasized that the proposed algorithm is equally suitable for Typically, it is numerically safer to compute b by taking the binary and multiclass classification problems. mean of a group of values resulting from using (6) for some The rest of this paper is organized as follows: Section II nonzero a. describes the use of SVM in classification. Section III G descibesthe ew poposd aloritm. Sctio IV ivesthe Given that the training patterns are not linearly separable, describes the new proposed algorithm. Section IV gives the. . experimental results. Finally, Section V states the work then (3) iS rewritten as conclusion. 978-1 -4244-1 835-0/07/$25.00 ©C2007 IEEE 880

Transcript of [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza,...

Page 1: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

2007 IEEE International Symposiumon Signal Processing and Information Technology

Novel Multiclass SVM-Based Binary Decisionree Classifier

Hossam OsmanDepartment of Computer and Systems Engineering, Ain Shams University, Abbassia, Cairo, Egypt 11517

Phone: +(202) 3534 5160 Fax: +(202) 3539 2134E-mail: hosmangmcit.gov.eg

Abstract-This paper proposes a novel algorithm for II. SVM CLASSIFIERSconstructing multiclass SVM-based binary decision treeclassifiers. The basic strategy of the proposed algorithm is to set Assume a binary classification problem with a trainingthe target values for the training patterns such that linear set {x ,t } where x; E St is the ith d-dimensional trainingseparability is always achieved and thus a linear SVM can be _,l

xi E

constructed at each non-leaf node. It is argued that replacing vector, t, is the corresponding target value with ti = +1, if xi Ecomplex, nonlinear SVMs by a larger number of linear SVMs Class 1, and ti = -1, if xi E Class 2, and N is the training setremarkably reduces training and classification times as well as size. Suppose the training patterns of Class 1 are linearlyclassifier size without compromising classification performance. separable from those of Class 2. An SVM finds the optimalThis is experimentally demonstrated through a comparative separatinghyperplane classifier [8]analysis involving the most efficient existing multiclass SVMclassifiers, namely the one-against-rest and the one-against-one. f(x) = w.x - b, (1)

whereI. INTRODUCTION f (xi) >0 V Xi E Class 1,

SVMs have demonstrated remarkable efficiency in pattern f (xi) < 0 V Xi E Class 2 (2)classification [1]. On the other hand, since it is originally The classifier in (2) is optimal in the sense that it has thedeveloped for binary classification, its multiclass extension is

mg trainistill an active research topic [2]-[7]. Generally speaking, the maximm septin margin T he traiapproaches proposed so far share the issues of expensive paternsno the two aclas thSlinds (1) b firscomputation, large classifier size, and the relatively long determining the Lagrangian multipliersa= {a,}¾ thattraining and classification speeds [3] [7]. globally maximize the objective function

Here, this paper proposes a novel algorithm that is iterative N 1 N Nin nature. The algorithm constructs a binary decision tree Q(a) = Yai- E a1i a, ti tj (xi. xj) (3)classifier with a linear SVM residing at each non-leaf node. i=1 2 i=1 j=Many SVM-based binary decision tree classifiers exist in the subject to the constraintsliterature [3] [4] [6] [7]. The main contribution of the proposed Nalgorithm is the replacement of the complex, nonlinear SVMs Za ti = 0,that are traditionally employed by existing algorithms and thathandle issues as nonlinear separability, imperfect separation, a >0 i=1,2,...,N (4)higher-dimensional transformation, and kernel function aevaluation by a larger number of the much simpler, linear Then, the weight vector w is computed using [8]SVMs. This is done by structuring the classification problem = a, t, x,, (5)in a way that keeps linear separability. It is expected that a w= aigroup of linear SVMs has shorter training time, faster nonzeroaiclassification decision, and smaller size compared to a group and for any a .0, the threshold b is computed usingof a smaller number of nonlinear SVMs. It should be b = (I-ti (w * xI))Iti (6)emphasized that the proposed algorithm is equally suitable for Typically, it is numerically safer to compute b by taking thebinary and multiclass classification problems. mean of a group of values resulting from using (6) for someThe rest of this paper is organized as follows: Section II nonzero a.

describes the use of SVM in classification. Section IIIGdescibesthe ewpoposd aloritm. Sctio IV ivesthe Given that the training patterns are not linearly separable,describes the new proposed algorithm. Section IV gives the. .

experimental results. Finally, Section V states the work then (3) iS rewritten asconclusion.

978-1 -4244-1 835-0/07/$25.00 ©C2007 IEEE 880

Page 2: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

N 1 N N when given x. It follows that compared to 1-a-1, the 1-a-r isQ(a) = ia, - ,ca, a, ti tj K(xi , xj), (7) much slower to train but faster to classify. It has been

i=l 2 j=1 j=I demonstrated that both techniques yield unclassified regionswhere K(xi , xj) is a kernel function that satisfies Mercer's [9] and in general, the 1-a-I outperforms the other SVM-

condition .8 and thus is equivalent to the dot productofits based multiclass classification techniques existing in thecondition [8] and thus iS equivalent to the dot product of itS literature [3][7].arguments after being mapped to a higher-dimensional spacein which the training patterns are hopefully linearly separable.Here, the utilization of the kernel trick avoids the need for III. PROPOSED ALGORITHMgoing through the probably unsolvable problem of explicitly LNdetermining the required mapping. Clearly, handling the e E {x =1nonlinearly-separable classification problem necessitates first where 1i =j, if xi E Classj. Let Sm = ts,, r, rmC E denote theselecting a form for the kernel function and secondly set of training patterns that is used at the mth algorithmestimating the function parameters. Typically, this is a non- iteration to construct SVMm, the mth linear SVM that splits Smeasy, time-consuming task. In principle, there exist many into two partitions at a non-leaf tree node. Let nm denote theforms for kernel functions, but the most popular is the number of classes to which the patterns of Sm belong. Let TmGaussian kernel where traditionally the one parameter that ={t }m denote the set of corresponding target values with t=needs to be estimated is the variance av. It should beemphasized that here the transformation to a higher- +1 or -1. The main principle of the proposed algorithm is todimensional space is not explicitly determined and thus the always set Tm such that a linear SVM can separate the trainingweight vector w cannot be computed using (5). On the other patterns ofSm since linearity implies faster training and testinghand, in view of(5), the dot product ofw and x in (1) and (6) speeds as well as smaller sizes for classifiers. Based uponcan be replaced by these notations, the algorithm steps can be detailed as follows:

Step 1: Let m = 1.

Ya1 ti K(xi,x) (8) Step2:LetS1=Eandn =c.

nonzero aStep 3: Compute the n1m class centers usingnonzero azi** X

Clearly, the classifier for the nonlinearly-separable case is 1 v (9much more computationally expensive, slower, and larger in C1 Z., Si ' J l . nm (9)size than that of (1). Let k denote the number of nonzero ai k1 siEClassy1that is the number of support vectors [8]. It is straightforward where kj is the number of si that belong to Class yj.to notice that the nonlinear classifier has (2+d)xk parameters Step 4: Merge the nearest two centers by averaging until onlyin addition to those of the kernel function, whereas the linear two resulting centers are left. Denote them by U and V. Here,classifier only has d+1 parameters. nearness is measured using the known Euclidean distance.One more extension to classification using an SVM is to Step 5: Determine Tm by setting ti to +1, if si is nearer to U

allow for imperfect separation. The training patterns may not than V, and setting ti to -1, otherwise. This strategy producesbe linearly separable even after transformation to a higher- linearly-separable training patterns.dimensional space. In this case, a user-specified cost term C is Step 6: Apply SMO [10] to construct a linear SVM, SVMm,introduced and now the Lagrangian multipliers is upper using Sm and Tm. This appends a non-leafnode to the decisionbounded by C rather than infinity as in (4) [8]. This indicates tree.that one more parameter has to be selected but on the other Step 7: SVMm generates a binary decision at the non-leafhand the solution form is kept the same. node that splits Sm into two partitions. Let P c Sm denote theUp to this point, SVMs only handle 2-class pattern set of the training patterns in Sm with target value = +1, the

classification problems. As they are originally designed for 2- positive partition. Also, let N c Sm denote the set of theclass problems, several techniques exist in the literature to training patterns in Sm with target value = -1, the negativeenable their utilization in multiclass situations. The most partition.popular and effective ones are the one-against-one (1-a-1) and Step 8: Let Sm±1 = P. Compute nm±+ if nm±+ > 1, let m = m+1the one-against-rest (1-a-r) [3]. For c-class problems, the 1-a- and go to Step 3. Otherwise, appends a leaf node to the1 trains cx(c-1)/2 SVMs using all the binary pairwise decision tree and label that node with the class of the patternscombinations of the training patterns. The final classification in P.decision adopts the most-voted strategy where given an input Step 9: Let Sm±+ = N. Compute nm±1. if nm±1 > 1, let m = m+1pattern x the class that is voted most by all trained SVMs is and go to Step 3. Otherwise, appends a leaf node to theselected as the class for x. The 1-a-r approach trains c SVMs. decision tree and label that node with the class of the patternsThe jth SVM is trained using the whole available training set in N.after setting t4 of all patterns belonging to Class]j to 1, and t4 of Clearly, the completed decision tree classifier has m non-all the other patterns to -1. The final classification decision leaf nodes with a linear SVM residing at each one. Theadopts the max-value strategy where Class j iS selected for algorithm does not require any parameter selection. It avoidsinput pattern x if the jth SVM yields the maximum output the issues of imperfect separation and nonlinear separability.

It solves the classification problem in its original space rather

881

Page 3: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

than considering transformation to a higher-dimensional that problem. This situation is depicted in Figure 3. For eachspace. Figure 1 gives a sample output for the application of constructed classifier, the classification performance on thethe proposed algorithm. The binary tree branches indicate the test set was determined. Table I summarizes the obtainedlabels of the classes to which the patterns on both positive and results. It is demonstrated that all algorithms yieldednegative partitions belong. It should be emphasized that a classification performance of similar quality. Asclass label may coexist on both sides. It has been demonstrated, the new algorithm replaced the complex,demonstrated that this kind of overlap improves the nonlinear SVMs by a larger number of linear SVMs, but evenclassification performance [4]. It should also be noticed that with that its classifier size is much smaller than those of themore than one leaf node may share the same class, classifiers of the other two algorithms. As expected, it wasSince the algorithm keeps iterating until zero classification observed that much shorter training times were needed by the

error on the training set, overfitting may occur. Techniques new algorithm because of the SVM linearity, the rapidfor post pruning decision trees as error-based pruning (EBP) decrease of the number of training patters as we movedhave demonstrated effectiveness in handling this issue [11]. It down across the tree, and the no requirement for parameteris worth remembering that the proposed algorithm shares the don similarly, the noc er had mfor parteradvantages of all SVM-based decision tree classifiers. selection. Similarly, the new classifier had much shorterSpecifically, with more iterations, the number of training classification times again because of linearity and the nopatterns involved in the construction of the SVMs decreases requirement of complex kernel evaluation and because only arapidly and thus shorter training times can be expected. In the small percentage of the SVMs of the tree classifier needed totesting phase, not all SVMs need to be evaluated. This implies be evaluated for each testing pattern.faster classification decisions. Moreover, the problem ofunclassified regions is completely surmounted [7].

1,2,3,4 (a)

1m31 41,24SVM1 ~ VM

1,4/ \3e43,244/ \1 (b)

SVM7 SVM6 SVM34 3 4 1 2,4

SVM4DmDD 4 2 (C)

Figure 1. Sample SVM-based binary decision tree classifier constructed using Figure 2. Three 4-class uniformly-distributed datasets used inthe proposed algorithm. experimentation. The 4 classes only overlap in (c).

IV. EXPERIMENTAL RESULTS O

The datasets first used in experimentation are drawn in FFigure 2. These are three 2-dimensional 4-class uniformly-distributed datasets. Clearly, the difficulty of the classificationproblem increases as we move from Figure 2(a) down toFigure 2(c). Both training and test set had 1000 patterns each. _However, for the 1-a-I and 1-a-r classifiers, the classification -3 Operformance on 30% of the training patterns was first utilizedto select good values for the needed parameters a2 and C.Then, the two classifiers were re-trained on the whole trainingset using the selected parameter values. The proposed SVM3algorithm was implemented as described in Section III. Itshould be remembered that no parameter selection is needed

for the proposedalgorithm. It s worth mentioning that for th Figure 3. SVMs constructed using the proposed algorithm for the dataset offor the proposed algor1thm. It 1S worth ment1onmg that for tne Figure 2(a).dataset of Figure 2(a), the new algorithm constructed three To assure the observations noticed above, a secondSVMs which is theminimum number needed to linearly solve experimentation was implemented using three benchmark

882

Page 4: [IEEE 2007 IEEE International Symposium on Signal Processing and Information Technology - Giza, Egypt (2007.12.15-2007.12.18)] 2007 IEEE International Symposium on Signal Processing

datasets from the UCI database [12]. Table II describes these problem by introducing a novel algorithm that replacesdatasets. Here, due to the limitation of the number of patterns, nonlinear SVMs by a larger number of linear ones in thethe 10-fold cross-validation strategy was employed. The framework of constructing binary decision tree classifiers.dataset was equally divided into ten parts, nine of which were The resulting classifier has shorter training and classificationused for training, whereas the remaining part was used for times and is much smaller in size. Experimental results havetesting. This was done such that it was ensured that all data demonstrated that this is achieved without compromising thetook part in testing. The obtained results were averaged in classification performance.Table III. Again, it should be remembered that for the 1-a-I TABLE IVand 1-a-r classifiers, three of the nine training parts were first PROPOSED ALGORITHM WITH THE APPLICATION OF ERROR-BASED PRUNINGused to select good a2 and C. In view of Table III, (EBP) FOR THE DATASETS IN TABLE IIobservations similar to those noticed in the first Dataset New algorithm New algorithm with EBPexperimentation can be concluded. 1097.33 98.62%

To address overfitting, EBP [11] was applied to the iris NS 50 45Cs 04constructed decision trees. Table IV summarizes the obtained CP 97.33% 98.88%results. Generally speaking, EBP improved classification wine NS 8 7performance and reduced classifier size. Cs 112 98

CP 74% 74%TABLE I glass NS 75 65

EXPERIMENTAL RESULTS FOR THE DATASETS IN FIGURE 1 Cs 750 650CP: CLASSIFICATION PERFORMANCE, NS: NUMBER OF SVMS, AND

CS: CLASSIFIER SIZE MEASURED AS NUMBER OF CLASSIFIER PARAMETERS REFERENCES

|Dataset I-a-I I-a-r algorithm [1] S. Abe, Support Vector Machines for Pattern Classification. London:algorithm Springer Verlag, 2005.CP 100% 100% 100%0

(a) NS 6 4 3 [2] E. Allwein, R. Schapire, and Y. Signer, "Reducing Multiclass toCs 114 84 9 Binary: A Unifying Approach for Margin Classifiers," Journal ofCP 99% 98% 99% Machine Learning Research, vol. 1, pp. 113-141, 2000.

(b) NS 6 4 41CS 310 532 123 [3] C. W. Hsu and C. J. Lin, "A Comparison of Methods for MulticlassCP 90.4% 90.2% 91% Support Vector Machines," IEEE Trans.on Neural Networks, vol. 13,

(c) NS 6 4 178 no. 2, pp. 415-425, 2002.CS 1114 1840 534

[4] S. Cheong, S. H. Oh, and S. Y. Lee, "Support Vector Machines with

TABLE II Binary Tree Architecture for Multi-Class Classification," Neural

BENCHMARK DATASETS USED iN EXPERIMENTATION Information Processing-Letters and Reviews, vol. 2, no. 3, pp. 47-51,2004.

Dataset Number of Number of Pattern [5] F. Aiolli and A. Sperduti "Multiclass Classification with Multi-classes patterns dimension Prototype Support Vector Machines," Journal of Machine Learning

iris 3 150 4 Research, vol. 6, pp. 817-850, 2005.wine 3 178 13glass 6 214 9 [6] J. Yang, X. Yang, and J. Zhang, "A Parallel Multi-Class Classification

Support Vector Machine Based on Sequential Minimal Optimization,"TABLE III Proceedings ofIMSCCS, vol. 1, pp. 443-446, 2006.

EXPERIMENTAL RESULTS FOR THE DATASETS IN TABLE II[7] Z. Lu, F. Lin, and H. Ying, "Design of Decision Tree via Kernelized

Dataset 1-a-I 1-a-r New algorithm Hierarchical Clustering for Multiclass Support Vector Machines,"CP 97.33% 95.2% 97.33% Cybernetics and Systems: An International Journal, vol. 38, pp. 187-

iris NS 3 3 10 202, 2007.CS 111 117 50CP 98.88% 98.88% 97.33% [8] C. Burges, "A Tutorial on Support Vector Machines for Pattern

wine NS 3 3 8 Recognition," Data Mining and Knowledge Discovery, vol. 2, pp. 121-CS 678 543 112 167, 1998.CP 72% 71% 74%

glass NS 15 6 75 [9] S. Abe and T. Inoue, "Fuzzy Support Vector Machines for MulticlassCS 1269 1480 750 Problems," Proceedings ofESANN, pp. 113-118, 2002.

[10] J. C. Platt, "Fast Training of Support Vector Machines using SequentialV. CONCLUSION Minimal Optimization," Advances in Kernel Methods: Support Vector

Learning, B. Scholkopf, C. J. Burges, A. J. Smola, Eds. MA, USA:Nonlinear SVMs are inherently computationally expensive. MIT Press, pp. 185-208, 1999.

They typically utilize a large number of support vectorsespecially for classification problems with large dataset. Their [11I] F. Esposito, D. Malerba, and G. Semeraro, "A Comparative Analysis of

* ~~~~Methods for Pruning Decision Trees," IEEE Trans. on Pattern Analysissole utilization to solve the multiclass classification problems and Machine Intelligence, vol. 19, no. 5, pp. 476-491, 1997.complicates these issues even further and negatively impactstraining and classification speeds. This paper addresses this [12] UCI, "ML Repository," 2007; mlearn.ics.uci.edu/MLRepository.html

883