[IEEE 2013 International Joint Conference on Neural Networks (IJCNN 2013 - Dallas) - Dallas, TX, USA...

Cluster Oriented Ensemble Classifiers using Multi–Objective

Evolutionary Algorithm

Ashfaqur Rahman, Brijesh Verma

Abstract— In this paper, we present an application of Multi–Objective Evolutionary Algorithm (MOEA) for generating cluster oriented ensemble classifier. In our recently developed Non–Uniform Layered Cluster Oriented Ensemble Classifier (NULCOEC), the data set is partitioned into a variable number of clusters at different layers. Base classifiers are then trained on the clusters at different layers. The performance of NULCOEC is a function of the vector (layers,clusters) and the research presented in this paper investigates the implication of applying MOEA to generate NULCOEC. Accuracy and diversity of the ensemble classifier is expressed as a function of layers and clusters. A MOEA then searches for the combination of layers and clusters to obtain the non–dominated set of (accuracy,diversity). We have also obtained the results of single objective optimization (i.e. optimizing either accuracy or diversity) and compared them with the results of MOEA. The results show that the MOEA can improve the performance of ensemble classifier.

Keywords: ensemble classifier, genetic algorithm, multi-objective optimization

I. INTRODUCTION A set of individual classifiers are trained in a cooperative fashion in an ensemble classifier. The individual classifiers are called base classifiers. In order for an ensemble classifier to perform better than its base counterparts it is necessary that the base classifiers are accurate and at the same time they are complementary in terms of errors they make on the patterns. The term diversity [1] is commonly used to describe the complementary nature of learning of the base classifiers. It is thus required to train the base classifiers to achieve high accuracy and high diversity leading to a multi–objective optimization problem.

Diversity is achieved among the base classifiers during the construction of the ensemble classifier [2]–[5]. A group of works [7][16][18] obtains diversity among the base classifiers by manipulating the training set. Bagging [9] is a commonly used ensemble classifier

generation method where the training subsets are randomly drawn (with replacement) from the training set. Homogeneous base classifiers are trained on the subsets. The class chosen by most base classifiers is the considered to be the final verdict of the ensemble classifier. In Boosting [10], an informed approach is taken to construct data subsets for successive base classifiers. Each of the training examples is assigned a weight that determines how well the instance was classified in the previous iteration. The training data that are wrongly classified is included in the training subset for the next iteration. AdaBoost [11] is a more generalized version of boosting.

Clustering is an approach to generate training subsets for base classifiers. The clustering process identifies these natural subgroups [6]. Divide–and–conquer approach towards ensemble classifier generation [7] produces training subsets for the base classifiers by clustering the data set. As reported in [8] the performance of this approach in some cases is even worse than unique classifiers. This is due to the fact that a pattern can belong to one cluster only and as a result the decision can be obtained from a single classifier. The concept of diversity thus does not apply on to this approach and only one classifier is in fact trained on a pattern leading to poor classification performance.

This issue of lack of diversity was addressed by NULCOEC in [18] using overlapping clustering. Utilizing the fact that the outcome of the k–means clustering algorithm depends on the initialization of clustering parameters, the data set was independently partitioned times into variable number of clusters in layers. Classifiers were trained on the clusters at each layer. The decisions provided by the base classifiers trained on the non–uniform clusters at layers were fused to obtain the final verdict on the pattern. Single objective optimization was done in [18] and this leaves space for improvement by applying multi–objective optimization.

In this paper we have adopted a Multi–Objective Evolutionary Algorithm based approach to optimize NULCOEC. The optimal results in MOEA depend on the objective function. We have expressed accuracy and diversity as a function of layers and clusters. The objective function is a vector (accuracy,diversity) rather than a scalar. We have adopted Pareto set [11] to obtain a set of solutions as outcome of MOEA. The optimization algorithm is detailed in the following

Manuscript received February 26, 2013. This work was

supported by Commonwealth Scientific and Industrial Research Organization (CSIRO) and Central Queensland University (CQUni).

Ashfaqur Rahman is with Intelligent Sensing and Systems Laboratory (ISSL) in CSIRO. (Phone: +61362325536; E-mail: ashfaqur.rahman@ csiro.au).

Brijesh Verma is with Centre for Intelligent and Networked Systems at Central Queensland University. (E-mail: b.verma@ cqu.edu.au).

sections. We also compared the results of MOEA with that of single objective optimization (either accuracy or diversity).

II. NON-UNIFORM LAYERED CLUSTER ORIENTED ENSEMBLE CLASSIFIER

The Non–Uniform Layered Cluster Oriented Ensemble Classifier [18] is centred on the concept of clustering. The architecture of the ensemble classifier with the added multi–objective optimization step is presented in Fig. 1. The clusters are identified at the training step and classifiers are trained on the clusters. The best set of layers and clusters are obtained by the multi–objective optimization step. During prediction the closest cluster for test pattern is found at each optimal layer. At a layer the closest cluster is obtained by computing Euclidian distance between the test pattern and cluster centers. The classifier corresponding to the nearest cluster classifies the test pattern. During clustering some clusters are generated that contains patterns of a unique type. We do not train any classifier for these clusters. The unique class label of the patterns in that cluster is predicted for a test pattern that belongs to that cluster.

As the numbers of clusters are different at different layers it is called Non–Uniform Layered Cluster Oriented Ensemble Classifier or NULCOEC. The performance of NULCOEC depends on a number of parameters including number of layers and number of clusters at layer where 1 . The technique used in this paper for optimizing and is presented next.

III. PARAMETER OPTIMIZATION USING SINGLE AND MULTIPLE OBJECTIVES

The main aim of designing an ensemble classifier is to obtain better classification performance. It is thus natural to search for and that maximizes the accuracy of the training set. Maximum accuracy of the training set not necessarily always guarantee that diversity is achieved among the base layer classifiers. It is thus worth investigating the and that optimizes diversity. Although it is argued by several researchers that diversity increases classification accuracy it is also worth investigating how the combination of both performs.

A. Search Space Generation Before applying any search algorithm it is required to setup a search space. We use hierarchical clustering algorithm for this purpose. The problem of applying k–means clustering algorithm in this regard is that the cluster contents can change for same k depending on the initialization of the cluster centers. In our case we build the search space by clustering the data set and training classifiers separately on the cluster. At the leaves each pattern is put in a single cluster. This represents the layer where there are a total of |Γ| clusters.

In the next level two nearest clusters are merged and this represents the layer with |Γ| 1 clusters. The cluster merging process repeats at each level until it reaches the level that represents the layer with only one cluster.

Fig. 1. Architecture of the Non–Uniform Layered Cluster Oriented Ensemble Classifier.

Note that here the search space means a set of trained base classifiers on non–uniform clusters at different layers. Each level in the tree represents a possible contender layer in the ensemble classifier. At level |Γ| each example in the training set is in a separate cluster whereas at level one all the examples in the training set are in a single cluster. The search algorithm looks for a set of levels from this tree so that their combination maximizes the optimization criteria.

B. Single Objective Optimization Now given |Γ| examples in the training set and the total number of clusters can vary from 1 to |Γ|. The search space is proportional to |Γ| where is the number of layers. We have used Evolutionary Algorithm (EA) to perform this search. EA performs the optimal search using two operators – crossover and mutation. The crossover operator has the effect of merging solutions whilst preserving the already successful solutions. The mutation operator makes sure that EA does not get stuck to local minima. EA encodes the solution in a string of binary digits. Each binary digit is known as a gene and the string of genes is known as a chromosome. The chromosomes used in the proposed technique have a length of |Γ| and each gene in the chromosome represents a possible level in the search tree. A value of one for the –th gene represents the selection of a level in the search space whereas a zero value indicates elimination.

In order to find the optimal level sets, EA starts with a large set (population) of randomly generated chromosomes. Each chromosome in the current population is decoded by selecting the levels corresponding to genes that have the value one. Classification of the training set is performed by combining the decisions from these levels on the training patterns. Depending on the use of objective function the fitness computation and the generation of new population set will vary. At this stage we introduce the single and multi objective functions.

The single objective optimization algorithm computes either accuracy or diversity. The objective is to maximise either of these measures. The fitness of a chromosome is set to accuracy or diversity. A higher value of a measure indicates higher fitness of a chromosome. A set of chromosomes, whose fitness is better than the remaining chromosomes in the current population, are used to fill in a mating pool for subsequent crossover and mutation operations. We have performed Elitism by keeping the best chromosome over the generations in the mating pool. A mutation operator is then applied on the new chromosomes where every gene (bit) in a chromosome is altered with a certain probability. These newly generated chromosomes constitute the GA population for the next generation. GA stops after a fixed number of generations.

C. Multi Objective Evolutionary Algorithm MOEA uses the same generic algorithm of EA. The difference is in multiple objectives [11][13] i.e. in MOEA the objective is a vector rather than a scalar. Thus comparison of solutions and fitness function in EA needs to be modified. We cannot apply operators like ‘greater than’ or ‘less than’ on two probable vector solutions. As a result we will get a set of possible solutions of equivalent quality. We can explain this situation with an example. Consider the outcome of an ensemble classifier represented by a vector of two entries. The first entry in the vector represents accuracy and the second represents diversity. A solution (90,0.1) represents the fact that the probable ensemble classifier achieves an accuracy of 90% with diversity 0.1 among the base classifiers. Now let’s compare two solutions: (90,0.1) and (88,0.12). The first solution achieves better accuracy whereas the second solution achieves better diversity. They are thus two possible solutions of equivalent quality.

It is evidenced from the above discussion that multi–dimensional search spaces are partially ordered whereas scalar spaces are fully ordered. At this stage we would like to introduce the concept of dominance: a way to relate two different multi–dimensional solutions. Let the following optimization problem be addressed:

, , , . (1)

Here represents accuracy, represents diversity, represents layers, and represents clusters. Let be the set of possible solutions. We develop the following preliminaries:

(i) A solution vector is said to dominate another solution vector if and only if:

. (2)

(ii) A solution vector is said to cover another solution vector if and only if:

. (3)

(iii) A solution vector is said to be non–dominated with respect to set of probable solution if and only if there exists no solution vector in that dominates .

Based on the above preliminaries we can define Pareto optimal solution:

“A solution vector is called Pareto optimal if and only if is non dominated with respect to the whole solution space . The set of possible solutions is called Pareto–front.”

Fig. 2 presents a genetic algorithm based search approach to find the Pareto–front. The chromosome structure is same as single objective optimization where each gene in the chromosome represents a possible level in the search tree. The chromosomes have a length of |Γ|. A value of one for the –th gene represents the

selection of a level in the search space whereas a zero value indicates elimination. Decoding a chromosome will thus reveal the active layers in the ensemble. The fitness function, the selection process for filling the mating pool and mutation process used in algorithm presented in Fig. 2 are described next.

The fitness of solution in is computed in two steps. In the first step a ranking is obtained for the members in

. In the second step the member in are assigned fitness values. Let be the size of and be the number of individuals in that are covered by . The strength of is set to ; The fitness of an individual in is calculated by summing the strengths of individuals in that covers . Let represents the summation of strengths. The fitness is set to 1 . The one is added to make sure that the individuals in has better fitness than those in . Note that fitness is to be minimized.

The mating pool is filled with individuals from the set . A pair of chromosomes is selected randomly at each instance. The one with better fitness is kept. This process is repeated until the entire mating pool is filled in. During mutation each bit in the chromosome is altered with certain probability to generate an offspring for the new population.

IV. EXPERIMENTAL PLATFORM We have conducted a number of experiments on benchmark data sets to verify the strength of the proposed ensemble classifier. We have compiled the datasets as used in contemporary research works from the UCI Machine Learning Repository [14]. A summary of the data sets is presented in Table 1. We used the hierarchical clustering algorithm in all the experiments.

We have computed accuracy as the percentage of correctly classified instances in a data set. We have computed diversity of the ensemble classifier using

Kohavi–Wolpert (KW) variance [16][18]. Given a set of |Γ| examples , , , , … , | |, | | in the training set, KW variance over L layers is computed as

| | ∑ ∑ ∑| | (4)

where is the number of layers, and is set as

1 0 (5)

KW variance increases as the diversity (i.e. disagreement) among the base classifiers increases. As shown in Eq. (4) the value of KW depends on the product of and . If all the classifiers agree on the same class equals and thus the product of and is zero. For a pattern the product reaches its maximum when half of classifiers correctly classify it.

Both single and multi–objective optimization algorithm was deployed with the following set of parameters: (i) population size = 50, and (ii) mutation probability 0.5. The classification results using bagging and boosting are compared with the proposed approach and the results are obtained using WEKA [15] on identical training and test sets. Note that we have used the same training and test sets for all the ensemble algorithms. Also the training and test sets were created randomly without any bias towards any algorithm.

We have used SVM as the base classifier in NULCOEC. We thus used SVM as the base classifier in all ensemble classifiers. We have used the radial basis kernel for SVM and the libsvm library [16] in all the experiments. The g in RBF was set to 1 _ _ for bagging, boosting and NULCEOC. All the experiments on the proposed method were conducted on MATLAB 7.5.0.

Function MultiObjectiveEvolutionaryAlgorithm

1. Generate random initial population ; 2. ′ // The empty external set of non–dominated solutions; 3. WHILE GenerationNo<MaximumGenerations 4. FOR each chromosome in 5. Decode each chromosome to identify the set of active layers; 6. Generate ensemble classifier based on the layers; 7. Compute (accuracy,diversity) of the ensemble classifier; 8. END 9. Copy non–dominated solution vectors in to ′;

10. Remove solutions in ′ that are covered by any other member of ′; 11. Calculate fitness of each individual in and ′; 12. Select individuals from ′, until the mating pool is filled; 13. Perform mutation on the members of the mating pool and create new population ; 14. GenerationNo= GenerationNo+1; 15. END

Fig. 2. Algorithm for finding the Pareto front

Table 1. Data sets used in the experiments. Dataset # instances # attributes # classes Glass 214 10 7 Haberman 306 3 2 Iris 150 4 3 Liver 345 6 2 Parkinsons 197 23 2 Sonar 208 60 2 Spect 267 23 2 Thyroid 215 5 3

V. RESULTS AND DISCUSSION Fig. 3 presents the Pareto front obtained for the datasets in Table 1. For the sake of easy visualization both accuracy and diversity are scaled in the range of 0 to 1. Only single solutions are obtained for Glass and Thyroid data set. A total of seven solutions are obtained for Liver and Spect data set. It is clearly evidenced from the graphs that no solution in the Pareto front is dominated by another.

We have compared the results of multi–objective optimization with that of single objective optimization. Table 2 presents the best set accuracies obtained using single objective (accuracy or diversity) and multi–objective optimization. Note the accuracy obtained using multi–objective optimization is better than its single objective counterpart. The multi–objective optimization performs 7.56% better than diversity optimization and 6.31% better than accuracy optimization. Table 3 presents best diversities obtained using different optimization criteria. In 6 out of 8 cases the multi–objective criteria perform better than the single objective criteria.

We also compared the results of the multi–objective optimization with that of bagging and boosting. Table 4 presents the Test set accuracies. Note that NULCOEC with multi–objective optimization beats bagging and boosting in all occasions. It performs 14.25% better than bagging and 11.81 % better than boosting.

Glass

Haberman

Liver

Parkinsons

Spect

Thyroid

Iris

Sonar

Fig. 3. Outcome of the multi–objective optimization algorithm. Both accuracy and diversity are scaled within the range of 0 to 1 with 1 representing maximum.

Table 2. Test set accuracy comparison between accuracy, diversity, and multi–objective optimality criteria. Boldface implies best performer.

Data Set Optimization Criteria Diversity Accuracy Multi–objective

Glass 90.00 90.91 92.73 Haberman 67.53 61.69 74.68 Iris 96.00 97.33 98.67 Liver 56.07 60.69 65.90 Parkinsons 89.69 91.75 92.78 Sonar 80.00 80.95 88.57 Spect 74.63 78.36 87.31 Thyroid 93.64 94.55 96.36

Table 3. Test set diversity comparison between accuracy, diversity, and multi–objective optimality criteria. Boldface implies best performer.

Data Set Optimization Criteria Diversity Accuracy Multi–objective


Table 4. A comparative analysis of test set accuracies between the proposed and existing ensemble classifier generation methods

Data Set Bagging Boosting (AdaBoost.M1)

NULCOEC (Multi-Objective)


V. CONCLUSIONS In this paper, we have presented application of multi–objective optimization method for generating cluster oriented ensemble classifiers. During ensemble generation, a data set is partitioned into variable number of clusters at different layers and base classifiers are trained on each cluster. Both accuracy and diversity depends on layers and clusters. A Pareto front based optimization approach is undertaken to keep a balance between accuracy and diversity by finding the best set of layers and clusters using genetic algorithm. Experimental results reveal multi–objective optimization performs 7.56% and 6.31% better than diversity optimization and accuracy optimization. While compared against existing techniques it performs 14.25% better than bagging and 11.81 % better than boosting. In future we aim to include additional

objectives like the true positive rate in the objective function to obtain better performance with imbalanced data sets.

REFERENCES [1] E. K. Tang, P. N. Suganthan and X. Yao, “An Analysis of

Diversity Measures,” Machine Learning, vol. 65, pp. 247–271, 2006.

[2] Z. H. Zhou, J. Wu, and W. Tang, “Ensembling neural networks: Many could be better than all,” Artificial Intelligence, vol. 137, pp. 239–263, 2002

[3] Y. Liu and X. Yao, “Simultaneous Training of Negatively Correlated Neural Networks in an Ensemble,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 29(6):716–725, December 1999.

[4] K. Varshney, R. Prenger, T. Marlatt, B. Chen, and W. Hanley, “Practical Ensemble Classification Error Bounds for Different Operating Points,” IEEE Transactions on Knowledge and Data Engineering, 2012. [doi: 10.1109/TKDE.2012.219]

[5] H. Haibo and C. Yuan, “SSC: A Classifier Combination Method Based on Signal Strength,” IEEE Transactions on Neural Networks and Learning Systems, vol.23, no.7, pp. 1100–1117, July 2012.

[6] H. Parvin, H. Alizadeh, and B. Minaei–Bidgoli, “Using Clustering for Generating Diversity in Classifier Ensemble,” Int. Journal of Digital Content Technology and its Applications, vol. 3, no. 1, pp. 51–57, 2009.

[7] L. Rokach, O. Maimon, and I. Lavi, “Space Decomposition In Data Mining: A Clustering Approach”, International Symposium On Methodologies For Intelligent Systems, pp. 24–31, 2003.

[8] S. Eschrich, L. O. Hall “Soft partitions lead to better learned ensemble,” Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS), pp. 406–411, 2002.

[9] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.

[10] R. E. Schapire, “The strength of weak learnability,” Machine Learning,” vol. 5, no. 2, pp. 197–227, 1990.

[11] Y. Freund and R. E. Schapire, “Decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.

[12] I. S. Sbalzarini, S. Muller, and P. Koumoutsakos, “Multiobjective optimization using evolutionary algorithms,” in Proc. 2000 Summer Program. Stanford, CA, Nov. 2000.

[13] H. Chen and X. Yao, “Multiobjective Neural Network Ensembles Based on Regularized Negative Correlation Learning,” IEEE Transactions on Knowledge and Data Engineering, vol.22, no.12, pp.1738-1751, 2010

[14] UCI Machine Learning Database, http://archive.ics.uci.edu/ml/, accessed on 6th October 2009.

[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA Data Mining Software: An Update,” SIGKDD Explorations, vol. 11, no. 1, 2009.

[16] LIBSVM, “A library for support vector machines,” http://www.csie.ntu.edu.tw/~cjlin/libsvm/, accessed on 10th February 2010.

[17] A. Rahman and B. Verma, “Novel Layered Clustering-Based Approach for Generating Ensemble of Classifiers,” IEEE Transactions on Neural Networks, vol. 22, no. 5, pp. 781–792, 2011.

[18] A. Rahman, B. Verma, and X. Yao, “Non–uniform Layered Clustering for Ensemble Classifier Generation and Optimality,” Neural Information Processing Theory and Algorithms, Lecture Notes in Computer Science, pp. 551–558, 2010.

[IEEE 2013 International Joint Conference on Neural Networks (IJCNN 2013 - Dallas) - Dallas, TX, USA...

Documents

Transcript of [IEEE 2013 International Joint Conference on Neural Networks (IJCNN 2013 - Dallas) - Dallas, TX, USA...