[IEEE 2010 International Conference on Information Science and Applications - Seoul, Korea (South)...

6
A New Discrete PSO for Data Classification Naveed Kazim Khan 1 , A. Rauf Baig 2 , Muhammad Amjad Iqbal 3 1, 2, 3 National University of Computer and Emerging Sciences, NU-FAST, Islamabad, Pakistan 1 [email protected] , 2 [email protected] , 3 [email protected] Abstract: In this paper we have presented a new Discrete Particle Swarm Optimization approach to induce rules from the discrete data. The proposed algorithm initializes its population by taking into account the discrete nature of the data. It assigns different fixed probabilities to current, local best and the global best positions. Based on these probabilities, each member of the population updates its position iteratively. The performance of the proposed algorithm is evaluated on five different datasets and compared against 9 different classification techniques. The algorithm produces promising results by creating highly accurate rules for each dataset. Keywords— Particle Swarm Optimization (PSO), Discrete PSO, classification, rule list, probabilistic classifier I. INTRODUCTION Classification is considered as one of the most critical tasks of decision making in a variety of application domains such as human sciences, medical sciences, engineering etc. Classification schemes have been developed successfully for several applications such as medical diagnosis, credit scoring, speech recognition etc. In classification problems, sets of if-then rules for learned hypotheses is considered to be one of the most expressive and comprehensive representations. In many cases it is useful to learn the target function represented as a set of if-then rules that jointly define the function. When considering a rule induction algorithm and focusing on the encoding or representation of rules, there are two main approaches that have been applied by the researchers: Michigan approach and Pittsburgh approach. If a Michigan-style encoding is applied then in general either a rule precondition or a complete rule is evolved by the evolutionary algorithm. In the first case, a separate procedure is used to decide the rule postcondition before the rule evaluation e.g. [1]. Alternatively, the class may already be specified for each rule precondition during the entire run of the algorithm. EA is run repeatedly with each run focusing on evolving rule preconditions for a specific class, e.g. [3]. If a complete rule is encoded in an individual then the rule postcondition may also be subject to evolution and change by the genetic operators [2]. In Pittsburgh approach, generally rule preconditions are encoded along with their postconditions [4]. There is another important issue of evaluating and comparing the quality of rule induction algorithms. The most widely used approach is a batch mode in which the set of examples is divided into a training set and a test set. The algorithm is required to generate rules from the training examples. The validity of the rule set produced is then measured by the system on the test set of examples with no further learning. The alternative to batch mode testing is an incremental mode in which the algorithm is required to create rule set from the examples seen so far and to use this rule set to classify the next incoming example. In this mode, learning never stops and the predictive performance of the algorithm over time is measured in terms of learning curves. Following are the salient features of our work: We have proposed a new position update rule for the particles which assigns fixed probabilities to the current, local best and global best positions of the particles. As a result particles share their information probabilistically. Proposed position update rule and encoding scheme allows individual terms associated with each attribute in the rule antecedent to be disjunction of 978-1-4244-5943-8/10/$26.00 ©2010 IEEE

Transcript of [IEEE 2010 International Conference on Information Science and Applications - Seoul, Korea (South)...

Page 1: [IEEE 2010 International Conference on Information Science and Applications - Seoul, Korea (South) (2010.04.21-2010.04.23)] 2010 International Conference on Information Science and

A New Discrete PSO for Data Classification

Naveed Kazim Khan1, A. Rauf Baig2, Muhammad Amjad Iqbal3 1, 2, 3 National University of Computer and Emerging Sciences, NU-FAST, Islamabad, Pakistan

[email protected], [email protected], [email protected]

Abstract: In this paper we have presented a new Discrete Particle Swarm Optimization approach to induce rules from the discrete data. The proposed algorithm initializes its population by taking into account the discrete nature of the data. It assigns different fixed probabilities to current, local best and the global best positions. Based on these probabilities, each member of the population updates its position iteratively. The performance of the proposed algorithm is evaluated on five different datasets and compared against 9 different classification techniques. The algorithm produces promising results by creating highly accurate rules for each dataset.

Keywords— Particle Swarm Optimization (PSO), Discrete PSO, classification, rule list, probabilistic classifier

I. INTRODUCTION

Classification is considered as one of the most critical tasks of decision making in a variety of application domains such as human sciences, medical sciences, engineering etc. Classification schemes have been developed successfully for several applications such as medical diagnosis, credit scoring, speech recognition etc. In classification problems, sets of if-then rules for learned hypotheses is considered to be one of the most expressive and comprehensive representations. In many cases it is useful to learn the target function represented as a set of if-then rules that jointly define the function. When considering a rule induction algorithm and focusing on the encoding or representation of rules, there are two main approaches that have been applied by the researchers: Michigan approach and Pittsburgh approach. If a Michigan-style encoding is applied then in general either a rule precondition or a complete

rule is evolved by the evolutionary algorithm. In the first case, a separate procedure is used to decide the rule postcondition before the rule evaluation e.g. [1]. Alternatively, the class may already be specified for each rule precondition during the entire run of the algorithm. EA is run repeatedly with each run focusing on evolving rule preconditions for a specific class, e.g. [3]. If a complete rule is encoded in an individual then the rule postcondition may also be subject to evolution and change by the genetic operators [2]. In Pittsburgh approach, generally rule preconditions are encoded along with their postconditions [4]. There is another important issue of evaluating and comparing the quality of rule induction algorithms. The most widely used approach is a batch mode in which the set of examples is divided into a training set and a test set. The algorithm is required to generate rules from the training examples. The validity of the rule set produced is then measured by the system on the test set of examples with no further learning. The alternative to batch mode testing is an incremental mode in which the algorithm is required to create rule set from the examples seen so far and to use this rule set to classify the next incoming example. In this mode, learning never stops and the predictive performance of the algorithm over time is measured in terms of learning curves. Following are the salient features of our work:

• We have proposed a new position update rule for the particles which assigns fixed probabilities to the current, local best and global best positions of the particles. As a result particles share their information probabilistically.

• Proposed position update rule and encoding scheme allows individual terms associated with each attribute in the rule antecedent to be disjunction of

978-1-4244-5943-8/10/$26.00 ©2010 IEEE

Page 2: [IEEE 2010 International Conference on Information Science and Applications - Seoul, Korea (South) (2010.04.21-2010.04.23)] 2010 International Conference on Information Science and

values of the attribute. For example in AM+ and all earlier versions of Ant Miner, the rule antecedent part is simply the conjunction of values of the predicting attributes. In contrast, our technique PDPSO allows a complete rule antecedent to be conjunction of disjunctions of values of predicting attributes.

• The proposed technique works both on binary and multiclass datasets. Although it has been designed for working with datasets containing categorical or discrete attributes, it can also be used in continuous domain after discretizing the continuous attributes.

This paper is organized as follows: Section 2 describes related work on rule induction using Computational Intelligence techniques. Section 3 provides a brief introduction on classic PSO and on its variant Discrete PSO. Section 4 describes in detail the proposed approach. Then, in Section 5, experimental setup is discussed and performance of the proposed algorithm is evaluated on various datasets. Finally, Section 6 concludes with summary and some ideas for future work.

II. RELATED WORK

Decision Trees are one of the most popular choices for rule induction from a given set of training instances. First a decision tree is learnt from the data and then it is translated into an equivalent set of rules [5]. Decision trees have been greatly praised for comprehensibility. Another method applied by researchers for rule induction from a given dataset is Genetic Algorithm (GA). Here rules are encoded as a bit string and specialized genetic search operators are used to explore the hypothesis space e.g. [6]. Using Ant Colony Optimization (ACO) for rule induction provides an efficient mechanism to accomplish a more global search in hypothesis space. In [7], an ant algorithm called Ant-Miner was proposed for creating crisp rules for describing the underlying dataset. Ant-Miner considers that nodes of problem graph are the terms that may form a rule precondition. Ant-Miner does not require the ant to visit each and every node. In contrast [8] takes rule induction problem as an assignment task where fixed number of nodes is present in rule precondition. An ant explores the problem graph by visiting each and every node in turn.

In many classification tasks where Fuzzy systems have been applied successfully, rules are mostly derived from human expert knowledge. But as the feature space increases, this type of manual approach becomes unfeasible. With this requirement in mind, several methods have been proposed to generate fuzzy rules from numeric data explicitly [9]. Neural networks [10] and genetic algorithms (GA) [11] have been used by several researchers in combination with Fuzzy logic for rule generation. In [12], Genetic Programming (GP) using a hybrid Michigan-Pittsburgh-style encoding has been applied to extract rules in an iterative learning fashion. In [13], Evolution Strategy (ES) is iteratively invoked and at each iteration it induces a fuzzy rule that gives the best classification over the current set of training instances.

III. OVERVIEW OF PSO

Particle Swarm Optimization (PSO) is a population-based evolutionary computation technique, originally designed for continuous optimization problems. The searching agents called particles are 'flown' in the n-dimensional search space. Each particle updates its position considering not only its own experience but also that of other particles. The position and velocity of each particle are updated according to following equations [14]:

( ) ( ) ( )1i i iX t X t V t= − +

( ) ( )( ) ( ) ( )( )( )( ) ( )( )( )

1 1

2 2

1 pbi i i i

gbi i

V t w V t c r t X X t

c r t X X t

= × − + − +

−where Xi(t) is the position of particle for ith dimension at time t and Vi(t) is the velocity of particle for ith dimension at time t. W is inertia weight, c1 and c2 are the learning factors; r1 and r2 are constants whose values are randomly chosen between 0 and 1. pb

iX is the ith dimension of the personal best position reached so far by the particle under consideration. gb

iX is the ith dimension of the global best position reached so far by the entire swarm. Each particle is evaluated using a fitness function. Closer the position of the particle to the optimal position, fitter is the particle. The optimization process is iterative.

Page 3: [IEEE 2010 International Conference on Information Science and Applications - Seoul, Korea (South) (2010.04.21-2010.04.23)] 2010 International Conference on Information Science and

IV. PROPOSED METHOD

The proposed algorithm is class dependent sequential covering algorithm. It continues to learn rules for the target class sequentially and removes the covered examples of the target class until there is no more example of the target class. In our proposed algorithm fixed probabilities are assigned to current, local best and local best positions so that information among particles can be shared probabilistically. The algorithm classifies N examples having M attributes each (excluding the class attribute).We have a fixed swarm size of P particles where each particle represents the precondition of an individual rule. The algorithm has been explained in following sections:

A. Particle Encoding The position vector of each particle is a binary

bit string containing M

ii

m∑ bits where M is total

number of attributes in the dataset excluding the class attribute and mi is the number of possible values for the ith attribute. Exactly mi bits are used to encode a value for ith attribute. Each bit position in the string corresponds to one of the possible values of the attribute. Placing a 1 at certain bit shows that the attribute is allowed to take on the associated value. A complete rule precondition is obtained by concatenating the encoded bit strings of every attribute. We have explicitly avoided the case where the position string contains all 0’s (Null value). Alternatively one can simply assign a very low fitness to the particles having such Null values. In the following section, we shall present detailed description of proposed algorithm which uses the above mentioned particle encoding scheme. B. Swarm Initialization A set of P particles is randomly generated by

producing a bit string of length M

ii

m∑ for each

particle such that for each attribute i, a random value is chosen from its domain and is encoded using exactly mi bits. These encoded values are then concatenated to produce a complete bit string for a particular particle.

C. Position Update Position of each particle is updated according to following scheme: Three different probabilities, Pc, Pp and Pg, are assigned to each particle. These probabilities are assigned at the start of evolutionary process and remain fixed afterwards. Pc is the probability that particle will make its current position to be its new position. Pp is the probability that the particle will select its new position from its personal best position while Pg is the probability that the particle will make its global best position to be its new position. Let Bci be the value of ith bit for attribute B in the current position vector of a particle, Bpi be the value of ith bit for attribute B in the vector containing personal best position for a particle and Bgi be the value of ith bit for attribute B in the vector containing global best position for the whole swarm. At each iteration, a random number rand is generated between 0 and 1 for every attribute. Three position vectors containing current, personal best and global best positions are compared with each other in a bitwise fashion for each particle. Based on the values of rand, Pc, Pp and Pg, a new value is chosen for that attribute. Since all three position vectors contain binary values, it may never be the case that Bci, Bpi and Bgi, all three, are different from each other. Either three values for a given bit in the corresponding vectors are same or at least two of these are similar with each other. Accordingly there are four cases to deal with: Case 1: (When Bci, Bpi and Bgi are same) If rand <= Pc+Pp+Pg , any value among Bci , Bpi and Bgi is assigned to the ith bit in the position vector of the particle since all three are same in this case. Otherwise it is randomly set to 0 or 1. Case 2: (When Bci and Bpi are same but Bgi is different.) If rand <= Pc+Pp then ith bit in the position vector of the particle takes its value from any of the Bci and Bpi since both are same in this case. On the other hand if rand > Pc+Pp and rand <= Pc+Pp+Pg, Bgi will be set as the new value for the ith bit in the position vector of the particle. Otherwise it is randomly set to 0 or 1. Case 3: (When Bci and Bgi are same but Bpi is different) If rand <= Pc+Pg, then ith bit in the position vector of the particle takes its value from any of the Bci and Bgi since both are same

Page 4: [IEEE 2010 International Conference on Information Science and Applications - Seoul, Korea (South) (2010.04.21-2010.04.23)] 2010 International Conference on Information Science and

in this case. On the other hand if rand > Pc+Pg and rand <= Pc+Pp+Pg, Bpi will be set as the new value for the ith bit in the position vector of the particle. Otherwise it is randomly set to 0 or 1. Case 4: (When Bpi and Bgi are same but Bci is different) If rand <= Pp+Pg then ith bit in the position vector of the particle takes its value from any of the Bpi and Bgi since both are same in this case. On the other hand if rand > Pp+Pg and rand <= Pc+Pp+Pg, Bci will be set as the new value for the ith bit in the position vector of the particle. Otherwise it is randomly set to 0 or 1.

D. Quality Measure We have used following fitness function or quality measure to evaluate the performance of the rule [15].

Quality Sensitivity Specificity= × Sensitivity indicates that out of all positive examples, what ratio of examples is actually classified as positive by the rule.i.e.

( )/Sensitivity tp tp fn= +

where tp is the total number of positive examples which are classified as positive by the rule and fn is the total number of positive examples which are not classified as positive by the rule. Specificity indicates that out of all negative examples, what ratio of examples is actually avoided by the rule to be classified as positive.

( )/Specificity tn tn fp= + where tn is the total number of negative examples which are not classified as positive by the rule and fp is the total number of negative examples which are classified as positive by the rule.

E. Stopping Criteria Since the algorithm learns incrementally i.e. one rule at a time, there are two different types of stopping criteria. For the evolutionary process where algorithm is used to learn a single rule, maximum number of generations specified in advance can be used as stopping criteria. Alternatively evolutionary process can be stopped when there is no improvement in the global fitness of the swarm. Second type of stopping criteria is for the overall rule learning scheme. Algorithm continues to extract best rules

for the target class until there are no more examples of the class in the dataset. This process is repeated for every class. Following is the pseudocode for proposed method:

1) For each class, do step 2 and 3 2) reinitialize training set 3) while some examples of the class

still uncovered, do steps 4 to 6 4) run our discrete PSO to generate

the best rule using position update method described in Position update section

5) if best rule performance is greater than threshold

a. add best rule to final rule set

b. remove covered class examples

c. go to step (3) 6) else decrease threshold and go to

step (4) 7) output final rule set

V. EXPERIMENTS AND RESULTS

Five datasets have been used in our experiments. The information about these datasets has been summarized in Table 1. As shown in Table 1, two of them have binary class i.e. ttt and bcw while remaining are multiclass datasets. Attributes in Table 1 are those attributes which have been actually used for classification thus excluding the class attribute. A unique identifier attribute, if present in a dataset, has also been excluded. Values for three probability parameters have been chosen empirically. The used experimental setup has been summarized in Table 2. Continuous attributes, when present in the given dataset, have been discretized in pre-processing step using Weka's supervised discretization method.

Table 1: DATASETS USED IN EXPERIMENTS

Dataset Dataset Size

No. of Attributes

No. of Classes

iris 150 4 3 wine 178 13 3

ttt 958 9 2 car 1728 6 4 bcw 699 9 2

Page 5: [IEEE 2010 International Conference on Information Science and Applications - Seoul, Korea (South) (2010.04.21-2010.04.23)] 2010 International Conference on Information Science and

Table 2: PARAMETER SETTING

Parameters Values Population 30

Max No. of Iterations 1000 Pc 0.08 Pb 0.22 Pg 0.65

Performance Threshold 0.25 Scaling Factor for

Threshold 0.7

In each experiment, we have used ten-folding testing scheme. We have defined error to be the sum of the fp and fn. Median accuracy achieved by our proposed method has been compared with the results reported in [16]. Table 3 shows the results of our experimentation. Overall, when achieved accuracy was averaged over all the datasets, the proposed method achieved the highest accuracy of 95.95 percent. AM+, INN and RIPPER achieved 95.45, 94.80 and 94.20 percent accuracy respectively. These results are shown in Figure 1. Table 3 also shows that as compared to other techniques, our proposed method performed better on non binary datasets. In the case of iris dataset it has beaten all the other nine techniques and in the case of car dataset it has beaten all the evolutionary algorithms i.e. AM+, AM, AM2 and AM3. The high accuracy achieved by the proposed method is due to its probabilistic position update method. Position update rule allowed the particles to share their information by ‘intersecting’ the contents of their position vectors probabilistically, yet giving them some chance of random search to explore the hypothesis space.

VI. CONCLUSION

In this paper, we have proposed a discrete version of PSO for rule extraction. The aim of the proposed technique is to discover classification rules in the dataset. Though the algorithm has been designed for working with datasets containing categorical or discrete attributes, it can also be used in continuous domain after discretizing the continuous attributes. It discovers rules where individual terms in the rule antecedent are permitted to be disjunctions of the possible values of those attributes. Rule antecedent itself is the conjunction of these disjunctions.

Table 3: PERCENTAGE ACCURACY COMPARISON OF OUR METHOD WITH OTHER TECHNIQUE Algorithm

Binary Nonbinary

ttt bcw iris wine car

Our Method

97.92 94.29 100 94.45 93.07

AM+ 99.75 96.40 94.51 94.59 92.01

AM 75.03 91.14 76.60 84.50 77.38

AM2 71.13 91.54 81.80 85.33 77.93

AM3 68.94 90.92 77.00 83.50 77.50

RIPPER 97.99 95.35 93.00 90.68 94.01

C4.5 83.79 94.69 93.80 89.83 96.61

1NN 98.50 96.40 91.00 95.43 92.69

logit 65.57 96.53 93.80 94.33 80.52

SVM 91.06 92.81 94.40 94.83 97.71

We have compared the accuracy of our proposed method with 4 other evolutionary techniques and with 5 other state-of-art techniques over the five public domain datasets. As we have shown, when the accuracy results have been averaged over all the datasets, our proposed method stayed at the top. The proposed technique also produced the highest accuracy as compared to other nine techniques over non binary datasets on the average. As we have employed a different position update rule, there are no traditional parameters like inertia weight, learning factors, c1 and c2, in our proposed method. So there is no overhead of tuning those parameters. But at the same time we have introduced 3 probability parameters. Values of these parameters have been chosen empirically and remain fixed throughout the evolutionary process. In the future, we are interested to make these values adaptive so that these values be chosen during the evolutionary process automatically and there will be no overhead of tuning those values as well. We are also interested in finding a different encoding scheme for the particles.

Page 6: [IEEE 2010 International Conference on Information Science and Applications - Seoul, Korea (South) (2010.04.21-2010.04.23)] 2010 International Conference on Information Science and

0102030405060708090

100

OurMethod

AM+ 1NN RIPPER SVM C4.5 logit AM2 AM AM3

%ag

e A

ccur

acy

Figure 1 Average Accuracy on All Datasets

VII. ACKNOWLEDGEMENT

The author, Naveed Kazim Khan, 042-380093 Eg2-214 would like to acknowledge the Higher Education Commission (HEC) of Pakistan for providing the funding and required resources to complete this work. It would have been impossible to complete this work without their continuous support.

VIII. REFERENCES

[1] H. Ishibuchi, T. Nakashima and T. Murata, “Performance Evaluation of Fuzzy Classifier Systems for Multidimensional Pattern Classification Problems,” IEEE Transactions on Systems, Man and Cybernetics 29:601-618, 1999.

[2] Y.F. Yuan and H. Zhuang, “A Genetic Algorithm for Generating Fuzzy Classification Rules,” Fuzzy Sets and Systems 84:1-19, 1996.

[3] W. Romao, A.A. Freitas and R.C.S. Pacheco, “A Genetic Algorithm for Discovering Interesting Fuzzy Prediction Rules: Applications to Science and Technology Data,” Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2002) 343-350, 2002.

[4] S. F. Smith, “A learning system based on genetic adaptive algorithms,” Doctoral dissertation, Department of Computer Science. University of Pittsburgh, 1980.

[5] R. Gallion, D.C. St.Clair, C. Sabharwal & W.E. Bond, “Dynamic ID3: A Symbolic Learning Algorithm for Many-Valued Attribute Domains,” Proceedings of the 1993 Symposium on Applied Computing 14-20, 1993.

[6] Janikow C Z, “A knowledge-intensive GA for supervised learning,” Machine Learning 13:189-228, 1993.

[7] R.S. Parpinelli, H.S. Lopes, A.A. Freitas, “ Data Mining with an Ant Colony Optimization Algorithm.,” IEEE Transactions on Evolutionary Computation 6:321-332, 2002.

[8] J. Casillas, O. Cordon, F. Herrera , “Learning Fuzzy Rules using Ant Colony Optimization Algorithms,” Proceedings of the 2nd International Workshop on Ant Algorithms (ANTS 2000) 13-21, 2000.

[9] S. Abe, M. Lan, “Fuzzy rules extraction directly from numerical data for function approximation,” IEEE Transactions on Systems, Man and Cybernetics, 25(1):119-129, 1995.

[10] S. Yao, C. Wei, Z. He, “Evolving fuzzy neural networks for extracting rules,” Proc. 5th IEEE Int. Conf. Fuzzy Systems (FUZZ-IEEE ’96) 361–367, 1996.

[11] T. Pal, “Evolutionary approaches to rule extraction for fuzzy logic controllers,” Advances in Soft Computing, (Lecture Notes Series in Artificial Intelligence) 2275: 425–432, 2002.

[12] R.R.F. Mendes, F.d.B. Voznika, A.A. Freitas, J.C. Nievola, “Discovering Fuzzy Classification Rules with Genetic Programming and Co-Evolution,” Lecture Notes in Artificial Intelligence 2168:314-325, 2001.

[13] F. Hoffmann, “Combining Boosting and Evolutionary Algorithms for Learning of Fuzzy Classification Rules,” Fuzzy Sets and Systems, Article in Press - Uncorrected Proof, 2003.

[14] J. Kennedy, R. C. Eberhart, “Particle swarm optimization,” Proc. IEEE Int. Conf. Neural Networks 1942–1948, 1995.

[15] R.S. Parpinelli, H.S. Lopes, A.A. Freitas, “Data Mining with an Ant Colony Optimization Algorithm,” IEEE Trans. on Evolutionary Computation, special issue on Ant Colony algorithms 6(4):321-332, 2002.

[16] D. Martens, M. D. Baker, “Classification with ant colony optimization,” IEEE Transactions on Evolutionary Computation, 2007.