[IEEE 2006 Seventh International Conference on Web-Age Information Management Workshops - Hong Kong,...

Genetic Algorithm-based Text Clustering Technique: Automatic Evolution of Clusters with High Efficiency

Wei Song and Soon Cheol Park

Division of Electronics and Information Engineering Chonbuk National University Korea

[email protected] [email protected]

Abstract

In this paper, we propose a modified variable string length genetic algorithm (MVGA) for text clustering. Our algorithm has been exploited for automatically evolving the optimal number of clusters as well as providing proper data set clustering. The chromosome is encoded by a string of real numbers with special indices to indicate the location of each gene. More effective versions of operators for selection, crossover, and mutation are introduced in MVGA which can also automatically adjust the influence between the diversity of the population and selective pressure during generations. The superiority of the MVGA over conventional variable string length genetic algorithm (VGA) is demonstrated by providing proper Reuter text collection clusters in terms of number of clusters and clustering data sets. 1. Introduction

Clustering is a data mining technique which is defined as group n objects into m clusters without any prior knowledge. The number of partition/clusters may or may not be known a priori. Several algorithms for clustering data when the number of clusters is known a priori are available in literature. The K-means algorithm [1], one of the most widely used, attempts to solve the clustering problem into a fixed number of clusters K known in advance. It is an iterative hill-climbing algorithm and solution suffering from the limitation of the sub optimal which is known to depend on the choice of initial clustering distribution in [2]. In [3], a branch and bound algorithm uses a tree search technique to search the entire solution space. It employs a criterion of eliminating sub trees which do not contain the optimal result. In this scheme, the number of nodes to be searched becomes huge as the size of the data set becomes large.

In most real life situations the number of clusters in a data set is not known in advance. The real challenge in this situation is to be able to automatically evolve a proper number of clusters. Genetic algorithms (GAs) [4, 5, 6, 7], based on the principle of evolution and heredity, are able to search in complex, large and multimodal landscapes. We can apply the searching capability of GAs to evolving proper number of clusters and providing appropriate clustering. GAs have a large amount of implicit parallelism and provide near optimal solutions to clustering problem. An efficient genetic algorithm-based clustering technique [8] used a look-up table to save the distances between points to centroid. The aim is to reduce the computation time. However, it only makes the centroid of cluster evolve from one test document to another document. In our algorithm, chromosome encoding by a string of real numbers can generate random centroids in space. A variable string length genetic algorithm (VGA) provided in [9, 10] employs simple and compact chromosomes to evolve the number of clusters. However, the compact chromosome encoding causes reduction of chances to get the optimal centers combination. We propose a modified variable length GA (MVGA) using gene indices to encode chromosomes. The gene index indicates the relative location of each gene, which has more chance to obtain the special centers combination and find the optimal number of clusters. We also propose a dynamic proportion of population to evolve considering the influence between the diversity of the population and selective pressure in generations. The details of the algorithm are described in section 3. Experiment results are given in section 4. Discussion and conclusions are given in section 5.

2. Genetic Algorithms for Clustering

GAs belong to search techniques that mimic the principle of natural selection. Clustering is a popular

Proceedings of the Seventh International Conference onWeb-Age Information Management Workshops (WAIMW'06)0-7695-2705-1/06 $20.00 © 2006

unsupervised pattern classification technique which partitions the input space into K regions based on some similarity/dissimilarity metric. The number of partitions/clusters may or may not be known a priori. GAs are able to search in complex, large and multimodal landscapes. We apply the capability of GAs to evolving the proper number of clusters and providing appropriate clustering. The parameters in the search space are represented in the form of strings (chromosome) which are encoded by a combination of cluster centroids. A collection of such chromosomes is called a population. Initially, a random population is created, which represents different solutions in the search space. An objective and fitness function is associated with each chromosome that represents the degree of fitness. Based on the principle of survival of the fittest, a few of chromosomes are selected and each is assigned into the next generation. Biologically inspired operators like crossover and mutation are applied to chromosomes to yield new child chromosomes. The operator of selection, crossover and mutation continues several generations till the termination criterion is satisfied. The fittest chromosome seen up to the last generation provided the best solution to the clustering problem.

A VGA in [9, 10] employed simple and compact chromosomes to evolve the number of clusters. We modified the VGA chromosome using gene index to indicate the location of each gene. Our aim in this article is to propose a more efficient clustering methodology for obtaining the optimal number of clusters as well as providing optimal text data clustering. Such method is described in the next section.

3. MVGA for Texts Clustering 3.1. Encode Chromosome

In MVGA clustering, the chromosome is encoded

by a string of real numbers. The number of clusters, denoted by K, is assumed to lie in the range [Kmin, Kmax], where Kmin is chosen as 2 unless specified otherwise. The maximal length of a string is taken to be Kmax where each individual gene represents a real centroid of cluster.

Each chromosome i in the population initially encodes a number of Ki centers. For initializing these centers, the number of Ki points are chosen randomly from the data set. These points are distributed in random positions in chromosome. In order to make a compact encoding, we use gene index to denote the relative position of each point, which is different from

the VGA chromosome encoding. Let us consider the following example. Example: let Kmin = 2 and Kmax = 10. The random number Ki and Kj be equal to 4 and 6 for chromosomei and chromosomej respectively. The VGA chromosome encoding are shown as follows:

chromosomei : [ Ci1, Ci2, Ci3, Ci4 ] chromosomej : [ Cj1, Cj2, Cj3, Cj4, Cj5, Cj6 ] We know that a random exchange point crossover

happened in VGA encoding makes their offspring hold the same length as their parents. For example, if the crossover point is 3 their offspring generated are as follows:

chromosomei’ : [ Ci1, Ci2, Ci3, Cj4, Cj5, Cj6 ] chromosomej’ : [ Cj1, Cj2, Cj3, Ci4 ]

chromosomei’ and chromosomej’ have the same

length to their parents with length of 6 and 4 respectively. So the VGA chromosome encoding causes the combination of centers in a limitation.

MVGA uses gene index to indicate the random position of each gene. We use the random integer number generator to generate a string of numbers for gene location. The random numbers are [0, 1, 5, 8] and [0, 2, 4, 6, 7, 9]. The chromosomei and chromosomej are shown as:

chromosomei: [ Ci1, Ci2, Ci3, Ci4 ] gene index: [ 0, 1, 5, 8 ] chromosomej: [ Cj1, Cj2, Cj3, Cj4, Cj5, Cj6 ] gene index: [ 0, 2, 4, 6, 7, 9 ]

If the crossover point is 6, we only need to exchange

the genes whose gene index is greater than 6. The offspring generated are shown as:

chromosomei’: [ Ci1, Ci2, Ci3, Cj5, Cj6 ] gene index: [ 0, 1, 5, 7, 9 ] chromosomej’: [ Cj1, Cj2, Cj3, Cj4, Ci4 ] gene index: [ 0, 2, 4, 6, 8 ] We can see that the lengths of offspring are both 5

after crossover, which are different from their parents with lengths of 4 and 6 respectively. So MVGA chromosome encoding provides more chance to evolve proper number of clusters.

We also don’t need to ensure an ascending order of gene indices. We only need it to denote the relative position, which is enough for the convenient crossover.

3.2. Evolution Principle of MVGA


The search capability of GA satisfying a certain criterion has been applied in this paper for text clustering. The dynamic steps of MVGA are defined in Figure 1. We chose a dynamic proportion of chromosomes to undergo genetic operators considering influence between diversity of the population and selective pressure during generations. M is the population, which can not be changed during generations. r is the proportion of chromosomes for crossover. m is the proportion of chromosomes that undergo mutation. The termination criterion is iteration of best fitness value without improvement exceeding consecutive Nmax iterations. The criterionI happens when there have been no improvements in best fitness value for consecutive nmax iterations. Nmax and nmax (<Nmax) are two margin numbers defined artificially. rand1 and rand2 are two random numbers generated by random number generator.

Figure 1. Dynamic steps of MVGA. 3.3. Encode Texts

We transformed the set of text documents into a set of feature vectors in vector space model (VSM). The text documents to be clustered are processed by word extraction, stop words removal, and stemming. After stemming by Porter’s stemming algorithm [11], we measure the term weight by Okapi rule.

*0.5 1.5*

jtfweight idfdltf

avgdl

=+ +

(1)

logjNidfn

= (2)

Where tf is frequency of term in a document. dl is

the length of document, N is the number of documents in the document sets, and n is the number of documents in which the jth term appears. Okapi rule normalizes the length of documents to replace simple tf * idf.

GA is a global optimum algorithm, but the main difficulty is to evolve in high dimensional space. For a typical text collection, there are thousands or even tens of thousands of unique terms in the vocabulary. So the large feature vectors are not suitable for searching of GAs. We used dimension reduction techniques [12] to reduce the high dimensional feature space into a much lower dimensional space without adversely affecting the clustering effectiveness. We sorted the terms by its weight and chose high weighted terms to replace all terms. 3.4. Population Initialization

The diversity of the population is an important factor that affects the success of GAs. Low diversity in original population may lead GA to a premature convergence to a local optimum solution or take long evolution time to find the global optimal solution. Hence, we should ensure the high diversity in population initialization. In our algorithm, we initialized 100 chromosomes in the original population. The initialization of the original population is like this, chromosome i is initialized by Ki randomly chosen points (text vector) from the data set. These points are distributed in random positions as genes of chromosomes. Then we use gene index to indicate its relative position. This process is repeated for each of the P chromosomes in population, where P is the size of the population.

3.5. Fitness Function

The initial population consists of possible solutions to the best clustering. Fitness function measures the relative fitness of a given solution with respect to the problem we are trying go solve. The Davies-Bouldin [13] index is a function of the ratio of the sum within-cluster scatter to between-cluster separation. The scatter within Ci, the ith cluster, is computed as


1

, 21( {|| || })

| |i

q qi q i

x Ci

S x zC ∈

= −∑ (3)

( ) /

i

i ix C

z x n∈

= ∑ (4)

Where Si,q is the qth root of the qth moment of the

|Ci| points in cluster Ci with respect to their mean zi, and is a measure of the dispersion of the points in the cluster. Specifically Si,1 used in this article, is the average Euclidean distance of the vectors in class i to the centroid of class i. zi is the centroid of Ci, and ni is the cardinality of Ci, the number of points in cluster Ci. The distance between Ci and Cj is defined as

, || ||ij t i j td z z= − (5)

dij,t is the Minkowski distance of order t between the centroid zi and zj that characterize clusters Ci and Cj. Subsequently, we compute

, ,, ,

,

max{ }i q j qi qt j j i

ij t

S SR

d≠

+= (6)

The Davies-Bouldin DB index is then defined as

,1

1 k

i qti

DB RK =

= ∑ (7)

The fitness function is defined as

F = 1 / DB (8)

The objective is to minimize the DB index for achieving proper clustering. Therefore, the fitness of chromosome j is defined as (1/ DB). Note that maximization of the fitness function will ensure minimization of the DB index. 3.6. Evolutional Operators 3.6.1. Selection. The selection operator selects chromosomes by fittest concept. We chose a dynamic number of fittest chromosomes to the next generation directly. 3.6.2. Crossover. As mentioned in section 2, we used the classical single-point crossover operator. The crossover site is selected by 1 + rand() mod Ki, where Ki is the length of chromosome i.

3.6.3. Mutation. The mutation operator adopted in this article is Gaussian Mutation [15]. Each parent generates an offspring via Gaussian mutation as a survival to next generation. Gaussian Mutation is described as follows.

For each individual x, its offspring x’ is created as

' ~x x Yβ+ (9)

Where β represents possible adaptive mutation and Y is a random variable which is accord with the Gaussian Probability Distribution. The Ys is shown in Figure 2 which simulated the Gaussian Probability Distribution event. The value of variable Y distributed near 0 will have high probability to be generated.

Figure 2. The probability distribution of the random variable Ys.

3.7. Evolution Process

The diversity of the population influences the search for the optimal solution. Selective pressure influences the creation of the new population. There is a strong influence among these two factors. Conventional GA in [14] chose a fixed proportion of chromosomes in population to evolve without analyzing the current population distribution. The evolution of MVGA is a dynamic process. When there has been no improvement in the best fitness value for consecutive nmax iterations we should increase the diversity of the population to evolve chromosomes with suddenly improved fitness. Otherwise, we should increase the degree of selective pressure and choose more chromosomes with high fitness to the next generation. This dynamic process mentioned in Figure 1 made reduction of the evolution time theoretical and practical.


The process of selection, crossover and mutation are executed for a number of iterations. The fittest chromosome seen up to the last generation provided the best solution to the clustering problem. The termination criterion is shown in Figure 1.

The next section provides cluster results of experiment of MVGA for text clustering, along with its comparison with the performance of the VGA for Reuter data set. 4. Experiments

We chose the texts from Reuter-21578 texts collection as the data set. Data_set1 has 100 texts belonging to three topics (acq, crude, trade). Data_set2 has 200 texts belonging to four topics (acq, earn, crude, trade). After being processed by word extraction, stop words removal, and stemming, there are 1,960 and 3,570 terms in the vocabulary respectively. Feature reduction technique [11] was then applied to reduce the dimensionality of the feature vector from 1,960 to 200 and 3,570 to 250, with 90% and 93% reduction in dimension respectively. We then applied MVGA and VGA to clustering respectively. Table 1 shows the Data_set1 and Data_set2 we used in our experiments. MVGA is implemented with the following parameters: r = 0.4, m = 0.1, Kmax = 7, Nmax = 20, nmax = 12 and the population size is taken to be 100. Figure 3 and Figure 4 show the cluster results we applied MVGA and VGA to Data_set1. MVGA and VGA get the same results in the end. Figure 5 and Figure 6 show the cluster results we applied MVGA and VGA to Data_set2. In the end MVGA gets 4 clusters, while VGA gets 3 clusters.

Table 1. The number of texts, topics and dimensions in data_set 1 and data_set 2.

Data_set1 Data_set2

#Texts 100 200

#Topics 3 4

# Dimensions 200 250

Data set acq(0-29),

crude(30-59), trade(60-99)

acq(0-49), earn(50-99),

crude(100-149), trade(150-199)

Figure 3. MVGA is applied to Data_set1.

Figure 4. VGA is applied to Data_set1.

Figure 5. MVGA is applied to Data_set2.

Figure 6. VGA is applied to Data_set2.


Although the number of clusters we got comparing to the number of text topics to test can not evaluate the efficiency of clustering algorithms, but we can have a peek of data distribution and get an approximate data structure of each cluster. The real evaluation criterion adopted in this paper is 1/DB mentioned in (8). DB is the Davies-Bouldin index for each chromosome. We chose the fitness of the best chromosome in each generation to represent the fitness of the generation. The results are shown in Figure 7 and Figure 8. In Figure 7 MVGA and VGA evolved the same fitness results and made the same clustering distribution on Data_set1, but MVGA spent less than 100 generations to get the optimal clustering, while VGA spent more than 110 generations. In Figure 8 the final fitness value of 1.91 provided by MVGA is much better than that provided by VGA with fitness value of 1.78 on Data_set2. Moreover, MVGA spent fewer generations to obtain the optimal clustering.

0 10 20 30 40 50 60 70 80 90 100 110 1201.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2

Generations

Fitn

ess

MVGA

VGA

Figure 7. MVGA and VGA cluster for Data_set1.

0 10 20 30 40 50 60 70 80 90 100 110 1201.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

Generations

Fitn

ess

MVGAVGA

Figure 8. MVGA and VGA cluster for Data_set2.

Although we can not simply evaluate the clustering results only by number of clusters we can still use the value of fitness function (1/DB) to prove adequately that the performance of MVGA is better than that of VGA for text clustering. Then we use clustering precision, recall and F-measure to evaluate the algorithms above. We define precision to be the number of texts correctly assigned divided by the total number of texts in the result cluster. Recall is defined to be the number of texts correctly assigned divided by the total number of texts that should be assigned. Let D be the number of texts in test set i before clustering. Let A be the number of texts in result cluster i. Further, let P be the number of texts in the intersection of sets A and D. P also means the number of texts correctly assigned. If we obtain totally K clusters in the end the clustering precision for cluster i is defined as

| |

| |iA DP

A= ∩

(10)

The average clustering precision is defined as

1

1=K

ii

P PK =∑ (11)

The clustering recall for cluster i is defined as

| |

| |iA DR

D= ∩

(12)

The average clustering recall is defined as

1

1=K

ii

R RK =∑ (13)

F-measure is defined as

2RPFR P

=+

(14)

F weights low values of precision and recall more

heavily than higher values. It is high when both precision and recall are high.

Table 2 shows the precision, recall and F-measure of MVGA and VGA on Data_set2. We can see from table 2 that the evaluations of MVGA are better than that of VGA.


Table 2. The evaluations of MVGA and VGA on data_set 2.

P(%) R(%) F(%)

MVGA 77.8 78.5 78.1

VGA 58.3 76.0 66.0

Note that GAs are comparatively steady algorithms

which provide approximately the same clustering results no matter how different the initial populations are. We ran both MVGA and GA 100 times respectively, and got nearly the same result. So the performance of MVGA in all runs is better than the performance of VGA for text clustering. 5. Discussion and Conclusions

In this paper we proposed a modified variable string length genetic algorithm named MVGA for text clustering problem. MVGA has been used for evolving the proper number of cluster centers as well as providing appropriate data sets clustering. We applied MVGA to Reuter-21578 text collection and demonstrated the effectiveness of our clustering algorithm which maximizes the clustering metric (8). The results show that MVGA is superior to that of the VGA, a traditional variable length GA for clustering technique. Note that the chromosome encoding adopted in this paper provides more random combination of cluster centers for clustering problem. Reuter text Data_set1 (100 texts, 3 topics) and Data_set2 (200 texts, 4 topic) have been chosen for the tests. The results showed that not only did the MVGA evolve a more proper number of clusters, but also provided the appropriate clustering when compared with the VGA clustering algorithm.

The process of dynamic proportion to evolve has been adopted in this paper, since it conceptually solved the influence between selective pressure and diversity of the population during the evolutions. Notably, we used dimension reduction technique to reduce text data dimension. Otherwise, the high dimensional space would cause long computing time especially for genetic algorithms. We adopted Gaussian Mutation [15] (EP domain) as the mutation operator. Another mutation method, Levy mutation [16], was more likely generate offspring that was farther away from its parent. Such a correlative comparing applying to GA is currently performed. 10. References

[1] J.T. Tou and R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley Publishing Co, Reading Massachusetts, 1974. [2] S. Z. Selim and M. A Ismail, K-means-type algorithm: generalized convergence theorem and characterization of local optimality, IEEE Trans on pattern Anal. Mach. Intell. 6, pp 1984, 81-87. [3] W. L. Koontz, P. M. Narendra and K. Fukunaga, A branch and bound clustering algorithm, IEEE Trans on computers. 9, pp 1975, 908-915. [4] Z. Michalewicz, Genetic Algorithms + Data Structure = Evolution Programs, 3rd edn. Springer-Verlag, Berlin Heidelberg New York, 1996. [5] J. L. R. Filho, P. C. Treleaven, C. Alippi, Genetic algorithm programming environments, IEEE Compute. 27 1994, 28-43. [6] Gareth Jones, Alexander M. Robertson, Chawchat Santimevirul and Peter Willett, Non-hierarchic document clustering using genetic algorithm, Information Research. Vol. 1 No. 1, April 1995. [7] Elena D. Cristofor, Information-Theoretical Methods in Clustering, Office of Computer Science, publish on University of Massachusetts transaction, 2002. [8] Hwei-Jen Lin, Fu-Wen Yang and Yang-Ta Kao, An Efficient GA-based Clustering Technique, Tamkang Journal of Science and Engineering, Vol. 8, No 2, pp 2005, 113-122. [9] Sanghamitra Bandyopadhyay and Ujjwal Mauilk, Nonparametric Genetic Clustering: Comparison of Validity Indices, IEEE Transactions on System, Man, and Cybernetics-Part C Applications and Reviews, Vol. 31, No. 1, 2001. [10] Ujjwal Mauilk and Sanghamitra Bandyopadhyay, Performance Evaluation of Some Clustering Algorithms and Validity Indices, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 12, 2002. [11] M. F. Porter, An algorithm for suffix stripping, Program, 14 no. 3, pp 1980, 130-137. [12] Savio L.Y. Lam, Dik Lun Lee, Feature Reduction for Neural Network Based Text Categorization, 6th International Conference on Database Systems for Advanced Applications, 1999, 195. [13] D. L. Davies and D. W. Bouldin, A Cluster Separation Measure, IEEE Trans. Patt. Anal. Mach. Intell. 1, pp. ABSTRACT-INSPEC, 1979, 224-227. [14] Maulik and Bandyopadhyay, Genetic Algorithm Based Clustering Technique, Pattern Recognition, vol. 33, no. 9, pp 2000, 1455-1465.


[15] Xin Yao, Yong liu and Guangming Lin, Evolutionary Programming Made Faster, IEEE Trans, Evolutionary Computation, Vol. 3, No. 2, 1999. [16] Chang-Yong Lee and Xin Yao, Evolutionary Programming Using Mutations Based on the Levy

Probability Distribution, IEEE Trans, Evolutionary Computation, Vol. 8, No. 1, 2004.


[IEEE 2006 Seventh International Conference on Web-Age Information Management Workshops - Hong Kong,...

Documents

Transcript of [IEEE 2006 Seventh International Conference on Web-Age Information Management Workshops - Hong Kong,...