IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

10
IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN-SOM, AND HAC DATA MINING ALGORITHMS BASED ON CLUSTERING KAPIL SHARMA & RICHA DHIMAN Computer Science and Engineering, Lovely Professional University, Jalandhar, Punjab, India ABSTRACT With the development of information technology and computer science, high-capacity data appear in our lives. In order to help people analyzing and digging out useful information, the generation and application of data mining technology seem so significance. Clustering is the mostly used method of data mining. Clustering can be used for describing and analyzing of data. In this research, we use clustering and classification methods to mine the data and extract the valuable information by using hybrid algorithms or combination of three algorithms and it produce the better results than the traditional algorithms and compare it by applying on data sets. After comparing these three methods effectively, i can reflect data characters and potential rules syllabify. This work will be presents a new and improves results from large- scale datasets. KEYWORDS: Clustering, K-Means, HAC, Data Mining, Kohonen-SOM INTRODUCTION A self-organizing map (SOM) or self-organizing feature map (SOFM) is a kind of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two dimensional), discretized representation of the input space of the training samples, called a map. Self-organizing maps are different than other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space. SOM is a clustering method. Indeed, it organizes the data in clusters (cells of map) such as the instances in the same cell are similar, and the instances in different cells are different. In this point of view, SOM gives comparable results to state-of-the-art clustering algorithm such as K-Means. SOM can be vied also as a visualization technique. It allows us to visualize in a low dimensional representation space (2D) the original dataset. Indeed, the individuals located in adjacent cells are more similar than individuals located in distant cells. In this point of view, it is comparable to visualization techniques such as Multidimensional scaling or PCA (Principal Component Analysis). In this work, I show how to implement the Kohonen's SOM algorithm with particular tool. I try to assess the properties of this approach by comparing the results with those of the PCA algorithm. Then, I compare the results to those of K-Means, which is a clustering algorithm. Finally, I will implement the Two-step Clustering process by combining the SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the Two-Step Clustering where i combine K-Means and HAC. DATASET I analyze the WAVEFORM dataset (Breiman and al., 1984).This is an artificial dataset. There are 21 descriptors. I have 5000 instances. I do not use the CLASS attribute which classify the instances into 3 pre-defined classes in this research. International Journal of Computer Science Engineering & Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 1, Mar 2013, 165-174 © TJPRC Pvt. Ltd.

Transcript of IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

Page 1: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN-SOM, AND HAC

DATA MINING ALGORITHMS BASED ON CLUSTERING

KAPIL SHARMA & RICHA DHIMAN

Computer Science and Engineering, Lovely Professional University, Jalandhar, Punjab, India

ABSTRACT

With the development of information technology and computer science, high-capacity data appear in our lives. In

order to help people analyzing and digging out useful information, the generation and application of data mining

technology seem so significance. Clustering is the mostly used method of data mining. Clustering can be used for

describing and analyzing of data. In this research, we use clustering and classification methods to mine the data and extract

the valuable information by using hybrid algorithms or combination of three algorithms and it produce the better results

than the traditional algorithms and compare it by applying on data sets. After comparing these three methods effectively,

i can reflect data characters and potential rules syllabify. This work will be presents a new and improves results from large-

scale datasets.

KEYWORDS: Clustering, K-Means, HAC, Data Mining, Kohonen-SOM

INTRODUCTION

A self-organizing map (SOM) or self-organizing feature map (SOFM) is a kind of artificial neural network that is

trained using unsupervised learning to produce a low-dimensional (typically two dimensional), discretized representation

of the input space of the training samples, called a map. Self-organizing maps are different than other artificial neural

networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.

SOM is a clustering method. Indeed, it organizes the data in clusters (cells of map) such as the instances in the

same cell are similar, and the instances in different cells are different. In this point of view, SOM gives comparable results

to state-of-the-art clustering algorithm such as K-Means.

SOM can be vied also as a visualization technique. It allows us to visualize in a low dimensional representation

space (2D) the original dataset. Indeed, the individuals located in adjacent cells are more similar than individuals located in

distant cells. In this point of view, it is comparable to visualization techniques such as Multidimensional scaling or PCA

(Principal Component Analysis).

In this work, I show how to implement the Kohonen's SOM algorithm with particular tool. I try to assess the

properties of this approach by comparing the results with those of the PCA algorithm. Then, I compare the results to those

of K-Means, which is a clustering algorithm. Finally, I will implement the Two-step Clustering process by combining the

SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the Two-Step Clustering

where i combine K-Means and HAC.

DATASET

I analyze the WAVEFORM dataset (Breiman and al., 1984).This is an artificial dataset. There are 21 descriptors. I

have 5000 instances. I do not use the CLASS attribute which classify the instances into 3 pre-defined classes in this

research.

International Journal of Computer Science Engineering

& Information Technology Research (IJCSEITR)

ISSN 2249-6831

Vol. 3, Issue 1, Mar 2013, 165-174

© TJPRC Pvt. Ltd.

Page 2: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

166 Kapil Sharma & Richa Dhiman

Kohonen’s SOM Approach with Tanagra

The easiest way to import a XLS file is to open the data file into Excel spreadsheet. Then, using the add-in

TANAGRA.XLA3, i can send the dataset to Tanagra which is launched automatically. I can check the range of selected

cells in the worksheet.

Figure 1

Tanagra is launched, a new diagram is created and the dataset is loaded. I have 5000 instances and 21 attributes.

Descriptive Statistics and Outlier’s Detection

First step, i check the integrity of the dataset by computing some descriptive statistics indicators. I insert the

DEFINE STATUS component into the diagram using the shortcut into the toolbar. Then I set all the variables as INPUT.

Figure 2

I add the UNIVARIATE CONTINUOUS STAT component (STATISTICS tab). I click on the VIEW menu.

I obtain the following report.

Figure 3

Page 3: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

Implementation and Evaluation of K-Means, KOHONEN-SOM, and 167 HAC Data Mining Algorithms Based on Clustering

I note that there is no constant in our dataset i.e. standard deviation = 0. I note also that all the variables seem

defined in the same scale.

The KOHONEN-SOM Component

I want to launch the analysis now. I create a grid with 2 rows and 3 columns i.e. a Classification of the instances

into 6 groups (2 x 3 = 6 clusters). I add the KOHONEN-SOM component (CLUSTERING tab) into the diagram. I click on

the PARAMETERS menu. I set the following settings.

Figure 4

The number of rows of the map is 2 (Row Size), the number of columns is 3 (Col Size). I standardize the data i.e.

i divide each variable by their standard deviation. It is recommended when the variables are not in the same scale. It is not

necessary if they are defined in the same scale or when i want to take into consideration explicitly the differences in scale.

I do not modify the other settings I validate and i click on the VIEW menu. I obtain the following report.

Figure 5

The KOHONEN-SOM component adds a new column to the current dataset. It states the group membership of

each instance. This new attribute is available in the subsequent part of the diagram. I can visualize the current dataset with

the VIEW DATASET component. I see for instance that the first example belongs to the cell (1, 3) i.e. first row and third

column.

Note: I can classify an additional instance with the same framework i.e. an example which is not involved in the

learning process. This deployment phase is one of the most important steps of the Data Mining process.

Individuals who are in adjacent cells are also close in the original representation space. This is one of the main

interests of this method. Let us check this assertion on the WAVEFORM dataset.

Page 4: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

168 Kapil Sharma & Richa Dhiman

I cannot visualize the dataset into the original space. So i use a PCA in order to obtain a 2D representation. I try to

visualize the relative positions of groups (clusters) in the scatter plot. I add the PRINCIPAL COMPONENT ANALYSIS

component (FACTORIAL ANALYSIS tab) after the KOHONEN-SOM 1 component. I click on the VIEW menu

Figure 6

The first two factors representation space accounts for the 53.17% of the total variability. It seems Low but on

this dataset, it is enough to represent properly the instances. I add the SCATTERPLOT component (DATA

VISUALIZATION tab) into the diagram. I set the first factor as the horizontal axis, the second one as vertical axis.

Figure 7

A crucial step of this research, i colorize the points with the cluster membership supplied by the SOM algorithm

(CLUSTER_SOM_1).

Figure 8

Page 5: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

Implementation and Evaluation of K-Means, KOHONEN-SOM, and 169 HAC Data Mining Algorithms Based on Clustering

I note the correspondence betaken the proximities into the SOM map and the proximities into the 2 first factors of

PCA. It means also that the instances into adjacent cells are close into the original representation space (with 21 attributes).

Figure 9

A comparison with K-MEANS and Clustering process with K-Means

K-Means is a state-of-the-art approach for clustering process. I add the component into the diagram and i ask

6 clusters. There is no constraint about the relative position of the clusters here.

Figure 10

I click on the VIEW menu in order to launch the calculations.

Figure 11

The relative part of the total sum of squares explained by the partitioning is 46.31%. It is rather comparable to the

one obtained with SOM (45.16%). But i remind that there is no constraint about the relative position of the clusters for

the K-Means algorithm.

Page 6: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

170 Kapil Sharma & Richa Dhiman

Agreement between the Clusters

If the performances of these approaches seem similar, are the clusters comparable?

To check the correspondence, i create a cross-tabulation betaken the column memberships supplied by the two

approaches. I insert a new DEFINE STATUS component into the diagram. I set CLUSTER_SOM_1, the cluster

membership column supplied by the SOM algorithm, as TARGET; I set CLUSTER_KMEANS_1, supplied by the

K-MEANS algorithm, as INPUT.

Figure 12

In the table below, i show the correspondence between clusters.

Table 1

Two-Step Clustering

Two-step clustering creates pre-clusters, and then it clusters the pre-clusters using hierarchical methods (HCA).

Two step clustering handles very large datasets6.

The K-Means is usually used in the first phase where the pre-clusters are created. In this tutorial, instead of

K-Means, we use the SOM results for this first phase.

This variant involves a very interesting property: the adjacent preclusters correspond to nearby areas in the

original representation space. This strengthens the interpretation of the dendrogram created with the subsequent HCA

algorithm.

We add the DEFINE STATUS component into the diagram. We set CLUSTER_SOM_1 as TARGET, the

descriptive variables (V1…V21) as INPUT.

Page 7: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

Implementation and Evaluation of K-Means, KOHONEN-SOM, and 171 HAC Data Mining Algorithms Based on Clustering

Figure 13

I add the HAC component (CLUSTERING tab).

Figure 14

The component automatically detects 3 groups. This choice relies on the height between each merging. There is

no theoretical justification here.

The DENDROGRAM tab of the visualization window is very important. By clicking on each node of the tree, we

obtain the ID of the pre-clusters supplied by the SOM algorithm.

Figure 15

Page 8: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

172 Kapil Sharma & Richa Dhiman

CONCLUSIONS

The white nodes of the tree states the groups computed with the HAC algorithm. If we select the white node at

right, we obtain the SOM's pre-clusters ID i.e. the individuals in this group come from the pre-clusters

(1 ; 1), (1 ; 2) et (2 ; 1).

In the table, we see the correspondence between SOM pre-clusters ID and the HAC clusters ID.

Table 2

The results are strikingly consistent with the theoretical consideration underlying the SOM approach: the HAC

above all merges the adjacent cells of the Kohonen’s map.

REFERENCES

1. Rakesh Agrawal , Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases,

Proceedings of the 20th International Conference on Very Large Data Bases, p.487-499, September 12-15, 1994.

2. A. Baskurt, F. Blum, M. Daoudi, J.L. Dugelay, F. Dupont, A. Dutartre, T. Filali Ansary, F. Fratani, E. Garcia, G.

Lavoué, D. Lichau, F. Preteux, J. Ricard, B. Savage, J.P. Vandeborre, T. Zaharia. SEMANTIC-3D :

COMPRESSION, INDEXATION ET TATOUAGE DE DONNÉES 3D Réseau National de Recherche en

Télécommunications (RNRT) (2002).

3. T.Zaharia F.Prêteux, Descripteurs de forme : Etude comparée des approches 3D et 2D/3D 3D versus 2D/3D

Shape Descriptors: A Comparative study.

4. T.F.Ansary J.P.Vandeborre M.Daoudi, Recherche de modèles 3D de pièces mécaniques basée sur les moments de

Zernike

5. A. Khothanzad, Y. H. Hong, Invariant image recognition by Zernike moments, IEEE Trans. Pattern Anal. Match.

Intell.,12 (5), 489-497, 1990.

6. Agrawal R., Imielinski T., Swani A. (1993) Mining Association rules between sets of items in large databases. In :

Proceedings of the ACMSIGMOD Conference on Management of Data, Washington DC, USA.

7. Hébrail G., Lechevallier Y. (2003) Data mining et analyse des données. In : Govaert G. Analyse des données. Ed.

Lavoisier, Paris, pp 323-355.

8. T.F.Ansary J.P.Vandeborre M.Daoudi, une approche baysiénne pour l’indexation de modèles 3D basée sur les

vues caractéristiques.

9. Ansary, T. F. Daoudi, M. Vandeborre, J.-P. A Bayesian 3-D Search Engine Using Adaptive Views Clustering,

IEEE Transactions on Multimedia, 2007.

Page 9: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …

Implementation and Evaluation of K-Means, KOHONEN-SOM, and 173 HAC Data Mining Algorithms Based on Clustering

10. Ansary, T.F. Vandeborre, J.-P. Mahmoudi, S. Daoudi, M. A Bayesian framework for 3D models retrieval based

on characteristic views, 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004. Proceedings.

2nd International Symposium Publication Date: 6-9 Sept. 2004.

11. Agrawal R., Srikant R., Fast algorithms for mining association rules in larges databases. In Proceeding of the 20th

international conference on Very Large Dada Bases (VLDB’94), pages 478-499. Morgan Kaufmann, September

1994.

12. U. Fayyad, G.Piatetsky-Shapiro, and Padhraic Smyth, From Data Mining toKnowledge Discovery in Databases,

American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996

13. S.Lallich, O.Teytaud, Évaluation et validation de l'intérêt des règles d'association.

14. Osada, R., Funkhouser, T., Chazelle, B. et Dobkin, D. ((Matching 3D Models with Shape Distributions)). Dans

Proceedings of the International Conference on Shape Modeling & Applications (SMI ’01), pages 154–168. IEEE

Computer Society,Washington, DC, Etat-Unis. 2001.

15. W.Y. Kim et Y.S. Kim. A region-based shape descriptor using Zernike moments. Signal Processing: Image

Communication, 16 :95–100, 2000.

16. A. Khotanzad et Y.H. Hong. Invariant image recognition by Zernike moments. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 12(5), May 90.

17. N Pasquier, Y Bastide, R Taouil, L Lakhal - Database Theory ICDT'99, 1999 – Springer.

18. M.El far, L.Moumoun, T.Gadi, R.Benslimane- A new technique for the extraction of characteristic views for

2D/3D indexation- International Journal of Engineering Science and Technology (IJEST)-accepted session July

2010.

Page 10: IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN …