[IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

Estimate Human Gene Family Number by Improved K-Means Clustering

Dongmei Pu, Yubo Yuan, Feilong Cao*

Institute of Metrology and Computational Science, China J iliang University,Hangzhou, Zhejiang 310018, P. R. China,

E-MIAL: [email protected]

Abstract The number of gene families is a very important is

sue among all of genome sequencing studies. In this

paper, the genome sequences has been recoded. One

of it is a 64 dimensional vector. The human genome

database has been recoded as a matrix which has 64 columns. After using the improved k-means clustering

algorithm, the family number can be checked out and

every member of each family can be determined.

Keywords: clustering; human genome;data mining

1 Introduction

It is well known that a human gene can be conside -red as a sequence, consisting of four nucleotides, which are simply denoted by four letters, A, C, G, and T. Biologists have been interested in identifying human genes and determining their functions, because these can be used to diagnose hu

man diseases and to design new drugs for them. A human gene can be identified through a series of time-consuming biological experiments, often with the help of computer programs. The DNA sequences announced in 2003 were

only rough drafts for each human chromosome. W hile this draft already has advanced medical research, more detail was needed. The draft genomic sequences can be compared broadly to a cross-country road excavated by a bulldozer that leaves behind many gaps across difficult terrain that

will require bridges and other refinements. So, too, with charting the landscape of the human genome. Researchers have now filled in the gaps and provided far more details for each chromosome. Much of this was accomplished by comparing particular DNA sequences across populations in ge

nomic areas that may have contained anomalies in the initial samples. For example, some DNA segments have proven unstable during the process of copying them (cloning) for

'Corresponding Author

978-1-4244-6527 -9/10/$26.00 ©201 0 IEEE

use in sequencing machines. (See an example.) Correcting minor errors (estimated at 1 error in every 10,000 DNA subunits) and cataloging of mutations will continue for some

time to come. The entire collection of human chromosome DNA sequences is freely available to the worldwide research community. DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks that make up the DNA of the 24 different human

chromosomes, was the greatest technical challenge in the Human Genome Proj ect. Achieving this goal has helped reveal the estimated 20,000-25,000 human genes within our DNA as well as the regions controlling them. The resulting DNA sequence maps are being used by 21st Century scien

tists to explore human biology and other complex phenomena. Being an animal, the function of human body is composed as many parts. Such as, Sleep, sight, smell, hearing, jumping ability and rurming ability, immunity, etc. Every

one of them must be controlled by one family genes. In this paper, we use the improved k-means clustering

method to recognize the family members. The improved k-means clustering method is proposed. Then, the genes sequences database is recoded as a numerical matrix. We

use the proposed algorithm to determine the genes family number.

2 Improved K-Means Clustering Algorithm

Clustering analysis of data set aims at discovering smaller, more homogeneous groups from a large heterogeneous collection of data points and it is an important unsu

pervised classification technique used in identifying some inherent structure present in a set of objects. Mathematically speaking, clustering analysis is to group a set of m

patterns, usually denoted as vectors in n-dimensional real space, into clusters in such a way that patterns in one cluster

are similar and patterns in different clusters are dissimilar in some sense. The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in

3146


exploratory data analysis. It is used in many fields including data mining, statistical data analysis, extreme learning machine[lJ, image segmentation, pattern recognition, bioinformatics, financial series analysis, disease assistant di

agnosis (such as cancer, heart disease), vector quantization and various business applications and so on.

In some circumstances, the number of clusters, the parameter k, is known as a priori, and clustering may be for

mulated as distributing m patterns in n-dimensional space among k sets such that the patterns in one set are more similar to each other than to patterns in different sets. This involves minimization of some extrinsic optimization criterion. Agglomerative algorithms, k-means algorithm, fuzzy

algorithms, BIRCH and CLARANS are a few of the existing clustering methods.

Among of them, the k-means algorithm is the most basically and widely used one for clustering (in fact, the math

ematical model of clustering is double minimization problem, can be seen in formulation (3») . Random procedures are used to generate starting clustering centers at the beginning of the k-means algorithm. However, it is known and also can be found from the experiments presented in this

paper that the efficiency of the k-means algorithm largely depends on the choice of the initial clustering centers (Boris Mirkin ([2J-[4] ) has presented this opinion and proposed some intuitions for selection of clustering centers, such as

MaxMin for producing deviate centroids, deviate centroids with anomalous pattern, intelligent K-means and so on) . In 2004, Shehroz et.([5]) also presented that performance of iterative clustering algorithms which would converge to numerous local minima depended highly on initial cluster

ing centers and proposed a clustering center initial algorithm(named CCIA). Their results showed the proposed algorithm could achieve better performance. Also, in 1998, Bradley et. ([6] ) had proved that the better initial starting points indeed could lead to improved solutions for clus

tering problems. In order to improve performance of the k-means method for data clustering, a better initial centers selection algorithm is proposed in this paper. The idea comes from partition technology according to data distri

bution. Before k-means algorithm is made, some features of data set for clustering are analyzed, then, the beginning clusters for k-means algorithm are obtained.

Here, we proposed an improved k-means clustering al

gorithm to do it.

Clustering in n dimensional Euclidean space Rn is the process of partitioning a given set of m points into a number of groups (or, clusters) based on some similarities (or dissimilarities) . The similarity establishes a rule for assign

ing patterns (points) to the domain of a particular cluster center. Let the set of m points be S = {Xl, X2, ' .. , xm} with Xi being an n-dimensional vector, and k clusters be represented by {Cl, C2, ... , Cd. The basic model of de-

scribing the clustering problem is given by (can be seen in [7] )

i#j, i,j=1,2,···,k; i=1,2,···,k.

(1)

The procedure of finding the k optimal clusters Cl, C2, " ., Ck is equivalent to find k clustering centers, denoted as {Zl, Z2, ... , zd. For the swatch set of m points S = {Xl, X2, .. " Xm}, cluster Ci is determined as follows

Ci = {Xj I llXj - zill :S; IIXj - zpll, p i= i, p = 1,2" .. ,k, Xj E S} (2)

where II . II is some norm in Rn, that is, Ci is the set of the points that are the closest to the cluster center Zi.

Therefore, the clustering problem is to find k clustering centers {Zl' Z2, .. " Zk} such that the sum of the distances of each point in the set S to one point in {Zl' Z2, ' . " Zk} is minimized, that is, {Z1, Z2, .. " zd is the solution of the

following optimization problem

(3)

The objective function in (3) is in general neither convex nor concave, and hence it could be difficult to find the solu

tion by solving the problem. However, based on Lemma 3.1 in [8J, problem (3) can be reformulated into the following constrained optimization problem

mIn Z,t

n k L.2:.>jpllxJ - zpll, j�l p�l k

(4)

s.t. L.tjp = 1, tjp 2' 0, ,j = 1,2"", n,p = 1,2"", k, p�l

where tjp = 1 if zp is the closet center to Xj' and tjp = o for p = 1,2"" , k, p i= p. If multiple centers have the same minimum distance to Xj, then tjp can be nonzero between X j and these clustering centers, and form a convex combination of this minimum distance.

Usually, in problem (4), if we employ C2- norm, the fol

lowing optimal problem is obtained

mIn Z,t

k

(5)

s.t. L.tjp = 1, tjp 2' 0, ,j = 1,2"", n,p = 1,2"", k, p�l

3147


and k -Means algorithm is one of the widely used clustering

techniques for (5). The k-means algorithm is an iterative descent method and can be described as follows:

The k-Means Algorithm. step1: Generate k initial clustering centers Zl, Z2,' . " Zk. step2: Cluster Assignment: Assign point x j, j = 1,2" .

" n, to clusters C; with centers Z;, i = 1,2", . ,k;

step3: Update clustering centers z; = 16i 1 :z x j; X3EC'1.

step4: If z; = zi, \Ii = 1,2" .. , k terminate, else Zi = z; and go to step 2;

The initial clustering centers Zl, Z2,' . " Zk in step 1

are generally randomly generated from the set of S =

{Xl, X2, " " xm} , and the point Xj satisfying

is assigned to the cluster Ci in step 2.

The k- means algorithm generally works well. However, in order to improve its performance, the following algo

rithm is proposed to generate an initial clustering centers for the k-means algorithm.

The Max-Min Segmentation Initial Centers Algorithm Step 1 Calculate M = max . . llxi - xjll�, and set d =

l<:",J<:,m"�J �, Sl = S; Step 2 For i = 1 to k do

If i < k then Ci = {Xj I l!xi - zill� S; d, Xj E Si} with Ilzill� = max{llxjll� I Xj E S;},

and set Si+1 = S;f C; else C; = S;;

Calculate Zi = 16,1 :z x j; XJECi

step3: Cluster Assignment: Assign point x j, j 1,2" .

" n, to clusters Ci with centers Z;, i = 1,2", . ,k;

step4: Update clustering centers zi = 16i 1 :z x j; XJECt

stepS: If z; = Zi, \Ii = 1,2" .. , k terminate, else Zi = z; and go to step 3;

3 Genes Sequences Matrix

Let us introduce the procedure for generate the genes sequences matrix. One gene is a sequence, consisting of four

nucleotides, which are simply denoted by four letters, A, C, G, and T. For example,

GGGCTACGTAAACGGGTCCGGAATTCGAT is one gene sequence. We use one integer vector to rewrite the sequence. The row vector is the number of the four pair

of A, C, G, and T. The first one is AAAA, second one is

AAAC, third one is AAAG, , the sixty-forth is TTTT. The upper genes can be rewrite as (0, 1,0, 0,0,0,1,.,0). Use this method, we can get the genes sequences matrix. We denote

it as GR3000064 . We employ the improved k-means clus

tering to cluster the 30000 points in 64-dimensional vector space.

4 Conclusion

Using Matlab 7. 0, it take us 4 weeks and eleven hours to finish on a pc with CPU 3. 0G, 2. 0 G DDR. Finally, we get the genes family member is 4167.

Acknowledgment

This research has been supported by the National Natural Science Foundation under Grants(Nos.

90818020,60873206) and Natural Science Foundation and Education Department of Zhejiang under Grants( Nos.

Y7080235, Y200805339).

References

[1] Huang G B, Zhu Q Y, Siew C K. Extreme learning machine: Theory and applications. Neurocomputing, 2006, (series 70(1-3» , 489-501.

[2] Boris Mirkin, 1996, Chapter 3: Clustering Algo

rithms: a review, Mathematical Classification and Clustering, Kluwer Academic Publishers, 109-169.

[3] Boris Mirkin, 2005, Chapter 3: K-Means Clustering ,

Clustering for Data Mining, Taylor & Francis Group,

75-110.

[4] Boris Mirkin, 1999, Concept Learning and Feature Se

lection Based on Square-Error Clustering , Machine Learning, series 35(1), 25-39.

[5] Shehroz S. Khan and Amir Ahmad, 2004, Cluster cen

ter Max-Minization algorithm for K-means cluster

ing, Pattern Recognition Letters, series 25(11), 1293-

1302.

[6] Paul S. Bradley and Usama M. Fayyad, 1998, Re

fining Max-Min points for K-means clustering, P roc. 15th International Con! on Machine Learning, Mor

gan Kaufmarm, San Francisco, CA, 91-99.

[7] Sanghamitra Bandyopadhyay and Ujjwal Maulik,

2002, An evolutionary technique based on K-Means

algorithm for optimal clustering in RN, Information Sciences, series 146(1), 221-237.

[8] P.S. Bradley, O. L.Mangasarian and W.N. Street, 1996,

Clustering via concave minimization, in Advances in

Neural Information Systems, M. C.Mozer, M.I. Jordan and T.Petsche,(eds. ) , Cambridge,MA,MIT Press,368-374.

3148


[9] S.Z. Selim, M.A. Ismail, 1984, K-means type algorithms: a generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal. Mach. Inteli, series 6(1), 81-87.

[10] A. K. Jain and R. C. Dubes, 1988, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ.

[11] R. O. Duda, P. E. Hart, and D. G. Stork, 2001, Pattern Classification, Wiley, second edition.

[12] VS. Ananthanarayana, M. Narasimha Murty and D. K. Subramanian, 2001, Rapid and Brief Communication

Efficient clustering of large data sets, Pattern Recognition, series 34 , 2561-2563.

[13] Newman, DJ. & Hettich, S. & Blake, c. L. & Merz, CJ. , 1998, UCI Repos-itory of machine learning databases [http: //www.ics. uci.edu/fuleamIMLRepository.html].

Irvine, CA: University of California, Department of Information and Computer Science.

[14] Georg Peters, 2006, Some refinements of rough kmeans clustering. Pattern Recognition, series 39(8), 1481-1491.

[15] Makoto Otsubo, Katsushi Sato and Atsushi Yamaji, 2006, Computerized identification of stress tensors determined from heterogeneous fault-slip data by com

bining the multiple inverse method and k-means clustering. Journal of Structural Geology, series 28(6), 991-997.

[16] Bjarni Bodvarsson, M. M orkebj erg, L.K. Hansen, G.M. Knudsen and C. Svarer, 2006, Extraction of time activity curves from positron emission tomography: K-means clustering or non-negative matrix factoriza

tion, NeuroImage, series 31(2), 185-186.

[17] R.J. Kuo, H.S. Wang, Tung-Lai Hu and S.H. Chou, 2005, Application of ant K-means on clustering analysis, Computers & Mathematics with Applications, series 50(10-12), 1709-1724.

[18] Youssef M. Marzouk and Ahmed F. Ghoniem, 2005, K-means clustering for optimal partitioning and dynamic load balancing of parallel hierarchical N-body

simulations, Journal of Computational P hysics, series 207(2), 493-528.

[19] David J. Hand and Wojtek J. Krzanowski, 2005, Optimising k-means clustering results with standard software packages, Computational Statistics & Data Analysis, series 49(4) , 969-973.

[20] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman and Angela Y. Wu, 2004, A local search approximation algorithm for k-means clustering, Computational Geome

try, series 28(2-3) , 89-112.

3149

[IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...

Documents

Transcript of [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...