[IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...

4
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010 Estimate Human Gene Family Number by Improved K-Means Clustering Dongmei , Yubo Yuan, Feilong Cao* Institute of Meology and Computational Science, China J iliang University,Hangzhou, Zhejiang 310018, P. R. China, E-MIAL: [email protected]m Abstract The number of gene families is a very important is- sue among all of genome sequencing studies. In this paper, the genome sequences has been recoded. One of it is a 64 dimensional vector. The human genome database has been recoded as a matrix which has 64 columns. Aſter using the improved k-means clustering algorithm, the family number can be checked out and every member of each family can be determined. Keywords: clustering; human genome;data mining 1 Introduction It is well known that a human gene can be conside -red as a sequence, consistg of four nucleotides, which are sim- ply denoted by four letters, A, C, G, and T. Biologists have been interested in identifying human genes and determing their functions, because these can be used to diagnose hu- man diseases and to design new ugs for them. A human gene can be identified through a series of time-consuming biological experiments, oſten with the help of computer programs. The DNA sequences announced in 2003 were only rough draſts for each human chromosome. W hile this draſt already has advanced medical research, more detail was needed. The draft genomic sequences can be compared broadly to a cross-country road excavated by a bulldozer that leaves behind many gaps across difficult terrain that will require bridges and other refinements. So, too, with chag the landscape of the human genome. Researchers have now filled in the gaps and provided far more details for each chromosome. Much of this was accomplished by com- paring particular DNA sequences across populations in ge- nomic areas that may have contained anomalies the initial samples. For example, some DNA segments have proven unstable during the process of copying them (cloning) for 'Coesponding Author 978-1-4244-6527-9/10/$26.00 ©2010 IEEE use in sequencing maches. (See an example.) Coecting minor errors (estimated at 1 error in every 10,000 DNA sub- units) and cataloging of mutatio will continue for some time to come. The ente collection of human como- some DNA sequences is freely available to the worldwide research community. DNA sequencing, the process of de- termining the exact order of the 3 billion chemical building blocks that make up the DNA of the 24 different human chromosomes, was the eatest technical challenge in the Human Genome Project. Achieving this goal has helped re- veal the estimated 20,000-25,000 human genes within our DNA as well as the regions controlling them. The resulting DNA sequence maps are being used by 21st Century scien- tists to explore human biology and other complex phenom- ena . Being an animal, the fction of human body is com- posed as many pas. Such as, Sleep, sight, smell, hearing, jumping ability and ruing ability, immunity, etc. Every one of them must be controlled by one family genes. In this paper, we use the improved k-means clustering method to recognize the family members. The improved k-means clustering method is proposed. Then, the genes sequences database is recoded as a numerical matrix. We use the proposed algorithm to determine the genes family number. 2 Improved K-Means Clustering Algorithm Clustering analysis of data set aims at discovering smaller, more homogeneous groups from a large heteroge- neous collection of data points and it is an impoant unsu- pervised classification technique used in identifying some ierent sucture present in a set of objects. Mathemati- cally speaking, clusterg analysis is to group a set of m pattes, usually denoted as vectors in n-dimeional real space, into clusters in such a way that pattes in one cluster are similar and pattes in different clusters are dissimilar in some see. The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in 3146

Transcript of [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...

Page 1: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

Estimate Human Gene Family Number by Improved K-Means Clustering

Dongmei Pu, Yubo Yuan, Feilong Cao*

Institute of Metrology and Computational Science, China J iliang University,Hangzhou, Zhejiang 310018, P. R. China,

E-MIAL: [email protected]

Abstract The number of gene families is a very important is­

sue among all of genome sequencing studies. In this

paper, the genome sequences has been recoded. One

of it is a 64 dimensional vector. The human genome

database has been recoded as a matrix which has 64 columns. After using the improved k-means clustering

algorithm, the family number can be checked out and

every member of each family can be determined.

Keywords: clustering; human genome;data mining

1 Introduction

It is well known that a human gene can be conside -red as a sequence, consisting of four nucleotides, which are sim­ply denoted by four letters, A, C, G, and T. Biologists have been interested in identifying human genes and determining their functions, because these can be used to diagnose hu­

man diseases and to design new drugs for them. A human gene can be identified through a series of time-consuming biological experiments, often with the help of computer programs. The DNA sequences announced in 2003 were

only rough drafts for each human chromosome. W hile this draft already has advanced medical research, more detail was needed. The draft genomic sequences can be compared broadly to a cross-country road excavated by a bulldozer that leaves behind many gaps across difficult terrain that

will require bridges and other refinements. So, too, with charting the landscape of the human genome. Researchers have now filled in the gaps and provided far more details for each chromosome. Much of this was accomplished by com­paring particular DNA sequences across populations in ge­

nomic areas that may have contained anomalies in the initial samples. For example, some DNA segments have proven unstable during the process of copying them (cloning) for

'Corresponding Author

978-1-4244-6527 -9/10/$26.00 ©201 0 IEEE

use in sequencing machines. (See an example.) Correcting minor errors (estimated at 1 error in every 10,000 DNA sub­units) and cataloging of mutations will continue for some

time to come. The entire collection of human chromo­some DNA sequences is freely available to the worldwide research community. DNA sequencing, the process of de­termining the exact order of the 3 billion chemical building blocks that make up the DNA of the 24 different human

chromosomes, was the greatest technical challenge in the Human Genome Proj ect. Achieving this goal has helped re­veal the estimated 20,000-25,000 human genes within our DNA as well as the regions controlling them. The resulting DNA sequence maps are being used by 21st Century scien­

tists to explore human biology and other complex phenom­ena. Being an animal, the function of human body is com­posed as many parts. Such as, Sleep, sight, smell, hearing, jumping ability and rurming ability, immunity, etc. Every

one of them must be controlled by one family genes. In this paper, we use the improved k-means clustering

method to recognize the family members. The improved k-means clustering method is proposed. Then, the genes sequences database is recoded as a numerical matrix. We

use the proposed algorithm to determine the genes family number.

2 Improved K-Means Clustering Algorithm

Clustering analysis of data set aims at discovering smaller, more homogeneous groups from a large heteroge­neous collection of data points and it is an important unsu­

pervised classification technique used in identifying some inherent structure present in a set of objects. Mathemati­cally speaking, clustering analysis is to group a set of m

patterns, usually denoted as vectors in n-dimensional real space, into clusters in such a way that patterns in one cluster

are similar and patterns in different clusters are dissimilar in some sense. The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in

3146

Page 2: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

exploratory data analysis. It is used in many fields includ­ing data mining, statistical data analysis, extreme learning machine[lJ, image segmentation, pattern recognition, bio­informatics, financial series analysis, disease assistant di­

agnosis (such as cancer, heart disease), vector quantization and various business applications and so on.

In some circumstances, the number of clusters, the pa­rameter k, is known as a priori, and clustering may be for­

mulated as distributing m patterns in n-dimensional space among k sets such that the patterns in one set are more sim­ilar to each other than to patterns in different sets. This involves minimization of some extrinsic optimization crite­rion. Agglomerative algorithms, k-means algorithm, fuzzy

algorithms, BIRCH and CLARANS are a few of the exist­ing clustering methods.

Among of them, the k-means algorithm is the most basi­cally and widely used one for clustering (in fact, the math­

ematical model of clustering is double minimization prob­lem, can be seen in formulation (3») . Random procedures are used to generate starting clustering centers at the begin­ning of the k-means algorithm. However, it is known and also can be found from the experiments presented in this

paper that the efficiency of the k-means algorithm largely depends on the choice of the initial clustering centers (Boris Mirkin ([2J-[4] ) has presented this opinion and proposed some intuitions for selection of clustering centers, such as

MaxMin for producing deviate centroids, deviate centroids with anomalous pattern, intelligent K-means and so on) . In 2004, Shehroz et.([5]) also presented that performance of iterative clustering algorithms which would converge to numerous local minima depended highly on initial cluster­

ing centers and proposed a clustering center initial algo­rithm(named CCIA). Their results showed the proposed al­gorithm could achieve better performance. Also, in 1998, Bradley et. ([6] ) had proved that the better initial starting points indeed could lead to improved solutions for clus­

tering problems. In order to improve performance of the k-means method for data clustering, a better initial cen­ters selection algorithm is proposed in this paper. The idea comes from partition technology according to data distri­

bution. Before k-means algorithm is made, some features of data set for clustering are analyzed, then, the beginning clusters for k-means algorithm are obtained.

Here, we proposed an improved k-means clustering al­

gorithm to do it.

Clustering in n dimensional Euclidean space Rn is the process of partitioning a given set of m points into a num­ber of groups (or, clusters) based on some similarities (or dissimilarities) . The similarity establishes a rule for assign­

ing patterns (points) to the domain of a particular cluster center. Let the set of m points be S = {Xl, X2, ' .. , xm} with Xi being an n-dimensional vector, and k clusters be represented by {Cl, C2, ... , Cd. The basic model of de-

scribing the clustering problem is given by (can be seen in [7] )

i#j, i,j=1,2,···,k; i=1,2,···,k.

(1)

The procedure of finding the k optimal clusters Cl, C2, " ., Ck is equivalent to find k clustering centers, denoted as {Zl, Z2, ... , zd. For the swatch set of m points S = {Xl, X2, .. " Xm}, cluster Ci is determined as follows

Ci = {Xj I llXj - zill :S; IIXj - zpll, p i= i, p = 1,2" .. ,k, Xj E S} (2)

where II . II is some norm in Rn, that is, Ci is the set of the points that are the closest to the cluster center Zi.

Therefore, the clustering problem is to find k clustering centers {Zl' Z2, .. " Zk} such that the sum of the distances of each point in the set S to one point in {Zl' Z2, ' . " Zk} is minimized, that is, {Z1, Z2, .. " zd is the solution of the

following optimization problem

(3)

The objective function in (3) is in general neither convex nor concave, and hence it could be difficult to find the solu­

tion by solving the problem. However, based on Lemma 3.1 in [8J, problem (3) can be reformulated into the following constrained optimization problem

mIn Z,t

n k L.2:.>jpllxJ - zpll, j�l p�l k

(4)

s.t. L.tjp = 1, tjp 2' 0, ,j = 1,2"", n,p = 1,2"", k, p�l

where tjp = 1 if zp is the closet center to Xj' and tjp = o for p = 1,2"" , k, p i= p. If multiple centers have the same minimum distance to Xj, then tjp can be nonzero between X j and these clustering centers, and form a convex combination of this minimum distance.

Usually, in problem (4), if we employ C2- norm, the fol­

lowing optimal problem is obtained

mIn Z,t

k

(5)

s.t. L.tjp = 1, tjp 2' 0, ,j = 1,2"", n,p = 1,2"", k, p�l

3147

Page 3: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

and k -Means algorithm is one of the widely used clustering

techniques for (5). The k-means algorithm is an iterative descent method and can be described as follows:

The k-Means Algorithm. step1: Generate k initial clustering centers Zl, Z2,' . " Zk. step2: Cluster Assignment: Assign point x j, j = 1,2" .

" n, to clusters C; with centers Z;, i = 1,2", . ,k;

step3: Update clustering centers z; = 16i 1 :z x j; X3EC'1.

step4: If z; = zi, \Ii = 1,2" .. , k terminate, else Zi = z; and go to step 2;

The initial clustering centers Zl, Z2,' . " Zk in step 1

are generally randomly generated from the set of S =

{Xl, X2, " " xm} , and the point Xj satisfying

is assigned to the cluster Ci in step 2.

The k- means algorithm generally works well. However, in order to improve its performance, the following algo­

rithm is proposed to generate an initial clustering centers for the k-means algorithm.

The Max-Min Segmentation Initial Centers Algorithm Step 1 Calculate M = max . . llxi - xjll�, and set d =

l<:",J<:,m"�J �, Sl = S; Step 2 For i = 1 to k do

If i < k then Ci = {Xj I l!xi - zill� S; d, Xj E Si} with Ilzill� = max{llxjll� I Xj E S;},

and set Si+1 = S;f C; else C; = S;;

Calculate Zi = 16,1 :z x j; XJECi

step3: Cluster Assignment: Assign point x j, j 1,2" .

" n, to clusters Ci with centers Z;, i = 1,2", . ,k;

step4: Update clustering centers zi = 16i 1 :z x j; XJECt

stepS: If z; = Zi, \Ii = 1,2" .. , k terminate, else Zi = z; and go to step 3;

3 Genes Sequences Matrix

Let us introduce the procedure for generate the genes se­quences matrix. One gene is a sequence, consisting of four

nucleotides, which are simply denoted by four letters, A, C, G, and T. For example,

GGGCTACGTAAACGGGTCCGGAATTCGAT is one gene sequence. We use one integer vector to rewrite the sequence. The row vector is the number of the four pair

of A, C, G, and T. The first one is AAAA, second one is

AAAC, third one is AAAG, , the sixty-forth is TTTT. The upper genes can be rewrite as (0, 1,0, 0,0,0,1,.,0). Use this method, we can get the genes sequences matrix. We denote

it as GR3000064 . We employ the improved k-means clus­

tering to cluster the 30000 points in 64-dimensional vector space.

4 Conclusion

Using Matlab 7. 0, it take us 4 weeks and eleven hours to finish on a pc with CPU 3. 0G, 2. 0 G DDR. Finally, we get the genes family member is 4167.

Acknowledgment

This research has been supported by the Na­tional Natural Science Foundation under Grants(Nos.

90818020,60873206) and Natural Science Foundation and Education Department of Zhejiang under Grants( Nos.

Y7080235, Y200805339).

References

[1] Huang G B, Zhu Q Y, Siew C K. Extreme learning machine: Theory and applications. Neurocomputing, 2006, (series 70(1-3» , 489-501.

[2] Boris Mirkin, 1996, Chapter 3: Clustering Algo­

rithms: a review, Mathematical Classification and Clustering, Kluwer Academic Publishers, 109-169.

[3] Boris Mirkin, 2005, Chapter 3: K-Means Clustering ,

Clustering for Data Mining, Taylor & Francis Group,

75-110.

[4] Boris Mirkin, 1999, Concept Learning and Feature Se­

lection Based on Square-Error Clustering , Machine Learning, series 35(1), 25-39.

[5] Shehroz S. Khan and Amir Ahmad, 2004, Cluster cen­

ter Max-Minization algorithm for K-means cluster­

ing, Pattern Recognition Letters, series 25(11), 1293-

1302.

[6] Paul S. Bradley and Usama M. Fayyad, 1998, Re­

fining Max-Min points for K-means clustering, P roc. 15th International Con! on Machine Learning, Mor­

gan Kaufmarm, San Francisco, CA, 91-99.

[7] Sanghamitra Bandyopadhyay and Ujjwal Maulik,

2002, An evolutionary technique based on K-Means

algorithm for optimal clustering in RN, Information Sciences, series 146(1), 221-237.

[8] P.S. Bradley, O. L.Mangasarian and W.N. Street, 1996,

Clustering via concave minimization, in Advances in

Neural Information Systems, M. C.Mozer, M.I. Jordan and T.Petsche,(eds. ) , Cambridge,MA,MIT Press,368-374.

3148

Page 4: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics

Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010

[9] S.Z. Selim, M.A. Ismail, 1984, K-means type algo­rithms: a generalized convergence theorem and char­acterization of local optimality, IEEE Trans. Pattern Anal. Mach. Inteli, series 6(1), 81-87.

[10] A. K. Jain and R. C. Dubes, 1988, Algorithms for Clus­tering Data, Prentice-Hall, Englewood Cliffs, NJ.

[11] R. O. Duda, P. E. Hart, and D. G. Stork, 2001, Pattern Classification, Wiley, second edition.

[12] VS. Ananthanarayana, M. Narasimha Murty and D. K. Subramanian, 2001, Rapid and Brief Communication

Efficient clustering of large data sets, Pattern Recog­nition, series 34 , 2561-2563.

[13] Newman, DJ. & Hettich, S. & Blake, c. L. & Merz, CJ. , 1998, UCI Repos-itory of machine learning databases [http: //www.ics. uci.edu/fuleamIMLRepository.html].

Irvine, CA: University of California, Department of Information and Computer Science.

[14] Georg Peters, 2006, Some refinements of rough k­means clustering. Pattern Recognition, series 39(8), 1481-1491.

[15] Makoto Otsubo, Katsushi Sato and Atsushi Yamaji, 2006, Computerized identification of stress tensors de­termined from heterogeneous fault-slip data by com­

bining the multiple inverse method and k-means clus­tering. Journal of Structural Geology, series 28(6), 991-997.

[16] Bjarni Bodvarsson, M. M orkebj erg, L.K. Hansen, G.M. Knudsen and C. Svarer, 2006, Extraction of time activity curves from positron emission tomography: K-means clustering or non-negative matrix factoriza­

tion, NeuroImage, series 31(2), 185-186.

[17] R.J. Kuo, H.S. Wang, Tung-Lai Hu and S.H. Chou, 2005, Application of ant K-means on clustering analy­sis, Computers & Mathematics with Applications, se­ries 50(10-12), 1709-1724.

[18] Youssef M. Marzouk and Ahmed F. Ghoniem, 2005, K-means clustering for optimal partitioning and dy­namic load balancing of parallel hierarchical N-body

simulations, Journal of Computational P hysics, series 207(2), 493-528.

[19] David J. Hand and Wojtek J. Krzanowski, 2005, Optimising k-means clustering results with standard software packages, Computational Statistics & Data Analysis, series 49(4) , 969-973.

[20] Tapas Kanungo, David M. Mount, Nathan S. Ne­tanyahu, Christine D. Piatko, Ruth Silverman and An­gela Y. Wu, 2004, A local search approximation algo­rithm for k-means clustering, Computational Geome­

try, series 28(2-3) , 89-112.

3149