[IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...
Transcript of [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China...
![Page 1: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics](https://reader037.fdocuments.in/reader037/viewer/2022092715/5750a6c91a28abcf0cbc3093/html5/thumbnails/1.jpg)
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
Estimate Human Gene Family Number by Improved K-Means Clustering
Dongmei Pu, Yubo Yuan, Feilong Cao*
Institute of Metrology and Computational Science, China J iliang University,Hangzhou, Zhejiang 310018, P. R. China,
E-MIAL: [email protected]
Abstract The number of gene families is a very important is
sue among all of genome sequencing studies. In this
paper, the genome sequences has been recoded. One
of it is a 64 dimensional vector. The human genome
database has been recoded as a matrix which has 64 columns. After using the improved k-means clustering
algorithm, the family number can be checked out and
every member of each family can be determined.
Keywords: clustering; human genome;data mining
1 Introduction
It is well known that a human gene can be conside -red as a sequence, consisting of four nucleotides, which are simply denoted by four letters, A, C, G, and T. Biologists have been interested in identifying human genes and determining their functions, because these can be used to diagnose hu
man diseases and to design new drugs for them. A human gene can be identified through a series of time-consuming biological experiments, often with the help of computer programs. The DNA sequences announced in 2003 were
only rough drafts for each human chromosome. W hile this draft already has advanced medical research, more detail was needed. The draft genomic sequences can be compared broadly to a cross-country road excavated by a bulldozer that leaves behind many gaps across difficult terrain that
will require bridges and other refinements. So, too, with charting the landscape of the human genome. Researchers have now filled in the gaps and provided far more details for each chromosome. Much of this was accomplished by comparing particular DNA sequences across populations in ge
nomic areas that may have contained anomalies in the initial samples. For example, some DNA segments have proven unstable during the process of copying them (cloning) for
'Corresponding Author
978-1-4244-6527 -9/10/$26.00 ©201 0 IEEE
use in sequencing machines. (See an example.) Correcting minor errors (estimated at 1 error in every 10,000 DNA subunits) and cataloging of mutations will continue for some
time to come. The entire collection of human chromosome DNA sequences is freely available to the worldwide research community. DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks that make up the DNA of the 24 different human
chromosomes, was the greatest technical challenge in the Human Genome Proj ect. Achieving this goal has helped reveal the estimated 20,000-25,000 human genes within our DNA as well as the regions controlling them. The resulting DNA sequence maps are being used by 21st Century scien
tists to explore human biology and other complex phenomena. Being an animal, the function of human body is composed as many parts. Such as, Sleep, sight, smell, hearing, jumping ability and rurming ability, immunity, etc. Every
one of them must be controlled by one family genes. In this paper, we use the improved k-means clustering
method to recognize the family members. The improved k-means clustering method is proposed. Then, the genes sequences database is recoded as a numerical matrix. We
use the proposed algorithm to determine the genes family number.
2 Improved K-Means Clustering Algorithm
Clustering analysis of data set aims at discovering smaller, more homogeneous groups from a large heterogeneous collection of data points and it is an important unsu
pervised classification technique used in identifying some inherent structure present in a set of objects. Mathematically speaking, clustering analysis is to group a set of m
patterns, usually denoted as vectors in n-dimensional real space, into clusters in such a way that patterns in one cluster
are similar and patterns in different clusters are dissimilar in some sense. The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in
3146
![Page 2: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics](https://reader037.fdocuments.in/reader037/viewer/2022092715/5750a6c91a28abcf0cbc3093/html5/thumbnails/2.jpg)
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
exploratory data analysis. It is used in many fields including data mining, statistical data analysis, extreme learning machine[lJ, image segmentation, pattern recognition, bioinformatics, financial series analysis, disease assistant di
agnosis (such as cancer, heart disease), vector quantization and various business applications and so on.
In some circumstances, the number of clusters, the parameter k, is known as a priori, and clustering may be for
mulated as distributing m patterns in n-dimensional space among k sets such that the patterns in one set are more similar to each other than to patterns in different sets. This involves minimization of some extrinsic optimization criterion. Agglomerative algorithms, k-means algorithm, fuzzy
algorithms, BIRCH and CLARANS are a few of the existing clustering methods.
Among of them, the k-means algorithm is the most basically and widely used one for clustering (in fact, the math
ematical model of clustering is double minimization problem, can be seen in formulation (3») . Random procedures are used to generate starting clustering centers at the beginning of the k-means algorithm. However, it is known and also can be found from the experiments presented in this
paper that the efficiency of the k-means algorithm largely depends on the choice of the initial clustering centers (Boris Mirkin ([2J-[4] ) has presented this opinion and proposed some intuitions for selection of clustering centers, such as
MaxMin for producing deviate centroids, deviate centroids with anomalous pattern, intelligent K-means and so on) . In 2004, Shehroz et.([5]) also presented that performance of iterative clustering algorithms which would converge to numerous local minima depended highly on initial cluster
ing centers and proposed a clustering center initial algorithm(named CCIA). Their results showed the proposed algorithm could achieve better performance. Also, in 1998, Bradley et. ([6] ) had proved that the better initial starting points indeed could lead to improved solutions for clus
tering problems. In order to improve performance of the k-means method for data clustering, a better initial centers selection algorithm is proposed in this paper. The idea comes from partition technology according to data distri
bution. Before k-means algorithm is made, some features of data set for clustering are analyzed, then, the beginning clusters for k-means algorithm are obtained.
Here, we proposed an improved k-means clustering al
gorithm to do it.
Clustering in n dimensional Euclidean space Rn is the process of partitioning a given set of m points into a number of groups (or, clusters) based on some similarities (or dissimilarities) . The similarity establishes a rule for assign
ing patterns (points) to the domain of a particular cluster center. Let the set of m points be S = {Xl, X2, ' .. , xm} with Xi being an n-dimensional vector, and k clusters be represented by {Cl, C2, ... , Cd. The basic model of de-
scribing the clustering problem is given by (can be seen in [7] )
i#j, i,j=1,2,···,k; i=1,2,···,k.
(1)
The procedure of finding the k optimal clusters Cl, C2, " ., Ck is equivalent to find k clustering centers, denoted as {Zl, Z2, ... , zd. For the swatch set of m points S = {Xl, X2, .. " Xm}, cluster Ci is determined as follows
Ci = {Xj I llXj - zill :S; IIXj - zpll, p i= i, p = 1,2" .. ,k, Xj E S} (2)
where II . II is some norm in Rn, that is, Ci is the set of the points that are the closest to the cluster center Zi.
Therefore, the clustering problem is to find k clustering centers {Zl' Z2, .. " Zk} such that the sum of the distances of each point in the set S to one point in {Zl' Z2, ' . " Zk} is minimized, that is, {Z1, Z2, .. " zd is the solution of the
following optimization problem
(3)
The objective function in (3) is in general neither convex nor concave, and hence it could be difficult to find the solu
tion by solving the problem. However, based on Lemma 3.1 in [8J, problem (3) can be reformulated into the following constrained optimization problem
mIn Z,t
n k L.2:.>jpllxJ - zpll, j�l p�l k
(4)
s.t. L.tjp = 1, tjp 2' 0, ,j = 1,2"", n,p = 1,2"", k, p�l
where tjp = 1 if zp is the closet center to Xj' and tjp = o for p = 1,2"" , k, p i= p. If multiple centers have the same minimum distance to Xj, then tjp can be nonzero between X j and these clustering centers, and form a convex combination of this minimum distance.
Usually, in problem (4), if we employ C2- norm, the fol
lowing optimal problem is obtained
mIn Z,t
k
(5)
s.t. L.tjp = 1, tjp 2' 0, ,j = 1,2"", n,p = 1,2"", k, p�l
3147
![Page 3: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics](https://reader037.fdocuments.in/reader037/viewer/2022092715/5750a6c91a28abcf0cbc3093/html5/thumbnails/3.jpg)
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
and k -Means algorithm is one of the widely used clustering
techniques for (5). The k-means algorithm is an iterative descent method and can be described as follows:
The k-Means Algorithm. step1: Generate k initial clustering centers Zl, Z2,' . " Zk. step2: Cluster Assignment: Assign point x j, j = 1,2" .
" n, to clusters C; with centers Z;, i = 1,2", . ,k;
step3: Update clustering centers z; = 16i 1 :z x j; X3EC'1.
step4: If z; = zi, \Ii = 1,2" .. , k terminate, else Zi = z; and go to step 2;
The initial clustering centers Zl, Z2,' . " Zk in step 1
are generally randomly generated from the set of S =
{Xl, X2, " " xm} , and the point Xj satisfying
is assigned to the cluster Ci in step 2.
The k- means algorithm generally works well. However, in order to improve its performance, the following algo
rithm is proposed to generate an initial clustering centers for the k-means algorithm.
The Max-Min Segmentation Initial Centers Algorithm Step 1 Calculate M = max . . llxi - xjll�, and set d =
l<:",J<:,m"�J �, Sl = S; Step 2 For i = 1 to k do
If i < k then Ci = {Xj I l!xi - zill� S; d, Xj E Si} with Ilzill� = max{llxjll� I Xj E S;},
and set Si+1 = S;f C; else C; = S;;
Calculate Zi = 16,1 :z x j; XJECi
step3: Cluster Assignment: Assign point x j, j 1,2" .
" n, to clusters Ci with centers Z;, i = 1,2", . ,k;
step4: Update clustering centers zi = 16i 1 :z x j; XJECt
stepS: If z; = Zi, \Ii = 1,2" .. , k terminate, else Zi = z; and go to step 3;
3 Genes Sequences Matrix
Let us introduce the procedure for generate the genes sequences matrix. One gene is a sequence, consisting of four
nucleotides, which are simply denoted by four letters, A, C, G, and T. For example,
GGGCTACGTAAACGGGTCCGGAATTCGAT is one gene sequence. We use one integer vector to rewrite the sequence. The row vector is the number of the four pair
of A, C, G, and T. The first one is AAAA, second one is
AAAC, third one is AAAG, , the sixty-forth is TTTT. The upper genes can be rewrite as (0, 1,0, 0,0,0,1,.,0). Use this method, we can get the genes sequences matrix. We denote
it as GR3000064 . We employ the improved k-means clus
tering to cluster the 30000 points in 64-dimensional vector space.
4 Conclusion
Using Matlab 7. 0, it take us 4 weeks and eleven hours to finish on a pc with CPU 3. 0G, 2. 0 G DDR. Finally, we get the genes family member is 4167.
Acknowledgment
This research has been supported by the National Natural Science Foundation under Grants(Nos.
90818020,60873206) and Natural Science Foundation and Education Department of Zhejiang under Grants( Nos.
Y7080235, Y200805339).
References
[1] Huang G B, Zhu Q Y, Siew C K. Extreme learning machine: Theory and applications. Neurocomputing, 2006, (series 70(1-3» , 489-501.
[2] Boris Mirkin, 1996, Chapter 3: Clustering Algo
rithms: a review, Mathematical Classification and Clustering, Kluwer Academic Publishers, 109-169.
[3] Boris Mirkin, 2005, Chapter 3: K-Means Clustering ,
Clustering for Data Mining, Taylor & Francis Group,
75-110.
[4] Boris Mirkin, 1999, Concept Learning and Feature Se
lection Based on Square-Error Clustering , Machine Learning, series 35(1), 25-39.
[5] Shehroz S. Khan and Amir Ahmad, 2004, Cluster cen
ter Max-Minization algorithm for K-means cluster
ing, Pattern Recognition Letters, series 25(11), 1293-
1302.
[6] Paul S. Bradley and Usama M. Fayyad, 1998, Re
fining Max-Min points for K-means clustering, P roc. 15th International Con! on Machine Learning, Mor
gan Kaufmarm, San Francisco, CA, 91-99.
[7] Sanghamitra Bandyopadhyay and Ujjwal Maulik,
2002, An evolutionary technique based on K-Means
algorithm for optimal clustering in RN, Information Sciences, series 146(1), 221-237.
[8] P.S. Bradley, O. L.Mangasarian and W.N. Street, 1996,
Clustering via concave minimization, in Advances in
Neural Information Systems, M. C.Mozer, M.I. Jordan and T.Petsche,(eds. ) , Cambridge,MA,MIT Press,368-374.
3148
![Page 4: [IEEE 2010 International Conference on Machine Learning and Cybernetics (ICMLC) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Machine Learning and Cybernetics](https://reader037.fdocuments.in/reader037/viewer/2022092715/5750a6c91a28abcf0cbc3093/html5/thumbnails/4.jpg)
Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao, 11-14 July 2010
[9] S.Z. Selim, M.A. Ismail, 1984, K-means type algorithms: a generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal. Mach. Inteli, series 6(1), 81-87.
[10] A. K. Jain and R. C. Dubes, 1988, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ.
[11] R. O. Duda, P. E. Hart, and D. G. Stork, 2001, Pattern Classification, Wiley, second edition.
[12] VS. Ananthanarayana, M. Narasimha Murty and D. K. Subramanian, 2001, Rapid and Brief Communication
Efficient clustering of large data sets, Pattern Recognition, series 34 , 2561-2563.
[13] Newman, DJ. & Hettich, S. & Blake, c. L. & Merz, CJ. , 1998, UCI Repos-itory of machine learning databases [http: //www.ics. uci.edu/fuleamIMLRepository.html].
Irvine, CA: University of California, Department of Information and Computer Science.
[14] Georg Peters, 2006, Some refinements of rough kmeans clustering. Pattern Recognition, series 39(8), 1481-1491.
[15] Makoto Otsubo, Katsushi Sato and Atsushi Yamaji, 2006, Computerized identification of stress tensors determined from heterogeneous fault-slip data by com
bining the multiple inverse method and k-means clustering. Journal of Structural Geology, series 28(6), 991-997.
[16] Bjarni Bodvarsson, M. M orkebj erg, L.K. Hansen, G.M. Knudsen and C. Svarer, 2006, Extraction of time activity curves from positron emission tomography: K-means clustering or non-negative matrix factoriza
tion, NeuroImage, series 31(2), 185-186.
[17] R.J. Kuo, H.S. Wang, Tung-Lai Hu and S.H. Chou, 2005, Application of ant K-means on clustering analysis, Computers & Mathematics with Applications, series 50(10-12), 1709-1724.
[18] Youssef M. Marzouk and Ahmed F. Ghoniem, 2005, K-means clustering for optimal partitioning and dynamic load balancing of parallel hierarchical N-body
simulations, Journal of Computational P hysics, series 207(2), 493-528.
[19] David J. Hand and Wojtek J. Krzanowski, 2005, Optimising k-means clustering results with standard software packages, Computational Statistics & Data Analysis, series 49(4) , 969-973.
[20] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman and Angela Y. Wu, 2004, A local search approximation algorithm for k-means clustering, Computational Geome
try, series 28(2-3) , 89-112.
3149