Present-1.pptx

91
Clustering Sridhar S Department of IST Anna University

Transcript of Present-1.pptx

Clustering

ClusteringSridhar SDepartment of ISTAnna University1ClusteringDefinitionClustering is the process of organizing objects into groups whose members are similar in some way.A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. A data mining technique called Class Discovery

Why Clustering?Learning from DataUnsupervised learning

What it does? Pattern detectionSimplificationsconcept constructionSupervised vs. Unsupervised ClassificationSupervised: set of classes (clusters) is given, assign new pattern (point) to proper cluster, label it with label of its clusterExamples: classify bacteria to select proper antibiotics, assign signature to book and place in proper shelf

Unsupervised: for given set of patterns, discover a set of clusters (training set) and assign addtional patterns to proper clusterExamples: buying behavior, stock groups, closed groups of researchers citing each other, more?ExamplesClustering of documents or research groups by citation index, evolution of papers. Problem: find minimal set of papers describing essential ideasClustering of items from excavations according to cultural epoquesClustering of tooth fragments in anthropologyCarl von Linne: Systema Naturae, 1735, botanics, later zoologyetc ...The use of ClusteringData MiningInformation RetrievalText MiningWeb AnalysisMarketingMedical DiagnosticImage AnalysisBioinformatics

Image Analysis MST Clustering

An image file before/after colorclustering using HEMST (Hierarchical EMST clustering algorithm) and SEMST (Standard EMST clustering algorithm).Overview of clusteringFeature Selectionidentifying the most effective subset of the original features to use in clusteringFeature Extractiontransformations of the input features to produce new salient features.Inter-pattern Similaritymeasured by a distance function defined on pairs of patterns.Groupingmethods to group similar patterns in the same cluster

8Features wrt microarray data can be genesWe will not talk about feature selection/extractionConducting Cluster AnalysisFormulate the ProblemAssess the Validity of ClusteringSelect a Distance MeasureSelect a Clustering ProcedureDecide on the Number of ClustersInterpret and Profile ClustersFormulating the Problem Most important is selecting the variables on which the clustering is based. Inclusion of even one or two irrelevant variables may distort a clustering solution. Select a Similarity Measure Similarity measure can be correlations or distances

The most commonly used measure of similarity is the Euclidean distance. The city-block distance is also used.

If variables measured in vastly different units, we must standardize data. Also eliminate outliers

Use of different similarity/distance measures may lead to different clustering results. Hence, it is advisable to use different measures and compare the results. Distance MeasuresThe Euclidean Distance between p and q is defined as:De (p,q) = [(x s)2 + (y - t)2]1/2 q (s,t)

De (p,q)

p (x,y)

D4 distanceThe D4 distance (also called city-block distance) between p and q is defined as:D4 (p,q) = | x s | + | y t |

D8 distanceThe D8 distance (also called chessboard distance) between p and q is defined as:D8 (p,q) = max(| x s |,| y t |)

ExampleMetric DistancesWhat properties should a similarity distance have?

D(A,B) = D(B,A)Symmetry D(A,A) = 0Constancy of Self-Similarity D(A,B) >= 0 Positivity D(A,B) D(A,C) + D(B,C)Triangular Inequality

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical treeCan be visualized as a dendrogramA tree like diagram that records the sequences of merges or splits

Hierarchical ClusteringTwo main types of hierarchical clustering

Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

Divisive AlgorithmsDivisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)

Single Linkage AlgorithmDistance between two clustersSingle-link distance between clusters Ci and Cj is the minimum distance between any object in Ci and any object in Cj

The distance is defined by the two most similar objects

Initial TablePixelsXY1442843158424452412First Iteration123451-4.011.720.021.524.0-8.116.017.9311.78.1-9.89.8420.016.09.8-8.0521.517.99.88.0-Objects 1 and 2 should be clusteredSecond Iteration{1,2}345{1,2}-8.116.017.938.1-9.89.8416.09.8-8.0517.99.88.0-Objects 4 and 5 should be clusteredThird Iteration{1,2}3{4,5}{1,2}-8.116.038.1-9.8{4,5}16.09.8-Objects {1,2} and {3} are clusteredDendrogram

Complete Linkage AlgorithmInitial TablePixelsXY1442843158424452412First Iteration123451-4.011.720.021.524.0-8.116.017.9311.78.1-9.89.8420.016.09.8-8.0521.517.99.88.0-Objects 1 and 2 should be clusteredSecond Iteration{1,2}345{1,2}-11.720.021.5311.7-9.89.8420.09.8-8.0521.59.88.0-Objects 4 and 5 should be clusteredThird Iteration{1,2}3{4,5}{1,2}-11.721.5311.7-9.8{4,5}21.59.8-Objects {3} and {4,5} are clusteredDendrogram

Average Linkage AlgorithmDistance between two clustersGroup average distance between clusters Ci and Cj is the average distance between any object in Ci and any object in Cj

Initial TablePixelsXY1442843158424452412First Iteration123451-4.011.720.021.524.0-8.116.017.9311.78.1-9.89.8420.016.09.8-8.0521.517.99.88.0-Objects 1 and 2 should be clusteredSecond Iteration{1,2}345{1,2}-9.918.019.739.9-9.89.84189.8-8.0519.79.88.0-Objects 4 and 5 should be clusteredThird Iteration{1,2}3{4,5}{1,2}-9.918.939.9-9.8{4,5}18.99.8-Objects {3} and {4,5} are clusteredDendrogram

Hierarchical Clustering: Group AverageCompromise between Single and Complete Link

StrengthsLess susceptible to noise and outliers

LimitationsBiased towards globular clustersWards Algorithm Cluster Similarity: Wards MethodBased on Minimum variance

Find mean vector for example of(4,4) and (8,4) = (12/2,4+4/2) = (6,4)

Variance of merging objects 1 and 2 is(4-6)2+(8-6)2 + (4-6)2+(8-6)2

= 8First Iteration For 4 clustersClustersSquared Error{1,2},{3},{4},{5}8.0{1,3},{2},{4},{5}68.5{1,4},{2},{3},{5}200{1,5},{2},{3},{4}232{2,3},{1},{4},{5}32.5{2,4},{1},{3},{5}128{2,5},{1},{3},{4}160{3,4},{1},{2},{4}48.5{3,5},{1},{2},{5}48.5{4,5},{1},{2},{3}32.0First Iteration For 3 clustersClustersSquared Error{1,2,3},{4},{5}72.7{1,2,4}, {3},{5} 224{1,2,5},{3},{4}266.7{1,2},{3,4},{5}56.5{1,2},{3,4},{5}56.5{1,2},{4,5},{3}40First Iteration For 2 clustersClustersSquared Error{1,2,3},{4,5}104.7{1,2,4,5},{3}380.0{1,2},{3,4,5}94Dendrogram

Hierarchical Clustering: Time and Space requirementsO(N2) space since it uses the proximity matrix. N is the number of points.

O(N3) time in many casesThere are N steps and at each step the size, N2, proximity matrix must be updated and searchedComplexity can be reduced to O(N2 log(N) ) time for some approaches

Overall Assessment Once a decision is made to combine two clusters, it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following:Sensitivity to noise and outliers Difficulty handling different sized clusters and convex shapes, Breaking large clusters

Divisive Hierarchical AlgorithmsSpanning TreeA spanning tree of a graph is a tree and is a subgraph that contains all the vertices. A graph may have many spanning trees; for example, the complete graph on four vertices has sixteen spanning trees: Spanning Tree cont.Minimum Spanning TreesSuppose that the edges of the graph have weights or lengths. The weight of a tree will be the sum of weights of its edges. Based on the example, we can see that different trees have different lengths. The question is: how to find the minimum length spanning tree?Minimum Spanning TreesThe question can be solved by many different algorithms, here is three classical minimum-spanning tree algorithms :Kruskal's AlgorithmPrim's Algorithm

Kruskal's AlgorithmJoseph Bernard Kruskal, JrKruskal Approach:Select the minimum weight edge that does not form a cycle

Kruskal's Algorithm:

sort the edges of G in increasing order by lengthkeep a subgraph S of G, initially emptyfor each edge e in sorted order if the endpoints of e are disconnected in S add e to S return SKruskal's Algorithm - Example

Kruskal's Algorithm - Example

Prims AlgorithmRobert Clay PrimPrim Approach:Choose an arbitrary start node vAt any point in time, we have connected component N containing v and other nodes V-NChoose the minimum weight edge from N to V-N

Prim's Algorithm:

let T be a single vertex xwhile (T has fewer than n vertices){find the smallest edge connecting T to G-Tadd it to T}Prim's Algorithm - Example13501172812910144031620Prim's Algorithm - Example13501172812910144031620Prim's Algorithm - Example13501172812910144031620Prim's Algorithm - Example13501172812910144031620Prim's Algorithm - Example13501172812910144031620Prim's Algorithm - Example13501172812910144031620Prim's Algorithm - Example13501172812910144031620Prim's Algorithm - Example13501172812910144031620Prim's Algorithm - Example13501172812910144031620MST: Divisive Hierarchical ClusteringBuild MST (Minimum Spanning Tree)Start with a tree that consists of any pointIn successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is notAdd q to the tree and put an edge between p and q

MST: Divisive Hierarchical ClusteringUse MST for constructing hierarchy of clusters

Partitional ClusteringIntroductionPatritional ClusteringForgys Algorithmk-means AlgorithmIsodata AlgorithmPartitional ClusteringForgys Algorithmk samples called seed points.k is the number of clusters to be constructedPartitional ClusteringForgys AlgorithmInitialize the cluster centroids to the seed points.For each sample, find the cluster centroid nearest it. Put the sample in the cluster identified with this nearest cluster centroid.If no sample changed clusters in step 2, stop.Compute the centroids of the resulting clusters and go to step 2.Partitional ClusteringForgys AlgorithmSampleNearest Cluster Centroid(4,4)(4,4)(8,4)(8,4)(15,8)(24,4)(24,12)InitializationPartitional ClusteringForgys AlgorithmSampleNearest Cluster Centroid(4,4)(4,4)(8,4)(8,4)(15,8)(8,4)(24,4)(8,4)(24,12)(8,4)First iterationPartitional ClusteringForgys AlgorithmSampleNearest Cluster Centroid(4,4)(4,4)(8,4)(4,4)(15,8)(17.75, 7)(24,4)(17.75, 7)(24,12)(17.75, 7)Second iterationPartitional ClusteringForgys AlgorithmSampleNearest Cluster Centroid(4,4)(6,4)(8,4)(6,4)(15,8)(21, 8)(24,4)(21, 8)(24,12)(21, 8)Third iterationPartitional ClusteringForgys AlgorithmIt has been proved that Forgys algorithm terminates.Eventually no sample change clusters.If the number of samples is large, it may take the algorithm considerable time to produce stable clusters.k-means AlgorithmPatritional ClusteringForgys Algorithmk-means AlgorithmIsodata Algorithmk-means AlgorithmSimilar to Forgys algorithmk is the number of clusters to be constructedDifferencesThe centroids of the clusters are recomputed as soon as a sample joins a cluster.Makes only two passes through the data setForgys algorithm is iterativek-means AlgorithmBegin with k clusters, each consisting of one of the first k samples. For each of the remaining n-k samples, find the centroid nearest it. Put the sample in the cluster identified with this nearest centroid. After each sample is assigned, recompute the centroid of the altered cluster.Go through the data a second time. For each sample, find the centroid nearest it. Put the sample in the cluster identified with the nearest centroild. (During this step, do not recompute any centroid.)k-means AlgorithmSet k=2 and assume that the data are ordered so that the first two samples are (8,4) and (24,4).Distances for useby the k-means algorithmSampleDistance to Centroid (9, 5.3)Distance to Centroid (24,8)(8,4)1.616.5(24,4)15.14.0(15,8)6.69.0(4,4)6.640.4(24,12)16.44.0Isodata AlgorithmPatritional ClusteringForgys Algorithmk-means AlgorithmIsodata AlgorithmIsodata AlgorithmAn enhancement of the approach taken by Forgys algorithm and the k-means algorithm.k is allowed to range over an intervalDiscard clusters with too few elementsMerge clustersToo large or too closeSplit clustersToo few or containing very dissimilar samplesModel-based clusteringAssume data generated from k probability distributionsGoal: find the distribution parametersAlgorithm: Expectation Maximization (EM)Output: Distribution parameters and a soft assignment of points to clusters

85Model-based clusteringAssume k probability distributions with parameters: (1,, k)Given data X, compute (1,, k) such that Pr(X|1,, k) [likelihood] or ln(Pr(X|1,, k)) [loglikelihood] is maximized.Every point xX need not be generated by a single distribution but it can be generated by multiple distributions with some probability [soft clustering]

86EM AlgorithmInitialize k distribution parameters (1,, k); Each distribution parameter corresponds to a cluster center

Iterate between two steps

Expectation step: (probabilistically) assign points to clusters

Maximation step: estimate model parameters that maximize the likelihood for the given assignment of points

87EM AlgorithmInitialize k cluster centersIterate between two stepsExpectation step: assign points to clusters

Maximation step: estimate model parameters

88Validity of clustersWhy validity of clusters?Given some data, any clustering algorithm generates clustersSo we need to make sure the clustering results are valid and meaningful.Measuring the validity of clustering results usually involveOptimality of clustersVerification of biological meaning of clusters89Even when the data is generated randomlyOptimality of clustersOptimal clusters shouldminimize distance within clusters (intracluster)maximize distance between clusters (intercluster)Example of intracluster measureSquared error se

where mi is the mean of all instances in cluster ci

90ReferencesA. K. Jain and M. N. Murty and P. J. Flynn, Data clustering: a review, ACM Computing Surveys, 31:3, pp. 264 - 323, 1999.T. R. Golub et. al, Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286:5439, pp. 531 537, 1999.Gasch,A.P. and Eisen,M.B. (2002) Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol., 3, 122.M. Eisen et. al, Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8, 1998.91