By Timofey Shulepov Clustering Algorithms. Clustering - main features Clustering – a data mining...

by Timofey Shulepovby Timofey Shulepov

Clustering AlgorithmsClustering Algorithms

Clustering - main Clustering - main featuresfeatures

Clustering – a data mining techniqueClustering – a data mining technique Def.: Classification of objects into sets by the Def.: Classification of objects into sets by the

common traits among the objects, but not common traits among the objects, but not between different sets.between different sets.

Usage: Usage: Statistical Data AnalysisStatistical Data Analysis Machine LearningMachine Learning Data MiningData Mining Pattern RecognitionPattern Recognition Image AnalysisImage Analysis BioinformaticsBioinformatics

Types of clusteringTypes of clustering HierarchicalHierarchical

Finding new clusters using previously found onesFinding new clusters using previously found ones

PartitionalPartitional Finding all clusters at onceFinding all clusters at once

Self-Organizing MapsSelf-Organizing Maps Hybrids (incremental)Hybrids (incremental)

Concept of distance Concept of distance measuremeasure

Distance measure – determines how the Distance measure – determines how the similaritysimilarity of two elements is calculated. of two elements is calculated.

Similarity is expressed in terms of a Similarity is expressed in terms of a distance functiondistance function

Distance functions – vary significantly for Distance functions – vary significantly for interval-scaled, categorical, and other interval-scaled, categorical, and other variablesvariables

Examples of Dist. Fcns: Euclidean Examples of Dist. Fcns: Euclidean distance, Manhattan distance, etc.distance, Manhattan distance, etc.

Distance functions, in Distance functions, in more detail.more detail.

Euclidean distance – aka “as the crow flies”, or Euclidean distance – aka “as the crow flies”, or 2-norm distance. The most commonly used 2-norm distance. The most commonly used one, the usually implied distance measurement one, the usually implied distance measurement (ruler, 2 dots).(ruler, 2 dots).

Manhattan distance – aka “taxicab” or 1-norm Manhattan distance – aka “taxicab” or 1-norm distance. Going from A to B via intersections distance. Going from A to B via intersections (sort of).(sort of).

Maximum norm – explanation is too Maximum norm – explanation is too complicated for this presentationcomplicated for this presentation

Mahalanobis distance – similar to Euclidean, Mahalanobis distance – similar to Euclidean, but it considers specifics of data sets, and is but it considers specifics of data sets, and is scale-invariantscale-invariant

GarciaGarcia

Hierarchical ClusteringHierarchical Clustering

Hierarchical clusteringHierarchical clustering ResultResult: Given the input set : Given the input set SS, the goal is to , the goal is to

produce a hierarchy (dendogram) in which produce a hierarchy (dendogram) in which nodes represent subsets of nodes represent subsets of SS simulating the simulating the structure found in structure found in SS. .

Can be agglomerative or divisiveCan be agglomerative or divisive Agglomerative – “bottoms-up”: begin with one Agglomerative – “bottoms-up”: begin with one

element as a separate cluster, and escalate. element as a separate cluster, and escalate. Divisive – “top-down”: begin with one large set, Divisive – “top-down”: begin with one large set,

and divide it into smaller sets.and divide it into smaller sets.

Agglomerative Agglomerative

Hierarchical ClusteringHierarchical Clustering 1. Place each instance of S in its own cluster (singleton), 1. Place each instance of S in its own cluster (singleton),

creating the list of clusters L (initially, the leaves of T): creating the list of clusters L (initially, the leaves of T): L = S1, S2, S3, .., Sn. L = S1, S2, S3, .., Sn.

2. Compute a merging cost function between every pair 2. Compute a merging cost function between every pair of elements in L to find the two closest clusters {Si, Sj} of elements in L to find the two closest clusters {Si, Sj} which will be the cheapest couple to mergewhich will be the cheapest couple to merge

Remove Si & Sj from L.Remove Si & Sj from L. 4. Merge Si & Sj to create a new internal node Sij in T 4. Merge Si & Sj to create a new internal node Sij in T

which will be the parent of Sj & Sj in the result tree.which will be the parent of Sj & Sj in the result tree. 5. Do (2) until there is only one set remaining.5. Do (2) until there is only one set remaining.

K-ClusteringK-Clustering

K-clustering algorithmK-clustering algorithm ResultResult: Given the input set : Given the input set SS and a fixed and a fixed

integer integer kk, a partition of , a partition of SS into into kk subsets must subsets must be returned. be returned.

K-means clustering is the most common K-means clustering is the most common partitioning algorithm. partitioning algorithm.

K-clustering algo cont'dK-clustering algo cont'd

1. Select k initial cluster centroids, c1, c2, 1. Select k initial cluster centroids, c1, c2, c3..., ck.c3..., ck.

2. Assign each instance x in S to the 2. Assign each instance x in S to the cluster whose centroid is the nearest to x.cluster whose centroid is the nearest to x.

3. For each cluster, re-compute its 3. For each cluster, re-compute its centroid based on which elements are centroid based on which elements are contained in.contained in.

4. Go to (2) until convergence is achieved.4. Go to (2) until convergence is achieved. GarciaGarcia

Self-Organized MapsSelf-Organized Maps

Def.: A group of several connected nodes Def.: A group of several connected nodes mapped into a k-dimensional space mapped into a k-dimensional space following some specific geometrical following some specific geometrical topology (grids, rings, lines, ...). Initially topology (grids, rings, lines, ...). Initially placed at random and iteratively adjusted placed at random and iteratively adjusted according to the distribution of examples according to the distribution of examples (input) along the k-dimensional space. (input) along the k-dimensional space.

GarciaGarcia

Annotated BibliographyAnnotated Bibliography

WikipediaWikipediahttp://en.wikipedia.org/wiki/Data_clustering#Types_of_clhttp://en.wikipedia.org/wiki/Data_clustering#Types_of_clusteringustering

Enrique Blanco GarciaEnrique Blanco Garcia http://genome.imim.es/~eblanco/seminars/docs/clusterinhttp://genome.imim.es/~eblanco/seminars/docs/clustering/index_types.html#hierarchyg/index_types.html#hierarchy

By Timofey Shulepov Clustering Algorithms. Clustering - main features Clustering – a data mining...

Documents

Transcript of By Timofey Shulepov Clustering Algorithms. Clustering - main features Clustering – a data mining...