Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate...

1

Machine Learning

Lecture # 11Clustering

Cluster Analysis

What is Cluster Analysis?

• Cluster: a collection of data objects

– Similar to one another within the same cluster

– Dissimilar to the objects in other clusters

• Cluster analysis

– Finding similarities between data according to the

characteristics found in the data and grouping similar data

objects into clusters


• Clustering analysis is an important human activity

• Early in childhood, we learn how to distinguish between cats and dogs

• Unsupervised learning: no predefined classes

• Typical applications

– As a stand-alone tool to get insight into data distribution

– As a preprocessing step for other algorithms

Clustering

• Hard vs. Soft

– Hard: same object can only belong to single cluster

– Soft: same object can belong to different clusters

Clustering

• Flat vs. Hierarchical

– Flat: clusters are flat

– Hierarchical: clusters form a tree

• Agglomerative

• Divisive

Clustering: Rich Applications and Multidisciplinary Efforts

• Pattern Recognition

• Spatial Data Analysis

– Create thematic maps in GIS by clustering feature spaces

– Detect spatial clusters or for other spatial mining tasks

• Image Processing

• Economic Science (especially market research)

• WWW

– Document classification

– Cluster Weblog data to discover groups of similar access patterns

Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters with

– high intra-class similarity

(Similar to one another within the same cluster)

– low inter-class similarity

(Dissimilar to the objects in other clusters)


• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or

dissimilarity between two data objects

• Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-

dimensional data objects, and q is a positive integer

• If q = 1, d is Manhattan distance

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid

Similarity and Dissimilarity Between Objects (Cont.)

• If q = 2, d is Euclidean distance:

• Also, one can use weighted distance, parametric

Pearson correlation, or other disimilarity measures

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Major Clustering Approaches

• Partitioning approach:

– Construct various partitions and then evaluate them by some criterion, e.g., minimizing

the sum of square errors

– Typical methods: k-means, k-medoids, CLARANS

• Hierarchical approach:

– Create a hierarchical decomposition of the set of data (or objects) using some criterion

– Typical methods: Hierarchical, Diana, Agnes, BIRCH, ROCK, CAMELEON

• Density-based approach:

– Based on connectivity and density functions

– Typical methods: DBSACN, OPTICS, DenClue

Clustering Approaches

1. Partitioning Methods

2. Hierarchical Methods

3. Density-Based Methods

Partitioning Algorithms: Basic Concept

• Given a k, find a partition of k clusters that optimizes the chosen

partitioning criterion

• k-means and k-medoids algorithms

• k-means (MacQueen’67): Each cluster is represented by the center of the cluster

• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each

cluster is represented by one of the objects in the cluster

The K-Means Clustering Method

• Given k, the k-means algorithm is

implemented in four steps:1. Partition objects into k nonempty subsets

2. Compute seed points as the centroids of the clusters of the

current partition (the centroid is the center, i.e., mean

point, of the cluster)

3. Assign each object to the cluster with the nearest seed

point

4. Go back to Step 2, stop when no more new assignment

K-means Clustering

The K-Means Clustering Method

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Example

• Run K-means clustering with 3 clusters (initial centroids: 3, 16, 25) for at least 2 iterations

Example

• Centroids:3 – 2 3 4 7 9 new centroid: 5

16 – 10 11 12 16 18 19 new centroid: 14.33

25 – 23 24 25 30 new centroid: 25.5

Example

• Centroids:5 – 2 3 4 7 9 new centroid: 5

14.33 – 10 11 12 16 18 19 new centroid: 14.33

25.5 – 23 24 25 30 new centroid: 25.5

In class Practice

• Run K-means clustering with 3 clusters (initial centroids: 3, 12, 19) for at least 2 iterations

Hierarchical Clustering

• Two main types of hierarchical clustering– Agglomerative:

• Start with the points as individual clusters

• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

Matlab: Statistics Toolbox: clusterdata,

which performs all these steps: pdist, linkage, cluster

– Divisive:

• Start with one, all-inclusive cluster

• At each step, split a cluster until each cluster contains a point (or there are k clusters)

• Traditional hierarchical algorithms use a similarity or distance matrix– Merge or split one cluster at a time

– Image segmentation mostly uses simultaneous merge/split

Hierarchical clustering

• Agglomerative (Bottom-up)– Compute all pair-wise pattern-pattern similarity

coefficients

– Place each of n patterns into a class of its own

– Merge the two most similar clusters into one• Replace the two clusters into the new cluster

• Re-compute inter-cluster similarity scores w.r.t. the new cluster

– Repeat the above step until there are k clusters left (k can be 1)


• Agglomerative (Bottom up)

Hierarchical clustering• Agglomerative (Bottom up)

• 1st iteration1


• 2nd iteration1 2


• 3rd iteration

1 23


• 4th iteration

1 23

4


• 5th iteration

1 23

4

5


• Finally k clusters left

1 23

4

6 9

5

7

8


• Divisive (Top-down)

– Start at the top with all patterns in one cluster

– The cluster is split using a flat clustering algorithm

– This procedure is applied recursively until each pattern is in its own singleton cluster


• Divisive (Top-down)

Hierarchical Clustering: The Algorithm

• Hierarchical clustering takes as input a set of points • It creates a tree in which the points are leaves and the internal

nodes reveal the similarity structure of the points.– The tree is often called a “dendogram.”

• The method is summarized below:

Place all points into their own clusters

While there is more than one cluster, do

Merge the closest pair of clusters

The behavior of the algorithm depends on how “closest pair

of clusters” is defined

Hierarchical Clustering: Example

A

B

EF

C D

A B E FC D

This example illustrates single-link clustering in

Euclidean space on 6 points.

Hierarchical Clustering

• Produces a set of nested clusters organized as a hierarchical tree

• Can be visualized as a dendrogram

– A tree like diagram that records the sequences of merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Strengths of Hierarchical Clustering

• Do not have to assume any particular number of clusters– Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

Hierarchical Clustering: Merging Clusters

Single Link: Distance between two clusters

is the distance between the closest points.

Also called “neighbor joining.”

Average Link: Distance between clusters is

distance between the cluster centroids.

Complete Link: Distance between clusters

is distance between farthest pair of points.

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

• MIN

• MAX

• Group Average

• Distance Between Centroids

• Other methods driven by an objective function– Ward’s Method uses squared error

Proximity Matrix


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN

• MAX

• Group Average


• Other methods driven by an objective function– Ward’s Method uses squared error


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN

• MAX

• Group Average


• Other methods driven by an objective function

An example

Let us consider a gene measured in a set of 5 experiments: A,B,C,D and E. The values measured in the 5 experiments are:

A=100 B=200 C=500 D=900 E=1100

We will construct the hierarchical clustering of these values using Euclidean distance, centroid linkage and an agglomerative approach.

50

An example

SOLUTION:

• The closest two values are 100 and 200=>the centroid of these two values is 150.

• Now we are clustering the values: 150, 500, 900, 1100

• The closest two values are 900 and 1100

=>the centroid of these two values is 1000.

• The remaining values to be joined are: 150, 500, 1000.

• The closest two values are 150 and 500

=>the centroid of these two values is 325.

• Finally, the two resulting subtrees are joined in the root of the tree.

51

An example:Two hierarchical clusters of the expression values of a single gene measured

in 5 experiments.

52

100A

200B

500C

900D

1100E

100A

200B

500C

900D

1100E

The dendograms are identical: both diagrams show that:•A is most similar to B•C is most similar to the group (A,B)•D is most similar to E

In the left dendogram A and E are plotted far from each otherIn the right dendogram A and E are immediate neighbors

THE PROXIMITY IN A HIERARCHICAL CLUSTERING DOES NOT NECESSARILYCORRESPOND TO SIMILARITY

What Is the Problem of the K-Means Method?

• The k-means algorithm is sensitive to outliers !

– Since an object with an extremely large value may substantially distort

the distribution of the data.

• K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the

most centrally located object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density


Limitations of K-means: Non-globular Shapes


57

The K-Medoids Clustering Method

• Find representative objects, called medoids, in

clusters

• Medoids are located in the center of the

clusters.

– Given data points, how to find the medoid?

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

58

Categorical Values

• Handling categorical data: k-modes (Huang’98)

– Replacing means of clusters with modes

• Mode of an attribute: most frequent value

• Mode of instances: for an attribute A, mode(A)= most frequent value

• K-mode is equivalent to K-means

– Using a frequency-based method to update modes of clusters

– A mixture of categorical and numerical data: k-prototype method

X1 2 6

X2 3 4

X3 3 8

X4 4 7

X5 6 2

X6 6 4

X7 7 3

X8 7 4

X9 8 5

X10 7 6

K-mediods example

• Initialize k mediods

• Let us assume c1 = (3,4) and c2 = (7,4)

• Calculate distance so as to associate each data object to its nearest medoid.

i c1

Data objects (Xi)

Cost (distance)

1 3 4 2 6 3

3 3 4 3 8 4

4 3 4 4 7 4

5 3 4 6 2 5

6 3 4 6 4 3

7 3 4 7 3 5

9 3 4 8 5 6

10 3 4 7 6 6

i c2

Data objects (Xi)

Cost (distance)

1 7 4 2 6 7

3 7 4 3 8 8

4 7 4 4 7 6

5 7 4 6 2 3

6 7 4 6 4 1

7 7 4 7 3 1

9 7 4 8 5 2

10 7 4 7 6 2

Cluster1 = {(3,4)(2,6)(3,8)(4,7)}Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}

i O′Data objects (Xi)

Cost (distance)

1 7 3 2 6 8

3 7 3 3 8 9

4 7 3 4 7 7

5 7 3 6 2 2

6 7 3 6 4 2

8 7 3 7 4 1

9 7 3 8 5 3

10 7 3 7 6 3

i c1

Data objects (Xi)

Cost (distance)

1 3 4 2 6 3

3 3 4 3 8 4

4 3 4 4 7 4

5 3 4 6 2 5

6 3 4 6 4 3

8 3 4 7 4 4

9 3 4 8 5 6

10 3 4 7 6 6

• Select one of the nonmedoids O′. Let us assume O′ = (7,3)• Now the medoids are c1(3,4) and O′(7,3)

• Do not change the mediod as S > 0

Fuzzy C-means Clustering

• Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters.

• This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition.

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html

Fuzzy C-means Clusteringhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html

Fuzzy C-means Clustering• For example: we have initial centroid 3 & 11

(with m=2)

• For node 2 (1st element):

U11 =

The membership of first node to first cluster

U12 =

The membership of first node to second cluster

%78.9882

81

81

11

1

112

32

32

32

1

12

2

12

2

%22.182

1

181

1

112

112

32

112

1

12

2

12

2


• For example: we have initial centroid 3 & 11 (with m=2)

• For node 3 (2nd element):

U21 = 100% The membership of second node to first cluster

U22 = 0%

The membership of second node to second cluster


• For example: we have initial centroid 3 & 11

(with m=2)

• For node 4 (3rd element):

U31 =

The membership of first node to first cluster

U32 =

The membership of first node to second cluster

%98

49

50

1

49

11

1

114

34

34

34

1

12

2

12

2

%250

1

149

1

114

114

34

114

1

12

2

12

2


• For example: we have initial centroid 3 & 11

(with m=2)

• For node 7 (4th element):

U41 =

The membership of fourth node to first cluster

U42 =

The membership of fourth node to second cluster

%502

1

11

1

117

37

37

37

1

12

2

12

2

%502

1

11

1

117

117

37

117

1

12

2

12

2


• C1=

...%)50(%)98(%)100(%)78.98(

...7*%)50(4*%)98(3*%)100(2*%)78.98(2222

2222

FCM Soft Clustering

Gaussian Mixture Model

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Component 1 Component 2p(x

)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x

)

-5 0 5 100

0.5

1

1.5

2

Component Models

p(x

)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x

)

The EM Algorithm

• Dempster, Laird, and Rubin

– general framework for likelihood-based parameter estimation with missing data

• start with initial guesses of parameters

• E step: estimate memberships given params

• M step: estimate params given memberships

• Repeat until convergence

– converges to a (local) maximum of likelihood

– E step and M step are often computationally simple

– generalizes to maximum a posteriori (with priors)

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red B

lood C

ell

Hem

oglo

bin

Concentr

ation

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 1

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 3

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 5

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 10

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 15

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4


Red B

lood C

ell H

em

oglo

bin

Concentration

EM ITERATION 25

Gaussian Mixture

• The Gaussian mixture architecture estimates probability density functions (PDF) for each class, and then performs classification based on Bayes’ rule:

)(

)()|()|(

XP

CPCXPXCP i

ii

Where P(X | Ci) is the PDF of class i, evaluated at X, P( Ci ) is the prior probability for class i, and P(X) is the overall PDF, evaluated

at X.

Gaussian Mixture

• Unlike the unimodal Gaussian architecture, which assumes P(X | Cj) to be in the form of a Gaussian, the Gaussian mixture model estimates P(X | Cj) as a weighted average of multiple Gaussians.

Nc

k

kkj GwCXP1

)|(

where wk is the weight of the k-th Gaussian Gk and the weightssum to one. One such PDF model is produced for each class.

Gaussian Mixture

• Each Gaussian component is defined as:

)]()(2/1[

2/12/

1

||)2(

1kk

Tk MXVMX

k

nk e

VG

where Mk is the mean of the Gaussian and Vk is the covariancematrix of the Gaussian.

Gaussian Mixture

• Free parameters of the Gaussian mixture model consist of the means and covariance matrices of the Gaussian components and the weights indicating the contribution of each Gaussian to the approximation of P(X | Cj).

Composition of Gaussian Mixture

G1,w1 G2,w2

G3,w3

G4,w4

G5,w5

Class 1

)(

)()|()|(

XP

CPCXPXCP

j

jj

Nc

k

kkj GwCXP1

)|(

)]()(2/1[

2/12/

1

||)2(

1)|( i

Ti XViX

i

dik eV

GXpG

Variables: μi, Vi, wk

We use EM (estimate-maximize) algorithm to approximate this variables.

Gaussian Mixture

• These parameters are tuned using a complex iterative procedure called the estimate-maximize (EM) algorithm, that aims at maximizing the likelihood of the training set generated by the estimated PDF.

Gaussian Mixture Training Flow Chart (1)

Initialize the initial Gaussian means μi, i=1,…G using the K means clustering algorithmInitialize the covariance matrices, Vi, to the distance to the nearest cluster.Initialize the weights πi =1 / G so that all Gaussian are equally likely.

)]()(2/1[

2/12/

1

||)2(

1)|( i

Ti XViX

i

di eV

GXp

)|()|(1

i

G

i

is GXpXp

Present each pattern X of the training set and model each of the classes Kas a weighte sum of Gaussians:

where G is the number of Gaussians, the πi’s are the weights, and

where Vi is the covariance matrix.


Compute:

G

j

kjj

kiikiiiip

CXp

CGXp

Xp

CGXpXGP

1

),|(

),|(

)(

),|()|(

Iteratively update the weights, means and covariances:

cN

ppi

c

i tN

t1

)(1

)1(

p

N

p

pi

ic

XttN

tic

1

)()(

1)1(

)))())(((()()(

1)1(

1

T

ipip

N

ppi

ic

i tXtXttN

tVc

E-Step for EM

M-Step for EM

Note: Here is the PE which we did in the class


Recompute τip using the new weights, means and covariances. Stop training if

Or the number of epochs reach the specified value. Otherwise, continue the iterative updates.

thresholdttpipipi )()1(

Gaussian Mixture Test Flow Chart

Present each input pattern X and compute the confidence for each class j:

where is the prior probability of class Cj estimated by counting the number of training patterns. Classify pattern X as the class with the highest confidence.

),|()( jxj CXPCP

N

NCP ci

j )(

Gaussian Mixture End

100

Acknowledgements

Introduction to Machine Learning, Alphaydin

Pattern Classification” by Duda et al., John Wiley & Sons.

Read GMM from “Automated Detection of Exudates in Colored Retinal Images for Diagnosis of Diabetic Retinopathy”, Applied Optics, Vol. 51 No. 20, 4858-4866, 2012.

Mat

eria

l in

th

ese

slid

es h

as b

een

tak

en f

rom

, th

e fo

llow

ing

reso

urc

es

Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate...

Documents

Transcript of Lecture # 11 Clustering• Partitioning approach: – Construct various partitions and then evaluate...