Clustering I - Sharif

Machine Learning

Clustering I

Hamid R. RabieeJafar Muhammadi, Nima Pourdamghani

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1


Sharif University of Technology, Computer Engineering Department, Machine Learning Course2

Agenda

Unsupervised Learning

Quality Measurement

Similarity Measures

Major Clustering Approaches

Distance Measuring

Partitioning Methods

Hierarchical Methods

Density Based Methods

Spectral Clustering

Other Methods

Constraint Based Clustering

Clustering as Optimization



Unsupervised Learning

Clustering or unsupervised classification is aimed at discovering natural groupings in a set

of data.

Note: All samples in the training set are unlabeled.

Applications for clustering:

Spatial data analysis: Create thematic maps in GIS by clustering feature space

Image processing: Segmentation

Economic science: Discover distinct groups in costumer bases

Internet: Document classification

To gain insight into the structure of the data prior to classifier design; classifier design


Quality Measurement

High quality clusters must have

high intra-class similarity

low inter-class similarity

Some other measures

Ability to discover hidden patterns

Judged by the user

Purity

Suppose we know the labels of the data, assign to each cluster its most frequent class

Purity is the number of correctly assigned points divided by the number of data


Similarity Measures

Distances are normally used to measure the similarity or dissimilarity between two

data objects

Some popular distances are Minkowski and Mahalanobis.

Distance between binary strings

d(S1,S2)=|{(s1,i,s2,i) : s1,i ≠ s2,i}|

Distance between vector objectsTX .Y

d(X,Y)X Y


Major Clustering Approaches

Partitioning approach

Construct various partitions and then evaluate them by some criterion (ex. k-means, c-means, k-medoids)

Hierarchical approach

Create a hierarchical decomposition of the set of data using some criterion (ex. Agnes)

Density-based approach

Based on connectivity and density functions (ex. DBSACN, OPTICS)

Graph-based approach (Spectral Clustering)

approximately optimizing the normalized cut criterion

Grid-based approach

based on a multiple-level granularity structure (ex. STING, WaveCluster, CLIQUE)

Model-based

A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other (ex.

EM, SOM)


Distance Measuring

Single link

smallest distance between an element in one cluster

and an element in the other

Complete link

largest distance between an element in one cluster

and an element in the other

Average

avg distance between an element in one cluster and an element in the other

Centroid

distance between the centroids of two clusters

Used in k-means

Medoid

distance between the medoids of two clusters

Medoid: A representative object whose average dissimilarity to all the objects in the cluster is minimal


Partitioning Methods

Construct a partition of n data into a set of k clusters, s.t., min sum of squared

distance

where Cms are clusters representatives.

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means, c-means and k-medoids algorithms

k-means: Each cluster is represented by the center of the cluster

c-means: The fuzzy version of k-means

k-medoids: Each cluster is represented by one of the samples in the cluster

j m

k 2

m 1 x Cluster j mmin (x C )


Partitioning Methods: k-means

k-means

Suppose we know there are K categories and each category is represented by its

sample mean

Given a set of unlabeled training samples, how to estimate the means?

Algorithm k-means (k)

1. Partition samples into k non-empty subsets (random initialization)

2. Compute mean points of the clusters of the current partition

3. Assign each sample to the cluster with the nearest mean point

4. Go back to Step 2, stop when no more new assignment


Partitioning Methods: k-means

Some notes on k-means

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers (Why?)

Not suitable to discover clusters with non-convex shapes (Why?)

Algorithm is sensitive to

number of cluster centers,

choice of initial cluster centers

sequence in which data are processed (Why?)

Convergence not guaranteed, but results acceptable if there are well-separated

clusters


Partitioning Methods: c-means

The membership function μil expresses to what degree xl belongs to class Ci.

Crisp clustering: xl can belong to one class only

Fuzzy clustering: xl belongs to all classes simultaneously with varying degrees of membership

where z(m)s are cluster means

q is a fuzziness index with 1<q<2

Fuzzy clustering becomes crisp clustering when q→1

Observe that

C-mean minimizes

1

0

l i

il

l i

if x C

if x C

1

1

( )

1

1

( )1

1

( , )

1

( , )

q

m

i l

il

qk

mii l

d z x

d z x

11, 1,2,..., .

k

ilifor l N

2( )

1 1, ( )

k Nf f q m

e i i il i li lJ J J z x


Partitioning Methods: k-medoids

k-medoids

Instead of taking the mean value of the samples in a cluster as a reference point, medoids

can be used

Note that choosing the new medoids is slightly different with choosing the new means in k-

means algorithm

Algorithm k-medoids (k)

1. Select k representative samples arbitrarily

2. Associate each data point to the closest medoid

3. For each medoid m and data point o

Swap m and o and compute the total cost of configuration

4. Select the configuration with the lowest cost

5. repeat steps 2-5 until there is no change


Partitioning Methods: k-medoids

Some notes on k-medoids

k-medoids is more robust than k-means in the presence of noise and outliers (Why?)

works effectively for small data sets, but does not scale well for large data sets

For Large data sets we can use sampling based methods (How?)



Clusters have sub-clusters and sub-clusters can have sub-sub-clusters, …

Use distance matrix as clustering criteria.

This method does not require the number of clusters k as an input, but needs a

termination condition

Step 0 Step 1 Step 2 Step 3

Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

a

b

c

d

e

a b

d ec d e

a b c d e



Agglomerative Hierarchical Clustering

AGNES (Agglomerative Nesting)

Uses the Single-Link method

Merge nodes (clusters) that have the maximum similarity

divisive Hierarchical Clustering

DIANA (Divisive Analysis)

Inverse order of AGNES

Eventually each node forms a cluster on its own



Dendrogram

Shows How the Clusters are Merged

Decompose samples into a several levels of nested partitioning (tree of clusters), called a

dendrogram.

A clustering of the samples is obtained by cutting the dendrogram at the desired level, then

each connected component forms a cluster.



Clustering based on density (local cluster criterion), such as density-connected

points

Major features:

Discover clusters of arbitrary shapes

Handle noise

Need density parameters as termination condition



Main Concepts:

parameters:

Eps: Maximum radius of the neighborhood

MinPts: Minimum number of points in an Eps-neighbourhood

of that point

Sample q is directly density-reachable from sample p, if

d(p,q)<=Eps and p has MinPts points in its neighborhood.

Sample q is density-reachable from a sample p if there is a chain

of points p1, …, pn, p1 = p, pn = q such that pi+1 is directly density-

reachable from pi

Sample p is density-connected to sample q if there is a sample o

such that both, p and q are density-reachable from o.

p q

o

q

pp1


Density Based Methods: DBSCAN

DBSCAN (Density Based Spatial Clustering of Applications with Noise)

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of

density-connected points

Discovers clusters of arbitrary shape in spatial data with noise

Algorithm DBSCAN (Eps, MinPts)

Arbitrary select a sample p

Retrieve all samples density-reachable from p w.r.t. Eps and MinPts.

If p is a core sample (some samples are density-reachable from p), a cluster is formed.

If p is a border sample (no samples are density-reachable from p), DBSCAN visits the next

sample of the database.

Continue the process until all of the samples have been processed.


Graph-based Clustering

Represent data points as the vertices V of a graph G.

All pairs of vertices are connected by an edge E.

Edges have weights W.

Large weights mean that the adjacent vertices are very similar; small weights imply

dissimilarity.



Clustering on a graph is equivalent to partitioning the vertices of the graph.

A loss function for a partition of V into sets A and B

In a good partition, vertices in different partitions will be dissimilar.

Mincut criterion: Find a partition A, B that minimizes cut(A,B)

Mincut criterion ignores the size of the subgraphs formed

,

,

( , ) u v

u A v B

cut A B W



Normalized cut criterion favors balanced partitions.

Minimizing the normalized cut criterion exactly is NP-hard.

One way of approximately optimizing the normalized cut criterion leads to

spectral clustering.

, ,

, ,

( , ) ( , )( , )

u v u v

u A v V u B v V

cut A B cut A BNcut A B

W W


Spectral Clustering

Spectral clustering

Looks for a new representation of the original data points, such that

Preserve the edge weights.

The convex clusters’ shapes in the new space represents non-convex ones in the

original space.

Cluster the points in the new space using any clustering scheme (say k-means).

We only describe the resulting algorithm here.

For more information about derivations, refer to U. Luxburg, “A Tutorial on

Spectral Clustering”.


Spectral Clustering

Inputs

Set of points and number of clusters k

Algorithm

Form the edge weights matrix

For example:

Scaling parameter chosen by user

Define D a diagonal matrix whose (i,i) element is the sum of W’s row i

Form the matrix

Find the k largest eigenvectors of L to form the matrix Xnxk

nxnW R

1/2 1/2L D WD

2 2/2

0

i js s

ij

e if i jWelse

1 ,..., l

nS S S R


Spectral Clustering

Algorithm (cont.)

Normalized the matrix Xnxk and form matrix Ynxk

Treat each row of Y as a point in Rk (data dimensionality reduction from n to k)

Cluster the new data into k clusters via K-means

2/ij ij ijj

Y X X


Spectral Clustering

Example

simple edge weights matrix (d(xi,xj) denotes Euclidean distance between points

xi, xj and θ=1)

1 ,

, ,0

i jif d x xW i j W j i

otherwise

1 (0.7,0.7,0,0)Te

2 (0,0,0.7,0.7)Te 1 (0.7,0,0.7,0)Te

2 (0,0.7,0,0.7)Te

~1 1 0 0 1 0 1 0

1 1 0 0 0 1 0 1

0 0 1 1 1 0 1 0

0 0 1 1 0 1 0 1

a b c d a c b d

a a

W Wb c

c b

d d

b

c1 d

a


Spectral Clustering

Another example


Other Methods

Grid based methods

Using multi-resolution grid data structure.

1. Create the grid structure, i.e., partition the data space into a finite number of cells

2. Calculate the cell density for each cell

3. Sort the cells according to their densities

4. Identify cluster centers

5. Traverse the neighbor cells


Other Methods

Model based methods

Attempt to optimize the fit between the given data and some mathematical

model

Based on the assumption: Data are generated by a mixture of underlying

probability distribution

Typical methods

Statistical approach: EM (Expectation maximization) – will be discussed later

Neural network approach: SOM (Self-Organizing Feature Map)



Why constraint based clustering?

Need user feedback: Users know their applications the best

Less parameters but more user-desired constraints, e.g., an ATM allocation problem:

obstacle & desired clusters

Different constraints in cluster analysis:

Constraints on individual samples (do selection first)

Cluster on samples which …

Constraints on distance or similarity functions

Weighted functions, obstacles

Constraints on the selection of clustering parameters

Number of clusters, limitation of each cluster size

User-specified constraints

Some samples must be in cluster and some others not!

Semi-supervised: giving small training sets as “constraints” or hints



A sample data and two answers (taking the constraints into account and not taking

the constraints into account)

Constraints: The data in different sides of each “wall” should be in different

clusters


Clustering as Optimization

Clustering can be posted as an optimization of a criterion function

The sum-of-squared-error criterion

Scatter criteria

The given criterion function is optimized through iterative optimization


Any Question?

End of Lecture 19

Thank you!

Spring 2015


http://ce.sharif.edu/courses/93-94/2/ce717-1/

Machine Learning

Clustering IIHamid R. Rabiee

[Slides are based on Bishop Book]

Spring 2015



Problem of identifying groups, or clusters, of data points in a

multidimensional space

Partitioning the data set into some number K of clusters

Cluster: a group of data points whose inter-point distances are small

compared with the distances to points outside of the cluster

Goal: an assignment of data points to clusters such that the sum of the

squares of the distances to each data point to its closest vector (the center

of the cluster) is a minimum

Objective function called distortion measure:

K-means Clustering

2 Sharif University of Technology, Computer Engineering Department, Machine Learning Course

3

K-means Clustering

Two-stage optimization

In the 1st stage: minimizing J with respect to the rnk, keeping the μk fixed

In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed

The mean of all of the data points assigned to cluster k

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

4

K-means Clustering


5

Mixtures of Gaussians

Gaussian mixture distribution can be written as a linear superposition of Gaussian

An equivalent formulation of the Gaussian mixture involving an explicit latent variable Graphical representation of a mixture model

A binary random variable z having a 1-of-K representation

The marginal distribution of x is a Gaussian mixture of the form (*) ( for

every observed data point xn, there is a corresponding latent variable zn)

…(*)


6


γ(zk) can also be viewed as the responsibility that component k takes for

explaining the observation x



Generating random samples distributed according to the Gaussian mixture

model

Generating a value for z, which denoted as from the marginal

distribution p(z) and then generate a value for x from the conditional

distribution

7

8


a. The three states of z, corresponding to the three components of the mixture, are

depicted in red, green, blue

b. The corresponding samples from the marginal distribution p(x)

c. The same samples in which the colors represent the value of the responsibilities

γ(znk) associated with data point

Illustrating the responsibilities by evaluating the posterior probability for each

component in the mixture distribution which this data set was generated

9

Maximum likelihood

Graphical representation of a

Gaussian mixture model for

a set of N i.i.d. data points

{xn}, with corresponding

latent points {zn}

The log of the likelihood function

….(*1)


10

Maximum likelihood

For simplicity, consider a Gaussian mixture whose components have covariance matrices given by

Suppose that one of the components of the mixture model has its mean μj exactly equal to one of the data points so that μj = xn

This data point will contribute a term in the likelihood function of the form

Once there are at least two components in the mixture, one ofthe components can have a finite variance and therefore assignfinite probability to all of the datapoints while the other componentcan shrink onto one specific data point and thereby contribute an everincreasing additive value to the loglikelihood over-fitting problem


11

Maximum likelihood

Over-fitting problem

In applying maximum likelihood to a Gaussian mixture models, there

should be steps to avoid finding such pathological solutions and instead

seek local minima of the likelihood function that are well behaved

Identifiability problem

A K-component mixture will have a total of K! equivalent solutions

corresponding to the K! ways of assigning K sets of parameters to K

components

Difficulty of maximizing the log likelihood function the presence of the summation

over k that appears inside the logarithm gives no closed form solution as in the single

case


12

EM for Gaussian mixtures

I. Assign some initial values for the means, covariances, and mixing coefficients

II. Expectation or E step

• Using the current value for the parameters to evaluate the posterior probabilities

or responsibilities

III. Maximization or M step

• Using the result of II to re-estimate the means, covariances, and mixing coefficients

It is common to run the K-means algorithm in order to find a suitable initial values

• The covariance matrices the sample covariances of the clusters found by the K-

means algorithm

• Mixing coefficients the fractions of data points assigned to the respective

clusters


13


Given a Gaussian mixture model, the goal is to maximize the likelihood

function with respect to the parameters

1. Initialize the means μk, covariance Σk and mixing coefficients πk

2. E step

3. M step


14


Given a Gaussian mixture model, the goal is to maximize the likelihood

function with respect to the parameters

4: Evaluate the log likelihood

…(*2)


15


Setting the derivatives of (*2) with respect to the means of the

Gaussian components to zero

Setting the derivatives of (*2) with respect to the covariance of

the Gaussian components to zero

Responsibilityγ(znk )

A weighted mean of all of the points in the data

set

- Each data point weighted by the corresponding posterior probability

- The denominator given by the effective # of points associated with the corresponding component


16



17

An Alternative View of EM

In maximizing the log likelihood function

the summation prevents the logarithm from acting directly on the

joint distribution

Instead, the log likelihood function for the complete data set {X, Z} is

straightforward.

In practice since we are not given the complete data set, we consider

instead its expected value Q under the posterior distribution p( Z|X,

Θ) of the latent variable


18

An Alternative View of EM

General EM

1. Choose an initial setting for the parameters Θold

2. E step Evaluate p(Z|X,Θold )

3. M step Evaluate Θnew given by

Θnew = argmaxΘQ(Θ ,Θold)

Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)

4. It the covariance criterion is not satisfied,

then let Θold Θnew


19

Gaussian mixtures revisited

Maximizing the likelihood for the complete data {X, Z}

The logarithm acts directly on the Gaussian distribution much simpler solution to the maximum likelihood problem

the maximization with respect to a mean or a covariance is exactly as for a single Gaussian (closed form)


20

Gaussian mixtures revisited

Unknown latent variables considering expectation of the complete-data log

likelihood with respect to the posterior distribution of the latent variables

Posterior distribution

The expected value of the indicator variable under this posterior distribution

The expected value of the complete-data log likelihood function

…(*3)

1 1

( | , , , ) ( | , , , ) ( ) [ ( | , )]

( . (9.10), (9.11))

nk

N Kz

k n k k

n k

p Z X μ p X Z μ p Z N x

ref

1

( 1) ( | 1)( ) ( 1| )

( )

( | , )( )

( | , )

nk n nknk nk n

n

k n k knkK

j n j j

j

p z p x zE z p z x

p x

N xz

N x

k

1 1

[ln ( , , , )] ( ){ln ln ( | , )}N K

nk n k k

n k

E p X Z | μ z N x


21

Relation to K-means

K-means performs a hard assignment of data points to the clusters (each data

point is associated uniquely with one cluster

EM makes a soft assignment based on the posterior probabilities

K-means can be derived as a particular limit of EM for Gaussian mixtures:

As epsilon gets smaller, the terms for which is farthest will go to zero

most quickly. Hence the responsibility go all zero except for the term k for which the

responsibility will go to unit

2

n jx


22

Relation to K-means

Thus maximizing the expected complete data log-likelihood is equivalent to

minimizing the distortion measure J for the K-means

(In Elliptical K-means, the covariance is estimated also.)

2

2

2

2

1 1

exp{ / 2 }1 1( | , ) exp , ( )

22 exp{ / 2 }

1[ln ( , | , , )]

2

k n k

n k k n k nk

j n jj

N K

nk n k

n k

xp x I x z

x

E p X Z r x const


23

(1 1 ) (2 2 )

Mixtures of Bernoulli distributions


24


(1 )

1

1 1

(1 )

1 1

( | ) (1 ) (single component)

( ( , , ), ( ) ( , , ), ( ) { (1 )})

( | , ) ( | ), ( | ) (1 ) (mixture)

( { },

i i

i i

Dx x

i i

i

D D i i

DKx x

k k k ki ki

k i

p

x x E Cov diag

p p p

1 K k

x μ

x x μ x Σ

x μ x μ x μ

μ μ , ,μ μ

1

1 1

1 1

2

( ) ( ) ( )

( , , ))

( ) ( ) ,

( ) ( ) ( ) ( ) ( ( )) ( ) ( ) ( ) ( )

(Let ( )) ( ), then ( | ) , ( |

k kD

K K

k k k k

k k

K KT T T

k k k k

k k

k ij k ii k i k k ij k i j k

E E

Cov E E E E E E E E

E c c E x c E x x

T T T

k k

T

x x |μ μ

x xx x x xx |μ x x μ μ x x

xx |μ μ μ2

3 2

) ( )

and note that individual variabes are )

[ . .] (1, 0, 0, 1, 1)

(single component): ( ) : ( ) (1 ) , ( ) ( , , ), ( ) (1 )

(mixture) : (

k

i

i j

x independent, given μ

E g

p H p E Cov I

p H

x

x x x

3 2 3 2

1 1 2 2 1 1 1 2 2 2 1 1 2 2 1 1 2 2

2 2 2

1 1 1 1 2 2 2 2 1 1 2 2

2 2

1 1 1 2 2 2 1 1 1 2 1 1

| ) , ( | ) : ( ) (1 ) (1 ) , ( ) ( , , )

( ) { (1 ) } { (1 ) } ( )

{ (1 ) (1 )} { (

c p H c p E

Cov I I

I

x x

x 1 1 1

2

2 2 ) }

* Because the covariance matrix Cov( ) is no longer diagonal, the mixture distribution can

capture correlations between the variables, unlike a single Bernulli distribution.

x

1


25


1 1

T

1

1

1

1

ln ( | , ) ln ( | )

( | , ) ( | ) ( ( , ) is a binary indicator variables)

( )

(complete-data log likelihood function) :

ln ( , | , ) ln [

k

k

N K

k k

n k

Kz

K

k

Kz

k

k

K

nk k

k

p p

p p z , z

p

p z

n

k

x μ x μ

x z μ x μ z

z

X Z μ

π

π

π

|

1 1

Z

1 1 1

1

1

ln (1 )ln(1 )]

E [ln ( , | , )] ( ) ln [ ln (1 )ln(1 )]

( | ) (E-step) ( ) [ ] , ( ),

( | )

N D

ni ki ni ki

n i

N K D

nk k ni ki ni ki

n k i

Nk k

nk nk k nkKn

j j

j

x x

p z x x

pz E z N z

p

n

n

X Z μ

x μx

x μ

π

1

k

1( )

(M-step) ,

* In contrast to the mixture of Gaussians, there are no singularities in which the likelihood

goes to infinity

N

nk

nk

kk

zN

N

N

k n

k

x

μ x


26


N=600 digit images, 3 mixtures

A mixture of k=3 Bernoulli distributions by 10 EM iterations

Parameters for each of the three components/single multivariate Bernoulli

The analysis of Bernoulli mixtures can be extended to the case of multinomial binary

variables having M>2 states (Ex.9.19)



References

Slides of Chapter 9 of Bishop book adapted from Biointelligence

Laboratory, Seoul National University http://bi.snu.ac.kr/


Any Question?

End of Lecture 20

Thank you!

Spring 2015


http://ce.sharif.edu/courses/93-94/2/ce717-1/

Clustering I - Sharif

Documents

Transcript of Clustering I - Sharif