Unsupervised Learning Clustering Algorithms - IT -...

1

1

Unsupervised Learning Clustering Algorithms

Unsupervised Learning -- Ana Fred

Outline

Part 1: Basic Concepts of data clustering

Non-Supervised Learning and Clustering

: Problem formulation – cluster analysis

: Taxonomies of Clustering Techniques

: Data types and Proximity Measures

: Difficulties and open problems

Part 2: Clustering Algorithms

Hierarchical methods

: Single-link

: Complete-link

: Clustering Based on Dissimilarity Increments Criteria

From Single Clustering to Ensemble Methods - April 2009

2



Hierarchical Clustering

Use proximity matrix: nxn

: D(i,j): proximity (similarity or distance) between patterns i and j

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive


2

3



Hierarchical Clustering: Agglomerative Methods

1. Start with n clusters containing one object

2. Find the most similar pair of clusters Ci e Cj from the proximity

matrix and merge them into a single cluster

3. Update the proximity matrix (reduce its order by one, by replacing

the individual clusters with the merged cluster)

4. Repeat steps (2) e (3) until a single cluster is obtained (i.e. N-1

times)


4




Similarity measures between clusters:

:Well known similarity measures can be written using the Lance-Williams

formula, expressing the distance between cluster k and cluster i+j,

obtained by the merging of clusters i and j:

Single-link

Complete-link

Centroid

Median

(Average link)

Ward’s Method

(minimum variance)

),(),,(min, -0.5c ; 0 ; 5.0 kjdkidkjid baa ji

),(),,(max, 0.5c ; 0 ; 5.0 kjdkidkjid baa ji

),(),( 0

2 kji

ji

ji

ji

j

j

ji

ii dkjidc

nn

nnb

nn

na

nn

na

baa ji 0c ; 25.0 ; 5.0

ji CbCaji

ji

ji

j

j

ji

ii bad

nnCCdcb

nn

na

nn

na

,

),(1

),( 0

0

c

nnn

nb

nnn

nna

nnn

nna

jik

k

jik

jk

j

jik

iki

),(),(),(),(),(),( kjdkidcjibdkjdakidakjid ji


3

5




Single-link

Complete-link

Centroid

Median

(Average link)

Ward’s Method

(minimum variance)



),(),( 0

2 kji

ji

ji

ji

j

j

ji

ii dkjidc

nn

nnb

nn

na

nn

na

baa ji 0c ; 25.0 ; 5.0

ji CbCaji

ji

ji

j

j

ji

ii bad

nnCCdcb

nn

na

nn

na

,

),(1

),( 0

0

c

nnn

nb

nnn

nna

nnn

nna

jik

k

jik

jk

j

jik

iki

Single Link: Distance between two clusters

is the distance between the closest points.

Also called “neighbor joining.”


6




Single-link

Complete-link

Centroide

Median

(Average link)

Ward’s Method

(minimum variance)



),(),( 0

2 kji

ji

ji

ji

j

j

ji

ii dkjidc

nn

nnb

nn

na

nn

na

baa ji 0c ; 25.0 ; 5.0

ji CbCaji

ji

ji

j

j

ji

ii bad

nnCCdcb

nn

na

nn

na

,

),(1

),( 0

0

c

nnn

nb

nnn

nna

nnn

nna

jik

k

jik

jk

j

jik

iki

Complete Link: Distance between clusters

is distance between farthest pair of points.


4

7




Single-link

Complete-link

Centroide

Median

(Average link)

Ward’s Method

(minimum variance)



),(),( 0

2 kji

ji

ji

ji

j

j

ji

ii dkjidc

nn

nnb

nn

na

nn

na

baa ji 0c ; 25.0 ; 5.0

ji CbCaji

ji

ji

j

j

ji

ii bad

nnCCdcb

nn

na

nn

na

,

),(1

),( 0

0

c

nnn

nb

nnn

nna

nnn

nna

jik

k

jik

jk

j

jik

iki

Centroid: Distance between clusters is

distance between centroids.


8




Single-link

Complete-link

Centroid

Median

(Average link)

Ward’s Method

(minimum variance)



),(),( 0

2 kji

ji

ji

ji

j

j

ji

ii dkjidc

nn

nnb

nn

na

nn

na

baa ji 0c ; 25.0 ; 5.0

ji CbCaji

ji

ji

j

j

ji

ii bad

nnCCdcb

nn

na

nn

na

,

),(1

),( 0

0

c

nnn

nb

nnn

nna

nnn

nna

jik

k

jik

jk

j

jik

iki

Average Link: Distance between clusters is

average distance between the cluster points.


5

9



Ward’s Link: Minimizes the sum-of-squares

criterion (measure of heterogenity)


Single-link

Complete-link

Centroid

Median

(Average link)

Ward’s Method

(minimum variance)



),(),( 0

2 kji

ji

ji

ji

j

j

ji

ii dkjidc

nn

nnb

nn

na

nn

na

baa ji 0c ; 25.0 ; 5.0

ji CbCaji

ji

ji

j

j

ji

ii bad

nnCCdcb

nn

na

nn

na

,

),(1

),( 0

0

c

nnn

nb

nnn

nna

nnn

nna

jik

k

jik

jk

j

jik

iki


10



Single Linkage: ),(min,

,badCCd

ji CbCaji

x y

1 4 4

2 8 4

3 15 8

4 24 4

5 24 121 2 3 4 5

1 - 4 11.7 20 21.5

2 4 - 8.1 16 17.9

3 11.7 8.1 - 9.8 9.8

4 20 16 9.8 - 8

5 21.5 17.9 9.8 8 -

1 2

35

4

1 2

35

4

1,2 3 4 5

1,2 - 8.1 16 17.9

3 8.1 - 9.8 9.8

4 16 9.8 - 8

5 17.9 9.8 8 -

1 2

35

4


6

11



Single Linkage:

1,2 3 4,5

1,2 - 8.1 16

3 8.1 - 9.8

4,5 16 9.8 -

1 2

35

4

1,2,3 4,5

1,2,3 - 9.8

4,5 9.8 -

1 2

35

4

Dendrogram

),(min,,

badCCdji CbCa

ji


12



Complete-Link: ),(max,,

badCCdji CbCa

ji

1 2 3 4 5

1 - 4 11.7 20 21.5

2 4 - 8.1 16 17.9

3 11.7 8.1 - 9.8 9.8

4 20 16 9.8 - 8

5 21.5 17.9 9.8 8 -

1 2

35

4

1 2

35

4

1,2 3 4 5

1,2 - 11.7 20 21.5

3 11.7 - 9.8 9.8

4 20 9.8 - 8

5 21.5 9.8 8 -

1 2

35

4

1,2 3 4,5

1,2 - 11.7 21.5

3 11.7 - 9.8

4,5 21.5 9.8 -


7

13



Complete-Link: ),(max,,

badCCdji CbCa

ji

1 2

35

4

1,2,3 4,5

1,2,3 - 21.5

4,5 21.5 -

Dendrogram


14



1 2

35

4

Single-link

1 2

35

4

Complete-link

Single-link and Complete-Link

: A clustering of the data objects is obtained by cutting the dendrogram at

the desired level, then each connected component forms a cluster.

Favours connectedness Favours compactness


8

15




SL algorithm:

: Favors connectedness

Equivalent to building a MST

And cutting at weak links

2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10


16




SL algorithm:


: Detects arbitrary-shaped clusters with even densities

: Cannot handle distinct density clusters

. Single-link method, th=0.49


9

17



-5 0 5 10-4

-2

0

2

4

-5 0 5 10-4

-2

0

2

4


SL algorithm:




: Is sensitive to in-between patterns

-5 0 5 10-4

-2

0

2

4


18




SL algorithm:





: Needs criteria to set the final number of clusters

CL algorithm:

: Favors compactness

-5 0 5 10-4

-2

0

2

4

-5 0 5 10-4

-2

0

2

4


10

19




SL algorithm:






CL algorithm:


: Imposes spherical-shaped clusters on data

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


20




SL algorithm:






CL algorithm:


: Imposes spherical-shaped clusters on data



11

21



Hierarchical Clustering

Weakness

: do not scale well: time complexity of O(n2), where n is the number

of total objects

: can never undo what was done previously

Integration of hierarchical with distance-based clustering

: BIRCH (Zhang, Ramakrishnan, Livny, 1996): uses a

ClusteringFeature-tree and incrementally adjusts the quality of sub-

clusters

: CURE (Guha, Rastogi & Shim, 1998): selects well-scattered points

from the cluster and then shrinks them towards the center of the

cluster by a specified fraction

: CHAMELEON (G. Karypis, E.H. Han and V. Kumar, 1999):

hierarchical clustering using dynamic modeling

1. Use a graph partitioning algorithm: cluster objects into a large

number of relatively small sub-clusters

2. Use an agglomerative hierarchical clustering algorithm: find the

genuine clusters by repeatedly combining these sub-clusters


22



Clustering Based on Dissimilarity Increments Criteria

Smoothness Hypothesis:

: A cluster is a set of patterns sharing important characteristics in a

given context

: A dissimilarity measure encapsulates the notion of pattern

resemblance

: Higher resemblance patterns are more likely to belong to the same

cluster and should be associated first

: Dissimilarity between neighboring patterns within a cluster should

not occur with abrupt changes

: The merging of well separated clusters results in abrupt changes

in dissimilarity values

A Fred, J Leitão, A New Cluster Isolation Criterion Based on Dissimilarity Increments, IEEE PAMI, 2003.


12

23




Dissimilarity Increments:

Dissimilarity increment:

, , - nearest neighbors

: argmin , ,

: argmin , ,

i j k

j l il

k l jl

x x x

x j d x x l i

x k d x x l i

, , , ,inc i j k i j j kd x x x d x x d x x

ixjx

kx

,i jd x x

,i jd x x

incd


24




Distribution of Dissimilarity Increments:

: Uniformly distributed data


13

25





: 2D Gaussian data


26





: Ring-shaped data


14

27





: Directional expanding data


28





: Exponential distribution: ( ) expx

p x


15

29




Exponential distribution:

: Higher density patterns -> higher

: Well separated clusters -> dinc on the tail of p(x)


30




Gap between clusters: ,i i j t igap d C C d C

12 3

45

6

7

8

9

10

11

12

13

14

15

16

17

18d

t(C

i) d(C

i,C

j)

dt(C

j)

gapj

dt(C

j)

dt(C

i)gap

i

Ci

Cj


16

31




Gap between clusters: ,i i j t igap d C C d C

Ci

Cj

gapi

gapj

dt(C

i )d

t(C

j ) d(C

i ,C

j )


32




Cluster Isolation criterion:

Let Ci, Ck be two clusters which are candidates for merging, and let i , k

be the respective mean values of the dissimilarity increments in each

cluster. Compute the increments for each cluster, gapi and gapk. If

gapi i (gapk k), isolate cluster Ci (Ck) and proceed the

clustering strategy with the remaining patterns. If neither cluster exceeds

the gap limit, merge them.


17

33




Setting the Isolation Criterion Parameter :

: Result: the crossings of the tangential line, at points which are multiple of

the distribution mean value, /, with the x axis, is given by (+1)/

: 3,5 cover the significant part of the distribution

exponential distribution, =.05

tangential lines at points multiple of

=.05


34




Hierarchical Clustering Algorithm:

: A statistic of the dissimilarity increments within a cluster is maintained and

updated during cluster merging

: Clusters are obtained by comparing dissimilarity increments with a dynamic

threshold, i , based on cluster statistics


18

35




Results:

:Ring-Shaped Clusters


36




Results:


. Single-link method, th=0.49


19

37




Results:


. Dissimilarity Increments-base method:

d1

d2

d1

d2d2

g1

g2

3/

3/1

3/3


38




Results:


. Dissimilarity Increments-base method:

d1

d2d2

g1

g2

3/

3/1

3/3


20

39




Results:

:2-D Patterns with Complex Structure

Dissimilarity Increments-base method Single-link

2 < < 9

Single Link

th=1.1


40



Clustering of Contour Images

: The data set is composed by 634 contour images of 15 types of hardware tools: t1 to t15.

: When counting each pose as a distinct sub-class in the object type, we obtain a total of 24 classes.


21

41




Contour extraction

: the object boundary is sampled at 50 equally spaced points

: the angle between consecutive segments is quantized in 8 levels.

Contour

Extraction

String Contour

Description

Based on a

thresholding

method

8 directional

differential

chain code


42





22

43




t1

t2

t3

t4

t5

t1

t2

t3

t4

t5

Single-link Dissimilarity Increments-based method

(string-edit distance)


44




Description:

: Hierarchical agglomerative algorithm adopting a cluster isolation

criterion based on dissimilarity increments

Strength

: Method is not conditioned by a particular dissimilarity measure

(examples used Euclidean and string-edit distances)

: Ability to identify arbitrary shaped and sized clusters

: The number of clusters is intrinsically found

Weakness

: Sensitive to in-between points connecting touching clusters


23

45



Outline

Partitional Methods

: K-Means

: Spectral Clustering

: EM-based Gaussian Mixture Decomposition

Part 3.: Validation of clustering solutions

Cluster Validity Measures

Part 4.: Ensemble Methods

Evidence Accumulation Clustering


46



Partitional Methods

K-Means: Minimizes the cost function:

: Algorithm:

. Input: k, the number of clusters; data set

1. Randomly select k seed points from the data set, and take them as initial centroids

2. Partition the data into k clusters by assigning each object to the cluster with the nearest centroid.

3. Compute centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.

4. Go back to step 2 or stop when no more new assignment.


24

47



Partitional Methods: K-Means

Favors compactness

-5 0 5 10-4

-2

0

2

4

-5 0 5 10-4

-2

0

2

4


48




Favors compactness

Strength

: Fast algorithm (O(tkn) – t is the number of iterations; normally, k, t << n.)

: Scalability

: Often terminates at a local optimum.

Weakness

: Imposes spherical-shaped clusters

K-means clustering of uniform data (k=4)


25

49




Favors compactness

Strength


: Scalability


Weakness


: Is sensitive to the number of objects in clusters0 2 4 6 8 100

2

4

6

8

10

0 2 4 6 8 100

2

4

6

8

10


50




Favors compactness

Strength


: Scalability


Weakness


: Is sensitive to the number of objects in clusters

: Dependence on initialization


26

51



Favors compactness

Strength

: Fast

: Scalability

: Robust to noise

Weakness




0 2 4 6 8 100

2

4

6

8

10

0 2 4 6 8 100

2

4

6

8

10

0 2 4 6 8 100

2

4

6

8

10


52




Favors compactness

Strength


: Scalability


Weakness





: Applicable only when mean is defined (what about categorical

data?)


27

53



Variations of K-Means Method

K-Means(MacQueen’67): each cluster is represented by the center

of the cluster

A few variants of the k-means which differ in

: Selection of the initial k means

: Dissimilarity calculations (Mahalanobis distance -> elliptic clusters)

: Strategies to calculate cluster means

. Medoid - each cluster is represented by one of the objects in the cluster

: Fuzzy version: Fuzzy K-Means

Handling categorical data: k-modes (Huang’98)

: Replacing means of clusters with modes

: Using new dissimilarity measures to deal with categorical objects

: Using a frequency-based method to update modes of clusters


54



Spectral Clustering

For a given data set, X, Spectral Clustering finds a set of data clusters on the basis of spectral analysis of a similarity graph

The clustering problem is defined in terms of a complete graph, G, with vertices V={1, …, N}, corresponding to the data points in the data set, and each edge between two vertices is weighted by the similarity between them.

The weight matrix is also called the affinity matrix or the similarity matrix.

2

2

2

ji xx

ij eA

Gaussian Kernel


28

55



Spectral Clustering

Cutting edges of G we obtain disjoint subgraphs of G as the clusters of X

The goal of clustering is to organize the dataset into disjoint subsets with high intra-cluster similarity and low inter-cluster similarity

The resulting clusters should be as compact and isolated as possible


56



Spectral Clustering

The graph partitioning for data clustering can be interpreted as aminimization problem of an objective function, in which the compactnessand isolation are quantified by the subset sums of edge weights.

Common objective functions

Ratio cut (Rcut)

Normalised cut (Ncut)

Min-max cut (Mcut)

• cut(A, B) is the sum of the edge weights between ∀p ∈ A and ∀p ∈ B

• P\Cl is the complement of Cl ⊂ X,

• card Cl denotes the number of points in Cl

k

l ll

llk

k

l l

llk

k

l l

llk

CC

CXCCC

XC

CXCCC

C

CXCCC

1

1

1

1

1

1

),cut(

)\,cut(:),,Mcut(

),cut(

)\,cut(:),,Ncut(

card

)\,cut(:),,Rcut(


29

57



Spectral Clustering

The solution of the minimization problem of any of the previous objectivefunctions is obtained from the matrix of the first k eigenvectors of a matrixderived from the affinity matrix (Laplacian matrix)

The eigenvectors for Ncut and Mcut are identical, and obtained from thesymmetrical Laplacian

Another common choice is

Distinct algorithms differ on the way of producing and using the eigenvectors and how to derive clusters from them.: Some use each eigenvector one at a time

: Other, use top k eigenvectors simultaneously

Closely related with spectral graph partitioning, in which the second eigenvector of a graph’s Laplacian is used to define a semi-optimal cut; the second eigenvector solves a relaxation of an NP-hard discrete graph partitioning problem, giving an approximation to the optimal cut.

2/12/1: ADDLL sym

• D is a diagonal matrix whose ii-th entry is the sum of the i-th row of A

ADDLL rw 1:


58



Spectral Clustering (Ng et al, 2001)

Maps the feature space into a new space, Y, based on the

eigenvectors of a matrix derived from an affinity matrix associated

with the data set.

The data partition is obtained by applying the K-means algorithm

on the new space.

(NG. and Al. 2001)

2

2

2

ji xx

ij eA

Original feature

space

Eigen vector

Feature space

A. Y. Ng and M. I. Jordan and Y. Weiss, On Spectral Clustering: Analysis and an algorithm, NIPS 2001


30

59




Algorithm:


60





31

61




Results strongly depend on parameter values: k e σ

K=2, σ=0.1 K=2, σ=0.4


62





K=2, σ=0.1 K=2, σ=0.4


32

63





-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-1.5

-1

-0.5

0

0.5

1

1.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-1.5

-1

-0.5

0

0.5

1

1.5

K=3, σ=0.1 K=3, σ=0.45


64





-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-1.5

-1

-0.5

0

0.5

1

1.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-1.5

-1

-0.5

0

0.5

1

1.5

K=3, σ=0.1 K=3, σ=0.45


33

65



Spectral Clustering

Selection of parameter values:

MSE:

Eigengap

Rcut

k

i

n

j

i

j

iY

i

myN

MSE1 1

21

1

21)(

A

i j ij

K

k

K

kll Sj Sj jk

KA

ARcut k l1 ,1


66



Selection of Parameters: Global Results on selecting σ and K

None of the studied methods is suitable for the automatic selection of the spectral clustering parameters

A majority voting decision did not significantly improve the results

Percentage of correct classification:

σ


34

67



Spectral Clustering

Strength

: Detects arbitrary-shaped clusters.

: By using an adequate similarity measure between patterns, can be

applied to all types of data

Weakness

: Computationally heavy

: Needs criteria to set the final number of clusters and scaling factor


68



Model-Based Clustering: Finite Mixtures

k random sources, with probability density functions fi(x), i=1,…,k

f1(x)

f2(x)

fi(x)

fk(x)

Choose at random

X random variable

f (x|source i) = fi (x)Conditional:

f (x and source i) = fi (x) iJoint:

f(x) = Unconditional: f (x and source i) sourcesall

k

1i

ii )x(f


35

69



Model-Based Clustering: Finite Mixtures

each component models one cluster

clustering = mixture fitting

f1(x)

f2(x)

f3(x)

k

i i

i=1

f(x|Θ) = α f(x|θ )


70



Gaussian Mixture Decomposition

Mixture Model:

:.

.

k

i i

i=1

f(x|Θ) = α f(x|θ )Component densities

Mixing probabilities:

k

1i

i 1iα 0 and

f ( | )ix Gaussian

Arbitrary covariances: ),|x(N)|x(f iii Cμ

},...,,,...,,,,...,,{ 1k21k21k21 CCCμμμ

Common covariance: ),|x(N)|x(f ii Cμ

},...,,,,...,,{ 1k21k21 Cμμμ


36

71



Mixture Model Fitting

n independent observations

Mixture density model:

Estimate that maximizes (log)likelihood (ML estimate of ):

k

i i

i=1

f(x|Θ) = α f(x|θ )

}x,...,x,x{ )n()2()1(x

),x(Lmaxargˆ

n

1j

k

1i

i

)j(

i )|(flog x

mixture

n

1j

)j()|x(flog),x(L


72



Gaussian Mixture Model Fitting

Problem: the likelihood function is unbounded as

. There is no global maximum

. Unusual goal: a “good” local maximum

Example: a 2-component Gaussian mixture

some data points

0)det( i C

2

)x(

σ2

)x(

2

2

21

22

21

21

e2

1e

σ2),σ,μ,μ|x(f

}x,...,x,x{ n21

n

2j

2

)x(

2log(...)e

2

1

σ2log),x(L

221

0σas, 2

11 xμ


37

73




ML estimate has no closed-form solution

Standard alternative: expectation-maximization (EM) algorithm:

: Missing data problem:

. Observed data: }x,...,x,x{ )n()2()1(x


74




ML estimate has no closed-form solution

Standard alternative: expectation-maximization (EM) algorithm:

: Missing data problem:

. Observed data:

. Missing data:

Missing labels (“colors”)

“1” at position i x( j) generated by component i

. Complete log-likelihood function:

}x,...,x,x{ )n()2()1(x},...,,{ )n()2()1(

zzzz

T)j(

k

)j(

2

)j(

1

)j( 0...010...0,z...,,z,z z

n

1j

k

1i

i

)j(

ii

)j(

ic )|(flogz),,(L xzx

)|z,x(flog )j()j(


38

75



The EM Algorithm

The E-step: compute the expected value of

The M-step: update parameter estimates

),,(Lc zx

)ˆ,(Q]ˆ,x|),,(L[E )t()t(

c zx

)ˆ,(Qmaxargˆ )t()1t(


76



EM-Algorithm for Mixture of Gaussian

Iterative procedure:

The E-step:

The M-step:

,...ˆ,ˆ,...,ˆ,ˆ )1t()t()1()0(

( ) ( )( , )

( ) ( )

1

ˆˆ ( | )

ˆˆ ( | )

j tj t i i

i kj t

n n

n

f xw

f x

)t,j(

iw Estimate, at iteration t, of the probability that x( j) was

produced by component i

n

1j

)t,j(

i

)1t(

i wn

1ˆ

n

1j

)t,j(

i

n

1j

)j()t,j(

i

)1t(

i

w

xw

ˆ

n

1j

)t,j(

i

n

1j

T)1t(

i

)j()1t(

i

)j()t,j(

i

)1t(

i

w

)ˆx()ˆx(w

C


39

77



Mixture Gaussian Decomposition: Model Selection

How many components?

: The maximized likelihood never decreases when k increases

:

: Usually:

: Criteria in this cathegory:

. Minimum description length (MDL), Rissanen and Ristad, 1992.

. Akaike’s information criterion (AIC), Whindham and Cutler, 1992.

. Schwarz´s Bayesian inference criterion (BIC), Fraley and Raftery, 1998.

: Resampling-based techniques

. Bootstrap for clustering, Jain and Moreau, 1987.

. Bootstrap for Gaussian mixtures, McLachlan, 1987.

. Cross validation, Smyth, 1998.

maxminmin)k( k,...,1k,kk),ˆ(minargk C

)ˆ(P)ˆ,(L)ˆ( )k()k()k( xC


78




Given (k) , shortest code-length for x (Shannon’s):

MDL criterion: parameter code length

: Total code-length (two part code):

: MDL criterion:

)|(flog)|(L )k()k( xx

)(L)|(flog),(L )k()k()k( xx

Parameter

code-length

)(L)|(flogminargˆ)k()k()k(

)k(

x

L(each component of (k) ) = )'nlog(2

1

'n Amount of data from which the parameter is estimated


40

79




Classical MDL: n’ = n

Mixtures MDL (MMDL) (Figueiredo, 2002)

: Using EM and redefining the M-Step

)nlog(2

k)|(flogminargˆ

)k()k(

)k( x

k

1m

m

pp

)k()k( )log(2

N)nlog(

2

)1N(k)|(flogminargˆ

)k(

x

2

Nw

pn

1j

)t,j(

ii

y

y+

k

1m

m

i)1t(

iˆ

This M-step may annihilate components

M. Figueiredo and A. K. Jain, Unsupervised Learning of Finite Mixture Models, IEEE TPAMI, 2002

Np is the number of parameters

of each component.

Gaussian, arbitrary covariances

Np = d + d(d+1)/2

Gaussian, common covariance:

Np = d


80



Gaussian Mixture Decomposition – EM MMDL

Examples


41

81



Gaussian Mixture Decomposition – EM MMDL

Examples


82




Strength

: Model-based approach

: Good for Gaussian data

: Handles touching clusters

Weakness

: Unable to detect arbitrary shaped clusters



42

83




It is a local (greedy) algorithm (likelihood never decreases)

=> Initialization dependent

74 iterations 270 iterations


84



References

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

A.K. Jain and M. N. Murty and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, vol 31. No

3,pp 264-323, 1999.

Data Mining: Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufmann Pusblishers, 2001.

A. Y. Ng and M. I. Jordan and Y. Weiss, On Spectral Clustering: Analysis and an algorithm, in Advances in

Neural Information Processing Systems 14, T. G. Dietterich and S. Becker and Z. Ghahramani, MIT

Press, 2002,

M. Figueiredo and A. K. Jain, Unsupervised Learning of Finite Mixture Models, IEEE TPAMI, 2002

A. L. N. Fred, J. M. N. Leitão, A New Cluster Isolation Criterion Based on Dissimilarity Increments, IEEE

Trans. On Pattern Analysis and Machine Intelligence, vol 25, NO. 8, pp 2003.


Unsupervised Learning Clustering Algorithms - IT -...

Documents

Transcript of Unsupervised Learning Clustering Algorithms - IT -...