Data clustering: Topics of Current Interest

34
Data clustering: Topics of Current Interest Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University of London UK Supported by: - “Teacher-Student” grants from the Research Fund of NRU HSE Moscow (2011-2013) - International Lab for Decision Analysis and Choice NRU HSE Moscow (2008 – pres.) - Laboratory of Algorithms and Technologies for Networks Analysis NRU HSE Nizhniy Novgorod Russia (2010 – pres.)

description

Data clustering: Topics of Current Interest. Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University of London UK Supported by: - “Teacher-Student” grants from the Research Fund of NRU HSE Moscow (2011-2013) - PowerPoint PPT Presentation

Transcript of Data clustering: Topics of Current Interest

Page 1: Data clustering: Topics of Current Interest

Data clustering: Topics of Current Interest

Boris Mirkin1,2

1National Research University Higher School of Economics Moscow RF

2Birkbeck University of London UK

Supported by: - “Teacher-Student” grants from the Research Fund of NRU HSE Moscow (2011-2013)- International Lab for Decision Analysis and Choice NRU HSE Moscow

(2008 – pres.)- Laboratory of Algorithms and Technologies for Networks Analysis

NRU HSE Nizhniy Novgorod Russia (2010 – pres.)

Page 2: Data clustering: Topics of Current Interest

Data clustering: Topics of Current Interest1. K-Means clustering and two issues

1. Finding right number of clusters1. Before clustering (anomalous)2. While clustering (divisive no minima of density function)

2. Weighting features (3-step iterations)2. K-Means at similarity clustering (kernel k-means)3. Semi-average similarity clustering4. Consensus clustering5. Spectral clustering, Threshold clustering and

Modularity clustering6. Laplacian pseudo-inverse transformation7. Conclusion

Page 3: Data clustering: Topics of Current Interest

Batch K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

K= 3 hypothetical centroids (@)

* * * * * * * * * * @ @

@** * * *

3

Page 4: Data clustering: Topics of Current Interest

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* * * * * * * * * * @ @

@** * * *

4

Page 5: Data clustering: Topics of Current Interest

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* * * * * * * * * * @ @

@** * * *

5

Page 6: Data clustering: Topics of Current Interest

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters

* * @ * * * @ * * * *

** * * *@

6

Page 7: Data clustering: Topics of Current Interest

K-Means criterion: Summary distance to cluster centroids

Minimize

* * @ * * * @ * * * *

** * * *@

kk Si

i

K

k

M

vkviv

Si

K

k

ydcycSW )c,()(),( k11

2

17

Page 8: Data clustering: Topics of Current Interest

Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-line’

Shortcomings of K-Means - Initialisation: no advice on K or initial

centroids - No deep minima - No defence of irrelevant features

8

Page 9: Data clustering: Topics of Current Interest

CODA Week 8 by Boris Mirkin 9

Issue: How the number and location of initial centers should be chosen? (Mirkin 1998, Chiang and Mirkin 2010)

Minimize

over S and c.

Data scatter (the sumof squared data entries)== W(S,c)+D(S,c)

Data scatter is constant while partitioning

Equivalent criterion:

Maximize

where Nk is the number of entities in Sk

<ck, ck> - Euclidean squared distance between 0 and ck

Page 10: Data clustering: Topics of Current Interest

CODA Week 8 by Boris Mirkin 10

Issue: How the number and location of initial centers should be chosen? 2

Maximize where Nk=|Sk|

Preprocess data by centering: 0 is grand mean<ck, ck> - Euclidean squared distance between 0 and ck

Look for anomalous & populated clusters!!! Further away from the origin.

Page 11: Data clustering: Topics of Current Interest

CODA Week 8 by Boris Mirkin 11

Issue: How the number and location of initial centers should be chosen? 3

Preprocess data by centering to Reference point,typically grand mean. 0 is grand mean since that. Build just one Anomalous cluster.

Page 12: Data clustering: Topics of Current Interest

CODA Week 8 by Boris Mirkin 12

Issue: How the number and location of initial centers should be chosen? 4

Preprocess data by centering to Reference point,typically grand mean. 0 is grand mean since that. Build Anomalous cluster:1. Initial center c is entity farthest away from 0. 2. Cluster update. if d(yi,c) < d(yi,0), assign yi to S. 3. Centroid update: Within-S mean c' if c' c. Go to 2 with c c'. Otherwise, halt.

Page 13: Data clustering: Topics of Current Interest

CODA Week 8 by Boris Mirkin 13

Issue: How the number and location of initial centers should be chosen? 5

Anomalous Cluster is (almost) K-Means up to: (i) the number of clusters K=2: the “anomalous” one and the “main body” of entities around 0; (ii) center of the “main body” cluster is forcibly always at 0; (iii) a farthest away from 0 entity initializes the anomalous cluster.

Page 14: Data clustering: Topics of Current Interest

CODA Week 8 by Boris Mirkin 14

Issue: How the number and location of initial centers should be chosen? 6

Anomalous Cluster iK-Means is superior of: (Chiang, Mirkin, 2010)

Method Acronym

Calinski and Harabasz index CH

Hartigan rule HK

Gap statistic GS

Jump statistic JS

Silhouette width SW

Consensus distribution area CD

Average distance between partitions DD

Square error iK-Means LS

Absolute error iK-Means LM

Page 15: Data clustering: Topics of Current Interest

Issue: Weighting features according to relevance and Minkowski -distance (Amorim, Mirkin, 2012)

1 1 1

| | ( , )k

K M K

ik v iv kv i kk i I v k i S

s w y с d y с

w: feature weights=scale factors

3-step K-Means:-Given s, c, find w (weights)-Given w, c, find s (clusters)-Given s,w, find c (centroids)-till convergence

15

Page 16: Data clustering: Topics of Current Interest

Issue: Weighting features according to relevance and Minkowski -distance 2

Minkowski’s centers

• Minimize over c

• At >1, d(c) is convex• Gradient method

16

( ) | |k

ivi S

d с y с

Page 17: Data clustering: Topics of Current Interest

Issue: Weighting features according to relevance and Minkowski -distance 3

Minkowski’s metric effects

• The more uniform distribution of the entities over a feature, the smaller its weight

• Uniform distribution w=0• The best Minkowski power is data dependent• The best can be learnt from data in a semi-

supervised manner (with clustering of all objects)• Example: at Fisher’s Iris, iMWK-Means gives 5

errors only (a record) 17

Page 18: Data clustering: Topics of Current Interest

18

K-Means kernelized 1

• K-Means: Given a quantitative data matrix, find centers ck and clusters Sk to

minimize W(S,c)= • Girolami 2002: W(S,c)=where A(i,j)=<xi,xj> - kernel trick applicable <xi,xj> K(xi,xj ) • Mirkin 2012:W(S,c)= Const -

Page 19: Data clustering: Topics of Current Interest

19

K-Means kernelized 2

• K-Means equivalent criterion: find partition{S1,…, SK} to maximize

• G(S1,…, SK)=

where (Sk) – within cluster mean

Mirkin (1976, 1996, 2012): Build partition {S1,…, SK}finding one cluster at a time

1 , 1

1 ( , ) ( ) | || |

k

K K

k kk i j S kk

A i j S SS

Page 20: Data clustering: Topics of Current Interest

20

K-Means kernelized 3

• K-Means equivalent criterion and one cluster S at atime: maximizing

g(S)= (S)|S|where (S) – within cluster mean

AddRemAdd(i) algorithm by adding/removing one entity at a time

Page 21: Data clustering: Topics of Current Interest

21

K-Means kernelized 4

• Semi-average criterion:

max g(S)= (S)|S|where (S) – within cluster mean with AddRemAdd(i)

(1) Spectral: max

(2) Tight: the average similarity of S and j > (S) /2 if jS < (S) /2 if jS

( )T

T

s Asg Ss s

Page 22: Data clustering: Topics of Current Interest

22

Three extensions to entire data set

• Partitional: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Take S=S(i*) for i* maximizing f(S(i)) over all I– 3. Remove S from I; if I is not empty, goto 1; else halt.

• Additive: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Take S=S(i*) for i* maximizing f(S(i)) over all I– 3. subtract a(S)ssT from A; if No-stop-condition, goto 1;

else halt.• Explorative: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all iI;– 2. Leave those S(i) that do not much overlap.

Page 23: Data clustering: Topics of Current Interest

23

Consensus partition I: Given partitions R1,R2,…,Rn, find an “average” R

• Partition R={R1, R2, …, RK} incidence matrix Z=(zik):

zik = 1 if iRk; zik = 0, otherwise

• Partition R={R1, R2, …, RK} projector matrix P=(pij): P = Z(ZTZ)-1ZT

• Criterion min (R)= (Mirkin, Muchnik 1981 in Russian, Mirkin 2012)

Page 24: Data clustering: Topics of Current Interest

24

Consensus partition 2: Given partitions R1,R2,…,Rn, find an “average” R

21

1

( ,..., ) || || minn

K Zm

R R Zm P Zm

11 , 1

1( ,..., ) ( , ) ( ) | || |

k

K K

K k kk i j R kk

G R R A i j R RR

This is equivalent to max:

Page 25: Data clustering: Topics of Current Interest

25

Consensus partition 3: Given partitions R1,R2,…,Rn, find an “average” R

21

1

( ,..., ) || || minn

K Zm

R R Zm P Zm

1

1 , 1

1( ,..., ) ( , ) ( ) | || |

k

K K

K k kk i j R kk

G R R A i j R RR

Mirkin, Shestakov (2013): (1) This is superior to a bunch of contemporary

consensus clustering approaches(2) Consensus clustering of results of multiple runs of K-Means is better in cluster recovery than best K-Means

Page 26: Data clustering: Topics of Current Interest

26

Additive clustering IGiven similarity A=(A(i,j)), find clusters •u1=(ui

1), u2=(ui2),…, uK=(ui

K) ui

k either 1 or 0 - crisp clusters0 ui

k 1 - fuzzy clusters•1u1, 2u2,…, KuK - intensityAdditive Model:•A= 1

2ui1uj

1+ …+V2ui

VujV+E; min E2

Shepard, Arabie 1979 (presented 1973); Mirkin 1987 (1976 in Russian)

Page 27: Data clustering: Topics of Current Interest

27

Additive clustering IIGiven similarity A=(A(i,j)), iterative extractionMirkin 1987 (1976 in Russian): double-greedy

• OUTER LOOP: One cluster at a time

min L(A, , u) =

1. Find real (intensity) and 1/0 binary u (membership) to (locally) minimize L(A, ,u).

2. Take cluster S = { i | ui = 1 }.

3. Update A A - uuT (subtraction of in S) 4. Reiterate till a Stop-condition.

Page 28: Data clustering: Topics of Current Interest

28

Additive clustering IIIGiven similarity A=(A(i,j)), iterative extractionMirkin 1987 (1976 in Russian): double-greedy

• OUTER LOOP: One cluster at a time leads to T(A) = 1

2|S1|2 + 2 2|S2|2 +…+ K

2|SK| 2 + L (*)

T(A)=,k 2|Sk|2 - contribution of cluster k

Given Skk = a(Sk)

Contribution k 2|Sk|2 = f(Sk) 2

Additive extension of AddRem is applicableSimilar double-greedy approach to fuzzy clustering: Mirkin, Nascimento 2012.

Page 29: Data clustering: Topics of Current Interest

29

Different criteria I

• Summary Uniform (Mirkin 1976 in Russian)

Within-S sum of similarities A(i,j)- to maximizeRelates to those considered

• Summary Modular (Newman 2004)Within-S sum of similarities A(i,j)-B(i,j) to maximize B(i,j)= A(i,+)A(+,j)/A(+,+)

Page 30: Data clustering: Topics of Current Interest

30

Different criteria II

• Normalized cut (Shi, Malik 2000) to maximize A(S,S)/A(S,+) + A(,)/A(,+)where is complement of S, A(S,S) and A(S,+) summary similarities.Can be reformulated: minimize a Rayleigh quotient, f(S) = z is binary; L(A) is Laplace transformation A(i,j) (i,j)

Page 31: Data clustering: Topics of Current Interest

31

FADDIS: Fuzzy Additive Spectral Clustering• Spectral: B = Pseudo-inverse Laplacian of A– One cluster at a time

• Min ||B – 2uiuj||2 (One cluster to find)• Residual similarity B B – 2uiuj

• Stopping conditions– Equivalent: Rayleigh quotient to maximize

•Max uTBu/uTu [follows from model in contrast to a very popular, yet purely heuristic, approach by Shi and Malik 2000]

• Experimentally demonstrated: Competitive over– ordinary graphs for community detection– conventional (dis)similarity data– affinity data (kernel transformations of feature space data)– in-house synthetic data generators

Page 32: Data clustering: Topics of Current Interest

32

Competitive at:

• Community detection in ordinary graphs• Conventional similarity data• Affinity similarity data• Lapin transformed similarity data D=diag(B*1N) L = I - D-1/2BD-1/2

L+ = pinv(L)• There are examples at which Lapin doesn’t work

Page 33: Data clustering: Topics of Current Interest

33

Example at which Lapin does work,but no square error

Page 34: Data clustering: Topics of Current Interest

Conclusion

• Clustering is yet far from a mathematical theory, however, it gets meaty + Gaussian kernels bringing distributions + Laplacian transformation bringing dynamics

• To make it to a theory, a way to go– Modeling dynamics– Compatibility at Multiple data and metadata– Interpretation