Метод К-средних в кластер-анализе и его...
description
Transcript of Метод К-средних в кластер-анализе и его...
![Page 1: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/1.jpg)
Метод К-средних в кластер-анализе и его интеллектуализацияБ.Г. МиркинПрофессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФProfessor Emeritus, School of Computer Science & Information Systems, Birkbeck College University of London, UK 1
![Page 2: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/2.jpg)
Outline:Clustering as empirical classificationK-Means and its issues: (1) Determining K and initialization (2) Weighting variables
Addressing (1): Data recovery clustering and K-Means (Mirkin 1987,
1990) One-by-one clustering: Anomalous patterns and iK-Means Other approaches Computational experiment
Addressing (2): Three-stage K-Means Minkowski K-Means Computational experiment
Conclusion2
![Page 3: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/3.jpg)
WHAT IS CLUSTERING; WHAT IS DATAK-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Mixed Data; Interpretation AidsWARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward ClusteringDATA RECOVERY MODELS: Statistics Modelling as Data Recovery;
Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One ClusteringDIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of ClustersGENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability
3
![Page 4: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/4.jpg)
Referred recent work:B.G. Mirkin, Chiang M. (2010) Intelligent choice of the number of clusters in K-Means clustering: An experimental study with different cluster spreads, J. of Classification, 27, 1, 3-41 B.G. Mirkin, Choosing the number of clusters (2011), WIRE Data Mining and Knowledge Discovery, 1, 3, 252-60B.G. Mirkin, R.Amorim (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition, 45, 1061-75 4
![Page 5: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/5.jpg)
What is clustering?
Finding homogeneous fragments, mostly sets of entities, in datasets for further analysis
5
![Page 6: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/6.jpg)
Example: W. Jevons (1857) planet clusters (updated by Mirkin 1996)
Pluto doesn’t fit in the two clusters of planets: originated another cluster (September 2006)
6
![Page 7: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/7.jpg)
Example: A Few ClustersClustering interface to WEB search engines (Grouper):Query: Israel (after O. Zamir and O. Etzioni 2001)
Cluster # sites Interpretation1ViewRefine
24 Society, religion• Israel and Iudaism• Judaica collection
2ViewRefine
12 Middle East, War, History• The state of Israel• Arabs and Palestinians
3ViewRefine
31 Economy, Travel• Israel Hotel Association• Electronics in Israel 7
![Page 8: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/8.jpg)
Clustering algorithms: Nearest neighbour Agglomerative clustering Divisive clustering Conceptual clustering K-means Kohonen SOM Spectral clustering ………………….
8
![Page 9: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/9.jpg)
Batch K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
K= 3 hypothetical centroids (@)
* * * * * * * * * * @ @
@** * * *
9
![Page 10: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/10.jpg)
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
* * * * * * * * * * @ @
@** * * *
10
![Page 11: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/11.jpg)
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
* * * * * * * * * * @ @
@** * * *
11
![Page 12: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/12.jpg)
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters
* * @ * * * @ * * * *
** * * *@
12
![Page 13: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/13.jpg)
K-Means criterion: Summary distance to cluster centroids
Minimize
* * @ * * * @ * * * *
** * * *@
kk Si
i
K
k
M
vkviv
Si
K
k
ydcycSW )c,()(),( k11
2
113
![Page 14: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/14.jpg)
Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-
line’
Shortcomings of K-Means - Initialisation: no advice on K or
initial centroids - No deep minima - No defence of irrelevant features 14
![Page 15: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/15.jpg)
Initial Centroids: Correct
Two cluster case
15
![Page 16: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/16.jpg)
Initial Centroids: Correct
Initial Final
16
![Page 17: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/17.jpg)
Different Initial Centroids
17
![Page 18: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/18.jpg)
Different Initial Centroids: Wrong
Initial Final
18
![Page 19: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/19.jpg)
(1) To address:* Number of clusters
Issue: Criterion WK < WK-1
* Initial setting* Deeper minimum
The two are interrelated: a good initial setting leads to a deeper minimum
19
![Page 20: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/20.jpg)
Number K: conventional approach Take a range RK of K, say K=3, 4, …, 15 For each KRK Run K-Means 100-200 times from randomly
chosen initial centroids and choose the best of them W(S,c)=WK.
Compare WK for all KRK in a special way and choose the best; such as Gap statistic (2001) Jump statistic (2003) Hartigan (1975): In the ascending order of K,
pick the first K at which HK = [ WK / WK+1 - 1 ]/(N-K-1) 10
20
![Page 21: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/21.jpg)
(1) Addressing* Number of clusters* Initial setting
with a PCA-like method in the data recovery approach
21
![Page 22: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/22.jpg)
Representing a partition
Cluster k:
Centroid
ckv (v - feature)
Binary 1/0 membership
zik (i - entity)
22
![Page 23: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/23.jpg)
Basic equations (same as for PCA, but score vectors zk constrained to be binary)
y – data entry, z – 1/0 membership, not score
c - cluster centroid, N – cardinality
i - entity, v - feature /category, k - cluster
,1 ivikzkvc
K
kivy
23
![Page 24: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/24.jpg)
Quadratic data scatter decomposition (Pythagorean)
K-means: Alternating LS minimisation y – data entry, z – 1/0 membership
c - cluster centroid, N – cardinality
i - entity, v - feature /category, k - cluster
K
k Si
V
vkviv
V
vk
K
kkv
N
i
V
viv
k
cyNcy1 1
2
1 1
2
1 1
2 )(
,1 ivikzkvc
K
kivy
24
![Page 25: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/25.jpg)
Equivalent criteria (1)A. Bilinear residuals squared MIN
Minimizing difference between data andcluster structureB. Distance-to-Centre Squared MIN
Minimizing difference between data andcluster structure
N
i Vvive
1
2
K
kk
Si
icdWk1
2 ),(
25
![Page 26: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/26.jpg)
Equivalent criteria (2)C. Within-group error squared MIN
Minimizing difference between data andcluster structureD. Within-group variance Squared MIN
Minimizing within-cluster variance
2
1
)( iv
N
ikv
Vv Si
yck
K
kkk SS
1
2 )(||
26
![Page 27: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/27.jpg)
Equivalent criteria (3)E. Semi-averaged within distance squared MIN
Minimizing dissimilarities within clustersF. Semi-averaged within similarity squared
MAX
Maximizing similarities within clusters
||/),(1 ,
2k
K
k Sji
Sjidk
jijiawhereSjia k
K
k Sji k
,),(|,|/),(1 ,
27
![Page 28: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/28.jpg)
Equivalent criteria (4)G. Distant Centroids MAX
Finding anomalous typesH. Consensus partition MAX
Maximizing correlation between sought partition and given variables
||1
2k
K
k Vvkv Sc
),(1
vSV
v
28
![Page 29: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/29.jpg)
Equivalent criteria (5)I. Spectral Clusters MAX
Maximizing summary Raileigh quotient over binary vectors
K
kk
Tkk
TTk zzzYYz
1
/
29
![Page 30: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/30.jpg)
PCA inspired Anomalous Pattern Clustering
yiv =cv zi + eiv,
where zi = 1 if iS, zi = 0 if iS
With Euclidean distance squared
Si
V
vSviv
V
vSSv
N
i
V
viv cyNcy
1
2
1
2
1 1
2 )(
Si
SSS
N
i
cidNcdid ),()0,()0,(1cS must be anomalous, that is,
interesting30
![Page 31: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/31.jpg)
Initial setting with Anomalous Pattern Cluster
Tom Sawyer
31
![Page 32: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/32.jpg)
Anomalous Pattern Clusters: Iterate
0Tom Sawyer
12
3
32
![Page 33: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/33.jpg)
iK-Means:Anomalous clusters + K-meansAfter extracting 2 clusters (how one can know that 2 is right?)
Final
33
![Page 34: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/34.jpg)
iK-Means:Defining K and Initial Setting with Iterative Anomalous Pattern Clustering
Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres
34
![Page 35: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/35.jpg)
Study of eight Number-of-clusters methods (joint work with Mark Chiang):
• Variance based:Hartigan (HK)
Calinski & Harabasz (CH) Jump Statistic (JS)• Structure based:
Silhouette Width (SW)• Consensus based:
Consensus Distribution area (CD)Consensus Distribution mean (DD)
• Sequential extraction of APs (iK-Means):Least Square (LS)Least Moduli (LM)
35
![Page 36: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/36.jpg)
Experimental resultsat 9 Gaussian clusters (3 spread patterns), 1000 x 15 data size
Estimated number of clusters
Adjusted Rand Index
Large spread
Small spread
Large spread
Small spread
HKCHJS
SWCDDDLSLM
1-time winner 2-times winner 3-times winner
Two winners counted each time
36
![Page 37: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/37.jpg)
37
![Page 38: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/38.jpg)
(2) Address: Weighting features according to relevance
1 1 1
| | ( , )k
K M K
ik v iv kv i kk i I v k i S
s w y с d y с
w: feature weights=scale factors
3-step K-Means:-Given s, c, find w (weights)-Given w, c, find s (clusters)-Given s,w, find c (centroids)-till convergence
38
![Page 39: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/39.jpg)
Minkowski’s centersMinimize over c
At >1, d(c) is convexGradient method
39
( ) | |k
ivi S
d с y с
![Page 40: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/40.jpg)
Minkowski’s metric effects The more uniform distribution of the entities over a feature, the smaller its weight Uniform distribution w=0The best Minkowski power is data dependentThe best can be learnt from data in a semi-supervised manner (with clustering of all objects)Example: at Fisher’s Iris, iMWK-Means gives 5 errors only (a record) 40
![Page 41: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/41.jpg)
Conclusion:Data recovery K-Means-wise model of
clustering is a tool that involves wealth of interesting criteria for mathematical investigation and application projects
Further work:Extending the approach to other data types
– text, sequence, image, web pageUpgrading K-Means to address the issue of
interpretation of the results
DecoderModelCoder
Data clustering Clusters Data recovery
41
![Page 42: Метод К-средних в кластер-анализе и его интеллектуализация](https://reader036.fdocuments.in/reader036/viewer/2022062502/5681562a550346895dc3dba7/html5/thumbnails/42.jpg)
HEFCE survey of students’ satisfaction
HEFCE method: ALL 93 of highest mark STRATA: 43 best, ranging 71.8 to 84.6
42