Allan Tucker- Birkbeck College Stephen Swift- Brunel University Nigel Martin- Birkbeck College
L2 and L1 Criteria for K-Means Bilinear Clustering B. Mirkin School of Computer Science Birkbeck...
-
Upload
jenna-bates -
Category
Documents
-
view
214 -
download
0
Transcript of L2 and L1 Criteria for K-Means Bilinear Clustering B. Mirkin School of Computer Science Birkbeck...
L2 and L1 Criteria for K-Means Bilinear Clustering
B. MirkinSchool of Computer ScienceBirkbeck College, University of London
Advert of a Special Issue: The Computer Journal, Profiling Expertise and Behaviour: Deadline 15 Nov. 2006. To submit, http:// www.dcs.bbk.ac.uk/~mark/cfp_cj_profiling.txt
Outline: More of Properties than MethodsClustering, K-Means and IssuesData recovery PCA model and clusteringData scatter decompositions for L2 and L1 Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
WHAT IS CLUSTERING; WHAT IS DATAK-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Mixed Data; Interpretation AidsWARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward ClusteringDATA RECOVERY MODELS: Statistics Modelling as Data Recovery;
Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One ClusteringDIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of ClustersGENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability
Clustering, K-Means and IssuesBilinear PCA model and clusteringData Scatter Decompositions: Quadratic and Absolute Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
Example: W. Jevons (1857) planet clusters, updated (Mirkin, 1996)
Pluto doesn’t fit in the two clusters of planets: originated another cluster (2006)
Clustering algorithms
Nearest neighbour Ward’s Conceptual clustering
K-means Kohonen SOM Spectral clustering ………………….
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
K= 3 hypothetical centroids (@)
* *
* * * * * * * * @ @
@** * * *
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids
(seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
* *
* * * * * * * * @ @
@** * * *
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
* *
* * * * * * * * @ @
@** * * *
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters
* * @ * * * @ * * * *
** * * *
@
Advantages of K-Means Models typology building Computationally effective Can be utilised incrementally, `on-line’
Shortcomings (?) of K-Means Initialisation affects results Convex cluster shape
Initial Centroids: Correct
Two cluster case
Initial Centroids: Correct
Initial Final
Different Initial Centroids
Different Initial Centroids: Wrong
Initial Final
Issues:K-Means gives no advice on:
* Number of clusters* Initial setting* Data normalisation* Mixed variable scales* Multiple data sets
K-Means gives limited advice on: * Interpretation of
resultsThese all can be addressed with the data recovery approach
Clustering, K-Means and IssuesData recovery PCA model and clusteringData Scatter Decompositions: Quadratic and Absolute Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
Data recovery for data mining (discovery of patterns in data)
Type of Data Similarity Temporal Entity-to-feature
Type of Model Regression Principal
components Clusters
Model:Data = Model_Derived_Data + ResidualPythagoras:
|Data|m = |Model_Derived_Data|m + |Residual|m
m =1, 2. The better fit, the better the model: a natural source of
optimisation problems
K-Means as a data recovery method
Representing a partition
Cluster k:
Centroid
ckv (v - feature)
Binary 1/0 membership
zik (i - entity)
Basic equations (same as for PCA, but score vectors zk constrained to be binary)
y – data entry, z – 1/0 membership, not score
c - cluster centroid, N – cardinality
i - entity, v - feature /category, k - cluster
,1
ivikz
kvc
K
kivy
Clustering: general and K-MeansBilinear PCA model and clusteringData Scatter Decompositions L2 and L1Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
Quadratic data scatter decomposition (classic)
K-means: Alternating LS minimisation y – data entry, z – 1/0 membership
c - cluster centroid, N – cardinality
i - entity, v - feature /category, k - cluster
K
k Si
V
vkviv
V
vk
K
kkv
N
i
V
viv
k
cyNcy1 1
2
1 1
2
1 1
2 )(
,1
ivikz
kvc
K
kivy
Absolute Data Scatter Decomposition (Mirkin 1997)
K
k Si
V
vkviv
V
vkvkv
K
kiv
Si
N
i
V
viv
k
kv
cy
cnyy
1 1
1 11 1
||
|)|||2(||
00:
00:
kvivkvk
kvkvivkkv cycSi
ccySiS
Ckv are medians
OutlineClustering: general and K-MeansBilinear PCA model and clusteringData Scatter Decompositions L2 and L1 Implications for data pre-processingExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
Meaning of the Data scatter
m=1,2; The sum of contributions of features – the basis for feature pre-processing (dividing by range rather than std) Proportional to the summary variance (L2) / absolute deviation from the median (L1)
V
v
N
i
miv
N
i
V
v
miv
m yyD1 11 1
||||
Standardisation of features
Yik = (Xik –Ak)/Bk
X - original data Y – standardised data i – entities k – features Ak – shift of the origin, typically, the average Bk – rescaling factor, traditionally the
standard deviation, but range may be better in clustering
Normalising by std
decreases the effect of more useful feature 2
by rangekeeps the effect of distribution shape in T(Y):B = range*#categories (for L2 case)(under the equality-of-variables assumption)
B1=Std1 << B2= Std2
Data standardisation
Categories as one/zero variablesSubtracting the averageAll features: Normalising by rangeCategories - sometimes by the number of them
Illustration of data pre-processing Mixed scale data table
Conventional quantitative coding + … data standardisation
No normalisation
Tom Sawyer
Z-scoring (scaling by std)
Tom Sawyer
Normalising by range*#categories
Tom Sawyer
OutlineClustering: general and K-MeansBilinear PCA model and clusteringData Scatter Decompositions: Quadratic and Absolute Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
Contribution of a feature F to a partition (m=2)
Proportional to correlation ratio 2 if F is quantitative a contingency coefficient between cluster
partition and F, if F is nominal:Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)
Fv
k
K
kkvNc
1
2Contrib(F) =
Contribution of a quantitative feature to a partition (m=2)
Proportional to correlation ratio 2 if F is quantitative
2
1
222 /)(
K
kkkpNN
Contribution of a pair nominal feature – partition, L2 case
Proportional to a contingency coefficient Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised) Bj=1
Still needs be normalised by the square root of #categories, to balance the contribution of a numerical feature
2/
2)
,( jBkpjk jpkpkjpNContr
jj pB
Contribution of a pair nominal feature – partition, L1 case
A highly original contingency coefficient Still needs be normalised by square root of #categories, to balance the contribution of a numerical feature
|2|
),(
),(
/2[/
),(
)1/(2 kpkvp
vk
vkvp
kp
kvpKNContr
vkvkvk ppp
||/|2|),(kvckpkvpNvkContr
Clustering: general and K-MeansBilinear PCA model and clusteringData Scatter Decompositions: Quadratic and Absolute Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
Equivalent criteria (1)
A. Bilinear residuals squared MIN
Minimizing difference between data andcluster structureB. Distance-to-Centre Squared MIN
Minimizing difference between data andcluster structure
N
i Vv
ive1
2
K
kk
Si
icdWk1
2 ),(
Equivalent criteria (2)
C. Within-group error squared MIN
Minimizing difference between data andcluster structureD. Within-group variance Squared MIN
Minimizing within-cluster variance
2
1
)( iv
N
ikv
Vv Si
yck
K
kkk SS
1
2 )(||
Equivalent criteria (3)E. Semi-averaged within distance squared MIN
Minimizing dissimilarities within clustersF. Semi-averaged within similarity squared
MAX
Maximizing similarities within clusters
||/),(1 ,
2k
K
k Sji
Sjidk
jijiawhereSjia k
K
k Sji k
,),(|,|/),(1 ,
Equivalent criteria (4)
G. Distant Centroids MAX
Finding anomalous typesH. Consensus partition MAX
Maximizing correlation between sought partition and given variables
||1
2k
K
k Vv
kv Sc
),(1
vSV
v
Equivalent criteria (5)
I. Spectral Clusters MAX
Maximizing summary Raileigh quotient over binary vectors
K
kk
Tkk
TTk zzzYYz
1
/
Gower’s controversy: 2N+1 entities
Two-cluster possibilities W(c1,c2/c3)= N2 d(c1,c2) W(c1/c2,c3)= N d(c2,c3) W(c1/c2,c3)=o(W(c1,c2/c3))
Separation over grand mean/median rather than just over distances (in the most general d setting)
c1 c2 c3
N N 1
OutlineClustering: general and K-MeansBilinear PCA model and clusteringData Scatter Decompositions: Quadratic and Absolute Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction strategy: Anomalous patterns and iK-MeansComments on optimisation problemsIssue of the number of clustersConclusion and future work
PCA inspired Anomalous Pattern Clustering
yiv =cv zi + eiv,
where zi = 1 if iS, zi = 0 if iS
With Euclidean distance squared
Si
V
vSviv
V
vSSv
N
i
V
viv cyNcy
1
2
1
2
1 1
2 )(
Si
SSS
N
i
cidNcdid ),()0,()0,(1
cS must be anomalous, that is,
interesting
Spectral clustering can be not optimal (1)
Spectral clustering (becoming popular, in a different setting): Find maximum eigenvector by
maximising
over all possible x
Define zi=1 if xi>a ; zi=0, if xi a, for some a
xxxYYx TTT /
Spectral clustering can be not optimal (2)
Example (for similarity data):
2 6 19 20
3 4
5
1
z | 0.681 0.260 0.126 0.168 |i | 1 2 3-5 6-20
This cannot be typical…
Initial setting with Anomalous Pattern Cluster
Tom Sawyer
Anomalous Pattern Clusters: Iterate
0
Tom Sawyer
12
3
iK-Means:Anomalous clusters + K-meansAfter extracting 2 clusters (how one can know that 2 is right?)
Final
iK-Means:Defining K and Initial Setting with Iterative Anomalous Pattern Clustering
Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres
OutlineClustering: general and K-MeansBilinear PCA model and clusteringData Scatter Decompositions: Quadratic and Absolute Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
Study of eight Number-of-clusters methods (joint work with Mark Chiang):
• Variance based:Hartigan (HK)
Calinski & Harabasz (CH) Jump Statistic (JS)• Structure based:
Silhouette Width (SW)• Consensus based:
Consensus Distribution area (CD)Consensus Distribution mean (DD)
• Sequential extraction of APs (iK-Means):Least Square (LS)Least Moduli (LM)
Experimental resultsat 9 Gaussian clusters (3 size patterns), 1000 x 15 data
Estimated number of clusters
Adjusted Rand Index
Large spread
Small spread
Large spread
Small spread
HK
CH
JS
SW
CD
DD
LS
LM
1-time winner 2-times winner 3-times winner
Two winners counted each time
Clustering: general and K-MeansBilinear PCA model and clusteringData Scatter Decompositions: Quadratic and Absolute Contributions of nominal featuresExplications of Quadratic criterionOne-by-one cluster extraction: Anomalous patterns and iK-MeansIssue of the number of clustersComments on optimisation problemsConclusion and future work
Some other data recovery clustering models
Hierarchical clustering: Ward agglomerative and Ward-like divisive, relation to wavelets and Haar basis (1997) Additive clustering: partition and one-by-one clustering (1987) Biclustering: box clustering (1995)
Hierarchical clustering for conventional and spatial data
Model: Same
Cluster structure: 3-valued z representing a splitA split S=S1+S2 of a node S in children S1, S2:
zi = 0 if i S, = a if iS1= -b if iS2If a and b taken to z being centred, thenode vectors for a hierarchy formorthogonal base (an analogue to SVD)
,1
ivikkv
K
kiv ezcy
S1 S2
S
Last:Data recovery K-Means-wise model is an
adequate tool that involves wealth of interesting criteria for mathematical investigation
L1 criterion remains a mystery, even for the most popular method, the PCA
Greedy-wise approaches remain a vital element both theoretically and practically
Evolutionary approaches started sneaking in: should be given more attention
Extending the approach to other data types is a promising direction