Spectral Clustering - HU Pili...Spectral Clustering Framework 1. Get similarity matrix AN× N from...
Transcript of Spectral Clustering - HU Pili...Spectral Clustering Framework 1. Get similarity matrix AN× N from...
Spectral Clustering
by HU Pili
June 16, 2013
Outline
• Clustering Problem
• Spectral Clustering Demo
• Preliminaries
◦ Clustering: K-means Algorithm
◦ Dimensionality Reduction: PCA, KPCA.
• Spectral Clustering Framework
• Spectral Clustering Justification
• Ohter Spectral Embedding Techniques
Main reference: Hu 2012 [4]
2
Clustering Problem
Height
Weight
Figure 1. Abstract your target using feature vector
3
Clustering Problem
Height
Weight
Figure 2. Cluster the data points into K (2 here) groups
4
Clustering Problem
Height
Weight
ThinFat
Figure 3. Gain Insights of your data
5
Clustering Problem
Height
Weight
Figure 4. The center is representative (knowledge)
6
Clustering Problem
Height
Weight
Figure 5. Use the knowledge for prediction
7
Review: Clustering
We learned the general steps of Data Mining/ KnowledgeDiscover using a clustering example:
1. Abstract your data in form of vectors.
2. Run learning algorithms
3. Gain insights/ extract knowledge/ make prediction
We focus on 2.
8
Spectral Clustering Demo
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
Figure 6. Data Scatter Plot
9
Spectral Clustering Demo
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
Figure 7. Standard K-Means
10
Spectral Clustering Demo
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
Figure 8. Our Sample Spectral Clustering
11
Spectral Clustering Demo
The algorithm is simple:
• K-Means:[idx, c] = kmeans(X, K) ;
• Spectral Clustering:epsilon = 0.7 ;
D = dist(X’) ;
A = double(D < epsilon) ;
[V, Lambda] = eigs(A, K) ;
[idx, c] = kmeans(V, K) ;
12
Review: The Demo
The usual case in data mining:
• No weak algorithms
• Preprocessing is as important as algorithms
• The problem looks easier in another space (thesecret coming soon)
Transformation to another space:
• High to low: dimensionality reduction, low dimen-sion embedding. e.g. Spectral clustering.
• Low to high. e.g. Support Vector Machine (SVM)
13
Secrets of Preprocessing
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
Figure 9. The similarity graph: connect to ε-ball.
D = dist(X’); A = double(D < epsilon);
14
Secrets of Preprocessing
−0.11 −0.1 −0.09 −0.08 −0.07 −0.06 −0.05 −0.04 −0.03 −0.02 −0.01−0.1
−0.05
0
0.05
0.1
0.15
Figure 10. 2-D embedding using largest 2 eigenvectors
[V, Lambda] = eigs(A, K);
15
Secrets of Preprocessing
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 11. Even better after projecting to unit circle (not
used in our sample but more applicable, Brand 2003 [3])16
Secrets of Preprocessing
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
1
2
3x 10
4 Angles −− All Points
0.94 0.96 0.98 1 1.02 1.04 1.060
200
400
600
800Angles −− Cluster 1
0.8 0.85 0.9 0.95 1 1.05 1.1 1.150
1000
2000
3000
4000Angles −− Cluster 2
Figure 12. Angle histogram: The two clusters are concen-
trated and they are nearly perpendicular to each other17
Notations
• Data points: Xn×N = [x1, x2, , xN]. N points,each of n dimensions. Organize in columns.
• Feature extraction: φ(xi). d dimensional.
• Eigen value decomposition: A = UΛUT. First d
colums of U : Ud.
• Feature matrix: Φn̂×N = [φ(x)1, φ(x)2, , φ(x)N]
• Low dimension embedding: YN×d. Embed N
points into d-dimensional space.
• Number of clusters: K.
18
K-Means
• Initialize m1, ,mK centers
• Iterate until convergence:
◦ Cluster assignment: Ci=argminj ‖xi−mj‖for all data points, 16 i6N .
◦ Update clustering: Sj={i:Ci= j},16 j6K
◦ Update centers: mj=1
|Sj |
∑
i∈Sjxi
19
Remarks: K-Means
1. A chiken and egg problem.
2. How to initialize centers?
3. Determine superparameter K?
4. Decision boundary is a straight line:
• Ci= argminj ‖xi−mj‖
• ‖x−mj‖= ‖x−mk‖
• miTmi−mj
Tmj=2xT(mi−mj)
We address the 4th points by transforming the data intoa space where straight line boundary is enough.
20
Principle Component Analysis
DataPCError
Figure 13. Error minimization formulation
21
Principle Component Analysis
Assume xi is already centered (easy to preprocess).Project points onto the space spanned by Un×d with min-imum errors:
minU∈Rn×d
J(U)=∑
i=1
N
‖UUTxi− xi‖2
s.t. UTU = I
22
Principle Component Analysis
Transform to trace maximization:
maxU∈Rn×d
Tr[UT(XXT)U ]
s.t. UTU = I
Standard problem in matrix theory:
• Solution of U is given by the largest d eigen vec-tors of XXT (those corresponding to largest eigen
values). i.e. XXTU =UΛ.
• Usually denote Σn×n =XXT, because XXT canbe interpreted as the covariance of variables (nvariables). (Again not that X is centered)
23
Principle Component Analysis
About UUTxi:
• xi: the data points in original n dimensional space.
• Un×d: the d dimensional space (axis) expressed
using the coordinates of original n-D space. –Principle Axis.
• UTxi: the coordinates of xi in d-dimensional space;d-dimensional embedding; the projection of xi tod-D space expressed in d-D space. – Principle
Component.
• U UTxi: the projection of xi to d-D space expressedin n-D space.
24
Principle Component Analysis
From principle axis to principle component:
• One data point: UTxi
• Matrix form: (Y T)d×N =UTX
Relation between covariance and similarity:
(XTX )Y = XTX (UTX)T
= XTXXTU
= XTU Λ
= Y Λ
Observation: Y is the eigen vectors of XTX .
25
Principle Component Analysis
Two operational approaches:
• Decompose (XXT)n×n and Let (Y T)d×N=UTX.
• Decompose (XTX)N×N and directly get Y .
Implications:
• Choose the smaller size one in practice.
• XTX hint that we can do more with the structure.
Remarks:
• Principle components are what we want in mostcases. i.e. Y . i.e. d-dimension embedding. e.g. Cando clustering on coordinates given by Y .
26
Kernel PCA
Settings:
• A feature extraction function:
φ:Rn→Rn̂
original n dimension features. Map to n̂ dimension.
• Matrix organization:
Φn̂×N = [φ(x)1, φ(x)2, , φ(x)N]
• Now Φ is the “data points” in the formulation ofPCA. Try to embed them into d-dimensions.
27
Kernel PCA
According to the analysis of PCA, we can operate on:
• (ΦΦT)n̂×n̂: the covariance matrix of features
• (ΦTΦ)N×N: the similarity/ affinity matrix (inspectral clustering language); the gram/ kernalmatrix (in keneral PCA language).
The observation:
• n̂ can be very large. e.g. φ:Rn→R∞
• (ΦTΦ)i,j=φT(xi)φ(xj). Don’t need explict φ; only
need k(xi, xj)= φT(xi)φ(xj).
28
Kernel PCA
k(xi, xj) = φT(xi)φ(xj) is the “kernel”. One importantproperty: by definition,
• k(xi, xj) is a positive semidefinite function.
• K is a positive semidefinite matrix.
Some example kernals:
• Linear: k(xi, xj)=xiTxj. Degrade to PCA.
• Polynomial: k(xi, xj)= (1+xiTxj)p
• Gaussion: k(xi, xj)= e−
∥
∥
∥xi−xj
∥
∥
∥
2
2σ2
29
Remarks: KPCA
• Avoid explicit high dimension (maybe infinite) fea-ture construction.
• Enable one research direction: kernel engineering.
• The above discussion assume Φ is centered! SeeBishop 2006 [2] for how to center this matrix (usingonly kernel function). (or “double centering” tech-nique in [4])
• Out of sample embedding is the real difficulty, seeBengio 2004 [1].
30
Review: PCA and KPCA
• Minimum error formulation of PCA
• Two equivalent implementation approaches:
◦ covariance matrix
◦ similarity matrix
• Similarity matrix is more convenient to manipulateand leads to KPCA.
• Kernel is Positive Semi-Definite (PSD) by defini-tion. K =ΦTΦ
31
Spectral Clustering Framework
A bold guess:
• Decomposing K =ΦTΦ gives good low-dimensionembedding. Inner product measures similarity, i.e.k(xi, xj)= φT(xi)φ(xj). K is similarity matrix.
• In the operation, we acturally do not look at Φ.
• We can specify K directly and perform EVD:
Xn×N→KN×N
• What if we directly give a similarity measure, K,without the constraint of PSD?
That leads to the general spectral clustering.
32
Spectral Clustering Framework
1. Get similarity matrix AN×N from data points X.(A: affinity matrix; adjacency matrix of a graph;similarity graph)
2. EVD: A= UΛUT . Use Ud (or post-processed ver-sion, see [4]) as the d-dimension embedding.
3. Perform clustering on d-D embedding.
33
Spectral Clustering Framework
Review our naive spectral clustering demo:
1. epsilon = 0.7 ;
D = dist(X’) ;
A = double(D < epsilon) ;
2. [V, Lambda] = eigs(A, K) ;
3. [idx, c] = kmeans(V, K) ;
34
Remarks: SC Framework
• We start by relaxing A (K) in KPCA.
• Lose PSD == Lose KPCA justification? Not exact
A′=A+σI
• Real tricks: (see [4] section 2 for details)
◦ How to form A?
◦ Decompose A or other variants of A (L =D−A).
◦ Use EVD result directly (e.g. U) or use a
variant (e.g. UΛ1/2).
35
Similarity graph
Input is high dimensional data: (e.g. come in form of X)
• k-nearest neighbour
• ε-ball
• mutual k-NN
• complete graph (with Gaussian kernel weight)
36
Similarity graph
Input is distance.(
D(2))
i,jis the squred distance between
i and j. (may not come from raw xi, xj)
c = [x1Tx1, , xN
TxN]T
D(2) = c 1T+1 cT− 2XTX
J = I −1
n1 1T
XTX = −1
2JD(2)J
Remarks:
• See MDS.
37
Similarity graph
Input is a graph:
• Just use it.
• Or do some enhancements. e.g. Geodesic distance.See [4] Section 2.1.3 for some possible methods.
After get the graph (or input):
• Adjacency matrix A, Laplacian matrix L=D−A.
• Normalized versions:
◦ Aleft=D−1A, Asym=D−1/2
AD−1/2
◦ Lleft=D−1L, Lsym=D−1/2
LD−1/2
38
EVD of the Graph
Matrix types:
• Adjacency series: Use the largest EVs.
• Laplacian series: Use the smallest EVs.
39
Remarks: SC Framework
• There are many possibilities in construction of sim-ilarity matrix and the post-processing of EVD.
• Not all of these combinations have justifications.
• Once a combination is shown to working, it maynot be very hard to find justifications.
• Existing works actually starts from very differentflavoured formulations.
• Only one common property: involve EVD;aka “spectral analysis”; hence the name.
40
Spectral Clustering Justification
• Cut based argument (main stream; origin)
• Random walk escaping probability
• Commute time: L−1 encodes the effective resis-tance. (where UΛ−1/2 come from)
• Low-rank approximation.
• Density estimation.
• Matrix perturbation.
• Polarization. (the demo)
See [4] for pointers.
41
Cut Justification
Normalized Cut (Shi 2000 [5]):
NCut=∑
i=1
Kcut(Ci, V −Ci)
vol(Ci)
Characteristic vector for Ci, χi= {0, 1}N:
NCut=∑
i=1
KχiTLχi
χiTDχi
42
Relax χi to real value:
minvi∈RN
∑
i=1
K
viTLvi
s.t. viTDvi=1
viT vj=0, ∀i� j
This is the generalized eigenvalue problem:
L v=λDv
Equivalent to EVD on:
Lleft=D−1L
43
Matrix Perturbation Justification
• When the graph is ideally separable, i.e. multipleconnected components, A and L have character-istic (or piecewise linear) EVs.
• When not ideally separable but sparse cut exists,A can be viewed as ideal separable matrix plus asmall perturbation.
• Small perturbation of matrix entries leads to smallperturbation of EVs.
• EVs are not too far from piecewise linear: easy toseparate by simple algorithms like K-Means.
44
Low Rank Approximation
The similarity matrix A is generated by inner productin some unknown space we want to recover. We want tominimize the recovering error:
minY ∈RN×d
‖A−YY T ‖F2
The standard low-rank approximation problem, whichleads to EVD of A:
Y =UΛ1/2
45
Spectral Embedding Techniques
See [4] for some pointers: MDS, isomap, PCA, KPCA,LLE, LEmap, HEmap, SDE, MVE, SPE. The difference,as said, lies mostly in the construction of A.
46
Bibliography
[1] Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. Le Roux,
and M. Ouimet. Out-of-sample extensions for lle, isomap, mds,
eigenmaps, and spectral clustering. Advances in neural informa-
tion processing systems, 16:177–184, 2004.
[2] C. Bishop. Pattern recognition and machine learning,
volume 4. springer New York, 2006.
[3] M. Brand and K. Huang. A unifying theorem for spectral
embedding and clustering. In Proceedings of the Ninth Inter-
national Workshop on Artificial Intelligence and Statistics, 2003.
[4] P. Hu. Spectral clustering survey, 5 2012.
[5] J. Shi and J. Malik. Normalized cuts and image segmentation.
Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 22(8):888–905, 2000.
47
Thanks
Q/A
Some supplementary slides for details are attached.
48
SVD and EVD
Definitions of Singular Value Decomposition (SVD):
Xn×N = Un×kΣk×kVN×kT
Definitions of Eigen Value Decomposition (EVD):
A = XTX
A = UΛUT
Relations:
XTX = V Σ2V T
XXT = UΣ2UT
49
Remarks:
• SVD requires UTU = I , V TV = I and σi > 0(Σ = diag(σ1, , σN)). This is to guarantee theuniqueness of solution.
• EVD does not have constraints, any U and Λ satis-fying AU =UΛ is OK. The requirement of UTU =I is also to guarantee uniqueness of solution (e.g.PCA). Another benefit is the numerical stability ofsubspace spanned by U : orthogonal layout is moreerror resilient.
• The computation of SVD is done via EVD.
• Watch out the terms and the object they refer to.
50
Out of Sample Embedding
• New data point x ∈Rn that is not in X . How tofind the lower dimension embedding, i.e. y ∈Rd.
• In PCA, we have principle axis U (XXT=UΛUT).Out of sample embedding is simple: y=UTx.
• Un×d is actually a compact representation of
knowledge.
• In KPCA and different variants of SC, we operateon similarity graph and do not have such compactrepresentation. It is thus hard to explicitly the outof sample embedding result.
• See [1] for some researches on this.
51
Gaussian Kernel
The gaussian kernel: (Let τ =1
2σ2)
k(xi, xj)= e−
∥
∥
∥xi−xj
∥
∥
∥
2
2σ2 = e−τ ‖xi−xj‖2
Use Taylor expansion:
ex=∑
k=0
∞1
k!xk
Rewrite the kernel:
k(xi, xj) = e−τ(xi−xj)T(xi−xj)
= e−τxiTxi · e−τxj
Txj · e2τxiTxj
52
Focus on the last part:
e2τxiTxj =
∑
k=0
∞1
k!(2τxi
Txj)k
It’s hard to write out the form when xi ∈Rn, n > 1. Wedemo the case when n = 1. xi and xj are now singlevariable:
e2τxiTxj =
∑
k=0
∞1
k!(2τ )kxi
kxjk
=∑
k=0
∞
c(k) ·xikxj
k
53
The feature vector is: (infinite dimension)
φ(x)= e−τx2[
c(0)√
, c(1)√
x, c(2)√
x2, , c(k)√
xk,]
Verify that:
k(xi, xj)= φ(xi) · φ(xj)
This shows that Gaussian kernel implicitly map 1-D datato an infinite dimensional feature space.
54