Post on 04-Oct-2020
1
Optimization and learning techniques forclustering problems
Soledad Villar
Center for Data ScienceCourant Institute of Mathematical Sciences
Statistical Physics and Machine Learning back togetherCargèse, August 2018
2
Clustering“Next AI revolution is unsupervised” Yann LeCun
Main task in unsupervised machine learning:
I Finding structure in unlabeled data.
k-means clustering
I Simple objective
I Useful for “generic data”
Clustering stochastic block model
I Specialized objective
I Algorithms get more precise but less robust to model changes
Data-driven clustering methods.
3
Clustering“Next AI revolution is unsupervised” Yann LeCun
Main task in unsupervised machine learning:
I Finding structure in unlabeled data.
Example:Cluster analysis and display of genome-wideexpression patterns.
I Gene expression as a function of timefor different conditions.
I Clustering gene expression data groupstogether genes of known similarfunctionality.
Eisen, Spellman, Brown, Botstein, PNAS, 1998
4
Summary
I k-means clusteringI Landscape of manifold optimizationI Lower bounds certificates from SDPs (data-driven)
I Clustering stochastic block modelI AMP −→ Spectral methods −→ Graph neural networks
I Quadratic assignment
5
The k-means problem
Given a point cloud {xi},partition the points in clustersC1, . . . ,Ck
k-means objective:
minC1,...Ck
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2
I NP-hard to minimize in general (even in the plane).
6
Lloyd’s algorithm
1. Choose k centers at random.
2. Assign points to closest center.
3. Compute new centers are clusters centroids.
k-means++ is Lloyd’s with smarter initialization.
Pros:
I Fast.
I Very easy toimplement.
I Widely used andit works for mostapplications.
Cons:
I No guarantee of convergence (local minima).
I In extreme cases it may take exponentiallymany steps.
I Solutions depend heavily on initialization.
I Its output doesn’t say how good of a solutionit may be.
7
Optimization formulation
Taking Dij := ‖xi − xj‖2, then
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2 = 12 Tr(D
k∑t=1
1
|Ct |1Ct 1
>Ct︸ ︷︷ ︸)
minimize Tr(DYY>)
subject to Y ∈ Rn×k
Y>Y = Ik
YY>1 = 1
Y ≥ 0
=
Set of Y ’s is discrete.
8
Manifold optimization implementation
minimize Tr(DYY>) + λ‖Y−‖2
subject to Y ∈ M = {Y ∈ Rn×k : Y>Y = Ik , YY>1 = 1}
To implement the manifold optimization one needs:
gradf : M → TM,retrY : TYM → M,
where
retrY (0) = 0,d
dt
∣∣∣∣t=0
retrY (tV ) = V .
Gradient descent:
Yn+1 = retrYn(−αn gradf (Yn)).
Sepulchre, Absil, Mahony, 2008
Carson, Mixon, V., Ward, SampTA, 2017
9
Manifold optimization, retraction implementation
Homogeneous structure of M.Let O(1n) be the n × n orthogonal matrices that fix 1.
M × O(1n)× O(k)→ M(Y ,Q,R) 7→ QYR (transitive action)
Multiplication of Y on the right by R is equivalent to multiplication onthe left by R ′ = YRY> ∈ O(1n).
V ∈ TYM can be written as V = BY + YA where A ∈ so(k), B ∈ so(1n)where BY and YA are orthogonal. We compute A,B explicitly.
A = Y>V
B = VY> − YV> − 2YAY>
retrY (V ) = exp(B) exp(A′)Y ∈ M
10
Manifold optimization algorithm
λ0 ← 0repeatYn+1 ← GradientDescent(fλ) {Initialized at Yn}λn+1 ← 2λn + 1
until ‖Y−‖F < �
minimizeY fλ(Y ) = Tr(DYY>) + λ‖Y−‖
2
subject to Y ∈ M = {Y ∈ Rn×k : Y>Y = Ik , YY>1 = 1}
lambda=0 lambda=6300 lambda=2.047e+05 lambda=2.6214e+07
Problem structure allows to do each iteration in linear-time.
Carson, Mixon, V., Ward, SampTA, 2017
11
Claim: clustering is (maybe) not (that) hardOptimization landscape
Points from three well-separated unit balls in small dimension.
12
Claim: clustering is (maybe) not (that) hardNo clustering structure
Points drawn from unit ball into three clusters in small dimension.
And it’s optimal! How do I know that?
13
k-means optimization formulation
Recall Dij := ‖xi − xj‖2, then
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2 = 12 Tr(D
k∑t=1
1
|Ct |1Ct 1
>Ct︸ ︷︷ ︸)
minimize Tr(DYY>)
subject to Y ∈ Rn×k
Y>Y = Ik
YY>1 = 1
Y ≥ 0
=
Set of Y ’s is discrete.
14
A semidefinite programming relaxation
Recall Dij := ‖xi − xj‖2, then
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2 = 12 Tr(D
k∑t=1
1
|Ct |1Ct 1
>Ct︸ ︷︷ ︸)
Relax to SDP:
minimize Tr(DX )
subject to Tr(X ) = k
X1 = 1
X ≥ 0X � 0
Peng, Wei, SIAM J. Optim., 2007
15
Stochastic ball model(D, γ, n)-stochastic ball model
I D = rotation-invariant distribution over unit ball in Rm
I γ1, . . . , γk = ball centers in Rm
I Draw rt,1, . . . , rt,n i.i.d. from D for each i ∈ {1, . . . , k}I xt,i = γt + rt,i = ith point from cluster t
Nellore, Ward, Information and Computation, 2015
16
Stochastic ball model(D, γ, n)-stochastic ball model
I D = rotation-invariant distribution over unit ball in Rm
I γ1, . . . , γk = ball centers in Rm
I Draw rt,1, . . . , rt,n i.i.d. from D for each i ∈ {1, . . . , k}I xt,i = γt + rt,i = ith point from cluster t
Nellore, Ward, Information and Computation, 2015
17
k-means SDP. Stochastic ball model
Exact recovery
SDP is tight for k unit balls in Rm with centers γi . . . γk
mini 6=j‖γi − γj‖ ≥ min
{2√
2(1 + 1√m
), 2 + k2
m , 2 +√
2km+2
}
Conjecture (Li, Li, Ling, Strohmer)
Phase transition for exact recovery (under stochastic ball model) at
mini 6=j‖γi − γj‖ ≥ 2 +
c
m
Awasthi, Bandeira, Charikar, Krishnaswamy, V., Ward, Proc. ITCS, 2015
Iguchi, Mixon, Peterson, V., Mathematical Programming, 2016
Li, Li, Ling, Strohmer, arXiv:1710.06008 2017
18
k-means SDP. Subgaussian mixtures
Relax and “round” (weak recovery).
Theorem
Centers: γ̂1, . . . , γ̂k . Estimated centers v1, . . . , vk
1
k
k∑i=1
‖vi − γ̂i‖2 . k2σ2 whp provided mini 6=j‖γi − γj‖ & kσ.
Example: http://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.html
Mixon, V., Ward, Information and Inference, 2017
Proof technique: Guédon, Vershynin, Probability Theory and Related Fields, 2016
http://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.htmlhttp://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.html
19
SDPs are slow!Fast lower bound certificates
Want: Lower bound on k-means value,
Given {xi}i∈T ⊆ Rm, let W= value of k-means++ initialization
valkmeans(T) := min1
|T |
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2 ≥ 18(log k + 2)EWRunning k-means++ on MNIST training set with k = 10 gives:
I Upper bound (by running the algorithm): 39.22
I Lower bound (estimated from above): 2.15
Can we get a better lower bound?
Arthur, Vassilvitskii, 2007
20
Idea
Given {xi}i∈T ⊆ Rm, draw S ∼ Unif ∈(Ts
).
E valSDP(S) ≤ E valkmeans(S)
≤ E
1s
k∑t=1
∑i∈C∗t ∩S
∥∥∥∥xi − 1|C∗t ∩ S | ∑j∈C∗t ∩S xj∥∥∥∥2
≤ E
1s
k∑t=1
∑i∈C∗t ∩S
∥∥∥∥xi − 1|C∗t | ∑j∈C∗t xj∥∥∥∥2 = valkmeans(T)
Goal: Establish E valSDP(S) > B with small p-value.
Mixon, V., arXiv:1710.00956
21
Better performance, and fast!
100 150 200 250 300 350 400 450s
0
5
10
15
20
25
30
35
40va
l(S-S
DP)
val(S-SDP)kmeans++ valuekmeans++ lower bound
number of points (log scale)10
210
410
610
8
run
nin
g t
ime
(lo
g s
ca
le)
10-3
10-2
10-1
100
101
102
103
104
k-means SDPk-means++Monte Carlo certifier
I Taking s = 100� 60, 000 MNIST points already worksI Constant runtime in N, faster than k-means++ for N & 106.
I Theorem for mixtures of Gaussians (additive bound)
Mixon, V., arXiv:1710.00956
22
Generic algorithms vs. model-dependent algorithmsClustering the stochastic block model
I Sparse graph clustering (non-geometric)
I Similar generic algorithmic (SDP, manifold optimization)
I Model-tailored algorithms (AMP, spectral methods onNon-bactracking matrices and Bethe Hessian)
I Can we learn the algorithm from data?I Use graph neural networks
Decelle, Krzakala, Moore, Zdeborová, 2013
Krzakala, Moore, Mossel, Neeman, Sly, Zdeborová, Zhang, 2013
Saade, Krzakala, Zdeborová, 2014
Abbe, 2018
23
Spectral redemption
A ∼ SBM(a/n, b/n, n, 2) sparse.Spectrum doesn’t concentrate (high degree vertices dominate it)Laplacian is not useful for clustering
Consider the non-backtracking operator (from linearized BP)
B(i→j)(i ′→j ′) =
{1 if j = i ′ and j ′ 6= i
0 otherwise
Second eigenvector of B reveals clustering structure
Krzakala, Moore, Mossel, Neeman, Sly, Zdeborová, Zhang, 2013
24
Bethe Hessian
BH(r) = (r2 − 1)I − rA + D
Fixed points of BP ←→ Stationary points of Bethe free energySecond eigenvector reveals clustering structurePitfall: highly dependent in the model
Goal: Combine graph operators
I ,D,A, . . . to generate robust “data-
driven spectral methods” for prob-
lems in graphs
Saade, Krzakala, Zdeborová, 2014
Javanmard, Montanari, Ricci-Tersenghi, 2015
25
Graph neural networks
Power method: v t+1 = Mv t t = 1, . . . ,T .
Graph with adjacency matrix A. Set M = {In,D,A,A2, . . .A2J},
v t+1l = ρ
( ∑M∈M
Mv tθM,lt
), l = 1 . . . dt+1
with v t ∈ Rn×dt ,Θ = {θt1, . . . , θt|M|}t , θ
tM ∈ Rdt×dt+1 trainable parameters.
I Extension to line graph (GNN with non-backtracking).
I Covariant wrt permutations G 7→ φ(G ) then GΠ 7→ Πφ(G ).I Preliminary theoretical analysis of energy landscape.
I Strong assumptions ⇒ local minima have low energy.
Scarselli, Tsoi, Hagenbuchner, Monfardini, IEEE Transactions on Neural Networks, 2009
Chen, Li, Bruna, 2017
26
Numerical performance. SBM k = 2
Overlap as function of SNR
Chen, Li, Bruna, 2017
27
Extension to quadratic assignment
Graph matching: minX∈Π ‖G1X − XG2‖2F = ‖G1X‖2 + ‖XG2‖2 − 2〈G1X ,XG2〉
Siamese neural network:
G2 = πG1 + N N ∼Erdos-RenyiG1 ∼ Erdos-RenyiG1 ∼ Random regular
Nowak, V., Bandeira, Bruna, 2017
28
Numerical experiment
0.00 0.01 0.02 0.03 0.04 0.05Noise
0.0
0.2
0.4
0.6
0.8
1.0
Recovery Rate
ErdosRenyi Graph Model
SDPLowRankAlign(k=4)GNN
0.00 0.01 0.02 0.03 0.04 0.05Noise
0.0
0.2
0.4
0.6
0.8
1.0
Recove
ry Rate
Random Regular Graph Model
Our model runs in O(n2) , LowRankAlign is O(n3), SDP in O(n4).
Nowak, V., Bandeira, Bruna, 2017
29
Quadratic Assignment Problem (QAP)
i
k j s t
r
wijwikwjk
drs drtdst
minπ
∑i∈A
∑j∈A
∑r∈B
∑s∈B
wijdrs︸ ︷︷ ︸Eijrs
πirπjs
subject to∑r
πir = 1 ∀i ∈ A∑i
πir = 1 ∀r ∈ B
πir ∈ {0, 1}
Onaran, Villar, ongoing work
30
Max-Product Belief Propagation for QAP
I Posterior probability:
p(π) =1
Z
∏i
1{∑t
πit = 1}∏r
1{∑k
πkr = 1}∏j 6=i
∏s 6=r
e−βπirπjsEijrs
i1 ir in
∗
. . . . . .
irjsjs
ν̂∗→ir
ν̂ir→∗
ν̂ir→js
ν̂js→ir
31
Max-Product Message UpdatesI Variable to check nodes:
νir→∗(1) ∼=∏kt
e−βν̂kt→ir (1)Eirkt
νir→∗(0) ∼= 1
I Check to variable nodes:
ν∗→ir (1) ∼= νir→∗(1)∏6̀=iν`r→∗(0)
∏t 6=r
νit→∗(0)
ν∗→ir (0) ∼= νir→∗(0) maxt 6=r ,` 6=i
νit→∗(1)ν`r→∗(1)∏u 6=t,r
νiu→∗(0)∏
m 6=`,iνmr→∗(0)
I Variable back to variable nodes:
ν̂ir→js(1) ∼= ν∗→ir (1)∏ku 6=js
e−βν̂ku→ir (1)Eirku
ν̂ir→js(0) ∼= ν∗→ir (0)
32
Simplifications (Min-Sum Algorithm)
Introduce log-likelihood ratio:
ˆ̀ir→js =
1
βlog
ν̂ir→js(1)
ν̂ir→js(0)
Then the algorithm simplifies to one update step as follows,
ˆ̀ir→js ← min
t 6=r ,l 6=i
∑ku
eβˆ̀ku→it
1 + eβ ˆ̀ku→itEitku +
eβˆ̀ku→`r
1 + eβ ˆ̀ku→`rE`rku
−2∑ku 6=js
eβˆ̀ku→ir
1 + eβ ˆ̀ku→irEirku +
eβˆ̀js→ir
1 + eβˆ̀js→ir
Eirjs
33
A special case: Graph Matching (Isomorphism)
ij
k
`
r
s
t
u
Graph-A Graph-B
minP
‖A− PBPT‖F
subject to P ∈ Π
equivalent to QAP withEijrs = Aij ⊕ Brs .
When A ∼ ER(n, 0.5) and
B = PAPT ⊕ Z
for some P ∈ Π whereZ ∼ Ber(�). Find P givenunlabeled graphs.
0.0 0.1 0.2 0.3 0.35 0.39 0.44 0.48
75
60
45
30
15
n
BP Success rate for GM ER(n, 0.5)
0.0
0.2
0.4
0.6
0.8
1.0
34
Summary
I k-means clusteringI Nice manifold optimization landscapeI Lower bounds certificates from SDPs (data-driven)
I Clustering stochastic block modelI AMP −→ Spectral methods −→ Graph neural networks
I Quadratic assignment
35
Thank you
Relax no need to round: Integrality of clustering formulationsP. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, R. WardarXiv:1408.4045, Innovations in Theoretical Computer Science (ITCS) 2015
Probably certifiably correct k-means clusteringT. Iguchi, D. G. Mixon, J. Peterson, S. VillararXiv:1509.07983, Mathematical Programming, 2016
Clustering subgaussian mixtures by semidefinite programmingD. G. Mixon, S. Villar, R. WardarXiv:1602.06612, Information and Inference: A Journal of the IMA, 2017
Monte Carlo approximation certificates for k-means clusteringD. G. Mixon, S. VillararXiv:1710.00956
Supervised Community Detection with Hierarchical Graph Neural NetworksZ. Chen, L. Li, J. BrunaarXiv:1705.08415
A note on learning algorithms for quadratic assignment with Graph Neural NetworksA. Nowak, S. Villar, A. S. Bandeira, J. Bruna, ICML (PADL) 2017arXiv:1706.07450