Optimization and learning techniques for clustering problemsvillar/files/cargese_villar.pdf · I...
Transcript of Optimization and learning techniques for clustering problemsvillar/files/cargese_villar.pdf · I...
-
1
Optimization and learning techniques forclustering problems
Soledad Villar
Center for Data ScienceCourant Institute of Mathematical Sciences
Statistical Physics and Machine Learning back togetherCargèse, August 2018
-
2
Clustering“Next AI revolution is unsupervised” Yann LeCun
Main task in unsupervised machine learning:
I Finding structure in unlabeled data.
k-means clustering
I Simple objective
I Useful for “generic data”
Clustering stochastic block model
I Specialized objective
I Algorithms get more precise but less robust to model changes
Data-driven clustering methods.
-
3
Clustering“Next AI revolution is unsupervised” Yann LeCun
Main task in unsupervised machine learning:
I Finding structure in unlabeled data.
Example:Cluster analysis and display of genome-wideexpression patterns.
I Gene expression as a function of timefor different conditions.
I Clustering gene expression data groupstogether genes of known similarfunctionality.
Eisen, Spellman, Brown, Botstein, PNAS, 1998
-
4
Summary
I k-means clusteringI Landscape of manifold optimizationI Lower bounds certificates from SDPs (data-driven)
I Clustering stochastic block modelI AMP −→ Spectral methods −→ Graph neural networks
I Quadratic assignment
-
5
The k-means problem
Given a point cloud {xi},partition the points in clustersC1, . . . ,Ck
k-means objective:
minC1,...Ck
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2
I NP-hard to minimize in general (even in the plane).
-
6
Lloyd’s algorithm
1. Choose k centers at random.
2. Assign points to closest center.
3. Compute new centers are clusters centroids.
k-means++ is Lloyd’s with smarter initialization.
Pros:
I Fast.
I Very easy toimplement.
I Widely used andit works for mostapplications.
Cons:
I No guarantee of convergence (local minima).
I In extreme cases it may take exponentiallymany steps.
I Solutions depend heavily on initialization.
I Its output doesn’t say how good of a solutionit may be.
-
7
Optimization formulation
Taking Dij := ‖xi − xj‖2, then
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2 = 12 Tr(D
k∑t=1
1
|Ct |1Ct 1
>Ct︸ ︷︷ ︸)
minimize Tr(DYY>)
subject to Y ∈ Rn×k
Y>Y = Ik
YY>1 = 1
Y ≥ 0
=
Set of Y ’s is discrete.
-
8
Manifold optimization implementation
minimize Tr(DYY>) + λ‖Y−‖2
subject to Y ∈ M = {Y ∈ Rn×k : Y>Y = Ik , YY>1 = 1}
To implement the manifold optimization one needs:
gradf : M → TM,retrY : TYM → M,
where
retrY (0) = 0,d
dt
∣∣∣∣t=0
retrY (tV ) = V .
Gradient descent:
Yn+1 = retrYn(−αn gradf (Yn)).
Sepulchre, Absil, Mahony, 2008
Carson, Mixon, V., Ward, SampTA, 2017
-
9
Manifold optimization, retraction implementation
Homogeneous structure of M.Let O(1n) be the n × n orthogonal matrices that fix 1.
M × O(1n)× O(k)→ M(Y ,Q,R) 7→ QYR (transitive action)
Multiplication of Y on the right by R is equivalent to multiplication onthe left by R ′ = YRY> ∈ O(1n).
V ∈ TYM can be written as V = BY + YA where A ∈ so(k), B ∈ so(1n)where BY and YA are orthogonal. We compute A,B explicitly.
A = Y>V
B = VY> − YV> − 2YAY>
retrY (V ) = exp(B) exp(A′)Y ∈ M
-
10
Manifold optimization algorithm
λ0 ← 0repeatYn+1 ← GradientDescent(fλ) {Initialized at Yn}λn+1 ← 2λn + 1
until ‖Y−‖F < �
minimizeY fλ(Y ) = Tr(DYY>) + λ‖Y−‖
2
subject to Y ∈ M = {Y ∈ Rn×k : Y>Y = Ik , YY>1 = 1}
lambda=0 lambda=6300 lambda=2.047e+05 lambda=2.6214e+07
Problem structure allows to do each iteration in linear-time.
Carson, Mixon, V., Ward, SampTA, 2017
-
11
Claim: clustering is (maybe) not (that) hardOptimization landscape
Points from three well-separated unit balls in small dimension.
-
12
Claim: clustering is (maybe) not (that) hardNo clustering structure
Points drawn from unit ball into three clusters in small dimension.
And it’s optimal! How do I know that?
-
13
k-means optimization formulation
Recall Dij := ‖xi − xj‖2, then
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2 = 12 Tr(D
k∑t=1
1
|Ct |1Ct 1
>Ct︸ ︷︷ ︸)
minimize Tr(DYY>)
subject to Y ∈ Rn×k
Y>Y = Ik
YY>1 = 1
Y ≥ 0
=
Set of Y ’s is discrete.
-
14
A semidefinite programming relaxation
Recall Dij := ‖xi − xj‖2, then
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2 = 12 Tr(D
k∑t=1
1
|Ct |1Ct 1
>Ct︸ ︷︷ ︸)
Relax to SDP:
minimize Tr(DX )
subject to Tr(X ) = k
X1 = 1
X ≥ 0X � 0
Peng, Wei, SIAM J. Optim., 2007
-
15
Stochastic ball model(D, γ, n)-stochastic ball model
I D = rotation-invariant distribution over unit ball in Rm
I γ1, . . . , γk = ball centers in Rm
I Draw rt,1, . . . , rt,n i.i.d. from D for each i ∈ {1, . . . , k}I xt,i = γt + rt,i = ith point from cluster t
Nellore, Ward, Information and Computation, 2015
-
16
Stochastic ball model(D, γ, n)-stochastic ball model
I D = rotation-invariant distribution over unit ball in Rm
I γ1, . . . , γk = ball centers in Rm
I Draw rt,1, . . . , rt,n i.i.d. from D for each i ∈ {1, . . . , k}I xt,i = γt + rt,i = ith point from cluster t
Nellore, Ward, Information and Computation, 2015
-
17
k-means SDP. Stochastic ball model
Exact recovery
SDP is tight for k unit balls in Rm with centers γi . . . γk
mini 6=j‖γi − γj‖ ≥ min
{2√
2(1 + 1√m
), 2 + k2
m , 2 +√
2km+2
}
Conjecture (Li, Li, Ling, Strohmer)
Phase transition for exact recovery (under stochastic ball model) at
mini 6=j‖γi − γj‖ ≥ 2 +
c
m
Awasthi, Bandeira, Charikar, Krishnaswamy, V., Ward, Proc. ITCS, 2015
Iguchi, Mixon, Peterson, V., Mathematical Programming, 2016
Li, Li, Ling, Strohmer, arXiv:1710.06008 2017
-
18
k-means SDP. Subgaussian mixtures
Relax and “round” (weak recovery).
Theorem
Centers: γ̂1, . . . , γ̂k . Estimated centers v1, . . . , vk
1
k
k∑i=1
‖vi − γ̂i‖2 . k2σ2 whp provided mini 6=j‖γi − γj‖ & kσ.
Example: http://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.html
Mixon, V., Ward, Information and Inference, 2017
Proof technique: Guédon, Vershynin, Probability Theory and Related Fields, 2016
http://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.htmlhttp://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.html
-
19
SDPs are slow!Fast lower bound certificates
Want: Lower bound on k-means value,
Given {xi}i∈T ⊆ Rm, let W= value of k-means++ initialization
valkmeans(T) := min1
|T |
k∑t=1
∑i∈Ct
∥∥∥∥xi − 1|Ct |∑j∈Ct
xj
∥∥∥∥2 ≥ 18(log k + 2)EWRunning k-means++ on MNIST training set with k = 10 gives:
I Upper bound (by running the algorithm): 39.22
I Lower bound (estimated from above): 2.15
Can we get a better lower bound?
Arthur, Vassilvitskii, 2007
-
20
Idea
Given {xi}i∈T ⊆ Rm, draw S ∼ Unif ∈(Ts
).
E valSDP(S) ≤ E valkmeans(S)
≤ E
1s
k∑t=1
∑i∈C∗t ∩S
∥∥∥∥xi − 1|C∗t ∩ S | ∑j∈C∗t ∩S xj∥∥∥∥2
≤ E
1s
k∑t=1
∑i∈C∗t ∩S
∥∥∥∥xi − 1|C∗t | ∑j∈C∗t xj∥∥∥∥2 = valkmeans(T)
Goal: Establish E valSDP(S) > B with small p-value.
Mixon, V., arXiv:1710.00956
-
21
Better performance, and fast!
100 150 200 250 300 350 400 450s
0
5
10
15
20
25
30
35
40va
l(S-S
DP)
val(S-SDP)kmeans++ valuekmeans++ lower bound
number of points (log scale)10
210
410
610
8
run
nin
g t
ime
(lo
g s
ca
le)
10-3
10-2
10-1
100
101
102
103
104
k-means SDPk-means++Monte Carlo certifier
I Taking s = 100� 60, 000 MNIST points already worksI Constant runtime in N, faster than k-means++ for N & 106.
I Theorem for mixtures of Gaussians (additive bound)
Mixon, V., arXiv:1710.00956
-
22
Generic algorithms vs. model-dependent algorithmsClustering the stochastic block model
I Sparse graph clustering (non-geometric)
I Similar generic algorithmic (SDP, manifold optimization)
I Model-tailored algorithms (AMP, spectral methods onNon-bactracking matrices and Bethe Hessian)
I Can we learn the algorithm from data?I Use graph neural networks
Decelle, Krzakala, Moore, Zdeborová, 2013
Krzakala, Moore, Mossel, Neeman, Sly, Zdeborová, Zhang, 2013
Saade, Krzakala, Zdeborová, 2014
Abbe, 2018
-
23
Spectral redemption
A ∼ SBM(a/n, b/n, n, 2) sparse.Spectrum doesn’t concentrate (high degree vertices dominate it)Laplacian is not useful for clustering
Consider the non-backtracking operator (from linearized BP)
B(i→j)(i ′→j ′) =
{1 if j = i ′ and j ′ 6= i
0 otherwise
Second eigenvector of B reveals clustering structure
Krzakala, Moore, Mossel, Neeman, Sly, Zdeborová, Zhang, 2013
-
24
Bethe Hessian
BH(r) = (r2 − 1)I − rA + D
Fixed points of BP ←→ Stationary points of Bethe free energySecond eigenvector reveals clustering structurePitfall: highly dependent in the model
Goal: Combine graph operators
I ,D,A, . . . to generate robust “data-
driven spectral methods” for prob-
lems in graphs
Saade, Krzakala, Zdeborová, 2014
Javanmard, Montanari, Ricci-Tersenghi, 2015
-
25
Graph neural networks
Power method: v t+1 = Mv t t = 1, . . . ,T .
Graph with adjacency matrix A. Set M = {In,D,A,A2, . . .A2J},
v t+1l = ρ
( ∑M∈M
Mv tθM,lt
), l = 1 . . . dt+1
with v t ∈ Rn×dt ,Θ = {θt1, . . . , θt|M|}t , θ
tM ∈ Rdt×dt+1 trainable parameters.
I Extension to line graph (GNN with non-backtracking).
I Covariant wrt permutations G 7→ φ(G ) then GΠ 7→ Πφ(G ).I Preliminary theoretical analysis of energy landscape.
I Strong assumptions ⇒ local minima have low energy.
Scarselli, Tsoi, Hagenbuchner, Monfardini, IEEE Transactions on Neural Networks, 2009
Chen, Li, Bruna, 2017
-
26
Numerical performance. SBM k = 2
Overlap as function of SNR
Chen, Li, Bruna, 2017
-
27
Extension to quadratic assignment
Graph matching: minX∈Π ‖G1X − XG2‖2F = ‖G1X‖2 + ‖XG2‖2 − 2〈G1X ,XG2〉
Siamese neural network:
G2 = πG1 + N N ∼Erdos-RenyiG1 ∼ Erdos-RenyiG1 ∼ Random regular
Nowak, V., Bandeira, Bruna, 2017
-
28
Numerical experiment
0.00 0.01 0.02 0.03 0.04 0.05Noise
0.0
0.2
0.4
0.6
0.8
1.0
Recovery Rate
ErdosRenyi Graph Model
SDPLowRankAlign(k=4)GNN
0.00 0.01 0.02 0.03 0.04 0.05Noise
0.0
0.2
0.4
0.6
0.8
1.0
Recove
ry Rate
Random Regular Graph Model
Our model runs in O(n2) , LowRankAlign is O(n3), SDP in O(n4).
Nowak, V., Bandeira, Bruna, 2017
-
29
Quadratic Assignment Problem (QAP)
i
k j s t
r
wijwikwjk
drs drtdst
minπ
∑i∈A
∑j∈A
∑r∈B
∑s∈B
wijdrs︸ ︷︷ ︸Eijrs
πirπjs
subject to∑r
πir = 1 ∀i ∈ A∑i
πir = 1 ∀r ∈ B
πir ∈ {0, 1}
Onaran, Villar, ongoing work
-
30
Max-Product Belief Propagation for QAP
I Posterior probability:
p(π) =1
Z
∏i
1{∑t
πit = 1}∏r
1{∑k
πkr = 1}∏j 6=i
∏s 6=r
e−βπirπjsEijrs
i1 ir in
∗
. . . . . .
irjsjs
ν̂∗→ir
ν̂ir→∗
ν̂ir→js
ν̂js→ir
-
31
Max-Product Message UpdatesI Variable to check nodes:
νir→∗(1) ∼=∏kt
e−βν̂kt→ir (1)Eirkt
νir→∗(0) ∼= 1
I Check to variable nodes:
ν∗→ir (1) ∼= νir→∗(1)∏6̀=iν`r→∗(0)
∏t 6=r
νit→∗(0)
ν∗→ir (0) ∼= νir→∗(0) maxt 6=r ,` 6=i
νit→∗(1)ν`r→∗(1)∏u 6=t,r
νiu→∗(0)∏
m 6=`,iνmr→∗(0)
I Variable back to variable nodes:
ν̂ir→js(1) ∼= ν∗→ir (1)∏ku 6=js
e−βν̂ku→ir (1)Eirku
ν̂ir→js(0) ∼= ν∗→ir (0)
-
32
Simplifications (Min-Sum Algorithm)
Introduce log-likelihood ratio:
ˆ̀ir→js =
1
βlog
ν̂ir→js(1)
ν̂ir→js(0)
Then the algorithm simplifies to one update step as follows,
ˆ̀ir→js ← min
t 6=r ,l 6=i
∑ku
eβˆ̀ku→it
1 + eβ ˆ̀ku→itEitku +
eβˆ̀ku→`r
1 + eβ ˆ̀ku→`rE`rku
−2∑ku 6=js
eβˆ̀ku→ir
1 + eβ ˆ̀ku→irEirku +
eβˆ̀js→ir
1 + eβˆ̀js→ir
Eirjs
-
33
A special case: Graph Matching (Isomorphism)
ij
k
`
r
s
t
u
Graph-A Graph-B
minP
‖A− PBPT‖F
subject to P ∈ Π
equivalent to QAP withEijrs = Aij ⊕ Brs .
When A ∼ ER(n, 0.5) and
B = PAPT ⊕ Z
for some P ∈ Π whereZ ∼ Ber(�). Find P givenunlabeled graphs.
0.0 0.1 0.2 0.3 0.35 0.39 0.44 0.48
75
60
45
30
15
n
BP Success rate for GM ER(n, 0.5)
0.0
0.2
0.4
0.6
0.8
1.0
-
34
Summary
I k-means clusteringI Nice manifold optimization landscapeI Lower bounds certificates from SDPs (data-driven)
I Clustering stochastic block modelI AMP −→ Spectral methods −→ Graph neural networks
I Quadratic assignment
-
35
Thank you
Relax no need to round: Integrality of clustering formulationsP. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, R. WardarXiv:1408.4045, Innovations in Theoretical Computer Science (ITCS) 2015
Probably certifiably correct k-means clusteringT. Iguchi, D. G. Mixon, J. Peterson, S. VillararXiv:1509.07983, Mathematical Programming, 2016
Clustering subgaussian mixtures by semidefinite programmingD. G. Mixon, S. Villar, R. WardarXiv:1602.06612, Information and Inference: A Journal of the IMA, 2017
Monte Carlo approximation certificates for k-means clusteringD. G. Mixon, S. VillararXiv:1710.00956
Supervised Community Detection with Hierarchical Graph Neural NetworksZ. Chen, L. Li, J. BrunaarXiv:1705.08415
A note on learning algorithms for quadratic assignment with Graph Neural NetworksA. Nowak, S. Villar, A. S. Bandeira, J. Bruna, ICML (PADL) 2017arXiv:1706.07450