Optimization and learning techniques for clustering problemsvillar/files/cargese_villar.pdf · I...

1

Optimization and learning techniques forclustering problems

Soledad Villar

Center for Data ScienceCourant Institute of Mathematical Sciences

Statistical Physics and Machine Learning back togetherCargèse, August 2018

2

Clustering“Next AI revolution is unsupervised” Yann LeCun

Main task in unsupervised machine learning:

I Finding structure in unlabeled data.

k-means clustering

I Simple objective

I Useful for “generic data”

Clustering stochastic block model

I Specialized objective

I Algorithms get more precise but less robust to model changes

Data-driven clustering methods.

3

Clustering“Next AI revolution is unsupervised” Yann LeCun

Main task in unsupervised machine learning:

I Finding structure in unlabeled data.

Example:Cluster analysis and display of genome-wideexpression patterns.

I Gene expression as a function of timefor different conditions.

I Clustering gene expression data groupstogether genes of known similarfunctionality.

Eisen, Spellman, Brown, Botstein, PNAS, 1998

4

Summary

I k-means clusteringI Landscape of manifold optimizationI Lower bounds certificates from SDPs (data-driven)

I Clustering stochastic block modelI AMP −→ Spectral methods −→ Graph neural networks

I Quadratic assignment

5

The k-means problem

Given a point cloud {xi},partition the points in clustersC1, . . . ,Ck

k-means objective:

minC1,...Ck

k∑t=1

∑i∈Ct

∥∥∥∥xi − 1|Ct |∑j∈Ct

xj

∥∥∥∥2

I NP-hard to minimize in general (even in the plane).

6

Lloyd’s algorithm

1. Choose k centers at random.

2. Assign points to closest center.

3. Compute new centers are clusters centroids.

k-means++ is Lloyd’s with smarter initialization.

Pros:

I Fast.

I Very easy toimplement.

I Widely used andit works for mostapplications.

Cons:

I No guarantee of convergence (local minima).

I In extreme cases it may take exponentiallymany steps.

I Solutions depend heavily on initialization.

I Its output doesn’t say how good of a solutionit may be.

7

Optimization formulation

Taking Dij := ‖xi − xj‖2, then

k∑t=1

∑i∈Ct

∥∥∥∥xi − 1|Ct |∑j∈Ct

xj

∥∥∥∥2 = 12 Tr(D

k∑t=1

1

|Ct |1Ct 1

>Ct︸︷︷︸)

minimize Tr(DYY>)

subject to Y ∈ Rn×k

Y>Y = Ik

YY>1 = 1

Y ≥ 0

=

Set of Y ’s is discrete.

8

Manifold optimization implementation

minimize Tr(DYY>) + λ‖Y−‖2

subject to Y ∈ M = {Y ∈ Rn×k : Y>Y = Ik , YY>1 = 1}

To implement the manifold optimization one needs:

gradf : M → TM,retrY : TYM → M,

where

retrY (0) = 0,d

dt

∣∣∣∣t=0

retrY (tV ) = V .

Gradient descent:

Yn+1 = retrYn(−αn gradf (Yn)).

Sepulchre, Absil, Mahony, 2008

Carson, Mixon, V., Ward, SampTA, 2017

9

Manifold optimization, retraction implementation

Homogeneous structure of M.Let O(1n) be the n × n orthogonal matrices that fix 1.

M × O(1n)× O(k)→ M(Y ,Q,R) 7→ QYR (transitive action)

Multiplication of Y on the right by R is equivalent to multiplication onthe left by R ′ = YRY> ∈ O(1n).

V ∈ TYM can be written as V = BY + YA where A ∈ so(k), B ∈ so(1n)where BY and YA are orthogonal. We compute A,B explicitly.

A = Y>V

B = VY> − YV> − 2YAY>

retrY (V ) = exp(B) exp(A′)Y ∈ M

10

Manifold optimization algorithm

λ0 ← 0repeatYn+1 ← GradientDescent(fλ) {Initialized at Yn}λn+1 ← 2λn + 1

until ‖Y−‖F < �

minimizeY fλ(Y ) = Tr(DYY>) + λ‖Y−‖

2

subject to Y ∈ M = {Y ∈ Rn×k : Y>Y = Ik , YY>1 = 1}

lambda=0 lambda=6300 lambda=2.047e+05 lambda=2.6214e+07

Problem structure allows to do each iteration in linear-time.

Carson, Mixon, V., Ward, SampTA, 2017

11

Claim: clustering is (maybe) not (that) hardOptimization landscape

Points from three well-separated unit balls in small dimension.

12

Claim: clustering is (maybe) not (that) hardNo clustering structure

Points drawn from unit ball into three clusters in small dimension.

And it’s optimal! How do I know that?

13

k-means optimization formulation

Recall Dij := ‖xi − xj‖2, then

k∑t=1

∑i∈Ct

∥∥∥∥xi − 1|Ct |∑j∈Ct

xj

∥∥∥∥2 = 12 Tr(D

k∑t=1

1

|Ct |1Ct 1

>Ct︸︷︷︸)

minimize Tr(DYY>)

subject to Y ∈ Rn×k

Y>Y = Ik

YY>1 = 1

Y ≥ 0

=

Set of Y ’s is discrete.

14

A semidefinite programming relaxation

Recall Dij := ‖xi − xj‖2, then

k∑t=1

∑i∈Ct

∥∥∥∥xi − 1|Ct |∑j∈Ct

xj

∥∥∥∥2 = 12 Tr(D

k∑t=1

1

|Ct |1Ct 1

>Ct︸︷︷︸)

Relax to SDP:

minimize Tr(DX )

subject to Tr(X ) = k

X1 = 1

X ≥ 0X � 0

Peng, Wei, SIAM J. Optim., 2007

15

Stochastic ball model(D, γ, n)-stochastic ball model

I D = rotation-invariant distribution over unit ball in Rm

I γ1, . . . , γk = ball centers in Rm

I Draw rt,1, . . . , rt,n i.i.d. from D for each i ∈ {1, . . . , k}I xt,i = γt + rt,i = ith point from cluster t

Nellore, Ward, Information and Computation, 2015

16

Stochastic ball model(D, γ, n)-stochastic ball model

I D = rotation-invariant distribution over unit ball in Rm

I γ1, . . . , γk = ball centers in Rm

I Draw rt,1, . . . , rt,n i.i.d. from D for each i ∈ {1, . . . , k}I xt,i = γt + rt,i = ith point from cluster t

Nellore, Ward, Information and Computation, 2015

17

k-means SDP. Stochastic ball model

Exact recovery

SDP is tight for k unit balls in Rm with centers γi . . . γk

mini 6=j‖γi − γj‖ ≥ min

{2√

2(1 + 1√m

), 2 + k2

m , 2 +√

2km+2

}

Conjecture (Li, Li, Ling, Strohmer)

Phase transition for exact recovery (under stochastic ball model) at

mini 6=j‖γi − γj‖ ≥ 2 +

c

m

Awasthi, Bandeira, Charikar, Krishnaswamy, V., Ward, Proc. ITCS, 2015

Iguchi, Mixon, Peterson, V., Mathematical Programming, 2016

Li, Li, Ling, Strohmer, arXiv:1710.06008 2017

18

k-means SDP. Subgaussian mixtures

Relax and “round” (weak recovery).

Theorem

Centers: γ̂1, . . . , γ̂k . Estimated centers v1, . . . , vk

1

k

k∑i=1

‖vi − γ̂i‖2 . k2σ2 whp provided mini 6=j‖γi − γj‖ & kσ.

Example: http://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.html

Mixon, V., Ward, Information and Inference, 2017

Proof technique: Guédon, Vershynin, Probability Theory and Related Fields, 2016

http://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.htmlhttp://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.html

19

SDPs are slow!Fast lower bound certificates

Want: Lower bound on k-means value,

Given {xi}i∈T ⊆ Rm, let W= value of k-means++ initialization

valkmeans(T) := min1

|T |

k∑t=1

∑i∈Ct

∥∥∥∥xi − 1|Ct |∑j∈Ct

xj

∥∥∥∥2 ≥ 18(log k + 2)EWRunning k-means++ on MNIST training set with k = 10 gives:

I Upper bound (by running the algorithm): 39.22

I Lower bound (estimated from above): 2.15

Can we get a better lower bound?

Arthur, Vassilvitskii, 2007

20

Idea

Given {xi}i∈T ⊆ Rm, draw S ∼ Unif ∈(Ts

).

E valSDP(S) ≤ E valkmeans(S)

≤ E

1s

k∑t=1

∑i∈C∗t ∩S

∥∥∥∥xi − 1|C∗t ∩ S | ∑j∈C∗t ∩S xj∥∥∥∥2

≤ E

1s

k∑t=1

∑i∈C∗t ∩S

∥∥∥∥xi − 1|C∗t | ∑j∈C∗t xj∥∥∥∥2 = valkmeans(T)

Goal: Establish E valSDP(S) > B with small p-value.

Mixon, V., arXiv:1710.00956

21

Better performance, and fast!

100 150 200 250 300 350 400 450s

0

5

10

15

20

25

30

35

40va

l(S-S

DP)

val(S-SDP)kmeans++ valuekmeans++ lower bound

number of points (log scale)10

210

410

610

8

run

nin

g t

ime

(lo

g s

ca

le)

10-3

10-2

10-1

100

101

102

103

104

k-means SDPk-means++Monte Carlo certifier

I Taking s = 100� 60, 000 MNIST points already worksI Constant runtime in N, faster than k-means++ for N & 106.

I Theorem for mixtures of Gaussians (additive bound)

Mixon, V., arXiv:1710.00956

22

Generic algorithms vs. model-dependent algorithmsClustering the stochastic block model

I Sparse graph clustering (non-geometric)

I Similar generic algorithmic (SDP, manifold optimization)

I Model-tailored algorithms (AMP, spectral methods onNon-bactracking matrices and Bethe Hessian)

I Can we learn the algorithm from data?I Use graph neural networks

Decelle, Krzakala, Moore, Zdeborová, 2013

Krzakala, Moore, Mossel, Neeman, Sly, Zdeborová, Zhang, 2013

Saade, Krzakala, Zdeborová, 2014

Abbe, 2018

23

Spectral redemption

A ∼ SBM(a/n, b/n, n, 2) sparse.Spectrum doesn’t concentrate (high degree vertices dominate it)Laplacian is not useful for clustering

Consider the non-backtracking operator (from linearized BP)

B(i→j)(i ′→j ′) =

{1 if j = i ′ and j ′ 6= i

0 otherwise

Second eigenvector of B reveals clustering structure

Krzakala, Moore, Mossel, Neeman, Sly, Zdeborová, Zhang, 2013

24

Bethe Hessian

BH(r) = (r2 − 1)I − rA + D

Fixed points of BP ←→ Stationary points of Bethe free energySecond eigenvector reveals clustering structurePitfall: highly dependent in the model

Goal: Combine graph operators

I ,D,A, . . . to generate robust “data-

driven spectral methods” for prob-

lems in graphs

Saade, Krzakala, Zdeborová, 2014

Javanmard, Montanari, Ricci-Tersenghi, 2015

25

Graph neural networks

Power method: v t+1 = Mv t t = 1, . . . ,T .

Graph with adjacency matrix A. Set M = {In,D,A,A2, . . .A2J},

v t+1l = ρ

( ∑M∈M

Mv tθM,lt

), l = 1 . . . dt+1

with v t ∈ Rn×dt ,Θ = {θt1, . . . , θt|M|}t , θ

tM ∈ Rdt×dt+1 trainable parameters.

I Extension to line graph (GNN with non-backtracking).

I Covariant wrt permutations G 7→ φ(G ) then GΠ 7→ Πφ(G ).I Preliminary theoretical analysis of energy landscape.

I Strong assumptions ⇒ local minima have low energy.

Scarselli, Tsoi, Hagenbuchner, Monfardini, IEEE Transactions on Neural Networks, 2009

Chen, Li, Bruna, 2017

26

Numerical performance. SBM k = 2

Overlap as function of SNR

Chen, Li, Bruna, 2017

27

Extension to quadratic assignment

Graph matching: minX∈Π ‖G1X − XG2‖2F = ‖G1X‖2 + ‖XG2‖2 − 2〈G1X ,XG2〉

Siamese neural network:

G2 = πG1 + N N ∼Erdos-RenyiG1 ∼ Erdos-RenyiG1 ∼ Random regular

Nowak, V., Bandeira, Bruna, 2017

28

Numerical experiment

0.00 0.01 0.02 0.03 0.04 0.05Noise

0.0

0.2

0.4

0.6

0.8

1.0

Recovery Rate

ErdosRenyi Graph Model

SDPLowRankAlign(k=4)GNN

0.00 0.01 0.02 0.03 0.04 0.05Noise

0.0

0.2

0.4

0.6

0.8

1.0

Recove

ry Rate

Random Regular Graph Model

Our model runs in O(n2) , LowRankAlign is O(n3), SDP in O(n4).

Nowak, V., Bandeira, Bruna, 2017

29

Quadratic Assignment Problem (QAP)

i

k j s t

r

wijwikwjk

drs drtdst

minπ

∑i∈A

∑j∈A

∑r∈B

∑s∈B

wijdrs︸︷︷︸Eijrs

πirπjs

subject to∑r

πir = 1 ∀i ∈ A∑i

πir = 1 ∀r ∈ B

πir ∈ {0, 1}

Onaran, Villar, ongoing work

30

Max-Product Belief Propagation for QAP

I Posterior probability:

p(π) =1

Z

∏i

1{∑t

πit = 1}∏r

1{∑k

πkr = 1}∏j 6=i

∏s 6=r

e−βπirπjsEijrs

i1 ir in

∗

. . . . . .

irjsjs

ν̂∗→ir

ν̂ir→∗

ν̂ir→js

ν̂js→ir

31

Max-Product Message UpdatesI Variable to check nodes:

νir→∗(1) ∼=∏kt

e−βν̂kt→ir (1)Eirkt

νir→∗(0) ∼= 1

I Check to variable nodes:

ν∗→ir (1) ∼= νir→∗(1)∏6̀=iν`r→∗(0)

∏t 6=r

νit→∗(0)

ν∗→ir (0) ∼= νir→∗(0) maxt 6=r ,` 6=i

νit→∗(1)ν`r→∗(1)∏u 6=t,r

νiu→∗(0)∏

m 6=`,iνmr→∗(0)

I Variable back to variable nodes:

ν̂ir→js(1) ∼= ν∗→ir (1)∏ku 6=js

e−βν̂ku→ir (1)Eirku

ν̂ir→js(0) ∼= ν∗→ir (0)

32

Simplifications (Min-Sum Algorithm)

Introduce log-likelihood ratio:

ˆ̀ir→js =

1

βlog

ν̂ir→js(1)

ν̂ir→js(0)

Then the algorithm simplifies to one update step as follows,

ˆ̀ir→js ← min

t 6=r ,l 6=i

∑ku

eβˆ̀ku→it

1 + eβ ˆ̀ku→itEitku +

eβˆ̀ku→`r

1 + eβ ˆ̀ku→`rE`rku

−2∑ku 6=js

eβˆ̀ku→ir

1 + eβ ˆ̀ku→irEirku +

eβˆ̀js→ir

1 + eβˆ̀js→ir

Eirjs

33

A special case: Graph Matching (Isomorphism)

ij

k

`

r

s

t

u

Graph-A Graph-B

minP

‖A− PBPT‖F

subject to P ∈ Π

equivalent to QAP withEijrs = Aij ⊕ Brs .

When A ∼ ER(n, 0.5) and

B = PAPT ⊕ Z

for some P ∈ Π whereZ ∼ Ber(�). Find P givenunlabeled graphs.

0.0 0.1 0.2 0.3 0.35 0.39 0.44 0.48

75

60

45

30

15

n

BP Success rate for GM ER(n, 0.5)

0.0

0.2

0.4

0.6

0.8

1.0

34

Summary

I k-means clusteringI Nice manifold optimization landscapeI Lower bounds certificates from SDPs (data-driven)

I Clustering stochastic block modelI AMP −→ Spectral methods −→ Graph neural networks

I Quadratic assignment

35

Thank you

Relax no need to round: Integrality of clustering formulationsP. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, R. WardarXiv:1408.4045, Innovations in Theoretical Computer Science (ITCS) 2015

Probably certifiably correct k-means clusteringT. Iguchi, D. G. Mixon, J. Peterson, S. VillararXiv:1509.07983, Mathematical Programming, 2016

Clustering subgaussian mixtures by semidefinite programmingD. G. Mixon, S. Villar, R. WardarXiv:1602.06612, Information and Inference: A Journal of the IMA, 2017

Monte Carlo approximation certificates for k-means clusteringD. G. Mixon, S. VillararXiv:1710.00956

Supervised Community Detection with Hierarchical Graph Neural NetworksZ. Chen, L. Li, J. BrunaarXiv:1705.08415

A note on learning algorithms for quadratic assignment with Graph Neural NetworksA. Nowak, S. Villar, A. S. Bandeira, J. Bruna, ICML (PADL) 2017arXiv:1706.07450

Optimization and learning techniques for clustering problemsvillar/files/cargese_villar.pdf · I...

Documents

Transcript of Optimization and learning techniques for clustering problemsvillar/files/cargese_villar.pdf · I...