Optimization and learning techniques for clustering problemsvillar/files/cargese_villar.pdf · I...

35
1 Optimization and learning techniques for clustering problems Soledad Villar Center for Data Science Courant Institute of Mathematical Sciences Statistical Physics and Machine Learning back together Carg` ese, August 2018

Transcript of Optimization and learning techniques for clustering problemsvillar/files/cargese_villar.pdf · I...

  • 1

    Optimization and learning techniques forclustering problems

    Soledad Villar

    Center for Data ScienceCourant Institute of Mathematical Sciences

    Statistical Physics and Machine Learning back togetherCargèse, August 2018

  • 2

    Clustering“Next AI revolution is unsupervised” Yann LeCun

    Main task in unsupervised machine learning:

    I Finding structure in unlabeled data.

    k-means clustering

    I Simple objective

    I Useful for “generic data”

    Clustering stochastic block model

    I Specialized objective

    I Algorithms get more precise but less robust to model changes

    Data-driven clustering methods.

  • 3

    Clustering“Next AI revolution is unsupervised” Yann LeCun

    Main task in unsupervised machine learning:

    I Finding structure in unlabeled data.

    Example:Cluster analysis and display of genome-wideexpression patterns.

    I Gene expression as a function of timefor different conditions.

    I Clustering gene expression data groupstogether genes of known similarfunctionality.

    Eisen, Spellman, Brown, Botstein, PNAS, 1998

  • 4

    Summary

    I k-means clusteringI Landscape of manifold optimizationI Lower bounds certificates from SDPs (data-driven)

    I Clustering stochastic block modelI AMP −→ Spectral methods −→ Graph neural networks

    I Quadratic assignment

  • 5

    The k-means problem

    Given a point cloud {xi},partition the points in clustersC1, . . . ,Ck

    k-means objective:

    minC1,...Ck

    k∑t=1

    ∑i∈Ct

    ∥∥∥∥xi − 1|Ct |∑j∈Ct

    xj

    ∥∥∥∥2

    I NP-hard to minimize in general (even in the plane).

  • 6

    Lloyd’s algorithm

    1. Choose k centers at random.

    2. Assign points to closest center.

    3. Compute new centers are clusters centroids.

    k-means++ is Lloyd’s with smarter initialization.

    Pros:

    I Fast.

    I Very easy toimplement.

    I Widely used andit works for mostapplications.

    Cons:

    I No guarantee of convergence (local minima).

    I In extreme cases it may take exponentiallymany steps.

    I Solutions depend heavily on initialization.

    I Its output doesn’t say how good of a solutionit may be.

  • 7

    Optimization formulation

    Taking Dij := ‖xi − xj‖2, then

    k∑t=1

    ∑i∈Ct

    ∥∥∥∥xi − 1|Ct |∑j∈Ct

    xj

    ∥∥∥∥2 = 12 Tr(D

    k∑t=1

    1

    |Ct |1Ct 1

    >Ct︸ ︷︷ ︸)

    minimize Tr(DYY>)

    subject to Y ∈ Rn×k

    Y>Y = Ik

    YY>1 = 1

    Y ≥ 0

    =

    Set of Y ’s is discrete.

  • 8

    Manifold optimization implementation

    minimize Tr(DYY>) + λ‖Y−‖2

    subject to Y ∈ M = {Y ∈ Rn×k : Y>Y = Ik , YY>1 = 1}

    To implement the manifold optimization one needs:

    gradf : M → TM,retrY : TYM → M,

    where

    retrY (0) = 0,d

    dt

    ∣∣∣∣t=0

    retrY (tV ) = V .

    Gradient descent:

    Yn+1 = retrYn(−αn gradf (Yn)).

    Sepulchre, Absil, Mahony, 2008

    Carson, Mixon, V., Ward, SampTA, 2017

  • 9

    Manifold optimization, retraction implementation

    Homogeneous structure of M.Let O(1n) be the n × n orthogonal matrices that fix 1.

    M × O(1n)× O(k)→ M(Y ,Q,R) 7→ QYR (transitive action)

    Multiplication of Y on the right by R is equivalent to multiplication onthe left by R ′ = YRY> ∈ O(1n).

    V ∈ TYM can be written as V = BY + YA where A ∈ so(k), B ∈ so(1n)where BY and YA are orthogonal. We compute A,B explicitly.

    A = Y>V

    B = VY> − YV> − 2YAY>

    retrY (V ) = exp(B) exp(A′)Y ∈ M

  • 10

    Manifold optimization algorithm

    λ0 ← 0repeatYn+1 ← GradientDescent(fλ) {Initialized at Yn}λn+1 ← 2λn + 1

    until ‖Y−‖F < �

    minimizeY fλ(Y ) = Tr(DYY>) + λ‖Y−‖

    2

    subject to Y ∈ M = {Y ∈ Rn×k : Y>Y = Ik , YY>1 = 1}

    lambda=0 lambda=6300 lambda=2.047e+05 lambda=2.6214e+07

    Problem structure allows to do each iteration in linear-time.

    Carson, Mixon, V., Ward, SampTA, 2017

  • 11

    Claim: clustering is (maybe) not (that) hardOptimization landscape

    Points from three well-separated unit balls in small dimension.

  • 12

    Claim: clustering is (maybe) not (that) hardNo clustering structure

    Points drawn from unit ball into three clusters in small dimension.

    And it’s optimal! How do I know that?

  • 13

    k-means optimization formulation

    Recall Dij := ‖xi − xj‖2, then

    k∑t=1

    ∑i∈Ct

    ∥∥∥∥xi − 1|Ct |∑j∈Ct

    xj

    ∥∥∥∥2 = 12 Tr(D

    k∑t=1

    1

    |Ct |1Ct 1

    >Ct︸ ︷︷ ︸)

    minimize Tr(DYY>)

    subject to Y ∈ Rn×k

    Y>Y = Ik

    YY>1 = 1

    Y ≥ 0

    =

    Set of Y ’s is discrete.

  • 14

    A semidefinite programming relaxation

    Recall Dij := ‖xi − xj‖2, then

    k∑t=1

    ∑i∈Ct

    ∥∥∥∥xi − 1|Ct |∑j∈Ct

    xj

    ∥∥∥∥2 = 12 Tr(D

    k∑t=1

    1

    |Ct |1Ct 1

    >Ct︸ ︷︷ ︸)

    Relax to SDP:

    minimize Tr(DX )

    subject to Tr(X ) = k

    X1 = 1

    X ≥ 0X � 0

    Peng, Wei, SIAM J. Optim., 2007

  • 15

    Stochastic ball model(D, γ, n)-stochastic ball model

    I D = rotation-invariant distribution over unit ball in Rm

    I γ1, . . . , γk = ball centers in Rm

    I Draw rt,1, . . . , rt,n i.i.d. from D for each i ∈ {1, . . . , k}I xt,i = γt + rt,i = ith point from cluster t

    Nellore, Ward, Information and Computation, 2015

  • 16

    Stochastic ball model(D, γ, n)-stochastic ball model

    I D = rotation-invariant distribution over unit ball in Rm

    I γ1, . . . , γk = ball centers in Rm

    I Draw rt,1, . . . , rt,n i.i.d. from D for each i ∈ {1, . . . , k}I xt,i = γt + rt,i = ith point from cluster t

    Nellore, Ward, Information and Computation, 2015

  • 17

    k-means SDP. Stochastic ball model

    Exact recovery

    SDP is tight for k unit balls in Rm with centers γi . . . γk

    mini 6=j‖γi − γj‖ ≥ min

    {2√

    2(1 + 1√m

    ), 2 + k2

    m , 2 +√

    2km+2

    }

    Conjecture (Li, Li, Ling, Strohmer)

    Phase transition for exact recovery (under stochastic ball model) at

    mini 6=j‖γi − γj‖ ≥ 2 +

    c

    m

    Awasthi, Bandeira, Charikar, Krishnaswamy, V., Ward, Proc. ITCS, 2015

    Iguchi, Mixon, Peterson, V., Mathematical Programming, 2016

    Li, Li, Ling, Strohmer, arXiv:1710.06008 2017

  • 18

    k-means SDP. Subgaussian mixtures

    Relax and “round” (weak recovery).

    Theorem

    Centers: γ̂1, . . . , γ̂k . Estimated centers v1, . . . , vk

    1

    k

    k∑i=1

    ‖vi − γ̂i‖2 . k2σ2 whp provided mini 6=j‖γi − γj‖ & kσ.

    Example: http://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.html

    Mixon, V., Ward, Information and Inference, 2017

    Proof technique: Guédon, Vershynin, Probability Theory and Related Fields, 2016

    http://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.htmlhttp://solevillar.github.io/2016/07/05/Clustering-MNIST-SDP.html

  • 19

    SDPs are slow!Fast lower bound certificates

    Want: Lower bound on k-means value,

    Given {xi}i∈T ⊆ Rm, let W= value of k-means++ initialization

    valkmeans(T) := min1

    |T |

    k∑t=1

    ∑i∈Ct

    ∥∥∥∥xi − 1|Ct |∑j∈Ct

    xj

    ∥∥∥∥2 ≥ 18(log k + 2)EWRunning k-means++ on MNIST training set with k = 10 gives:

    I Upper bound (by running the algorithm): 39.22

    I Lower bound (estimated from above): 2.15

    Can we get a better lower bound?

    Arthur, Vassilvitskii, 2007

  • 20

    Idea

    Given {xi}i∈T ⊆ Rm, draw S ∼ Unif ∈(Ts

    ).

    E valSDP(S) ≤ E valkmeans(S)

    ≤ E

    1s

    k∑t=1

    ∑i∈C∗t ∩S

    ∥∥∥∥xi − 1|C∗t ∩ S | ∑j∈C∗t ∩S xj∥∥∥∥2

    ≤ E

    1s

    k∑t=1

    ∑i∈C∗t ∩S

    ∥∥∥∥xi − 1|C∗t | ∑j∈C∗t xj∥∥∥∥2 = valkmeans(T)

    Goal: Establish E valSDP(S) > B with small p-value.

    Mixon, V., arXiv:1710.00956

  • 21

    Better performance, and fast!

    100 150 200 250 300 350 400 450s

    0

    5

    10

    15

    20

    25

    30

    35

    40va

    l(S-S

    DP)

    val(S-SDP)kmeans++ valuekmeans++ lower bound

    number of points (log scale)10

    210

    410

    610

    8

    run

    nin

    g t

    ime

    (lo

    g s

    ca

    le)

    10-3

    10-2

    10-1

    100

    101

    102

    103

    104

    k-means SDPk-means++Monte Carlo certifier

    I Taking s = 100� 60, 000 MNIST points already worksI Constant runtime in N, faster than k-means++ for N & 106.

    I Theorem for mixtures of Gaussians (additive bound)

    Mixon, V., arXiv:1710.00956

  • 22

    Generic algorithms vs. model-dependent algorithmsClustering the stochastic block model

    I Sparse graph clustering (non-geometric)

    I Similar generic algorithmic (SDP, manifold optimization)

    I Model-tailored algorithms (AMP, spectral methods onNon-bactracking matrices and Bethe Hessian)

    I Can we learn the algorithm from data?I Use graph neural networks

    Decelle, Krzakala, Moore, Zdeborová, 2013

    Krzakala, Moore, Mossel, Neeman, Sly, Zdeborová, Zhang, 2013

    Saade, Krzakala, Zdeborová, 2014

    Abbe, 2018

  • 23

    Spectral redemption

    A ∼ SBM(a/n, b/n, n, 2) sparse.Spectrum doesn’t concentrate (high degree vertices dominate it)Laplacian is not useful for clustering

    Consider the non-backtracking operator (from linearized BP)

    B(i→j)(i ′→j ′) =

    {1 if j = i ′ and j ′ 6= i

    0 otherwise

    Second eigenvector of B reveals clustering structure

    Krzakala, Moore, Mossel, Neeman, Sly, Zdeborová, Zhang, 2013

  • 24

    Bethe Hessian

    BH(r) = (r2 − 1)I − rA + D

    Fixed points of BP ←→ Stationary points of Bethe free energySecond eigenvector reveals clustering structurePitfall: highly dependent in the model

    Goal: Combine graph operators

    I ,D,A, . . . to generate robust “data-

    driven spectral methods” for prob-

    lems in graphs

    Saade, Krzakala, Zdeborová, 2014

    Javanmard, Montanari, Ricci-Tersenghi, 2015

  • 25

    Graph neural networks

    Power method: v t+1 = Mv t t = 1, . . . ,T .

    Graph with adjacency matrix A. Set M = {In,D,A,A2, . . .A2J},

    v t+1l = ρ

    ( ∑M∈M

    Mv tθM,lt

    ), l = 1 . . . dt+1

    with v t ∈ Rn×dt ,Θ = {θt1, . . . , θt|M|}t , θ

    tM ∈ Rdt×dt+1 trainable parameters.

    I Extension to line graph (GNN with non-backtracking).

    I Covariant wrt permutations G 7→ φ(G ) then GΠ 7→ Πφ(G ).I Preliminary theoretical analysis of energy landscape.

    I Strong assumptions ⇒ local minima have low energy.

    Scarselli, Tsoi, Hagenbuchner, Monfardini, IEEE Transactions on Neural Networks, 2009

    Chen, Li, Bruna, 2017

  • 26

    Numerical performance. SBM k = 2

    Overlap as function of SNR

    Chen, Li, Bruna, 2017

  • 27

    Extension to quadratic assignment

    Graph matching: minX∈Π ‖G1X − XG2‖2F = ‖G1X‖2 + ‖XG2‖2 − 2〈G1X ,XG2〉

    Siamese neural network:

    G2 = πG1 + N N ∼Erdos-RenyiG1 ∼ Erdos-RenyiG1 ∼ Random regular

    Nowak, V., Bandeira, Bruna, 2017

  • 28

    Numerical experiment

    0.00 0.01 0.02 0.03 0.04 0.05Noise

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Recovery Rate

    ErdosRenyi Graph Model

    SDPLowRankAlign(k=4)GNN

    0.00 0.01 0.02 0.03 0.04 0.05Noise

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Recove

    ry Rate

    Random Regular Graph Model

    Our model runs in O(n2) , LowRankAlign is O(n3), SDP in O(n4).

    Nowak, V., Bandeira, Bruna, 2017

  • 29

    Quadratic Assignment Problem (QAP)

    i

    k j s t

    r

    wijwikwjk

    drs drtdst

    minπ

    ∑i∈A

    ∑j∈A

    ∑r∈B

    ∑s∈B

    wijdrs︸ ︷︷ ︸Eijrs

    πirπjs

    subject to∑r

    πir = 1 ∀i ∈ A∑i

    πir = 1 ∀r ∈ B

    πir ∈ {0, 1}

    Onaran, Villar, ongoing work

  • 30

    Max-Product Belief Propagation for QAP

    I Posterior probability:

    p(π) =1

    Z

    ∏i

    1{∑t

    πit = 1}∏r

    1{∑k

    πkr = 1}∏j 6=i

    ∏s 6=r

    e−βπirπjsEijrs

    i1 ir in

    . . . . . .

    irjsjs

    ν̂∗→ir

    ν̂ir→∗

    ν̂ir→js

    ν̂js→ir

  • 31

    Max-Product Message UpdatesI Variable to check nodes:

    νir→∗(1) ∼=∏kt

    e−βν̂kt→ir (1)Eirkt

    νir→∗(0) ∼= 1

    I Check to variable nodes:

    ν∗→ir (1) ∼= νir→∗(1)∏6̀=iν`r→∗(0)

    ∏t 6=r

    νit→∗(0)

    ν∗→ir (0) ∼= νir→∗(0) maxt 6=r ,` 6=i

    νit→∗(1)ν`r→∗(1)∏u 6=t,r

    νiu→∗(0)∏

    m 6=`,iνmr→∗(0)

    I Variable back to variable nodes:

    ν̂ir→js(1) ∼= ν∗→ir (1)∏ku 6=js

    e−βν̂ku→ir (1)Eirku

    ν̂ir→js(0) ∼= ν∗→ir (0)

  • 32

    Simplifications (Min-Sum Algorithm)

    Introduce log-likelihood ratio:

    ˆ̀ir→js =

    1

    βlog

    ν̂ir→js(1)

    ν̂ir→js(0)

    Then the algorithm simplifies to one update step as follows,

    ˆ̀ir→js ← min

    t 6=r ,l 6=i

    ∑ku

    eβˆ̀ku→it

    1 + eβ ˆ̀ku→itEitku +

    eβˆ̀ku→`r

    1 + eβ ˆ̀ku→`rE`rku

    −2∑ku 6=js

    eβˆ̀ku→ir

    1 + eβ ˆ̀ku→irEirku +

    eβˆ̀js→ir

    1 + eβˆ̀js→ir

    Eirjs

  • 33

    A special case: Graph Matching (Isomorphism)

    ij

    k

    `

    r

    s

    t

    u

    Graph-A Graph-B

    minP

    ‖A− PBPT‖F

    subject to P ∈ Π

    equivalent to QAP withEijrs = Aij ⊕ Brs .

    When A ∼ ER(n, 0.5) and

    B = PAPT ⊕ Z

    for some P ∈ Π whereZ ∼ Ber(�). Find P givenunlabeled graphs.

    0.0 0.1 0.2 0.3 0.35 0.39 0.44 0.48

    75

    60

    45

    30

    15

    n

    BP Success rate for GM ER(n, 0.5)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

  • 34

    Summary

    I k-means clusteringI Nice manifold optimization landscapeI Lower bounds certificates from SDPs (data-driven)

    I Clustering stochastic block modelI AMP −→ Spectral methods −→ Graph neural networks

    I Quadratic assignment

  • 35

    Thank you

    Relax no need to round: Integrality of clustering formulationsP. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, R. WardarXiv:1408.4045, Innovations in Theoretical Computer Science (ITCS) 2015

    Probably certifiably correct k-means clusteringT. Iguchi, D. G. Mixon, J. Peterson, S. VillararXiv:1509.07983, Mathematical Programming, 2016

    Clustering subgaussian mixtures by semidefinite programmingD. G. Mixon, S. Villar, R. WardarXiv:1602.06612, Information and Inference: A Journal of the IMA, 2017

    Monte Carlo approximation certificates for k-means clusteringD. G. Mixon, S. VillararXiv:1710.00956

    Supervised Community Detection with Hierarchical Graph Neural NetworksZ. Chen, L. Li, J. BrunaarXiv:1705.08415

    A note on learning algorithms for quadratic assignment with Graph Neural NetworksA. Nowak, S. Villar, A. S. Bandeira, J. Bruna, ICML (PADL) 2017arXiv:1706.07450