ISIT Tutorial Information theory and machine learning Part II

Emmanuel AbbePrinceton University

ISIT TutorialInformation theory and machine learning

Part II

Martin WainwrightUC Berkeley

Examples:- community detection in social networks

Inverse problems on graphs

A large variety of machine learning and data-mining problems are about inferring global properties on a collection of agents

by observing local noisy interactions of these agents

Examples:- community detection in social networks- image segmentation

Examples:- community detection in social networks

- data classification and information retrieval- image segmentation

Examples:- community detection in social networks- image segmentation

- object matching, synchronization- page sorting- protein-to-protein interactions- haplotype assembly- ...

- data classification and information retrieval

In each case: observe information on the edges of a network that has been generated from hidden attributes on the nodes,

and try to infer back these attributes

Dual to the graphical model learning problem (previous part)

Different: the code is a design parameter and takes typically specific non-local interactions of the bits (e.g., random, LDPC, polar codes).

What about graph-based codes?

(1) What are the relevant types of “codes” and “channels” behind machine learning problems?(2) What are the fundamental limits for these?

1. Community detection and clustering2. Stochastic block models : fundamental limits and capacity-achieving algorithms3. Open problems 4. Graphical channels and low-rank matrix recovery

Outline of the talk

Community detection and clustering

social networks:“friendship”

biological networks:“protein interactions”

genome HiC networks:“DNA contacts”

call graphs:“calls”

Networks provide local interactions among agents

social networks:“friendship”

biological networks:“protein interactions”

genome HiC networks:“DNA contacts”

call graphs:“calls”

Networks provide local interactions among agentsone often wants to infer global similarity classes

Tutorial motto:Can on establish a clear line-of-sight

as in communications with the Shannon capacity? Pea

WAN wireless tech.

Shannon-capacity

Infeasible Region

802.16

A long-studied and notoriously hard problem

what is a good clustering?assort. and disassort. relations

how to get a good clustering?computationally hard

The challenges of community detection

work with models

many heuristics...

The Stochastic Block Model

p = (p1, . . . , pk) <- probability vector = relative size of the communities

k = 4The DMC of clustering..?

P = diag(p)

The stochastic block model

np1 np2

np3np4

SBM(n, p,W )

<- symmetric matrix with entries in [0,1] = prob .of connecting among communitiesW =

B@W11 . . . W1k...

. . ....

Wk1 · · · Wkk

W11W12

W44 W34

The (exact) recovery problem

We will see weaker recovery requirements later

Starting point: progress in science often comes from understanding special cases...

Definition. An algorithm

ˆXn(·) solves (exact) recovery in the SBM if

for a random graph G under the model, limn!1 P(Xn=

ˆXn(G)) = 1.

Let Xn= [X1, . . . , Xn] represent the community variables

of the nodes (drawn under p)

SBM with 2 symmetric communities: 2-SBM

p1 = p2 = 1/2

W11 = W22 = p W12 = q

Some historyfor 2-SBM

Recovery problem

HollandLaskeyLeinhardt

BoppanaSnijdersNowicki

CondonKarpDyer

Frieze

BickelChen

McSherry

JerrumSorkin

CarsonImpagliazzoBui, Chaudhuri,

Leighton, Sipser

RoheChatterjeeYu

Bui, Chaudhuri,

Leighton, Sipser ’84 maxflow-mincut p = ⌦(1/n), q = o(n

�1�4/((p+q)n))

Boppana ’87 spectral meth. (p� q)/

pp+ q = ⌦(

plog(n)/n)

Dyer, Frieze ’89 min-cut via degrees p� q = ⌦(1)

Snijders, Nowicki ’97 EM algo. p� q = ⌦(1)

Jerrum, Sorkin ’98 Metropolis aglo. p� q = ⌦(n

�1/6+✏)

Condon, Karp ’99 augmentation algo. p� q = ⌦(n

�1/2+✏)

Carson, Impagliazzo ’01 hill-climbing algo. p� q = ⌦(n

�1/2log

Mcsherry ’01 spectral meth. (p� q)/

pp � ⌦(

plog(n)/n)

Bickel, Chen ’09 N-G modularity (p� q)/

pp+ q = ⌦(log(n)/

Rohe, Chatterjee, Yu ’11 spectral meth. p� q = ⌦(1)

Instead of ‘how’,when can we recover

the clusters (IT)?

P(X̂n = Xn) ! 1

algorithms driven...

Information-theoretic view of clustering (über alles)

reliable comm. iff R < 1-H(ɛ)

✓1� ✏ ✏✏ 1� ✏

✓1� p p1� q q

reliable comm.

exact recovery

unorthodoxcode!

iff ???

Recovery problem

CondonKarpDyer

Frieze

BickelChen

McSherry

JerrumSorkin

CarsonImpagliazzoBui, Chaudhuri,

Leighton, Sipser

RoheChatterjeeYu

Infeasible

befficiently achievable

Bui, Chaudhuri,

�1�4/((p+q)n))

pp+ q = ⌦(

plog(n)/n)

�1/6+✏)

�1/2+✏)

�1/2log

pp � ⌦(

plog(n)/n)

pp+ q = ⌦(log(n)/

a log(n)

n, q =

b log(n)

nRecovery i↵

a+b2 � 1 +

Abbe-Bandeira-Hall

P(X̂n = Xn) ! 1

Recovery problem

CondonKarpDyer

Frieze

BickelChen

McSherry

Detection problem

Coja-Oghlan

DecelleKrzakalaMoore Zdeborova

Massoulié

MosselNeemanSlyJerrum

Sorkin

CarsonImpagliazzo

n, q =

n(a� b)2 > 2(a+ b)

Detection i↵

Mossel-Neeman-Sly

Bui, Chaudhuri,Leighton, Sipser

RoheChatterjeeYu

efficiently achievable

Bui, Chaudhuri,

�1�4/((p+q)n))

pp+ q = ⌦(

plog(n)/n)

�1/6+✏)

�1/2+✏)

�1/2log

pp � ⌦(

plog(n)/n)

pp+ q = ⌦(log(n)/

9✏ > 0 : P(d(X̂n, Xn)

2� ✏) ! 1

a log(n)

n, q =

b log(n)

nRecovery i↵

a+b2 � 1 +

Abbe-Bandeira-Hall

What about multiple/asymm. communities?Conjecture: detection changes with 5 or more

P(X̂n = Xn) ! 1

a log n

n, q =

b log n

n/2n/2

qRi Rji j

Recovery in the 2-SBM: IT limit

If a+b2 �

pab < 1 then ML fails w.h.p.Converse:

Abbe-Bandeira-Hall ’14 22

ML fails if two nodes can be swapped to reduce the cut

what is ML? ! min-bisection

P (B1 R1) = n�((a+b)/2�pab)+o(1)

P (9i : Bi Ri) = ? ⇣ nP (B1 R1) (weak correlations)

n/2n/2

P(9X̂n : d(X̂n, Xn)/n /2 [1/2� ✏, 1/2 + ✏]) ! 1�1 +1

Recovery in the 2-SBM: efficient algorithms

s.t. xi = ±1

tx = 0

Abbe-Bandeira-Hall ’14

X = xx

tlifting:

max tr(AX)

s.t. Xii = 1

tX = 0

X ⌫ 0

rank(X) = 1

ML-decoding:spectral:max x

s.t. kxk = 1

tx = 0

NP-hard

Note that SDP can be expensive...

-> Analyze the spectral norm of a random matrix[Abbe-Bandeira-Hall ’14] Bernstein: slightly loose

[Xu-Hanjek-Wu ’15] Seginer bound[Bandeira, Bandeira-Van Handel ’15] tight bound

Recovery in the 2-SBM: efficient algorithms

Abbe-Bandeira-Hall ’14 24

Theorem. The SDP solves recovery if 2LSBM + 11

t+ In ⌫ 0

where LSBM = DG+ �DG� �A.

The general SBM

Quiz: If a node is in community i, how many neighbors does ithave in expectation in community j ?

1. npj

“degree profile matrix”

np1 np2

np3np4

W11W12

W44 W34

SBM(n, p,W )

2. npjWij

3. npiWij

Back to the Information-theoretic view of clustering

reliable comm. iff R < 1-H(ɛ)

reliable comm. iff 1 < J(p,W) ???reliable comm. iff R < max I(p,W)p | {z }

KL-divergence

reliable comm. iff 1 < (a+b)/2-√ab__

✓ ◆W

Main results

(über alles)

| {z }Dt(µ, ⌫)

We call D+ the CH-divergence.

2(pa�

pb)2 � 1

Abbe-Bandeira-Hall ’14

J(p,Q) := mini<j

D+((PQ)i, (PQ)j) � 1

• D1/2(µ, ⌫) =12k

pµ�

p⌫k22 is the Hellinger divergence (distance)

• � logmaxtP

i µti⌫

ti is the Cherno↵ divergence

• Dt is an f -divergence:P

i ⌫if(µi/⌫i)

[Abbe-Sandon ’15]

f(x) = 1� t+ tx� x

D+(µ, ⌫) := max

t2[0,1]

�tµ` + (1� t)⌫` � µt

`⌫1�t`

Theorem 1. Recovery is solvable in SBM(n, p,Q log(n)/n) if and only if

Main results

(über alles)(über alles)

Is recovery in the general SBM solvable efficiently down the information theoretic threshold?

Theorem 2. The degree-profiling algorithm achieves the threshold

and runs in quasi-linear time.

J(p,Q) := mini<j

D+((PQ)i, (PQ)j) � 1

[Abbe-Sandon ’15]

D+(µ, ⌫) := max

t2[0,1]

�tµ` + (1� t)⌫` � µt

`⌫1�t`

Theorem 1. Recovery is solvable in SBM(n, p,Q log(n)/n) if and only if

(über alles)

[Abbe-Sandon ’15]

When can we extract a specific community?

Theorem. If community i has a profile (PQ)i at D+-distance at least 1

from all other profiles (PQ)j , j 6= i, then it can be extracted w.h.p.

What if we do not know the parameters?We can learn them on the fly: [Abbe-Sandon ’15] (second paper)

Proof techniques and algorithms

A key step

(über alles)(über alles)

Hypothesis 1

Hypothesis 2

.⇠ P(log(n)(PQ)1)

dv.⇠ P(log(n)(PQ)2)

Theorem. For any ✓1, ✓2 2 (R+ \ {0})k with ✓1 6= ✓2 and p1, p2 2 R+ \ {0},X

min(Pln(n)✓1(x)p1,Pln(n)✓2(x)p2) = ⇥⇣n

�D+(✓1,✓2)�o(1)⌘,

where D+ is the CH-divergence.

[Abbe-Sandon ’15] 32

Plan:put effort in recovering most of the nodes and

then finish greedily with local improvements

How to use this?

(1) Split into two graphsG”

GG’ loglog-degree

log-degree

G’(2) Run on Sphere-comparison

-> gets a fraction 1-o(1) (see next)

(3) Take now with the clustering of G” G’

= n�D+((PQ)1,(PQ)2)+o(1)

The degree-profiling algorithm

Hypothesis 1

Hypothesis 2

dv...⇠ P(log(n)(PQ)1)

dv...⇠ P(log(n)(PQ)2)

(capacity-achieving)

How do we get most nodes correctly?

Other recovery requirements

For all the above: what are the “efficient” VS. “information-theoretic” fundamental limits?

Exact recovery: c = 1

Almost exact recovery: c = 1� o(1)

Weak recovery or detection : for some (for the symmetric k-SBM).

c = 1/k + ✏ ✏ > 0

Partial recovery: An algorithm solves partial recovery in the SBM with accuracy if it produces a clustering which is correct on a fraction of the nodes with high probability.

Proposed notion of SNR:

|�min

|2�max

e.v. of PQ

Theorem (informal). In the sparse SBM(n, p,Q/n),the Sphere-comparison algorithm recovers a fraction

of nodes which approaches 1 when the SNR diverges.

What is a good notion of SNR?

Note that the SNR scales if Q scales!

(a�b)2

2(a+b) for 2-symm. comm.

(a�b)2

k(a+(k�1)b) for k-symm. comm.

[Abbe-Sandon ’15]

Bin(n/2, a/n)

Bin(n/2, b/n)

Partial recovery in SBM(n, p,Q/n)

A node neighborhood inSBM(n, p,Q/n)

np1 npk

�v = i

p1Qi1 pkQik

. . .(PQ)i

depth r

Nr(v)((PQ)r)i

Sphere-comparison

�v = i

�v0 = j

Nr0(v0)

Compare v and v’ from:

|Nr(v) \Nr0(v0)|

Sphere-comparisonTake now two nodes:

hard to analyze...

�v = i

�v0 = j

Nr,r0[E](v · v0)= number of crossing edges

Compare v and v’ from:

⇡ Nr[G\E](v) ·cQ

nNr0[G\E](v

Subsample G with prob. c to get E

Nr[G\E](v)

Nr0[G\E](v0) ⇡ ((1� c)PQ)re�v · cQ

n((1� c)PQ)r

0e�v0

= c(1� c)r+r0e�v ·Q(PQ)r+r0e�v0 /n

Sphere-comparison

Decorrelate:

Additional steps:1. look at several depths -> Vandermonde syst.2. use “anchor nodes”

A real data example

our algorithm: ~60/1222

The political blogs network

[Adamic and Glance ’05]

1222 blogs(left- and right-leaning)

edge = hyperlink between blogs

We can recover 95% of the nodes correctlyThe CH-divergence is close to 1

Some open problems in community detection

I. The SBM:a. Recovery

[Abbe-Sandon ’15] should extend to k = o(log(n))

[Chen-Xu ’14] k,p,q scale with n polynomially

Growing nb. of communities? Sub-linear communities?

I. The SBM:

b. Detection and broadcasting on treesa. Recovery

[Mossel-Neeman-Sly ’13] Converse for detection in 2-SBM:

p=a/n, q=b/n

Galton-Watson tree Poisson((a+b)/2)

BSC(b/(a+ b))

If (a+b)/2<=1 the tree dies w.p. 1If (a+b)/2>1 the tree survives w.p. >0

Unorthodox broadcasting problem: when can we detect the root-bit?

(1� 2")2If and only if

[Evans-Kenyon-Peres-Schulman ’00]

c =a+ b

() (a� b)2

2(a+ b)> 1

I. The SBM:

b. Detection and broadcasting on treesa. Recovery

For k clusters? SNR =(a� b)2

k(a� (k � 1)b)

[Decelle-Krzakala-Zdeborova-Moore ’11]

impossible possible efficient

0 SNR1ck

Conjecture. For the symmetric k-SBM(n, a, b), there exists ck s.t.

(1) If SNR < ck, then detection cannot be solved,

(2) If ck < SNR < 1, then detection can be solved

information-theoretically but not e�ciently,

(3) If SNR > 1, then detection can be solved e�ciently.

Moreover ck = 1 for k 2 {2, 3, 4} and ck < 1 for k � 5.

I. The SBM:

b. Detection and broadcasting on treesc. Partial recovery and the SNR-distortion curve

a. Recovery

� = 1� ↵ k = 2

Conjecture. For the symmetric k-SBM(n, a, b) and ↵ 2 (1/k, 1),there exists �k, �k s.t. partial-recovery of accuracy ↵ is solvable

if and only if SNR > �k, and e�ciently solvable i↵ SNR > �k.

k � 5

- Censored block models- Labelled block models

n/2n/2

[Abbe-Bandeira-Bracher-Singer ’14]

“correlation clustering” [Bansal, Blum, Chawla ’04] “LDGM codes” [Kumar, Pakzad, Salavati, Shokrollahi ’12]

“labelled block model” [Heimlicher, Lelarge, Massoulié ’12]“soft CSPs” [Abbe, Montanari ’13]

“pairwise measurements” [Chen, Suh, Goldsmith ’14 and ’15] “bounded-size correlation clustering” [Puleo, Milenkovic ’14]

2-CBM [Abbe-Bandeira-Bracher-Singer, Xu-Lelarge-Massoulie,Saad-Krzakala-Lelarge-Zdeborova,Chin-Rao-Vu, Hajek-Wu-Xu]

2-CBM(n, p, ✏)

II. Other block models

- Mixed-membership block models [Airoldi-Blei-Fienber-Xing ’08]- Overlapping block models [Abbe-Sandon ’15]

OSBM(n, p, f)u v

XuXu Xv⇠ p p on {0, 1}s⇠ p

f : {0, 1}s ⇥ {0, 1}s ! [0, 1]connect (u, v) with prob. f(Xu, Xv)

- Censored block models- Labelled block models

- Degree-corrected block models [Karrer-Newman ’11]

- Planted community [Deshpande-Montanari ’14 , Montanari ’15]

- Mixed-membership block models - Overlapping block models

- Censored block models- Labelled block models- Degree-corrected block models

- Hypergraph models

- Planted community

- Hypergraph models

- Planted community

For all the above: Is there a CH-divergence behind? A generalized notion of SNR? Detection gaps?

Efficient algorithms?

III. Beyond block models:

b. Low-rank matrix recovery

b. Graphical channels

A general information-theoretic model?

a. Exchangeable random arrays and graphons w : [0, 1]2 ! [0, 1]

P (Eij = 1|Xi = xi, Xj = xj) = w(xi, xj)

(ui, uj)

w(ui, uj)

Figure 1: [Left] Given a graphon w : [0, 1]2 → [0, 1], we draw i.i.d. samples ui, uj fromUniform[0,1] and assign Gt[i, j] = 1 with probability w(ui, uj), for t = 1, . . . , 2T . [Middle]Heat map of a graphon w. [Right] A random graph generated by the graphon shown in the middle.Rows and columns of the graph are ordered by increasing ui, instead of i for better visualization.

Wolfe [9] studied the consistency properties, but did not provide algorithms to estimate the graphon.To the best of our knowledge, the only method that estimates graphons consistently, besides ours, isUSVT [8]. However, our algorithm has better complexity and outperformsUSVT in our simulations.More recently, other groups have begun exploring approaches related to ours [28, 24].

The proposed approximation procedure requires w to be piecewise Lipschitz. The basic idea is toapproximate w by a two-dimensional step function !w with diminishing intervals as n increases.Theproposed method is called the stochastic blockmodel approximation (SBA) algorithm, as the idea ofusing a two-dimensional step function for approximation is equivalent to using the stochastic blockmodels [10, 22, 13, 7, 25]. The SBA algorithm is defined up to permutations of the nodes, so theestimated graphon is not canonical. However, this does not affect the consistency properties of theSBA algorithm, as the consistency is measured w.r.t. the graphon that generates the graphs.

2 Stochastic blockmodel approximation: Procedure

In this section we present the proposed SBA algorithm and discuss its basic properties.

2.1 Assumptions on graphons

We assume that w is piecewise Lipschitz, i.e., there exists a sequence of non-overlaping intervalsIk = [αk−1, αk] defined by 0 = α0 < . . . < αK = 1, and a constant L > 0 such that, for any(x1, y1) and (x2, y2) ∈ Iij = Ii × Ij ,

|w(x1, y1)− w(x2, y2)| ≤ L (|x1 − x2|+ |y1 − y2|) .

For generality we assume w to be asymmetric i.e., w(u, v) ̸= w(v, u), so that symmetric graphonscan be considered as a special case. Consequently, a random graph G(n,w) generated by w isdirected, i.e., G[i, j] ̸= G[j, i].

2.2 Similarity of graphon slices

The intuition of the proposed SBA algorithm is that if the graphon is smooth, neighboring cross-sections of the graphon should be similar. In other words, if two labels ui and uj are close i.e.,|ui− uj| ≈ 0, then the difference between the row slices |w(ui, ·)−w(uj , ·)| and the column slices|w(·, ui) − w(·, uj)| should also be small. To measure the similarity between two labels using thegraphon slices, we define the following distance

dij =1

0[w(x, ui)− w(x, uj)]

0[w(ui, y)− w(uj , y)]

$. (2)

[Lovasz] SBMs can approximate graphons [Choi-Wolfe, Airoldi-Costa-Chan]

[Abbe-Montanari ’13]Graphical channels

• G = (V,E) a k-hypergraph• Q : X k ! Y a channel (kernel)

A family of channels motivated by inference on graph problems

For x 2 X V, y 2 YE

P(y|x) =Q

e2E(G) Q(ye|x[e])

How much information do graphical channels carry?

limn!1

nI(Xn;G)Theorem. exists for ER graphs and some kernels

-> why not always? what is the limit?

Connection: sparse PCA and clustering

SBM and low-rank Gaussian model

Spiked Wigner model:

If (finite SNR), with (large degrees)npn, nqn ! 1�n =n(pn � qn)2

2(pn + qn)! �

Y� =

nXXt + Z

limn!1

nI(X;G(n, pn, qn)) lim

nI(X;Y�)

�2⇤

4�� ⇤

2+ I(�⇤)

where �⇤ solves � = �(1�MMSE(�)) Y1(�) =p�X1 + Z1

(single-letter)

Theorem. =

[Deshpande-Abbe-Montanari ’15]

Gaussian symmetric

• X = (X1, . . . , Xn) i.i.d. Bernoulli(✏) ! sparse-PCA

[Amini-Wainwright ’09, Desphande-Montanari ’14]

• X = (X1, . . . , Xn) i.i.d. Radamacher(1/2) ! SBM???

I-MMSE [Guo-Shamai-Verdú]

Yij = cXiXj + Zij

Conclusion

Community detection couples naturally with the channel view of information theory and more specifically with:

- graph-based codes- f-divergences- broadcasting problems - I-MMSE - ...

More generally, the problem of inferring global similarity classes in data sets from noisy local interactions is at the center of many problems in ML, and aninformation-theoretic view of these problems seems needed and powerful.

unorthodox versions...

www.princeton.edu/~eabbeDocuments related to the tutorial:

Questions?

ISIT Tutorial Information theory and machine learning Part II

Documents

Transcript of ISIT Tutorial Information theory and machine learning Part II

Support Vector Machine Tutorial

Machine learning – Neural Network classification tutorial - Scilab… · Machine learning – Neural Network classification tutorial This tutorial is based on the Neural Network

ISIT Steel Focusisit.or.th/uploads/files/1212 ISIT Steel Focus TH_Non-member.pdfมาตรการเร่งรัดการลงทุนในโครงสร้างพื้นฐานในช่วงที่ผ่านมาส

Semantic Web and Machine Learning Tutorial

Machine Simulation Tutorial

Tutorial turtle machine

Azure Machine Learning tutorial

State Machine Entry and Debugging Tutorial

Tutorial on Machine Learning Tools

Gaussian Processes in Machine Learning Tutorial

CSC413 Tutorial: Optimization for Machine Learning

Section 1Section 1 Machine Learning basic concepts · Machine Learning TutorialMachine Learning Tutorial ... Section 1Section 1 Machine Learning basic concepts ... Learning Tutorial

isit the Exhibiior - JICA · isit the Exhibiior . Created Date: 11/17/2008 6:31:01 PM

Machine Tagging Tutorial - Wikispacesbiodivlib.wikispaces.com/file/view/Tagging_Tutorial_and_FAQ.pdf/... · Machine Tagging Tutorial ... The BHL photostream is organized into thematic

Ecole d'été France Excellence - ISIT

ISIT 200 Report

Machine Learning Tutorial

Machine Stitched Cathedral Window Tutorial

Altium Tutorial R2 - Machine Intelligence Lab

Machine Vision Workshop: Gephi Tutorial