Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks...

45
Communities, Spectral Clustering, and Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012

Transcript of Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks...

Page 1: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Communities, Spectral Clustering, andRandom Walks

David Bindel

Department of Computer ScienceCornell University

13 Jul 2012

Page 2: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectral clustering recipe

Ingredients:1. A subspace basis with useful information

(eigenvectors of L, L̄, B, etc; NMF factors)2. A way to extract clusters from the basis

(sign patterns; (scaled) latent coordinates)

What works for small communities? With overlap?

Page 3: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Classic spectral bisection

Goal: Split graph into equal parts, minimizing edges cut.

Strategy: turn into a constrained integer QPI Identify partition with label vector s̄ ∈ {±1}n.I Objective: s̄T Ls̄ = 4× |edges cut|I Constraint: eT s̄ = 0

This is NP hard, but...

Page 4: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Classical spectral bisection

2 4 6 8 10 12 14 16 18 20−1

−0.5

0

0.5

1

Hard: min s̄T Ls̄ s.t. eT s̄ = 0, s̄ ∈ {±1}n.Easy: min vT Lv s.t. eT v = 0, v ∈ Rn, ‖v‖2 = n.

v is the Fiedler vector (second eigenvector of L)

Page 5: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Three clusters?

Take three eigenvectors as latent coordinates + cluster:

V ≈

r1...r1r2...r2r3...r3

=

1...1

1...1

1...1

r1r2r3

= SR.

Equivalent to a matrix factorization!

Page 6: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Overlapping communities

0 20 40 60 80 100 120 140 160 180

0

20

40

60

80

100

120

140

160

180

nz = 3996

A ≈ S diag(β)ST , S ∈ {0,1}n×c

Page 7: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectrum for A with overlap

20 40 60 80 100 120 140 160 180

0

10

20

Index

Eig

enva

lue

Page 8: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Clustering and overlap

Dominant eigenvectors for A:

012

Alternate basis for the space:

0

0.5

1

How do we get the latter basis?

Page 9: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Recovering indicators

minimize ‖s̃‖1 (proxy for sparsity of s̃)s.t. s̃ = Uy (s̃ in the right space)

s̃i ≥ 1 (“seed” constraint)s̃ ≥ 0 (componentwise nonnegativity)

Given a basis U, want to extract a vector s̃ s.t.I s̃ lies close to the span of UI s̃ is almost an indicator for a communityI Some known “seeds” are in the community

Page 10: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Indicators from subspaces: LP edition

minimize ‖s̃‖1 (proxy for sparsity of s̃)s.t. s̃ = Uy (s̃ in the right space)

s̃i ≥ 1 (“seed” constraint)s̃ ≥ 0 (componentwise nonnegativity)

Recovers smallest set containing node i ifI U = SY−1 exactly.I Each set contains at least one element only in that set.

(Frequently works if there is not “too much” overlap.)

Got noise? Need thresholding!

Page 11: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Alas

The method is willing, but the space is weak:I Eigenvectors of A may not be ideal – what about L, L̄,

transition matrices for random walks?I Many small communities =⇒ many eigenvectors?

So consider:I Why eigenvectors?

I Optimization connectionI Dynamics connection

I What might serve better?

Page 12: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Optimization and quadratic forms

Indicate V ′ ⊆ V by s ∈ {0,1}n. Measure subgraph:

sT As = |E ′| = internal edges

sT Ds = edges incident on subgraph

sT Ls = edges between V ′ and V̄ ′

sT Bs = “surprising” internal edges

Page 13: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Rayleigh quotients

sT AssT s

= mean internal degree in subgraph

sT LssT s

= edges cut between V ′ and V̄ ′ relative to |V ′|

sT AssT Ds

= fraction of incident edges internal to V ′

sT LssT Ds

= fraction of incident edges cut

sT BssT s

= mean “surprising” internal degree in subgraph

sT BssT Ds

= mean fraction of internal degree that is surprising

Page 14: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Rayleigh quotients and eigenvalues

Basic connection (M spd):

xT KxxT Mx

stationary at x ⇐⇒ Kx = λMx

Easy despite lack of convexity.

Page 15: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Rayleigh quotients big and small

Decompose:

W T MW = I and W T KW = Λ = diag(λ1, . . . , λn).

For any x 6= 0, xT Mx = 1, have

x =∑

j

wjzj

xT Kx =∑

j

λjz2j

SosT KssT Ms

≈ λmax =⇒ s ≈∑

λj≈λmax

wjzj .

So look at invariant subspaces for extreme eigenvalues.

Page 16: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

The dynamics perspective

Page 17: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

The random walker

Basic idea: extract structure from random walk.

Start at seed and walk forwardDay 1: I came up with a funny joke!Day 2: I tell everyone in my familyDay 3: My mother tells a friend?

Or look at how quickly source is forgottenDay 1: David came up with a funny joke!Day 2: There’s a joke going around Cornell CS.Day 3: I read this bad joke on the web...

Page 18: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

The random walker

Lazy random walk with transition matrix

S = (αI + AD−1)/(1 + α).

1. Start at p0, take k steps. Distribution:

pk = Skp0 (→ d/m as k →∞)

2. End at q0 after k steps. Conditional distribution on start:

qk ∝(

ST)k

q0 (→ e/n as k →∞)

Notes:I If the graph is undirected, ST = D−1SD.I If the graph is also regular, ST = S.

Page 19: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Simon-Ando theory

Markov chain with loosely-coupled subchains:I Rapid local mixing: after a few steps

pk ≈c∑

j=1

αj,kp(j)∞

where p(j)∞ is a local equilibrium for the j th subchain

I Slow equilibration: αj,k → αj,∞.

Alternately, rapid local mixing looks like:

qk ≈c∑

j=1

γj,ksj

where sj is an indicator for nodes in one subchain.

Page 20: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectral Simon-Ando picture

Exactly decoupled case (c decoupled chains):I Eigenvalue one has multiplicity c.I Eigenvectors of S are local equilibria.I Eigenvectors of ST are indicators for chains.I Rapid mixing =⇒ large gap to λc+1.

Weakly coupled case:I Cluster of c eigenvalues near 1.I Eigenvectors of S are combinations of local equilibria.I Eigenvectors of ST are combinations of indicators.I Large gap between λc and λc+1.

Page 21: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Beyond eigenvectors

Connections that suggest using dominant eigenvectors:I Optimization of quadratic formsI Asymptotic dynamics of random walks

Works well for a few big communities – what about local ones?

Page 22: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Ritz vectors

Variational characterization of eigenpairs:

xT KxxT Mx

stationary at x ⇐⇒ Kx = λMx

Rayleigh-Ritz approximation: x = Vy

(Vy)T K (Vy)

(Vy)T M(Vy)stationary at y ⇐⇒ (V T KV )y = λ̂(V T MV )y

Page 23: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Krylov + Rayleigh-Ritz

Usual approach to large-scale symmetric eigenproblems:1. Generate a basis for a Krylov subspace

Kk (A, x0) = span{x0,Ax0,A2x0, . . . ,Ak−1x0}

2. Use Rayleigh-Ritz on subpace3. Re-start if subspace gets too large

Page 24: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Krylov + Rayleigh-Ritz + dynamics

Eigendecomposition picture:

pk =N∑

j=1

αjλkj vj

Krylov + Rayleigh-Ritz:

{(µj ,wj)}rj=1 = Ritz pairs from Kr (S,p0)

pk =r∑

j=1

βjµkj wj , k < r

Page 25: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

The idea: Ritz subspaces

Idea: Use unconverged subspace computations

Invariant subspace Ritz subspaceLong random walks Short random walksGlobal information Local informationPotentially expensive Cheap to computeMay need large space Small space okay

Page 26: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Current favorite method

1. Pick “seed” nodes j1, j2, . . .2. Take short random walks (length k ) from each seed3. Extract few Ritz vectors from span{q0,q1, . . . ,qk−1}.4. Approx indicators in span of all Ritz vectors.5. Possibly add more seeds and return to step 1.6. Convert raw “score” vector to a {0,1} indicator.

Page 27: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Wang test graph

1

2

3

4

5

6 7

8

9

10

11

12

13

14

15

16

17

1819

20

21

22

23

24

25

26

27

28

29

30

Page 28: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectrum for Wang test graph

5 10 15 20 25 30−0.5

0

0.5

1

Index

Eig

enva

lue

Page 29: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Zachary Karate graph

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

Page 30: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectrum for Karate

5 10 15 20 25 30

−0.5

0

0.5

1

Index

Eig

enva

lue

Page 31: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Football graph

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

3031

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

5960

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

Page 32: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectrum for Football

20 40 60 80 100

0

0.5

1

Index

Eig

enva

lue

Page 33: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Dolphin graph

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24 25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

Page 34: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectrum for Dolphin

10 20 30 40 50 60

−0.5

0

0.5

1

Index

Eig

enva

lue

Page 35: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Non-overlapping synthetic benchmark (µ = 0.5)

0 100 200 300 400 500 600 700 800 900 1000

0

100

200

300

400

500

600

700

800

900

1000

nz = 15746

Page 36: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectrum for synthetic benchmark

20 40 60 80 100

0.4

0.6

0.8

1

Index

Eig

enva

lue

Page 37: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Score vector

100 200 300 400 500 600 700 800 900 1,0000

0.2

0.4

0.6

0.8

1

Index

Sco

re

Score vector for the two-node seed of 492 and 513 in the firstLFR benchmark graph. Ten steps, three Ritz vectors.

Page 38: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Non-overlapping synthetic benchmark (µ = 0.6)

0 100 200 300 400 500 600 700 800 900 1000

0

100

200

300

400

500

600

700

800

900

1000

nz = 15316

Page 39: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectrum for synthetic benchmark

20 40 60 80 100

0.4

0.6

0.8

1

Index

Eig

enva

lue

Page 40: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Score vector

100 200 300 400 500 600 700 800 900 1,0000

0.5

1

Index

Sco

re

Score vector for the two-node seed of 492 and 513 in the firstLFR benchmark graph. Ten steps, three Ritz vectors.

Page 41: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Overlapping synthetic benchmark (µ = 0.3)

I 1000 nodesI 47 communitiesI 500 nodes belong to two communities

Page 42: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Spectrum for synthetic benchmark

20 40 60 80 100

0.4

0.6

0.8

1

Index

Eig

enva

lue

Page 43: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Score vector

100 200 300 400 500 600 700 800 900 1,0000

0.5

1

Index

Sco

re

Score vector for the two-node seed of 521 and 892.The desired indicator is in red.

Page 44: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Score vector

100 200 300 400 500 600 700 800 900 1,0000

0.5

1

1.5

Index

Sco

re

Score vector for the two-node seed of 521 and 892 +twelve reseeds. The desired indicator is in red.

Page 45: Communities, Spectral Clustering, and Random Walksbindel/present/2012-07-mmds.pdf · Random Walks David Bindel Department of Computer Science Cornell University 13 Jul 2012. ... Strategy:

Conclusions

Classic spectral methods use eigenvectors to cluster, but:I We don’t need to stop at partitioning!

I Overlap is okayI Key is how we mine the subspace

I We don’t need to stop at eigenvectors!I Can also use Ritz vectorsI Computation is cheap: short random walks

Still very much need:I Better theoretical groundingI Better ideas for thresholdingI Testing on large-scale examples