Spectral Clusteringwcohen/10-605/unsup-on-graphs.pdf · Spectral Clustering: Graph = Matrix W*v 1 =...

Spectral Clustering

2

spectral k-means

after transformation

3

SpectralClustering:Graph=Matrix

AB

C

FD

E

GI

HJ

A B C D E F G H I JA 1 1 1B 1 1C 1D 1 1E 1F 1 1 1G 1H 1 1 1I 1 1 1J 1 1

5

SpectralClustering:Graph=MatrixTransitivelyClosedComponents=“Blocks”

AB

C

FD

E

GI

HJ

A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _

Of course we can’t see the “blocks” unless the nodesare sorted by cluster…

sometimes called a block-stochastic matrix:• each node has a latent “block”• fixed probability qi for links between

elements of block i• fixed probability q0 for links between

elements of different blocks

6

SpectralClustering:Graph=MatrixVector=NodeàWeight

H


AB

C

FD

E

GI

J

AA 3B 2C 3DEFGHIJ

M v

7

SpectralClustering:Graph=MatrixM*v1 =v2“propogates weightsfromneighbors”

A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _

AB

C

FD

E

I

A 3B 2C 3DEFGHIJ

M

M v1

A 2*1+3*1+0*1

B 3*1+3*1

C 3*1+2*1

DEFGHIJ

v2* =

8

SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”

A B C D E F G H I J

A _ .5 .5 .3

B .3 _ .5

C .3 .5 _

D _ .5 .3

E .5 _ .3

F .3 .5 .5 _

G _ .3 .3

H _ .3 .3

I .5 .5 _ .3

J .5 .5 .3 _

AB

C

FD

E

I

A 3

B 2

C 3

D

E

F

G

H

I

J

W v1

A 2*.5+3*.5+0*.3

B 3*.3+3*.5

C 3*.33+2*.5

D

E

F

G

H

I

J

v2* =W: normalized so columns sum to 1

9


ll eigenvaluer with eigenvectoan is : vvvW =×

Q: How do I pick v to be an eigenvector

for a block-stochastic matrix?

10



How do I pick v to be an eigenvector

for a block-stochastic matrix?

11



[Shi & Meila, 2002]

λ2

λ3

λ4

λ5,6,7,…

.

λ1e1

e2

e3“eigengap”

12



[Shi & Meila, 2002]

e2

e3

-0.4 -0.2 0 0.2

-0.4

-0.2

0.0

0.2

0.4

xx x xx x

yyyy

y

xx xxx x

zzzzz zzzzz z e1

e2

13


ll eigenvaluer with eigenvectoan is : vvvW =×If W is connected but roughly block diagonal with k blocks then• the top eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks

14


ll eigenvaluer with eigenvectoan is : vvvW =×If W is connected but roughly block diagonal with k blocks then• the “top” eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks

Spectral clustering:• Find the top k+1 eigenvectors v1,…,vk+1• Discard the “top” one• Replace every node a with k-dimensional vector xa = <v2(a),…,vk+1 (a) >• Cluster with k-means

15

Backtothe2-Dexamples

• Create a graph.• Each 2D point is a vertex• Add edges connecting all points in the 2-D initial space to all other points• Weight of edge is distance

b

b

b

b

b

g g

g

g

g

g g

g gr r r r

r r r…

16

SpectralClustering:ProsandCons• Elegant,andwell-foundedmathematically• Tendstoavoidlocalminima

– Optimalsolutiontorelaxedversionofmincut problem(Normalizedcut,akaNCut)

• Worksquitewellwhenrelationsareapproximatelytransitive(likesimilarity,socialconnections)

• Expensiveforverylargedatasets– Computingeigenvectorsisthebottleneck– Approximateeigenvectorcomputationnotalwaysuseful

• Noisydatasetssometimescauseproblems– Pickingnumberofeigenvectorsandkistricky– “Informative” eigenvectorsneednotbeintopfew– Performancecandropsuddenlyfromgoodtoterrible

17

Experimentalresults:best-caseassignmentofclasslabelstoclusters

18

Spectral clustering as Optimization

19

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

ll eigenvaluer with eigenvectoan is : vvvW =×• A = adjacency matrix• D = diagonal normalizer for A (# outlinks)• W = A D-1

• smallest eigenvecs of D-A are largest eigenvecs of A• smallest eigenvecs of I-W are largest eigenvecs of W

Suppose each y(i)=+1 or -1:• Then y is a cluster indicator that splits the nodes into two • what is yT(D-A)y ?

20

úû

ùêë

é-=

úû

ùêë

é-+=

úúû

ù

êêë

é-÷

ø

öçè

æ+÷÷

ø

öççè

æ=

úû

ùêë

é-=

-=-=-

å

ååå

åå åå å

åå

åå

jijiji

jijiji

jijij

jiiij

jijiji

jj

iij

ii

jij

jijiji

iii

jijiji

iii

TTT

yya

yyayaya

yyayaya

yyayd

yyaydADAD

,

2,

,,

,

2

,

2

,,

22

,,

2

,,

2

)(21

221

221

2221

)( yyyyyy

= size of CUT(y)

)NCUT( of size)( yyy =-WIT

NCUT: roughly minimize ratio of transitions between classes vs transitions within classes

Same as smoothness criterion used in SSL but now there are no seeds

21

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

ll eigenvaluer with eigenvectoan is : vvvW =×• smallest eigenvecs of D-A are largest eigenvecs of A• smallest eigenvecs of I-W are largest eigenvecs of WSuppose each y(i)=+1 or -1:• Then y is a cluster indicator that cuts the nodes into two • what is yT(D-A)y ? The cost of the graph cut defined by y• what is yT(I-W)y ? Also a cost of a graph cut defined by y• How to minimize it?

• Turns out: to minimize yT X y / (yTy) find smallest eigenvector of X• But: this will not be +1/-1, so it’s a “relaxed” solution

22



[Shi & Meila, 2002]

λ2

λ3

λ4

λ5,6,7,…

.

λ1e1

e2

e3“eigengap”

23

Another way to think about spectral clustering….

• Mostnormalpeoplethinkaboutspectralclusteringlikethat- asrelaxedoptimization

• …me,notsomuch• Iliketheconnectionto“averaging”…because….itleadstoanicefastvariationofspectralclusteringcalledpoweriterationclustering

24

Unsupervised Learning on Graphs with Power Iteration

Clustering

25

Spectral Clustering: Graph = Matrix

AB

C

FD

E

GI

HJ

A B C D E F G H I JA 1 1 1B 1 1C 1D 1 1E 1F 1 1 1G 1H 1 1 1I 1 1 1J 1 1

26

Spectral Clustering: Graph = MatrixTransitively Closed Components = “Blocks”

AB

C

FD

E

GI

HJ


Of course we can’t see the “blocks” unless the nodesare sorted by cluster…

sometimes called a block-stochastic matrix:• each node has a latent “block”• fixed probability qi for links between

elements of block i• fixed probability q0 for links between

elements of different blocks

27

Spectral Clustering: Graph = MatrixVector = Node à Weight

H


AB

C

FD

E

GI

J

AA 3B 2C 3DEFGHIJ

M

M v

28

SpectralClustering:Graph=MatrixM*v1 =v2“propogates weightsfromneighbors”


AB

C

FD

E

I

A 3B 2C 3DEFGHIJ

M v1

A 2*1+3*1+0*1

B 3*1+3*1

C 3*1+2*1

D

E

F

G

H

I

J

v2* =

29


A B C D E F G H I J

A _ .5 .5 .3

B .3 _ .5

C .3 .5 _

D _ .5 .3

E .5 _ .3

F .3 .5 .5 _

G _ .3 .3

H _ .3 .3

I .5 .5 _ .3

J .5 .5 .3 _

AB

C

FD

E

I

A 3

B 2

C 3

D

E

F

G

H

I

J

W v1

A 2*.5+3*.5+0*.3

B 3*.3+3*.5

C 3*.33+2*.5

D

E

F

G

H

I

J


30

Repeatedaveragingwithneighborsasaclusteringmethod

• Pickavectorv0(maybeatrandom)• Computev1 =Wv0

– i.e.,replacev0[x]withweightedaverage ofv0[y]fortheneighborsyofx• Plotv1[x]foreachx• Repeatforv2,v3,…

• Variantswidelyusedforsemi-supervised learning– HF/CoEM/wvRN - average+clampingoflabelsfornodeswithknownlabels

• Withoutclamping,willconvergetoconstantvt

• Whatarethedynamics ofthisprocess?

31

Repeatedaveragingwithneighborsonasampleproblem…

• Create a graph, connecting all points in the 2-D initial space to all other points

• Weighted by distance• Run power iteration (averaging) for 10 steps• Plot node id x vs v10(x)

• nodes are ordered and colored by actual cluster number

b

b

b

b

b

g g

g

g

g

g g

g gr r r r

r r r…

blue green ___red___

32


blue green ___red___blue green ___red___ blue green ___red___

smal

ler

larg

er

33


blue green ___red___ blue green ___red___blue green ___red___

blue green ___red___blue green ___red___

34


very

smal

l

35

PIC:PowerIterationClusteringrunpoweriteration(repeatedaveragingw/neighbors)with

earlystopping

– V0:randomstart,or“degreematrix” D,or…– Easytoimplementandefficient– Veryeasilyparallelized

– Experimentally,oftenbetterthan traditionalspectralmethods

– Surprisingsincetheembeddedspaceis1-dimensional!

36

Experiments

• “Network” problems:naturalgraphstructure– PolBooks:105politicalbooks,3classes,linkedbycopurchaser– UMBCBlog:404politicalblogs,2classes,blogroll links– AGBlog:1222politicalblogs,2classes,blogroll links

• “Manifold” problems:cosinedistancebetweenclassificationinstances– Iris:150flowers,3classes– PenDigits01,17:200handwrittendigits,2classes(0-1or1-7)– 20ngA:200docs,misc.forsale vs soc.religion.christian– 20ngB:400docs,misc.forsale vs soc.religion.christian– 20ngC:20ngB+200docsfromtalk.politics.guns– 20ngD:20ngC+200docsfromrec.sport.baseball

37

Experimentalresults:best-caseassignmentofclasslabelstoclusters

38

Experiments:runtimeandscalability

Time in millisec

40

More experiments

VaryingthenumberofclustersforPICandPIC4(startswithrandom4-dpointratherthanarandom1-dpoint).

41

More experiments

More“real”networkdatasetsfromvariousdomains

42

More experiments

43

LEARNING ON GRAPHS FOR NON-GRAPH DATASETS

45

Why I’m talking about graphs• Lotsoflargedataisgraphs

– Facebook,Twitter,citationdata,andothersocialnetworks

– Theweb,theblogosphere,thesemanticweb,Freebase,Wikipedia,Twitter,andotherinformation networks

– Textcorpora(likeRCV1),largedatasetswithdiscretefeaturevalues,andotherbipartitenetworks

• nodes=documentsorwords• linksconnectdocumentàwordorwordà document

– Computernetworks,biologicalnetworks(proteins,ecosystems,brains,…),…

– Heterogeneousnetworkswithmultipletypesofnodes• people,groups,documents

46

proposal

CMU

CALO

graph

William

Simplest Case: Bi-partite Graphs

47

Motivation:ExperimentalDatasets`UsedforPICare…

• “Network” problems:naturalgraphstructure– PolBooks:105politicalbooks,3classes,linkedbycopurchaser– UMBCBlog:404politicalblogs,2classes,blogroll links– AGBlog:1222politicalblogs,2classes,blogroll links– Also:Zachary’skarateclub,citationnetworks,...

• “Manifold” problems:cosinedistancebetweenallpairsofclassificationinstances– Iris:150flowers,3classes– PenDigits01,17:200handwrittendigits,2classes(0-1or1-7)– 20ngA:200docs,misc.forsale vs soc.religion.christian– 20ngB:400docs,misc.forsale vs soc.religion.christian– …

Gets expensive fast

48

SpectralClustering:Graph=MatrixA*v1 =v2“propogates weightsfromneighbors”


AB

C

FD

E

I

A 3B 2C 3DEFGHIJ

M

A v1

A 2*1+3*1+0*1

B 3*1+3*1

C 3*1+2*1

D

E

F

G

H

I

J

v2* =

49


A B C D E F G H I J

A _ .5 .5 .3

B .3 _ .5

C .3 .5 _

D _ .5 .3

E .5 _ .3

F .3 .5 .5 _

G _ .3 .3

H _ .3 .3

I .5 .5 _ .3

J .5 .5 .3 _

AB

C

FD

E

I

A 3

B 2

C 3

D

E

F

G

H

I

J

W v1

A 2*.5+3*.5+0*.3

B 3*.3+3*.5

C 3*.33+2*.5

D

E

F

G

H

I

J


W = D-1*A D[i,i]=1/degree(i)50

Lazycomputationofdistancesandnormalizers

• RecallPIC’supdateis– vt =W*vt-1==D-1A*vt-1

– …whereDisthe[diagonal]degreematrix:D=A*1• Myfavoritedistancemetricfortextislength-normalizedTFIDF:– Def’n:A(i,j)=<vi,vj>/||vi||*||vj||– LetN(i,i)=||vi||…andN(i,j)=0fori!=j– LetF(i,k)=TFIDFweightofwordwk indocumentvi–Then:A=N-1FTFN-1

<u,v>=inner product||u|| is L2-norm

1 is a column vector of 1’s

51

Lazycomputationofdistancesandnormalizers

• RecallPIC’supdateis– vt =W*vt-1==D-1A*vt-1

– …whereDisthe[diagonal]degreematrix:D=A*1– LetF(i,k)=TFIDFweightofwordwk indocumentvi– ComputeN(i,i)=||vi||…andN(i,j)=0fori!=j– Don’t computeA=N-1FTFN-1

– LetD(i,i)=N-1FTFN-1*1where1isanall-1’svector• ComputedasD=N-1(FT(F(N-1*1)))forefficiency

–Newupdate:• vt =D-1A*vt-1 =D-1 N-1FTFN-1*vt-1

Equivalent to using TFIDF/cosine on all pairs of examples but requires only

sparse matrices

52

Experimentalresults

• RCV1textclassificationdataset– 800k+newswirestories– Categorylabelsfromindustry vocabulary– Tooksingle-labeldocumentsandcategorieswithatleast500instances

– Result:193,844documents,103categories• Generated100randomcategorypairs

– Eachisalldocumentsfromtwocategories– Rangeinsizeanddifficulty– Pickcategory1,withm1 examples– Pickcategory2suchthat0.5m1<m2<2m1

53

Results

•NCUTevd: Ncut with exact eigenvectors•NCUTiram: Implicit restarted Arnoldi method•No stat. signif. diffs between NCUTevd and PIC

54

Results

55

Results

56

Results

57

Results• Linearrun-timeimpliesconstantnumberofiterations

• Numberofiterationsto“acceleration-convergence” ishardtoanalyze:–Fasterthanasinglecompleterunofpoweriterationtoconvergence

–Onourdatasets• 10-20iterationsistypical• 30-35isexceptional

58

PIC AND SSL

60

PIC:PowerIterationClusteringrunpoweriteration(repeatedaveragingw/neighbors)with

earlystopping

61

HarmonicFunctions/CoEM/wvRN

thenreplacevt+1(i)withseedvalues+1/-1forlabeleddata

for5-10iterations

Classifydatausingfinalvaluesfromv

62

ImplicitManifoldsontheNELLdatasets

Paris

live in arg1

San FranciscoAustin

traits such as arg1

anxiety

mayor of arg1

Pittsburgh

Seattle

denial

arg1 is home of

selfishness

Nodes “near” seeds Nodes “far from” seeds

arrogancetraits such as arg1

denialselfishness

63

Using the Manifold Trick for SSL

64


65


66


67


68

Using the Manifold Trick for SSLA smoothing trick:

69


70

Spectral Clusteringwcohen/10-605/unsup-on-graphs.pdf · Spectral Clustering: Graph = Matrix W*v 1 =...

Documents

Transcript of Spectral Clusteringwcohen/10-605/unsup-on-graphs.pdf · Spectral Clustering: Graph = Matrix W*v 1 =...