Spectral Clusteringwcohen/10-605/unsup-on-graphs.pdf · Spectral Clustering: Graph = Matrix W*v 1 =...
Transcript of Spectral Clusteringwcohen/10-605/unsup-on-graphs.pdf · Spectral Clustering: Graph = Matrix W*v 1 =...
1
Spectral Clustering
2
spectral k-means
after transformation
3
4
SpectralClustering:Graph=Matrix
AB
C
FD
E
GI
HJ
A B C D E F G H I JA 1 1 1B 1 1C 1D 1 1E 1F 1 1 1G 1H 1 1 1I 1 1 1J 1 1
5
SpectralClustering:Graph=MatrixTransitivelyClosedComponents=“Blocks”
AB
C
FD
E
GI
HJ
A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _
Of course we can’t see the “blocks” unless the nodesare sorted by cluster…
sometimes called a block-stochastic matrix:• each node has a latent “block”• fixed probability qi for links between
elements of block i• fixed probability q0 for links between
elements of different blocks
6
SpectralClustering:Graph=MatrixVector=NodeàWeight
H
A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _
AB
C
FD
E
GI
J
AA 3B 2C 3DEFGHIJ
M v
7
SpectralClustering:Graph=MatrixM*v1 =v2“propogates weightsfromneighbors”
A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _
AB
C
FD
E
I
A 3B 2C 3DEFGHIJ
M
M v1
A 2*1+3*1+0*1
B 3*1+3*1
C 3*1+2*1
DEFGHIJ
v2* =
8
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
A B C D E F G H I J
A _ .5 .5 .3
B .3 _ .5
C .3 .5 _
D _ .5 .3
E .5 _ .3
F .3 .5 .5 _
G _ .3 .3
H _ .3 .3
I .5 .5 _ .3
J .5 .5 .3 _
AB
C
FD
E
I
A 3
B 2
C 3
D
E
F
G
H
I
J
W v1
A 2*.5+3*.5+0*.3
B 3*.3+3*.5
C 3*.33+2*.5
D
E
F
G
H
I
J
v2* =W: normalized so columns sum to 1
9
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
ll eigenvaluer with eigenvectoan is : vvvW =×
Q: How do I pick v to be an eigenvector
for a block-stochastic matrix?
10
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
ll eigenvaluer with eigenvectoan is : vvvW =×
How do I pick v to be an eigenvector
for a block-stochastic matrix?
11
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
ll eigenvaluer with eigenvectoan is : vvvW =×
[Shi & Meila, 2002]
λ2
λ3
λ4
λ5,6,7,…
.
λ1e1
e2
e3“eigengap”
12
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
ll eigenvaluer with eigenvectoan is : vvvW =×
[Shi & Meila, 2002]
e2
e3
-0.4 -0.2 0 0.2
-0.4
-0.2
0.0
0.2
0.4
xx x xx x
yyyy
y
xx xxx x
zzzzz zzzzz z e1
e2
13
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
ll eigenvaluer with eigenvectoan is : vvvW =×If W is connected but roughly block diagonal with k blocks then• the top eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks
14
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
ll eigenvaluer with eigenvectoan is : vvvW =×If W is connected but roughly block diagonal with k blocks then• the “top” eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks
Spectral clustering:• Find the top k+1 eigenvectors v1,…,vk+1• Discard the “top” one• Replace every node a with k-dimensional vector xa = <v2(a),…,vk+1 (a) >• Cluster with k-means
15
Backtothe2-Dexamples
• Create a graph.• Each 2D point is a vertex• Add edges connecting all points in the 2-D initial space to all other points• Weight of edge is distance
b
b
b
b
b
g g
g
g
g
g g
g gr r r r
r r r…
16
SpectralClustering:ProsandCons• Elegant,andwell-foundedmathematically• Tendstoavoidlocalminima
– Optimalsolutiontorelaxedversionofmincut problem(Normalizedcut,akaNCut)
• Worksquitewellwhenrelationsareapproximatelytransitive(likesimilarity,socialconnections)
• Expensiveforverylargedatasets– Computingeigenvectorsisthebottleneck– Approximateeigenvectorcomputationnotalwaysuseful
• Noisydatasetssometimescauseproblems– Pickingnumberofeigenvectorsandkistricky– “Informative” eigenvectorsneednotbeintopfew– Performancecandropsuddenlyfromgoodtoterrible
17
Experimentalresults:best-caseassignmentofclasslabelstoclusters
18
Spectral clustering as Optimization
19
Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”
ll eigenvaluer with eigenvectoan is : vvvW =ו A = adjacency matrix• D = diagonal normalizer for A (# outlinks)• W = A D-1
• smallest eigenvecs of D-A are largest eigenvecs of A• smallest eigenvecs of I-W are largest eigenvecs of W
Suppose each y(i)=+1 or -1:• Then y is a cluster indicator that splits the nodes into two • what is yT(D-A)y ?
20
úû
ùêë
é-=
úû
ùêë
é-+=
úúû
ù
êêë
é-÷
ø
öçè
æ+÷÷
ø
öççè
æ=
úû
ùêë
é-=
-=-=-
å
ååå
åå åå å
åå
åå
jijiji
jijiji
jijij
jiiij
jijiji
jj
iij
ii
jij
jijiji
iii
jijiji
iii
TTT
yya
yyayaya
yyayaya
yyayd
yyaydADAD
,
2,
,,
,
2
,
2
,,
22
,,
2
,,
2
)(21
221
221
2221
)( yyyyyy
= size of CUT(y)
)NCUT( of size)( yyy =-WIT
NCUT: roughly minimize ratio of transitions between classes vs transitions within classes
Same as smoothness criterion used in SSL but now there are no seeds
21
Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”
ll eigenvaluer with eigenvectoan is : vvvW =ו smallest eigenvecs of D-A are largest eigenvecs of A• smallest eigenvecs of I-W are largest eigenvecs of WSuppose each y(i)=+1 or -1:• Then y is a cluster indicator that cuts the nodes into two • what is yT(D-A)y ? The cost of the graph cut defined by y• what is yT(I-W)y ? Also a cost of a graph cut defined by y• How to minimize it?
• Turns out: to minimize yT X y / (yTy) find smallest eigenvector of X• But: this will not be +1/-1, so it’s a “relaxed” solution
22
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
ll eigenvaluer with eigenvectoan is : vvvW =×
[Shi & Meila, 2002]
λ2
λ3
λ4
λ5,6,7,…
.
λ1e1
e2
e3“eigengap”
23
Another way to think about spectral clustering….
• Mostnormalpeoplethinkaboutspectralclusteringlikethat- asrelaxedoptimization
• …me,notsomuch• Iliketheconnectionto“averaging”…because….itleadstoanicefastvariationofspectralclusteringcalledpoweriterationclustering
24
Unsupervised Learning on Graphs with Power Iteration
Clustering
25
Spectral Clustering: Graph = Matrix
AB
C
FD
E
GI
HJ
A B C D E F G H I JA 1 1 1B 1 1C 1D 1 1E 1F 1 1 1G 1H 1 1 1I 1 1 1J 1 1
26
Spectral Clustering: Graph = MatrixTransitively Closed Components = “Blocks”
AB
C
FD
E
GI
HJ
A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _
Of course we can’t see the “blocks” unless the nodesare sorted by cluster…
sometimes called a block-stochastic matrix:• each node has a latent “block”• fixed probability qi for links between
elements of block i• fixed probability q0 for links between
elements of different blocks
27
Spectral Clustering: Graph = MatrixVector = Node à Weight
H
A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _
AB
C
FD
E
GI
J
AA 3B 2C 3DEFGHIJ
M
M v
28
SpectralClustering:Graph=MatrixM*v1 =v2“propogates weightsfromneighbors”
A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _
AB
C
FD
E
I
A 3B 2C 3DEFGHIJ
M v1
A 2*1+3*1+0*1
B 3*1+3*1
C 3*1+2*1
D
E
F
G
H
I
J
v2* =
29
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
A B C D E F G H I J
A _ .5 .5 .3
B .3 _ .5
C .3 .5 _
D _ .5 .3
E .5 _ .3
F .3 .5 .5 _
G _ .3 .3
H _ .3 .3
I .5 .5 _ .3
J .5 .5 .3 _
AB
C
FD
E
I
A 3
B 2
C 3
D
E
F
G
H
I
J
W v1
A 2*.5+3*.5+0*.3
B 3*.3+3*.5
C 3*.33+2*.5
D
E
F
G
H
I
J
v2* =W: normalized so columns sum to 1
30
Repeatedaveragingwithneighborsasaclusteringmethod
• Pickavectorv0(maybeatrandom)• Computev1 =Wv0
– i.e.,replacev0[x]withweightedaverage ofv0[y]fortheneighborsyofx• Plotv1[x]foreachx• Repeatforv2,v3,…
• Variantswidelyusedforsemi-supervised learning– HF/CoEM/wvRN - average+clampingoflabelsfornodeswithknownlabels
• Withoutclamping,willconvergetoconstantvt
• Whatarethedynamics ofthisprocess?
31
Repeatedaveragingwithneighborsonasampleproblem…
• Create a graph, connecting all points in the 2-D initial space to all other points
• Weighted by distance• Run power iteration (averaging) for 10 steps• Plot node id x vs v10(x)
• nodes are ordered and colored by actual cluster number
b
b
b
b
b
g g
g
g
g
g g
g gr r r r
r r r…
blue green ___red___
32
Repeatedaveragingwithneighborsonasampleproblem…
blue green ___red___blue green ___red___ blue green ___red___
smal
ler
larg
er
33
Repeatedaveragingwithneighborsonasampleproblem…
blue green ___red___ blue green ___red___blue green ___red___
blue green ___red___blue green ___red___
34
Repeatedaveragingwithneighborsonasampleproblem…
very
smal
l
35
PIC:PowerIterationClusteringrunpoweriteration(repeatedaveragingw/neighbors)with
earlystopping
– V0:randomstart,or“degreematrix” D,or…– Easytoimplementandefficient– Veryeasilyparallelized
– Experimentally,oftenbetterthan traditionalspectralmethods
– Surprisingsincetheembeddedspaceis1-dimensional!
36
Experiments
• “Network” problems:naturalgraphstructure– PolBooks:105politicalbooks,3classes,linkedbycopurchaser– UMBCBlog:404politicalblogs,2classes,blogroll links– AGBlog:1222politicalblogs,2classes,blogroll links
• “Manifold” problems:cosinedistancebetweenclassificationinstances– Iris:150flowers,3classes– PenDigits01,17:200handwrittendigits,2classes(0-1or1-7)– 20ngA:200docs,misc.forsale vs soc.religion.christian– 20ngB:400docs,misc.forsale vs soc.religion.christian– 20ngC:20ngB+200docsfromtalk.politics.guns– 20ngD:20ngC+200docsfromrec.sport.baseball
37
Experimentalresults:best-caseassignmentofclasslabelstoclusters
38
39
Experiments:runtimeandscalability
Time in millisec
40
More experiments
VaryingthenumberofclustersforPICandPIC4(startswithrandom4-dpointratherthanarandom1-dpoint).
41
More experiments
More“real”networkdatasetsfromvariousdomains
42
More experiments
43
44
LEARNING ON GRAPHS FOR NON-GRAPH DATASETS
45
Why I’m talking about graphs• Lotsoflargedataisgraphs
– Facebook,Twitter,citationdata,andothersocialnetworks
– Theweb,theblogosphere,thesemanticweb,Freebase,Wikipedia,Twitter,andotherinformation networks
– Textcorpora(likeRCV1),largedatasetswithdiscretefeaturevalues,andotherbipartitenetworks
• nodes=documentsorwords• linksconnectdocumentàwordorwordà document
– Computernetworks,biologicalnetworks(proteins,ecosystems,brains,…),…
– Heterogeneousnetworkswithmultipletypesofnodes• people,groups,documents
46
proposal
CMU
CALO
graph
William
Simplest Case: Bi-partite Graphs
47
Motivation:ExperimentalDatasets`UsedforPICare…
• “Network” problems:naturalgraphstructure– PolBooks:105politicalbooks,3classes,linkedbycopurchaser– UMBCBlog:404politicalblogs,2classes,blogroll links– AGBlog:1222politicalblogs,2classes,blogroll links– Also:Zachary’skarateclub,citationnetworks,...
• “Manifold” problems:cosinedistancebetweenallpairsofclassificationinstances– Iris:150flowers,3classes– PenDigits01,17:200handwrittendigits,2classes(0-1or1-7)– 20ngA:200docs,misc.forsale vs soc.religion.christian– 20ngB:400docs,misc.forsale vs soc.religion.christian– …
Gets expensive fast
48
SpectralClustering:Graph=MatrixA*v1 =v2“propogates weightsfromneighbors”
A B C D E F G H I JA _ 1 1 1B 1 _ 1C 1 1 _D _ 1 1E 1 _ 1F 1 1 _G _ 1 1H _ 1 1I 1 1 _ 1J 1 1 1 _
AB
C
FD
E
I
A 3B 2C 3DEFGHIJ
M
A v1
A 2*1+3*1+0*1
B 3*1+3*1
C 3*1+2*1
D
E
F
G
H
I
J
v2* =
49
SpectralClustering:Graph=MatrixW*v1 =v2“propogates weightsfromneighbors”
A B C D E F G H I J
A _ .5 .5 .3
B .3 _ .5
C .3 .5 _
D _ .5 .3
E .5 _ .3
F .3 .5 .5 _
G _ .3 .3
H _ .3 .3
I .5 .5 _ .3
J .5 .5 .3 _
AB
C
FD
E
I
A 3
B 2
C 3
D
E
F
G
H
I
J
W v1
A 2*.5+3*.5+0*.3
B 3*.3+3*.5
C 3*.33+2*.5
D
E
F
G
H
I
J
v2* =W: normalized so columns sum to 1
W = D-1*A D[i,i]=1/degree(i)50
Lazycomputationofdistancesandnormalizers
• RecallPIC’supdateis– vt =W*vt-1==D-1A*vt-1
– …whereDisthe[diagonal]degreematrix:D=A*1• Myfavoritedistancemetricfortextislength-normalizedTFIDF:– Def’n:A(i,j)=<vi,vj>/||vi||*||vj||– LetN(i,i)=||vi||…andN(i,j)=0fori!=j– LetF(i,k)=TFIDFweightofwordwk indocumentvi–Then:A=N-1FTFN-1
<u,v>=inner product||u|| is L2-norm
1 is a column vector of 1’s
51
Lazycomputationofdistancesandnormalizers
• RecallPIC’supdateis– vt =W*vt-1==D-1A*vt-1
– …whereDisthe[diagonal]degreematrix:D=A*1– LetF(i,k)=TFIDFweightofwordwk indocumentvi– ComputeN(i,i)=||vi||…andN(i,j)=0fori!=j– Don’t computeA=N-1FTFN-1
– LetD(i,i)=N-1FTFN-1*1where1isanall-1’svector• ComputedasD=N-1(FT(F(N-1*1)))forefficiency
–Newupdate:• vt =D-1A*vt-1 =D-1 N-1FTFN-1*vt-1
Equivalent to using TFIDF/cosine on all pairs of examples but requires only
sparse matrices
52
Experimentalresults
• RCV1textclassificationdataset– 800k+newswirestories– Categorylabelsfromindustry vocabulary– Tooksingle-labeldocumentsandcategorieswithatleast500instances
– Result:193,844documents,103categories• Generated100randomcategorypairs
– Eachisalldocumentsfromtwocategories– Rangeinsizeanddifficulty– Pickcategory1,withm1 examples– Pickcategory2suchthat0.5m1<m2<2m1
53
Results
•NCUTevd: Ncut with exact eigenvectors•NCUTiram: Implicit restarted Arnoldi method•No stat. signif. diffs between NCUTevd and PIC
54
Results
55
Results
56
Results
57
Results• Linearrun-timeimpliesconstantnumberofiterations
• Numberofiterationsto“acceleration-convergence” ishardtoanalyze:–Fasterthanasinglecompleterunofpoweriterationtoconvergence
–Onourdatasets• 10-20iterationsistypical• 30-35isexceptional
58
59
PIC AND SSL
60
PIC:PowerIterationClusteringrunpoweriteration(repeatedaveragingw/neighbors)with
earlystopping
61
HarmonicFunctions/CoEM/wvRN
thenreplacevt+1(i)withseedvalues+1/-1forlabeleddata
for5-10iterations
Classifydatausingfinalvaluesfromv
62
ImplicitManifoldsontheNELLdatasets
Paris
live in arg1
San FranciscoAustin
traits such as arg1
anxiety
mayor of arg1
Pittsburgh
Seattle
denial
arg1 is home of
selfishness
Nodes “near” seeds Nodes “far from” seeds
arrogancetraits such as arg1
denialselfishness
63
Using the Manifold Trick for SSL
64
Using the Manifold Trick for SSL
65
Using the Manifold Trick for SSL
66
Using the Manifold Trick for SSL
67
Using the Manifold Trick for SSL
68
Using the Manifold Trick for SSLA smoothing trick:
69
Using the Manifold Trick for SSL
70