100150
200250
300
Aaron Clauset @aaronclausetAssistant Professor of Computer ScienceUniversity of Colorado BoulderExternal Faculty, Santa Fe Institute
Lecture 8:Generalized large-scale structure
© 2017 Aaron Clauset
hierarchical communities
most communities are not random graphs
• groups within groups / groups of groups• finding communities at one "level" of a hierarchy
can obscure structure above or below that level
can we automatically extract such hierarchies?
?
step 3: hierarchystep 1: network data
hierarchical communities
hierarchical random graph model
hierarchical communities
Clauset, Moore, Newman, Nature 453, 98-101 (2008)Clauset, Moore, Newman, ICML (2006)
assortative modules
D, {pr}
probability pr
Clauset, Moore, Newman, Nature 453, 98-101 (2008)Clauset, Moore, Newman, ICML (2006)
“inhomogeneous” random graph
model
instance
Pr(i, j connected) = pr
i
j
i j
= p(lowest common ancestor of i,j)
hierarchical communities
Er
Rr
Lr = number nodes in left subtree
= number nodes in right subtree
= number edges with as lowest common ancestor
Lr Rr
Er
→
pr
r
Pr(A | D, {pr}) =Y
r
pErr (1� pr)
LrRr�Er}hierarchical random graph model
Clauset, Moore, Newman, Nature 453, 98-101 (2008)Clauset, Moore, Newman, ICML (2006)
1
1
1/3
1
1/4
L = 0.0016
L =
!
"
1
3
#1 "
2
3
#2$
·
!
"
1
4
#2 "
3
4
#6$
L(D, {pr}) =!
r
pEr
r (1 − pr)LrRr−Er
generalizing from a single example
• given graph , estimate model parameters• sample new graphs from posterior distribution
checking the models
compare resampled graphs with original datacheck
1. degree distribution2. clustering coefficient3. geodesic path lengths
hierarchical communities
A D, {pr}Pr(G | D, {pr})
!"#$%$&'()')*+,$-
100 10110−3
10−2
10−1
100a
Degree, k
Frac
tion
of v
ertic
es w
ith d
egre
e k
resampled→
original→
hierarchical communities
degree distribution
!"#$%$&'()')*+,$-
resampled→
original→
0 0.05 0.1 0.15 0.2 0.25 0.30
0.05
0.1
0.15
0.2
0.25
Frac
tion
of g
raph
s w
ith c
lust
erin
g co
effic
ient
c
Clustering coefficient, c
resampled→
original→
hierarchical communities
density of triangles
2 4 6 8 1010−3
10−2
10−1
100b
Distance, d
Frac
tion
of v
erte
x−pa
irs a
t dis
tanc
e d
resampled→
original→
!"#$%$&'()')*+,$-
hierarchical communities
geodesic distances
inspecting the dendrograms
hierarchical communities
0
1
2
3
4
5
6
78
9
10
11
1213
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
3940
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95 96
97
98
99
101
102
103
104
105
106
107
108
109 110
111
112
113
114
100
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 26
27
28
29
30
31
32
33
34
Zachary’s Karate ClubNCAA Schedule 2000
8 14
3
13
4
20
22
18
2
12
5
6
7
11
17 1 9
31
2315
19
21
32
29
28
24
27
30
16
33
10
34
26
25
MAP
hierarchical communities
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 26
27
28
29
30
31
32
33
34
Bri
gh
am
Yo
un
g (
0)
New
Mexic
o (
4)
San
Die
go
Sta
t (9
)W
yo
min
g (
16)
Uta
h (
23)
NV
LasV
eg
as (
104)
Air
Fo
rce (
93)
Co
lora
do
Sta
t (4
1)
NM
Sta
te (69)
Ark
ansasS
tat (2
4)
Nort
hTex
as (11
)
Bois
eSta
te (28
)
Idah
o (50
)
Uta
hSta
te (9
0)
Ore
gonSta
te (1
08)
Ariz
onaSta
te (8
)
Arizona (2
2)
Californ
ia (1
11)
WashSta
te (7
8)
UCLA (21)
Oregon (68)
Stanford (77)
SouthernCal (7)
Washington (51)
Hawaii (114)
Nevada (67)
TexasElPaso (83)
FresnoState (46)
TXChristian (110)Tulsa (88)SanJoseState (73)Rice (49)SouthernMeth (53)
FloridaState (1)WakeForest (105)
Maryland (109)
Clemson (103)
NCState (25)
Duke (45)
Virginia (33)
GeorgiaTech (37)
NoCarolina (89)
WesternM
ich (14)
CentralM
ich (38)
NorthernIll (12)
BallS
tate (26)
Toled
o (85)
Eastern
Mich
(43)
Akro
n (1
8)
Buffa
lo (3
4)
Oh
io (7
1)
Ken
t (54)
Mars
hall (9
9)
Bo
wlin
gG
reen
(31)
Mia
miO
hio
(61)
Co
nn
ectic
ut (4
2)
Mis
sS
tate
(65)
Lo
uis
ian
Sta
t (96)
(62)V
an
derb
ilt
(95)G
eo
rgia
(17)A
ub
urn
(87)M
issis
sip
pi
(70)S
oC
aro
lin
a
(76)T
en
nessee
(27)F
lori
da
(56)K
en
tucky
(113)A
rkansas
(20)A
labam
a
(101
)Mia
miF
lori
da
(19)
Virgin
iaTec
h
(30)
Wes
tVirgi
nia
(35)
Syr
acus
e
(55)
Pittsb
urgh
(80)N
avy
(29)B
ostonColl
(79)T
emple
(94)Rutg
ers
(39)Purdue
(32)Michigan(2)Iowa
(47)OhioState
(13)Northwestern(106)Indiana(60)Minnesota(100)MichiganStat(64)Illinois
(6)PennState(15)Wisconsin
(74)Nebraska
(52)Kansas
(3)KansasState
(10)Baylor
(98)Texas
(107)OKState
(81)TexasA&M
(72)IowaState
(40)Colorado
(102)Missouri
(84)Oklahoma
(5)TexasTech
(82)NotreD
ame
(57)Louisville
(66)Mem
phis
(75)South
ernM
iss
(91)Arm
y
(86)Tulan
e
(48)H
ousto
n
(112)A
LB
irmin
gham
(92)C
incin
nati
(44)E
astC
aro
lina
(97)L
ou
isia
naL
af
(63)M
idT
NS
tate
(36)C
en
tFlo
rida
(59)L
ou
isia
nM
on
r(5
8)L
ou
isia
nT
ech
hierarchical communities
0
1
2
3
4
5
6
78
9
10
11
1213
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
3940
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95 96
97
98
99
101
102
103
104
105
106
107
108
109 110
111
112
113
114
100
MAP
link prediction in networks
• many networks are sampled• social nets, foodwebs, protein interactions, etc.• generative models provide estimate of
for either (missing links) or (spurious links)• like cross-validation: hold out some adjacencies,
measure accuracy of algorithm on these
now many approaches to link prediction:• Liben-Nowell & Kleinberg (2003)• Goldberg & Roth (2003)• Szilágyi et al. (2005)• Guimera & Sales-Pardo (2009)• and many others
hierarchical communities
Pr(Aij | ✓)Aij = 0 Aij = 1
{Aij}
!"#$%$&'()')*+,$-
0 0.2 0.4 0.6 0.8 10.4
0.5
0.6
0.7
0.8
0.9
1
Area
und
er R
OC
cur
ve
Fraction of edges observed, k/m
Grassland species network
Pure chanceCommon neighborsJaccard coeff.Degree productShortest pathsHierarchical structure
simple predictors
hierarchy
pure chance
AUC
hierarchical communities
0 0.2 0.4 0.6 0.8 10.4
0.5
0.6
0.7
0.8
0.9
1
AUC
Fraction of edges observed
Terrorist association networka
Pure chanceCommon neighborsJaccard coefficientDegree productShortest pathsHierarchical structure
0 0.2 0.4 0.6 0.8 10.4
0.5
0.6
0.7
0.8
0.9
1
AUC
Fraction of edges observed
T. pallidum metabolic networkb
Pure chanceCommon neighborsJaccard coefficientDegree productShortest pathsHierarchical structure
hierarchical communities
other approaches
edge counts among blocks are another networkfit another SBM to these, repeat
hierarchical communities
Hierarchical Block Structures and High-Resolution Model Selection in Large Networks
Tiago P. Peixoto*
Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany(Received 5 November 2013; published 24 March 2014)
Discovering and characterizing the large-scale topological features in empirical networks are crucialsteps in understanding how complex systems function. However, most existing methods used to obtain themodular structure of networks suffer from serious problems, such as being oblivious to the statisticalevidence supporting the discovered patterns, which results in the inability to separate actual structure fromnoise. In addition to this, one also observes a resolution limit on the size of communities, where smaller butwell-defined clusters are not detectable when the network becomes large. This phenomenon occurs for thevery popular approach of modularity optimization, which lacks built-in statistical validation, but also formore principled methods based on statistical inference and model selection, which do incorporate statisticalvalidation in a formally correct way. Here, we construct a nested generative model that, through a completedescription of the entire network hierarchy at multiple scales, is capable of avoiding this limitation andenables the detection of modular structure at levels far beyond those possible with current approaches. Evenwith this increased resolution, the method is based on the principle of parsimony, and is capable ofseparating signal from noise, and thus will not lead to the identification of spurious modules even on sparsenetworks. Furthermore, it fully generalizes other approaches in that it is not restricted to purely assortativemixing patterns, directed or undirected graphs, and ad hoc hierarchical structures such as binary trees.Despite its general character, the approach is tractable and can be combined with advanced techniques ofcommunity detection to yield an efficient algorithm that scales well for very large networks.
DOI: 10.1103/PhysRevX.4.011047 Subject Areas: Complex Systems, InterdisciplinaryPhysics, Statistical Physics
I. INTRODUCTION
The detection of communities and other large-scalestructures in networks has become perhaps one of thelargest undertakings in network science [1,2]. It is moti-vated by the desire to be able to characterize the mostsalient features in large biological [3–5], technological[6,7], and social systems [3,8,9], such that their buildingblocks become evident, potentially giving valuable insightinto the central aspects governing their function andevolution. At its simplest level, the problem seems straight-forward: Modules are groups of nodes in the network thathave a similar connectivity pattern, often assumed to beassortative, i.e., connected mostly among themselves andless so with the rest of the network. However, whenattempting to formalize this notion and develop methodsto detect such structures, the combined effort of manyresearchers in recent years has spawned a great variety ofcompeting approaches to the problem, with no clear,universally accepted outcome [2].
The method that has perhaps gathered the most wide-spread use is called modularity optimization [10] andconsists in maximizing a quality function that favorspartitions of nodes for which the fraction of internal edgesinside each cluster is larger than expected given a nullmodel, taken to be a random graph. This method isrelatively easy to use and comprehend, works well inmany accessible examples, and is capable of being appliedin very large systems via efficient heuristics [11,12].However it also suffers from serious drawbacks. In par-ticular, despite measuring a deviation from a null model, itdoes not take into account the statistical evidence asso-ciated with this deviation, and as a result, it is incapable ofseparating actual structure from those arising simply ofstatistical fluctuations of the null model, and it even findshigh-scoring partitions in fully random graphs [13]. Thisproblem is not specific to modularity and is a characteristicshared by the vast majority of methods proposed forsolving the same task [2]. In addition to the lack ofstatistical validation, modularity maximization fails todetect clusters with size below a given threshold [14,15],which increases with the size of the system as ∼
ffiffiffiffiE
p, where
E is the number of edges in the entire network. Thislimitation is independent of how salient these relativelysmaller structures are, and makes this potentially very im-portant information completely inaccessible. Furthermore,
*[email protected]‑bremen.de
Published by the American Physical Society under the terms ofthe Creative Commons Attribution 3.0 License. Further distri-bution of this work must maintain attribution to the author(s) andthe published article’s title, journal citation, and DOI.
PHYSICAL REVIEW X 4, 011047 (2014)
2160-3308=14=4(1)=011047(18) 011047-1 Published by the American Physical Society
l =0
l =1
l =2
l =3
Nested
model
Observed
network
N nodes
E edges
B0 nodes
E edges
B1 nodes
E edges
B2 nodes
E edges
ers
Peixoto, Phys. Rev. X 4, 011047 (2014)
hierarchical communities
Peixoto, Phys. Rev. X 4, 011047 (2014)
other approaches (hierarchical SBM)
political blogs (2004) network
limits of statistical inference
community structure in networks
• dozens of algorithms for finding it• generative models among the most powerful• how methods fail is as important as how they succeed
• even if communities exist in a network, they may not be detectable
limits of statistical inference
planted partition problem
• synthetic data with known communities• 2 groups, equal sized• mean degree• parameterized strength of communities
cout
n
cinn
cinn
cout
n
c
✏ = cout
/cin
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
limits of statistical inference
planted partition problem
• synthetic data with known communities• 2 groups, equal sized• mean degree c
✏ = cout
/cin
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
13
obeying detailed balance with respect to the Hamiltonian (8), starting with a random initial group assignment {qi}.We see that Q = 0 for cout/cin > ϵc. In other words, in this region both BP and MCMC converge to the factorizedstate, where the marginals contain no information about the original assignment. For cout/cin < ϵc, however, theoverlap is positive and the factorized fixed point is not the one to which BP or MCMC converge.
In particular the right-hand side of Fig. 1 shows the case of q = 4 groups with average degree c = 16, correspondingto the benchmark of Newman and Girvan [9]. We show the large N results and also the overlap computed withMCMC for size N = 128 which is the commonly used size for this benchmark. Again, up to symmetry breaking,marginalization achieves the best possible overlap that can be inferred from the graph by any algorithm. Therefore,when algorithms are tested for performance, their results should be compared to Fig. 1 instead of to the common butwrong expectation that the four groups are detectable for any ϵ < 1.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
over
lap
!= cout/cin
undetectable
q=2, c=3
N=500k, BPN=70k, MCMC
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
over
lap
!= cout/cin
undetectable
q=4, c=16
N=100k, BPN=70k, MCN=128, MC
N=128, full BP
FIG. 1: (color online): The overlap (5) between the original assignment and its best estimate given the structure of the graph,computed by the marginalization (13). Graphs were generated using N nodes, q groups of the same size, average degree c, anddifferent ratios ϵ = cout/cin. Thus ϵ = 1 gives an Erdos-Renyi random graph, and ϵ = 0 gives completely separated groups.Results from belief propagation (26) for large graphs (red line) are compared to Gibbs sampling, i.e., Monte Carlo Markovchain (MCMC) simulations (data points). The agreement is good, with differences in the low-overlap regime that we attributeto finite size fluctuations. On the right we also compare to results from the full BP (22) and MCMC for smaller graphs withN = 128, averaged over 400 samples. The finite size effects are not very strong in this case, and BP is reasonably close to theexact (MCMC) result even on small graphs that contain many short loops. For N → ∞ and ϵ > ϵc = (c−
√c)/[c+
√c(q−1)] it
is impossible to find an assignment correlated with the original one based purely on the structure of the graph. For two groupsand average degree c = 3 this means that the density of connections must be ϵ−1
c (q = 2, c = 3) = 3.73 greater within groupsthan between groups to obtain a positive overlap. For Newman and Girvan’s benchmark networks with four groups (right),this ratio must exceed 2.33.
Let us now investigate the stability of the factorized fixed point under random perturbations to the messages whenwe iterate the BP equations. In the sparse case where cab = O(1), graphs generated by the block model are locallytreelike in the sense that almost all nodes have a neighborhood which is a tree up to distance O(log N), where theconstant hidden in the O depends on the matrix cab. Equivalently, for almost all nodes i, the shortest loop that ibelongs to has length O(log N). Consider such a tree with d levels, in the limit d → ∞. Assume that on the leavesthe factorized fixed point is perturbed as
ψkt = nt + ϵk
t , (39)
and let us investigate the influence of this perturbation on the message on the root of the tree, which we denote k0.There are, on average, cd leaves in the tree where c is the average degree. The influence of each leaf is independent,so let us first investigate the influence of the perturbation of a single leaf kd, which is connected to k0 by a pathkd, kd−1, . . . , k1, k0. We define a kind of transfer matrix
T ai ≡
∂ψkia
∂ψki+1
b
!
!
!
ψt=nt
=
"
ψkia cab
#
r carψki+1r
− ψkia
$
s
ψkis csb
#
r carψki+1r
%
!
!
!
ψt=nt
= na
&cab
c− 1
'
. (40)
where this expression was derived from (26) to leading order in N . The perturbation ϵk0
t0 on the root due to the
strong communities
random graph
easyto detect
hardto detect
over
lap
(acc
urac
y)
limits of statistical inference
planted partition problem
• synthetic data with known communities• 2 groups, equal sized• mean degree c
✏ = cout
/cin
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
13
obeying detailed balance with respect to the Hamiltonian (8), starting with a random initial group assignment {qi}.We see that Q = 0 for cout/cin > ϵc. In other words, in this region both BP and MCMC converge to the factorizedstate, where the marginals contain no information about the original assignment. For cout/cin < ϵc, however, theoverlap is positive and the factorized fixed point is not the one to which BP or MCMC converge.
In particular the right-hand side of Fig. 1 shows the case of q = 4 groups with average degree c = 16, correspondingto the benchmark of Newman and Girvan [9]. We show the large N results and also the overlap computed withMCMC for size N = 128 which is the commonly used size for this benchmark. Again, up to symmetry breaking,marginalization achieves the best possible overlap that can be inferred from the graph by any algorithm. Therefore,when algorithms are tested for performance, their results should be compared to Fig. 1 instead of to the common butwrong expectation that the four groups are detectable for any ϵ < 1.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
over
lap
!= cout/cin
undetectable
q=2, c=3
N=500k, BPN=70k, MCMC
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
over
lap
!= cout/cin
undetectable
q=4, c=16
N=100k, BPN=70k, MCN=128, MC
N=128, full BP
FIG. 1: (color online): The overlap (5) between the original assignment and its best estimate given the structure of the graph,computed by the marginalization (13). Graphs were generated using N nodes, q groups of the same size, average degree c, anddifferent ratios ϵ = cout/cin. Thus ϵ = 1 gives an Erdos-Renyi random graph, and ϵ = 0 gives completely separated groups.Results from belief propagation (26) for large graphs (red line) are compared to Gibbs sampling, i.e., Monte Carlo Markovchain (MCMC) simulations (data points). The agreement is good, with differences in the low-overlap regime that we attributeto finite size fluctuations. On the right we also compare to results from the full BP (22) and MCMC for smaller graphs withN = 128, averaged over 400 samples. The finite size effects are not very strong in this case, and BP is reasonably close to theexact (MCMC) result even on small graphs that contain many short loops. For N → ∞ and ϵ > ϵc = (c−
√c)/[c+
√c(q−1)] it
is impossible to find an assignment correlated with the original one based purely on the structure of the graph. For two groupsand average degree c = 3 this means that the density of connections must be ϵ−1
c (q = 2, c = 3) = 3.73 greater within groupsthan between groups to obtain a positive overlap. For Newman and Girvan’s benchmark networks with four groups (right),this ratio must exceed 2.33.
Let us now investigate the stability of the factorized fixed point under random perturbations to the messages whenwe iterate the BP equations. In the sparse case where cab = O(1), graphs generated by the block model are locallytreelike in the sense that almost all nodes have a neighborhood which is a tree up to distance O(log N), where theconstant hidden in the O depends on the matrix cab. Equivalently, for almost all nodes i, the shortest loop that ibelongs to has length O(log N). Consider such a tree with d levels, in the limit d → ∞. Assume that on the leavesthe factorized fixed point is perturbed as
ψkt = nt + ϵk
t , (39)
and let us investigate the influence of this perturbation on the message on the root of the tree, which we denote k0.There are, on average, cd leaves in the tree where c is the average degree. The influence of each leaf is independent,so let us first investigate the influence of the perturbation of a single leaf kd, which is connected to k0 by a pathkd, kd−1, . . . , k1, k0. We define a kind of transfer matrix
T ai ≡
∂ψkia
∂ψki+1
b
!
!
!
ψt=nt
=
"
ψkia cab
#
r carψki+1r
− ψkia
$
s
ψkis csb
#
r carψki+1r
%
!
!
!
ψt=nt
= na
&cab
c− 1
'
. (40)
where this expression was derived from (26) to leading order in N . The perturbation ϵk0
t0 on the root due to the
strong communities
random graph
over
lap
(acc
urac
y)• 2nd order phase transition
in detectability• overlap goes to 0 for
✏ � c�pc
c+pc(k � 1)
limits of statistical inference
planted partition problem
• for 2 groups, phase transition is information theoreticno algorithm can exist that detects these communities (better than chance)
• when communities are strong, most algorithms succeed• when networks & communities are very sparse = trouble
• recently generalized to dynamic networks (Ghasemian et al. 2015)• hierarchical block models (Peixoto 2014) and node metadata (Newman &
Clauset 2016) both improve detectability
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)Ghasemian et al., arxiv:1506.0679 (2015)Peixoto, Phys. Rev. X 4, 011047 (2014)Newman & Clauset, Nature Communications, to appear (2016)
the trouble with community detection
many networks include metadata on their nodes:
metadata is often used to evaluate the accuracy of community detection algs.
if community detection method finds a partition that correlates with then we say that is good
social networks age, sex, ethnicity or race, etc.
food webs feeding mode, species body mass, etc.
Internet data capacity, physical location, etc.
protein interactions molecular weight, association with cancer, etc.
x
P xAA
the trouble with community detection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 26
27
28
29
30
31
32
33
34
Zachary karate club political blogs network
the trouble with community detection
0
1
2
3
4
5
6
78
9
10
11
1213
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
3940
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95 96
97
98
99
101
102
103
104
105
106
107
108
109 110
111
112
113
114
100
NCAA 2000 Schedule
political books (2004)
the trouble with community detection
often, groups found by community detection are meaningful
• allegiances or personal interests in social networks [1]
• biological function in metabolic networks [2]
but
[1] see Fortunato (2010), and Adamic & Glance (2005)[2] see Holme, Huss & Jeong (2003), and Guimera & Amaral (2005)
the trouble with community detection
often, groups found by community detection are meaningful
• allegiances or personal interests in social networks [1]
• biological function in metabolic networks [2]
but some recent studies claim these are the exception
• real networks either do not contain structural communities or communities exist but they do not correlate with metadata groups [3]
[1] see Fortunato (2010), and Adamic & Glance (2005)[2] see Holme, Huss & Jeong (2003), and Guimera & Amaral (2005)[3] see Leskovec et al. (2009), and Yang & Leskovec (2012), and Hric, Darst & Fortunato (2014)
the trouble with community detection
Hric, Darst & Fortunato (2014)
• 115 networks with metadata & 12 community detection methods
• compare extracted with observed for each 3
Name No. Nodes No. Edges No. Groups Description of group nature
lfr 1000 9839 40 artificial network (lfr, 1000S, µ = 0.5)
karate 34 78 2 membership after the split
football 115 615 12 team scheduling groups
polbooks 105 441 2 political alignment
polblogs 1222 16782 3 political alignment
dpd 35029 161313 580 software package categories
as-caida 46676 262953 225 countries
fb100 762–41536 16651–1465654 2–2597 common students’ traits
pgp 81036 190143 17824 email domains
anobii 136547 892377 25992 declared group membership
dblp 317080 1049866 13472 publication venues
amazon 366997 1231439 14–29432 product categories
flickr 1715255 22613981 101192 declared group membership
orkut 3072441 117185083 8730807 declared group membership
lj-backstrom 4843953 43362750 292222 declared group membership
lj-mislove 5189809 49151786 2183754 declared group membership
Table I: Basic properties of all datasets used in this analysis. fb100 consists of 100 unique networks of universities, so weshow the ranges of the number of nodes and edges of the networks, as well as of the metadata groups of the various partitions.amazon consists of a hierarchical set of 11 group levels, we report the range of the number of groups. The number of groups iscalculated after our indicated preprocessing (see text).
Ganxis [29] (formerly SLPA) is based on label propaga-tion. GreedyCliqueExp [30] begins with small cliques asseeds and expands them optimizing a local fitness func-tion.
III. STRUCTURAL PROPERTIES OF NODEGROUPS FROM METADATA
Here we show some basic topological features of themetadata groups of our datasets. Fig. 1 reports thedistribution of the group sizes, which is skewed for alldatasets. Power law fits of the tails deliver exponentsaround �2. This is in agreement with the behavior ofthe size distributions for the communities found by com-munity detection algorithms on real networks [1].
The link density of a subgraph S is the ratio betweenthe number of links joining pairs of nodes of S and the to-tal maximum number of links that could be there, whichis given by nS(nS � 1)/2, nS being the number of nodesof S. In Fig. 2 we see the link density of the metadatagroups versus their sizes. Clearly, the larger the size ofthe group, the lower the link density. This is becausereal graphs are typically sparse, so the total number oflinks scales linearly with the number of nodes. This holdsfor parts of the network too, modulo small variations, sothe link density decreases approximately as a power ofthe number of links of the group (with exponent close to�1). Since the latter is proportional to the group size,we obtain that the link density decreases as the inverseof the group size, as we see in Fig. 2.
Finally, in Fig. 3 we report the relation between thegroup embeddedness and its size. The embeddedness of
Figure 1: (Color online) Distribution of sizes of metadatagroups. Each curve corresponds to a specific dataset of ourcollection.
a group is the ratio between the internal degree of thegroup and the total degree. The internal degree of agroup is given by the sum of the internal degrees of thegroup’s nodes, i.e. twice the number of links inside thegroup. The total degree of the group is the sum of thedegrees of its nodes. A group is “good” if it has highembeddedness, i.e. if it is well separated from (looselyconnected to) the rest of the graph. We notice that someof the datasets of our collection have groups with fairly
P x A
[1] fb100 is 100 networks
the trouble with community detection
Hric, Darst & Fortunato (2014)
• evaluate by normalized mutual information6
Figure 5: (Color online) NMI scores between structural communities and metadata groups for di↵erent networks. Scores aregrouped by datasets on the x -axis. The height of each column is the maximal NMI score between any partition layer of themetadata partitions and any layer returned by the community detection method, considering only those comparisons wherethe overlap of the partitions is larger than 10% of total number of nodes.
methods do not align with partitions built from meta-data, but what about specific groups? Can we detectany of the groups well? Are some groups reflected in thegraph structure and detectable, but lost in the bulk noiseof the graph? This is what we wish to investigate here.
The basis of our analysis is the Jaccard score betweentwo groups. Let Ci represent (the set of nodes of) theknown group i, and Dj represent (the set of nodes of)the detected community j. The Jaccard score betweenthese two sets is defined as
J(Ci, Dj) =|Ci \Dj ||Ci [Dj |
, (1)
with |· · · | set cardinality, \ set intersection, and [ setunion. The Jaccard score ranges from one (perfectmatch) to zero and roughly indicates the fraction of nodesshared between the two sets: the match quality.
The recall score measures how well one known group isdetected. The recall score of one known group Ci is de-fined as the maximal Jaccard score between it and everydetected community Dj ,
R(Ci) = maxDj2{D}
J(Ci, Dj). (2)
It is near one if the group is well detected and low oth-erwise. We can study the distribution of these scores tosee how many groups can be detected at any given qual-ity level. Recall measures the detection of known groups,and to measure the significance of detected communities,we can reverse the measure to calculate a precision score
P (Dj) = maxCi2{C}
J(Dj , Ci). (3)
The precision score tells us how well one detected com-munity corresponds to any known group.
We can now directly quantify the two conditions forgood community detection: every known group must cor-respond to some detected community, and every detectedcommunity must represent some known group. Both ofthese measures are still interesting independently: a highrecall but low precision indicates that the known groupsare reflected in the network structurally, but there aremany structural communities that are not known. Wevisualize the scores by means of rank-Jaccard plots whichgive an overview of the network’s detection quality. Wecompute the recall (precision) for every known (detected)group and sort the groups in order of ascending Jaccardscore. We plot recall (precision) vs the group rank, sortedby recall (precision) score so that the horizontal scale isthe relative group rank, i.e. the ratio between the rank ofthe group and the number of groups (yielding a value be-tween 0 and 1). Similar to our treatment of the partition-level analysis, we only plot matchings whose intersectioncovers more than 10% of total nodes in the graph. In ourfinal plots, the average value of the curve (proportional tothe area under it) is the average recall or precision scoreover all groups. The shape of the curve can tell us if allgroups are detected equally well (yielding a high plateau)or if there is a large inequality in detection (a high slope).Furthermore, this allows us to compactly represent mul-tiple layers. Each independent layer of known (detected)groups can be plotted in the same figure. We would gen-erally look for the highest curve to know if any layer hasa high recall (precision). When computing recall (preci-sion), unless otherwise specified, as detected communitieswe consider the communities of all partitions delivered bya method, whereas the metadata groups are those presentin all metadata partitions (if more than one partition isavailable in either case). This will give us the maximumpossible recall (precision), which might be far higher than
NMI(P,x)
[1] maximum NMI between any partition layer of the metadata partitions and any layer returned by the community detection method
{ "classic" data sets
a solution
idea: use metadata to help select a partition that correlates with , from among the exponential number of plausible partitions
P⇤ 2 {P}x x
[1] image copyright BostonGazette or maybe 20th Century Fox? gah
idea: use metadata to help select a partition that correlates with , from among the exponential number of plausible partitions
use a generative model to guide the selection:• define a parametric probability distribution over networks
• generation : given , draw from this distribution
• inference : given , choose that makes likely
P⇤ 2 {P}
a solution
Pr(G | ✓)✓ G
G ✓ G
x x
generation
inference
model
G=(V,E)
data
Pr(G | θ)
generation
given metadata and degree for each node • each node is assigned a community with probability
• thus, prior on community assignments is
• given assignments, place edges independently, each with probability:
• where the are the stochastic block matrix parameters
this is a degree-corrected stochastic block model (DC-SBM)with a metadata-based prior on community labels
a metadata-aware stochastic block model
x = {xu} d = {du}�sx
su
u
P (s |�,x) =Y
i
�si,xi
[1] is the matrix of parameters [2] Karrer & Newman (2011)
� k ⇥K �sx
puv = dudv✓su,sv✓st
P (A |⇥,�,x) =X
s
P (A |⇥, s)P (s |�,x)
=X
s
Y
u<v
pAuvuv
(1� puv
)1�AuvY
u
�su,xu
inference
given observed network (adjacency matrix)• the model likelihood is
• where is a matrix of community interaction parameters , and the sum is over all possible assignments
• we fit this model to data using expectation-maximization (EM) to maximize w.r.t. and
a metadata-aware stochastic block model
A
⇥ ✓stk ⇥ ks
P (A |⇥,�,x) ⇥ �
[1] technical details in Newman & Clauset (2015) arxiv:1507.04001
network metadata
networks with planted structure
does this method recover known structure in synthetic data?
• use SBM to generate planted partition networks, with equal-sized groups and mean degree
• assign metadata with variable correlation to true group labels
• vary strength of partition
k = 2
⇢ 2 [0.5, 0.9]
cin
� cout
c = (cin
+ cout
)/2
cin
� cout
p
2(cin
+ cout
)
[1] Decelle, Krzakala, Moore & Zdeborova (2011)
cout
n
cinn
cinn
cout
n
• when , no structure-only algorithm can recover the planted communities better than chance (the detectability threshold, which is a phase transition)
cin-cout
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Frac
tion
of c
orre
ctly
ass
igne
d no
des
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1 0.50.60.70.80.9
networks with planted structure
undetectable
let mean degree • when , metadata isn’t useful and we recover regular SBM behavior
c = 8
⇢ = 0.5
[1] n = 10 000
weaker stronger
networks with planted structure
cin-cout
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Frac
tion
of c
orre
ctly
ass
igne
d no
des
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1 0.50.60.70.80.9
• when metadata correlates with true groups, accuracy is better than either metadata or SBM alone
metadata + SBM performs better than either
• any algorithm without metadata, or
• metadata alone.
let mean degree • when , metadata isn’t useful and we recover regular SBM behavior⇢ = 0.5
⇢ > 0.5
c = 8
undetectable
weaker stronger
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
2. marine food web: predator-prey interactions among 488 species in Weddell Sea in Antarctica
3. Malaria gene recombinations: recombination events among 297 var genes
4. Facebook friendships: online friendships among 15,126 Harvard students and alumni
5. Internet graph: peering relations among 46,676 Autonomous Systems
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
• {grade 7-12, ethnicity, gender}x =
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
• {grade 7-12, ethnicity, gender}x =
• method finds a good partition between high-school and middle-school
• .
• without metadata:
NMI = 0.881
[1] Add Health network data, designed by Udry, Bearman & Harris
NMI 2 [0.105, 0.384]
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
• {grade 7-12, ethnicity, gender}x =
• method finds a good partition between blacks and whites (with others scattered among)
• without metadata:
NMI = 0.820
NMI 2 [0.120, 0.239]
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks
1. high school social network: 795 students in a medium-sized American high school and its feeder middle school
• {grade 7-12, ethnicity, gender}x =
• method finds no good partition between males/females. instead, chooses a mixture of grade/ethnicity partitions
• without metadata:
NMI = 0.003
NMI 2 [0.000, 0.010]
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks
2. marine food web: predator-prey interactions among 488 species in Weddell Sea in Antarctica
• {species body mass, feeding mode, oceanic zone}
• partition recovers known correlation between body mass, trophic level, and ecosystem role:
x =
1
2
3
Detritivore
Carnivore
Omnivore
Herbivore
Primary producer[1] here, we’re using a continuous metadata model[2] Brose et al. (2005)
8
(a) Without metadata
(b) With metadata
FIG. S3: Inferred communities, without metadata and with,for the HVR 5 gene recombination network of the humanmalaria parasite P. falciparum, where metadata values arethe CP labels for the HVR 6 network.
predictive (96% probability) of that gene being in onegroup, while having four cysteines is modestly predictive(67% probability) of being in the other group. Thus themethod has discovered by itself that the motif sequencesthat define the CP labels, along with their correspondingnetwork communities, correlate with cysteine counts andtheir associated severe disease phenotypes [8, 11].
The communities in the HVR 6 network representhighly non-random patterns of recombination, which arethought to indicate functional constraints on proteinstructure. Previous work has conjectured that commonconstraints on recombination span distinct HVRs [9]. Wecan test this hypothesis using the methods described inthis paper. There is no reason a priori to expect thatthe community structure of HVR 6 should correlate withthat of HVR 5 because the Cys and CP labels are derivedfrom outside the HVR 5 sequences—Cys labels reflectcysteine counts in HVR 6 while CP labels subdivide Cyslabels based on sequence motifs adjacent to, but outsideof, HVR 5. Applying our methods to HVR 5 withoutany metadata (Fig. S3a), we find mixing of the HVR 6Cys labels across the HVR 5 communities. By contrast,using the CP labels as metadata for the HVR 5 network,our method finds a much cleaner partition (Fig. S3b), in-dicating that indeed the HVR 6 Cys labels correlate withthe community structure of HVR 5.
10-12
10-9
10-6
10-3
100
103
106
109
Mean body mass (g)
0
0.5
1
Pro
bab
ilit
y o
f co
mm
un
ity
mem
ber
ship
FIG. S4: Learned priors, as a function of body mass, for thethree-community division of the Weddell Sea network shownin Fig. 4 of the main paper.
C. Weddell Sea food web
As discussed in the main text, the Weddell Sea foodweb provides an example of the “ordered” metadata typein the body mass of species. A three-way community di-vision of the network with the log of species’ averagebody mass as metadata produces the division shown inFig. 4 of the paper. The prior probabilities as functionsof body mass are of interest in their own right. They areshown in Fig. S4. Although, as described in Section ICof the paper, the log mass is rescaled in our calculationsto the range [0, 1], the horizontal axis in the figure is cal-ibrated to read in terms of the original mass in grams, sothe prior probabilities of belonging to each of the threecommunities can be simply read from the figure. Theblue, green, and red curves correspond respectively tothe communities labeled 1, 2, and 3 in Fig. 4. Thus aspecies with a low mean mass of 10�12 g has about an80% probability of being in community 1, a 20% proba-bility of being in community 2, and virtually no chance ofbeing in community 3. Conversely, a species with meanbody mass of 108 g (which could only be a whale) hasabout a 90% chance of being in community 3, 10% ofbeing in community 2, and almost no chance of being incommunity 1.
D. School friendship network data use
Our use of the school friendship data described in themain text requires that we make the following statement:
This work uses data from Add Health, a pro-gram project designed by J. Richard Udry,Peter S. Bearman, and Kathleen Mullan Har-ris, and funded by a grant P01–HD31921from the National Institute of Child Health
3
2
1
real-world networks
x =
[1] Larremore, Clauset & Buckee (2013)
3. Malaria gene recombinations: recombination events among 297 var genes
• {Cys-PoLV labels for HVR6 region}
• with metadata, partition discovers correlation with Cys labels (which are associated with severe disease)
without metadata with metadata
HVR6
NMI 2 [0.077, 0.675] NMI = 0.596
real-world networks
x =
[1] Larremore, Clauset & Buckee (2013)
3. Malaria gene recombinations: recombination events among 297 var genes
• {Cys-PoLV labels for HVR6 region}
• on adjacent region of gene, we find Cys-PoLV labels correlate with recombinant structure here, too
HVR5
without metadata with metadata
the ground truth about metadata
what is the goal of community detection?
network + method communities vs. metadataC = f(G) MG f !
the ground truth about metadata
what is the goal of community detection?
network + method communities vs. metadataC = f(G) MG f !
"this method works!"
C ⇡ M
the ground truth about metadata
what is the goal of community detection?
network + method communities vs. metadataC = f(G) MG f !
"this method works!" "this method stinks!"
C ⇡ M C 6= M
the ground truth about metadata
what is the goal of community detection?
there are 4 indistinguishable reasons why we might find :
1. metadata are unrelated to network structure
f(G) = C 6= M
M G
the ground truth about metadata
what is the goal of community detection?
there are 4 indistinguishable reasons why we might find :
1. metadata are unrelated to network structure
2. metadata and communities capture different aspects of structure
social groups leaders and followers
M
C
G
M
f(G) = C 6= M
the ground truth about metadata
what is the goal of community detection?
there are 4 indistinguishable reasons why we might find :
1. metadata are unrelated to network structure
2. metadata and communities capture different aspects of structure
3. network has no community structure
M
C
G
M
G
f(G) = C 6= M
the ground truth about metadata
what is the goal of community detection?
there are 4 indistinguishable reasons why we might find :
1. metadata are unrelated to network structure
2. metadata and communities capture different aspects of structure
3. network has no community structure
4. algorithm is bad
"this method stinks!"
M
C
G
M
G
f
f(G) = C 6= M
DON’T TRY TO FIND THE GROUND TRUTH
INSTEAD . . . TRY TO REALIZE THERE IS NO GROUND TRUTH
theorems for community detection
theorems for community detection
1. Theorem: no bijection between ground truth and communities
2. Theorem: No Free Lunch in community detection
g(T )! G g0(T 0) 2 different processes, on 2 different ground truths, can create the same observed network
no algorithm has better performance than any other algorithm , when averaged over all possible inputs
good performance comes from matching algorithm to its preferred subclass of networks
[1] performance defined as adjusted mutual information (AMI), which is like the normalized mutual information, but adjusted for expected values[2] original NFL theorem: Wolpert, Neural Computation (1996)[3] proofs of these theorems is in Peel, Larremore, Clauset (2016)
ff 0
{G}
!f{G0} ⇢ {G}
real-world networks
x =
[1] Traud, Mucha & Porter (20012)
13
communities is less than the number of metadata values,in some cases by a wide margin. Assuming the values ofboth to be reasonably broadly distributed, this impliesthat the entropy H(s) of the communities will be smallerthan that of the metadata H(x) and hence, normally,min[H(s), H(x)] = H(s). Thus if we define
NMI =I(s ;x)
min[H(s), H(x)], (B4)
we ensure that the normalized mutual information liesbetween zero and one, that it has a symmetric defini-tion with respect to s and x, and that it will achieveits maximum value of one when the metadata perfectlypredict the community membership. Other definitions,normalized using the mean or maximum of the two en-tropies, satisfy the first two of these three conditions butnot the third, giving values smaller than one by an unpre-dictable margin even when the metadata perfectly pre-dict the communities.We use the definition (B4) in all the calculations pre-
sented in this paper.
Appendix C: Further examples
In this appendix we present a number of additionalapplications of our methods as well as some additionaldetails on examples described in the main text. Summarystatistics on all the networks studied are given in Table I.
1. Facebook friendship network
The FB100 data set of Traud et al. [31] is a set offriendship networks among college students at US uni-versities compiled from friend relations on the social net-working website Facebook. The networks date from theearly days of Facebook when its services were availableonly to college students and each university formed a sep-arate and unconnected subgraph in the larger network.The nodes in these networks represent the students, theedges represent friend relations on Facebook, and in ad-dition to the network structure there are metadata ofseveral types, including gender, college year (i.e., yearof college graduation), major (i.e., principal subject ofstudy, if known), and a numerical code indicating whichdorm they lived in.The primary divisions in these networks appear to be
by age, or more specifically by college year. For instance,we have looked in some detail at the data for HarvardUniversity, which was the birthplace of Facebook andits biggest institutional participant at the time the datawere gathered, with 15 126 students in the network, span-ning college years 2003 to 2009. There are also a smallnumber of Harvard alumni (i.e., former students) in thedata set, primarily those recently graduated—graduationyears 2000–2002. The top panel in Fig. 4 shows results
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
0.5
1Pr
ior p
roba
bilit
y of
mem
bers
hip
Dorm0
0.5
1
Prio
r pro
babi
lity
of m
embe
rshi
p
FIG. 4: Learned prior probability of community membershipfor two five-way divisions of the Facebook friendship networkof Harvard students described in the text. The horizontal axisis (top) year of graduation and (bottom) dorm, and the colorsrepresent the prior probabilities of membership in each of thecommunities.
from a five-way division of the network using our algo-rithm with year as metadata. Year, for the purposes ofthis calculation, was treated as an unordered variable,placing no constraints on the value of the prior probabil-ities of community membership for adjacent years. Onecould have treated it as an ordered variable, which wouldhave constrained adjacent years to have similar priors,but we did not do that here. Nonetheless, as we will see,the algorithm finds communities in which adjacent yearstend to be grouped together.This network provides a good example of the useful-
ness of the learned priors in shedding light on the struc-ture of the network. The figure shows a visualization ofthe priors as a function of year, with the colors show-ing the relative probability of belonging to each of thecommunities. Each of the bars in the plot has the sameheight of 1 since the prior probabilities are required tosum to 1, while the balance of colors shows the distribu-tion over communities. Examination of the top panel inthe figure shows clearly a division of the network alongage lines. Two groups, in orange and yellow at the rightof the plot, correspond to the most recent two years ofstudents at the time of the study (graduation years 2008and 2009) and the next, in red, accounts for the two yearsbefore that (2006 and 2007). The purple community cor-responds to the next three years, 2003–2005, while the
NMI 2 [0.573, 0.641]
NMI = 0.668
4. Facebook friendships: online friendships among 15,126 Harvard students and alumni (in Sept. 2005)
• {graduation year, dormitory}
• method finds a good partition between alumni, recent graduates, upperclassmen, sophomores, and freshmen
• .
• without metadata:
real-world networks
x =
[1] Traud, Mucha & Porter (20012)
4. Facebook friendships: online friendships among 15,126 Harvard students and alumni (in Sept. 2005)
• {graduation year, dormitory}
• method finds a good partition among the dorms
• .
• without metadata:
13
communities is less than the number of metadata values,in some cases by a wide margin. Assuming the values ofboth to be reasonably broadly distributed, this impliesthat the entropy H(s) of the communities will be smallerthan that of the metadata H(x) and hence, normally,min[H(s), H(x)] = H(s). Thus if we define
NMI =I(s ;x)
min[H(s), H(x)], (B4)
we ensure that the normalized mutual information liesbetween zero and one, that it has a symmetric defini-tion with respect to s and x, and that it will achieveits maximum value of one when the metadata perfectlypredict the community membership. Other definitions,normalized using the mean or maximum of the two en-tropies, satisfy the first two of these three conditions butnot the third, giving values smaller than one by an unpre-dictable margin even when the metadata perfectly pre-dict the communities.We use the definition (B4) in all the calculations pre-
sented in this paper.
Appendix C: Further examples
In this appendix we present a number of additionalapplications of our methods as well as some additionaldetails on examples described in the main text. Summarystatistics on all the networks studied are given in Table I.
1. Facebook friendship network
The FB100 data set of Traud et al. [31] is a set offriendship networks among college students at US uni-versities compiled from friend relations on the social net-working website Facebook. The networks date from theearly days of Facebook when its services were availableonly to college students and each university formed a sep-arate and unconnected subgraph in the larger network.The nodes in these networks represent the students, theedges represent friend relations on Facebook, and in ad-dition to the network structure there are metadata ofseveral types, including gender, college year (i.e., yearof college graduation), major (i.e., principal subject ofstudy, if known), and a numerical code indicating whichdorm they lived in.The primary divisions in these networks appear to be
by age, or more specifically by college year. For instance,we have looked in some detail at the data for HarvardUniversity, which was the birthplace of Facebook andits biggest institutional participant at the time the datawere gathered, with 15 126 students in the network, span-ning college years 2003 to 2009. There are also a smallnumber of Harvard alumni (i.e., former students) in thedata set, primarily those recently graduated—graduationyears 2000–2002. The top panel in Fig. 4 shows results
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
0.5
1
Prio
r pro
babi
lity
of m
embe
rshi
p
Dorm0
0.5
1Pr
ior p
roba
bilit
y of
mem
bers
hip
FIG. 4: Learned prior probability of community membershipfor two five-way divisions of the Facebook friendship networkof Harvard students described in the text. The horizontal axisis (top) year of graduation and (bottom) dorm, and the colorsrepresent the prior probabilities of membership in each of thecommunities.
from a five-way division of the network using our algo-rithm with year as metadata. Year, for the purposes ofthis calculation, was treated as an unordered variable,placing no constraints on the value of the prior probabil-ities of community membership for adjacent years. Onecould have treated it as an ordered variable, which wouldhave constrained adjacent years to have similar priors,but we did not do that here. Nonetheless, as we will see,the algorithm finds communities in which adjacent yearstend to be grouped together.This network provides a good example of the useful-
ness of the learned priors in shedding light on the struc-ture of the network. The figure shows a visualization ofthe priors as a function of year, with the colors show-ing the relative probability of belonging to each of thecommunities. Each of the bars in the plot has the sameheight of 1 since the prior probabilities are required tosum to 1, while the balance of colors shows the distribu-tion over communities. Examination of the top panel inthe figure shows clearly a division of the network alongage lines. Two groups, in orange and yellow at the rightof the plot, correspond to the most recent two years ofstudents at the time of the study (graduation years 2008and 2009) and the next, in red, accounts for the two yearsbefore that (2006 and 2007). The purple community cor-responds to the next three years, 2003–2005, while the
NMI 2 [0.074, 0.224]
NMI = 0.255
Top Related