Network Analysis & Modeling
Transcript of Network Analysis & Modeling
Network Analysis & Modeling
Prof. Aaron Clauset Computer Science & BioFrontiers Institute @aaronclauset [email protected]
lecture 0: what are networks and how do we talk about them?
who are network scientists?PhysicistsComputer ScientistsApplied MathematiciansStatisticiansBiologistsEcologistsSociologistsPolitical Scientists
it’s a big community! } • different traditions • different tools • different questionsincreasingly, not ONE community, but MANY, only loosely interacting communities
who are network scientists?PhysicistsComputer ScientistsApplied MathematiciansStatisticiansBiologistsEcologistsSociologistsPolitical Scientists
} phase transitions, universality
data / algorithm oriented, predictions
dynamical systems, diff. eq.
inference, consistency, covariates
experiments, causality, molecules
observation, experiments, species
individuals, differences, causality
rationality, influence, conflict
what are networks?an approach. a mathematical representation provide structure to complexity.
structure that exists above individuals / components
or: structure that exists below system / population
}system / population
individuals / components
tools and resourcesSoftware
RPythonMatlabNetworkX [python]graph-tool [python, c++]GraphLab [python, c++]
Standalone editors
UCI-NetNodeXLGephiPajekNetwork WorkbenchCytoscapeyEd graph editorGraphviz
Network data sets
Colorado Index of Complex Networks
learning goals1. develop a network intuition for reasoning about network phenomena
2. understand network representations, basic terminology, and concepts.
3. learn principles and methods for describing and clustering network data
4. learn to predict missing network information
5. understand how to conduct and interpret numerical network experiments, to explore and test hypotheses about networks
6. analyze and model real-world network data, using math and computation
course format• course meets in-person in ECEE 283 + over Zoom
• lectures 2 times a week, some guest lectures and some class discussions
• biweekly problem sets (6 total)
• class project: proposal, presentation, final report
• all content via class Canvas (lecture notes, recordings, problem sets, submissions)
• see syllabus for all course policies
course schedulebuilding intuition basic concepts, tools practical tools advanced tools
week by week 1. fundamentals of networks 2. representations and summary statistics 3. simple random graphs 4. better random graphs 5. predicting missing node attributes 6. predicting missing links 7. community structure and mixing patterns 8. community structure models 9. spreading processes and cascades 10. spreading processes with structure (epidemics) 11. data incompleteness and sampling 12. ranking in networks 13. ethics and networks 14. student project presentations
lessons learnedwhat’s difficult
1. students need to know many different things:
some probability Erdos-Renyi, configuration, calculations some mathematics physics-style calculations, phase transitions some statistics basic data analysis, correlations, distributions some machine learning prediction, likelihoods, features, estimation algorithms some programming data wrangling, coding up measures and algorithms
2. can’t teach all of these things to all types of students! • vast amounts of advanced material in each of these directions • students have little experience / intuition of what makes good science
lessons learnedwhat works well
1. simple mathematical problems—build intuition & practice with concepts
nA nB
A
B
calculate summary statistics
clustering highly-structured networks
derive mathematical relations
spreading process on networks
Biological Networks, CSCI 3352Lecture 7
Prof. Aaron ClausetSpring 2021
and the rate equation for the number of infected individuals
di
dt= �i(1 � i) � �i . (3)
We can use this equation to answer a simple question: will a new epidemic spread?
In the beginning of an epidemic, i will be very small (few individuals infected, relative to N), andwe want to know whether and how it will grow. Under this assumption, we make the approximation(1 � i) ⇡ 1, which yields
di
dt⇡ i(� � �) . (4)
This is an ordinary di↵erential equation, whose solution is a logistic function, given a fraction i0of initially infected individuals. That means the initial growth of i is exponential, growing like(R0)t = (�/�)t, and when � > �, an epidemic in the compartment model will tend to grow (ex-ponentially) because the compartment I’s rate of in-flow exceeds its rate of out-flow. In contrast,when � < �, the I compartment will tend to empty out more quickly than it fills up, and the epi-demic dies out. When �/� = 1, a value we call the epidemic threshold, we see wild fluctuationsin whether the epidemic takes o↵ or dies out.
1.2.2 SIS and SIR dynamics
The population’s states S, I under the SIS model evolve like the figure above, on the left.2 Becauseindividuals can go back and forth between the two compartments, the relative proportion of indi-
2Adapted from: https://en.wikiversity.org/wiki/File:Sissys.png
4
Biological Networks, CSCI 3352
Lecture 6
Prof. Aaron Clauset
Spring 2021
Lgood = 0.043304 . . . Lbad = 0.000244 . . .
ln Lgood = �3.1395 . . . ln Lbad = �8.3178 . . .
ers/nrs red bluered 3/3 1/9
blue 1/9 3/3
ers/nrs red bluered 4/6 2/8
blue 2/8 1/1
We can measure how much better the “good” partition is than the “bad” one by computing e� lnL,which is their likelihood ratio (do you see why?). Plugging in our results shows that the goodpartition is exp(ln Lgood � ln Lbad) = 177 times more likely to generate the observed data than the“bad” partition. In other words, the good partition is a much better model of the data.
2.2 The likelihood of a DC-SBM
Recall that to specify a degree-corrected SBM, we need to choose the number of communities c,the partition of nodes into groups ~z, the expected degree sequence ~k, and the mixing matrix M.
Under the DC-SBM, the value we assign to any particular adjacency Aij is a Poisson-distributedrandom variable with mean �i�jMzizj , where each �i is a node-specific model parameter thatquantifies the fraction of the group zi’s total degree that belongs to node i (see Lecture 5). Hence,the likelihood function is
L(G | z, �, M) =Y
i,j
Poisson(�i�jMzizj )
=Y
i<j
��i�jMzizj
�Aij
Aij !exp
���i�jMzizj
�⇥
Y
i
�12�
2i Mzizj
�Aii/2
(Aii/2)!exp
✓�1
2�
2i Mzizj
◆(6)
where the two parts are the likelihoods of the between-group and within-group adjacencies, respec-tively. These two parts appear because we assume an undirected network, and in the DC-SBM, wecount stubs within groups but edges between groups (see Lecture 5).
Although it may seem rather complicated, we can substantially simplify Eq. (6) by first factoring
6
lessons learnedwhat works well
2. analyze real networks—test understanding & practice with implementing methods
what patterns really occur?
102
103
104
105
2
2.5
3
3.5
Network size, n
Me
an
ge
od
esi
c p
ath
len
gth
USF
Haverford
Caltech
Penn
1 4 7 10 13 16 19 22 25 28 31 34
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
vertex label
ha
rmo
nic
ce
ntr
alit
y
Karate clubconfiguration model
real-world network
how much does randomness explain? (when is a pattern interesting?)
stochastic block models
Biological Networks, CSCI 3352
Lecture 6
Prof. Aaron Clauset
Spring 2021
6
given i and r can thus be done in time O(K(K + �k�)).Because these computations can be done quickly for areasonable number of communities, local vertex switch-ing algorithms, such as single-vertex Monte Carlo, can beimplemented easily. Monte Carlo, however, is slow, andwe have found competitive results using a local heuristicalgorithm similar in spirit to the Kernighan–Lin algo-rithm used in minimum-cut graph partitioning [27].
Briefly, in this algorithm we divide the network intosome initial set of K communities at random. Then werepeatedly move a vertex from one group to another, se-lecting at each step the move that will most increase theobjective function—or least decrease it if no increase ispossible—subject to the restriction that each vertex maybe moved only once. When all vertices have been moved,we inspect the states through which the system passedfrom start to end of the procedure, select the one with thehighest objective score, and use this state as the startingpoint for a new iteration of the same procedure. Whena complete such iteration passes without any increase inthe objective function, the algorithm ends. As with manydeterministic algorithms, we have found it helpful to runthe calculation with several di↵erent random initial con-ditions and take the best result over all runs.
IV. RESULTS
We have tested the performance of the degree-corrected and uncorrected blockmodels in applicationsboth to real-world networks with known community as-signments and to a range of synthetic (i.e., computer-generated) networks. We evaluate performance by quan-titative comparison of the community assignments foundby the algorithms and the known assignments. As a met-ric for comparison we use the normalized mutual infor-mation, which is defined as follows [7]. Let nrs be thenumber of vertices in community r in the inferred groupassignment and in community s in the true assignment.Then define p(X = r, Y = s) = nrs/n to be the jointprobability that a randomly selected vertex is in r in theinferred assignment and s in the true assignment. Usingthis joint probability over the random variables X andY , the normalized mutual information is
NMI(X, Y ) =2 MI(X, Y )
H(X) + H(Y ), (26)
where MI(X, Y ) is the mutual information and H(Z) isthe entropy of random variable Z. The normalized mu-tual information measures the similarity of the two com-munity assignments and takes a value of one if the as-signments are identical and zero if they are uncorrelated.A discussion of this and other measures can be found inRef. [28].
(a) Without degree correction
(b) With degree-correction
FIG. 1: Divisions of the karate club network found using the
(a) uncorrected and (b) corrected blockmodels. The size of a
vertex is proportional to its degree and vertex color reflects
inferred group membership. The dashed line indicates the
split observed in real life.
A. Empirical networks
We have tested our algorithms on real-world networksranging in size from tens to tens of thousands of ver-tices. In networks with highly homogeneous degree distri-butions we find little di↵erence in performance betweenthe degree-corrected and uncorrected blockmodels, whichis expected since for networks with uniform degrees thetwo models have the same likelihood up to an additiveconstant. Our primary concern, therefore, is with net-works that have heterogeneous degree distributions, andwe here give two examples that show the e↵ects of het-erogeneity clearly.
The first example, widely studied in the field, is the“karate club” network of Zachary [29]. This is a socialnetwork representing friendship patterns between the 34members of a karate club at a US university. The clubin question is known to have split into two di↵erent fac-tions as a result of an internal dispute, and the membersof each faction are known. It has been demonstratedthat the factions can be extracted from a knowledgeof the complete network by many community detectionmethods.
Applying our inference algorithms to this network, us-
6
given i and r can thus be done in time O(K(K + �k�)).Because these computations can be done quickly for areasonable number of communities, local vertex switch-ing algorithms, such as single-vertex Monte Carlo, can beimplemented easily. Monte Carlo, however, is slow, andwe have found competitive results using a local heuristicalgorithm similar in spirit to the Kernighan–Lin algo-rithm used in minimum-cut graph partitioning [27].
Briefly, in this algorithm we divide the network intosome initial set of K communities at random. Then werepeatedly move a vertex from one group to another, se-lecting at each step the move that will most increase theobjective function—or least decrease it if no increase ispossible—subject to the restriction that each vertex maybe moved only once. When all vertices have been moved,we inspect the states through which the system passedfrom start to end of the procedure, select the one with thehighest objective score, and use this state as the startingpoint for a new iteration of the same procedure. Whena complete such iteration passes without any increase inthe objective function, the algorithm ends. As with manydeterministic algorithms, we have found it helpful to runthe calculation with several di↵erent random initial con-ditions and take the best result over all runs.
IV. RESULTS
We have tested the performance of the degree-corrected and uncorrected blockmodels in applicationsboth to real-world networks with known community as-signments and to a range of synthetic (i.e., computer-generated) networks. We evaluate performance by quan-titative comparison of the community assignments foundby the algorithms and the known assignments. As a met-ric for comparison we use the normalized mutual infor-mation, which is defined as follows [7]. Let nrs be thenumber of vertices in community r in the inferred groupassignment and in community s in the true assignment.Then define p(X = r, Y = s) = nrs/n to be the jointprobability that a randomly selected vertex is in r in theinferred assignment and s in the true assignment. Usingthis joint probability over the random variables X andY , the normalized mutual information is
NMI(X, Y ) =2 MI(X, Y )
H(X) + H(Y ), (26)
where MI(X, Y ) is the mutual information and H(Z) isthe entropy of random variable Z. The normalized mu-tual information measures the similarity of the two com-munity assignments and takes a value of one if the as-signments are identical and zero if they are uncorrelated.A discussion of this and other measures can be found inRef. [28].
(a) Without degree correction
(b) With degree-correction
FIG. 1: Divisions of the karate club network found using the
(a) uncorrected and (b) corrected blockmodels. The size of a
vertex is proportional to its degree and vertex color reflects
inferred group membership. The dashed line indicates the
split observed in real life.
A. Empirical networks
We have tested our algorithms on real-world networksranging in size from tens to tens of thousands of ver-tices. In networks with highly homogeneous degree distri-butions we find little di↵erence in performance betweenthe degree-corrected and uncorrected blockmodels, whichis expected since for networks with uniform degrees thetwo models have the same likelihood up to an additiveconstant. Our primary concern, therefore, is with net-works that have heterogeneous degree distributions, andwe here give two examples that show the e↵ects of het-erogeneity clearly.
The first example, widely studied in the field, is the“karate club” network of Zachary [29]. This is a socialnetwork representing friendship patterns between the 34members of a karate club at a US university. The clubin question is known to have split into two di↵erent fac-tions as a result of an internal dispute, and the membersof each faction are known. It has been demonstratedthat the factions can be extracted from a knowledgeof the complete network by many community detectionmethods.
Applying our inference algorithms to this network, us-
best z, SBM (c = 2) best z, DC-SBM (c = 2)
We can, but doing so via statistical inference requires account for the fact that changing c changesthe “complexity” or flexibility of the model, because M has ⇥(c2) entries. Likelihoods for di↵erentchoices of c cannot be compared fairly because the “larger” model, with more parameters, willnaturally tend to yield higher likelihood models of the network. In the extreme case of c = n,the model can simply “memorizes” the adjacency matrix via the ⇥(n2) parameters in the mixingmatrix A = M, placing every node is in a group by itself.
A popular approach to making di↵erent choices of c comparable is to penalize or regularize thelikelihood by some function f(c) that is grows with c. That is, we impose a cost to using additionalparameters, and then search for the parameterization that balances this cost against the improvedfit to the network. There are many choices for f(c), and these go by names like Bayesian marginal-ization, Bayes factors, various information criteria (BIC, AIC, etc.), minimum description length(MDL) approaches, and more. We will not cover any of these techniques here.
2.4 Finding good partitions
The preceding sections simplified the task of finding a good decomposition of a network into com-munities. First, we reduced the problem to estimating parameters via statistical inference. Then,we reduced it further to searching over partitions all {~z} to find a partition ~z that maximizes theSBM or DC-SBM log-likelihood of a network.
There are many ways we could perform such a search over partitions. These include powerfulmethods like Markov chain Monte Carlo, expectation-maximization (the “EM” algorithm), beliefpropagation (also called “BP”), among many others. Here, we will learn about a locally greedy
heuristic, which is a kind of generalization of the Kernighan-Lin algorithm developed in 1970 forsolving the minimum-cut graph partitioning problem.12 Like Kernighan-Lin, the locally greedy
12Kernighan & Lin, Bell System Technical Journal 49, 291 (1970), https://archive.org/details/bstj49-2-291.
9
lessons learnedwhat works well
3. simple prediction tasks—test intuition & run numerical experiments
link prediction via heuristic
Fraction of labels observed, f0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Frac
tion
of c
orre
ct la
bel p
redi
ctio
ns
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1malaria genes, HVR5Norwegian boards, net1m-2011-08-01
Fraction of edges observed, f0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
AUC
0.5
0.6
0.7
0.8
0.9
1HVR5 malaria genes network
degree productJaccard coefficientshortest pathbaseline (guessing)
label prediction via homophily
lessons learnedwhat works well
4. simple simulations—explore dynamics vs structure & numerical experiments
simulate Price’s modelsimulate epidemics (SIR) on planted partitions
in-degree, kin
100 101 102 103 104 105
Pr(K
≥ k
in)
10-6
10-5
10-4
10-3
10-2
10-1
100
r=1r=4no preferential attachment
015
5
1
l
10
10
cin-cout p
15
0.550 0
lessons learnedwhat works well
4. team projects—teamwork & exploring their (your!) own ideas
key takeaways• network intuition is hard to develop!
good intuition draws on many skills (probability, statistics, computation, causal dynamics, etc.)
• best results come from 1. exercises to get practice with calculations 2. practice analyzing diverse real-world networks 3. conducting out numerical experiments & simulations
• practical tasks are a pedagogical tool (e.g., link and label prediction) • interpreting the results requires a good intuition and to think like a scientist • null models are key concept: is a pattern interesting? what could explain it? • networks are fun!
0
0.51
a(t)
0
1
0
1
0200
400600
0
1
alignment position t
1
23 4
56
78
9
calculate alignment scoresconvert to alignment indicatorsremove short aligned regionsextract highly variable regions
NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY
NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY
NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY
16
6
13
16 6
13
A
B
C
D
Aaron Clauset Associate Professor Computer Science + BioFrontiers Institute
Help! What am I looking at? [Open tutorial in new window]Show faculty hiring network from:
Read the OpenAccess paper on Science Advances here. | Read Aaron Clauset's companion page or download the data here
1 - S
tanf
ord
2 - U
C Be
rkel
ey
3 - M
IT
4 - C
altec
h5
- Har
vard
6 - C
orne
ll7 -
Carn
egie
Mellon
8 - Prin
ceton
9 - Yale
10 - W
ashin
gton
11 - Illin
ois, Urb. C
hamp.
12 - Wisco
nsin, M
adison
13 - UPenn
14 - Rice
15 - UCLA
16 - NYU
17 - Chicago
18 - UT, Austin
19 - Brown
20 - Columbia
21 - Toronto
22 - Rochester
23 - Southern California
24 - Johns Hopkins
25 - UMass, Amherst
26 - UC San Diego27 - Maryland, College Park
28 - Michigan29 - UNC30 - Duke31 - SUNY Stony Brook
32 - UC Irvine33 - Dartmouth34 - Virginia35 - Purdue36 - Minnesota, Minneapolis
37 - Georgia Tech
38 - Rutgers39 - Arizona
40 - Penn. State
41 - Ohio State
42 - Northwestern
43 - Wash. U. St. Louis
44 - Pittsburgh
45 - Boston Univ.
46 - British Columbia
47 - Oregon
48 - Syracuse
49 - UC Santa Barbara
50 - Utah
51 - UC Davis
52 - Texas A&M
53 - Houston
54 - Michigan State
55 - Waterloo
56 - Colorado, Boulder
57 - McG
ill
58 - CUNY Graduate Center
59 - Case Western
60 - UC Riverside
61 - Kansas
62 - Florida State
63 - William
and Mary
64 - Lehigh65 - M
ontreal66 - Sim
on Fraser67 - Illinois, Chicago68 - Cincinnati
69 -
UC S
anta
Cru
z70
- Re
nsse
laer
Pol
ytec
h.
71 -
Sout
hern
Met
hodi
st
72 -
Cent
ral F
lorid
a
73 -
New
Mex
ico
74 -
NC S
tate
75 -
SUNY
Buf
falo
76 -
Stev
ens
Inst
. Tec
h.
77 -
Was
hing
ton
Stat
e, P
ullm
an
78 -
McM
aste
r
79 -
Bran
deis
80 -
Albe
rta
81 -
Geo
rge
Was
hing
ton
82 -
Poly.
Inst.
of N
YU
83 -
Iowa
84 -
Delaw
are
85 -
Iowa
Sta
te
86 -
Vand
erbil
t
87 -
Flor
ida
88 -
Geor
ge M
ason
89 -
Notre
Dam
e
90 -
Way
ne S
tate
91 -
Calga
ry
92 - L
ouisi
ana,
Lafay
ette
93 - W
ester
n Onta
rio
94 - C
arleto
n
95 - Q
ueen
s
96 - W
iscon
sin, M
ilwau
kee
97 - L
ouisia
na Stat
e
98 - T
exas
, Arlin
gton
99 - O
ttawa
100 -
UConn
101 - Oregon Hlth & Sci
102 - Arizo
na State
103 - Kansas S
tate
104 - SUNY Albany
105 - South Florida
106 - Virginia Tech
107 - Portla
nd State
108 - South Carolina
109 - Missouri S
ci. & Tech.
110 - Oklahoma
111 - Indiana
112 - Illinois Inst. Tech.
113 - Tennessee, Knoxville114 - Miami
115 - New Jersey Inst. Tech.116 - Kentucky
117 - SUNY Binghamton118 - Colorado State
119 - Colorado, Denver120 - Texas, Dallas121 - Rhode Island
122 - Toyota Tech. Chicago123 - Auburn124 - Drexel125 - Oregon State126 - Manitoba127 - Regina128 - Nebraska, Lincoln129 - Saskatchewan130 - Florida Atlantic131 - Texas, San Antonio
132 - Denver133 - Maryland, Balt. County
134 - Oakland (Michigan)135 - Maine
136 - Clarkson
137 - New Mexico State138 - Memphis
139 - Victoria
140 - Concordia, Montreal
141 - Missouri, Kansas City
142 - Missouri, Columbia
143 - Dalhousie
144 - Virginia Commonwealth
145 - Worcester Poly.
146 - Old Dominion
147 - Texas, El Paso
148 - Naval Postgrad. Sch.
149 - UMass, Boston
150 - Wyoming
151 - York
152 - Georgia
153 - Mississippi State
154 - Tufts
155 - Florida Intl
156 - Southern Ill., Carbondale
157 - Ohio
158 - Claremont Graduate
159 - Catholic Univ. America
160 - Clemson
161 - UMass, Lowell
162 - Oklahoma State
163 - UNC, Charlotte
164 - Alabama, Birmingham
165 - Santa Clara
166 - Hawaii, Manoa
167 - Temple
168 - Toledo
169 - Michigan Tech.
170 - New Hampshire
171 - Montana State
172 - New Brunswick
173 - Wright State
174 - Western Michigan
175 - Northeastern
176 - Alabama, Huntsville
177 - Idaho, Moscow
178 - Texas Tech179 - Tulsa
180 - Mississippi
181 - Brigham Young
182 - Long Island
183 - Southern Mississippi
184 - Colorado, Colorado Springs
185 - Colorado School of Mines
186 - Georgia State
187 - Nevada, Las Vegas188 - Kent State
189 - Mem
orial Newfoundland190 - Utah State
191 - DePaul
192 - North Dakota State193 - Nova Southeastern
194 - Arkansas, Fayetteville195 - Bridgeport196 - Louisville
197 - Nebraska, Om
aha198 - Florida Inst. Tech.
199 - Arkansas, Little Rock200 - Rochester Inst. Tech.
201 - Pace202 - New M
exico Inst. Min. Tech.
203 - Nevada, Reno204 - Alabam
a, TuscaloosaNorth Texas, Denton
Computer Science
© Clauset, Arbesman & Larremore, 2015100% up the hierarchy2 up per 1 downbalanced2 down per 1 up100% down the hierarchy
inequality and the spread of ideas in sciencecomplex social and biological systems
computational methods for network analysis
aaronclauset.github.ioabout me
Morgan et al. EPJ Data Science ( 2018) 7:40 Page 7 of 16
Of these events, 88 (37%) are due to transmissions of research ideas by way of hiring, andin 81% of these cases, transmissions move via faculty from higher prestige universitiesto lower prestige universities (past studies show that only 9 to 14% of faculty placementsmove faculty to a more prestigious university than their doctoral institution [9]). Figure 4illustrates these patterns by showing spreading events over time, for three of the topics.
Crucially, if faculty hiring shapes the spread of ideas, then a significant share of depart-ments that ever adopt a topic X will have adopted it through faculty hiring (scenario 2).We test this hypothesis by constructing a specialized permutation test to assess the statis-tical significance of the empirically observed fraction of departments that have adopted aresearch idea via scenario 2, denoted fobs, and the expected fraction of such departmentsfexp. The test’s null model is one in which the publication years for each faculty are fixedwith their empirical values, but paper titles are drawn uniformly at random, without re-placement, from the set of all titles. In this way, serial correlations in topics and temporalcorrelations with the hiring event are removed from each faculty. We then report empiri-cal p-values [30] for the fraction of hiring-driven adoption events for each topic.
Figure 4 Adoption events for the three research topics over time. Purple dots denote institutions whoadopted an idea by hiring someone who studies that topic, and white dots represent institutions whoseexisting faculty began working on the topic. Arrows denote, for each time period, new transmissions,originating from the hired individual’s doctoral location. All 205 institutions are arranged clockwise byprestige (descending), with the most prestigious department positioned at noon
structureof
academia,shed
newlight
onthe
factorsthatshape
individualcareertra-
jectories,andidentify
anovel
connectionbetw
eenfaculty
hiringand
socialinequality.
RESULTS
Across
thesam
pleddisciplines,w
efind
thatfacultyproduction
(number
offac-ulty
placed)ishighlyskew
ed,with
only25%
ofinstitutionsproducing
71to
86%ofall
tenure-trackfaculty
(tableS2;thisand
sub-sequentrangesindicate
therange
ofagiven
quantityacross
thethree
disciplines,un-less
otherwise
noted).The
numberoffac-
ultywithin
anacadem
icunit(num
berof
facultyhired,thatis,the
unit’ssize)is
alsoskew
ed,with
someunitsbeing
twotothree
timeslarger
thanothers.Business
schoolsare
especiallylarge,generally
containingseveralinternaldepartm
ents,with
amean
sizeof
70faculty
mem
berswho
receivedtheirdoctoratesfrom
otherwithin-sam
pleunits,w
hereascom
puterscience
andhis-
toryhave
mean
sizesof21and
29,respec-tively
(seeSupplem
entaryMaterials).T
hedifferences
insize
within
adiscipline,
however,cannotexplain
theobserved
dif-ferencesin
placements.Ifplacem
entswere
simply
proportionaltothe
sizeof
aunit,
thenthe
placementand
sizedistributions
would
bestatistically
indistinguishable.Asim
pletestofthis
size-proportionalplace-menthypothesis
showsthatitm
aybe
re-jected
outofhand[Kolm
ogorov-Smirnov
(KS)
test,P<10
−8;
Fig.2,Band
C],indicating
genuinedifferentialsuccess
ratesin
facultyplacem
ent.The
Gini
coefficient,astandard
measure
ofsocial
in-equality,is
definedas
themean
relativedifference
between
auniform
lyrandom
pairof
observedvalues.T
hus,G=0
denotesstrict
equality,andG
=1maxim
alinequality.W
efind
G=0.62
to0.76
forfaculty
production(Fig.2,A
andB),indicating
stronginequality
acrossdisciplines
[cf.,G=
0.45for
theincom
edistribution
ofthe
United
States(12)].
Stronginequality
holdseven
among
thetop
facultypro-
ducers:thetop
10units
produce1.6
to3.0
times
more
fac-ulty
thanthe
second10,and
2.3to
5.6tim
esmore
thanthe
third10.For
suchdifferences
toreflect
purelymeritocratic
outcomes,that
is,utilitarianoptim
alityof
totalscholarship
(13),differencesin
placementrates
mustreflectinherentdif-
ferencesin
theproduction
ofscholarship.Under
ameritoc-
racy,theobserved
placement
rateswould
imply
thatfaculty
with
doctoratesfrom
thetop
10units
areinherently
twoto
sixtim
esmore
productivethan
facultywith
doctoratesfrom
thethird
10units.T
hemagnitude
ofthesedifferences
makes
apure
meritocracy
seemim
plausible,suggestingthe
influ-ence
ofnonm
eritocraticfactors
likesocialstatus.
MIT
Stanford
UC
Berkeley
Carnegie M
ellon
Cornell
Washington
Caltech
Harvard
YaleP
rinceton
MIT
Stanford
UC Berkeley
Carnegie Mellon
Cornell
Washington
Caltech
Harvard
YalePrinceton
Fig.1.
Prestigehierarchies
infaculty
hiringnetw
orks.(Top)
Placements
for267
computer
sciencefaculty
among
10universities,w
ithplacem
entsfrom
oneparticular
universityhighlighted.Each
arc(u,v)has
awidth
proportionaltothe
numberofcurrentfaculty
atuniversityvwho
receivedtheirdoctorate
atuniversity
u(≠v).(Bottom
)Prestigehierarchy
onthese
institutionsthat
minim
izesthe
totalweight
of“upw
ard”arcs,that
is,arcswhere
vismore
highlyranked
thanu.
00.25
0.500.75
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Fraction of institutions
Fraction of faculty produced
Equality
A
Com
puter scienceB
usinessH
istory
0.01
0.1
1B
Faculty produced, k
out
Fraction greater than k
0100
200300
400 0.01
0.1
1
Faculty hired, k
in
C
Fig.2.Inequality
infaculty
production.(A)Lorenz
curvesshow
ingthe
fractionof
allfacultyproduced
asafunction
ofproducinginstitutions.(B
andC)Com
plementary
cumulative
distributionsfor
institutionout-degree
(facultyproduced)
andin-degree
(facultyhired).The
means
ofthese
distributionsare
21for
computer
science,70for
business,and29
forhistory.
RESEARCH
ARTIC
LE
Clausetet
al.Sci.Adv.2015;1:e1400005
12February
20152of
6
Fig. 1. Institutional prestige predicts early-career productivity and promi-nence of computer science faculty. Shown are median publication (leftaxis) and log10 citation (right axis) counts per faculty per institution (min-imum three faculty per institution), accumulated through their first 10 yearsposthire, adjusted for growth in publication rates over time (SI Appendix,section A). Shaded regions denote 95% confidence intervals for leastsquares regression.
Replacing prestige with the 2010 departmental rankings by U.S. News &World Report in our analysis produces similar results (see SI Appendix,section B and Fig. S1).
The annual faculty job market generates two kinds of quasi-naturalexperiments: It colocates at the same institution individuals who trained atmore or less prestigious institutions than each other (Fig. 2, Top Left), andit separates individuals with similar training into faculty appointments atmore or less prestigious institutions than each other (Fig. 2, Bottom Left).To isolate the effect of prestige differences on posthire productivity andprominence in each case, we combine exact and caliper matching tech-niques to mitigate the confounding effects of differences in the age, gender,subfield productivity norms, and postdoctoral training (see SI Appendix,section B). If where an individual trained determines their early-careerscholarly output, individuals with more-prestigious training should be, on
average, more productive and more prominent than colocated peers withless prestigious training. On the other hand, if where an individual worksdetermines their early-career scholarly output, individuals with appoint-ments at more-prestigious institutions should be more productive and moreprominent than similarly trained peers with appointments at less prestigiousinstitutions.
ResultsFor matched pairs of faculty with appointments at similarlyprestigious institutions, the individual with the more prestigioustraining was not more productive in the first 5 years posthire(N =359 pairs; p=0.59, t test) but received, on average, 301more citations (N =129 pairs; p< 0.05, t test) during this period(Fig. 2 A and B). Among the pairs, the individual with more-prestigious training was more productive in 52.1% (p=0.23;one-tailed binomial test) of trials but more highly cited in 63.9%(p< 0.005; one-tailed binomial test).
In contrast, for matched pairs of faculty with similarly presti-gious training and with similar prehire productivity and promi-nence (publications, N =194 pairs; citations, N =194; see Fig. 2C and D), the individual with the more prestigious appoint-ment produced, on average, 5.1 more papers in the first 5 yearsposthire (p< 0.005, t test), with 57.4% of trials exhibiting anadvantage of any magnitude (p< 0.05, binomial test) and sig-nificant differences in years y 2 {1, 2, 4, 5} (p< 0.05, t test).Similarly, individuals with the more prestigious appointmentreceived, on average, 344 more citations in this period (p<0.001, t test), although the median difference was a more modest112 additional citations. For context, faculty at the top 20% ofinstitutions by prestige produced, on average, 17 more publica-tions in their first 5 years and received 824 more citations thanfaculty at the bottom 20% of institutions, and they produced 9more publications and received 543 more citations than facultyat the middle 20% of institutions.
Hence, conditioned on an individual holding a faculty positionsomewhere, we find no evidence that training at a prestigious
-5 -4 -3 -2 -1 0 1 2 3 4 5
Publishing year, relative to initial placement
-50
0
50
100
150
-50
0
50
100
150
-1
0
1
2
3
-5 -4 -3 -2 -1 0 1 2 3 4 5
Publishing year, relative to initial placement
-1
0
1
2
3
*: Faculty also matched on gender, subfield, and other features. See main text for full details.
Diff
eren
ce in
pub
licat
ions
Diff
eren
ce in
pub
licat
ions
Diff
eren
ce in
cita
tions
Diff
eren
ce in
cita
tions
Less prestigiouswork environment
More prestigioustraining environment
Matched on work
environment*
Less prestigioustraining environment
More prestigiouswork environment
Matched on training
environment*
Yearly difference in publications Yearly difference in citations
Person with more prestigious appointmentvs. person with less prestigious appointment.
Person with more prestigious appointmentvs. person with less prestigious appointment.
Person with more prestigious trainingvs. person with less prestigious training.
Person with more prestigious trainingvs. person with less prestigious training.
Post-hirePre-hirePost-hirePre-hire
A
C
B
D
Fig. 2. Early-career productivity is driven by work environment prestige. For pairs of computer science faculty matched by (A and B) work environmentprestige or (C and D) training environment prestige, (A) publication and (B) citation counts are statistically independent of differences in doctoral prestigebut are driven higher by (C and D) placing into a more prestigious work environment. Shaded regions denote 95% confidence intervals for the mean. Similarresults are obtained using U.S. News & World Report department rankings in place of prestige (see SI Appendix, Fig. S1).
10730 | www.pnas.org/cgi/doi/10.1073/pnas.1817431116 Way et al.
Fig. 1. Institutional prestige predicts early-career productivity and promi-nence of computer science faculty. Shown are median publication (leftaxis) and log10 citation (right axis) counts per faculty per institution (min-imum three faculty per institution), accumulated through their first 10 yearsposthire, adjusted for growth in publication rates over time (SI Appendix,section A). Shaded regions denote 95% confidence intervals for leastsquares regression.
Replacing prestige with the 2010 departmental rankings by U.S. News &World Report in our analysis produces similar results (see SI Appendix,section B and Fig. S1).
The annual faculty job market generates two kinds of quasi-naturalexperiments: It colocates at the same institution individuals who trained atmore or less prestigious institutions than each other (Fig. 2, Top Left), andit separates individuals with similar training into faculty appointments atmore or less prestigious institutions than each other (Fig. 2, Bottom Left).To isolate the effect of prestige differences on posthire productivity andprominence in each case, we combine exact and caliper matching tech-niques to mitigate the confounding effects of differences in the age, gender,subfield productivity norms, and postdoctoral training (see SI Appendix,section B). If where an individual trained determines their early-careerscholarly output, individuals with more-prestigious training should be, on
average, more productive and more prominent than colocated peers withless prestigious training. On the other hand, if where an individual worksdetermines their early-career scholarly output, individuals with appoint-ments at more-prestigious institutions should be more productive and moreprominent than similarly trained peers with appointments at less prestigiousinstitutions.
ResultsFor matched pairs of faculty with appointments at similarlyprestigious institutions, the individual with the more prestigioustraining was not more productive in the first 5 years posthire(N =359 pairs; p=0.59, t test) but received, on average, 301more citations (N =129 pairs; p< 0.05, t test) during this period(Fig. 2 A and B). Among the pairs, the individual with more-prestigious training was more productive in 52.1% (p=0.23;one-tailed binomial test) of trials but more highly cited in 63.9%(p< 0.005; one-tailed binomial test).
In contrast, for matched pairs of faculty with similarly presti-gious training and with similar prehire productivity and promi-nence (publications, N =194 pairs; citations, N =194; see Fig. 2C and D), the individual with the more prestigious appoint-ment produced, on average, 5.1 more papers in the first 5 yearsposthire (p< 0.005, t test), with 57.4% of trials exhibiting anadvantage of any magnitude (p< 0.05, binomial test) and sig-nificant differences in years y 2 {1, 2, 4, 5} (p< 0.05, t test).Similarly, individuals with the more prestigious appointmentreceived, on average, 344 more citations in this period (p<0.001, t test), although the median difference was a more modest112 additional citations. For context, faculty at the top 20% ofinstitutions by prestige produced, on average, 17 more publica-tions in their first 5 years and received 824 more citations thanfaculty at the bottom 20% of institutions, and they produced 9more publications and received 543 more citations than facultyat the middle 20% of institutions.
Hence, conditioned on an individual holding a faculty positionsomewhere, we find no evidence that training at a prestigious
-5 -4 -3 -2 -1 0 1 2 3 4 5
Publishing year, relative to initial placement
-50
0
50
100
150
-50
0
50
100
150
-1
0
1
2
3
-5 -4 -3 -2 -1 0 1 2 3 4 5
Publishing year, relative to initial placement
-1
0
1
2
3
*: Faculty also matched on gender, subfield, and other features. See main text for full details.
Diff
eren
ce in
pub
licat
ions
Diff
eren
ce in
pub
licat
ions
Diff
eren
ce in
cita
tions
Diff
eren
ce in
cita
tions
Less prestigiouswork environment
More prestigioustraining environment
Matched on work
environment*
Less prestigioustraining environment
More prestigiouswork environment
Matched on training
environment*
Yearly difference in publications Yearly difference in citations
Person with more prestigious appointmentvs. person with less prestigious appointment.
Person with more prestigious appointmentvs. person with less prestigious appointment.
Person with more prestigious trainingvs. person with less prestigious training.
Person with more prestigious trainingvs. person with less prestigious training.
Post-hirePre-hirePost-hirePre-hire
A
C
B
D
Fig. 2. Early-career productivity is driven by work environment prestige. For pairs of computer science faculty matched by (A and B) work environmentprestige or (C and D) training environment prestige, (A) publication and (B) citation counts are statistically independent of differences in doctoral prestigebut are driven higher by (C and D) placing into a more prestigious work environment. Shaded regions denote 95% confidence intervals for the mean. Similarresults are obtained using U.S. News & World Report department rankings in place of prestige (see SI Appendix, Fig. S1).
10730 | www.pnas.org/cgi/doi/10.1073/pnas.1817431116 Way et al.
reasons: (i) These particular metadata are irrelevant to the structure ofthe network, (ii) the detected communities and the metadata capturedifferent aspects of the network’s structure, (iii) the network containsno communities as in a simple random graph (7) or a network that issufficiently sparse that its communities are not detectable (8), or (iv) thecommunity detection algorithm performed poorly.
In the above, we refer to the observed network and metadata andnote that noise in either could lead to one of the reasons above. For in-stance, measurement error of the network structure may make our ob-servations unreliable and, in extreme cases, can obscure the communitystructure entirely, resulting in case (iii). It is also possible that humanerrors are introducedwhen handling the data, exemplified by thewidelyused American college football network (9) of teams that played eachother in one season, whose associated metadata representing eachteam’s conference assignment were collected during a different season(10). Large errors in the metadata can render them irrelevant to thenetwork [case (i)].
Most work on community detection assumes that failure to findcommunities that correlate with metadata implies case (iv), algorithmfailure, although some critical work has focused on case (iii), difficult orimpossible to recover communities. The lack of consideration for cases(i) and (ii) suggests the possibility for selection bias in the publishedliterature in this area [a point recently suggested by Hric et al. (11)].Recent critiques of the general utility of community detection in net-works (11–13) can be viewed as a side effect of confusion about the roleof metadata in evaluating algorithm results. For these reasons, usingmetadata to assess the performance of community detection algorithmscan lead to errors of interpretation, false comparisons betweenmethods,and oversights of alternative patterns and explanations, including thosethat do not correlate with the known metadata.
For example, Zachary’s Karate Club (14) is a small real-worldnetwork with compelling metadata frequently used to demonstratecommunity detection algorithms. The network represents the observedsocial interactions of 34 members of a karate club. At the time of study,the club fell into a political dispute and split into two factions. Thesefaction labels are the metadata commonly used as ground truth com-munities in evaluating community detection methods. However, it isworth noting at this point that Zachary’s original network andmetadatadiffer from those commonly used for community detection (9). Links inthe original network were by the different types of social interactionthat Zachary observed. Zachary also recorded twometadata attributes:the political leaning of each of the members (strong, weak, or neutralsupport for one of the factions) and the faction they ultimately joinedafter the split. However, the community detection literature uses onlythemetadata representing the faction each node joined, oftenwith oneof the nodes mislabeled. This node (“Person number 9”) supportedthe president during the dispute but joined the instructor’s factionbecause joining the president’s faction would have involved retrain-ing as a novice when he was only 2 weeks away from taking his blackbelt exam.
The division of the Karate Club nodes into factions is not the onlyscientifically reasonable way to partition the network. Figure 1 showsthe log-likelihood landscape for a large number of two-group partitions(embedded in two dimensions for visualization) of the Karate Club, un-der the stochastic blockmodel (SBM) for community detection (15, 16).Partitions that are similar to each other are embedded nearby in thehorizontal coordinates, meaning that the two broad peaks in the land-scape represent two distinct sets of high-likelihood partitions: onecentered around the faction division and one that divides the network
into leaders and followers. Other common approaches to communitydetection (9, 17) suggest that the best divisions of this network havemore than two communities (10, 18). The multiplicity and diversityof good partitions illustrate the ambiguous status of the faction meta-data as a desirable target.
The Karate Club network is among many examples for whichstandard community detectionmethods return communities that eithersubdivide the metadata partition (19) or do not correlate with the meta-data at all (20, 21).More generally, most real-world networks havemanygood partitions, and there are many plausible ways to sort all partitionsto find good ones, sometimes leading to a large number of reasonableresults. Moreover, there is no consensus on which method to use onwhich type of network (21, 22).
In what follows, we explore both the theoretical origins of these pro-blems and the practical means to address the confounding cases de-scribed above. To do so, we make use of a generative model perspectiveof community detection. In this perspective, we describe the relation-ship between community assignments C and graphs G via a joint dis-tribution P(C,G) over all possible community assignments and graphsthat wemay observe.We take this perspective because it provides a pre-cise and interpretable description of the relationship between commu-nities andnetwork structure. Although generativemodels, like the SBM,describe the relationship between networks and communities directlyvia a mathematically explicit expression for P(C,G), other methods forcommunity detection nevertheless maintain an implicit relationshipbetween network structure and community assignment. Hence, thetheorems we present, as well as their implications, are more generallyapplicable across all methods of community detection.
In the next section,we present rigorous theoretical results with directimplications for cases (i) and (iv), whereas the remaining sections intro-duce two statistical methods for addressing cases (i) and (ii). These con-tributions do not address case (iii), when there is no structure to befound, which has been previously explored by other authors, for exam-ple, for the SBM (8, 23–27) and modularity (28, 29).
–240
–230
–220
–210
SB
M lo
g lik
elih
ood
–200
Partition space
Fig. 1. The stochastic blockmodel log-likelihood surface for bipartitions of theKarateClubnetwork (14). Thehigh-dimensional spaceof all possiblebipartitionsof thenetwork has been projected onto the x, y plane (using a method described in Supple-mentary TextD.4) such that points representing similar partitions are closer together. Thesurface shows two distinct peaks that represent scientifically reasonable partitions. Thelower peak corresponds to the social grouppartitiongivenby themetadata—often treatedas ground truth—whereas the higher peak corresponds to a leader-follower partition.
S C I ENCE ADVANCES | R E S EARCH ART I C L E
Peel, Larremore, Clauset, Sci. Adv. 2017;3 : e1602548 3 May 2017 2 of 8
on May 3, 2017
http://advances.sciencemag.org/
Dow
nloaded from
this division 65% of the time, a relatively weak level of correlation,not far above the 50% of completely uncorrelated data.Nonetheless, as shown in Fig. 1b, this is enough for the algorithmto reliably find the correct division of the network in almost everycase—98% of the time in our tests. Without the metadata, bycontrast, we succeed only 6% of the time. Some practicalapplications of this ability to select among competing divisionsare given in the next section.
Real-world networks. In this section we describe applications ofour method to a range of real-world networks, drawn from social,biological and technological domains.
For our first application we analyse a network of schoolstudents, drawn from the US National Longitudinal Study ofAdolescent Health. The network represents patterns of friend-ship, established by survey, among the 795 students in a medium-sized American high school (US grades 9–12, ages 14–18 years)and its feeder middle school (grades 7 and 8, ages 12–14 years).
Given that this network combines middle and high schools, itcomes as no surprise that there is a clear division (previouslydocumented) into two network communities correspondingroughly to the two schools. Previous work, however, has alsoshown the presence of divisions by ethnicity31. Our methodallows us to select between divisions by using metadata thatcorrelate with the one we are interested in.
Figure 2 shows the results of applying our algorithm to thenetwork three times. Each time, we asked the algorithm to dividethe network into two communities. In Fig. 2a, we used the sixschool grades as metadata and the algorithm readily identifies adivision into grades 7 and 8 on the one hand and grades 9–12 onthe other—that is, the division into middle school and highschool. In Fig. 2b, by contrast, we used the students’ self-identifiedethnicity as metadata, which in this data set takes one of fourvalues: white, black, hispanic, or other (plus a small number ofnodes with missing data). Now the algorithm finds a completelydifferent division into two groups, one group consistingprincipally of black students and one of white. (The smallnumber of remaining students are distributed roughly evenlybetween the groups.)
One might be concerned that in these examples the algorithmis mainly following the metadata to determine communitymembership, and ignoring the network structure. To test for thispossibility, we performed a third analysis, using gender asmetadata. When we do this, as shown in Fig. 2c, the algorithmdoes not find a division into male and female groups. Instead,it finds a new division that is a hybrid of the grade and ethnicitydivisions (white high-school students in one group and everyoneelse in the other). That is, the algorithm has ignored thegender metadata, because there was no good network divisionthat correlated with it, and instead found a division based onthe network structure alone. The algorithm makes use of themetadata only when doing so improves the quality of the networkdivision (in the sense of the maximum-likelihood fit described inthe Methods section).
The extent to which the communities found by our algorithmmatch the metadata (or any other ‘ground truth’ variable) canbe quantified by calculating a normalized mutual information(NMI)32,33, as described in the Methods section. NMI ranges invalue from 0 when the metadata are uninformative about thecommunities to 1 when the metadata specify the communitiescompletely. The divisions shown in Fig. 2a,b have NMI scores of0.881 and 0.820, respectively, indicating that the metadataare strongly though not perfectly correlated with communitymembership. By contrast, the division in Fig. 2c, where genderwas used as metadata, has an NMI score of 0.003, indicating that
the metadata contain essentially zero information about thecommunities.
Our next application is to an ecological network, a food web ofpredator–prey interactions between 488 marine species living inthe Weddell Sea, a large bay off the coast of Antarctica34,35. Anumber of different metadata are available for these species,including feeding mode (deposit feeder, suspension feeder,scavenger and so on), zone within the ocean (benthic, pelagicand so on) and others. In our analysis, however, we focus on one inparticular, the average adult body mass. Body masses of species inthis ecosystem have a wide range, from microorganisms weighingnanograms or less to hundreds of tonnes for the largest whales.
WhiteMiddle
High
Black Hispanic Other MissingMale
Female
a
b
c
Figure 2 | Communities found in a high school friendship network withvarious types of metadata. Three divisions of a school friendship network,using as metadata (a) school grade, (b) ethnicity and (c) gender.
ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms11863
4 NATURE COMMUNICATIONS | 7:11863 | DOI: 10.1038/ncomms11863 | www.nature.com/naturecommunications
Identifying a power law in the distribution of an empirical quantitycan indicate the presence of exotic underlying mechanisms, includingnonlinearities, feedback loops, and network effects (33, 34), althoughnot always (36), and power laws are believed to occur broadly incomplex social, technological, and biological systems (37). For instance,the intensities or sizes of many natural disasters, such as earthquakes,forest fires, and floods (34, 38, 39), as well as many social disasters, suchas riots and terrorist attacks (35, 40), are well described by power laws.
However, it can be difficult to accurately characterize the shape of adistribution that follows a power-law pattern (37). Fluctuations inheavy-tailed data are greatest in the distribution’s upper tail, whichgoverns the frequency of the largest and rarest events. As a result, datatend to be sparsest precisely where the greatest precision in model esti-mates is desired.
Recent interest in heavy-tailed distributions has led to the devel-opment ofmore rigorousmethods to identify and estimate power-lawdistributions in empirical data (37, 41, 42), to compare different mod-els of the upper tail’s shape (37), and to make principled statisticalforecasts of future events (43). This branch of statistical methodologyis related to but distinct from the task of estimating the distribution ofmaxima within a sample (44, 45) and is more closely related to thepeaks-over-threshold literature in seismology, forestry, hydrology, in-surance, and finance (41, 42, 45–48).
Although Poisson processes pose fewer statistical concerns thanpower-law distributions, a similar statistical approach is used in theanalysis here of both war sizes and years between war onsets. In partic-ular, an ensemble approach is used (43) on the basis of a standard non-parametric bootstrap procedure (49) that simulates the generativeprocess of events to produce a series of synthetic data sets {Y } with sim-ilar statistical structure as the empirical data X. Fitting a semipara-metric model Pr(y|q) to each Y yields an ensemble of models {q} thatincorporate the empirical data’s inherent variability into a distributionof estimated parameters. This distribution is then used toweightmodelsby their likelihood under the bootstrap distribution and to numericallyestimate the likelihood of specific historical or future patterns (43).
Within the 1823–2003 time period, the end of the Second WorldWar in 1945 is widely viewed as the most plausible change point inthe underlying dynamics of the conflict-generating process for warsand marks the beginning of the subsequent long peace pattern (10).Determining whether 1945 marks a genuine a shift in the observed sta-tistics ofwars and, hence, whether the long peace is plausibly a trendor afluctuation represents a broad test of the stationary hypothesis of war(23). Evaluating other theoretically plausible change points in these datais left for future work.
Finally, some studies choose to limit or normalize war onset countsor war sizes (battle death counts) by a reference population. For in-stance, onset counts can be normalized by assuming that war is a dyadicevent and that dyads independently generate conflicts (26), implying anormalization that grows quadratically with the number of nations.However, considerable evidence indicates that dyads do not indepen-dently generate conflicts (9, 16–21). Similarly, limiting the analysis toconflicts among “major powers” introduces subjectivity in definingsuch a scope, and there is not a clear consensus about the details, forexample, when and whether to include China or the occupied Euro-pean nations, or certain wars, such as the KoreanWar (26). War sizecan be normalized by assuming that individuals contribute indepen-dently to total violence, which implies a normalization that dependson either the population of the combatant nations (a variable some-times called war “intensity”) or of the world (3, 23). However, there is
little evidence for this assumption (3, 50), although such a per capitavariable may be useful for other reasons. In the analysis performedhere, war variables are analyzed in their unnormalized forms, and allrecorded interstate wars are considered. The analysis is thus at thelevel of the entire world, and results are about absolute counts.
RESULTSThe sizes of warsConsidering the sizes of wars alone necessarily ignores other charac-teristics of conflicts, including their relative timing, which may con-tain independent signals about trends. A pattern in war sizes alonethus says little about changes in declared reasons for conflicts, theway they are fought, their settlements, aftermaths, or relationshipsto other conflicts past or future, or the number of nations worldwide,among other factors. One benefit of ignoring these factors, at leastat first, is that theymay be irrelevant for identifying an overall trendin wars, and their relationship to a trend can be explored subse-quently. Hence, focusing narrowly on war sizes simplifies the rangeof models to consider and may improve the ability to detect a sub-tle trend.
The Correlates of War data set includes 95 interstate wars, theabsolute sizes of which range from 1000 (the minimum size by def-inition) to 16,634,907 (the recorded battle deaths of the SecondWorld War) (Fig. 2). The estimated power-law model has two param-eters: xmin, which represents the smallest value above which thepower-law pattern holds, and a, the scaling parameter. Standardtechniques are used to estimate model parameters and model plau-sibility (section S1) (37).
The maximum likelihood power-law parameter is a ¼ 1:53 ± 0:07for wars with severity x ≥ xmin = 7061 (Fig. 2, inset), and 95% ofthe bootstrap distribution of a falls within the interval [1.37, 1.76].However, these estimates do not indicate that the observed data area plausible independent and identically distributed (iid) draw fromthe fitted model. To quantitatively assess this aspect of the model,we used an appropriately defined statistical hypothesis test (37),which indicates that a power-law distribution cannot be rejected
Battle deaths, x 103 104 105 106 107 108
Fra
ctio
n of
war
s w
ith a
t lea
st x
dea
ths
10–2
10–1
100
x0.25
x0.50
x0.75
Power-law exponent,1.2 1.4 1.6 1.8
Den
sity
Fig. 2. Interstatewars sizes, 1823–2003. The maximum likelihood power-law modelof the largest-severity wars (solid line, a = 1.53 ± 0.07 for x ¼ xmin ¼ 7061) is a plau-sible data-generating process of the empirical severities (Monte Carlo, pKS = 0.78 ± 0.03).For reference, distribution quartiles are marked by vertical dashed lines. Inset: Bootstrapdistribution of maximum likelihood parameters Pr(a), with the empirical value (black line).
S C I ENCE ADVANCES | R E S EARCH ART I C L E
Clauset, Sci. Adv. 2018;4 : eaao3580 21 February 2018 3 of 9
on February 21, 2018http://advances.sciencem
ag.org/D
ownloaded from
SAFE LEADS AND LEAD CHANGES IN COMPETITIVE . . . PHYSICAL REVIEW E 91, 062815 (2015)
Effective lead, z0 0.5 1 1.5 2
Pro
babi
lity
that
effe
ctiv
e le
ad is
saf
e, Q
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Eq. (15)NBA gamesBill James' heuristics
FIG. 11. (Color online) Probability that a lead is safe versus thedimensionless lead size z = L/
√4Dτ for NBA games, showing the
prediction from Eq. (15), the empirical data, and the mean predictionfor Bill James’ well-known “safe lead” heuristic.
probabilities only for dimensionless leads z > 2) and has thewrong qualitative dependence on z. In contrast, the randomwalk model gives a maximal overestimate of 6.2% for thesafe lead probability over all z, and has the same qualitative zdependence as the empirical data.
For completeness, we extend the derivation for the safe leadprobability to unequal-strength teams by including the effectof a bias velocity v in Eq. (14):
Q(L,τ ) = 1 −∫ τ
0
L√4πDt3
e−(L+vt)2/4Dt dt
= 1 − e−vL/2D
∫ τ
0
L√4πDt3
e−L2/4Dt−v2t/4D dt, (16)
where the integrand in the first line is the first-passageprobability for nonzero bias. Substituting u = L/
√4Dt and
using again the Peclet number Pe = vL/2D, the result is
Q(L,τ ) = 1 − 2√π
ePe∫ z
0e−u2−Pe2/4u2
du
= 1 − 12
[e−2Peerfc
(z− Pe
2z
)+erfc
(z+ Pe
2z
)]. (17)
When the stronger team is leading (Pe > 0), essentially anylead is safe for Pe ! 1, while for Pe < 1, the safety of a leaddepends more sensitively on z [Fig. 12(a)]. Conversely, if theweaker team happens to be leading (Pe < 0), then the lead hasto be substantial or the time remaining quite short for the leadto be safe [Fig. 12(b)]. In this regime, the asymptotics of theerror function gives Q(L,τ ) ∼ e−Pe2/4z2
for z < |Pe|/2, whichis vanishingly small. For values of z in this range, the lead isessentially never safe.
VI. LEAD CHANGES IN OTHER SPORTS
We now consider whether our predictions for lead changestatistics in basketball extend to other sports, such as college
(a)
(b)
FIG. 12. (Color online) Probability that a lead is safe versus z =L/
√4Dτ for (a) the stronger team is leading for Pe = 1
5 , 12 , and 1
(progressively flatter curves), and (b) the weaker team is leading forPe = − 2
5 , − 45 , and − 6
5 (progressively shifting to the right). The casePe = 0 is also shown for comparison.
American football (CFB), professional American football(NFL), and professional hockey (NHL) [43]. These sports havethe following commonalities with basketball [19]:
(1) Two teams compete for a fixed time T , in which pointsare scored by moving a ball or puck into a special zone in thefield.
(2) Each team accumulates points during the game and theteam with the largest final score is the winner (with sport-specific tiebreaking rules).
(3) A roughly constant scoring rate throughout the game,except for small deviations at the start and end of each scoringperiod.
(4) Negligible temporal correlations between successivescoring events.
(5) Intrinsically different team strengths.(6) Scoring antipersistence, except for hockey.
These similarities suggest that a random-walk model shouldalso apply to lead change dynamics in these sports (Fig. 13).
However, there are also points of departure, the mostimportant of which is that the scoring rate in these sports isbetween 10 and 25 times smaller than in basketball. Becauseof this much lower overall scoring rate, the diminished rate atthe start of games is much more apparent than in basketball(Fig. 14). This longer low-activity initial period and othernon-random-walk mechanisms cause the distributionsL(t) and
062815-7
typically been omitted from previous analyses), for ectothermicspecies, etc.We resolve several of these questions by testing the tradeoff
theory’s ability to explain the observed body size distribution ofcetaceans, the largest and most diverse marine mammal clade.Cetaceans are an ideal test case for the theory. First, Cetacea is asufficiently speciose clade (77 extant species) to allow a quantitativecomparison of predicted and observed distributions. Sirenia, theonly other fully aquatic mammal clade, contains four extantspecies, which is too small for a productive comparison. Second,semiaquatic groups like Pinnipeds (seals and walruses) andMustelids (otters) cannot be used to test the theory because theyspend significant time on land, thus avoiding the hard thermo-regulatory constraint assumed by the theory. Thus, by focusing oncetaceans, we provide a reasonable test of the theory. Third, fully
aquatic mammals like cetaceans have typically been omitted inpast studies because their marine habitat induces a different lowerlimit on mass than is seen in terrestrial mammals. As a result, itremains unknown whether the theory extends to all mammals, oronly those in terrestrial environments. Finally, cetacean bodymasses do indeed exhibit the canonical right-skewed pattern(Fig. 1): the median size (356 kg, Tursiops truncatus) is close to thesmallest (37.5 kg, Pontoporia blainvillei) but far from the largest(175,000 kg). This suggests that the theory may indeed hold forthem.Here, we test the strongest possible form of the macroevolu-
tionary tradeoff theory for cetacean sizes. Instead of estimatingmodel parameters from cetacean data, we combine parametersestimated from terrestrial mammals with a theoretically deter-mined choice for the lower limit on cetacean species body mass.The resulting model has no tunable parameters by which to adjustits predicted distribution. In this way, we answer the question ofhow large a whale should be: if the predicted distribution agreeswith the observed sizes, the same short-term versus long-termtradeoff that determines the sizes of terrestrial mammals alsodetermines the sizes of whales.We find that this zero-parameter model provides a highly
accurate prediction of cetacean sizes. Thus, a single universaltradeoff mechanism appears to explain the body sizes of allmammal species, but this mechanism must obey the thermoreg-ulatory limits imposed by the environment in which it unfolds. It isthis one difference–thermoregulation in air for terrestrial mam-mals and in water for aquatic mammals–that explains the differentlocations of their respective body size distributions. Energeticconstraints, while a popular historical explanation for sizes, seemto be only part of the puzzle for understanding the distribution ofspecies sizes. Under this macroevolutionary mechanism, the size ofthe largest observed species is set by the tradeoff between theextinction probability at large sizes and the rate at which smallerspecies evolve to larger body masses, both of which may dependpartly on energetic and ecological factors.
Figure 1. Terrestrial and fully aquatic mammal species massdistributions. Both show the canonical asymmetric pattern: themedian size is flanked by a short left-tail down to a minimum viable sizeand a long right-tail out to a few extremely large species.doi:10.1371/journal.pone.0053967.g001
Figure 2. Characteristic species size pattern and cladogenetic diffusion model. (A) The characteristic distribution of species body sizes,observed in most major animal groups. Macroevolutionary tradeoffs between short-term selective advantages and long-term extinction risks,constrained by a minimum viable size Mmin, produce the distribution’s long right-tail. (B) Schematic illustrating the cladogenetic diffusion model ofspecies body-size evolution: a descendant species’ mass is related to its ancestor’s size M by a random multiplicative factor l. Species become extinctwith a probability that grows slowly with M.doi:10.1371/journal.pone.0053967.g002
How Large Should Whales Be?
PLOS ONE | www.plosone.org 2 January 2013 | Volume 8 | Issue 1 | e53967