Network Analysis & Modeling

16
Network Analysis & Modeling Prof. Aaron Clauset Computer Science & BioFrontiers Institute @aaronclauset [email protected] lecture 0: what are networks and how do we talk about them?

Transcript of Network Analysis & Modeling

Page 1: Network Analysis & Modeling

Network Analysis & Modeling

Prof. Aaron Clauset Computer Science & BioFrontiers Institute @aaronclauset [email protected]

lecture 0: what are networks and how do we talk about them?

Page 2: Network Analysis & Modeling

who are network scientists?PhysicistsComputer ScientistsApplied MathematiciansStatisticiansBiologistsEcologistsSociologistsPolitical Scientists

it’s a big community! } • different traditions • different tools • different questionsincreasingly, not ONE community, but MANY, only loosely interacting communities

Page 3: Network Analysis & Modeling

who are network scientists?PhysicistsComputer ScientistsApplied MathematiciansStatisticiansBiologistsEcologistsSociologistsPolitical Scientists

} phase transitions, universality

data / algorithm oriented, predictions

dynamical systems, diff. eq.

inference, consistency, covariates

experiments, causality, molecules

observation, experiments, species

individuals, differences, causality

rationality, influence, conflict

Page 4: Network Analysis & Modeling

what are networks?an approach. a mathematical representation provide structure to complexity.

structure that exists above individuals / components

or: structure that exists below system / population

}system / population

individuals / components

Page 5: Network Analysis & Modeling

tools and resourcesSoftware

RPythonMatlabNetworkX [python]graph-tool [python, c++]GraphLab [python, c++]

Standalone editors

UCI-NetNodeXLGephiPajekNetwork WorkbenchCytoscapeyEd graph editorGraphviz

Network data sets

Colorado Index of Complex Networks

Page 6: Network Analysis & Modeling

learning goals1. develop a network intuition for reasoning about network phenomena

2. understand network representations, basic terminology, and concepts.

3. learn principles and methods for describing and clustering network data

4. learn to predict missing network information

5. understand how to conduct and interpret numerical network experiments, to explore and test hypotheses about networks

6. analyze and model real-world network data, using math and computation

Page 7: Network Analysis & Modeling

course format• course meets in-person in ECEE 283 + over Zoom

• lectures 2 times a week, some guest lectures and some class discussions

• biweekly problem sets (6 total)

• class project: proposal, presentation, final report

• all content via class Canvas (lecture notes, recordings, problem sets, submissions)

• see syllabus for all course policies

Page 8: Network Analysis & Modeling

course schedulebuilding intuition basic concepts, tools practical tools advanced tools

week by week 1. fundamentals of networks 2. representations and summary statistics 3. simple random graphs 4. better random graphs 5. predicting missing node attributes 6. predicting missing links 7. community structure and mixing patterns 8. community structure models 9. spreading processes and cascades 10. spreading processes with structure (epidemics) 11. data incompleteness and sampling 12. ranking in networks 13. ethics and networks 14. student project presentations

Page 9: Network Analysis & Modeling

lessons learnedwhat’s difficult

1. students need to know many different things:

some probability Erdos-Renyi, configuration, calculations some mathematics physics-style calculations, phase transitions some statistics basic data analysis, correlations, distributions some machine learning prediction, likelihoods, features, estimation algorithms some programming data wrangling, coding up measures and algorithms

2. can’t teach all of these things to all types of students! • vast amounts of advanced material in each of these directions • students have little experience / intuition of what makes good science

Page 10: Network Analysis & Modeling

lessons learnedwhat works well

1. simple mathematical problems—build intuition & practice with concepts

nA nB

A

B

calculate summary statistics

clustering highly-structured networks

derive mathematical relations

spreading process on networks

Biological Networks, CSCI 3352Lecture 7

Prof. Aaron ClausetSpring 2021

and the rate equation for the number of infected individuals

di

dt= �i(1 � i) � �i . (3)

We can use this equation to answer a simple question: will a new epidemic spread?

In the beginning of an epidemic, i will be very small (few individuals infected, relative to N), andwe want to know whether and how it will grow. Under this assumption, we make the approximation(1 � i) ⇡ 1, which yields

di

dt⇡ i(� � �) . (4)

This is an ordinary di↵erential equation, whose solution is a logistic function, given a fraction i0of initially infected individuals. That means the initial growth of i is exponential, growing like(R0)t = (�/�)t, and when � > �, an epidemic in the compartment model will tend to grow (ex-ponentially) because the compartment I’s rate of in-flow exceeds its rate of out-flow. In contrast,when � < �, the I compartment will tend to empty out more quickly than it fills up, and the epi-demic dies out. When �/� = 1, a value we call the epidemic threshold, we see wild fluctuationsin whether the epidemic takes o↵ or dies out.

1.2.2 SIS and SIR dynamics

The population’s states S, I under the SIS model evolve like the figure above, on the left.2 Becauseindividuals can go back and forth between the two compartments, the relative proportion of indi-

2Adapted from: https://en.wikiversity.org/wiki/File:Sissys.png

4

Biological Networks, CSCI 3352

Lecture 6

Prof. Aaron Clauset

Spring 2021

Lgood = 0.043304 . . . Lbad = 0.000244 . . .

ln Lgood = �3.1395 . . . ln Lbad = �8.3178 . . .

ers/nrs red bluered 3/3 1/9

blue 1/9 3/3

ers/nrs red bluered 4/6 2/8

blue 2/8 1/1

We can measure how much better the “good” partition is than the “bad” one by computing e� lnL,which is their likelihood ratio (do you see why?). Plugging in our results shows that the goodpartition is exp(ln Lgood � ln Lbad) = 177 times more likely to generate the observed data than the“bad” partition. In other words, the good partition is a much better model of the data.

2.2 The likelihood of a DC-SBM

Recall that to specify a degree-corrected SBM, we need to choose the number of communities c,the partition of nodes into groups ~z, the expected degree sequence ~k, and the mixing matrix M.

Under the DC-SBM, the value we assign to any particular adjacency Aij is a Poisson-distributedrandom variable with mean �i�jMzizj , where each �i is a node-specific model parameter thatquantifies the fraction of the group zi’s total degree that belongs to node i (see Lecture 5). Hence,the likelihood function is

L(G | z, �, M) =Y

i,j

Poisson(�i�jMzizj )

=Y

i<j

��i�jMzizj

�Aij

Aij !exp

���i�jMzizj

�⇥

Y

i

�12�

2i Mzizj

�Aii/2

(Aii/2)!exp

✓�1

2�

2i Mzizj

◆(6)

where the two parts are the likelihoods of the between-group and within-group adjacencies, respec-tively. These two parts appear because we assume an undirected network, and in the DC-SBM, wecount stubs within groups but edges between groups (see Lecture 5).

Although it may seem rather complicated, we can substantially simplify Eq. (6) by first factoring

6

Page 11: Network Analysis & Modeling

lessons learnedwhat works well

2. analyze real networks—test understanding & practice with implementing methods

what patterns really occur?

102

103

104

105

2

2.5

3

3.5

Network size, n

Me

an

ge

od

esi

c p

ath

len

gth

USF

Haverford

Caltech

Penn

1 4 7 10 13 16 19 22 25 28 31 34

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

vertex label

ha

rmo

nic

ce

ntr

alit

y

Karate clubconfiguration model

real-world network

how much does randomness explain? (when is a pattern interesting?)

stochastic block models

Biological Networks, CSCI 3352

Lecture 6

Prof. Aaron Clauset

Spring 2021

6

given i and r can thus be done in time O(K(K + �k�)).Because these computations can be done quickly for areasonable number of communities, local vertex switch-ing algorithms, such as single-vertex Monte Carlo, can beimplemented easily. Monte Carlo, however, is slow, andwe have found competitive results using a local heuristicalgorithm similar in spirit to the Kernighan–Lin algo-rithm used in minimum-cut graph partitioning [27].

Briefly, in this algorithm we divide the network intosome initial set of K communities at random. Then werepeatedly move a vertex from one group to another, se-lecting at each step the move that will most increase theobjective function—or least decrease it if no increase ispossible—subject to the restriction that each vertex maybe moved only once. When all vertices have been moved,we inspect the states through which the system passedfrom start to end of the procedure, select the one with thehighest objective score, and use this state as the startingpoint for a new iteration of the same procedure. Whena complete such iteration passes without any increase inthe objective function, the algorithm ends. As with manydeterministic algorithms, we have found it helpful to runthe calculation with several di↵erent random initial con-ditions and take the best result over all runs.

IV. RESULTS

We have tested the performance of the degree-corrected and uncorrected blockmodels in applicationsboth to real-world networks with known community as-signments and to a range of synthetic (i.e., computer-generated) networks. We evaluate performance by quan-titative comparison of the community assignments foundby the algorithms and the known assignments. As a met-ric for comparison we use the normalized mutual infor-mation, which is defined as follows [7]. Let nrs be thenumber of vertices in community r in the inferred groupassignment and in community s in the true assignment.Then define p(X = r, Y = s) = nrs/n to be the jointprobability that a randomly selected vertex is in r in theinferred assignment and s in the true assignment. Usingthis joint probability over the random variables X andY , the normalized mutual information is

NMI(X, Y ) =2 MI(X, Y )

H(X) + H(Y ), (26)

where MI(X, Y ) is the mutual information and H(Z) isthe entropy of random variable Z. The normalized mu-tual information measures the similarity of the two com-munity assignments and takes a value of one if the as-signments are identical and zero if they are uncorrelated.A discussion of this and other measures can be found inRef. [28].

(a) Without degree correction

(b) With degree-correction

FIG. 1: Divisions of the karate club network found using the

(a) uncorrected and (b) corrected blockmodels. The size of a

vertex is proportional to its degree and vertex color reflects

inferred group membership. The dashed line indicates the

split observed in real life.

A. Empirical networks

We have tested our algorithms on real-world networksranging in size from tens to tens of thousands of ver-tices. In networks with highly homogeneous degree distri-butions we find little di↵erence in performance betweenthe degree-corrected and uncorrected blockmodels, whichis expected since for networks with uniform degrees thetwo models have the same likelihood up to an additiveconstant. Our primary concern, therefore, is with net-works that have heterogeneous degree distributions, andwe here give two examples that show the e↵ects of het-erogeneity clearly.

The first example, widely studied in the field, is the“karate club” network of Zachary [29]. This is a socialnetwork representing friendship patterns between the 34members of a karate club at a US university. The clubin question is known to have split into two di↵erent fac-tions as a result of an internal dispute, and the membersof each faction are known. It has been demonstratedthat the factions can be extracted from a knowledgeof the complete network by many community detectionmethods.

Applying our inference algorithms to this network, us-

6

given i and r can thus be done in time O(K(K + �k�)).Because these computations can be done quickly for areasonable number of communities, local vertex switch-ing algorithms, such as single-vertex Monte Carlo, can beimplemented easily. Monte Carlo, however, is slow, andwe have found competitive results using a local heuristicalgorithm similar in spirit to the Kernighan–Lin algo-rithm used in minimum-cut graph partitioning [27].

Briefly, in this algorithm we divide the network intosome initial set of K communities at random. Then werepeatedly move a vertex from one group to another, se-lecting at each step the move that will most increase theobjective function—or least decrease it if no increase ispossible—subject to the restriction that each vertex maybe moved only once. When all vertices have been moved,we inspect the states through which the system passedfrom start to end of the procedure, select the one with thehighest objective score, and use this state as the startingpoint for a new iteration of the same procedure. Whena complete such iteration passes without any increase inthe objective function, the algorithm ends. As with manydeterministic algorithms, we have found it helpful to runthe calculation with several di↵erent random initial con-ditions and take the best result over all runs.

IV. RESULTS

We have tested the performance of the degree-corrected and uncorrected blockmodels in applicationsboth to real-world networks with known community as-signments and to a range of synthetic (i.e., computer-generated) networks. We evaluate performance by quan-titative comparison of the community assignments foundby the algorithms and the known assignments. As a met-ric for comparison we use the normalized mutual infor-mation, which is defined as follows [7]. Let nrs be thenumber of vertices in community r in the inferred groupassignment and in community s in the true assignment.Then define p(X = r, Y = s) = nrs/n to be the jointprobability that a randomly selected vertex is in r in theinferred assignment and s in the true assignment. Usingthis joint probability over the random variables X andY , the normalized mutual information is

NMI(X, Y ) =2 MI(X, Y )

H(X) + H(Y ), (26)

where MI(X, Y ) is the mutual information and H(Z) isthe entropy of random variable Z. The normalized mu-tual information measures the similarity of the two com-munity assignments and takes a value of one if the as-signments are identical and zero if they are uncorrelated.A discussion of this and other measures can be found inRef. [28].

(a) Without degree correction

(b) With degree-correction

FIG. 1: Divisions of the karate club network found using the

(a) uncorrected and (b) corrected blockmodels. The size of a

vertex is proportional to its degree and vertex color reflects

inferred group membership. The dashed line indicates the

split observed in real life.

A. Empirical networks

We have tested our algorithms on real-world networksranging in size from tens to tens of thousands of ver-tices. In networks with highly homogeneous degree distri-butions we find little di↵erence in performance betweenthe degree-corrected and uncorrected blockmodels, whichis expected since for networks with uniform degrees thetwo models have the same likelihood up to an additiveconstant. Our primary concern, therefore, is with net-works that have heterogeneous degree distributions, andwe here give two examples that show the e↵ects of het-erogeneity clearly.

The first example, widely studied in the field, is the“karate club” network of Zachary [29]. This is a socialnetwork representing friendship patterns between the 34members of a karate club at a US university. The clubin question is known to have split into two di↵erent fac-tions as a result of an internal dispute, and the membersof each faction are known. It has been demonstratedthat the factions can be extracted from a knowledgeof the complete network by many community detectionmethods.

Applying our inference algorithms to this network, us-

best z, SBM (c = 2) best z, DC-SBM (c = 2)

We can, but doing so via statistical inference requires account for the fact that changing c changesthe “complexity” or flexibility of the model, because M has ⇥(c2) entries. Likelihoods for di↵erentchoices of c cannot be compared fairly because the “larger” model, with more parameters, willnaturally tend to yield higher likelihood models of the network. In the extreme case of c = n,the model can simply “memorizes” the adjacency matrix via the ⇥(n2) parameters in the mixingmatrix A = M, placing every node is in a group by itself.

A popular approach to making di↵erent choices of c comparable is to penalize or regularize thelikelihood by some function f(c) that is grows with c. That is, we impose a cost to using additionalparameters, and then search for the parameterization that balances this cost against the improvedfit to the network. There are many choices for f(c), and these go by names like Bayesian marginal-ization, Bayes factors, various information criteria (BIC, AIC, etc.), minimum description length(MDL) approaches, and more. We will not cover any of these techniques here.

2.4 Finding good partitions

The preceding sections simplified the task of finding a good decomposition of a network into com-munities. First, we reduced the problem to estimating parameters via statistical inference. Then,we reduced it further to searching over partitions all {~z} to find a partition ~z that maximizes theSBM or DC-SBM log-likelihood of a network.

There are many ways we could perform such a search over partitions. These include powerfulmethods like Markov chain Monte Carlo, expectation-maximization (the “EM” algorithm), beliefpropagation (also called “BP”), among many others. Here, we will learn about a locally greedy

heuristic, which is a kind of generalization of the Kernighan-Lin algorithm developed in 1970 forsolving the minimum-cut graph partitioning problem.12 Like Kernighan-Lin, the locally greedy

12Kernighan & Lin, Bell System Technical Journal 49, 291 (1970), https://archive.org/details/bstj49-2-291.

9

Page 12: Network Analysis & Modeling

lessons learnedwhat works well

3. simple prediction tasks—test intuition & run numerical experiments

link prediction via heuristic

Fraction of labels observed, f0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Frac

tion

of c

orre

ct la

bel p

redi

ctio

ns

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1malaria genes, HVR5Norwegian boards, net1m-2011-08-01

Fraction of edges observed, f0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

AUC

0.5

0.6

0.7

0.8

0.9

1HVR5 malaria genes network

degree productJaccard coefficientshortest pathbaseline (guessing)

label prediction via homophily

Page 13: Network Analysis & Modeling

lessons learnedwhat works well

4. simple simulations—explore dynamics vs structure & numerical experiments

simulate Price’s modelsimulate epidemics (SIR) on planted partitions

in-degree, kin

100 101 102 103 104 105

Pr(K

≥ k

in)

10-6

10-5

10-4

10-3

10-2

10-1

100

r=1r=4no preferential attachment

015

5

1

l

10

10

cin-cout p

15

0.550 0

Page 14: Network Analysis & Modeling

lessons learnedwhat works well

4. team projects—teamwork & exploring their (your!) own ideas

Page 15: Network Analysis & Modeling

key takeaways• network intuition is hard to develop!

good intuition draws on many skills (probability, statistics, computation, causal dynamics, etc.)

• best results come from 1. exercises to get practice with calculations 2. practice analyzing diverse real-world networks 3. conducting out numerical experiments & simulations

• practical tasks are a pedagogical tool (e.g., link and label prediction) • interpreting the results requires a good intuition and to think like a scientist • null models are key concept: is a pattern interesting? what could explain it? • networks are fun!

0

0.51

a(t)

0

1

0

1

0200

400600

0

1

alignment position t

1

23 4

56

78

9

calculate alignment scoresconvert to alignment indicatorsremove short aligned regionsextract highly variable regions

NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY

NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY

NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY

16

6

13

16 6

13

A

B

C

D

Page 16: Network Analysis & Modeling

Aaron Clauset Associate Professor Computer Science + BioFrontiers Institute

Help! What am I looking at? [Open tutorial in new window]Show faculty hiring network from:

         Read the OpenAccess paper on Science Advances here.       |       Read Aaron Clauset's companion page or download the data here

1 - S

tanf

ord

2 - U

C Be

rkel

ey

3 - M

IT

4 - C

altec

h5

- Har

vard

6 - C

orne

ll7 -

Carn

egie

Mellon

8 - Prin

ceton

9 - Yale

10 - W

ashin

gton

11 - Illin

ois, Urb. C

hamp.

12 - Wisco

nsin, M

adison

13 - UPenn

14 - Rice

15 - UCLA

16 - NYU

17 - Chicago

18 - UT, Austin

19 - Brown

20 - Columbia

21 - Toronto

22 - Rochester

23 - Southern California

24 - Johns Hopkins

25 - UMass, Amherst

26 - UC San Diego27 - Maryland, College Park

28 - Michigan29 - UNC30 - Duke31 - SUNY Stony Brook

32 - UC Irvine33 - Dartmouth34 - Virginia35 - Purdue36 - Minnesota, Minneapolis

37 - Georgia Tech

38 - Rutgers39 - Arizona

40 - Penn. State

41 - Ohio State

42 - Northwestern

43 - Wash. U. St. Louis

44 - Pittsburgh

45 - Boston Univ.

46 - British Columbia

47 - Oregon

48 - Syracuse

49 - UC Santa Barbara

50 - Utah

51 - UC Davis

52 - Texas A&M

53 - Houston

54 - Michigan State

55 - Waterloo

56 - Colorado, Boulder

57 - McG

ill

58 - CUNY Graduate Center

59 - Case Western

60 - UC Riverside

61 - Kansas

62 - Florida State

63 - William

and Mary

64 - Lehigh65 - M

ontreal66 - Sim

on Fraser67 - Illinois, Chicago68 - Cincinnati

69 -

UC S

anta

Cru

z70

- Re

nsse

laer

Pol

ytec

h.

71 -

Sout

hern

Met

hodi

st

72 -

Cent

ral F

lorid

a

73 -

New

Mex

ico

74 -

NC S

tate

75 -

SUNY

Buf

falo

76 -

Stev

ens

Inst

. Tec

h.

77 -

Was

hing

ton

Stat

e, P

ullm

an

78 -

McM

aste

r

79 -

Bran

deis

80 -

Albe

rta

81 -

Geo

rge

Was

hing

ton

82 -

Poly.

Inst.

of N

YU

83 -

Iowa

84 -

Delaw

are

85 -

Iowa

Sta

te

86 -

Vand

erbil

t

87 -

Flor

ida

88 -

Geor

ge M

ason

89 -

Notre

Dam

e

90 -

Way

ne S

tate

91 -

Calga

ry

92 - L

ouisi

ana,

Lafay

ette

93 - W

ester

n Onta

rio

94 - C

arleto

n

95 - Q

ueen

s

96 - W

iscon

sin, M

ilwau

kee

97 - L

ouisia

na Stat

e

98 - T

exas

, Arlin

gton

99 - O

ttawa

100 -

UConn

101 - Oregon Hlth & Sci

102 - Arizo

na State

103 - Kansas S

tate

104 - SUNY Albany

105 - South Florida

106 - Virginia Tech

107 - Portla

nd State

108 - South Carolina

109 - Missouri S

ci. & Tech.

110 - Oklahoma

111 - Indiana

112 - Illinois Inst. Tech.

113 - Tennessee, Knoxville114 - Miami

115 - New Jersey Inst. Tech.116 - Kentucky

117 - SUNY Binghamton118 - Colorado State

119 - Colorado, Denver120 - Texas, Dallas121 - Rhode Island

122 - Toyota Tech. Chicago123 - Auburn124 - Drexel125 - Oregon State126 - Manitoba127 - Regina128 - Nebraska, Lincoln129 - Saskatchewan130 - Florida Atlantic131 - Texas, San Antonio

132 - Denver133 - Maryland, Balt. County

134 - Oakland (Michigan)135 - Maine

136 - Clarkson

137 - New Mexico State138 - Memphis

139 - Victoria

140 - Concordia, Montreal

141 - Missouri, Kansas City

142 - Missouri, Columbia

143 - Dalhousie

144 - Virginia Commonwealth

145 - Worcester Poly.

146 - Old Dominion

147 - Texas, El Paso

148 - Naval Postgrad. Sch.

149 - UMass, Boston

150 - Wyoming

151 - York

152 - Georgia

153 - Mississippi State

154 - Tufts

155 - Florida Intl

156 - Southern Ill., Carbondale

157 - Ohio

158 - Claremont Graduate

159 - Catholic Univ. America

160 - Clemson

161 - UMass, Lowell

162 - Oklahoma State

163 - UNC, Charlotte

164 - Alabama, Birmingham

165 - Santa Clara

166 - Hawaii, Manoa

167 - Temple

168 - Toledo

169 - Michigan Tech.

170 - New Hampshire

171 - Montana State

172 - New Brunswick

173 - Wright State

174 - Western Michigan

175 - Northeastern

176 - Alabama, Huntsville

177 - Idaho, Moscow

178 - Texas Tech179 - Tulsa

180 - Mississippi

181 - Brigham Young

182 - Long Island

183 - Southern Mississippi

184 - Colorado, Colorado Springs

185 - Colorado School of Mines

186 - Georgia State

187 - Nevada, Las Vegas188 - Kent State

189 - Mem

orial Newfoundland190 - Utah State

191 - DePaul

192 - North Dakota State193 - Nova Southeastern

194 - Arkansas, Fayetteville195 - Bridgeport196 - Louisville

197 - Nebraska, Om

aha198 - Florida Inst. Tech.

199 - Arkansas, Little Rock200 - Rochester Inst. Tech.

201 - Pace202 - New M

exico Inst. Min. Tech.

203 - Nevada, Reno204 - Alabam

a, TuscaloosaNorth Texas, Denton

Computer Science

© Clauset, Arbesman & Larremore, 2015100% up the hierarchy2 up per 1 downbalanced2 down per 1 up100% down the hierarchy

inequality and the spread of ideas in sciencecomplex social and biological systems

computational methods for network analysis

aaronclauset.github.ioabout me

Morgan et al. EPJ Data Science ( 2018) 7:40 Page 7 of 16

Of these events, 88 (37%) are due to transmissions of research ideas by way of hiring, andin 81% of these cases, transmissions move via faculty from higher prestige universitiesto lower prestige universities (past studies show that only 9 to 14% of faculty placementsmove faculty to a more prestigious university than their doctoral institution [9]). Figure 4illustrates these patterns by showing spreading events over time, for three of the topics.

Crucially, if faculty hiring shapes the spread of ideas, then a significant share of depart-ments that ever adopt a topic X will have adopted it through faculty hiring (scenario 2).We test this hypothesis by constructing a specialized permutation test to assess the statis-tical significance of the empirically observed fraction of departments that have adopted aresearch idea via scenario 2, denoted fobs, and the expected fraction of such departmentsfexp. The test’s null model is one in which the publication years for each faculty are fixedwith their empirical values, but paper titles are drawn uniformly at random, without re-placement, from the set of all titles. In this way, serial correlations in topics and temporalcorrelations with the hiring event are removed from each faculty. We then report empiri-cal p-values [30] for the fraction of hiring-driven adoption events for each topic.

Figure 4 Adoption events for the three research topics over time. Purple dots denote institutions whoadopted an idea by hiring someone who studies that topic, and white dots represent institutions whoseexisting faculty began working on the topic. Arrows denote, for each time period, new transmissions,originating from the hired individual’s doctoral location. All 205 institutions are arranged clockwise byprestige (descending), with the most prestigious department positioned at noon

structureof

academia,shed

newlight

onthe

factorsthatshape

individualcareertra-

jectories,andidentify

anovel

connectionbetw

eenfaculty

hiringand

socialinequality.

RESULTS

Across

thesam

pleddisciplines,w

efind

thatfacultyproduction

(number

offac-ulty

placed)ishighlyskew

ed,with

only25%

ofinstitutionsproducing

71to

86%ofall

tenure-trackfaculty

(tableS2;thisand

sub-sequentrangesindicate

therange

ofagiven

quantityacross

thethree

disciplines,un-less

otherwise

noted).The

numberoffac-

ultywithin

anacadem

icunit(num

berof

facultyhired,thatis,the

unit’ssize)is

alsoskew

ed,with

someunitsbeing

twotothree

timeslarger

thanothers.Business

schoolsare

especiallylarge,generally

containingseveralinternaldepartm

ents,with

amean

sizeof

70faculty

mem

berswho

receivedtheirdoctoratesfrom

otherwithin-sam

pleunits,w

hereascom

puterscience

andhis-

toryhave

mean

sizesof21and

29,respec-tively

(seeSupplem

entaryMaterials).T

hedifferences

insize

within

adiscipline,

however,cannotexplain

theobserved

dif-ferencesin

placements.Ifplacem

entswere

simply

proportionaltothe

sizeof

aunit,

thenthe

placementand

sizedistributions

would

bestatistically

indistinguishable.Asim

pletestofthis

size-proportionalplace-menthypothesis

showsthatitm

aybe

re-jected

outofhand[Kolm

ogorov-Smirnov

(KS)

test,P<10

−8;

Fig.2,Band

C],indicating

genuinedifferentialsuccess

ratesin

facultyplacem

ent.The

Gini

coefficient,astandard

measure

ofsocial

in-equality,is

definedas

themean

relativedifference

between

auniform

lyrandom

pairof

observedvalues.T

hus,G=0

denotesstrict

equality,andG

=1maxim

alinequality.W

efind

G=0.62

to0.76

forfaculty

production(Fig.2,A

andB),indicating

stronginequality

acrossdisciplines

[cf.,G=

0.45for

theincom

edistribution

ofthe

United

States(12)].

Stronginequality

holdseven

among

thetop

facultypro-

ducers:thetop

10units

produce1.6

to3.0

times

more

fac-ulty

thanthe

second10,and

2.3to

5.6tim

esmore

thanthe

third10.For

suchdifferences

toreflect

purelymeritocratic

outcomes,that

is,utilitarianoptim

alityof

totalscholarship

(13),differencesin

placementrates

mustreflectinherentdif-

ferencesin

theproduction

ofscholarship.Under

ameritoc-

racy,theobserved

placement

rateswould

imply

thatfaculty

with

doctoratesfrom

thetop

10units

areinherently

twoto

sixtim

esmore

productivethan

facultywith

doctoratesfrom

thethird

10units.T

hemagnitude

ofthesedifferences

makes

apure

meritocracy

seemim

plausible,suggestingthe

influ-ence

ofnonm

eritocraticfactors

likesocialstatus.

MIT

Stanford

UC

Berkeley

Carnegie M

ellon

Cornell

Washington

Caltech

Harvard

YaleP

rinceton

MIT

Stanford

UC Berkeley

Carnegie Mellon

Cornell

Washington

Caltech

Harvard

YalePrinceton

Fig.1.

Prestigehierarchies

infaculty

hiringnetw

orks.(Top)

Placements

for267

computer

sciencefaculty

among

10universities,w

ithplacem

entsfrom

oneparticular

universityhighlighted.Each

arc(u,v)has

awidth

proportionaltothe

numberofcurrentfaculty

atuniversityvwho

receivedtheirdoctorate

atuniversity

u(≠v).(Bottom

)Prestigehierarchy

onthese

institutionsthat

minim

izesthe

totalweight

of“upw

ard”arcs,that

is,arcswhere

vismore

highlyranked

thanu.

00.25

0.500.75

10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Fraction of institutions

Fraction of faculty produced

Equality

A

Com

puter scienceB

usinessH

istory

0.01

0.1

1B

Faculty produced, k

out

Fraction greater than k

0100

200300

400 0.01

0.1

1

Faculty hired, k

in

C

Fig.2.Inequality

infaculty

production.(A)Lorenz

curvesshow

ingthe

fractionof

allfacultyproduced

asafunction

ofproducinginstitutions.(B

andC)Com

plementary

cumulative

distributionsfor

institutionout-degree

(facultyproduced)

andin-degree

(facultyhired).The

means

ofthese

distributionsare

21for

computer

science,70for

business,and29

forhistory.

RESEARCH

ARTIC

LE

Clausetet

al.Sci.Adv.2015;1:e1400005

12February

20152of

6

Fig. 1. Institutional prestige predicts early-career productivity and promi-nence of computer science faculty. Shown are median publication (leftaxis) and log10 citation (right axis) counts per faculty per institution (min-imum three faculty per institution), accumulated through their first 10 yearsposthire, adjusted for growth in publication rates over time (SI Appendix,section A). Shaded regions denote 95% confidence intervals for leastsquares regression.

Replacing prestige with the 2010 departmental rankings by U.S. News &World Report in our analysis produces similar results (see SI Appendix,section B and Fig. S1).

The annual faculty job market generates two kinds of quasi-naturalexperiments: It colocates at the same institution individuals who trained atmore or less prestigious institutions than each other (Fig. 2, Top Left), andit separates individuals with similar training into faculty appointments atmore or less prestigious institutions than each other (Fig. 2, Bottom Left).To isolate the effect of prestige differences on posthire productivity andprominence in each case, we combine exact and caliper matching tech-niques to mitigate the confounding effects of differences in the age, gender,subfield productivity norms, and postdoctoral training (see SI Appendix,section B). If where an individual trained determines their early-careerscholarly output, individuals with more-prestigious training should be, on

average, more productive and more prominent than colocated peers withless prestigious training. On the other hand, if where an individual worksdetermines their early-career scholarly output, individuals with appoint-ments at more-prestigious institutions should be more productive and moreprominent than similarly trained peers with appointments at less prestigiousinstitutions.

ResultsFor matched pairs of faculty with appointments at similarlyprestigious institutions, the individual with the more prestigioustraining was not more productive in the first 5 years posthire(N =359 pairs; p=0.59, t test) but received, on average, 301more citations (N =129 pairs; p< 0.05, t test) during this period(Fig. 2 A and B). Among the pairs, the individual with more-prestigious training was more productive in 52.1% (p=0.23;one-tailed binomial test) of trials but more highly cited in 63.9%(p< 0.005; one-tailed binomial test).

In contrast, for matched pairs of faculty with similarly presti-gious training and with similar prehire productivity and promi-nence (publications, N =194 pairs; citations, N =194; see Fig. 2C and D), the individual with the more prestigious appoint-ment produced, on average, 5.1 more papers in the first 5 yearsposthire (p< 0.005, t test), with 57.4% of trials exhibiting anadvantage of any magnitude (p< 0.05, binomial test) and sig-nificant differences in years y 2 {1, 2, 4, 5} (p< 0.05, t test).Similarly, individuals with the more prestigious appointmentreceived, on average, 344 more citations in this period (p<0.001, t test), although the median difference was a more modest112 additional citations. For context, faculty at the top 20% ofinstitutions by prestige produced, on average, 17 more publica-tions in their first 5 years and received 824 more citations thanfaculty at the bottom 20% of institutions, and they produced 9more publications and received 543 more citations than facultyat the middle 20% of institutions.

Hence, conditioned on an individual holding a faculty positionsomewhere, we find no evidence that training at a prestigious

-5 -4 -3 -2 -1 0 1 2 3 4 5

Publishing year, relative to initial placement

-50

0

50

100

150

-50

0

50

100

150

-1

0

1

2

3

-5 -4 -3 -2 -1 0 1 2 3 4 5

Publishing year, relative to initial placement

-1

0

1

2

3

*: Faculty also matched on gender, subfield, and other features. See main text for full details.

Diff

eren

ce in

pub

licat

ions

Diff

eren

ce in

pub

licat

ions

Diff

eren

ce in

cita

tions

Diff

eren

ce in

cita

tions

Less prestigiouswork environment

More prestigioustraining environment

Matched on work

environment*

Less prestigioustraining environment

More prestigiouswork environment

Matched on training

environment*

Yearly difference in publications Yearly difference in citations

Person with more prestigious appointmentvs. person with less prestigious appointment.

Person with more prestigious appointmentvs. person with less prestigious appointment.

Person with more prestigious trainingvs. person with less prestigious training.

Person with more prestigious trainingvs. person with less prestigious training.

Post-hirePre-hirePost-hirePre-hire

A

C

B

D

Fig. 2. Early-career productivity is driven by work environment prestige. For pairs of computer science faculty matched by (A and B) work environmentprestige or (C and D) training environment prestige, (A) publication and (B) citation counts are statistically independent of differences in doctoral prestigebut are driven higher by (C and D) placing into a more prestigious work environment. Shaded regions denote 95% confidence intervals for the mean. Similarresults are obtained using U.S. News & World Report department rankings in place of prestige (see SI Appendix, Fig. S1).

10730 | www.pnas.org/cgi/doi/10.1073/pnas.1817431116 Way et al.

Fig. 1. Institutional prestige predicts early-career productivity and promi-nence of computer science faculty. Shown are median publication (leftaxis) and log10 citation (right axis) counts per faculty per institution (min-imum three faculty per institution), accumulated through their first 10 yearsposthire, adjusted for growth in publication rates over time (SI Appendix,section A). Shaded regions denote 95% confidence intervals for leastsquares regression.

Replacing prestige with the 2010 departmental rankings by U.S. News &World Report in our analysis produces similar results (see SI Appendix,section B and Fig. S1).

The annual faculty job market generates two kinds of quasi-naturalexperiments: It colocates at the same institution individuals who trained atmore or less prestigious institutions than each other (Fig. 2, Top Left), andit separates individuals with similar training into faculty appointments atmore or less prestigious institutions than each other (Fig. 2, Bottom Left).To isolate the effect of prestige differences on posthire productivity andprominence in each case, we combine exact and caliper matching tech-niques to mitigate the confounding effects of differences in the age, gender,subfield productivity norms, and postdoctoral training (see SI Appendix,section B). If where an individual trained determines their early-careerscholarly output, individuals with more-prestigious training should be, on

average, more productive and more prominent than colocated peers withless prestigious training. On the other hand, if where an individual worksdetermines their early-career scholarly output, individuals with appoint-ments at more-prestigious institutions should be more productive and moreprominent than similarly trained peers with appointments at less prestigiousinstitutions.

ResultsFor matched pairs of faculty with appointments at similarlyprestigious institutions, the individual with the more prestigioustraining was not more productive in the first 5 years posthire(N =359 pairs; p=0.59, t test) but received, on average, 301more citations (N =129 pairs; p< 0.05, t test) during this period(Fig. 2 A and B). Among the pairs, the individual with more-prestigious training was more productive in 52.1% (p=0.23;one-tailed binomial test) of trials but more highly cited in 63.9%(p< 0.005; one-tailed binomial test).

In contrast, for matched pairs of faculty with similarly presti-gious training and with similar prehire productivity and promi-nence (publications, N =194 pairs; citations, N =194; see Fig. 2C and D), the individual with the more prestigious appoint-ment produced, on average, 5.1 more papers in the first 5 yearsposthire (p< 0.005, t test), with 57.4% of trials exhibiting anadvantage of any magnitude (p< 0.05, binomial test) and sig-nificant differences in years y 2 {1, 2, 4, 5} (p< 0.05, t test).Similarly, individuals with the more prestigious appointmentreceived, on average, 344 more citations in this period (p<0.001, t test), although the median difference was a more modest112 additional citations. For context, faculty at the top 20% ofinstitutions by prestige produced, on average, 17 more publica-tions in their first 5 years and received 824 more citations thanfaculty at the bottom 20% of institutions, and they produced 9more publications and received 543 more citations than facultyat the middle 20% of institutions.

Hence, conditioned on an individual holding a faculty positionsomewhere, we find no evidence that training at a prestigious

-5 -4 -3 -2 -1 0 1 2 3 4 5

Publishing year, relative to initial placement

-50

0

50

100

150

-50

0

50

100

150

-1

0

1

2

3

-5 -4 -3 -2 -1 0 1 2 3 4 5

Publishing year, relative to initial placement

-1

0

1

2

3

*: Faculty also matched on gender, subfield, and other features. See main text for full details.

Diff

eren

ce in

pub

licat

ions

Diff

eren

ce in

pub

licat

ions

Diff

eren

ce in

cita

tions

Diff

eren

ce in

cita

tions

Less prestigiouswork environment

More prestigioustraining environment

Matched on work

environment*

Less prestigioustraining environment

More prestigiouswork environment

Matched on training

environment*

Yearly difference in publications Yearly difference in citations

Person with more prestigious appointmentvs. person with less prestigious appointment.

Person with more prestigious appointmentvs. person with less prestigious appointment.

Person with more prestigious trainingvs. person with less prestigious training.

Person with more prestigious trainingvs. person with less prestigious training.

Post-hirePre-hirePost-hirePre-hire

A

C

B

D

Fig. 2. Early-career productivity is driven by work environment prestige. For pairs of computer science faculty matched by (A and B) work environmentprestige or (C and D) training environment prestige, (A) publication and (B) citation counts are statistically independent of differences in doctoral prestigebut are driven higher by (C and D) placing into a more prestigious work environment. Shaded regions denote 95% confidence intervals for the mean. Similarresults are obtained using U.S. News & World Report department rankings in place of prestige (see SI Appendix, Fig. S1).

10730 | www.pnas.org/cgi/doi/10.1073/pnas.1817431116 Way et al.

reasons: (i) These particular metadata are irrelevant to the structure ofthe network, (ii) the detected communities and the metadata capturedifferent aspects of the network’s structure, (iii) the network containsno communities as in a simple random graph (7) or a network that issufficiently sparse that its communities are not detectable (8), or (iv) thecommunity detection algorithm performed poorly.

In the above, we refer to the observed network and metadata andnote that noise in either could lead to one of the reasons above. For in-stance, measurement error of the network structure may make our ob-servations unreliable and, in extreme cases, can obscure the communitystructure entirely, resulting in case (iii). It is also possible that humanerrors are introducedwhen handling the data, exemplified by thewidelyused American college football network (9) of teams that played eachother in one season, whose associated metadata representing eachteam’s conference assignment were collected during a different season(10). Large errors in the metadata can render them irrelevant to thenetwork [case (i)].

Most work on community detection assumes that failure to findcommunities that correlate with metadata implies case (iv), algorithmfailure, although some critical work has focused on case (iii), difficult orimpossible to recover communities. The lack of consideration for cases(i) and (ii) suggests the possibility for selection bias in the publishedliterature in this area [a point recently suggested by Hric et al. (11)].Recent critiques of the general utility of community detection in net-works (11–13) can be viewed as a side effect of confusion about the roleof metadata in evaluating algorithm results. For these reasons, usingmetadata to assess the performance of community detection algorithmscan lead to errors of interpretation, false comparisons betweenmethods,and oversights of alternative patterns and explanations, including thosethat do not correlate with the known metadata.

For example, Zachary’s Karate Club (14) is a small real-worldnetwork with compelling metadata frequently used to demonstratecommunity detection algorithms. The network represents the observedsocial interactions of 34 members of a karate club. At the time of study,the club fell into a political dispute and split into two factions. Thesefaction labels are the metadata commonly used as ground truth com-munities in evaluating community detection methods. However, it isworth noting at this point that Zachary’s original network andmetadatadiffer from those commonly used for community detection (9). Links inthe original network were by the different types of social interactionthat Zachary observed. Zachary also recorded twometadata attributes:the political leaning of each of the members (strong, weak, or neutralsupport for one of the factions) and the faction they ultimately joinedafter the split. However, the community detection literature uses onlythemetadata representing the faction each node joined, oftenwith oneof the nodes mislabeled. This node (“Person number 9”) supportedthe president during the dispute but joined the instructor’s factionbecause joining the president’s faction would have involved retrain-ing as a novice when he was only 2 weeks away from taking his blackbelt exam.

The division of the Karate Club nodes into factions is not the onlyscientifically reasonable way to partition the network. Figure 1 showsthe log-likelihood landscape for a large number of two-group partitions(embedded in two dimensions for visualization) of the Karate Club, un-der the stochastic blockmodel (SBM) for community detection (15, 16).Partitions that are similar to each other are embedded nearby in thehorizontal coordinates, meaning that the two broad peaks in the land-scape represent two distinct sets of high-likelihood partitions: onecentered around the faction division and one that divides the network

into leaders and followers. Other common approaches to communitydetection (9, 17) suggest that the best divisions of this network havemore than two communities (10, 18). The multiplicity and diversityof good partitions illustrate the ambiguous status of the faction meta-data as a desirable target.

The Karate Club network is among many examples for whichstandard community detectionmethods return communities that eithersubdivide the metadata partition (19) or do not correlate with the meta-data at all (20, 21).More generally, most real-world networks havemanygood partitions, and there are many plausible ways to sort all partitionsto find good ones, sometimes leading to a large number of reasonableresults. Moreover, there is no consensus on which method to use onwhich type of network (21, 22).

In what follows, we explore both the theoretical origins of these pro-blems and the practical means to address the confounding cases de-scribed above. To do so, we make use of a generative model perspectiveof community detection. In this perspective, we describe the relation-ship between community assignments C and graphs G via a joint dis-tribution P(C,G) over all possible community assignments and graphsthat wemay observe.We take this perspective because it provides a pre-cise and interpretable description of the relationship between commu-nities andnetwork structure. Although generativemodels, like the SBM,describe the relationship between networks and communities directlyvia a mathematically explicit expression for P(C,G), other methods forcommunity detection nevertheless maintain an implicit relationshipbetween network structure and community assignment. Hence, thetheorems we present, as well as their implications, are more generallyapplicable across all methods of community detection.

In the next section,we present rigorous theoretical results with directimplications for cases (i) and (iv), whereas the remaining sections intro-duce two statistical methods for addressing cases (i) and (ii). These con-tributions do not address case (iii), when there is no structure to befound, which has been previously explored by other authors, for exam-ple, for the SBM (8, 23–27) and modularity (28, 29).

–240

–230

–220

–210

SB

M lo

g lik

elih

ood

–200

Partition space

Fig. 1. The stochastic blockmodel log-likelihood surface for bipartitions of theKarateClubnetwork (14). Thehigh-dimensional spaceof all possiblebipartitionsof thenetwork has been projected onto the x, y plane (using a method described in Supple-mentary TextD.4) such that points representing similar partitions are closer together. Thesurface shows two distinct peaks that represent scientifically reasonable partitions. Thelower peak corresponds to the social grouppartitiongivenby themetadata—often treatedas ground truth—whereas the higher peak corresponds to a leader-follower partition.

S C I ENCE ADVANCES | R E S EARCH ART I C L E

Peel, Larremore, Clauset, Sci. Adv. 2017;3 : e1602548 3 May 2017 2 of 8

on May 3, 2017

http://advances.sciencemag.org/

Dow

nloaded from

this division 65% of the time, a relatively weak level of correlation,not far above the 50% of completely uncorrelated data.Nonetheless, as shown in Fig. 1b, this is enough for the algorithmto reliably find the correct division of the network in almost everycase—98% of the time in our tests. Without the metadata, bycontrast, we succeed only 6% of the time. Some practicalapplications of this ability to select among competing divisionsare given in the next section.

Real-world networks. In this section we describe applications ofour method to a range of real-world networks, drawn from social,biological and technological domains.

For our first application we analyse a network of schoolstudents, drawn from the US National Longitudinal Study ofAdolescent Health. The network represents patterns of friend-ship, established by survey, among the 795 students in a medium-sized American high school (US grades 9–12, ages 14–18 years)and its feeder middle school (grades 7 and 8, ages 12–14 years).

Given that this network combines middle and high schools, itcomes as no surprise that there is a clear division (previouslydocumented) into two network communities correspondingroughly to the two schools. Previous work, however, has alsoshown the presence of divisions by ethnicity31. Our methodallows us to select between divisions by using metadata thatcorrelate with the one we are interested in.

Figure 2 shows the results of applying our algorithm to thenetwork three times. Each time, we asked the algorithm to dividethe network into two communities. In Fig. 2a, we used the sixschool grades as metadata and the algorithm readily identifies adivision into grades 7 and 8 on the one hand and grades 9–12 onthe other—that is, the division into middle school and highschool. In Fig. 2b, by contrast, we used the students’ self-identifiedethnicity as metadata, which in this data set takes one of fourvalues: white, black, hispanic, or other (plus a small number ofnodes with missing data). Now the algorithm finds a completelydifferent division into two groups, one group consistingprincipally of black students and one of white. (The smallnumber of remaining students are distributed roughly evenlybetween the groups.)

One might be concerned that in these examples the algorithmis mainly following the metadata to determine communitymembership, and ignoring the network structure. To test for thispossibility, we performed a third analysis, using gender asmetadata. When we do this, as shown in Fig. 2c, the algorithmdoes not find a division into male and female groups. Instead,it finds a new division that is a hybrid of the grade and ethnicitydivisions (white high-school students in one group and everyoneelse in the other). That is, the algorithm has ignored thegender metadata, because there was no good network divisionthat correlated with it, and instead found a division based onthe network structure alone. The algorithm makes use of themetadata only when doing so improves the quality of the networkdivision (in the sense of the maximum-likelihood fit described inthe Methods section).

The extent to which the communities found by our algorithmmatch the metadata (or any other ‘ground truth’ variable) canbe quantified by calculating a normalized mutual information(NMI)32,33, as described in the Methods section. NMI ranges invalue from 0 when the metadata are uninformative about thecommunities to 1 when the metadata specify the communitiescompletely. The divisions shown in Fig. 2a,b have NMI scores of0.881 and 0.820, respectively, indicating that the metadataare strongly though not perfectly correlated with communitymembership. By contrast, the division in Fig. 2c, where genderwas used as metadata, has an NMI score of 0.003, indicating that

the metadata contain essentially zero information about thecommunities.

Our next application is to an ecological network, a food web ofpredator–prey interactions between 488 marine species living inthe Weddell Sea, a large bay off the coast of Antarctica34,35. Anumber of different metadata are available for these species,including feeding mode (deposit feeder, suspension feeder,scavenger and so on), zone within the ocean (benthic, pelagicand so on) and others. In our analysis, however, we focus on one inparticular, the average adult body mass. Body masses of species inthis ecosystem have a wide range, from microorganisms weighingnanograms or less to hundreds of tonnes for the largest whales.

WhiteMiddle

High

Black Hispanic Other MissingMale

Female

a

b

c

Figure 2 | Communities found in a high school friendship network withvarious types of metadata. Three divisions of a school friendship network,using as metadata (a) school grade, (b) ethnicity and (c) gender.

ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms11863

4 NATURE COMMUNICATIONS | 7:11863 | DOI: 10.1038/ncomms11863 | www.nature.com/naturecommunications

Identifying a power law in the distribution of an empirical quantitycan indicate the presence of exotic underlying mechanisms, includingnonlinearities, feedback loops, and network effects (33, 34), althoughnot always (36), and power laws are believed to occur broadly incomplex social, technological, and biological systems (37). For instance,the intensities or sizes of many natural disasters, such as earthquakes,forest fires, and floods (34, 38, 39), as well as many social disasters, suchas riots and terrorist attacks (35, 40), are well described by power laws.

However, it can be difficult to accurately characterize the shape of adistribution that follows a power-law pattern (37). Fluctuations inheavy-tailed data are greatest in the distribution’s upper tail, whichgoverns the frequency of the largest and rarest events. As a result, datatend to be sparsest precisely where the greatest precision in model esti-mates is desired.

Recent interest in heavy-tailed distributions has led to the devel-opment ofmore rigorousmethods to identify and estimate power-lawdistributions in empirical data (37, 41, 42), to compare different mod-els of the upper tail’s shape (37), and to make principled statisticalforecasts of future events (43). This branch of statistical methodologyis related to but distinct from the task of estimating the distribution ofmaxima within a sample (44, 45) and is more closely related to thepeaks-over-threshold literature in seismology, forestry, hydrology, in-surance, and finance (41, 42, 45–48).

Although Poisson processes pose fewer statistical concerns thanpower-law distributions, a similar statistical approach is used in theanalysis here of both war sizes and years between war onsets. In partic-ular, an ensemble approach is used (43) on the basis of a standard non-parametric bootstrap procedure (49) that simulates the generativeprocess of events to produce a series of synthetic data sets {Y } with sim-ilar statistical structure as the empirical data X. Fitting a semipara-metric model Pr(y|q) to each Y yields an ensemble of models {q} thatincorporate the empirical data’s inherent variability into a distributionof estimated parameters. This distribution is then used toweightmodelsby their likelihood under the bootstrap distribution and to numericallyestimate the likelihood of specific historical or future patterns (43).

Within the 1823–2003 time period, the end of the Second WorldWar in 1945 is widely viewed as the most plausible change point inthe underlying dynamics of the conflict-generating process for warsand marks the beginning of the subsequent long peace pattern (10).Determining whether 1945 marks a genuine a shift in the observed sta-tistics ofwars and, hence, whether the long peace is plausibly a trendor afluctuation represents a broad test of the stationary hypothesis of war(23). Evaluating other theoretically plausible change points in these datais left for future work.

Finally, some studies choose to limit or normalize war onset countsor war sizes (battle death counts) by a reference population. For in-stance, onset counts can be normalized by assuming that war is a dyadicevent and that dyads independently generate conflicts (26), implying anormalization that grows quadratically with the number of nations.However, considerable evidence indicates that dyads do not indepen-dently generate conflicts (9, 16–21). Similarly, limiting the analysis toconflicts among “major powers” introduces subjectivity in definingsuch a scope, and there is not a clear consensus about the details, forexample, when and whether to include China or the occupied Euro-pean nations, or certain wars, such as the KoreanWar (26). War sizecan be normalized by assuming that individuals contribute indepen-dently to total violence, which implies a normalization that dependson either the population of the combatant nations (a variable some-times called war “intensity”) or of the world (3, 23). However, there is

little evidence for this assumption (3, 50), although such a per capitavariable may be useful for other reasons. In the analysis performedhere, war variables are analyzed in their unnormalized forms, and allrecorded interstate wars are considered. The analysis is thus at thelevel of the entire world, and results are about absolute counts.

RESULTSThe sizes of warsConsidering the sizes of wars alone necessarily ignores other charac-teristics of conflicts, including their relative timing, which may con-tain independent signals about trends. A pattern in war sizes alonethus says little about changes in declared reasons for conflicts, theway they are fought, their settlements, aftermaths, or relationshipsto other conflicts past or future, or the number of nations worldwide,among other factors. One benefit of ignoring these factors, at leastat first, is that theymay be irrelevant for identifying an overall trendin wars, and their relationship to a trend can be explored subse-quently. Hence, focusing narrowly on war sizes simplifies the rangeof models to consider and may improve the ability to detect a sub-tle trend.

The Correlates of War data set includes 95 interstate wars, theabsolute sizes of which range from 1000 (the minimum size by def-inition) to 16,634,907 (the recorded battle deaths of the SecondWorld War) (Fig. 2). The estimated power-law model has two param-eters: xmin, which represents the smallest value above which thepower-law pattern holds, and a, the scaling parameter. Standardtechniques are used to estimate model parameters and model plau-sibility (section S1) (37).

The maximum likelihood power-law parameter is a ¼ 1:53 ± 0:07for wars with severity x ≥ xmin = 7061 (Fig. 2, inset), and 95% ofthe bootstrap distribution of a falls within the interval [1.37, 1.76].However, these estimates do not indicate that the observed data area plausible independent and identically distributed (iid) draw fromthe fitted model. To quantitatively assess this aspect of the model,we used an appropriately defined statistical hypothesis test (37),which indicates that a power-law distribution cannot be rejected

Battle deaths, x 103 104 105 106 107 108

Fra

ctio

n of

war

s w

ith a

t lea

st x

dea

ths

10–2

10–1

100

x0.25

x0.50

x0.75

Power-law exponent,1.2 1.4 1.6 1.8

Den

sity

Fig. 2. Interstatewars sizes, 1823–2003. The maximum likelihood power-law modelof the largest-severity wars (solid line, a = 1.53 ± 0.07 for x ¼ xmin ¼ 7061) is a plau-sible data-generating process of the empirical severities (Monte Carlo, pKS = 0.78 ± 0.03).For reference, distribution quartiles are marked by vertical dashed lines. Inset: Bootstrapdistribution of maximum likelihood parameters Pr(a), with the empirical value (black line).

S C I ENCE ADVANCES | R E S EARCH ART I C L E

Clauset, Sci. Adv. 2018;4 : eaao3580 21 February 2018 3 of 9

on February 21, 2018http://advances.sciencem

ag.org/D

ownloaded from

SAFE LEADS AND LEAD CHANGES IN COMPETITIVE . . . PHYSICAL REVIEW E 91, 062815 (2015)

Effective lead, z0 0.5 1 1.5 2

Pro

babi

lity

that

effe

ctiv

e le

ad is

saf

e, Q

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Eq. (15)NBA gamesBill James' heuristics

FIG. 11. (Color online) Probability that a lead is safe versus thedimensionless lead size z = L/

√4Dτ for NBA games, showing the

prediction from Eq. (15), the empirical data, and the mean predictionfor Bill James’ well-known “safe lead” heuristic.

probabilities only for dimensionless leads z > 2) and has thewrong qualitative dependence on z. In contrast, the randomwalk model gives a maximal overestimate of 6.2% for thesafe lead probability over all z, and has the same qualitative zdependence as the empirical data.

For completeness, we extend the derivation for the safe leadprobability to unequal-strength teams by including the effectof a bias velocity v in Eq. (14):

Q(L,τ ) = 1 −∫ τ

0

L√4πDt3

e−(L+vt)2/4Dt dt

= 1 − e−vL/2D

∫ τ

0

L√4πDt3

e−L2/4Dt−v2t/4D dt, (16)

where the integrand in the first line is the first-passageprobability for nonzero bias. Substituting u = L/

√4Dt and

using again the Peclet number Pe = vL/2D, the result is

Q(L,τ ) = 1 − 2√π

ePe∫ z

0e−u2−Pe2/4u2

du

= 1 − 12

[e−2Peerfc

(z− Pe

2z

)+erfc

(z+ Pe

2z

)]. (17)

When the stronger team is leading (Pe > 0), essentially anylead is safe for Pe ! 1, while for Pe < 1, the safety of a leaddepends more sensitively on z [Fig. 12(a)]. Conversely, if theweaker team happens to be leading (Pe < 0), then the lead hasto be substantial or the time remaining quite short for the leadto be safe [Fig. 12(b)]. In this regime, the asymptotics of theerror function gives Q(L,τ ) ∼ e−Pe2/4z2

for z < |Pe|/2, whichis vanishingly small. For values of z in this range, the lead isessentially never safe.

VI. LEAD CHANGES IN OTHER SPORTS

We now consider whether our predictions for lead changestatistics in basketball extend to other sports, such as college

(a)

(b)

FIG. 12. (Color online) Probability that a lead is safe versus z =L/

√4Dτ for (a) the stronger team is leading for Pe = 1

5 , 12 , and 1

(progressively flatter curves), and (b) the weaker team is leading forPe = − 2

5 , − 45 , and − 6

5 (progressively shifting to the right). The casePe = 0 is also shown for comparison.

American football (CFB), professional American football(NFL), and professional hockey (NHL) [43]. These sports havethe following commonalities with basketball [19]:

(1) Two teams compete for a fixed time T , in which pointsare scored by moving a ball or puck into a special zone in thefield.

(2) Each team accumulates points during the game and theteam with the largest final score is the winner (with sport-specific tiebreaking rules).

(3) A roughly constant scoring rate throughout the game,except for small deviations at the start and end of each scoringperiod.

(4) Negligible temporal correlations between successivescoring events.

(5) Intrinsically different team strengths.(6) Scoring antipersistence, except for hockey.

These similarities suggest that a random-walk model shouldalso apply to lead change dynamics in these sports (Fig. 13).

However, there are also points of departure, the mostimportant of which is that the scoring rate in these sports isbetween 10 and 25 times smaller than in basketball. Becauseof this much lower overall scoring rate, the diminished rate atthe start of games is much more apparent than in basketball(Fig. 14). This longer low-activity initial period and othernon-random-walk mechanisms cause the distributionsL(t) and

062815-7

typically been omitted from previous analyses), for ectothermicspecies, etc.We resolve several of these questions by testing the tradeoff

theory’s ability to explain the observed body size distribution ofcetaceans, the largest and most diverse marine mammal clade.Cetaceans are an ideal test case for the theory. First, Cetacea is asufficiently speciose clade (77 extant species) to allow a quantitativecomparison of predicted and observed distributions. Sirenia, theonly other fully aquatic mammal clade, contains four extantspecies, which is too small for a productive comparison. Second,semiaquatic groups like Pinnipeds (seals and walruses) andMustelids (otters) cannot be used to test the theory because theyspend significant time on land, thus avoiding the hard thermo-regulatory constraint assumed by the theory. Thus, by focusing oncetaceans, we provide a reasonable test of the theory. Third, fully

aquatic mammals like cetaceans have typically been omitted inpast studies because their marine habitat induces a different lowerlimit on mass than is seen in terrestrial mammals. As a result, itremains unknown whether the theory extends to all mammals, oronly those in terrestrial environments. Finally, cetacean bodymasses do indeed exhibit the canonical right-skewed pattern(Fig. 1): the median size (356 kg, Tursiops truncatus) is close to thesmallest (37.5 kg, Pontoporia blainvillei) but far from the largest(175,000 kg). This suggests that the theory may indeed hold forthem.Here, we test the strongest possible form of the macroevolu-

tionary tradeoff theory for cetacean sizes. Instead of estimatingmodel parameters from cetacean data, we combine parametersestimated from terrestrial mammals with a theoretically deter-mined choice for the lower limit on cetacean species body mass.The resulting model has no tunable parameters by which to adjustits predicted distribution. In this way, we answer the question ofhow large a whale should be: if the predicted distribution agreeswith the observed sizes, the same short-term versus long-termtradeoff that determines the sizes of terrestrial mammals alsodetermines the sizes of whales.We find that this zero-parameter model provides a highly

accurate prediction of cetacean sizes. Thus, a single universaltradeoff mechanism appears to explain the body sizes of allmammal species, but this mechanism must obey the thermoreg-ulatory limits imposed by the environment in which it unfolds. It isthis one difference–thermoregulation in air for terrestrial mam-mals and in water for aquatic mammals–that explains the differentlocations of their respective body size distributions. Energeticconstraints, while a popular historical explanation for sizes, seemto be only part of the puzzle for understanding the distribution ofspecies sizes. Under this macroevolutionary mechanism, the size ofthe largest observed species is set by the tradeoff between theextinction probability at large sizes and the rate at which smallerspecies evolve to larger body masses, both of which may dependpartly on energetic and ecological factors.

Figure 1. Terrestrial and fully aquatic mammal species massdistributions. Both show the canonical asymmetric pattern: themedian size is flanked by a short left-tail down to a minimum viable sizeand a long right-tail out to a few extremely large species.doi:10.1371/journal.pone.0053967.g001

Figure 2. Characteristic species size pattern and cladogenetic diffusion model. (A) The characteristic distribution of species body sizes,observed in most major animal groups. Macroevolutionary tradeoffs between short-term selective advantages and long-term extinction risks,constrained by a minimum viable size Mmin, produce the distribution’s long right-tail. (B) Schematic illustrating the cladogenetic diffusion model ofspecies body-size evolution: a descendant species’ mass is related to its ancestor’s size M by a random multiplicative factor l. Species become extinctwith a probability that grows slowly with M.doi:10.1371/journal.pone.0053967.g002

How Large Should Whales Be?

PLOS ONE | www.plosone.org 2 January 2013 | Volume 8 | Issue 1 | e53967