Dynamic degree-corrected blockmodels for social networks: a … · 2018. 9. 4. · I Overlapping...
Transcript of Dynamic degree-corrected blockmodels for social networks: a … · 2018. 9. 4. · I Overlapping...
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic degree-corrected blockmodels for socialnetworks: a nonparametric approach
Linda Tan
(Joint work with Maria De Iorio)
National University of Singapore
Bayesian Computation for High-Dimensional Statistical Models
1
Introduction Static model Dynamic models Inference Applications Conclusion
Introduction
• Understand community structure of social networkI Social networks often exhibit community structure (nodes are more densely
connected within each group than across groups).I Community structure may be present due to similar interests, social stature,
physical locations etc.
• Challenges in detecting community structureI The number of communities is typically unknown and the communities can
vary in size and rate of interaction.I Nodes within a community can also have different activity levels and
community detection results can be distorted if degree heterogeneity is nottaken into account (Karrer and Newman, 2011).
• Propose a nonparametric Bayesian approachI Use independent Dirichlet processes to (1) capture blockstructure in the
social network and (2) induce clustering in the activity level of nodes.
2
Introduction Static model Dynamic models Inference Applications Conclusion
Stochastic blockmodel
• Assumes that the nodes in a network are partitioned into groups.
• Distribution of ties between nodes depends only on1. Group membership of the nodes.2. Probabilities of interactions between different groups.
1 2 3 4
1
2
3
4
• Variety of network structures (community, hierarchical, core-periphery)can be produced through different choices of the probability blockmatrix.
• BlockmodelingI a priori: exogenous actor attribute data are used to partition the nodes.I a posteriori: discover blockstructures from relational data.
3
Introduction Static model Dynamic models Inference Applications Conclusion
Extensions of stochastic blockmodel
• Relax restriction that each node can belong to only one groupI Mixed membership stochastic blockmodels (Airoldi et al., 2008)I Overlapping stochastic blockmodels (Latouche et al. (2011))
• Account for degree heterogeneityI Degree-corrected stochastic blockmodels (Karrer and Newman, 2011; Peng
and Carvalho, 2016)I Assortative MMSB with node popularities (Gopalan et al., 2013)
• Determine number of groups automatically (Chinese restaurant process)I Infinite relational model (Kemp et al., 2006)I Infinite degree-corrected stochastic block model (Herlau et al., 2014)
• Models for dynamic networksI State-space mixed membership blockmodel (Xing et al. 2010)
4
Introduction Static model Dynamic models Inference Applications Conclusion
Contributions
• Develop degree-corrected stochastic blockmodels for community detectionin social networks using a nonparametric Bayesian approach.
• Formulate static model using probit regression. Use Dirichlet process(DP) to1. detect communities in the network,2. induce clustering among the popularity parameters.
• Flexible approach: allows the number of communities and popularityclusters to be determined by the data automatically.
• Integrates the approach of Kemp et al (2006) who use the Chineserestaurant process to detect community structure and Ghosh et al. (2010)who use the DP to induce clustering among the “popularity” parameters.
• Present a model for static networks and extensions to dynamic networks.Posterior inference is obtained using Gibbs samplers. Proposed models areillustrated using real social networks.
5
Introduction Static model Dynamic models Inference Applications Conclusion
Static model
• N = {1, . . . , n}: set of actors.
• y = [yij ]: n× n adjacency matrix. yij is an indicator of a link from i to j.Consider undirected network: y is symmetric with zero diagonal.
yij |pijindep∼ Bernoulli(pij),
Φ−1(pij) = θi + θj +∑K
k=1β∗k1{zi = zj = k}, (1 ≤ i < j ≤ n).
I θi: popularity of actor i.I Φ(·): cumulative distribution function of N(0,1).I K ≤ n: number of communities (unknown)I β∗
k : rate of interaction in kth community (High β∗k : close-knit group).
I zi ∈ {1, . . . ,K}: group membership of actor i.I 1{·}: indicator function.
• Probability of interaction between i and j depends onI their individual popularities,I interaction rate of their group (if they belong to the same group).
6
Introduction Static model Dynamic models Inference Applications Conclusion
Static model
• Last term represents a stochastic blockmodel where non-diagonal entriesof the probability matrix are set to a common value (not necessarily zero).
• Interaction between actors from different groups is driven by theirpopularities. The popularity parameters {θi} and group assignments {zi}are competing to explain the network.
• A DP is used to induce clustering among the popularity parameters {θi}.
θi|Giid∼ G (i = 1, . . . , n), G ∼ DP(α,G0),
where G0 is N(0, σ2θ) and α ∼ Gamma(aα, bα).
• A DP, H (independent of G), is used to detect the communities in thenetwork. Introduce a βi for each actor i where βi = β∗zi and assume
βi|Hiid∼ H (i = 1, . . . , n), H ∼ DP(ν,H0),
where H0 is N(0, σ2β) and ν ∼ Gamma(aν , bν).
7
Introduction Static model Dynamic models Inference Applications Conclusion
Infer number of clusters
• The number of communities K and number of popularity clusters Lamong {θi} are inferred from the data.
• The prior distribution on L depends on concentration parameter α in theDP (larger α implies larger L).
• We specify a Gamma prior on α, which also facilitates computations.If α ∼ Gamma(a0, b0),
E(L) ≈ a0b0A, Var(L) ≈ E(L∗) +
a20b20B +
{a0b0B +A
}2 a0b20,
where A = ψ0(a0+nb0b0
)− ψ0(a0b0
), B = ψ1(a0+nb0b0
)− ψ1(a0b0
) (Jara et al.,2007). ψ0(·): digamma function, ψ1(·): trigamma function. These resultsserve as reference for setting a0, b0.
• Relation between K and ν is similar.
8
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic model I
• Suppose we observe networks yt = [yt,ij ] for t = 1, . . . , T . Consider
yt,ij |pt,ijindep∼ Bernoulli(pt,ij), (t = 1, . . . , T, 1 ≤ i < j ≤ n).
Φ−1(pt,ij) = θit + θjt +
K∑k=1
β∗k1{zi = zj}.
I Assume community memberships remain unchanged but popularities of theactors can vary with time.
I This assumption is appropriate for data where communities arise due tofactors that are unlikely to vary drastically over time (physical locations, jobpositions). Changes in ties is attributed to variations in popularities of nodes.
I In resemblance of static model, we induce clustering among {θit} using a DP,
θit|Giid∼ G for i = 1, . . . , n, t = 1, . . . , T, G ∼ DP(G0, α),
where G0 is N(0, σ2θ) and α ∼ Gamma(aα, bα).
I The {β∗k} and {zi} are modeled using a DP as in the static model.
9
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic model II
• Allows the tie between nodes i and j at time t to depend on existence ofthe tie at previous time point. Assumes that the popularities andcommunity memberships of the actors remain unchanged over time.
Φ−1(pt,ij) = ηyt−1,ij1{t > 1}+ θi + θj +
K∑k=1
β∗k1{zi = zj},
where η ∼ N(0, σ2η).
• η measures the persistence of ties in the network.I η > 0: a tie is more likely to be present at time t if it was present at t− 1.I η < 0: a tie is more likely to be present at time t if it was absent at t− 1.
• {θi}, {zi} and {β∗k} are modeled as in the static model.
• The popularities and communities inferred from this model smooths outthe data and provide an overview of the behavior of actors over time.
10
Introduction Static model Dynamic models Inference Applications Conclusion
Posterior inference
• We use Gibbs samplers to obtain posterior inference for the proposedmodels.
• Sampling from the DP is performed using the methods in Neal (2000)while the concentration parameters α and ν are sampled using themethod in Escobar and West (1995).
• The algorithms are coded in Julia. It is possible to use standard softwarese.g. OpenBUGS to obtain posterior inference by considering a truncatedDP (Ishwaran and Zarepour, 2000). However, the runtime in OpenBUGSis significantly longer than Julia when the number of nodes is large.
• For the applications, we initialize multiple MCMC chains from randomstarting points and use diagnostic plots to check for convergence.
11
Introduction Static model Dynamic models Inference Applications Conclusion
Cluster analysis
• From MCMC output, we compute n× n posterior similarity matrix S.Sij : posterior probability that actors i and j belong to the same cluster.I Estimate Sij by proportion of times i and j are in the same cluster.
I Sij is not affected by “label-switching” (cluster labels may change duringMCMC runs) or number of clusters varying across iterations.
• Compute a (hard) clustering estimate using Binder’s loss function (totalnumber of disagreements between estimated and true clustering amongall pairs of actors).
• The function minbinder from R package mcclust can be used to findthe clustering c∗ = [c∗1, . . . , c
∗n] that minimizes the posterior expectation
of this loss. The posterior expected loss can be written as∑i<j
|1{c∗i=c∗j} − Sij |,
where the sum is taken over all possible pairs of actors.
12
Introduction Static model Dynamic models Inference Applications Conclusion
Karate club network
• This dataset contains 78 undirected friendship links among 24 members.Due to disputes over lesson price, the club was divided into two factions,led by instructor Mr Hi (actor 1) and president John A. (actor 34).
• The club eventually split into two separate clubs. All members joinedclubs following their own factions except actor 9.
• Static model (3 parallel chains, 40000 iterations each, total time: 172 s).Set aν = bν = aα = bα = 5 and σ2θ = σ2β = 1.
2 4 6
K
Pro
babi
lity
0.0
0.2
0.4
0.0 1.0 2.0
0.0
0.4
0.8
1.2
ν
dens
ity
Community
2 4 6 8
L
Pro
babi
lity
0.00
0.10
0.20
0.30
0.5 2.0 3.5
0.0
0.4
0.8
α
dens
ity
Popularity
Figure: Posterior distributions of K, ν L and α. For ν and α, prior distributionsin dotted lines and posterior distributions in solid lines.
13
Introduction Static model Dynamic models Inference Applications Conclusion
Posterior similarity matrices and popularities
Mr H
i5 7 11 6 4 17 8 13 2 18 22 14 20 12 3 10 29 25 Jo
hn A
.24 30 26 33 32 27 28 15 31 9 16 21 23 19
Mr Hi57
1164
178
132
1822142012
3102925
John A.243026333227281531
916212319
Community
0.0
0.2
0.4
0.6
0.8
1.0
Mr H
iJo
hn A
.3 33 2 4 24 26 17 18 16 27 11 21 19 23 15 12 22 13 5 20 25 30 6 8 7 29 10 28 31 9 14 32
Mr HiJohn A.
333
24
2426171816271121192315122213
5202530
687
29102831
91432
Popularity
0.0
0.2
0.4
0.6
0.8
1.0
2 4 6 8 10 12 14 16degree
1.501.251.000.750.500.250.000.250.50
mea
n of
i
1
2
3
45 678 910 1112 13
141516171819 20212223 24252627 2829 3031 32
33
34Popularity
(Posterior mean of θi against actor’s degree)
Mr Hi (1), John A. (34) and a few other
actors {2, 3, 33} have high popularity but the
rest have much lower activity levels.
14
Introduction Static model Dynamic models Inference Applications Conclusion
Hard clustering estimates
Mr Hi
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1819
20
21
22
23
24
2526
27 28
29
30
31
32
33John A.
Community
Mr Hi
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1819
20
21
22
23
24
2526
27 28
29
30
31
32
33John A.
Popularity
Gp 1: 1.8 (0.27)Gp 3: 1.74 (0.26)
Gp 1: 0.56 (0.15)Gp 2: −0.24 (0.2)Gp 3: −1.51 (0.14)
• Plots of network: nodes colored according to hard clustering estimates.Singletons are not colored. Posterior means of β∗ and θ∗ and their standarddeviations (brackets) conditional on clustering structure given in legend.
• For communities, Gp 3 is exactly the faction led by John A. (Zachary, 1997)while Gp 1 together with {3} is the faction led by Mr Hi (actor 3 has posteriorprobability of 0.4 of being clustered with members in Gp 1 and 0.05 of beingclustered with members in Gp 3).
15
Introduction Static model Dynamic models Inference Applications Conclusion
Interpreting results
• If we drop {θi} from the static model and consider just the blockmodel,we obtain five clusters: {1}, {3}, {33}, {34} and all other members.I Network is split into high-degree nodes and low-degree nodesI Importance of accounting for degree variation in blockmodels (Karrer and
Newman, 2011)
Our static model tries to address these issues using a nonparametricapproach via the automatic clustering structures induced by the DP.
• While hard partitions of the network are easier to interpret, the posteriorsimilarity matrices reveal finer details regarding the degree of affiliation ofactors towards the clusters that they are assigned to in the hard split.I E.g. actor 10 is assigned to the cluster led by John A., but he has a slightly
lower posterior probability (≈ 0.5) of being with other members in this clusterthan the rest, and also has a posterior probability (≈ 0.2) of being in thesame cluster as members in Mr Hi’s faction.
16
Introduction Static model Dynamic models Inference Applications Conclusion
Kapferer’s tailor shop network
• Data on interactions among 39 workers in a tailor shop in SouthernAfrica, from June 1965 – Feb 1966 (Kapferer, 1972).
• The workers’ duties can be classified into eight categories:
More prestigious Less prestigious
head tailor, cutter, line 1 tailor,button machiner
line 3 tailor, ironer, cotton boy,line 2 tailor
• Focus on symmetric “sociational” networks recorded at two time points:
1. before an aborted strike,2. after a successful strike for higher wages.
• Network at second time point (223 edges) is denser than the first (158edges) as the workers strive to be more united (thereby expanding theirsocial relations) in their efforts to change the wage system.
17
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic model I
• Assume communities remain unchanged and that the emergence ordissolution of ties are due to changes in activity level of actors.
• Set aν = bν = aα = bα = 10 and σ2θ = σ2β = 1. (3 parallel chains, 15,000iterations each, total time: 139 s)
3 5 7 9
K
Pro
babi
lity
0.00
0.10
0.20
0.30
0.5 1.5 2.5
0.0
0.4
0.8
1.2
ν
dens
ityCommunity
3 5 7
LP
roba
bilit
y
0.0
0.1
0.2
0.3
0.4
0.5 1.5
0.0
0.5
1.0
1.5
α
dens
ity
Popularity
Figure: Marginal posterior distributions of K, ν, L and α. For ν and α, priordistributions in dotted lines and posterior distributions in solid lines.
18
Introduction Static model Dynamic models Inference Applications Conclusion
Hard clustering estimates1
2
3 4
5
6
7
8
9
10
111213
141516
17
18
19
20
21
22
2324
25
26
27
28
29
30
313233
343536
37 38
39
t=1: Community
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2223
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
t=2: Community
Gp 1: 1.65 (0.13)Gp 2: 1.46 (0.31)Gp 3: 1.45 (0.29)Gp 9: 1.55 (0.14)
1
2
3 4
5
6
7
8
9
10
111213
141516
17
18
19
20
21
22
2324
25
26
27
28
29
30
313233
343536
37 38
39
t=1: Popularity
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2223
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
t=2: Popularity
Gp 1: −1.41 (0.09)Gp 2: −0.46 (0.03)Gp 3: 0.57 (0.08)
Head tailor
Cutter
Line 1 tailor
19
16
Button machiner
Line 3 tailor
Ironer
Cotton boy
Line 2 tailor
19
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic model I results
• High degree of job homophily in the communities even though thenetworks are based on casual interactions. Groups 1 and 2: cutter, line 1and line 2 tailors (more prestigious jobs); Gp 3: line 3 tailors; Gp 9:ironers and cotton boys.
low avg high
t=1t=2
Popularity
coun
t
05
1015
2025
highavglowpr
opor
tion
0.0
0.2
0.4
0.6
0.8
1.0 head tailor cutter line 1 tailor button mach. line 3 tailor ironer cotton boy line 2 tailor
t=1 t=2 t=1 t=2 t=1 t=2 t=1 t=2 t=1 t=2 t=1 t=2 t=1 t=2 t=1 t=2
• Three popularity clusters, −1.41 (Gp 1), −0.46 (Gp 2) and 0.57 (Gp 3),representing “low”, “average” and “high” popularity.
• Number of workers with low popularity decreased from t = 1 to t = 2while the number with average or high popularity increased (reflectsefforts of workers in expanding social ties after first unsuccessful strike)
20
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic model I results
• Proportion of workers with low and average popularity remainedunchanged over the two time points for the ironers, cotton boys and line2 tailors (positions with lower prestige).
• Changes in popularity arise mainly from line 1 tailors, button machinersand line 3 tailors.
• These observations are consistent with the analysis of Kapferer (1972),who noted that line 1 tailors made a strong attempt to expand their linksafter the first unsuccessful strike as they stand to benefit the most fromthe change in wage system.
21
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic model II
• The probability that a tie is formed depends on whether a tie exists at theprevious time point as well as the community membership of the nodesand their popularities.
• Consider 3 parallel chains, 15,000 iterations, total runtime: 106 s.
3 5 7 9 11
K
Pro
babi
lity
0.00
0.05
0.10
0.15
0.20
0.25
0.5 1.0 1.5 2.0 2.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
ν
dens
ity
3 5 7 9 11 13
L
Pro
babi
lity
0.00
0.05
0.10
0.15
0.20
0.25
0.5 1.5 2.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
α
dens
ity
0.0 0.4 0.8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
η
dens
ity
Figure: Marginal posterior distributions of K, ν, L, α and η. For ν and α, priordistributions in dotted lines and posterior distributions in solid lines.
• Posterior mean of η is 0.58 and its posterior mass is concentrated onpositive values. This indicates that a tie is likely to persist at second timepoint given that it existed at first time point.
22
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic model II results1
2
3 4
5
6
7
8
9
10
111213
141516
17
18
19
20
21
22
2324
25
26
27
28
29
30
313233
343536
37 38
39
t=1: Community
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2223
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
t=2: Community
Gp 1: 1.33 (0.13)Gp 2: 1.38 (0.39)Gp 4: 1.39 (0.28)Gp 6: 0.54 (0.8)Gp 10: 1.76 (0.15)
1
2
3 4
5
6
7
8
9
10
111213
141516
17
18
19
20
21
22
2324
25
26
27
28
29
30
313233
343536
37 38
39
t=1: Popularity
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2223
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
t=2: Popularity
Gp 1: −0.81 (0.04)Gp 2: −0.16 (0.05)Gp 3: −1.39 (0.11)Gp 4: 0.74 (0.11)Gp 5: −0.47 (0.1)Gp 6: 0.24 (0.16)
Head tailor
Cutter
Line 1 tailor
19
16
Button machiner
Line 3 tailor
Ironer
Cotton boy
Line 2 tailor
23
Introduction Static model Dynamic models Inference Applications Conclusion
Dynamic model II results
• The communities detected are similar to that of Dynamic model I exceptfor changes to individuals {14, 16, 19, 21}.
• The number of popularity clusters increased from three in dynamic modelI to six in model II. In model II, the popularity of an actor summarizes hisactivity level across all time points.
0 5 10 15 20 25degree
1.25
1.00
0.75
0.50
0.25
0.00
0.25
0.50
mea
n of
i
1 2
3
45
6
78
9
10
111213
14
15
16
17
18
19
20
21
22
23
24
25
2627
28
2930
3132
33
34
3536
37
38
39
t=1
5 10 15 20 25degree
1.25
1.00
0.75
0.50
0.25
0.00
0.25
0.50
mea
n of
i
12
3
45
6
78
9
10
1112 13
14
15
16
17
18
19
20
21
22
23
24
25
2627
28
2930
3132
33
34
3536
37
38
39
t=2
Figure: Mean of θi against the actor i’s degree at t = 1 (left) and t = 2 (right).
• The head tailor (19) and cutter (16) have significantly higher popularitythan the rest followed by actor 24 and actors in popularity Gp 2 (includesindividuals who play significant roles in the factory’s social relationships).
24
Introduction Static model Dynamic models Inference Applications Conclusion
Conclusion
• We propose a nonparametric Bayesian approach for detecting communitiesin social networks, using degree-corrected stochastic blockmodels.
• The number of communities and popularity clusters is inferred from thedata through use of the Dirichlet process.
• Inferred popularity clusters summarizes the popularities of the actors andhelps in identifying key players in the network.
• Extensions of static model to dynamic networks.I Dynamic model I: study changes in activity level of actors over the time.I Dynamic model II: measures persistence of links in the network.
• While Gibbs samplers are feasible for small networks, they do not scalewell to large networks and more efficient methods of estimation, such asvariational approximation methods, can be developed.
Tan, L. S. L. and De Iorio, M. (2018). Dynamic degree-corrected blockmodelsfor social networks: a nonparametric approach. Statistical Modelling.
25