Community)Detec-on)for))...
Transcript of Community)Detec-on)for))...
Community Detec-on for Social Compu-ng
Lei Tang Yahoo! Labs
2011/09/16 @UC Berkeley
1
Most materials are adapted from Community Detec-on and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
hLp://dmml.asu.edu/cdm/
Community • Community: It is formed by individuals such that those within a
group interact with each other more frequently than with those outside the group – a.k.a. group, cluster, cohesive subgroup, module in different contexts
• Community detec-on: discovering groups in a network where individuals’ group memberships are not explicitly given
• Why communi-es in social media? – Human beings are social – Easy-‐to-‐use social media allows people to extend their social life in
unprecedented ways – Difficult to meet friends in the physical world, but much easier to find
friend online with similar interests – Interac-ons between nodes can help determine communi-es
2
Real-‐World Communi-es
3
Egypt Protest, 2011
Communi-es by Subscrip-on Communi-es from
Facebook Communi-es from
Flickr
Communi-es of Poli-cal Blogs
5 Lada Admic and Natalie Glance, The Poli-cal Blogosphere and the 2004 U.S. Elec-on: Divided They Blog, 2005
Liberal
Conserva-ve
Communi-es in TwiLer
6 hLp://ebiquity.umbc.edu/blogger/2007/04/19/twiLer-‐social-‐network-‐analysis/
Communi-es of Personal Social Network
7 Generated based on Lei Tang’s LinkedIn connec-ons on 2011/09/15
Graduate School
College
Communi-es in Protein-‐Protein Interac-on Networks
8
Communi-es in Social Media • Two types of groups in social media
– Explicit Groups: formed by user subscrip-ons – Implicit Groups: implicitly formed by social interac-ons
• Some social media sites allow people to join groups, is it necessary to extract groups based on network topology? – Not all sites provide community plagorm – Not all people want to make effort to join groups – Groups can change dynamically
• Network interac-on provides rich informa-on about the rela-onship between users – Can complement other kinds of informa-on – Provide basic informa-on for other social compu-ng tasks
9
Applica-ons of Community Detec-on
• Visualiza-on & network naviga-on • Data compression • Structural posi-on and role analysis • Topic detec-on in collabora-ve tagging systems • Tag disambigua-on • Iden-fying #unique users in networks • User profiling based on neighborhood smoothing • Recommenda-on and targe-ng • Event detec-on
10
Roadmap • Overview of Community Detec-on Methods
– Node-‐Centric – Group-‐Centric – Network-‐Centric – Hierarchy-‐Centric
• Communi-es in Social Media – Sta-s-cal Proper-es – Community Evolu-on – Heterogeneous Networks – Community Evalua-on – Scaling Community Detec-on
• Applica-on of Community Detec-on for Social Media Mining
11
COMMUNITY DETECTION
12
Subjec-vity of Community Defini-on Each component is a
community A densely-‐knit community
Defini-on of a community can be subjec-ve.
13
Craze about Communi-es?
• Subjec-vity of communi-es – challenging for evalua-on (covered later) – Promo-ng various defini-ons and models
• Boom of social media and its novel challenges
14
ArBcle/Book Year References
A classifica-on for Community Discovery Methods in Complex Networks
2011 168
Community Detec-on in Social Media 2011 28 ref/page * 4.5 pages = 126
Community Detec-on in Graphs 2010 58 ref/page * 7.5 pages = 435
Community Detec-on and Mining in Social Media 2010 Around 100
Taxonomy of Community Criteria • Criteria vary depending on the tasks • Roughly can be divided into 4 categories (not exclusive): • Node-‐Centric Community
– Each node in a group sa-sfies certain proper-es • Group-‐Centric Community
– Consider the connec-ons within a group as a whole. The group has to sa-sfy certain proper-es without zooming into node-‐level
• Network-‐Centric Community – Op-mize a quality func-on wrt. the whole network topology
• Hierarchy-‐Centric Community – Construct a hierarchical structure of communi-es
15
For simplicity, we focus on undirected network with boolean edge weights
Node-‐Centric Community Detec-on
• Nodes sa-sfy different proper-es – Complete Mutuality
• cliques – Reachability of members
• k-‐clique, k-‐clan, k-‐club – Nodal degrees
• k-‐plex, k-‐core – Rela-ve frequency of Within-‐Outside Ties
• LS sets, Lambda sets
• Commonly used in tradi-onal social network analysis • Open involves combinatorial op-miza-on • Here, we discuss some representa-ve ones
16
Complete Mutuality: Cliques
• Clique: a maximum complete subgraph in which all nodes are adjacent to each other
• NP-‐hard to find the maximum clique in a network • Straighgorward implementa-on to find cliques is very
expensive in -me complexity
Nodes 5, 6, 7 and 8 form a clique
17
Finding the Maximum Clique
• In a clique of size k, each node maintains degree >= k-‐1
• Nodes with degree < k-‐1 will not be included in the maximum clique
• Recursively apply the following pruning procedure – Sample a sub-‐network from the given network, and find a clique in the
sub-‐network, say, by a greedy approach
– Suppose the clique above is size k, in order to find out a larger clique, all nodes with degree <= k-‐1 should be removed.
• Repeat un-l the network is small enough
• Many nodes will be pruned as social media networks follow a power law distribu-on for node degrees
18
Maximum Clique Example
• Suppose we sample a sub-‐network with nodes {1-‐5} and find a clique {1, 2, 3} of size 3
• In order to find a clique >3, remove all nodes with degree <=3-‐1=2 – Remove nodes 2 and 9
– Remove nodes 1 and 3
– Remove node 4
19
Clique Percola-on Method (CPM)
• Clique is a very strict defini-on, unstable • Normally use cliques as a core or a seed to find larger
communi-es
• CPM is such a method to find overlapping communi-es – Input
• A parameter k, and a network – Procedure
• Find out all cliques of size k in a given network • Construct a clique graph. Two cliques are adjacent if they share k-‐1 nodes
• Each connected components in the clique graph form a community
20
CPM Example Cliques of size 3: {1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8}
Communi-es: {1, 2, 3, 4} {4, 5, 6, 7, 8}
21
Reachability : k-‐clique, k-‐club
• Any node in a group should be reachable in k hops • k-‐clique: a maximal subgraph in which the largest geodesic
distance between any nodes <= k • k-‐club: a substructure of diameter <= k
• A k-‐clique might have diameter larger than k in the subgraph • Commonly used in tradi-onal SNA • Open involves combinatorial op-miza-on
Cliques: {1, 2, 3} 2-‐cliques: {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6} 2-‐clubs: {1,2,3,4}, {1, 2, 3, 5}, {2, 3, 4, 5, 6}
22
Group-‐Centric Community Detec-on: Density-‐Based Groups
• The group-‐centric criterion requires the whole group to sa-sfy a certain condi-on – E.g., the group density >= a given threshold
• A subgraph is a quasi-clique if
• A similar strategy to that of cliques can be used – Sample a subgraph, and find a maximal quasi-
clique (say, of size k) – Remove nodes with degree
Gs(Vs, Es) γ − dense
|Es||Vs|(|Vs|− 1)/2
≥ γ
γ − dense
< kγ
23
Network-‐Centric Community Detec-on
• Network-‐centric criterion needs to consider the connec-ons within a network globally
• Goal: par--on nodes of a network into disjoint sets • Approaches:
– Clustering based on vertex similarity – Latent space models
– Block model approxima-on
– Spectral clustering – Modularity maximiza-on
24
Clustering based on Vertex Similarity
• Apply k-‐means or similarity-‐based clustering to nodes
• Vertex similarity is defined in terms of the similarity of their neighborhood
• Structural equivalence: two nodes are structurally equivalent iff they are connec-ng to the same set of actors
• Structural equivalence is too restrict for prac-cal use.
Nodes 1 and 3 are structurally equivalent; So are nodes 5 and 6.
25
(1) Clustering based on vertex similarity
Vertex Similarity
• Jaccard Similarity
• Cosine similarity
Jaccard(4, 6) =|{5}|
|{1, 3, 4, 5, 6, 7, 8}| =1
7
cosine(4, 6) =1√4 · 4
=1
426
(1) Clustering based on vertex similarity
Jaccard(vi, vj) =|Ni ∩Nj ||Ni ∪Nj |
cosine(vi, vj) =
�k AikAjk��s A
2is
�t A
2jt
Latent Space Models
• Map nodes into a low-‐dimensional space such that the proximity between nodes based on network connec-vity is preserved in the new space, then apply k-‐means clustering
• Mul--‐dimensional scaling (MDS) – Given a network, construct a proximity matrix P represen-ng the
pairwise distance between nodes (e.g., geodesic distance)
– Let denote the coordinates of nodes in the low-‐dimensional space
– Objec-ve func-on: – Solu-on: – V is the top eigenvectors of , and is a diagonal matrix of top
eigenvalues
€
S∈Rn×
SST ≈ −1
2(I − 1
n11T )(P ◦ P )(I − 1
n11T ) = �P
min �SST − �P�2FS = V Λ
12
� �PΛ = diag(λ1,λ2, · · · ,λ�)
Λ
27
(2) Latent space models
MDS Example
P =
0 1 1 1 2 2 3 3 41 0 1 2 3 3 4 4 51 1 0 1 2 2 3 3 41 2 1 0 1 1 2 2 32 3 2 1 0 1 1 1 22 3 2 1 1 0 1 1 23 4 3 2 1 1 0 1 13 4 3 2 1 1 1 0 24 5 4 3 2 2 1 2 0
�P =
2.46 3.96 1.96 0.85 −0.65 −0.65 −2.21 −2.04 −3.653.96 6.46 3.96 1.35 −1.15 −1.15 −3.71 −3.54 −6.151.96 3.96 2.46 0.85 −0.65 −0.65 −2.21 −2.04 −3.650.85 1.35 0.85 0.23 −0.27 −0.27 −0.82 −0.65 −1.27
−0.65 −1.15 −0.65 −0.27 0.23 −0.27 0.68 0.85 1.23−0.65 −1.15 −0.65 −0.27 −0.27 0.23 0.68 0.85 1.23−2.21 −3.71 −2.21 −0.82 0.68 0.68 2.12 1.79 3.68−2.04 −3.54 −2.04 −0.65 0.85 0.85 1.79 2.46 2.35−3.65 −6.15 −3.65 −1.27 1.23 1.23 3.68 2.35 6.23
!3 !2 !1 0 1 2 3!1
!0.8
!0.6
!0.4
!0.2
0
0.2
0.4
0.6
0.8
1
2
34
5 6
7
8
9
S1
S2
Two communi-es: {1, 2, 3, 4} and {5, 6, 7, 8, 9}
geodesic distance
28
(2) Latent space models
Block Models
• S is the community indicator matrix
• Relax S to be numerical values, then the op-mal solu-on corresponds to the top eigenvectors of A
min ||A− SΣST ||2F
Two communi-es: {1, 2, 3, 4} and {5, 6, 7, 8, 9}
29
(3) Block model approxima-on
Cut
• Most interac-ons are within group whereas interac-ons between groups are few
• community detec-on minimum cut problem • Cut: A par--on of ver-ces of a graph into two disjoint sets • Minimum cut problem: find a graph par--on such that the
number of edges between the two sets is minimized
30
(4) Spectral clustering
Ra-o Cut & Normalized Cut
• Minimum cut open returns an imbalanced par--on, with one set being a singleton
• Change the objec-ve func-on to consider community size
Ratio Cut(π) =1
k
k�
i=1
cut(Ci, C̄i)
|Ci|,
Normalized Cut(π) =1
k
k�
i=1
cut(Ci, C̄i)
vol(Ci)
Ci,: a community |Ci|: number of nodes in Ci vol(Ci): sum of degrees in Ci
31
(4) Spectral clustering
Ra-o Cut & Normalized Cut Example
Ratio Cut(π1) =1
2
�1
1+
1
8
�= 9/16 = 0.56
Normalized Cut(π1) =1
2
�1
1+
1
27
�= 14/27 = 0.52
For parBBon in red:
For parBBon in green:
Normalized Cut(π2) =1
2
�2
12+
2
16
�= 7/48 = 0.15 < Normalized Cut(π1)
Ratio Cut(π2) =1
2
�2
4+
2
5
�= 9/20 = 0.45 < Ratio Cut(π1)
π1
π2
Both ra-o cut and normalized cut prefer a balanced par--on 32
(4) Spectral clustering
Spectral Clustering
• Both ra-o cut and normalized cut can be reformulated as
• Where
• Spectral relaxa-on: • Op-mal solu-on: top eigenvectors with the smallest
eigenvalues
�L =
�D −AI −D−1/2AD−1/2
graph Laplacian for ra-o cut
normalized graph Laplacian
D = diag(d1, d2, · · · , dn) A diagonal matrix of degrees
minS
Tr(ST �LS) s.t. STS = Ik
minS∈{0,1}n×k
Tr(ST �LS)
33
(4) Spectral clustering
Spectral Clustering Example
D = diag(3, 2, 3, 4, 4, 4, 4, 3, 1)
�L = D −A =
3 −1 −1 −1 0 0 0 0 0−1 2 −1 0 0 0 0 0 0−1 −1 3 −1 0 0 0 0 0−1 0 −1 4 −1 −1 0 0 00 0 0 −1 4 −1 −1 −1 00 0 0 −1 −1 4 −1 −1 00 0 0 0 −1 −1 4 −1 −10 0 0 0 −1 −1 −1 3 00 0 0 0 0 0 −1 0 1
S =
0.33 −0.380.33 −0.480.33 −0.380.33 −0.120.33 0.160.33 0.160.33 0.300.33 0.240.33 0.51
Two communi-es: {1, 2, 3, 4} and {5, 6, 7, 8, 9}
The 1st eigenvector means all nodes belong to the same cluster, no
use
k-‐means
34
(4) Spectral clustering
Modularity Maximiza-on
• Modularity measures the strength of a community par--on by taking into account the degree distribu-on
• Given a network with m edges, the expected number of edges between two nodes with di and dj is
• Strength of a community:
• Modularity: • A larger value indicates a good community structure
didj/2m
The expected number of edges between nodes 1 and 2 is
3*2/ (2*14) = 3/14 �
i∈C,j∈C
Aij − didj/2m
Q =1
2m
k�
�=1
�
i∈C�,j∈C�
(Aij − didj/2m)
35
(5) Modularity maximiza-on
Modularity Matrix
• Modularity matrix:
• Similar to spectral clustering, modularity maximiza-on can be reformulated as
• Op-mal solu-on: top eigenvectors of the modularity matrix • Apply k-‐means to S as a post-‐processing step to obtain
community par--on
B = A− ddT /2m (Bij = Aij − didj/2m)
maxQ =1
2mTr(STBS) s.t. STS = Ik
36
(5) Modularity maximiza-on
Modularity Maximiza-on Example
B =
−0.32 0.79 0.68 0.57 −0.43 −0.43 −0.43 −0.32 −0.110.79 −0.14 0.79 −0.29 −0.29 −0.29 −0.29 −0.21 −0.070.68 0.79 −0.32 0.57 −0.43 −0.43 −0.43 −0.32 −0.110.57 −0.29 0.57 −0.57 0.43 0.43 −0.57 −0.43 −0.14
−0.43 −0.29 −0.43 0.43 −0.57 0.43 0.43 0.57 −0.14−0.43 −0.29 −0.43 0.43 0.43 −0.57 0.43 0.57 −0.14−0.43 −0.29 −0.43 −0.57 0.43 0.43 −0.57 0.57 0.86−0.32 −0.21 −0.32 −0.43 0.57 0.57 0.57 −0.32 −0.11−0.11 −0.07 −0.11 −0.14 −0.14 −0.14 0.86 −0.11 −0.04
S =
0.44 −0.000.38 0.230.44 −0.000.17 −0.48
−0.29 −0.32−0.29 −0.32−0.38 0.34−0.34 −0.08−0.14 0.63
Modularity Matrix
k-‐means
Two Communi-es: {1, 2, 3, 4} and {5, 6, 7, 8, 9}
37
(5) Modularity maximiza-on
A Unified View for Community Par--on
Utility Matrix M =
modified proximity matrix �P if latent space modelsadjacency matrix A if block models
graph Laplacian �L if spectral clusteringmodularity maximization B if modularity maximization
• Latent space models, block models, spectral clustering, and modularity maximiza-on can be unified as
38
Hierarchy-‐Centric Community Detec-on
• Goal: build a hierarchical structure of communi-es based on network topology
• Allow the analysis of a network at different resolu-ons
• Representa-ve approaches: – Divisive Hierarchical Clustering – Agglomera-ve Hierarchical clustering
39
Divisive Hierarchical Clustering
• Divisive clustering – Par--on nodes into several sets – Each set is further divided into smaller ones – Network-‐centric par--on can be applied for the par--on
• One par-cular example: recursively remove the “weakest” -e – Find the edge with the least strength – Remove the edge and update the corresponding strength of each edge
• Recursively apply the above two steps un-l a network is discomposed into desired number of connected components.
• Each component forms a community
40
Edge Betweenness
• The strength of a -e can be measured by edge betweenness
• Edge betweenness: the number of shortest paths that pass along with the edge
• The edge with higher betweenness tends to be the bridge between two communi-es.
The edge betweenness of e(1, 2) is 4, as all the shortest paths from 2 to {4, 5, 6, 7, 8, 9} have to either pass e(1, 2) or e(2, 3), and e(1,2) is the shortest path between 1 and 2
edge-betweenness(e) = Σs<tσst(e)
σs,t
41
Divisive clustering based on edge betweenness
Aper remove e(4,5), the betweenness of e(4, 6) becomes 20, which is the highest;
Aper remove e(4,6), the edge e(7,9) has the highest betweenness value 4, and should be removed.
Ini-al betweenness value
42
Agglomera-ve Hierarchical Clustering
• Ini-alize each node as a community
• Merge communi-es successively into larger communi-es following a certain criterion – E.g., based on modularity increase
43
Summary of Community Detec-on
• Node-‐Centric: cliques, k-‐cliques, k-‐clubs • Group-‐Centric: quasi-‐cliques • Network-‐Centric:
– Clustering based on vertex similarity
– Latent space models, block models, spectral clustering, modularity maximiza-on
• Hierarchy-‐Centric: Divisive, Agglomera-ve
• Difficult to determine which suits the best
• Other dimensions to explore: – Directed network – Overlapping communi-es
– Sop clustering – Op-miza-on 44
COMMUNITIES IN SOCIAL MEDIA
45
Ques-ons & Challenges
• Sta-s-cal Proper-es of Communi-es – any structural paLerns for community size, density and growth?
• Evolu-on – Timeliness is emphasized in social media – Interac-ons are highly dynamic
• Heterogeneity – Various types of en--es and interac-ons are involved
• Evalua-on – Lack of ground truth, and complete informa-on due to privacy
• Scalability – Social networks are open in a scale of millions of nodes and connec-ons – Tradi-onal Network Analysis open deals with very limited number of of
subjects
46
Size Distribu-on of Explicit Communi-es (LiveJournal)
47 Lei Tang, Xufei Wang and Huan Liu, Group Profiling for Understanding Social Structures, TIST, 2011
16K bloggers, 132k links, 100k groups
Sta-s-cal property
Size Distribu-on of Explicit Communi-es (Flickr)
48
907K users, 43k groups
Sta-s-cal property
Densi-es of Flickr Groups
49
Average Network Density
Sta-s-cal property
Distribu-on Implicit Communi-es (consider each connected component as a community)
50
100 101 102 103 104 105 106 107 108 109100
101
102
103
104
105
106
107
size
frequency
The giant component
Y! Instant Messenger Contact Network: Hundreds of millions of nodes
Billions of connec-ons
Kun liu and Lei Tang, Large Scale Behavioral Targe-ng with a Social Twist, CIKM 2011
Sta-s-cal property
How does a network look like? Zooming into the giant component
51 J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Sta-s-cal Proper-es of Community Structure in Large Social and Informa-on Networks, WWW 2008
Sta-s-cal property
Community Growth wrt. #friends
• Diminishing returns – Probability of joining increases with the number friends in the group
– But increases get smaller and smaller 52
Livejournal DBLP
Group Forma-on in Large Social Networks: Membership, Growth, and Evolu-on, KDD 2006
Sta-s-cal property
Community Growth: More subtle features
• Connectedness of friends: – x and y have three friends in the group – x’s friends are independent – y’s friends are all connected
• Who is more likely to join?
53 Adapted from Jure Leskovec’s lecture slides @Stanford
Sta-s-cal property
Community Growth wrt. Connectedness of Friends
54
Sta-s-cal property
Community Evolu-on
• Communi-es also expand, shrink , or dissolve in dynamic networks
• How to uncover latent community change behind dynamic network interac-ons?
55
Naïve Approach to Studying Community Evolu-on
• Take snapshots of a network • find communi-es at each snapshot • Clustering independently at each
snapshot
• Cons: – Most community detec-on methods
produce local op-mal solu-ons
– Hard to determine if the evolu-on is due to the evolu-on or algorithm randomness
56
Community Evolu-on
Naïve Approach Example
• There is a sharp change at T2 • This approach may report spurious structural changes
!" #$
%&
" #
'
$
()
!" #$
%&
" #
'
$
()
A(1) A(2)A(3)at T1 at T3 at T2
H(2) =
1 01 01 01 01 01 01 01 00 1
H(1) =
1 01 01 01 00 10 10 10 10 1
H(3) =
1 01 01 00 10 10 10 10 10 1
57
Community Evolu-on
Evolu-onary Clustering in Smoothly Evolving Networks
• Evolu-onary Clustering: find a smooth sequence of communi-es given a series of network snapshots
• Objec-ve func-on: snapshot cost (CS) + temporal cost (CT)
• Take spectral clustering as an example – Snapshot cost : – Temporal cost:
• Community Evolu-on: where
Cost = α · CS + (1− α) · CT
CSt = Tr(STt LtSt), s.t. ST
t St = Ik
CTt = �St − St−1�2 St is s-ll a valid solu-on aper an orthogonal transforma-on
CTt =1
2�StS
Tt − St−1S
Tt−1�2
Costt = Tr�STt�LtSt
�
�Lt = I − α ·D−1/2t A(t)D−1/2
t − (1− α) · St−1STt−1
58
Community Evolu-on
Evolu-onary Clustering Example !
" #$
%&
" #
'
$
()
!" #$
%&
" #
'
$
()
A(1) A(2)A(3)at T1 at T3 at T2
S1 =
0.33 −0.440.27 −0.430.33 −0.440.38 −0.160.38 0.240.38 0.240.38 0.380.33 0.300.19 0.23
�L2 =
0.91 −0.42 −0.33 −0.21 −0.01 −0.01 0.01 0.01 0.01−0.42 0.92 −0.08 −0.27 0.00 0.00 0.02 0.01 0.01−0.33 −0.08 0.91 −0.22 −0.25 −0.01 0.01 0.01 0.01−0.21 −0.27 −0.22 0.95 −0.18 −0.24 −0.02 −0.02 −0.01−0.01 0.00 −0.25 −0.18 0.94 −0.06 −0.37 −0.06 −0.04−0.01 0.00 −0.01 −0.24 −0.06 0.94 −0.07 −0.45 −0.040.01 0.02 0.01 −0.02 −0.37 −0.07 0.91 −0.44 −0.050.01 0.01 0.01 −0.02 −0.06 −0.45 −0.44 0.94 −0.040.01 0.01 0.01 −0.01 −0.04 −0.04 −0.05 −0.04 0.97
For T1 For T2
We obtain two communi-es based on spectral clustering with this modified graph Laplacian:
{1, 2, 3, 4} and {5, 6, 7, 8, 9} 59
Community Evolu-on
Segment-‐based Clustering with Evolving Networks
• Independent clustering at each snapshot – do not consider temporal informa-on
– Likely to output specious evalua-on paLerns • Evolu-onary clustering enforces smoothness
– may fail to capture dras-c change
• How to strike balance between gradual changes under normal circumstances and dras-c changes caused by major events?
• Segment-‐based clustering: – Community structure remains unchanged in a segment of -me – A change between consecu-ve segments
• Fundamental ques-on: how to detect the change points?
60
Community Evolu-on
Segment-‐based Clustering
• Segment-‐based Clustering assumes community structure remains unchanged in a segment of -me
• GraphScope is one segment-‐based clustering method
– If network connec-ons do not change much over -me, consecu-ve network snapshots should be grouped into one segment
– If a new network snapshot does not fit into an exis-ng segment (when current community structure induces a high cost on a new network snapshot), then introduce a change point and start a new segment
61
Community Evolu-on
GraphScope on Enron Data
62
Community Evolu-on
Heterogeneous Networks
Heterogeneous Network
Heterogeneous Interac-ons
Mul--‐Dimensional Network
Heterogeneous Nodes
Mul--‐Mode Network
63
Why Does Heterogeneity MaLer
• Social media introduces heterogeneity • It calls for solu-ons to community detec-on in heterogeneous networks – Interac-ons in social media are noisy – Interac-ons in one mode or one dimension might be too noisy to detect meaningful communi-es
– Not all users are ac-ve in all dimensions or with different modes
• Need integra-on of interac-ons at mul-ple dimensions or modes
• Details skipped due to -me limit (check out chapter 4 of the lecture book)
64
Evalua-ng Community Detec-on
• For groups with clear defini-ons – E.g., Cliques, k-‐cliques, k-‐clubs, quasi-‐cliques – Verify whether extracted communi-es sa-sfy the defini-on
• For networks with ground truth informa-on – Normalized mutual informa-on – Accuracy of pairwise community memberships
65
Measuring a Clustering Result
• The number of communi-es aper grouping can be different from the ground truth
• No clear community correspondence between clustering result and the ground truth
• Normalized Mutual Informa-on can be used
Ground Truth
1, 2, 3
4, 5, 6
1, 3 2 4, 5, 6
Clustering Result
How to measure the clustering quality?
66
Community Evalua-on
Normalized Mutual Informa-on • Entropy: the informa-on contained in a distribu-on
• Mutual Informa-on: the shared informa-on between two distribu-ons
• Normalized Mutual Informa-on (between 0 and 1)
• Consider a par--on as a distribu-on (probability of one node falling into one community), we can compute the matching between two clusterings
67
Community Evalua-on
NMI
68
Community Evalua-on
NMI-‐Example
• Par--on a: [1, 1, 1, 2, 2, 2] • Par--on b: [1, 2, 1, 3, 3, 3]
1, 2, 3 4, 5, 6
1, 3 2 4, 5,6
h=1 3
h=2 3
l=1 2
l=2 1
l=3 3
l=1 l=2 l=3
h=1 2 1 0
h=2 0 0 3
=0.8278
69
Community Evalua-on
Accuracy of Pairwise Community Memberships
• Consider all the possible pairs of nodes and check whether they reside in the same community
• An error occurs if
– Two nodes belonging to the same community are assigned to different communi-es aper clustering
– Two nodes belonging to different communi-es are assigned to the same community
• Construct a con-ngency table
accuracy =a+ d
a+ b+ c+ d=
a+ d
n(n− 1)/270
Community Evalua-on
Accuracy Example
Ground Truth
C(vi) = C(vj) C(vi) != C(vj)
Clustering Result
C(vi) = C(vj) 4 0
C(vi) != C(vj) 2 9
Ground Truth
1, 2, 3
4, 5, 6
1, 3 2 4, 5, 6
Clustering Result
Accuracy = (4+9)/ (4+2+9+0) = 13/15
71
Community Evalua-on
Evalua-on using Seman-cs
• For networks with seman-cs – Networks come with seman-c or aLribute informa-on of nodes or connec-ons
– Human subjects can verify whether the extracted communi-es are coherent
• Evalua-on is qualita-ve • It is also intui-ve and helps understand a community
An animal community
A health community
72
Community Evalua-on
Evalua-on without Ground Truth
• For networks without ground truth or seman-c informa-on
• This is the most common situa-on • An op-on is to resort to cross-‐valida-on
– Extract communi-es from a (training) network – Evaluate the quality of the community structure on a network constructed from a different date or based on a related type of interac-on
• Quan-ta-ve evalua-on func-ons – modularity
– block model approxima-on error
73
Community Evalua-on
Scaling Community Detec-on
• Social media networks are huge. How to scale community detec-on methods? – Approxima-on
• Sampling
• Local graph processing • Mul--‐level methods
– Exact • Streaming/Itera-ve schemes
• Distributed and Parallel processing
74
Sampling
• Downsample the network such that classical community detec-on methods can be applied – Sample nodes (or edges)
• Which nodes should we sample? – A simple heuris-c: keep those high-‐degree nodes
• Node degrees follows a power-‐law distribu-on • Communi-es might form around those popular ones
– Uncover the community membership for remaining nodes • Check if any of his friends has been assigned to any community
– This simple heuris-c works reasonably well
• Op-miza-on may achieve beLer approxima-on
75
Scaling community detec-on
Local graph processing
• Circumvent the memory boLleneck – Rather than compu-ng global quality, check local neighborhood (say, k-‐hop) for computa-on • Compute edge-‐betweenness
– Construct communi-es from seeded nodes
76
Scaling community detec-on
Mul--‐Level Approaches
• Find a rough par--on of the network into communi-es by use of a fast process (may at the expense of accuracy)
• Construct a meta network with nodes being communi-es and edges the connec-on between communi-es
• Meta-‐network is much smaller than the original network
77 hLp://sites.google.com/site/findcommuni-es/
Scaling community detec-on
Streaming/Itera-ve schemes
• An itera-ve process examines each vertex along with its neighbor in a given order and perform computa-on
• Repeat vertex itera-on un-l convergence
78
Scaling community detec-on
Distributed/Parallel Compu-ng
• Exploit the power of parallel compu-ng – Extend community detec-on methods to Hadoop MapReduce
– Typically requires mul-ple itera-ons as well – Move computa-on to data (split into N chunks)
79
Scaling community detec-on
COMMUNITY DETECTION FOR NETWORK-‐BASED CLASSIFICATION
80
Correla-ons in Network
• Individual behaviors are correlated in a network environment
homophily influence Confounding
81
Classifica-on with Network Data
• How to leverage this correla-on observed in networks to help predict user aLributes or interests?
Predict the labels for nodes in yellow
82
Collec-ve Classifica-on
• Labels of nodes are interdependent with each other • The label of one node cannot be determined independently;
Need Collec-ve Classifica-on
• Markov Assump-on: the label of one node depends on the label of his neighbors
• Collec-ve classifica-on involves 3 components:
Local Classifier
• Assign ini-al label
Rela-onal Classifier
• Capture correla-ons between nodes
Collec-ve Inference
• Propagate correla-ons through network
83
Collec-ve Classifica-on
• Local Classifier: used for ini-al label assignment – Predicts label based on node aLributes – Classical classifica-on learning – Does not employ network informa-on
• Rela-onal Classifier: capture correla-ons based on label info – Learn a classifier from the labels or/and aLributes of its neighbors to
the label of one node – Network informa-on is used
• Collec-ve Classifica-on: propagate the correla-on – Apply rela-onal classifier to each node itera-vely – Iterate un-l the inconsistency between neighboring labels is
minimized – Network structure substan-ally affects the final predic-on
84
Weighted-‐vote Rela-onal Neighborhood Classifier
• No local classifier • Rela-onal classifier
– predic-on of one node is the average of its neighbors
• Collec-ve Inference
85
Example of wvRN
• Ini-aliza-on for unlabeled nodes – p(yi=1|Ni)=0.5
• 1st Itera-on: – For node 3, N3={1,2,4}
• P(y1=1|N1) = 0 • P(y2=1|N2) = 0 • P(y4=1 |N4) = 0.5 • P(y3=1|N3) = 1/3 (0 + 0 + 0.5) = 0.17
– For node 4, N4={1,3, 5, 6} • P(y4=1|N4)= ¼(0+ 0.17+0.5+1) = 0.42
– For node 5, N5={4,6,7,8} • P(y5=1|N5) = ¼ (0.42+1+1+0.5) = 0.73
86
Itera-ve Result
• Stabilizes aper 5 itera-ons – Nodes 5, 8, 9 are + (Pi > 0.5) – Node 3 is – (Pi < 0.5) – Node 4 is in between (Pi =0.5)
87
Community-‐Based Learning
• People in social media engage in various rela-onships – Colleagues – Rela-ves – Friends – Co-‐travelers
• Different rela-onships have different correla-ons with user interests/behavior/profiles
88
Challenges
• Social media open comes with a network, but no rela-onship informa-on
• Or rela-onship informa-on is not complete or refined enough
• Challenges: – How to differen-ate these heterogeneous rela-onship in a network?
– How to determine whether the rela-onship is useful for predic-on?
89
Social Dimension
• Social Dimension: – Latent dimensions defined by social connec-ons
• Each dimension represents one type of rela-onship
90
A Learning Framework based on Social Dimensions
• People involved in the same social dimension are likely to connect to each other, thus forming a community
• Training: 1. Extract meaningful social dimensions based on network
connec-vity via community detec-on 2. Determine relevant social dimensions (plus node
aLributes) through supervised learning
• Predic-on – Apply the constructed model in Step 2 to social dimensions of unlabeled nodes
– No itera-ve inference
91
Underlying Assump-on
• Assump-on: – the label of one node is determined by its social dimension – P(yi|A) = P(yi|Si)
• Community membership serves as latent features
92
Example of SocioDim Framework
• One is likely to involve in mul-ple rela-onships, thus sop clustering is used to extract social dimensions
Spectral Clustering Support Vector Machine 93
Collec-ve Classifica-on vs. Community-‐Based Learning
CollecBve ClassificaBon
Community-‐Based Learning
Computa-onal Cost for Training low high
Computa-onal Cost for Predic-on
high low
Handling Heterogeneous Rela-ons
✔
Handling Incremental Network Change
✔
Integra-ng network informa-on and actor aLributes
✔ ✔
94
References
95
• Lei Tang and Huan Liu. Community Detec-on and Mining in Social Media, Morgan & Claypool Publishers, 2010. • Santo Fortunato, Community Detec-on in Graphs, Physics Reports, vol.
486, pages 75-‐174, 2010 • Lei Tang and Huan Liu. Leveraging Social Media Networks for
Classifica-on. Journal of Data Mining and Knowledge Discovery (DMKD), 2011.
• Papadopoulos, S. and Kompatsiaris, Y. and Vakali, A. and Spyridonos, P., Community Detec-on in Social Media, Journal of Data Mining and Knowledge Discovery (DMKD), Pages 1-‐40, 2011
• Michele Coscia, Fosca Gianno�, and Dino Pedreschi, A Classifica-on for Community Discovery Methods in Complex Networks, Sta-s-cal Analysis and Data Mining, 2011