Community)Detec-on)for))...

Community Detec-on for Social Compu-ng

Lei Tang Yahoo! Labs

2011/09/16 @UC Berkeley

1

Most materials are adapted from Community Detec-on and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.

hLp://dmml.asu.edu/cdm/

Community •  Community: It is formed by individuals such that those within a

group interact with each other more frequently than with those outside the group –  a.k.a. group, cluster, cohesive subgroup, module in different contexts

•  Community detec-on: discovering groups in a network where individuals’ group memberships are not explicitly given

•  Why communi-es in social media? –  Human beings are social –  Easy-‐to-‐use social media allows people to extend their social life in

unprecedented ways –  Difficult to meet friends in the physical world, but much easier to find

friend online with similar interests –  Interac-ons between nodes can help determine communi-es

2

Real-‐World Communi-es

3

Egypt Protest, 2011

Communi-es by Subscrip-on Communi-es from

Facebook Communi-es from

Flickr

Communi-es of Poli-cal Blogs

5 Lada Admic and Natalie Glance, The Poli-cal Blogosphere and the 2004 U.S. Elec-on: Divided They Blog, 2005

Liberal

Conserva-ve

Communi-es in TwiLer

6 hLp://ebiquity.umbc.edu/blogger/2007/04/19/twiLer-‐social-‐network-‐analysis/

Communi-es of Personal Social Network

7 Generated based on Lei Tang’s LinkedIn connec-ons on 2011/09/15

Graduate School

College

Communi-es in Protein-‐Protein Interac-on Networks

8

Communi-es in Social Media •  Two types of groups in social media

–  Explicit Groups: formed by user subscrip-ons –  Implicit Groups: implicitly formed by social interac-ons

•  Some social media sites allow people to join groups, is it necessary to extract groups based on network topology? –  Not all sites provide community plagorm –  Not all people want to make effort to join groups –  Groups can change dynamically

•  Network interac-on provides rich informa-on about the rela-onship between users –  Can complement other kinds of informa-on –  Provide basic informa-on for other social compu-ng tasks

9

Applica-ons of Community Detec-on

•  Visualiza-on & network naviga-on •  Data compression •  Structural posi-on and role analysis •  Topic detec-on in collabora-ve tagging systems •  Tag disambigua-on •  Iden-fying #unique users in networks •  User profiling based on neighborhood smoothing •  Recommenda-on and targe-ng •  Event detec-on

10

Roadmap •  Overview of Community Detec-on Methods

–  Node-‐Centric –  Group-‐Centric –  Network-‐Centric –  Hierarchy-‐Centric

•  Communi-es in Social Media –  Sta-s-cal Proper-es –  Community Evolu-on –  Heterogeneous Networks –  Community Evalua-on –  Scaling Community Detec-on

•  Applica-on of Community Detec-on for Social Media Mining

11

COMMUNITY DETECTION

12

Subjec-vity of Community Defini-on Each component is a

community A densely-‐knit community

Defini-on of a community can be subjec-ve.

13

Craze about Communi-es?

•  Subjec-vity of communi-es –  challenging for evalua-on (covered later) –  Promo-ng various defini-ons and models

•  Boom of social media and its novel challenges

14

ArBcle/Book Year References

A classifica-on for Community Discovery Methods in Complex Networks

2011 168

Community Detec-on in Social Media 2011 28 ref/page * 4.5 pages = 126

Community Detec-on in Graphs 2010 58 ref/page * 7.5 pages = 435

Community Detec-on and Mining in Social Media 2010 Around 100

Taxonomy of Community Criteria •  Criteria vary depending on the tasks •  Roughly can be divided into 4 categories (not exclusive): •  Node-‐Centric Community

–  Each node in a group sa-sfies certain proper-es •  Group-‐Centric Community

–  Consider the connec-ons within a group as a whole. The group has to sa-sfy certain proper-es without zooming into node-‐level

•  Network-‐Centric Community –  Op-mize a quality func-on wrt. the whole network topology

•  Hierarchy-‐Centric Community –  Construct a hierarchical structure of communi-es

15

For simplicity, we focus on undirected network with boolean edge weights

Node-‐Centric Community Detec-on

•  Nodes sa-sfy different proper-es –  Complete Mutuality

•  cliques –  Reachability of members

•  k-‐clique, k-‐clan, k-‐club –  Nodal degrees

•  k-‐plex, k-‐core –  Rela-ve frequency of Within-‐Outside Ties

•  LS sets, Lambda sets

•  Commonly used in tradi-onal social network analysis •  Open involves combinatorial op-miza-on •  Here, we discuss some representa-ve ones

16

Complete Mutuality: Cliques

•  Clique: a maximum complete subgraph in which all nodes are adjacent to each other

•  NP-‐hard to find the maximum clique in a network •  Straighgorward implementa-on to find cliques is very

expensive in -me complexity

Nodes 5, 6, 7 and 8 form a clique

17

Finding the Maximum Clique

•  In a clique of size k, each node maintains degree >= k-‐1

•  Nodes with degree < k-‐1 will not be included in the maximum clique

•  Recursively apply the following pruning procedure –  Sample a sub-‐network from the given network, and find a clique in the

sub-‐network, say, by a greedy approach

–  Suppose the clique above is size k, in order to find out a larger clique, all nodes with degree <= k-‐1 should be removed.

•  Repeat un-l the network is small enough

•  Many nodes will be pruned as social media networks follow a power law distribu-on for node degrees

18

Maximum Clique Example

•  Suppose we sample a sub-‐network with nodes {1-‐5} and find a clique {1, 2, 3} of size 3

•  In order to find a clique >3, remove all nodes with degree <=3-‐1=2 –  Remove nodes 2 and 9

–  Remove nodes 1 and 3

–  Remove node 4

19

Clique Percola-on Method (CPM)

•  Clique is a very strict defini-on, unstable •  Normally use cliques as a core or a seed to find larger

communi-es

•  CPM is such a method to find overlapping communi-es –  Input

•  A parameter k, and a network –  Procedure

•  Find out all cliques of size k in a given network •  Construct a clique graph. Two cliques are adjacent if they share k-‐1 nodes

•  Each connected components in the clique graph form a community

20

CPM Example Cliques of size 3: {1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8}

Communi-es: {1, 2, 3, 4} {4, 5, 6, 7, 8}

21

Reachability : k-‐clique, k-‐club

•  Any node in a group should be reachable in k hops •  k-‐clique: a maximal subgraph in which the largest geodesic

distance between any nodes <= k •  k-‐club: a substructure of diameter <= k

•  A k-‐clique might have diameter larger than k in the subgraph •  Commonly used in tradi-onal SNA •  Open involves combinatorial op-miza-on

Cliques: {1, 2, 3} 2-‐cliques: {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6} 2-‐clubs: {1,2,3,4}, {1, 2, 3, 5}, {2, 3, 4, 5, 6}

22

Group-‐Centric Community Detec-on: Density-‐Based Groups

•  The group-‐centric criterion requires the whole group to sa-sfy a certain condi-on –  E.g., the group density >= a given threshold

•  A subgraph is a quasi-clique if

•  A similar strategy to that of cliques can be used –  Sample a subgraph, and find a maximal quasi-

clique (say, of size k) –  Remove nodes with degree

Gs(Vs, Es) γ − dense

|Es||Vs|(|Vs|− 1)/2

≥ γ

γ − dense

< kγ

23

Network-‐Centric Community Detec-on

•  Network-‐centric criterion needs to consider the connec-ons within a network globally

•  Goal: par--on nodes of a network into disjoint sets •  Approaches:

–  Clustering based on vertex similarity –  Latent space models

–  Block model approxima-on

–  Spectral clustering – Modularity maximiza-on

24

Clustering based on Vertex Similarity

•  Apply k-‐means or similarity-‐based clustering to nodes

•  Vertex similarity is defined in terms of the similarity of their neighborhood

•  Structural equivalence: two nodes are structurally equivalent iff they are connec-ng to the same set of actors

•  Structural equivalence is too restrict for prac-cal use.

Nodes 1 and 3 are structurally equivalent; So are nodes 5 and 6.

25

(1) Clustering based on vertex similarity

Vertex Similarity

•  Jaccard Similarity

•  Cosine similarity

Jaccard(4, 6) =|{5}|

|{1, 3, 4, 5, 6, 7, 8}| =1

7

cosine(4, 6) =1√4 · 4

=1

426

(1) Clustering based on vertex similarity

Jaccard(vi, vj) =|Ni ∩Nj ||Ni ∪Nj |

cosine(vi, vj) =

�k AikAjk��s A

2is

�t A

2jt

Latent Space Models

•  Map nodes into a low-‐dimensional space such that the proximity between nodes based on network connec-vity is preserved in the new space, then apply k-‐means clustering

•  Mul--‐dimensional scaling (MDS) –  Given a network, construct a proximity matrix P represen-ng the

pairwise distance between nodes (e.g., geodesic distance)

–  Let denote the coordinates of nodes in the low-‐dimensional space

–  Objec-ve func-on: –  Solu-on: –  V is the top eigenvectors of , and is a diagonal matrix of top

eigenvalues

€

S∈Rn×

SST ≈ −1

2(I − 1

n11T )(P ◦ P )(I − 1

n11T ) = �P

min �SST − �P�2FS = V Λ

12

� �PΛ = diag(λ1,λ2, · · · ,λ�)

Λ

27

(2) Latent space models

MDS Example

P =

0 1 1 1 2 2 3 3 41 0 1 2 3 3 4 4 51 1 0 1 2 2 3 3 41 2 1 0 1 1 2 2 32 3 2 1 0 1 1 1 22 3 2 1 1 0 1 1 23 4 3 2 1 1 0 1 13 4 3 2 1 1 1 0 24 5 4 3 2 2 1 2 0

�P =

2.46 3.96 1.96 0.85 −0.65 −0.65 −2.21 −2.04 −3.653.96 6.46 3.96 1.35 −1.15 −1.15 −3.71 −3.54 −6.151.96 3.96 2.46 0.85 −0.65 −0.65 −2.21 −2.04 −3.650.85 1.35 0.85 0.23 −0.27 −0.27 −0.82 −0.65 −1.27

−0.65 −1.15 −0.65 −0.27 0.23 −0.27 0.68 0.85 1.23−0.65 −1.15 −0.65 −0.27 −0.27 0.23 0.68 0.85 1.23−2.21 −3.71 −2.21 −0.82 0.68 0.68 2.12 1.79 3.68−2.04 −3.54 −2.04 −0.65 0.85 0.85 1.79 2.46 2.35−3.65 −6.15 −3.65 −1.27 1.23 1.23 3.68 2.35 6.23

!3 !2 !1 0 1 2 3!1

!0.8

!0.6

!0.4

!0.2

0

0.2

0.4

0.6

0.8

1

2

34

5 6

7

8

9

S1

S2

Two communi-es: {1, 2, 3, 4} and {5, 6, 7, 8, 9}

geodesic distance

28

(2) Latent space models

Block Models

•  S is the community indicator matrix

•  Relax S to be numerical values, then the op-mal solu-on corresponds to the top eigenvectors of A

min ||A− SΣST ||2F


29

(3) Block model approxima-on

Cut

•  Most interac-ons are within group whereas interac-ons between groups are few

•  community detec-on minimum cut problem •  Cut: A par--on of ver-ces of a graph into two disjoint sets •  Minimum cut problem: find a graph par--on such that the

number of edges between the two sets is minimized

30

(4) Spectral clustering

Ra-o Cut & Normalized Cut

•  Minimum cut open returns an imbalanced par--on, with one set being a singleton

•  Change the objec-ve func-on to consider community size

Ratio Cut(π) =1

k

k�

i=1

cut(Ci, C̄i)

|Ci|,

Normalized Cut(π) =1

k

k�

i=1

cut(Ci, C̄i)

vol(Ci)

Ci,: a community |Ci|: number of nodes in Ci vol(Ci): sum of degrees in Ci

31


Ra-o Cut & Normalized Cut Example

Ratio Cut(π1) =1

2

�1

1+

1

8

�= 9/16 = 0.56

Normalized Cut(π1) =1

2

�1

1+

1

27

�= 14/27 = 0.52

For parBBon in red:

For parBBon in green:

Normalized Cut(π2) =1

2

�2

12+

2

16

�= 7/48 = 0.15 < Normalized Cut(π1)

Ratio Cut(π2) =1

2

�2

4+

2

5

�= 9/20 = 0.45 < Ratio Cut(π1)

π1

π2

Both ra-o cut and normalized cut prefer a balanced par--on 32


Spectral Clustering

•  Both ra-o cut and normalized cut can be reformulated as

•  Where

•  Spectral relaxa-on: •  Op-mal solu-on: top eigenvectors with the smallest

eigenvalues

�L =

�D −AI −D−1/2AD−1/2

graph Laplacian for ra-o cut

normalized graph Laplacian

D = diag(d1, d2, · · · , dn) A diagonal matrix of degrees

minS

Tr(ST �LS) s.t. STS = Ik

minS∈{0,1}n×k

Tr(ST �LS)

33


Spectral Clustering Example

D = diag(3, 2, 3, 4, 4, 4, 4, 3, 1)

�L = D −A =

3 −1 −1 −1 0 0 0 0 0−1 2 −1 0 0 0 0 0 0−1 −1 3 −1 0 0 0 0 0−1 0 −1 4 −1 −1 0 0 00 0 0 −1 4 −1 −1 −1 00 0 0 −1 −1 4 −1 −1 00 0 0 0 −1 −1 4 −1 −10 0 0 0 −1 −1 −1 3 00 0 0 0 0 0 −1 0 1

S =

0.33 −0.380.33 −0.480.33 −0.380.33 −0.120.33 0.160.33 0.160.33 0.300.33 0.240.33 0.51


The 1st eigenvector means all nodes belong to the same cluster, no

use

k-‐means

34


Modularity Maximiza-on

•  Modularity measures the strength of a community par--on by taking into account the degree distribu-on

•  Given a network with m edges, the expected number of edges between two nodes with di and dj is

•  Strength of a community:

•  Modularity: •  A larger value indicates a good community structure

didj/2m

The expected number of edges between nodes 1 and 2 is

3*2/ (2*14) = 3/14 �

i∈C,j∈C

Aij − didj/2m

Q =1

2m

k�

�=1

�

i∈C�,j∈C�

(Aij − didj/2m)

35

(5) Modularity maximiza-on

Modularity Matrix

•  Modularity matrix:

•  Similar to spectral clustering, modularity maximiza-on can be reformulated as

•  Op-mal solu-on: top eigenvectors of the modularity matrix •  Apply k-‐means to S as a post-‐processing step to obtain

community par--on

B = A− ddT /2m (Bij = Aij − didj/2m)

maxQ =1

2mTr(STBS) s.t. STS = Ik

36


Modularity Maximiza-on Example

B =

−0.32 0.79 0.68 0.57 −0.43 −0.43 −0.43 −0.32 −0.110.79 −0.14 0.79 −0.29 −0.29 −0.29 −0.29 −0.21 −0.070.68 0.79 −0.32 0.57 −0.43 −0.43 −0.43 −0.32 −0.110.57 −0.29 0.57 −0.57 0.43 0.43 −0.57 −0.43 −0.14

−0.43 −0.29 −0.43 0.43 −0.57 0.43 0.43 0.57 −0.14−0.43 −0.29 −0.43 0.43 0.43 −0.57 0.43 0.57 −0.14−0.43 −0.29 −0.43 −0.57 0.43 0.43 −0.57 0.57 0.86−0.32 −0.21 −0.32 −0.43 0.57 0.57 0.57 −0.32 −0.11−0.11 −0.07 −0.11 −0.14 −0.14 −0.14 0.86 −0.11 −0.04

S =

0.44 −0.000.38 0.230.44 −0.000.17 −0.48

−0.29 −0.32−0.29 −0.32−0.38 0.34−0.34 −0.08−0.14 0.63

Modularity Matrix

k-‐means

Two Communi-es: {1, 2, 3, 4} and {5, 6, 7, 8, 9}

37


A Unified View for Community Par--on

Utility Matrix M =

modified proximity matrix �P if latent space modelsadjacency matrix A if block models

graph Laplacian �L if spectral clusteringmodularity maximization B if modularity maximization

•  Latent space models, block models, spectral clustering, and modularity maximiza-on can be unified as

38

Hierarchy-‐Centric Community Detec-on

•  Goal: build a hierarchical structure of communi-es based on network topology

•  Allow the analysis of a network at different resolu-ons

•  Representa-ve approaches: – Divisive Hierarchical Clustering – Agglomera-ve Hierarchical clustering

39

Divisive Hierarchical Clustering

•  Divisive clustering –  Par--on nodes into several sets –  Each set is further divided into smaller ones –  Network-‐centric par--on can be applied for the par--on

•  One par-cular example: recursively remove the “weakest” -e –  Find the edge with the least strength –  Remove the edge and update the corresponding strength of each edge

•  Recursively apply the above two steps un-l a network is discomposed into desired number of connected components.

•  Each component forms a community

40

Edge Betweenness

•  The strength of a -e can be measured by edge betweenness

•  Edge betweenness: the number of shortest paths that pass along with the edge

•  The edge with higher betweenness tends to be the bridge between two communi-es.

The edge betweenness of e(1, 2) is 4, as all the shortest paths from 2 to {4, 5, 6, 7, 8, 9} have to either pass e(1, 2) or e(2, 3), and e(1,2) is the shortest path between 1 and 2

edge-betweenness(e) = Σs<tσst(e)

σs,t

41

Divisive clustering based on edge betweenness

Aper remove e(4,5), the betweenness of e(4, 6) becomes 20, which is the highest;

Aper remove e(4,6), the edge e(7,9) has the highest betweenness value 4, and should be removed.

Ini-al betweenness value

42

Agglomera-ve Hierarchical Clustering

•  Ini-alize each node as a community

•  Merge communi-es successively into larger communi-es following a certain criterion –  E.g., based on modularity increase

43

Summary of Community Detec-on

•  Node-‐Centric: cliques, k-‐cliques, k-‐clubs •  Group-‐Centric: quasi-‐cliques •  Network-‐Centric:

–  Clustering based on vertex similarity

–  Latent space models, block models, spectral clustering, modularity maximiza-on

•  Hierarchy-‐Centric: Divisive, Agglomera-ve

•  Difficult to determine which suits the best

•  Other dimensions to explore: –  Directed network –  Overlapping communi-es

–  Sop clustering –  Op-miza-on 44

COMMUNITIES IN SOCIAL MEDIA

45

Ques-ons & Challenges

•  Sta-s-cal Proper-es of Communi-es –  any structural paLerns for community size, density and growth?

•  Evolu-on –  Timeliness is emphasized in social media –  Interac-ons are highly dynamic

•  Heterogeneity –  Various types of en--es and interac-ons are involved

•  Evalua-on –  Lack of ground truth, and complete informa-on due to privacy

•  Scalability –  Social networks are open in a scale of millions of nodes and connec-ons –  Tradi-onal Network Analysis open deals with very limited number of of

subjects

46

Size Distribu-on of Explicit Communi-es (LiveJournal)

47 Lei Tang, Xufei Wang and Huan Liu, Group Profiling for Understanding Social Structures, TIST, 2011

16K bloggers, 132k links, 100k groups

Sta-s-cal property

Size Distribu-on of Explicit Communi-es (Flickr)

48

907K users, 43k groups

Sta-s-cal property

Densi-es of Flickr Groups

49

Average Network Density

Sta-s-cal property

Distribu-on Implicit Communi-es (consider each connected component as a community)

50

100 101 102 103 104 105 106 107 108 109100

101

102

103

104

105

106

107

size

frequency

The giant component

Y! Instant Messenger Contact Network: Hundreds of millions of nodes

Billions of connec-ons

Kun liu and Lei Tang, Large Scale Behavioral Targe-ng with a Social Twist, CIKM 2011

Sta-s-cal property

How does a network look like? Zooming into the giant component

51 J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Sta-s-cal Proper-es of Community Structure in Large Social and Informa-on Networks, WWW 2008

Sta-s-cal property

Community Growth wrt. #friends

•  Diminishing returns – Probability of joining increases with the number friends in the group

– But increases get smaller and smaller 52

Livejournal DBLP

Group Forma-on in Large Social Networks: Membership, Growth, and Evolu-on, KDD 2006

Sta-s-cal property

Community Growth: More subtle features

•  Connectedness of friends: – x and y have three friends in the group – x’s friends are independent – y’s friends are all connected

•  Who is more likely to join?

53 Adapted from Jure Leskovec’s lecture slides @Stanford

Sta-s-cal property

Community Growth wrt. Connectedness of Friends

54

Sta-s-cal property

Community Evolu-on

•  Communi-es also expand, shrink , or dissolve in dynamic networks

•  How to uncover latent community change behind dynamic network interac-ons?

55

Naïve Approach to Studying Community Evolu-on

•  Take snapshots of a network •  find communi-es at each snapshot •  Clustering independently at each

snapshot

•  Cons: –  Most community detec-on methods

produce local op-mal solu-ons

–  Hard to determine if the evolu-on is due to the evolu-on or algorithm randomness

56

Community Evolu-on

Naïve Approach Example

•  There is a sharp change at T2 •  This approach may report spurious structural changes

!" #$

%&

" #

'

$

()

!" #$

%&

" #

'

$

()

A(1) A(2)A(3)at T1 at T3 at T2

H(2) =

1 01 01 01 01 01 01 01 00 1

H(1) =

1 01 01 01 00 10 10 10 10 1

H(3) =

1 01 01 00 10 10 10 10 10 1

57

Community Evolu-on

Evolu-onary Clustering in Smoothly Evolving Networks

•  Evolu-onary Clustering: find a smooth sequence of communi-es given a series of network snapshots

•  Objec-ve func-on: snapshot cost (CS) + temporal cost (CT)

•  Take spectral clustering as an example –  Snapshot cost : –  Temporal cost:

•  Community Evolu-on: where

Cost = α · CS + (1− α) · CT

CSt = Tr(STt LtSt), s.t. ST

t St = Ik

CTt = �St − St−1�2 St is s-ll a valid solu-on aper an orthogonal transforma-on

CTt =1

2�StS

Tt − St−1S

Tt−1�2

Costt = Tr�STt�LtSt

�

�Lt = I − α ·D−1/2t A(t)D−1/2

t − (1− α) · St−1STt−1

58

Community Evolu-on

Evolu-onary Clustering Example !

" #$

%&

" #

'

$

()

!" #$

%&

" #

'

$

()

A(1) A(2)A(3)at T1 at T3 at T2

S1 =

0.33 −0.440.27 −0.430.33 −0.440.38 −0.160.38 0.240.38 0.240.38 0.380.33 0.300.19 0.23

�L2 =

0.91 −0.42 −0.33 −0.21 −0.01 −0.01 0.01 0.01 0.01−0.42 0.92 −0.08 −0.27 0.00 0.00 0.02 0.01 0.01−0.33 −0.08 0.91 −0.22 −0.25 −0.01 0.01 0.01 0.01−0.21 −0.27 −0.22 0.95 −0.18 −0.24 −0.02 −0.02 −0.01−0.01 0.00 −0.25 −0.18 0.94 −0.06 −0.37 −0.06 −0.04−0.01 0.00 −0.01 −0.24 −0.06 0.94 −0.07 −0.45 −0.040.01 0.02 0.01 −0.02 −0.37 −0.07 0.91 −0.44 −0.050.01 0.01 0.01 −0.02 −0.06 −0.45 −0.44 0.94 −0.040.01 0.01 0.01 −0.01 −0.04 −0.04 −0.05 −0.04 0.97

For T1 For T2

We obtain two communi-es based on spectral clustering with this modified graph Laplacian:

{1, 2, 3, 4} and {5, 6, 7, 8, 9} 59

Community Evolu-on

Segment-‐based Clustering with Evolving Networks

•  Independent clustering at each snapshot –  do not consider temporal informa-on

–  Likely to output specious evalua-on paLerns •  Evolu-onary clustering enforces smoothness

–  may fail to capture dras-c change

•  How to strike balance between gradual changes under normal circumstances and dras-c changes caused by major events?

•  Segment-‐based clustering: –  Community structure remains unchanged in a segment of -me –  A change between consecu-ve segments

•  Fundamental ques-on: how to detect the change points?

60

Community Evolu-on

Segment-‐based Clustering

•  Segment-‐based Clustering assumes community structure remains unchanged in a segment of -me

•  GraphScope is one segment-‐based clustering method

–  If network connec-ons do not change much over -me, consecu-ve network snapshots should be grouped into one segment

–  If a new network snapshot does not fit into an exis-ng segment (when current community structure induces a high cost on a new network snapshot), then introduce a change point and start a new segment

61

Community Evolu-on

GraphScope on Enron Data

62

Community Evolu-on

Heterogeneous Networks

Heterogeneous Network

Heterogeneous Interac-ons

Mul--‐Dimensional Network

Heterogeneous Nodes

Mul--‐Mode Network

63

Why Does Heterogeneity MaLer

•  Social media introduces heterogeneity •  It calls for solu-ons to community detec-on in heterogeneous networks –  Interac-ons in social media are noisy –  Interac-ons in one mode or one dimension might be too noisy to detect meaningful communi-es

–  Not all users are ac-ve in all dimensions or with different modes

•  Need integra-on of interac-ons at mul-ple dimensions or modes

•  Details skipped due to -me limit (check out chapter 4 of the lecture book)

64

Evalua-ng Community Detec-on

•  For groups with clear defini-ons – E.g., Cliques, k-‐cliques, k-‐clubs, quasi-‐cliques – Verify whether extracted communi-es sa-sfy the defini-on

•  For networks with ground truth informa-on – Normalized mutual informa-on – Accuracy of pairwise community memberships

65

Measuring a Clustering Result

•  The number of communi-es aper grouping can be different from the ground truth

•  No clear community correspondence between clustering result and the ground truth

•  Normalized Mutual Informa-on can be used

Ground Truth

1, 2, 3

4, 5, 6

1, 3 2 4, 5, 6

Clustering Result

How to measure the clustering quality?

66

Community Evalua-on

Normalized Mutual Informa-on •  Entropy: the informa-on contained in a distribu-on

•  Mutual Informa-on: the shared informa-on between two distribu-ons

•  Normalized Mutual Informa-on (between 0 and 1)

•  Consider a par--on as a distribu-on (probability of one node falling into one community), we can compute the matching between two clusterings

67

Community Evalua-on

NMI

68

Community Evalua-on

NMI-‐Example

•  Par--on a: [1, 1, 1, 2, 2, 2] •  Par--on b: [1, 2, 1, 3, 3, 3]

1, 2, 3 4, 5, 6

1, 3 2 4, 5,6

h=1 3

h=2 3

l=1 2

l=2 1

l=3 3

l=1 l=2 l=3

h=1 2 1 0

h=2 0 0 3

=0.8278

69

Community Evalua-on

Accuracy of Pairwise Community Memberships

•  Consider all the possible pairs of nodes and check whether they reside in the same community

•  An error occurs if

–  Two nodes belonging to the same community are assigned to different communi-es aper clustering

–  Two nodes belonging to different communi-es are assigned to the same community

•  Construct a con-ngency table

accuracy =a+ d

a+ b+ c+ d=

a+ d

n(n− 1)/270

Community Evalua-on

Accuracy Example

Ground Truth

C(vi) = C(vj) C(vi) != C(vj)

Clustering Result

C(vi) = C(vj) 4 0

C(vi) != C(vj) 2 9

Ground Truth

1, 2, 3

4, 5, 6

1, 3 2 4, 5, 6

Clustering Result

Accuracy = (4+9)/ (4+2+9+0) = 13/15

71

Community Evalua-on

Evalua-on using Seman-cs

•  For networks with seman-cs –  Networks come with seman-c or aLribute informa-on of nodes or connec-ons

–  Human subjects can verify whether the extracted communi-es are coherent

•  Evalua-on is qualita-ve •  It is also intui-ve and helps understand a community

An animal community

A health community

72

Community Evalua-on

Evalua-on without Ground Truth

•  For networks without ground truth or seman-c informa-on

•  This is the most common situa-on •  An op-on is to resort to cross-‐valida-on

–  Extract communi-es from a (training) network –  Evaluate the quality of the community structure on a network constructed from a different date or based on a related type of interac-on

•  Quan-ta-ve evalua-on func-ons –  modularity

–  block model approxima-on error

73

Community Evalua-on

Scaling Community Detec-on

•  Social media networks are huge. How to scale community detec-on methods? – Approxima-on

•  Sampling

•  Local graph processing •  Mul--‐level methods

– Exact •  Streaming/Itera-ve schemes

•  Distributed and Parallel processing

74

Sampling

•  Downsample the network such that classical community detec-on methods can be applied –  Sample nodes (or edges)

•  Which nodes should we sample? –  A simple heuris-c: keep those high-‐degree nodes

•  Node degrees follows a power-‐law distribu-on •  Communi-es might form around those popular ones

–  Uncover the community membership for remaining nodes •  Check if any of his friends has been assigned to any community

–  This simple heuris-c works reasonably well

•  Op-miza-on may achieve beLer approxima-on

75

Scaling community detec-on

Local graph processing

•  Circumvent the memory boLleneck – Rather than compu-ng global quality, check local neighborhood (say, k-‐hop) for computa-on •  Compute edge-‐betweenness

– Construct communi-es from seeded nodes

76


Mul--‐Level Approaches

•  Find a rough par--on of the network into communi-es by use of a fast process (may at the expense of accuracy)

•  Construct a meta network with nodes being communi-es and edges the connec-on between communi-es

•  Meta-‐network is much smaller than the original network

77 hLp://sites.google.com/site/findcommuni-es/


Streaming/Itera-ve schemes

•  An itera-ve process examines each vertex along with its neighbor in a given order and perform computa-on

•  Repeat vertex itera-on un-l convergence

78


Distributed/Parallel Compu-ng

•  Exploit the power of parallel compu-ng –  Extend community detec-on methods to Hadoop MapReduce

–  Typically requires mul-ple itera-ons as well –  Move computa-on to data (split into N chunks)

79


COMMUNITY DETECTION FOR NETWORK-‐BASED CLASSIFICATION

80

Correla-ons in Network

•  Individual behaviors are correlated in a network environment

homophily influence Confounding

81

Classifica-on with Network Data

•  How to leverage this correla-on observed in networks to help predict user aLributes or interests?

Predict the labels for nodes in yellow

82

Collec-ve Classifica-on

•  Labels of nodes are interdependent with each other •  The label of one node cannot be determined independently;

Need Collec-ve Classifica-on

•  Markov Assump-on: the label of one node depends on the label of his neighbors

•  Collec-ve classifica-on involves 3 components:

Local Classifier

•  Assign ini-al label

Rela-onal Classifier

•  Capture correla-ons between nodes

Collec-ve Inference

•  Propagate correla-ons through network

83

Collec-ve Classifica-on

•  Local Classifier: used for ini-al label assignment –  Predicts label based on node aLributes –  Classical classifica-on learning –  Does not employ network informa-on

•  Rela-onal Classifier: capture correla-ons based on label info –  Learn a classifier from the labels or/and aLributes of its neighbors to

the label of one node –  Network informa-on is used

•  Collec-ve Classifica-on: propagate the correla-on –  Apply rela-onal classifier to each node itera-vely –  Iterate un-l the inconsistency between neighboring labels is

minimized –  Network structure substan-ally affects the final predic-on

84

Weighted-‐vote Rela-onal Neighborhood Classifier

•  No local classifier •  Rela-onal classifier

–  predic-on of one node is the average of its neighbors

•  Collec-ve Inference

85

Example of wvRN

•  Ini-aliza-on for unlabeled nodes –  p(yi=1|Ni)=0.5

•  1st Itera-on: –  For node 3, N3={1,2,4}

•  P(y1=1|N1) = 0 •  P(y2=1|N2) = 0 •  P(y4=1 |N4) = 0.5 •  P(y3=1|N3) = 1/3 (0 + 0 + 0.5) = 0.17

–  For node 4, N4={1,3, 5, 6} •  P(y4=1|N4)= ¼(0+ 0.17+0.5+1) = 0.42

–  For node 5, N5={4,6,7,8} •  P(y5=1|N5) = ¼ (0.42+1+1+0.5) = 0.73

86

Itera-ve Result

•  Stabilizes aper 5 itera-ons –  Nodes 5, 8, 9 are + (Pi > 0.5) –  Node 3 is – (Pi < 0.5) –  Node 4 is in between (Pi =0.5)

87

Community-‐Based Learning

•  People in social media engage in various rela-onships –  Colleagues –  Rela-ves –  Friends –  Co-‐travelers

•  Different rela-onships have different correla-ons with user interests/behavior/profiles

88

Challenges

•  Social media open comes with a network, but no rela-onship informa-on

•  Or rela-onship informa-on is not complete or refined enough

•  Challenges: –  How to differen-ate these heterogeneous rela-onship in a network?

–  How to determine whether the rela-onship is useful for predic-on?

89

Social Dimension

•  Social Dimension: –  Latent dimensions defined by social connec-ons

•  Each dimension represents one type of rela-onship

90

A Learning Framework based on Social Dimensions

•  People involved in the same social dimension are likely to connect to each other, thus forming a community

•  Training: 1.  Extract meaningful social dimensions based on network

connec-vity via community detec-on 2.  Determine relevant social dimensions (plus node

aLributes) through supervised learning

•  Predic-on –  Apply the constructed model in Step 2 to social dimensions of unlabeled nodes

–  No itera-ve inference

91

Underlying Assump-on

•  Assump-on: –  the label of one node is determined by its social dimension –  P(yi|A) = P(yi|Si)

•  Community membership serves as latent features

92

Example of SocioDim Framework

•  One is likely to involve in mul-ple rela-onships, thus sop clustering is used to extract social dimensions

Spectral Clustering Support Vector Machine 93

Collec-ve Classifica-on vs. Community-‐Based Learning

CollecBve ClassificaBon

Community-‐Based Learning

Computa-onal Cost for Training low high

Computa-onal Cost for Predic-on

high low

Handling Heterogeneous Rela-ons

✔

Handling Incremental Network Change

✔

Integra-ng network informa-on and actor aLributes

✔ ✔

94

References

95

•  Lei Tang and Huan Liu. Community Detec-on and Mining in Social Media, Morgan & Claypool Publishers, 2010. •  Santo Fortunato, Community Detec-on in Graphs, Physics Reports, vol.

486, pages 75-‐174, 2010 •  Lei Tang and Huan Liu. Leveraging Social Media Networks for

Classifica-on. Journal of Data Mining and Knowledge Discovery (DMKD), 2011.

•  Papadopoulos, S. and Kompatsiaris, Y. and Vakali, A. and Spyridonos, P., Community Detec-on in Social Media, Journal of Data Mining and Knowledge Discovery (DMKD), Pages 1-‐40, 2011

•  Michele Coscia, Fosca Gianno�, and Dino Pedreschi, A Classifica-on for Community Discovery Methods in Complex Networks, Sta-s-cal Analysis and Data Mining, 2011

Lei Tang scientist, Yahoo! Labs

[email protected]

96

Community)Detec-on)for))...

Documents

Transcript of Community)Detec-on)for))...