Community)Detec-on)for))...

96
Community Detec-on for Social Compu-ng Lei Tang Yahoo! Labs 2011/09/16 @UC Berkeley 1 Most materials are adapted from Community Detec-on and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010. hLp://dmml.asu.edu/cdm/

Transcript of Community)Detec-on)for))...

Page 1: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Community  Detec-on  for    Social  Compu-ng  

Lei  Tang  Yahoo!  Labs  

2011/09/16  @UC  Berkeley    

1  

Most  materials  are  adapted  from      Community  Detec-on  and  Mining  in  Social  Media.    Lei  Tang  and  Huan  Liu,  Morgan  &  Claypool,  September,  2010.  

hLp://dmml.asu.edu/cdm/    

Page 2: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Community  •  Community:  It  is  formed  by  individuals  such  that  those  within  a  

group  interact  with  each  other  more  frequently  than  with  those  outside  the  group  –  a.k.a.  group,  cluster,  cohesive  subgroup,  module  in  different  contexts  

•  Community  detec-on:  discovering  groups  in  a  network  where  individuals’  group  memberships  are  not  explicitly  given  

•  Why  communi-es  in  social  media?    –  Human  beings  are  social  –  Easy-­‐to-­‐use  social  media  allows  people  to  extend  their  social  life  in  

unprecedented  ways  –  Difficult  to  meet  friends  in  the  physical  world,  but  much  easier  to  find  

friend  online  with  similar  interests  –  Interac-ons  between  nodes  can  help  determine  communi-es  

2  

Page 3: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Real-­‐World  Communi-es  

3  

Egypt  Protest,  2011  

Page 4: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Communi-es  by  Subscrip-on  Communi-es  from    

Facebook  Communi-es  from    

Flickr  

Page 5: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Communi-es  of  Poli-cal  Blogs  

5  Lada  Admic  and  Natalie  Glance,  The  Poli-cal  Blogosphere  and  the  2004  U.S.  Elec-on:  Divided  They  Blog,  2005  

Liberal    

Conserva-ve  

Page 6: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Communi-es  in  TwiLer  

6  hLp://ebiquity.umbc.edu/blogger/2007/04/19/twiLer-­‐social-­‐network-­‐analysis/  

Page 7: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Communi-es  of  Personal  Social  Network  

7  Generated  based  on  Lei  Tang’s  LinkedIn  connec-ons    on  2011/09/15  

Graduate  School  

College  

Page 8: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Communi-es  in  Protein-­‐Protein  Interac-on  Networks  

8  

Page 9: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Communi-es  in  Social  Media  •  Two  types  of  groups  in  social  media  

–  Explicit  Groups:  formed  by  user  subscrip-ons  –  Implicit  Groups:  implicitly  formed  by  social  interac-ons  

•  Some  social  media  sites  allow  people  to  join  groups,  is  it  necessary  to  extract  groups  based  on  network  topology?  –  Not  all  sites  provide  community  plagorm  –  Not  all  people  want  to  make  effort  to  join  groups  –  Groups  can  change  dynamically    

•  Network  interac-on  provides  rich  informa-on  about  the  rela-onship  between  users  –  Can  complement  other  kinds  of  informa-on  –  Provide  basic  informa-on  for  other  social  compu-ng  tasks  

9  

Page 10: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Applica-ons  of  Community  Detec-on  

•  Visualiza-on  &  network  naviga-on  •  Data  compression  •  Structural  posi-on  and  role  analysis  •  Topic  detec-on  in  collabora-ve  tagging  systems  •  Tag  disambigua-on    •  Iden-fying  #unique  users  in  networks  •  User  profiling  based  on  neighborhood  smoothing  •  Recommenda-on  and  targe-ng  •  Event  detec-on  

10  

Page 11: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Roadmap  •  Overview  of  Community  Detec-on  Methods  

–  Node-­‐Centric  –  Group-­‐Centric  –  Network-­‐Centric  –  Hierarchy-­‐Centric  

•  Communi-es  in  Social  Media  –  Sta-s-cal  Proper-es  –  Community  Evolu-on  –  Heterogeneous  Networks  –  Community  Evalua-on  –  Scaling  Community  Detec-on  

•  Applica-on  of  Community  Detec-on  for  Social  Media  Mining  

11  

Page 12: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

COMMUNITY  DETECTION  

12  

Page 13: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Subjec-vity  of  Community  Defini-on  Each  component  is  a  

community  A  densely-­‐knit    community    

Defini-on  of  a  community    can  be  subjec-ve.  

13  

Page 14: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Craze  about  Communi-es?    

•  Subjec-vity  of  communi-es    –  challenging  for  evalua-on  (covered  later)  –  Promo-ng  various  defini-ons  and  models  

•  Boom  of  social  media  and  its  novel  challenges  

14  

ArBcle/Book   Year   References  

A  classifica-on  for  Community  Discovery  Methods  in  Complex  Networks  

2011   168  

Community  Detec-on  in  Social  Media   2011   28  ref/page  *  4.5  pages  =  126    

Community  Detec-on  in  Graphs   2010   58  ref/page  *  7.5  pages  =  435  

Community  Detec-on  and  Mining  in  Social  Media   2010   Around  100    

Page 15: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Taxonomy  of  Community  Criteria    •  Criteria  vary  depending  on  the  tasks  •  Roughly  can  be  divided  into  4  categories  (not  exclusive):    •  Node-­‐Centric  Community  

–  Each  node  in  a  group  sa-sfies  certain  proper-es    •  Group-­‐Centric  Community  

–  Consider  the  connec-ons  within  a  group  as  a  whole.  The  group  has  to  sa-sfy  certain  proper-es  without  zooming  into  node-­‐level  

•  Network-­‐Centric  Community  –  Op-mize  a  quality  func-on  wrt.  the  whole  network  topology    

•  Hierarchy-­‐Centric  Community      –  Construct  a  hierarchical  structure  of  communi-es  

15  

For  simplicity,    we  focus  on  undirected  network  with  boolean  edge  weights    

Page 16: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Node-­‐Centric  Community  Detec-on  

•  Nodes  sa-sfy  different  proper-es  –  Complete  Mutuality    

•  cliques  –  Reachability  of  members  

•  k-­‐clique,  k-­‐clan,  k-­‐club  –  Nodal  degrees    

•  k-­‐plex,  k-­‐core  –  Rela-ve  frequency  of  Within-­‐Outside  Ties  

•  LS  sets,  Lambda  sets  

•  Commonly  used  in  tradi-onal  social  network  analysis  •  Open  involves  combinatorial  op-miza-on  •  Here,  we  discuss  some  representa-ve  ones  

16  

Page 17: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Complete  Mutuality:  Cliques  

•  Clique:  a  maximum  complete  subgraph  in  which  all  nodes  are  adjacent  to  each  other  

•  NP-­‐hard  to  find  the  maximum  clique  in  a  network  •  Straighgorward  implementa-on  to  find  cliques  is  very  

expensive  in  -me  complexity  

Nodes  5,  6,  7  and  8  form  a  clique  

17  

Page 18: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Finding  the  Maximum  Clique  

•  In  a  clique  of  size  k,  each  node  maintains  degree  >=  k-­‐1  

•  Nodes  with  degree  <  k-­‐1  will  not  be  included  in  the  maximum  clique  

•  Recursively  apply  the  following  pruning  procedure  –  Sample  a  sub-­‐network  from  the  given  network,  and  find  a  clique  in  the  

sub-­‐network,  say,  by  a  greedy  approach  

–  Suppose  the  clique  above  is  size  k,  in  order  to  find  out  a  larger  clique,  all  nodes  with  degree  <=  k-­‐1  should  be  removed.    

•  Repeat  un-l  the  network  is  small  enough  

•  Many  nodes  will  be  pruned  as  social  media  networks  follow  a  power  law  distribu-on  for  node  degrees  

18  

Page 19: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Maximum  Clique  Example  

•  Suppose  we  sample  a  sub-­‐network  with  nodes  {1-­‐5}  and  find  a  clique  {1,  2,  3}  of  size  3    

•  In  order  to  find  a  clique  >3,  remove  all  nodes  with  degree  <=3-­‐1=2  –  Remove  nodes  2  and  9  

–  Remove  nodes  1  and  3  

–  Remove  node  4  

19  

Page 20: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Clique  Percola-on  Method  (CPM)  

•  Clique  is  a  very  strict  defini-on,  unstable  •  Normally  use  cliques  as  a  core  or  a  seed  to  find  larger  

communi-es  

•  CPM  is  such  a  method  to  find  overlapping  communi-es  –  Input  

•  A  parameter  k,  and  a  network    –  Procedure  

•  Find  out  all  cliques  of  size  k  in  a  given  network  •  Construct  a  clique  graph.  Two  cliques  are  adjacent  if  they  share  k-­‐1  nodes  

•  Each  connected  components  in  the  clique  graph  form  a  community  

20  

Page 21: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

CPM  Example  Cliques  of  size  3:  {1,  2,  3},  {1,  3,  4},  {4,  5,  6},  {5,  6,  7},  {5,  6,  8},  {5,  7,  8},    {6,  7,  8}  

Communi-es:    {1,  2,  3,  4}  {4,  5,  6,  7,  8}  

21  

Page 22: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Reachability  :  k-­‐clique,  k-­‐club    

•  Any  node  in  a  group  should  be  reachable  in  k  hops  •  k-­‐clique:  a  maximal  subgraph  in  which  the  largest  geodesic  

distance  between  any  nodes  <=  k    •  k-­‐club:  a  substructure  of  diameter  <=  k  

•  A  k-­‐clique  might  have  diameter  larger  than  k  in  the  subgraph  •  Commonly  used  in  tradi-onal  SNA  •  Open  involves  combinatorial  op-miza-on  

Cliques:  {1,  2,  3}  2-­‐cliques:  {1,  2,  3,  4,  5},  {2,  3,  4,  5,  6}  2-­‐clubs:  {1,2,3,4},  {1,  2,  3,  5},  {2,  3,  4,  5,  6}  

22  

Page 23: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Group-­‐Centric  Community  Detec-on:  Density-­‐Based  Groups  

•  The  group-­‐centric  criterion  requires  the  whole  group  to  sa-sfy  a  certain  condi-on  –  E.g.,  the  group  density  >=  a given threshold

•  A subgraph is a quasi-clique  if  

•  A similar strategy to that of cliques can be used –  Sample a subgraph, and find a maximal quasi-

clique (say, of size k) –  Remove nodes with degree

Gs(Vs, Es) γ − dense

|Es||Vs|(|Vs|− 1)/2

≥ γ

γ − dense

< kγ

23  

Page 24: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Network-­‐Centric  Community  Detec-on  

•  Network-­‐centric  criterion  needs  to  consider  the  connec-ons  within  a  network  globally  

•  Goal:  par--on  nodes  of  a  network  into  disjoint  sets  •  Approaches:  

–  Clustering  based  on  vertex  similarity  –  Latent  space  models  

–  Block  model  approxima-on  

–  Spectral  clustering  – Modularity  maximiza-on  

24  

Page 25: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Clustering  based  on  Vertex  Similarity  

•  Apply  k-­‐means  or  similarity-­‐based  clustering  to  nodes  

•  Vertex  similarity  is  defined  in  terms  of  the  similarity  of  their  neighborhood  

•  Structural  equivalence:  two  nodes  are  structurally  equivalent  iff  they  are  connec-ng  to  the  same  set  of  actors  

•  Structural  equivalence  is  too  restrict  for  prac-cal  use.    

Nodes  1  and  3  are  structurally  equivalent;      So  are  nodes  5  and  6.    

25  

(1)  Clustering  based  on  vertex  similarity  

Page 26: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Vertex  Similarity  

•  Jaccard  Similarity  

•  Cosine  similarity  

Jaccard(4, 6) =|{5}|

|{1, 3, 4, 5, 6, 7, 8}| =1

7

cosine(4, 6) =1√4 · 4

=1

426  

(1)  Clustering  based  on  vertex  similarity  

Jaccard(vi, vj) =|Ni ∩Nj ||Ni ∪Nj |

cosine(vi, vj) =

�k AikAjk��s A

2is

�t A

2jt

Page 27: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Latent  Space  Models  

•  Map  nodes  into  a  low-­‐dimensional  space  such  that  the  proximity  between  nodes  based  on  network  connec-vity  is  preserved  in  the  new  space,  then  apply  k-­‐means  clustering  

•  Mul--­‐dimensional  scaling  (MDS)  –  Given  a  network,  construct  a  proximity  matrix  P  represen-ng  the  

pairwise  distance  between  nodes  (e.g.,  geodesic  distance)  

–  Let                              denote  the  coordinates  of  nodes  in  the  low-­‐dimensional  space  

–  Objec-ve  func-on:        –  Solu-on:  –  V  is  the  top            eigenvectors  of            ,  and          is  a  diagonal  matrix  of  top  

eigenvalues                                                                                                

S∈Rn×

SST ≈ −1

2(I − 1

n11T )(P ◦ P )(I − 1

n11T ) = �P

min �SST − �P�2FS = V Λ

12

� �PΛ = diag(λ1,λ2, · · · ,λ�)

Λ

27  

(2)  Latent  space  models  

Page 28: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

MDS  Example  

P =

0 1 1 1 2 2 3 3 41 0 1 2 3 3 4 4 51 1 0 1 2 2 3 3 41 2 1 0 1 1 2 2 32 3 2 1 0 1 1 1 22 3 2 1 1 0 1 1 23 4 3 2 1 1 0 1 13 4 3 2 1 1 1 0 24 5 4 3 2 2 1 2 0

�P =

2.46 3.96 1.96 0.85 −0.65 −0.65 −2.21 −2.04 −3.653.96 6.46 3.96 1.35 −1.15 −1.15 −3.71 −3.54 −6.151.96 3.96 2.46 0.85 −0.65 −0.65 −2.21 −2.04 −3.650.85 1.35 0.85 0.23 −0.27 −0.27 −0.82 −0.65 −1.27

−0.65 −1.15 −0.65 −0.27 0.23 −0.27 0.68 0.85 1.23−0.65 −1.15 −0.65 −0.27 −0.27 0.23 0.68 0.85 1.23−2.21 −3.71 −2.21 −0.82 0.68 0.68 2.12 1.79 3.68−2.04 −3.54 −2.04 −0.65 0.85 0.85 1.79 2.46 2.35−3.65 −6.15 −3.65 −1.27 1.23 1.23 3.68 2.35 6.23

!3 !2 !1 0 1 2 3!1

!0.8

!0.6

!0.4

!0.2

0

0.2

0.4

0.6

0.8

1

2

34

5 6

7

8

9

S1

S2

Two  communi-es:    {1,  2,  3,  4}  and  {5,  6,  7,  8,  9}  

geodesic  distance  

28  

(2)  Latent  space  models  

Page 29: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Block  Models  

•  S  is  the  community  indicator  matrix  

•  Relax  S  to  be  numerical  values,  then  the  op-mal  solu-on  corresponds  to  the  top  eigenvectors  of  A  

min ||A− SΣST ||2F

Two  communi-es:    {1,  2,  3,  4}  and  {5,  6,  7,  8,  9}  

29  

(3)  Block  model  approxima-on  

Page 30: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Cut  

•  Most  interac-ons  are  within  group  whereas  interac-ons  between  groups  are  few  

•  community  detec-on    minimum  cut  problem  •  Cut:  A  par--on  of  ver-ces  of  a  graph  into  two  disjoint  sets  •  Minimum  cut  problem:  find  a  graph  par--on  such  that  the  

number  of  edges  between  the  two  sets  is  minimized  

30  

(4)  Spectral  clustering  

Page 31: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Ra-o  Cut  &  Normalized  Cut  

•  Minimum  cut  open  returns  an  imbalanced  par--on,  with  one  set  being  a  singleton  

•  Change  the  objec-ve  func-on  to  consider  community  size  

Ratio Cut(π) =1

k

k�

i=1

cut(Ci, C̄i)

|Ci|,

Normalized Cut(π) =1

k

k�

i=1

cut(Ci, C̄i)

vol(Ci)

Ci,:  a  community  |Ci|:  number  of  nodes  in  Ci  vol(Ci):  sum  of  degrees  in  Ci  

31  

(4)  Spectral  clustering  

Page 32: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Ra-o  Cut  &  Normalized  Cut  Example  

Ratio Cut(π1) =1

2

�1

1+

1

8

�= 9/16 = 0.56

Normalized Cut(π1) =1

2

�1

1+

1

27

�= 14/27 = 0.52

For  parBBon  in  red:    

For  parBBon  in  green:    

Normalized Cut(π2) =1

2

�2

12+

2

16

�= 7/48 = 0.15 < Normalized Cut(π1)

Ratio Cut(π2) =1

2

�2

4+

2

5

�= 9/20 = 0.45 < Ratio Cut(π1)

π1

π2

Both  ra-o  cut  and  normalized  cut  prefer  a  balanced  par--on  32  

(4)  Spectral  clustering  

Page 33: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Spectral  Clustering  

•  Both  ra-o  cut  and  normalized  cut  can  be  reformulated  as  

•  Where  

•  Spectral  relaxa-on:  •  Op-mal  solu-on:    top  eigenvectors  with  the  smallest  

eigenvalues  

�L =

�D −AI −D−1/2AD−1/2

graph  Laplacian  for  ra-o  cut  

normalized  graph  Laplacian  

D = diag(d1, d2, · · · , dn) A  diagonal  matrix  of  degrees  

minS

Tr(ST �LS) s.t. STS = Ik

minS∈{0,1}n×k

Tr(ST �LS)

33  

(4)  Spectral  clustering  

Page 34: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Spectral  Clustering  Example  

D = diag(3, 2, 3, 4, 4, 4, 4, 3, 1)

�L = D −A =

3 −1 −1 −1 0 0 0 0 0−1 2 −1 0 0 0 0 0 0−1 −1 3 −1 0 0 0 0 0−1 0 −1 4 −1 −1 0 0 00 0 0 −1 4 −1 −1 −1 00 0 0 −1 −1 4 −1 −1 00 0 0 0 −1 −1 4 −1 −10 0 0 0 −1 −1 −1 3 00 0 0 0 0 0 −1 0 1

S =

0.33 −0.380.33 −0.480.33 −0.380.33 −0.120.33 0.160.33 0.160.33 0.300.33 0.240.33 0.51

Two  communi-es:    {1,  2,  3,  4}  and  {5,  6,  7,  8,  9}  

The  1st  eigenvector  means  all  nodes  belong  to  the  same  cluster,  no  

use  

k-­‐means  

34  

(4)  Spectral  clustering  

Page 35: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Modularity  Maximiza-on  

•  Modularity  measures  the  strength  of  a  community  par--on  by  taking  into  account  the  degree  distribu-on  

•  Given  a  network  with  m  edges,  the  expected  number  of  edges  between  two  nodes  with  di  and  dj    is        

•  Strength  of  a  community:    

•  Modularity:    •  A  larger  value  indicates  a  good  community  structure    

didj/2m

The  expected  number  of  edges  between  nodes  1  and  2  is  

3*2/  (2*14)  =  3/14  �

i∈C,j∈C

Aij − didj/2m

Q =1

2m

k�

�=1

i∈C�,j∈C�

(Aij − didj/2m)

35  

(5)  Modularity  maximiza-on  

Page 36: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Modularity  Matrix  

•  Modularity  matrix:  

•  Similar  to  spectral  clustering,  modularity  maximiza-on  can  be  reformulated  as  

•  Op-mal  solu-on:  top  eigenvectors  of  the  modularity  matrix    •  Apply  k-­‐means  to  S  as  a  post-­‐processing  step  to  obtain  

community  par--on  

B = A− ddT /2m (Bij = Aij − didj/2m)

maxQ =1

2mTr(STBS) s.t. STS = Ik

36  

(5)  Modularity  maximiza-on  

Page 37: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Modularity  Maximiza-on  Example  

B =

−0.32 0.79 0.68 0.57 −0.43 −0.43 −0.43 −0.32 −0.110.79 −0.14 0.79 −0.29 −0.29 −0.29 −0.29 −0.21 −0.070.68 0.79 −0.32 0.57 −0.43 −0.43 −0.43 −0.32 −0.110.57 −0.29 0.57 −0.57 0.43 0.43 −0.57 −0.43 −0.14

−0.43 −0.29 −0.43 0.43 −0.57 0.43 0.43 0.57 −0.14−0.43 −0.29 −0.43 0.43 0.43 −0.57 0.43 0.57 −0.14−0.43 −0.29 −0.43 −0.57 0.43 0.43 −0.57 0.57 0.86−0.32 −0.21 −0.32 −0.43 0.57 0.57 0.57 −0.32 −0.11−0.11 −0.07 −0.11 −0.14 −0.14 −0.14 0.86 −0.11 −0.04

S =

0.44 −0.000.38 0.230.44 −0.000.17 −0.48

−0.29 −0.32−0.29 −0.32−0.38 0.34−0.34 −0.08−0.14 0.63

Modularity  Matrix  

k-­‐means  

Two  Communi-es:  {1,  2,  3,  4}  and  {5,  6,  7,  8,  9}  

37  

(5)  Modularity  maximiza-on  

Page 38: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

A  Unified  View  for  Community  Par--on  

Utility Matrix M =

modified proximity matrix �P if latent space modelsadjacency matrix A if block models

graph Laplacian �L if spectral clusteringmodularity maximization B if modularity maximization

•  Latent  space  models,  block  models,  spectral  clustering,  and  modularity  maximiza-on  can  be  unified  as      

38  

Page 39: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Hierarchy-­‐Centric  Community  Detec-on  

•  Goal:  build  a  hierarchical  structure  of  communi-es  based  on  network  topology  

•  Allow  the  analysis  of  a  network  at  different  resolu-ons  

•  Representa-ve  approaches:    – Divisive  Hierarchical  Clustering  – Agglomera-ve  Hierarchical  clustering  

39  

Page 40: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Divisive  Hierarchical  Clustering  

•  Divisive  clustering  –  Par--on  nodes  into  several  sets  –  Each  set  is  further  divided  into  smaller  ones  –  Network-­‐centric  par--on  can  be  applied  for  the  par--on  

•  One  par-cular  example:  recursively  remove  the  “weakest”  -e  –  Find  the  edge  with  the  least  strength  –  Remove  the  edge  and  update  the  corresponding  strength  of  each  edge  

•  Recursively  apply  the  above  two  steps  un-l  a  network  is  discomposed  into  desired  number  of  connected  components.  

•  Each  component  forms  a  community    

40  

Page 41: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Edge  Betweenness  

•  The  strength  of  a  -e  can  be  measured  by  edge  betweenness    

•  Edge  betweenness:  the  number  of  shortest  paths  that  pass  along  with  the  edge  

•  The  edge  with  higher  betweenness  tends  to  be  the  bridge  between  two  communi-es.    

The  edge  betweenness  of  e(1,  2)  is  4,  as  all  the  shortest  paths  from  2  to  {4,  5,  6,  7,  8,  9}  have  to  either  pass  e(1,  2)  or  e(2,  3),  and  e(1,2)  is  the  shortest  path  between  1  and  2  

edge-betweenness(e) = Σs<tσst(e)

σs,t

41  

Page 42: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Divisive  clustering  based  on  edge  betweenness  

Aper  remove  e(4,5),    the  betweenness    of  e(4,  6)  becomes  20,  which  is  the  highest;  

Aper  remove  e(4,6),      the  edge    e(7,9)  has  the  highest  betweenness  value  4,  and  should  be  removed.    

Ini-al  betweenness  value  

42  

Page 43: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Agglomera-ve  Hierarchical  Clustering  

•  Ini-alize  each  node  as  a  community  

•  Merge  communi-es  successively  into  larger  communi-es  following  a  certain  criterion  –  E.g.,  based  on  modularity  increase  

43  

Page 44: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Summary  of  Community  Detec-on  

•  Node-­‐Centric:  cliques,  k-­‐cliques,  k-­‐clubs  •  Group-­‐Centric:  quasi-­‐cliques  •  Network-­‐Centric:    

–  Clustering  based  on  vertex  similarity  

–  Latent  space  models,  block  models,  spectral  clustering,  modularity  maximiza-on  

•  Hierarchy-­‐Centric:  Divisive,  Agglomera-ve  

•  Difficult  to  determine  which  suits  the  best  

•  Other  dimensions  to  explore:  –  Directed  network  –  Overlapping  communi-es  

–  Sop  clustering  –  Op-miza-on   44  

Page 45: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

COMMUNITIES  IN  SOCIAL  MEDIA  

45  

Page 46: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Ques-ons  &  Challenges  

•  Sta-s-cal  Proper-es  of  Communi-es  –  any  structural  paLerns  for  community  size,  density  and  growth?    

•  Evolu-on  –  Timeliness  is  emphasized  in  social  media  –  Interac-ons  are  highly  dynamic    

•  Heterogeneity  –  Various  types  of  en--es  and  interac-ons  are  involved  

•  Evalua-on  –  Lack  of  ground  truth,  and  complete  informa-on  due  to  privacy  

•  Scalability  –  Social  networks  are  open  in  a  scale  of  millions  of  nodes  and  connec-ons  –  Tradi-onal  Network  Analysis  open  deals  with  very  limited  number  of  of  

subjects    

46  

Page 47: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Size  Distribu-on  of  Explicit  Communi-es    (LiveJournal)  

47  Lei  Tang,  Xufei  Wang  and  Huan  Liu,  Group  Profiling  for  Understanding  Social  Structures,  TIST,  2011  

16K  bloggers,  132k  links,  100k  groups  

Sta-s-cal  property  

Page 48: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Size  Distribu-on  of  Explicit  Communi-es    (Flickr)    

48  

907K  users,  43k  groups  

Sta-s-cal  property  

Page 49: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Densi-es  of  Flickr  Groups  

49  

Average    Network  Density  

Sta-s-cal  property  

Page 50: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Distribu-on  Implicit  Communi-es  (consider  each  connected    component  as  a  community)  

50  

100 101 102 103 104 105 106 107 108 109100

101

102

103

104

105

106

107

size

frequency

The  giant  component  

Y!  Instant  Messenger  Contact  Network:  Hundreds  of  millions  of  nodes  

Billions  of  connec-ons  

Kun  liu  and  Lei  Tang,  Large  Scale  Behavioral  Targe-ng  with  a  Social  Twist,  CIKM  2011  

Sta-s-cal  property  

Page 51: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

How  does  a  network  look  like?  Zooming  into  the  giant  component    

51  J.  Leskovec,  K.  Lang,  A.  Dasgupta,  M.  Mahoney.  Sta-s-cal  Proper-es  of    Community  Structure  in  Large  Social  and  Informa-on  Networks,  WWW  2008  

Sta-s-cal  property  

Page 52: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Community  Growth  wrt.  #friends  

•  Diminishing  returns  – Probability  of  joining  increases  with  the  number  friends  in  the  group  

– But  increases  get  smaller  and  smaller  52  

Livejournal     DBLP  

Group  Forma-on  in  Large  Social  Networks:  Membership,  Growth,  and  Evolu-on,  KDD  2006  

Sta-s-cal  property  

Page 53: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Community  Growth:  More  subtle  features    

•  Connectedness  of  friends:    – x  and  y  have  three  friends  in  the  group  – x’s  friends  are  independent  – y’s  friends  are  all  connected  

•  Who  is  more  likely  to  join?    

53  Adapted  from  Jure  Leskovec’s  lecture  slides  @Stanford    

Sta-s-cal  property  

Page 54: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Community  Growth  wrt.  Connectedness  of  Friends  

54  

Sta-s-cal  property  

Page 55: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Community  Evolu-on  

•  Communi-es  also  expand,  shrink  ,  or  dissolve  in  dynamic  networks  

•  How  to  uncover  latent  community  change  behind  dynamic  network  interac-ons?    

55  

Page 56: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Naïve  Approach  to  Studying  Community  Evolu-on  

•  Take  snapshots  of  a  network  •  find  communi-es  at  each  snapshot  •  Clustering  independently  at  each  

snapshot  

•  Cons:    –  Most  community  detec-on  methods  

produce  local  op-mal  solu-ons  

–  Hard  to  determine  if  the  evolu-on  is  due  to  the  evolu-on  or  algorithm  randomness  

56  

Community  Evolu-on  

Page 57: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Naïve  Approach  Example  

•  There  is  a  sharp  change  at  T2  •  This  approach  may  report  spurious  structural  changes  

!" #$

%&

" #

'

$

()

!" #$

%&

" #

'

$

()

A(1) A(2)A(3)at  T1   at  T3  at  T2  

H(2) =

1 01 01 01 01 01 01 01 00 1

H(1) =

1 01 01 01 00 10 10 10 10 1

H(3) =

1 01 01 00 10 10 10 10 10 1

57  

Community  Evolu-on  

Page 58: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Evolu-onary  Clustering  in  Smoothly  Evolving  Networks  

•  Evolu-onary  Clustering:  find  a  smooth  sequence  of  communi-es  given  a  series  of  network  snapshots  

•  Objec-ve  func-on:    snapshot  cost  (CS)  +  temporal  cost  (CT)  

•  Take  spectral  clustering  as  an  example  –  Snapshot  cost  :    –  Temporal  cost:    

•  Community  Evolu-on:                            where  

Cost = α · CS + (1− α) · CT

CSt = Tr(STt LtSt), s.t. ST

t St = Ik

CTt = �St − St−1�2 St  is  s-ll  a  valid  solu-on  aper  an  orthogonal  transforma-on    

CTt =1

2�StS

Tt − St−1S

Tt−1�2

Costt = Tr�STt�LtSt

�Lt = I − α ·D−1/2t A(t)D−1/2

t − (1− α) · St−1STt−1

58  

Community  Evolu-on  

Page 59: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Evolu-onary  Clustering  Example  !

" #$

%&

" #

'

$

()

!" #$

%&

" #

'

$

()

A(1) A(2)A(3)at  T1   at  T3  at  T2  

S1 =

0.33 −0.440.27 −0.430.33 −0.440.38 −0.160.38 0.240.38 0.240.38 0.380.33 0.300.19 0.23

�L2 =

0.91 −0.42 −0.33 −0.21 −0.01 −0.01 0.01 0.01 0.01−0.42 0.92 −0.08 −0.27 0.00 0.00 0.02 0.01 0.01−0.33 −0.08 0.91 −0.22 −0.25 −0.01 0.01 0.01 0.01−0.21 −0.27 −0.22 0.95 −0.18 −0.24 −0.02 −0.02 −0.01−0.01 0.00 −0.25 −0.18 0.94 −0.06 −0.37 −0.06 −0.04−0.01 0.00 −0.01 −0.24 −0.06 0.94 −0.07 −0.45 −0.040.01 0.02 0.01 −0.02 −0.37 −0.07 0.91 −0.44 −0.050.01 0.01 0.01 −0.02 −0.06 −0.45 −0.44 0.94 −0.040.01 0.01 0.01 −0.01 −0.04 −0.04 −0.05 −0.04 0.97

For  T1  For  T2  

We  obtain  two  communi-es  based  on  spectral  clustering  with  this  modified  graph  Laplacian:  

{1,  2,  3,  4}  and  {5,  6,  7,  8,  9}   59  

Community  Evolu-on  

Page 60: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Segment-­‐based  Clustering  with  Evolving  Networks  

•  Independent  clustering  at  each  snapshot  –  do  not  consider  temporal  informa-on  

–  Likely  to  output  specious  evalua-on  paLerns  •  Evolu-onary  clustering  enforces  smoothness  

–  may  fail  to  capture  dras-c  change  

•  How  to  strike  balance  between  gradual  changes  under  normal  circumstances  and  dras-c  changes  caused  by  major  events?  

•  Segment-­‐based  clustering:    –  Community  structure  remains  unchanged  in  a  segment  of  -me  –  A  change  between  consecu-ve  segments  

•  Fundamental  ques-on:  how  to  detect  the  change  points?    

60  

Community  Evolu-on  

Page 61: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Segment-­‐based  Clustering  

•  Segment-­‐based  Clustering  assumes  community  structure  remains  unchanged  in  a  segment  of  -me  

•  GraphScope  is  one  segment-­‐based  clustering  method  

–  If  network  connec-ons  do  not  change  much  over  -me,  consecu-ve  network  snapshots  should  be  grouped  into  one  segment  

–  If  a  new  network  snapshot  does  not  fit  into  an  exis-ng  segment  (when  current  community  structure  induces  a  high  cost  on  a  new  network  snapshot),  then  introduce  a  change  point  and  start  a  new  segment  

61  

Community  Evolu-on  

Page 62: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

GraphScope  on  Enron  Data  

62  

Community  Evolu-on  

Page 63: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Heterogeneous  Networks  

Heterogeneous  Network  

Heterogeneous  Interac-ons  

Mul--­‐Dimensional  Network  

Heterogeneous  Nodes  

Mul--­‐Mode  Network  

63  

Page 64: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Why  Does  Heterogeneity  MaLer  

•  Social  media  introduces  heterogeneity  •  It  calls  for  solu-ons  to  community  detec-on  in  heterogeneous  networks  –  Interac-ons  in  social  media  are  noisy  –  Interac-ons  in  one  mode  or  one  dimension  might  be  too  noisy  to  detect  meaningful  communi-es    

–  Not  all  users  are  ac-ve  in  all  dimensions  or  with  different  modes  

•  Need  integra-on  of  interac-ons  at  mul-ple  dimensions  or  modes  

•  Details  skipped  due  to  -me  limit  (check  out  chapter  4  of  the  lecture  book)  

64  

Page 65: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Evalua-ng  Community  Detec-on  

•  For  groups  with  clear  defini-ons  – E.g.,  Cliques,  k-­‐cliques,  k-­‐clubs,  quasi-­‐cliques  – Verify  whether  extracted  communi-es  sa-sfy  the  defini-on  

•  For  networks  with  ground  truth  informa-on  – Normalized  mutual  informa-on  – Accuracy  of  pairwise  community  memberships  

65  

Page 66: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Measuring  a  Clustering  Result  

•  The  number  of  communi-es  aper  grouping  can  be  different  from  the  ground  truth  

•  No  clear  community  correspondence  between  clustering  result  and  the  ground  truth    

•  Normalized  Mutual  Informa-on  can  be  used  

Ground  Truth  

1,  2,  3  

4,  5,  6  

1,  3   2  4,  5,  6  

Clustering  Result  

How  to  measure  the    clustering  quality?  

66  

Community  Evalua-on  

Page 67: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Normalized  Mutual  Informa-on  •  Entropy:  the  informa-on  contained  in  a  distribu-on  

•  Mutual  Informa-on:  the  shared  informa-on  between  two  distribu-ons  

•  Normalized  Mutual  Informa-on  (between  0  and  1)  

•  Consider  a  par--on  as  a  distribu-on  (probability  of  one  node  falling  into  one  community),  we  can  compute  the  matching  between  two  clusterings  

67  

Community  Evalua-on  

Page 68: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

NMI  

68  

Community  Evalua-on  

Page 69: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

NMI-­‐Example  

•  Par--on  a:    [1,  1,  1,  2,  2,  2]    •  Par--on  b:    [1,  2,  1,  3,  3,  3]  

1,  2,  3   4,  5,  6  

1,  3   2   4,  5,6  

h=1   3  

h=2   3  

l=1   2  

l=2   1  

l=3   3  

l=1   l=2   l=3  

h=1   2   1   0  

h=2   0   0   3  

=0.8278  

69  

Community  Evalua-on  

Page 70: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Accuracy  of  Pairwise  Community  Memberships  

•  Consider  all  the  possible  pairs  of  nodes  and  check  whether  they  reside  in  the  same  community  

•  An  error  occurs  if  

–   Two  nodes  belonging  to  the  same  community  are  assigned  to  different  communi-es  aper  clustering  

–  Two  nodes  belonging  to  different  communi-es  are  assigned  to  the  same  community    

•  Construct  a  con-ngency  table    

accuracy =a+ d

a+ b+ c+ d=

a+ d

n(n− 1)/270  

Community  Evalua-on  

Page 71: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Accuracy  Example  

Ground  Truth  

C(vi)  =    C(vj)   C(vi)  !=  C(vj)  

Clustering  Result  

C(vi)  =    C(vj)   4   0  

C(vi)  !=  C(vj)   2   9  

Ground  Truth  

1,  2,  3  

4,  5,  6  

1,  3   2  4,  5,  6  

Clustering  Result  

Accuracy  =  (4+9)/  (4+2+9+0)  =  13/15  

71  

Community  Evalua-on  

Page 72: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Evalua-on  using  Seman-cs  

•  For  networks  with  seman-cs  –  Networks  come  with  seman-c  or  aLribute  informa-on  of  nodes  or  connec-ons  

–  Human  subjects  can  verify  whether  the  extracted  communi-es  are  coherent    

•  Evalua-on  is  qualita-ve  •  It  is  also  intui-ve  and  helps  understand  a  community  

An  animal  community  

A  health  community  

72  

Community  Evalua-on  

Page 73: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Evalua-on  without  Ground  Truth  

•  For  networks  without  ground  truth  or  seman-c  informa-on  

•  This  is  the  most  common  situa-on  •  An  op-on  is  to  resort  to  cross-­‐valida-on  

–  Extract  communi-es  from  a  (training)  network  –  Evaluate  the  quality  of  the  community  structure  on  a  network  constructed  from  a  different  date  or  based  on  a  related  type  of  interac-on  

•  Quan-ta-ve  evalua-on  func-ons  –  modularity  

–  block  model  approxima-on  error  

73  

Community  Evalua-on  

Page 74: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Scaling  Community  Detec-on  

•  Social  media  networks  are  huge.  How  to  scale  community  detec-on  methods?    – Approxima-on  

•  Sampling  

•  Local  graph  processing  •  Mul--­‐level  methods  

– Exact  •  Streaming/Itera-ve  schemes  

•  Distributed  and  Parallel  processing    

74  

Page 75: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Sampling  

•  Downsample  the  network  such  that  classical  community  detec-on  methods  can  be  applied  –  Sample  nodes  (or  edges)  

•  Which  nodes  should  we  sample?    –  A  simple  heuris-c:  keep  those  high-­‐degree  nodes  

•  Node  degrees  follows  a  power-­‐law  distribu-on  •  Communi-es  might  form  around  those  popular  ones  

–  Uncover  the  community  membership  for  remaining  nodes  •  Check  if  any  of  his  friends  has  been  assigned  to  any  community    

–  This  simple  heuris-c  works  reasonably  well  

•  Op-miza-on  may  achieve  beLer  approxima-on  

75  

Scaling  community  detec-on  

Page 76: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Local  graph  processing  

•  Circumvent  the  memory  boLleneck  – Rather  than  compu-ng  global  quality,  check  local  neighborhood  (say,  k-­‐hop)  for  computa-on  •  Compute  edge-­‐betweenness  

– Construct  communi-es  from  seeded  nodes  

76  

Scaling  community  detec-on  

Page 77: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Mul--­‐Level  Approaches  

•  Find  a  rough  par--on  of  the  network  into  communi-es  by  use  of  a  fast  process  (may  at  the  expense  of  accuracy)  

•  Construct  a  meta  network  with  nodes  being  communi-es  and  edges  the  connec-on  between  communi-es  

•  Meta-­‐network  is  much  smaller  than  the  original  network    

77  hLp://sites.google.com/site/findcommuni-es/  

Scaling  community  detec-on  

Page 78: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Streaming/Itera-ve  schemes  

•  An  itera-ve  process  examines  each  vertex  along  with  its  neighbor  in  a  given  order  and  perform  computa-on  

•  Repeat  vertex  itera-on  un-l  convergence  

78  

Scaling  community  detec-on  

Page 79: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Distributed/Parallel  Compu-ng  

•  Exploit  the  power  of  parallel  compu-ng  –  Extend  community  detec-on  methods  to  Hadoop  MapReduce  

–  Typically  requires  mul-ple  itera-ons  as  well  –  Move  computa-on  to  data  (split  into  N  chunks)  

79  

Scaling  community  detec-on  

Page 80: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

COMMUNITY  DETECTION  FOR  NETWORK-­‐BASED  CLASSIFICATION  

80  

Page 81: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Correla-ons  in  Network  

•  Individual  behaviors  are  correlated  in  a  network  environment  

homophily   influence   Confounding  

81  

Page 82: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Classifica-on  with  Network  Data  

•  How  to  leverage  this  correla-on  observed  in  networks  to  help  predict  user  aLributes  or  interests?    

Predict  the  labels  for  nodes  in  yellow  

82  

Page 83: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Collec-ve  Classifica-on  

•  Labels  of  nodes  are  interdependent  with  each  other  •  The  label  of  one  node  cannot  be  determined  independently;  

Need  Collec-ve  Classifica-on  

•  Markov  Assump-on:  the  label  of  one  node  depends  on  the  label  of  his  neighbors  

•  Collec-ve  classifica-on  involves  3  components:  

Local  Classifier  

•  Assign  ini-al  label  

Rela-onal  Classifier  

•  Capture  correla-ons  between  nodes  

Collec-ve  Inference  

•  Propagate  correla-ons  through  network  

83  

Page 84: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Collec-ve  Classifica-on  

•  Local  Classifier:  used  for  ini-al  label  assignment  –  Predicts  label  based  on  node  aLributes  –  Classical  classifica-on  learning  –  Does  not  employ  network  informa-on  

•  Rela-onal  Classifier:  capture  correla-ons  based  on  label  info  –  Learn  a  classifier  from  the  labels  or/and  aLributes  of  its  neighbors  to  

the  label  of  one  node  –  Network  informa-on  is  used  

•  Collec-ve  Classifica-on:  propagate  the  correla-on  –  Apply  rela-onal  classifier  to  each  node  itera-vely  –  Iterate  un-l  the  inconsistency  between  neighboring  labels  is  

minimized  –  Network  structure  substan-ally  affects  the  final  predic-on  

84  

Page 85: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Weighted-­‐vote  Rela-onal  Neighborhood  Classifier  

•  No  local  classifier  •  Rela-onal  classifier  

–  predic-on  of  one  node  is  the  average  of  its  neighbors  

•  Collec-ve  Inference  

85  

Page 86: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Example  of  wvRN  

•  Ini-aliza-on  for  unlabeled  nodes    –   p(yi=1|Ni)=0.5  

•  1st  Itera-on:  –  For  node  3,  N3={1,2,4}  

•  P(y1=1|N1)  =  0  •  P(y2=1|N2)  =  0  •  P(y4=1  |N4)  =  0.5  •  P(y3=1|N3)  =  1/3  (0  +  0  +  0.5)  =  0.17  

–  For  node  4,  N4={1,3,  5,  6}  •  P(y4=1|N4)=  ¼(0+  0.17+0.5+1)  =  0.42  

–  For  node  5,  N5={4,6,7,8}    •  P(y5=1|N5)  =  ¼  (0.42+1+1+0.5)  =  0.73  

86  

Page 87: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Itera-ve  Result  

•  Stabilizes  aper  5  itera-ons  –  Nodes  5,  8,  9  are  +  (Pi  >  0.5)  –  Node  3  is  –    (Pi  <  0.5)  –  Node  4  is  in  between  (Pi  =0.5)  

87  

Page 88: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Community-­‐Based  Learning  

•  People  in  social  media  engage  in  various  rela-onships    –  Colleagues  –  Rela-ves  –  Friends  –  Co-­‐travelers  

•  Different  rela-onships  have  different  correla-ons  with  user  interests/behavior/profiles  

88  

Page 89: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Challenges  

•  Social  media  open  comes  with  a  network,  but  no  rela-onship  informa-on  

•  Or  rela-onship  informa-on  is  not  complete  or  refined  enough  

•  Challenges:  –  How  to  differen-ate  these  heterogeneous  rela-onship  in  a  network?  

–  How  to  determine  whether  the  rela-onship  is  useful  for  predic-on?  

89  

Page 90: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Social  Dimension  

•  Social  Dimension:    –  Latent  dimensions  defined  by  social  connec-ons  

•  Each  dimension  represents  one  type  of  rela-onship  

90  

Page 91: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

A  Learning  Framework  based  on  Social  Dimensions  

•  People  involved  in  the  same  social  dimension  are  likely  to  connect  to  each  other,  thus  forming  a  community  

•  Training:    1.  Extract  meaningful  social  dimensions  based  on  network  

connec-vity  via  community  detec-on  2.  Determine  relevant  social  dimensions  (plus  node  

aLributes)  through  supervised  learning  

•  Predic-on  –  Apply  the  constructed  model  in  Step  2  to  social  dimensions  of  unlabeled  nodes  

–  No  itera-ve  inference  

91  

Page 92: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Underlying  Assump-on  

•  Assump-on:    –  the  label  of  one  node  is  determined  by  its  social  dimension    –  P(yi|A)  =  P(yi|Si)  

•  Community  membership  serves  as  latent  features  

92  

Page 93: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Example  of  SocioDim  Framework  

•  One  is  likely  to  involve  in  mul-ple  rela-onships,  thus  sop  clustering  is  used  to  extract  social  dimensions  

Spectral  Clustering   Support  Vector  Machine  93  

Page 94: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Collec-ve  Classifica-on  vs.  Community-­‐Based  Learning  

CollecBve  ClassificaBon  

Community-­‐Based  Learning  

Computa-onal  Cost  for  Training   low   high  

Computa-onal  Cost  for  Predic-on  

high   low  

Handling  Heterogeneous  Rela-ons  

✔  

Handling  Incremental  Network  Change  

✔  

Integra-ng  network  informa-on  and  actor  aLributes  

✔   ✔  

94  

Page 95: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

References  

95  

•  Lei  Tang  and  Huan  Liu.  Community  Detec-on  and  Mining                in  Social  Media,  Morgan  &  Claypool  Publishers,  2010.  •  Santo  Fortunato,  Community  Detec-on  in  Graphs,  Physics  Reports,  vol.  

486,  pages  75-­‐174,  2010  •  Lei  Tang  and  Huan  Liu.  Leveraging  Social  Media  Networks  for  

Classifica-on.  Journal  of  Data  Mining  and  Knowledge  Discovery  (DMKD),  2011.  

•  Papadopoulos,  S.  and  Kompatsiaris,  Y.  and  Vakali,  A.  and  Spyridonos,  P.,  Community  Detec-on  in  Social  Media,  Journal  of  Data  Mining  and  Knowledge  Discovery  (DMKD),  Pages  1-­‐40,    2011  

•  Michele  Coscia,  Fosca  Gianno�,  and  Dino  Pedreschi,  A  Classifica-on  for  Community  Discovery  Methods  in  Complex  Networks,  Sta-s-cal  Analysis  and  Data  Mining,  2011  

Page 96: Community)Detec-on)for)) Social)Compu-ng)leitang.net/presentation/community_detection_2011_berkeley.pdf · Community)Detec-on)for)) Social)Compu-ng) Lei)Tang) Yahoo!)Labs) 2011/09/16)@UC)Berkeley))

Lei Tang scientist, Yahoo! Labs

[email protected]

96