Community Extraction for Social Networks - U-M Personal World Wide

29
Community Extraction for Social Networks Yunpeng Zhao Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA August 3, 2011 Advisor: Liza Levina and Ji Zhu

Transcript of Community Extraction for Social Networks - U-M Personal World Wide

Page 1: Community Extraction for Social Networks - U-M Personal World Wide

Community Extraction for Social Networks

Yunpeng Zhao

Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA

August 3, 2011

Advisor: Liza Levina and Ji Zhu

Page 2: Community Extraction for Social Networks - U-M Personal World Wide

Outline

Review of community detection

Community extraction

Asymptotic consistency

Simulation study

Real data analysis

Page 3: Community Extraction for Social Networks - U-M Personal World Wide

Network data

Network analysis has been a focus of attention in differentfields.

Social science: friendship networks

Internet: WWW, hyper-links

Biology: food webs, gene regulatory networks

Page 4: Community Extraction for Social Networks - U-M Personal World Wide

Community detection

Communities: Networks consist of communities, orclusters, with many connections within a community andfew connections between communities.

Community detection problem: For an undirected networkN = (V ,E), the community detection problem is typicallyformulated as finding a partition V = V1 ∪·· ·∪VK whichgives “tight” communities in some suitable sense.

Page 5: Community Extraction for Social Networks - U-M Personal World Wide

Community detection problem

Existing community detection methods: minimizing linksbetween communities while maximizing links withincommunities (see Newman (2004) for a review).

For simplicity, we consider the case of partitioning the networkinto two communities V1 and V2.

Page 6: Community Extraction for Social Networks - U-M Personal World Wide

Min-cut (Wu and Leahy, 1993)

To minimize

R = ∑i∈V1,j∈V2

Aij .

However, min-cut always yields a trivial solution of V1 = V orV2 = V .

Page 7: Community Extraction for Social Networks - U-M Personal World Wide

Ratio-cut (Wei and Cheng, 1989)

min R/(|V1| · |V2|),

where |V1| and |V2| represent the sizes of two groupsrespectively.

Ratio-cut can avoid trivial solutions because the maximizer of|V1| · |V2| is achieved at |V1| = |V2| = |V |/2.

Page 8: Community Extraction for Social Networks - U-M Personal World Wide

Normalized-cut (Shi and Malik, 2000)

minR

assoc(V1,V )+

Rassoc(V2,V )

,

where assoc(Vk ,V ) = ∑i∈Vk ,j∈V Aij for k = 1,2.

Normalized-cut can avoid trivial solutions because an extremelysmall group Vk may have a large ratio R/assoc(Vk ,V ).

Page 9: Community Extraction for Social Networks - U-M Personal World Wide

Modularity (Newman and Girvan, 2004)

To maximize

Q =2

∑k=1

[

Okk

L−

(

Dk

L

)2]

,

where Okk = ∑i∈Vk ,j∈VkAij ,Dk = ∑i∈Vk ,j∈V Aij ,L = ∑2

k=1 Dk .

Q represents the fraction of edges that fall within communities,minus the “average” value of the same quantity if edges fall atrandom given the degree of each node.

Page 10: Community Extraction for Social Networks - U-M Personal World Wide

Outline

X Review of community detection

Community extraction

Asymptotic consistency

Simulation study

Real data analysis

Page 11: Community Extraction for Social Networks - U-M Personal World Wide

Community extraction

Most networks consist of a number (not known a priori) ofcommunities, with relatively tight links within eachcommunity and sparse links to the outside, and“background” nodes that only have sparse links to othernodes.

We propose a method that extracts communitiessequentially: at each step, the tightest is extracted from thenetwork until no more meaningful communities exist.

Page 12: Community Extraction for Social Networks - U-M Personal World Wide

Criterion

Extract one community at a time by looking for a set ofnodes with a large number of links within itself and a smallnumber of links to the rest of the network.

The links within the complement of this set do not matter.

To maximize

W (S) =I(S)

k2 −B(S)

k(n−k),

where

I(S) = ∑i ,j∈S

Aij , B(S) = ∑i∈S,j∈Sc

Aij , k = |S| .

Page 13: Community Extraction for Social Networks - U-M Personal World Wide

Adjusted criterion

Empirically, the previous criterion performs well for densenetworks. However, it always finds very small communitiesfor sparse networks.

To avoid small communities, we also propose

To maximize

Wa(S) = k(n−k)

(

I(S)

k2 −B(S)

k(n−k)

)

.

The factor k(n−k) penalizes communities with k close to 1 orn and encourages more balanced solutions.

Page 14: Community Extraction for Social Networks - U-M Personal World Wide

Algorithm

Tabu Search (Glover, 1986; Glover and Laguna, 1997): alocal optimization technique based on label switching

Run the algorithm for many randomly ordered nodes

Page 15: Community Extraction for Social Networks - U-M Personal World Wide

Outline

X Review of community detection

X Community extraction

Asymptotic consistency

Simulation study

Real data analysis

Future work

Page 16: Community Extraction for Social Networks - U-M Personal World Wide

Block models

Asymptotic consistency can be established under theassumption of block models.

General block models1 Each node is assigned to a block independently of other

nodes, with probability πk for block k , 1 ≤ k ≤ K ,∑K

k=1 πk = 1.2 Given that node i belongs to block a and node j belongs to

block b, P[Aij = 1] = pab, and all edges are independent.

Block models for networks with background

We can define the last block as background, by assumingpaK < pbb for all a = 1, . . . ,K , and all b = 1, . . . ,K −1.

Page 17: Community Extraction for Social Networks - U-M Personal World Wide

Asymptotic consistency

For simplicity, assume there is only one community andbackground in the network (K = 2 with parametersp11,p12,p22,π and 1−π).

Let c denote the true community labels, c(n) denote theestimated labels, based on Bickel and Chen (2010), weproved

Theorem

For any 0 < π < 1, if p11 > p12, p11 > p22 and p11 +p22 > 2p12,the maximizer c(n) of both unadjusted and adjusted criteriasatisfies

P[c(n)= c] → 1 as n → ∞.

Page 18: Community Extraction for Social Networks - U-M Personal World Wide

Outline

X Review of community detection

X Community extraction

X Asymptotic consistency

Simulation study

Real data analysis

Page 19: Community Extraction for Social Networks - U-M Personal World Wide

Simulation I

Two communities with background (block model)

n = 1000

n1 = 100,200,n2 = 100

p12 = p23 = p13 = p33 = 0.05

p11 = 0.05i ,p22 = 0.04i , i = 3,4

Rand index

Page 20: Community Extraction for Social Networks - U-M Personal World Wide

Results for simulation I

M B E

0.0

0.2

0.4

0.6

0.8

1.0

n1=

100

n2=

200

p11=0.15 p22=0.12

M B E

0.0

0.2

0.4

0.6

0.8

1.0

p11=0.2 p22=0.16

M B E

0.0

0.2

0.4

0.6

0.8

1.0

n1=

200

n2=

200

M B E

0.0

0.2

0.4

0.6

0.8

1.0

Page 21: Community Extraction for Social Networks - U-M Personal World Wide

Simulation II

Two communities with background

n = 1000

n1 = 100,200,n2 = 100

p12 = p23 = p13 = p33 = 0.05

p11 = 0.05i ,p22 = 0.04i , i = 3,4

Doubling the degree for 10 highest degree nodes

Page 22: Community Extraction for Social Networks - U-M Personal World Wide

Results for simulation II

M B E

0.0

0.2

0.4

0.6

0.8

1.0

n1=

100

n2=

200

p11=0.15 p22=0.12

M B E

0.0

0.2

0.4

0.6

0.8

1.0

p11=0.2 p22=0.16

M B E

0.0

0.2

0.4

0.6

0.8

1.0

n1=

200

n2=

200

M B E

0.0

0.2

0.4

0.6

0.8

1.0

Page 23: Community Extraction for Social Networks - U-M Personal World Wide

Outline

X Review of community detection

X Community extraction

X Asymptotic consistency

X Simulation study

Real data analysis

Page 24: Community Extraction for Social Networks - U-M Personal World Wide

Karate club network

Friendships between 34 members of a karate club(Zachary, 1977).

This club has subsequently split into two parts following adisagreement between an instructor (node 0) and anadministrator (node 33).

Page 25: Community Extraction for Social Networks - U-M Personal World Wide

Karate club network

(a) Modularity (b) Block model (c) Extraction

01

2

3

4

5

6 7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

01

2

3

4

5

6 7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

01

2

3

4

5

6 7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

Page 26: Community Extraction for Social Networks - U-M Personal World Wide

Political books network

Links in the political books network (Newman, 2006) representpairs of books frequently bought together on amazon.com.

Blue: liberalRed: conservative

Page 27: Community Extraction for Social Networks - U-M Personal World Wide

Political books network

(a) Modularity (b) Block model (c) Extraction

Page 28: Community Extraction for Social Networks - U-M Personal World Wide

Acknowledgment

Thank my advisors: Elizaveta Levina and Ji Zhu

Thank Professor Mark Newman for constructivesuggestion and sharing his code

Thank my friends: Xi Chen, Jian Guo and Yizao Wang

And also

Page 29: Community Extraction for Social Networks - U-M Personal World Wide

Thank you all very much!