A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona...

26
A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models and Algorithms for the Web Graph (WAW 2006) November 29 – December 2, 2006
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona...

Page 1: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

A scalable multilevel algorithm for community structure detection

Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory

Models and Algorithms for the Web Graph (WAW 2006) November 29 – December 2, 2006

Page 2: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Community Structure Detection Problem

The problem of identifying communities in a network is usually modeled as a graph clustering problem– Vertices correspond to individual items

– Edges describe relationships

– The communities correspond to subgraphs • Dense connections between vertices from the same subgraph

• Fewer connections between vertices in different subgraphs

Page 3: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Motivation: Why to detect communities?

Analyze and understand the information contained in the huge amount of data available on the WWW

Finding related commercial items Recommendation systems Important for

– Social networks

– Ad-hoc networks

– Protein interaction networks

– Genetic networks

Page 4: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Motivation: Why to detect communities?

Predict how much someone going to love a movie based on their movie preferences

Grand Prize

$1.000.000

Page 5: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Outline of the talk

Previous work Graph partitioning problem Our approach Modularity Reduction Multilevel graph partitioning Experimental results Conclusions

Page 6: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Previous Work

Two main classes– Agglomerative Methods (addition of edges)

– Divisive Methods (removal of edges)

Algorithms based on– Laplacian Matrix

– Centrality measures

– Flow models

– Random walks

– Resistor networks

– Optimization

Not fast enough or inaccurate

Page 7: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Graph Partitioning Problem

Given a graph G(V, E), find a partition such that – The partition is balanced (i.e., the number of vertices of all

subsets are roughly equal)

– Cut size is minimized (i.e., the number of the edges with endpoints in different subsets is minimized)

Previous Work: – Kernighan-Lin algorithm

– Spectral partitioning

– Multilevel algorithms

Page 8: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Kernighan - Lin Algorithm

Find an initial random partition

Improve by a greedy procedure that swaps pairs of vertices from different partitions

Minimize the size of the cut set

uv

uv

Page 9: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Graph Partitioning vs Graph Clustering

Minimize cut size Equal number of vertices

in each subset Number of subsets is an

input

Find Clusters Community sizes may differ

Number of subsets varies

Algorithms for graph partitioning can not be directly used to produce good quality clustering

Page 10: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Our approach

Convert original graph G into a complete graph G’ Find min-cut of G’ using modified graph partitioning

method This will produce a good quality (high modularity)

clustering for G

Page 11: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Modularity

A useful measure of clustering quality Introduced by Newman [6] Modularity of a partitioning

= (number of edges within communities)

– (expected number of such edges) We are trying to find a division of graph with high

modularity

Page 12: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Reduction

Graph Clustering Problem: The problem of finding a clustering of maximum modularity in G

Min-Cut Problem: The problem of finding a minimum cut in a complete edge-weighted graph G'

Page 13: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

ReductionMaximize modularity of a

partitioning

= (number of edges within communities)

– (expected number of such edges)ij

ij

1 - p , if (i, j) E(G)Weight (i, j) =

- p , if (i, j) E(G)

Minimize (- modularity)

= (cut size)

– (expected cut size)

Graph Clustering Problem: Maximize modularity

Min-Cut Problem: Minimize cut size

Page 14: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Random Graph Models

Erdos - Renyi Model:

2

nm

pij

ij

ij

1 - p , if (i, j) E(G)Weight (i, j) =

- p , if (i, j) E(G)

pij : the probability that there is an edge between vertices i and j in a random graph from a given distribution

Chung - Lu Model:m

ddp jiij 2

Page 15: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Multilevel graph partitioning

Fast and an accurate method for producing high-quality partitions

Coarsening

Uncoarsening

Partitioning

Consists of the three phases: – Coarsening phase– Partitioning phase– Uncoarsening and refinement

phase

Page 16: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Coarsening Phase

Coarsening

Uncoarsening

Partitioning

Find a maximal matching and collapse edges to a vertex

Recursive coarsening:

< G = G1, G2, …, Gk >

Page 17: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Partitioning Phase

Coarsening

Uncoarsening

Partitioning

Greedy graph growing partitioning

Partition Gk

Page 18: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Uncoarsening and Refinement Phase

Coarsening

Uncoarsening

Partitioning

Project the partitioning Pi of Gi to Pi-1 of Gi-1

More degrees of freedom at Gi than Gi-1

Improve Pi using KL algorithm

Page 19: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Implementation

Our implementation is based on the graph partitioning package METIS [3] that employs a multilevel strategy

Convert the graph partitioning algorithm into a clustering one– The optimal clustering might not be balanced.

We ignore the restrictions that control the sizes of the parts.

– The number of the parts in the optimal clustering is not known.

We employ a recursive bisection procedure.

– The original graph G might be sparse, while the transformed one G' is complete. Our algorithm does not explicitly generate G’.

Page 20: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Modularity: Erdos - Renyi Model

(- Modularity) = cut size – n1n2p

n1 n2

Erdos - Renyi Model:

2

nm

pij

(- Modularity)’ = cut size’ – (n1+1)(n2-1)p

Page 21: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Modularity: Chung - Lu Model

(- Modularity) = cut size – w1w2/2m

w1 w2

(- Modularity)’ =

cut size’ – (w1 + w(v))(w2 - w(v))/2m

wi: Sum of degrees in partition i

Page 22: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Analysis

Time Complexity: O(n+m) Experiments

– Random Graphs– k-community graphs – nd.edu

Page 23: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Experiment I: Random Graphs

We generated random graphs with 128 vertices and 4 communities of size 32 each

The expected degree of any vertex is 16

Out degree varies

Page 24: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Experiment II: k-community graphs

We generated graphs with k communities

Size of each community is 100 Expected number of edges in the

community is equal to expected number of edges going outside from community.

Probability of an edge in communities varies between 0.5 and 0.1.

Results show that graphs are clustered especially %99 correctly.

Page 25: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Experiment III: nd.edu

Data consists of the complete map of the nd.edu domain, which contains 325,729 document and 1090108 links

Our algorithm clusters this graph into 280 clusters with modularity 0.925579

This high modularity indicates strong community structure in the graph

We show the dendrogram generated by our algorithm.

The size of rectangles are proportional to size of communities.

Page 26: A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models.

Conclusions

Community structure detection problem A scalable algorithm Based on multilevel graph partitioning Uses modularity as a quality measure