SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming...

20
SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski hris Gaiteri, Mingming Chen, Konstantin Kuzm Frontiers of Network Science

Transcript of SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming...

Page 1: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

SpeakEasy: Algorithm for Robust Community Detection

Prof. Boleslaw SzymanskiChris Gaiteri, Mingming Chen, Konstantin Kuzmin

Frontiers of Network Science

Page 2: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

2

Discovering Communities in Social & Bio-networksClustering implies modularityFunctional modularity imposes natural boundary lines between communities.

Discovering community structure uncovers functionality

Bio (left) and social (right) networks are driven by functionality

Page 3: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Motivation for Specialized Algorithms

Biological and social networks have high level of noise and therefore have incorrect or missing links

Biological or social functions are accomplished by communities of interacting molecules/cells or people

Membership in these communities may overlap when humans or biological components are involved in multiple functions

Addition of noise &

unclustered links

Multi-community nodesRed dot = connection between nodes

3

Page 4: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

• An extension of the Label Propagation Algorithm (LPA) in which nodes send their labels to neighbors and most popular label is retained. All nodes left with the same label represent community

• SLPA mimics pairwise interactions between nodes (functional in bio-network, social in social networks)

• Each node broadcasts a label to its neighbors and at the same time receives a label from each of its neighbors

4

• Each node has a memory of received labels which are taken into account in the next round of broadcasting

• Linear time complexity O(m)) in the number of edges

SLPA/GANXiS Community Detection Algorithm

J. Xie, B. Szymanski, Proc. IEEE Network Science Workshop, pp. 138 - 143. (2013).

Page 5: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

5

J. Xie, S. Kelley, B. Szymanski, ACM Computing Surveys (2013).

In the end, a label distribution is derived for each node.

a b c0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

b,b

d,c,c

e,b

c

b

b

Select by density

a,enode i

Node i neighbors

a,e,b

Select the most popular

node i

b

SLPA Interaction Rules

Frontiers of Network Science

Page 6: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

SpeakEasy Algorithm Novelty: Identifies communities using top-down and bottom-up

approaches simultaneously. Specifically, nodes join communities based on their local connections and global information about the network structure.

Label propagation algorithm: each node updates its status to the label found among nodes connected to it which has the greatest specificity, i.e., the actual number of times this label is present in neighboring nodes minus its expected number based on its global frequency.

Consensus clustering: the partition with the highest average adjusted Rand Index among all other partitions is selected as the representative partition to get robust community structure.

Overlapping communities: overlapping communities can be obtained with co-occurrence matrix. Multi-community nodes are selected as nodes which co-occur with more than one of the final clusters with greater than a user-selected threshold. 015).

6Frontiers of Network Science

Page 7: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Visual Example of SpeakEasy Clustering

Labels are represented by color tags

Multi-community nodes are tagged with multiple colors

A. Each node is assigned with random unique label (before clustering)

B. Nodes with the same labels belong to the same community (after clustering)

C. Gaiteri, M. Chen, B.K. Szymanski, et al. ,arXiv:1501.04709 (2015)

Frontiers of Network Science

Page 8: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Color-coded community ID

Algorithm identifies communities though evolution of common labels.

After a certain number of iterations of label propagation or if none of the nodes updates its labels in the given iteration, nodes with the same label will be clustered into the same community.

However, because the clustering is fast and parameter-free, running the algorithm multiple times, we get an assessment of the robustness of the clusters and the identity of multi-community nodes.

Correlation matrix after clustering

Nod

es

Nodes

Clustering Workflow

C. Gaiteri, M. Chen, B.K. Szymanski, et al. ,arXiv:1501.04709 (2015)

Frontiers of Network Science

Page 9: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Individual clustering results look pretty good (dense within-community clusters, and not many between-community links.)

However, how robust are these clusters?

One way to test cluster robustness is to resample the data, rebuild the clusters, and compare them to the original, or to other clusters built by resampling.

For example, how similar are the clusters from a resampled dataset?

The sample with the highest average adjusted Rand Index among all other samples is selected as the representative sample to get robust communities.

Clustered node ADJ #1

Nodes

Nodes

Nod

esN

odes

???Clustered node ADJ #2

Identifying Robust Clusters

9

C. Gaiteri, M. Chen, B.K. Szymanski, et al. ,arXiv:1501.04709 (2015)

Frontiers of Network Science

Page 10: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Run SpeakEasy multiple times (e.g. 100x).

For all pairs of nodes (i, j) the “co-occurrence” matrix records number of times they land in same cluster.

This is useful for both identifying robust clusters and for finding nodes that link multiple communities together.

Co-occurrence matrix

Clusters in this matrix show nodes that cluster across many initial conditions

Strong non-clustered/ off-diagonal elements show multi-community nodes

frac

tion

of re

peat

co-

clus

terin

gs

Nod

es

Nodes

Identifying Multi-community Nodes

C. Gaiteri, M. Chen, B.K. Szymanski, et al. ,arXiv:1501.04709 (2015)

Frontiers of Network Science

Page 11: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Performance on Real-world Networks SpeakEasy shows improved performance on 6/15 networks using

the modularity (Q) metric, over other algorithms.

SpeakEasy performs better than GANXiS on 14/15 of the networks with a mean percent difference of 28% over GANXiS.

Comparison of the quality of community structures detected with GANXiS and SpeakEasy on 15 real-world networks using modularity (Q) and modularity density (Qds).

11

Page 12: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Performance on LFR Benchmark (Disjoint) SpeakEasy can accurately identify disjoint clusters on LFR

benchmarks, even when these clusters are obscured by cross-linking, which simulates the effect of noise in typical datasets.

SpeakEasy shows the highest-yet accuracy in community detection based on various community quality metrics, especially for highly cross-linked clusters.

The LFR benchmarks track cluster recovery as networks become increasingly cross-linked (as μ increases) .

Page 13: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Robust clustering performance with various cluster size distributions and intra-cluster degree distributions. (A) Various disjoint cluster recovery metrics for networks from LFR benchmarks with n=1000, (cluster size distribution) =3, (within-cluster degree distribution) =2. (B) Disjoint cluster recovery metrics for networks from LFR benchmarks with n=1000, =3, =1 (C) Disjoint cluster recovery metrics for networks from LFR benchmarks with n=1000, =2, =2.

Performance on LFR Benchmark (Disjoint)

Frontiers of Network Science

Page 14: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Performance on LFR Benchmark (Overlapping)

SpeakEasy shows excellent performance on identifying multi-community nodes tied to various number of communities (controlled by Om) on LFR benchmarks.

Recovery of true clusters quantified by NMI as a function of μ (cross-linking between clusters) and Om (number of communities associated with each multi-community node).

Frontiers of Network Science

Page 15: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

15

Performance on LFR Benchmark (Overlapping)

F(multi)-score is the standard F-score, but specifically applied for detection of correct community associations of multi-community nodes, calculated at various values of Om and different average connectivity levels (D=10,20).

Frontiers of Network Science

Page 16: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Application to Protein-protein Interaction Datasets

A. The high throughput interaction dataset from Gavin et al. has nodes colored according

to protein complexes found in the Saccharomyces Genome Database (SGD).

B. The communities identified with SpeakEasy on the high throughput interaction dataset

from Gavin et al.

C. Gaiteri, M. Chen, B.K. Szymanski, et al., arXiv:1501.04709 (2015)

Page 17: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Application to Cell-type Clustering

Primary and secondary biological classifications of immune cell types are reflected in primary and secondary clusters.

Frontiers of Network Science

Page 18: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Application to Resting-state fMRI Data

A. Raw correlation matrices between resting state brain activity from control and Parkinson disease cohorts.

B. Co-occurrence matrices for controls and Parkinson disease cohorts.

Frontiers of Network Science

Page 19: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Brain region communities detected from control subject resting-state fMRI. The order of communities 1-6 corresponds to the order of communities shown in the figure before.

Location of brain regions in each cluster was/were visualized with the BrainNet Viewer

Page 20: SpeakEasy: Algorithm for Robust Community Detection Prof. Boleslaw Szymanski Chris Gaiteri, Mingming Chen, Konstantin Kuzmin Frontiers of Network Science.

Thank You

Questions?

Frontiers of Network Science