Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

37
Carnegie Mellon University Making Sense of Large Graphs: Summarization and Similarity Mlconf ‘14, Atlanta, GA Danai Koutra Computer Science Department Carnegie Mellon University [email protected] http://www.cs.cmu.edu/~dkoutra

description

Networks naturally capture a host of interactions in the real world spanning from friendships to brain activity. But, given a massive graph, like the Facebook social graph, what can be said about its structure? Which are its most important structures? How does it compare to other networks like Twitter? This talk will focus on my work developing scalable algorithms and models that help us to make sense of large graphs via pattern discovery and similarity analysis. I will begin by presenting VoG, an approach that efficiently summarizes large graphs by finding their most interesting and semantically meaningful structures. Starting from a clutter of millions of nodes and edges, such as the Enron who-mails-whom graph, our Minimum Description Length based algorithm, disentangles the complex graph connectivity and spotlights the structures that ‘best’ describe the graph. Then, for similarity analysis at the graph level, I will introduce the problems of graph comparison and graph alignment. I will conclude by showing how to apply my methods to temporal anomaly detection, brain graph clustering, deanonymization of bipartite (e.g., user-group membership) and unipartite graphs, and more.

Transcript of Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Page 1: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Carnegie  Mellon  University  

Making  Sense  of  Large  Graphs:  Summarization  and  Similarity  

Mlconf  ‘14,  Atlanta,  GA  

Danai Koutra Computer Science Department

Carnegie Mellon University

[email protected] http://www.cs.cmu.edu/~dkoutra

Page 2: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Making  sense  of  large  graphs  

Danai Koutra (CMU) 2

Human Connectome

Project

scalable algorithms and models for understanding massive graphs.

>1.25B users!

Page 3: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Understanding  Large  Graphs  

Danai Koutra (CMU) 3

Part 1 S u m m a r i z a t i o n

Page 4: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Danai Koutra (CMU) 4

79,870 email accounts 288,364 emails

Ever  tried  visualizing  a  large  graph?  

Page 5: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Danai Koutra (CMU) 5

79,870 email accounts 288,364 emails

Ever  tried  visualizing  a  large  graph?  

Page 6: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

After  this  talk,  you’ll  know    how  to  Cind…  

Danai Koutra (CMU) 6

VoG Top-3 Stars [email protected]  

[email protected]  

Page 7: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Enron  Summary  

Danai Koutra (CMU) 7

VoG Top Near Bipartite Core Ski

excursion

organizers participants

“Affair”

Commenters CC’ed

Page 8: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Problem  DeCinition  

Danai Koutra (CMU) 8

Given: a graph

Find:

≈ important graph

structures.

a succinct summary with possibly overlapping subgraphs

[Koutra, Kang, Vreeken, Faloutsos. SDM’14]

Danai Koutra (CMU) 8

Lady Gaga Fan Club

Page 9: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Main  Ideas  

Idea 1: Use well-known structures (vocabulary):

Idea 2: Best graph summary   è optimal compression (MDL)

Danai Koutra (CMU) 9

Shortest lossless description

Page 10: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Minimum  Description  Length  

Danai Koutra (CMU) 10

BACKGROUND  

a1 x + a0

min  L(M)        +        L(D|M)  

a10 x10 + a9 x9 + … + a0

errors

{ }

simple & good explanations

# bits for M

# bits for the data using M

~Occam’s razor

Page 11: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Formally:  Minimum  Graph  Description    

Danai Koutra (CMU) 11

Given: - a graph G - vocabulary Ω

Find: model M s.t. min L(G,M) = min{ L(M) + L(E) }

Adjacency A Model M Error E

Page 12: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

VoG:  Overview  

Danai Koutra (CMU) 12

argmin    

≈?

Page 13: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

VoG:  Overview  

Danai Koutra (CMU) 13

Pick best (with some criterion)

Summary

Page 14: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Q:  Which  structures  to  pick?  

Danai Koutra (CMU) 14

A: Those that min description length S of G

2|S| combinations

Page 15: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Runtime  

Danai Koutra (CMU) 15

VOG is near-linear on # edges of the input graph.

1.25B users!

Page 16: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Understanding  a    wiki  graph  

Danai Koutra (CMU) 16

Nodes: wiki editors Edges: co-edited

I don’t see anything! L

Page 17: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Wiki  Controversial  Article  

Danai Koutra (CMU) 17

Stars: admins, bots, heavy users

Bipartite cores: edit wars

Kiev vs. Kyiv vandals vs. admins

Page 18: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

VoG    vs.  other  methods  

Danai Koutra (CMU) 18

VoG   Bounded-­‐Error  Summariza@on  

Mo@f  Simplifica@on  

Clustering  Methods  

Cross-­‐Associa@ons  

 

Variety  of  Structures   ✔   ✗   ✗   ✗   ✗  Important  Structures   ✔   ✗   ✗   ✗   ✗  Low  Complexity   ✔   ✗   ✗   ✔(?)   ✔  Visualiza@on   ✔   ✔   ✔   ✗   ✗  Graph  Summary   ✔   ✔   ✔   ✗   ✗  

Stars, cliques near-cliques

[Navlakha+’08] [Dunne+’13] [Chakrabarti+’03]

Page 19: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

VoG:  summary  

Danai Koutra (CMU) 19

•  Focus on important •  possibly-overlapping structures •  with known graph-theoretic properties

 www.cs.cmu.edu/~dkoutra/SRC/vog.tar  

Page 20: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Understanding  Large  Graphs  

Danai Koutra (CMU) 20

Part 2 S i m i l a r i t i e s

Page 21: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

friendship  graph  ≈  wall  posts  graph?  

Danai Koutra (CMU) 21

Behavioral  PaOerns  1  

VS.  

Are  the  graphs  /  behaviors  similar?  

Page 22: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Why  graph  similarity?  

Danai Koutra (CMU) 22

Classification 2

Temporal  anomaly    detec@on  

3

Intrusion  detec@on  4  

�! �!12 13 14 22 23

Day  1                    Day  2                      Day  3                    Day  4  

sim1   sim2   sim3  

Page 23: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Problem  DeCinition:  Graph  Similarity  

•  Given: (i) 2 graphs with the same nodes and different edge sets (ii) node correspondence

•  Find: similarity score s [0,1]

Danai Koutra (CMU) 23

GA

GB

Page 24: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Obvious  solution?  

Edge Overlap (EO) # of common edges (normalized or not)

Danai Koutra 24

GA

GB

Page 25: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

…  but  “barbell”…  

EO(B10,mB10) == EO(B10,mmB10)

Danai Koutra 25

GA GA

GB GB’

Page 26: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

What  makes  a  similarity    function  good?  

26

•  Properties: ² Intuitive

Danai Koutra

ProperFes  like:  “Edge-­‐importance”  

Page 27: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

✗  

What  makes  a  similarity    function  good?  

27

•  Properties: ² Intuitive

² Scalable

Danai Koutra

ProperFes  like:  “Weight-­‐awareness”  

✗  

Page 28: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

MAIN  IDEA:  DELTACON  

28

SA  =   SB =  

①  Find the pairwise node influence, SA & SB. ②  Find the similarity between SA & SB.

Danai Koutra (CMU)

DETAILS  

Page 29: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

How?  Using  Belief  Propagation  Attenuating Neighboring Influence for small ε:

29

S = [I+ε 2D−εA]−1 ≈

≈ [I−εA]−1 = I+εA+ε 2A2 +...

1-hop 2-hops …

Note: ε > ε2 > ..., 0<ε<1

INTUITION  

Danai Koutra (CMU)

Page 30: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

OUR  SOLUTION:  DELTACON  

30

DETAILS  

①  Find the pairwise node influence, SA & SB. ②  Find the similarity between SA & SB.

Danai Koutra (CMU)

sim( ) = 1

1+ sA,ij − sB,ij( )2

i, j∑SA,SB  

SA  =   SB =  

“Root” Euclidean Distance

Page 31: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

…  but  O(n2)  …  

31

f a s t e r ?

O(m1+m2) in the paper J

Danai Koutra (CMU)

Page 32: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

32

•  Nodes:  email  accounts  of  employees  •  Edges:  email  exchange    

Day  1                              Day  2                              Day  3                          Day  4                            Day  5    

sim1   sim2   sim3   sim4  

Danai Koutra (CMU)

Temporal  Anomaly    Detection    

Page 33: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

33

similarity  

consecu@ve  days  Danai Koutra (CMU)

Feb  4:  Lay  resigns  

Temporal  Anomaly    Detection    

Page 34: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Brain-­‐Connectivity    Graph  Clustering  

34

•  114 brain graphs ² Nodes: 70 cortical regions ² Edges: connections

•  Attributes: gender, IQ, age…

Danai Koutra (CMU)

Page 35: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Brain-­‐Connectivity    Graph  Clustering  

35 Danai Koutra (CMU) 35 Danai Koutra (CMU) Danai Koutra (CMU) 35

t-­‐test    p-­‐value  =  0.0057  

Page 36: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Graph  Understanding  via  …  

•  … Summarization … ² VoG: to spot the important graph structures

•  … Comparison …

² DeltaCon: to find the similarity between aligned networks ² BiG-Align to align bi/uni-partite ² Uni-Align graphs efficiently

36 Danai Koutra (CMU) Danai Koutra (CMU) 36

Page 37: Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University at MLconf ATL

Thank  you!  

www.cs.cmu.edu/~dkoutra/pub.htm [email protected]

Danai Koutra (CMU) 37

summarization similarities Understanding