1 Tools for Large Graph Mining Thesis Committee: Christos Faloutsos Chris Olston Guy Blelloch ...

151
1 Tools for Large Graph Mining Thesis Committee: Christos Faloutsos Chris Olston Guy Blelloch Jon Kleinberg (Cornell) - Deepayan Chakrabarti

Transcript of 1 Tools for Large Graph Mining Thesis Committee: Christos Faloutsos Chris Olston Guy Blelloch ...

Page 1: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

1

Tools for Large Graph Mining

Thesis Committee:

Christos Faloutsos

Chris Olston

Guy Blelloch

Jon Kleinberg (Cornell)

- Deepayan Chakrabarti

Page 2: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

2

Introduction

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

► Graphs are ubiquitous

Page 3: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

3

Introduction

What can we do with graphs? How quickly will a disease

spread on this graph?

“Needle exchange” networks of drug users

[Weeks et al. 2002]

Page 4: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

4

Introduction

What can we do with graphs? How quickly will a disease

spread on this graph? Who are the “strange

bedfellows”? Who are the key people? …

► Graph analysis can have great impact

Hijacker network [Krebs ‘01]

“Key” terrorist

Page 5: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

5

Graph Mining: Two Paths

Specific applications

• Node grouping

• Viral propagation

• Frequent pattern mining

• Fast message routing

General issues

• Realistic graph generation

• Graph patterns and “laws”

• Graph evolution over time?

Page 6: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

6

Our Work

General issues

• Realistic graph generation

• Graph patterns and “laws”

• Graph evolution over time?

Specific applications

• Node grouping

• Viral propagation

• Frequent pattern mining

• Fast message routing

Page 7: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

7

Our Work

General issues

• Realistic graph generation

• Graph patterns and “laws”

• Graph evolution over time?

Specific applications

• Node grouping

• Viral propagation

• Frequent pattern mining

• Fast message routing

Node Grouping Find “natural” partitions and outliers

automatically. Viral Propagation

Will a virus spread and become an epidemic?

Graph Generation How can we mimic a given real-world

graph?

Page 8: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

8

Roadmap

Specific applications

• Node grouping

• Viral propagation

General issues

• Realistic graph generation

• Graph patterns and “laws”31

2

4 Conclusions

Find “natural” partitions and outliers automatically

Focus of this talk

Page 9: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

9

Node Grouping [KDD 04]

Products

Cus

tom

ers

Cus

tom

er G

roup

s

Product Groups

Simultaneously group customers and products, or, documents and words, or, users and preferences …

Customers

Products

Page 10: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

10

Node Grouping [KDD 04]

Cus

tom

er G

roup

s

Product Groups

Row and column groups

• need not be along a diagonal, and

• need not be equal in number

Cus

tom

er G

roup

s

Product Groups

Both are fine

Page 11: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

11

Motivation

Visualization

Summarization

Detection of outlier nodes and edges

Compression, and others…

Page 12: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

12

Node Grouping

Desiderata:

1. Simultaneously discover row and column groups

2. Fully Automatic: No “magic numbers”

3. Scalable to large matrices

4. Online: New data should not require full recomputations

Page 13: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

13

Closely Related Work

Information Theoretic Co-clustering [Dhillon+/2003] Number of row and column groups must be

specified

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”

Scalable to large graphs

Online

Page 14: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

14

Other Related Work

K-means and variants: [Pelleg+/2000, Hamerly+/2003]

“Frequent itemsets”: [Agrawal+/1994]

Information Retrieval:[Deerwester+1990, Hoffman/1999]

Graph Partitioning:[Karypis+/1998]

Do not cluster rows and cols simultaneously

User must specify “support”

Choosing the number of “concepts”

Number of partitions

Measure of imbalance between clusters

Page 15: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

15

What makes a cross-association “good”?

versus

Column groups Column groups

Row

gro

ups

Row

gro

ups

Good Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

Why is this better?

implies

Page 16: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

16

Main Idea

Good Compression

Good Clusteringimplies

Column groups

Row

gro

ups

density pi1 = % of dots

size * H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi

Binary Matrix

+ Σi

Page 17: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

17

Examples

One row group, one column group

high low

m row group, n column group

highlow

Total Encoding Cost = size * H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi + Σi

Page 18: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

18

What makes a cross-association “good”?

Why is this better?

low low

Total Encoding Cost = size * H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi + Σi

versus

Column groups Column groups

Row

gro

ups

Row

gro

ups

Page 19: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

19

Formal problem statement

Given a binary matrix,

Re-organize the rows and columns into groups, and

Choose the number of row and column groups, to

Minimize the total encoding cost.

Page 20: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

20

Formal problem statement

Given a binary matrix,

Re-organize the rows and columns into groups, and

Choose the number of row and column groups, to

Minimize the total encoding cost.

Note: No Parameters

Page 21: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

21

Algorithmsk =

5 row groups

k=1, l=2

k=2, l=2

k=2, l=3

k=3, l=3

k=3, l=4

k=4, l=4

k=4, l=5

l = 5 col groups

Page 22: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

22

Algorithmsl = 5

k = 5

Start with initial matrix

Find good groups for fixed k and l

Choose better values for k and l

Final cross-association

Lower the encoding cost

Page 23: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

23

Fixed k and ll = 5

k = 5

Start with initial matrix

Find good groups for fixed k and l

Choose better values for k and l

Final cross-association

Lower the encoding cost

Page 24: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

24

Fixed k and lRe-assign: for each row x

re-assign it to the row group which minimizes the code cost

Column groups

Row

gro

ups 1.Row re-assigns

2.Column re-assigns

3. and repeat …

Column groups

Row

gro

ups

Page 25: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

25

Choosing k and ll = 5

k = 5

Start with initial matrix

Choose better values for k and l

Final cross-association

Lower the encoding cost

Find good groups for fixed k and l

Page 26: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

26

Choosing k and l

Split:1. Find the most “inhomogeneous” group.

2. Remove the rows/columns which make it inhomogeneous.

3. Create a new group for these rows/columns.

Column groups

Row

gro

ups

Row

gro

ups

Column groups

Page 27: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

27

Algorithmsl = 5

k = 5

Start with initial matrix

Find good groups for fixed k and l

Choose better values for k and l

Final cross-association

Lower the encoding cost

Re-assigns

Splits

Page 28: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

28

Experiments

l = 5 col groups

k = 5 row

groups

“Customer-Product” graph with Zipfian sizes, no noise

Page 29: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

29

Experiments

“Quasi block-diagonal” graph with Zipfian sizes, noise=10%

l = 8 col groups

k = 6 row

groups

Page 30: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

30

Experiments

“White Noise” graph: we find the existing spurious patterns

l = 3 col groups

k = 2 row

groups

Page 31: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

31

Experiments“CLASSIC”

• 3,893 documents

• 4,303 words

• 176,347 “dots”

Combination of 3 sources:

• MEDLINE (medical)

• CISI (info. retrieval)

• CRANFIELD (aerodynamics)

Doc

umen

ts

Words

Page 32: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

32

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

Doc

umen

ts

Words

Page 33: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

33

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE(medical)

insipidus, alveolar, aortic, death, …

blood, disease, clinical, cell, …

Page 34: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

34

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE(medical)

CISI(Information Retrieval)

providing, studying, records, development, …

abstract, notation, works, construct, …

Page 35: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

35

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE(medical)

CRANFIELD (aerodynamics)

shape, nasa, leading, assumed, …

CISI(Information Retrieval)

Page 36: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

36

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE(medical)

CRANFIELD (aerodynamics)

paint, examination, fall, raise, leave, based, …

CISI(Information Retrieval)

Page 37: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

37

ExperimentsN

SF

Gra

nt P

ropo

sals

Words in abstract

“GRANTS”

• 13,297 documents

• 5,298 words

• 805,063 “dots”

Page 38: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

38

Experiments

“GRANTS” graph of documents & words: k=41, l=28

NS

F G

rant

Pro

posa

ls

Words in abstract

Page 39: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

39

Experiments

“GRANTS” graph of documents & words: k=41, l=28

The Cross-Associations refer to topics:

• Genetics

encoding, characters, bind, nucleus

Page 40: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

40

Experiments

“GRANTS” graph of documents & words: k=41, l=28

The Cross-Associations refer to topics:

• Genetics

• Physics

coupling, deposition, plasma, beam

Page 41: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

41

Experiments

“GRANTS” graph of documents & words: k=41, l=28

The Cross-Associations refer to topics:

• Genetics

• Physics

• Mathematics

• …

manifolds, operators, harmonic

Page 42: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

42

Experiments

Number of “dots”

Tim

e (

secs

)

Splits

Re-assigns

Linear on the number of “dots”: Scalable

Page 43: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

43

Summary of Node Grouping

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”

Scalable to large matrices

Online: New data does not need full recomputation

Page 44: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

44

Extensions

We can use the same MDL-based framework for other problems:

1. Self-graphs

2. Detection of outlier edges

Page 45: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

45

Extension #1 [PKDD 04]

Self-graphs, such as Co-authorship graphs Social networks The Internet, and the World-wide Web

Customers

Products

Authors

Bipartite graph Self-graph

Page 46: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

46

Extension #1 [PKDD 04]

Self-graphs Rows and columns represent the same nodes so row re-assigns affect column re-assigns…

Bipartite graph Self-graph

Authors

Customers

Products

Page 47: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

47

Experiments

Authors

Aut

hors

DBLP dataset

• 6,090 authors in:• SIGMOD

• ICDE

• VLDB

• PODS

• ICDT

• 175,494 co-citation or co-authorship links

Page 48: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

48

Experiments

Authors

Aut

hors

Aut

hor

grou

ps

Author groups

k=8 author groups found

Stonebraker, DeWitt, Carey

Page 49: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

49

Extension #2 [PKDD 04]

Outlier edges Which links should not exist?

(illegal contact/access?) Which links are missing?

(missing data?)

Page 50: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

50

Extension #2 [PKDD 04]

Nodes

No

des

Outliers Deviations from “normality”

Lower quality compression

Find edges whose removal maximally reduces cost

No

de

Gro

up

s

Node Groups

Outlier edges

Page 51: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

51

Roadmap

Specific applications

• Node grouping

• Viral propagation

General issues

• Realistic graph generation

• Graph patterns and “laws”31

2

4 Conclusions

Will a virus spread and become an epidemic?

Page 52: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

52

The SIS (or “flu”) model

(Virus) birth rate β : probability than an infected neighbor attacks

(Virus) death rate δ : probability that an infected node heals

Cured = Susceptible

Infected

Healthy

NN1

N3

N2Prob. β

Prob. β

Prob. δ

Undirected network

Page 53: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

53

The SIS (or “flu”) model

Competition between virus birth and death Epidemic or extinction?

depends on the ratio β/δ but also on the network topology

Epidemicor

Extinction

Example of the effect of network topology

Page 54: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

54

Epidemic threshold

The epidemic threshold τ is the value such that If β/δ < τ there is no epidemic where β = birth rate, and δ = death

rate

Page 55: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

55

Previous models

Question: What is the epidemic threshold?

Answer #1: 1/<k>[Kephart and White ’91, ’93]

Answer #2: <k>/<k2>[Pastor-Satorras and Vespignani ’01]

Homogeneity assumption: All nodes have the same degree(but most graphs have power laws)

Mean-field assumption: All nodes of the same degree are equally affected(but susceptibility should depend on position in network too)

BUT

BUT

Page 56: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

56

The full solution is intractable! The full Markov Chain

has 2N states intractable so, a simplification is needed.

Independence assumption: Probability that two neighbors are infected =

Product of individual probabilities of infection This is a point estimate of the full Markov Chain.

Page 57: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

57

Our model

A non-linear dynamical system (NLDS) which makes no assumptions about the topology

1-pi,t = [1-pi,t-1 + δpi,t-1] . ∏ (1-β.Aji.pj,t-1)j=1

N

Probability of being infected

Adjacency matrix

Healthy at time t

Healthy at time t-1

Infected but cured

No infection received from another node

Page 58: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

58

Epidemic threshold [Theorem 1] We have no epidemic if:

β/δ < τ = 1/ λ1,A

(Virus) Birth rate

(Virus) Death rate

Epidemic threshold

largest eigenvalueof adj. matrix A

► λ1,A alone decides viral epidemics!

Page 59: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

59

Recall the definition of eigenvalues

A X X= λA

eigenvalue

λ1,A = largest eigenvalue

≈ size of the largest “blob”

Page 60: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

60

Experiments (100-node Star)

β/δ = τ (close to the threshold)

β/δ < τ (below threshold)

β/δ > τ (above threshold)

……

……

Page 61: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

61

Experiments (Oregon)

β/δ > τ (above threshold)

β/δ = τ (at the threshold)

β/δ < τ (below threshold)

10,900 nodes and 31,180 edges

Page 62: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

62

Extensions

This dynamical-systems framework can exploited further

1. The rate of decay of the infection

2. Information survival thresholds in sensor/P2P networks

Page 63: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

63

Extension #1

Below the threshold:How quickly does an infection die out?

[Theorem 2] Exponentially quickly

Page 64: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

64

Experiment (10K Star Graph)

“Score” s = β/δ * λ1,A = “fraction” of threshold

Nu

mb

er

of in

fect

ed

nod

es

(lo

g-s

cale

)

Time-steps (linear-scale)

Linear on log-lin scale exponential decay

Page 65: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

65

Experiment (Oregon Graph)

“Score” s = β/δ * λ1,A = “fraction” of threshold

Nu

mb

er

of in

fect

ed

nod

es

(lo

g-s

cale

)

Time-steps (linear-scale)

Linear on log-lin scale exponential decay

Page 66: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

66

Extension #2

• Sensors gain new information

Information survival insensor networks[+ Leskovec, Faloutsos, Guestrin, Madden]

Page 67: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

67

Extension #2

• Sensors gain new information

• but they may die due to harsh environment or battery failure

• so they occasionally try to transmit data to nearby sensors

• and failed sensors are occasionally replaced.

Information survival insensor networks[+ Leskovec, Faloutsos, Guestrin, Madden]

Page 68: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

68

Extension #2

• Sensors gain new information

• but they may die due to harsh environment or battery failure

• so they occasionally try to transmit data to nearby sensors

• and failed sensors are occasionally replaced.

• Under what conditions does the information survive?

Information survival insensor networks[+ Leskovec, Faloutsos, Guestrin, Madden]

Page 69: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

69

Extension #2

[Theorem 1] The information dies out exponentially quickly if

Retransmission rate

Resurrection rate

Failure rate of sensors

Largest eigenvalue of the “link quality” matrix

Page 70: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

70

Roadmap

Specific applications

• Node grouping

• Viral propagation

General issues

• Realistic graph generation

• Graph patterns and “laws”3

4 Conclusions

How can we generate a “realistic” graph, that mimics

a given real-world?

1

2

Skip

Page 71: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

71

Experiments (Clickstream bipartite graph)

In-degree

Users

Websites

Some personal webpage

Yahoo, Google and others

ClickstreamR-MAT

+ x

Cou

nt

Page 72: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

72

Experiments (Clickstream bipartite graph)

Users

Websites

Email-checking surfers

“All-night” surfers

Out-degree

Cou

nt

ClickstreamR-MAT

+ x

Page 73: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

73

Experiments (Clickstream bipartite graph)

Count vs Out-degree Count vs In-degree Hop-plot

Left “Network value” Right “Network value”

►R-MAT can match real-world graphs

Singular value vs Rank

Page 74: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

74

Roadmap

Specific applications

• Node grouping

• Viral propagation

General issues

• Realistic graph generation

• Graph patterns and “laws”3

4 Conclusions

1

2

Page 75: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

75

Conclusions

Two paths in graph mining: Specific applications:

Viral Propagation non-linear dynamical system, epidemic depends on largest eigenvalue

Node Grouping MDL-based approach for automatic grouping

General issues: Graph Patterns Marks of “realism” in a graph Graph Generators R-MAT, a scalable generator

matching many of the patterns

Page 76: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

76

Software

http://www-2.cs.cmu.edu/~deepay/#Sw CrossAssociations

To find natural node groups. Used by “anonymous” large accounting firm. Used by Intel Research, Cambridge, UK. Used at UC, Riverside (net intrusion detection). Used at the University of Porto, Portugal

NetMine To extract graph patterns quickly + build realistic graphs. Used by Northrop Grumman corp.

F4 A non-linear time series forecasting package.

Page 77: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

77

===CROSS-ASSOCIATIONS=== Why simultaneous groupin

g? Differences from co-cluster

ing and others? Other parameter-fitting crit

eria? Cost surface Exact cost function Exact complexity, wall-

clock times Soft clustering Different weights for code

and description costs?

Precision-recall for CLASSIC

Inter-group “affinities” Collaborative filtering and r

ecommendation systems? CA versus bipartite cores Extras General comments on CA

communities

Page 78: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

78

===Viral Propagation===

Comparison with previous methods Accuracy of dynamical system Relationship with full Markov chain Experiments on information survival threshold Comparison with Infinite Particle Systems Intuition behind the largest eigenvalue Correlated failures

Page 79: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

79

===R-MAT===

Graph patterns Generator desiderata Description of R-MAT Experiments on a directed graph R-MAT communities via Cross-Associations? R-MAT versus tree-based generators

Page 80: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

80

===Graphs in general===

Relational learning Graph Kernels

Page 81: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

81

Simultaneous grouping is useful Sparse blocks, with little

in common between rows

Grouping rows first would

collapse these two into one!

Index

Page 82: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

82Index

Cross-Associations ≠ Co-clustering !Information-theoretic

co-clustering Cross-Associations

1. Lossy Compression.

2. Approximates the original matrix, while trying to minimize KL-divergence.

3. The number of row and column groups must be given by the user.

1. Lossless Compression.

2. Always provides complete information about the matrix, for any number of row and column groups.

3. Chosen automatically using the MDL principle.

Page 83: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

83Index

Other parameter-fitting methods The Gap statistic [Tibshirani+ ’01]

Minimize the “gap” of log-likelihood of intra-cluster distances from the expected log-likelihood.

But Needs a distance function between graph nodes Needs a “reference” distribution Needs multiple MCMC runs to remove “variance

due to sampling” more time.

Page 84: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

84

Other parameter-fitting methods Stability-based method [Ben-Hur+ ’02, ‘03]

Run clustering multiple times on samples of data, for several values of “k”

For low k, clustering is stable; for high k, unstable Choose this transition point.

But Needs many runs of the clustering algorithm Arguments possible over definition of transition

point

Index

Page 85: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

85

Precision-Recall for CLASSIC

Index

Page 86: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

86

Cost surface (total cost)

k

l

Surface plot

lk

Contour plot

With increasing k and l: Total cost decays very rapidly initially, but then starts increasing slowly

Index

Page 87: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

87

Cost surface (code cost only)

k

ll

k

With increasing k and l: Code cost decays very rapidly

Surface plot Contour plot

Index

Page 88: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

88

Encoding Cost Function

Total encoding cost =log*(k) + log*(l) + (cluster number)

N.log(N) + M.log(M) + (row/col order)

Σ log(ai) + Σ log(bj) + (cluster sizes) ΣΣ log(aibj+1) + (block densities)

ΣΣ aibj . H(pi,j)

Desc

rip

tion

co

st

Code cost

Index

Page 89: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

89

Complexity of CA

O(E. (k2+l2)) ignoring the number of re-assign iterations, which is typically low.

Index

Page 90: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

90

Complexity of CA

Number of edges

Tim

e /

Σ(k

+l)

Index

Page 91: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

91

Inter-group distances

Nodes

No

des

Node Groups

Grp1

Grp2

Grp3

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

No

de

Gro

up

s

Index

Page 92: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

92

Inter-group distances

No

de

Gro

up

s

Node Groups

Grp1

Grp2

Grp3

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

Grp1 Grp2

Grp3

5.5

4.55.1

Index

Page 93: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

93

Experiments

Aut

hor

grou

ps

Author groups

Grp8Grp1

Inter-group distances can aid in visualization

Stonebraker, DeWitt, Carey

Index

Page 94: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

94

Collaborative filtering and recommendation systems Q: If someone likes a product X, will (s)he like

product Y? A: Check if others who liked X also liked Y.

Focus on distances between people, typically cosine similarity

and not on clustering

Index

Page 95: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

95

CA and bipartite cores: related but different

A 3x2 bipartite core

Hubs Authorities

Kumar et al. [1999] say that bipartite cores correspond to communities.

Index

Page 96: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

96

CA and bipartite cores: related but different

CA finds two communities there: one for hubs, and one for authorities.

We gracefully handle cases where a few links are missing.

CA considers connections between all sets of clusters, and not just two sets.

Not each node need belong to a non-trivial bipartite core.

CA is (informally) a generalization

Index

Page 97: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

97

Comparison with soft clustering Soft clustering each node belongs to each

cluster with some probability Hard clustering one cluster per node

Index

Page 98: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

98

Comparison with soft clustering1. Far more degrees of freedom

1. Parameter fitting is harder

2. Algorithms can be costlier

2. Hard clustering is better for exploratory data analysis

3. Some real-world problems require hard clustering e.g., fraud detection for accountants

Index

Page 99: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

99

Weights for code cost vs description cost Total = 1. (code cost) + 1. (description cost) Physical meaning: Total number of bits

Total = α. (code cost) + β. (description cost) Physical meaning: Number of encoding bits

under some prior

Index

Page 100: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

100

Re-assign: for each row x

Formula for re-assigns

Column groups

Row

gro

ups

Index

Page 101: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

101

Choosing k and ll = 5

k = 5

Split:1. Find the row group R with the maximum entropy per row

2. Choose the rows in R whose removal reduces the entropy per row in R

3. Send these rows to the new row group, and set k=k+1

Index

Page 102: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

102

Experiments

User groups

Use

r gr

oups

Epinions dataset

• 75,888 users

• 508,960 “dots”, one “dot” per “trust” relationship

k=19 groups foundSmall dense “core”

Index

Page 103: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

103

Comparison with previous methods Our threshold subsumes the homogeneous

model Proof We are more accurate than the Mean-Field

Assumption model.

Index

Page 104: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

104

Comparison with previous methods 10K Star Graph

Index

Page 105: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

105

Comparison with previous methods Oregon Graph

Index

Page 106: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

106

Accuracy of dynamical system 10K Star Graph

Index

Page 107: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

107

Accuracy of dynamical system Oregon Graph

Index

Page 108: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

108

Accuracy of dynamical system 10K Star Graph

Index

Page 109: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

109

Accuracy of dynamical system Oregon Graph

Index

Page 110: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

110

Relationship with full Markov Chain The full Markov Chain is of the form:

Prob(infection at time t) = Xt-1 + Yt-1 – Zt-1

Independence assumption leads to a point estimate for Zt-1 non-linear dynamical system.

Still non-linear, but now tractable

Non-linear component

Index

Page 111: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

111

Experiments: Information survival INTEL sensor map (54 nodes) MIT sensor map (40 nodes) and others…

Index

Page 112: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

112

Experiments: Information survival

INTEL sensor map

Index

Page 113: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

113

Survival threshold on INTEL

Index

Page 114: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

114

Survival threshold on INTEL

Index

Page 115: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

115

Experiments: Information survival

MIT sensor map

Index

Page 116: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

116

Survival threshold on MIT

Index

Page 117: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

117

Survival threshold on MIT

Index

Page 118: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

118

Infinite Particle Systems

“Contact Process” ≈ SIS model Differences:

Infinite graphs only the questions asked are different

Very specific topologies lattices, trees Exact thresholds have not been found for these;

proving existence of thresholds is important Our results match those on the finite line

graph [Durrett+ ’88]

Index

Page 119: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

119

Intuition behind the largest eigenvalue Approximately size of the largest “blob” Consider the special case of a “caveman”

graph Largest eigenvalue = 4

Index

Page 120: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

120

Intuition behind the largest eigenvalue Approximately size of the largest “blob”

Largest eigenvalue = 4.016

Index

Page 121: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

121

Graph Patterns

Power Laws

Count vs Outdegree

Count vs Indegree

The “epinions” graph with 75,888 nodes and508,960 edges

Index

Page 122: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

122

Graph Patterns

Power Laws

Count vs Outdegree

Count vs Indegree

The “epinions” graph with 75,888 nodes and508,960 edges

Index

Page 123: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

123

Graph Patterns

Power Laws and deviations (DGX/Lognormals [Bi+ ’01])

Degree

Cou

nt

Count vs Indegree

Index

Page 124: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

124

Graph Patterns

Power Lawsand deviations

Small-world “Community” effect …

hops

Effective Diameter

# r

each

ab

le p

air

s

Index

Page 125: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

125

Graph Generator Desiderata

Power Lawsand deviations

Small-world “Community” effect …

Most current graph generators fail to match some of these.

Other desiderata Few parameters Fast parameter-fitting Scalable graph

generation Simple extension to

undirected, bipartite and weighted graphs

Index

Page 126: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

126

The R-MAT generator

[SIAM DM’04]

2n

2n

Subdivide the adjacency matrix

and choose one quadrant with probability (a,b,c,d)

a (0.5)

d (0.25)

c (0.15)

b (0.1)

From To

Intuition: The “80-20 law”

Index

Page 127: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

127

The R-MAT generator

[SIAM DM’04]

2n

2n

Subdivide the adjacency matrix

and choose one quadrant with probability (a,b,c,d)

Recurse till we reach a 1*1 cell

where we place an edge and repeat for all edges.

a

c d

a

c d

b

Intuition: The “80-20 law”

Index

Page 128: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

128

The R-MAT generator

[SIAM DM’04]

2n

2n

Only 3 parameters a, b and c (d = 1-a-b-c).

We have a fast parameter fitting algorithm.

a

c d

a

c d

b

Intuition: The “80-20 law”

Index

Page 129: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

129

Experiments (Epinions directed graph)

Count vs Indegree Count vs Outdegree Hop-plot

Eigenvalue vs Rank “Network value” Count vs Stress

Effective Diameter

►R-MAT matches directed graphs

Index

Page 130: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

130

R-MAT communities and Cross-Associations R-MAT builds communities in graphs, and

Cross-Associations finds them. Relationship?

R-MAT builds a hierarchy of communities, while CA finds a flat set of communities

Linkage in the sizes of communities found by CA: When the R-MAT parameters are very skewed, the

community sizes for CA are skewed and vice versa

Index

Page 131: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

131

R-MAT and tree-based generators Recursive splitting in R-MAT ≈ following a

tree from root to leaf.

Relationship with other tree-based generators [Kleinberg ’01, Watts+ ’02]? The R-MAT tree has edges as leaves, the others

have nodes Tree-distance between nodes is used to connect

nodes in other generators, but what does tree-distance between edges mean?

Index

Page 132: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

132

Comparison with relational learningRelational Learning

(typical) Graph Mining

(typical)

1. Aims to find small structure/patterns at the local level

2. Labeled nodes and edges

3. Semantics of labels are important

4. Algorithms are typically costlier

1. Emphasis on global aspects of large graphs

2. Unlabeled graphs

3. More focused on topological structure and properties

4. Scalability is more important

Index

Page 133: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

133

===OTHER WORK===

OTHER WORK

Page 134: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

134

Other Work

Time Series Prediction[CIKM 2002] We use the fractal dimension of the data This is related to chaos theory and Lyapunov exponents…

Page 135: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

135

Other Work

Time Series Prediction[CIKM 2002]

Logistic Parabola

Page 136: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

136

Other Work

Time Series Prediction[CIKM 2002]

Lorenz attractor

Page 137: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

137

Other Work

Time Series Prediction[CIKM 2002]

Laser fluctuations

Page 138: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

138

Other Work Adaptive histograms with error guarantees

[+ Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos]

Salary

Cou

nt

Prob.

• Maintain count probabilities for buckets

• to give statistically correct query result-size estimation

• and query feedback

• + …

Insertions, deletions

Count

Page 139: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

139

Other Work

User-personalization Patent number 6,611,834 (IBM)

Relevance feedback in multimedia image search Filed for patent (IBM)

Building 3D models using robot camera and rangefinder data [ICML 2001]

Page 140: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

140

===EXTRAS===

Page 141: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

141

Conclusions Two paths in graph mining:

Specific applications: Viral Propagation Resilience testing, information

dissemination, rumor spreading Node Grouping automatically grouping nodes, AND

finding the correct number of groups

References:1. Fully automatic Cross-Associations,

by Chakrabarti, Papadimitriou, Modha and Faloutsos, in KDD 20042. AutoPart: Parameter-free graph partitioning and Outlier detection,

by Chakrabarti, in PKDD 20043. Epidemic spreading in real networks: An eigenvalue viewpoint,

by Wang, Chakrabarti, Wang and Faloutsos, in SRDS 2003

Page 142: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

142

Conclusions Two paths in graph mining:

Specific applications General issues:

Graph Patterns Marks of “realism” in a graph Graph Generators R-MAT, a fast, scalable generator

matching many of the patterns

References:1. R-MAT: A recursive model for graph mining,

by Chakrabarti, Zhan and Faloutsos in SIAM Data Mining 2004.2. NetMine: New mining tools for large graphs,

by Chakrabarti, Zhan, Blandford, Faloutsos and Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy

Page 143: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

143

Other References

F4: Large Scale Automated Forecasting using Fractals,by D. Chakrabarti and C. Faloutsos, in CIKM 2002.

Using EM to Learn 3D Models of Indoor Environments with Mobile Robots,by Y. Liu, R. Emery, D. Chakrabarti, W. Burgard and S. Thrun, in ICML 2001

Graph Mining: Laws, Generators and Algorithms,by D. Chakrabarti and C. Faloutsos, under submission to ACM Computing Surveys

Page 144: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

144

References --- graphs

1. R-MAT: A recursive model for graph mining, by D. Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data Mining 2004.

2. Epidemic spreading in real networks: An eigenvalue viewpoint, by Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, in SRDS 2003

3. Fully automatic Cross-Associations, by D. Chakrabarti, S. Papadimitriou, D. Modha and C. Faloutsos, in KDD 2004

4. AutoPart: Parameter-free graph partitioning and Outlier detection, by D. Chakrabarti, in PKDD 2004

5. NetMine: New mining tools for large graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy

Page 145: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

145

Roadmap

Specific applications

• Node grouping

• Viral propagation

General issues

• Realistic graph generation

• Graph patterns and “laws”31

2

4 Other Work

5 Conclusions

Page 146: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

146

Experiments (Clickstream bipartite graph)

In-degree

Users

Websites

Some personal webpage

Yahoo, Google and others

Clickstream +

Cou

nt

Page 147: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

147

Experiments (Clickstream bipartite graph)

Users

Websites

Email-checking surfers

“All-night” surfers

Out-degree

Cou

nt

Clickstream +

Page 148: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

148

Experiments (Clickstream bipartite graph)

Users

Websites

Hops

# R

each

able

pai

rs

ClickstreamR-MAT

Page 149: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

149

Graph Generation

Important for: Simulations of new algorithms Compression using a good graph generation

model Insight into the graph formation process

Our R-MAT (Recursive MATrix) generator can match many common graph patterns.

Page 150: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

150

Recall the definition of eigenvalues

β/δ < τ = 1/ λ1,A

A X X= λA

λA = eigenvalue of A

λ1,A = largest eigenvalue

Page 151: 1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.

151

Tools for Large Graph Mining

Deepayan Chakrabarti Carnegie Mellon University