Mining Large Graphs: Spectral Methods, Tensors and Influence propagation

Post on 22-Feb-2016

54 views 0 download

description

Mining Large Graphs: Spectral Methods, Tensors and Influence propagation. Christos Faloutsos CMU. Thanks. Alex Smola Jia Yu (Tim) Pan. Roadmap. Graph problems: G1: Fraud detection – BP G2: Botnet detection – spectral G3: Beyond graphs: tensors and ``NELL’’ - PowerPoint PPT Presentation

Transcript of Mining Large Graphs: Spectral Methods, Tensors and Influence propagation

CMU SCS

Mining Large Graphs: Spectral Methods, Tensors and Influence propagation

Christos FaloutsosCMU

CMU SCS

Thanks• Alex Smola

• Jia Yu (Tim) Pan

Google, June 2013 C. Faloutsos (CMU) 2

CMU SCS

C. Faloutsos (CMU) 3

Roadmap• Graph problems:

– G1: Fraud detection – BP– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’

• Influence propagation and spike modeling– C1: spikeM model

• Conclusions

Google, June 2013

CMU SCS

Google, June 2013 C. Faloutsos (CMU) 4

E-bay Fraud detection

w/ Polo Chau &Shashank Pandit, CMU[www’07]

CMU SCS

Google, June 2013 C. Faloutsos (CMU) 5

E-bay Fraud detection

CMU SCS

Google, June 2013 C. Faloutsos (CMU) 6

E-bay Fraud detection

CMU SCS

Google, June 2013 C. Faloutsos (CMU) 7

E-bay Fraud detection - NetProbe

CMU SCS

Google, June 2013 C. Faloutsos (CMU) 8

E-bay Fraud detection - NetProbe

F A HF 99%

A 99%

H 49% 49%

Compatibilitymatrix

heterophily

details

CMU SCS

C. Faloutsos (CMU) 9

Background 1: Belief Propagation Equations

mij (x j ) = φi (xi ) ⋅ψ ij (xi , x j ) ⋅ mni (xi )n∈N (i)\ j

∏xi

bi (xi ) = η ⋅φi (xi ) ⋅ mij (xi )j∈N (i)∏

[Pearl ‘82][Yedidia+ ‘02]…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]

Google, June 2013

~bi (xi )

CMU SCS

Popular press

And less desirable attention:• E-mail from ‘Belgium police’ (‘copy of

your code?’)

Google, June 2013 C. Faloutsos (CMU) 10

CMU SCS

C. Faloutsos (CMU) 11

Roadmap• Graph problems:

– G1: Fraud detection – BP• Ebay• Symantec• Unification

– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’

• Influence propagation and spike modeling• Conclusions

Google, June 2013

CMU SCS

Polo ChauMachine Learning Dept

Carey NachenbergVice President & Fellow

Jeffrey WilhelmPrincipal Software Engineer

Adam WrightSoftware Engineer

Prof. Christos FaloutsosComputer Science Dept

Polonium: Tera-Scale Graph Mining and Inference for Malware Detection

PATENT PENDING

SDM 2011, Mesa, Arizona

CMU SCS

Polonium: The Data60+ terabytes of data anonymously contributed by participants of worldwide Norton Community Watch program

50+ million machines900+ million executable files

Constructed a machine-file bipartite graph (0.2 TB+)

1 billion nodes (machines and files)37 billion edges

Google, June 2013 13C. Faloutsos (CMU)

CMU SCS

Polonium: Key Ideas• Use “guilt-by-association” (i.e., homophily)

– E.g., files that appear on machines with many bad files are more likely to be bad

• Scalability: handles 37 billion-edge graph

Google, June 2013 14C. Faloutsos (CMU)

CMU SCS

Polonium: One-Interaction Results

84.9% True Positive Rate1% False Positive Rate

True Positive Rate% of malware

correctly identified

False Positive Rate% of non-malware wrongly labeled as malware15

Ideal

Google, June 2013 C. Faloutsos (CMU)

CMU SCS

C. Faloutsos (CMU) 16

Roadmap• Graph problems:

– G1: Fraud detection – BP• Ebay• Symantec• Unification

– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’

• Influence propagation and spike modeling• Conclusions

Google, June 2013

CMU SCS

Unifying Guilt-by-Association Approaches:

Theorems and Fast Algorithms

Danai KoutraU KangHsing-Kuo Kenneth Pao

Tai-You KeDuen Horng (Polo) ChauChristos Faloutsos

ECML PKDD, 5-9 September 2011, Athens, Greece

CMU SCS

Problem Definition:GBA techniques

C. Faloutsos (CMU) 18

Given: Graph; & few labeled nodesFind: labels of rest(assuming network effects)

?

?

?

?

Google, June 2013

CMU SCS

Homophily and Heterophily

C. Faloutsos (CMU) 19

Step 1

Step 2

homophily heterophily

All methods handle

homophily

NOT all methods handle

heterophilyBUT

proposed method

does!Google, June 2013

CMU SCS

Are they related?• RWR (Random Walk with Restarts)

– google’s pageRank (‘if my friends are important, I’m important, too’)

• SSL (Semi-supervised learning) – minimize the differences among neighbors

• BP (Belief propagation) – send messages to neighbors, on what you

believe about them

Google, June 2013 C. Faloutsos (CMU) 20

CMU SCS

Are they related?• RWR (Random Walk with Restarts)

– google’s pageRank (‘if my friends are important, I’m important, too’)

• SSL (Semi-supervised learning) – minimize the differences among neighbors

• BP (Belief propagation) – send messages to neighbors, on what you

believe about them

Google, June 2013 C. Faloutsos (CMU) 21

YES!

CMU SCS

C. Faloutsos (CMU) 22

Background 1: Belief Propagation Equations

mij (x j ) = φi (xi ) ⋅ψ ij (xi , x j ) ⋅ mni (xi )n∈N (i)\ j

∏xi

bi (xi ) = η ⋅φi (xi ) ⋅ mij (xi )j∈N (i)∏

[Pearl ‘82][Yedidia+ ‘02]…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]

Google, June 2013

CMU SCS

Correspondence of Methods

C. Faloutsos (CMU) 23

Method Matrix Unknown knownRWR [I – c AD-1] × x = (1-c)ySSL [I + a(D - A)] × x = y

FABP [I + a D - c’A] × bh = φh

0 1 01 0 10 1 0

? 0 1 1

d1

d2 d3

final labels/ beliefs

prior labels/ beliefs

adjacency matrix

Google, June 2013

CMU SCS

Correspondence of Methods

C. Faloutsos (CMU) 24

Method Matrix Unknown knownRWR [I – c AD-1] × x = (1-c)ySSL [I + a(D - A)] × x = y

FABP [I + a D - c’A] × bh = φh

0 1 01 0 10 1 0

? 0 1 1

d1

d2 d3

final labels/ beliefs

prior labels/ beliefs

adjacency matrix

Google, June 2013

We know when it converges!

CMU SCS

Results: Scalability

C. Faloutsos (CMU) 25

FABP is linear on the number of edges.

# of edges (Kronecker graphs)

runt

ime

(min

)

Google, June 2013

CMU SCS

Results: Parallelism

C. Faloutsos (CMU) 26

FABP ~2x faster & wins/ties on accuracy.

runtime (min)

% a

ccur

acy

Google, June 2013

CMU SCS

C. Faloutsos (CMU) 27

Conclusions for BP

• ‘NetProbe’, ‘Polonium’, and belief propagation: exploit network effects.

• FaBP: fast & accurate (and -> convergence conditions)

Google, June 2013

CMU SCS

C. Faloutsos (CMU) 28

Roadmap• Graph problems:

– G1: Fraud detection – BP• Ebay• Symantec• Unification

– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’

• Influence propagation and spike modeling• Conclusions

Google, June 2013

CMU SCS

EigenSpokesB. Aditya Prakash, Mukund Seshadri, Ashwin

Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, 21-24 June 2010.

C. Faloutsos (CMU) 29Google, June 2013

CMU SCS

EigenSpokes• Eigenvectors of adjacency matrix

equivalent to singular vectors (symmetric, undirected graph)

30C. Faloutsos (CMU)Google, June 2013

CMU SCS

EigenSpokes• Eigenvectors of adjacency matrix

equivalent to singular vectors (symmetric, undirected graph)

31C. Faloutsos (CMU)Google, June 2013

N

N

details

CMU SCS

EigenSpokes• Eigenvectors of adjacency matrix

equivalent to singular vectors (symmetric, undirected graph)

32C. Faloutsos (CMU)Google, June 2013

N

N

details

CMU SCS

EigenSpokes• Eigenvectors of adjacency matrix

equivalent to singular vectors (symmetric, undirected graph)

33C. Faloutsos (CMU)Google, June 2013

N

N

details

CMU SCS

EigenSpokes• Eigenvectors of adjacency matrix

equivalent to singular vectors (symmetric, undirected graph)

34C. Faloutsos (CMU)Google, June 2013

N

N

details

CMU SCS

EigenSpokes• EE plot:• Scatter plot of

scores of u1 vs u2• One would expect

– Many points @ origin

– A few scattered ~randomly

C. Faloutsos (CMU) 35

u1

u2

Google, June 2013

1st Principal component

2nd Principal component

CMU SCS

EigenSpokes• EE plot:• Scatter plot of

scores of u1 vs u2• One would expect

– Many points @ origin

– A few scattered ~randomly

C. Faloutsos (CMU) 36

u1

u290o

Google, June 2013

CMU SCS

EigenSpokes - pervasiveness•Present in mobile social graph

across time and space

•Patent citation graph

37C. Faloutsos (CMU)Google, June 2013

CMU SCS

EigenSpokes - explanation

Near-cliques, or near-bipartite-cores, loosely connected

38C. Faloutsos (CMU)Google, June 2013

CMU SCS

EigenSpokes - explanation

Near-cliques, or near-bipartite-cores, loosely connected

39C. Faloutsos (CMU)Google, June 2013

CMU SCS

EigenSpokes - explanation

Near-cliques, or near-bipartite-cores, loosely connected

40C. Faloutsos (CMU)Google, June 2013

CMU SCS

EigenSpokes - explanation

Near-cliques, or near-bipartite-cores, loosely connected

So what? Extract nodes with high

scores high connectivity Good “communities”

spy plot of top 20 nodes

41C. Faloutsos (CMU)Google, June 2013

CMU SCS

Bipartite Communities!

magnified bipartite community

patents fromsame inventor(s)

`cut-and-paste’bibliography!

42C. Faloutsos (CMU)Google, June 2013

CMU SCS

(maybe, botnets?)

Victim IPs?

Botnet members?

43C. Faloutsos (CMU)Google, June 2013

Exploring itwith Dr. Eric Mao (III-Taiwan)

CMU SCS

C. Faloutsos (CMU) 44

Roadmap• Graph problems:

– G1: Fraud detection – BP– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’

• Influence propagation and spike modeling• Conclusions

Google, June 2013

CMU SCS

GigaTensor: Scaling Tensor Analysis Up By 100 Times –

Algorithms and Discoveries

U Kang

ChristosFaloutsos

KDD’12

EvangelosPapalexakis

AbhayHarpale

Google, June 2013 45C. Faloutsos (CMU)

CMU SCS

Background: Tensors• Tensors (=multi-dimensional arrays) are

everywhere– Hyperlinks &anchor text [Kolda+,05]

URL 1

URL 2

Anchor Text

Java

C++

C#

11

1

1

1

1 1

Google, June 2013 46C. Faloutsos (CMU)

CMU SCS

Background: Tensors• Tensors (=multi-dimensional arrays) are

everywhere– Sensor stream (time, location, type)– Predicates (subject, verb, object) in knowledge base

“Barack Obama is president of U.S.”

“Eric Clapton playsguitar”

(26M)

(26M)

(48M)

NELL (Never Ending Language Learner) data

Nonzeros =144M

Google, June 2013 47C. Faloutsos (CMU)

CMU SCS

Background: Tensors• Tensors (=multi-dimensional arrays) are

everywhere– Sensor stream (time, location, type)– Predicates (subject, verb, object) in knowledge base

Google, June 2013 48C. Faloutsos (CMU)IP-destination

IP-source

Time-stamp Anomaly Detection inComputernetworks

CMU SCS

Problem Definition• How to decompose a billion-scale tensor?

– Corresponds to SVD in 2D case

Google, June 2013 49C. Faloutsos (CMU)

CMU SCS

Problem Definition• How to decompose a billion-scale tensor?

– Corresponds to SVD in 2D case

Google, June 2013 50C. Faloutsos (CMU)

‘Politicians’ ‘Artists’

CMU SCS

Problem Definition

Q1: Dominant concepts/topics? Q2: Find synonyms to a given noun phrase? (and how to scale up: |data| > RAM)

(26M)

(26M)

(48M)

NELL (Never Ending Language Learner) data

Nonzeros =144M

Google, June 2013 51C. Faloutsos (CMU)

CMU SCS

Experiments• GigaTensor solves 100x larger problem

Number of nonzero= I / 50

(J)

(I)

(K)

GigaTensor

Tensor

Toolbox Out ofMemory

100x

Google, June 2013 52C. Faloutsos (CMU)

CMU SCS

A1: Concept Discovery• Concept Discovery in Knowledge Base

Google, June 2013 53C. Faloutsos (CMU)

CMU SCS

A1: Concept Discovery

Google, June 2013 54C. Faloutsos (CMU)

CMU SCS

A2: Synonym Discovery

Google, June 2013 55C. Faloutsos (CMU)

CMU SCS

C. Faloutsos (CMU) 56

Roadmap• Graph problems:

– G1: Fraud detection – BP– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’

• Influence propagation and spike modeling• Conclusions

Google, June 2013

CMU SCS

Rise and Fall Patterns of Information Diffusion:Model and Implications

Yasuko Matsubara (Kyoto University), Yasushi Sakurai (NTT), B. Aditya Prakash (CMU),

Lei Li (UCB), Christos Faloutsos (CMU)KDD’12, Beijing China

KDD 2012 57Y. Matsubara et al.

CMU SCS

C. Faloutsos (CMU)

• Meme (# of mentions in blogs)– short phrases Sourced from U.S. politics in 2008

58

“you can put lipstick on a pig”

“yes we can”

Rise and fall patterns in social media

Google, June 2013

CMU SCS

C. Faloutsos (CMU)

Rise and fall patterns in social media

59

• four classes on YouTube [Crane et al. ’08]• six classes on Meme [Yang et al. ’11]

Google, June 2013

CMU SCS

C. Faloutsos (CMU)

Rise and fall patterns in social media

60

• Can we find a unifying model, which includes these patterns?

• four classes on YouTube [Crane et al. ’08]• six classes on Meme [Yang et al. ’11]

Google, June 2013

CMU SCS

C. Faloutsos (CMU)

Rise and fall patterns in social media

61

• Answer: YES!

• We can represent all patterns by single model

Google, June 2013

CMU SCS

C. Faloutsos (CMU) 62

Main idea - SpikeM- 1. Un-informed bloggers (uninformed about rumor)- 2. External shock at time nb (e.g, breaking news)- 3. Infection (word-of-mouth)

Time n=0 Time n=nb

β

Google, June 2013

Infectiveness of a blog-post at age n:

- Strength of infection (quality of news)- Decay function

Time n=nb+1

CMU SCS

C. Faloutsos (CMU) 63

- 1. Un-informed bloggers (uninformed about rumor)- 2. External shock at time nb (e.g, breaking news)- 3. Infection (word-of-mouth)

Time n=0 Time n=nb

β

Google, June 2013

Infectiveness of a blog-post at age n:

- Strength of infection (quality of news)- Decay function

Time n=nb+1

Main idea - SpikeM

CMU SCS

Google, June 2013 C. Faloutsos (CMU) 64

-1.5 slope

J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF]

Response time (log)

Prob(RT > x)(log) -1.5

CMU SCS

C. Faloutsos (CMU)

SpikeM - with periodicity• Full equation of SpikeM

65

Periodicity

noonPeak 3am

Dip

Time n

Bloggers change their activity over time

(e.g., daily, weekly, yearly)

activity

Details

Google, June 2013

CMU SCS

C. Faloutsos (CMU)

Details• Analysis – exponential rise and power-raw fall

66

Lin-log

Log-log

Rise-part

SI -> exponential SpikeM -> exponential

Google, June 2013

CMU SCS

C. Faloutsos (CMU)

Details• Analysis – exponential rise and power-raw fall

67

Lin-log

Log-log

Fall-part

SI -> exponential SpikeM -> power law

Google, June 2013

CMU SCS

C. Faloutsos (CMU)

Tail-part forecasts

68

• SpikeM can capture tail part

Google, June 2013

CMU SCS

C. Faloutsos (CMU)

“What-if” forecasting

69

e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date

?

(1) First spike

(2) Release date

(3) Two weeks before release

Google, June 2013

?

CMU SCS

C. Faloutsos (CMU)

“What-if” forecasting

70SpikeM can forecast upcoming spikes

(1) First spike

(2) Release date

(3) Two weeks before release

Google, June 2013

CMU SCS

Conclusions for spikes• Exp rise; PL decay• ‘spikeM’ captures all patterns, with a few

parms– And can do extrapolation– And forecasting

Google, June 2013 C. Faloutsos (CMU) 71

CMU SCS

C. Faloutsos (CMU) 72

Roadmap• Graph problems:

– G1: Fraud detection – BP– G2: Botnet detection – spectral – G3: Beyond graphs: tensors and ``NELL’’

• Influence propagation and spike modeling• Future research• Conclusions

Google, June 2013

CMU SCS

Challenge#1: Time evolving networks / tensors

• Periodicities? Burstiness?• What is ‘typical’ behavior of a node, over time• Heterogeneous graphs (= nodes w/ attributes)

Google, June 2013 C. Faloutsos (CMU) 73

CMU SCS

Challenge #2: ‘Connectome’ – brain wiring

Google, June 2013 C. Faloutsos (CMU) 74

• Which neurons get activated by ‘bee’• How wiring evolves• Modeling epilepsy

N. Sidiropoulos

George Karypis

V. Papalexakis

Tom Mitchell

CMU SCS

C. Faloutsos (CMU) 75

Thanks

Google, June 2013

Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab

CMU SCS

C. Faloutsos (CMU) 76

Project info: PEGASUS

Google, June 2013

www.cs.cmu.edu/~pegasusResults on large graphs: with Pegasus +

hadoop + M45Apache licenseCode, papers, manual, video

Prof. U Kang Prof. Polo Chau

CMU SCS

C. Faloutsos (CMU) 77

Cast

Akoglu, Leman

Chau, Polo

Kang, U

McGlohon, Mary

Tong, Hanghang

Prakash,Aditya

Google, June 2013

Koutra,Danai

Beutel,Alex

Papalexakis,Vagelis

CMU SCS

C. Faloutsos (CMU) 78

References

• Deepayan Chakrabarti, Christos Faloutsos: Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38(1): (2006)

Google, June 2013

CMU SCS

C. Faloutsos (CMU) 79

References• Christos Faloutsos, Tamara G. Kolda, Jimeng Sun:

Mining large graphs and streams using matrix and tensor tools. Tutorial, SIGMOD Conference 2007: 1174

Google, June 2013

CMU SCS

References• Yasuko Matsubara, Yasushi Sakurai, B. Aditya

Prakash, Lei Li, Christos Faloutsos, "Rise and Fall Patterns of Information Diffusion: Model and Implications", KDD’12, pp. 6-14, Beijing, China, August 2012

Google, June 2013 C. Faloutsos (CMU) 80

CMU SCS

References• Jimeng Sun, Dacheng Tao, Christos

Faloutsos: Beyond streams and graphs: dynamic tensor analysis. KDD 2006: 374-383

Google, June 2013 C. Faloutsos (CMU) 81

CMU SCS

Overall Conclusions• G1: fraud detection

– BP: powerful method– FaBP: faster; equally accurate; known

convergence• G2: botnets -> Eigenspokes• G3: Subject-Verb-Object ->

Tensors/GigaTensor• Spikes: ‘spikeM’ (exp rise; PL drop)

Google, June 2013 C. Faloutsos (CMU) 82