School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos...

118
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos...

School of Computer ScienceCarnegie Mellon

Data Mining using Fractals and Power laws

Christos Faloutsos

Carnegie Mellon University

FIU 2007 C. Faloutsos 2

School of Computer ScienceCarnegie Mellon

THANK YOU!

• Tao Li

• Martha Soledad

FIU 2007 C. Faloutsos 3

School of Computer ScienceCarnegie Mellon

Thanks to• Deepayan Chakrabarti (CMU/Yahoo)

• Michalis Faloutsos (UCR)

• George Siganos (UCR)

FIU 2007 C. Faloutsos 4

School of Computer ScienceCarnegie Mellon

Overview

• Goals/ motivation: find patterns in large datasets:– (A) Sensor data– (B) network/graph data

• Solutions: self-similarity and power laws

• Discussion

FIU 2007 C. Faloutsos 5

School of Computer ScienceCarnegie Mellon

Applications of sensors/streams

• ‘Smart house’: monitoring temperature, humidity etc

• Financial, sales, economic series

FIU 2007 C. Faloutsos 6

School of Computer ScienceCarnegie Mellon

Applications of sensors/streams

• ‘Smart house’: monitoring temperature, humidity etc

• Financial, sales, economic series

FIU 2007 C. Faloutsos 7

School of Computer ScienceCarnegie Mellon

Motivation - Applications• Medical: ECGs +; blood

pressure etc monitoring

• Scientific data: seismological; astronomical; environment / anti-pollution; meteorological

Sunspot Data

0

50

100

150

200

250

300

FIU 2007 C. Faloutsos 8

School of Computer ScienceCarnegie Mellon

Motivation - Applications (cont’d)

• civil/automobile infrastructure

– bridge vibrations [Oppenheim+02]

– road conditions / traffic monitoring

Automobile traffic

0200400600800

100012001400160018002000

time

# cars

FIU 2007 C. Faloutsos 9

School of Computer ScienceCarnegie Mellon

Motivation - Applications (cont’d)

• Computer systems

– web servers (buffering, prefetching)

– network traffic monitoring

– ...

http://repository.cs.vt.edu/lbl-conn-7.tar.Z

FIU 2007 C. Faloutsos 10

School of Computer ScienceCarnegie Mellon

Web traffic

• [Crovella Bestavros, SIGMETRICS’96]

FIU 2007 C. Faloutsos 11

School of Computer ScienceCarnegie Mellon

...

survivable,self-managing storage

infrastructure

...

a storage brick(0.5–5 TB)~1 PB

“self-*” = self-managing, self-tuning, self-healing, … Goal: 1 petabyte (PB) for CMU researchers www.pdl.cmu.edu/SelfStar

Self-* Storage (Ganger+)

FIU 2007 C. Faloutsos 12

School of Computer ScienceCarnegie Mellon

...

survivable,self-managing storage

infrastructure

...

a storage brick(0.5–5 TB)~1 PB

“self-*” = self-managing, self-tuning, self-healing, …

Self-* Storage (Ganger+)

FIU 2007 C. Faloutsos 13

School of Computer ScienceCarnegie Mellon

Problem definition

• Given: one or more sequences x1 , x2 , … , xt , …; (y1, y2, … , yt, …)

• Find – patterns; clusters; outliers; forecasts;

FIU 2007 C. Faloutsos 14

School of Computer ScienceCarnegie Mellon

Problem #1

• Find patterns, in large datasets

time

# bytes

FIU 2007 C. Faloutsos 15

School of Computer ScienceCarnegie Mellon

Problem #1

• Find patterns, in large datasets

time

# bytes

Poisson indep., ident. distr

FIU 2007 C. Faloutsos 16

School of Computer ScienceCarnegie Mellon

Problem #1

• Find patterns, in large datasets

time

# bytes

Poisson indep., ident. distr

FIU 2007 C. Faloutsos 17

School of Computer ScienceCarnegie Mellon

Problem #1

• Find patterns, in large datasets

time

# bytes

Poisson indep., ident. distr

Q: Then, how to generatesuch bursty traffic?

FIU 2007 C. Faloutsos 18

School of Computer ScienceCarnegie Mellon

Overview

• Goals/ motivation: find patterns in large datasets:– (A) Sensor data

– (B) network/graph data

• Solutions: self-similarity and power laws• Discussion

FIU 2007 C. Faloutsos 19

School of Computer ScienceCarnegie Mellon

Problem #2 - network and graph mining• How does the Internet look like?• How does the web look like?• What constitutes a ‘normal’ social

network?• What is the ‘network value’ of a

customer? • which gene/species affects the others

the most?

FIU 2007 C. Faloutsos 20

School of Computer ScienceCarnegie Mellon

Network and graph mining

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

Graphs are everywhere!

FIU 2007 C. Faloutsos 21

School of Computer ScienceCarnegie Mellon

Problem#2Given a graph:

• which node to market-to / defend / immunize first?

• Are there un-natural sub-graphs? (eg., criminals’ rings)?

[from Lumeta: ISPs 6/1999]

FIU 2007 C. Faloutsos 22

School of Computer ScienceCarnegie Mellon

Solutions

• New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail

• Let’s see the details:

FIU 2007 C. Faloutsos 23

School of Computer ScienceCarnegie Mellon

Overview

• Goals/ motivation: find patterns in large datasets:– (A) Sensor data– (B) network/graph data

• Solutions: self-similarity and power laws

• Discussion

FIU 2007 C. Faloutsos 24

School of Computer ScienceCarnegie Mellon

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area: (3/4)^inf

infinite length!

(4/3)^inf

Q: What is its dimensionality??

FIU 2007 C. Faloutsos 25

School of Computer ScienceCarnegie Mellon

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area: (3/4)^inf

infinite length!

(4/3)^inf

Q: What is its dimensionality??A: log3 / log2 = 1.58 (!?!)

FIU 2007 C. Faloutsos 26

School of Computer ScienceCarnegie Mellon

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• Q: fd of a plane?

FIU 2007 C. Faloutsos 27

School of Computer ScienceCarnegie Mellon

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)

• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs..

log(r) )

FIU 2007 C. Faloutsos 28

School of Computer ScienceCarnegie Mellon

Sierpinsky triangle

log( r )

log(#pairs within <=r )

1.58

== ‘correlation integral’

= CDF of pairwise distances

FIU 2007 C. Faloutsos 29

School of Computer ScienceCarnegie Mellon

Observations: Fractals <-> power laws

Closely related:

• fractals <=>

• self-similarity <=>

• scale-free <=>

• power laws ( y= xa ; F=K r-2)

• (vs y=e-ax or y=xa+b)log( r )

log(#pairs within <=r )

1.58

FIU 2007 C. Faloutsos 30

School of Computer ScienceCarnegie Mellon

Outline

• Problems

• Self-similarity and power laws

• Solutions to posed problems

• Discussion

FIU 2007 C. Faloutsos 31

School of Computer ScienceCarnegie Mellon

time

#bytes

Solution #1: traffic

• disk traces: self-similar: (also: [Leland+94])• How to generate such traffic?

FIU 2007 C. Faloutsos 32

School of Computer ScienceCarnegie Mellon

Solution #1: traffic

• disk traces (80-20 ‘law’) – ‘multifractals’

time

#bytes

20% 80%

FIU 2007 C. Faloutsos 33

School of Computer ScienceCarnegie Mellon

80-20 / multifractals20 80

FIU 2007 C. Faloutsos 34

School of Computer ScienceCarnegie Mellon

80-20 / multifractals20

• p ; (1-p) in general

• yes, there are dependencies

80

FIU 2007 C. Faloutsos 35

School of Computer ScienceCarnegie Mellon

More on 80/20: PQRS

• Part of ‘self-* storage’ project

time

cylinder#

FIU 2007 C. Faloutsos 36

School of Computer ScienceCarnegie Mellon

More on 80/20: PQRS

• Part of ‘self-* storage’ project

p q

r s

q

r s

FIU 2007 C. Faloutsos 37

School of Computer ScienceCarnegie Mellon

Overview

• Goals/ motivation: find patterns in large datasets:– (A) Sensor data

– (B) network/graph data

• Solutions: self-similarity and power laws– sensor/traffic data

– network/graph data

• Discussion

FIU 2007 C. Faloutsos 38

School of Computer ScienceCarnegie Mellon

Problem #2 - topology

How does the Internet look like? Any rules?

FIU 2007 C. Faloutsos 39

School of Computer ScienceCarnegie Mellon

Patterns?

• avg degree is, say 3.3• pick a node at random

– guess its degree, exactly (-> “mode”)

degree

count

avg: 3.3

FIU 2007 C. Faloutsos 40

School of Computer ScienceCarnegie Mellon

Patterns?

• avg degree is, say 3.3• pick a node at random

– guess its degree, exactly (-> “mode”)

• A: 1!!

degree

count

avg: 3.3

FIU 2007 C. Faloutsos 41

School of Computer ScienceCarnegie Mellon

Patterns?

• avg degree is, say 3.3• pick a node at random

- what is the degree you expect it to have?

• A: 1!!• A’: very skewed distr.• Corollary: the mean is

meaningless!• (and std -> infinity (!))

degree

count

avg: 3.3

FIU 2007 C. Faloutsos 42

School of Computer ScienceCarnegie Mellon

Solution#2: Rank exponent R• A1: Power law in the degree distribution

[SIGCOMM99]

internet domains

log(rank)

log(degree)

-0.82

att.com

ibm.com

FIU 2007 C. Faloutsos 43

School of Computer ScienceCarnegie Mellon

Solution#2’: Eigen Exponent E

• A2: power law in the eigenvalues of the adjacency matrix

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

May 2001

FIU 2007 C. Faloutsos 44

School of Computer ScienceCarnegie Mellon

Power laws - discussion

• do they hold, over time?

• do they hold on other graphs/domains?

FIU 2007 C. Faloutsos 45

School of Computer ScienceCarnegie Mellon

Power laws - discussion

• do they hold, over time?

• Yes! for multiple years [Siganos+]

• do they hold on other graphs/domains?

• Yes!– web sites and links [Tomkins+], [Barabasi+]– peer-to-peer graphs (gnutella-style)– who-trusts-whom (epinions.com)

FIU 2007 C. Faloutsos 46

School of Computer ScienceCarnegie Mellon

Time Evolution: rank R

-1

-0.9

-0.8

-0.7

-0.6

-0.50 200 400 600 800

Instances in time: Nov'97 and on

Ra

nk

ex

po

ne

nt

• The rank exponent has not changed! [Siganos+]

Domainlevel

log(rank)

log(degree)

-0.82

att.com

ibm.com

FIU 2007 C. Faloutsos 47

School of Computer ScienceCarnegie Mellon

The Peer-to-Peer Topology

• Number of immediate peers (= degree), follows a power-law

[Jovanovic+]

degree

count

FIU 2007 C. Faloutsos 48

School of Computer ScienceCarnegie Mellon

epinions.com

• who-trusts-whom [Richardson + Domingos, KDD 2001]

(out) degree

count

FIU 2007 C. Faloutsos 49

School of Computer ScienceCarnegie Mellon

Why care about these patterns?

• better graph generators [BRITE, INET]– for simulations– extrapolations

• ‘abnormal’ graph and subgraph detection

FIU 2007 C. Faloutsos 50

School of Computer ScienceCarnegie Mellon

Recent discoveries [KDD’05]

• How do graphs evolve?

• degree-exponent seems constant - anything else?

FIU 2007 C. Faloutsos 51

School of Computer ScienceCarnegie Mellon

Evolution of diameter?

• Prior analysis, on power-law-like graphs, hints that

diameter ~ O(log(N)) or

diameter ~ O( log(log(N)))

• i.e.., slowly increasing with network size

• Q: What is happening, in reality?

FIU 2007 C. Faloutsos 52

School of Computer ScienceCarnegie Mellon

Evolution of diameter?

• Prior analysis, on power-law-like graphs, hints that

diameter ~ O(log(N)) or

diameter ~ O( log(log(N)))

• i.e.., slowly increasing with network size

• Q: What is happening, in reality?

• A: It shrinks(!!), towards a constant value

FIU 2007 C. Faloutsos 53

School of Computer ScienceCarnegie Mellon

Shrinking diameter

[Leskovec+05a]• Citations among physics

papers • 11yrs; @ 2003:

– 29,555 papers– 352,807 citations

• For each month M, create a graph of all citations up to month M

time

diameter

FIU 2007 C. Faloutsos 54

School of Computer ScienceCarnegie Mellon

Shrinking diameter• Authors &

publications• 1992

– 318 nodes

– 272 edges

• 2002– 60,000 nodes

• 20,000 authors

• 38,000 papers

– 133,000 edges

FIU 2007 C. Faloutsos 55

School of Computer ScienceCarnegie Mellon

Shrinking diameter• Patents & citations• 1975

– 334,000 nodes

– 676,000 edges

• 1999– 2.9 million nodes

– 16.5 million edges

• Each year is a datapoint

FIU 2007 C. Faloutsos 56

School of Computer ScienceCarnegie Mellon

Shrinking diameter• Autonomous

systems• 1997

– 3,000 nodes

– 10,000 edges

• 2000– 6,000 nodes

– 26,000 edges

• One graph per day N

diameter

FIU 2007 C. Faloutsos 57

School of Computer ScienceCarnegie Mellon

Temporal evolution of graphs

• N(t) nodes; E(t) edges at time t

• suppose that

N(t+1) = 2 * N(t)

• Q: what is your guess for

E(t+1) =? 2 * E(t)

FIU 2007 C. Faloutsos 58

School of Computer ScienceCarnegie Mellon

Temporal evolution of graphs

• N(t) nodes; E(t) edges at time t

• suppose that

N(t+1) = 2 * N(t)

• Q: what is your guess for

E(t+1) =? 2 * E(t)

• A: over-doubled!

FIU 2007 C. Faloutsos 59

School of Computer ScienceCarnegie Mellon

Temporal evolution of graphs

• A: over-doubled - but obeying:

E(t) ~ N(t)a for all t

where 1<a<2

FIU 2007 C. Faloutsos 60

School of Computer ScienceCarnegie Mellon

Densification Power Law

ArXiv: Physics papers

and their citations

1.69

N(t)

E(t)

FIU 2007 C. Faloutsos 61

School of Computer ScienceCarnegie Mellon

Densification Power Law

ArXiv: Physics papers

and their citations

1.69

N(t)

E(t)

‘tree’

1

FIU 2007 C. Faloutsos 62

School of Computer ScienceCarnegie Mellon

Densification Power Law

ArXiv: Physics papers

and their citations

1.69

N(t)

E(t)‘clique’

2

FIU 2007 C. Faloutsos 63

School of Computer ScienceCarnegie Mellon

Densification Power Law

U.S. Patents, citing each other

1.66

N(t)

E(t)

FIU 2007 C. Faloutsos 64

School of Computer ScienceCarnegie Mellon

Densification Power Law

Autonomous Systems

1.18

N(t)

E(t)

FIU 2007 C. Faloutsos 65

School of Computer ScienceCarnegie Mellon

Densification Power Law

ArXiv: authors & papers

1.15

N(t)

E(t)

FIU 2007 C. Faloutsos 66

School of Computer ScienceCarnegie Mellon

Outline

• problems

• Fractals

• Solutions

• Discussion – what else can they solve? – how frequent are fractals?

FIU 2007 C. Faloutsos 67

School of Computer ScienceCarnegie Mellon

What else can they solve?

• separability [KDD’02]• forecasting [CIKM’02]• dimensionality reduction [SBBD’00]• non-linear axis scaling [KDD’02]• disk trace modeling [PEVA’02]• selectivity of spatial/multimedia queries

[PODS’94, VLDB’95, ICDE’00]• ...

FIU 2007 C. Faloutsos 68

School of Computer ScienceCarnegie Mellon

Problem #3 - spatial d.m.

Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’

galaxies

- patterns? (not Gaussian; not uniform)

-attraction/repulsion?

- separability??

FIU 2007 C. Faloutsos 69

School of Computer ScienceCarnegie Mellon

Solution#3: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

CORRELATION INTEGRAL!

FIU 2007 C. Faloutsos 70

School of Computer ScienceCarnegie Mellon

Solution#3: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

[w/ Seeger, Traina, Traina, SIGMOD00]

FIU 2007 C. Faloutsos 71

School of Computer ScienceCarnegie Mellon

Solution#3: spatial d.m.

r1r2

r1

r2

Heuristic on choosing # of clusters

FIU 2007 C. Faloutsos 72

School of Computer ScienceCarnegie Mellon

Solution#3: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

FIU 2007 C. Faloutsos 76

School of Computer ScienceCarnegie Mellon

Outline

• problems

• Fractals

• Solutions

• Discussion – what else can they solve? – how frequent are fractals?

FIU 2007 C. Faloutsos 77

School of Computer ScienceCarnegie Mellon

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

• <and many-many more! see [Mandelbrot]>

FIU 2007 C. Faloutsos 78

School of Computer ScienceCarnegie Mellon

Fractals: Brain scans

• brain-scans

octree levels

Log(#octants)

2.63 = fd

FIU 2007 C. Faloutsos 79

School of Computer ScienceCarnegie Mellon

More fractals

• periphery of malignant tumors: ~1.5

• benign: ~1.3

• [Burdet+]

FIU 2007 C. Faloutsos 80

School of Computer ScienceCarnegie Mellon

More fractals:

• cardiovascular system: 3 (!) lungs: ~2.9

FIU 2007 C. Faloutsos 81

School of Computer ScienceCarnegie Mellon

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

FIU 2007 C. Faloutsos 82

School of Computer ScienceCarnegie Mellon

More fractals:

• Coastlines: 1.2-1.58

1 1.1

1.3

FIU 2007 C. Faloutsos 83

School of Computer ScienceCarnegie Mellon

FIU 2007 C. Faloutsos 84

School of Computer ScienceCarnegie Mellon

More fractals:

• the fractal dimension for the Amazon river is 1.85 (Nile: 1.4)

[ems.gphys.unc.edu/nonlinear/fractals/examples.html]

FIU 2007 C. Faloutsos 85

School of Computer ScienceCarnegie Mellon

More fractals:

• the fractal dimension for the Amazon river is 1.85 (Nile: 1.4)

[ems.gphys.unc.edu/nonlinear/fractals/examples.html]

FIU 2007 C. Faloutsos 86

School of Computer ScienceCarnegie Mellon

Cross-roads of Montgomery county:

•any rules?

GIS points

FIU 2007 C. Faloutsos 87

School of Computer ScienceCarnegie Mellon

GIS

A: self-similarity:• intrinsic dim. = 1.51

log( r )

log(#pairs(within <= r))

1.51

FIU 2007 C. Faloutsos 88

School of Computer ScienceCarnegie Mellon

Examples:LB county

• Long Beach county of CA (road end-points)

1.7

log(r)

log(#pairs)

FIU 2007 C. Faloutsos 89

School of Computer ScienceCarnegie Mellon

More power laws: areas – Korcak’s law

Scandinavian lakes

Any pattern?

FIU 2007 C. Faloutsos 90

School of Computer ScienceCarnegie Mellon

More power laws: areas – Korcak’s law

Scandinavian lakes area vs complementary cumulative count (log-log axes)

log(count( >= area))

log(area)

FIU 2007 C. Faloutsos 91

School of Computer ScienceCarnegie Mellon

More power laws: Korcak

Japan islands;

area vs cumulative count (log-log axes) log(area)

log(count( >= area))

FIU 2007 C. Faloutsos 92

School of Computer ScienceCarnegie Mellon

More power laws

• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]

log(count)

Magnitude = log(energy)day

Energy released

FIU 2007 C. Faloutsos 93

School of Computer ScienceCarnegie Mellon

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

FIU 2007 C. Faloutsos 94

School of Computer ScienceCarnegie Mellon

A famous power law: Zipf’s law

• Bible - rank vs. frequency (log-log)

log(rank)

log(freq)

“a”

“the”

“Rank/frequency plot”

FIU 2007 C. Faloutsos 95

School of Computer ScienceCarnegie Mellon

TELCO data

# of service units

count ofcustomers

‘best customer’

FIU 2007 C. Faloutsos 96

School of Computer ScienceCarnegie Mellon

SALES data – store#96

# units sold

count of products

“aspirin”

FIU 2007 C. Faloutsos 97

School of Computer ScienceCarnegie Mellon

Olympic medals (Sidney’00, Athens’04):

log( rank)

log(#medals)

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

athens

sidney

FIU 2007 C. Faloutsos 98

School of Computer ScienceCarnegie Mellon

Olympic medals (Sidney’00, Athens’04):

log( rank)

log(#medals)

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

athens

sidney

FIU 2007 C. Faloutsos 99

School of Computer ScienceCarnegie Mellon

Even more power laws:

• Income distribution (Pareto’s law)• size of firms

• publication counts (Lotka’s law)

FIU 2007 C. Faloutsos 100

School of Computer ScienceCarnegie Mellon

Even more power laws:

library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001)

log(#citations)

log(count)

Ullman

FIU 2007 C. Faloutsos 101

School of Computer ScienceCarnegie Mellon

Even more power laws:

• web hit counts [w/ A. Montgomery]

Web Site Traffic

log(freq)

log(count)

Zipf“yahoo.com”

FIU 2007 C. Faloutsos 102

School of Computer ScienceCarnegie Mellon

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

FIU 2007 C. Faloutsos 103

School of Computer ScienceCarnegie Mellon

Power laws, cont’d

• In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER]

log indegree

- log(freq)

from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ]

FIU 2007 C. Faloutsos 104

School of Computer ScienceCarnegie Mellon

Power laws, cont’d

• In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER]

log indegree

log(freq)

from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ]

FIU 2007 C. Faloutsos 105

School of Computer ScienceCarnegie Mellon

Power laws, cont’d

• In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER]

log indegree

log(freq)

Q: ‘how can we usethese power laws?’

FIU 2007 C. Faloutsos 106

School of Computer ScienceCarnegie Mellon

“Foiled by power law”

• [Broder+, WWW’00]

(log) in-degree

(log) count

FIU 2007 C. Faloutsos 107

School of Computer ScienceCarnegie Mellon

“Foiled by power law”

• [Broder+, WWW’00]

“The anomalous bump at 120on the x-axis is due a large clique formed by a single spammer”

(log) in-degree

(log) count

FIU 2007 C. Faloutsos 108

School of Computer ScienceCarnegie Mellon

Power laws, cont’d

• In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER]

• length of file transfers [Crovella+Bestavros ‘96]

• duration of UNIX jobs

FIU 2007 C. Faloutsos 109

School of Computer ScienceCarnegie Mellon

Additional projects

• Find anomalies in traffic matrices [SDM’07]• Find correlations in sensor/stream data

[VLDB’05]– Chlorine measurements, with Civ. Eng.– temperature measurements (INTEL/MIT)

• Virus propagation (SIS, SIR) [Wang+, ’03]• Graph partitioning [Chakrabarti+, KDD’04]

FIU 2007 C. Faloutsos 110

School of Computer ScienceCarnegie Mellon

Conclusions

• Fascinating problems in Data Mining: find patterns in– sensors/streams – graphs/networks

FIU 2007 C. Faloutsos 111

School of Computer ScienceCarnegie Mellon

Conclusions - cont’d

New tools for Data Mining: self-similarity & power laws: appear in many cases

Bad news:

lead to skewed distributions

(no Gaussian, Poisson,

uniformity, independence,

mean, variance)

Good news:• ‘correlation integral’

for separability• rank/frequency plots• 80-20 (multifractals)• (Hurst exponent, • strange attractors,• renormalization theory, • ++)

FIU 2007 C. Faloutsos 112

School of Computer ScienceCarnegie Mellon

Resources

• Manfred Schroeder “Chaos, Fractals and Power Laws”, 1991

FIU 2007 C. Faloutsos 113

School of Computer ScienceCarnegie Mellon

References

• [vldb95] Alberto Belussi and Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension Proc. of VLDB, p. 299-310, 1995

• [Broder+’00] Andrei Broder, Ravi Kumar , Farzin Maghoul1, Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata, Andrew Tomkins , Janet Wiener, Graph structure in the web , WWW’00

• M. Crovella and A. Bestavros, Self similarity in World wide web traffic: Evidence and possible causes , SIGMETRICS ’96.

FIU 2007 C. Faloutsos 114

School of Computer ScienceCarnegie Mellon

References

• J. Considine, F. Li, G. Kollios and J. Byers, Approximate Aggregation Techniques for Sensor Databases (ICDE’04, best paper award).

• [pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4-13

FIU 2007 C. Faloutsos 115

School of Computer ScienceCarnegie Mellon

References

• [vldb96] Christos Faloutsos, Yossi Matias and Avi Silberschatz, Modeling Skewed Distributions Using Multifractals and the `80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996.

• [sigmod2000] Christos Faloutsos, Bernhard Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000

FIU 2007 C. Faloutsos 116

School of Computer ScienceCarnegie Mellon

References

• [vldb96] Christos Faloutsos and Volker Gaede Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India, Sept. 1996

• [sigcomm99] Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999

FIU 2007 C. Faloutsos 117

School of Computer ScienceCarnegie Mellon

References

• [Leskovec 05] Jure Leskovec, Jon M. Kleinberg, Christos Faloutsos: Graphs over time: densification laws, shrinking diameters and possible explanations. KDD 2005: 177-187

FIU 2007 C. Faloutsos 118

School of Computer ScienceCarnegie Mellon

References

• [ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the Self-Similar Nature of Ethernet Traffic, IEEE Transactions on Networking, 2, 1, pp 1-15, Feb. 1994.

• [brite] Alberto Medina, Anukool Lakhina, Ibrahim Matta, and John Byers. BRITE: An Approach to Universal Topology Generation. MASCOTS '01

FIU 2007 C. Faloutsos 119

School of Computer ScienceCarnegie Mellon

References

• [icde99] Guido Proietti and Christos Faloutsos, I/O complexity for range queries on region data stored using an R-tree (ICDE’99)

• Stan Sclaroff, Leonid Taycher and Marco La Cascia , "ImageRover: A content-based image browser for the world wide web" Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, pp 2-9, 1997.

FIU 2007 C. Faloutsos 120

School of Computer ScienceCarnegie Mellon

References

• [kdd2001] Agma J. M. Traina, Caetano Traina Jr., Spiros Papadimitriou and Christos Faloutsos: Tri-plots: Scalable Tools for Multidimensional Data Mining, KDD 2001, San Francisco, CA.

FIU 2007 C. Faloutsos 121

School of Computer ScienceCarnegie Mellon

Thank you!

Contact info:christos <at> cs.cmu.edu

www. cs.cmu.edu /~christos

(w/ papers, datasets, code for fractal dimension estimation, etc)