15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B....

46
15-826 C. Faloutsos 1 CMU SCS 15-826: Multimedia Databases and Data Mining Fractals - introduction C. Faloutsos 15-826 Copyright: C. Faloutsos (2007) 2 CMU SCS Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining 15-826 Copyright: C. Faloutsos (2007) 3 CMU SCS Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods – z-ordering – R-trees – misc • fractals – intro – applications • text

Transcript of 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B....

Page 1: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

1

CMU SCS

15-826: Multimedia Databases

and Data Mining

Fractals - introduction

C. Faloutsos

15-826 Copyright: C. Faloutsos (2007) 2

CMU SCS

Outline

Goal: ‘Find similar / interesting things’

• Intro to DB

• Indexing - similarity search

• Data Mining

15-826 Copyright: C. Faloutsos (2007) 3

CMU SCS

Indexing - Detailed outline

• primary key indexing

• secondary key / multi-key indexing

• spatial access methods

– z-ordering

– R-trees

– misc

• fractals

– intro

– applications

• text

Page 2: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

2

15-826 Copyright: C. Faloutsos (2007) 4

CMU SCS

Intro to fractals - outline

• Motivation – 3 problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples and tools

• Discussion - putting fractals to work!

• Conclusions – practitioner’s guide

• Appendix: gory details - boxcounting plots

15-826 Copyright: C. Faloutsos (2007) 5

CMU SCS

Road end-points of

Montgomery county:

•Q1: how many d.a. for an

R-tree?

•Q2 : distribution?

•not uniform

•not Gaussian

•no rules??

Problem #1: GIS - points

15-826 Copyright: C. Faloutsos (2007) 6

CMU SCS

Problem #2 - spatial d.m.

Galaxies (Sloan Digital Sky Survey w/ B.

Nichol)

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

"elliptical.cut.dat""spiral.cut.dat"

- ‘spiral’ and ‘elliptical’

galaxies

(stores and households ...)

- patterns?

- attraction/repulsion?

- how many ‘spi’ within r

from an ‘ell’?

Page 3: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

3

15-826 Copyright: C. Faloutsos (2007) 7

CMU SCS

Problem #3: traffic

• disk trace (from HP - J.

Wilkes); Web traffic - fit

a model

• how many explosions to

expect?

• queue length distr.?time

# bytes

15-826 Copyright: C. Faloutsos (2007) 8

CMU SCS

Problem #3: traffic

time

# bytes

Poisson

indep.,

ident. distr

15-826 Copyright: C. Faloutsos (2007) 9

CMU SCS

Problem #3: traffic

time

# bytes

Poisson

indep.,

ident. distr

Page 4: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

4

15-826 Copyright: C. Faloutsos (2007) 10

CMU SCS

Problem #3: traffic

time

# bytes

Poisson

indep.,

ident. distr

Q: Then, how to generate

such bursty traffic?

15-826 Copyright: C. Faloutsos (2007) 11

CMU SCS

Common answer:

• Fractals / self-similarities / power laws

• Seminal works from Hilbert, Minkowski,

Cantor, Mandelbrot, (Hausdorff, Lyapunov,

Ken Wilson, …)

15-826 Copyright: C. Faloutsos (2007) 12

CMU SCS

Road map

• Motivation – 3 problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples and tools

• Discussion - putting fractals to work!

• Conclusions – practitioner’s guide

• Appendix: gory details - boxcounting plots

Page 5: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

5

15-826 Copyright: C. Faloutsos (2007) 13

CMU SCS

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area;

infinite length!

15-826 Copyright: C. Faloutsos (2007) 14

CMU SCS

Definitions (cont’d)

• Paradox: Infinite perimeter ; Zero area!

• ‘dimensionality’: between 1 and 2

• actually: Log(3)/Log(2) = 1.58...

15-826 Copyright: C. Faloutsos (2007) 15

CMU SCS

Dfn of fd:

ONLY for a perfectly self-similar point set:

=log(n)/log(f) = log(3)/log(2) = 1.58

...zero area;

infinite length!

Page 6: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

6

15-826 Copyright: C. Faloutsos (2007) 16

CMU SCS

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of aline?

• A: 1 (= log(2)/log(2)!)

15-826 Copyright: C. Faloutsos (2007) 17

CMU SCS

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of aline?

• A: 1 (= log(2)/log(2)!)

15-826 Copyright: C. Faloutsos (2007) 18

CMU SCS

Intrinsic (‘fractal’) dimension

• Q: dfn for a given set

of points?

42

33

24

15

yx

Page 7: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

7

15-826 Copyright: C. Faloutsos (2007) 19

CMU SCS

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of aline?

• A: nn ( <= r ) ~ r^1

(‘power law’: y=x^a)

• Q: fd of a plane?

• A: nn ( <= r ) ~ r^2

fd== slope of (log(nn) vslog(r) )

15-826 Copyright: C. Faloutsos (2007) 20

CMU SCS

Intrinsic (‘fractal’) dimension

• Algorithm, to estimate it?

Notice

• avg nn(<=r) is exactly

tot#pairs(<=r) / N

including ‘mirror’ pairs

15-826 Copyright: C. Faloutsos (2007) 21

CMU SCS

Sierpinsky triangle

log( r )

log(#pairs

within <=r )

1.58

== ‘correlation integral’

Page 8: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

8

15-826 Copyright: C. Faloutsos (2007) 22

CMU SCS

Observations:

• Euclidean objects have integer fractal

dimensions

– point: 0

– lines and smooth curves: 1

– smooth surfaces: 2

• fractal dimension -> roughness of the

periphery

15-826 Copyright: C. Faloutsos (2007) 23

CMU SCS

Important properties

• fd = embedding dimension -> uniform

pointset

• a point set may have several fd, depending

on scale

15-826 Copyright: C. Faloutsos (2007) 24

CMU SCS

Important properties

• fd = embedding dimension -> uniform

pointset

• a point set may have several fd, depending

on scale

2-d

Page 9: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

9

15-826 Copyright: C. Faloutsos (2007) 25

CMU SCS

Important properties

• fd = embedding dimension -> uniform

pointset

• a point set may have several fd, depending

on scale

1-d

15-826 Copyright: C. Faloutsos (2007) 26

CMU SCS

Important properties

0-d

15-826 Copyright: C. Faloutsos (2007) 27

CMU SCS

Road map

• Motivation – 3 problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples and tools

• Discussion - putting fractals to work!

• Conclusions – practitioner’s guide

• Appendix: gory details - boxcounting plots

Page 10: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

10

15-826 Copyright: C. Faloutsos (2007) 28

CMU SCS

Cross-roads of

Montgomery county:

•any rules?

Problem #1: GIS points

15-826 Copyright: C. Faloutsos (2007) 29

CMU SCS

Solution #1

A: self-similarity ->

• <=> fractals

• <=> scale-free

• <=> power-laws

(y=x^a, F=C*r^(-2))

• avg#neighbors(<= r )

= r^D

log( r )

log(#pairs(within <= r))

1.51

15-826 Copyright: C. Faloutsos (2007) 30

CMU SCS

Solution #1

A: self-similarity

• avg#neighbors(<= r )

~ r^(1.51)

log( r )

log(#pairs(within <= r))

1.51

Page 11: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

11

15-826 Copyright: C. Faloutsos (2007) 31

CMU SCS

Examples:MG county

• Montgomery County of MD (road end-

points)

15-826 Copyright: C. Faloutsos (2007) 32

CMU SCS

Examples:LB county

• Long Beach county of CA (road end-points)

15-826 Copyright: C. Faloutsos (2007) 33

CMU SCS

Solution#2: spatial d.m.

Galaxies ( ‘BOPS’ plot - [sigmod2000])

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100

"ell-ell.points.ns""spi-spi.points.ns"

"spi.dat-ell.dat.points"

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

"elliptical.cut.dat""spiral.cut.dat"

log(#pairs)

log(r)

Page 12: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

12

15-826 Copyright: C. Faloutsos (2007) 34

CMU SCS

Solution#2: spatial d.m.

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100

"ell-ell.points.ns""spi-spi.points.ns"

"spi.dat-ell.dat.points"

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

15-826 Copyright: C. Faloutsos (2007) 35

CMU SCS

Spatial d.m.

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100

"ell-ell.points.ns""spi-spi.points.ns"

"spi.dat-ell.dat.points"

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

15-826 Copyright: C. Faloutsos (2007) 36

CMU SCS

Spatial d.m.

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100

"ell-ell.points.ns""spi-spi.points.ns"

"spi.dat-ell.dat.points"

r1r2

r1

r2

Heuristic on choosing # of

clusters

Page 13: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

13

15-826 Copyright: C. Faloutsos (2007) 37

CMU SCS

Spatial d.m.

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100

"ell-ell.points.ns""spi-spi.points.ns"

"spi.dat-ell.dat.points"

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

15-826 Copyright: C. Faloutsos (2007) 38

CMU SCS

Spatial d.m.

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100

"ell-ell.points.ns""spi-spi.points.ns"

"spi.dat-ell.dat.points"

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

-repulsion!!

-duplicates

15-826 Copyright: C. Faloutsos (2007) 39

CMU SCS

Solution #3: traffic

• disk traces: self-similar:

time

#bytes

Page 14: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

14

15-826 Copyright: C. Faloutsos (2007) 40

CMU SCS

Solution #3: traffic

• disk traces (80-20 ‘law’ = ‘multifractal’)

time

#bytes

20% 80%

15-826 Copyright: C. Faloutsos (2007) 41

CMU SCS

80-20 / multifractals20 80

0

200000

400000

600000

800000

1e+06

1.2e+06

0 2e+06 4e+06 6e+06 8e+06 1e+07

num

ber

of K

b

time (in milliseconds)

15-826 Copyright: C. Faloutsos (2007) 42

CMU SCS

80-20 / multifractals

20• p ; (1-p) in general

• yes, there are

dependencies

80

Page 15: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

15

15-826 Copyright: C. Faloutsos (2007) 43

CMU SCS

More on 80/20: PQRS

• Part of ‘self-* storage’ project

time

cylinder#

15-826 Copyright: C. Faloutsos (2007) 44

CMU SCS

More on 80/20: PQRS

• Part of ‘self-* storage’ project

p q

r s

q

r s

15-826 Copyright: C. Faloutsos (2007) 45

CMU SCS

Solution#3: traffic

Clarification:

• fractal: a set of points that is self-similar

• multifractal: a probability density function

that is self-similar

Many other time-sequences are

bursty/clustered: (such as?)

Page 16: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

16

15-826 Copyright: C. Faloutsos (2007) 46

CMU SCS

Example:

• network traffic

0

200000

400000

600000

800000

1e+06

1.2e+06

0 2e+06 4e+06 6e+06 8e+06 1e+07

num

ber

of K

b

time (in milliseconds)

http://repository.cs.vt.edu/lbl-conn-7.tar.Z

15-826 Copyright: C. Faloutsos (2007) 47

CMU SCS

Web traffic

• [Crovella Bestavros, SIGMETRICS’96]

Chronological time (slots of 1000 sec)

Byte

s

0 1000 2000 3000

05*1

0^6

10^7

1.5

*10^7

2*1

0^7

2.5

*10^7

3*1

0^7

Chronological time (slots of 100 sec)

Byte

s

23300 23400 23500 23600

02*1

0^6

4*1

0^6

6*1

0^6

8*1

0^6

Chronological time (slots of 10 sec)

Byte

s

234200 234250 234300 234350 234400 234450

0100000

200000

300000

400000

500000

Chronological time (slots of 1 sec)

Byte

s

2342600 2342650 2342700 2342750 2342800 2342850 2342900

050000

100000

150000

Figure 2: Tra�c Bursts over Four Orders of Magnitude; Upper Left: 1000, Upper Right: 100, Lower Left: 10, and Lower Right:1 Second Aggegrations. (Actual Transfers)

this e�ect is by visually inspecting time series plots of tra�cdemand.

In Figure 2 we show four time series plots of the WWWtra�c induced by our reference traces. The plots are producedby aggregating byte tra�c into discrete bins of 1, 10, 100, or1000 seconds.

The upper left plot is a complete presentation of the entiretra�c time series using 1000 second (16.6 minute) bins. Thediurnal cycle of network demand is clearly evident, and dayto day activity shows noticeable bursts. However, even withinthe active portion of a single day there is signi�cant burstiness;this is shown in the upper right plot, which uses a 100 secondtimescale and is taken from a typical day in the middle of thedataset. Finally, the lower left plot shows a portion of the 100second plot, expanded to 10 second detail; and the lower rightplot shows a portion of the lower left expanded to 1 seconddetail. These plots show signi�cant bursts occurring at thesecond-to-second level.

4.2.2 Statistical Analysis

We used the four methods for assessing self-similarity describedin Section 2: the variance-time plot, the rescaled range (orR/S) plot, the periodogram plot, and the Whittle estimator.We concentrated on individual hours from our tra�c series, soas to provide as nearly a stationary dataset as possible.

To provide an example of these approaches, analysis of asingle hour (4pm to 5pm, Thursday 5 Feb 1995) is shown inFigure 3. The �gure shows plots for the three graphical meth-ods: variance-time (upper left), rescaled range (upper right),and periodogram (lower center). The variance-time plot is lin-

ear and shows a slope that is distinctly di�erent from -1 (whichis shown for comparison); the slope is estimated using regres-sion as -0.48, yielding an estimate for H of 0.76. The R/S plotshows an asymptotic slope that is di�erent from 0.5 and from1.0 (shown for comparision); it is estimated using regressionas 0.75, which is also the corresponding estimate of H. Theperiodogram plot shows a slope of -0.66 (the regression line isshown), yielding an estimate of H as 0.83. Finally, the Whittleestimator for this dataset (not a graphical method) yields anestimate of H = 0:82 with a 95% con�dence interval of (0.77,0.87).

As discussed in Section 2.1, the Whittle estimator is theonly method that yields con�dence intervals on H, but short-range dependence in the timeseries can introduce inaccura-cies in its results. These inaccuracies are minimized by m-aggregating the timeseries for successively large values of m,and looking for a value of H around which the Whittle esti-mator stabilizes.

The results of this method for four busy hours are shown inFigure 4. Each hour is shown in one plot, from the busiest hourin the upper left to the least busy hour in the lower right. Inthese �gures the solid line is the value of the Whittle estimateof H as a function of the aggregation level m of the dataset.The upper and lower dotted lines are the limits of the 95%con�dence interval on H. The three level lines represent theestimate of H for the unaggregated dataset as given by thevariance-time, R-S, and periodogram methods.

The �gure shows that for each dataset, the estimate of Hstays relatively consistent as the aggregation level is increased,and that the estimates given by the three graphical methods

1000 sec; 100sec

10sec; 1sec

15-826 Copyright: C. Faloutsos (2007) 48

CMU SCS

Tape accesses

time

Tape#1 Tape# N

# tapes needed, to

retrieve n records?

(# days down, due to

failures / hurricanes /

communication

noise...)

Page 17: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

17

15-826 Copyright: C. Faloutsos (2007) 49

CMU SCS

Tape accesses

LD16

rdb50.out

rdb60.out

rdb75.out

rdb85.out

rdb86.out

rdb87.out

rdb90.out

LD16.avg

avg # blocks

records

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

13.00

14.00

15.00

16.00

0.00 50.00 100.00 150.00 200.00

time

Tape#1 Tape# N

# tapes retrieved

# qual. records

50-50 = Poisson

real

15-826 Copyright: C. Faloutsos (2007) 50

CMU SCS

Road map

• Motivation – 3 problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More tools and examples

• Discussion - putting fractals to work!

• Conclusions – practitioner’s guide

• Appendix: gory details - boxcounting plots

15-826 Copyright: C. Faloutsos (2007) 51

CMU SCS

A counter-intuitive example

• avg degree is, say 3.3

• pick a node at random

– guess its degree,

exactly (-> “mode”)

degree

count

avg: 3.3

Page 18: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

18

15-826 Copyright: C. Faloutsos (2007) 52

CMU SCS

A counter-intuitive example

• avg degree is, say 3.3

• pick a node at random

– guess its degree,

exactly (-> “mode”)

• A: 1!!

degree

count

avg: 3.3

15-826 Copyright: C. Faloutsos (2007) 53

CMU SCS

A counter-intuitive example

• avg degree is, say 3.3

• pick a node at random- what is the degreeyou expect it to have?

• A: 1!!

• A’: very skewed distr.

• Corollary: the mean ismeaningless!

• (and std -> infinity (!))

degree

count

avg: 3.3

15-826 Copyright: C. Faloutsos (2007) 54

CMU SCS

Rank exponent R• Power law in the degree distribution

[SIGCOMM99]

internet domains

log(rank)

log(degree)

-0.82

att.com

ibm.com

Page 19: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

19

15-826 Copyright: C. Faloutsos (2007) 55

CMU SCS

More tools

• Zipf’s law

• Korcak’s law / “fat fractals”

15-826 Copyright: C. Faloutsos (2007) 56

CMU SCS

A famous power law: Zipf’s

law

• Q: vocabulary word frequency in a document

- any pattern?

aaron zoo

freq.

15-826 Copyright: C. Faloutsos (2007) 57

CMU SCS

A famous power law: Zipf’s

law

• Bible - rank vs

frequency (log-log)

log(rank)

log(freq)

“a”

“the”

Page 20: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

20

15-826 Copyright: C. Faloutsos (2007) 58

CMU SCS

A famous power law: Zipf’s

law

• Bible - rank vs

frequency (log-log)

• similarly, in many

other languages; for

customers and sales

volume; city

populations etc etclog(rank)

log(freq)

15-826 Copyright: C. Faloutsos (2007) 59

CMU SCS

A famous power law: Zipf’s

law

•Zipf distr:

freq = 1/ rank

•generalized Zipf:

freq = 1 / (rank)^a

log(rank)

log(freq)

15-826 Copyright: C. Faloutsos (2007) 60

CMU SCS

Olympic medals (Sidney):

y = -0.9676x + 2.3054

R2 = 0.9458

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Series1

Linear (Series1)

rank

log(#medals)

Page 21: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

21

15-826 Copyright: C. Faloutsos (2007) 61

CMU SCS

Olympic medals (Sidney’00,

Athens’04):

log( rank)

log(#medals)

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

athens

sidney

15-826 Copyright: C. Faloutsos (2007) 62

CMU SCS

TELCO data

# of service units

count of

customers

‘best customer’

15-826 Copyright: C. Faloutsos (2007) 63

CMU SCS

SALES data – store#96

# units sold

count of

products

“aspirin”

Page 22: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

22

15-826 Copyright: C. Faloutsos (2007) 64

CMU SCS

More power laws: areas –

Korcak’s law

Scandinavian lakes

Any pattern?

15-826 Copyright: C. Faloutsos (2007) 65

CMU SCS

More power laws: areas –

Korcak’s law

Scandinavian lakes

area vs

complementary

cumulative count

(log-log axes) 0

2

4

6

8

10

0 2 4 6 8 10 12 14

Nu

mb

er

of

reg

ion

s

Area

Area distribution of regions (SL dataset)

-0.85*x +10.8

log(count( >= area))

log(area)

15-826 Copyright: C. Faloutsos (2007) 66

CMU SCS

More power laws: Korcak

1

10

100

1000

1 10 100 1000 10000 100000

Num

ber

of is

lands

Area

Area distribution of Japan Archipelago

Japan islands;

area vs cumulative

count (log-log axes)log(area)

log(count( >= area))

Page 23: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

23

15-826 Copyright: C. Faloutsos (2007) 67

CMU SCS

(Korcak’s law: Aegean islands)

15-826 Copyright: C. Faloutsos (2007) 68

CMU SCS

Korcak’s law & “fat fractals”

How to generate such regions?

15-826 Copyright: C. Faloutsos (2007) 69

CMU SCS

Korcak’s law & “fat fractals”Q: How to generate such regions?

A: recursively, from a single region

Page 24: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

24

15-826 Copyright: C. Faloutsos (2007) 70

CMU SCS

so far we’ve seen:

• concepts:

– fractals, multifractals and fat fractals

• tools:

– correlation integral (= pair-count plot)

– rank/frequency plot (Zipf’s law)

– CCDF (Korcak’s law)

15-826 Copyright: C. Faloutsos (2007) 71

CMU SCS

so far we’ve seen:

• concepts:

– fractals, multifractals and fat fractals

• tools:

– correlation integral (= pair-count plot)

– rank/frequency plot (Zipf’s law)

– CCDF (Korcak’s law)

same

info

15-826 Copyright: C. Faloutsos (2007) 72

CMU SCS

Road map

• Motivation – 3 problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More tools and examples

• Discussion - putting fractals to work!

• Conclusions – practitioner’s guide

• Appendix: gory details - boxcounting plots

Page 25: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

25

15-826 Copyright: C. Faloutsos (2007) 73

CMU SCS

Other applications: Internet

• How does the internet look like?

CMU

15-826 Copyright: C. Faloutsos (2007) 74

CMU SCS

Other applications: Internet

• How does the internet look like?

• Internet routers: how many neighbors

within h hops?

CMU

15-826 Copyright: C. Faloutsos (2007) 75

CMU SCS

(reminder: our tool-box:)

• concepts:

– fractals, multifractals and fat fractals

• tools:

– correlation integral (= pair-count plot)

– rank/frequency plot (Zipf’s law)

– CCDF (Korcak’s law)

Page 26: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

26

15-826 Copyright: C. Faloutsos (2007) 76

CMU SCS

Internet topology

• Internet routers: how many neighbors

within h hops?

Reachability function:

number of neighbors

within r hops, vs r (log-

log).

Mbone routers, 1995log(hops)

log(#pairs)

2.8

15-826 Copyright: C. Faloutsos (2007) 77

CMU SCS

More power laws on the

Internet

degree vs rank, for Internet domains

(log-log) [sigcomm99]

log(rank)

log(degree)

-0.82

15-826 Copyright: C. Faloutsos (2007) 78

CMU SCS

More power laws - internet

• pdf of degrees: (slope: 2.2 )

1

10

100

1000

10000

1 10 100

"981205.out"exp(8.11393) * x ** ( -2.20288 )Log(count)

Log(degree)

-2.2

Page 27: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

27

15-826 Copyright: C. Faloutsos (2007) 79

CMU SCS

Even more power laws on the

Internet

Scree plot for Internet domains (log-

log) [sigcomm99]

log(i)

log( i-th eigenvalue)

1

10

100

1 10 100

"971108.internet.svals"exp(3.57926) * x ** ( -0.471327 )

0.47

15-826 Copyright: C. Faloutsos (2007) 80

CMU SCS

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

15-826 Copyright: C. Faloutsos (2007) 81

CMU SCS

More apps: Brain scans

• Oct-trees; brain-scans

7

8

9

10

11

12

13

14

15

16

4 4.5 5 5.5 6 6.5 7

log

2(

octr

ee

le

ave

s)

level k

"oct-count"x*2.63871-2.9122

octree levels

Log(#octants)

2.63 =

fd

Page 28: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

28

15-826 Copyright: C. Faloutsos (2007) 82

CMU SCS

More apps: Medical images

[Burdett et al, SPIE ‘93]:

• benign tumors: fd ~ 2.37

• malignant: fd ~ 2.56

15-826 Copyright: C. Faloutsos (2007) 83

CMU SCS

More fractals:

• cardiovascular system: 3 (!)

• lungs: 2.9

15-826 Copyright: C. Faloutsos (2007) 84

CMU SCS

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

Page 29: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

29

15-826 Copyright: C. Faloutsos (2007) 85

CMU SCS

More fractals:

• Coastlines: 1.2-1.58

1 1.1

1.3

15-826 Copyright: C. Faloutsos (2007) 86

CMU SCS

15-826 Copyright: C. Faloutsos (2007) 87

CMU SCS

More fractals:

• the fractal dimension for the Amazon river

is 1.85 (Nile: 1.4)

[ems.gphys.unc.edu/nonlinear/fractals/examples.html]

Page 30: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

30

15-826 Copyright: C. Faloutsos (2007) 88

CMU SCS

More fractals:

• the fractal dimension for the Amazon river

is 1.85 (Nile: 1.4)

[ems.gphys.unc.edu/nonlinear/fractals/examples.html]

15-826 Copyright: C. Faloutsos (2007) 89

CMU SCS

More power laws

• Energy of earthquakes (Gutenberg-Richter

law) [simscience.org]

log(freq)

magnitudeday

amplitude

15-826 Copyright: C. Faloutsos (2007) 90

CMU SCS

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

Page 31: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

31

15-826 Copyright: C. Faloutsos (2007) 91

CMU SCS

More fractals:

stock prices (LYCOS) - random walks: 1.5

1 year 2 years

15-826 Copyright: C. Faloutsos (2007) 92

CMU SCS

Even more power laws:

• Income distribution (Pareto’s law)

• size of firms

• publication counts (Lotka’s law)

15-826 Copyright: C. Faloutsos (2007) 93

CMU SCS

Even more power laws:

library science (Lotka’s law of publication

count); and citation counts:

(citeseer.nj.nec.com 6/2001)

1

10

100

100 1000 10000

log c

ount

log # citations

’cited.pdf’

log(#citations)

log(count)

Ullman

Page 32: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

32

15-826 Copyright: C. Faloutsos (2007) 94

CMU SCS

Even more power laws:

• web hit counts [w/ A. Montgomery]

Web Site Traffic

log(freq)

log(count)

Zipf

“yahoo.com”

15-826 Copyright: C. Faloutsos (2007) 95

CMU SCS

Fractals & power laws:

appear in numerous settings:

• medical

• geographical / geological

• social

• computer-system related

15-826 Copyright: C. Faloutsos (2007) 96

CMU SCS

Power laws, cont’d

• In- and out-degree distribution of web sites

[Barabasi], [IBM-CLEVER]

log indegree

- log(freq)

from [Ravi Kumar,

Prabhakar Raghavan,

Sridhar Rajagopalan,

Andrew Tomkins ]

Page 33: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

33

15-826 Copyright: C. Faloutsos (2007) 97

CMU SCS

Power laws, cont’d

• In- and out-degree distribution of web sites

[Barabasi], [IBM-CLEVER]

log indegree

log(freq)

from [Ravi Kumar,

Prabhakar Raghavan,

Sridhar Rajagopalan,

Andrew Tomkins ]

15-826 Copyright: C. Faloutsos (2007) 98

CMU SCS

“Foiled by power law”

• [Broder+, WWW’00]

“The anomalous bump at 120

on the x-axis

is due a large clique

formed by a single spammer”

(log) in-degree

(log) count

15-826 Copyright: C. Faloutsos (2007) 99

CMU SCS

Power laws, cont’d

• In- and out-degree distribution of web sites

[Barabasi], [IBM-CLEVER]

• length of file transfers [Crovella+Bestavros

‘96]

• duration of UNIX jobs [Harchol-Balter]

Page 34: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

34

15-826 Copyright: C. Faloutsos (2007) 100

CMU SCS

Even more power laws:

• Distribution of UNIX file sizes

• web hit counts [Huberman]

15-826 Copyright: C. Faloutsos (2007) 101

CMU SCS

Road map

• Motivation – 3 problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples and tools

• Discussion - putting fractals to work!

• Conclusions – practitioner’s guide

• Appendix: gory details - boxcounting plots

15-826 Copyright: C. Faloutsos (2007) 102

CMU SCS

What else can they solve?

• separability [KDD’02]

• forecasting [CIKM’02]

• dimensionality reduction [SBBD’00]

• non-linear axis scaling [KDD’02]

• disk trace modeling [PEVA’02]

• selectivity of spatial/multimedia queries

[PODS’94, VLDB’95, ICDE’00]

• ...

Page 35: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

35

15-826 Copyright: C. Faloutsos (2007) 103

CMU SCS

Settings for fractals:

Points; areas (-> fat fractals), eg:

15-826 Copyright: C. Faloutsos (2007) 104

CMU SCS

Settings for fractals:

Points; areas, eg:

• cities/stores/hospitals, over earth’s surface

• time-stamps of events (customer arrivals,

packet losses, criminal actions) over time

• regions (sales areas, islands, patches of

habitats) over space

15-826 Copyright: C. Faloutsos (2007) 105

CMU SCS

Settings for fractals:

• customer feature vectors (age, income,

frequency of visits, amount of sales per

visit)

‘good’ customers

‘bad’ customers

Page 36: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

36

15-826 Copyright: C. Faloutsos (2007) 106

CMU SCS

Some uses of fractals:

• Detect non-existence of rules (if points are

uniform)

• Detect non-homogeneous regions (eg., legal

login time-stamps may have different fd

than intruders’)

• Estimate number of neighbors / customers /

competitors within a radius

15-826 Copyright: C. Faloutsos (2007) 107

CMU SCS

Multi-Fractals

Setting: points or objects, w/ some value, eg:

– cities w/ populations

– positions on earth and amount of gold/water/oil

underneath

– product ids and sales per product

– people and their salaries

– months and count of accidents

15-826 Copyright: C. Faloutsos (2007) 108

CMU SCS

Use of multifractals:

• Estimate tape/disk accesses

– how many of the 100 tapes contain my 50

phonecall records?

– how many days without an accident?

time

Tape#1 Tape# N

Page 37: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

37

15-826 Copyright: C. Faloutsos (2007) 109

CMU SCS

Use of multifractals

• how often do we exceed the threshold?

time

#bytes

Poisson

15-826 Copyright: C. Faloutsos (2007) 110

CMU SCS

Use of multifractals cont’d

• Extrapolations for/from samples

time

#bytes

15-826 Copyright: C. Faloutsos (2007) 111

CMU SCS

Use of multifractals cont’d

• How many distinct products account for

90% of the sales?20% 80%

Page 38: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

38

15-826 Copyright: C. Faloutsos (2007) 112

CMU SCS

Road map

• Motivation – 3 problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples and tools

• Discussion - putting fractals to work!

• Conclusions – practitioner’s guide

• Appendix: gory details - boxcounting plots

15-826 Copyright: C. Faloutsos (2007) 113

CMU SCS

Conclusions

• Real data often disobey textbook

assumptions (Gaussian, Poisson,

uniformity, independence)

15-826 Copyright: C. Faloutsos (2007) 114

CMU SCS

Conclusions - cont’d

Self-similarity & power laws: appear in many

cases

Bad news:

lead to skewed distributions

(no Gaussian, Poisson,

uniformity, independence,

mean, variance)

Good news:

• ‘correlation integral’for separability

• rank/frequency plots

• 80-20 (multifractals)• (Hurst exponent,

• strange attractors,

• renormalization theory,

• ++)

Page 39: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

39

15-826 Copyright: C. Faloutsos (2007) 115

CMU SCS

Conclusions

• tool#1: (for points) ‘correlation integral’:

(#pairs within <= r) vs (distance r)

• tool#2: (for categorical values) rank-

frequency plot (a’la Zipf)

• tool#3: (for numerical values) CCDF:

Complementary cumulative distr. function

(#of elements with value >= a )

15-826 Copyright: C. Faloutsos (2007) 116

CMU SCS

Practitioner’s guide:• tool#1: #pairs vs distance, for a set of objects,

with a distance function (slope = intrinsic

dimensionality)

log(hops)

log(#pairs)

2.8

log( r )

log(#pairs(within <= r))

1.51

internet

MGcounty

15-826 Copyright: C. Faloutsos (2007) 117

CMU SCS

Practitioner’s guide:

• tool#2: rank-frequency plot (for categorical

attributes)

log(rank)

log(degree)

-0.82

internet domainsBible

log(freq)

log(rank)

Page 40: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

40

15-826 Copyright: C. Faloutsos (2007) 118

CMU SCS

Practitioner’s guide:• tool#3: CCDF, for (skewed) numerical

attributes, eg. areas of islands/lakes, UNIX

jobs...)

0

2

4

6

8

10

0 2 4 6 8 10 12 14

Nu

mb

er

of

reg

ion

s

Area

Area distribution of regions (SL dataset)

-0.85*x +10.8

log(count( >= area))

log(area)

scandinavian lakes

15-826 Copyright: C. Faloutsos (2007) 119

CMU SCS

Resources:

• Software for fractal dimension

– http://www.cs.cmu.edu/~christos

[email protected]

15-826 Copyright: C. Faloutsos (2007) 120

CMU SCS

Books

• Strongly recommended intro book:

– Manfred Schroeder Fractals, Chaos, Power

Laws: Minutes from an Infinite Paradise W.H.

Freeman and Company, 1991

• Classic book on fractals:

– B. Mandelbrot Fractal Geometry of Nature,

W.H. Freeman, 1977

Page 41: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

41

15-826 Copyright: C. Faloutsos (2007) 121

CMU SCS

References

• [vldb95] Alberto Belussi and Christos Faloutsos,Estimating the Selectivity of Spatial Queries Using the`Correlation' Fractal Dimension Proc. of VLDB, p. 299-310, 1995

• [Broder+’00] Andrei Broder, Ravi Kumar , FarzinMaghoul1, Prabhakar Raghavan , Sridhar Rajagopalan ,

Raymie Stata, Andrew Tomkins , Janet Wiener, Graph

structure in the web , WWW’00

• M. Crovella and A. Bestavros, Self similarity in Worldwide web traffic: Evidence and possible causes ,SIGMETRICS ’96.

15-826 Copyright: C. Faloutsos (2007) 122

CMU SCS

References

– [ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger,

D.V. Wilson, On the Self-Similar Nature of Ethernet

Traffic, IEEE Transactions on Networking, 2, 1, pp 1-

15, Feb. 1994.

– [pods94] Christos Faloutsos and Ibrahim Kamel,

Beyond Uniformity and Independence: Analysis of R-

trees Using the Concept of Fractal Dimension, PODS,

Minneapolis, MN, May 24-26, 1994, pp. 4-13

15-826 Copyright: C. Faloutsos (2007) 123

CMU SCS

References

– [vldb96] Christos Faloutsos, Yossi Matias and Avi

Silberschatz, Modeling Skewed Distributions Using

Multifractals and the `80-20 Law’ Conf. on Very Large

Data Bases (VLDB), Bombay, India, Sept. 1996.

Page 42: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

42

15-826 Copyright: C. Faloutsos (2007) 124

CMU SCS

References

– [vldb96] Christos Faloutsos and Volker Gaede Analysis

of the Z-Ordering Method Using the Hausdorff Fractal

Dimension VLD, Bombay, India, Sept. 1996

– [sigcomm99] Michalis Faloutsos, Petros Faloutsos and

Christos Faloutsos, What does the Internet look like?

Empirical Laws of the Internet Topology, SIGCOMM

1999

15-826 Copyright: C. Faloutsos (2007) 125

CMU SCS

References

– [icde99] Guido Proietti and Christos Faloutsos, I/O

complexity for range queries on region data stored

using an R-tree International Conference on Data

Engineering (ICDE), Sydney, Australia, March 23-26,

1999

– [sigmod2000] Christos Faloutsos, Bernhard Seeger,

Agma J. M. Traina and Caetano Traina Jr., Spatial Join

Selectivity Using Power Laws, SIGMOD 2000

15-826 Copyright: C. Faloutsos (2007) 126

CMU SCS

Appendix - Gory details

• Bad news: There are more than one fractal

dimensions

– Minkowski fd; Hausdorff fd; Correlation fd;

Information fd

• Great news:

– they can all be computed fast!

– they usually have nearby values

Page 43: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

43

15-826 Copyright: C. Faloutsos (2007) 127

CMU SCS

Fast estimation of fd(s):

• How, for the (correlation) fractal dimension?

• A: Box-counting plot:

log( r )

rpi

log(sum(pi ^2))

15-826 Copyright: C. Faloutsos (2007) 128

CMU SCS

Definitions

• pi : the percentage (or count) of points in

the i-th cell

• r: the side of the grid

15-826 Copyright: C. Faloutsos (2007) 129

CMU SCS

Fast estimation of fd(s):

• compute sum(pi^2) for another grid side, r’

log( r )

r’

pi’

log(sum(pi ^2))

Page 44: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

44

15-826 Copyright: C. Faloutsos (2007) 130

CMU SCS

Fast estimation of fd(s):

• etc; if the resulting plot has a linear part, its

slope is the correlation fractal dimension D2

log( r )

log(sum(pi ^2))

15-826 Copyright: C. Faloutsos (2007) 131

CMU SCS

Definitions (cont’d)

• Many more fractal dimensions Dq (related toRenyi entropies):

)log(

)log(

1)log(

)log(

1

1

1r

ppD

qr

p

qD

ii

q

i

q

∂=

≠∂

−=

15-826 Copyright: C. Faloutsos (2007) 132

CMU SCS

Hausdorff or box-counting fd:

• Box counting plot: Log( N ( r ) ) vs Log ( r)

• r: grid side

• N (r ): count of non-empty cells

• (Hausdorff) fractal dimension D0:

)log(

))(log(0

r

rND

∂−=

Page 45: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

45

15-826 Copyright: C. Faloutsos (2007) 133

CMU SCS

Definitions (cont’d)

• Hausdorff fd:

r

log(r)

log(#non-empty cells)

D0

15-826 Copyright: C. Faloutsos (2007) 134

CMU SCS

Observations

• q=0: Hausdorff fractal dimension

• q=2: Correlation fractal dimension

(identical to the exponent of the number

of neighbors vs radius)

• q=1: Information fractal dimension

15-826 Copyright: C. Faloutsos (2007) 135

CMU SCS

Observations, cont’d

• in general, the Dq’s take similar, but not

identical, values.

• except for perfectly self-similar point-sets,

where Dq=Dq’ for any q, q’

Page 46: 15-826: Multimedia Databases and Data Mining€¦ · Galaxies (Sloan Digital Sky Survey w/ B. Nichol)-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

15-826 C. Faloutsos

46

15-826 Copyright: C. Faloutsos (2007) 136

CMU SCS

Examples:MG county

• Montgomery County of MD (road end-

points)

15-826 Copyright: C. Faloutsos (2007) 137

CMU SCS

Examples:LB county

• Long Beach county of CA (road end-points)

15-826 Copyright: C. Faloutsos (2007) 138

CMU SCS

Conclusions

• many fractal dimensions, with nearby

values

• can be computed quickly

(O(N) or O(N log(N))

• (code: on the web)