Download - Paolo Ferragina Dipartimento di Informatica Università di Pisa

IR

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Paradigm shift:

Web 2.0 is about the many

Do big DATA need big

PCs ??

an Italian Ad of the ’80 about a BIG brush or a brush BIG....

big DATA big PC ?

We have three types of algorithms: T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit

How many input data n each algorithm may process within t time units?

n1 = t, n2 = √t, n3 = log2 t

What about a k-times faster processor? ...or, what is n, when the available time is k*t ?

n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario for Algorithmics

Data are more available than even before

n ➜ ∞ ... is more than a theoretical assumption

The RAM model is too simple

Step cost is (1)

The memory hierarchy

CPU RAM

1CPUregisters

L1 L2 RAM

Cache Few MbsSome nanosecsFew words fetched

Few GbsTens of nanosecsSome words fetched

HD net

Few Tbs

Many TbsEven secsPackets

Few millisecsB = 32K page

Does Virtual Memory help ? M = memory size, N = problem size p = prob. of memory access [0,3÷0,4 (Hennessy-

Patterson)] C = cost of an I/O [105 ÷ 106 (Hennessy-

Patterson)]

If N ≤ M, then the cost per step is 1

If N=(1+) M, then the avg cost per step is:

1 + C * p * /(1+)

This is at least > 104 * /(1+)

If = 1/1000

( e.g. M = 1Gb, N = 1Gb + 1Mb )

Avg step-cost is > 20

The I/O-model

Spatial locality or Temporal locality

track

magnetic surface

read/write armread/write head

“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

Less and faster I/Os caching

CPU RAM HD1

B

Count I/Os

Other issues other models

Random vs sequential I/Os

Scanning is better than jumping

Not just one CPU

Many PCs, Multi-cores CPUs or even GPUs

Parameter-free algorithms

Anywhere, anytime, anyway... Optimal !!

Streaming algorithms

Parallel or Distributedalgorithms

Cache-oblivious algorithms

What about energy-consumption ?

[Leventhal, CACM 2008]

≈10 IO/s/W

≈6000 IO/s/W

Our topics, on an exampleWeb

Crawler

Page archive

Which pagesto visit next?

Query

Queryresolver

?

Ranker

PageAnalizer

textStructure

auxiliary

Indexer

Hashing

Data Compression

DictionariesSorting

Linear AlgebraClusteringClassification

Warm up...

Take Wikipedia in Italian, and compute word freq:

Few GBs n 109 words

How do you proceed ?? Tokenize into a sequence of strings Sort the strings Create tuples < word, freq >

Binary Merge-Sort

Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)

Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)

DivideConquer

Combine

1 2 8 10 7 9 13 19

1 2 7

Merge is linear in the

#items to be merged

But...

Few key observations:

Items = (short) strings = atomic... (n log n) memory accesses (I/Os ??)

[5ms] * n log2 n ≈ 3 years

In practice it is a “faster”, why?

Implicit Caching…

10 2

2 10

5 1

1 5

13 19

13 19

9 7

7 9

15 4

4 15

8 3

3 8

12 17

12 17

6 11

6 11

1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17

1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19

log2 N

MN/M runs, each sorted in internal memory (no I/Os)

2 passes (one Read/one Write) = 2 * (N/B)

I/Os

— I/O-cost for binary merge-sort is ≈ 2 (N/B) log2 (N/M)

Log

2 (

N/M

)

2 passes

(R/W)

2 passes

(R/W)

B

A key inefficiency

1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17

B

After few steps, every run is longer than B !!!

B

We are using only 3 pagesBut memory contains M/B pages ≈ 230/215 = 215

B

OutputBuffer Disk

1, 2, 3

1, 2, 3

OutputRun

4, ...

Multi-way Merge-Sort

Sort N items with main-memory M and disk-pages B:

Pass 1: Produce (N/M) sorted runs. Pass i: merge X = M/B-1 runs logX N/M passes

Main memory buffers of B items

Pg for run1

Pg for run X

Out Pg

DiskDisk

Pg for run 2

. . . . . .

. . .

Cost of Multi-way Merge-Sort

Number of passes = logX N/M logM/B (N/M)

Total I/O-cost is ( (N/B) logM/B N/M ) I/Os

Large fan-out (M/B) decreases #passes

In practice M/B ≈ 105 #passes =1 few mins

Tuning dependson disk features

Compression would decrease the cost of a pass!

N/B

logM/B M = logM/B [(M/B)*B] = (logM/B B) + 1

I/O-lower bound for Sorting

Every I/O fetches B items, in memory M

Decision tree with fan out:

B

M

There are N/B steps in which x B! cmp-outcomes

BN

t

BB

MN /)!(!

We get t = ( (N/B) logM/B N/B ) I/Os

Find t > N/B such that:

Keep attention...

If sorting needs to manage arbitrarily long strings

Key observations:

Array A is an “array of pointers to objects”

For each object-to-object comparison A[i] vs A[j]: 2 random accesses to 2 memory locations A[i] and A[j] (n log n) random memory accesses (I/Os ??)

Memory containing the strings

A

Again chaching helps,But it may be less effective than before

Indirect sort