IR
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Paradigm shift:
Web 2.0 is about the many
Do big DATA need big
PCs ??
an Italian Ad of the ’80 about a BIG brush or a brush BIG....
big DATA big PC ?
We have three types of algorithms: T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process within t time units?
n1 = t, n2 = √t, n3 = log2 t
What about a k-times faster processor? ...or, what is n, when the available time is k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
A new scenario for Algorithmics
Data are more available than even before
n ➜ ∞ ... is more than a theoretical assumption
The RAM model is too simple
Step cost is (1)
The memory hierarchy
CPU RAM
1CPUregisters
L1 L2 RAM
Cache Few MbsSome nanosecsFew words fetched
Few GbsTens of nanosecsSome words fetched
HD net
Few Tbs
Many TbsEven secsPackets
Few millisecsB = 32K page
Does Virtual Memory help ? M = memory size, N = problem size p = prob. of memory access [0,3÷0,4 (Hennessy-
Patterson)] C = cost of an I/O [105 ÷ 106 (Hennessy-
Patterson)]
If N ≤ M, then the cost per step is 1
If N=(1+) M, then the avg cost per step is:
1 + C * p * /(1+)
This is at least > 104 * /(1+)
If = 1/1000
( e.g. M = 1Gb, N = 1Gb + 1Mb )
Avg step-cost is > 20
The I/O-model
Spatial locality or Temporal locality
track
magnetic surface
read/write armread/write head
“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)
Less and faster I/Os caching
CPU RAM HD1
B
Count I/Os
Other issues other models
Random vs sequential I/Os
Scanning is better than jumping
Not just one CPU
Many PCs, Multi-cores CPUs or even GPUs
Parameter-free algorithms
Anywhere, anytime, anyway... Optimal !!
Streaming algorithms
Parallel or Distributedalgorithms
Cache-oblivious algorithms
What about energy-consumption ?
[Leventhal, CACM 2008]
≈10 IO/s/W
≈6000 IO/s/W
Our topics, on an exampleWeb
Crawler
Page archive
Which pagesto visit next?
Query
Queryresolver
?
Ranker
PageAnalizer
textStructure
auxiliary
Indexer
Hashing
Data Compression
DictionariesSorting
Linear AlgebraClusteringClassification
Warm up...
Take Wikipedia in Italian, and compute word freq:
Few GBs n 109 words
How do you proceed ?? Tokenize into a sequence of strings Sort the strings Create tuples < word, freq >
Binary Merge-Sort
Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)
Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)
DivideConquer
Combine
1 2 8 10 7 9 13 19
1 2 7
Merge is linear in the
#items to be merged
But...
Few key observations:
Items = (short) strings = atomic... (n log n) memory accesses (I/Os ??)
[5ms] * n log2 n ≈ 3 years
In practice it is a “faster”, why?
Implicit Caching…
10 2
2 10
5 1
1 5
13 19
13 19
9 7
7 9
15 4
4 15
8 3
3 8
12 17
12 17
6 11
6 11
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
log2 N
MN/M runs, each sorted in internal memory (no I/Os)
2 passes (one Read/one Write) = 2 * (N/B)
I/Os
— I/O-cost for binary merge-sort is ≈ 2 (N/B) log2 (N/M)
Log
2 (
N/M
)
2 passes
(R/W)
2 passes
(R/W)
B
A key inefficiency
1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17
B
After few steps, every run is longer than B !!!
B
We are using only 3 pagesBut memory contains M/B pages ≈ 230/215 = 215
B
OutputBuffer Disk
1, 2, 3
1, 2, 3
OutputRun
4, ...
Multi-way Merge-Sort
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs. Pass i: merge X = M/B-1 runs logX N/M passes
Main memory buffers of B items
Pg for run1
Pg for run X
Out Pg
DiskDisk
Pg for run 2
. . . . . .
. . .
Cost of Multi-way Merge-Sort
Number of passes = logX N/M logM/B (N/M)
Total I/O-cost is ( (N/B) logM/B N/M ) I/Os
Large fan-out (M/B) decreases #passes
In practice M/B ≈ 105 #passes =1 few mins
Tuning dependson disk features
Compression would decrease the cost of a pass!
N/B
logM/B M = logM/B [(M/B)*B] = (logM/B B) + 1
I/O-lower bound for Sorting
Every I/O fetches B items, in memory M
Decision tree with fan out:
B
M
There are N/B steps in which x B! cmp-outcomes
BN
t
BB
MN /)!(!
We get t = ( (N/B) logM/B N/B ) I/Os
Find t > N/B such that:
Keep attention...
If sorting needs to manage arbitrarily long strings
Key observations:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]: 2 random accesses to 2 memory locations A[i] and A[j] (n log n) random memory accesses (I/Os ??)
Memory containing the strings
A
Again chaching helps,But it may be less effective than before
Indirect sort
Top Related