Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we...
Transcript of Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we...
![Page 1: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/1.jpg)
Data Streams
Everything DataCompSci 216 Spring 2018
![Page 2: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/2.jpg)
How much data is generated every minute in the world ?
2
hAps://fossbytes.com/how-much-data-is-generated-every-minute-in-the-world/
![Page 3: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/3.jpg)
Data stream• A potentially infinite sequence, where data
arrives one record at a time
• We only get one look—can’t go back
• We want to answer a standing query that produces new/updated results as stream goes by
• We have limited space to remember whatever we deem necessary to answer the query
3
![Page 4: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/4.jpg)
Several questions about streams• How do we maintain a random sample of size n
for all data we’ve seen so far?– Sample can be used to answer queries
4
![Page 5: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/5.jpg)
Several questions about streams• How do we maintain a random sample of size n
for all data we’ve seen so far?– Sample can be used to answer queries
• How do we maintain a data structure to check if
a new arrival has appeared before?– E.g., a URL shortening service
5
![Page 6: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/6.jpg)
Several questions about streams• How do we maintain a random sample of size n
for all data we’ve seen so far?– Sample can be used to answer queries
• How do we maintain a data structure to check if
a new arrival has appeared before?– E.g., a URL shortening service
• How do we count the number of unique records
seen so far?– E.g., # of unique visitors (by IP) to a website
6
![Page 7: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/7.jpg)
Outline• How do we maintain a random sample of size n
for all data we’ve seen so far?– Sample can be used to answer queries
• How do we maintain a data structure to check if
a new arrival has appeared before?– E.g., a URL shortening service
• How do we count the number of unique records
seen so far?– E.g., # of unique visitors (by IP) to a website
7
![Page 8: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/8.jpg)
Outline• How do we maintain a random sample of size n
for all data we’ve seen so far?– Sample can be used to answer queries
• How do we maintain a data structure to check if
a new arrival has appeared before?– E.g., a URL shortening service
• How do we count the number of unique records
seen so far?– E.g., # of unique visitors (by IP) to a website
8
![Page 9: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/9.jpg)
Sampling static data vs. stream• With a static dataset– We know the total data size N– We can access an arbitrary record
• With a stream– There is no N, just the number of records we have
previously seen– We only get one look of any record, in arrival order
9
![Page 10: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/10.jpg)
Reservoir sampling• Make one pass over data• Maintain a reservoir of n records à After reading t records, the reservoir is a random sample of the first t records
• The algorithm tells us how to update the reservoir upon every new record arrival
10
![Page 11: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/11.jpg)
Simple algorithm• Initialize reservoir to the first n records• For the t-th (new) record– Pick a random number x from {1,…, t}– If x ≤ n, then replace the x-th record in the reservoir
with the new record
• That’s it!
11
… 12 18 14 16 23 22 …Stream: t
now
Reservoir: 14 11 9 20 8 18
n
p = n/t
![Page 12: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/12.jpg)
But why? • Analyze a simple case: reservoir size n=1• The t-th (new) record is included with p=1/t • The probability that the other i-th record r is the
sample from the stream of length t is – P[r is sampled on arrival] × P[r survives to the end]
= 1/𝑖 ×(1− 1/𝑖+1 )×(1− 1/𝑖+2 )×…×(1− 1/𝑡 )=1/𝑡
• How about reservoir size n>1 ?
12
![Page 13: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/13.jpg)
But why?• Baseline: If t = n, obviously the reservoir has a
“random sample” of all records seen so far• Induction: Suppose the reservoir is a random
sample of the first t records for t = k – 1– i.e., P[r in reservoir after k – 1 steps] = n/(k – 1)
• What happens when t = k?– The new record is included with prob. n / k– For any other record r, P[r in reservoir]
= P[r in reservoir after k – 1 steps] × P[r is not replaced in step k] = [n/(k – 1)] × (1 – 1/k) = n / k
13
![Page 14: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/14.jpg)
An improvement• What if skipping new records is cheaper than
accessing them one by one?• After adding the t-th record to reservoir– Simulate forward until we need to add a new record
in the reservoir; skip until then– Or calculate the CDF of skip size 𝑃[𝑠𝑘𝑖𝑝 𝑠𝑖𝑧𝑒 ≤𝑠]=1−(1− 𝑛/𝑡+1 )(1− 𝑡+2−𝑛/𝑡+2 )…(1− 𝑛/𝑡+𝑠+1 )• Sample from this CDF, skip accordingly• Pick one record in reservoir to replace
14
![Page 15: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/15.jpg)
Summary of reservoir sampling• Helps create a “running” random sample of
fixed size over a stream
• Very useful when computing/accessing the whole dataset is expensive
15
![Page 16: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/16.jpg)
Outline• How do we maintain a random sample of size n
for all data we’ve seen so far?– Sample can be used to answer queries
• How do we maintain a data structure to check if
a new arrival has appeared before?– E.g., a URL shortening service
• How do we count the number of unique records
seen so far?– E.g., # of unique visitors (by IP) to a website
16
![Page 17: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/17.jpg)
Problem boils down to…• Give a large set S (e.g., all values seen so far), check
whether a given value x is in S
• Suppose we have n bits of storage (n ≪|S|)– Cannot afford to store S
17
![Page 18: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/18.jpg)
Approximation comes to rescue• If x is in S, return true with prob. 1– i.e., no false negatives
• If x is not in S, return false with high prob.– i.e., possible false positives
18
![Page 19: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/19.jpg)
Primitive: hash function• h: S ⟶ {1, 2, …, n}– Hashes values uniformly to integers in [1, n],
i.e.: P[h(x) = i] = 1/n
• “Compressing” a value down with one h loses too much information, so we use k independent hash functions h1, h2, …, hk
19
![Page 20: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/20.jpg)
Bloom filter• Initialization of bloom filter B (n-bit array)– Set all n bits of B to 0
• Add all x in S (all value seen so far)– Compute h1(x), h2(x), …, hk(x)– Set the corresponding k bits of B to 1
e.g. k=3
20
0 1 0 0 1 1 0 1 1 0
x1 x2
h3(x1)h1(x1)
h2(x1)
h2(x2)h3(x2)
h1(x2)
![Page 21: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/21.jpg)
Bloom filter• Initialization of bloom filter B– Set all n bits of B to 0
• Add all x in S (all value seen so far)– Compute h1(x), h2(x), …, hk(x)– Set the corresponding k bits of B to 1
• Check if new arrival x is in S– Compute h1(x), h2(x), …, hk(x)– Return true iff the corresp. k bits of B are all 1
21
![Page 22: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/22.jpg)
No false negatives• If x is really in S– Then by construction we have set bits h1(x), h2(x), …,
hk(x) to 1– So check will surely return true
22
0 1 0 0 1 1 0 1 1 0
x1
h3(x1)h1(x1)
h2(x1)
![Page 23: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/23.jpg)
False positive probability• If x is not in S– Check returns true if each bit hj(x) is 1 due to some
other value(s) in S
23
0 1 0 0 1 1 0 1 1 0
x1 x2
h3(x1)h1(x1)
h2(x1)
h2(x2)h3(x2)
h1(x2)
![Page 24: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/24.jpg)
False positive probability• If x is not in S– Check returns true if each bit hj(x) is 1 due to some
other value(s) in S
P[bit i is 1] = 1 – P[bit i was not set by k|S| hashes] = 1 – (1 – 1/n)k|S|
P[k particular bits are 1] = P[bit i is 1] k = (1 – (1 – 1/n)k|S|)k ≈ (1 – e – k|S|/n)k
24
![Page 25: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/25.jpg)
Example• Suppose there are |S| = 109 elements• Suppose we have 1 GB (8×109 bits) memory
• If k = 1– P[false positive] ≈ (1 – e – k|S|/n)k = 1 – e – 1/8
≈ 0.1175• If k = 2– P[false positive] ≈ (1 – e – k|S|/n)k = (1 – e – 2/8)2
≈ 0.0493
25
![Page 26: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/26.jpg)
Example• Suppose there are |S| = 109 elements• Suppose we have 1 GB (8×109 bits) memory
26
k (# hash functions)
As we increase the # of bits to set to 1 per
element, it is more likely that a bunch of bits
become 1 just by chance
Fals
e po
sitiv
e pr
ob.
![Page 27: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/27.jpg)
Summary of Bloom filter• Helps check membership in a large set that
cannot be stored entirely
• No false negatives– Good for applications like URL shortener
• False positive probability can be tweaked by the
choice of n and k
27
![Page 28: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/28.jpg)
Outline• How do we maintain a random sample of size n
for all data we’ve seen so far?– Sample can be used to answer queries
• How do we maintain a data structure to check if
a new arrival has appeared before?– E.g., a URL shortening service
• How do we count the number of unique records
seen so far?– E.g., # of unique visitors (by IP) to a website
28
![Page 29: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/29.jpg)
Can you use a Bloom filter?• Increment a counter whenever check returns
false for an incoming value
• Because of 0 false negative and non-0 false positive probabilities, we will consistently underestimate the # of distinct values
• Also, the Bloom filter does more than we need—can we use the n bits more efficiently?
29
![Page 30: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/30.jpg)
FM (Flajolet-Martin) sketchLet Tail0(h(x)) = # of trailing consecutive 0’s• Tail0(101001) = 0• Tail0(101010) = 1• Tail0(001100) = 2• Tail0(101000) = 3• Tail0(000000) = 6
30
![Page 31: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/31.jpg)
FM sketch• Maintain a value K (max 0-tail length)• Initialize K to 0• For each new value x– Compute Tail0(h(x))– Replace K with this value it is greater than K
• F’ = 2K is an estimate of F, the true number of distinct elements
• K require very liAle space to store
31
![Page 32: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/32.jpg)
Rough intuitionIf we have F distinct elements, we’d expect• F/2 of them to have Tail0(x) ≥ 1• F/4 of them to have Tail0(x) ≥ 2 • …• F/2K of them to have Tail0(x) ≥𝐾
à P[Tail0(x) ≥𝐾] = 2K/F
So F’ = 2K is preAy good guess of F
32
![Page 33: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/33.jpg)
How good is the result?• F: the true number of distinct elements• F’: guess by FM sketch• We can show that for all c > 3,
P[F/c ≤ F’ ≤ cF] > 1 – 3/c• But that’s not very accurate!
33
![Page 34: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/34.jpg)
Use more sketches!• Use the “median of means” trick
• Maintain a × b FM sketches– Use independent hash functions!– F11, F12, …, F1a
F21, F22, …, F2a ... Fb1, Fb2, ..., Fba
34
![Page 35: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/35.jpg)
Use more sketches!• Use the “median of means” trick
• Maintain a × b FM sketches
• Compute the mean over each group of a– Gi = mean(Fi1, Fi2, …, Fia)
35
![Page 36: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/36.jpg)
Use more sketches!• Use the “median of means” trick
• Maintain a × b FM sketches
• Compute the mean over each group of a
• Return the median of b means as answer– Median(G1,G2,…,Gb) – Very close to the orginal count F for sufficiently large
a and b
36
![Page 37: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/37.jpg)
Summary of FM sketch• Helps estimate # of distinct elements in a large
set that cannot be stored entirely
• Each FM sketch is very rough, but groups of them improve estimation
• Trick question: do FM sketches support membership check like Bloom filter?– No — too much error on any particular check– Specialization gives us beAer efficiency
37
![Page 38: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/38.jpg)
Summary• How do we maintain a random sample of size n
for all data we’ve seen so far?– Reservoir Sampling
• How do we maintain a data structure to check if
a new arrival has appeared before?– Bloom Filter
• How do we count the number of unique records
seen so far?– FM Sketch
38
![Page 39: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can](https://reader035.fdocuments.in/reader035/viewer/2022070218/61266bcb39b5e6375d0b9693/html5/thumbnails/39.jpg)
SummaryTricks for big data covered in class• Parallel processing (e.g., MapReduce)• Approximate processing– Sampling (downsize data)– Stream processing (linear time, limited space)
39
Image: http://bigdatapix.tumblr.com/