Scalable real-time processing techniques
-
Upload
lars-albertsson -
Category
Software
-
view
283 -
download
1
description
Transcript of Scalable real-time processing techniques
Scalable real-time processing techniques
How to almost countLars Albertsson, Schibsted
“We promised to count live...
...but since you can’t do that, we used historical numbers and this cool math to extrapolate.”
?!?
Stream counting is simple
You already have the building blocks
Yet many wait for batch execution
Or go through estimation hoops
BucketiserBucketiser
Accurate counting
● Straightforward, with some plumbing.● Heavier than you need.
BusServer
Bucketiser
AggregatorServer
ServerServer
Now or later? Exact or rough?
Approximation now >> accurate later
Basic scenarios
● How many distinct items in last x minutes?● What are the top k items in last x minutes?● How many Ys in last x minutes?
These base techniques are sufficient for implementing e.g. personalisation and recommendation algorithms.
Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.
Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter
+ counter.
Counting in context
● Look backward, different time windows, compare.
● Count for a small time quantum, keep history.
● Aggregate old windows.● Monoid representations are desirable.
Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter
+ counter.● Naive 3: Hash to bitmap. Count bits.
Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter
+ counter.● Naive 3: Hash to bitmap. Count bits.● Attempt 4: Hash, bitmap, count + collision
compensation. Linear Probabilistic Counter.
Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.● Naive 2: Set approximation with Bloom filter
+ counter.● Naive 3: Hash to bitmap. Count bits.● Attempt 4: Hash, bitmap, count + collision
compensation. Linear Probabilistic Counter.● Read papers… -> HyperLogLog counter
Cardinality - distinct stream count
Source: Shakespeare, highscalability.com
Top K counting
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 18
U2 65
Gaga 46
Avicii 23
Eminem 21
Peps 19
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 20
● Keep k items, assume absentees have lowest value
● Accurate at top, overcounting in bottom
Approx counting - Count-Min Sketch
● Compute n hashes for key.● Increment once on each row, col by mod
(hash)● Retrieve by min() over rows
3 7 20 3 11 6 3+1 4 1 1
3 8 6 2+1 17 13 1 0 4 5
12 7 6 14 2 0 2 3 6+1 7
3 2 12 8+1 10 2 7 2 11 2
Top K with Count-Min Sketch
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 18
U2 65
Gaga 46
Avicii 23
Eminem 21
Peps 2
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 19
● Keep Heavy Hitters list.● Lookup absentees in CMS.● Risk of overcount is smaller and spread out.
Cubic CMS
● Decorate song with geo, age, etc. Pour into CMS.
● Keep heavy hitters per geo, age group.
*:*:<U2>
SE:*:<U2>
*:31-40:<U2>
SE:31-40:<U2>
+1
+1
+1
+1
Machinery
O(104) messages / s per machine.
You probably only need one. If not, use Storm.
Read and write to pub/sub channel, e.g. Kafka or ZeroMQ.
Brute force alternative
Dump every single message into ElasticSearch.
Suitable for high dimensionality cubes.
Recommendations, you said?
● Collaborative filtering - similarity matrix
2 4 1 1 5 2
0 1 7 1 0 6
5 2 9 0 3 0
3 8 0 6 0 7
Users
Item
s
Shave the matrixUsers
Item
s
0,0 3
0,1 5
0,2 0
0,3 2
1,0 8
... ...
2,1 9
1,0 8
2,2 7
5,0 7
5,2 6
... ...
Flip Sort
2,1 9
1,0 8
2,2 7
5,0 7
5,2 6
Cut
0 0 0 0 0 0
0 0 7 0 0 6
0 0 9 0 0 0
0 8 0 0 0 7
Noise removed - fine for recommendations
2 4 1 1 5 2
0 1 7 1 0 6
5 2 9 0 3 0
3 8 0 6 0 7
Hungry for more?Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and-recommendation-stream-mining
Ted Dunning on deep learning for real-time anomaly detection: http://www.berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databasesTed Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20
Open source: stream-lib, Algebird
Want to work in this [email protected]