Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS...
Transcript of Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS...
![Page 1: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/1.jpg)
Big Data Infrastructure
Week 12: Real-Time Data Analytics (1/2)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States���See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 489/698 Big Data Infrastructure (Winter 2016)
Jimmy LinDavid R. Cheriton School of Computer Science
University of Waterloo
March 29, 2016
These slides are available at http://lintool.github.io/bigdata-2016w/
![Page 2: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/2.jpg)
OLTP/OLAP Architecture
OLTP OLAPETL���
(Extract, Transform, and Load)
![Page 3: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/3.jpg)
Twitter’s data warehousing architectureWhat’s the issue?
![Page 4: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/4.jpg)
real-time vs.
online vs.
streaming
![Page 5: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/5.jpg)
What is a data stream?¢ Sequence of items:
l Structured (e.g., tuples)l Ordered (implicitly or timestamped)
l Arriving continuously at high volumes
l Sometimes not possible to store entirely
l Sometimes not possible to even examine all items
![Page 6: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/6.jpg)
What to do with data streams?¢ Network traffic monitoring
¢ Datacenter telemetry monitoring
¢ Sensor networks monitoring
¢ Credit card fraud detection
¢ Stock market analysis
¢ Online mining of click streams
¢ Monitoring social media streams
![Page 7: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/7.jpg)
What’s the scale? Packet data streams¢ Single 2 Gb/sec link; say avg. packet size is 50 bytes
l Number of packets/sec = 5 million l Time per packet = 0.2 microseconds
¢ If we only capture header information per packet: ���source/destination IP, time, no. of bytes, etc. – at least 10 bytesl 50 MB per second
l 4+ TB per day
l Per link!
What if you wanted to do deep-packet inspection?Source: Minos Garofalakis, Berkeley CS 286
![Page 8: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/8.jpg)
Converged IP/MPLS Network
PSTN
DSL/Cable Networks
Enterprise Networks
Network Operations Center (NOC)
BGP Peer
R1 R2 R3
What are the top (most frequent) 1000 (source, dest) pairs seen by R1 over the last month?
SELECT COUNT (R1.source, R1.dest) FROM R1, R2 WHERE R1.source = R2.source
SQL Join Query
How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3?
Set-Expression Query
Off-line analysis – Data access is slow, expensive
DBMS
Source: Minos Garofalakis, Berkeley CS 286
![Page 9: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/9.jpg)
Common Architecture
¢ Data stream management system (DSMS) at observation pointsl Voluminous streams-in, reduced streams-out
¢ Database management system (DBMS)l Outputs of DSMS can be treated as data feeds to databases
DSMS
DSMS
DBMS
data streamsqueries
queriesdata feeds
Source: Peter Bonz
![Page 10: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/10.jpg)
OLTP/OLAP Architecture
OLTP OLAPETL���
(Extract, Transform, and Load)
![Page 11: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/11.jpg)
DBMS vs. DSMS
DBMSl Model: persistent relations l Relation: tuple set/bag
l Data update: modifications
l Query: transient
l Query answer: exactl Query evaluation: arbitrary
l Query plan: fixed
DSMSl Model: (mostly) transient relations l Relation: tuple sequence
l Data update: appends
l Query: persistent
l Query answer: approximatel Query evaluation: one pass
l Query plan: adaptive
Source: Peter Bonz
![Page 12: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/12.jpg)
What makes it hard?¢ Intrinsic challenges:
l Volumel Velocity
l Limited storage
l Strict latency requirements
¢ System challenges:l Load balancing
l Unreliable and out-of-order message deliveryl Fault-tolerance
l Consistency semantics (at most once, exactly once, at least once)
![Page 13: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/13.jpg)
What exactly do you do?¢ “Standard” relational operations:
l Selectl Project
l Transform (i.e., apply custom UDF)
l Group by
l Joinl Aggregations
¢ What else do you need to make this “work”?
![Page 14: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/14.jpg)
Issues of Semantics¢ Group by… aggregate
l When do you stop grouping and start aggregating?
¢ Joining a stream and a static sourcel Simple lookup
¢ Joining two streamsl How long do you wait for the join key in the other stream?
¢ Joining two streams, group by and aggregationl When do you stop joining?
What’s the solution?
![Page 15: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/15.jpg)
Windows¢ Mechanism for extracting finite relations from an infinite stream
¢ Windows restrict processing scope:
l Windows based on ordering attributes (e.g., time) l Windows based on item (record) counts
l Windows based on explicit markers (e.g., punctuations)
l Variants (e.g., some semantic partitioning constraint)
![Page 16: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/16.jpg)
Windows on Ordering Attributes¢ Assumes the existence of an attribute that defines the order of
stream elements (e.g., time)
¢ Let T be the window size in units of the ordering attribute
t1 t2 t3 t4 t1' t2’ t3’ t4’
t1 t2 t3
sliding window
tumbling window
ti’ – ti = T
ti+1 – ti = T
Source: Peter Bonz
![Page 17: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/17.jpg)
Windows on Counts¢ Window of size N elements (sliding, tumbling) over the stream
¢ Challenges:
l Problematic with non-unique timestamps: non-deterministic outputl Unpredictable window size (and storage requirements)
t1 t2 t3 t1' t2’ t3’ t4’
Source: Peter Bonz
![Page 18: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/18.jpg)
Windows from “Punctuations”¢ Application-inserted “end-of-processing”
l Example: stream of actions… “end of user session”
¢ Propertiesl Advantage: application-controlled semantics
l Disadvantage: unpredictable window size (too large or too small)
![Page 19: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/19.jpg)
Common Techniques
Source: Wikipedia (Forge)
![Page 20: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/20.jpg)
“Hello World” Stream Processing¢ Problem:
l Count the frequency of items in the stream
¢ Why?l Take some action when frequency exceeds a threshold
l Data mining: raw counts → co-occurring counts → association rules
![Page 21: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/21.jpg)
The Raw Stream…
stream
items
Source: Peter Bonz
![Page 22: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/22.jpg)
Divide Into Windows…
window 1 window 2 window 3
Source: Peter Bonz
![Page 23: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/23.jpg)
First Window
Source: Peter Bonz
![Page 24: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/24.jpg)
Second Window
Next Window
+
FrequencyCounts
second window
frequency counts
FrequencyCounts
frequency counts
Source: Peter Bonz
![Page 25: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/25.jpg)
Window Counting¢ What’s the issue?
Lessons learned?Solutions are approximate (or lossy)
![Page 26: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/26.jpg)
General Strategies¢ Sampling
¢ Hashing
![Page 27: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/27.jpg)
Reservoir Sampling¢ Task: select s elements from a stream of size N with uniform
probabilityl N can be very very large
l We might not even know what N is! (infinite stream)
¢ Solution: Reservoir samplingl Store first s elements
l For the k-th element thereafter, keep with probability s/k ���(randomly discard an existing element)
¢ Example: s = 10l Keep first 10 elements
l 11th element: keep with 10/11l 12th element: keep with 10/12
l …
![Page 28: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/28.jpg)
Reservoir Sampling: How does it work?¢ Example: s = 10
l Keep first 10 elementsl 11th element: keep with 10/11
¢ General case: at the (k + 1)th elementl Probability of selecting each item up until now is s/k
l Probability existing item is discarded: s/(k+1) × 1/s = 1/(k + 1)
l Probability existing item survives: k/(k + 1)
l Probability each item survives to (k + 1)th round: ���(s/k) × k/(k + 1) = s/(k + 1)
If we decide to keep it: sampled uniformly by definitionprobability existing item is discarded: 10/11 × 1/10 = 1/11probability existing item survives: 10/11
![Page 29: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/29.jpg)
Hashing for Three Common Tasks¢ Cardinality estimation
l What’s the cardinality of set S?l How many unique visitors to this page?
¢ Set membershipl Is x a member of set S?
l Has this user seen this ad before?
¢ Frequency estimationl How many times have we observed x?
l How many queries has this user issued?
HashSet
HashSet
HashMap
HLL counter
Bloom Filter
CMS
![Page 30: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/30.jpg)
HyperLogLog Counter¢ Task: cardinality estimation of set
l size() → number of unique elements in the set
¢ Observation: hash each item and examine the hash codel On expectation, 1/2 of the hash codes will start with 1
l On expectation, 1/4 of the hash codes will start with 01
l On expectation, 1/8 of the hash codes will start with 001
l On expectation, 1/16 of the hash codes will start with 0001
l …
How do we take advantage of this observation?
![Page 31: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/31.jpg)
Bloom Filters¢ Task: keep track of set membership
l put(x) → insert x into the setl contains(x) → yes if x is a member of the set
¢ Componentsl m-bit bit vector
l k hash functions: h1 … hk
0 0 0 0 0 0 0 0 0 0 0 0
![Page 32: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/32.jpg)
Bloom Filters: put
0 0 0 0 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11
![Page 33: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/33.jpg)
Bloom Filters: put
0 1 0 0 1 0 0 0 0 0 1 0
xput
![Page 34: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/34.jpg)
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
xcontains h1(x) = 2h2(x) = 5h3(x) = 11
![Page 35: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/35.jpg)
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
xcontains h1(x) = 2h2(x) = 5h3(x) = 11
AND = YES A[h1(x)]A[h2(x)]A[h3(x)]
![Page 36: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/36.jpg)
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
ycontains h1(y) = 2h2(y) = 6h3(y) = 9
![Page 37: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/37.jpg)
Bloom Filters: contains
0 1 0 0 1 0 0 0 0 0 1 0
ycontains h1(y) = 2h2(y) = 6h3(y) = 9
What’s going on here?
AND = NO A[h1(y)]A[h2(y)]A[h3(y)]
![Page 38: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/38.jpg)
Bloom Filters¢ Error properties: contains(x)
l False positives possiblel No false negatives
¢ Usage:l Constraints: capacity, error probability
l Tunable parameters: size of bit vector m, number of hash functions k
![Page 39: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/39.jpg)
Count-Min Sketches¢ Task: frequency estimation
l put(x) → increment count of x by onel get(x) → returns the frequency of x
¢ Componentsl k hash functions: h1 … hk
l m by k array of counters
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
m
k
![Page 40: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/40.jpg)
Count-Min Sketches: put
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
![Page 41: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/41.jpg)
Count-Min Sketches: put
0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 1 0 0 0 0 0 0 0 0
xput
![Page 42: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/42.jpg)
Count-Min Sketches: put
0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 1 0 0 0 0 0 0 0 0
xput h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
![Page 43: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/43.jpg)
Count-Min Sketches: put
0 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 0
0 0 0 2 0 0 0 0 0 0 0 0
xput
![Page 44: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/44.jpg)
Count-Min Sketches: put
0 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 0
0 0 0 2 0 0 0 0 0 0 0 0
yput h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
![Page 45: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/45.jpg)
Count-Min Sketches: put
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yput
![Page 46: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/46.jpg)
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
xget h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
![Page 47: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/47.jpg)
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
xget h1(x) = 2h2(x) = 5h3(x) = 11h4(x) = 4
A[h3(x)]MIN = 2
A[h1(x)]A[h2(x)]
A[h4(x)]
![Page 48: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/48.jpg)
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yget h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
![Page 49: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/49.jpg)
Count-Min Sketches: get
0 2 0 0 0 1 0 0 0 0 0 0
0 0 0 0 3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 2 1
0 1 0 2 0 0 0 0 0 0 0 0
yget h1(y) = 6h2(y) = 5h3(y) = 12h4(y) = 2
MIN = 1 A[h3(y)]
A[h1(y)]A[h2(y)]
A[h4(y)]
![Page 50: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/50.jpg)
Count-Min Sketches¢ Error properties:
l Reasonable estimation of heavy-hittersl Frequent over-estimation of tail
¢ Usage:l Constraints: number of distinct events, distribution of events, error
boundsl Tunable parameters: number of counters m, number of hash functions k,
size of counters
![Page 51: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/51.jpg)
Three Common Tasks¢ Cardinality estimation
l What’s the cardinality of set S?l How many unique visitors to this page?
¢ Set membershipl Is x a member of set S?
l Has this user seen this ad before?
¢ Frequency estimationl How many times have we observed x?
l How many queries has this user issued?
HashSet
HashSet
HashMap
HLL counter
Bloom Filter
CMS
![Page 52: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/52.jpg)
Source: Wikipedia (River)
Stream Processing ArchitecturesNext time:
![Page 53: Big Data Infrastructure - GitHub Pageslintool.github.io/bigdata-2016w/slides/week12a.pdf · CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of](https://reader033.fdocuments.in/reader033/viewer/2022042220/5ec5fe5f1ef0f36d12360f85/html5/thumbnails/53.jpg)
Source: Wikipedia (Japanese rock garden)
Questions?