Hokusai - Sketching streams in real time

Post on 14-Apr-2017

140 views 0 download

Transcript of Hokusai - Sketching streams in real time

HokusaiSketching streams in real time

Sergiy Matusevych1

Alexander J. Smola2

Amr Ahmed2

1Yahoo! Research, Santa Clara, CA2Google, Mountain View, CA

UAI 2012

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 1

Thanks

Alex SmolaGoogle and CMU

Amr AhmedGoogle

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 2

Motivation

I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 3

Motivation

I Compute frequencies of elements in the data streamI Item frequencies change over time.I Number of items unkonwn and variable.I Example - logging query frequency over time.

I ApplicationsI Flow counting for IP traffic (who sent what, when and how much)I Spam detection and filtering (detect bursts immediately)I Website analytics (feedback to editors, trend detection)

I State of the artI CountMin sketch is instantaneous but does not log time.I Naive snapshotting costs linear memory.I MapReduce batch job provides exact counts but long delays.

I Resource constraintsI Fixed memory footprint for entire sketch regardless of durationI High query throughputI Real time aggregation and response

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 4

Strategy

1. Use CountMin sketch to store snapshots of data.(this solves the real time logging problem)

2. Compress snapshots linearly as they ageI We care most about recent events

I Logarithmic storage sinceT∑t=1

t−1 = O(logT )

3. Exploit CountMin data structure for efficient compressionI Variant 1: reduce storage per snapshotI Variant 2: increase timespan per snapshot

4. Interpolate between both variants for improved accuracy

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 5

CountMin Sketch (Cormode & Muthukrishnan)

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I In-memory data structure for instantaneous retrieval

I Aggregate statistic of observation interval (instantanous retrieval)

I Intuition — Bloom filter with integers

Algorithm

insert(x):for i = 1 to d doM[i , hi (x)]← M[i , hi (x)] + 1

end for

query(x):nx ← min

i∈{1,...d}M[i , hi (x)]

return nx

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 6

Guarantees

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I Approximation guaranteeFor sketch with d = dlog 1

δ e and n = d eε e we have with probability1− δ that the estimate nx deviates from the count nx via

nx ≤ nx ≤ nx + ε∑x ′

nx ′ for all x .

I Linear statistic of the dataI Power law distributions with exponent z only use O(Nε−1/z) space.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 7

Step 1: Combining time intervals

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I MT and MT ′ sketches at time intervals T and T ′ with T ∩ T ′ = ∅.I Combine sketches by adding them up

+

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 8

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

4

2

1

1 1

1 2

1 1

1 1

1 1

1 1

42

4

2

1

2

1

1 1

1 1 2 4

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 9

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

2

2

81

1 1

1 1

1 1

1 421

8

8

8

4

4

4

2

4

2

1 1

1 1

1 1

42

4

2

1

1 1

1 1 2 4

8

8

8

8

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 10

Step 1: Efficient computation

I Keep aggregates for time intervals of length {1, 1, 2, 4, 8, . . . , 2m}.I Insert into the leftmost aggregation interval.

I Aggregate as cumulative sum from the left using 1 +n∑

i=0

2i = 2n+1

I Computation is∞∑n=1

n · 2−n = O(1) amortized time, O(log t) space.

2

2

81

1 1

1 1

1 1

1 421

8

8

8

4

4

4

2

4

2

1 1

1 1

1 1

42

4

2

1

1 1

1 1 2 4

8

8

8

8

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 11

Step 2: Folding over

M ∈ Rd×n matrix

d hash functions

n bins

hash h1 M11 M12 M13 M14 M15 M16. . . M1n

hash h2 M21 M22 M23 M24 M25 M26. . . M2n

hash h3 M31 M32 M33 M34 M35 M36. . . M3n

x

I Mb is sketch with n = 2b bins.

I Mb−1 can obtained as

Mb−1[i , j ] = Mb[i , j ] + Mb[i , j + 2b−1]

by “folding over” the sketch

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 12

Step 2: Efficient computation

I Halve the size of the sketch every 2t intervals.I Computation costs O(1) time and O(log t) space.

. . .

1 x 16 bins

2 x 8 bins

4 x 4 bins

interval 1

interval 2 3

4 5 6 7

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 13

Step 3: Resolution Interpolation

I Time aggregation reports good estimate over long time interval.

I Item aggregation reports poor estimate over short time interval.

I Marginals of joint distribution — assume independence & interpolate

n(t)

n(x)n I Torso and TailI Item aggregated estimate nxI Time aggregated estimate ntI Count interpolation

nxt =nx · nt

nwhere n =

∑t

nt =∑x

nx

I HeadI Sketch accuracy decreases with e · tI Use regular CountMin sketch whenever

n(x , t) > e · t · 2−b

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 14

Setup and Throughput

Web query data, 5 days sample

Term frequency

Num

ber

of u

niqu

e te

rms

100

102

104

106

97.9M unique terms,378.1M total

100 101 102 103 104 105 106

Wikipedia data

Term frequency

Num

ber

of u

niqu

e te

rms

100

101

102

103

104

105

106 4.5M unique terms,1291.5M total

100 102 104 106

Configuration

I PlatformI 64-bit LinuxI 4-core 2GHz x86I 16GB RAMI Gigabit network

I Sketch setupI 4 hash functionsI 223 binsI 211 aggregation

intervals (7 days in5 minute intervals)

I 3-gram interpolation12GB sketch with

I 3 hash functionsI 230 bins

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 15

Setup and Throughput

Web query data, 5 days sample

Term frequency

Num

ber

of u

niqu

e te

rms

100

102

104

106

97.9M unique terms,378.1M total

100 101 102 103 104 105 106

Wikipedia data

Term frequency

Num

ber

of u

niqu

e te

rms

100

101

102

103

104

105

106 4.5M unique terms,1291.5M total

100 102 104 106

Speed

I SoftwareI Client-server systemI ICE middlewareI 1 server, 10 clients

I Throughput/sI 50k insertsI 22k requests

(time aggregation)I 8.5k requests

(resolution interp.)

I Limiting FactorsI TCP/IP Overhead

Package queryI Memory latency

Random access

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 16

Accuracy (aggregate absolute error n − n)

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 17

Accuracy (stratified absolute error n − n)

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 18

Sketching for Graphical Models

I GoalI Observe stream of observationsI Estimate joint probability in O(1) timeI CountMin is good for head but interpolation better for torso and tail

I General StrategyI Markov network with junction tree: cliques C and separator sets S.I Estimate counts for xC and xS with C ∈ C and S ∈ S to generate

p(x) = n|S|−|C|∏C∈C

nxC∏S∈S

n−1xS .

I Estimates are fast — only lookup in CountMin sketch. No need tosolve convex program for graphical model inference.

I Markov Chain

p(abc) ≈ n−3 · na · nb · nc Unigrams

p(abc) ≈ n−2 · nab · nbcnb

Bigrams

Backoff smoothing (e.g. Kneser-Ney) in practice.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 19

n-gram Interpolation

I Trigram approximation

I Wikipedia dataset (1291.5M terms, 405M unique trigrams)

Absolute error Relative error

Unigram approximation 2.50 · 107 0.266Bigram approximation 1.22 · 106 0.013Trigram sketching (CountMin) 8.35 · 106 0.089

I Sketching trigrams is not accurate enough on the tail.

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 20

Summary

I Fast and simple algorithm to aggregate statistics of data streams.I Effective compressed representation of the temporal data.I Works well for graphical models.I High-performance scalable implementation with O(1) time access.I Can be distributed over many servers.

Hokusai Katsushika

Great Wave off Kanagawa

Matusevych, Smola, Ahmed Hokusai — Sketching streams in real time UAI 2012 21