Hierarchical Heavy Hitters with the Space Saving...
Transcript of Hierarchical Heavy Hitters with the Space Saving...
The Problem Previous Work Our Algorithm Results
Hierarchical Heavy Hitters with the Space SavingAlgorithm
Michael Mitzenmacher, Thomas Steinke, Justin ThalerHarvard SEAS
January 16, 2012
The Problem Previous Work Our Algorithm Results
Overview
The Problem
Previous Work
Our Algorithm
Results
The Problem Previous Work Our Algorithm Results
Motivating Problem
• Monitoring network traffic.
• Streaming problem. (Lots of data.)
• Want to find major patterns. (Identify big servers/serverfarms; connections; DDoS.)
The Problem Previous Work Our Algorithm Results
The Hierarchy
• Data has structure.
• Subnets correspond to organisations, ISPs, geographicallocations.
• Subnets are hierarchical: 123.234.*.* � 123.234.78.79
• Subnet masks form lattice.
• Level: 0: *.*.*.*, 1: 1.*.*.*, 2: 1.2.*.*, 3: 1.2.3.*, 4: 1.2.3.4
• Find frequent subnets or subnet pairs (Hierarchical HeavyHitters).
• Condition counts: If 123.234.132.199 is frequent there is noneed to output 123.234.132.*, 123.234.*.*, and 123.*.*.* aswell.
The Problem Previous Work Our Algorithm Results
HHH Definition
Definition(Exact HHHs) The set of exact Hierarchical Heavy Hitters are definedinductively.
1. HHHL, the hierarchical heavy hitters at level L, are the heavyhitters of S , that is the fully specified elements whose frequenciesexceed φN.
2. Given a prefix p from Level(l), 0 ≤ l < L, define HHHpl+1 to be the
set {h ∈ HHHl+1 ∧ h ≺ p} i.e. HHHpl+1 is the set of descendants
of p that have been identified as HHHs. Define the conditionedcount of p to be Fp =
∑(e∈S)∧(e�p)∧(e 6�HHHp
l+1)f (e). The set
HHHl is defined as
HHHl = HHHl+1 ∪ {p : (p ∈ Level(l) ∧ (Fp ≥ φN)}.
3. The set of exact Hierarchical Heavy Hitters HHH is defined as theset HHH0.
The Problem Previous Work Our Algorithm Results
Approximate HHH
Definition(Approximate HHHs) Given parameter ε, the Approximate HierarchicalHeavy Hitters problem with threshold φ is to output a set of items Pfrom the lattice, and lower and upper bounds fmin(p) and fmax(p), suchthat they satisfy two properties, as follows.
1. Accuracy. fmin(p) ≤ f (p) ≤ fmax(p), and fmax(p)− fmin(p) ≤ εN forall p ∈ P.
2. Coverage. For all prefixes p, define Pp to be the set{q ∈ P : q ≺ p}. Define the conditioned count of p with respect toP to be Fp =
∑(e∈S)∧(e�p)∧(e 6�Pp)
f (e). We require for all prefixes
p /∈ P, Fp < φN.
The Problem Previous Work Our Algorithm Results
Reduction to Counting
• When you receive 123.234.78.79, record 123.234.78.*,123.234.*.*, and 123.*.*.* as well.
• Keep track of these counts separately (using approximatecounting algorithm for Heavy Hitters problem).
• Use Space Saving (Metwally et al.).
• Recover hierarchy in postprocessing.
The Problem Previous Work Our Algorithm Results
Results
We implemented all algorithms in C (gcc 4.1.2) and tested onAMD Opteron 850 (4 single cores, 64-bit, 2.4GHz, 1MB cache,8GB memory) using real data (www.caida.org).Compare five aspects:
• Time
• Memory
• Output Size
• Accuracy
• Programmer Time
The Problem Previous Work Our Algorithm Results
Results: Time
0 50 100 150 200 250N (millions)
0
500
1000
1500
2000
2500
3000
3500
CPU
Tim
e (s
ecs)
Byte-granularity in two dimensions with ε=0.1 and φ=0.2
hhhunitarypartialfull
100 101 102 103 104 105
1/ε
50
100
150
200
250
300
350
400
450
500
CPU
Tim
e (s
ecs)
Byte-granularity in two dimensions with N=30 millionhhhunitarypartialfull
We have worst-case rather than amortized bounds on update time.
The Problem Previous Work Our Algorithm Results
Results: Memory
hhh unitary partial full0
50
100
150
200
250
300
350
400
Max
imum
Mem
ory
Usag
e (K
Byte
s)
Byte-granularity in one dimension with ε=0.001 and φ=0.002
hhh unitary partial full0
50
100
150
200
250
300
Max
imum
Mem
ory
Usag
e (K
Byte
s)
Byte-granularity in two dimensions with ε=0.01 and φ=0.02
Our algorithm uses fixed-size tables, rather than dynamic memory.
The Problem Previous Work Our Algorithm Results
Results: Output Size
hhh unitary partial full0
5
10
15
20
25
Out
put S
ize
Byte-granularity in one dimension with ε=0.01 and φ=0.02
hhh unitary partial full0
10
20
30
40
50
Out
put S
ize
Byte-granularity in two dimensions with ε=0.01 and φ=0.02
We have good analytical bounds on output size.
The Problem Previous Work Our Algorithm Results
Results: Accuracy
100 101 102 103
N (millions)
0.0
0.2
0.4
0.6
0.8
1.0
Rela
tive
Erro
r
Byte-granularity in one dimension with ε=0.001 and φ=0.002
hhhunitarypartialfull
100 101 102 103
N (millions)
0.0
0.2
0.4
0.6
0.8
1.0
Rela
tive
Erro
r
Byte-granularity in two dimensions with ε=0.001 and φ=0.002
hhhunitarypartialfull
We have improved worst-case and average-case analytical boundson accuracy.
The Problem Previous Work Our Algorithm Results
Results: Programmer Time
hhh/unitary 1D partial/full 1D hhh/unitary 2D partial/full 2D0
50
100
150
200
250
300
350
400
450
Line
s of
Sou
rce
Code
Comparison of Implementation Complexity (Ignoring Libraries)
Not scientific, but our algorithm is significanly simpler.
The Problem Previous Work Our Algorithm Results
Concluding remarks
• Our algorithm has better worst case and average case bounds.
• Our algorithm parallelizes well.
• Our algorithm suits TCAM implementations.
• Simplicity!