Hierarchical Heavy Hitters with the Space Saving...

The Problem Previous Work Our Algorithm Results

Hierarchical Heavy Hitters with the Space SavingAlgorithm

Michael Mitzenmacher, Thomas Steinke, Justin ThalerHarvard SEAS

January 16, 2012


Overview

The Problem

Previous Work

Our Algorithm

Results


Motivating Problem

• Monitoring network traffic.

• Streaming problem. (Lots of data.)

• Want to find major patterns. (Identify big servers/serverfarms; connections; DDoS.)


The Hierarchy

• Data has structure.

• Subnets correspond to organisations, ISPs, geographicallocations.

• Subnets are hierarchical: 123.234.*.* � 123.234.78.79

• Subnet masks form lattice.

• Level: 0: *.*.*.*, 1: 1.*.*.*, 2: 1.2.*.*, 3: 1.2.3.*, 4: 1.2.3.4

• Find frequent subnets or subnet pairs (Hierarchical HeavyHitters).

• Condition counts: If 123.234.132.199 is frequent there is noneed to output 123.234.132.*, 123.234.*.*, and 123.*.*.* aswell.


Multidimensional Lattice


HHH Definition

Definition(Exact HHHs) The set of exact Hierarchical Heavy Hitters are definedinductively.

1. HHHL, the hierarchical heavy hitters at level L, are the heavyhitters of S , that is the fully specified elements whose frequenciesexceed φN.

2. Given a prefix p from Level(l), 0 ≤ l < L, define HHHpl+1 to be the

set {h ∈ HHHl+1 ∧ h ≺ p} i.e. HHHpl+1 is the set of descendants

of p that have been identified as HHHs. Define the conditionedcount of p to be Fp =

∑(e∈S)∧(e�p)∧(e 6�HHHp

l+1)f (e). The set

HHHl is defined as

HHHl = HHHl+1 ∪ {p : (p ∈ Level(l) ∧ (Fp ≥ φN)}.

3. The set of exact Hierarchical Heavy Hitters HHH is defined as theset HHH0.


Approximate HHH

Definition(Approximate HHHs) Given parameter ε, the Approximate HierarchicalHeavy Hitters problem with threshold φ is to output a set of items Pfrom the lattice, and lower and upper bounds fmin(p) and fmax(p), suchthat they satisfy two properties, as follows.

1. Accuracy. fmin(p) ≤ f (p) ≤ fmax(p), and fmax(p)− fmin(p) ≤ εN forall p ∈ P.

2. Coverage. For all prefixes p, define Pp to be the set{q ∈ P : q ≺ p}. Define the conditioned count of p with respect toP to be Fp =

∑(e∈S)∧(e�p)∧(e 6�Pp)

f (e). We require for all prefixes

p /∈ P, Fp < φN.


Ancestry


Ancestry: 2D


Reduction to Counting

• When you receive 123.234.78.79, record 123.234.78.*,123.234.*.*, and 123.*.*.* as well.

• Keep track of these counts separately (using approximatecounting algorithm for Heavy Hitters problem).

• Use Space Saving (Metwally et al.).

• Recover hierarchy in postprocessing.


Results

We implemented all algorithms in C (gcc 4.1.2) and tested onAMD Opteron 850 (4 single cores, 64-bit, 2.4GHz, 1MB cache,8GB memory) using real data (www.caida.org).Compare five aspects:

• Time

• Memory

• Output Size

• Accuracy

• Programmer Time


Results: Time

0 50 100 150 200 250N (millions)

0

500

1000

1500

2000

2500

3000

3500

CPU

Tim

e (s

ecs)

Byte-granularity in two dimensions with ε=0.1 and φ=0.2

hhhunitarypartialfull

100 101 102 103 104 105

1/ε

50

100

150

200

250

300

350

400

450

500

CPU

Tim

e (s

ecs)

Byte-granularity in two dimensions with N=30 millionhhhunitarypartialfull

We have worst-case rather than amortized bounds on update time.


Results: Memory

hhh unitary partial full0

50

100

150

200

250

300

350

400

Max

imum

Mem

ory

Usag

e (K

Byte

s)

Byte-granularity in one dimension with ε=0.001 and φ=0.002


50

100

150

200

250

300

Max

imum

Mem

ory

Usag

e (K

Byte

s)


Our algorithm uses fixed-size tables, rather than dynamic memory.


Results: Output Size


5

10

15

20

25

Out

put S

ize



10

20

30

40

50

Out

put S

ize


We have good analytical bounds on output size.


Results: Accuracy

100 101 102 103

N (millions)

0.0

0.2

0.4

0.6

0.8

1.0

Rela

tive

Erro

r



100 101 102 103

N (millions)

0.0

0.2

0.4

0.6

0.8

1.0

Rela

tive

Erro

r



We have improved worst-case and average-case analytical boundson accuracy.


Results: Programmer Time

hhh/unitary 1D partial/full 1D hhh/unitary 2D partial/full 2D0

50

100

150

200

250

300

350

400

450

Line

s of

Sou

rce

Code

Comparison of Implementation Complexity (Ignoring Libraries)

Not scientific, but our algorithm is significanly simpler.


Concluding remarks

• Our algorithm has better worst case and average case bounds.

• Our algorithm parallelizes well.

• Our algorithm suits TCAM implementations.

• Simplicity!

Hierarchical Heavy Hitters with the Space Saving...

Documents

Transcript of Hierarchical Heavy Hitters with the Space Saving...