Hierarchical Heavy Hitters with the Space Saving...

18
The Problem Previous Work Our Algorithm Results Hierarchical Heavy Hitters with the Space Saving Algorithm Michael Mitzenmacher, Thomas Steinke , Justin Thaler Harvard SEAS January 16, 2012

Transcript of Hierarchical Heavy Hitters with the Space Saving...

The Problem Previous Work Our Algorithm Results

Hierarchical Heavy Hitters with the Space SavingAlgorithm

Michael Mitzenmacher, Thomas Steinke, Justin ThalerHarvard SEAS

January 16, 2012

The Problem Previous Work Our Algorithm Results

Overview

The Problem

Previous Work

Our Algorithm

Results

The Problem Previous Work Our Algorithm Results

Motivating Problem

• Monitoring network traffic.

• Streaming problem. (Lots of data.)

• Want to find major patterns. (Identify big servers/serverfarms; connections; DDoS.)

The Problem Previous Work Our Algorithm Results

The Hierarchy

• Data has structure.

• Subnets correspond to organisations, ISPs, geographicallocations.

• Subnets are hierarchical: 123.234.*.* � 123.234.78.79

• Subnet masks form lattice.

• Level: 0: *.*.*.*, 1: 1.*.*.*, 2: 1.2.*.*, 3: 1.2.3.*, 4: 1.2.3.4

• Find frequent subnets or subnet pairs (Hierarchical HeavyHitters).

• Condition counts: If 123.234.132.199 is frequent there is noneed to output 123.234.132.*, 123.234.*.*, and 123.*.*.* aswell.

The Problem Previous Work Our Algorithm Results

Multidimensional Lattice

The Problem Previous Work Our Algorithm Results

HHH Definition

Definition(Exact HHHs) The set of exact Hierarchical Heavy Hitters are definedinductively.

1. HHHL, the hierarchical heavy hitters at level L, are the heavyhitters of S , that is the fully specified elements whose frequenciesexceed φN.

2. Given a prefix p from Level(l), 0 ≤ l < L, define HHHpl+1 to be the

set {h ∈ HHHl+1 ∧ h ≺ p} i.e. HHHpl+1 is the set of descendants

of p that have been identified as HHHs. Define the conditionedcount of p to be Fp =

∑(e∈S)∧(e�p)∧(e 6�HHHp

l+1)f (e). The set

HHHl is defined as

HHHl = HHHl+1 ∪ {p : (p ∈ Level(l) ∧ (Fp ≥ φN)}.

3. The set of exact Hierarchical Heavy Hitters HHH is defined as theset HHH0.

The Problem Previous Work Our Algorithm Results

Approximate HHH

Definition(Approximate HHHs) Given parameter ε, the Approximate HierarchicalHeavy Hitters problem with threshold φ is to output a set of items Pfrom the lattice, and lower and upper bounds fmin(p) and fmax(p), suchthat they satisfy two properties, as follows.

1. Accuracy. fmin(p) ≤ f (p) ≤ fmax(p), and fmax(p)− fmin(p) ≤ εN forall p ∈ P.

2. Coverage. For all prefixes p, define Pp to be the set{q ∈ P : q ≺ p}. Define the conditioned count of p with respect toP to be Fp =

∑(e∈S)∧(e�p)∧(e 6�Pp)

f (e). We require for all prefixes

p /∈ P, Fp < φN.

The Problem Previous Work Our Algorithm Results

Ancestry

The Problem Previous Work Our Algorithm Results

Ancestry: 2D

The Problem Previous Work Our Algorithm Results

Reduction to Counting

• When you receive 123.234.78.79, record 123.234.78.*,123.234.*.*, and 123.*.*.* as well.

• Keep track of these counts separately (using approximatecounting algorithm for Heavy Hitters problem).

• Use Space Saving (Metwally et al.).

• Recover hierarchy in postprocessing.

The Problem Previous Work Our Algorithm Results

Results

We implemented all algorithms in C (gcc 4.1.2) and tested onAMD Opteron 850 (4 single cores, 64-bit, 2.4GHz, 1MB cache,8GB memory) using real data (www.caida.org).Compare five aspects:

• Time

• Memory

• Output Size

• Accuracy

• Programmer Time

The Problem Previous Work Our Algorithm Results

Results: Time

0 50 100 150 200 250N (millions)

0

500

1000

1500

2000

2500

3000

3500

CPU

Tim

e (s

ecs)

Byte-granularity in two dimensions with ε=0.1 and φ=0.2

hhhunitarypartialfull

100 101 102 103 104 105

1/ε

50

100

150

200

250

300

350

400

450

500

CPU

Tim

e (s

ecs)

Byte-granularity in two dimensions with N=30 millionhhhunitarypartialfull

We have worst-case rather than amortized bounds on update time.

The Problem Previous Work Our Algorithm Results

Results: Memory

hhh unitary partial full0

50

100

150

200

250

300

350

400

Max

imum

Mem

ory

Usag

e (K

Byte

s)

Byte-granularity in one dimension with ε=0.001 and φ=0.002

hhh unitary partial full0

50

100

150

200

250

300

Max

imum

Mem

ory

Usag

e (K

Byte

s)

Byte-granularity in two dimensions with ε=0.01 and φ=0.02

Our algorithm uses fixed-size tables, rather than dynamic memory.

The Problem Previous Work Our Algorithm Results

Results: Output Size

hhh unitary partial full0

5

10

15

20

25

Out

put S

ize

Byte-granularity in one dimension with ε=0.01 and φ=0.02

hhh unitary partial full0

10

20

30

40

50

Out

put S

ize

Byte-granularity in two dimensions with ε=0.01 and φ=0.02

We have good analytical bounds on output size.

The Problem Previous Work Our Algorithm Results

Results: Accuracy

100 101 102 103

N (millions)

0.0

0.2

0.4

0.6

0.8

1.0

Rela

tive

Erro

r

Byte-granularity in one dimension with ε=0.001 and φ=0.002

hhhunitarypartialfull

100 101 102 103

N (millions)

0.0

0.2

0.4

0.6

0.8

1.0

Rela

tive

Erro

r

Byte-granularity in two dimensions with ε=0.001 and φ=0.002

hhhunitarypartialfull

We have improved worst-case and average-case analytical boundson accuracy.

The Problem Previous Work Our Algorithm Results

Results: Programmer Time

hhh/unitary 1D partial/full 1D hhh/unitary 2D partial/full 2D0

50

100

150

200

250

300

350

400

450

Line

s of

Sou

rce

Code

Comparison of Implementation Complexity (Ignoring Libraries)

Not scientific, but our algorithm is significanly simpler.

The Problem Previous Work Our Algorithm Results

Concluding remarks

• Our algorithm has better worst case and average case bounds.

• Our algorithm parallelizes well.

• Our algorithm suits TCAM implementations.

• Simplicity!

The Problem Previous Work Our Algorithm Results