Maintaining Stream Statistics Over Sliding Windows

Post on 25-Feb-2016

60 views 0 download

description

Maintaining Stream Statistics Over Sliding Windows. Paper by Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani. Presentation by Adam Morrison. Sliding Window Intro. Infinite stream. Only last N elements relevant. Packet streams. N is huge. Stronger model…. 1. 2. 3. 4. - PowerPoint PPT Presentation

Transcript of Maintaining Stream Statistics Over Sliding Windows

Maintaining Stream Maintaining Stream Statistics Over Statistics Over

Sliding WindowsSliding WindowsPaper by Mayur Datar, Aristides Gionis, Paper by Mayur Datar, Aristides Gionis,

Piotr Indyk, Rajeev MotwaniPiotr Indyk, Rajeev Motwani

Presentation by Adam Morrison.Presentation by Adam Morrison.

Sliding Window IntroSliding Window IntroInfinite stream.Infinite stream.Only last Only last NN elements relevant. elements relevant.

– Packet streams.Packet streams. NN is huge. is huge.

– Stronger model…Stronger model…

ModelModelCount memory bits.Count memory bits.Online algorithm.Online algorithm.

Arrival:Arrival:

Timestamp:Timestamp:

11 22 33 44 55 66 77

3 2 13 2 13 2 13 2 13 2 13 2 1

PlanPlanBasic CountingBasic Counting

– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.

SumSum– Given an integer stream, maintain the Given an integer stream, maintain the

sum of the last sum of the last N N elements.elements.Everything elseEverything else

Basic CountingBasic CountingExact Solution? (Counter?)Exact Solution? (Counter?)

22 11 11 11 0022Exact solution Exact solution

requires requires ((NN) bits.) bits.

Approximate Basic CountingApproximate Basic Counting

Solution: Approximate the Solution: Approximate the answer and bound the answer and bound the relative relative errorerror

answer

answerapproxanswer

errorabs

1001009595 105105==0.050.05

The ideaThe ideaDynamic histogram of active 1s.Dynamic histogram of active 1s.New 1s go into right most bucket.New 1s go into right most bucket.For each bucket keep the For each bucket keep the timestamptimestamp

of the most recent 1 and the bucket’s of the most recent 1 and the bucket’s sizesize..

When timestamp expires, free When timestamp expires, free bucket.bucket.

Bucket sizes?Bucket sizes?

Policy for creating new Policy for creating new buckets?buckets?

What is it good for?What is it good for?

Example (Example (NN=4)=4)

Timestamp:Timestamp:

Size:Size:111111222222332244225522

1111

9 10 11 12 13 14 9 10 11 12 13 14 1414 5 4 3 2 1 05 4 3 2 1 0

(Timestamps are easy)(Timestamps are easy)

9 10 11 12 13 14 0 9 10 11 12 13 14 0 00 6 5 4 3 2 1 06 5 4 3 2 1 0

Cyclic counter mod Cyclic counter mod NN..

N=N=1515

What does the What does the histogram buy us?histogram buy us?

Active bucket Active bucket Contains an Contains an active 1.active 1.

Only the last bucket might Only the last bucket might contain expired 1s.contain expired 1s.

Estimating number of 1sEstimating number of 1s

ConclusionConclusion::T T – sum of all bucket sizes but last.– sum of all bucket sizes but last.

– So there are at least So there are at least T T 1s.1s.CC – size of last bucket. – size of last bucket.

– Actual # of 1s can be anything Actual # of 1s can be anything from 1 to from 1 to CC..

21error Absolute

21 :Estimate

C

CT

Absolute Absolute RelativeRelativeBucket sizes:Bucket sizes: mCC ,,1

True countTrue count )(11 TCC m

1

1

2/)1(counttrue

2/)1(relerr m

i i

mm

CCC

Bounding the errorBounding the error

Goal: Relative error at most Goal: Relative error at most =1/=1/kk..

kC

Cj

i i

j 12/)1(1

1

If at all times we’d have that for If at all times we’d have that for all all jj,,

How can we do that?How can we do that?(With as few buckets as (With as few buckets as

possible?)possible?)Non-decreasing bucket sizes.Non-decreasing bucket sizes.Bucket sizes constrained toBucket sizes constrained to

At most buckets of each size.At most buckets of each size.For all sizes but that of last bucket, For all sizes but that of last bucket,

at least buckets of each size.at least buckets of each size.

12log'and',2,,4,2,1 '

kNmmmm

12 k

2k

Exponential Exponential HistogramHistogram

11

11

22

11

33

11

11

11

44

11

22

11

11

11

22

22

11

11

33

22

22

11

44

22

33

11

11

11

11

11

55

22

44

11

22

11

11

11

55

22

22

22

22

11

66

22

33

22

11

11

33

11

77

22

44

22

22

11

11

11

77

22

44

22

22

22

11

11

44

44

22

22

11

111

2

k

New 1 – create bucketNew 1 – create bucket

Too many buckets – mergeToo many buckets – merge

Check if invariant violated.Check if invariant violated.

T

If there are at leastIf there are at leastbuckets of sizesbuckets of sizes

Why it works Why it works (correctness)(correctness)

rjC 2 2

k12,,2,1 r

11

1221

2

rj

i ikC

12

jCk

1

1

2/)1(1j

i i

j

C

Ck

Why it works (space)Why it works (space)Can account for all 1s with justCan account for all 1s with just

NkNkkC m

rrm

i i

22

22

'

01

buckets.12log12

1

kNkm

Space usageSpace usage

kNN 2logloglog

)2log(kNkO

)log( 2 NkOT T counter for estimation:counter for estimation: )(log NO

# of buckets:# of buckets:

Bucket size:Bucket size:

OperationsOperationsEstimationEstimation: O(1): O(1)

)2(logkN

But only O(1) amortized!But only O(1) amortized!

InsertionInsertion: Cascading makes it : Cascading makes it worst case. worst case.

Bucket of size Bucket of size BB accounts for all accounts for all

operations related operations related to it: to it: BB inserts, inserts, BB--

1 merges (& 1 merges (& maybe delete).maybe delete).

pastpast

Sum of Sum of all all buckets in life buckets in life

time time (including (including

deleted) is deleted) is all all insertions.insertions.

PlanPlanBasic CountingBasic Counting

– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.

SumSum– Given an integer stream, maintain the Given an integer stream, maintain the

sum of the last sum of the last N N elements.elements.Everything elseEverything else

case]. worst )(log[ timeinsert amortized )1( ,query time )1(

space, )log( using Solved 2

NOOO

NkO

Extending to Extending to SumSumIntegers in range [0, Integers in range [0, RR].].On value On value VV, insert , insert VV 1s. 1s.Timestamps:Timestamps:Bucket counter:Bucket counter:# of buckets:# of buckets:Total space:Total space:

RN logloglog )2log(

kNRkO

.log still N

Insertion

Insertion takes

takes (R)!(R)!

)log)log(log( NRNkO

Reducing Reducing insertioninsertion time timeIf we had a way to rebuild the If we had a way to rebuild the

entire histogram…entire histogram…We could buffer new values…We could buffer new values…And rebuild histogram when And rebuild histogram when

buffer reaches size buffer reaches size BB..If it takes , If it takes ,

amortized is amortized is))log(log( RNkBO

))log(log1(B

RNkO

Picking Picking

givesgives

amortized time.amortized time.

))log(log( RNkB

)loglog(

NRO

kk/2 canonical /2 canonical representationrepresentation

The The k/2 canonical representationk/2 canonical representation of of SS : :

jikkkkkS ii

j

i

ii

for 2

,12

,20

Would it Would it really?really?

Is this Is this representation representation

uniqueunique??

If If S S is the total size of the buckets, is the total size of the buckets, computing its computing its k/2 canonical representationk/2 canonical representation would help us rebuild the histogram.would help us rebuild the histogram.

ij

iikS 2

0

ji

jii

kk )12(2

2

12/

2 k

SjFind the largest Find the largest j j for whichfor which

)12(2

' jkSS

22

k

jj=2=2=5=5

If findIf findjS 2'jj mSm 2)1('2

mk j

jj

j

kSS

bb

2''' oftion representa

binary the be ,,Let 10

=01=01

ii bkk 2

Total time Total time

required is required is

OO(log (log SS).).

8 6 4 3 2 18 6 4 3 2 19 7 5 4 3 29 7 5 4 3 210 8 6 5 4 310 8 6 5 4 3

22111 S 02 S 22 S 72 S

Calculate Calculate SS11++SS22 representation: representation:

10 6 2 1 1 1 110 6 2 1 1 1 1

55

If a value gets If a value gets “unindexed”, it will “unindexed”, it will

never be indexed in the never be indexed in the future.future.

PlanPlanBasic CountingBasic Counting

– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.

SumSum– Given an integer stream, maintain the Given an integer stream, maintain the

sum of the last sum of the last N N elements.elements.Everything elseEverything else

case]. worst )(log[ einsert tim

amortized )loglog( ,query time )1(

space, )loglog( using Solved

NRONROO

NNRkO

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problems

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problems

Lower boundsLower boundsBasic CountingBasic Counting and and SumSum

algorithms are optimal.algorithms are optimal.Similar techniques will show Similar techniques will show

that lots of other problems are that lots of other problems are intractable. (Later.)intractable. (Later.)

Basic CountingBasic Counting bound boundNN

243 kBi2

Bi2

BN

kB

L

log

4

)(log16

log 2

kNkLNkB

Big Big block block dd

)12(

4dk

dc2

Left most Left most such such

subblocksubblock 41,2)1( kcc d

kkk d

d

2)12(44

2err rel

2err abs1-d

1

Same idea works for Same idea works for SumSum..

Randomized boundRandomized boundYao minimax principleYao minimax principle::Expected space complexity of Expected space complexity of

optimal algorithm for an input optimal algorithm for an input distribution is a lower bound on distribution is a lower bound on expected space complexity of expected space complexity of randomized algorithm.randomized algorithm.

Lower bound applies to Lower bound applies to randomized algorithms.randomized algorithms.

TimestampsTimestampsDefine window based on real Define window based on real

time – equate timestamp with time – equate timestamp with clock.clock.

No work needs to be done No work needs to be done when items don’t arrive, so when items don’t arrive, so deletions can be deferred.deletions can be deferred.

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problemsIf much less than If much less than N N items can items can

arrive during the window, memory arrive during the window, memory usage is usage is reducedreduced..

ApplicationsApplicationsAdapting algorithms to the sliding Adapting algorithms to the sliding

window model using window model using EHEH to to replace replace counters.counters.

Counters require bits, EH Counters require bits, EH takes .takes .

Also factor loss in accuracy.Also factor loss in accuracy.

NlogNk 2log

1

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problems

More ProblemsMore ProblemsMin/MaxMin/Max

– Storing subsequence of (say) Storing subsequence of (say) mins is optimal.mins is optimal.

Distinct valuesDistinct values– Basic CountingBasic Counting reduces to it. reduces to it.

•Lower BoundsLower Bounds

•More about More about timestamps.timestamps.

• Applications.Applications.

•More problemsMore problems

Other ProblemsOther ProblemsDistinct values with deletions.Distinct values with deletions.

– Factor 2 estimation requires Factor 2 estimation requires ((N)N) space.space.

– Map 1s in a bit string to distinct Map 1s in a bit string to distinct values. Pad with zeros to infer value values. Pad with zeros to infer value of last bit, then use deletion to of last bit, then use deletion to cancel that bit.cancel that bit.

– Repeat.Repeat.

Other ProblemsOther ProblemsSum Sum with negative integers.with negative integers.

– Factor 2 estimation requires Factor 2 estimation requires ((N)N) space. space.

– Maps 1s in bit string to (-1,1) and Maps 1s in bit string to (-1,1) and 0s to (1,-1).0s to (1,-1).

– Pad with 0s and query at odd Pad with 0s and query at odd time instants.time instants.