Maintaining Stream Statistics Over Sliding Windows
description
Transcript of Maintaining Stream Statistics Over Sliding Windows
Maintaining Stream Maintaining Stream Statistics Over Statistics Over
Sliding WindowsSliding WindowsPaper by Mayur Datar, Aristides Gionis, Paper by Mayur Datar, Aristides Gionis,
Piotr Indyk, Rajeev MotwaniPiotr Indyk, Rajeev Motwani
Presentation by Adam Morrison.Presentation by Adam Morrison.
Sliding Window IntroSliding Window IntroInfinite stream.Infinite stream.Only last Only last NN elements relevant. elements relevant.
– Packet streams.Packet streams. NN is huge. is huge.
– Stronger model…Stronger model…
ModelModelCount memory bits.Count memory bits.Online algorithm.Online algorithm.
Arrival:Arrival:
Timestamp:Timestamp:
11 22 33 44 55 66 77
3 2 13 2 13 2 13 2 13 2 13 2 1
PlanPlanBasic CountingBasic Counting
– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.
SumSum– Given an integer stream, maintain the Given an integer stream, maintain the
sum of the last sum of the last N N elements.elements.Everything elseEverything else
Basic CountingBasic CountingExact Solution? (Counter?)Exact Solution? (Counter?)
22 11 11 11 0022Exact solution Exact solution
requires requires ((NN) bits.) bits.
Approximate Basic CountingApproximate Basic Counting
Solution: Approximate the Solution: Approximate the answer and bound the answer and bound the relative relative errorerror
answer
answerapproxanswer
errorabs
1001009595 105105==0.050.05
The ideaThe ideaDynamic histogram of active 1s.Dynamic histogram of active 1s.New 1s go into right most bucket.New 1s go into right most bucket.For each bucket keep the For each bucket keep the timestamptimestamp
of the most recent 1 and the bucket’s of the most recent 1 and the bucket’s sizesize..
When timestamp expires, free When timestamp expires, free bucket.bucket.
Bucket sizes?Bucket sizes?
Policy for creating new Policy for creating new buckets?buckets?
What is it good for?What is it good for?
Example (Example (NN=4)=4)
Timestamp:Timestamp:
Size:Size:111111222222332244225522
1111
9 10 11 12 13 14 9 10 11 12 13 14 1414 5 4 3 2 1 05 4 3 2 1 0
(Timestamps are easy)(Timestamps are easy)
9 10 11 12 13 14 0 9 10 11 12 13 14 0 00 6 5 4 3 2 1 06 5 4 3 2 1 0
Cyclic counter mod Cyclic counter mod NN..
N=N=1515
What does the What does the histogram buy us?histogram buy us?
Active bucket Active bucket Contains an Contains an active 1.active 1.
Only the last bucket might Only the last bucket might contain expired 1s.contain expired 1s.
Estimating number of 1sEstimating number of 1s
ConclusionConclusion::T T – sum of all bucket sizes but last.– sum of all bucket sizes but last.
– So there are at least So there are at least T T 1s.1s.CC – size of last bucket. – size of last bucket.
– Actual # of 1s can be anything Actual # of 1s can be anything from 1 to from 1 to CC..
21error Absolute
21 :Estimate
C
CT
Absolute Absolute RelativeRelativeBucket sizes:Bucket sizes: mCC ,,1
True countTrue count )(11 TCC m
1
1
2/)1(counttrue
2/)1(relerr m
i i
mm
CCC
Bounding the errorBounding the error
Goal: Relative error at most Goal: Relative error at most =1/=1/kk..
kC
Cj
i i
j 12/)1(1
1
If at all times we’d have that for If at all times we’d have that for all all jj,,
How can we do that?How can we do that?(With as few buckets as (With as few buckets as
possible?)possible?)Non-decreasing bucket sizes.Non-decreasing bucket sizes.Bucket sizes constrained toBucket sizes constrained to
At most buckets of each size.At most buckets of each size.For all sizes but that of last bucket, For all sizes but that of last bucket,
at least buckets of each size.at least buckets of each size.
12log'and',2,,4,2,1 '
kNmmmm
12 k
2k
Exponential Exponential HistogramHistogram
11
11
22
11
33
11
11
11
44
11
22
11
11
11
22
22
11
11
33
22
22
11
44
22
33
11
11
11
11
11
55
22
44
11
22
11
11
11
55
22
22
22
22
11
66
22
33
22
11
11
33
11
77
22
44
22
22
11
11
11
77
22
44
22
22
22
11
11
44
44
22
22
11
111
2
k
New 1 – create bucketNew 1 – create bucket
Too many buckets – mergeToo many buckets – merge
Check if invariant violated.Check if invariant violated.
T
If there are at leastIf there are at leastbuckets of sizesbuckets of sizes
Why it works Why it works (correctness)(correctness)
rjC 2 2
k12,,2,1 r
11
1221
2
rj
i ikC
12
jCk
1
1
2/)1(1j
i i
j
C
Ck
Why it works (space)Why it works (space)Can account for all 1s with justCan account for all 1s with just
NkNkkC m
rrm
i i
22
22
'
01
buckets.12log12
1
kNkm
Space usageSpace usage
kNN 2logloglog
)2log(kNkO
)log( 2 NkOT T counter for estimation:counter for estimation: )(log NO
# of buckets:# of buckets:
Bucket size:Bucket size:
OperationsOperationsEstimationEstimation: O(1): O(1)
)2(logkN
But only O(1) amortized!But only O(1) amortized!
InsertionInsertion: Cascading makes it : Cascading makes it worst case. worst case.
Bucket of size Bucket of size BB accounts for all accounts for all
operations related operations related to it: to it: BB inserts, inserts, BB--
1 merges (& 1 merges (& maybe delete).maybe delete).
pastpast
Sum of Sum of all all buckets in life buckets in life
time time (including (including
deleted) is deleted) is all all insertions.insertions.
PlanPlanBasic CountingBasic Counting
– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.
SumSum– Given an integer stream, maintain the Given an integer stream, maintain the
sum of the last sum of the last N N elements.elements.Everything elseEverything else
case]. worst )(log[ timeinsert amortized )1( ,query time )1(
space, )log( using Solved 2
NOOO
NkO
Extending to Extending to SumSumIntegers in range [0, Integers in range [0, RR].].On value On value VV, insert , insert VV 1s. 1s.Timestamps:Timestamps:Bucket counter:Bucket counter:# of buckets:# of buckets:Total space:Total space:
RN logloglog )2log(
kNRkO
.log still N
Insertion
Insertion takes
takes (R)!(R)!
)log)log(log( NRNkO
Reducing Reducing insertioninsertion time timeIf we had a way to rebuild the If we had a way to rebuild the
entire histogram…entire histogram…We could buffer new values…We could buffer new values…And rebuild histogram when And rebuild histogram when
buffer reaches size buffer reaches size BB..If it takes , If it takes ,
amortized is amortized is))log(log( RNkBO
))log(log1(B
RNkO
Picking Picking
givesgives
amortized time.amortized time.
))log(log( RNkB
)loglog(
NRO
kk/2 canonical /2 canonical representationrepresentation
The The k/2 canonical representationk/2 canonical representation of of SS : :
jikkkkkS ii
j
i
ii
for 2
,12
,20
Would it Would it really?really?
Is this Is this representation representation
uniqueunique??
If If S S is the total size of the buckets, is the total size of the buckets, computing its computing its k/2 canonical representationk/2 canonical representation would help us rebuild the histogram.would help us rebuild the histogram.
ij
iikS 2
0
ji
jii
kk )12(2
2
12/
2 k
SjFind the largest Find the largest j j for whichfor which
)12(2
' jkSS
22
k
jj=2=2=5=5
If findIf findjS 2'jj mSm 2)1('2
mk j
jj
j
kSS
bb
2''' oftion representa
binary the be ,,Let 10
=01=01
ii bkk 2
Total time Total time
required is required is
OO(log (log SS).).
8 6 4 3 2 18 6 4 3 2 19 7 5 4 3 29 7 5 4 3 210 8 6 5 4 310 8 6 5 4 3
22111 S 02 S 22 S 72 S
Calculate Calculate SS11++SS22 representation: representation:
10 6 2 1 1 1 110 6 2 1 1 1 1
55
If a value gets If a value gets “unindexed”, it will “unindexed”, it will
never be indexed in the never be indexed in the future.future.
PlanPlanBasic CountingBasic Counting
– Given a bit stream, maintain at every Given a bit stream, maintain at every time instant the count of 1s in the last time instant the count of 1s in the last NN elements. elements.
SumSum– Given an integer stream, maintain the Given an integer stream, maintain the
sum of the last sum of the last N N elements.elements.Everything elseEverything else
case]. worst )(log[ einsert tim
amortized )loglog( ,query time )1(
space, )loglog( using Solved
NRONROO
NNRkO
•Lower BoundsLower Bounds
•More about More about timestamps.timestamps.
• Applications.Applications.
•More problemsMore problems
•Lower BoundsLower Bounds
•More about More about timestamps.timestamps.
• Applications.Applications.
•More problemsMore problems
Lower boundsLower boundsBasic CountingBasic Counting and and SumSum
algorithms are optimal.algorithms are optimal.Similar techniques will show Similar techniques will show
that lots of other problems are that lots of other problems are intractable. (Later.)intractable. (Later.)
Basic CountingBasic Counting bound boundNN
243 kBi2
Bi2
BN
kB
L
log
4
)(log16
log 2
kNkLNkB
Big Big block block dd
)12(
4dk
dc2
Left most Left most such such
subblocksubblock 41,2)1( kcc d
kkk d
d
2)12(44
2err rel
2err abs1-d
1
Same idea works for Same idea works for SumSum..
Randomized boundRandomized boundYao minimax principleYao minimax principle::Expected space complexity of Expected space complexity of
optimal algorithm for an input optimal algorithm for an input distribution is a lower bound on distribution is a lower bound on expected space complexity of expected space complexity of randomized algorithm.randomized algorithm.
Lower bound applies to Lower bound applies to randomized algorithms.randomized algorithms.
TimestampsTimestampsDefine window based on real Define window based on real
time – equate timestamp with time – equate timestamp with clock.clock.
No work needs to be done No work needs to be done when items don’t arrive, so when items don’t arrive, so deletions can be deferred.deletions can be deferred.
•Lower BoundsLower Bounds
•More about More about timestamps.timestamps.
• Applications.Applications.
•More problemsMore problemsIf much less than If much less than N N items can items can
arrive during the window, memory arrive during the window, memory usage is usage is reducedreduced..
ApplicationsApplicationsAdapting algorithms to the sliding Adapting algorithms to the sliding
window model using window model using EHEH to to replace replace counters.counters.
Counters require bits, EH Counters require bits, EH takes .takes .
Also factor loss in accuracy.Also factor loss in accuracy.
NlogNk 2log
1
•Lower BoundsLower Bounds
•More about More about timestamps.timestamps.
• Applications.Applications.
•More problemsMore problems
More ProblemsMore ProblemsMin/MaxMin/Max
– Storing subsequence of (say) Storing subsequence of (say) mins is optimal.mins is optimal.
Distinct valuesDistinct values– Basic CountingBasic Counting reduces to it. reduces to it.
•Lower BoundsLower Bounds
•More about More about timestamps.timestamps.
• Applications.Applications.
•More problemsMore problems
Other ProblemsOther ProblemsDistinct values with deletions.Distinct values with deletions.
– Factor 2 estimation requires Factor 2 estimation requires ((N)N) space.space.
– Map 1s in a bit string to distinct Map 1s in a bit string to distinct values. Pad with zeros to infer value values. Pad with zeros to infer value of last bit, then use deletion to of last bit, then use deletion to cancel that bit.cancel that bit.
– Repeat.Repeat.
Other ProblemsOther ProblemsSum Sum with negative integers.with negative integers.
– Factor 2 estimation requires Factor 2 estimation requires ((N)N) space. space.
– Maps 1s in bit string to (-1,1) and Maps 1s in bit string to (-1,1) and 0s to (1,-1).0s to (1,-1).
– Pad with 0s and query at odd Pad with 0s and query at odd time instants.time instants.