Social Media in the Classroom Robin O’Callaghan Assistant Professor Winona State University
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar,...
-
Upload
ira-gregory -
Category
Documents
-
view
213 -
download
1
Transcript of Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar,...
![Page 1: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/1.jpg)
Maintaining Variance and k-Medians over
Data Stream Windows
Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan
O’Callaghan.
Presentation by Anat Rapoport December 2003.Presentation by Anat Rapoport December 2003.
![Page 2: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/2.jpg)
Characteristics of the data stream
Data elements arrive continually Only the most recent N elements
are used when answering queries Single linear scan algorithm (can
only have one look) Store only the summery of the data
seen thus far.
![Page 3: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/3.jpg)
IntroductionTwo important and related
problems: –Variance
–K-median clustering
![Page 4: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/4.jpg)
Problem 1 (Variance)
Given a stream of numbers, maintain at every instant the variance of the last N values
where denotes the mean of the last N values
N
iixVAR
1
2)(
N
i ixN 1
1
![Page 5: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/5.jpg)
Problem 1 (Variance)
We cannot buffer the entire sliding window in memory
So we cannot compute the variance exactly at every instant
We will solve this problem approximately. We use memory and provide
an estimate with relative error of at most ε The time required per new element is
amortized O(1)
)log1
(2
NO
![Page 6: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/6.jpg)
Extend to k-median Given a multiset X of objects in a metric
space M with distance function l the k-median problem is to pick k points c1,…,ck M∈ so as to minimize
where C(x) is the closest of c1,…,ck to x. if C(x)=ci then x is said to be assigned to ci
and l(x, ci) is called the assignment distance of x
The objective function is the sum of the assignment distances.
XxxCxl ))(,(
![Page 7: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/7.jpg)
Problem 2 (SWKM)
Given a stream of points from a metric space M with distance function l, window size N,
and parameter k, maintain at every instant t a median set c1,…,ck M minimizing∈
where Xt is the multiset of N most recent points at time t
tXxxCxl ))(,(
![Page 8: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/8.jpg)
Exponential Histogram
From last week: Maintaining simple statistics over sliding windows
The exponential histogram estimates a class of aggregated functions over sliding windows
Their result applies to any function f satisfying the following properties for all multisets X,Y:
![Page 9: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/9.jpg)
Where EH goes wrong
))()(()(
)()()(
)()(
0)(
YfXfCYXf
YfXfYXf
XpolyXf
Xf
f
EH can estimate any function f defined over windows which satisfies:
Positive: Polynomialy bounded: Composable: Weakly additive:where Cf ≥1 is a constant
“Weakly Additive” condition not valid for variance, k-medians
![Page 10: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/10.jpg)
Failure of “Weak Additivity”
Cannot afford to neglect contribution of last bucket
Time
Value
Variance of each bucket is small
Variance of combinedbucket is large
![Page 11: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/11.jpg)
The idea Summarize intervals of the data stream
using composable synopses For efficient memory use adjacent intervals
are combined, when it doesn’t increase the error significantly
The synopsis of the last interval in the sliding window is inaccurate. Some points have expired…
HOWEVER– We will find a way to estimate this interval
![Page 12: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/12.jpg)
Timestamp
Corresponds to the position of an active data element in the current window
We do not make explicit updates We use a wraparound counter of logN bits Timestamp can be extracted by comparison
with the counter value of the current arrival
![Page 13: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/13.jpg)
Model We store the data elements in the buckets of
the histogram Every bucket stores the synopsis structure
for a contiguous set of elements The partition is based on arrival time The bucket also has a timestamp, of the
most recent data element in it When the timestamp reaches N+1 we drop
the bucket
![Page 14: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/14.jpg)
Model
Buckets are numbered B1,…,Bm
– B1 the most recent
– Bm the oldest
t1,…,tm denote the bucket timestamp All buckets but Bm have only active data
elements
![Page 15: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/15.jpg)
Maintaining variance over sliding windows
algorithm
![Page 16: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/16.jpg)
Details
We would like to estimate the variance with relative error of at most ε
Maintain for each bucket Bi, besides it’s timestamp ti, also:– number of elements ni
– mean μi
– variance Vi
![Page 17: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/17.jpg)
Details
Define: another set of buckets B1*,…, Bj* that represent the suffixes of the data stream.
The bucket Bm* represents all the points that arrived after the oldest non-expired bucket
The statistics for these buckets are computed temporarily
![Page 18: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/18.jpg)
data structure: exponential histogram
X1XN X2
……
Window size N
timestampmost recentoldest
timestampmost recent
……
B1Bm-1Bm
oldest…… B2
Bm*
![Page 19: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/19.jpg)
Combination rule In the algorithm we will need to combine adjacent buckets Consider two buckets Bi and Bj that get combined to form a
new bucket Bij
The statistics for Bij are
2
,,
,,
,
)( jiji
jijiji
ji
jjiiji
jiji
n
nnVVV
n
nn
nnn
![Page 20: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/20.jpg)
Lemma 1
The bucket combination procedure correctly computes ni,j, μi,j, Vi,j for the new bucket
Proof
Note that ni,j, μi,j, are correctly computed from the definitions of count and average
Define δi=μi,-μi,j δj=μj,-μi,j
![Page 21: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/21.jpg)
2
,
22
22
22
2222
22
2,,
)(
))(
())(
(
)()(
)(2)(2
)(2)()(2)(
)()(
)(,
jiji
jiji
ji
jiij
ji
ijjiji
jjiiji
Bxj
Bxji
Bxi
Bxiijjiiji
jBx
jljjliBx
iliil
Bxjjl
Bxiil
Bxjilji
n
nnVV
nn
nn
nn
nnVV
nnVV
xxnnVV
xxxx
xx
xV
jiljililil
jlil
jlil
jil
![Page 22: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/22.jpg)
Main Solution Idea
More careful estimation of last bucket’s contribution
Decompose variance into two parts– “Internal” variance: within bucket
– “External” variance: between buckets
2ji
ji
jijiji, )μ - (μ
n + n
nn + V + V = V
Internal Varianceof Bucket i
Internal Varianceof Bucket j
External Variance
![Page 23: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/23.jpg)
Estimation of the variance over the current active window
Let Bm’ refer to the non-expired portion of the bucket Bm (the set of active elements)
The estimation for nm’, μm’, Vm’ :– nm’
EST=N+1-tm (exact)– μm’EST= μm
– Vm’ EST= Vm/2
The statistics for Bm’,m* are sufficient for computing the variance at time t.
![Page 24: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/24.jpg)
Estimation of the variance over the current active window
The estimate for Bm’ can be found in o(1) time if we keep statistics for Bm
The error is due to the error in the estimation statistics for Bm’
Theorem: Relative error ≤ ε, provided Vm ≤ (ε2/9) Vm*
Aim: Maintain Vm ≤ (ε2/9) Vm* using as few buckets as possible
))(()( 2*'
*'
*'*' mm
mm
mmmm nn
nnVVtVAR
![Page 25: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/25.jpg)
Algorithm sketch
for every new element:– insert the new element
• to an existing bucket or to a new bucket
– if Bm‘s timestamp > N delete it
– if there are two adjacent buckets with small combined variance combine them to one bucket
![Page 26: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/26.jpg)
Algorithm 1 (insert xt)
1. if xt=μ1 then insert xt to B1, by incrementing n1 by 1. Otherwise, create a new bucket for xt. The new bucket becomes B1 with v1=0 μ1= xt, n1 =1. An old bucket Bi becomes Bi+1.
2. if Bm‘s timestamp>N, delete the bucket. Bucket Bm-1 becomes the new oldest bucket. Maintain the statistics of Bm-1* (instead of Bm*), which can be computed using the previously maintained statistics for Bm* and Bm-1. (“deletion” of buckets also works…)
![Page 27: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/27.jpg)
Algorithm 1 (insert xt)
3. Let k=9/ε^2 and Vi,i-1 is the variance combination of buckets Bi and Bi-1. While there exist an index i>2 such that kVi,i-1≤Vi-1* find the smallest i and combine the buckets according to the combination rule. The statistics for Bi* can be computed incrementally from the statistics for Bi-1 and Bi-1*
4. Output estimated variance at time t according to the estimation procedure. Vm’,m*
![Page 28: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/28.jpg)
Invariant 1 For every bucket Bi, 9/ε2Vi ≤Vi*
– Ensures that the relative error is ≤ ε
Invariant 2 For each i<1, for every bucket Bi, 9/ε2Vi,i-1 > Vi-1*
– This invariant insures that the total number of buckets is small O((1/ε2)log NR2)
– Each bucket requires constant space
![Page 29: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/29.jpg)
Lemma 2
The number of buckets maintained at any point in time by an algorithm that
preserves Invariant 2 is
O(1/ε2logNR2 )
where R is an upper bound on the absolute value of the data elements.
![Page 30: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/30.jpg)
Proof sketch
From the combination rule: the variance of the union of two buckets is no less then the sum of the individual variances.
Algorithm that preserves invariant 2, the variance of the suffix bucket Bi* doubles after every O(1/ε2) buckets.
Total number of buckets: no more then O(1/ε2 logV) where V is the variance of the last N points. V is no more than NR2. O(1/ε2 log NR2)
![Page 31: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/31.jpg)
Running time improvement The algorithm requires O(1/ε2logNR2
) time per new element.
Most time is spent in step 3 where we make the sweep to combine buckets.
The time is proportional to the size of the histogram O(1/ε2logNR2
). The trick: skip step 3 until we have seen Θ(1/ε2logNR2
). This ensures that the time of the algorithm is amortized
O(1). May violate invariant 2 temporarily, but we restore it every
Θ(1/ε2logNR2 ) data points, when we execute step 3.
![Page 32: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/32.jpg)
Variance algorithm summery
O(1/ε2logNR2 ) time per new element
O(1/ε2 log NR2) memory with error of at most ε
![Page 33: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/33.jpg)
Clustering on sliding windows
![Page 34: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/34.jpg)
Clustering Data Streams
Based on k-median problem:– Data stream points from metric space.– Find k clusters in the stream such that the sum
of distances from data points to their closest center is minimized
![Page 35: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/35.jpg)
Clustering Data Streams Constant factor approximation algorithms
– A simple two step algorithm:
step1: For each set of M=nτ points, Si, find O(k) centers in S1, …, SM
-- Local clustering: Assign each point in Si to its closest center
step2: Let S’ be centers for S1, …, SM with each center weighted by number of points assigned to it.
Cluster S’ to find k centers The solution cost is < 2*optimal solution cost τ<0.5 is a parameter which trades off space bound with
approximation factor of 2O(1/τ)
![Page 36: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/36.jpg)
One-pass algorithm: first phase
1
3
2
4 5
1
3
2
4 5
S1 S2
S
Original data:
k=1
we take: M=3
![Page 37: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/37.jpg)
One-pass algorithm: second phase
Original data:
M=3
k=1
we take:1
3
2
4 5
1
5
w=3
w=2
S’
![Page 38: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/38.jpg)
Restate the algorithm…
Nτ points
input data stream
find O(k) mediansstore it with weightdiscard Nτ points
level-1 medians
level-0 medians
Nτ medians with associated weight
find O(k) medians level-2 medians
Repeat 1/τ times
![Page 39: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/39.jpg)
The idea In general, whenever there are nτ medians at
level i they are clustered to form level (i+1) medians.
level-i medians
level-(i+1) medians
data points
![Page 40: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/40.jpg)
timestampmost recent
……
B1Bm-1Bm
oldest……
X1XN X2
……
Window size N
timestampmost recentoldest
each bucket consists of a collection of data points or intermediate medians.
data structure: exponential histogram
![Page 41: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/41.jpg)
Point representation Each point is represented by a triple:
(p(x),w(x), c(x)).p(x) - identifier of x (coordinate)
w(x) - weight of x, the number of points it representsc(x) - cost of x. An estimate of the sum of costs l(x,y)
of all the leaves y in the tree which x is the root of x.
w(x) = Σ(w(y1), w(y2),…,w(yi))c(x) = Σ(c(y)+w(y) l(x,y)) ,for all y assigned to x‧
if x is a level-0 median: w(x) = 1, c(x)=0
Thus, c(x) is an overestimate of the “true” cost of x
![Page 42: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/42.jpg)
Bucket cost function
We maintain medians at intermediate levels When ever there are Nτ medians at the same
level we cluster them into O(k) medians at the next higher level
Each bucket can be spilt into 1/τ groups
where each contains medians at level j.
Each group contains at most Nτ medians
1/10 ,..., ii RR
jiR
![Page 43: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/43.jpg)
Bucket cost function
Bucket’s cost function is an estimate of the cost of clustering the points represented by the bucket.
Consider bucket Bi. Let be the set of medians in the bucket
Cost function for Bi
Where C(x) ∈{c1,…ck} is the median closest to x
1/1
0
j
jii RY
iYxxCxlxwxc ))(,()()()f(Bi
![Page 44: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/44.jpg)
Combination Let Bi and Bj be two adjacent buckets that need to
be combined to form Bi,j
Let be the groups of medians from the two buckets. Set:
if then cluster the points from and set it to be empty.
C0 set of O(k) medians obtained by clustering
and so on… After at most 1/τ unions we get Bi,j
Now we compute the new bucket’s cost
1/10 ,..., ii RR 1/10 ,...,
jj RR000
, jiji RRR NR ji 0
,
0111
, CRRR jiji
0, jiR
0, jiR
![Page 45: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/45.jpg)
Answer a query
Consider buckets B1…Bm-1
Each contain at most 1/τ Nτ medians, all contain at most 1/τ Nτ medians
Cluster them to produce k medians Cluster bucket Bm to get k additional
medians Present the 2k medians as the answer
![Page 46: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/46.jpg)
algorithm Insert xt
if number of level-0 medians in B1<k, add the point xt as a level-0 median in bucket B1. else create a new bucket B1 to contain xt and renumber the existing buckets accordingly.
if bucket Bm‘s time stamp > N, delete it; now, Bm-1 becomes the last bucket.
Make a sweep over the buckets from most recent to least recent
while there exists an index i>2 such that f(Bi,i-1)≤2f(Bi-1*), find the smallest such i and combine buckets Bi and Bi-1 using the combination procedure described above.
![Page 47: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/47.jpg)
Invariant 3. For every bucket Bi f(Bi)≤2f(Bi*) Ensures a solution with 2k median whose
cost is within multiplicative factor of 2O(1/τ)
of the cost of the optimal k-median solution.
Invariant 4. For every bucket Bi (i>1), f(Bi,i-1)>2f(Bi-1*)Ensures that the number of buckets never
exceeds O(1/τ+logN)We assume that cost is bounded by poly(N)
O(1/τlogN) in the article
![Page 48: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/48.jpg)
Running time improvement
After each element arrives we check if invariant 3 holds.
In order to reduce time we can execute bucket combination only after some amount of points accumulated in bucket B1, Only after it fills we check for the invariant.
We assume that the algorithm is not called after each new entry. Instead, it maintains enough statistics to produce statistics when a query arrives.
![Page 49: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/49.jpg)
Producing exactly k clusters
With each median, we estimate within a constant factor the number of active data points that are assigned to it.
We don’t cluster Bm* and Bm’ separately but cluster the medians from all the buckets together. However the weights of medians form Bm are adjusted so that they reflect only the active data points.
![Page 50: Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.](https://reader038.fdocuments.in/reader038/viewer/2022110403/56649e7a5503460f94b79eea/html5/thumbnails/50.jpg)
Conclusions
The goal of such algorithms is to maintain statistics or information for the last N set of entries that is growing over real time.
The variance algorithm uses O(1/ε2logNR2) memory and maintains an estimate of the variance with relative error of at most ε and amortized O(1) time per new element
The k-median algorithm provides a 2O(1/τ)
approximation for τ<0.5. It uses O(1/τ+logN) memory and requires O(1) amortized time per new element.