1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
2
Transcript of 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.
![Page 1: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/1.jpg)
1
Algorithms for massive data setsAlgorithms for massive data sets
Lecture 3 (March 2, 2003)
Synopses, Samples & Sketches
![Page 2: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/2.jpg)
2
SynopsesSynopses
• Synopsis (from Webster) : a condensed statement or outline (as of a narrative or treatise)
• Synopsis (here) : A succinct data structure that lets us answers queries efficiently
![Page 3: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/3.jpg)
3
Typical QueriesTypical Queries
Statistics (count, median, variance, aggregates)
Patterns (clustering, associations, classification)
Nearest Neighbors (L1, L2, Hamming norm)
Property Testing (Skewness, Independence)
etc..
![Page 4: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/4.jpg)
4
Why use Synopses?Why use Synopses?
• Can’t store the whole data : E.g. Web Data
• Resides in main memory : fast query response. E.g. OLAP Data
• Remote transmission at minimal cost
• Minimal effect on storage cost
![Page 5: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/5.jpg)
5
Classification of SynopsesClassification of Synopses
• Are they useful for more than kind of query?– General purpose: E.g. samples
– Specific purpose: E.g. Distinct Values Estimator
• What granularity ?– One per database: E.g. Sample of the whole
relation
– One per distinct value of attribute : E.g. Profiles for customers in a call database
![Page 6: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/6.jpg)
6
Some NumbersSome Numbers
• AQUA Project (Bell Labs): – DB Size : 420 MB
– Synopsis Size : 420 KB (0.1%) to 12.5 MB (3%)
– Accuracy : Within 10% for 0.1% of DB size
– Running Time : Less than 0.3% of the time for full query
• Quantile Summary (Khanna et al) : – DB Size : 109 tuples
– Synopsis Size : 1249 tuples
– Accuracy : 1%
![Page 7: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/7.jpg)
7
Synopses need not be fancy!Synopses need not be fancy!
• Maintaining Mean (μ) of numbers
• What about variance ?
22 )( ix
![Page 8: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/8.jpg)
8
ObjectivesObjectives
• Small Size
• Fast Update and Query
• Provable error guarantees (Need not give exact answers)
• Composable : Useful for distributed scenario
![Page 9: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/9.jpg)
9
A coarse classificationA coarse classification
• Sampling based : This lecture
• Sketches
• Histograms
![Page 10: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/10.jpg)
10
SamplingSampling
• Where and how are samples used
• How are samples maintained – Single relation
• Types of samples :– Oblivious
– Value based
• Limitations of oblivious samples
![Page 11: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/11.jpg)
11
Samples in DSSSamples in DSS
• Exact answers NOT always required
– DSS applications usually exploratory: early feedback to help identify “interesting” regions
– Aggregate queries: precision to “last decimal” not needed• e.g., “What percentage of the US sales are in NJ?” (display as
bar graph)
– Base data can be remote or unavailable: approximate processing using locally-cached data synopsesdata synopses is the only option
SQL Query
Exact Answer
DecisionDecisionSupport Support SystemsSystems(DSS) (DSS)
Long Response Times!
![Page 12: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/12.jpg)
12
Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-represents all the data
– For a fast approx answer, apply the query to S & “scale” the result
– E.g., R.a is {0,1}, S is a 20% sample
select count(*) from R where R.a = 0
select 5 * count(*) from S where S.a = 0
1 1 0 1 1 1 1 1 0 0 0
0 1 1 1 1 1 0 11 1 0 1 0 1 1
0 1 1 0
Red = in S
R.aR.a
Est. count = 5*2 = 10, Exact count = 10
• Leverage extensive literature on confidence intervals for sampling
Actual answer is within the interval [a,b] with a given probability
E.g., 54,000 ± 600 with prob 90%
![Page 13: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/13.jpg)
13
The Aqua ArchitectureThe Aqua Architecture
DataWarehouse
(e.g., Oracle)
SQLQuery Q
Network
Q
Result HTMLXML
WarehouseData
Updates
BrowserExcel
Picture without Aqua:
• User poses a query Q
• Data Warehouse executes Q and returns result
• Warehouse is periodically updated with new data
![Page 14: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/14.jpg)
14
The Aqua ArchitectureThe Aqua Architecture
Picture with Aqua:
• Aqua is middleware, between the user and the warehouse
• Aqua Synopses are stored in the warehouse
• Aqua intercepts the user query and rewrites it to be a query Q’ on the synopses. Data warehouse returns approximate answer
DataWarehouse
(e.g., Oracle)
Rewriter
SQLQuery Q
Network
Q’
Result (w/ error bounds)
HTMLXML
WarehouseData
Updates
AQUASynopses
AQUATracker
BrowserExcel
select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 Q Q’
![Page 15: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/15.jpg)
15
Schema & QueriesSchema & Queries
• Most queries involve foreign key joins between tables followed by (grouping and) aggregation.
L
O PS
P SCN
R
order part, supp
cust
nation
![Page 16: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/16.jpg)
16
![Page 17: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/17.jpg)
17
Example QueryExample Query
![Page 18: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/18.jpg)
18
What samples are right?What samples are right?
• Naïve approach : maintain samples of each relation in the schema
• Problem : sample of the join is not a join of the samples, even for foreign key joins
• Example :
AABB
AB
a1a2
b1
![Page 19: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/19.jpg)
19
Foreign Key JoinsForeign Key Joins
• Foreign Key Join : Effectively a central “fact” table is appended with columns from the dimension tables.
• Sampling from the join is same as sampling from the “fact” table itself.
• Synopsis : For every table that may be a “fact” table for certain join, sample from the table and join the sample with the dimension tables.
![Page 20: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/20.jpg)
20
SynopsisSynopsis
• For every node in the DAG:– Maintain a sample corresponding to that table.
– Join the sample with tables corresponding to all its descendents in the graph.
– Maximal join for which the table is a “fact” table.
L
O PSP SC
NR
order part, supp
cust
nation
![Page 21: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/21.jpg)
21
Bells and whistles!Bells and whistles!
• How to allocate memory across samples of different “fact” tables
• Group-By Queries: – Are uniform samples best or can we do better?
• Aggregate attribute may be skewed– Are uniform samples best or can we do better?
• We may revisit these issues later– Have not seen some equations for a while!
![Page 22: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/22.jpg)
22
How to sample?How to sample?• Consider a single table with only insertions
• Want to maintain a sample of this table
• Three semantics of sampling:– Coin flip
– Fixed size without replacement
– Fixed size with replacement
• First one (coin flip) easy to maintain under insertions
• Exercise : Can we switch between different samples? If so how ?
![Page 23: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/23.jpg)
23
Reservoir SamplingReservoir Sampling
• Given : A stream of elements (tuples), viewed as insertions into a relation
• Aim : At every instant maintain a uniform random sample of size n without replacement
• Method : (Accept the first n elements)– Let t be the number of elements seen so far
– On seeing the the (t+1)st element include it with probability n/(t+1)
– If included evict one of the previous elements uniformly at random
![Page 24: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/24.jpg)
24
Proof of CorrectnessProof of Correctness
• Easy to see that every instant the size of the sample is exactly n
• Claim : After seeing t elements, every element belongs to the sample with probability n/t
• Exercise : Using induction prove the last claim
![Page 25: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/25.jpg)
25
Efficiency Efficiency • Let N be the number of records seen
• Each record (beyond the first n records) is added to the reservoir with probability n/t
• The average number of records added is
)/ln1()1(/ nNnHHntnn nNNtn
• Consider any reservoir sample. • The t th element has to be a part of the sample with probability no less than n/t. • Thus, the quantity above is also a lower bound on the additions made to the reservoir (time spent)
![Page 26: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/26.jpg)
26
EfficiencyEfficiency• The naïve algorithm makes N calls to
RANDOM() and takes time O(N)
• Consider the following random variable: Let S(n,t) denote the number of elements skipped where n is the size of the reservoir and t is the number of elements processed so far.
• Aim: Study this random variable and sample from its distribution using O(1) operations.
• Idea : Generate S(n,t) and skip those many records doing nothing
![Page 27: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/27.jpg)
27
ObservationsObservations• S(n,t) is non-negative
• Let F(s) denote Prob {S(n,t) ≤ s}, for s≥ 0
1
1
)1(
)1(1
)1(1)(
s
s
n
n
t
nt
st
tsF
Where ab denotes the falling power
a(a-1) (a-2)…(a-b-1) and denotes the rising
power a(a+1)(a+2)…(a+b-1)
ba
![Page 28: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/28.jpg)
28
ObservationsObservations• Subtracting two terms corresponding to s and s-1 we
get the probability distribution function f(s) as
1
1
)1(
)(
)(1)(
s
s
n
n
t
nt
nt
n
st
t
st
nsf
We can compute the expected value which is (t-n+1)/(n-1)Here is a simple way to sample from the distribution corresponding to S(n,t). We already calculated its CDF (F(s)). We generate a random number U between 0 and 1 and find the smallest s such that U ≤ F(s), i.e.
Ut
nts
s
1)1(
)1(1
1
![Page 29: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/29.jpg)
29
ObservationsObservations
• Have reduced the number of calls to RANDOM() to optimal : One per insertion into the reservoir
• There are two ways to find the largest s that satisfies the previous equation– Linear scan : Gives O(N) time algorithm
– Binary search/Newton’s interpolation method to get a running time of O(n2(1 + log (N/n) log log (N/n))
• Note: This is still not optimal. Read the paper for an optimal (up to constants) algorithm.
![Page 30: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/30.jpg)
30
What have we seen so far?What have we seen so far?• How to sample efficiently (Reservoir Sampling)
– A method to sample without replacement by making a single scan
– Optimized the calls to RANDOM()
– Overall processing time can also be optimized
• How samples are used in DSS and what are the different samples that should be kept in order to answer queries
• What next?– Queries in DSS are not simple counts over the entire relation
– Typically they have grouping followed by aggregation of an attribute that may have high variance
![Page 31: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/31.jpg)
31
Error using samplingError using sampling
R = {y1, y2, …, yN}, sample size nVariance in data values:
1
)(1
2
N
YyS
N
ii
Error = Std Dev =√E(μ – μ*)2
N
n
n
S 1)(
![Page 32: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/32.jpg)
32
Group-By QueriesGroup-By Queries
• SELECT avg (salary) FROM census GROUP BY state
• Some of the states have very tuples as compared to others. E.g. CA has 70 times more people as compared to WY
• If we sample uniformly from the entire relation then there will be very few tuples corresponding to WY and hence a large error in its avg(salary) estimate
![Page 33: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/33.jpg)
33
Error Metric (Group-By)Error Metric (Group-By)
• Let c*_i be the true answer (aggregate) corresponding to group i
• Let c_i be the estimate obtained from sample
• The error e_i is given by |c*_i – c_i|/|c_i|
• The cumulative error is the L1,L2, L∞ norm of the error vector {e_i}
![Page 34: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/34.jpg)
34
Optimal sampling strategyOptimal sampling strategy
• For every group the error is inversely proportional to √n where n is the number of tuples in the sample from this group
• In order to reduce the maximum error among all groups we should have equal number of samples from each group (Senate)
• But this strategy is not optimal if the query does not have a group by and is over the entire relation. In that case a uniform sample of the entire relation is optimal (House)
![Page 35: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/35.jpg)
35
Basic-Congress SamplingBasic-Congress Sampling
• Unfortunately, unlike U.S. congress we don’t have place to sit both Senators and House Representatives!
• Hence we do the following:– Let X be the total seats allotted to Congress
– For a state CA let CA_S (resp CA_H) be the seats allotted to it assuming the congress was only made of senate (resp. house)
– The final seat allocation to each state CA is proportional to max(CA_S, CA_H), subject to total seats being X
![Page 36: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/36.jpg)
36
CommentsComments
• No error guarantees– Only a best effort solution
• Cannot use Reservoir sampling anymore– The full paper talks about one pass algorithms, but
admits that they don’t work in all cases
• What if the variance in values (S) is large ?– Outlier indexing
![Page 37: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/37.jpg)
37
Error using samplingError using sampling
R = {y1, y2, …, yN}, sample size nVariance in data values:
1
)(1
2
N
YyS
N
ii
Error = Std Dev :
N
n
n
S 1)(
![Page 38: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/38.jpg)
38
Presence of Data Skew.Presence of Data Skew.
Outliers (deviant tuples).
9950 tuples.Value = 1
50 tuples
Value = 1000.
Uniform sampleof size 100.
Sum estimate= 10,000
OR
Sum estimate > 109,900
Error > 83%
Exact Answer= 59,950
case1
case2
![Page 39: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/39.jpg)
39
Outlier Indexing Scheme.Outlier Indexing Scheme.
R
RO (outliers)
RNO
sample RNO
(sample)
Preprocessing
QA1
Q & extrapolateA2
+ A
Query
![Page 40: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/40.jpg)
40
Selection of Outlier Index.Selection of Outlier Index.
Objective: Remove at most n outliers such that non outliers have least variance.
Theorem: For a sorted (multi)set of values optimal outlier set looks like :
...,vk,vk+1, vk+2,…,vm-1,vm,vm+1,…
![Page 41: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/41.jpg)
41
CommentsComments
• Cannot do reservoir sampling
• One pass algorithm for selection of outliers
![Page 42: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/42.jpg)
42
Types of SamplesTypes of Samples
• Oblivious samples: We do not look at the value of attribute while sampling
• Value based sampling : The distinct sampling of Gibbons et al
• Limitations of oblivious sampling:– Please refer :
Sampling algorithms: lower bounds and applicaitons, Z. Bar-Yossef, S. Ravi Kumar, and D. Sivakumar.STOC 2001.
![Page 43: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/43.jpg)
43
SummarySummary• Obvious type of synopsis: samples
• Use of samples in DB, in particular DSS. – Idea of maintaining the samples of ‘fact’ tables
• How to sample without replacement with a single pass, not knowing the size of the relation a-priori– Reservoir sampling and tricks to make it efficient
• Shortcomings of sampling in DB’s– Group-By queries : Congressional samples
– High Skew in Data : Outlier indexing, stratified sampling
![Page 44: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/44.jpg)
44
ReferencesReferences• Join Synopses for Approximate Query Answering, S. Acharya, P.
Gibbons, V. Poosala, and S. Ramaswamy. SIGMOD 1999.
• Congressional Samples for Approximate Answering of Group-By Queries, S. Acharya, P, Gibbons, and V. Poosala. SIGMOD 2000.
• Overcoming Limitations of Sampling for Aggregation Queries, S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. ICDE 2001.
• A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, S. Chaudhuri, G. Das and V. Narasayya. SIGMOD 2001.
• Random Sampling with a Reservoir, J. S. Vitter. Trans. on Mathematical Software 11(1):37-57 (1985).
![Page 45: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/45.jpg)
45
Sampling over Sliding windowsSampling over Sliding windows
• Samples of streaming data
• Need to account for staleness of data
• An data element is fresh if it belongs to the last N elements
• Problem statement : Given a stream of elements maintain a uniform random sample of size
![Page 46: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/46.jpg)
46
A Simple, Unsatisfying ApproachA Simple, Unsatisfying Approach• Choose a random subset X={x1, …,xk}, X{0,1,…,n-1}
• The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n)
• Only uses O(k) memory
• Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic
• Unsuitable for many real applications, particularly those with periodicity in the data
![Page 47: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/47.jpg)
47
Reservoir Sampling: Why It Reservoir Sampling: Why It Doesn’t WorkDoesn’t Work
• Suppose an element in the reservoir expires
• Need to replace it with a randomly-chosen element from the current window
• However, in the data stream model we have no access to past data
• Could store the entire window but this would require O(n) memory
![Page 48: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/48.jpg)
48
Chain-SampleChain-Sample• Include each new element in the sample with probability
1/min(i,n)
• As each element is added to the sample, choose the index of the element that will replace it when it expires
• When the ith element expires, the window will be (i+1…i+n), so choose the index from this range
• Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements
• When an element is chosen to be discarded from the sample, discard its “chain” as well
![Page 49: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/49.jpg)
49
ExampleExample
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
![Page 50: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/50.jpg)
50
Memory Usage of Chain-SampleMemory Usage of Chain-Sample
• Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x
• T(x) =
• The expected length of each chain is less than T(n) e 2.718
• Expected memory usage is O(k)
{ 0 for x < 01 + 1/n [ΣT(j)] for x 1 j<i
![Page 51: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/51.jpg)
51
Memory Usage of Chain-SampleMemory Usage of Chain-Sample• Chain consists of “hops” with lengths 1…n
• Chain of length j can be represented by partition of n into j ordered integer parts– j-1 hops with sum less than n plus a remainder
• Each such partition has probability n-j
• Number of such partitions is (n) < (ne/j)j
• Probability of any such partition is small [O(n-c)]when j = O(k log n)
• Uses O(k log n) memory whp
j
![Page 52: 1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches.](https://reader036.fdocuments.in/reader036/viewer/2022062714/56649d4b5503460f94a28b41/html5/thumbnails/52.jpg)
52
Comparison of AlgorithmsComparison of Algorithms
• Chain-sample is preferable to oversampling:– Better expected memory usage: O(k) vs. O(k log n)
– Same high-probability memory bound of O(k log n)
– No chance of failure due to sample size shrinking below k
Algorithm Expected High-Probability
Periodic O(k) O(k)
Oversample O(k log n) O(k log n)
Chain-Sample O(k) O(k log n)