Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
-
Upload
bertha-fitzgerald -
Category
Documents
-
view
215 -
download
0
Transcript of Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
Approximating Data Stream using histogram for Query Evaluation
Huiping CaoJan. 03, 2003
2
Outline Introduction Background Histogramming a data stream
V-Optimal Histogram Optimal Histogram Construction Agglomerative Histogram Algorithm Fixed-Window Histogram Algorithm
Experiments Conclusion
3
Introduction Data Stream refers to the fixed order of data elements
that come continuously and in a variable rate. Many applications generate streaming data, such as network
monitoring records, data generated by sensors, etc. New features of algorithms used to handle data stream: single-
pass, quick speed(maybe), limited memory, online(unbounded) Data stream operations
Approximate querying, similarity searching, data mining. Such operations reply on good approximation of data
stream, histogram is a popular way to approximate data stream
4
Background Histogram
Histogram approximates the data distribution of data sets or data stream by partitioning the underlying data into subsets called buckets.
Good histogram construction algorithm can approximate the data as accurately and quickly as possible
Accuracy of the approximation depends on: (1) partitioning technique used to group values into buckets. I.e,
how to partition the data into subsets while inducing less error. (2) approximation technique employed within each bucket. I.e.,
how to summary the values in one buckets. E.g., mean, average.
5
Background(cont.) Data stream model
Agglomerative(Landmark) model
Take into account every elements seen so far
Figure 1(a) Fixed-window(Sliding-window)
model Only consider the last seen n
data elements or the elements observed t time units before the current time
Figure 1(b)
Sketch
t0=0 tcurrent
Fig. 1(a)
Sketch
t0=0 tcurrent
Fig. 1(b)
n
6
Background(cont.) Related work
Approximate specific queries Distinct values([Gib01]), frequency counts([MM02]),
quantile([GK01]), general aggregation([DGG+02]), join([KNV03]).
Approximate methods Sample, histogram, wavelets, more common
synopsis(Section 6 in [BBD+02]).
Focus of this talk: Query independent histogram construction
methods, specifically concentrate on the partitioning of buckets.
7
Histogramming a data stream
Optimal histogram([GK02, IP95]) Optimal histogram
construction([GK02, JKM+98]) Agglomerative
algorithm([GK02,GKS01]) Fixed window algorithm([GK02])
8
V-Optimal Histogram Optimal Histogram Problem
Given a sequence of length n, a number of buckets B, and an error function En(), find HB to minimize E(HB).
Independent on queries [IP95] showed that V-Optimal is the well known optimal
histogram. Basic idea: attribute values are grouped in buckets based on
proximity in their frequencies but not in their actual values. En()=bi[1,..,B] vbi(fv-Cbi /Vbi)2
B: maximum bucket number bi: the i-th bucket fv : the frequency of v in one bucket Cbi,Vbi: The sum and the number of frequencies in bucket bi
9
Optimal Histogram Construction
Problem: The problem of constructing optimal histogram is intrinsically to partition
the index set 1...n into B intervals or buckets minimizing E() Main idea: [JKM+98]
the algorithm focuses on computing OPT[n,B] and getting the bucket boundaries at the same time.
OPT[i,k] denotes the minimum error of representing [1,…,i] by a histogram with k buckets, where i n and k B.
OPT[n,B]= mini<n{OPT[i,B-1]+SSE[i+1,n]} E() = OPT(i,B)= k[1...B]SSEk.
SSE is the common error metric: Sum Squared Error(SSE) SSE([a,b])= i[a,b](vi - avg(v)) 2 = vi
2- 1/(b-a+1)(vi) 2
= SQSUM[1,b]-SQLSUM[1,a-1] -(1/(b-a+1))(SUM[1,b]-SUM[1,a-1])
where, SUM[1,i]= vj SQSUM[1,i] = vj2 , j [1,...,i]
10
Optimal Histogram Construction(Cont.)
Algorithm OptimalHistogram() Compute SUM[1,i], SQSUM[1,i] for all 1 i n Initialize OPT[j,1]= SQSUM[j,n], 1 j n 1. For j=1 to n do 2. For k=2 to B do 3. For i=1 to j-1 do 4. OPT[j,k] =mini(OPT[i,k-1]+SSE[i+1,j])
Explanation For any latest seen element vj , it computes OPT[j,B] get the
minimum cost of any possible intervals. E.g., OPT[n,B]= mini<n{OPT[i,B-1]+SSE[i+1,n]} means
OPT[1,B-1]+SSE[2,n] OPT[2,B-1]+SSE[3,n] ... OPT[n-1,B-1]+SSE[n,n]
minimum=opt[n,B]
11
Example: data sequence:{x1, x2, x3, ...,x10} n=10, B=3 j=1 best partition: [1,1] j=2 best partition: [1,2] ... j=5 k=B-1 best partition: [1,2][3,5] j=6 k=B-1 best partition: [1,3][4,6] ... j=9, k=B OPT[9,B] = OPT[5,B-1]+SSE[6,9]
Then, best partition = [1,2][3,5][6,9] j=10, k=B OPT[10,B]=OPT[6,B-1]+SSE[7,10]
Then, best partition=[1,3][4,6],[7,10] Time complexity: O(n2B), Space complexity: O(n)
Optimal Histogram Construction(Cont.)
12
Agglomerative algorithm -approximation algorithm
Given a sequence of length n, a number of buckets B, an error function En() and a precision >0, find HB with En(HB) less than (1+ )minH(En(H)).
If the data sequence is a data stream, then n is the fixed memory space used to store a portion, n data points, of the stream.
Agglomerative algorithm aims to construct an -approximation histogram.
Can we improve the optimal construction algorithm to -approximation algorithm in data stream setting?
The cost for searching minimum approximation error is big [GKS01]
13
Agglomerative algorithm(cont.)
Improvement to the OptimalHistogram algorithm: It reduced the cost to compute OPT[j,k]
OptimalHistogram: OPT[j,k] =mini(OPT[i,k-1]+SSE[i+1,j]) Agg. Algorithm: OPT[j,k] = min(OPT[bi,k-1]+SSE[bi+1,j]), bi are
end points of intervals for approximating j data points using k-1 buckets.
E.g.: If {vi}={v1,v2,v3,....v9} and {bi}={v3, v5, v9}, then OptimalHistogram algorithm needs to compare 9 values, but Agg. algorithm just needs to compare 3 values.
Reason: OPT[b,k-1]+SSE[b+1,j] (1+ )(OPT[i,k-1]+SSE[i+1,j]), a i
b SSE[i+1,j] is a positive non-increasing function if j is fixed and i
increases. OPT[i,k-1] is a positive non-decreasing function as i increases.
14
Main idea: For each 1 k B, the algorithm maintains
intervals(a1k,b1
k),...,(alk,bl
k) such that, a1
k =1, blk =n , bj
k +1= aj+1k for j<l.
OPT[k, bjk] (1+) OPT[k, aj
k] (1+) B 1+
Store OPT[k, ajk], OPT[k, bj
k] for all j and k, also store SUM[1,r], SQSUM[1,r], where r k,j{{aj
k} {bjk}}
B-1 queues storing the intervals and the related SUMs and SQSUMs
Agglomerative algorithm(cont.)
15
On seeing the n+1’st value vn+1, the algorithm Compute OPT[k,n+1] for all 1 k B
for k=1, OPT[n+1,1]=SSE[1,n+1] for k 2, OPT[n+1,k] = mini (OPT[bi
k,k-1 ]+SSE[bik,n+1]).
Update the intervals (a1k,b1
k),...,(alk,bl
k) The algorithm just need to update the last interval(a l
k,blk),
either setting blk =n+1 or creating a new interval l+1 with
al+1k = bl+1
k =n+1.
Time complexity O((nB2/)logn) Space complexity O((B2/)logn)
Agglomerative algorithm(cont.)
16
Fixed Window Algorithm Agglomerative algorithm is not very useful
in constructing a fixed window histogram Reason: the computation of a histogram on
[1,..,n] does not allow any information on[2,..., n].
Main Idea Maintain lj ixj and lj ixj
2 using two arrays SUM’ and SQSUM’ on [0,n], which are circular buffers. Here {xl,..., xi} are observations of interest.
17
FixedWindowHistogram() Compute SUM’ and SQSUM’ Assume 1 to be the first point in the circular buffer For k=1 to B-1{
Initialize k’th queue to empty CreateList[1,n,k] //time complexity: O((1/)2log3n), = B
//creates intervals of [1...n] using k buckets//interval range[a,b] satisfying OPT[b,k] (1+ )OPT[a,k]// && b is maximized
} {let bl1, bl2,... are end points in QueueB-1} OPT[n,B]=mini{OPT[bli,B-1]+SSE[bli+1,n]}
Time complexity: O((B3/2)log3n), space complexity: O(n)
Fixed Window Algorithm(Cont.)
18
Fixed Window Algorithm(Cont.) Example:
data sequence {0,0,0,1,1,1,1,1} =1, B=2 SUM’ =SQSUM’={0,0,0,1,2,3,4,5} CreateList[1,8,1] (a=1,b=8,k=1), running step:
a=1, OPT[1,1]=0 find index c such that OPT[c,1] 0 =(1+ )OPT[1,1] and c
is maximized. c=3 Queue1={3} Call CreateList[4,8,1]//CreateList(c+1,b,k)
OPT[4,1]=0.75 find index c such that OPT[c,1] 1.5= (1+ )OPT[4,1] and c
is maximized. c=6 Queue1={3,6}
Call CreateList[7,8,1] get Queue1 = {3,6,8}
19
OPT[n=8,B=2] = minimum of the following 3 values OPT[3,1]+SSE[4,8]=0+0=0 OPT[6,1]+SSE[7,8]=1.5+0 =1.5 OPT[8,1] = 15/8 minimum is 0, then best partition
{(1,3),(4,8)}
Fixed Window Algorithm(Cont.)
20
Experimental Evaluation Test
the Construction Performance Accuracy of fixed window algorithm when
evaluating range sum queries Measure:
Construction performance measure: time Accuracy measure: average results
Data: Real data sets extracted from AT&T data warehouses
21
Accuracy test for various and B Conclusion:
For fixed window histogram, accuracy improves with and B Fixed window histogram outperforms wavelet based histogram
Exact
Histogram
Wavelets
22
Construction time for various and B Conclusion:
Wavelet based method is much worse than fixed window histogram (so, not given here)
Construction time grows as B increases or decreases
23
Conclusion Background knowledge on data
stream Three algorithms used to construct
optimal (-approximate) histogram in different scenario
Other related work: New operators over a data stream Operations over multi data streams
sketch technique, query optimization, etc.
24
Reference 1 [GK02] Sudipto Guha and Nick Koudas. Approximating a data
stream for querying and estimation: algorithms and performance evaluation. In ICDE’02.
[GKS01]Sudipto Guha, Nick Koudas and Kyuseok Shim. Data-Streams and Histograms. In STOC’01, pages 471-475.
[IP95]Yannis E. Ioannidis and Viswanath Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD’95. Pages 233-244.
[JKM+98] H.V.Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik and Torsten Suel. Optimal Histograms with Quality Guarantees. In VLDB’98. Pages 275-286.
25
Reference 2 [BBD+02]B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models
and Issues in Data Stream Systems. In PODS’ 02, pages 1-16. [DGG+02]A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi. Processing
complex aggregate queries over data streams. In SIGMOD’ 02, pages 61-72.
[Gib01]Distinct Sampling for highly-accurate answers to distinct values queries and event reports. In VLDB’01, pages 541-550.
[GK01]M. Greenwald, S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD’01, pages 58-66.
[MM02]G. S. Manku, R. Motwani. Approximate frequency counts over data streams. In VLDB’02, pages 346-357.
[KNV03]Jaewoo Kang, J.F.Naughton and Stratis D. Biglas. Evaluating window joins over unbounded streams. In ICDE’03.
26
Thank you!