Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

26
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003

Transcript of Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

Page 1: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

Approximating Data Stream using histogram for Query Evaluation

Huiping CaoJan. 03, 2003

Page 2: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

2

Outline Introduction Background Histogramming a data stream

V-Optimal Histogram Optimal Histogram Construction Agglomerative Histogram Algorithm Fixed-Window Histogram Algorithm

Experiments Conclusion

Page 3: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

3

Introduction Data Stream refers to the fixed order of data elements

that come continuously and in a variable rate. Many applications generate streaming data, such as network

monitoring records, data generated by sensors, etc. New features of algorithms used to handle data stream: single-

pass, quick speed(maybe), limited memory, online(unbounded) Data stream operations

Approximate querying, similarity searching, data mining. Such operations reply on good approximation of data

stream, histogram is a popular way to approximate data stream

Page 4: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

4

Background Histogram

Histogram approximates the data distribution of data sets or data stream by partitioning the underlying data into subsets called buckets.

Good histogram construction algorithm can approximate the data as accurately and quickly as possible

Accuracy of the approximation depends on: (1) partitioning technique used to group values into buckets. I.e,

how to partition the data into subsets while inducing less error. (2) approximation technique employed within each bucket. I.e.,

how to summary the values in one buckets. E.g., mean, average.

Page 5: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

5

Background(cont.) Data stream model

Agglomerative(Landmark) model

Take into account every elements seen so far

Figure 1(a) Fixed-window(Sliding-window)

model Only consider the last seen n

data elements or the elements observed t time units before the current time

Figure 1(b)

Sketch

t0=0 tcurrent

Fig. 1(a)

Sketch

t0=0 tcurrent

Fig. 1(b)

n

Page 6: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

6

Background(cont.) Related work

Approximate specific queries Distinct values([Gib01]), frequency counts([MM02]),

quantile([GK01]), general aggregation([DGG+02]), join([KNV03]).

Approximate methods Sample, histogram, wavelets, more common

synopsis(Section 6 in [BBD+02]).

Focus of this talk: Query independent histogram construction

methods, specifically concentrate on the partitioning of buckets.

Page 7: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

7

Histogramming a data stream

Optimal histogram([GK02, IP95]) Optimal histogram

construction([GK02, JKM+98]) Agglomerative

algorithm([GK02,GKS01]) Fixed window algorithm([GK02])

Page 8: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

8

V-Optimal Histogram Optimal Histogram Problem

Given a sequence of length n, a number of buckets B, and an error function En(), find HB to minimize E(HB).

Independent on queries [IP95] showed that V-Optimal is the well known optimal

histogram. Basic idea: attribute values are grouped in buckets based on

proximity in their frequencies but not in their actual values. En()=bi[1,..,B] vbi(fv-Cbi /Vbi)2

B: maximum bucket number bi: the i-th bucket fv : the frequency of v in one bucket Cbi,Vbi: The sum and the number of frequencies in bucket bi

Page 9: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

9

Optimal Histogram Construction

Problem: The problem of constructing optimal histogram is intrinsically to partition

the index set 1...n into B intervals or buckets minimizing E() Main idea: [JKM+98]

the algorithm focuses on computing OPT[n,B] and getting the bucket boundaries at the same time.

OPT[i,k] denotes the minimum error of representing [1,…,i] by a histogram with k buckets, where i n and k B.

OPT[n,B]= mini<n{OPT[i,B-1]+SSE[i+1,n]} E() = OPT(i,B)= k[1...B]SSEk.

SSE is the common error metric: Sum Squared Error(SSE) SSE([a,b])= i[a,b](vi - avg(v)) 2 = vi

2- 1/(b-a+1)(vi) 2

= SQSUM[1,b]-SQLSUM[1,a-1] -(1/(b-a+1))(SUM[1,b]-SUM[1,a-1])

where, SUM[1,i]= vj SQSUM[1,i] = vj2 , j [1,...,i]

Page 10: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

10

Optimal Histogram Construction(Cont.)

Algorithm OptimalHistogram() Compute SUM[1,i], SQSUM[1,i] for all 1 i n Initialize OPT[j,1]= SQSUM[j,n], 1 j n 1. For j=1 to n do 2. For k=2 to B do 3. For i=1 to j-1 do 4. OPT[j,k] =mini(OPT[i,k-1]+SSE[i+1,j])

Explanation For any latest seen element vj , it computes OPT[j,B] get the

minimum cost of any possible intervals. E.g., OPT[n,B]= mini<n{OPT[i,B-1]+SSE[i+1,n]} means

OPT[1,B-1]+SSE[2,n] OPT[2,B-1]+SSE[3,n] ... OPT[n-1,B-1]+SSE[n,n]

minimum=opt[n,B]

Page 11: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

11

Example: data sequence:{x1, x2, x3, ...,x10} n=10, B=3 j=1 best partition: [1,1] j=2 best partition: [1,2] ... j=5 k=B-1 best partition: [1,2][3,5] j=6 k=B-1 best partition: [1,3][4,6] ... j=9, k=B OPT[9,B] = OPT[5,B-1]+SSE[6,9]

Then, best partition = [1,2][3,5][6,9] j=10, k=B OPT[10,B]=OPT[6,B-1]+SSE[7,10]

Then, best partition=[1,3][4,6],[7,10] Time complexity: O(n2B), Space complexity: O(n)

Optimal Histogram Construction(Cont.)

Page 12: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

12

Agglomerative algorithm -approximation algorithm

Given a sequence of length n, a number of buckets B, an error function En() and a precision >0, find HB with En(HB) less than (1+ )minH(En(H)).

If the data sequence is a data stream, then n is the fixed memory space used to store a portion, n data points, of the stream.

Agglomerative algorithm aims to construct an -approximation histogram.

Can we improve the optimal construction algorithm to -approximation algorithm in data stream setting?

The cost for searching minimum approximation error is big [GKS01]

Page 13: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

13

Agglomerative algorithm(cont.)

Improvement to the OptimalHistogram algorithm: It reduced the cost to compute OPT[j,k]

OptimalHistogram: OPT[j,k] =mini(OPT[i,k-1]+SSE[i+1,j]) Agg. Algorithm: OPT[j,k] = min(OPT[bi,k-1]+SSE[bi+1,j]), bi are

end points of intervals for approximating j data points using k-1 buckets.

E.g.: If {vi}={v1,v2,v3,....v9} and {bi}={v3, v5, v9}, then OptimalHistogram algorithm needs to compare 9 values, but Agg. algorithm just needs to compare 3 values.

Reason: OPT[b,k-1]+SSE[b+1,j] (1+ )(OPT[i,k-1]+SSE[i+1,j]), a i

b SSE[i+1,j] is a positive non-increasing function if j is fixed and i

increases. OPT[i,k-1] is a positive non-decreasing function as i increases.

Page 14: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

14

Main idea: For each 1 k B, the algorithm maintains

intervals(a1k,b1

k),...,(alk,bl

k) such that, a1

k =1, blk =n , bj

k +1= aj+1k for j<l.

OPT[k, bjk] (1+) OPT[k, aj

k] (1+) B 1+

Store OPT[k, ajk], OPT[k, bj

k] for all j and k, also store SUM[1,r], SQSUM[1,r], where r k,j{{aj

k} {bjk}}

B-1 queues storing the intervals and the related SUMs and SQSUMs

Agglomerative algorithm(cont.)

Page 15: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

15

On seeing the n+1’st value vn+1, the algorithm Compute OPT[k,n+1] for all 1 k B

for k=1, OPT[n+1,1]=SSE[1,n+1] for k 2, OPT[n+1,k] = mini (OPT[bi

k,k-1 ]+SSE[bik,n+1]).

Update the intervals (a1k,b1

k),...,(alk,bl

k) The algorithm just need to update the last interval(a l

k,blk),

either setting blk =n+1 or creating a new interval l+1 with

al+1k = bl+1

k =n+1.

Time complexity O((nB2/)logn) Space complexity O((B2/)logn)

Agglomerative algorithm(cont.)

Page 16: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

16

Fixed Window Algorithm Agglomerative algorithm is not very useful

in constructing a fixed window histogram Reason: the computation of a histogram on

[1,..,n] does not allow any information on[2,..., n].

Main Idea Maintain lj ixj and lj ixj

2 using two arrays SUM’ and SQSUM’ on [0,n], which are circular buffers. Here {xl,..., xi} are observations of interest.

Page 17: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

17

FixedWindowHistogram() Compute SUM’ and SQSUM’ Assume 1 to be the first point in the circular buffer For k=1 to B-1{

Initialize k’th queue to empty CreateList[1,n,k] //time complexity: O((1/)2log3n), = B

//creates intervals of [1...n] using k buckets//interval range[a,b] satisfying OPT[b,k] (1+ )OPT[a,k]// && b is maximized

} {let bl1, bl2,... are end points in QueueB-1} OPT[n,B]=mini{OPT[bli,B-1]+SSE[bli+1,n]}

Time complexity: O((B3/2)log3n), space complexity: O(n)

Fixed Window Algorithm(Cont.)

Page 18: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

18

Fixed Window Algorithm(Cont.) Example:

data sequence {0,0,0,1,1,1,1,1} =1, B=2 SUM’ =SQSUM’={0,0,0,1,2,3,4,5} CreateList[1,8,1] (a=1,b=8,k=1), running step:

a=1, OPT[1,1]=0 find index c such that OPT[c,1] 0 =(1+ )OPT[1,1] and c

is maximized. c=3 Queue1={3} Call CreateList[4,8,1]//CreateList(c+1,b,k)

OPT[4,1]=0.75 find index c such that OPT[c,1] 1.5= (1+ )OPT[4,1] and c

is maximized. c=6 Queue1={3,6}

Call CreateList[7,8,1] get Queue1 = {3,6,8}

Page 19: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

19

OPT[n=8,B=2] = minimum of the following 3 values OPT[3,1]+SSE[4,8]=0+0=0 OPT[6,1]+SSE[7,8]=1.5+0 =1.5 OPT[8,1] = 15/8 minimum is 0, then best partition

{(1,3),(4,8)}

Fixed Window Algorithm(Cont.)

Page 20: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

20

Experimental Evaluation Test

the Construction Performance Accuracy of fixed window algorithm when

evaluating range sum queries Measure:

Construction performance measure: time Accuracy measure: average results

Data: Real data sets extracted from AT&T data warehouses

Page 21: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

21

Accuracy test for various and B Conclusion:

For fixed window histogram, accuracy improves with and B Fixed window histogram outperforms wavelet based histogram

Exact

Histogram

Wavelets

Page 22: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

22

Construction time for various and B Conclusion:

Wavelet based method is much worse than fixed window histogram (so, not given here)

Construction time grows as B increases or decreases

Page 23: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

23

Conclusion Background knowledge on data

stream Three algorithms used to construct

optimal (-approximate) histogram in different scenario

Other related work: New operators over a data stream Operations over multi data streams

sketch technique, query optimization, etc.

Page 24: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

24

Reference 1 [GK02] Sudipto Guha and Nick Koudas. Approximating a data

stream for querying and estimation: algorithms and performance evaluation. In ICDE’02.

[GKS01]Sudipto Guha, Nick Koudas and Kyuseok Shim. Data-Streams and Histograms. In STOC’01, pages 471-475.

[IP95]Yannis E. Ioannidis and Viswanath Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD’95. Pages 233-244.

[JKM+98] H.V.Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik and Torsten Suel. Optimal Histograms with Quality Guarantees. In VLDB’98. Pages 275-286.

Page 25: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

25

Reference 2 [BBD+02]B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models

and Issues in Data Stream Systems. In PODS’ 02, pages 1-16. [DGG+02]A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi. Processing

complex aggregate queries over data streams. In SIGMOD’ 02, pages 61-72.

[Gib01]Distinct Sampling for highly-accurate answers to distinct values queries and event reports. In VLDB’01, pages 541-550.

[GK01]M. Greenwald, S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD’01, pages 58-66.

[MM02]G. S. Manku, R. Motwani. Approximate frequency counts over data streams. In VLDB’02, pages 346-357.

[KNV03]Jaewoo Kang, J.F.Naughton and Stratis D. Biglas. Evaluating window joins over unbounded streams. In ICDE’03.

Page 26: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

26

Thank you!