Data-Streams and Histograms

12
Data-Streams and Histograms Sudipto Guha, Nick Koudas & Kyuseok Shim

description

Data-Streams and Histograms. Sudipto Guha, Nick Koudas & Kyuseok Shim. Background. Histogram Captures distribution statistics in an efficient manner Applications Query optimization Approximate query answering Data mining (time series in particular) Piecewise transmission of data - PowerPoint PPT Presentation

Transcript of Data-Streams and Histograms

Page 1: Data-Streams and Histograms

Data-Streams and Histograms

Sudipto Guha, Nick Koudas & Kyuseok Shim

Page 2: Data-Streams and Histograms

Background

• Histogram– Captures distribution statistics in an efficient

manner– Applications

• Query optimization• Approximate query answering• Data mining (time series in particular)• Piecewise transmission of data

• EquiWidth, EquiDepth, MHIST, MaxDiff, V-OPT

Page 3: Data-Streams and Histograms

Background

• Data Stream– An ordered sequence of points that can be

read only once or a small number of times– Applications

• Mission critical network components• Dynamic traffic configuration, fault identification,

troubleshooting

– Performance of algorithm measured by number of passes algorithm must make over the stream

Page 4: Data-Streams and Histograms

Motivation

• Since the end use of a histogram is to approximate a data distribution, why not use a near-optimal approximation of the best histogram if it means linear time computation?

Page 5: Data-Streams and Histograms

Motivation

• Approximate V-OPT histograms by improving the dynamic programming solution from quadratic to linear time

• Revised algorithm uses little space, hence suitable for data stream model

• Assumes cost of interval is monotonic under inclusion

Page 6: Data-Streams and Histograms

Problem Statement

• Given:– non-negative integers v1, ..., vn

– k intervals or buckets to partition the index 1..n

• Constraint:– Minimize k VARk where is the variance of values in

the kth bucket

• Dynamic Programming solution:– OPT[k, n] = min {OPT[k-1, x] + VAR[(x+1)..n]}

– Runs in O(n2k) time with O(n) space

x<n

Page 7: Data-Streams and Histograms

Intuition of Improvement

• For a x b, – VAR[a..n] VAR[x..n] VAR [b..n] (1) – OPT[a..n] OPT[x..n] OPT[b..n] (2)

• Use this monotonicity property to reduce the search space by settling for an approximation

• Instead of storing the whole OPT function, approximate it by a histogram!

Page 8: Data-Streams and Histograms

Intuition of Improvement

• For all 1 p k, maintain intervals (a1,b1),…, (al, bl) • Value of bi (1+)ai

• The number of intervals l depends on p• The value for each interval substitutes for each value in

the interval reducing space and time complexity

Page 9: Data-Streams and Histograms

Results

• Theorem: A (1+) approximation for V-OPT runs in O((k2/)log n) space and time O((nk2/)log n) in the data stream model

Page 10: Data-Streams and Histograms

Advantages and Disadvantages

• Accuracy/runtime tradeoff can be controlled by the parameter

• For data-stream model, alternatives abound:– Random sampling (simple, assumption of

distribution)– Other histogram techniques (faster, less optimal)– Wavelet (flexibility)– Sliding Windows (later paper)

Page 11: Data-Streams and Histograms

Improvements

Page 12: Data-Streams and Histograms

Conclusion

• The authors provided an algorithm for approximating a distribution that runs reasonably fast and with small space requirements

• Proposed solution can be applied to data-stream model because values are not referred to unless they are stored