Presented By: Ashu Raj 09/21/2010
description
Transcript of Presented By: Ashu Raj 09/21/2010
![Page 1: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/1.jpg)
Presented By: Ashu Raj09/21/2010
CSE 6339 DATA EXPLORATION AND ANALYSIS IN RELATIONAL DATABASES FALL 2010
![Page 2: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/2.jpg)
Problem Statement Application Examples Bayesian Method
The learning PhaseThe Characterization PhaseThe Inference Phase
Experiments and Results Conclusion
![Page 3: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/3.jpg)
A ubiquitous problem in data management:
Given a finite set of real values, can we take a sample from the set, and use the sample to predict the kth largest value in the entire set?
![Page 4: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/4.jpg)
Min/Max online aggregation Top-k query processing Outlier detection Probability Query Optimization
![Page 5: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/5.jpg)
Propose a natural estimator
Characterize error distribution of estimator (Bayesian)Learn a prior model from the past query
workloadUpdate the prior model using a sampleSample an error distribution from the posterior
model
With estimator and its error distribution, we can confidence bound the kth largest
![Page 6: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/6.jpg)
Data set size N Sample size n Estimator is the (k´)th largest in sample
k´/n = k/N
So, k´= ┌ (n/N * k) ┐
How accurate this method is?
![Page 7: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/7.jpg)
How to determine the estimator’s error?
Study the relationship (estimator vs answer)
Take the ratio of k th and (k´)th and find the ratio distribution
Don’t have any prior knowledge of D. It is impossible to predict the ratio by looking
at the sample only.
![Page 8: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/8.jpg)
With domain knowledge and sample, we can guess behavior of D
What domain knowledge should be modeled to help solving this problem?
![Page 9: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/9.jpg)
Setup: four data sets with different histogram shapes; each has10,000 values; we are looking for the largest value.
Experiment: take a 100-element sample, record the obtained ratio kth/(k´)th . Do this 500 times.
Importance of query shape
![Page 10: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/10.jpg)
Proposed Bayesian Inference Framework
Learning PhaseCharacterization PhaseInference Phase
![Page 11: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/11.jpg)
The Generative Model Assume the existence of a set of possible shape patterns Each shape has a weight, specifying how likely it matches with a new
data set’s histogram shape.
![Page 12: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/12.jpg)
First, A biased die is rolled to determine by which shape pattern the query result set will be generated.(in previous figure, suppose we select shape 3)
Next, Arbitrary scale for the query is randomly generated.
Instantiate a parametric distribution f(x│shape,scale); this distribution is repeatedly sampled from to generate the new data set.
![Page 13: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/13.jpg)
![Page 14: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/14.jpg)
Next, formalize and learn the model from the domain data(workload)
Probability Density Function(PDF)Gamma distribution can produce data with
arbitrary right leaning skew.
![Page 15: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/15.jpg)
Deriving the PDF:Gamma Distribution PDF is:
Where α > 0,known as the shape parameter and β > 0, known as inverse scale parameter.
Since scale does not matter, we treat β as an unknown random variable.
![Page 16: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/16.jpg)
Deriving the likelihood Model:The resulting likelihood of a given dataset D is in
the form:L(D│α)
This model assumes that a set of c weighted shapes. So, the complete likelihood model of observing D is:
Where wjs are each non-negative weights and ∑cj=1
wj =1.
The complete set of model parameters is ⱷ = {ⱷ1,ⱷ2,…,ⱷc} where ⱷj = {wj, αj}
![Page 17: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/17.jpg)
Learning the parameters: ⱷ is unknown and must be learned from the
historical workload.Given a set of independent domain data
sets D ={D1, . . . ,Dr }, the likelihood of observing them is:
We use EM algorithm to learn the most likely so that:
At this point , We have learned a prior shape model
![Page 18: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/18.jpg)
Now we have to apply EM algorithm to this prior modelNow we take the sample from the data set
![Page 19: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/19.jpg)
Use the Sample to update prior weight of each shape pattern
![Page 20: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/20.jpg)
EM Algorithm:
![Page 21: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/21.jpg)
Let S be our sample, applying Baye’s rule, the posterior weight of shape pattern j is:
The resulting posterior shape model:
![Page 22: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/22.jpg)
Derive error distribution associated with each shape.Each shape characterizes an error distribution
of kth/(k´)thTo find the error distribution for α,
Pick a scale β1. Query is produced by drawing a sample size N from the
distribution Gamma f(x│α , β),the kth largest value in this sample is f(k).
2.In order to estimate f(k), a sub sample of size n is drawn from the Sample obtained in step 1. the (k ´)th largest value in the subsample is the estimator f(k)´.
![Page 23: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/23.jpg)
Monte- Carlo Sampling
![Page 24: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/24.jpg)
TKD (Top k dependent) METHOD efficiently produce f(k)´ given f(k). First determines whether or not the subsample includes f(k)
by means of a Bernoulli trail.Depending upon the result, The TKD method figures out in
the randomized method, the composition of the k´ largest values in the subsample with the help of Hypergeometric Method and returns the (k´)th largest.
The input parameters are same as Monte Carlo method with the addition of the sampled f(k).
This process assumes that we have an efficient method to sample a f(k) efficiently.
![Page 25: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/25.jpg)
Each shape characterizes an error distribution kth/(k´)th
To get the posterior error distribution, attach each shape’s posterior weight to it’s error distribution.
![Page 26: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/26.jpg)
The final mixture error distribution:
![Page 27: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/27.jpg)
Given the distribution of kth/(k´)th, we can confidence bound the answer:Choose pair of lower bound and upper
bound (LB,UB),such that p% probabilty is covered.
Bound kth by [(k´)th * LB , (k´)th * UB] with p% probability.
![Page 28: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/28.jpg)
Learn a prior shape model from historical queriesDevised a close-form model: a variant of Gamma
mixture modelEmployed an EM algorithm to learn the model from
historical data Update prior shape model with a sample
Applied Baye’s rule to update shape pattern’s weight
Produce an error distribution from the posterior modelPosterior weight attached to each shape’s error
distribution With our estimator and its error distribution, we can
bound answer.
![Page 29: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/29.jpg)
![Page 30: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/30.jpg)
![Page 31: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/31.jpg)
![Page 32: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/32.jpg)
Distance-Based Outlier Detection
Improve the performance of state-of-the-art algorithm on an average factor of 4 over seven large data sets.
![Page 33: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/33.jpg)
Defined the problem of estimating the kth largest value in a real data set.
Proposed an estimator Characterized the ratio error
distribution by a Bayesian framework.
Applied the proposed method to research problems successfully.
![Page 34: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/34.jpg)
Mingxi Wu, Chris Jermaine. A Bayesian Method for Guessing the Extreme Values in a Data Set , VLDB 2007.
http://www.cise.ufl.edu/~mwu/research/extremeTalk.pdf
![Page 35: Presented By: Ashu Raj 09/21/2010](https://reader036.fdocuments.in/reader036/viewer/2022070415/56814fc8550346895dbd83e5/html5/thumbnails/35.jpg)