Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas,...

Post on 14-Jan-2016

218 views 0 download

Tags:

Transcript of Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas,...

Fast Algorithms For Fast Algorithms For Hierarchical Range Hierarchical Range

Histogram Histogram ConstructionsConstructions

Fast Algorithms For Fast Algorithms For Hierarchical Range Hierarchical Range

Histogram Histogram ConstructionsConstructions

AuthorsAuthorsSudipto Guha, Nick Koudas, Divesh Sudipto Guha, Nick Koudas, Divesh

Srivastava.Srivastava.ACM PODS ’2002sACM PODS ’2002s

Layout• Introduction• Related Works• Problem Definition• Problem Solution

– A Sparse Interval Set System– The Dynamic Programming algorithm

• Experimental Evaluation• Conclusions

Introduction• Data Warehousing and OLAP applications

– OLAP – Online analytical processing

• Data has multiple logical dimensions with natural hierarchies defined on it

• OLAP queries – usually involve hierarchical selections on

some of the dimensions – often aggregate measure attributes

Introduction – Cont.

Histograms• Numeric attribute value domain • Space-efficient • Conditions on a given dimension -

hierarchical ranges • Range estimation depends on a

good solution to the histogram construction problem

The Main Idea• Proposes a fast practical

algorithms for the problem of constructing hierarchical range histograms

The Main Contributions• A novel notion of sparse intervals• A proposed algorithm effectively

trades space for construction time without compromising the accuracy

• First practical approach to the problem

Previous Works• V-Optimal histograms

– Minimizes error for equality queries– But… Constructed by taking only equality

queries into account • Koudas et al. - a polynomial-time

algorithm– For special and general cases – But… High polynomiality

• Gilbert et al. – pseudo-polynomial time optimal for arbitrary ranges– But.. High polynomiality

Problem Definition• An array A[1,n] of non-negative real

numbers• The average of items A[a],…,A[b]

1

][...][],[

ab

bAaAbaA

• A histogram of array A[1,n] using B buckets is specified by B+1 integers

• Each interval is a bucket• Each is a bucket boundary

Histogram Definition

nbbb

bb

B

B

121

11

...0

,...,

],1[ 1 ii bb

ib

Histogram Definition – Cont.

• Stored as – a series of bucket boundaries– the average of the array values

in each bucket – bucket sum can be obtained

],1[ 1 ii bbA

Histograms – Cont.• Mostly support equality queries

– “give me A[i]”

• Hierarchical range queries

Hierarchical Range Queries Definition

• A range query asks for the sum

• A set S of range queries is hierarchical if for any two queries and in S, the ranges [i,j] and [k,l] are– disjoint– or contained one in the other

ijR][...][ jAiAsij

ijR klR

Hierarchical Range Queries – Cont.

• Generalize equality queries

• Can be displayed as a tree– Each node u has an associated range– Node v is a child of node u iff and

there is no w such that

][iARii

uruv rr

uwv rrr

Workload Definition• A workload W consists of

– A set S of hierarchical range queries– A probability for each query in

S this probability can be obtained by monitoring and logging

• Simple probabilities model

ijp ijR

How The Histogram Works1. A histogram H of array A[1,n]2. Query 3. An expected answer 4. Left bucket such that5. Right bucket such that 6. Calculate precise total of the values in

the buckets between left and right buckets

7. Estimate the sums for the portions within the left and right buckets

ijR][...][ jAiAsij

],1[ 1 ll bb 11 ll bib],1[ 1 rr bb

11 rr bjb

How The Histogram Works – Cont.

8. The sum of A in the interval is estimated by

– Uniformity assumption

9. The right bucket likewise

],1[],[ 1 ll bbji

],1[],[],1[ 11 llll bbjibbA

The Total Estimate

• The total estimate

• left bucket estimation +right bucket estimation +

exact sum for buckets in between

ijs

ijs

Determining the average

• Construct a prefix sum array for all

• Given and return the average at constant time

ib

jjA

1][

ib

ib 1ib

Optimal histogram definition

• The error of the range query is

• Given a histogram H and workload W the total expected error for estimating W is over all queries in W

ije ijR2)ˆ( ijijij sse

ijR ijijep )(

Optimal Histogram Definition – Cont.

• Given W, an optimal histogram with B buckets of array A[1,n] is the histogram with at most B buckets that has the minimum total expected error for estimating workload W among all histograms with at most B buckets

Fast Histogram (FH) Construction for Hierarchical

Range Queries • Given an array A[1,n], B buckets

and workload W• E denotes the total expected error

of the optimal histogram• Find algorithms that construct HR

histograms with an error at most E trading space and construction time

Layout• Introduction• Related Works• Problem Definition• Problem Solution

– A Sparse Interval Set System– The Dynamic Programming algorithm

• Experimental Evaluation• Conclusions

FH construction• Constructing a set of “sparse

intervals”– Increases a number of buckets– Any arbitrary interval can be

represented

• Dynamic programming algorithm

A Sparse Interval System

• Given an integer set • Level 1 points: • Level 2 points: • Level j+1 points:• Last r+1 level points:

1r 11

rnl

n,...,0

...3,2,,0 lll

...3,2,,0 jjj lll

n,0

A Sparse Interval System

• The interval [0,n] is in the sparse system S

• Any pair of level j points between level j+1 points defines in interval in S

A Sparse Interval System Example

n=8 ; r=3 ; l=2

0 2 4 6 81 3 5 70 4 81 2 3 5 6 70 81 2 3 4 5 6 70 81 2 3 4 5 6 7

Level 2 pointsLevel 3 pointsLevel 4 pointsLevel 1 points

Sparse Interval System Properties

• Any interval over [0,n] can be written as a disjoint union of at most 2r intervals in the sparse system

Claim• Any interval [0,x] can be

expressed as a partition of at most r intervals from the sparse system

Claim Proof• By induction• Induction step Any interval where can be

expressed as j intervals. • Base case

true for j=1

],0[ x jlx

Claim Proof – Cont.• j+1• Consider• We can write the interval as and

where t is maximal • is a valid interval in the sparse

system (in level j+1 - 0 and are adjacent)

1 jj lxl

],0[ jtl

],[ xtl j )( lt ],0[ jtl

1jl

Claim Proof – Cont.• is essentially similar to • since t is maximal. Therefore by induction can be expressed by j

intervals• Total j+1• Since any interval can be

expressed as a union of r intervals

],[ xtl j ],0[ jtlx jj ltlx

rlx ],0[ x

Observation• Any interval can be expressed as

intervals • By cutting it in a point of the form with maximum j• By symmetry and can be expressed as a disjoint union

],[ ba r2

bala j

],[ jala ],[ bal j

Lemma• In a sparse set system with

parameter r, the number of intervals containing a point is at most

)(2

rrnO

Lemma Proof• Consider the level 1 intervals• There are at most such intervals

that contain a specific point– There are l points between adjacent points

of level 2– l points can create at most intervals

• Level j intervals behave on level j points the same as level 1 points on the original points

• Extend to r levels…(r+1’th level adds one more interval)

)( 2lO

)( 2lO

Layout• Introduction• Related Works• Problem Definition• Problem Solution

– A Sparse Interval Set System– The Dynamic Programming algorithm

• Experimental Evaluation• Conclusions

Hierarchy Representation By a

Tree• Ranges define a hierarchy based on the

inclusion relationship• T is a hierarchy representation by a

tree– Each node v of T is associated with a range

– The weight is – The error is

],[ RLij vvR

vw ijpve ije

Representation By a Binary Tree

• We allow • If a node had children transform it into a node

with two children – – a new node with weight 0

• The size of a tree increases only by factor 2• So assume that the tree is binary

0uwtuu ,...1

1u tuu ,...2

Dynamic Programming Algorithm - FH

• Best(v,left,right,p) denotes the smallest error of the range

• v – tree node associated with• left – overlapping interval on the left• right - overlapping interval on the left• v contains p intervals completely• Formally, left contains and right contains

],[ RL vv],[ RL vv

LvRv

FH stages• Let the children of v to be y and z

with ranges and • Cases (a) + (b)

],[ RL yy ],[ RL zz

Cases (a)+(b)• For all possible intervals I that

contain and ,compute

• In the case that I finishes on

Ry Lz

)1,,,((min)(cos 11

kpIleftyBestewewIt zzyyk

)),,,( 1krightIzBest

),,,((min)(cos 11

kpIleftyBestewewIt zzyyk

)),,,( 1krightzBest

Ry

Cases (a)+(b)

FH stages – Cont.• Return • When interval I is fixed, and

are automatically defined and can be counted in O(1) time.

)(min ICostI

ye ze

Time complexity• Time spent evaluating cost(I) is O(p)• The running time depends on the

number of choices of interval I• Let C(S) be the maximum number of

intervals in an interval system S that contain a particular element ( )

• If all intervals are allowed then

)()( 2nOSC

Ry

Time complexity – Cont.

• The running time of the algorithm FH is

• The number of entries for each tree node v is – Since there are C(S)+1 intervals for

choices of left (all intervals that contain and ). Similarly for right

• Work for every tree node

))(( 2SBCO

Lv

))(())()(( 32 SCBOSBCSpCO

))(( SpCO

Time complexity – Cont.

• Total work including preprocessing is

• When S is a set of all possible intervals

• The result matches the time complexity of the previous algorithm (for arbitrary intervals)

))(( 32 SCBTnO

)( 26 TBnO

Time Complexity For a Sparse Set System

• S – a sparse set system with parameter r

• Run FH with 2r(B-1) buckets • Error - less then or equal to the

original B bucket histogram– A histogram with B buckets can

be expressed as a histogram with 2r(B-1) buckets in sparse system

Time Complexity – Cont.

• • Set

– In time we can construct a solution with buckets whose error is at most the error of any solution of the original problem with B buckets

66

r

)( 52 TnBnO

)/( BO

rrnSC2

)(

Some Notes• Get alternate tradeoffs by constructing different

sparse set system– Complete binary tree on [0,n] – Allow intervals such that one end point is an

ancestor of the other– Any arbitrary interval can be expressed as a

disjoint union of two intervals from the sparse set

– C(S) = O(n)– Solution with 2B buckets in time

)( 23 BTnO

Experimental Evaluation

• FH was implemented with r=6• Compared to an algorithm A0

presented by Gilbert et al.– Optimized for arbitrary range queries – For a data series of length n to be

approximated with B buckets, constructs a histogram consisting of 2B buckets in time

– The only known algorithm with reasonable complexity

)( 2BnO

Description of Data Sets

• A: A real data set of length 1000 extracted from an AT&T operational warehouse

• B: A synthetic data set of length 2000, distributed Zipf with skew parameter 0.5

• C: A synthetic data set of varying length, represented samples from Gaussian distribution with mean and variance 250

Workload Description• A normal used to assign the

probabilities to a full hierarchy• Then normalization to obtain a

probability distribution • W1 – generated by sampling N(10,10)• W2 – generated by sampling N(10,50)

Performance Evaluation

• Accuracy and construction time • Parameters

– Total space allowed for histogram– Total size of the data set

Computing Accuracy• Ask 1000 queries • Report the total expected sum

squared error of the workload execution on the histogram

Results for Data Set A

Results for Set A – Cont.

• The accuracy of FH is superior to A0• FH is more accurate for smaller

variance (W1) • As the variance increases, gets closer

to uniform (A0 optimized) • A0 linear in the space • FH is better in construction time for the

same range of space

Results for Data Set B

Results for Set B –Cont.• Similar to A• Accuracy improves much faster

with space– since the distribution is Zipf

• The savings in construction time for FH are dramatic– since data set B is twice the size of A

Results for Data Set C

Results for Set C – Cont.

• Data set size increases (x axis) and total space 20

• A0 has a plateau– Due to the way the data is generated

in the experiment (Gaussian tail)• Quadratic trend in construction time for

A0• FH – near-linear increasing in

construction time

Conclusions• The first practical approach to the

problem of constructing hierarchical range histograms

• The dynamic programming algorithms effectively trade space and construction time without compromising histogram accuracy

• A novel notion of sparse intervals

Future plans• A formal study of the dynamic

properties of hierarchical range histograms

• How should one modify these histograms under data or workload modifications?

The ENDThe ENDThe ENDThe END

Thanks for listeningThanks for listening