The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of...

46
The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline w Prehistory w Definitions and Framework w The Early Past w 10 Years Ago w The Recent Past w Industry w Competitors w The Future

Transcript of The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of...

Page 1: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The History of Histograms

Yannis IoannidisUniversity of Athens, Hellas

Outline

w Prehistoryw Definitions and Frameworkw The Early Pastw 10 Years Agow The Recent Pastw Industryw Competitorsw The Future

Page 2: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

Prehistory

wWord ̀ histogram’ of Greek originn `histo-s’ = `mast’n `gram-ma’ = `something written’

w Not used originally in the Greek language!w Introduced by Karl Pearson in 1892 for a“common form of graphical representation”

Prehistory

Page 3: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

Prehistory

w 1662: Concept exists at least since then in mortality tables of J. Grauntw 1786: Bar charts introduced by W. Playfair to

capture Scottish imports/exports w 1833: Histograms introduced by A. M. Guerry as

discrete approximations to distribution functionsw 1859: Florence Nightingale used them to

compare mortality of soldiers and civilians

Prehistory

Page 4: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

Prehistory

w Playfair’s bar chart

DefinitionsData Distributions

w One-dimensional data distribution = Set of (attribute value, frequency) pairsw Large and non-uniform ⇒ need

compression and approximationw Concentrate on numeric attributes

Page 5: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

DefinitionsData Distributions

Freq

Value

Spread

Area

DefinitionsData Distributions

w Combinations of multiple attribute valuesw Joint frequencyw Multidimensional data distributions =

Set of (value combination, joint frequency)pairs

Page 6: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

DefinitionsMultidimensional Data Distributions

Value2

Value1

5

45

17

2

Motivation

w Selectivity estimationw Approximate query answering

w Query optimizationw Query profiling for user feedbackw Load balancing for parallel join executionw Partition-based temporal join execution

within

Page 7: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

DefinitionsHistograms

w Partition data distribution into β disjoint bucketsw Approximate values (value combinations)

and frequencies within each bucket

DefinitionsHistograms

Freq

Value

Page 8: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

DefinitionsHistograms

Freq

Value

bucket 1

bucket 2

FrameworkHistogram Parameters

w Partition rule: 4 orthogonal parametersn Partition classn Sort parametern Partition constraintn Source parameter

w Construction algorithm

Page 9: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

FrameworkHistogram Parameters

w Value approximation within bucketw Frequency approximation within bucketw Error guarantees

FrameworkPartition Class

w Indicates restrictions on partitioningn Serial: non-overlapping ranges of sort

parameter valuesn End-biased: at most one non-singleton

bucket

Page 10: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

FrameworkSort Parameter

w Derivative of data distribution element (its value and/or frequency)n Attribute values (V)n Frequencies (F)n Areas (A) = spread x frequency

w Serial: buckets must contain contiguoussort parameter values

FrameworkPartition Class and Sort Parameter

10502040407242090116323653015

FREQUENCYVALUE

Page 11: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

FrameworkPartition Class and Sort Parameter

9040363024201610

SORT PAR

90140736530152420204016321050

FREQUENCYVALUE

B1

B2

B3

B4

FrameworkPartition Class and Sort Parameter

9040363024201610

SORT PAR

90140736530152420204016321050

FREQUENCYVALUEB1

B2

B3

B4

Page 12: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

FrameworkPartition Class and Sort Parameter

9040363024201610

SORT PAR

90140736530152420204016321050

FREQUENCYVALUE

B1

B2B3

B4

FrameworkSource Parameter

w Derivative of data distribution element (its value and/or frequency)n Spreads (S)n Frequencies (F)n Cumulative frequencies (C)n Areas (A)

w Partition constraint applied on source parameter

Page 13: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

FrameworkPartition Constraint

w Mathematical constraint on the source parameter that partitioning must satisfyw General direction: Avoid grouping vastly

different source parameter values

FrameworkPartition Constraint

w Equi-sum: equalize sumsw V-optimal: minimize variancew Maxdiff: minimize maximum difference of

adjacent source valuesw Compressed: preserve high source

values and equalize sums of the restw Spline-based: minimize square root of

error

Page 14: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

FrameworkPartition Constr. and Source Parameter

36032072

15028820012810

SOURCE PAR

9040363024201610

SORT PAR

90140736530152420204016321050

FREQVALUEB1

B2

B3

B4

118

138

248

FrameworkHistogram Parameters

w Notationclass : constraint (sort, source)

w Special notation for serial partition classconstraint (sort, source)

Page 15: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

FrameworkHistogram Parameters

w Same parameters for multidimensionalhistogramsw Partition rule more intricate: not always

analyzable into 4 orthogonal parametersn No sort parameter often

The Early PastDark Ages

w Essentially, use of 1-bucket histogramsw Large errors

Page 16: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Early PastFirst Appearance

w Kooi’s PhD Thesisw equi-width histograms

n equi-width = equi-sum (V, S)w Adopted by INGRES

The Early PastFirst Appearance

Freq

Value

Page 17: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Early PastFirst Appearance

Freq

Value

The Early PastFirst Alternative

w Don’t equalize ranges of values but number of tuples in bucketw equi-depth histograms

n equi-depth = equi-sum (V, F)w Source is only differencew Adopted by several commercial systems

Page 18: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Early PastFirst Alternative

Freq

Value

The Early PastOptimal Sort Parameter

w Theorem: For single join queries and ac-curate knowledge of values,

serial histograms withfrequency as sort parameter

are optimal.w Generalization of practice to keep high-

frequency values accurately.

Page 19: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Early PastOptimal Sort Parameters

Freq

Value

10 Years Ago

w Theorem: For single join queries and ac-curate knowledge of values,

serial histograms withfrequency as sort parameter

are optimal.

Page 20: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent Past

w Optimal partition constraints and source parameters?w Optimality when values are not known

accurately?w Optimal values of other histogram

characteristics?

The Recent PastOptimal Constraint and Source

w Theorem: For the average join query and accurate knowledge of values,

v-optimal histograms withfrequency as source parameter

are optimal.v-optimal (F, F)

w v-optimal: minimize variance of source values

Page 21: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastOptimal Constraint and Source

Freq

Value

The Recent Past

w If values are not known accurately, no optimality result on any histogram characteristicw Several experimental results identify key

choices

Page 22: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastNew Partition Constraints

w All try to group similar source valuesw max-diff: bucket borders at highest

differences of adjacent source valuesw compressed: Preserve high values of

source and equalize sums of the rest

The Recent Pastmaxdiff

Freq

Value

Page 23: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent Pastcompressed

Freq

Value

The Recent PastAlternative Partition Constraints

w Variations on the optimal knot placementproblemn Linear splines onlyn Discontinuous across bucket boundaries

Page 24: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastNew Sort and Source Parameters

w Choicesn Attribute values (V)n Spreads (S)n Frequencies (F)n Areas (A)n Cumulative frequencies (C)

w value is best sort parameter overallw area and frequency are best source

parameters overall

The Recent PastMultidimensional Partition Rules

w Multidimensional value domain cannot be sorted to serve as sort parameterw Many alternatives to partition the space of

values into bucketsw Although possible, frequency has not

been used as sort parameter

Page 25: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastMultidimensional Partition Class

w A la Grid Filew A la K-D-B-Tree (MHIST)w GENHISTw STHoles

The Recent PastMultidimensional Data Distributions

Value2

Value1

5

45

17

2

Page 26: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastM-D Partition Class: Grid File

Value2

Value1

The Recent PastM-D Partition Class: MHIST

Value2

Value1

Page 27: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastM-D Partition Class: GENHIST

Value2

Value1

The Recent PastM-D Partition Class: GENHIST

Value2

Value1

Page 28: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastM-D Partition Class: GENHIST

Value2

Value1

The Recent PastM-D Partition Class: STHoles

Value2

Value1

Page 29: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastHistogram Framework

w Partition rulel Partition classl Sort parameterl Partition constraintl Source parameter

w Construction algorithmw Value and frequency approximationw Error guarantees

The Recent PastValue Approximation

w Continuous value assumption:(min and) max valuew Uniform spread assumption:

above + number of unique valuesw Popularity-based spread:

above with “fake” num of unique valuesw Kernel estimation

Page 30: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastValue Approximation

Freq

Valuemaxmin

7

The Recent PastValue Approximation

Freq

Valuemaxmin

24

Page 31: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastValue Approximation

w All generalized to multidimensional casew Tradeoff between number of buckets and

information kept within each bucket

The Recent PastFrequency Approximation

w Uniform distribution assumption:average frequencyw Linear spline approximation:

above + spline’s angle

Page 32: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Recent PastFrequency Approximation

Freq

Value

Industrial Presence

w Only 1-dimensional histograms

w 1970’s: trivial histograms (1 bucket)w 1980’s: equi-width histogramsw 1990’s: equi-depth histogramsw 2000’s: a

Page 33: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

Industrial PresenceDB2

compressed (V, F)w Default of 10 singleton and 20 non-

singleton bucketsw Store cumulative frequenciesw Construction based on reservoir samplew Indices used to quantify dependenciesw LEO learning is key

Industrial PresenceORACLE

equi-depth = equi-sum (V, F)w Indices used to quantify dependenciesw On-the-fly dependence estimationw Past selectivities stored for future use

Page 34: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

Industrial PresenceSQL Server

max-diff (V, F)w Up to 199 bucketsw Store cumulative frequenciesw Store frequency of max accurately w Construction based on samplew Indices use to quantify dependencies

Histogram Competitors

wWaveletsw Sampling (usually complementary)w Specialized techniques

Page 35: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The Future

w Histograms and clusteringw Bucket recognition and representationw Histograms and tree indicesw Value approximationw Comprehensive technique comparisonw Other data types

The FutureHistograms and Clustering

w Clustering is “identical” problem!n Grouping of similar elements into buckets

(bucket = cluster = pattern)n Small approximation within bucket

w Multidimensional elements aren attribute value combinationsn above + frequency

Page 36: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureHistograms and Clustering

Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50

The FutureHistograms and Clustering

Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50

Page 37: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureHistograms and Clustering

Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50

The FutureHistograms and Clustering

w Very different techniquesw Apply on one problem techniques

developed for the othern Partition rulesn Construction algorithmsn Approximate representations within bucket

Page 38: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureBucket Recognition and Representation

w Essence of histograms or clusteringn Identify groups of similar elementsn Similarity on few characteristics (source)n Store approximation of these characteristics

wWhich are the similar characteristics?[Pattern Recognition]

The FutureBucket Recognition and Representation

w Maybe not original element dimensionsw Maybe not the same for all groups

Page 39: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureBucket Recognition and Representation

Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50

The FutureBucket Recognition and Representation

Freq

Value

80

60

40

20

3 5 7 9 23 27

30

65

Page 40: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureBucket Recognition and Representation

Freq

Value3 4 6 9 23 27

80

60

40

20

4330 32

15

80

50

The FutureBucket Recognition and Representation

Freq

Value

80

60

40

20

3 5 7 9 23 27

30

65

Page 41: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureBucket Recognition and Representation

w Not clustering in the value-frequency space, but the spread-frequency spacewWhy the difference in treatment?w Is this always better?w How can we recognize winner?

The FutureBucket Recognition and Representation

Freq

Value

Page 42: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureBucket Recognition and Representation

Freq

Value

The FutureBucket Recognition and Representation

Freq

Value

Page 43: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureBucket Recognition and Representation

Freq

Value

The FutureHistograms and Tree Indices

w Root of the B+ tree partitions space of values into non-overlapping bucketsw Each bucket further subdivided into

smaller bucketsw Appropriate info next to each bucket turns

each node into a histogramw Entire B+ tree becomes

Hierarchical Histogram

Page 44: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureHistograms and Tree Indices

43512077 71 83

The FutureHistograms and Tree Indices

- Index fanout decreases

+ Indexing and estimation in one+ Incremental estimation with increasing

estimate accuracy

Page 45: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

The FutureHistograms and Tree Indices

w B+ tree node “is” equi-depth histogramwWhat kind of trees with other constraints?

n V-optimaln Max-diffn Compressed

w Unbalanced trees: exact search slowerw Unbalanced trees: approximate answers

more accurate

The FutureHistograms and Tree Indices

w Take into account query frequencyw Represent popular values more

accurately – higher in the treew New hierarchical histograms/indices may

be faster than traditional ones

Page 46: The History of Histograms - University of Cretehy460/pdf/histogramsvldb03.pdf · The History of Histograms Yannis Ioannidis University of Athens, Hellas Outline wPrehistory wDefinitions

Conclusions

w Histograms very successful in databasesw Possibly best tradeoff between

n Simplicityn Efficiencyn Effectivenessn Applicability

The Future

w New approaches to some characteristicsw Untouched foundational problems

The next 10 yearseven more exciting!