Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason...

33
Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University

Transcript of Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason...

Page 1: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Compression by Quantization

Edward J. WegmanCenter for Computational Statistics

George Mason University

Page 2: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Outline

AcknowledgementsComplexitySampling Versus BinningSome Quantization TheoryRecommendations for Quantization

Page 3: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Acknowledgements

This is joint work with Nkem-Amin (Martin) Khumbah

This work was funded by the Army Research Office

Page 4: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Complexity

Descriptor Data Set Size in Bytes Storage Mode

Tiny 102 Piece of Paper Small 104 A Few Pieces of Paper Medium 106 A Floppy Disk Large 108 Hard Disk Huge 1010 Multiple Hard Disks e.g. RAID StorageMassive 1012 Robotic Magnetic Tape Storage SilosSuper Massive 1015 Distributed Archives

The Huber/Wegman Taxonomy of Data Set Sizes

Page 5: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Complexity

O(r) Plot a scatterplotO(n) Calculate means, variances, kernel

density estimates

O(n log(n)) Calculate fast Fourier transformsO(nc) Calculate singular value

decomposition of an rc matrix; solve a multiple linear regression

O(n2) Solve most clustering algorithms.O(an) Detect Multivariate Outliers

Algorithmic Complexity

Page 6: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Complexity

Table 7: Computational Feasibility on a Teraflop Grand Challenge Computer1000 gigaflop performance assumed

n n1/2 n n log(n) n3/2 n2

tiny 10-11

seconds10-10

seconds2x10-10

seconds10-9

seconds10-8

seconds

small 10-10

seconds10-8

seconds4x10-8

seconds10-6

seconds10-4

seconds

medium 10-9

seconds10-6

seconds6x10-6

seconds.001

seconds1

second

large 10-8

seconds10-4

seconds8x10-4

seconds1

second2.8

hours

huge 10-7

seconds.01

seconds.1

seconds16.7

minutes3.2

years

Page 7: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Motivation

Massive data sets can make many algorithms computationally infeasible, e.g. O(n2) and higher

Must reduce effective number of cases Reduce computational complexity Reduce data transfer requirements Enhance visualization capabilities

Page 8: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data SamplingDatabase Sampling

Exhaustive search may not be practically feasible because of their size

The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined

For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases)

Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.

Page 9: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Compression

Squishing, Squashing, Thinning, Binning Squishing = # cases reduced

Sampling = ThinningQuantization = Binning

Squashing = # dimensions (variables) reduced Depending on goal, one of sampling or

quantization may be preferable

Page 10: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Quantization

Thinning vs Binning

People’s first thoughts about Massive Data usually is statistical subsampling

Quantization is engineering’s success story

Binning is statistician’s quantization

Page 11: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Quantization

Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels.

Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels

Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3

For a terabyte data set, 106 bins

Page 12: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Quantization

Binning, but at microresolutionConventions

d = dimension k = # of bins n = sample size Typically k << n

Page 13: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Quantization

Choose E[W|Q = yj] = mean of observations in jth bin = yj

In other words, E[W|Q] = QThe quantizer is self-consistent

Page 14: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Quantization

E[W] = E[Q] If is a linear unbiased estimator, then so is E[|

Q] If h is a convex function, then E[h(Q)] E[h(W)].

In particular, E[Q2] E[W2] and var (Q) var (W).

E[Q(Q-W)] = 0 cov (W-Q) = cov (W) - cov (Q) E[W-P]2 E[W-Q]2 where P is any other quantizer.

Page 15: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Quantization

Page 16: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Distortion due to Quantization

Distortion is the error due to quantization.

In simple terms, E[W-Q]2.Distortion is minimized when the

quantization regions, Sj, are most like a (hyper-) sphere.

Page 17: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

Need space-filling tessellationsNeed congruent tilesNeed as spherical as possible

Page 18: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

In one dimension Only polytope is a straight line segment (also

bounded by a one-dimensional sphere).

In two dimensions Only polytopes are equilateral triangles,

squares and hexagons

Page 19: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

In 3 dimensions Tetrahedrons (3-simplex), cube, hexagonal

prism, rhombic dodecahedron, truncated octahedron.

In 4 dimensions 4 simplex, hypercube, 24 cell

Truncated octahedron tessellation

Page 20: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

Tetrahedron* .1040042… Cube* .0833333… Octahedron .0825482… Hexagonal Prism* .0812227… Rhombic Dodecahedron* .0787451… Truncated Octahedron* .0785433… Dodecahedron .0781285… Icosahedron .0778185… Sphere .0769670

Dimensionless Second Moment for 3-D Polytopes

Page 21: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

Tetrahedron

Cube Octahedron

Icosahedron

Dodecahedron

Truncated Octahedron

Page 23: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

Hexagonal Prism

24 Cell with Cuboctahedron Envelope

Page 24: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

Using 106 bins is computationally and visually feasible. Fast binning, for data in the range [a,b], and for k bins

j = fixed[k*(xi-a)/(b-a)]

gives the index of the bin for xi in one dimension. Computational complexity is 4n+1=O(n). Memory requirements drop to 3k - location of bin + #

items in bin + representor of bin, I.e. storage complexity is 3k.

Page 25: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

In two dimensions Each hexagon is indexed by 3 parameters. Computational complexity is 3 times 1-D

complexity, I.e. 12n+3=O(n). Complexity for squares is 2 times 1-D

complexity. Ratio is 3/2. Storage complexity is still 3k.

Page 26: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Geometry-based Quantization

In 3 dimensions For truncated octahedron, there are 3 pairs of

square sides and 4 pairs of hexagonal sides. Computational complexity is 28n+7 = O(n). Computational complexity for a cube is 12n+3. Ratio is 7/3. Storage complexity is still 3k.

Page 27: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Quantization Strategies

Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions. Complexity is always O(n). Storage complexity is 3k. # tiles grows exponentially with dimension, so-

called curse of dimensionality. Higher dimensional geometry is poorly known. Computational complexity grows faster than

hypercube.

Page 28: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Quantization Strategies

For purposes of simplicity, always use hypercube or d-dimensional simplices Computational complexity is always O(n). Methods for data adaptive tiling are available Storage complexity is 3k. # tiles grows exponentially with dimension. Both polytopes depart spherical shape rapidly as d

increases. Hypercube approach is known as datacube in computer

science literature and is closely related to multivariate histograms in statistical literature.

Page 29: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Quantization Strategies

Conclusions on Geometric Quantization Geometric approach good to 4 or 5

dimensions. Adaptive tilings may improve rate at which #

tiles grows, but probably destroy spherical structure.

Good for large n, but weaker for large d.

Page 30: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Quantization Strategies

Alternate Strategy Form bins via clustering

Known in the electrical engineering literature as vector quantization.

Distance based clustering is O(n2) which implies poor performance for large n.

Not terribly dependent on dimension, d.Clusters may be very out of round, not even convex.

ConclusionCluster approach may work for large d, but fails for

large n.Not particularly applicable to “massive” data mining.

Page 31: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Quantization Strategies

Third strategy Density-based clustering

Density estimation with kernel estimators is O(n).Uses modes m to form clustersPut xi in cluster if it is closest to mode m.This procedure is distance based, but with complexity

O(kn) not O(n2).Normal mixture densities may be an alternative approach.Roundness may be a problem.

But quantization based on density-based clustering offers promise for both large d and large n.

Page 32: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Quantization

Binning does not lose fine structure in tails as sampling might.

Roundoff analysis applies. With scale of binning, discretization not likely to be

much less accurate than accuracy of recorded data.

Discretization - finite number of bins implies discrete variables more compatible with categorical data.

Page 33: Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Data Quantization

Analysis on a finite subset of the integers has theoretical advantages Analysis is less delicate

different forms of convergence are equivalent

Analysis is often more natural since data is already quantized or categorical

Graphical analysis of numerical data is not much changed since 106 pixels is at limit of HVS