Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason...

Data Compression by Quantization

Edward J. WegmanCenter for Computational Statistics

George Mason University

Outline

AcknowledgementsComplexitySampling Versus BinningSome Quantization TheoryRecommendations for Quantization

Acknowledgements

This is joint work with Nkem-Amin (Martin) Khumbah

This work was funded by the Army Research Office

Complexity

Descriptor Data Set Size in Bytes Storage Mode

Tiny 102 Piece of Paper Small 104 A Few Pieces of Paper Medium 106 A Floppy Disk Large 108 Hard Disk Huge 1010 Multiple Hard Disks e.g. RAID StorageMassive 1012 Robotic Magnetic Tape Storage SilosSuper Massive 1015 Distributed Archives

The Huber/Wegman Taxonomy of Data Set Sizes

Complexity

O(r) Plot a scatterplotO(n) Calculate means, variances, kernel

density estimates

O(n log(n)) Calculate fast Fourier transformsO(nc) Calculate singular value

decomposition of an rc matrix; solve a multiple linear regression

O(n2) Solve most clustering algorithms.O(an) Detect Multivariate Outliers

Algorithmic Complexity

Complexity

Table 7: Computational Feasibility on a Teraflop Grand Challenge Computer1000 gigaflop performance assumed

n n1/2 n n log(n) n3/2 n2

tiny 10-11

seconds10-10

seconds2x10-10

seconds10-9

seconds10-8

seconds

small 10-10

seconds10-8

seconds4x10-8

seconds10-6

seconds10-4

seconds

medium 10-9

seconds10-6

seconds6x10-6

seconds.001

seconds1

second

large 10-8

seconds10-4

seconds8x10-4

seconds1

second2.8

hours

huge 10-7

seconds.01

seconds.1

seconds16.7

minutes3.2

years

Motivation

Massive data sets can make many algorithms computationally infeasible, e.g. O(n2) and higher

Must reduce effective number of cases Reduce computational complexity Reduce data transfer requirements Enhance visualization capabilities

Data SamplingDatabase Sampling

Exhaustive search may not be practically feasible because of their size

The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined

For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases)

Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.

Data Compression

Squishing, Squashing, Thinning, Binning Squishing = # cases reduced

Sampling = ThinningQuantization = Binning

Squashing = # dimensions (variables) reduced Depending on goal, one of sampling or

quantization may be preferable

Data Quantization

Thinning vs Binning

People’s first thoughts about Massive Data usually is statistical subsampling

Quantization is engineering’s success story

Binning is statistician’s quantization

Data Quantization

Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels.

Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels

Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3

For a terabyte data set, 106 bins

Data Quantization

Binning, but at microresolutionConventions

d = dimension k = # of bins n = sample size Typically k << n

Data Quantization

Choose E[W|Q = yj] = mean of observations in jth bin = yj

In other words, E[W|Q] = QThe quantizer is self-consistent

Data Quantization

E[W] = E[Q] If is a linear unbiased estimator, then so is E[|

Q] If h is a convex function, then E[h(Q)] E[h(W)].

In particular, E[Q2] E[W2] and var (Q) var (W).

E[Q(Q-W)] = 0 cov (W-Q) = cov (W) - cov (Q) E[W-P]2 E[W-Q]2 where P is any other quantizer.

Data Quantization

Distortion due to Quantization

Distortion is the error due to quantization.

In simple terms, E[W-Q]2.Distortion is minimized when the

quantization regions, Sj, are most like a (hyper-) sphere.

Geometry-based Quantization

Need space-filling tessellationsNeed congruent tilesNeed as spherical as possible


In one dimension Only polytope is a straight line segment (also

bounded by a one-dimensional sphere).

In two dimensions Only polytopes are equilateral triangles,

squares and hexagons


In 3 dimensions Tetrahedrons (3-simplex), cube, hexagonal

prism, rhombic dodecahedron, truncated octahedron.

In 4 dimensions 4 simplex, hypercube, 24 cell

Truncated octahedron tessellation


Tetrahedron* .1040042… Cube* .0833333… Octahedron .0825482… Hexagonal Prism* .0812227… Rhombic Dodecahedron* .0787451… Truncated Octahedron* .0785433… Dodecahedron .0781285… Icosahedron .0778185… Sphere .0769670

Dimensionless Second Moment for 3-D Polytopes


Tetrahedron

Cube Octahedron

Icosahedron

Dodecahedron

Truncated Octahedron


Rhombic Dodecahedron

http://www.jcrystal.com/steffenweber/POLYHEDRA/p_07.html







Hexagonal Prism

24 Cell with Cuboctahedron Envelope


Using 106 bins is computationally and visually feasible. Fast binning, for data in the range [a,b], and for k bins

j = fixed[k*(xi-a)/(b-a)]

gives the index of the bin for xi in one dimension. Computational complexity is 4n+1=O(n). Memory requirements drop to 3k - location of bin + #

items in bin + representor of bin, I.e. storage complexity is 3k.


In two dimensions Each hexagon is indexed by 3 parameters. Computational complexity is 3 times 1-D

complexity, I.e. 12n+3=O(n). Complexity for squares is 2 times 1-D

complexity. Ratio is 3/2. Storage complexity is still 3k.


In 3 dimensions For truncated octahedron, there are 3 pairs of

square sides and 4 pairs of hexagonal sides. Computational complexity is 28n+7 = O(n). Computational complexity for a cube is 12n+3. Ratio is 7/3. Storage complexity is still 3k.

Quantization Strategies

Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions. Complexity is always O(n). Storage complexity is 3k. # tiles grows exponentially with dimension, so-

called curse of dimensionality. Higher dimensional geometry is poorly known. Computational complexity grows faster than

hypercube.


For purposes of simplicity, always use hypercube or d-dimensional simplices Computational complexity is always O(n). Methods for data adaptive tiling are available Storage complexity is 3k. # tiles grows exponentially with dimension. Both polytopes depart spherical shape rapidly as d

increases. Hypercube approach is known as datacube in computer

science literature and is closely related to multivariate histograms in statistical literature.


Conclusions on Geometric Quantization Geometric approach good to 4 or 5

dimensions. Adaptive tilings may improve rate at which #

tiles grows, but probably destroy spherical structure.

Good for large n, but weaker for large d.


Alternate Strategy Form bins via clustering

Known in the electrical engineering literature as vector quantization.

Distance based clustering is O(n2) which implies poor performance for large n.

Not terribly dependent on dimension, d.Clusters may be very out of round, not even convex.

ConclusionCluster approach may work for large d, but fails for

large n.Not particularly applicable to “massive” data mining.


Third strategy Density-based clustering

Density estimation with kernel estimators is O(n).Uses modes m to form clustersPut xi in cluster if it is closest to mode m.This procedure is distance based, but with complexity

O(kn) not O(n2).Normal mixture densities may be an alternative approach.Roundness may be a problem.

But quantization based on density-based clustering offers promise for both large d and large n.

Data Quantization

Binning does not lose fine structure in tails as sampling might.

Roundoff analysis applies. With scale of binning, discretization not likely to be

much less accurate than accuracy of recorded data.

Discretization - finite number of bins implies discrete variables more compatible with categorical data.

Data Quantization

Analysis on a finite subset of the integers has theoretical advantages Analysis is less delicate

different forms of convergence are equivalent

Analysis is often more natural since data is already quantized or categorical

Graphical analysis of numerical data is not much changed since 106 pixels is at limit of HVS

Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason...

Documents

Transcript of Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason...