Statistical Data Mining
description
Transcript of Statistical Data Mining
![Page 1: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/1.jpg)
Statistical Data MiningLecture 2
Edward J. WegmanGeorge Mason University
![Page 2: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/2.jpg)
Data Preparation
![Page 3: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/3.jpg)
Data Preparation
0
10
20
30
40
50
60
ObjectivesDetermination
Data Preparation Data Mining Analysis &Assimilation
Effo
rt (%
)
![Page 4: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/4.jpg)
Data Preparation
• Data Cleaning and Quality• Types of Data• Categorical versus Continuous Data• Problem of Missing Data
– Imputation– Missing Data Plots
• Problem of Outliers• Dimension Reduction, Quantization, Sampling
![Page 5: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/5.jpg)
Data Preparation
• Quality– Data may not have any statistically significant patterns or
relationships– Results may be inconsistent with other data sets– Data often of uneven quality, e.g. made up by respondent– Opportunistically collected data may have biases or errors– Discovered patterns may be too specific or too general to be useful
![Page 6: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/6.jpg)
Data Preparation
• Noise - Incorrect Values– Faulty data collection instruments, e.g. sensors– Transmission errors, e.g. intermittent errors from
satellite or Internet transmissions– Data entry problems– Technology limitations– Naming conventions misused
![Page 7: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/7.jpg)
Data Preparation
• Noise - Incorrect Classification– Human judgment– Time varying– Uncertainty/Probabilistic nature of data
![Page 8: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/8.jpg)
Data Preparation
• Redundant/Stale data– Variables have different names in different databases– Raw variable in one database is a derived variable in
another– Irrelevant variables destroy speed (dimension reduction
needed)– Changes in variable over time not reflected in database
![Page 9: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/9.jpg)
Data Preparation
• Data cleaning• Selecting and appropriate data set and/or
sampling strategy• Transformations
![Page 10: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/10.jpg)
Data Preparation
• Data Cleaning– Duplicate removal (tool based)– Missing value imputation (manual, statistical)– Identify and remove data inconsistencies– Identify and refresh stale data– Create unique record (case) ID
![Page 11: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/11.jpg)
Data Preparation
• Categorical versus Continuous Data– Most statistical theory, many graphics tools developed
for continuous data– Much of the data if not most data in databases is
categorical– Computer science view often takes continuous data into
categorical, e.g. salaries categorized as low, medium, high, because more suited to Boolean operations
![Page 12: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/12.jpg)
Data Preparation
• Problem of Missing Values– Missing values in massive data sets may or may not be
a problem• Missing data may be irrelevant to desired result, e.g. cases with
missing demographic data may not help if I am trying to create selection mechanism for good customers based on demographics
• Massive data sets if acquired by instrumentation may have few missing values anyway
• Imputation has model assumptions
– Suggest making a Missing Value Plot
![Page 13: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/13.jpg)
Data Preparation
• Missing Value Plot– A plot of variables by cases– Missing values colored red– Special case of “color
histogram” with binary data– “Color histogram” also
known as “data image”– This example is 67
dimensions by 1000 cases– This example is also fake
![Page 14: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/14.jpg)
Data Preparation
• Problem of Outliers– Outliers easy to detect in low dimensions– A high dimensional outlier may not show up in low
dimensional projections– MVE or MCD algorithms are exponentially
computationally complex • Visualization tools may help
– Fisher Info Matrix and Convex Hull Peeling more feasible but still too complex for Massive datasets
• Some angle based methods are promising
![Page 15: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/15.jpg)
Data Preparation• Database Sampling
– Exhaustive search may not be practically feasible because of their size
– The KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined
– For sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases)
– Sampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.
![Page 16: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/16.jpg)
Data Compression
• Often data preparation involves data compression– Sampling– Quantization
![Page 17: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/17.jpg)
Data Quantization
Thinning vs Binning
• People’s first thoughts about Massive Data usually is statistical subsampling
• Quantization is engineering’s success story• Binning is statistician’s quantization
![Page 18: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/18.jpg)
Data Quantization
• Images are quantized in 8 to 24 bits, i.e. 256 to 16 million levels.
• Signals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels
• Ask a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3
• For a terabyte data set, 106 bins
![Page 19: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/19.jpg)
Data Quantization
• Binning, but at microresolution• Conventions
– d = dimension– k = # of bins– n = sample size– Typically k << n
![Page 20: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/20.jpg)
Data Quantization
• Choose E[W|Q = yj] = mean of observations in jth bin = yj
• In other words, E[W|Q] = Q• The quantizer is self-consistent
![Page 21: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/21.jpg)
Data Quantization
• E[W] = E[Q]• If θ is a linear unbiased estimator, then so is E[θ|Q]• If h is a convex function, then E[h(Q)] ≤ E[h(W)].
– In particular, E[Q2] ≤ E[W2] and var (Q) ≤ var (W).
• E[Q(Q-W)] = 0• cov (W-Q) = cov (W) - cov (Q)• E[W-P]2 ≥ E[W-Q]2 where P is any other quantizer.
![Page 22: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/22.jpg)
Data Quantization
![Page 23: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/23.jpg)
Distortion due to Quantization
• Distortion is the error due to quantization.• In simple terms, E[W-Q]2.• Distortion is minimized when the
quantization regions, Sj, are most like a (hyper-) sphere.
![Page 24: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/24.jpg)
Geometry-based Quantization
• Need space-filling tessellations• Need congruent tiles• Need as spherical as possible
![Page 25: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/25.jpg)
Geometry-based Quantization
• In one dimension– Only polytope is a straight line segment (also bounded
by a one-dimensional sphere).
• In two dimensions– Only polytopes are equilateral triangles, squares and
hexagons
![Page 26: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/26.jpg)
Geometry-based Quantization
• In 3 dimensions– Tetrahedrons (3-simplex), cube, hexagonal prism,
rhombic dodecahedron, truncated octahedron.
• In 4 dimensions– 4 simplex, hypercube, 24 cell
Truncated octahedron tessellation
![Page 27: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/27.jpg)
Geometry-based Quantization
Tetrahedron* .1040042…Cube* .0833333…Octahedron .0825482…Hexagonal Prism* .0812227…Rhombic Dodecahedron* .0787451…Truncated Octahedron* .0785433…Dodecahedron .0781285…Icosahedron .0778185…Sphere .0769670
Dimensionless Second Moment for 3-D Polytopes
![Page 28: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/28.jpg)
Geometry-based Quantization
Tetrahedron Cube Octahedron
IcosahedronDodecahedronTruncated Octahedron
![Page 29: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/29.jpg)
Geometry-based Quantization
Rhombic Dodecahedron
http://www.jcrystal.com/steffenweber/POLYHEDRA/p_07.html
![Page 30: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/30.jpg)
Geometry-based Quantization
Hexagonal Prism
24 Cell with Cuboctahedron Envelope
![Page 31: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/31.jpg)
Geometry-based Quantization
• Using 106 bins is computationally and visually feasible.• Fast binning, for data in the range [a,b], and for k bins
j = fixed[k*(xi-a)/(b-a)]gives the index of the bin for xi in one dimension.
• Computational complexity is 4n+1=O(n).• Memory requirements drop to 3k - location of bin + #
items in bin + representor of bin, I.e. storage complexity is 3k.
![Page 32: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/32.jpg)
Geometry-based Quantization
• In two dimensions– Each hexagon is indexed by 3 parameters.– Computational complexity is 3 times 1-D complexity,– I.e. 12n+3=O(n).– Complexity for squares is 2 times 1-D complexity.– Ratio is 3/2.– Storage complexity is still 3k.
![Page 33: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/33.jpg)
Geometry-based Quantization
• In 3 dimensions– For truncated octahedron, there are 3 pairs of square
sides and 4 pairs of hexagonal sides.– Computational complexity is 28n+7 = O(n).– Computational complexity for a cube is 12n+3.– Ratio is 7/3.– Storage complexity is still 3k.
![Page 34: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/34.jpg)
Quantization Strategies
• Optimally for purposes of minimizing distortion, use roundest polytope in d-dimensions.– Complexity is always O(n).– Storage complexity is 3k.– # tiles grows exponentially with dimension, so-called
curse of dimensionality.– Higher dimensional geometry is poorly known.– Computational complexity grows faster than
hypercube.
![Page 35: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/35.jpg)
Quantization Strategies
• For purposes of simplicity, always use hypercube or d-dimensional simplices– Computational complexity is always O(n).– Methods for data adaptive tiling are available– Storage complexity is 3k.– # tiles grows exponentially with dimension.– Both polytopes depart spherical shape rapidly as d increases.– Hypercube approach is known as datacube in computer science
literature and is closely related to multivariate histograms in statistical literature.
![Page 36: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/36.jpg)
Quantization Strategies
• Conclusions on Geometric Quantization– Geometric approach good to 4 or 5 dimensions.– Adaptive tilings may improve rate at which # tiles
grows, but probably destroy spherical structure.– Good for large n, but weaker for large d.
![Page 37: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/37.jpg)
Quantization Strategies
• Alternate Strategy– Form bins via clustering
• Known in the electrical engineering literature as vector quantization.
• Distance based clustering is O(n2) which implies poor performance for large n.
• Not terribly dependent on dimension, d.• Clusters may be very out of round, not even convex.
– Conclusion• Cluster approach may work for large d, but fails for large n.• Not particularly applicable to “massive” data mining.
![Page 38: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/38.jpg)
Quantization Strategies
• Third strategy– Density-based clustering
• Density estimation with kernel estimators is O(n).• Uses modes mα to form clusters• Put xi in cluster α if it is closest to mode mα.• This procedure is distance based, but with complexity O(kn)
not O(n2).• Normal mixture densities may be an alternative approach.• Roundness may be a problem.
– But quantization based on density-based clustering offers promise for both large d and large n.
![Page 39: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/39.jpg)
Data Quantization
• Binning does not lose fine structure in tails as sampling might.
• Roundoff analysis applies.• With scale of binning, discretization not likely to be much
less accurate than accuracy of recorded data.• Discretization - finite number of bins implies discrete
variables more compatible with categorical data.
![Page 40: Statistical Data Mining](https://reader033.fdocuments.in/reader033/viewer/2022061121/546e87a3b4af9fa0268b46bc/html5/thumbnails/40.jpg)
Data Quantization
• Analysis on a finite subset of the integers has theoretical advantages– Analysis is less delicate
• different forms of convergence are equivalent– Analysis is often more natural since data is
already quantized or categorical– Graphical analysis of numerical data is not
much changed since 106 pixels is at limit of HVS