•Nonlinear Dimensionality Reduction , John A. Lee, Michel Verleysen, Chapter3
1
Estimation of the Intrinsic Dimension
دانشگاه صنعتي اميرکبيردانشگاه صنعتي اميرکبيرپلي تکنيک تهران(پلي تکنيک تهران())
Overview
2
Introduce the concept of intrinsic dimension along with several techniques that can estimate it
Estimators based on fractal geometry
Estimators related to PCA
Trail and error approach
q-dimension (Cont.)
3
q-dimension (Cont.)
4
The support of μ is covered with a (multidimensional) grid of cubes with edge lengthε.
Let N(ε) be the number of cubes that intersect the support of μ.
let the natural measures of these cubes be p1, p2, . . . , pN(ε).
pi may be seen as the probability that these cubes are populated
q-dimension (Cont.)
5
For q ≥ 0, q ≠ 1, these limits do not depend on the choice of the –grid, and give the same values
Capacity dimension
6
Setting q equal to zero
In this definition, dcap does not depend on the natural measures pi
dcap is also known as the ‘box-counting’ dimension
Capacity dimension (Cont.)
7
When the manifold is not known analytically and only a few data points are available, the capacity dimension is quite easy to estimate:
Intuitive interpretation of the capacity dimension
8
Assuming a three-dimensional space divided in small cubic boxes with a fixed edge length ℇ
The number of occupied boxes growsFor a growing one-dimensional object, proportionally to the
object length for a growing two-dimensional object, proportionally to the
object surface. for a growing three-dimensional object, proportionally to the
object volume.
Generalizing to a P-dimensional object like a P-manifold embedded in RD
Correlation dimension
9
Where q = 2The term correlation refers to the fact that
the probabilities or natural measures pi are squared.
Correlation dimension )Cont.(
10
C2(ε) is the number of neighboring points lying closer than a certain threshold ε .
This number grows as a length for a 1D object, as a surface for a 2D object, as a volume for a 3D object, and so forth.
Correlation Dim.)Cont.(
11
When the manifold or fractal object is only known by a countable set of points
Practical estimation
12
When knowledge is finite number of points
Capacity and correlation dimensions
However for each situation calculating limit to zero is impossible in practice
Practical estimation )Cont.(
13
14
the slope of the curve is almost constant between 1 ≈ exp(−6) = 0.0025 and 2 ≈ exp(0) = 1
Dimension estimators based on PCA
15
The model of PCA is linear
The Estimator works only for manifolds containing linear dependencies (linear subspaces)
For more complex manifolds, PCA gives at best an estimate of the global dimensionality of an object.(2D for spiral manifold, Macroscopic effect)
Local Methods
16
Decomposing the space into small patches, or “space windows”
Ex. Nonlinear generalization of PCA1. Windows are determined by
clustering the data (Vector quantization)
2. PCA is carried out locally, on each space Window
3. Compute weighted average on localities
17
The fraction of the total variance spanned by the first principal component of each cluster or space window. The corresponding dimensionality (computed by piecewise linear interpolation) for three variance fractions (0.97, 0.98, and0.99)
Properties
18
The dimension given by local PCA is scale-dependent, like the correlation dimension.
Low number of space windows-> Large window-> Macroscopic structure of spiral (2D)
Optimum window -> small pieces of spiral (1D)
High number of space windows-> too small window-> Noise scale(2D)
Propeties
19
local PCA requires more data samples to yield an accurate estimate(dividing the manifold into non overlapping patches.)
PCA is repeated for many different numbers of space windows, then the computation time grows.
Trial and error
20
1. For a manifold embedded in a D-dimensional space, reduce dimensionality successively to P=1,2,..,D.
2. Plot Ecodec as a function of P.
3. Choose a threshold, and determine the lowest value of P such that Ecodec
goes below it (An elbow).
Additional refinement
21
Using statistical estimation methods like cross validation or bootstrapping:
Ecodec is computed by dimensionality reduction on several subsets that are randomly drawn from the available data.
This results in a better estimation of the reconstruction errors, and therefore in a more faithful estimation of the dimensionality at the elbow.
Huge computational requirements.
ComparisionsData Set
22
10D data setIntrinsic Dim : 3100, 1000, and 10,000 observationsWhite Gaussian noise, with std 0.01
PCA estimator
23
Number of observations does not greatly influence the results Nonlinear dependences hidden in the data sets
Correlation Dimension
24Much more sensitive to the number of available observations.
25
the correlation dimension is muchslower than PCA but yields higher quality
results
Edge effects appear: the dimensionality is slightly underestimated
The noise dimensionality appears more clearly as the number of observations grows.
The correlation dimension is much slower than PCA but yields higher quality results
Local PCA estimator
26 The nonlinear shape of the underlying manifold for large windows
Local PCA estimator )cont.(
27
too small window, rare samples
PCA is no longer reliable, because the windows do not contain enough points.
Local PCA estimator )cont.(
28
Local PCA yields the right dimensionality.
The largest three normalized eigen values remain high for any number of windows, while the fourth and subsequent ones are negligible.
It is noteworthy that for a single window the result of local PCA is trivially the same as for PCA applied globally, But as the number of windows is increasing, the fourth normalized eigen value is decreasing slowly.
Local PCA is obviously much slower than global PCA, but still faster than the correlation dimension
Trial and error
29
The number of points does not play an important role.
The DR method slightly over estimates the dimensionality.
Although the method relies on a nonlinear model, the manifold may still be too curved to achieve a perfect embedding in a space having the same dimension as the exact manifold dimensionality.
The overestimation observed for PCA does not disappear but is only attenuated when switching to an NLDR method.
Concluding remarks
30
PCA applied globally on the whole data set remains the simplest and fastest one.
Its results are not very convincing: the dimension is almost always overestimated if data do not perfectly fit the PCA model.
Method relying on a nonlinear model is very slow.
The overestimation that was observed with PCA does not disappear totally.
Concluding remarks
31
Local PCA runs fast if the number of windows does not sweep a wide interval.
local PCA has given the right dimensionality for the studied data sets.
The correlation dimension clearly appears as the best method to estimate the intrinsic dimensionality.
It is not the fastest of the four methods, but its results are the best and most detailed ones, giving the dimension on all scales.
•Nonlinear Dimensionality Reduction , John A. Lee, Michel Verleysen, Chapter4
32
Distance Preservation
دانشگاه صنعتي اميرکبيردانشگاه صنعتي اميرکبيرپلي تکنيک تهران(پلي تکنيک تهران())
33
The motivation behind distance preservation is that any manifold can be fully described by pairwise distances.
Presrving geometrical structure
Outline
34
Metric space & most common distance measures
Metric Multi dimensional scaling
Geodesic and graph distances
Non linear DR methods
Spatial distancesMetric space
35
A space Y with a distance function d(a, b) between two points a, b ∈ Y is said to be a metric space if the distance function respects the following axioms:
Nondegeneracy
d(a, b) = 0 if and only if a = b.
Triangular inequality
d(a, b) ≤ d(c, a) + d(c, b).
Nonnegativity. Symmetry
36
In the usual Cartesian vector space RD, the most-used distance functions are derived from the Minkowski norm
Dominance distance (p = ∞)Manhattan distance (p=1)Euclidean distance(p = 2)
Mahalanobis distanceA straight generalization of the Euclidean distance
pbabad ),(
Metric Multi dimensional scaling
37
Classical metric MDS is not a true distance preserving method.Metric MDS preserves pairwise scalar products instead of pairwise
distances(both are closely related).Is not a nonlinear DR.
Instead of pairwise distances we can use pairwise “similarities”.
When the distances are Euclidean MDS is equivalent to PCA.
Metric MDS
38
Generative model
Where components of x are independent or uncorrelated
W is a D-by-p matrix such that
Scalar product between observations
Gram matrixBoth Y and X are unknown; only the matrix of pairwise
scalar products S,Gram matrix, is given.
Metric MDS (Cont.)
39
Eigen value decomposition of Gram matrix
P-dimensional latent variables
Criterion of metric MDS
Metric MDS (Cont.)
40
Metric MDS and PCA give the same solution.
When data consist of distances or similarities prevent us from applying PCA -> Metric MDS.
When the coordinates are known, PCA spends fewer memory resources than MDS.
??
Experiments
41
Geodesic distance
42
Assuming that very short Euclidean distances are preserved
Euclidean longer distances are considerably stretched.
Measuring the distance along the manifold and not through the embedding space
Geodesic distance
43
Distance along a manifoldIn the case of a one-dimensional manifold
M, which depends on a single latent variable x
Geodesic distance-Multi Dim. manifold
44
Geodesic distance (Cont.)
45
The integral then has to be minimized over all possible paths that connect the starting and ending points.
Such a minimization is intractable since it is a functional minimization.
Anyway, the parametric equations of M(and P) are unknown; only some (noisy) points of M are available.
Graph dist.
46
Lack of analytical information -> reformulation of problem.
Minimizing an arc length between two points on a manifold.
Minimize the length of a path (i.e., a broken line).The path should be constrained to follow the underlying
manifold.
In order to obtain a good approximation of the true arc length, a fine discretization of the manifold is needed.
Only the smallest jumps will be permitted. (K-rule,ε-rule )
Graph dist.
47
48
Graph dist.
49
How to compute the shortest paths in a weighted graph? Dijkstra
It is proved that the graph distance approximates the true geodesic distance in an appropriate way.
Isomap
50
Isomap is a NLDR method that uses the graph distance as an approximation of the geodesic distance.
Isomap inherits one of the major shortcomings of MDS: a very rigid model.
If the distances in D are not Euclidean, Implicitly assumed that the replacement metric yields distances that are equal to Euclidean distances measured in some transformed hyperplane.
Isomap Algorithm
51
Intrinsic dimensionality
52
ˆx)i) is the ith column of ˆX = IP×NΛ1/2UT .
An elbow indicates the right dimensionality
Experiments
53
The first two eigenvalues clearly dominate the others.
Experiments
54
The open box is not a developable manifold and Isomap does not embed it in a satisfying way.
The first three eigenvalues dominate the others.
like MDS, Isomap does not succeed in detecting that the intrinsic dimensionality of the box is two.
Kernel PCA
55
The first idea of KPCA consists of reformulating the PCA into its metric MDS equivalent.
KPCA works as metric MDS, i.e., with the matrix of pairwise scalar products S = YTY.
The second idea of KPCA is to “linearize” the underlying manifold M.
Kernel PCA (Cont.)
56
As a unique hypothesis, KPCA assumes that the mapping φ is such that the mapped data span a linear subspace of the Q-dimensional space, with Q > D.
KPCA thus starts by increasing the data dimensionality!
Kernel PCA
57
Choose the mapping φCompute pairwise scalar products for the
mapped data and store in the N-by-N matrix Φ
The symmetric matrix Φ has to be decomposed in eigenvalues and eigenvectors.
This operation will not yield the expected result unless Φ is positive semidefinite.
Experiments- kernel PCA
58
It aims at embedding the manifold into a space where an MDS-based projection would be more successful than in the initial space.
No guarantee is provided that this goal can be reached.
Experiments- kernel PCA (Cont.)
59
Tuning the parameters of the kernel is tedious
In other methods using an EVD, like metric MDS and Isomap, the variance remains concentrated within the first three eigenvalues, whereas KPCA spreads it out in most cases.
In order to concentrate the variance within a minimal number of eigenvalues, the width of the Gaussian kernel may be increased, but then the benefit of using a kernel is lost and KPCA tends to yield the same result as metric MDS: a linear projection.
Advantages and drawbacks
60
KPCA can deal with nonlinear manifolds. And actually, the theory hidden behind KPCA is a beautiful and powerful work of art.
KPCA is not used much in dimensionality reduction. The reasons are that the method is not motivated by
geometrical arguments and the geometrical interpretation of the various kernels remains difficult.
Gaussian kernels
The main difficulty in KPCA, as highlighted in the example, is the choice of an appropriate kernel along with the right values for its parameters.
LLE
61
LLE Step1Suppose the data consist of N real-valued
vectors y(i) each of dimensionality D, sampled from some underlying manifold.
We expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold.
The idea of LLE is to replace each point y(i) with a linear combination of its neighbors.
LLEStep 2
63
Characterize the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors.
Reconstruction errors are measured by the cost function
LLE Step 2 (cont.)
64
Minimize the cost function subject to two constraints
For any particular data point y)i), they are invariant to rotations, rescaling, and translations of that data point and its neighbors.
The reconstruction weights reflect intrinsic geometric properties of the data that are invariant to exactly such transformations.
ijW
LLEStep 3
65
Each high-dimensional observation y(i) is mapped to a low-dimensional vector representing global internal coordinates on the manifold.
This is done by choosing p-dimensional coordinate to minimize the embedding cost function
This cost function, like the previous one, is based on locally linear reconstruction errors, but here we fix the weights while optimizing the coordinates.
)(ˆ ix
Experiments
66
Once the parameters are correctly set, the embedding looks rather good: There are no tears and the box is deformed smoothly, without
superpositions. The only problem for the open box is that at least one lateral face is
completely crushed.
67
Any question?
Top Related