Download - Estimation of the Intrinsic Dimension

•Nonlinear Dimensionality Reduction , John A. Lee, Michel Verleysen, Chapter3

1

Estimation of the Intrinsic Dimension

دانشگاه صنعتي اميرکبيردانشگاه صنعتي اميرکبيرپلي تکنيک تهران(پلي تکنيک تهران())

Overview

2

Introduce the concept of intrinsic dimension along with several techniques that can estimate it

Estimators based on fractal geometry

Estimators related to PCA

Trail and error approach

q-dimension (Cont.)

3

q-dimension (Cont.)

4

The support of μ is covered with a (multidimensional) grid of cubes with edge lengthε.

Let N(ε) be the number of cubes that intersect the support of μ.

let the natural measures of these cubes be p1, p2, . . . , pN(ε).

pi may be seen as the probability that these cubes are populated

q-dimension (Cont.)

5

For q ≥ 0, q ≠ 1, these limits do not depend on the choice of the –grid, and give the same values

Capacity dimension

6

Setting q equal to zero

In this definition, dcap does not depend on the natural measures pi

dcap is also known as the ‘box-counting’ dimension

Capacity dimension (Cont.)

7

When the manifold is not known analytically and only a few data points are available, the capacity dimension is quite easy to estimate:

Intuitive interpretation of the capacity dimension

8

Assuming a three-dimensional space divided in small cubic boxes with a fixed edge length ℇ

The number of occupied boxes growsFor a growing one-dimensional object, proportionally to the

object length for a growing two-dimensional object, proportionally to the

object surface. for a growing three-dimensional object, proportionally to the

object volume.

Generalizing to a P-dimensional object like a P-manifold embedded in RD

Correlation dimension

9

Where q = 2The term correlation refers to the fact that

the probabilities or natural measures pi are squared.

Correlation dimension )Cont.(

10

C2(ε) is the number of neighboring points lying closer than a certain threshold ε .

This number grows as a length for a 1D object, as a surface for a 2D object, as a volume for a 3D object, and so forth.

Correlation Dim.)Cont.(

11

When the manifold or fractal object is only known by a countable set of points

Practical estimation

12

When knowledge is finite number of points

Capacity and correlation dimensions

However for each situation calculating limit to zero is impossible in practice

Practical estimation )Cont.(

13

14

the slope of the curve is almost constant between 1 ≈ exp(−6) = 0.0025 and 2 ≈ exp(0) = 1

Dimension estimators based on PCA

15

The model of PCA is linear

The Estimator works only for manifolds containing linear dependencies (linear subspaces)

For more complex manifolds, PCA gives at best an estimate of the global dimensionality of an object.(2D for spiral manifold, Macroscopic effect)

Local Methods

16

Decomposing the space into small patches, or “space windows”

Ex. Nonlinear generalization of PCA1. Windows are determined by

clustering the data (Vector quantization)

2. PCA is carried out locally, on each space Window

3. Compute weighted average on localities

17

The fraction of the total variance spanned by the first principal component of each cluster or space window. The corresponding dimensionality (computed by piecewise linear interpolation) for three variance fractions (0.97, 0.98, and0.99)

Properties

18

The dimension given by local PCA is scale-dependent, like the correlation dimension.

Low number of space windows-> Large window-> Macroscopic structure of spiral (2D)

Optimum window -> small pieces of spiral (1D)

High number of space windows-> too small window-> Noise scale(2D)

Propeties

19

local PCA requires more data samples to yield an accurate estimate(dividing the manifold into non overlapping patches.)

PCA is repeated for many different numbers of space windows, then the computation time grows.

Trial and error

20

1. For a manifold embedded in a D-dimensional space, reduce dimensionality successively to P=1,2,..,D.

2. Plot Ecodec as a function of P.

3. Choose a threshold, and determine the lowest value of P such that Ecodec

goes below it (An elbow).

Additional refinement

21

Using statistical estimation methods like cross validation or bootstrapping:

Ecodec is computed by dimensionality reduction on several subsets that are randomly drawn from the available data.

This results in a better estimation of the reconstruction errors, and therefore in a more faithful estimation of the dimensionality at the elbow.

Huge computational requirements.

ComparisionsData Set

22

10D data setIntrinsic Dim : 3100, 1000, and 10,000 observationsWhite Gaussian noise, with std 0.01

PCA estimator

23

Number of observations does not greatly influence the results Nonlinear dependences hidden in the data sets

Correlation Dimension

24Much more sensitive to the number of available observations.

25

the correlation dimension is muchslower than PCA but yields higher quality

results

Edge effects appear: the dimensionality is slightly underestimated

The noise dimensionality appears more clearly as the number of observations grows.

The correlation dimension is much slower than PCA but yields higher quality results

Local PCA estimator

26 The nonlinear shape of the underlying manifold for large windows

Local PCA estimator )cont.(

27

too small window, rare samples

PCA is no longer reliable, because the windows do not contain enough points.

Local PCA estimator )cont.(

28

Local PCA yields the right dimensionality.

The largest three normalized eigen values remain high for any number of windows, while the fourth and subsequent ones are negligible.

It is noteworthy that for a single window the result of local PCA is trivially the same as for PCA applied globally, But as the number of windows is increasing, the fourth normalized eigen value is decreasing slowly.

Local PCA is obviously much slower than global PCA, but still faster than the correlation dimension

Trial and error

29

The number of points does not play an important role.

The DR method slightly over estimates the dimensionality.

Although the method relies on a nonlinear model, the manifold may still be too curved to achieve a perfect embedding in a space having the same dimension as the exact manifold dimensionality.

The overestimation observed for PCA does not disappear but is only attenuated when switching to an NLDR method.

Concluding remarks

30

PCA applied globally on the whole data set remains the simplest and fastest one.

Its results are not very convincing: the dimension is almost always overestimated if data do not perfectly fit the PCA model.

Method relying on a nonlinear model is very slow.

The overestimation that was observed with PCA does not disappear totally.

Concluding remarks

31

Local PCA runs fast if the number of windows does not sweep a wide interval.

local PCA has given the right dimensionality for the studied data sets.

The correlation dimension clearly appears as the best method to estimate the intrinsic dimensionality.

It is not the fastest of the four methods, but its results are the best and most detailed ones, giving the dimension on all scales.

•Nonlinear Dimensionality Reduction , John A. Lee, Michel Verleysen, Chapter4

32

Distance Preservation

دانشگاه صنعتي اميرکبيردانشگاه صنعتي اميرکبيرپلي تکنيک تهران(پلي تکنيک تهران())

33

The motivation behind distance preservation is that any manifold can be fully described by pairwise distances.

Presrving geometrical structure

Outline

34

Metric space & most common distance measures

Metric Multi dimensional scaling

Geodesic and graph distances

Non linear DR methods

Spatial distancesMetric space

35

A space Y with a distance function d(a, b) between two points a, b ∈ Y is said to be a metric space if the distance function respects the following axioms:

Nondegeneracy

d(a, b) = 0 if and only if a = b.

Triangular inequality

d(a, b) ≤ d(c, a) + d(c, b).

Nonnegativity. Symmetry

36

In the usual Cartesian vector space RD, the most-used distance functions are derived from the Minkowski norm

Dominance distance (p = ∞)Manhattan distance (p=1)Euclidean distance(p = 2)

Mahalanobis distanceA straight generalization of the Euclidean distance

pbabad ),(

Metric Multi dimensional scaling

37

Classical metric MDS is not a true distance preserving method.Metric MDS preserves pairwise scalar products instead of pairwise

distances(both are closely related).Is not a nonlinear DR.

Instead of pairwise distances we can use pairwise “similarities”.

When the distances are Euclidean MDS is equivalent to PCA.

Metric MDS

38

Generative model

Where components of x are independent or uncorrelated

W is a D-by-p matrix such that

Scalar product between observations

Gram matrixBoth Y and X are unknown; only the matrix of pairwise

scalar products S,Gram matrix, is given.

Metric MDS (Cont.)

39

Eigen value decomposition of Gram matrix

P-dimensional latent variables

Criterion of metric MDS

Metric MDS (Cont.)

40

Metric MDS and PCA give the same solution.

When data consist of distances or similarities prevent us from applying PCA -> Metric MDS.

When the coordinates are known, PCA spends fewer memory resources than MDS.

??

Experiments

41

Geodesic distance

42

Assuming that very short Euclidean distances are preserved

Euclidean longer distances are considerably stretched.

Measuring the distance along the manifold and not through the embedding space

Geodesic distance

43

Distance along a manifoldIn the case of a one-dimensional manifold

M, which depends on a single latent variable x

Geodesic distance-Multi Dim. manifold

44

Geodesic distance (Cont.)

45

The integral then has to be minimized over all possible paths that connect the starting and ending points.

Such a minimization is intractable since it is a functional minimization.

Anyway, the parametric equations of M(and P) are unknown; only some (noisy) points of M are available.

Graph dist.

46

Lack of analytical information -> reformulation of problem.

Minimizing an arc length between two points on a manifold.

Minimize the length of a path (i.e., a broken line).The path should be constrained to follow the underlying

manifold.

In order to obtain a good approximation of the true arc length, a fine discretization of the manifold is needed.

Only the smallest jumps will be permitted. (K-rule,ε-rule )

Graph dist.

47

Graph dist.

49

How to compute the shortest paths in a weighted graph? Dijkstra

It is proved that the graph distance approximates the true geodesic distance in an appropriate way.

Isomap

50

Isomap is a NLDR method that uses the graph distance as an approximation of the geodesic distance.

Isomap inherits one of the major shortcomings of MDS: a very rigid model.

If the distances in D are not Euclidean, Implicitly assumed that the replacement metric yields distances that are equal to Euclidean distances measured in some transformed hyperplane.

Isomap Algorithm

51

Intrinsic dimensionality

52

ˆx)i) is the ith column of ˆX = IP×NΛ1/2UT .

An elbow indicates the right dimensionality

Experiments

53

The first two eigenvalues clearly dominate the others.

Experiments

54

The open box is not a developable manifold and Isomap does not embed it in a satisfying way.

The first three eigenvalues dominate the others.

like MDS, Isomap does not succeed in detecting that the intrinsic dimensionality of the box is two.

Kernel PCA

55

The first idea of KPCA consists of reformulating the PCA into its metric MDS equivalent.

KPCA works as metric MDS, i.e., with the matrix of pairwise scalar products S = YTY.

The second idea of KPCA is to “linearize” the underlying manifold M.

Kernel PCA (Cont.)

56

As a unique hypothesis, KPCA assumes that the mapping φ is such that the mapped data span a linear subspace of the Q-dimensional space, with Q > D.

KPCA thus starts by increasing the data dimensionality!

Kernel PCA

57

Choose the mapping φCompute pairwise scalar products for the

mapped data and store in the N-by-N matrix Φ

The symmetric matrix Φ has to be decomposed in eigenvalues and eigenvectors.

This operation will not yield the expected result unless Φ is positive semidefinite.

Experiments- kernel PCA

58

It aims at embedding the manifold into a space where an MDS-based projection would be more successful than in the initial space.

No guarantee is provided that this goal can be reached.

Experiments- kernel PCA (Cont.)

59

Tuning the parameters of the kernel is tedious

In other methods using an EVD, like metric MDS and Isomap, the variance remains concentrated within the first three eigenvalues, whereas KPCA spreads it out in most cases.

In order to concentrate the variance within a minimal number of eigenvalues, the width of the Gaussian kernel may be increased, but then the benefit of using a kernel is lost and KPCA tends to yield the same result as metric MDS: a linear projection.

Advantages and drawbacks

60

KPCA can deal with nonlinear manifolds. And actually, the theory hidden behind KPCA is a beautiful and powerful work of art.

KPCA is not used much in dimensionality reduction. The reasons are that the method is not motivated by

geometrical arguments and the geometrical interpretation of the various kernels remains difficult.

Gaussian kernels

The main difficulty in KPCA, as highlighted in the example, is the choice of an appropriate kernel along with the right values for its parameters.

LLE

61

LLE Step1Suppose the data consist of N real-valued

vectors y(i) each of dimensionality D, sampled from some underlying manifold.

We expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold.

The idea of LLE is to replace each point y(i) with a linear combination of its neighbors.

LLEStep 2

63

Characterize the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors.

Reconstruction errors are measured by the cost function

LLE Step 2 (cont.)

64

Minimize the cost function subject to two constraints

For any particular data point y)i), they are invariant to rotations, rescaling, and translations of that data point and its neighbors.

The reconstruction weights reflect intrinsic geometric properties of the data that are invariant to exactly such transformations.

ijW

LLEStep 3

65

Each high-dimensional observation y(i) is mapped to a low-dimensional vector representing global internal coordinates on the manifold.

This is done by choosing p-dimensional coordinate to minimize the embedding cost function

This cost function, like the previous one, is based on locally linear reconstruction errors, but here we fix the weights while optimizing the coordinates.

)(ˆ ix

Experiments

66

Once the parameters are correctly set, the embedding looks rather good: There are no tears and the box is deformed smoothly, without

superpositions. The only problem for the open box is that at least one lateral face is

completely crushed.

67

Any question?