An Introduction to Dimensionality Reduction Using...

An Introduction to Dimensionality

Reduction Using Matlab

L.J.P. van der Maaten

Report MICC 07-07

Universiteit Maastricht MICC/IKAT P.O. Box 616 6200 MD Maastricht The Netherlands Maastricht, July 2007

Copyright © 2007 by Faculty of Humanities & Sciences, MICC/IKAT, Maastricht, The Netherlands. No part of this Report may be reproduced in any form, by print, photo print, microfilm or any other means without written permission from MICC/IKAT, Universiteit Maastricht, Maastricht, The Netherlands.

An Introduction to Dimensionality Reduction Using Matlab

Laurens van der Maaten

MICC, Maastricht University

P.O. Box 616, NL-6200 MD Maastricht, The Netherlands

E-mail: [email protected]

Abstract

Dimensionality reduction is an important task in machine learning, for it facilitates classification, compression, and

visualization of high-dimensional data by mitigating undesired properties of high-dimensional spaces. Over the

last decade, a large number of new (nonlinear) techniques for dimensionality reduction have been proposed. Most

of these techniques are based on the intuition that data lies on or near a complex low-dimensional manifold that

is embedded in the high-dimensional space. New techniques for dimensionality reduction aim at identifying and

extracting the manifold from the high-dimensional space. A systematic empirical evaluation of a large number

of dimensionality reduction techniques has been presented in [86]. This work has led to the development of the

Matlab Toolbox for Dimensionality Reduction, which contains implementations of 27 techniques for dimensionality

reduction. In addition, the toolbox contains implementation of 6 intrinsic dimensionality estimators and functions

for out-of-sample extension, data generation, and data prewhitening. The report describes the techniques that are

implemented in the toolbox in detail. Furthermore, it presents a number of examples that illustrate the functionality

of the toolbox, and provide insight in the capabilities of state-of-the-art techniques for dimensionality reduction.

Contents

1 Introduction 3

2 Dimensionality reduction 4

3 Techniques for dimensionality reduction 5

3.1 Traditional linear techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.2 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Global nonlinear techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2.1 MDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2.2 SPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2.3 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2.4 FastMVU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.5 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.6 GDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.7 Diffusion maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.8 SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.9 Multilayer autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Local nonlinear techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 LLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.2 Laplacian Eigenmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.3 Hessian LLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.4 LTSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Extensions and variants of local nonlinear techniques . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.1 Conformal Eigenmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.2 MVU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.3 LPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.4 NPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.5 LLTSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5 Global alignment of linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5.1 LLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5.2 Manifold charting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.5.3 CFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Techniques for intrinsic dimensionality estimation 20

4.1 Local estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.1 Correlation dimension estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.2 Nearest neighbor estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.3 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Global estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.1 Eigenvalue-based estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.2 Packing numbers estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.3 GMST estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Examples 24

5.1 Intrinsic dimensionality estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Additional functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.1 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.2 Prewhitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3.3 Out-of-sample extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Conclusion 32

A Prerequisites 38

B Installation instructions 38

2

1 Introduction

Dimensionality reduction is the transformation of high-dimensional data into a meaningful representation of reduced

dimensionality. Ideally, the reduced representation has a dimensionality that corresponds to the intrinsic dimension-

ality of the data. The intrinsic dimensionality of data is the minimum number of parameters needed to account for

the observed properties of the data [31]. Dimensionality reduction is important in many domains, since it facilitates

classification, visualization, and compression of high-dimensional data, by mitigating the curse of dimensionality and

other undesired properties of high-dimensional spaces [46].

In recent years, a large number of new techniques for dimensionality reduction have been proposed [2, 7, 8, 14, 24,

36, 39, 40, 54, 70, 73, 72, 80, 78, 89, 90, 93, 92]. Many of these techniques have the ability to deal with complex

nonlinear data manifolds and therefore constitute traditional techniques for dimensionality reduction, such as Prin-

cipal Components Analysis (PCA) and Linear Discriminant Analysis (LDA) [25]. In particular for real-world data,

dimensionality reduction techniques that are capable of dealing with complex nonlinear data manifolds may offer an

advantage. Various studies have shown that nonlinear techniques outperform their linear counterparts on artificial

tasks that are highly nonlinear. In order to make the recent advances in dimensionality reduction available to a broader

community, we developed the Matlab Toolbox for Dimensionality Reduction, which is described in this report.

The Matlab Toolbox for Dimensionality Reduction is a toolbox with Matlab implementations of 27 techniques for

dimensionality reduction, 6 techniques for intrinsic dimensionality reduction estimation, and additional functions for

out-of-sample extension, data generation, and data prewhitening. The report provides a description of all techniques

that are implemented in the toolbox, as well as a number of examples on how to use the toolbox. Note that this report

is not intended to provide a comparative review of techniques for dimensionality reduction. A comparative review on

state-of-the-art techniques for dimensionality reduction can be found in [86].

The Matlab Toolbox for Dimensionality Reduction includes all main techniques for dimensionality reduction, ex-

cept self-organizing maps [51] and their probabilistic extension GTM [12], because we consider these techniques to

be clustering techniques1. Neither does the toolbox include techniques for blind-source separation such as ICA [9].

Techniques for dimensionality reduction that are missing in the toolbox (but might be added later) include principal

curves [17], kernel maps [76], FastMap [27], Geodesic Nullspace Analysis [14], and the technique for global align-

ment of linear models presented in [69].

The outline of the remainder of this report is as follows. In section 2, we give an introduction to dimensionality reduc-

tion. Section 3 discusses the techniques for dimensionality reduction that are implemented in the toolbox. Section 4

describes 6 techniques for intrinsic dimensionality estimation that are implemented in the toolbox. In Section 5, we

provide examples of the use of the toolbox, along with some results on artificial datasets. Section 6 concludes the

report. For installation instructions and information on possible pitfalls we refer to the appendices.

1A theoretical discussion of the link between clustering and dimensionality reduction can be found in [10].

3

2 Dimensionality reduction

The problem of dimensionality reduction can be defined as follows. Suppose we have a n × D matrix X consisting

of n datavectors xi with dimensionality D, and that this dataset has the intrinsic dimensionality d (where d < D, and

often d ≪ D). In geometric terms, intrinsic dimensionality means that the points in dataset X are lying on or near a

manifold with dimensionality d that is embedded in the D-dimensional space. A dimensionality reduction technique

transforms dataset X into a new dataset Y with dimensionality d, while retaining the geometry of the data as much as

possible. In general, neither the geometry of the embedded manifold, nor the intrinsic dimensionality d of the dataset

X are known. Therefore, dimensionality reduction is an ill-posed problem that can only be solved by assuming certain

properties of the data (such as its intrinsic dimensionality).

We illustrate dimensionality reduction by an example. Consider a dataset X that consists of n images of an object

that were depicted under n different orientations. The images have a size of 32 × 32 pixels, and thus form points in

a 1,024-dimensional space. Since there is only one intrinsic dimension (viz., the orientation of the depicted object),

the datapoints xi lie on or near a curved 1-dimensional manifold. The curvedness of the data manifold follows from

the observation that the image corresponding to an orientation of 0 degrees is similar to the image corresponding

to an orientation of 360 degrees. The aim of dimensionality reduction techniques is to identify the 1-dimensional

data manifold that is embedded in the 1,024-dimensional space and to embed it into a space of lower dimensionality,

thereby retaining the geometry of the data manifold.

4

3 Techniques for dimensionality reduction

In section 2, we defined the problem of dimensionality reduction. In this section, we discuss 23 techniques for dimen-

sionality reduction that are implemented in the toolbox2. The techniques presented in the section can be subdivided

into five main groups: (1) traditional linear techniques, (2) global nonlinear techniques, (3) local nonlinear techniques,

(4) extensions and variants of local nonlinear techniques, and (5) techniques based on the alignment of a collection

of linear models. Note that our subdivision of techniques for dimensionality reduction is not the only imagineable

subdivision, and that there exist many similarities between the various techniques.

The outline of the remainder of this section is as follows. In subsection 3.1, we discuss the two traditional linear

techniques for dimensionality reduction, viz., PCA and LDA. In subsection 3.2, nine global nonlinear techniques for

dimensionality reduction are discussed. Subsection 3.3 describes four local techniques for dimensionality reduction.

For local nonlinear techniques for dimensionality reduction, several extensions have been presented, as well as linear

versions of the local techniques. Two extensions for local nonlinear techniques and three linear variants are presented

in 3.4. Three techniques that construct a low-dimensional data representation based on the alignment of a number of

linear models are described in subsection 3.5.

Throughout the report, we denote a high-dimensional datapoint by xi, where xi is the ith row of the D-dimensional

data matrix X . The low-dimensional counterpart of xi is denoted by yi, where yi is the ith row of the d-dimensional

data matrix Y .

3.1 Traditional linear techniques

Two linear techniques for dimensionality reduction are already known in statistics since the beginning of the 20th

century. These techniques are Principal Components Analysis (originally known as the Karhunen-Loeve Transform)

and Linear Discriminant Analysis (originally known as the Fisher mapping). Because of their importance in the field

of statistics, we discuss these two techniques separately from novel linear techniques such as LPP, NPE, and LLTSA

(these techniques are discussed in subsection 3.4). PCA is described in subsubsection 3.1.1 and LDA is discussed in

subsubsection 3.1.2.

3.1.1 PCA

Principal Components Analysis (PCA) [43] constructs a low-dimensional representation of the data that describes as

much of the variance in the data as possible. This is done by finding a linear basis of reduced dimensionality for the

data, in which the amount of variance in the data is maximal.

In mathematical terms, PCA attempts to find a linear transformation T that maximizes TT covX−X T , where covX−X

is the covariance matrix of the zero mean data X . It can be shown that this linear mapping is formed by the d principal

eigenvectors (i.e., principal components) of the covariance matrix of the zero-mean data3. Hence, PCA solves the

eigenproblem

covX−X

v = λv (1)

The eigenproblem is solved for the d principal eigenvalues λ. The corresponding eigenvectors form the columns of

the linear transformation matrix T . The low-dimensional data representations yi of the datapoints xi are computed by

mapping them onto the linear basis T , i.e., Y = (X − X)T .

PCA has been successfully applied in a large number of domains such as face recognition [85], coin classification [44],

and seismic series analysis [66]. The main drawback of PCA is that the size of the covariance matrix is proportional

to the dimensionality of the datapoints. As a result, the computation of the eigenvectors might be infeasible for very

high-dimensional data (under the assumption that n > D). Approximation methods, such as Simple PCA [65], deal

with this problem by applying an iterative Hebbian approach in order to estimate the principal eigenvectors of the

2The reader should note that there are 27 techniques that are implemented in the toolbox. However, in the toolbox, there are 3 PCA implemen-

tations, 2 Isomap implementations, and 2 autoencoder implementations. As a result, only 23 techniques are discussed in the report.3PCA maximizes T T covX−X T with respect to T , under the constraint that |T | = 1. This constraint can be enforced by introducing a

Lagrange multiplier λ. Hence, an unconstrained maximization of T T covX−X T + λ(1−T T T ) is performed. A stationary point of this quantity

is to be found when covX−X T = λT .

5

covariance matrix. Alternatively, PCA can also be rewritten in a probabilistic framework, allowing for performing

PCA by means of an EM algorithm [83]. The reader should note that probabilistic PCA is closely related to factor

analysis [3]. In the toolbox, a traditional PCA implementation is available, as well as implementations of Simple PCA

and probabilistic PCA.

3.1.2 LDA

Linear Discriminant Analysis (LDA) [28] attempts to maximize the linear separability between datapoints belonging

to different classes. In contrast to most other dimensionality reduction techniques, LDA is a supervised technique.

LDA finds a linear mapping M that maximizes the linear class separability in the low-dimensional representation of

the data. The criteria that are used to formulate linear class separability in LDA are the within-class scatter Sw and the

between-class scatter Sb, which are defined as

Sw =∑

c

pc covXc−Xc

(2)

Sb =∑

c

covXc

= covX−X

−Sw (3)

where pc is the class prior of class label c, covXc−Xc is the covariance matrix of the zero mean datapoints xi assigned

to class c ∈ C (where C is the set of possible classes), covXc is the covariance matrix of the cluster means, and

covX−X is the covariance matrix of the zero mean data X . LDA optimizes the ratio between the within-class scatter

Sw and the between-class scatter Sb in the low-dimensional representation of the data, by finding a linear mapping Mthat maximizes the so-called Fisher criterion

TTSbT

TTSwT(4)

This maximization can be performed by solving the generalized eigenproblem

Sbv = λSwv (5)

for the d largest eigenvalues (under the requirement that d < |C|). The eigenvectors v form the columns of the linear

transformation matrix T . The low-dimensional data representation Y of the datapoints in X can be computed by

mapping them onto the linear basis T , i.e., Y = (X − X)T .

LDA has been successfully applied in a large number of classification tasks. Successful applications include speech

recognition [34], mammography [15], and document classification [84].

3.2 Global nonlinear techniques

Global nonlinear techniques for dimensionality reduction are techniques that attempt to preserve global properties

of the data (similar to PCA and LDA), but that are capable of constructing nonlinear transformations between the

high-dimensional data representation X and its low-dimensional counterpart Y . The subsection presents nine global

nonlinear techniques for dimensionality reduction: (1) MDS, (2) SPE, (3) Isomap, (4) FastMVU, (5) Kernel PCA,

(6) GDA, (7) diffusion maps, (8) SNE, and (9) multilayer autoencoders. The techniques are discussed in subsubsec-

tion 3.2.1 to 3.2.9.

3.2.1 MDS

Multidimensional scaling (MDS) [19, 52] represents a collection of nonlinear techniques that maps the high-dimensional

data representation to a low-dimensional representation while retaining the pairwise distances between the datapoints

as much as possible. The quality of the mapping is expressed in the stress function, a measure of the error between the

pairwise distances in the low-dimensional and high-dimensional representation of the data. Two important examples

of stress functions (for metric MDS) are the raw stress function and the Sammon cost function. The raw stress function

is defined by

φ(Y ) =∑

ij

(‖xi − xj‖ − ‖yi − yj‖)2 (6)

6

in which ‖xi − xj‖ is the Euclidean distance between the high-dimensional datapoints xi and xj and ‖yi − yj‖ is the

Euclidean distance between the low-dimensional datapoints yi and yj . The Sammon cost function is given by

φ(Y ) =1

∑

ij‖xi − xj‖∑

ij

(‖xi − xj‖ − ‖yi − yj‖)2‖xi − xj‖

(7)

The Sammon cost function differs from the raw stress function in that it puts more emphasis on retaining distances

that were originally small. The minimization of the stress function can be performed using various methods, such

as the eigendecomposition of a pairwise dissimilarity matrix, the conjugate gradient method, or a pseudo-Newton

method [19].

MDS is widely used for the visualization of data, e.g., in fMRI analysis [77] and in molecular modelling [88]. The

popularity of MDS has led to the proposal of variants such as SPE [2] and FastMap [27].

3.2.2 SPE

Stochastic Proximity Embedding (SPE) is an iterative algorithm that minimizes the MDS raw stress function. SPE

differs from MDS in the efficient rule it employs to update the current estimate of the low-dimensional data represen-

tation. In addition, SPE can readily be applied in order to retain only distances in a neighborhood graph defined on the

graph, leadin to behavior that is comparable to, e.g., Isomap (see subsubsection 3.2.3).

SPE minimizes the MDS raw stress function that was given in Equation 6. For convenience, we rewrite the raw stress

function as

φ(Y ) =∑

ij

(dij − rij)2 (8)

where rij is the proximity between the high-dimensional datapoints xi and xj (computed by rij =‖xi−xj‖max R ), and dij

is the Euclidean distance between their low-dimensional counterparts yi and yj in the current approximation of the

embedded space. SPE can readily be performed in order to retain solely distances in a neighborhood graph G defined

on the data, by setting dij and rij to 0 if (i, j) /∈ G. Using this setting, SPE behaves comparable to techniques based

on neighborhood graphs. Nonlinear dimensionality reduction techniques based on neighborhood graphs are discussed

in more detail later.

SPE performs an iterative algorithm in order to minimize the raw stress function defined above. The initial positions

of the points yi are selected randomly in [0, 1]. An update of the embedding coordinates yi is performed by randomly

selecting s pairs of points (yi, yj). For each pair of points, the Euclidean distance in the low-dimensional data repre-

sentation Y is computed. Subsequently, the coordinates of yi and yj are updated in order to decrease the difference

between the distance in the original space rij and the distance in the embedded space dij . The updating is performed

using the following update rules

yi = yi + λrij − dij

2dij + ǫ(yi − yj) (9)

yj = yj + λrij − dij

2dij + ǫ(yj − yi) (10)

where λ is a learning parameter that decreases with the number of iterations, and ǫ is a regularization parameter that

prevents divisions by zero. The updating of the embedded coordinates is performed for a large number of iterations

(e.g., 105 iterations). The high number of iterations is feasible because of the low computational costs of the update

procedure.

3.2.3 Isomap

Multidimensional scaling has proven to be successful in many applications, but it suffers from the fact that it is based

on Euclidean distances, and does not take into account the distribution of the neighboring datapoints. If the high-

dimensional data lies on or near a curved manifold, such as in the Swiss roll dataset [79], MDS might consider two

datapoints as near points, whereas their distance over the manifold is much larger than the typical interpoint distance.

Isomap [79] is a technique that resolves this problem by attempting to preserve pairwise geodesic (or curvilinear)

7

distances between datapoints. Geodesic distance is the distance between two points measured over the manifold.

In Isomap [79], the geodesic distances between the datapoints xi are computed by constructing a neighborhood graph

G, in which every datapoint xi is connected with its k nearest neighbors xijin the dataset X . The shortest path

between two points in the graph forms a good (over)estimate of the geodesic distance between these two points,

and can easily be computed using Dijkstra’s or Floyd’s shortest-path algorithm [23, 29]. The geodesic distances

between all datapoints in X are computed, thereby forming a pairwise geodesic distance matrix. The low-dimensional

representations yi of the datapoints xi in the low-dimensional space Y are computed by applying multidimensional

scaling (see subsection 3.2.1) on the resulting distance matrix.

An important weakness of the Isomap algorithm is its topological instability [6]. Isomap may construct erroneous

connections in the neighborhood graph G. Such short-circuiting [55] can severely impair the performance of Isomap.

Several approaches were proposed to overcome the problem of short-circuiting, e.g., by removing datapoints with

large total flows in the shortest path-algorithm [18] or by removing nearest neighbors that violate local linearity of the

neighborhood graph [71]. Furthermore, Isomap may suffer from holes in the manifold, a problem that can be dealt with

by tearing manifolds with holes [55]. A third weakness of Isomap is that it can fail if the manifold is nonconvex [80].

Despite these weaknesses, Isomap was successfully applied on tasks such as wood inspection [64], visualization of

biomedical data [57], and head pose estimation [68].

3.2.4 FastMVU

Similar to Isomap, Fast Maximum Variance Unfolding (FastMVU) defines a neighborhood graph on the data and

retains pairwise distances in the resulting graph. FastMVU differs from Isomap in that explicitly attempts to ’unfold’

data manifolds. It does so by maximizing the Euclidean distances between the datapoints, under the constraint that

the distances in the neighborhood graph are left unchanged (i.e., under the constraint that the local geometry of

the data manifold is not distorted). The resulting optimization problem can be solved efficiently using semidefinite

programming (which is why FastMVU is also known as Semidefinite Embedding or SDE).

FastMVU starts with the construction of a neighborhood graph G, in which each datapoint xi is connected to its knearest neighbors xij

. Subsequently, FastMVU attempts to maximize the sum of the squared Euclidean distances

between all datapoints, under the constraint that the distances inside the neighborhood graph G are preserved. In other

words, FastMVU performs the following optimization problem

Maximize∑

ij

‖ yi − yj ‖2 subject to:

1) ‖ yi − yj ‖2=‖ xi − xj ‖2 for all (i, j) ∈ G

FastMVU reformulates the optimization problem as a semidefinite programming problem (SDP) [87] by defining a

matrix K that is the inner product of the low-dimensional data representation Y . It can be shown that the above

optimization is similar to the SDP

Maximize trace(K) subject to:

1) kii − 2kij + kjj =‖ xi − xj ‖2 for all (i, j)

2)∑

ij

kij = 0

3) K ≥ 0

From the solutionK of the SDP, the low-dimensional data representation Y can be obtained by computing Y = K1/2.

3.2.5 Kernel PCA

Kernel PCA (KPCA) is the reformulation of traditional linear PCA in a high-dimensional space that is constructed

using a kernel function [72]. In recent years, the reformulation of linear techniques using the ’kernel trick’ has led to

the proposal of successful techniques such as kernel ridge regression and Support Vector Machines [74]. Kernel PCA

8

computes the principal eigenvectors of the kernel matrix, rather than those of the covariance matrix. The reformulation

of traditional PCA in kernel space is straightforward, since a kernel matrix is similar to the inproduct of the datapoints

in the high-dimensional space that is constructed using the kernel function. The application of PCA in kernel space

provides Kernel PCA the property of constructing nonlinear mappings.

Kernel PCA computes the kernel matrix K of the datapoints xi. The entries in the kernel matrix are defined by

kij = κ(xi, xj) (11)

where κ is a kernel function [74]. Subsequently, the kernel matrix K is centered using the following modification of

the entries

kij = kij −1

n

∑

l

kil −1

n

∑

l

kjl +1

n2

∑

lm

klm (12)

The centering operation corresponds to subtracting the mean of the features in traditional PCA. It makes sure that the

features in the high-dimensional space defined by the kernel function are zero-mean. Subsequently, the principal deigenvectors vi of the centered kernel matrix are computed. It can be shown that the eigenvectors of the covariance

matrix αi (in the high-dimensional space constructed by κ) are scaled versions of the eigenvectors of the kernel matrix

vi

αi =1√λi

vi (13)

In order to obtain the low-dimensional data representation, the data is projected onto the eigenvectors of the covariance

matrix. The result of the projection (i.e., the low-dimensional data representation Y ) is given by

Y =

∑

j

α1κ(xj , x),∑

j

α2κ(xj , x), ...,∑

j

αdκ(xj , x)

(14)

where κ is the kernel function that was also used in the computation of the kernel matrix. Since Kernel PCA is a

kernel-based method, the mapping performed by Kernel PCA highly relies on the choice of the kernel function κ.

Possible choices for the kernel function include the linear kernel (making Kernel PCA equal to traditional PCA), the

polynomial kernel, and the Gaussian kernel [74].

Kernel PCA has been successfully applied to, e.g., face recognition [49], speech recognition [58], and novelty detec-

tion [41]. An important drawback of Kernel PCA is that the size of the kernel matrix is square with the number of

instances in the dataset. An approach to resolve this drawback is proposed in [82].

3.2.6 GDA

In the same way as Kernel PCA, Generalized Discriminant Analysis (or Kernel LDA) is the reformulation of LDA

in the high-dimensional space constructed using a kernel function [7]. Similar to LDA, GDA attempts to maximize

the Fisher criterion. The difference between GDA and LDA is that GDA maximizes the Fisher criterion in the high-

dimensional space that is constructed using the kernel function. In this way, GDA allows for the construction of

nonlinear mappings that maximize the class separability in the data.

GDA starts by performing Kernel PCA on the data in order to remove noise from the data. Hence, GDA computes

the kernel matrix K of the datapoints xi using Equation 11. The resulting kernel matrix K is centered by modifying

its entries using Equation 12. The centering equation is the kernel equivalent of subtracting the mean of the features

(as is done in traditional LDA). Subsequently, an eigenanalysis of the centered kernel matrix K is performed. The neigenvectors of the kernel matrix K are stored in a matrix P that satisfies

K = PΛPT (15)

In general, many eigenvectors pi correspond to small eigenvalues λi. In order to remove noise from the data, the

eigenvectors pi that correspond to an eigenvalue λi < ǫ (for a small value of ǫ) are removed from matrix P . The

smoothed kernel K can be recomputed using Equation 15. Subsequently, GDA maximizes the Fisher criterion (see

9

Equation 4) in the high-dimensional space defined by the kernel function κ. It can be shown4 that this is equivalent to

maximizing the Rayleigh quotient

φ(W ) =αTKWKα

αTKKα(16)

where W is an n× n blockdiagonal matrix

W = (Wc)c=1,2,...,l (17)

In the equation, Wc is an nc × nc matrix of which the entries are 1nc

(where nc indicates the number of instances

of class c) and l indicates the number of classes. Hence, the solution that maximizes the between-class scatter and

minimizes the within-class scatter is given by the eigenvectors corresponding to the principal eigenvalues of KWK.

In fact, the smoothed kernel K does not have to be computed explicitly, since the eigenanalysis of KWK is equal to

the eigenanalysis of PTWP . Hence, GDA computes the d principal eigenvectors of the matrix PTWP and stores

these in a matrix V . The eigenvectors vi are normalized by computing

αi =vi

√

vTi Kvi

(18)

In order to obtain the low-dimensional representation of the data, the data is projected onto the normalized eigenvectors

αi in the high-dimensional space defined by the kernel function κ. Hence, the low-dimensional representation of the

data is given by Equation 14.

3.2.7 Diffusion maps

The diffusion maps (DM) framework [54, 61] originates from the field of dynamical systems. Diffusion maps are

based on defining a Markov random walk on the graph of the data. By performing the random walk for a number

of timesteps, a measure for the proximity of the datapoints is obtained. Using this measure, the so-called diffusion

distance is defined. In the low-dimensional representation of the data, the pairwise diffusion distances are retained as

well as possible.

In the diffusion maps framework, a graph of the data is constructed first. The weights of the edges in the graph are

computed using the Gaussian kernel function, leading to a matrix W with entries

wij = e−‖xi−xj‖2

2σ2 (19)

where σ indicates the variance of the Gaussian. Subsequently, normalization of the matrix W is performed in such a

way that its rows add up to 1. In this way, a matrix P (1) is formed with entries

p(1)ij =

wij∑

k wik(20)

Since diffusion maps originate from dynamical systems theory, the resulting matrix P (1) is considered a Markov matrix

that defines the forward transition probability matrix of a dynamical process. Hence, the matrix P (1) represents the

probability of a transition from one datapoint to another datapoint in a single timestep. The forward probability matrix

for t timesteps P (t) is given by (P (1))t. Using the random walk forward probabilities p(t)ij , the diffusion distance is

defined by

D(t)(xi, xj) =∑

k

(p(t)ik − p

(t)jk )2

ψ(xk)(0)(21)

In the equation, ψ(xi)(0) is a term that attributes more weight to parts of the graph with high density. It is defined by

ψ(xi)(0) = mi

P

jmj

, where mi is the degree of node xi defined by mi =∑

j pij . From Equation 21, it can be observed

that pairs of datapoints with a high forward transition probability have a small diffusion distance. The key idea behind

the diffusion distance is that it is based on many paths through the graph. This makes the diffusion distance more

4For the proof, we refer to [7].

10

robust to noise than, e.g., the geodesic distance. In the low-dimensional representation of the data Y , diffusion maps

attempt to retain the diffusion distances. Using spectral theory on the random walk, it can be shown5 that the low-

dimensional representation Y that retains the diffusion distances is formed by the d nontrivial principal eigenvectors

of the eigenproblem

P (t)Y = λY (22)

Because the graph is fully connected, the largest eigenvalue is trivial (viz. λ1 = 1), and its eigenvector v1 is thus

discarded. The low-dimensional representation Y is given by the next d principal eigenvectors. In the low-dimensional

representation, the eigenvectors are normalized by their corresponding eigenvalues. Hence, the low-dimensional data

representation is given by

Y = {λ2v2, λ3v3, ..., λd+1vd+1} (23)

3.2.8 SNE

Stochastic Neighbor Embedding (SNE) is an iterative technique that (similar to MDS) attempts to retain the pairwise

distances between the datapoints in the low-dimensional representation of the data [39]. SNE differs from MDS by

the distance measure that is used, and by the cost function that it minimizes. In SNE, similarities of nearby points

contribute more to the cost function. This leads to a low-dimensional data representation that preserves mainly local

properties of the manifold. The locality of its behavior strongly depends on the user-defined value for the free param-

eter σ.

In SNE, the probability pij that datapoints xi and xj are generated by the same Gaussian is computed for all combi-

nations of datapoints and stored in a matrix P . The probabilities pij are computed using the Gaussian kernel function

(see Equation 19). Subsequently, the coordinates of the low-dimensional representations yi are set to random values

(close to zero). The probabilities qij that datapoints yi and yj are generated by the same Gaussian are computed and

stored in a matrix Q using the Gaussian kernel as well. In a perfect low-dimensional representation of the data, Pand Q are equal. Hence, SNE minimizes the difference between the probability distributions P and Q. Kullback-

Leibler divergences are a natural distance measure to measure the difference between two probability distributions.

SNE attempts to minimize the following sum of Kullback-Leibler divergences

φ(Y ) =∑

ij

pij logpij

qij(24)

The minimization of φ(Y ) can be performed in various ways. In the original algorithm [39], minimization of φ(Y )is performed using a gradient descent method. To avoid problems with local minima, an amount of Gaussian jitter

(that is reduced over time) is added in every iteration. An alternative approach to SNE is based on a trust-region

algorithm [62].

3.2.9 Multilayer autoencoders

Multilayer encoders are feed-forward neural networks with an odd number of hidden layers [21, 40]. The middle

hidden layer has d nodes, and the input and the output layer have D nodes. An example of an autoencoder is shown

schematically in Figure 1. The network is trained to minimize the mean squared error between the input and the output

of the network (ideally, the input and the output are equal). Training the neural network on the datapoints xi leads to

a network in which the middle hidden layer gives a d-dimensional representation of the datapoints that preserves as

much information in X as possible. The low-dimensional representations yi can be obtained by extracting the node

values in the middle hidden layer, when datapoint xi is used as input. If linear activation functions are used in the

neural network, an autoencoder is very similar to PCA [53]. In order to allow the autoencoder to learn a nonlinear

mapping between the high-dimensional and low-dimensional data representation, sigmoid activation functions are

generally used.

Multilayer autoencoders usually have a high number of connections. Therefore, backpropagation approaches converge

slowly and are likely to get stuck in local minima. In [40], this drawback is overcome by performing a pretraining

5See [54] for the derivation.

11

using Restricted Boltzmann Machines (RBMs) [38]. RBMs are neural networks of which the units are binary and

stochastic, and in which connections between hidden units are not allowed. RBMs can successfully be used to train

neural networks with many hidden layers using a learning approach based on simulated annealing [50]. Once the

pretraining using RBMs is performed, finetuning of the network weights is performed using backpropagation. As an

alternative approach, genetic algorithms [42, 67] can be used to train an autoencoder.

Figure 1: Schematic structure of an autoencoder.

3.3 Local nonlinear techniques

Subsection 3.2 presented eight techniques for dimensionality reduction that attempt to retain global properties of the

data. In contrast, local nonlinear techniques for dimensionality reduction are based on solely preserving properties of

small neighborhoods around the datapoints. This subsection presents four local nonlinear techniques for dimensional-

ity reduction: (1) LLE, (2) Laplacian Eigenmaps, (3) Hessian LLE, and (4) LTSA in subsubsection 3.3.1 to 3.3.4. The

key idea behind these techniques is that by preservation of local properties of the data, the global layout of the data

manifold is retained as well.

It can be shown that local techniques for dimensionality reduction can be viewed upon as definitions of specific local

kernel functions for Kernel PCA [10, 35]. Therefore, these techniques can be rewritten within the Kernel PCA frame-

work. In this paper, we do not elaborate further on the theoretical connection between local methods for dimensionality

reduction and Kernel PCA.

3.3.1 LLE

Local Linear Embedding (LLE) [70] is a local technique for dimensionality reduction that is similar to Isomap in that

it constructs a graph representation of the datapoints. In contrast to Isomap, it attempts to preserve solely local prop-

erties of the data, making LLE less sensitive to short-circuiting than Isomap. Furthermore, the preservation of local

properties allows for successful embedding of nonconvex manifolds. In LLE, the local properties of the data manifold

are constructed by writing the datapoints as a linear combination of their nearest neighbors. In the low-dimensional

representation of the data, LLE attempts to retain the reconstruction weights in the linear combinations as well as

possible.

12

LLE describes the local properties of the manifold around a datapoint xi by writing the datapoint as a linear combi-

nation Wi (the so-called reconstruction weights) of its k nearest neighbors xij. Hence, LLE fits a hyperplane through

the datapoint xi and its nearest neighbors, thereby assuming that the manifold is locally linear. The local linearity

assumption implies that the reconstruction weights Wi of the datapoints xi are invariant to translation, rotation, and

rescaling. Because of the invariance to these transformations, any linear mapping of the hyperplane to a space of

lower dimensionality preserves the reconstruction weights in the space of lower dimensionality. In other words, if

the low-dimensional data representation preserves the local geometry of the manifold, the reconstruction weights Wi

that reconstruct datapoint xi from its neighbors in the high-dimensional data representation also reconstruct datapoint

yi from its neighbors in the low-dimensional data representation. As a consequence, finding the d-dimensional data

representation Y amounts to minimizing the cost function

φ(Y ) =∑

i

(yi −k

∑

j=1

wijyij)2 (25)

It can be shown6 that the coordinates of the low-dimensional representations yi that minimize this cost function can be

found by computing the eigenvectors corresponding to the smallest d nonzero eigenvalues of the inproduct of (I−W ).In this formula, I is the n× n identity matrix.

LLE has been successfully applied for, e.g., superresolution [16] and sound source localization [26]. However, there

also exist experimental studies that report weak performance of LLE. In [57], LLE was reported to fail in the visual-

ization of even simple synthetic biomedical datasets. In [45], it is claimed that LLE performs worse than Isomap in the

derivation of perceptual-motor actions. A possible explanation lies in the difficulties that LLE has when confronted

with manifolds that contains holes [70].

3.3.2 Laplacian Eigenmaps

Similar to LLE, Laplacian Eigenmaps find a low-dimensional data representation by preserving local properties of the

manifold [8]. In Laplacian Eigenmaps, the local properties are based on the pairwise distances between near neighbors.

Laplacian Eigenmaps compute a low-dimensional representation of the data in which the distances between a datapoint

and its k nearest neighbors are minimized. This is done in a weighted manner, i.e., the distance in the low-dimensional

data representation between a datapoint and its first nearest neighbor contributes more to the cost function than the

distance between the datapoint and its second nearest neighbor. Using spectral graph theory, the minimization of the

cost function is defined as an eigenproblem.

The Laplacian Eigenmap algorithm first constructs a neighborhood graph G in which every datapoint xi is connected

to its k nearest neighbors. For all points xi and xj in graph G that are connected by an edge, the weight of the edge

is computed using the Gaussian kernel function (see Equation 19), leading to a sparse adjacency matrix W . In the

computation of the low-dimensional representations yi, the cost function that is minimized is given by

φ(Y ) =∑

ij

(yi − yj)2wij (26)

In the cost function, large weights wij correspond to small distances between the datapoints xi and xj . Hence, the

difference between their low-dimensional representations yi and yj highly contributes to the cost function. As a conse-

quence, nearby points in the high-dimensional space are brought closer together in the low-dimensional representation.

The computation of the degree matrix M and the graph Laplacian L of the graph W allows for formulating the mini-

mization problem as an eigenproblem [4]. The degree matrix M of W is a diagonal matrix, whose entries are the row

sums of W (i.e., mii =∑

j wij). The graph Laplacian L is computed by L = M −W . It can be shown that the

following holds7

φ(Y ) =∑

ij

(yi − yj)2wij = 2Y TLY (27)

6φ(Y ) = (Y − WY )2 = Y T (I − W )T (I − W )Y is the function that has to be minimized. Hence, the eigenvectors of (I − W )T (I − W )corresponding to the smallest nonzero eigenvalues form the solution that minimizes φ(Y ).

7Note that φ(Y ) =P

ij(yi − yj)2wij =

P

ij(y2

i + y2

j − 2yiyj)wij =P

i y2

i mii +P

j y2

j mjj − 2P

ij yiyjwij = 2Y T MY −

2Y T WY = 2Y T LY

13

Hence, minimizing φ(Y ) is proportional to minimizing Y TLY . The low-dimensional data representation Y can thus

be found by solving the generalized eigenvector problem

Lv = λMv (28)

for the d smallest nonzero eigenvalues. The d eigenvectors vi corresponding to the smallest nonzero eigenvalues form

the low-dimensional data representation Y .

Laplacian Eigenmaps have been successfully applied to, e.g., clustering [63, 75, 91] and face recognition [37].

3.3.3 Hessian LLE

Hessian LLE (HLLE) [24] is a variant of LLE that minimizes the ’curviness’ of the high-dimensional manifold when

embedding it into a low-dimensional space, under the constraint that the low-dimensional data representation is locally

isometric. This is done by an eigenanalysis of a matrix H that describes the curviness of the manifold around the

datapoints. The curviness of the manifold is measured by means of the local Hessian at every datapoint. The local

Hessian is represented in the local tangent space at the datapoint, in order to obtain a representation of the local

Hessian that is invariant to differences in the positions of the datapoints. It can be shown8 that the coordinates of the

low-dimensional representation can be found by performing an eigenanalysis of H.

Hessian LLE starts with identifying the k nearest neighbors for each datapoint xi using Euclidean distance. In the

neighborhood, local linearity of the manifold is assumed. Hence, a basis for the local tangent space at point xi can

be found by applying PCA on its k nearest neighbors xij. In other words, for every datapoint xi, a basis for the local

tangent space at point xi is determined by computing the d principal eigenvectors M = {m1,m2, ...,md} of the

covariance matrix covxij−xij

. Note that the above requires that k ≥ d. Subsequently, an estimator for the Hessian of

the manifold at point xi in local tangent space coordinates is computed. In order to do this, a matrix Zi is formed that

contains (in the columns) all cross products of M up to the dth order (including a column with ones). The matrix Zi

is orthonormalized by applying Gram-Schmidt orthonormalization [1] on the matrix Zi. The estimation of the tangent

Hessian Hi is now given by the transpose of the lastd(d+1)

2 columns of the matrix Zi. Using the Hessian estimators

in local tangent coordinates, a matrix H is constructed with entries

Hlm =∑

i

∑

j

((Hi)jl × (Hi)jm) (29)

The matrix H represents information on the curviness of the high-dimensional data manifold. An eigenanalysis of His performed in order to find the low-dimensional data representation that minimizes the curviness of the manifold.

The eigenvectors corresponding to the d smallest nonzero eigenvalues of H are selected and form the matrix Y , which

contains the low-dimensional representation of the data.

3.3.4 LTSA

Similar to Hessian LLE, Local Tangent Space Analysis (LTSA) is a technique that describes local properties of the

high-dimensional data using the local tangent space of each datapoint [93]. LTSA is based on the observation that, if

local linearity of the manifold is assumed, there exists a linear mapping from a high-dimensional datapoint to its local

tangent space, and that there exists a linear mapping from the corresponding low-dimensional datapoint to the same

local tangent space [93]. LTSA attempts to align these linear mappings in such a way, that they construct the local

tangent space of the manifold from the low-dimensional representation. In other words, LTSA simultaneously searches

for the coordinates of the low-dimensional data representations, and for the linear mappings of the low-dimensional

datapoints to the local tangent space of the high-dimensional data.

Similar to Hessian LLE, LTSA starts with computing bases for the local tangent spaces at the datapoints xi. This is

done by applying PCA on the k datapoints xijthat are neighbors of datapoint xi. This results in a mapping Mi from

the neighborhood of xi to the local tangent space Θi. A property of the local tangent space Θi is that there exists a

8The derivation is too extensive for this paper, but can be found in [24].

14

linear mapping Li from the local tangent space coordinates θijto the low-dimensional representations yij

. Using this

property of the local tangent space, LTSA performs the following minimization

minYi,Li

∑

i

‖ YiJk − LiΘi ‖2 (30)

where Jk is the centering matrix of size k [74]. It can be shown9 that the solution of the minimization is formed by

the eigenvectors of an alignment matrix B, that correspond to the d smallest nonzero eigenvalues of B. The entries of

the alignment matrix B are obtained by iterative summation (for all matrices Vi and starting from bij = 0 for ∀ij)

BNiNi= BNiNi

+ Jk(I − ViVTi )Jk (31)

where Ni is a selection matrix that contains the indices of the nearest neighbors of datapoint xi. Subsequently, the

low-dimensional representation Y is obtained by computation of the eigenvectors corresponding to the d smallest

nonzero eigenvectors of the symmetric matrix 12 (B +BT ).

In [81], a successful application of LTSA to microarray data is reported.

3.4 Extensions and variants of local nonlinear techniques

In subsection 3.3, we discussed four local nonlinear techniques for dimensionality reduction. The capability of these

techniques to successfully identify complex data manifolds has led to the proposal of several extensions and variants.

In this subsection, we discuss two extensions of local nonlinear techniques for dimensionality reduction, as well as

three linear approximations to local nonlinear techniques. The two extensions are discussed in subsubsection 3.4.1

and 3.4.2. Subsection 3.4.3 to 3.4.5 discuss the three linear approximations to local nonlinear dimensionality reduction

techniques.

3.4.1 Conformal Eigenmaps

Conformal Eigenmaps (CCA) are based on the observation that local nonlinear techniques for dimensionality reduc-

tion do not employ information on the geometry of the data manifold that is contained in discarded eigenvectors that

correspond to relatively small eigenvalues [73]. Conformal Eigenmaps initially perform LLE (or alternatively, another

local nonlinear technique for dimensionality reduction) to reduce the high-dimensional data to a dataset of dimension-

ality dt, where d < dt < D. Conformal Eigenmaps use the resulting intermediate solution in order to construct a

d-dimensional embedding that is maximally angle-preserving (i.e., conformal).

A conformal mapping is a transformation that is preserves the angles between neighboring datapoints when reducing

the dimensionality of the data. Consider a triangle (xh, xi, xj) in the high-dimensional data representation X and its

low-dimensional counterpart (yh, yi, yj). The mapping between these two triangles is conformal iff

‖ yh − yi ‖2

‖ xh − xi ‖2=

‖ yi − yj ‖2

‖ xi − xj ‖2=

‖ yh − yj ‖2

‖ xh − xj ‖2(32)

Based on this observation, Conformal Eigenmaps define a measure Ch that evaluates the conformality of the triangles

at datapoint xh in the neighborhood graph by

Ch =∑

ij

ηhiηij(‖ yi − yj ‖2 −sh ‖ xi − xj ‖2)2 (33)

where ηij is a variable that is 1 if xi and xj are connected in the neighborhood graph, and sh is a variable that

corrects for scalings in the transformation from the high-dimensional data representation X to its low-dimensional

counterpart Y . Summing over Ch yields a measure C(Y ) that measures the conformality of a low-dimensional data

representation Y with respect to the original data X . Conformal Eigenmaps seek for a linear mapping M from the

dt-dimensional data representation Z obtained from LLE to the low-dimensional data representation Y that maximizes

9The proof is too extensive for this paper, but can be found in [93].

15

the conformality measure C(Y ). The maximization of the conformality measure C(Y ) can be performed by solving

the following semidefinite program10

Maximize t subject to:

1) P ≥ 0

2) traceP = 1

3)

(

I SP(SP )T t

)

≥ 0

where I represents the identity matrix, S is a matrix containing the scaling coefficients sh, and P represents a matrix

that is given by P = MTM . Using the solution P of the SDP, the linear transformation M can be obtained by

computing M = P 1/2. The low-dimensional data representation Y is obtained by mapping the intermediate solution

Z (that was contructed by LLE) onto M , i.e., by computing Y = ZM .

3.4.2 MVU

Maximum Variance Unfolding (MVU) is very similar to FastMVU in that both techniques solve the same optimiza-

tion problem. However, in contrast to FastMVU, MVU starts from an intermediate solution that is obtained from

LLE (similar to Conformal Eigenmaps). The idea behind MVU is that local nonlinear techniques for dimensionality

reduction aim to preserve local properties of the data, but not necessarily aim to completely ’unfold’ a data manifold.

MVU performs the unfolding based on the low-dimensional data representation computed by LLE, thereby improving

the low-dimensional representation.

MVU starts with performing LLE (although another local nonlinear method could be used as well) in order to reduce

the dimensionality of the data to dt, where d < dt < D. Subsequently, the FastMVU procedure is applied on the

obtained data representation Z of dimensionality dt in order to obtain the final low-dimensional representation Y . The

intuition behind this approach is that local nonlinear techniques for dimensionality reduction preserve local properties

of a data manifold, but not explicitly unfold the data manifold. The advantage of MVU over FastMVU is that the

semidefinite program that MVU solves is much smaller than the program that is solved by FastMVU.

3.4.3 LPP

In contrast to traditional linear techniques such as PCA, local nonlinear techniques for dimensionality reduction (see

subsection 3.3) are capable of the successful identification of complex data manifolds such as the Swiss roll. This

capability is due to the cost functions that are minimized by local nonlinear dimensionality reduction techniques,

which aim at preserving local properties of the data manifold. However, in many learning settings, the use of a linear

technique for dimensionality reduction is desired, e.g., when an accurate and fast out-of-sample extension is necessary,

when data has to be transformed back into its original space, or when one wants to visualize the transformation that

was constructed by the dimensionality reduction technique. Linearity Preserving Projection (LPP) is a technique that

aims at combining the benefits of linear techniques and local nonlinear techniques for dimensionality reduction by

finding a linear mapping that minimizes the cost function of Laplacian Eigenmaps [36].

Similar to Laplacian Eigenmaps, LPP starts with the construction of a nearest neighbor graph in which every datapoint

xi is connected to its k nearest neighbors xij. The weights of the edges in the graph are computed using Equation 19.

Subsequently, LPP solves the generalized eigenproblem

(X − X)TL(X − X)v = λ(X − X)TM(X − X)v (34)

in which L is the graph Laplacian, and M the degree matrix of the graph, as described in subsubsection 3.3.2. It

can be shown that the eigenvectors vi corresponding to the d smallest nonzero eigenvalues form the columns of the

linear mapping T that minimizes the Laplacian Eigenmap cost function (see Equation 26). The low-dimensional data

representation Y is thus given by Y = (X − X)T .

10For the derivation of the SDP, we refer to [73].

16

3.4.4 NPE

Similar to LPP, Neigborhood Preserving Embedding (NPE) minimizes the cost function of a local nonlinear technique

for dimensionality reduction under the constraint that the mapping from the high-dimensional to the low-dimensional

data representation is linear. NPE is the linear approximation to LLE (see subsubsection 3.3.1).

NPE defines a neighborhood graph on the dataset X , and subsequently computes the reconstruction weights Wi as in

LLE. The cost function of LLE (see Equation 25) is optimized by solving the following generalized eigenproblem for

the d smallest nonzero eigenvalues

(X − X)T (I −W )T (I −W )(X − X)v = λ(X − X)T (X − X)v (35)

where I represents the n × n identity matrix. The low-dimensional data representation is computed by mapping Xonto the obtained mapping T , i.e., by computing Y = (X − X)T .

3.4.5 LLTSA

Using a similar approach as LPP and NPE, LLTSA is a linear technique for dimensionality reduction that minimizes

the cost function of LTSA. Summarizing, LLTSA defines a neighborhood graph on the data and estimates the local

tangent space Θi at every datapoint xi. Subsequently, it forms the alignment matrix B by performing the summation

in Equation 31. The LTSA cost function (see Equation 30) is minimized in a linear manner by solving the generalized

eigenproblem

(X − X)TB(X − X)v = λ(X − X)T (X − X)v (36)

for the d smallest nonzero eigenvalues. The d eigenvectors vi form the columns of the linear mapping T , which can

be used to obtain the low-dimensional representation Y by computing Y = (X − X)T .

3.5 Global alignment of linear models

In the previous subsections, we discussed techniques that compute a low-dimensional data representation by preserving

global or local properties of the data. Techniques that perform global alignment of (local) linear models combine these

two types: they compute a collection of linear models and perform an alignment of these linear models. In this

subsection, three techniques are discussed that globally align a collection of linear models: (1) LLC, (2) manifold

charting, and (3) CFA. The techniques are discussed separately in subsubsection 3.5.1 to 3.5.3.

3.5.1 LLC

Locally Linear Coordination (LLC) [78] computes a number of locally linear models and subsequently performs a

global alignment of the linear models. This process consists of two steps: (1) computing a mixture of local linear

models on the data by means of an Expectation Maximization (EM) algorithm and (2) aligning the local linear models

in order to obtain the low-dimensional data representation by finding a linear mapping from the data models that

minimizes the LLE cost function.

LLC first constructs a mixture of m factor analyzers using the EM-algorithm [22, 33, 47]. Alternatively, a mixture of

probabilistic PCA models could be used [82]. The mixture of linear models outputs m data representations zij and

corresponding responsibilities rij (where j ∈ {1, ...,m}) for every datapoint xi. The responsibility rij describes to

what extent datapoint xi corresponds to the linear model zij , and satisfies∑

j rij = 1. Using the linear models and the

corresponding responsibilities, responsibility-weighted data representations uij = rij [zij 1] are computed. The 1 is

added to the data representations zij in order to allow translations of the data representations zij in the linear mapping

L that is computed later. The responsibility-weighted data representations uij are stored in a n×mD block matrix U .

The alignment of the local models is performed based onU and on a matrixM that is given byM = (I−W )T (I−W ).Herein, the matrix W contains the reconstruction weights computed by LLE (see subsubsection 3.3.1), and I denotes

the n× n identity matrix. LLC aligns the local models by solving the generalized eigenproblem

Av = λBv (37)

17

for the d smallest nonzero eigenvalues11. In the equation, A is the inproduct of MTU and B is the inproduct of U .

The d eigenvectors vi form a matrix L, that can be shown to define a linear mapping from the responsibility-weighted

data representation U to the underlying low-dimensional data representation Y that minimizes the LLE cost function.

The low-dimensional data representation is thus obtained by computing Y = UL.

3.5.2 Manifold charting

Similar to LLC, manifold charting constructs a low-dimensional data representation by aligning a mixture of factor

analyzers (MFA) model [13]. In constrast to LLC, manifold charting does not minimize a cost function that corre-

sponds to another dimensionality reduction technique (such as the LLE cost function). Manifold charting minimizes

a convex cost function that measures the amount of disagreement between the linear models on the global coordinates

of the datapoints. The minimization of this cost function can be performed by solving a generalized eigenproblem.

Manifold charting first performs the EM algorithm to learn a mixture of factor analyzers, in order to obtain m low-

dimensional data representations zij and corresponding responsibilities rij (where j ∈ {1, ...,m}) for all datapoints

xi. Manifold charting finds a linear mapping from the data representations zij to the global coordinates yi that mini-

mizes the cost function

φ(Y ) =n

∑

i=1

m∑

j=1

rij ‖ yi − yik ‖2 (38)

where yi =∑m

k=1 rikyik. The intuition behind the cost function is that whenever there are two linear models to which

a datapoint has a high responsibility, these linear models should agree on the final coordinate of the datapoint. The

cost function can be rewritten in the form

φ(Y ) =n

∑

i=1

m∑

j=1

m∑

k=1

rijrik ‖ yij − yik ‖2 (39)

which allows the cost function to be rewritten in the form of a Rayleigh quotient. The Rayleight quotient can be

constructed by the definition of a block-diagonal matrix D with m blocks by

D =

D1 . . . 0...

. . ....

0 . . . Dm

(40)

where Dj is the sum of the weighted covariances of the data representations zij . Hence, Dj is given by

Dj =n

∑

i=1

rij cov[zij1]

(41)

In the equation, the 1 is added to the linear models zij in order to facilitate translations in the construction of yi from

the data representations zij . Using the definition of the matrix D and the n × mD block-diagonal matrix U with

entries uij = rij [zij 1], the manifold charting cost function can be rewritten as

φ(Y ) = LT (D − UTU)L (42)

where L represents the linear mapping on the matrix Z that can be used to compute the final low-dimensional data

representation Y . The linear mapping L can thus be computed by solving the generalized eigenproblem

(D − UTU)v = λUTUv (43)

for the d smallest nonzero eigenvalues. The d eigenvectors vi form the columns of the linear combination L from [Z 1]to Y .

11The derivation of this eigenproblem can be found in [78].

18

3.5.3 CFA

Similar to LLC and manifold charting, Coordinated Factor Analysis (CFA) constructs a low-dimensional data rep-

resentation by globally aligning a mixture of factor analyzers [89]. However, CFA combines the construction of the

mixture model and the alignment into one stage, whereas LLC and manifold charting first construct the mixture model,

and subsequently align the separate linear models. The main advantage of the approach taken by CFA is that it allows

for changes in the linear models if in order to facilitate a better global alignment of the linear models. CFA is im-

plemented by means of an EM algorithm that maximizes the normal MFA log-likelihood function, minus a term that

measures to what extent the mixture p(y|xi) resembles a Gaussian. The penalty term takes the form of a Kullback-

Leibler divergence.

The penalized log-likelihood function that is maximized in CFA is given by

L =n

∑

i=1

(log p(xi) −D(qi(y) ‖ p(y|xi))) (44)

where log p(xi) represents the log-likelihood of the MFA model, D(· ‖ ·) represents a Kullback-Leibler divergence,

and qi represents a Gaussian qi = N (y; yi,Σi). The penalty term measures to what extent the mixture p(y|xi)resembles a Gaussian. In other words, the penalty term reflects the notion of agreement in the learning algorithm on

the low-dimensional data representation y. The EM algorithm that maximizes the penalized log-likelihood function

in Equation 44 is tedious, but all update rules are given by closed form equations. The update rules that implement

the EM algorithm can be found in [89]. Although CFA allows for changes in the MFA model in case this is beneficial

for the alignment of the linear model, its performance may suffer from the presence of many local maxima in the

penalized log-likelihood function.

19

4 Techniques for intrinsic dimensionality estimation

In section 3, we presented 23 techniques for dimensionality reduction. In the entire section, we assumed that the target

dimensionality of the low-dimensional data representation was specified by the user. Ideally, the target dimensionality

is set equal to the intrinsic dimensionality of the dataset. Intrinsic dimensionality is the minimum number of parameters

that is necessary in order to account for all information in the data. Several techniques have been proposed in order to

estimate the intrinsic dimensionality of a dataset X . Techniques for intrinsic dimensionality estimation can be used in

order to circumvent the problem of selecting a proper target dimensionality (e.g., using hold-out testing) to reduce the

dataset X to. In this section, we discuss six techniques for intrinsic dimensionality estimation. For convenience, we

denote the intrinsic dimensionality of the dataset X by d and its estimation by d (not to be confused with the notation

d for target dimensionality that was employed in the previous section).

Techniques for intrinsic dimensionality estimation can be subdivided into two main groups: (1) estimators based on

the analysis of local properties of the data and (2) estimators based on the analysis of global properties of the data.

In this section, we present six techniques for intrinsic dimensionality estimation. Three techniques based on the

analysis of local data properties are discussed in subsection 4.1. Subsection 4.2 presents three techniques for intrinsic

dimensionality estimation that are based on the analysis of global properties of the data.

4.1 Local estimators

Local intrinsic dimensionality estimators are based on the observation that the number of datapoints covered by a

hypersphere around a datapoint with radius r grows proportional to rd, where d is the intrinsic dimensionality of the

data manifold around that datapoint. As a result, the intrinsic dimensionality d can be estimated by measuring the

number of datapoints covered by a hypersphere with a growing radius r.

This subsection describes three local estimators for intrinsic dimensionality: (1) the correlation dimension estimator,

(2) the nearest neighbor dimension estimator, and (3) the maximum likelihood estimator. The estimators are discussed

separately in subsubsection 4.1.1 to 4.1.3.

4.1.1 Correlation dimension estimator

The correlation dimension estimator uses the the intuition that the number of datapoints in a hypersphere with radius

r is proportional to rd by computing the relative amount of datapoints that lie within a hypersphere with radius r. The

relative amount of datapoints that lie within a hypersphere with radius r is given by

C(r) =2

n(n− 1)

n∑

i=1

n∑

j=i+1

c where c =

{

1, if ‖ xi − xj ‖≤ r0, if ‖ xi − xj ‖> r

(45)

Because the value C(r) is proportional to rd, we can use C(r) to estimate the intrinsic dimensionality d of the data.

The intrinsic dimensionality d is given by the limit

d = limr→0

logC(r)

log r(46)

Since the limit in the above equation cannot be solved for explicitly, its value is estimated by computing C(r) for two

values of r. The estimation of the intrinsic dimensionality of the data is then given by the ratio

d =log(C(r2) − C(r1))

log(r2 − r1)(47)

4.1.2 Nearest neighbor estimator

Similar to the correlation dimension estimator, the nearest neighbor estimator is based on the the evaluation of the

number of neighboring datapoints that are covered by a hypersphere with radius r. In contrast to the correlation

20

dimension estimator, the nearest neighbor estimator does not explicitly count the number of datapoints inside a hy-

persphere with radius r, but it computes the minimum radius r of the hypersphere that is necessary to cover k nearest

neighbors. Mathematically, the nearest neighbor estimator computes

C(k) =1

n

∑

i

Tk(xi) (48)

where Tk(xi) represents the radius of the smallest hypersphere with center xi that covers k neighboring datapoints.

Similar to the correlation dimension estimator, the nearest neighbor estimator estimates the intrinsic dimensionality

by

d =log(C(k2) − C(k1))

log(k2 − k1)(49)

4.1.3 Maximum likelihood estimator

Similar to the correlation dimension and the nearest neighbor dimension estimator, the maximum likelihood estimator

for intrinsic dimensionality [56] estimates the number of datapoints covered by a hypersphere with a growing radius

r. In contrast to the former two techniques, the maximum likelihood estimator does so by modelling the number of

datapoints inside the hypersphere as a Poisson process. In the Poisson process, the rate of the process λ(t) at intrinsic

dimensionality d is expressed as

λ(t) =f(x)πd/2dtd−1

Γ(d/2 + 1)(50)

in which f(x) is the sampling density and Γ(·) is the gamma function. Based on the Poisson process, it can be shown

that the maximum likelihood estimation of the intrinsic dimensionality d around datapoint xi given k nearest neighbors

is given by12

dk(xi) =

1

k − 1

k−1∑

j=1

logTk(xi)

Tj(xi)

−1

(51)

in which Tk(xi) represents the radius of the smallest hypersphere with center xi that covers k neighboring datapoints.

In the original paper [56], the estimation of the intrinsic dimensionality of the entire datasetX is obtained by averaging

over the n local estimates dk(xi). However, there exist arguments indicating that a maximum likelihood estimator for

intrinsic dimensionality should not average over all estimations, but over their inverses [59].

4.2 Global estimators

Whereas local estimators for intrinsic dimensionality average over local estimates of intrinsic dimensionality, global

estimators consider the data as a whole when estimating the intrinsic dimensionality. This subsection describes three

global intrinsic dimensionality estimators: (1) the eigenvalue-based estimator, (2) the packing number estimator, and

(3) the geodesic minimum spanning tree estimator. The three global intrinsic dimensionality estimators are descibed

separately in subsubsection 4.2.1 to 4.2.3.

4.2.1 Eigenvalue-based estimator

The eigenvalue-based intrinsic dimensionality estimator performs PCA (see subsubsection 3.1.1) on the high-dimensional

dataset X and evaluates the eigenvalues corresponding to the principal components13 [32]. The values of the eigen-

values provide insight in the amount of variance that is described by their corresponding eigenvectors14. After nor-

malization, the eigenvalues can be plotted in order to provide insight in the intrinsic dimensionality of the data. In

12The derivation of the maximum likelihood estimator can be found in [56].13Please note that the eigenvalues obtained from Kernel PCA can be evaluated in exactly the same way.

14This follows directly from the theory of the Rayleigh quotient. A function of the form XT AX

XT Xis maximized by the eigenvector v1 of A that

corresponds to the largest eigenvector λ1 of A. The value of the function XT AX

XT Xat the maximum given by λ1 itself. Note that PCA maximizes a

Rayleigh quotient where A = covX−X .

21

Figure 2: Example of an eigenvalue plot.

an eigenvalue plot, the first d eigenvalues are typically high (where d is the intrinsic dimensionality of the dataset),

whereas the remaining eigenvalues are typically small (since they only model noise in the data). An example of a

typical eigenvalue plot is shown in Figure 2. The estimation of the intrinsic dimensionality d is performed by counting

the number of normalized eigenvalues that is higher than a small threshold value ǫ.

4.2.2 Packing numbers estimator

The packing numbers intrinsic dimensionality estimator [48] is based on the intuition that the r-covering numberN(r)is proportional to r−d. The r-covering number N(r) is the number of hyperspheres with radius r that is necessary to

cover all datapoints xi in the dataset X . Because N(r) is proportional to r−d, the intrinsic dimensionality of a dataset

X is given by

d = − limr→0

logN(r)

log r(52)

In general, finding the r-covering number N(r) of a dataset X is a problem that is computationally infeasible. The

packing numbers estimator circumvents this problem by using the r-packing number M(r) instead of the r-covering

number N(r). The r-packing number M(r) is defined as the maximum size of an r-separated subset of X . In other

words, the r-packing number M(r) is the maximum number of datapoints in X that can be covered by a single

hypersphere with radius r. For datasets of reasonable size, finding the r-packing number M(r) is computationally

feasible. The intrinsic dimensionality of a dataset X can be found by evaluating the limit

d = − limr→0

logM(r)

log r(53)

Because the limit cannot be evaluated explicitly, the intrinsic dimensionality of the data is estimated (similar to the

correlation dimension estimator) by

d = − log(M(r2) −M(r1))

log(r2 − r1)(54)

4.2.3 GMST estimator

The geodesic minimum spanning tree (GMST) estimator is based on the observation that the length function of a

geodesic minimum spanning tree is strongly dependent on the intrinsic dimensionality d. The GMST is the minimum

spanning tree of the neighborhood graph defined on the dataset X . The length function of the GMST is the sum of

the Euclidean distances corresponding to all edges in the geodesic minimum spanning tree. The intuition behind the

22

dependency between the length function of the GMST and the intrinsic dimensionality d is similar to the intuition

behind local intrinsic dimensionality estimators.

Similar to Isomap, the GMST estimator constructs a neighborhood graphG on the datasetX , in which every datapoint

xi is connected with its k nearest neighbors xij. The geodesic minimum spanning tree T is defined as the minimal

graph over X , which has length

L(X) = minT∈T

∑

e∈T

ge (55)

where T is the set of all subtrees of graph G, e is an edge in tree T , and ge is the Euclidean distance corresponding to

the edge e. In the GMST estimator, a number of subsets A ⊂ X of the dataset X are constructed with various sizes m,

and the lengths L(A) of the GMSTs of the subsets A are computed. Theoretically, the ratiolog L(A)log m is linear, and can

thus be estimated by a function of the form y = ax+ b. The variables a and b are estimated by means of least squares.

It can be shown that the estimated value of a provides an estimate for the intrinsic dimensionality through d = 11−a .

23

5 Examples

In the previous sections, we described 23 techniques for dimensionality reduction and 6 techniques for intrinsic di-

mensionality estimation. All these techniques are implemented in the Matlab Toolbox for Dimensionality Reduction.

In this section, we explain how the implementations in the toolbox can be used. Throughout this section, we assume

the dataset is contained in an n×D matrix X. If class labels are available for the datapoints in X, these are contained

in an n× 1 numeric vector labels.

If you encounter any errors while excuting the example code in this section, please check whether your system meets

the prerequisites for using the Matlab Toolbox for Dimensionality Reduction, and whether you installed the toolbox

properly. A list of prerequisites can be found in appendix A. Installation instructions are to be found in appendix B.

In subsection 5.1, examples of the use of intrinsic dimensionality estimators are provided. Subsection 5.2 describes

examples of performing dimensionality reduction on a dataset X. In subsection 5.3, we discuss additional features of

the toolbox.

5.1 Intrinsic dimensionality estimation

In the Matlab Toolbox for Dimensionality Reduction, the 6 intrinsic dimensionality estimators described in section 4

are available through the function intrinsic dim. Consider the following example:�[X, labels] = generate_data(’swiss’, 2000, 0.05);

scatter3(X(:,1), X(:,2), X(:,3), 5, labels); drawnow

d = intrinsic_dim(X, ’MLE’);

disp([’Intrinsic dimensionality of data: ’ num2str(d)]);� �

The code above constructs a Swiss roll dataset of 2,000 datapoints with a small amount of noise around the manifold.

The dataset is plotted in a 3D scatter plot. Subsequently, the intrinsic dimensionality of the constructed dataset is

estimated by means of the maximum likelihood estimator. The result of the code above should be similar to:

Intrinsic dimensionality of data: 1.978

Note that the true intrinsic dimensionality of the Swiss roll dataset is 2, and the maximum likelihood estimation of the

intrinsic dimensionality is thus fairly accurate.

As can be observed from the example above, the syntax of the intrinsic dim function is

d = intrinsic_dim(X, method)

where method can have the following values:

• ’CorrDim’

• ’NearNbDim’

• ’MLE’

• ’EigValue’

• ’PackingNumbers’

• ’GMST’

The default value for method is ’MLE’. Below, we present an example reveals differences between the estimators

for intrinsic dimensionality:�

[X, labels] = generate_data(’twinpeaks’, 2000, 0.05);


d1 = intrinsic_dim(X);

disp([’MLE estimation: ’ num2str(d1)]);

d2 = intrinsic_dim(X, ’CorrDim’);

disp([’Correlation dim. estimation: ’ num2str(d2)]);

d3 = intrinsic_dim(X, ’NearNbDim’);

disp([’NN dim. estimation: ’ num2str(d3)]);

d4 = intrinsic_dim(X, ’EigValue’);

disp([’Eigenvalue estimation: ’ num2str(d4)]);

24

d5 = intrinsic_dim(X, ’PackingNumbers’);

disp([’Packing numbers estimation: ’ num2str(d5)]);

d6 = intrinsic_dim(X, ’GMST’);

disp([’GMST estimation: ’ num2str(d6)]);

mu = mean([d1 d2 d3 d4 d5 d6]);

disp([’Mean estimation: ’ num2str(mu)]);� �

The code above will generate a twin peaks manifold of 2,000 datapoints with some noise around the manifold, and run

the six intrinsic dimensionality estimators on the resulting dataset. The result of the code above should be similar to:

MLE estimation: 2.1846

Correlation dim. estimation: 2.093

NN dim. estimation: 3.2165

Eigenvalue estimation: 2

Packing numbers estimation: 1.5292

GMST estimation: 2.098

Mean estimation: 2.1869

From the example above, we observe that there are differences in the estimations made by various intrinsic dimension-

ality estimators. For real-world datasets, these differences may be larger than for the artificial dataset we used in the

example.

5.2 Dimensionality reduction

In the previous subsection, we provided examples of intrinsic dimensionality estimation using the Matlab Toolbox

for Dimensionality Reduction. Based on the intrinsic dimensionality estimations (or based on knowledge on the

underlying distribution of the data), it is possible to select a proper value for the target dimensionality d. Once the

target dimensionality d is chosen, we can select a technique in order to reduce the dimensionality of the dataset X to ddimensions by means of the compute mapping function. Consider the following example:

�[X, labels] = generate_data(’twinpeaks’, 2000, 0.05);


d = round(intrinsic_dim(X));

Y = compute_mapping(X, ’PCA’, d);

figure, scatter(Y(:,1), Y(:,2), 5, labels), drawnow� �

The code above generates a twinpeaks manifold of 2,000 datapoints, estimates its intrinsic dimensionality d (which is

somewhere near 2, see subsection 5.1), and reduces the dataset to d dimensions by means of Principal Components

Analysis. The resulting data representation is shown in Figure 3.

The syntax of the compute mapping function is similar for all dimensionality reduction techniques, although the

number of parameters that can be specified differs per technique. The syntax of the compute mapping function is

[Y, M] = compute_mapping(X, method, d, pars, eig)

where method indicates the technique that is used, d represents the target dimensionality d, pars represents one or

more parameters, and eig indicates the type of iterative eigensolver that is used in spectral techniques that perform

an eigenanalysis of a sparse matrix. The compute mapping returns the low-dimensional data representation in Y.

Additional information on the computed mapping is contained in the M structure. The compute mapping supports

the following values for method:

• ’PCA’ or ’SimplePCA’ or ’ProbPCA’

• ’LDA’

• ’MDS’

• ’SPE’

• ’Isomap’ or ’LandmarkIsomap’

• ’FastMVU’

• ’KernelPCA’

25

Figure 3: PCA applied on the twinpeaks dataset.

• ’GDA’

• ’DiffusionMaps’

• ’AutoencoderRBM’ or ’AutoencoderEA’

• ’LLE’

• ’Laplacian’

• ’HessianLLE’

• ’LTSA’

• ’Conformal’

• ’MVU’

• ’LPP’

• ’NPE’

• ’LLTSA’

• ’LLC’

• ’ManifoldChart’

• ’CFA’

The default value for method is ’PCA’. The default value for the target dimensionality d is 2. Note that the di-

mensionality of Y is not necessarily equal to d. For various techniques, the rank of a matrix that is spectrally analyzed

may be lower than d, leading to a data representation Y of lower dimensionality than the original setting of d. If this

occurs, the toolbox will always show a warning. In addition, running the Isomap algorithm might reduce the num-

ber of datapoints in the low-dimensional data representation Y , because Isomap only embeds the largest connected

component in the neighborhood graph. The indices of the datapoints that are retained in the low-dimensional data

representation Y can be obtained from M.conn comp.

Some of the techniques in the toolbox are very closely related, which is why we briefly discuss their differences. The

’PCA’ setting performs the traditional PCA procedure (as described in subsubsection 3.1.1), whereas the ’SimplePCA’

applies a Hebbian learning technique [65] to estimate the eigenvectors of the covariance matrix. The ’ProbPCA’

performs probabilistic PCA by means of an EM algorithm [83]. The setting ’LandmarkIsomap’ selects a number

of landmark points from the dataset X , performs Isomap on these datapoints, and fits the remaining points into the

26

Technique Parameters

PCA none

SPCA none

PPCA i = 200LDA none

MDS none

SPE type = {[’Global’], ’Local’}Isomap k = 12Landmark Isomap k = 12FastMVU k = 12Kernel PCA κ(·, ·)GDA κ(·, ·)Diffusion maps σ = 1 t = 1SNE σ = 1Autoencoders none

LLE k = 12Laplacian Eigenmaps k = 12 σ = 1Hessian LLE k = 12LTSA k = 12Conformal Eigenmaps k = 12MVU k = 12LPP k = 12NPE k = 12 σ = 1LLTSA k = 12LLC k = 12 m = 20 i = 200Manifold charting m = 40 i = 200CFA m = 40 i = 200

Table 1: Parameters and their default values.

low-dimensional data representation [20]. In contrast, the setting ’Isomap’ performs the original Isomap algorithm

(as described in subsubsection 3.2.3), which might be computationally infeasible for large problems. The setting

’AutoencoderRBM’ trains an autoencoder by means of a pretraining with an RBM and backpropagation (see [40]

for details), whereas the setting ’AutoencoderEA’ trains an autoencoder by means of an evolutionary algorithm.

A large number of techniques for dimensionality reduction has one or more free parameters, which can be set when

applying the technique. A list of parameters and their default values is provided in Table 1. The parameters should be

provided in the same order as listed in the table. If there are two parameters that need to be set, it is possible to set

only the first one. However, if the user wants to specify the second parameter, the first parameter needs to be specified

as well. In all techniques, the parameter k may have a integer value representing the number of neighbors to use in the

neighborhood graph, but it may also be set to ’adaptive’. The ’adaptive’ setting constructs a neighborhood

graph using the technique for adaptive neighborhood selection described in [60].

For the kernel methods Kernel PCA and GDA, the kernel function κ(·, ·) can be defined through the compute mapping

function. An overview of the possible kernel functions is provided in Table 2. For datasets up to 3,000 datapoints,

Kernel PCA explicitly computes the kernel matrix K and stores it in to memory. Because of memory constraints, for

larger problems, the eigendecomposition of the kernel matrix is performed using an iterative eigensolver that does not

explicitly store the kernel matrix in memory. As a result, it is not possible to use your own kernel matrices in the

Kernel PCA implementation.

The last parameter of the compute mapping function is the eig parameter. The eig parameter applies only

to techniques for dimensionality reduction that perform a spectral analysis of a sparse matrix. Possible values are

’Matlab’ and ’JDQR’ (the default value is ’Matlab’). The ’Matlab’ setting performs the spectral analysis

using Arnoldi methods [5], whereas the setting ’JDQR’ employs Jacobi-Davidsson methods [30] for the spectral

analysis. For large datasets (where n > 10, 000), we advice the use of Jacobi-Davidsson methods for spectral analysis

of sparse matrices. The parameter eig should be the last parameter that is given to the compute mapping function.

27

Kernel Parameters Function

Linear ’linear’ kij = xTi xj

Polynomial ’poly’, a, b kij = (xTi xj + a)b

Gaussian ’gauss’, σ kij = exp(−‖xi−xj‖

2

2σ2 )

Table 2: Kernel functions in the toolbox.

Below, we provide a second example of the use of the toolbox:�

[X, labels] = generate_data(’swiss’, 2000, 0.05);

d = round(intrinsic_dim(X));

Y1 = compute_mapping(X, ’LLE’);

Y2 = compute_mapping(X, ’LLE’, d, 7);

Y3 = compute_mapping(X, ’Laplacian’, d, 7, ’JDQR’);

Y4 = compute_mapping(X, ’LTSA’, d, 7);

Y5 = compute_mapping(X, ’CCA’, d, ’Matlab’);

subplot(3, 2, 1), scatter3(X(:,1), X(:,2), X(:,3), 5, labels);

subplot(3, 2, 2), scatter(Y1(:,1), Y1(:,2), 5, labels);




subplot(3, 2, 6), scatter(Y5(:,1), Y5(:,2), 5, labels);� �

The code above generates a Swiss roll dataset of 2,000 datapoints with some noise around the data manifold, estimates

the intrinsic dimensionality of the resulting dataset, performs five techniques for dimensionality reduction on the

dataset using various settings, and plots the results. The resulting plot should look like the plot in Figure 4.

Figure 4: Dimensionality reduction on the Swiss roll dataset.

28

5.3 Additional functions

In addition to the functions for intrinsic dimensionality estimation and dimensionality reduction, the toolbox contains

functions for data generation, data prewhitening, and out-of-sample extension. The generation of a number of artificial

dataset can be performed using the generate data function, which is discussed in subsubsection 5.3.1. For data

prewhitening, the function prewhiten can be used (see subsubsection 5.3.2). Out-of-sample extension can be

performed by means of the out of sample and the out of sample est functions. The two functions for out-

of-sample extension are described in more detail in subsubsection 5.3.3.

5.3.1 Data generation

The generate data is a function that can be used to generate five different artificial datasets: (1) the Swiss roll

dataset, (2) the twinpeaks dataset, (3) the helix dataset, (4) the 3D clusters dataset, and (5) the intersecting dataset.

All artificial datasets have dimensionality 3, but their intrinsic dimensionalities vary from 1 to 3. The syntax of the

generate data function is

[X, labels] = generate_data(name, n, noise)

Herein, name is the name of the dataset. The possible values are ’swiss’, ’twinpeaks’, ’helix’, ’3d clusters’,

and ’intersect’. The default value is ’swiss’. The parameter n specifies the number of datapoints that is sam-

pled from the underlying manifold (the default value is 1,000). The parameter noise determines the amount of noise

around the data manifold (the default value is 0.05).

Below, we provide an example of the use of the generate data function:�[X, labels] = generate_data(’helix’, 2000, 0.05);


[X, labels] = generate_data(’intersect’, 2000, 0.5);


[X, labels] = generate_data(’3d_clusters’);

subplot(1, 3, 3), scatter3(X(:,1), X(:,2), X(:,3), 5, labels);� �

The plot resulting from the code above should look like Figure 5.

Figure 5: Examples of artificial datasets.

29

5.3.2 Prewhitening

In section 2, we argued that data lies on a manifold of low dimensionality that is embedded in a high-dimensional

space. In practice, the data is likely to contain noise, i.e., the data lies near a manifold of low dimensionality. The

performance of techniques for dimensionality reduction can often be improved by the removal of noise around the

data manifold. Prewhitening is a straightforward way to remove noise from the high-dimensional data. Prewhitening

performs PCA (see subsection 3.1.1) on the high-dimensional dataset, thereby retaining a large portion of the variance

in the data (typically around 95% of the variance in the data is retained).

Using the Matlab Toolbox for Dimensionality Reduction, prewhitening of data can be performed by means of the

prewhiten function. The syntax of the prewhiten function is

X = prewhiten(X)

The function is parameterless. Because of the simplicity of the prewhiten function, we dot not present example

code in which the function is used.

5.3.3 Out-of-sample extension

Until now, we have assumed that dimensionality reduction is applied in learning settings in which both training and

test data are available. In, e.g., classification systems, test data often becomes available after the training process is

completed. In order to avoid retraining every time new test data comes available, the construction of a method for

out-of-sample extension is desirable. Out-of-sample extension transforms new high-dimensional datapoints to the

low-dimensional space.

For linear techniques for dimensionality reduction, out-of-sample extension is straightforward, since the linear map-

ping M defines the transformation from the high-dimensional data representation to its low-dimensional counterpart.

For linear dimensionality reduction techniques, out-of-sample extension of a new datapoint xn+1 is thus performed by

computing

yn+1 = (xn+1 − X)M (56)

For Kernel PCA, a similar procedure can be applied, although the out-of-sample extension for Kernel PCA requires

some additional kernel computations. For autoencoders, out-of-sample extension can readily be performed, because

the trained neural network defines the transformation from the high-dimensional data representation to its counterpart

of lower dimensionality.

For other dimensionality reduction techniques, out-of-sample extension can only be performed using estimation meth-

ods [11, 81]. Because of its general applicability, we briefly discuss the out-of-sample extension method in [81]. The

method starts by idenitifying the nearest neigbor xi of the new datapoint xn+1. Subsequently, the affine transforma-

tion matrix T that maps the nearest neighbor xi to its low-dimensional counterpart yi is computed. The transformation

matrix is given by

T = (yi − yi)(xi − xi)+ (57)

where (·)+ represents the matrix pseudoinverse. Because xn+1 is near to xi, the transformation matrix T can be

applied to the new datapoint xn+1 as well. The coordinates of the new low-dimensional datapoint yn+1 are thus given

by

yn+1 = yi + T (xn+1 − xi) (58)

The advantage of the method described above is its general applicability. Its main disadvantage is that it suffers

from estimation errors, especially when the sampling rate of the data is low with respect to the curvature of the data

manifold. Using the Matlab Toolbox for Dimensionality Reduction, out-of-sample extension can be performed by

means of the out of sample and out of sample est functions.

30

The out of sample function performs an exact out-of-sample extension on new datapoints, and is there-

fore only applicable to the techniques ’PCA’, ’SimplePCA’, ’ProbPCA’, ’LDA’, ’LPP’, ’NPE’, ’LLTSA’,

’KernelPCA’, ’AutoencoderRBM’, and ’AutoencoderEA’. The syntax of the out of sample function

is

Ynew = out_of_sample(Xnew, mapping)

in which Xnew represents the set of new datapoints and mapping is a structure containing information on the per-

formed dimensionality reduction (and can be obtained from the compute mapping function). The low-dimensional

counterparts of Xnew are returned in Ynew.

Below, we provide an example of the use of the out of sample function:�[X1, lab1] = generate_data(’twinpeaks’, 1000, 0.05);

[X2, lab2] = generate_data(’twinpeaks’, 1000, 0.05);

[Y1, mapping] = compute_mapping(X1, ’PCA’);

Y2 = out_of_sample(X2, mapping);

Y = [Y1; Y2];

labels = [lab1; lab2];

scatter(Y(:,1), Y(:,2), 5, labels);� �

The code above constructs two twinpeaks datasets of 1,000 points, applies PCA on the first dataset, and performs exact

out-of-sample extension on the second dataset. The resulting plot looks similar to Figure 3.

The out of sample est fuction performs out-of-sample extension of new data by means of the general estimation

method that was described above. The syntax of the out of sample est function is

Ynew = out_of_sample_est(Xnew, X, Y)

in which Xnew represents the set of new datapoints, X is the original dataset, and Y is the low-dimensional counterpart

of X that was obtained from the compute mapping function. The low-dimensional counterpart of Xnew obtained

using the out-of-sample extimation procedure described above is returned in Ynew. Below, we provide an example of

the use of the out of sample est function:�

[X1, labels1] = generate_data(’helix’, 1000, 0.05);

[X2, labels2] = generate_data(’helix’, 1000, 0.05);

[Y1, mapping] = compute_mapping(X1, ’Laplacian’);

Y2 = out_of_sample_est(X2, X1, Y1);

Y = [Y1; Y2];

labels = [labels1; labels2];

scatter(Y(:,1), Y(:,2), 5, labels);� �

The code above constructs two helix datasets of 1,000 points, applies Laplacian Eigenmaps on the first dataset, and

performs out-of-sample extension on the second dataset using the estimation method described above. The result of

the code above is shown in Figure 6. In the plot, the estimation errors in the out-of-sample extension are clearly visible.

31

Figure 6: Out-of-sample extension using an estimation method.

6 Conclusion

The report presented a description of the techniques that are implemented in the Matlab Toolbox for Dimensionality

Reduction. In addition, we provided a number of code examples that illustrate the functionality of the toolbox. The

report did not present discuss properties of the techniques such as computational complexities, neither did it present a

theoretical or empirical comparison of the techniques. A study that presents such a comparative review can be found

in [86].

Acknowledgements

The work is supported by NWO/CATCH, project RICH, under grant 640.002.401. I would like to thank Eric Postma,

Stijn Vanderlooy, and Guido de Croon for their comments on early versions of parts of this report.

32

References

[1] G. Afken. Gram-Schmidt Orthogonalization. Academic Press, Orlando, FL, USA, 1985.

[2] D.K. Agrafiotis. Stochastic proximity embedding. Journal of Computational Chemistry, 24(10):1215–1221,

2003.

[3] T.W. Anderson. Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34:122–

148, 1963.

[4] W.N. Anderson and T.D. Morley. Eigenvalues of the Laplacian of a graph. Linear and Multilinear Algebra,

18:141–145, 1985.

[5] W.E. Arnoldi. The principle of minimized iteration in the solution of the matrix eigenvalue problem. Quarterly

of Applied Mathematics, 9:17–25, 1951.

[6] M. Balasubramanian and E.L. Schwartz. The Isomap algorithm and topological stability. Science, 295(5552):7,

2002.

[7] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. Neural Computation,

12(10):2385–2404, 2000.

[8] M. Belkin and P. Niyogi. Laplacian Eigenmaps and spectral techniques for embedding and clustering. In Ad-

vances in Neural Information Processing Systems, volume 14, pages 585–591, Cambridge, MA, USA, 2002. The

MIT Press.

[9] A.J. Bell and T.J. Sejnowski. An information maximization approach to blind separation and blind deconvolution.

Neural Computation, 7(6):1129–1159, 1995.

[10] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement P. Vincent, and M. Ouimet. Learning eigenfunctions links

spectral embedding and Kernel PCA. Neural Computation, 16(10):2197–2219, 2004.

[11] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet. Out-of-sample extensions for

LLE, Isomap, MDS, eigenmaps, and spectral clustering. In Advances in Neural Information Processing Systems,

volume 16, Cambridge, MA, USA, 2004. The MIT Press.

[12] C.M. Bishop, M. Svensen, and C. Williams. GTM: The generative topographic mapping. Neural Computation,

10(1):215–234, 1998.

[13] M. Brand. Charting a manifold. In Advances in Neural Information Processing Systems, volume 15, pages

985–992, Cambridge, MA, USA, 2002. The MIT Press.

[14] M. Brand. From subspaces to submanifolds. In Proceedings of the 15thBritish Machine Vision Conference,

London, UK, 2004. British Machine Vision Association.

[15] H.-P. Chan. Computer-aided classification of mammographic masses and normal tissue: linear discriminant

analysis in texture feature space. Phys. Med. Biol., 40:857–876, 1995.

[16] H. Chang, D.-Y. Yeung, and Y. Xiong. Super-resolution through neighbor embedding. In IEEE Computer Society

Conference on Computer Vision and Pattern Recognition, volume 1, pages 275–282, 2004.

[17] K.-Y. Chang and J. Ghosh. Principal curves for nonlinear feature extraction and classification. In Applications

of Artificial Neural Networks in Image Processing III, pages 120–129, Bellingham, WA, USA, 1998. SPIE.

[18] H. Choi and S. Choi. Robust kernel Isomap. Pattern Recognition, 40(3):853–862, 2007.

[19] T. Cox and M. Cox. Multidimensional scaling. Chapman & Hall, London, UK, 1994.

33

[20] V. de Silva and J.B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Advances

in Neural Information Processing Systems, volume 15, pages 721–728, Cambridge, MA, USA, 2003. The MIT

Press.

[21] D. DeMers and G. Cottrell. Non-linear dimensionality reduction. In Advances in Neural Information Processing

Systems, volume 5, pages 580–587, San Mateo, CA, USA, 1993. Morgan Kaufmann.

[22] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal

of the Royal Statistical Society, Series B, 39(1):1–38, 1977.

[23] E.W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271, 1959.

[24] D.L. Donoho and C. Grimes. Hessian eigenmaps: New locally linear embedding techniques for high-dimensional

data. Proceedings of the National Academy of Sciences, 102(21):7426–7431, 2005.

[25] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. Wiley Interscience Inc., 2001.

[26] R. Duraiswami and V.C. Raykar. The manifolds of spatial hearing. In Proceedings of International Conference

on Acoustics, Speech and Signal Processing, volume 3, pages 285–288, 2005.

[27] C. Faloutsos and K.-I. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional

and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD International Conference on Management

of Data, pages 163–174, New York, NY, USA, 1995. ACM Press.

[28] R.A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188, 1936.

[29] R.W. Floyd. Algorithm 97: Shortest path. Communications of the ACM, 5(6):345, 1962.

[30] D.R. Fokkema, G.L.G. Sleijpen, and H.A. van der Vorst. Jacobi–Davidson style QR and QZ algorithms for the

reduction of matrix pencils. SIAM Journal on Scientific Computing, 20(1):94–125, 1999.

[31] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press Professional, Inc., San Diego,

CA, USA, 1990.

[32] K. Fukunaga and D.R. Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on

Computers, C-20:176–183, 1971.

[33] Z. Ghahramani and G.E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report CRG-

TR-96-1, Department of Computer Science, University of Toronto, 1996.

[34] R. Haeb-Umbach and H. Ney. Linear discriminant analysis for improved large vocabulary continuous speech

recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992, volume 1,

pages 13–16, 1992.

[35] J. Ham, D. Lee, S. Mika, and B. Scholkopf. A kernel view of the dimensionality reduction of manifolds. Tech-

nical Report TR-110, Max Planck Institute for Biological Cybernetics, Germany, 2003.

[36] X. He and P. Niyogi. Locality preserving projections. In Advances in Neural Information Processing Systems,

volume 16, page 37, Cambridge, MA, USA, 2004. The MIT Press.

[37] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang. Face recognition using laplacianfaces. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 27(3):328–340, 2005.

[38] G.E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation,

18(7):1527–1554, 2006.

[39] G.E. Hinton and S.T. Roweis. Stochastic Neighbor Embedding. In Advances in Neural Information Processing

Systems, volume 15, pages 833–840, Cambridge, MA, USA, 2002. The MIT Press.

34

[40] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,

313(5786):504–507, 2006.

[41] H. Hoffmann. Kernel PCA for novelty detection. Pattern Recognition, 40(3):863–874, 2007.

[42] J.H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, MI, USA,

1975.

[43] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational

Psychology, 24:417–441, 1933.

[44] R. Huber, H. Ramoser, K. Mayer, H. Penz, and M. Rubik. Classification of coins using an eigenspace approach.

Pattern Recognition Letters, 26(1):61–75, 2005.

[45] O.C. Jenkins and M.J. Mataric. Deriving action and behavior primitives from human motion data. In Interna-

tional Conference on Intelligent Robots and Systems, volume 3, pages 2551–2556, 2002.

[46] L.O. Jimenez and D.A. Landgrebe. Supervised classification in high-dimensional space: geometrical,statistical,

and asymptotical properties of multivariate data. IEEE Transactions on Systems, Man and Cybernetics, 28(1):39–

54, 1997.

[47] N. Kambhatla and T.K. Leen. Dimension reduction by local principal component analysis. Neural Computation,

9(7):1493–1516, 1997.

[48] B. Kegl. Intrinsic dimension estimation based on packing numbers. In Advances in Neural Information Process-

ing Systems, volume 15, pages 833–840, Cambridge, MA, USA, 2002. The MIT Press.

[49] K.I. Kim, K. Jung, and H.J. Kim. Face recognition using kernel principal component analysis. IEEE Signal

Processing Letters, 9(2):40–42, 2002.

[50] S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671–680,

1983.

[51] T. Kohonen. Self-organization and associative memory. Springer-Verlag, Berlin, Germany, 1988.

[52] J.B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika,

29:1–27, 1964.

[53] S.Y. Kung, K.I. Diamantaras, and J.S. Taur. Adaptive Principal component EXtraction (APEX) and applications.

IEEE Transactions on Signal Processing, 42(5):1202–1217, 1994.

[54] S. Lafon and A.B. Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality reduc-

tion, graph partitioning, and data set parameterization. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 28(9):1393–1403, 2006.

[55] J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction of data manifolds with essential loops. Neuro-

computing, 67:29–53, 2005.

[56] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural

Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.

[57] I.S. Lim, P.H. Ciechomski, S. Sarni, and D. Thalmann. Planar arrangement of high-dimensional biomedical data

sets by Isomap coordinates. In Proceedings of the 16th IEEE Symposium on Computer-Based Medical Systems,

pages 50–55, 2003.

[58] A. Lima, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kitamura. On the use of Kernel PCA for feature

extraction in speech recognition. IEICE Transactions on Information Systems, E87-D(12):2802–2811, 2004.

35

[59] D.J.C. MacKay and Z. Ghahramani. Comments on ’maximum likelihood estimation of intrinsic dimension’.

[60] N. Mekuz and J.K. Tsotsos. Parameterless Isomap with adaptive neighborhood selection. In Proceedings of the

28th DAGM Symposium, pages 364–373, Berlin, Germany, 2006. Springer.

[61] B. Nadler, S. Lafon, R.R. Coifman, and I.G. Kevrekidis. Diffusion maps, spectral clustering and the reaction

coordinates of dynamical systems. Applied and Computational Harmonic Analysis: Special Issue on Diffusion

Maps and Wavelets, 21:113–127, 2006.

[62] K. Nam, H. Je, and S. Choi. Fast Stochastic Neighbor Embedding: A trust-region algorithm. In Proceedings

of the IEEE International Joint Conference on Neural Networks 2004, volume 1, pages 123–128, Budapest,

Hungary, 2004.

[63] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural

Information Processing Systems, volume 14, pages 849–856, Cambridge, MA, USA, 2001. The MIT Press.

[64] M. Niskanen and O. Silven. Comparison of dimensionality reduction methods for wood surface inspection.

In Proceedings of the 6th International Conference on Quality Control by Artificial Vision, pages 178–188,

Gatlinburg, TN, USA, 2003. International Society for Optical Engineering.

[65] M. Partridge and R. Calvo. Fast dimensionality reduction and Simple PCA. Intelligent Data Analysis, 2(3):292–

298, 1997.

[66] A.M. Posadas, F. Vidal, F. de Miguel, G. Alguacil, J. Pena, J.M. Ibanez, and J. Morales. Spatial-temporal analysis

of a seismic series using the principal components method. Journal of Geophysical Research, 98(B2):1923–1932,

1993.

[67] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain. Dimensionality reduction using genetic

algorithms. IEEE Transactions on Evolutionary Computation, 4:164–171, 2000.

[68] B. Raytchev, I. Yoda, and K. Sakaue. Head pose estimation by nonlinear manifold learning. In Proceedings of

the 17th ICPR, pages 462–466, 2004.

[69] S.T. Roweis, L. Saul, and G. Hinton. Global coordination of local linear models. In Advances in Neural Infor-

mation Processing Systems, volume 14, pages 889–896, Cambridge, MA, USA, 2001. The MIT Press.

[70] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by Locally Linear Embedding. Science,

290(5500):2323–2326, 2000.

[71] A. Saxena, A. Gupta, and A. Mukerjee. Non-linear dimensionality reduction by locally linear isomaps. Lecture

Notes in Computer Science, 3316:1038–1043, 2004.

[72] B. Scholkopf, A.J. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem.

Neural Computation, 10(5):1299–1319, 1998.

[73] F. Sha and L.K. Saul. Analysis and extension of spectral methods for nonlinear dimensionality reduction. In

Proceedings of the 22nd International Conference on Machine Learning, pages 785–792, 2005.

[74] J. Shawe-Taylor and N. Christianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cam-

bridge, UK, 2004.

[75] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transaction on Pattern Analysis and

Machine Intelligence, 22(8):888–905, 2000.

[76] J.A.K. Suykens. Data visualization and dimensionality reduction using kernel maps with a reference point.

Technical Report 07-22, ESAT-SISTA, K.U. Leuven, 2007.

36

[77] G.A. Tagaris, W. Richter, S.G. Kim, G. Pellizzer, P. Andersen, K. Ugurbil, and A.P. Georgopoulos. Functional

magnetic resonance imaging of mental rotation and memory scanning: a multidimensional scaling analysis of

brain activation patterns. Brain Res., 26(2-3):106–12, 1998.

[78] Y.W. Teh and S.T. Roweis. Automatic alignment of hidden representations. In Advances in Neural Information

Processing Systems, volume 15, pages 841–848, Cambridge, MA, USA, 2002. The MIT Press.

[79] J.B. Tenenbaum. Mapping a manifold of perceptual observations. In Advances in Neural Information Processing


[80] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality

reduction. Science, 290(5500):2319–2323, 2000.

[81] L. Teng, H. Li, X. Fu, W. Chen, and I.-F. Shen. Dimension reduction of microarray data based on local tangent

space alignment. In Proceedings of the 4th IEEE International Conference on Cognitive Informatics, pages

154–159, 2005.

[82] M.E. Tipping. Sparse kernel principal component analysis. In Advances in Neural Information Processing


[83] M.E. Tipping and C.M. Bishop. Probabilistic prinicipal component analysis. Technical Report NCRG/97/010,

Neural Computing Research Group, Aston University, 1997.

[84] K. Torkkola. Linear discriminant analysis in document classification. In IEEE ICDM-2001 Workshop on Text

Mining, pages 800–806, 2001.

[85] M.A. Turk and A.P. Pentland. Face recognition using eigenfaces. In Proceedings of the Computer Vision and

Pattern Recognition 1991, pages 586–591, 1991.

[86] L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik. Dimensionality reduction: A comparative review.

IEEE Transactions on Pattern Analysis and Machine Intelligence (submitted), 2007.

[87] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38(1):49–95, 1996.

[88] M.S. Venkatarajan and W. Braun. New quantitative descriptors of amino acids based on multidimensional scaling

of a large number of physicalchemical properties. Journal of Molecular Modeling, 7(12):445–453, 2004.

[89] J. Verbeek. Learning nonlinear image manifolds by global alignment of local linear models. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 28(8):1236–1250, 2006.

[90] K.Q. Weinberger, B.D. Packer, and L.K. Saul. Nonlinear dimensionality reduction by semidefinite program-

ming and kernel matrix factorization. In Proceedings of the 10th International Workshop on AI and Statistics,

Barbados, WI, 2005. Society for Artificial Intelligence and Statistics.

[91] Y. Weiss. Segmentation using eigenvectors: a unifying view. In Proceedings of the IEEE International Confer-

ence on Computer Vision, volume 2, pages 975–982, Los Alamitos, CA, USA, 1999. IEEE Computer Society

Press.

[92] T. Zhang, J. Yang, D. Zhao, and X. Ge. Linear local tangent space alignment and application to face recognition.

Neurocomputing, 70:1547–1533, 2007.

[93] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimensionality reduction via local tangent space align-

ment. SIAM Journal of Scientific Computing, 26(1):313–338, 2004.

37

A Prerequisites

The prerequisites for the installation of the Matlab Toolbox for Dimensionality Reduction are:

• A working installation of Matlab 7.0 (R14) or a newer version of Matlab.

• A working installation of the Statistics Toolbox.

• On platforms for which no binaries are provided (e.g., PowerPC Mac and Solaris): a proper setup of mex. Type

mex -setup in Matlab to perform the setup of mex.

B Installation instructions

Installation of the Matlab Toolbox for Dimensionality Reduction is fairly straightforward. A step-by-step installation

guide is provided below.

• Decompress the file drtoolbox.tar.gz. Under Windows, the decompression can be performed using a tool

such as WinRar. Under Linux and Mac OS X, the decompression can be performed by opening a new Terminal

window, navigating to the location of the drtoolbox.tar.gz file, and executing the command tar -xvf

drtoolbox.tar.gz.

• Decompression of the file drtoolbox.tar.gz results in a new directory with the name drtoolbox. Al-

though you can put this directory in any location you like, we advice you to move the drtoolbox directory to

the standard Matlab toolbox directory, which is located in $MATLAB ROOT/toolbox.

• In order to be able to use the Matlab Toolbox for Dimensionality Reduction successfully, Matlab should be

notified about its presence. You do this by starting Matlab, and navigating to File -> Set Path... Now

click the button Add with Subfolders... and navigate to the location of the drtoolbox directory.

Click OK and subsequently, save your changes to the Matlab path by clicking the Save button. The toolbox is

now ready for use.

• The toolbox employs several functions that were implemented in C or C++ in order to speed up the techniques.

Precompiled versions for all main platforms are included in the toolbox, but a compiled version for your platform

might be either lacking or not have been optimized for your platform. In order to recompile all C and C++ files

on your platform, please execute the following code�cd([matlabroot ’/toolbox/drtoolbox’]);

mexall

� �

During the run of the mexall function, Matlab might display one or more warnings, which can be ignored. If

no errors are displayed during the run of mexall, all C and C++ files in the toolbox were successfully compiled.

38

Technical Reports in Computer Science The following reports have appeared in this series: CS 89-01 S.T. Dekker Complexity Starts at Five (14 pages).

H.J. van den Herik I.S. Herschberg

CS 89-02 H.J. van den Herik A Six-Men-Endgame Database: KRP(a2)KbBP(a3) I.S. Herschberg (18 pages). N. Nakad

CS 89-03 I.S. Herschberg Verifying and Codifying Strategies in

H.J. van den Herik a Chess Endgame (11 pages).

P.N.A. Schoo CS 89-04 J.W.H.M. Uiterwijk A Knowledge-Based Approach to Connect-Four

H.J. van den Herik The Game is Over: White to Move Wins! (21 pages). L.V. Allis

CS 89-05 P.J. Braspenning De Modellering van Complexe Objecten de route-planning voor INtelligent CAse (11 pages).

CS 90-01 M. van der Meulen Conspiracy-Number Search (12 pages). CS 90-02 J. Henseler Training Complex Multi-Layer Neural Networks (5 pages).

P.J. Braspenning CS 90-03 L.V. Allis !" Conspiracy-Number Search (18 pages).

M. van der Meulen

H.J. van den Herik CS 90-04 L.V. Allis Which Games Will Survive? (9 pages).

H.J. van den Herik I.S. Herschberg

CS 90-05 M. van der Meulen Lithidion: an Awari-playing Program (23 pages). L.V. Allis H.J. van den Herik

CS 90-06 P.J. Braspenning The Development of an Object-Oriented Representation

J.W.H.M. Uiterwijk System for Complex Objects in Analysis and Design: the

H. Bakker INCA-project (61 pages). L.C.J. van Leeuwen

CS 91-01 L.V. Allis Proof-Number Search (33 pages). M. van der Meulen H.J. van den Herik

CS 91-02 J.H.J. Lenting Planning with Artificial Intelligence (20 pages).

P.J. Braspenning

CS 91-03 J.H.J. Lenting Issues of Representation in AI Planning (31 pages).

P.J. Braspenning

CS 91-04 J.H.J. Lenting Issues of Control in AI Planning (33 pages).

P.J. Braspenning

CS 91-05 J. Henseler Membrain: A Cellular Neural Network based on a

P.J. Braspenning Vibrating Membrance (16 pages).

CS 91-06 A. Weijters Spraaksynthese met behulp van Artificiële Neurale

G. Hoppenbrouwers Netwerken (41 pages).

J. Thole R. van Nieuwenhoven

CS 91-07 H. Velthuijsen A Formalized Concept of the Blackboard Architecture P.J. Braspenning (45 pages).

CS 92-01 J. Hage MASC User Interface Guidelines (29 pages).

M. van der Meulen

CS 92-02 M. van der Meulen The HUMF Library (18 pages).

J. Hage

CS 92-03 L.V. Allis Go-Moku Opgelost door Nieuwe Zoektechnieken

H.J. van den Herik (15 pages).

CS 92-04 L.V. Allis A Knowledge-based Approach of Connect-Four (93 pages). (second edition)

CS 92-05 L.V. Allis 64. No Threats (14 pages).

H.J. van den Herik

M.M. Wind CS 92-07 J.H.J. Lenting The Answer To The Frame Problem:

P.J. Braspenning Forty-Two (24 pages). CS 93-01 D.M. Breuker Syllabification Using Expert Rules

L.V. Allis (16 pages). H.J. van den Herik

CS 93-02 L.V. Allis Go-Moku and Threat-Space Search (11 pages). H.J. van den Herik M.P.H. Huntjens

CS 93-03 H. Iida Opponent-Model Search (10 pages).

J.W.H.M. Uiterwijk

H.J. van den Herik CS 94-01 G. van Liempd Predicting the behavior of search algorithms

H.J. van den Herik on CSP problems (24 pages). CS 94-02 P.J. Braspenning Multiple Scattering Theory in a Generalized Fashion

A. Lodder A new approach to Artificial Neural Networks (28 pages). J.P. Dekker

CS 94-03 G.A.W. Vreeswijk IACAS: an interactive argumentation system (25 pages).

CS 94-04 A.D.M. Wan Limits to Ground Control in Autonomous Spacecraft P.J. Braspenning (15 pages). G.A.W. Vreeswijk

CS 95-01 G.A.W. Vreeswijk Open Protocols in Multi-Agent Systems (31 pages).

CS 95-02 G.A.W. Vreeswijk Formalizing Nomic: working on a theory of communication with modifiable rules of procedure (15 pages).

CS 95-03 G.A.W. Vreeswijk Self-government in multi-agent systems: experiments and thought-experiments (12 pages).

CS 95-04 G. van Liempd Predicting the behavior of search algorithms on CSP H.J. van den Herik problems (continued) (33 pages).

CS 95-05 G.A.W. Vreeswijk Interpolation of Benchmark Problems in Defeasible Reasoning (19 pages).

CS 95-06 G.A.W. Vreeswijk Several experiments in elementary self- modifying protocol games, such as Nomic (47 pages).

CS 95-07 G.A.W. Vreeswijk Self-modifying Protocol Games: A Preparatory Case Study

(research notes) (40 pages). CS 95-08 G.A.W. Vreeswijk Emergent Political Structures in Abstract Forms of

Rule-making and Legislation (37 pages). CS 95-09 G.A.W. Vreeswijk Representation of Formal Dispute with a Standing Order

(22 pages). CS 95-10 A.J.M.M. Weijters The BP-SOM Architecture and Learning Rule (11 pages).

CS 95-11 E.O. Postma SCAN: A Scalable Model of Attentional Selection


P.T.W. Hudson CS 96-01 A.J.M.M. Weijters Applying Ockham's Razor to Back-propagation (18 pages).

A.P.J. van den Bosch H.J. van den Herik

CS 96-02 R.W. van der Pol A device for query composition (39 pages). CS 96-03 K.M. Sim A Framework of Characterizing Multi-agent Systems:

P.J. Braspenning Cognition, Communication and Social Theory (12 pages). CS 96-04 E.N. Smirnov Disjunctive Version Space Approach

P.J. Braspenning to Concept Learning (34 pages).

CS 97-01 A.D.M. Wan State Identification in Dynamic Goal-State Learning Tasks P.J. Braspenning (6 pages).

CS 97-02 D.M. Breuker A Solution to the GHI Problem for Best-First Search


L.V. Allis J.W.H.M. Uiterwijk

CS 97-03 H.H.L.M. Donkers Beslissen bij Onzekerheid (95 pages). CS 97-04 H.H.L.M. Donkers Markov Decision Networks (M.Sc. Thesis) (89 pages).

CS 97-05 I.D. Craig VESUVIUS: Modular Reflection and Messages (10 pages).

CS 98-01 P.J. Braspenning Characterization and Design of Multi-agent Systems: K.M. Sim Theories and Testbeds (19 pages).

CS 98-02 H.H.L.M. Donkers Introduction to Markov Decision Networks (15 pages). J.W.H.M. Uiterwijk H.J. van den Herik

CS 98-03 H.J. van den Herik Kennistechnologie en Kunstmatige Intelligentie (26 pages)

P.J. Braspenning

I.M.C. Lemmens CS 98-04 H.H.L.M. Donkers Time Complexity of the Incremental-Pruning Algorithm for

J.W.H.M. Uiterwijk Markov Decision Networks (13 pages). H.J. van den Herik

CS 98-05 D.M. Breuker Solving Domineering (16 pages). J.W.H.M. Uiterwijk H.J. van den Herik

CS 98-06 J.W.H.M. Uiterwijk The Advantage of the Initiative (12 pages).

H.J. van den Herik

CS 98-07 H.H.L.M. Donkers Techniques for decision making in the LWI-domain

H.J. van den Herik (INDEKS-project, phase 3) (14 pages).

J.W.H.M. Uiterwijk CS 99-01 I.M.C Lemmens A Formal Analysis of Smithsonian Computational Reflection (11 pages).

P.J. Braspenning CS 99-02 I.M.C Lemmens Computational Reflection: Different Theories compared (23 pages)

P.J. Braspenning CS 99-03 N. Roos On the relation between information and subjective probability

CS 99-04 D.M. Breuker The pn

2 –search algorithm (16 pages)

J.W.H.M Uiterwijk

H.J. van den Herik CS 99-05 T. Carati Effectiviteitsonderzoek naar de kennisoverdracht van I&E Milieu (80 G.I. Holle pages)

J. Hoonhout L. A. Plugge

CS 99-06 A. Bossi Properties of Input-Consuming Derivations (18 pages) S. Etalle S. Rossi

CS 99-07 S. Etalle (Ed.) Benelog99 – Proceedings of the Eleventh Benelux Workshop on Logic Programming

CS 99-08 S. Etalle (Ed.) Verification of Logic Programs – Proceedings of the 1999 Workshop on J.-G. Smaus (Ed.) Verification of Logic Programs

CS 99-09 A. Lukito Vertex Partitions of Hypercubes into Snakes (12 pages) A.J. van Zanten

CS 99-10 J. Lenting On the inadequacy of MAS notions of rationality for multi-agent technology (19

pages)

CS 99-11 J. Lenting Drawbacks of Pareto optimality, strategy proofness, and incentive compatibility in the context of MAS mechanism design (12 pages)

CS 00-01 A. Bossi, S. Etalle, S. Rossi Semantics of Input-Consuming Programs (19 pages)

CS 00-02 S. Etalle and J. Mountjoy The (Lazy) Functional Side of Logic Programming (22 pages)

CS 00-03 J.W.H.M. Uiterwijk The Fifth Computer Olympiad. Computer-Games Workshop Proceedings

CS 00-04 A.J. van Zanten, A. Lukito Covers of Hypercubes with Symmetric Snakes (15 pages) CS 00-05 H.H.L.M. Donkers, Beta-Pruning and Opponent-Model Search (14 pages)

J.W.H.M. Uiterwijk, H. Iida and H.J. van den Herik

CS 01-01 A.J. van Zanten Iterative Solutions for the Tower of Hanoi Problem with Four Pegs (21 pages)

CS 01-02 I.G. Sprinkhuizen-Kuyper Artificial Evolution of Box-pushing Behaviour (26 pages) CS 01-03 J.P.G.M. Verbeek Euro-Info (51 pages)

CS 01-04 J.W.H.M. Uiterwijk The CMG Sixth Computer Olympiad. Computer-Games Workshop Proceedings

CS 01-05 H.J.van den Herik, Games Solved: Now and in the Future (44 pages) J.W.H.M. Uiterwijk

J. van Rijswijck CS 02-05 N. Roos, ProANITA April 2005. Eindrapport (51 pages)

P.H.M. Spronck E.N. Smirnov

MICC 06-01 A.J. van Zanten Covers and Near-Covers of the Hypercube Q16 by Symmetric Snakes L. Haryanto (59 pages)

MICC 06-02 G. de Croon, I.G., Sprinkhuizen-Kuyper, Comparing Active Vision Models (28 pages) E.O. Postma

MICC 06-03 A.J. van Zanten A 2

m–1-cover of Snakes

for Qn, 2

m–1 < n !2

m (50 pages)

L. Haryanto

MICC 07-01 S. de Jong, K. Tuyls, N. Roos, A Descriptive Model of Priority-Based Fairness (14 pages) K. Verbeeck

MICC 07-02 S. Vanderlooy, An Overview of Algorithmic Randomness and its Application to I.G. Sprinkhuizen-Kuyper Reliable Instance Classification (21 pages)

MICC 07-03 S. Vanderlooy, Off-Line Learning with Transductive Confidence Machines: L. van der Maaten, an Empirical Evaluation (36 pages)

I.G. Sprinkhuizen-Kuyper MICC 07-04 S. de Jong, M. Ponsen, Proceedings ALAMAS 2007 Conference (183 pages)

K. Tuyls, K. Verbeeck MICC 07-05 N. Roos, C. Witeveen Diagnosis of Plan Structure Violations (15 pages)

MICC 07-06 H.J. van den Herik, J.W.H.M. Proceedings Computer Games Workshop 2007 (243 pages) Uiterwijk, M. Winands and M. Schadd

Copies of these reports may be obtained from the secretarial office of MICC/IKAT, Universiteit Maastricht, P.O. Box 616, 6200 MD

Maastricht, The Netherlands. E-mail: [email protected]

An Introduction to Dimensionality Reduction Using...

Documents

Transcript of An Introduction to Dimensionality Reduction Using...