Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical...

Post on 14-Jan-2016

233 views 0 download

Tags:

Transcript of Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical...

Principal Component Analysis

Machine Learning

Last Time

• Expectation Maximization in Graphical Models– Baum Welch

Now

• Unsupervised Dimensionality Reduction

Curse of Dimensionality

• In (nearly) all modeling approaches, more features (dimensions) require (a lot) more data – Typically exponential in the number of features

• This is clearly seen from filling a probability table.

• Topological arguments are also made.– Compare the volume of an inscribed hypersphere

to a hypercube

Dimensionality Reduction

• We’ve already seen some of this.

• Regularization attempts to reduce the number of effective features used in linear and logistic regression classifiers

Linear Models

• When we regularize, we optimize a function that ignores as many features as possible.

• The “effective” number of dimensions is much smaller than D

Support Vector Machines

• In exemplar approaches (SVM, k-nn) each data point can be considered to describe a dimension.

• By selecting only those instances that maximize the margin (setting α to zero), SVMs use only a subset of available dimensions in their decision making.

Decision Trees

• Decision Trees explicitly select split points based on features that improve InformationGain or Accuracy

• Features that don’t contribute to the classification sufficiently are never used.

weight

<165

5M height

<68

5F 1F / 1M

Feature Spaces

• Even though a data point is described in terms of N features, this may not be the most compact representation of the feature space

• Even classifiers that try to use a smaller effective feature space can suffer from the curse-of-dimensionality

• If a feature has some discriminative power, the dimension may remain in the effective set.

1-d data in a 2-d world

0 0.020.040.060.08 0.1 0.120.14249.6249.8

250250.2250.4250.6250.8

251251.2251.4

Dimensions of high variance

Identifying dimensions of variance

• Assumption: directions that show high variance represent the appropriate/useful dimension to represent the feature set.

Aside: Normalization

• Assume 2 features:– Percentile GPA– Height in cm.

• Which dimension shows greater variability?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1235

240

245

250

255

260

265

270

275

280

285

Aside: Normalization

• Assume 2 features:– Percentile GPA– Height in cm.

• Which dimension shows greater variability?

0 5 10 15 20 25 30235

240

245

250

255

260

265

270

275

280

285

Aside: Normalization

• Assume 2 features:– Percentile GPA– Height in m.

• Which dimension shows greater variability?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Principal Component Analysis

• Principal Component Analysis (PCA) identifies the dimensions of greatest variance of a set of data.

Eigenvectors

• Eigenvectors are orthogonal vectors that define a space, the eigenspace.

• Any data point can be described as a linear combination of eigenvectors.

• Eigenvectors of a square matrix have the following property.

• The associated lambda is the eigenvalue.

PCA

• Write each data point in this new space

• To do the dimensionality reduction, keep C < D dimensions.

• Each data point is now represented as a vector of c’s.

Identifying Eigenvectors

• PCA is easy once we have eigenvectors and the mean.

• Identifying the mean is easy.• Eigenvectors of the covariance matrix,

represent a set of direction of variance.• Eigenvalues represent the degree of the

variance.

Eigenvectors of the Covariance Matrix

• Eigenvectors are orthonormal• In the eigenspace, the Gaussian is diagonal – zero

covariance.• All eigen values are non-negative.• Eigenvalues are sorted.• Larger eigenvalues, higher variance

Dimensionality reduction with PCA

• To convert from an original data point to PCA

• To reconstruct a point

Eigenfaces

Encoded then Decoded.

Efficiency can be evaluatedwith Absolute or Squared error

Some other (unsupervised) dimensionality reduction techniques

• Kernel PCA• Distance Preserving Dimension Reduction• Maximum Variance Unfolding• Multi Dimensional Scaling (MDS)• Isomap

• Next Time– Model Adaptation and Semi-supervised

Techniques• Work on your projects.