The Principal Components Analysis€¦ · The Principal Components Analysis Slava Vaisman The...

The Principal Components Analysis

Slava Vaisman

The University of Queensland

[email protected]

November 30, 2016

Slava Vaisman (UQ) PCA November 30, 2016 1 / 21

Overview

1 PCA review

2 Understanding the PCA

3 Computing the PCA

A short overview of the PCA

PCA is a powerful feature reduction (feature extraction) mechanism, thathelps us to handle high-dimensional data with too many features.

In particular, PCA is a method for compressing a lot of data intosomething (smaller), that captures the essence of the original data.

1 PCA looks for a related set of the variables in our data that explainmost of the variance, and adds it to the first principal component.

2 Next, it is going to do the same with the next group of variables thatexplain most of the remaining variance, and constructs the secondprincipal component. And so on...

3 · · ·4 · · ·5 · · ·

2 Next, it is going to do the same with the next group of variables thatexplain most of the remaining variance, and constructs the secondprincipal component.

And so on...

3 · · ·4 · · ·5 · · ·

Example

example

Rotation – a linear transformation of our data.

Example

example Rotation – a linear transformation of our data.

Understanding the PCA – the general setting

Our m × n data matrix X is given by

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

m — the number of measurement,

n — the number of observations.

Note that, each data sample is a column vector of X . Each sample is inm-dimensional space.

Understanding the PCA – the general setting

Our m × n data matrix X is given by

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

m — the number of measurement,

n — the number of observations.

Note that, each data sample is a column vector of X . Each sample is inm-dimensional space.

Understanding the PCA – our objective

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

Redundancy: It might happen, that our system has k << m degrees offreedom (the number of independent ways by which a dynamic system canmove), but is taking the entire m−dimensional space in our original dataset X .

Data Redundancy

1 Essentially, we would like to know if the rows of X are correlated.

2 If they are, we might be able to perform the desired dimensionalityreduction. Namely, we would like to remove the redundancy!

Understanding the PCA – our objective

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

Redundancy: It might happen, that our system has k << m degrees offreedom (the number of independent ways by which a dynamic system canmove), but is taking the entire m−dimensional space in our original dataset X .

Data Redundancy

1 Essentially, we would like to know if the rows of X are correlated.

2 If they are, we might be able to perform the desired dimensionalityreduction. Namely, we would like to remove the redundancy!

A reminder: The Variance and the Covariance

First, let us formalize the concept of redundancy.

Suppose that we are given 2 data vectors (let us suppose that their meanis zero)

x = (x1, . . . , xn), and y = (y1, . . . , yn).

Then, the variance is given by (inner product):

σ2x =

n − 1

n∑i=1

xi · xi =1

n − 1x · xT , σ2

n − 1y · yT .

The covariance is measuring the statistical relationship between x and y:

σ2xy =

n − 1x · yT =

n − 1y · xT = σ2

If I observe x, can I say something about y?

x = (x1, . . . , xn), and y = (y1, . . . , yn).

σ2x =

n − 1

n∑i=1

xi · xi =1

n − 1x · xT , σ2

n − 1y · yT .

σ2xy =

n − 1x · yT =

n − 1y · xT = σ2

x = (x1, . . . , xn), and y = (y1, . . . , yn).

σ2x =

n − 1

n∑i=1

xi · xi =1

n − 1x · xT , σ2

n − 1y · yT .

σ2xy =

n − 1x · yT =

n − 1y · xT = σ2

x = (x1, . . . , xn), and y = (y1, . . . , yn).

σ2x =

n − 1

n∑i=1

xi · xi =1

n − 1x · xT , σ2

n − 1y · yT .

σ2xy =

n − 1x · yT =

n − 1y · xT = σ2

Things to remember about the covariance

σ2xy ≈ 0 ⇒ x and y are (almost) statistically independent

σ2xy 6= 0 ⇒ x and y share some information ⇒ REDUNDANCY

σ2xy = σ2

Constructing a covariance matrix from our data

Recall that X is given by

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

The covariance matrix is given by CX = 1n−1X XT . In particular,

X1X2. . . σ2

σ2X2X1

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

Constructing a covariance matrix from our data

Recall that X is given by

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

The covariance matrix is given by CX = 1n−1X XT . In particular,

X1X2. . . σ2

σ2X2X1

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

Properties of covariance matrix

X1X2. . . σ2

σ2X2X1

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

The diagonal elements are the variances of the rows of our datamatrix X .

The off-diagonal elements are the corresponding covariances

CX (i , j) = σ2XiXj

= σ2XjXi

= CX (j , i).

CX = X XT is symmetric.

Intuitively: small off-diagonal entries ⇒ statistical independence.

Intuitively: not so small off-diagonal entries ⇒ REDUNDANCY!

X1X2. . . σ2

σ2X2X1

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

= σ2XjXi

= CX (j , i).

X1X2. . . σ2

σ2X2X1

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

= σ2XjXi

= CX (j , i).

X1X2. . . σ2

σ2X2X1

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

= σ2XjXi

= CX (j , i).

X1X2. . . σ2

σ2X2X1

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

= σ2XjXi

= CX (j , i).

X1X2. . . σ2

σ2X2X1

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

= σ2XjXi

= CX (j , i).

Intuition

Non-zero off-diagonal entries ⇒ REDUNDANCY!

So, what do we want to achieve?

We want the covariance matrix to look like this, why?

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . 0 σ2

Because, in this case we will have

NO CORRELATION = NO REDUNDANCY!

What is this? DIAGONALIZATION!

DIAGONALIZATION OF CX = NO REDUNDANCY

Intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . 0 σ2

Intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . 0 σ2

Intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . 0 σ2

DIAGONALIZATION OF CX = NO REDUNDANCYSlava Vaisman (UQ) PCA November 30, 2016 11 / 21

More intuition

Basically, we want to find a new way to look at my system (change ofbasis – linear transformation), such that CX becomes diagonal.

Suppose that we achieved the desired diagonalization:

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Now, we make an assumption.

An assumption

Larger values of σ2Xi

are much more interesting than the smaller ones.(Namely, most of the system dynamic is happening in places, where thevariance is relatively big.)

More intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

An assumption

More intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

An assumption

Even more intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Suppose that I order the variances, such that

> σ2X2

> · · · > σ2Xm.

In this case σ2X1

will tell me the dynamics of the strongest! — thisis the first principal component. example

captures less system dynamics, and forms the second principalcomponent.

And so on...

Even more intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

> σ2X2

> · · · > σ2Xm.

In this case σ2X1

And so on...

Even more intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

> σ2X2

> · · · > σ2Xm.

In this case σ2X1

And so on...

Even more intuition

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

> σ2X2

> · · · > σ2Xm.

In this case σ2X1

And so on...

The diagonalization (1)

1 Recall that X is our data matrix.

2 Compute a non-normalized covariance via X XT .

3 We saw that X XT is symmetric, that is, it has real eigenvalues andall eigenvectors are orthogonal to each other.

4 For such matrices, we can always performed the eigenvaluedecomposition.

Eigenvalue Decomposition

X XT = SΛS−1,

where Λ is a diagonal matrix, and S is a matrix of eigenvectors of X XT .(S ’s columns are normalized right eigenvectors of X XT .)

5 Since the eigenvectors are orthonormal, S−1 = ST !

6 Λ is a diagonal matrix with eigenvalues of X XT !

X XT = SΛS−1,

Recall that when we started to work with our data X , we actuallyworked in a sort of arbitrary coordinate system.

We would like to figure out, what is the bases, that we should use, inorder to have a diagonal covariance matrix instead of our original one(which has a bunch of correlated data measures).

So, let us create a new set of measurements Y , that is related to theold set of measurements X as follows:

Y = ST X .

(Note that this is just a linear transformation!) And, we would like towork in this new bases from now on.

Y = ST X .

Let us calculate the covariance of Y now.

n − 1Y Y T =

n − 1(ST X )(ST X )T =

n − 1ST X XT︸︷︷︸

n − 1ST S ΛST S =

n − 1Λ.

What is Λ? — a diagonal matrix!To conclude:

If we work at the ST bases — the covariance matrix CY ofY = ST X is diagonal ⇒ NO REDUNDANCY!

Effectively, we figured out, what is the right way to look at ourproblem.

n − 1Y Y T =

n − 1(ST X )(ST X )T =

n − 1ST X XT︸︷︷︸

n − 1Λ.

n − 1Y Y T =

n − 1(ST X )(ST X )T =

n − 1ST X XT︸︷︷︸

n − 1Λ.

n − 1Y Y T =

n − 1(ST X )(ST X )T =

n − 1ST X XT︸︷︷︸

n − 1Λ.

n − 1Y Y T =

n − 1(ST X )(ST X )T =

n − 1ST X XT︸︷︷︸

n − 1Λ.

What is Λ?

— a diagonal matrix!To conclude:

n − 1Y Y T =

n − 1(ST X )(ST X )T =

n − 1ST X XT︸︷︷︸

n − 1Λ.

What is Λ? — a diagonal matrix!

To conclude:

n − 1Y Y T =

n − 1(ST X )(ST X )T =

n − 1ST X XT︸︷︷︸

n − 1Λ.

How many principal components should I use?

Y10 . . . 0

0 σ2Y2

. . . 0... · · · . . .

...0 . . . . . . σ2

> σ2Y2

> · · · > σ2Ym.

The Percentage of Variance Explained (PVE) by the principalcomponent i , is defined by:

σ2Yi∑m

j=1 σ2Yj

So, the first k 6 m principal components explain

∑ki=1 σ

2Yi∑m

j=1 σ2Yj

of the system

variance.

Y10 . . . 0

0 σ2Y2

. . . 0... · · · . . .

...0 . . . . . . σ2

> σ2Y2

> · · · > σ2Ym.

σ2Yi∑m

j=1 σ2Yj

∑ki=1 σ

2Yi∑m

j=1 σ2Yj

of the system

variance.

Y10 . . . 0

0 σ2Y2

. . . 0... · · · . . .

...0 . . . . . . σ2

> σ2Y2

> · · · > σ2Ym.

σ2Yi∑m

j=1 σ2Yj

∑ki=1 σ

2Yi∑m

j=1 σ2Yj

of the system

variance.Slava Vaisman (UQ) PCA November 30, 2016 17 / 21

Image compression

PCA Assumptions

1 A data linearity is assumed. (Kernel PCA.)

(3 · X1 + 8)

2 We assume that bigger variances have more important dynamics.

3 The principal components are assumed to be orthogonal.

4 We assume that the data-points come from the Gaussian distribution.

PCA Assumptions

(3 · X1 + 8)

PCA Assumptions

(3 · X1 + 8)

PCA Assumptions

(3 · X1 + 8)

PCA — conclusions

+1 A simple method — no parameters to tweak and no coefficients to

adjust.2 A dramatic reduction in a data size.3 Easy to compute.4 Very powerful for many practical applications.

−1 How do we incorporate a prior knowledge?.2 Too expensive for many applications — O

complexity.3 Problems with outliers.4 Assumes linearity.

The End

The Principal Components Analysis€¦ · The Principal Components Analysis Slava Vaisman The...

Documents

Transcript of The Principal Components Analysis€¦ · The Principal Components Analysis Slava Vaisman The...