The Principal Components Analysis€¦ · The Principal Components Analysis Slava Vaisman The...

Post on 13-Jul-2020

9 views 0 download

Transcript of The Principal Components Analysis€¦ · The Principal Components Analysis Slava Vaisman The...

The Principal Components Analysis

Slava Vaisman

The University of Queensland

r.vaisman@uq.edu.au

November 30, 2016

Slava Vaisman (UQ) PCA November 30, 2016 1 / 21

Overview

1 PCA review

2 Understanding the PCA

3 Computing the PCA

Slava Vaisman (UQ) PCA November 30, 2016 2 / 21

A short overview of the PCA

PCA is a powerful feature reduction (feature extraction) mechanism, thathelps us to handle high-dimensional data with too many features.

In particular, PCA is a method for compressing a lot of data intosomething (smaller), that captures the essence of the original data.

1 PCA looks for a related set of the variables in our data that explainmost of the variance, and adds it to the first principal component.

2 Next, it is going to do the same with the next group of variables thatexplain most of the remaining variance, and constructs the secondprincipal component. And so on...

3 · · ·4 · · ·5 · · ·

Slava Vaisman (UQ) PCA November 30, 2016 3 / 21

A short overview of the PCA

PCA is a powerful feature reduction (feature extraction) mechanism, thathelps us to handle high-dimensional data with too many features.

In particular, PCA is a method for compressing a lot of data intosomething (smaller), that captures the essence of the original data.

1 PCA looks for a related set of the variables in our data that explainmost of the variance, and adds it to the first principal component.

2 Next, it is going to do the same with the next group of variables thatexplain most of the remaining variance, and constructs the secondprincipal component. And so on...

3 · · ·4 · · ·5 · · ·

Slava Vaisman (UQ) PCA November 30, 2016 3 / 21

A short overview of the PCA

PCA is a powerful feature reduction (feature extraction) mechanism, thathelps us to handle high-dimensional data with too many features.

In particular, PCA is a method for compressing a lot of data intosomething (smaller), that captures the essence of the original data.

1 PCA looks for a related set of the variables in our data that explainmost of the variance, and adds it to the first principal component.

2 Next, it is going to do the same with the next group of variables thatexplain most of the remaining variance, and constructs the secondprincipal component.

And so on...

3 · · ·4 · · ·5 · · ·

Slava Vaisman (UQ) PCA November 30, 2016 3 / 21

A short overview of the PCA

PCA is a powerful feature reduction (feature extraction) mechanism, thathelps us to handle high-dimensional data with too many features.

In particular, PCA is a method for compressing a lot of data intosomething (smaller), that captures the essence of the original data.

1 PCA looks for a related set of the variables in our data that explainmost of the variance, and adds it to the first principal component.

2 Next, it is going to do the same with the next group of variables thatexplain most of the remaining variance, and constructs the secondprincipal component. And so on...

3 · · ·4 · · ·5 · · ·

Slava Vaisman (UQ) PCA November 30, 2016 3 / 21

Example

example

Rotation – a linear transformation of our data.

Slava Vaisman (UQ) PCA November 30, 2016 4 / 21

Example

example Rotation – a linear transformation of our data.

Slava Vaisman (UQ) PCA November 30, 2016 4 / 21

Understanding the PCA – the general setting

Our m × n data matrix X is given by

X =

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

,

where

m — the number of measurement,

n — the number of observations.

Note that, each data sample is a column vector of X . Each sample is inm-dimensional space.

Slava Vaisman (UQ) PCA November 30, 2016 5 / 21

Understanding the PCA – the general setting

Our m × n data matrix X is given by

X =

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

,

where

m — the number of measurement,

n — the number of observations.

Note that, each data sample is a column vector of X . Each sample is inm-dimensional space.

Slava Vaisman (UQ) PCA November 30, 2016 5 / 21

Understanding the PCA – our objective

X =

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

Redundancy: It might happen, that our system has k << m degrees offreedom (the number of independent ways by which a dynamic system canmove), but is taking the entire m−dimensional space in our original dataset X .

Data Redundancy

1 Essentially, we would like to know if the rows of X are correlated.

2 If they are, we might be able to perform the desired dimensionalityreduction. Namely, we would like to remove the redundancy!

Slava Vaisman (UQ) PCA November 30, 2016 6 / 21

Understanding the PCA – our objective

X =

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

Redundancy: It might happen, that our system has k << m degrees offreedom (the number of independent ways by which a dynamic system canmove), but is taking the entire m−dimensional space in our original dataset X .

Data Redundancy

1 Essentially, we would like to know if the rows of X are correlated.

2 If they are, we might be able to perform the desired dimensionalityreduction. Namely, we would like to remove the redundancy!

Slava Vaisman (UQ) PCA November 30, 2016 6 / 21

A reminder: The Variance and the Covariance

First, let us formalize the concept of redundancy.

Suppose that we are given 2 data vectors (let us suppose that their meanis zero)

x = (x1, . . . , xn), and y = (y1, . . . , yn).

Then, the variance is given by (inner product):

σ2x =

1

n − 1

n∑i=1

xi · xi =1

n − 1x · xT , σ2

y =1

n − 1y · yT .

The covariance is measuring the statistical relationship between x and y:

σ2xy =

1

n − 1x · yT =

1

n − 1y · xT = σ2

yx.

If I observe x, can I say something about y?

Slava Vaisman (UQ) PCA November 30, 2016 7 / 21

A reminder: The Variance and the Covariance

First, let us formalize the concept of redundancy.

Suppose that we are given 2 data vectors (let us suppose that their meanis zero)

x = (x1, . . . , xn), and y = (y1, . . . , yn).

Then, the variance is given by (inner product):

σ2x =

1

n − 1

n∑i=1

xi · xi =1

n − 1x · xT , σ2

y =1

n − 1y · yT .

The covariance is measuring the statistical relationship between x and y:

σ2xy =

1

n − 1x · yT =

1

n − 1y · xT = σ2

yx.

If I observe x, can I say something about y?

Slava Vaisman (UQ) PCA November 30, 2016 7 / 21

A reminder: The Variance and the Covariance

First, let us formalize the concept of redundancy.

Suppose that we are given 2 data vectors (let us suppose that their meanis zero)

x = (x1, . . . , xn), and y = (y1, . . . , yn).

Then, the variance is given by (inner product):

σ2x =

1

n − 1

n∑i=1

xi · xi =1

n − 1x · xT , σ2

y =1

n − 1y · yT .

The covariance is measuring the statistical relationship between x and y:

σ2xy =

1

n − 1x · yT =

1

n − 1y · xT = σ2

yx.

If I observe x, can I say something about y?

Slava Vaisman (UQ) PCA November 30, 2016 7 / 21

A reminder: The Variance and the Covariance

First, let us formalize the concept of redundancy.

Suppose that we are given 2 data vectors (let us suppose that their meanis zero)

x = (x1, . . . , xn), and y = (y1, . . . , yn).

Then, the variance is given by (inner product):

σ2x =

1

n − 1

n∑i=1

xi · xi =1

n − 1x · xT , σ2

y =1

n − 1y · yT .

The covariance is measuring the statistical relationship between x and y:

σ2xy =

1

n − 1x · yT =

1

n − 1y · xT = σ2

yx.

If I observe x, can I say something about y?

Slava Vaisman (UQ) PCA November 30, 2016 7 / 21

Things to remember about the covariance

σ2xy ≈ 0 ⇒ x and y are (almost) statistically independent

σ2xy 6= 0 ⇒ x and y share some information ⇒ REDUNDANCY

σ2xy = σ2

yx

Slava Vaisman (UQ) PCA November 30, 2016 8 / 21

Constructing a covariance matrix from our data

Recall that X is given by

X =

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

=

X1

X2...

Xm

.

The covariance matrix is given by CX = 1n−1X XT . In particular,

CX =

σ2

X1σ2

X1X2. . . σ2

X1Xm

σ2X2X1

σ2X2

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

.

Slava Vaisman (UQ) PCA November 30, 2016 9 / 21

Constructing a covariance matrix from our data

Recall that X is given by

X =

X1,1 . . . . . . X1,n

X2,1 . . . . . . X2,n... · · · . . .

...Xm,1 . . . . . . Xm,n

=

X1

X2...

Xm

.

The covariance matrix is given by CX = 1n−1X XT . In particular,

CX =

σ2

X1σ2

X1X2. . . σ2

X1Xm

σ2X2X1

σ2X2

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

.

Slava Vaisman (UQ) PCA November 30, 2016 9 / 21

Properties of covariance matrix

CX =

σ2

X1σ2

X1X2. . . σ2

X1Xm

σ2X2X1

σ2X2

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

.

The diagonal elements are the variances of the rows of our datamatrix X .

The off-diagonal elements are the corresponding covariances

CX (i , j) = σ2XiXj

= σ2XjXi

= CX (j , i).

CX = X XT is symmetric.

Intuitively: small off-diagonal entries ⇒ statistical independence.

Intuitively: not so small off-diagonal entries ⇒ REDUNDANCY!

Slava Vaisman (UQ) PCA November 30, 2016 10 / 21

Properties of covariance matrix

CX =

σ2

X1σ2

X1X2. . . σ2

X1Xm

σ2X2X1

σ2X2

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

.

The diagonal elements are the variances of the rows of our datamatrix X .

The off-diagonal elements are the corresponding covariances

CX (i , j) = σ2XiXj

= σ2XjXi

= CX (j , i).

CX = X XT is symmetric.

Intuitively: small off-diagonal entries ⇒ statistical independence.

Intuitively: not so small off-diagonal entries ⇒ REDUNDANCY!

Slava Vaisman (UQ) PCA November 30, 2016 10 / 21

Properties of covariance matrix

CX =

σ2

X1σ2

X1X2. . . σ2

X1Xm

σ2X2X1

σ2X2

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

.

The diagonal elements are the variances of the rows of our datamatrix X .

The off-diagonal elements are the corresponding covariances

CX (i , j) = σ2XiXj

= σ2XjXi

= CX (j , i).

CX = X XT is symmetric.

Intuitively: small off-diagonal entries ⇒ statistical independence.

Intuitively: not so small off-diagonal entries ⇒ REDUNDANCY!

Slava Vaisman (UQ) PCA November 30, 2016 10 / 21

Properties of covariance matrix

CX =

σ2

X1σ2

X1X2. . . σ2

X1Xm

σ2X2X1

σ2X2

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

.

The diagonal elements are the variances of the rows of our datamatrix X .

The off-diagonal elements are the corresponding covariances

CX (i , j) = σ2XiXj

= σ2XjXi

= CX (j , i).

CX = X XT is symmetric.

Intuitively: small off-diagonal entries ⇒ statistical independence.

Intuitively: not so small off-diagonal entries ⇒ REDUNDANCY!

Slava Vaisman (UQ) PCA November 30, 2016 10 / 21

Properties of covariance matrix

CX =

σ2

X1σ2

X1X2. . . σ2

X1Xm

σ2X2X1

σ2X2

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

.

The diagonal elements are the variances of the rows of our datamatrix X .

The off-diagonal elements are the corresponding covariances

CX (i , j) = σ2XiXj

= σ2XjXi

= CX (j , i).

CX = X XT is symmetric.

Intuitively: small off-diagonal entries ⇒ statistical independence.

Intuitively: not so small off-diagonal entries ⇒ REDUNDANCY!

Slava Vaisman (UQ) PCA November 30, 2016 10 / 21

Properties of covariance matrix

CX =

σ2

X1σ2

X1X2. . . σ2

X1Xm

σ2X2X1

σ2X2

. . . σ2X2Xm

... · · · . . ....

σ2XmX1

. . . . . . σ2Xm

.

The diagonal elements are the variances of the rows of our datamatrix X .

The off-diagonal elements are the corresponding covariances

CX (i , j) = σ2XiXj

= σ2XjXi

= CX (j , i).

CX = X XT is symmetric.

Intuitively: small off-diagonal entries ⇒ statistical independence.

Intuitively: not so small off-diagonal entries ⇒ REDUNDANCY!

Slava Vaisman (UQ) PCA November 30, 2016 10 / 21

Intuition

Non-zero off-diagonal entries ⇒ REDUNDANCY!

So, what do we want to achieve?

We want the covariance matrix to look like this, why?

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . 0 σ2

Xm

.

Because, in this case we will have

NO CORRELATION = NO REDUNDANCY!

What is this? DIAGONALIZATION!

DIAGONALIZATION OF CX = NO REDUNDANCY

Slava Vaisman (UQ) PCA November 30, 2016 11 / 21

Intuition

Non-zero off-diagonal entries ⇒ REDUNDANCY!

So, what do we want to achieve?

We want the covariance matrix to look like this, why?

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . 0 σ2

Xm

.

Because, in this case we will have

NO CORRELATION = NO REDUNDANCY!

What is this? DIAGONALIZATION!

DIAGONALIZATION OF CX = NO REDUNDANCY

Slava Vaisman (UQ) PCA November 30, 2016 11 / 21

Intuition

Non-zero off-diagonal entries ⇒ REDUNDANCY!

So, what do we want to achieve?

We want the covariance matrix to look like this, why?

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . 0 σ2

Xm

.

Because, in this case we will have

NO CORRELATION = NO REDUNDANCY!

What is this? DIAGONALIZATION!

DIAGONALIZATION OF CX = NO REDUNDANCY

Slava Vaisman (UQ) PCA November 30, 2016 11 / 21

Intuition

Non-zero off-diagonal entries ⇒ REDUNDANCY!

So, what do we want to achieve?

We want the covariance matrix to look like this, why?

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . 0 σ2

Xm

.

Because, in this case we will have

NO CORRELATION = NO REDUNDANCY!

What is this? DIAGONALIZATION!

DIAGONALIZATION OF CX = NO REDUNDANCYSlava Vaisman (UQ) PCA November 30, 2016 11 / 21

More intuition

Basically, we want to find a new way to look at my system (change ofbasis – linear transformation), such that CX becomes diagonal.

Suppose that we achieved the desired diagonalization:

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Xm

.

Now, we make an assumption.

An assumption

Larger values of σ2Xi

are much more interesting than the smaller ones.(Namely, most of the system dynamic is happening in places, where thevariance is relatively big.)

Slava Vaisman (UQ) PCA November 30, 2016 12 / 21

More intuition

Basically, we want to find a new way to look at my system (change ofbasis – linear transformation), such that CX becomes diagonal.

Suppose that we achieved the desired diagonalization:

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Xm

.

Now, we make an assumption.

An assumption

Larger values of σ2Xi

are much more interesting than the smaller ones.(Namely, most of the system dynamic is happening in places, where thevariance is relatively big.)

Slava Vaisman (UQ) PCA November 30, 2016 12 / 21

More intuition

Basically, we want to find a new way to look at my system (change ofbasis – linear transformation), such that CX becomes diagonal.

Suppose that we achieved the desired diagonalization:

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Xm

.

Now, we make an assumption.

An assumption

Larger values of σ2Xi

are much more interesting than the smaller ones.(Namely, most of the system dynamic is happening in places, where thevariance is relatively big.)

Slava Vaisman (UQ) PCA November 30, 2016 12 / 21

Even more intuition

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Xm

.

Suppose that I order the variances, such that

σ2X1

> σ2X2

> · · · > σ2Xm.

In this case σ2X1

will tell me the dynamics of the strongest! — thisis the first principal component. example

σ2X2

captures less system dynamics, and forms the second principalcomponent.

And so on...

Slava Vaisman (UQ) PCA November 30, 2016 13 / 21

Even more intuition

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Xm

.

Suppose that I order the variances, such that

σ2X1

> σ2X2

> · · · > σ2Xm.

In this case σ2X1

will tell me the dynamics of the strongest! — thisis the first principal component. example

σ2X2

captures less system dynamics, and forms the second principalcomponent.

And so on...

Slava Vaisman (UQ) PCA November 30, 2016 13 / 21

Even more intuition

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Xm

.

Suppose that I order the variances, such that

σ2X1

> σ2X2

> · · · > σ2Xm.

In this case σ2X1

will tell me the dynamics of the strongest! — thisis the first principal component. example

σ2X2

captures less system dynamics, and forms the second principalcomponent.

And so on...

Slava Vaisman (UQ) PCA November 30, 2016 13 / 21

Even more intuition

CX =

σ2

X10 . . . 0

0 σ2X2

. . . 0... · · · . . .

...0 . . . . . . σ2

Xm

.

Suppose that I order the variances, such that

σ2X1

> σ2X2

> · · · > σ2Xm.

In this case σ2X1

will tell me the dynamics of the strongest! — thisis the first principal component. example

σ2X2

captures less system dynamics, and forms the second principalcomponent.

And so on...

Slava Vaisman (UQ) PCA November 30, 2016 13 / 21

The diagonalization (1)

1 Recall that X is our data matrix.

2 Compute a non-normalized covariance via X XT .

3 We saw that X XT is symmetric, that is, it has real eigenvalues andall eigenvectors are orthogonal to each other.

4 For such matrices, we can always performed the eigenvaluedecomposition.

Eigenvalue Decomposition

X XT = SΛS−1,

where Λ is a diagonal matrix, and S is a matrix of eigenvectors of X XT .(S ’s columns are normalized right eigenvectors of X XT .)

5 Since the eigenvectors are orthonormal, S−1 = ST !

6 Λ is a diagonal matrix with eigenvalues of X XT !

Slava Vaisman (UQ) PCA November 30, 2016 14 / 21

The diagonalization (1)

1 Recall that X is our data matrix.

2 Compute a non-normalized covariance via X XT .

3 We saw that X XT is symmetric, that is, it has real eigenvalues andall eigenvectors are orthogonal to each other.

4 For such matrices, we can always performed the eigenvaluedecomposition.

Eigenvalue Decomposition

X XT = SΛS−1,

where Λ is a diagonal matrix, and S is a matrix of eigenvectors of X XT .(S ’s columns are normalized right eigenvectors of X XT .)

5 Since the eigenvectors are orthonormal, S−1 = ST !

6 Λ is a diagonal matrix with eigenvalues of X XT !

Slava Vaisman (UQ) PCA November 30, 2016 14 / 21

The diagonalization (1)

1 Recall that X is our data matrix.

2 Compute a non-normalized covariance via X XT .

3 We saw that X XT is symmetric, that is, it has real eigenvalues andall eigenvectors are orthogonal to each other.

4 For such matrices, we can always performed the eigenvaluedecomposition.

Eigenvalue Decomposition

X XT = SΛS−1,

where Λ is a diagonal matrix, and S is a matrix of eigenvectors of X XT .(S ’s columns are normalized right eigenvectors of X XT .)

5 Since the eigenvectors are orthonormal, S−1 = ST !

6 Λ is a diagonal matrix with eigenvalues of X XT !

Slava Vaisman (UQ) PCA November 30, 2016 14 / 21

The diagonalization (1)

1 Recall that X is our data matrix.

2 Compute a non-normalized covariance via X XT .

3 We saw that X XT is symmetric, that is, it has real eigenvalues andall eigenvectors are orthogonal to each other.

4 For such matrices, we can always performed the eigenvaluedecomposition.

Eigenvalue Decomposition

X XT = SΛS−1,

where Λ is a diagonal matrix, and S is a matrix of eigenvectors of X XT .(S ’s columns are normalized right eigenvectors of X XT .)

5 Since the eigenvectors are orthonormal, S−1 = ST !

6 Λ is a diagonal matrix with eigenvalues of X XT !

Slava Vaisman (UQ) PCA November 30, 2016 14 / 21

The diagonalization (1)

1 Recall that X is our data matrix.

2 Compute a non-normalized covariance via X XT .

3 We saw that X XT is symmetric, that is, it has real eigenvalues andall eigenvectors are orthogonal to each other.

4 For such matrices, we can always performed the eigenvaluedecomposition.

Eigenvalue Decomposition

X XT = SΛS−1,

where Λ is a diagonal matrix, and S is a matrix of eigenvectors of X XT .(S ’s columns are normalized right eigenvectors of X XT .)

5 Since the eigenvectors are orthonormal, S−1 = ST !

6 Λ is a diagonal matrix with eigenvalues of X XT !

Slava Vaisman (UQ) PCA November 30, 2016 14 / 21

The diagonalization (1)

1 Recall that X is our data matrix.

2 Compute a non-normalized covariance via X XT .

3 We saw that X XT is symmetric, that is, it has real eigenvalues andall eigenvectors are orthogonal to each other.

4 For such matrices, we can always performed the eigenvaluedecomposition.

Eigenvalue Decomposition

X XT = SΛS−1,

where Λ is a diagonal matrix, and S is a matrix of eigenvectors of X XT .(S ’s columns are normalized right eigenvectors of X XT .)

5 Since the eigenvectors are orthonormal, S−1 = ST !

6 Λ is a diagonal matrix with eigenvalues of X XT !

Slava Vaisman (UQ) PCA November 30, 2016 14 / 21

The diagonalization (2)

Recall that when we started to work with our data X , we actuallyworked in a sort of arbitrary coordinate system.

We would like to figure out, what is the bases, that we should use, inorder to have a diagonal covariance matrix instead of our original one(which has a bunch of correlated data measures).

So, let us create a new set of measurements Y , that is related to theold set of measurements X as follows:

Y = ST X .

(Note that this is just a linear transformation!) And, we would like towork in this new bases from now on.

Slava Vaisman (UQ) PCA November 30, 2016 15 / 21

The diagonalization (2)

Recall that when we started to work with our data X , we actuallyworked in a sort of arbitrary coordinate system.

We would like to figure out, what is the bases, that we should use, inorder to have a diagonal covariance matrix instead of our original one(which has a bunch of correlated data measures).

So, let us create a new set of measurements Y , that is related to theold set of measurements X as follows:

Y = ST X .

(Note that this is just a linear transformation!) And, we would like towork in this new bases from now on.

Slava Vaisman (UQ) PCA November 30, 2016 15 / 21

The diagonalization (2)

Recall that when we started to work with our data X , we actuallyworked in a sort of arbitrary coordinate system.

We would like to figure out, what is the bases, that we should use, inorder to have a diagonal covariance matrix instead of our original one(which has a bunch of correlated data measures).

So, let us create a new set of measurements Y , that is related to theold set of measurements X as follows:

Y = ST X .

(Note that this is just a linear transformation!) And, we would like towork in this new bases from now on.

Slava Vaisman (UQ) PCA November 30, 2016 15 / 21

The diagonalization (3)

Let us calculate the covariance of Y now.

CY =1

n − 1Y Y T =

1

n − 1(ST X )(ST X )T =

=1

n − 1ST X XT︸ ︷︷ ︸

SΛST

S =1

n − 1ST S ΛST S =

1

n − 1Λ.

What is Λ? — a diagonal matrix!To conclude:

If we work at the ST bases — the covariance matrix CY ofY = ST X is diagonal ⇒ NO REDUNDANCY!

Effectively, we figured out, what is the right way to look at ourproblem.

Slava Vaisman (UQ) PCA November 30, 2016 16 / 21

The diagonalization (3)

Let us calculate the covariance of Y now.

CY =1

n − 1Y Y T =

1

n − 1(ST X )(ST X )T =

=1

n − 1ST X XT︸ ︷︷ ︸

SΛST

S =1

n − 1ST S ΛST S =

1

n − 1Λ.

What is Λ? — a diagonal matrix!To conclude:

If we work at the ST bases — the covariance matrix CY ofY = ST X is diagonal ⇒ NO REDUNDANCY!

Effectively, we figured out, what is the right way to look at ourproblem.

Slava Vaisman (UQ) PCA November 30, 2016 16 / 21

The diagonalization (3)

Let us calculate the covariance of Y now.

CY =1

n − 1Y Y T =

1

n − 1(ST X )(ST X )T =

=1

n − 1ST X XT︸ ︷︷ ︸

SΛST

S =

1

n − 1ST S ΛST S =

1

n − 1Λ.

What is Λ? — a diagonal matrix!To conclude:

If we work at the ST bases — the covariance matrix CY ofY = ST X is diagonal ⇒ NO REDUNDANCY!

Effectively, we figured out, what is the right way to look at ourproblem.

Slava Vaisman (UQ) PCA November 30, 2016 16 / 21

The diagonalization (3)

Let us calculate the covariance of Y now.

CY =1

n − 1Y Y T =

1

n − 1(ST X )(ST X )T =

=1

n − 1ST X XT︸ ︷︷ ︸

SΛST

S =1

n − 1ST S ΛST S =

1

n − 1Λ.

What is Λ? — a diagonal matrix!To conclude:

If we work at the ST bases — the covariance matrix CY ofY = ST X is diagonal ⇒ NO REDUNDANCY!

Effectively, we figured out, what is the right way to look at ourproblem.

Slava Vaisman (UQ) PCA November 30, 2016 16 / 21

The diagonalization (3)

Let us calculate the covariance of Y now.

CY =1

n − 1Y Y T =

1

n − 1(ST X )(ST X )T =

=1

n − 1ST X XT︸ ︷︷ ︸

SΛST

S =1

n − 1ST S ΛST S =

1

n − 1Λ.

What is Λ?

— a diagonal matrix!To conclude:

If we work at the ST bases — the covariance matrix CY ofY = ST X is diagonal ⇒ NO REDUNDANCY!

Effectively, we figured out, what is the right way to look at ourproblem.

Slava Vaisman (UQ) PCA November 30, 2016 16 / 21

The diagonalization (3)

Let us calculate the covariance of Y now.

CY =1

n − 1Y Y T =

1

n − 1(ST X )(ST X )T =

=1

n − 1ST X XT︸ ︷︷ ︸

SΛST

S =1

n − 1ST S ΛST S =

1

n − 1Λ.

What is Λ? — a diagonal matrix!

To conclude:

If we work at the ST bases — the covariance matrix CY ofY = ST X is diagonal ⇒ NO REDUNDANCY!

Effectively, we figured out, what is the right way to look at ourproblem.

Slava Vaisman (UQ) PCA November 30, 2016 16 / 21

The diagonalization (3)

Let us calculate the covariance of Y now.

CY =1

n − 1Y Y T =

1

n − 1(ST X )(ST X )T =

=1

n − 1ST X XT︸ ︷︷ ︸

SΛST

S =1

n − 1ST S ΛST S =

1

n − 1Λ.

What is Λ? — a diagonal matrix!To conclude:

If we work at the ST bases — the covariance matrix CY ofY = ST X is diagonal ⇒ NO REDUNDANCY!

Effectively, we figured out, what is the right way to look at ourproblem.

Slava Vaisman (UQ) PCA November 30, 2016 16 / 21

How many principal components should I use?

CY =

σ2

Y10 . . . 0

0 σ2Y2

. . . 0... · · · . . .

...0 . . . . . . σ2

Ym

.

Suppose that I order the variances, such that

σ2Y1

> σ2Y2

> · · · > σ2Ym.

The Percentage of Variance Explained (PVE) by the principalcomponent i , is defined by:

σ2Yi∑m

j=1 σ2Yj

.

So, the first k 6 m principal components explain

∑ki=1 σ

2Yi∑m

j=1 σ2Yj

of the system

variance.

Slava Vaisman (UQ) PCA November 30, 2016 17 / 21

How many principal components should I use?

CY =

σ2

Y10 . . . 0

0 σ2Y2

. . . 0... · · · . . .

...0 . . . . . . σ2

Ym

.

Suppose that I order the variances, such that

σ2Y1

> σ2Y2

> · · · > σ2Ym.

The Percentage of Variance Explained (PVE) by the principalcomponent i , is defined by:

σ2Yi∑m

j=1 σ2Yj

.

So, the first k 6 m principal components explain

∑ki=1 σ

2Yi∑m

j=1 σ2Yj

of the system

variance.

Slava Vaisman (UQ) PCA November 30, 2016 17 / 21

How many principal components should I use?

CY =

σ2

Y10 . . . 0

0 σ2Y2

. . . 0... · · · . . .

...0 . . . . . . σ2

Ym

.

Suppose that I order the variances, such that

σ2Y1

> σ2Y2

> · · · > σ2Ym.

The Percentage of Variance Explained (PVE) by the principalcomponent i , is defined by:

σ2Yi∑m

j=1 σ2Yj

.

So, the first k 6 m principal components explain

∑ki=1 σ

2Yi∑m

j=1 σ2Yj

of the system

variance.Slava Vaisman (UQ) PCA November 30, 2016 17 / 21

Image compression

Slava Vaisman (UQ) PCA November 30, 2016 18 / 21

PCA Assumptions

1 A data linearity is assumed. (Kernel PCA.)

XA =

(X1

(3 · X1 + 8)

)XB =

(X1

(X1)2

)

2 We assume that bigger variances have more important dynamics.

3 The principal components are assumed to be orthogonal.

4 We assume that the data-points come from the Gaussian distribution.

Slava Vaisman (UQ) PCA November 30, 2016 19 / 21

PCA Assumptions

1 A data linearity is assumed. (Kernel PCA.)

XA =

(X1

(3 · X1 + 8)

)XB =

(X1

(X1)2

)

2 We assume that bigger variances have more important dynamics.

3 The principal components are assumed to be orthogonal.

4 We assume that the data-points come from the Gaussian distribution.

Slava Vaisman (UQ) PCA November 30, 2016 19 / 21

PCA Assumptions

1 A data linearity is assumed. (Kernel PCA.)

XA =

(X1

(3 · X1 + 8)

)XB =

(X1

(X1)2

)

2 We assume that bigger variances have more important dynamics.

3 The principal components are assumed to be orthogonal.

4 We assume that the data-points come from the Gaussian distribution.

Slava Vaisman (UQ) PCA November 30, 2016 19 / 21

PCA Assumptions

1 A data linearity is assumed. (Kernel PCA.)

XA =

(X1

(3 · X1 + 8)

)XB =

(X1

(X1)2

)

2 We assume that bigger variances have more important dynamics.

3 The principal components are assumed to be orthogonal.

4 We assume that the data-points come from the Gaussian distribution.

Slava Vaisman (UQ) PCA November 30, 2016 19 / 21

PCA — conclusions

+1 A simple method — no parameters to tweak and no coefficients to

adjust.2 A dramatic reduction in a data size.3 Easy to compute.4 Very powerful for many practical applications.

−1 How do we incorporate a prior knowledge?.2 Too expensive for many applications — O

(n3)

complexity.3 Problems with outliers.4 Assumes linearity.

Slava Vaisman (UQ) PCA November 30, 2016 20 / 21

The End

Slava Vaisman (UQ) PCA November 30, 2016 21 / 21