WK9 – Principle Component Analysis

Contents

PCA

GHA

APEX

Kernel PCA

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Conclusions

WK9 – Principle Component Analysis

CS 476: Networks of Neural Computation

WK9 – Principle Component Analysis

Dr. Stathis Kasderidis

Dept. of Computer Science

University of Crete

Spring Semester, 2009

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

ContentsContents

Contents

•Introduction to Principal Component Analysis

•Generalised Hebbian Algorithm

•Adaptive Principal Components Extraction

•Kernel Principal Components Analysis

•Conclusions

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA

Principal Component Analysis

•The PCA method is a statistical method for Feature Selection and Dimensionality Reduction.

•Feature Selection is a process whereby a data space is transformed into a feature space. In principal both spaces have the same dimensionality.

•However, in the PCA method, the transformation is design in such way that the data set be represented by a reduced number of “effective” features and yet retain most of the intrinsic information contained in the data; in other words the data set undergoes a dimensionality reduction.

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA

Principal Component Analysis-1

•Suppose that we have a x of dimension m and we wish to transmit it using l numbers, where l<m. If we simply truncate the vector x, we will cause a mean square error equal to the sum of the variances of the elements eliminated from x.

•So, we ask: Does there exist an invertible linear transformation T such that the truncation of Tx is optimum in the mean-squared sense?

•Clearly, the transformation T should have the property that some of its components have low variance.

•Principal Component Analysis maximises the rate

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


of decrease of variance and is the right choice.

•Before we present neural network, Hebbian-based, algorithms that do this we first present the statistical analysis of the problem.

•Let X be an m-dimensional random vector representing the environment of interest. We assume that the vector X has zero mean:

E[X]=0

Where E is the statistical expectation operator. If X has not zero mean we first subtract the mean from X before we proceed with the rest of the analysis.

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


•Let q denote a unit vector, also of dimension m, onto which the vector X is to be projected. This projection is defined by the inner product of the vectors X and q:

A=XTq=qTX

Subject to the constraint:

||q||=(qTq)½=1

•The projection A is a random variable with a mean and variance related to the statistics of vector X. Assuming that X has zero mean we can calculate the mean value of the projection A:

E[A]=qTE[X]=0

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


•The variance of A is therefore the same as its mean-square value and so we can write:

2=E[A2]=E[(qTX)(XTq)]=qTE[XXT]q=qTR q

•The m-by-m matrix R is the correlation matrix of the random vector X, formally defined as the expectation of the outer product of the vector X with itself, as shown:

R=E[XXT]

•We observe that the matrix R is symmetric, which means that:

RT=R

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


•From this property it follows that for any m-by-1 vectors a and b we have:

aTRb= bTRa

•From the above we see that the variance 2

of A is a function of the unit vector q; we can then thus write:

(q)= 2= qTR q

•From the above we can think of (q) as a variance probe.

•To minimise the variance of A we must find the vectors q which are the extremal points of (q),

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


Subject to the constraint of unit length.

•If q is a vector such that (q) has an extreme value, then for any small q of the unit vector q, we find that, to the first order in q:

(q+ q )= (q)

•Now from the definition of the variance probe we have:

(q+ q )= (q+ q)TR (q+ q)=

qTRq+2(q)TRq+ (q)TR q

Where in the previous line we have made use of the symmetric property of matrix R.

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


•Ignoring the second-order term (q)TR q and invoking the definition of (q) we may write:

(q+ q )= qTRq+2(q)TRq=(q) +2(q)TRq

•The above relation implies that:

(q)TRq=0

•Note that just any perturbation q of q is not admissible; rather we restrict to use those for which the Euclidean norm of the perturbed vector q+ q remains equal to unity:

|| q+ q ||=1

Or: (q+ q)T (q+ q)=1

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


•Taking into account that q is already a vector of unit length, this means that:

(q)T q=0

•This means that perturbation q must be orthogonal to q and therefore only a small change in the direction of q is permitted.

•Combining the previous two equations we can now write:

(q)TR q-(q)T q=0 (q)T(R q- q)=0

Where is a scaling constant for the elements of R.

•We can now write:

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


R q= q

•This means that q is an eigenvector and is an eigenvalue of R.

•The matrix R has real and non-negative eigenvalues (because it is symmetric). Let the eigenvalues of matrix R be denoted by i and the corresponding vectors by qi where the eigenvalues are arranged in a decreasing order:

1 > 2 > … > m

so that 1= max.

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


•We can then write matrix R as:

•Combining the previous results we can see that the variance probes are the same as the eigenvalues:

(qj)= j , for j=1,2,…,m

•To summarise the previous analysis we have two important results:

•The eigenvectors of the correlation matrix R pertaining to the zero-mean random variable X define the unit vectors qj , representing the principal directions along which the variance probes (qj) have their extreme values;

m

i

Tiii qqR

1

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


•The associated eigenvalues define the extremal values of the variance probes.

•We now we want to investigate the representation of a data vector x which is a realisation of the random vector X.

•With m eigenvectors qj we have m possible projection directions. The projections of x into the eigenvectors are given by:

j=qjTx= xTqj , j=1,2,…,m

•The numbers j are called the principal components. To reconstruct the original vector x from the projections we combine all projections into

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


a single vector:

=[1, 2,…, m]T

=[xTq1, xTq2,…, xTqm]T

=QTx

Where Q is the matrix which is constructed by the (column) eigenvectors of R.

•From the above we see that:

x=Q

•This is nothing more than a coordinate

m

ijjqa

1

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


transformation from the input space, of vector x, to the feature space of the vector .

•From the perspective of the pattern recognition the usefulness of the PCA method is that it provides an effective technique for dimensionality reduction.

•In particular we may reduce the number of features needed for effective data representation by discarding those linear combinations in the previous formula that have small variances and retain only these terms that have large variances.

•Let 1, 2, …, l denote the largest l eigenvalues of R. We may then approximate the vector x by

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

PCAPCA


truncating the previous sum to the first l terms:

l

ijjqax

1

ˆ

l

l

a

a

a

qqq.

,...,, 2

1

21

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

GHAGHA

Generalised Hebbian Algorithm

•We will present now a neural network method which solves the PCA problem. It belongs to the so-called re-estimation algorithms class of PCA methods.•The network which solves the problem is shown below:

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

GHAGHA

Generalised Hebbian Algorithm -1

•For the feedforward network shown we make two structural assumptions:

•Each neuron in the output layer of the network is linear;•The network has m inputs and l outputs, both of which are specified. Moreover, the network has fewer outputs than inputs (i.e. l < m).

•It can be shown that under these assumptions and by using a special form of Hebbian learning the network truly learns to calculate the principal components in its output nodes.•The GHA can be summarised as follows:

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

GHAGHA


1. Initialise the synaptic weights of the network, wji, to small random values at time n=1. Assign a small positive value to the learning rate parameter ;

2. For n=1, j=1,2,…,l and i=1,2,…,m, compute:

Where xi(n) is the ith component of the m-by-1 input vector x(n) and l is the desire number of principal compenents;

3. Increment n by 1, go to step 2, and continue until the synaptic weights wji reach their steady state

m

iijij nxnwny

1

)()()(

j

kkkijijji nynwnynxnynw

1

)()()()()()(

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

GHAGHA


values. For large n, the weight wji of neuron j converges to the ith component of the eigenvector associated with jth eigenvalue of the correlation matrix of the input vector x(n). The output neurons represent the eigenvalues of correlation matrix with decreasing order from 1 towards l.

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

APEXAPEX

Adaptive Principal Components Extraction

•Another algorithm for extracting the principal components is the adaptive principal components extraction (APEX) algorithm. This network uses both feedforward and feedback connections.

•The algorithm is iterative in nature and if we are given the first (j-1) principal components the jth one can be easily computed.

•This algorithm belongs to the class of decorrelating algorithms.

•The network that implements the algorithm is shown next:

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

APEXAPEX

Adaptive Principal Components Extraction-1

•The network structure is defined as follows:

•Each neuron is assumed to be linear (in the output layer);

•Feedforward connections exist from the input nodes to each of the neurons 1,2,…,j, with j<m. The feedforward connections operate with a Hebbian rule. They are

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

APEXAPEX


excitatory and therefore provide amplification. These connections are represented by the wj(n) vector.

• Lateral connections exist from the individual outputs of neurons 1,2,…,j-1 to neuron j of the output layer, thereby applying feedback to the network. These connections are represented by the aj(n) vector. The lateral connections operate with an anti-Hebbian learning rule which has the effect of making them inhibitory.

• The algorithm is summarised as follows:

1. Initialise the feedforward weight vector wj and the feedback weight vector aj to small random numbers at time n=1, where j=1,2,…,m. Assign a small

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

APEXAPEX


positive value to the learning rate parameter ;

2. Set j=1, and for n=1,2,…, compute:

where x(n) is the input vector. For large n, we have w1(n)q1, where q1 is the eigenvector asociated with the largest eigenvalue 1 of the correlation matrix of x(n);

3. Set j=2, and for n=1,2,…, compute:

)()()( 11 nxnwny T

)()()()()()1( 121111 nwnynxnynwnw

Tjj nynynyny )(),...,(),()( 1211

)()()()()( 1 nynanxnwny jTj

Tjj

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

APEXAPEX


4. Increment j by 1, go to step 3, and continue until j=m, where m is the desired number of principal components. (Note that j=1 corresponds to eigenvector associated with the largest eigenvalue, which is taken care in step 2). For large n we have wj(n) qj and aj(n) 0, where qj is the eigenvector associated with the jth eigenvalue of the correlation matrix of x(n).

)()()()()()1( 2 nwnynxnynwnw jjjjj

)()()()()()1( 21 nanynynynana jjjjjj

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions

Kernel PCAKernel PCA

Kernel Principal Components Analysis

• A last algorithm which uses kernels (more on the SVM lecture) will be given below. We simply summarise the algorithm.

• This algorithm can be considered as a non-linear PCA methods as we first project the input space in a feature space using a non-linear transform (x) and then we perform a linear PCA analysis in the feature space. This is different from the previous methods in that they calculate a linear transformation between the input and the feature spaces.

• Summary of the kernel PCA method:

1. Given the training examples {xi}i=1 , compute the

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions


Kernel Principal Components Analysis-1

the N-by-N kernel matrix K={K(xi, xj)}, where:

K(xi, xj)= T(xi) (xj)

2. Solve the eigenvalue problem:

Ka=a

where is an eigenvalue of the kernel matrix K and a is the associated eigenvector;

3. Normalise the eigenvectors so computed by requiring that:

akT ak=1/ k , k=1,2,…,p

where p is the smallest nonzero eigenvalue of the matrix K, assuming that the eigenvalues are arranged in decreasing order;

Contents

PCA

GHA

APEX

Kernel PCA


Conclusions


Kernel Principal Components Analysis-2

4. For the extraction of the principal components of a test point x, compute the projections:

where ak,j is the jth element of eigenvector ak.

N

jjjk

Tkk pkxxKaxqa

1, ,...,2,1),,()(

~

Contents

PCA

GHA

APEX

Kernel PCA


ConclusionsConclusionsConclusions

Conclusions

•Typically we use PCA methods for dimension reduction as a pre-processing step before we apply other methods, for example in a pattern recognition problem.

•There are batch and adaptive numerical methods for the calculation of the PCA. An example for the first class is the Singular Value Decomposition (SVD) method while the GHA algorithm is for example and adaptive method.

•It is used mainly for finding out clusters in high-dimensional spaces, as it is difficult to visualise these clusters otherwise.

WK9 – Principle Component Analysis

Documents

Transcript of WK9 – Principle Component Analysis