Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n...

Lecture Topic: Principal Components

Principal Component Analysis

Principal component analysis (PCA) is one of the most often used tools instatistics (data science, machine learning).

It transforms a matrix of observations of possibly correlated variables into a matrixof values of linearly uncorrelated variables called principal components (PC),where each PC is defined by a combination of the columns of the original matrix.

The first principal component accounts for as much of the variability in the dataas possible, and each subsequent PC has the highest variance possible such that itis orthogonal to all preceding PC.

When you are given a matrix large enough not to be able to print on a sheet ofA4, the first step in understanding it should involve PCA or its regularised variant.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 2 / 1

An Example

Let A ∈ Rn×n denote a data matrix, which encodes n ratings of n users, e.g., ofmovies or books.

Let us assume the ratings of each user, i.e. values of each column, sum up to 1.

The first 2 principal components could represent a new coordinate system withtwo axes, such as “likes horrors” and “likes romantic comedies”, depending on theactual ratings.

You could replace the ratings of one user in the original rating-per-movie form, i.e.one row, with a 2-vector, which would suggest how much the user likes horrorsand how much the users likes romantic comedies.

For excellent interactive demonstrations, see:http://setosa.io/ev/principal-component-analysis/

An Example

Key Concepts

Principal Component Analysis is a linear transformation that transforms data intoa new coordinate system such that the projection of the data on the firstcoordinate explains more variance in the data than the projection on the thesecond coordinate, which in turn explains more variance than a projection ontothe third coordinate etc.

The Eigenvalue Problem:given A ∈ Rn×n, find the unknowns x ∈ Rn and λ ∈ R such that Ax = λx .

Here, λ is called an eigenvalue of A and x is an eigenvector of A.

Power Method for computing the greatest eigenvalue by absolute value, based onxk+1 = Axk

‖Axk‖ .

Key Concepts

‖Axk‖ .

Key Concepts

‖Axk‖ .

Key Concepts

‖Axk‖ .

Let A ∈ Rn×n denote a data matrix, which encodes m observations (samples) ofdimension n each (e.g., n features), which has been normalised such that eachcolumn has zero mean.

PCA extracts convex combinations of columns of A while maximising `2 norm ofAx .

The extraction of the first combination is:

maxx∈Rn‖Ax‖2 such that ‖x‖2 ≤ 1, (2.1)

The optimum x∗ ∈ Rn is called the loading vector. Ax∗ is called the first principalcomponent.

Each row-vector Ai : of A is mapped to a new vector t = x · Ai : in terms of theprincipal component x .

One can consider further principle components, usually sorted in the decreasingorder of the amount of variance they explain.

These can be obtained by running the same methods on an updated matrixAk+1 = Ak − xk(xk)TAkxk(xk)T , which is known as Hotelling’s deflation.

The combinations point in mutually orthogonal directions.

The DerivationProof sketch. Consider:

where ‖ · ‖2 is `2.

The Lagrangian is L(x) = xTAx − λ(xT x − 1

)where λ ∈ R is a newly

introduced Lagrangian multiplier.

The stationary points of L(x) satisfy

dx= 0 (2.3)

2xTAT − 2λxT = 0 (2.4)

Ax = λx . (2.5)

Recalling Ax = λx in the definition of an eigenvalue problem, each eigenvalues λof A is hence the value of L(x) at an eigenvectors of A.

where ‖ · ‖2 is `2.

dx= 0 (2.3)

2xTAT − 2λxT = 0 (2.4)

Ax = λx . (2.5)

where ‖ · ‖2 is `2.

dx= 0 (2.3)

2xTAT − 2λxT = 0 (2.4)

Ax = λx . (2.5)

Eigenvalues

Notice that we have seen eigenvalues a number of times before:

We have seen that positive definite matrices A, i.e., xTAx > 0 for all x 6= 0, havexTAx = λxT x > 0 and eigenvalues λ > 0.

We have seen the set σ(A) of all eigenvalues of A, called the spectrum.

The absolute value of the dominant eigenvalue is the spectral radius, which wehave seen in the definition of a contraction mapping.

We have also seen the spectral condition number of symmetric A,

cond*rel(A) :=

maxλ∈σ(A) |λ|minλ∈σ(A) |λ|

, which was a measure of the distortion produced by A,

i.e., the difference in expansion/contraction of eigenvectors that A can cause.

Eigenvalues

cond*rel(A) :=

Eigenvalues

cond*rel(A) :=

Eigenvalues

cond*rel(A) :=

Eigenvalues

cond*rel(A) :=

Eigenvalues

Prior to explaining more about eigenvalues and eigenvectors, let us see aninteractive demonstration:http://setosa.io/ev/eigenvectors-and-eigenvalues/

The Use in Python

Python libraries for computing eigenvalues and eigenvectors:

1 import numpy as npA = np.random.rand(2,2)(vals, vecs) = np.linalg.eig(A)np.dot(A, vecs[:,0]), vals[0] * vecs[:,0]np.dot(A, vecs[:,1]), vals[1] * vecs[:,1]

Eigenvalues: Bad News

The ith eigenvalue functional A 7→ λi (A) is not a linear functional beyonddimension one.

It is not convex (except for i = 1) or concave (except for i = n).

For any m ≥ 5, there is an m ×m matrix with rational coefficients whoseeigenvalue cannot be written using any expression involving rational numbers,addition, subtraction, multiplication, division, and taking kth roots.

Eigenvalues: Bad News Explained

Example (Trefethen and Bau, 25.1)

Consider the polynomial p(z) = zmam−1zm−1 + · · ·+ a1z + a0. The roots of p(z)

are equal to the eigenvalues of:0 −a0

1 0 −a1

1. . .

.... . . 0 −am−2

1 (−am−1)

Eigenvalues: Good News

The ith eigenvalue functional A 7→ λi (A) is Lipschitz continuous for symmetricmatrices, for fixed 1 ≤ i ≤ n.

There is a characterisation, resembling convexity:

Theorem (Courant-Fischer)

Let A ∈ Rn×n be symmetric. Then we have

λi (A) = supdim(V )=i

infv∈V :|v |=1

λi (A) = infdim(V )=n−i+1

supv∈V :|v |=1

for all 1 ≤ i ≤ n, where V ranges over all subspaces of Rn with the indicateddimension.

Eigenvalues: Good News

The ith eigenvalue functional A 7→ λi (A) is Lipschitz continuous for symmetricmatrices, for fixed 1 ≤ i ≤ n.

There is a characterisation, resembling convexity:

Theorem (Courant-Fischer)

Let A ∈ Rn×n be symmetric. Then we have

λi (A) = supdim(V )=i

infv∈V :|v |=1

λi (A) = infdim(V )=n−i+1

supv∈V :|v |=1

for all 1 ≤ i ≤ n, where V ranges over all subspaces of Rn with the indicateddimension.

Eigenvalues: Perturbation AnalysisLet us consider a symmetric matrices A,B ∈ Rn×n and view B as a perturbationof A. One can prove the so called Weyl inequalities:

λi+j−1(A + B) ≤ λi (A) + λj(B), (3.1)

for all i , j ≥ 1 and i + j − 1 ≤ n, the so called Ky Fan inequality

λ1(A + B) + . . .+ λk(A + B) ≤ λ1(A) + . . .+ λk(A) + λ1(B) + . . .+ λk(B)(3.2)

and the Tao inequality:

|λi (A + B)− λi (A)| ≤ ‖B‖op = max(|λ1(B)|, |λn(B)|). (3.3)

These suggest that for symmetric matrices, the spectrum of A + B is close to thatof A if B is small in operator norm.

Another Example

Google started out with a patent application for PageRank. There A ∈ Rn×n is an“adjacency” matrix, which links between pairs of websites. n is the number ofwebsites.

The eigenvector corresponding to the dominant eigenvalue suggested a reasonablerating of the websites for the use in search results.

There, however, you need a very efficient method of computing the eigenvector,considering n is very large.

Another Example

Power Method

The power method is a simple iterative algorithm for computing the largesteigenvalue by absolute value.

It is based on xk+1 = Axk

‖Axk‖ ,

which is motivated by the fact that given a matrix A and a basis of Rn comprisingeigenvectors of A,vectors multiplied by A are expanded mostin the direction of the eigenvector associated with the eigenvalue of the biggestmagnitude.

Power Method

‖Axk‖ ,

Power Method

‖Axk‖ ,

Power Method

‖Axk‖ ,

Power Method

‖Axk‖ ,

An Example

Question: What happens as we repeatedly multiply a vector by a matrix?

To get some intuition, consider the following example.

Let A be a 2× 2 matrix with eigenvalues 3 and 1/2and corresponding eigenvectors y and z .

Recall that eigenvectors of distinct eigenvalues are always linearly independent.We can hence think of {y , z} as a basis of Rn.

Let x0 be any linear combination of y and z , e.g., x0 = y + z .

(In general, we could study x0 = αy + βz for arbitrary scalars α, βbut here we are just letting α and β both equal 1,to simplify the discussion.)

An Example

An Example: Expansion/Contraction by EigenvalueNow let x1 = Ax0, and xk+1 = Axk = Akx0 for each k ∈ N.

Since matrix multiplication is distributive over addition,

x1 = Ay + Az = 3y +1

Thus the y component of x0 is expanded while the z component is contracted.

Repeating this process k times, we get:

xk = Axk−1 = Akx0 = 3ky +1

Thus, xk expands in the y direction and contracts in the z direction.

Eventually, xk will be almost completely made up of its y component, and the zcomponent will be negligible.

x1 = Ay + Az = 3y +1

xk = Axk−1 = Akx0 = 3ky +1

x1 = Ay + Az = 3y +1

xk = Axk−1 = Akx0 = 3ky +1

x1 = Ay + Az = 3y +1

xk = Axk−1 = Akx0 = 3ky +1

x1 = Ay + Az = 3y +1

xk = Axk−1 = Akx0 = 3ky +1

x1 = Ay + Az = 3y +1

xk = Axk−1 = Akx0 = 3ky +1

An Example: The Dominant Eigenvalue

Note that how many iterations k are needed depends onhow much the biggest eigenvalue exceeds the second biggest in magnitude.

In the example, the ratio is 31/2 = 6

so expansion in the y direction is happening 6 times faster than in the z direction.

In general, for any matrix A, the growth rate of its powers Ak as k →∞depends on the eigenvalue(s) of A whose absolute value is greatest:this largest absolute value is of course the spectral radius ρ(A).

These eigenvalue(s) are generally called the dominant eigenvalue(s) of A.

Power Method: In GeneralLet us generalise the example. Denote the approximation to the eigenvector atiteration k by xk .

The initial x1 may be chosen randomlyor set to an approximation to a dominant eigenvector x ,if there is some knowledge of this.

The core of the method is the step

xk+1 :=1

‖Axk‖Axk .

That is, at each iteration, xk is left-multiplied by A and normalised(divided by its own norm) giving a unit vector in the direction of Axk .

If xk were an eigenvector of A, then xk+1 would be equal to xk

(since they are both unit vectors, both in the same direction). This suggest aCauchy-type convergence criterion.

xk+1 :=1

‖Axk‖Axk .

xk+1 :=1

‖Axk‖Axk .

xk+1 :=1

‖Axk‖Axk .

xk+1 :=1

‖Axk‖Axk .

xk+1 :=1

‖Axk‖Axk .

Power Method: In Python

Choose the starting x1 so that ‖x1‖ = 1.

import numpy as npdef Power(A, x0, tol = 1e-5, limit = 100):x = x0for iteration in xrange(limit):

5 next = A*xnext = next/np.linalg.norm(next)if allclose(x, next, atol=tol): breakx = next

return next

Power Method: Convergence

The method will converge if:

A has a dominant real eigenvalue λ1, that is, |λ1| is strictly larger than |λ| forany other eigenvalue λ of A

the initial vector has a nonzero component in the direction of the eigenvectorx corresponding to λ1.

Certain classes of matrices are guaranteed to have real eigenvalues,e.g., symmetric matrices, positive definite matrices.

Also, the adjacency matrix of a connected graph will have a dominant realeigenvalue, by the Perron-Frobenius theorem.

On the other hand, the power iteration method will failif A’s dominant eigenvalue(s) have non-zero imaginary parts.

Power Method: Iteration Complexity

Theorem (Trefethen and Bau, 27.1)

Assuming |λ1| > |λ2| ≥ · · · ≥ |λm| ≥ 0 and non-zero x0, the iterates of powermethod satisfy:

‖xk − (±x∗)‖ = O

(∣∣∣∣λ2

∣∣∣∣k)

∣∣∣((xk)TAxk)− λ1

∣∣∣ = O

(∣∣∣∣λ2

∣∣∣∣2k)

as k →∞. The ± indicates that at each iteration k , the bound holds for one ofthe signs.

So its convergence is very fast, except when the largest and second largesteigenvalues of A are very close in absolute value.

Power Method: Per-Iteration Complexity

Assuming A is dense, the worst case complexity per iteration is n2 multiplicationsin the computation of Axk .

Computing the vector norm ‖Axk‖ =√

(Axk) · (Axk) also takes n multiplications.

This gives a total of O(n2) operations.

Assuming A is sparse, such as in PageRank computations, the number of floatingpoint operations in the computation of Axk is twice the number of non-zeros inA, independent of n,m.

Power Method: Further PropertiesOnce the dominant eigenvalue is found, the deflation method can be used tofind the second largest eigenvalue, etc. For finding all eigenvalues,decomposition methods may be preferrable, though.If the initial x1 had a zero component in the x direction, we rely on roundingerrors to get a component in the x direction at some stage in the run. Oncewe get a non-zero x component, it will expand relative to the othercomponents, but we may need to wait for this component to appear and thengrow.The expression Ax could be replaced based on

Ax = λx ⇒ x tAx = λx tx = λx · x = λ‖x‖2 ⇒ λ =1

‖x‖2x tAx .

which makes it possible to introduce additional checks to avoid under- andover-flows and division by very small numbers.Otherwise, one only needs to store A and two vectors xk and Axk . This ispossible even for n × n matrix A, n in the billions, such as for the GooglePageRank.

Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n...

Documents

Transcript of Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n...