Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n...

83
Lecture Topic: Principal Components

Transcript of Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n...

Page 1: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Lecture Topic: Principal Components

Page 2: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Principal component analysis (PCA) is one of the most often used tools instatistics (data science, machine learning).

It transforms a matrix of observations of possibly correlated variables into a matrixof values of linearly uncorrelated variables called principal components (PC),where each PC is defined by a combination of the columns of the original matrix.

The first principal component accounts for as much of the variability in the dataas possible, and each subsequent PC has the highest variance possible such that itis orthogonal to all preceding PC.

When you are given a matrix large enough not to be able to print on a sheet ofA4, the first step in understanding it should involve PCA or its regularised variant.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 2 / 1

Page 3: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Principal component analysis (PCA) is one of the most often used tools instatistics (data science, machine learning).

It transforms a matrix of observations of possibly correlated variables into a matrixof values of linearly uncorrelated variables called principal components (PC),where each PC is defined by a combination of the columns of the original matrix.

The first principal component accounts for as much of the variability in the dataas possible, and each subsequent PC has the highest variance possible such that itis orthogonal to all preceding PC.

When you are given a matrix large enough not to be able to print on a sheet ofA4, the first step in understanding it should involve PCA or its regularised variant.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 2 / 1

Page 4: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Principal component analysis (PCA) is one of the most often used tools instatistics (data science, machine learning).

It transforms a matrix of observations of possibly correlated variables into a matrixof values of linearly uncorrelated variables called principal components (PC),where each PC is defined by a combination of the columns of the original matrix.

The first principal component accounts for as much of the variability in the dataas possible, and each subsequent PC has the highest variance possible such that itis orthogonal to all preceding PC.

When you are given a matrix large enough not to be able to print on a sheet ofA4, the first step in understanding it should involve PCA or its regularised variant.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 2 / 1

Page 5: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Principal component analysis (PCA) is one of the most often used tools instatistics (data science, machine learning).

It transforms a matrix of observations of possibly correlated variables into a matrixof values of linearly uncorrelated variables called principal components (PC),where each PC is defined by a combination of the columns of the original matrix.

The first principal component accounts for as much of the variability in the dataas possible, and each subsequent PC has the highest variance possible such that itis orthogonal to all preceding PC.

When you are given a matrix large enough not to be able to print on a sheet ofA4, the first step in understanding it should involve PCA or its regularised variant.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 2 / 1

Page 6: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Let A ∈ Rn×n denote a data matrix, which encodes n ratings of n users, e.g., ofmovies or books.

Let us assume the ratings of each user, i.e. values of each column, sum up to 1.

The first 2 principal components could represent a new coordinate system withtwo axes, such as “likes horrors” and “likes romantic comedies”, depending on theactual ratings.

You could replace the ratings of one user in the original rating-per-movie form, i.e.one row, with a 2-vector, which would suggest how much the user likes horrorsand how much the users likes romantic comedies.

For excellent interactive demonstrations, see:http://setosa.io/ev/principal-component-analysis/

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 3 / 1

Page 7: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Let A ∈ Rn×n denote a data matrix, which encodes n ratings of n users, e.g., ofmovies or books.

Let us assume the ratings of each user, i.e. values of each column, sum up to 1.

The first 2 principal components could represent a new coordinate system withtwo axes, such as “likes horrors” and “likes romantic comedies”, depending on theactual ratings.

You could replace the ratings of one user in the original rating-per-movie form, i.e.one row, with a 2-vector, which would suggest how much the user likes horrorsand how much the users likes romantic comedies.

For excellent interactive demonstrations, see:http://setosa.io/ev/principal-component-analysis/

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 3 / 1

Page 8: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Let A ∈ Rn×n denote a data matrix, which encodes n ratings of n users, e.g., ofmovies or books.

Let us assume the ratings of each user, i.e. values of each column, sum up to 1.

The first 2 principal components could represent a new coordinate system withtwo axes, such as “likes horrors” and “likes romantic comedies”, depending on theactual ratings.

You could replace the ratings of one user in the original rating-per-movie form, i.e.one row, with a 2-vector, which would suggest how much the user likes horrorsand how much the users likes romantic comedies.

For excellent interactive demonstrations, see:http://setosa.io/ev/principal-component-analysis/

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 3 / 1

Page 9: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Let A ∈ Rn×n denote a data matrix, which encodes n ratings of n users, e.g., ofmovies or books.

Let us assume the ratings of each user, i.e. values of each column, sum up to 1.

The first 2 principal components could represent a new coordinate system withtwo axes, such as “likes horrors” and “likes romantic comedies”, depending on theactual ratings.

You could replace the ratings of one user in the original rating-per-movie form, i.e.one row, with a 2-vector, which would suggest how much the user likes horrorsand how much the users likes romantic comedies.

For excellent interactive demonstrations, see:http://setosa.io/ev/principal-component-analysis/

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 3 / 1

Page 10: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Let A ∈ Rn×n denote a data matrix, which encodes n ratings of n users, e.g., ofmovies or books.

Let us assume the ratings of each user, i.e. values of each column, sum up to 1.

The first 2 principal components could represent a new coordinate system withtwo axes, such as “likes horrors” and “likes romantic comedies”, depending on theactual ratings.

You could replace the ratings of one user in the original rating-per-movie form, i.e.one row, with a 2-vector, which would suggest how much the user likes horrorsand how much the users likes romantic comedies.

For excellent interactive demonstrations, see:http://setosa.io/ev/principal-component-analysis/

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 3 / 1

Page 11: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Key Concepts

Principal Component Analysis is a linear transformation that transforms data intoa new coordinate system such that the projection of the data on the firstcoordinate explains more variance in the data than the projection on the thesecond coordinate, which in turn explains more variance than a projection ontothe third coordinate etc.

The Eigenvalue Problem:given A ∈ Rn×n, find the unknowns x ∈ Rn and λ ∈ R such that Ax = λx .

Here, λ is called an eigenvalue of A and x is an eigenvector of A.

Power Method for computing the greatest eigenvalue by absolute value, based onxk+1 = Axk

‖Axk‖ .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 4 / 1

Page 12: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Key Concepts

Principal Component Analysis is a linear transformation that transforms data intoa new coordinate system such that the projection of the data on the firstcoordinate explains more variance in the data than the projection on the thesecond coordinate, which in turn explains more variance than a projection ontothe third coordinate etc.

The Eigenvalue Problem:given A ∈ Rn×n, find the unknowns x ∈ Rn and λ ∈ R such that Ax = λx .

Here, λ is called an eigenvalue of A and x is an eigenvector of A.

Power Method for computing the greatest eigenvalue by absolute value, based onxk+1 = Axk

‖Axk‖ .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 4 / 1

Page 13: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Key Concepts

Principal Component Analysis is a linear transformation that transforms data intoa new coordinate system such that the projection of the data on the firstcoordinate explains more variance in the data than the projection on the thesecond coordinate, which in turn explains more variance than a projection ontothe third coordinate etc.

The Eigenvalue Problem:given A ∈ Rn×n, find the unknowns x ∈ Rn and λ ∈ R such that Ax = λx .

Here, λ is called an eigenvalue of A and x is an eigenvector of A.

Power Method for computing the greatest eigenvalue by absolute value, based onxk+1 = Axk

‖Axk‖ .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 4 / 1

Page 14: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Key Concepts

Principal Component Analysis is a linear transformation that transforms data intoa new coordinate system such that the projection of the data on the firstcoordinate explains more variance in the data than the projection on the thesecond coordinate, which in turn explains more variance than a projection ontothe third coordinate etc.

The Eigenvalue Problem:given A ∈ Rn×n, find the unknowns x ∈ Rn and λ ∈ R such that Ax = λx .

Here, λ is called an eigenvalue of A and x is an eigenvector of A.

Power Method for computing the greatest eigenvalue by absolute value, based onxk+1 = Axk

‖Axk‖ .

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 4 / 1

Page 15: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Let A ∈ Rn×n denote a data matrix, which encodes m observations (samples) ofdimension n each (e.g., n features), which has been normalised such that eachcolumn has zero mean.

PCA extracts convex combinations of columns of A while maximising `2 norm ofAx .

The extraction of the first combination is:

maxx∈Rn‖Ax‖2 such that ‖x‖2 ≤ 1, (2.1)

The optimum x∗ ∈ Rn is called the loading vector. Ax∗ is called the first principalcomponent.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 5 / 1

Page 16: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Let A ∈ Rn×n denote a data matrix, which encodes m observations (samples) ofdimension n each (e.g., n features), which has been normalised such that eachcolumn has zero mean.

PCA extracts convex combinations of columns of A while maximising `2 norm ofAx .

The extraction of the first combination is:

maxx∈Rn‖Ax‖2 such that ‖x‖2 ≤ 1, (2.1)

The optimum x∗ ∈ Rn is called the loading vector. Ax∗ is called the first principalcomponent.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 5 / 1

Page 17: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Let A ∈ Rn×n denote a data matrix, which encodes m observations (samples) ofdimension n each (e.g., n features), which has been normalised such that eachcolumn has zero mean.

PCA extracts convex combinations of columns of A while maximising `2 norm ofAx .

The extraction of the first combination is:

maxx∈Rn‖Ax‖2 such that ‖x‖2 ≤ 1, (2.1)

The optimum x∗ ∈ Rn is called the loading vector. Ax∗ is called the first principalcomponent.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 5 / 1

Page 18: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Let A ∈ Rn×n denote a data matrix, which encodes m observations (samples) ofdimension n each (e.g., n features), which has been normalised such that eachcolumn has zero mean.

PCA extracts convex combinations of columns of A while maximising `2 norm ofAx .

The extraction of the first combination is:

maxx∈Rn‖Ax‖2 such that ‖x‖2 ≤ 1, (2.1)

The optimum x∗ ∈ Rn is called the loading vector. Ax∗ is called the first principalcomponent.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 5 / 1

Page 19: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Each row-vector Ai : of A is mapped to a new vector t = x · Ai : in terms of theprincipal component x .

One can consider further principle components, usually sorted in the decreasingorder of the amount of variance they explain.

These can be obtained by running the same methods on an updated matrixAk+1 = Ak − xk(xk)TAkxk(xk)T , which is known as Hotelling’s deflation.

The combinations point in mutually orthogonal directions.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 6 / 1

Page 20: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Each row-vector Ai : of A is mapped to a new vector t = x · Ai : in terms of theprincipal component x .

One can consider further principle components, usually sorted in the decreasingorder of the amount of variance they explain.

These can be obtained by running the same methods on an updated matrixAk+1 = Ak − xk(xk)TAkxk(xk)T , which is known as Hotelling’s deflation.

The combinations point in mutually orthogonal directions.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 6 / 1

Page 21: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Each row-vector Ai : of A is mapped to a new vector t = x · Ai : in terms of theprincipal component x .

One can consider further principle components, usually sorted in the decreasingorder of the amount of variance they explain.

These can be obtained by running the same methods on an updated matrixAk+1 = Ak − xk(xk)TAkxk(xk)T , which is known as Hotelling’s deflation.

The combinations point in mutually orthogonal directions.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 6 / 1

Page 22: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Principal Component Analysis

Each row-vector Ai : of A is mapped to a new vector t = x · Ai : in terms of theprincipal component x .

One can consider further principle components, usually sorted in the decreasingorder of the amount of variance they explain.

These can be obtained by running the same methods on an updated matrixAk+1 = Ak − xk(xk)TAkxk(xk)T , which is known as Hotelling’s deflation.

The combinations point in mutually orthogonal directions.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 6 / 1

Page 23: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

The DerivationProof sketch. Consider:

maxx∈Rn‖Ax‖2 such that ‖x‖2 ≤ 1, (2.2)

where ‖ · ‖2 is `2.

The Lagrangian is L(x) = xTAx − λ(xT x − 1

)where λ ∈ R is a newly

introduced Lagrangian multiplier.

The stationary points of L(x) satisfy

dL(x)

dx= 0 (2.3)

2xTAT − 2λxT = 0 (2.4)

Ax = λx . (2.5)

Recalling Ax = λx in the definition of an eigenvalue problem, each eigenvalues λof A is hence the value of L(x) at an eigenvectors of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 7 / 1

Page 24: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

The DerivationProof sketch. Consider:

maxx∈Rn‖Ax‖2 such that ‖x‖2 ≤ 1, (2.2)

where ‖ · ‖2 is `2.

The Lagrangian is L(x) = xTAx − λ(xT x − 1

)where λ ∈ R is a newly

introduced Lagrangian multiplier.

The stationary points of L(x) satisfy

dL(x)

dx= 0 (2.3)

2xTAT − 2λxT = 0 (2.4)

Ax = λx . (2.5)

Recalling Ax = λx in the definition of an eigenvalue problem, each eigenvalues λof A is hence the value of L(x) at an eigenvectors of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 7 / 1

Page 25: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

The DerivationProof sketch. Consider:

maxx∈Rn‖Ax‖2 such that ‖x‖2 ≤ 1, (2.2)

where ‖ · ‖2 is `2.

The Lagrangian is L(x) = xTAx − λ(xT x − 1

)where λ ∈ R is a newly

introduced Lagrangian multiplier.

The stationary points of L(x) satisfy

dL(x)

dx= 0 (2.3)

2xTAT − 2λxT = 0 (2.4)

Ax = λx . (2.5)

Recalling Ax = λx in the definition of an eigenvalue problem, each eigenvalues λof A is hence the value of L(x) at an eigenvectors of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 7 / 1

Page 26: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues

Notice that we have seen eigenvalues a number of times before:

We have seen that positive definite matrices A, i.e., xTAx > 0 for all x 6= 0, havexTAx = λxT x > 0 and eigenvalues λ > 0.

We have seen the set σ(A) of all eigenvalues of A, called the spectrum.

The absolute value of the dominant eigenvalue is the spectral radius, which wehave seen in the definition of a contraction mapping.

We have also seen the spectral condition number of symmetric A,

cond*rel(A) :=

maxλ∈σ(A) |λ|minλ∈σ(A) |λ|

, which was a measure of the distortion produced by A,

i.e., the difference in expansion/contraction of eigenvectors that A can cause.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 8 / 1

Page 27: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues

Notice that we have seen eigenvalues a number of times before:

We have seen that positive definite matrices A, i.e., xTAx > 0 for all x 6= 0, havexTAx = λxT x > 0 and eigenvalues λ > 0.

We have seen the set σ(A) of all eigenvalues of A, called the spectrum.

The absolute value of the dominant eigenvalue is the spectral radius, which wehave seen in the definition of a contraction mapping.

We have also seen the spectral condition number of symmetric A,

cond*rel(A) :=

maxλ∈σ(A) |λ|minλ∈σ(A) |λ|

, which was a measure of the distortion produced by A,

i.e., the difference in expansion/contraction of eigenvectors that A can cause.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 8 / 1

Page 28: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues

Notice that we have seen eigenvalues a number of times before:

We have seen that positive definite matrices A, i.e., xTAx > 0 for all x 6= 0, havexTAx = λxT x > 0 and eigenvalues λ > 0.

We have seen the set σ(A) of all eigenvalues of A, called the spectrum.

The absolute value of the dominant eigenvalue is the spectral radius, which wehave seen in the definition of a contraction mapping.

We have also seen the spectral condition number of symmetric A,

cond*rel(A) :=

maxλ∈σ(A) |λ|minλ∈σ(A) |λ|

, which was a measure of the distortion produced by A,

i.e., the difference in expansion/contraction of eigenvectors that A can cause.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 8 / 1

Page 29: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues

Notice that we have seen eigenvalues a number of times before:

We have seen that positive definite matrices A, i.e., xTAx > 0 for all x 6= 0, havexTAx = λxT x > 0 and eigenvalues λ > 0.

We have seen the set σ(A) of all eigenvalues of A, called the spectrum.

The absolute value of the dominant eigenvalue is the spectral radius, which wehave seen in the definition of a contraction mapping.

We have also seen the spectral condition number of symmetric A,

cond*rel(A) :=

maxλ∈σ(A) |λ|minλ∈σ(A) |λ|

, which was a measure of the distortion produced by A,

i.e., the difference in expansion/contraction of eigenvectors that A can cause.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 8 / 1

Page 30: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues

Notice that we have seen eigenvalues a number of times before:

We have seen that positive definite matrices A, i.e., xTAx > 0 for all x 6= 0, havexTAx = λxT x > 0 and eigenvalues λ > 0.

We have seen the set σ(A) of all eigenvalues of A, called the spectrum.

The absolute value of the dominant eigenvalue is the spectral radius, which wehave seen in the definition of a contraction mapping.

We have also seen the spectral condition number of symmetric A,

cond*rel(A) :=

maxλ∈σ(A) |λ|minλ∈σ(A) |λ|

, which was a measure of the distortion produced by A,

i.e., the difference in expansion/contraction of eigenvectors that A can cause.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 8 / 1

Page 31: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues

Prior to explaining more about eigenvalues and eigenvectors, let us see aninteractive demonstration:http://setosa.io/ev/eigenvectors-and-eigenvalues/

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 9 / 1

Page 32: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

The Use in Python

Python libraries for computing eigenvalues and eigenvectors:

1 import numpy as npA = np.random.rand(2,2)(vals, vecs) = np.linalg.eig(A)np.dot(A, vecs[:,0]), vals[0] * vecs[:,0]np.dot(A, vecs[:,1]), vals[1] * vecs[:,1]

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 10 / 1

Page 33: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues: Bad News

The ith eigenvalue functional A 7→ λi (A) is not a linear functional beyonddimension one.

It is not convex (except for i = 1) or concave (except for i = n).

For any m ≥ 5, there is an m ×m matrix with rational coefficients whoseeigenvalue cannot be written using any expression involving rational numbers,addition, subtraction, multiplication, division, and taking kth roots.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 11 / 1

Page 34: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues: Bad News

The ith eigenvalue functional A 7→ λi (A) is not a linear functional beyonddimension one.

It is not convex (except for i = 1) or concave (except for i = n).

For any m ≥ 5, there is an m ×m matrix with rational coefficients whoseeigenvalue cannot be written using any expression involving rational numbers,addition, subtraction, multiplication, division, and taking kth roots.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 11 / 1

Page 35: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues: Bad News

The ith eigenvalue functional A 7→ λi (A) is not a linear functional beyonddimension one.

It is not convex (except for i = 1) or concave (except for i = n).

For any m ≥ 5, there is an m ×m matrix with rational coefficients whoseeigenvalue cannot be written using any expression involving rational numbers,addition, subtraction, multiplication, division, and taking kth roots.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 11 / 1

Page 36: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues: Bad News Explained

Example (Trefethen and Bau, 25.1)

Consider the polynomial p(z) = zmam−1zm−1 + · · ·+ a1z + a0. The roots of p(z)

are equal to the eigenvalues of:0 −a0

1 0 −a1

1. . .

.... . . 0 −am−2

1 (−am−1)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 12 / 1

Page 37: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues: Good News

The ith eigenvalue functional A 7→ λi (A) is Lipschitz continuous for symmetricmatrices, for fixed 1 ≤ i ≤ n.

There is a characterisation, resembling convexity:

Theorem (Courant-Fischer)

Let A ∈ Rn×n be symmetric. Then we have

λi (A) = supdim(V )=i

infv∈V :|v |=1

vTAv

λi (A) = infdim(V )=n−i+1

supv∈V :|v |=1

vTAv

for all 1 ≤ i ≤ n, where V ranges over all subspaces of Rn with the indicateddimension.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 13 / 1

Page 38: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues: Good News

The ith eigenvalue functional A 7→ λi (A) is Lipschitz continuous for symmetricmatrices, for fixed 1 ≤ i ≤ n.

There is a characterisation, resembling convexity:

Theorem (Courant-Fischer)

Let A ∈ Rn×n be symmetric. Then we have

λi (A) = supdim(V )=i

infv∈V :|v |=1

vTAv

λi (A) = infdim(V )=n−i+1

supv∈V :|v |=1

vTAv

for all 1 ≤ i ≤ n, where V ranges over all subspaces of Rn with the indicateddimension.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 13 / 1

Page 39: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Eigenvalues: Perturbation AnalysisLet us consider a symmetric matrices A,B ∈ Rn×n and view B as a perturbationof A. One can prove the so called Weyl inequalities:

λi+j−1(A + B) ≤ λi (A) + λj(B), (3.1)

for all i , j ≥ 1 and i + j − 1 ≤ n, the so called Ky Fan inequality

λ1(A + B) + . . .+ λk(A + B) ≤ λ1(A) + . . .+ λk(A) + λ1(B) + . . .+ λk(B)(3.2)

and the Tao inequality:

|λi (A + B)− λi (A)| ≤ ‖B‖op = max(|λ1(B)|, |λn(B)|). (3.3)

These suggest that for symmetric matrices, the spectrum of A + B is close to thatof A if B is small in operator norm.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 14 / 1

Page 40: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Another Example

Google started out with a patent application for PageRank. There A ∈ Rn×n is an“adjacency” matrix, which links between pairs of websites. n is the number ofwebsites.

The eigenvector corresponding to the dominant eigenvalue suggested a reasonablerating of the websites for the use in search results.

There, however, you need a very efficient method of computing the eigenvector,considering n is very large.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 15 / 1

Page 41: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Another Example

Google started out with a patent application for PageRank. There A ∈ Rn×n is an“adjacency” matrix, which links between pairs of websites. n is the number ofwebsites.

The eigenvector corresponding to the dominant eigenvalue suggested a reasonablerating of the websites for the use in search results.

There, however, you need a very efficient method of computing the eigenvector,considering n is very large.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 15 / 1

Page 42: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Another Example

Google started out with a patent application for PageRank. There A ∈ Rn×n is an“adjacency” matrix, which links between pairs of websites. n is the number ofwebsites.

The eigenvector corresponding to the dominant eigenvalue suggested a reasonablerating of the websites for the use in search results.

There, however, you need a very efficient method of computing the eigenvector,considering n is very large.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 15 / 1

Page 43: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method

The power method is a simple iterative algorithm for computing the largesteigenvalue by absolute value.

It is based on xk+1 = Axk

‖Axk‖ ,

which is motivated by the fact that given a matrix A and a basis of Rn comprisingeigenvectors of A,vectors multiplied by A are expanded mostin the direction of the eigenvector associated with the eigenvalue of the biggestmagnitude.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 16 / 1

Page 44: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method

The power method is a simple iterative algorithm for computing the largesteigenvalue by absolute value.

It is based on xk+1 = Axk

‖Axk‖ ,

which is motivated by the fact that given a matrix A and a basis of Rn comprisingeigenvectors of A,vectors multiplied by A are expanded mostin the direction of the eigenvector associated with the eigenvalue of the biggestmagnitude.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 16 / 1

Page 45: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method

The power method is a simple iterative algorithm for computing the largesteigenvalue by absolute value.

It is based on xk+1 = Axk

‖Axk‖ ,

which is motivated by the fact that given a matrix A and a basis of Rn comprisingeigenvectors of A,vectors multiplied by A are expanded mostin the direction of the eigenvector associated with the eigenvalue of the biggestmagnitude.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 16 / 1

Page 46: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method

The power method is a simple iterative algorithm for computing the largesteigenvalue by absolute value.

It is based on xk+1 = Axk

‖Axk‖ ,

which is motivated by the fact that given a matrix A and a basis of Rn comprisingeigenvectors of A,vectors multiplied by A are expanded mostin the direction of the eigenvector associated with the eigenvalue of the biggestmagnitude.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 16 / 1

Page 47: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method

The power method is a simple iterative algorithm for computing the largesteigenvalue by absolute value.

It is based on xk+1 = Axk

‖Axk‖ ,

which is motivated by the fact that given a matrix A and a basis of Rn comprisingeigenvectors of A,vectors multiplied by A are expanded mostin the direction of the eigenvector associated with the eigenvalue of the biggestmagnitude.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 16 / 1

Page 48: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Question: What happens as we repeatedly multiply a vector by a matrix?

To get some intuition, consider the following example.

Let A be a 2× 2 matrix with eigenvalues 3 and 1/2and corresponding eigenvectors y and z .

Recall that eigenvectors of distinct eigenvalues are always linearly independent.We can hence think of {y , z} as a basis of Rn.

Let x0 be any linear combination of y and z , e.g., x0 = y + z .

(In general, we could study x0 = αy + βz for arbitrary scalars α, βbut here we are just letting α and β both equal 1,to simplify the discussion.)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 17 / 1

Page 49: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Question: What happens as we repeatedly multiply a vector by a matrix?

To get some intuition, consider the following example.

Let A be a 2× 2 matrix with eigenvalues 3 and 1/2and corresponding eigenvectors y and z .

Recall that eigenvectors of distinct eigenvalues are always linearly independent.We can hence think of {y , z} as a basis of Rn.

Let x0 be any linear combination of y and z , e.g., x0 = y + z .

(In general, we could study x0 = αy + βz for arbitrary scalars α, βbut here we are just letting α and β both equal 1,to simplify the discussion.)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 17 / 1

Page 50: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Question: What happens as we repeatedly multiply a vector by a matrix?

To get some intuition, consider the following example.

Let A be a 2× 2 matrix with eigenvalues 3 and 1/2and corresponding eigenvectors y and z .

Recall that eigenvectors of distinct eigenvalues are always linearly independent.We can hence think of {y , z} as a basis of Rn.

Let x0 be any linear combination of y and z , e.g., x0 = y + z .

(In general, we could study x0 = αy + βz for arbitrary scalars α, βbut here we are just letting α and β both equal 1,to simplify the discussion.)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 17 / 1

Page 51: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Question: What happens as we repeatedly multiply a vector by a matrix?

To get some intuition, consider the following example.

Let A be a 2× 2 matrix with eigenvalues 3 and 1/2and corresponding eigenvectors y and z .

Recall that eigenvectors of distinct eigenvalues are always linearly independent.We can hence think of {y , z} as a basis of Rn.

Let x0 be any linear combination of y and z , e.g., x0 = y + z .

(In general, we could study x0 = αy + βz for arbitrary scalars α, βbut here we are just letting α and β both equal 1,to simplify the discussion.)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 17 / 1

Page 52: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Question: What happens as we repeatedly multiply a vector by a matrix?

To get some intuition, consider the following example.

Let A be a 2× 2 matrix with eigenvalues 3 and 1/2and corresponding eigenvectors y and z .

Recall that eigenvectors of distinct eigenvalues are always linearly independent.We can hence think of {y , z} as a basis of Rn.

Let x0 be any linear combination of y and z , e.g., x0 = y + z .

(In general, we could study x0 = αy + βz for arbitrary scalars α, βbut here we are just letting α and β both equal 1,to simplify the discussion.)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 17 / 1

Page 53: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Question: What happens as we repeatedly multiply a vector by a matrix?

To get some intuition, consider the following example.

Let A be a 2× 2 matrix with eigenvalues 3 and 1/2and corresponding eigenvectors y and z .

Recall that eigenvectors of distinct eigenvalues are always linearly independent.We can hence think of {y , z} as a basis of Rn.

Let x0 be any linear combination of y and z , e.g., x0 = y + z .

(In general, we could study x0 = αy + βz for arbitrary scalars α, βbut here we are just letting α and β both equal 1,to simplify the discussion.)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 17 / 1

Page 54: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example

Question: What happens as we repeatedly multiply a vector by a matrix?

To get some intuition, consider the following example.

Let A be a 2× 2 matrix with eigenvalues 3 and 1/2and corresponding eigenvectors y and z .

Recall that eigenvectors of distinct eigenvalues are always linearly independent.We can hence think of {y , z} as a basis of Rn.

Let x0 be any linear combination of y and z , e.g., x0 = y + z .

(In general, we could study x0 = αy + βz for arbitrary scalars α, βbut here we are just letting α and β both equal 1,to simplify the discussion.)

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 17 / 1

Page 55: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: Expansion/Contraction by EigenvalueNow let x1 = Ax0, and xk+1 = Axk = Akx0 for each k ∈ N.

Since matrix multiplication is distributive over addition,

x1 = Ay + Az = 3y +1

2z .

Thus the y component of x0 is expanded while the z component is contracted.

Repeating this process k times, we get:

xk = Axk−1 = Akx0 = 3ky +1

2kz .

Thus, xk expands in the y direction and contracts in the z direction.

Eventually, xk will be almost completely made up of its y component, and the zcomponent will be negligible.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 18 / 1

Page 56: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: Expansion/Contraction by EigenvalueNow let x1 = Ax0, and xk+1 = Axk = Akx0 for each k ∈ N.

Since matrix multiplication is distributive over addition,

x1 = Ay + Az = 3y +1

2z .

Thus the y component of x0 is expanded while the z component is contracted.

Repeating this process k times, we get:

xk = Axk−1 = Akx0 = 3ky +1

2kz .

Thus, xk expands in the y direction and contracts in the z direction.

Eventually, xk will be almost completely made up of its y component, and the zcomponent will be negligible.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 18 / 1

Page 57: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: Expansion/Contraction by EigenvalueNow let x1 = Ax0, and xk+1 = Axk = Akx0 for each k ∈ N.

Since matrix multiplication is distributive over addition,

x1 = Ay + Az = 3y +1

2z .

Thus the y component of x0 is expanded while the z component is contracted.

Repeating this process k times, we get:

xk = Axk−1 = Akx0 = 3ky +1

2kz .

Thus, xk expands in the y direction and contracts in the z direction.

Eventually, xk will be almost completely made up of its y component, and the zcomponent will be negligible.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 18 / 1

Page 58: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: Expansion/Contraction by EigenvalueNow let x1 = Ax0, and xk+1 = Axk = Akx0 for each k ∈ N.

Since matrix multiplication is distributive over addition,

x1 = Ay + Az = 3y +1

2z .

Thus the y component of x0 is expanded while the z component is contracted.

Repeating this process k times, we get:

xk = Axk−1 = Akx0 = 3ky +1

2kz .

Thus, xk expands in the y direction and contracts in the z direction.

Eventually, xk will be almost completely made up of its y component, and the zcomponent will be negligible.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 18 / 1

Page 59: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: Expansion/Contraction by EigenvalueNow let x1 = Ax0, and xk+1 = Axk = Akx0 for each k ∈ N.

Since matrix multiplication is distributive over addition,

x1 = Ay + Az = 3y +1

2z .

Thus the y component of x0 is expanded while the z component is contracted.

Repeating this process k times, we get:

xk = Axk−1 = Akx0 = 3ky +1

2kz .

Thus, xk expands in the y direction and contracts in the z direction.

Eventually, xk will be almost completely made up of its y component, and the zcomponent will be negligible.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 18 / 1

Page 60: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: Expansion/Contraction by EigenvalueNow let x1 = Ax0, and xk+1 = Axk = Akx0 for each k ∈ N.

Since matrix multiplication is distributive over addition,

x1 = Ay + Az = 3y +1

2z .

Thus the y component of x0 is expanded while the z component is contracted.

Repeating this process k times, we get:

xk = Axk−1 = Akx0 = 3ky +1

2kz .

Thus, xk expands in the y direction and contracts in the z direction.

Eventually, xk will be almost completely made up of its y component, and the zcomponent will be negligible.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 18 / 1

Page 61: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: The Dominant Eigenvalue

Note that how many iterations k are needed depends onhow much the biggest eigenvalue exceeds the second biggest in magnitude.

In the example, the ratio is 31/2 = 6

so expansion in the y direction is happening 6 times faster than in the z direction.

In general, for any matrix A, the growth rate of its powers Ak as k →∞depends on the eigenvalue(s) of A whose absolute value is greatest:this largest absolute value is of course the spectral radius ρ(A).

These eigenvalue(s) are generally called the dominant eigenvalue(s) of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 19 / 1

Page 62: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: The Dominant Eigenvalue

Note that how many iterations k are needed depends onhow much the biggest eigenvalue exceeds the second biggest in magnitude.

In the example, the ratio is 31/2 = 6

so expansion in the y direction is happening 6 times faster than in the z direction.

In general, for any matrix A, the growth rate of its powers Ak as k →∞depends on the eigenvalue(s) of A whose absolute value is greatest:this largest absolute value is of course the spectral radius ρ(A).

These eigenvalue(s) are generally called the dominant eigenvalue(s) of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 19 / 1

Page 63: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: The Dominant Eigenvalue

Note that how many iterations k are needed depends onhow much the biggest eigenvalue exceeds the second biggest in magnitude.

In the example, the ratio is 31/2 = 6

so expansion in the y direction is happening 6 times faster than in the z direction.

In general, for any matrix A, the growth rate of its powers Ak as k →∞depends on the eigenvalue(s) of A whose absolute value is greatest:this largest absolute value is of course the spectral radius ρ(A).

These eigenvalue(s) are generally called the dominant eigenvalue(s) of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 19 / 1

Page 64: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: The Dominant Eigenvalue

Note that how many iterations k are needed depends onhow much the biggest eigenvalue exceeds the second biggest in magnitude.

In the example, the ratio is 31/2 = 6

so expansion in the y direction is happening 6 times faster than in the z direction.

In general, for any matrix A, the growth rate of its powers Ak as k →∞depends on the eigenvalue(s) of A whose absolute value is greatest:this largest absolute value is of course the spectral radius ρ(A).

These eigenvalue(s) are generally called the dominant eigenvalue(s) of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 19 / 1

Page 65: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: The Dominant Eigenvalue

Note that how many iterations k are needed depends onhow much the biggest eigenvalue exceeds the second biggest in magnitude.

In the example, the ratio is 31/2 = 6

so expansion in the y direction is happening 6 times faster than in the z direction.

In general, for any matrix A, the growth rate of its powers Ak as k →∞depends on the eigenvalue(s) of A whose absolute value is greatest:this largest absolute value is of course the spectral radius ρ(A).

These eigenvalue(s) are generally called the dominant eigenvalue(s) of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 19 / 1

Page 66: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

An Example: The Dominant Eigenvalue

Note that how many iterations k are needed depends onhow much the biggest eigenvalue exceeds the second biggest in magnitude.

In the example, the ratio is 31/2 = 6

so expansion in the y direction is happening 6 times faster than in the z direction.

In general, for any matrix A, the growth rate of its powers Ak as k →∞depends on the eigenvalue(s) of A whose absolute value is greatest:this largest absolute value is of course the spectral radius ρ(A).

These eigenvalue(s) are generally called the dominant eigenvalue(s) of A.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 19 / 1

Page 67: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: In GeneralLet us generalise the example. Denote the approximation to the eigenvector atiteration k by xk .

The initial x1 may be chosen randomlyor set to an approximation to a dominant eigenvector x ,if there is some knowledge of this.

The core of the method is the step

xk+1 :=1

‖Axk‖Axk .

That is, at each iteration, xk is left-multiplied by A and normalised(divided by its own norm) giving a unit vector in the direction of Axk .

If xk were an eigenvector of A, then xk+1 would be equal to xk

(since they are both unit vectors, both in the same direction). This suggest aCauchy-type convergence criterion.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 20 / 1

Page 68: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: In GeneralLet us generalise the example. Denote the approximation to the eigenvector atiteration k by xk .

The initial x1 may be chosen randomlyor set to an approximation to a dominant eigenvector x ,if there is some knowledge of this.

The core of the method is the step

xk+1 :=1

‖Axk‖Axk .

That is, at each iteration, xk is left-multiplied by A and normalised(divided by its own norm) giving a unit vector in the direction of Axk .

If xk were an eigenvector of A, then xk+1 would be equal to xk

(since they are both unit vectors, both in the same direction). This suggest aCauchy-type convergence criterion.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 20 / 1

Page 69: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: In GeneralLet us generalise the example. Denote the approximation to the eigenvector atiteration k by xk .

The initial x1 may be chosen randomlyor set to an approximation to a dominant eigenvector x ,if there is some knowledge of this.

The core of the method is the step

xk+1 :=1

‖Axk‖Axk .

That is, at each iteration, xk is left-multiplied by A and normalised(divided by its own norm) giving a unit vector in the direction of Axk .

If xk were an eigenvector of A, then xk+1 would be equal to xk

(since they are both unit vectors, both in the same direction). This suggest aCauchy-type convergence criterion.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 20 / 1

Page 70: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: In GeneralLet us generalise the example. Denote the approximation to the eigenvector atiteration k by xk .

The initial x1 may be chosen randomlyor set to an approximation to a dominant eigenvector x ,if there is some knowledge of this.

The core of the method is the step

xk+1 :=1

‖Axk‖Axk .

That is, at each iteration, xk is left-multiplied by A and normalised(divided by its own norm) giving a unit vector in the direction of Axk .

If xk were an eigenvector of A, then xk+1 would be equal to xk

(since they are both unit vectors, both in the same direction). This suggest aCauchy-type convergence criterion.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 20 / 1

Page 71: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: In GeneralLet us generalise the example. Denote the approximation to the eigenvector atiteration k by xk .

The initial x1 may be chosen randomlyor set to an approximation to a dominant eigenvector x ,if there is some knowledge of this.

The core of the method is the step

xk+1 :=1

‖Axk‖Axk .

That is, at each iteration, xk is left-multiplied by A and normalised(divided by its own norm) giving a unit vector in the direction of Axk .

If xk were an eigenvector of A, then xk+1 would be equal to xk

(since they are both unit vectors, both in the same direction). This suggest aCauchy-type convergence criterion.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 20 / 1

Page 72: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: In GeneralLet us generalise the example. Denote the approximation to the eigenvector atiteration k by xk .

The initial x1 may be chosen randomlyor set to an approximation to a dominant eigenvector x ,if there is some knowledge of this.

The core of the method is the step

xk+1 :=1

‖Axk‖Axk .

That is, at each iteration, xk is left-multiplied by A and normalised(divided by its own norm) giving a unit vector in the direction of Axk .

If xk were an eigenvector of A, then xk+1 would be equal to xk

(since they are both unit vectors, both in the same direction). This suggest aCauchy-type convergence criterion.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 20 / 1

Page 73: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: In Python

Choose the starting x1 so that ‖x1‖ = 1.

import numpy as npdef Power(A, x0, tol = 1e-5, limit = 100):x = x0for iteration in xrange(limit):

5 next = A*xnext = next/np.linalg.norm(next)if allclose(x, next, atol=tol): breakx = next

return next

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 21 / 1

Page 74: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Convergence

The method will converge if:

A has a dominant real eigenvalue λ1, that is, |λ1| is strictly larger than |λ| forany other eigenvalue λ of A

the initial vector has a nonzero component in the direction of the eigenvectorx corresponding to λ1.

Certain classes of matrices are guaranteed to have real eigenvalues,e.g., symmetric matrices, positive definite matrices.

Also, the adjacency matrix of a connected graph will have a dominant realeigenvalue, by the Perron-Frobenius theorem.

On the other hand, the power iteration method will failif A’s dominant eigenvalue(s) have non-zero imaginary parts.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 22 / 1

Page 75: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Convergence

The method will converge if:

A has a dominant real eigenvalue λ1, that is, |λ1| is strictly larger than |λ| forany other eigenvalue λ of A

the initial vector has a nonzero component in the direction of the eigenvectorx corresponding to λ1.

Certain classes of matrices are guaranteed to have real eigenvalues,e.g., symmetric matrices, positive definite matrices.

Also, the adjacency matrix of a connected graph will have a dominant realeigenvalue, by the Perron-Frobenius theorem.

On the other hand, the power iteration method will failif A’s dominant eigenvalue(s) have non-zero imaginary parts.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 22 / 1

Page 76: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Convergence

The method will converge if:

A has a dominant real eigenvalue λ1, that is, |λ1| is strictly larger than |λ| forany other eigenvalue λ of A

the initial vector has a nonzero component in the direction of the eigenvectorx corresponding to λ1.

Certain classes of matrices are guaranteed to have real eigenvalues,e.g., symmetric matrices, positive definite matrices.

Also, the adjacency matrix of a connected graph will have a dominant realeigenvalue, by the Perron-Frobenius theorem.

On the other hand, the power iteration method will failif A’s dominant eigenvalue(s) have non-zero imaginary parts.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 22 / 1

Page 77: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Convergence

The method will converge if:

A has a dominant real eigenvalue λ1, that is, |λ1| is strictly larger than |λ| forany other eigenvalue λ of A

the initial vector has a nonzero component in the direction of the eigenvectorx corresponding to λ1.

Certain classes of matrices are guaranteed to have real eigenvalues,e.g., symmetric matrices, positive definite matrices.

Also, the adjacency matrix of a connected graph will have a dominant realeigenvalue, by the Perron-Frobenius theorem.

On the other hand, the power iteration method will failif A’s dominant eigenvalue(s) have non-zero imaginary parts.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 22 / 1

Page 78: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Iteration Complexity

Theorem (Trefethen and Bau, 27.1)

Assuming |λ1| > |λ2| ≥ · · · ≥ |λm| ≥ 0 and non-zero x0, the iterates of powermethod satisfy:

‖xk − (±x∗)‖ = O

(∣∣∣∣λ2

λ1

∣∣∣∣k)

(4.1)

∣∣∣((xk)TAxk)− λ1

∣∣∣ = O

(∣∣∣∣λ2

λ1

∣∣∣∣2k)

(4.2)

as k →∞. The ± indicates that at each iteration k , the bound holds for one ofthe signs.

So its convergence is very fast, except when the largest and second largesteigenvalues of A are very close in absolute value.

Page 79: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Per-Iteration Complexity

Assuming A is dense, the worst case complexity per iteration is n2 multiplicationsin the computation of Axk .

Computing the vector norm ‖Axk‖ =√

(Axk) · (Axk) also takes n multiplications.

This gives a total of O(n2) operations.

Assuming A is sparse, such as in PageRank computations, the number of floatingpoint operations in the computation of Axk is twice the number of non-zeros inA, independent of n,m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 24 / 1

Page 80: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Per-Iteration Complexity

Assuming A is dense, the worst case complexity per iteration is n2 multiplicationsin the computation of Axk .

Computing the vector norm ‖Axk‖ =√

(Axk) · (Axk) also takes n multiplications.

This gives a total of O(n2) operations.

Assuming A is sparse, such as in PageRank computations, the number of floatingpoint operations in the computation of Axk is twice the number of non-zeros inA, independent of n,m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 24 / 1

Page 81: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Per-Iteration Complexity

Assuming A is dense, the worst case complexity per iteration is n2 multiplicationsin the computation of Axk .

Computing the vector norm ‖Axk‖ =√

(Axk) · (Axk) also takes n multiplications.

This gives a total of O(n2) operations.

Assuming A is sparse, such as in PageRank computations, the number of floatingpoint operations in the computation of Axk is twice the number of non-zeros inA, independent of n,m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 24 / 1

Page 82: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Per-Iteration Complexity

Assuming A is dense, the worst case complexity per iteration is n2 multiplicationsin the computation of Axk .

Computing the vector norm ‖Axk‖ =√

(Axk) · (Axk) also takes n multiplications.

This gives a total of O(n2) operations.

Assuming A is sparse, such as in PageRank computations, the number of floatingpoint operations in the computation of Axk is twice the number of non-zeros inA, independent of n,m.

Jakub Marecek and Sean McGarraghy (UCD) Numerical Analysis and Software November 5, 2015 24 / 1

Page 83: Lecture Topic: Principal Components...An Example Let A 2Rn n denote a data matrix, which encodes n ratings of n users, e.g., of movies or books. Let us assume the ratings of each user,

Power Method: Further PropertiesOnce the dominant eigenvalue is found, the deflation method can be used tofind the second largest eigenvalue, etc. For finding all eigenvalues,decomposition methods may be preferrable, though.If the initial x1 had a zero component in the x direction, we rely on roundingerrors to get a component in the x direction at some stage in the run. Oncewe get a non-zero x component, it will expand relative to the othercomponents, but we may need to wait for this component to appear and thengrow.The expression Ax could be replaced based on

Ax = λx ⇒ x tAx = λx tx = λx · x = λ‖x‖2 ⇒ λ =1

‖x‖2x tAx .

which makes it possible to introduce additional checks to avoid under- andover-flows and division by very small numbers.Otherwise, one only needs to store A and two vectors xk and Axk . This ispossible even for n × n matrix A, n in the billions, such as for the GooglePageRank.