PCA and admixture models - UCLAweb.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/pca.1.pdf ·...

Post on 07-Apr-2020

5 views 0 download

Transcript of PCA and admixture models - UCLAweb.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/pca.1.pdf ·...

PCA and admixture modelsCM226: Machine Learning for Bioinformatics.

Fall 2016

Sriram SankararamanAcknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price

PCA and admixture models 1 / 57

Announcements

• HW1 solutions posted.

PCA and admixture models 2 / 57

Supervised versus Unsupervised Learning

Unsupervised Learning from unlabeled observations

• Dimensionality Reduction. Last class.

• Other latent variable models. This class + review of PCA.

PCA and admixture models 3 / 57

Outline

Dimensionality reduction

Linear Algebra background

PCAPractical issuesProbabilistic PCA

Admixture models

Population structure and GWAS

PCA and admixture models Dimensionality reduction 4 / 57

Raw data can be complex, high-dimensional

• If we knew what to measure, we could find simple relationships.

• Signals have redundancy.

• Genotype measured at ≈ 500K SNPs.

• Genotypes at neighboring SNPs correlated.

PCA and admixture models Dimensionality reduction 5 / 57

Dimensionality reduction

Goal: Find a “more compact” representation of dataWhy ?

• Visualize and discover hidden patterns.

• Preprocessing for a supervised learning problem.

• Statistical: remove noise.

• Computational: reduce wasteful computation.

PCA and admixture models Dimensionality reduction 6 / 57

Dimensionality reduction

Goal: Find a “more compact” representation of dataWhy ?

• Visualize and discover hidden patterns.

• Preprocessing for a supervised learning problem.

• Statistical: remove noise.

• Computational: reduce wasteful computation.

PCA and admixture models Dimensionality reduction 6 / 57

An example

• We measure parents’ andoffspring heights.

• Two measurements.• Points in R2

• How can we find a more“compact” representation ?

• Two measurements arecorrelated with some noise.

• Pick a direction and project.

PCA and admixture models Dimensionality reduction 7 / 57

An example

• We measure parents’ andoffspring heights.

• Two measurements.• Points in R2

• How can we find a more“compact” representation ?

• Two measurements arecorrelated with some noise.

• Pick a direction and project.

PCA and admixture models Dimensionality reduction 7 / 57

An example

• We measure parents’ andoffspring heights.

• Two measurements.• Points in R2

• How can we find a more“compact” representation ?

• Two measurements arecorrelated with some noise.

• Pick a direction and project.

PCA and admixture models Dimensionality reduction 7 / 57

Goal: Minimize reconstruction error

• Find projection that minimizesthe Euclidean distance betweenoriginal points and projections.

• Principal Components Analysissolves this problem!

PCA and admixture models Dimensionality reduction 8 / 57

Principal Components Analysis

PCA: find lower dimensional representation of data

• Choose K.

• X is N ×M raw data.

• X ≈ ZWT where Z = N ×K reduced representaion (PC scores)

• W is M ×K principal components (columns are principalcomponents).

PCA and admixture models Dimensionality reduction 9 / 57

Outline

Dimensionality reduction

Linear Algebra background

PCAPractical issuesProbabilistic PCA

Admixture models

Population structure and GWAS

PCA and admixture models Linear Algebra background 10 / 57

Covariance matrix

C =1

NXTX

• Generalizes to many features

• Ci,i: variance of feature i

• Ci,j : covariance of feature i and j

• Symmetric

PCA and admixture models Linear Algebra background 11 / 57

Covariance matrix

C =1

NXTX

• Positive semi-definite (PSD). Sometimes indicated as C � 0

(Positive semi-definite matrix) A matrix A ∈ Rn×n is positivesemi-definite iff vTAv ≥ 0 for all v ∈ Rn.

PCA and admixture models Linear Algebra background 11 / 57

Covariance matrix

C =1

NXTX

• Positive semi-definite (PSD). Sometimes indicated as C � 0

vTCv ∝ vTXTXv

= (Xv)TXv

=

n∑i=1

(Xv)i2

PCA and admixture models Linear Algebra background 11 / 57

Covariance matrix

C =1

NXTX

• All covariance matrices (being symmetric and PSD) have aneigendecomposition

PCA and admixture models Linear Algebra background 11 / 57

Eigenvector and eigenvalue

(Eigenvector and eigenvalue) A vector v is an eigenvector ofA ∈ Rn×n if Av = λv for λ is the eigenvalue associated with v.

PCA and admixture models Linear Algebra background 12 / 57

Eigendecomposition of a covariance matrix

• C is symmetric ⇒Its eigenvectors {ui}, i ∈ {1, . . . ,M} can be chosen to beorthonormal

• uTi uj = 0, i 6= j

• uTi ui = 1

• We can choose eigenvectors so that eigenvalues are in decreasingorder: λ1 ≥ λ2 . . . ≥ λM .

PCA and admixture models Linear Algebra background 13 / 57

Eigendecomposition of a covariance matrix

Cui = λiui, i ∈ {1, . . . ,M}

Arrange U = [u1 . . .uM ]

CU = C[u1 . . .uM ]

= [Cu1 . . .CuM ]

= [λ1u1 . . . λMuM ]

= [u1 . . .uM ]

λ1 0 . . . 0...

......

...0 0 . . . λM

= UΛ

PCA and admixture models Linear Algebra background 13 / 57

Eigendecomposition of a covariance matrix

CU = UΛ

Now U is an orthogonal matrix. So UUT = IM

C = CUUT

= UΛUT

PCA and admixture models Linear Algebra background 14 / 57

Eigendecomposition of a covariance matrix

C = UΛUT

• U is m×m orthonormal matrix. Columns are eigenvectors sorted byeigenvalues.

• Λ is a diagonal matrix of eigenvalues.

PCA and admixture models Linear Algebra background 14 / 57

Eigendecomposition: Example

Covariance matrix : Ψ

PCA and admixture models Linear Algebra background 15 / 57

Eigendecomposition: Example

Covariance matrix : Ψ

PCA and admixture models Linear Algebra background 15 / 57

Alternate characterization of eigenvectors

• Eigenvectors are orthonormal directions of maximum variance

• Eigenvalues are the variance in these directions.

• First eigenvector direction of maximum variance with variance = λ1.

PCA and admixture models Linear Algebra background 16 / 57

Alternate characterization of eigenvectors

Given covariance matrix C ∈ RM×M

x∗ = arg maxx xTCx

‖x‖2 = 1

Solution:x∗ = u1 is the first eigenvector of C.

• Example of a constrained optimization problem

• Why do we need the constaint ?

PCA and admixture models Linear Algebra background 16 / 57

Outline

Dimensionality reduction

Linear Algebra background

PCAPractical issuesProbabilistic PCA

Admixture models

Population structure and GWAS

PCA and admixture models PCA 17 / 57

Back to PCA

Given N data points xn ∈ RM , n ∈ {1, . . . , N}, find a lineartransformation from a lower dimensional space K < M :W ∈ RM×K and a projection zn ∈ RK so that we can reconstructoriginal data from the lower dimensional projection.

xn ≈ w1zn,1 + . . .+wKzn,K

= [w1 . . .wK ]

zn,1...zn,K

= Wzn, zn ∈ RK

• We assume the data is centered.∑

n xn,m = 0.

Compression• We go from storing N ×M to M ×K +N ×K.

How do we define quality of reconstruction?

PCA and admixture models PCA 18 / 57

PCA

• Find zn ∈ RK and W ∈ RM×K to minimize the reconstruction error

J(W ,Z) =1

N

∑n

‖xn −Wzn‖22

Z = [z1, . . . ,zN ]T

• Require columns of W to be orthonormal.

• The optimal solution is obtained by setting W = UK where UK

contains the K eigenvectors associated with the K largesteigenvalues of the covaiance matrix C of X.

• The low-dimensional projection zn = WTxn.

PCA and admixture models PCA 19 / 57

PCA

• Find zn ∈ RK and W ∈ RM×K to minimize the reconstruction error

J(W ,Z) =1

N

∑n

‖xn −Wzn‖22

Z = [z1, . . . ,zN ]T

• Require columns of W to be orthonormal.

• The optimal solution is obtained by setting W = UK where UK

contains the K eigenvectors associated with the K largesteigenvalues of the covaiance matrix C of X.

• The low-dimensional projection zn = WTxn.

PCA and admixture models PCA 19 / 57

PCA

• Find zn ∈ RK and W ∈ RM×K to minimize the reconstruction error

J(W ,Z) =1

N

∑n

‖xn −Wzn‖22

Z = [z1, . . . ,zN ]T

• Require columns of W to be orthonormal.

• The optimal solution is obtained by setting W = UK where UK

contains the K eigenvectors associated with the K largesteigenvalues of the covaiance matrix C of X.

• The low-dimensional projection zn = WTxn.

PCA and admixture models PCA 19 / 57

PCA: K = 1

J(w1, z1) =1

N

∑n

‖xn −w1zn,1‖22

=1

N

∑n

(xn −w1zn,1)T (xn −w1zn,1)

=1

N

∑n

(xTnx− 2wT

1 xnzn,1 + zn,12wT

1w1

)= const+

1

N

∑n

(−2wT

1 xnzn,1 + zn,12)

To maximize this function, take derivatives with respect to zn,1

∂J(w1, z1)

∂zn,1= 0

⇒ zn,1 = wT1 xn

PCA and admixture models PCA 20 / 57

PCA: K = 1Plugging back zn,1 = wT

1 xn

J(w1) = const+1

N

∑n

(−2wT

1 xnzn,1 + zn,12)

= const+1

N

∑n

(−2zn,1zn,1 + zn,1

2)

= const− 1

N

∑n

zn,12

Now, because the data is centered

E [z1] =1

N

∑n

zn,1

=1

N

∑n

wT1 xn

= wT1

1

N

∑n

xn = 0PCA and admixture models PCA 20 / 57

PCA: K = 1

J(w1) = const− 1

N

∑n

zn,12

Var [z1] = E[z1

2]− E [z1]

2

=1

N

∑n

zn,12 − 0

=1

N

∑n

zn,12

PCA and admixture models PCA 20 / 57

PCA: K = 1

Putting together

J(w1) = const− 1

N

∑n

zn,12

Var [z1] =1

N

∑n

zn,12

We have

J(w1) = const− Var [z1]

Two views of PCA: Find a direction that minimizes the reconstructionerror ≡ Find a direction that maximizes variance of projected data

arg minw1J(w1) = arg maxw1

Var [z1]

PCA and admixture models PCA 20 / 57

PCA: K = 1

arg minw1J(w1) = arg maxw1

Var [z1]

Var [z1] =1

N

∑n

zn,12

=1

N

∑n

wT1 xnw

T1 xn

=1

N

∑n

wT1 xnx

Tnw1

= wT1

∑n(xnx

Tn )

Nw1

= wT1Cw1

PCA and admixture models PCA 21 / 57

PCA: K = 1

arg minw1J(w1) = arg maxw1

Var [z1]

So we need to solve

arg maxw1wT

1Cw1

Since we required W to be orthonormal, we need to constrain: ‖w1‖2 = 1.

This objective function is maximized when w1 is the first eigenvector of C

PCA and admixture models PCA 21 / 57

PCA: K > 1

• We can repeat the argument for K > 1.

• Since we require directions wk to be orthonormal, we can repeat theargument by searching for direction that maximzes the remainingvariance and is orthogonal to previously selected directions.

PCA and admixture models PCA 22 / 57

Computing eigendecompositions

• Numerical algorithms to compute all eigenvalue, eigenvectors.O(M3).

• Infeasible for genetic datasets.

• Computing largest eigenvalue, eigenvector: Power iteration. O(M2).

• Since we are interested in covariance matrices, can use algorithms tocompute the singular-value decomposition (SVD): O(MN2). (Willdiscuss later).

PCA and admixture models PCA 23 / 57

Practical issues

Choosing K

• For visualization, K = 2 or K = 3.

• For other analyses, pick K so that most of the variance in the data isretained. Fraction of variance retained in the top K eigenvectors∑K

k=1 λk∑Mm=1 λm

PCA and admixture models PCA 24 / 57

PCA: Example

PCA and admixture models PCA 25 / 57

PCA: Example

PCA and admixture models PCA 25 / 57

PCA: Example

PCA and admixture models PCA 25 / 57

PCA: Example

PCA and admixture models PCA 25 / 57

PCA: Example

PCA and admixture models PCA 25 / 57

PCA on HapMap

PCA and admixture models PCA 26 / 57

PCA on Human Genome Diversity Project

PCA and admixture models PCA 27 / 57

PCA on Human Genome Diversity Project

PCA and admixture models PCA 27 / 57

PCA on European genetic data

1

Novembre et al. Nature 2008PCA and admixture models PCA 28 / 57

Probabilistic interpretation of PCA

zniid∼ N (0, IK)

p(xn|zn) = N (Wzn, σ2IM )

PCA and admixture models PCA 29 / 57

Probabilistic interpretation of PCA

zniid∼ N (0, IK)

p(xn|zn) = N (Wzn, σ2IM )

E [xn|zn] = Wzn

E [xn] = E [E [xn|zn]]

= E [Wzn]

= WE [zn]

= 0

PCA and admixture models PCA 29 / 57

Probabilistic interpretation of PCA

zniid∼ N (0, IK)

p(xn|zn) = N (Wzn, σ2IM )

Cov [xn] = E[xnx

Tn

]− E [xn]E [xn]T

= E[(Wzn + εn)(Wzn + εn)T

]− 0

= E[Wznz

TnW

T + 2WznεTn + εnε

Tn

]= E

[Wznz

TnW

T]

+ E[2Wznε

Tn

]+ E

[εnε

Tn

]= WE [znzn]WT + 2WE

[znε

Tn

]+ σ2IM

= WE [znzn]WT + 2WE [zn]E [εn]T + σ2IM

= WIKWT + 2W 0 + σ2IM

= WWT + σ2IM

PCA and admixture models PCA 29 / 57

Probabilistic PCA

Log likelihood

LL(W , σ2) ≡ logP (D|W , σ2)

Maximize W subject to constraint that columns of W are orthonormal.The maximum likelihood estimator

WML = UK

√(ΛK − σ2IK)

UK = [U1 . . .UK ]

ΛK =

λ1 . . . 0...

...0 . . . λK

σ2ML =

1

M −K

M∑j=K+1

λj

PCA and admixture models PCA 30 / 57

Probabilistic PCA

Log likelihood

LL(W , σ2) ≡ logP (D|W , σ2)

Maximize W subject to constraint that columns of W are orthonormal.The maximum likelihood estimator

WML = UK

√(ΛK − σ2IK)

UK = [U1 . . .UK ]

ΛK =

λ1 . . . 0...

...0 . . . λK

σ2ML =

1

M −K

M∑j=K+1

λj

PCA and admixture models PCA 30 / 57

Probabilistic PCA

Computing the MLE

• Compute eigenvalues, eigenvectors

• Hidden/latent variable problem: Use EM

PCA and admixture models PCA 31 / 57

Probabilistic PCA

Computing the MLE

• Compute eigenvalues, eigenvectors

• Hidden/latent variable problem: Use EM

PCA and admixture models PCA 31 / 57

Other advantages of Probabilistic PCA

Can use model selection to infer K.

• Choose K to maximize the marginal likelihood P (D|K).

• Use cross-validation and pick K that maximizes likelihood on held outdata.

• Other model selection criteria such as AIC or BIC (see lecture 6 onclustering).

PCA and admixture models PCA 32 / 57

Mini-Summary

• Dimensionality reduction: Linear methods• Exploratory analysis and visualization.• Downstream inference: Can use the low-dimensional features for other

tasks.

• Principal Components Analysis finds a linear subspace that minimizedreconstruction error or equivalently maximizes the variance.

• Eigenvalue problem.• Probabilistic interpretation also leads to EM.

• Why may PCA not be appropriate for genetic data ?

PCA and admixture models PCA 33 / 57