Download - PCA and admixture models - UCLAweb.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/pca.1.pdf · PCA and admixture models Dimensionality reduction 4 / 57 Raw data can be complex,

PCA and admixture modelsCM226: Machine Learning for Bioinformatics.

Fall 2016

Sriram SankararamanAcknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price

PCA and admixture models 1 / 57

Announcements

• HW1 solutions posted.


Supervised versus Unsupervised Learning

Unsupervised Learning from unlabeled observations

• Dimensionality Reduction. Last class.

• Other latent variable models. This class + review of PCA.


Outline

Dimensionality reduction

Linear Algebra background

PCAPractical issuesProbabilistic PCA

Admixture models

Population structure and GWAS

PCA and admixture models Dimensionality reduction 4 / 57

Raw data can be complex, high-dimensional

• If we knew what to measure, we could find simple relationships.

• Signals have redundancy.

• Genotype measured at ≈ 500K SNPs.

• Genotypes at neighboring SNPs correlated.



Goal: Find a “more compact” representation of dataWhy ?

• Visualize and discover hidden patterns.

• Preprocessing for a supervised learning problem.

• Statistical: remove noise.

• Computational: reduce wasteful computation.


An example

• We measure parents’ andoffspring heights.

• Two measurements.• Points in R2

• How can we find a more“compact” representation ?

• Two measurements arecorrelated with some noise.

• Pick a direction and project.


Goal: Minimize reconstruction error

• Find projection that minimizesthe Euclidean distance betweenoriginal points and projections.

• Principal Components Analysissolves this problem!


Principal Components Analysis

PCA: find lower dimensional representation of data

• Choose K.

• X is N ×M raw data.

• X ≈ ZWT where Z = N ×K reduced representaion (PC scores)

• W is M ×K principal components (columns are principalcomponents).


Outline




Admixture models


PCA and admixture models Linear Algebra background 10 / 57

Covariance matrix

C =1

NXTX

• Generalizes to many features

• Ci,i: variance of feature i

• Ci,j : covariance of feature i and j

• Symmetric


Covariance matrix

C =1

NXTX

• Positive semi-definite (PSD). Sometimes indicated as C � 0

(Positive semi-definite matrix) A matrix A ∈ Rn×n is positivesemi-definite iff vTAv ≥ 0 for all v ∈ Rn.


Covariance matrix

C =1

NXTX

• Positive semi-definite (PSD). Sometimes indicated as C � 0

vTCv ∝ vTXTXv

= (Xv)TXv

=

n∑i=1

(Xv)i2


Covariance matrix

C =1

NXTX

• All covariance matrices (being symmetric and PSD) have aneigendecomposition


Eigenvector and eigenvalue

(Eigenvector and eigenvalue) A vector v is an eigenvector ofA ∈ Rn×n if Av = λv for λ is the eigenvalue associated with v.


Eigendecomposition of a covariance matrix

• C is symmetric ⇒Its eigenvectors {ui}, i ∈ {1, . . . ,M} can be chosen to beorthonormal

• uTi uj = 0, i 6= j

• uTi ui = 1

• We can choose eigenvectors so that eigenvalues are in decreasingorder: λ1 ≥ λ2 . . . ≥ λM .



Cui = λiui, i ∈ {1, . . . ,M}

Arrange U = [u1 . . .uM ]

CU = C[u1 . . .uM ]

= [Cu1 . . .CuM ]

= [λ1u1 . . . λMuM ]

= [u1 . . .uM ]

λ1 0 . . . 0...

......

...0 0 . . . λM

= UΛ



CU = UΛ

Now U is an orthogonal matrix. So UUT = IM

C = CUUT

= UΛUT



C = UΛUT

• U is m×m orthonormal matrix. Columns are eigenvectors sorted byeigenvalues.

• Λ is a diagonal matrix of eigenvalues.


Eigendecomposition: Example

Covariance matrix : Ψ


Alternate characterization of eigenvectors

• Eigenvectors are orthonormal directions of maximum variance

• Eigenvalues are the variance in these directions.

• First eigenvector direction of maximum variance with variance = λ1.


Alternate characterization of eigenvectors

Given covariance matrix C ∈ RM×M

x∗ = arg maxx xTCx

‖x‖2 = 1

Solution:x∗ = u1 is the first eigenvector of C.

• Example of a constrained optimization problem

• Why do we need the constaint ?


Outline




Admixture models


PCA and admixture models PCA 17 / 57

Back to PCA

Given N data points xn ∈ RM , n ∈ {1, . . . , N}, find a lineartransformation from a lower dimensional space K < M :W ∈ RM×K and a projection zn ∈ RK so that we can reconstructoriginal data from the lower dimensional projection.

xn ≈ w1zn,1 + . . .+wKzn,K

= [w1 . . .wK ]

zn,1...zn,K

= Wzn, zn ∈ RK

• We assume the data is centered.∑

n xn,m = 0.

Compression• We go from storing N ×M to M ×K +N ×K.

How do we define quality of reconstruction?


PCA

• Find zn ∈ RK and W ∈ RM×K to minimize the reconstruction error

J(W ,Z) =1

N

∑n

‖xn −Wzn‖22

Z = [z1, . . . ,zN ]T

• Require columns of W to be orthonormal.

• The optimal solution is obtained by setting W = UK where UK

contains the K eigenvectors associated with the K largesteigenvalues of the covaiance matrix C of X.

• The low-dimensional projection zn = WTxn.


PCA: K = 1

J(w1, z1) =1

N

∑n

‖xn −w1zn,1‖22

=1

N

∑n

(xn −w1zn,1)T (xn −w1zn,1)

=1

N

∑n

(xTnx− 2wT

1 xnzn,1 + zn,12wT

1w1

)= const+

1

N

∑n

(−2wT

1 xnzn,1 + zn,12)

To maximize this function, take derivatives with respect to zn,1

∂J(w1, z1)

∂zn,1= 0

⇒ zn,1 = wT1 xn


PCA: K = 1Plugging back zn,1 = wT

1 xn

J(w1) = const+1

N

∑n

(−2wT

1 xnzn,1 + zn,12)

= const+1

N

∑n

(−2zn,1zn,1 + zn,1

2)

= const− 1

N

∑n

zn,12

Now, because the data is centered

E [z1] =1

N

∑n

zn,1

=1

N

∑n

wT1 xn

= wT1

1

N

∑n

xn = 0PCA and admixture models PCA 20 / 57

PCA: K = 1

J(w1) = const− 1

N

∑n

zn,12

Var [z1] = E[z1

2]− E [z1]

2

=1

N

∑n

zn,12 − 0

=1

N

∑n

zn,12


PCA: K = 1

Putting together

J(w1) = const− 1

N

∑n

zn,12

Var [z1] =1

N

∑n

zn,12

We have

J(w1) = const− Var [z1]

Two views of PCA: Find a direction that minimizes the reconstructionerror ≡ Find a direction that maximizes variance of projected data

arg minw1J(w1) = arg maxw1

Var [z1]


PCA: K = 1


Var [z1]

Var [z1] =1

N

∑n

zn,12

=1

N

∑n

wT1 xnw

T1 xn

=1

N

∑n

wT1 xnx

Tnw1

= wT1

∑n(xnx

Tn )

Nw1

= wT1Cw1


PCA: K = 1


Var [z1]

So we need to solve

arg maxw1wT

1Cw1

Since we required W to be orthonormal, we need to constrain: ‖w1‖2 = 1.

This objective function is maximized when w1 is the first eigenvector of C


PCA: K > 1

• We can repeat the argument for K > 1.

• Since we require directions wk to be orthonormal, we can repeat theargument by searching for direction that maximzes the remainingvariance and is orthogonal to previously selected directions.


Computing eigendecompositions

• Numerical algorithms to compute all eigenvalue, eigenvectors.O(M3).

• Infeasible for genetic datasets.

• Computing largest eigenvalue, eigenvector: Power iteration. O(M2).

• Since we are interested in covariance matrices, can use algorithms tocompute the singular-value decomposition (SVD): O(MN2). (Willdiscuss later).


Practical issues

Choosing K

• For visualization, K = 2 or K = 3.

• For other analyses, pick K so that most of the variance in the data isretained. Fraction of variance retained in the top K eigenvectors∑K

k=1 λk∑Mm=1 λm


PCA: Example


PCA on HapMap


PCA on Human Genome Diversity Project


PCA on European genetic data

1

Novembre et al. Nature 2008PCA and admixture models PCA 28 / 57

Probabilistic interpretation of PCA

zniid∼ N (0, IK)

p(xn|zn) = N (Wzn, σ2IM )



zniid∼ N (0, IK)


E [xn|zn] = Wzn

E [xn] = E [E [xn|zn]]

= E [Wzn]

= WE [zn]

= 0



zniid∼ N (0, IK)


Cov [xn] = E[xnx

Tn

]− E [xn]E [xn]T

= E[(Wzn + εn)(Wzn + εn)T

]− 0

= E[Wznz

TnW

T + 2WznεTn + εnε

Tn

]= E

[Wznz

TnW

T]

+ E[2Wznε

Tn

]+ E

[εnε

Tn

]= WE [znzn]WT + 2WE

[znε

Tn

]+ σ2IM

= WE [znzn]WT + 2WE [zn]E [εn]T + σ2IM

= WIKWT + 2W 0 + σ2IM

= WWT + σ2IM


Probabilistic PCA

Log likelihood

LL(W , σ2) ≡ logP (D|W , σ2)

Maximize W subject to constraint that columns of W are orthonormal.The maximum likelihood estimator

WML = UK

√(ΛK − σ2IK)

UK = [U1 . . .UK ]

ΛK =

λ1 . . . 0...

...0 . . . λK

σ2ML =

1

M −K

M∑j=K+1

λj


Probabilistic PCA

Computing the MLE

• Compute eigenvalues, eigenvectors

• Hidden/latent variable problem: Use EM


Other advantages of Probabilistic PCA

Can use model selection to infer K.

• Choose K to maximize the marginal likelihood P (D|K).

• Use cross-validation and pick K that maximizes likelihood on held outdata.

• Other model selection criteria such as AIC or BIC (see lecture 6 onclustering).


Mini-Summary

• Dimensionality reduction: Linear methods• Exploratory analysis and visualization.• Downstream inference: Can use the low-dimensional features for other

tasks.

• Principal Components Analysis finds a linear subspace that minimizedreconstruction error or equivalently maximizes the variance.

• Eigenvalue problem.• Probabilistic interpretation also leads to EM.

• Why may PCA not be appropriate for genetic data ?