Principal Component Analysis in MD Simulation

Principal Component Analysis in MD Simulation

Speaker: ZHOU Chen-Yang

Supervisor: Wu Yun-Dong

Methods to analyze MD trajectory

• Intuition-based coordinates– RMSD with respect to native state– Fraction of native contacts – Radius of gyration– Other observables

• Advantage– Easy to understand– Convenient to do

• Disadvantage– Inaccurate– Ineffecctive for non-native structures, or without good

reference structure– Depend on previous knowledge

How to measure conformational change?

What we have to do:

• Reduce dimension • Trajectory is too complicated• Good projection should be able to seperat of noise and signal

• Classification/Clustering• Classify structures to different states

• Algorithms include:• PCA: Principal Component Analysis• MDS: Multi-Dimensional Scaling

If we already have optimal reaction coordinate

Then we have: free energy landscape,

transition pathway, transition rate ...

But usually we don't, and it doesn't come up automatically

dPCA vs RMSD

The figure represents the free energy landscape of Trp-zip2 at 300K, using Amber force field 99sb*-ildn. Projected to 2nd principal component and RMSD.

Genaral description of PCA

• The central idea of PCA is to:– reduce the dimension

– retain the variation

• An example:– (x,y) is a randomly generated

dataset• var(x) = 3.2, var(y) = 2.3

– (x,y) is either centered at (0,0) or at (3,3), which are mixed

– PCA generates new coordinate (x',y'), and x' captures most of the variation

• var(x') = 5.5, var(y') = 0.99

Key question understanding PCA

• In practice, the principal components (PCs) are some linear combination of original coordinates.

• Suppose we have a set of data containing 2 columns X1 and X2. Now we generate a new column of data Z=a1X1+a2X2, what is the variance of Z?

Variance and covarianceExample: Z=X1+X2

Why is it important? Because we are going to project the data set to a new coordinate Z, and our attemp is to choose a (a1, a2) to maximize the variance of Z.

Z=a1X1+a2X2:

Represented with matrix multiplication:

Covariance Matrix: Σ Coefficients of original

coordinate in PC, α

var(Z)=Var(αX)=α'Σα

Next step: change ato search the maximum of var(Z)

Z=X1+X2:

Maximize var(Z)

First, we have to normalize a:

Then, maximize var(Z) is to maximize

Differentiate with respect to a1

l is the eigen value and a1 is the corresponding eigen vector of S

eigen value ploted from large to small

Pick first several eigen vector as PC, or actually the coefficient of PCs. Then project data to PCs, and the simplified data could be further analyzed with orther techniques such as clustering.

PCA in application: Cartesian coordinates

• Cartesian coordinates contain all the imformation

• But often noisy

cPCA: cartesian PCAuse cartesian coordinate

Mu, Y., Nguyen, P. H., & Stock, G. (2005). Proteins, 58(1), 45–52.

Dashed blue line: Cartesian PCA

Comparison of cPCA and dPCA in the analysis of Ala7 MD simulation

Full red line: PCA using dihedral angle

PCA in application: cPCA, dPCA and pPCA

Advangtage: 1. reduction of dimensionality2. constraint within coordinateProblem with dihedral: 1. dihedral angle is periodic 2. dihedral angle is not linear

In application, people transform dihedral angle to its sin/cos values to do PCA, called dPCA

Application of dPCA: (Ab16-22)6

Nguyen, P. H., Li, M. S., Stock, G., Straub, J. E., & Thirumalai, D. (2007). PNAS, 104(1), 111–6.

Free-energy diagram projected onto the first two principal components V1 and V2 of the dPCA forthe hexamer.

dPCA in RNA analysis: flexible choice of internal coordinates

Riccardi, L., Nguyen, P. H., & Stock, G. (2009). JPCB, 113(52), 16660–8.

• REMD simulation of a short b-hairpin Trp-zip2 using:– ff99sb-ildn– ff99sb*-ildn– ff99sb-ildn-nmr– ff99C, our modified version of ff99sb-ildn

Using dPCA to compare Trp-zip2 potential energy surface in different force field


Free energy landscape of Trp-zip2 at 300K, using Amber force field 99sb*-ildn. Projected to 1st and 2nd principal component, using dPCA of turn region. The reason for the extended energy surface is that it cannot form stable hairpin.

Native like turn

Helical structure


The figure represents the free energy landscape of Trp-zip2 at 300K, using Amber force field 99sb-ildn. Projected to 1st and 2nd principal component of 99sb*-ildn, using dPCA of turn region.

Native like turn

Helical structure


The figure represents the free energy landscape of Trp-zip2 at 300K, using Amber force field 99sb-ildn-nmr. Projected to 1st and 2nd principal component of 99sb*-ildn, using dPCA of turn region. 99sb-ildn-nmr cannot fold the Trp-zip2 hairpin.

Native like turn

Helical structure

The figure represents the free energy landscape of Trp-zip2 at 300K, using force field 99C. Projected to 1st and 2nd principal component of 99sb*-ildn, using dPCA of turn region. In our force field, Trp-zip2 form stable beta-turn so that it rarely sample other conformation.


Native like turn

Summary

• PCA is a linear transformation of old coordinates to capture maximum variance

• Instead of using Cartesian coordinates, dihedral angles could be a better choice in description of conformational change

• General coordinates or a subset of coordinates (for region of interest) can be used for PCA analysis

• The result of PCA could used for further analysis such as clustering and transition rate calculation.

Thank you!Thank you!

Principal Component Analysis in MD Simulation

Documents

Transcript of Principal Component Analysis in MD Simulation