UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien...
-
Upload
jonas-logan -
Category
Documents
-
view
213 -
download
0
Transcript of UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien...
UNC Chapel Hill David A. O’Brien
Chain Growing Using Statistical Energy Functions
David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha Andrew Leaver-FeyShuquan Zong
UNC Chapel Hill David A. O’Brien
Overview
Lattice Chain Growth Algorithm Statistical Energy Functions
2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential
Results Chains Identifying Good Decoys
Current Work New Scoring Functions Incremental Tetrahedralization
Future work
UNC Chapel Hill David A. O’Brien
Chain Growing - Introduction
Lattice Chain Growing Goals:
Test measures of proteins Build protein chains that maximize a given measure If these chains appear native like, confirms that this is valid
measure
Predict protein structures from just sequence information, ab initio.
Develop an algorithm to build 3D folded protein decoys from the sequence that are similar to the native structure
Evaluate these decoys and determine which are native-like. In short, be able to pick the most native-like structure from the large set of decoys we will generate.
UNC Chapel Hill David A. O’Brien
Lattice Chain Growth Algo.
Cubic lattice (311) w/ 24 possible moves {(3,1,1),(3,1,-1),…,(-3,1,1)}
Generate chain configuration by sequential addition of links until full length of chain is reached.
New links can not be placed in the zone of exclusion of of other links and must satisfy angle constraints.
UNC Chapel Hill David A. O’Brien
Lattice Chain Growth Algo.: Adding a new link
Generate a set of possible open lattice nodes. For each, calculate a temperature-dependent transition probability. Choose one of these open lattice nodes with a Monte Carlo step. Variations such as look 2 steps ahead or building from middle
UNC Chapel Hill David A. O’Brien
Temperature-Dependent Transition Probability
Probability at step i of picking configuration x’ from x1 … xC :
T = temperature kB = Boltzman Constant E = Energy (Lower is better.)
1
1 1( ') exp[ ( ')] / exp[ ( )]
C
i jjB B
P x E x E xk T k T
UNC Chapel Hill David A. O’Brien
Overview
Lattice Chain Growth Algorithm Statistical Energy Functions
2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential
Results Chains Identifying Good Decoys
Current Work New Scoring Functions Incremental Tetrahedralization
Future work
UNC Chapel Hill David A. O’Brien
Statistical Energy Functions
Statistical energy functions assume that “contact” energies between amino acid residues in native proteins are related to their observed frequency in a representative structural database.
If a potential configuration (decoy) has a certain set of nearby residues that is common in nature, give this a good score.
Score for entire protein is sum of all contact energies.
We use three statistical energy functions: 2-body Miyazawa-Jernigan 4-body Potential Local Shape Potential
UNC Chapel Hill David A. O’Brien
Statistical Energy FunctionsOverview
Global vs. Local Global: Measures well the entire protein (or partial fragment) Local: Measures just a small sequence of consecutive residues
2-body Miyazawa-Jernigan Easy to calculate Can be global or local
4-body Potential Expensive to calculate Works better as a global measure Good for determining native-like folded structures
Local Shape Potential Easy to calculate Defined as a local measure Global measure ?
UNC Chapel Hill David A. O’Brien
Overview
Lattice Chain Growth Algorithm Statistical Energy Functions
2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential
Results Chains Identifying Good Decoys
Current Work New Scoring Functions Incremental Tetrahedralization
Future work
UNC Chapel Hill David A. O’Brien
For two-body potentials:
Actual ij values are taken from the Miyazawa-Jernigan matrix as reevaluated in 1996
Two-body Statistical Energy Function
ln[ / ]ij ij B ij ijQ k T F P
observed contact frequencyijF reference stateijP
ij ijQ
Miyazawa S, Jernigan RL. Residue residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996;256: 623 644.
UNC Chapel Hill David A. O’Brien
Overview
Lattice Chain Growth Algorithm Statistical Energy Functions
2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential
Results Chains Identifying Good Decoys
Current Work New Scoring Functions Incremental Tetrahedralization
Future work
UNC Chapel Hill David A. O’Brien
Calculates the energy based on a sets of 4 nearby residues (quad).
Quads calculated from the Delaunay Tessellation. The 4 vertices of each tetrahedra define a quad. Each quad is given a statistical score.
Four-Body Statistical Energy Function
Convex hull formed by the tetrahedral edges Each tetrahedron
corresponds to a cluster of four residues
UNC Chapel Hill David A. O’Brien
Four-Body Statistical Energy Function - Overview
Four-body potential is written .
Training set of 1166 proteins were tessellated Frequency of each quad type is counted Each quad is typed in two ways
by the combination of the four residue types {i,j,k,l} by the number of consecutively appearing residues ()
25.5% 35.6% 11.4% 22.1% 5.4%
ijklQ
UNC Chapel Hill David A. O’Brien
Four-Body Statistical Energy Function - Classifying quadruplets
Denote each quad by {i,j,k,l} i,j,k and l can be any of the 20 amino acids (L20)
e.g. AALV, TLKM, TTLK, YYYY etc. 8855 possible combinations
Or 20 amino acids can be grouped into just 6 types (L6) Groups defined by chemical properties of amino acids 126 possible combinations
c={cysteine} f={phenylaline, tyrosine, tryptophan}
h={histiine, arginine, lysine}
n={asparagine, aspartic acid, glutamine, glutamic acid}
s={serine, threonine, proline, alanine, glycine}
v={methionine, isoleucine, leucine, valine}
UNC Chapel Hill David A. O’Brien
Four-Body Statistical Energy Function - Classifying quadruplets
L20 Case: 5 -types x 8855 combination ==> 44,275 quad types Not all quad types observed in training set Potential of unfound types set to some fraction of the lowest
score for a represented quad type. L6 Case:
5 -types x 126 combination ==> 630 quad types All but a few quad types observed in training set
UNC Chapel Hill David A. O’Brien
Four-Body Statistical Energy Function - Formulation
Formulation is an extension of the previous 2-body formula:
where,ln[ / ]ijkl ijkl ijklQ k T f P
observed occurrences of type ( ) neighbors
total number for typeijkl
ijklf
observed occurrences of amino acid type
total number of residues in data seti
ia
number of each type it i
# of type tetrahedra observed in training set
total # of tetrahedra in training setP
1
4!
!ijkl ijkl i j k lN
ii
P P P P a a a a
t
UNC Chapel Hill David A. O’Brien
Overview
Lattice Chain Growth Algorithm Statistical Energy Functions
2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential
Results Chains Identifying Good Decoys
Current Work New Scoring Functions Incremental Tetrahedralization
Future work
UNC Chapel Hill David A. O’Brien
Motivation: Fragment libraries model protein structures accurately. Use the frequency of common fragments to construct a statistical
function that supplements the 2 and 4-body energy functions to grow better decoys
Good fragment libraries exist, but for the lattice-chain building we need fragments that fit in the 311 lattice
Main Idea: For each possible consecutive sequence of four residues, i, j, k, and l,
calculate in which shape these residues most often occur.
Shape – A Shape – B
If Shape – A is found more often in nature, try to build chain accordingly
Local Shape Statistical Energy Function
UNC Chapel Hill David A. O’Brien
Create set of canonical lattice shapes of length 4 (and 5) Calculate ways to embed chain of length 4 (or 5) in 311
lattice. 155 canonical shapes for length 4, (2789 for length 5) For L6, there are 64=1,296 sequences
155 x 1,296 = 200,880 combinations
• Parse representative set of 971 proteins into segments. For each 4 length segment, calculate RMSD against
each canonical shape
Local Shape Statistical Energy Function
…
Shape 1
Shape 2
Shape 155
Sample protein
UNC Chapel Hill David A. O’Brien
Turning RMSD values into frequencies If only the canonical shape with best RMSD are counted, not
all 200,880 shapes found in training set. If two canonical shapes have low RMSD, give each some
credit If each For each RMSD
i,j,k,l , i,j,k,l = residue type, = shape
Normalize the 155 RMSD values
Local Shape Statistical Energy Function
))(exp( ,,,
1,,,
RMSD klji
Freq nlkji
UNC Chapel Hill David A. O’Brien
Overview
Lattice Chain Growth Algorithm Statistical Energy Functions
2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential
Results Chains Identifying Good Decoys
Current Work New Scoring Functions Incremental Tetrahedralization
Future work
UNC Chapel Hill David A. O’Brien
Decoys produced by the Chain Growing still not good enough.
Relatively good correlation between RMSD and 4-Body Energy.
2mhu Built with MJ Potential Local Shape Pot.
Results - Building Decoys
Native state
Fou
r-b
od
y E
nerg
y p
er
resid
ue
Fou
r-b
od
y E
nerg
y p
er
resid
ue
UNC Chapel Hill David A. O’Brien
Overview
Lattice Chain Growth Algorithm Statistical Energy Functions
2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential
Results Chains Identifying Good Decoys
Current Work New Scoring Functions Incremental Tetrahedralization
Future work
UNC Chapel Hill David A. O’Brien
20L or 6L Non-bonded Sum only the contribution of -type 0 tetrahedra.
Identifying good Decoys
UNC Chapel Hill David A. O’Brien
Non-Bounded L20 scoring function applied to a set of folded and unfolded decoys.
Discriminating Native & Non-Native
Non-bonded log-likelihoods for the Shahnovich instances and the native structure (20L1T , SC)
0
5
10
15
20
25
30
35
40
s6A
1
s6A
2
s6A
3
s6A
4
s6A
5
s6A
6
GF
01
GF
02
GF
03
GF
04
GF
05
GF
06
GF
07
GF
08
GF
09
GF
10
GF
11
GF
12
GF
13
GF
14
GF
15
GF
16
GF
17
GF
18
GF
19
GF
20
2C
I2
instances (yellow-pre(6), blue-post(20), red-native)
log
-lik
elih
oo
d s
co
re
UNC Chapel Hill David A. O’Brien
Overview
Lattice Chain Growth Algorithm Statistical Energy Functions
2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential
Results Chains Identifying Good Decoys
Current Work New Scoring Functions Incremental Tetrahedralization
Future work
UNC Chapel Hill David A. O’Brien
20L or 6L Non-bonded Sum only the contribution of -type 0 tetrahedra.
20L or 6L 5T Sum contribution of all tetrahedra.
20L Ratio All As above, but Define:
Adjustments to Scoring Functions
# of type tetrahedra in test protein
total # of tetrahedra in test proteintestP
_, test
RatioAllijkl ijkl
Pr Q r Q
P
UNC Chapel Hill David A. O’Brien
Incremental Tetrahedralization
Maintain constant tetrahedralization and only add and remove single vertices.
When evaluating a new candidate, update total energy by tagging new quadruplets as well as any that have been removed.
Add the effect of the new, and subtract effect of those removed.Add candidate
and evaluate.Add next candidate and reevaluate.
Remove candidate and reset state.
UNC Chapel Hill David A. O’Brien
References
Generating folded protein structures with a lattice chain-growth algorithm. H.H. Gan, A. Tropsha and T. Schlick, J. Chem. Phys. 113, 5511-5524 (2000).
Lattice protein folding with two and four-body statistical potentials. H.H. Gan, A. Tropsha and T. Schlick, Proteins: Structure, Function, and Genetics 43, 161-174 (2001).
Miyazawa S, Jernigan RL. Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996;256: 623–644.
Tropsha A, Sigh RK, Vaisman LI. Delaunay tessellation of proteins: Four body nearest neighbor propensities of amino acid residues, J. Comput. Biol. 1996:3:2, 213-222 (1996).
R. Kolodny, P. Koehl, L. Guibas and M. Levitt. Small libraries of protein fragments model native protein structures
accurately, J. Mol. Biol., 323, 297-307 (2002).