Computational Methods in Molecular Modelling
Uğur SezermanBiological Sciences and Bioengineering ProgramSabancı University, Istanbul
Motivation
Knowing the structure of molecules enables us to understand its mechanism of function
Current experimental techniques X-ray cystallography NMR
PROTEIN FOLDING PROBLEMSTARTING FROM AMINO ACID SEQUENCE
FINDING THE STRUCTURE OF PROTEINS IS CALLED THE PROTEIN FOLDING PROBLEM
Forces driving protein folding
It is believed that hydrophobic collapse is a key driving force for protein folding Hydrophobic core Polar surface interacting with solvent
Minimum volume (no cavities) Van der Walls
Disulfide bond formation stabilizesHydrogen bondsPolar and electrostatic interactions
SECONDARY STRUCTURE PREDICTION
Intro. To Struc.(Tooze and Branden)
Secondary Structure Prediction
AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…A-VGIVPM-AYGQDIQY-GQVT…AG-GIIP--AYGNELQ--GQVT…AGVCTVPMTA---ELQYYG--T…
AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…----hhhHHHHHHhhh--eeEE…----hhhHHHHHHhhh--eeEE…
Chou-Fasman ParametersName Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine A 142 83 66 0.06 0.076 0.035 0.058Arginine R 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081Asparagine N 67 89 156 0.161 0.083 0.191 0.091Cysteine C 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064Glutamine Q 111 110 98 0.074 0.098 0.037 0.098Glycine G 57 75 156 0.102 0.085 0.19 0.152Histidine H 100 87 95 0.14 0.047 0.093 0.054Isoleucine I 108 160 47 0.043 0.034 0.013 0.056Leucine L 121 130 59 0.061 0.025 0.036 0.07Lysine K 114 74 101 0.055 0.115 0.072 0.095Methionine M 145 105 60 0.068 0.082 0.014 0.055Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065Proline P 57 55 152 0.102 0.301 0.034 0.068Serine S 77 75 143 0.12 0.139 0.125 0.106Threonine T 83 119 96 0.086 0.108 0.065 0.079Tryptophan W 108 137 96 0.077 0.013 0.064 0.167Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125Valine V 106 170 50 0.062 0.048 0.028 0.053
Computational Approaches
Ab initio methods Threading Comperative Modelling Fragment Assembly
conformation
ener
gyAb-initio protein structure prediction as
an optimization problem
2. Solve the computational problem of finding an optimal structure.
3.
1. Define a function that map protein structures to some quality measure.
Chen KeasarBGU
A dream function Has a clear minimum in the native structure. Has a clear path towards the minimum. Global optimization algorithm should find the
native structure.
Chen KeasarBGU
An approximate function Easier to design and compute. Native structure not always the global minimum. Global optimization methods do not converge. Many
alternative models (decoys) should be generated. No clear way of choosing among them.
Decoy set
Chen KeasarBGU
Fold Optimization
Simple lattice models (HP-models) Two types of residues:
hydrophobic and polar 2-D or 3-D lattice The only force is
hydrophobic collapse Score = number of HH
contacts
H/P model scoring:
Sometimes: Penalize for buried polar or surface
hydrophobic residues
Scoring Lattice Models
Learning from Lattice Models
Ken Dill ~ 1997
Hydrophobic zipper effect
diamondlattice
fine square lattice
fragments continuous
Some residues
Basic element
residue
extended atom
atom
half a residue
torsion angle lattice
electrons & protons
Hinds &Levitt
Chen KeasarBGU
What can we do with lattice models?
For smaller polypeptides, exhaustive search can be used Looking at the “best” fold, even in such a simple
model, can teach us interesting things about the protein folding process
For larger chains, other optimization and search methods must be used Greedy, branch and bound Evolutionary computing, simulated annealing Graph theoretical methods
Inverse Protein Folding Inverse Protein Folding ProblemProblemGiven a structure (or a functionality) identify
an amino acid sequence whose fold will be that structure (exhibit that functionality).
Crucial problem in drug design.NP-hard under most models.
PROTEIN THREADING
Thread the given sequence to the different structural families exist in structural databases
Choose the optimum structure based on the potential energy function ( contact potential, free energy, e.g.) used
Threading: Fold recognitionGiven:
Sequence: IVACIVSTEYDVMKAAR…
A database of molecular coordinates
Map the sequence onto each fold
Evaluate Objective 1: improve
scoring function Objective 2: folding
Protein Fold Families (CATH,SCOP)
CATH website www.cathdb.info
Genetic Algorithm used as a search tool We are searching for the minima of our fitness function composed of profile and
contact energy terms.
In this problem value encoding have been used. Parents are represented as strings of positions. Population Size is 50.
A sample parent (string of positions) is figured below:
1 2 3 4 5 10 11 12 13 14 23 24 25 26 27 28 29 30 31 32 55 56 57 58
Branch and Bound algorithm have been used to produce random initial parents.
Mutation:
Mutation operator is the shifting of the structure’s position either to the right or left by some units.
Crossover:
Two-point cross-over is applied where , selected suitable structures are exchanged between two parents.
Our Aim
In this research, we have threaded a structurally unknown protein sequence to over 2200 SCOP family fold proteins and sought the best fitting structural family.
We also tried to find the optimum fit of the query sequence to a given fold.
Energy function is a combination of The sequence profile energy Contact Potential energy (inter & intra
structural residues are taken into account)
TotalEnergy= p1 ( ProfileEnergy ) + c1(ContactEnergy)
The weights are chosen such that the contributing energy from profile and contact energy terms will be equal.
Fitness Function
Profile Energy
We do structural alignment on all selected secondary structural units of the sequences.
Same numbered secondary structural units are selected.
Length of the units may differ.-- P E E L L L R W A N F H L E N ( 1aoa)
-- S E K I L L K W V R Q T -- -- -- (1qag)N S E K I L L S W V R Q S T R -- (1dxx)
Sixth helices of the selected all-alfa sequences
Profile Matrix calculated from a structure group
A C D E F G H I K L M N P Q R S T V W Y -
-0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01 0.34 -2 -0.33 -1 -3.33 -0.67 -1.33 -3 -0.33 -3.33 -2.33 0.01 2.68 -0.33 -1.67 3.01 1.01 -2.33 -4 -2.33 0.01-1 -3 2.01 6.01 -3 -3 0.01 -4 1.01 -3 -2 0.01 -1 2.01 0.01 -1 -1 -3 -3 -2 0.01
-1 -3 0.01 2.68 -3.67 -2.33 0.01 -3.33 4.34 -3 -2 0.01 -1 2.01 2.01 -0.33 -1 -3 -3 -2 0.01-1.33 -2 -4 -3.67 0.34 -4 -3.67 4.01 -3 3.01 2.34 -3.33 -3.33 -2.67 -3.67 -3 -1 3.01 -2.67 -1 0.01
-2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01 -2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01 0.01 -2 -0.67 -0.67 -3 -1 -0.67 -3.33 1.01 -3 -2 0.34 -1.67 0.34 1.68 3.01 1.01 -2.33 -3.67 -1.67 0.01 -3 -5 -5 -3 1.01 -3 -3 -3 -3 -2 -1 -4 -4 -1 -3 -4 -3 -3 15.01 2.01 0.01 1.68 -1 -3.33 -2.33 -1.67 -2.67 -3.33 2.34 -2.33 0.01 0.34 -2.33 -2.33 -2.33 -2.67 -1 0.01 3.34 -3 -1.33 0.01 -1.67 -3.33 -0.67 0.01 -3.33 -2 0.34 -3.67 2.01 -3.33 -2 1.68 -2.67 0.68 4.34 -0.33 -0.67 -3 -3.33 -1.33 0.01 -1.67 -2.67 -1.67 0.34 0.01 -2.67 0.34 -2 0.01 -1 0.01 -1.33 -2 3.34 -0.33 -1 -1.33 -2.33 -0.33 0.68 0.01 -0.33 -1.67 -0.67 -0.67 -2 -1.33 2.34 -2.67 -0.33 -2.33 -1.33 0.68 -1.33 0.01 -0.67 2.01 1.68 -2 -3.33 -0.67 0.01 -0.67 -1 -1.67 -1.33 -0.33 -2 -1.67 0.34 -1.33 1.34 0.68 -1.33 -1.67 -1 -1.33 -0.33 1.34 0.34 -1.67 -1 2.01 -1 -2.33 0.01 2.01 -2 -2 0.01 -2.67 1.34 -2 -1.33 -0.33 -1.33 1.01 2.34 -0.67 -0.67 -2 -2 -1 2.01 -0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01
PositionsProfile scores
Residue Names
Contact Potential Energy
Based on the counts of frequency of contacts in a database of known structures converted into energy values.
In this study, contact potential energy is the sum of energies of the residues that are closer than seven angstroms in distance to each other.
Jernigan’s & Dill’s Contact Potential Energy Tables have been used.
Selected Benchmark Set
All Alfa Set :1aoa,1dxx,1qagFold: Calponin-homology domain, CH-domain core: 4 helices: bundle Superfamily: Calponin-homology domain, CH-domain
Family: Calponin-homology domain, CH-domain
All Beta Set :1acx,1hzk,1noa,2mcmFold: Immunoglobulin-like beta-sandwich sandwich; 7 strands in 2 sheetsSuperfamily: Actinoxanthin-like Family: Actinoxanthin-like
Alfa+Beta Set : 1dwn,1e6t,1frs,1qbe,1unaFold: RNA bacteriophage capsid protein 6-standed beta-sheet followed with 2 helices; meander Superfamily: RNA bacteriophage capsid protein Family: RNA bacteriophage capsid protein
Secondary structure prediction results of the family of all alfa proteins
Eight helixes of the following sequences are selected and each sequence is threaded to the other one and the shifts from the real structures are shown below.
Target Sequences
Template
sequences
1aoa 1dxx 1qag
1aoa T T T T T T T 30 T T T -6 -1 -1 T 27 1 T T T T 12 T T
1dxx T T T -4 1 5 4 9 T -3 T -5 T T T T 3 T T T 1 T 41 37
1qag -1 T T -5 T 4 41 32 5 1 T -6 -1 T -13 -1 T T T T T T T T
Secondary structure prediction results of the family of all beta proteins
Target Sequences
Template
sequences
Nine beta sheets of the following sequences are selected and each sequence is threaded to the other one and the shifts from the real structures are shown below.
1acx 1hzk 1noa 2mcm
1acx T T T T T T T T T 1 T T T T -2 T T T T T T T T -3 -1 T T T T T 2 T T 1 2 4
1hzk T T T T T T T T T T T T T T T T T T T T T T T 1 4 T T T T T T -3 -3 T T T
1noa T T T T T T 1 T T T T T T T -1 T T T T T T T T T 5 T T T T T T -2 -2 T T T
2mcm T T T T T T T T T T T T T T T T T T T T T 1 T T T T T T T T 1 T -1 T T T
Secondary structure prediction results of the family of alfa-beta proteins
Template
sequences
Target Sequences
1dwn 1e6t 1frs 1qbe 1una
1dwn T T 4 T T T T 4 T T T T T T T 5 T T T T T T T T T T T T T T T 5 T T T -1 -1 T T 1
1e6t T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T
1frs T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T -1 T T T T T T 1 T
1qbe -1 T T T 1 T T 3 T T -3 -11 T T T 4 T T -3 -11 T T T 1 T T T T T T T T -1 T T T T T T 1
1una T T T 1 1 T T T T T T -5 T T T T T T T -5 T T 1 T 1 T T T 2 T T T T T T T T T T T
Conclusion for fitting to a given fold
We obtained very good results for all-beta and alfa+beta proteins .
All alfa proteins gave good results generally but we had some shifts for the all alfa structures.
The main reason for the alfa shifts was mainly due to the fact that our all-alfa sequences had a very different lenghts and highly variable sequences which lowered the contribution from the profile scores.
Fold Classification Results
1ubi Threading Results
1e0q
1f9j 1ubi
-3000
-2800
-2600
-2400
-2200
-2000
-1800
-1600
-1400
-1200
-1000
0 100 200 300 400 500 600
Protein ID
En
erg
y V
alu
es
Other members of 1ubi's family
All Beta
1acx Threading Results
1c01 1zfo 1klo
1acx
-3000
-2800
-2600
-2400
-2200
-2000
-1800
-1600
-1400
-1200
-1000
0 100 200 300 400 500 600 700
Protein ID
En
erg
y V
alu
es
All Alpha
1bhd Threading Result
1hg6 1dfu 1qld 2pcf
1bhd
-3000
-2800
-2600
-2400
-2200
-2000
-1800
-1600
-1400
-1200
-1000
0 100 200 300 400 500 600 700
Protein ID
En
erg
y V
alu
es
CONCLUSION
By optimising the fitting process with genetic algorithm and using a correct target function we have obtained quite clear classifications in the base of families.
It is also possible to use this method for superfamily classification by adjusting only profile information and weights.
We also applied the method to 6 CASP proteins and correctly classified their folds.
HOMOLOGY MODELLING
Using database search algorithms find the sequence with known structure that best matches the query sequence
Assign the structure of the core regions obtained from the structure database to the query sequence
Find the structure of the intervening loops using loop closure algorithms
Homology Modeling: How it works
o Find template
o Align target sequence with template
o Generate model:- add loops- add sidechains
o Refine model
Prediction of Protein Structures
Examples – a few good examples
actual predicted actual
actual actual
predicted
predicted predicted
Prediction of Protein Structures
Not so good example
1esr
TURALIGN: Constrained Structural Alignment Tool For Structure Prediction
Motivation -1: Structure based Alignment
Most of the alignment algorithms are only sequence dependent (Needleman-Wunsch & Smith-Waterman )
Functional sites are usually mismatched Fail to give the best alignment between
highly divergent sequences having very similar structures
Motivation -2:Structure prediction of novel proteins
Using evolutionary information on sequence confirmation
Secondary structure predictions and possible locations of turns should be used for threading
Preservation of favorable contacts
Methods
Motif Alignment Based on Dynamic Algorithm Approach
Recursive Smith-Waterman Local Alignment Algorithm with Affine Gap Penalty Secondary Structure Similarity Matrix BLOSSUM 62 Position Specific Entropy Information
Filtering step using neighbourhood information Jernigan Contact Potential Matrix
Motif Alignment Using Dynamic Algorithm
Motif Alignment Using Dynamic Algorithm
Functional sites and motifs in template protein can be either given as input to the program or prosite scan* tool is used to detect the motifs.
*Gattiker,A et.al. Bioinformatics 2002:1(2) 107-108.
Recursive Smith-Waterman Local Alignment Algorithm with Affine Gap Penalty
50
47
pc
pR>0.9xpc
pL>0.9xpc
pR>0.9xpc
pL>0.9xpc
pc
Recursive Smith-Waterman Local Alignment Algorithm with Affine Gap Penalty
Build 3 matrices: A for the matches; B for the gaps on template; C for gaps on target.
S(i,j) : Pairwise Similarity Score go : Gap opening penalty ge : Gap extension penalty
Tracing back : Include the paths that have score > 0.9xMax
ge} j-ige, C go j-i ge, B go j-i{ A ji•C
ge} go ji- ge, C ji- ge, B go ji-{ A ji•B
S(i,j)} j-i- { XX ji•A CBA
)1,()1,()1,(max),(
),1(),1(),1(max),(
)1,1(max),( },,{
Recursive Smith-Waterman Local Alignment Algorithm with Affine Gap Penalty
SSS(i,j) : Secondary Structure Similarity
SS(i,j) : Sequence Similarity TS(i,j) : Turn Similarity
sc : Secondary Structure Similarity Coefficientac : Sequence Similarity Coefficienttc : Turn Similarity Coefficient
TS(i,j) tcSS(i,j) acSSS(i,j) scS(i,j)
Secondary Structure Similarity
),()),((),(3
1
ikPkiTSscjiSSSk
S H E L
H 2 -15 -4
E -15 4 -4
L -4 -4 2
Secondary Structure Similarity Matrix*
H H LH:0.7 0.5 0.0E:0.2 0.4 0.3L:0.1 0.1 0.6
Secondary Structure Prediction Servers
tCoefficien Similarity StructureSecondary :
jposition at Target of profile StructureSecondary :)(.,
iposition at Template of StructureSecondary :)(
sc
jP
iT
*Wallqvist,A et al. Bioinformatics. 2000 Nov;16(11):988-1002.
Sequence Similarity
Multiple Sequence Alignment of
Template Protein’s family*
20
1
),(log),()(i
jiPjiPjS...ALVKLI......A-IEII......AL-KLI...
templateof jposition at scoreon Conservati:)(
Matrix ProfileFamily :
templateof jposition at Entropy :)(
iC
P
iS
)(1
1)(
iSiC
),(62)(),( jiBLOSSUMiCacjiSS
*Glaser,F. Et al. Bioinformatics 19:163-164(2003)
Turn Similarity
),()(4),( jTPiTtcjiTS
Turn Prediction Servers
T T NT:0.7 0.5 0.0N:0.3 0.5 1.0
tCoefficien SimilarityTurn :
jposition at Target of profileTurn :)(.,
0 else T;i if 1)(
tc
jP
iT
Gap Penalties...L......-...
gege
gogo
3
23
2
...H/E...
... - ...20gapSec
And vice versa...
Filtering
For each of the motif alignments get the 25 best alignments
Build a connectivity map of template protein and thread onto target.
jii,
*
,1
ji if ),(-
Å3.7 ji if 0
Å3.7 ji if 1
),(
Matrix PotentialContact Jernigan : J
Matrix Kirchoff:
),(),(
ji
R
R
ji
jiJjicsCS
ij
ij
iji
Get the best 25 alignmentsAccording to the score:
CSSTS *Miyazawa S, Jernigan R L.(1983) Macromolecules ;18:534–552.
RESULTS
To test our program we have chosen 3 families from ASTRAL40* protein list. Citrate Synthase : 1csh,1iomA,1k3pA Methionine aminopeptidase:1b6a,1xgsA Methyltransferase:1fp2A,1fp1D
As testing measure: RMSD between the predicted and actual structure of target.
RESULTS
For all the experiments done, our algorithm perfectly matched functional sites and motifs given as input to the program. 1csh vs 1iomA :
RMSD = 2.50 1csh vs 1k3pA
RMSD = 2.12 1k3pA vs 1iomA
RMSD = 3.03 1b6a vs 1xgsA
RMSD = 2.23 1fp2A vs 1fp1D
RMSD = 2.98 At average we got the best results for 5
experiments: RMSD = 2.57 with ac:0.4,sc:0.4,tc:0.2,cc:0
User Interface of TURALIGN
DOMAIN INTERACTIONS
Top Related