Download - Computational Methods in Molecular Modelling Uğur Sezerman Biological Sciences and Bioengineering Program Sabancı University, Istanbul.

Computational Methods in Molecular Modelling

Uğur SezermanBiological Sciences and Bioengineering ProgramSabancı University, Istanbul

Motivation

Knowing the structure of molecules enables us to understand its mechanism of function

Current experimental techniques X-ray cystallography NMR

PROTEIN FOLDING PROBLEMSTARTING FROM AMINO ACID SEQUENCE

FINDING THE STRUCTURE OF PROTEINS IS CALLED THE PROTEIN FOLDING PROBLEM

Forces driving protein folding

It is believed that hydrophobic collapse is a key driving force for protein folding Hydrophobic core Polar surface interacting with solvent

Minimum volume (no cavities) Van der Walls

Disulfide bond formation stabilizesHydrogen bondsPolar and electrostatic interactions

SECONDARY STRUCTURE PREDICTION

Intro. To Struc.(Tooze and Branden)

Secondary Structure Prediction

AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…A-VGIVPM-AYGQDIQY-GQVT…AG-GIIP--AYGNELQ--GQVT…AGVCTVPMTA---ELQYYG--T…

AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…----hhhHHHHHHhhh--eeEE…----hhhHHHHHHhhh--eeEE…

Chou-Fasman ParametersName Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine A 142 83 66 0.06 0.076 0.035 0.058Arginine R 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081Asparagine N 67 89 156 0.161 0.083 0.191 0.091Cysteine C 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064Glutamine Q 111 110 98 0.074 0.098 0.037 0.098Glycine G 57 75 156 0.102 0.085 0.19 0.152Histidine H 100 87 95 0.14 0.047 0.093 0.054Isoleucine I 108 160 47 0.043 0.034 0.013 0.056Leucine L 121 130 59 0.061 0.025 0.036 0.07Lysine K 114 74 101 0.055 0.115 0.072 0.095Methionine M 145 105 60 0.068 0.082 0.014 0.055Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065Proline P 57 55 152 0.102 0.301 0.034 0.068Serine S 77 75 143 0.12 0.139 0.125 0.106Threonine T 83 119 96 0.086 0.108 0.065 0.079Tryptophan W 108 137 96 0.077 0.013 0.064 0.167Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125Valine V 106 170 50 0.062 0.048 0.028 0.053

Computational Approaches

Ab initio methods Threading Comperative Modelling Fragment Assembly

conformation

ener

gyAb-initio protein structure prediction as

an optimization problem

2. Solve the computational problem of finding an optimal structure.

3.

1. Define a function that map protein structures to some quality measure.

Chen KeasarBGU

A dream function Has a clear minimum in the native structure. Has a clear path towards the minimum. Global optimization algorithm should find the

native structure.

Chen KeasarBGU

An approximate function Easier to design and compute. Native structure not always the global minimum. Global optimization methods do not converge. Many

alternative models (decoys) should be generated. No clear way of choosing among them.

Decoy set

Chen KeasarBGU

Fold Optimization

Simple lattice models (HP-models) Two types of residues:

hydrophobic and polar 2-D or 3-D lattice The only force is

hydrophobic collapse Score = number of HH

contacts

H/P model scoring:

Sometimes: Penalize for buried polar or surface

hydrophobic residues

Scoring Lattice Models

Learning from Lattice Models

Ken Dill ~ 1997

Hydrophobic zipper effect

diamondlattice

fine square lattice

fragments continuous

Some residues

Basic element

residue

extended atom

atom

half a residue

torsion angle lattice

electrons & protons

Hinds &Levitt

Chen KeasarBGU

What can we do with lattice models?

For smaller polypeptides, exhaustive search can be used Looking at the “best” fold, even in such a simple

model, can teach us interesting things about the protein folding process

For larger chains, other optimization and search methods must be used Greedy, branch and bound Evolutionary computing, simulated annealing Graph theoretical methods

Inverse Protein Folding Inverse Protein Folding ProblemProblemGiven a structure (or a functionality) identify

an amino acid sequence whose fold will be that structure (exhibit that functionality).

Crucial problem in drug design.NP-hard under most models.

PROTEIN THREADING

Thread the given sequence to the different structural families exist in structural databases

Choose the optimum structure based on the potential energy function ( contact potential, free energy, e.g.) used

Threading: Fold recognitionGiven:

Sequence: IVACIVSTEYDVMKAAR…

A database of molecular coordinates

Map the sequence onto each fold

Evaluate Objective 1: improve

scoring function Objective 2: folding

Protein Fold Families (CATH,SCOP)

CATH website www.cathdb.info

Genetic Algorithm used as a search tool We are searching for the minima of our fitness function composed of profile and

contact energy terms.

In this problem value encoding have been used. Parents are represented as strings of positions. Population Size is 50.

A sample parent (string of positions) is figured below:

1 2 3 4 5 10 11 12 13 14 23 24 25 26 27 28 29 30 31 32 55 56 57 58

Branch and Bound algorithm have been used to produce random initial parents.

Mutation:

Mutation operator is the shifting of the structure’s position either to the right or left by some units.

Crossover:

Two-point cross-over is applied where , selected suitable structures are exchanged between two parents.

Our Aim

In this research, we have threaded a structurally unknown protein sequence to over 2200 SCOP family fold proteins and sought the best fitting structural family.

We also tried to find the optimum fit of the query sequence to a given fold.

Energy function is a combination of The sequence profile energy Contact Potential energy (inter & intra

structural residues are taken into account)

TotalEnergy= p1 ( ProfileEnergy ) + c1(ContactEnergy)

The weights are chosen such that the contributing energy from profile and contact energy terms will be equal.

Fitness Function

Profile Energy

We do structural alignment on all selected secondary structural units of the sequences.

Same numbered secondary structural units are selected.

Length of the units may differ.-- P E E L L L R W A N F H L E N ( 1aoa)

-- S E K I L L K W V R Q T -- -- -- (1qag)N S E K I L L S W V R Q S T R -- (1dxx)

Sixth helices of the selected all-alfa sequences

Profile Matrix calculated from a structure group

A C D E F G H I K L M N P Q R S T V W Y -

-0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01 0.34 -2 -0.33 -1 -3.33 -0.67 -1.33 -3 -0.33 -3.33 -2.33 0.01 2.68 -0.33 -1.67 3.01 1.01 -2.33 -4 -2.33 0.01-1 -3 2.01 6.01 -3 -3 0.01 -4 1.01 -3 -2 0.01 -1 2.01 0.01 -1 -1 -3 -3 -2 0.01

-1 -3 0.01 2.68 -3.67 -2.33 0.01 -3.33 4.34 -3 -2 0.01 -1 2.01 2.01 -0.33 -1 -3 -3 -2 0.01-1.33 -2 -4 -3.67 0.34 -4 -3.67 4.01 -3 3.01 2.34 -3.33 -3.33 -2.67 -3.67 -3 -1 3.01 -2.67 -1 0.01

-2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01 -2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01 0.01 -2 -0.67 -0.67 -3 -1 -0.67 -3.33 1.01 -3 -2 0.34 -1.67 0.34 1.68 3.01 1.01 -2.33 -3.67 -1.67 0.01 -3 -5 -5 -3 1.01 -3 -3 -3 -3 -2 -1 -4 -4 -1 -3 -4 -3 -3 15.01 2.01 0.01 1.68 -1 -3.33 -2.33 -1.67 -2.67 -3.33 2.34 -2.33 0.01 0.34 -2.33 -2.33 -2.33 -2.67 -1 0.01 3.34 -3 -1.33 0.01 -1.67 -3.33 -0.67 0.01 -3.33 -2 0.34 -3.67 2.01 -3.33 -2 1.68 -2.67 0.68 4.34 -0.33 -0.67 -3 -3.33 -1.33 0.01 -1.67 -2.67 -1.67 0.34 0.01 -2.67 0.34 -2 0.01 -1 0.01 -1.33 -2 3.34 -0.33 -1 -1.33 -2.33 -0.33 0.68 0.01 -0.33 -1.67 -0.67 -0.67 -2 -1.33 2.34 -2.67 -0.33 -2.33 -1.33 0.68 -1.33 0.01 -0.67 2.01 1.68 -2 -3.33 -0.67 0.01 -0.67 -1 -1.67 -1.33 -0.33 -2 -1.67 0.34 -1.33 1.34 0.68 -1.33 -1.67 -1 -1.33 -0.33 1.34 0.34 -1.67 -1 2.01 -1 -2.33 0.01 2.01 -2 -2 0.01 -2.67 1.34 -2 -1.33 -0.33 -1.33 1.01 2.34 -0.67 -0.67 -2 -2 -1 2.01 -0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01

PositionsProfile scores

Residue Names

Contact Potential Energy

Based on the counts of frequency of contacts in a database of known structures converted into energy values.

In this study, contact potential energy is the sum of energies of the residues that are closer than seven angstroms in distance to each other.

Jernigan’s & Dill’s Contact Potential Energy Tables have been used.

Selected Benchmark Set

All Alfa Set :1aoa,1dxx,1qagFold: Calponin-homology domain, CH-domain core: 4 helices: bundle Superfamily: Calponin-homology domain, CH-domain

Family: Calponin-homology domain, CH-domain

All Beta Set :1acx,1hzk,1noa,2mcmFold: Immunoglobulin-like beta-sandwich sandwich; 7 strands in 2 sheetsSuperfamily: Actinoxanthin-like Family: Actinoxanthin-like

Alfa+Beta Set : 1dwn,1e6t,1frs,1qbe,1unaFold: RNA bacteriophage capsid protein 6-standed beta-sheet followed with 2 helices; meander Superfamily: RNA bacteriophage capsid protein Family: RNA bacteriophage capsid protein

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.b.ee.A.A.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.b.ee.A.A.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.b.ee.b.A.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.b.ee.b.A.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.b.ee.b.b.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.b.ee.b.b.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.c.b.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.c.b.h.A.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.c.b.h.A.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.c.b.h.b.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.c.b.h.b.html

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.e.baa.A.A.html





http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.e.baa.b.A.html





http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.e.baa.b.b.html





Secondary structure prediction results of the family of all alfa proteins

Eight helixes of the following sequences are selected and each sequence is threaded to the other one and the shifts from the real structures are shown below.

Target Sequences

Template

sequences

1aoa 1dxx 1qag

1aoa T T T T T T T 30 T T T -6 -1 -1 T 27 1 T T T T 12 T T

1dxx T T T -4 1 5 4 9 T -3 T -5 T T T T 3 T T T 1 T 41 37

1qag -1 T T -5 T 4 41 32 5 1 T -6 -1 T -13 -1 T T T T T T T T

Secondary structure prediction results of the family of all beta proteins

Target Sequences

Template

sequences

Nine beta sheets of the following sequences are selected and each sequence is threaded to the other one and the shifts from the real structures are shown below.

1acx 1hzk 1noa 2mcm

1acx T T T T T T T T T 1 T T T T -2 T T T T T T T T -3 -1 T T T T T 2 T T 1 2 4

1hzk T T T T T T T T T T T T T T T T T T T T T T T 1 4 T T T T T T -3 -3 T T T

1noa T T T T T T 1 T T T T T T T -1 T T T T T T T T T 5 T T T T T T -2 -2 T T T

2mcm T T T T T T T T T T T T T T T T T T T T T 1 T T T T T T T T 1 T -1 T T T

Secondary structure prediction results of the family of alfa-beta proteins

Template

sequences

Target Sequences

1dwn 1e6t 1frs 1qbe 1una

1dwn T T 4 T T T T 4 T T T T T T T 5 T T T T T T T T T T T T T T T 5 T T T -1 -1 T T 1

1e6t T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T

1frs T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T -1 T T T T T T 1 T

1qbe -1 T T T 1 T T 3 T T -3 -11 T T T 4 T T -3 -11 T T T 1 T T T T T T T T -1 T T T T T T 1

1una T T T 1 1 T T T T T T -5 T T T T T T T -5 T T 1 T 1 T T T 2 T T T T T T T T T T T

Conclusion for fitting to a given fold

We obtained very good results for all-beta and alfa+beta proteins .

All alfa proteins gave good results generally but we had some shifts for the all alfa structures.

The main reason for the alfa shifts was mainly due to the fact that our all-alfa sequences had a very different lenghts and highly variable sequences which lowered the contribution from the profile scores.

Fold Classification Results

1ubi Threading Results

1e0q

1f9j 1ubi

-3000

-2800

-2600

-2400

-2200

-2000

-1800

-1600

-1400

-1200

-1000

0 100 200 300 400 500 600

Protein ID

En

erg

y V

alu

es

Other members of 1ubi's family

All Beta

1acx Threading Results

1c01 1zfo 1klo

1acx

-3000

-2800

-2600

-2400

-2200

-2000

-1800

-1600

-1400

-1200

-1000

0 100 200 300 400 500 600 700

Protein ID

En

erg

y V

alu

es

All Alpha

1bhd Threading Result

1hg6 1dfu 1qld 2pcf

1bhd

-3000

-2800

-2600

-2400

-2200

-2000

-1800

-1600

-1400

-1200

-1000

0 100 200 300 400 500 600 700

Protein ID

En

erg

y V

alu

es

CONCLUSION

By optimising the fitting process with genetic algorithm and using a correct target function we have obtained quite clear classifications in the base of families.

It is also possible to use this method for superfamily classification by adjusting only profile information and weights.

We also applied the method to 6 CASP proteins and correctly classified their folds.

HOMOLOGY MODELLING

Using database search algorithms find the sequence with known structure that best matches the query sequence

Assign the structure of the core regions obtained from the structure database to the query sequence

Find the structure of the intervening loops using loop closure algorithms

Homology Modeling: How it works

o Find template

o Align target sequence with template

o Generate model:- add loops- add sidechains

o Refine model

Prediction of Protein Structures

Examples – a few good examples

actual predicted actual

actual actual

predicted

predicted predicted

Prediction of Protein Structures

Not so good example

TURALIGN: Constrained Structural Alignment Tool For Structure Prediction

Motivation -1: Structure based Alignment

Most of the alignment algorithms are only sequence dependent (Needleman-Wunsch & Smith-Waterman )

Functional sites are usually mismatched Fail to give the best alignment between

highly divergent sequences having very similar structures

Motivation -2:Structure prediction of novel proteins

Using evolutionary information on sequence confirmation

Secondary structure predictions and possible locations of turns should be used for threading

Preservation of favorable contacts

Methods

Motif Alignment Based on Dynamic Algorithm Approach

Recursive Smith-Waterman Local Alignment Algorithm with Affine Gap Penalty Secondary Structure Similarity Matrix BLOSSUM 62 Position Specific Entropy Information

Filtering step using neighbourhood information Jernigan Contact Potential Matrix

Motif Alignment Using Dynamic Algorithm

Motif Alignment Using Dynamic Algorithm

Functional sites and motifs in template protein can be either given as input to the program or prosite scan* tool is used to detect the motifs.

*Gattiker,A et.al. Bioinformatics 2002:1(2) 107-108.

Recursive Smith-Waterman Local Alignment Algorithm with Affine Gap Penalty

50

47

pc

pR>0.9xpc

pL>0.9xpc

pR>0.9xpc

pL>0.9xpc

pc


Build 3 matrices: A for the matches; B for the gaps on template; C for gaps on target.

S(i,j) : Pairwise Similarity Score go : Gap opening penalty ge : Gap extension penalty

Tracing back : Include the paths that have score > 0.9xMax

ge} j-ige, C go j-i ge, B go j-i{ A ji•C

ge} go ji- ge, C ji- ge, B go ji-{ A ji•B

S(i,j)} j-i- { XX ji•A CBA

)1,()1,()1,(max),(

),1(),1(),1(max),(

)1,1(max),( },,{


SSS(i,j) : Secondary Structure Similarity

SS(i,j) : Sequence Similarity TS(i,j) : Turn Similarity

sc : Secondary Structure Similarity Coefficientac : Sequence Similarity Coefficienttc : Turn Similarity Coefficient

TS(i,j) tcSS(i,j) acSSS(i,j) scS(i,j)

Secondary Structure Similarity

),()),((),(3

1

ikPkiTSscjiSSSk

S H E L

H 2 -15 -4

E -15 4 -4

L -4 -4 2

Secondary Structure Similarity Matrix*

H H LH:0.7 0.5 0.0E:0.2 0.4 0.3L:0.1 0.1 0.6

Secondary Structure Prediction Servers

tCoefficien Similarity StructureSecondary :

jposition at Target of profile StructureSecondary :)(.,

iposition at Template of StructureSecondary :)(

sc

jP

iT

*Wallqvist,A et al. Bioinformatics. 2000 Nov;16(11):988-1002.

Sequence Similarity

Multiple Sequence Alignment of

Template Protein’s family*

20

1

),(log),()(i

jiPjiPjS...ALVKLI......A-IEII......AL-KLI...

templateof jposition at scoreon Conservati:)(

Matrix ProfileFamily :

templateof jposition at Entropy :)(

iC

P

iS

)(1

1)(

iSiC

),(62)(),( jiBLOSSUMiCacjiSS

*Glaser,F. Et al. Bioinformatics 19:163-164(2003)

Turn Similarity

),()(4),( jTPiTtcjiTS

Turn Prediction Servers

T T NT:0.7 0.5 0.0N:0.3 0.5 1.0

tCoefficien SimilarityTurn :

jposition at Target of profileTurn :)(.,

0 else T;i if 1)(

tc

jP

iT

Gap Penalties...L......-...

gege

gogo

3

23

2

...H/E...

... - ...20gapSec

And vice versa...

Filtering

For each of the motif alignments get the 25 best alignments

Build a connectivity map of template protein and thread onto target.

jii,

*

,1

ji if ),(-

Å3.7 ji if 0

Å3.7 ji if 1

),(

Matrix PotentialContact Jernigan : J

Matrix Kirchoff:

),(),(

ji

R

R

ji

jiJjicsCS

ij

ij

iji

Get the best 25 alignmentsAccording to the score:

CSSTS *Miyazawa S, Jernigan R L.(1983) Macromolecules ;18:534–552.

RESULTS

To test our program we have chosen 3 families from ASTRAL40* protein list. Citrate Synthase : 1csh,1iomA,1k3pA Methionine aminopeptidase:1b6a,1xgsA Methyltransferase:1fp2A,1fp1D

As testing measure: RMSD between the predicted and actual structure of target.

RESULTS

For all the experiments done, our algorithm perfectly matched functional sites and motifs given as input to the program. 1csh vs 1iomA :

RMSD = 2.50 1csh vs 1k3pA

RMSD = 2.12 1k3pA vs 1iomA

RMSD = 3.03 1b6a vs 1xgsA

RMSD = 2.23 1fp2A vs 1fp1D

RMSD = 2.98 At average we got the best results for 5

experiments: RMSD = 2.57 with ac:0.4,sc:0.4,tc:0.2,cc:0

User Interface of TURALIGN

DOMAIN INTERACTIONS