2d-3D Structure Modelling

116
2d-3D Structure Modelling S. Shahriar Arab

description

2d-3D Structure Modelling. S. Shahriar Arab. Flow of information. DNA. RNA. PROTEIN SEQ. PROTEIN STRUCT. PROTEIN FUNCTION. ………. Prediction in bioinformatics. Important prediction problems: Protein sequence from genomic DNA Protein 3D structure from sequence - PowerPoint PPT Presentation

Transcript of 2d-3D Structure Modelling

Page 1: 2d-3D Structure Modelling

2d-3D Structure Modelling

S. Shahriar Arab

Page 2: 2d-3D Structure Modelling

Flow of information

DNA

RNA

PROTEIN SEQ

PROTEIN STRUCT

PROTEIN FUNCTION

……….

Page 3: 2d-3D Structure Modelling

Prediction in bioinformatics

Important prediction problems:

Protein sequence from genomic DNA

Protein 3D structure from sequence

Protein function from structure

Protein function from sequence

Page 4: 2d-3D Structure Modelling

Why predict protein structure?

The sequence structure gap

Over millions known sequences, 80 000 known structures

Structural knowledge brings understanding of function and mechanism of action

Can help in prediction of function

Page 5: 2d-3D Structure Modelling

Why predict protein structure?

Predicted structures can be used in structure based drug design

It can help us understand the effects of mutations on structure or function

It is a very interesting scientific problem

still unsolved in its most general form after more than 20 years of effort

Page 6: 2d-3D Structure Modelling

What is protein structure prediction?

In its most general form

a prediction of the (relative) spatial position of each atom in the tertiary structure generated from knowledge only of the primary structure (sequence)

Page 7: 2d-3D Structure Modelling

Methods of structure prediction

Ab initio protein folding approaches

Comparative (homology) modelling

Fold recognition/threading

Page 8: 2d-3D Structure Modelling

Prediction in one dimension

Secondary structure prediction

Surface accessibility prediction

Page 9: 2d-3D Structure Modelling

2D Structure Identification

• DSSP - Database of Secondary Structures for Ps (http://swift.cmbi.kun.nl/gv/dssp/)

• VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/)

• PDB - Protein Data Bank (www.rcsb.org)

QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCAHHHHHHCCEEEEEEEEEEECCHHHHHHHCCCCCCC

Page 10: 2d-3D Structure Modelling

- -

The DSSP code

•H = alpha helix

•B = residue in isolated beta-bridge

•E = extended strand, participates in beta ladder

•G = 3-helix (3/10 helix)

•I = 5 helix (pi helix)

•T = hydrogen bonded turn

•S = bend

•C= coil

Secondary Structure

Page 11: 2d-3D Structure Modelling

Simplifications

Identification of secondary structures focused on

α-helices

β -strands

others (turns, coils, other helices) are collectively called “coils”

Eight states from DSSPH: α−helixG: 310 helixI: π-helixE: β−strandB: bridgeT: β−turnS: bendC: coil

CASP StandardH = (H, G, I), E = (E, B), C = (C, T, S)

Page 12: 2d-3D Structure Modelling

What is Secondary structure prediction?

Given a protein sequence (primary structure) GHWIATHWIATRGQLIREAYEDYGQLIREAYEDYRHFSSSSECPFIP

Predict its secondary structure content(C=coils H=Alpha Helix E=Beta Strands)

CEEEEEEEEEECHHHHHHHHHHHHHHHHHHHHHHCCCHHHHCCCCCC

Page 13: 2d-3D Structure Modelling

Why Secondary Structure Prediction?

Simply easier problem than 3D structure prediction

Accurate secondary structure prediction can be an important information for the tertiary structure prediction

Improving alignment accuracy

Protein function prediction

Protein classification

Page 14: 2d-3D Structure Modelling

secondary structure prediction

less detailed results

– only predicts the H (helix), E (extended) or C (coil/loop) state of each residue, does not predict the full atomic structure

Accuracy of secondary structure prediction

– The best methods have an average accuracy of just about 73% (the percentage of residues predicted correctly)

Page 15: 2d-3D Structure Modelling

History of protein secondary structure prediction

First generation

How: single residue statistics

Example: Chou-Fasman method, LIM method, GOR I, etc

Accuracy: low

Secondary generation

How: segment statistics

Examples: ALB method, GOR III, etc

Accuracy: ~60%

Third generation

How: long-range interaction, homology based

Examples: PHD

Accuracy: ~70%

Page 16: 2d-3D Structure Modelling

Chou-Fasman Method

Developed by Chou & Fasman in 1974 & 1978

Based on frequencies of residues in α-helices, β-sheets and turns

Accuracy ~50 - 60% Q3

Page 17: 2d-3D Structure Modelling

Chou-Fasman statistics R – amino acid, S- secondary structure

f(R,S) – number of occurrences of R in S

Ns – total number of amino acids in conformation S

N – total number of amino acids

P(R,S) – propensity of amino acid R to be in structure S

P(R,S) = (f(R,S)/f(R))/(Ns/N)

Page 18: 2d-3D Structure Modelling

Example

#residues=20,000, #helix=4,000, #Ala=2,000, #Ala in helix=500

f(Ala, α) = 500/20,000, f(Ala) = 2,000/20,000

p(α) = Να/Ν=4,000/20,000P = (500/2000) / (4,000/20000) = 1.25

Page 19: 2d-3D Structure Modelling

Chou-Fasman Statistics

Page 20: 2d-3D Structure Modelling

Amino acid propensities

Page 21: 2d-3D Structure Modelling

Scan peptide for α−helix regions

2. Identify regions where 4/6 have a

P(H) >100 “alpha-helix nucleus”

Page 22: 2d-3D Structure Modelling

Extend α-helix nucleus

3. Extend helix in both directions until a set of four residues have an average P(H) <100.

Repeat steps 1 – 3 for entire peptide

Page 23: 2d-3D Structure Modelling

Scan peptide for β-sheet regions

4. Identify regions where 3/5 have a

P(E) >100 “β-sheet nucleus”

5. Extend β-sheet until 4 continuous residues an have an average P(E) < 100

6. If region average > 105 and the average P(E) > average P(H) then “β-sheet”

Page 24: 2d-3D Structure Modelling

The GOR method

developed by Garnier, Osguthorpe& Robson

build on Pij values based on information theory

evaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal residues

sliding window of 17

GOR III method accuracy ~64% Q3

Page 25: 2d-3D Structure Modelling

Second generation

Page 26: 2d-3D Structure Modelling

GOR idea: Statistics that take into account the whole window

Each residue caries two different types of information:1. Intra-residue information – information about it’s own

secondary structure2. Inter-residue information – the influence of this residue on

other residue

Page 27: 2d-3D Structure Modelling

GOR….continued1. Individual propensity of amino acid R to be in

secondary structure S. – same idea as in Chou – Fasman

2. Contribution of 16 neighbors.

- take the window of radius 8 around the residue in question (8 before and 8 after the residue)

- for each residue in the window consider it’s contribution to the conformation of the middle residue and this it’s value to PH, PS, PC.

-Like in Chou-Fasman the values of all contributions are based on statistics.

Page 28: 2d-3D Structure Modelling

Third generation

Page 29: 2d-3D Structure Modelling

Nearest Neighbour Method• Idea: similar sequences are likely have same secondary

structure.

• Take a window around amino acid the conformation of which is to be predicted

• Find several, say k, closest sequences (with respect to a similarity measure defined differently depending on the variant of the method) of known structure.

• Assign secondary structure based on conformation of the sequence neighbours.

• Use max (nα, nβ, nc) or max(sα, sβ, sc)

Key: Scoring measure of evolutionary similarity.

Salamov, Solovyev NNSSP (1995) accuracy above 70%

Page 30: 2d-3D Structure Modelling

Neighbours

1 - L H H H H H H L L - S1

2 - L L H H H H H L L - S2

3 - L E E E E E E L L - S3

4 - L E E E E E E L L - S4

n - L L L L E E E E E - Sn

n+1 - H H H L L L E E E - Sn+1

:

max (nα, nβ, nL) or max (Σsα, Σsβ, ΣsL) or something else…

Page 31: 2d-3D Structure Modelling

Advantages

Information from structural neighbours can be used to provide details to predicted secondary structure (phi,psi angles)

Much higher accuracy than previous methods.

Page 32: 2d-3D Structure Modelling

Neural network models

machine learning approach

provide training sets of structures (e.g. α-helices, non α -helices)

computers are trained to recognize patterns in known secondary structures

provide test set (proteins with known structures)

accuracy ~ 70 –75%

Page 33: 2d-3D Structure Modelling

Neural Network Method

Recall artificial neurone:

Page 34: 2d-3D Structure Modelling

How PHD worksStep 1. BLAST search with input sequence

Step 2. Perform multiple seq. alignment and calculate aa frequencies for each position

Page 35: 2d-3D Structure Modelling

Step3. Level 1: sequence to structure

Take window of 13 adjacent residues

Scores for helix, strand, loop in the output layer, for each residue

How PHD works (cont.)

Page 36: 2d-3D Structure Modelling

Prediction tools that use NNs

MACMATCH

- (Presnell et al., 1993)

- for Macintoch

PHD

- (Rost & Sander, 1993)

http://www.predictprotein.org/

NNPREDICT

- (Kneller et al. 1990)

http://www.cmpharm.ucsf.edu/nomi/nnpredict.html

Page 37: 2d-3D Structure Modelling

PHD Prediction of rCD2

Page 38: 2d-3D Structure Modelling

Prediction Accuracy

Page 39: 2d-3D Structure Modelling

Best of the Best

PredictProtein-PHD (72%)

http://www.predictprotein.org/

Jpred (73-75%)

http://jura.ebi.ac.uk:8888/

PREDATOR (75%)

http://www.embl-heidelberg.de/cgi/predator_serv.pl

PSIpred (77%)

http://insulin.brunel.ac.uk/psipred

Page 40: 2d-3D Structure Modelling

Solvent Probe Accessible Surface

Van der Waals Surface

Reentrant Surface

Accessible Surface Area

Page 41: 2d-3D Structure Modelling

ASA Calculation• DSSP - Database of Secondary Structures for

Proteins (swift.embl-heidelberg.de/dssp)

• VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/)

• GetArea - www.scsb.utmb.edu/getarea/area_form.html

QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCAMD BBPPBEEEEEPBPBPBPBBPEEEPBPEPEEEEEEEEE1056298799415251510478941496989999999

Page 42: 2d-3D Structure Modelling

Other ASA sites

• Connolly Molecular Surface Home Page

– http://www.biohedron.com/

• Naccess Home Page

– http://sjh.bi.umist.ac.uk/naccess.html

• ASA Parallelization

– http://cmag.cit.nih.gov/Asa.htm

• Protein Structure Database

– http://www.psc.edu/biomed/pages/research/PSdb/

Page 43: 2d-3D Structure Modelling

Accessibility

Accessible Surface Area (ASA)

in folded protein

Accessibility =

Maximum ASA

• Two state = b(buried) ,e(exposed)

e.g. b<= 16% e>16%

• Three state = b(buried),I(intermediate), e(exposed)

e.g. b<=16% 16%>i,<36% e>36%

Page 44: 2d-3D Structure Modelling

Accessibility Prediction

• PredictProtein-PHDacc (58%)

http://cubic.bioc.columbia.edu/predictprotein

• PredAcc (70%?)

http://condor.urbb.jussieu.fr/PredAccCfg.html

QHTAW... QHTAWCLTSEQHTAAVIWBBPPBEEEEEPBPBPBPB

Page 45: 2d-3D Structure Modelling

PHD Prediction of rCD2

Page 46: 2d-3D Structure Modelling

3D structure prediction

Page 47: 2d-3D Structure Modelling

0 10 20 30 40 50 60 70 80 90 100

Existing folds

ThreadingBuilding by homology

similarity (%)

New folds

Ab initio prediction

3D structure prediction of proteins

Page 48: 2d-3D Structure Modelling

Choice of prediction methods

If you can find similar sequences of known structure then comparative modelling is the best way to predict structure

all other methods are less reliable

Of course, you can’t always find similar sequences of known structure.

Page 49: 2d-3D Structure Modelling

When you can’t do comparative modelling?

• Secondary structure prediction

• Fold recognition/threading

• Ab initio protein folding approaches

Page 50: 2d-3D Structure Modelling

Divergent evolution

Different proteins in different organisms have diverged from a common ancestor protein

Each copy of this ancestor in various organisms has been subject to mutations, deletions, and insertions of amino acids in its sequence

In general, its 3-D fold and function have remained similar

Page 51: 2d-3D Structure Modelling

Homology Modelling of Proteins

Prediction of three dimensional structure of a target protein from the amino acid sequence (primary structure) of a homologous (template) protein for which an X-ray or NMR structure is available.

Page 52: 2d-3D Structure Modelling

Comparative modelling

Makes a prediction of tertiary structure based on

– sequences of known structure which are similar to the target sequence (called template structures)

– an alignment between these and the target sequence

• Remember: ~25% seq ID means two proteins have the same basic structure

Page 53: 2d-3D Structure Modelling

Can and cannot of homology modelling

Best results relatively to other methods

Unreliable in predicting the conformations of insertions or deletions

Comparative models are unlikely to be useful in modelling ligand docking (drug design) unless the sequence identity with the template is >70%, and even then, less reliable than an empirical crystallographic or NMR structure.

Page 54: 2d-3D Structure Modelling

What is “good” comparative model

Take the 3D alignment between predicted structure A’ and native structure A.

Let a1,…..a n be the coordinates of carbon atoms in the native structure and a’1,…..a’n in predicted structure

<2 A rmsd is good for homology modelling results.

Page 55: 2d-3D Structure Modelling

Factors affecting accuracy

The accuracy of comparative modelling is controlled by the quality of the alignment between target sequence and template structures

Alignment is easier if the sequences are closely related (e.g. sequence identity > 80%).

Page 56: 2d-3D Structure Modelling

Homology model

Target sequence

Select templates from DB

Align target sequence with template structures

Build a model and evaluate

Page 57: 2d-3D Structure Modelling

The overall 3-D structure of the target protein is not dissimilarto that of the related proteins.

Regions of homologous sequence have similar structure.

Residues homologous throughout a family of proteins are conserved structurally.

Residues involved in biological activity have similar topology throughout the protein family.

Loop regions (non-conserved residues) allow insertions and deletions without disrupting the overall structure of the protein.

Loop regions are flexible and therefore need not be constructed as strictly as the conserved regions - assuming that they play no role

in biological activity.

Homology Modelling Assumptions

Page 58: 2d-3D Structure Modelling

Homology Modelling of Proteins

• Steps in Molecular Modelling– Identification of structures that will form the template for the target structure (model).– Sequence Alignment.- The most important step. For proteins with low homology sequences

with the query protein (~<30% Percentage sequence identity), the model can be improved by using secondary structure prediction (i.e. align-model-realign-remodel).

– Transfer the coordinates from the template(s) to the target of structurally conserved regions (SCR’s)

- many fragment method - single structure

• Modelling variable regions.- Loops Insertions: Search of a high resolution fragment database- Deletions: local minimization often sufficient.

• Modelling of side chains• - Rotamer database• Minimization

- Local-specially loop-hinge regions- Global

Page 59: 2d-3D Structure Modelling

Model Building from template

Multiple templates

Protein Fold

Core conserved regions

Variable Loop regions

Side chains

Calculate the framework from average of all template structures

Generate one model for each template and evaluate

Page 60: 2d-3D Structure Modelling

Model in loops If it is a short deletion - often local Minimization is sufficient.

Insertions:

a. Look for same length in another homologue

b. Search database of short High Resolution fragments

Lowest RMSD from Anchor points

Best Sequence Homology

Least interference with Core structure.

Anchor points (2 residues)

5 residue insertion Database search for 5 residue

fragmentsannealing

Page 61: 2d-3D Structure Modelling

Same S.C. conformer taken from template.

substitution: build based on rotamer library & energetics.

Partial Similarity: Most S.C. build on template.

Side Chain modelling

Page 62: 2d-3D Structure Modelling

Core model with side chains

Page 63: 2d-3D Structure Modelling

Minimization

LOCAL: Minimize a fragment. Usually a loop and its anchor regions - as these often have bad geometries. First minimize without influence of surrounding structure then take surrounding structure into account.

GLOBAL: Minimize whole protein (& H2O). Mainly to relieve short contacts and to rectify bad geometry, like bond angles, peptide planarity etc.

Page 64: 2d-3D Structure Modelling

Errors in Models !!!

Incorrect template selection

Incorrect alignments

Errors in positioning of side-chains and loops

Page 65: 2d-3D Structure Modelling
Page 66: 2d-3D Structure Modelling

Fold recognition or threading

Aimed at detecting when the target sequence adopts a known fold, even if it has no significant similarity to sequences of known fold

Page 67: 2d-3D Structure Modelling

How many folds are there ?

Source: http://scop.mrc-lmb.cam.ac.uk/scop/count.html

SCOP: Structural Classification of Proteins. 1.75 release38221 PDB Entries (23 Feb 2009). 110800 Domains. 1 Literature Reference(excluding

nucleic acids and theoretical models)

Page 68: 2d-3D Structure Modelling

Threading

Page 69: 2d-3D Structure Modelling

Definition

Threading - A protein fold recognition technique that involves replacing the sequence of a known protein structure with a query sequence of unknown structure. The new “model” structure is evaluated using a simple heuristic measure of protein fold quality. The process is repeated against all known 3D structures until an optimal fit is found.

Page 70: 2d-3D Structure Modelling

Why Threading?

Secondary structure is more conserved than primary structure

Tertiary structure is more conserved than secondary structure

Therefore very remote relationships can be better detected through 2D or 3D structural homology instead of sequence homology

Page 71: 2d-3D Structure Modelling

Threading idea

Choose a set of candidate structures - templates.

Align a sequence of proteins of unknown structure to each template structure.

Design a test that will evaluate which template is the most likely candidate for the correct fold for the given sequences. If none is reasonable – be able to recognize it as a possible new fold.

Page 72: 2d-3D Structure Modelling

Threading

Database of 3D structures and sequences

– Protein Data Bank (or non-redundant subset)

Query sequence

– Sequence < 25% identity to known structures

Alignment protocol

– Dynamic programming

Evaluation protocol

– Distance-based potential or secondary structure

Ranking protocol

Page 73: 2d-3D Structure Modelling

2 Kinds of Threading

• 2D Threading

• Prediction Based Methods (PBM)

– Predict secondary structure (SS) or ASA of query

– Evaluate on basis of SS and/or ASA matches

• 3D Threading

• Distance Based Methods (DBM)

– Create a 3D model of the structure

– Evaluate using a distance-based “hydrophobicity or pseudo-thermodynamic potential

Page 74: 2d-3D Structure Modelling

2D Threading Algorithm(prediction based method)

Convert PDB to a database containing sequence, SS and ASA information

Predict the SS and ASA for the query sequence

Perform a dynamic programming alignment using the query against the database (include sequence, SS & ASA)

Rank the alignments and select the most probable fold

Page 75: 2d-3D Structure Modelling

G E N E T I C SG 10 0 0 0 0 0 0 0E 0 10 0 10 0 0 0 0N 0 0 10 0 0 0 0 0E 0 0 0 10 0 10 0 0S 0 0 0 0 0 0 0 10I 0 0 0 0 0 10 0 0S 0 0 0 0 0 0 0 10

G E N E T I C SG 60 40 30 20 20 0 10 0E 40 50 30 30 20 0 10 0N 30 30 40 20 20 0 10 0E 20 20 20 30 20 10 10 0S 20 20 20 20 20 0 10 10I 10 10 10 10 10 20 10 0S 0 0 0 0 0 0 0 10

Dynamic Programming

Page 76: 2d-3D Structure Modelling

Sij (Identity Matrix) A C D E F G H I K L M N P Q R S T V W YA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0C 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0D 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0F 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0G 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0H 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0K 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0L 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0M 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0N 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0Q 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Page 77: 2d-3D Structure Modelling

A A T V DA 1VVD

A A T V DA 1 1 VVD

A A T V DA 1 1 0 0 0VVD

A A T V DA 1 1 0 0 0V 0VD

A A T V DA 1 1 0 0 0V 0 1 1VD

A A T V DA 1 1 0 0 0V 0 1 1 2VD

A Simple Example...

Page 78: 2d-3D Structure Modelling

A A T V DA 1 1 0 0 0V 0 1 1 2 1VD

A A T V DA 1 1 0 0 0V 0 1 1 2 1V 0 1 1 2 2D 0 1 1 1 3

A A T V DA 1 1 0 0 0V 0 1 1 2 1V 0 1 1 2 2D 0 1 1 1 3

A A T V D | | | |A - V V D

A A T V D | | | | A V V D

A A T V D | | | |A V - V D

A Simple Example...

Page 79: 2d-3D Structure Modelling

Let’s Include 2D info & ASA

H E CH 1 0 0E 0 1 0C 0 0 1

E P BE 1 0 0P 0 1 0B 0 0 1

Sij = k1Sij + k2Sij + k3Sij

seq strc asatotal

Sij

strc

Sij

asa

Page 80: 2d-3D Structure Modelling

A A T V DA 2VVD

A A T V DA 2 2 VVD

A A T V DA 2 2 1 0 0VVD

E E E C C E E E C C E E E C C

EECC

EECC

EECC

A A T V DA 2 2 1 0 0V 1VD

A A T V DA 2 2 1 0 0V 1 3 3VD

A A T V DA 2 2 1 0 0V 1 3 3 3VD

E E E C C E E E C C E E E C C

EECC

EECC

EECC

A Simple Example...

Page 81: 2d-3D Structure Modelling

A Simple Example...

A A T V DA 2 2 1 0 0V 1 3 3 3 2VD

A A T V DA 2 2 1 0 0V 1 3 3 3 2V 0 2 3 5 4D 0 2 3 4 7

A A T V DA 2 2 1 0 0V 1 3 3 3 2V 0 2 3 5 4D 0 2 3 4 7

E E E C C E E E C C E E E C C

EECC

EECC

EECC

A A T V D | | | |A - V V D

A A T V D | | | | A V V D

A A T V D | | | |A V - V D

Page 82: 2d-3D Structure Modelling

2D Threading Performance

In test sets 2D threading methods can identify 30-40% of proteins having very remote homologues (i.e. not detected by BLAST) using “minimal” non-redundant databases (<700 proteins)

If the database is expanded ~4x the performance jumps to 70-75%

Page 83: 2d-3D Structure Modelling

2D Threading Advantages

Algorithm is easy to implement

Algorithm is very fast (10x faster than 3D threading approaches)

The 2D database is small (<500 kbytes) compared to 3D database (>2 Gbytes)

Appears to be just as accurate as DBM or other 3D threading approaches

Very amenable to web servers

Page 84: 2d-3D Structure Modelling

2D Threading Disadvantages

Reliability is not 100% making most threading predictions suspect unless experimental evidence can be used to support the conclusion

Does not produce a 3D model at the end of the process

Doesn’t include all aspects of 2D and 3D structure features in prediction process

Page 85: 2d-3D Structure Modelling

Servers - PredictProtein

Page 86: 2d-3D Structure Modelling

Servers - PSIPRED

Page 87: 2d-3D Structure Modelling

Servers - LIBRA I

Page 88: 2d-3D Structure Modelling

More Servers - www.bronco.ualberta.ca

Page 89: 2d-3D Structure Modelling

Force Fields

Molecular Mechanics

Statistical or Knowledge based

Page 90: 2d-3D Structure Modelling

Molecular Mechanic Force Field

EFF = Estr+ Ebend + Etors + Eoop (bonded Terms)

+ Evdw + Eel + Ehb (Non-bonded Terms)

+ Estr-str + Estr-bnd + Estr-tor + Ebnd-bnd + Ebnd-tor (Cross Terms)

Estr = Σi kbi ( bi – b0 )2 (Bond length)

Ebend = Σi kθi ( θi – θ0 )2 (Bond angle)

Etors = Σi kςi ( cos(3ςi + γ0 )) (Torsion angle)

Eoop = Σi kimp (χ−χ0)2 (Improper quadratic out of plan)

Evdw = ΣiΣj Aij dij-6 + Bijdij-12 (Vanderwalls interaction)

Eel = ΣiΣj vivj / εdij (Electrostatic interaction)

Ehb = ΣiΣj ε [5(R0/Rij)12 -6(R0/Rij)10] (Hydrogen bond)

Page 91: 2d-3D Structure Modelling

Molecular Mechanic Force Field

AMBER

CHARMM

GROMACS

...

Differences

Terms of energy

Parameters

Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM Jr, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA . A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. J. Am. Chem. Soc. 1995 117: 5179–5197.Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M: CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem 1983, 4:187-217Van Der Spoel D, Lindahl E, Hess B, Groenhof G, Mark AE, Berendsen HJ . GROMACS: fast, flexible, and free. J Comput Chem 2005, 26 (16): 1701–18

Page 92: 2d-3D Structure Modelling

Statistical Force Field

Derived from an analysis of known structures in the Protein Data Bank

Schueler-Furman O, Wang C, Bradley P, Misura K, Baker D: Progress in modeling of protein structures and interactions. Science 2005, 310:638-642.Bradley P, Misura KM, Baker D: Toward high-resolution de novo structure prediction for small proteins. Science 2005, 309:1868-1871.

Tanaka and Scheraga (1976) : The idea of using Boltzmann distribution to find

knowledge-based force field

Page 93: 2d-3D Structure Modelling

Statistical Force Field

Reduce protein structure

Sippl MJ: Knowledge-based potentials for proteins. Curr Opin Struct Biol 1995, 5:229-235.Covell DG: Folding protein α-carbon chains into compact forms by Monte Carlo methods. Proteins Struct Funct Genet 1992, 14:409-420.Sun S: Reduced representation model of protein structure prediction: statistical potential and genetic algorithms . Protein Sci 1993, 2:762-785.

Distribution of:

Distances

Angles

ASA

Lazaridis T, Karplus M: Effective energy functions for protein structure prediction. Curr Opin Struct Biol 2000, 10:139-145.Bauer A, Beyer A: An improved pair potential to recognize native protein folds. Proteins Struct Funct Genet 1994, 18:254-261.Jernigan RL, Bahar I: Structure-derived potentials and protein simulations. Curr Opin Struct Biol 1996, 6:195-209.Melo F, Feytmans E: Assessing protein structures with a non-local atomic interaction energy. J Mol Biol 1998, 277:1141-1152.Melo F, Sanchez R, Sali A: Statistical potentials for fold assessment. Protein Sci 2002, 11:430-448.Tobi D, Elber R: Distance-dependent, pair potential for protein folding: Results from linear optimization . Proteins Struct Funct Genet 2000, 41:40-46.D, Elber R: Distance-dependent, pair potential for protein folding: Results from linear optimization . Proteins Struct Funct Genet 2000, 41:40-46.

Page 94: 2d-3D Structure Modelling

Statistical Force Field

P(c)≈ exp(−βE(c))

Page 95: 2d-3D Structure Modelling

Contact Potential Calculation - 1

Interaction energy between AAs

E(interaction) = -KT ln(frequency of interaction)

K: constant

T: temperature (in K, 273K = 0 ºC)

Frequency of interaction: measured in database of known struct.

More frequent ⇒ more favourable

Page 96: 2d-3D Structure Modelling

“energy” based on contact potentials (Jones)

Pairwise contact potentials:

ΔEab(s) = -kT ln (fab(s)/f(s))

s : separation length

fab(s): frequency of occurrence of a, b with separation s

f(s): frequency of the separation

Define energy of a structure as the sum over all pairwise contact potentials.

Page 97: 2d-3D Structure Modelling

Limitation of Contact Potential Method

The energy associated with an isolated AA pair is assumed to be similar to that found in known protein structures

Modification: the conformation energy of groups of AAs larger than 2 may provide a more reliable prediction

Page 98: 2d-3D Structure Modelling

Ab Initio Prediction

Predicting the 3D structure without any “prior knowledge”

Used when homology modelling or threading have failed (no homologues are evident)

Equivalent to solving the “Protein Folding Problem”

Still a research problem

Page 99: 2d-3D Structure Modelling

Ab initio protein folding

Aims to predict tertiary structure from basic physico-chemical principles

does not rely on any detection of similarity to sequences of known structure

An important scientific question

As yet very unreliable for practical predictions

Page 100: 2d-3D Structure Modelling

Some Ab Initio Methods

Molecular Dynamic Simulation

Using complex energy functions simulate folding of the primary sequence until it reaches it’s native state (1D->3D)

Genetic Algorithm

Used in refining a given potential function so that it can best predict the native state of a protein

Simulated Annealing

Branch and Bound Methods (usually used in side-chain conformation)

Page 101: 2d-3D Structure Modelling

INPUT

1. Sequence of amino acids

2. The chemical structures of amino acids and peptide backbone

constituent atoms

bond lengths, angles

constraints on dihedral angles

3. The properties of the media (water molecules, anions, cations, other molecules…)

Page 102: 2d-3D Structure Modelling

OUTPUT

3D coordinates of atoms in the protein (or some equivalent representation)

We are also willing to accept partial information:

3D structure of active site only

Location (in sequence) of secondary structures

Prediction of the “class” or “family” of the protein

Page 103: 2d-3D Structure Modelling

Is problem hard?

YES.

Huge Search Space:

Assume each amino acid can adopt one of three conformations (alpha, beta, coil), then chain of 100 amino acids has 3100 = 5 x 1047 possible folds.

If sample a fold in 10-13 seconds, it would take 1027 years.

Universe is 1010 years old.

Difficult criterion for “correct fold.”:

Interaction between thousands of atoms with each other, surrounding water,and surrounding molecules.

Page 104: 2d-3D Structure Modelling

Can it be done?

YES.

Nature does it all the time.

Real proteins fold in the range of seconds.

THUS

Nature must not sample all conformations.

Nature knows the correct criterion.

Page 105: 2d-3D Structure Modelling

Potential Energy Function

How do we know when a predicted structure is the native shape of the protein ?

In thermodynamics,

A molecule is most stable when it’s free energy is at a minimum

• The potential energy function is a simplification of actual forces acting on a real protein molecule and it’s formulation is based on the given simplified structural

model

native shape is at a free energy minimum

Page 106: 2d-3D Structure Modelling

Polypeptides can be...

Represented by a range of approaches or approximations including:

all atom representations in cartesian space

all atom representations in dihedral space

simplified atomic versions in dihedral space

tube/cylinder/ribbon representations

lattice models

Page 107: 2d-3D Structure Modelling

Ab Initio Folding

• Two Central Problems

Sampling conformational space (10100)

The energy minimum problem

• The Sampling Problem (Solutions)

Lattice models, off-lattice models, simplified chain methods

• The Energy Problem (Solutions)

Threading energies, simplified force fields, packing assessment, topology assessment

Page 108: 2d-3D Structure Modelling

A Simple 2D Lattice

3.5Å

Page 109: 2d-3D Structure Modelling

Lattice Folding

Page 110: 2d-3D Structure Modelling

Lattice Algorithm

Build a “n x m” matrix (a 2D array)

Choose an arbitrary point as your N terminal residue (start residue)

Add or subtract “1” from the x or y position of the start residue

Check to see if the new point (residue) is off the lattice or is already occupied

Evaluate the energy

Go to step 3) and repeat until done

Page 111: 2d-3D Structure Modelling

Lattice Energy Algorithm

• Red = hydrophobic, Blue = hydrophilic

• If Red is near empty space E = E+1

• If Blue is near empty space E = E-1

• If Red is near another Red E = E-1

• If Blue is near another Blue E = E+0

• If Blue is near Red E = E+0

Page 112: 2d-3D Structure Modelling

More Complex Lattices

Page 113: 2d-3D Structure Modelling

3D Lattices

Page 114: 2d-3D Structure Modelling

J. Skolnick

Really Complex 3D Lattices

Page 115: 2d-3D Structure Modelling

Lattice Methods

Easiest and quickest way to build a polypeptide

More complex lattices allow reasonably accurate representation

• At best, only an approximation to the real thing

• Does not allow accurate constructs

• Complex lattices are as “costly” as the real thing

Advantages Disadvantages

Page 116: 2d-3D Structure Modelling

The CASP “contest”

CASP is a blind prediction contest. There is a set of structures that are crystallized but not published.

The predictors attempt to predict there structures.

The results are compared.

http://predictioncenter.org/casp[1,2,3,4,5,6,7,8,9]/