Statistical Bioinformatics Genomics Transcriptomics Proteomics Systems Biology.
1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point...
-
Upload
clifford-norman -
Category
Documents
-
view
217 -
download
0
Transcript of 1 Protein structure Prediction. 2 Copyright notice Many of the images in this power point...
1
Protein structure Prediction
2
Copyright notice
• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.
• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!
3
Levels of Protein Structure
4
Why is protein structure prediction needed?
• 3D structure determination is expensive, slow and difficult (by X-ray crystallography or NMR)
• Assists in the engineering of new proteins
5
Approaches to predicting protein structures
• ab initio– Use just first principles: energy, geometry, and
kinematics
• Homology Comparative– Find the best match to a database of sequences with
known 3D-structure
Combinations
• Threading
6
Protein Data Bank PDB http://www.pdb.org
Database of templates
Separate into single chainsRemove bad structures (models)
Create BLAST database
Comparative Modeling
Template(s) selection
Sequence Alignment
Structure Modeling
Structure E
valuation
Final Structural Models
Target sequence
Known Structures (templates)
7
Known Structures (templates)
Sequence Alignment
Structure Modeling
Structure E
valuation
Final Structural Models
Target sequence
Sequence Similarity / Fold recognition
Structure quality (resolution, experimental method)
Experimental conditions (ligands and cofactors)
Comparative Modeling
Template(s) selection
8
Known Structures (templates)
Template(s) selection
Structure Modeling
Structure E
valuation
Final Structural Models
Target sequence
Key step in homology modeling
Global alignment is required
Small error in alignment can lead to big error in model
Multiple alignments are better than pairwise alignments
Comparative Modeling
Sequence Alignment
9
Known Structures (templates)
Template(s) selection
Structure E
valuation
Final Structural Models
Target sequence
Template based fragment Assembly (SwissMod). Satisfaction of Spatial Restraints: MODELLER
Comparative Modeling
Sequence Alignment
Structure Modeling
10
Known Structures (templates)
Template(s) selection
Sequence Alignment
Structure Modeling
Final Structural Models
Target sequence
Errors in template selection or alignment result in bad models
Iterative cycles of alignment, modeling and evaluation
Comparative Modeling
Structure E
valuation
11
Measure Proteins Structure Similarity
• Need ways to determine if two protein structures are related and to compare predicted models to experimental structures
• Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979):
• Usually use C atoms
N
dzdydxN
i iii
1
222
3.6 Å 2.9 Å
NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) T102 best model• Other measures include contact maps and torsion angle RMSDs
12
Comparative modeling
• In general, accuracy of structure prediction depends on the percent amino acid identity shared between target and template.
• For >50% identity, RMSD is often only 1 Å.
13
Many web servers offer comparative modeling services.
Examples areSWISS-MODEL (ExPASy)Predict Protein server (Columbia)WHAT IF (CMBI, Netherlands)
Comparative modeling
14
Ab Initio Methods
• Ab initio: “From the beginning”.• Assumption 1: All the information about the
structure of a protein is contained in its sequence of amino acids.
• Assumption 2: The structure that a (globular) protein folds into is the structure with the lowest free energy.
• Finding native-like conformations require: - A scoring function (potential). - A search strategy.
15
Ab initio prediction can be performed when a proteinhas no detectable homologs.
Protein folding is modeled based on global free-energyminimum estimates.
Ab initio protein structure prediction
16
Ab initio Prediction
• Sampling the global conformation space– Lattice models / Discrete-state models– Molecular Dynamics
• Picking native conformations with an energy function– Solution model: how protein interacts with water– Pair interactions between amino acids
• Predicting secondary structure– Local homology– Fragment libraries
17
ROSETTA
• ROSETTA is mainly an ab initio structure prediction algorithm, although various parts of it can be used for other purposes as well (such as homology modeling).
• Rationale
– Local structures often fold
independently of full protein
– Can predict large areas of protein by
matching sequence to I-Sites
DavidDavid BakerBaker
18
Ab initio Prediction – ROSETTA 1. PSI-BLAST – homology search
Discard sequences with >25% homology
2. PHD
For each 3-long and each 9-long sequence fragment, get 25 structure fragments that match “well”
3. Markov-Chain Monte Carlo method
Insert and remove iteratively one short structure fragment at a time
?? ?
19
Ab initio Prediction
20
Protein Threading
• The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB
• Energy function – knowledge (or statistics) based rather than physics based – Should be able to distinguish correct structural folds from
incorrect structural folds
– Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments
MTYKLILN …. NGVDGEWTYTE
21
Threading
• Threading is in-between homology-based prediction and molecular modeling
MTYKLILN …. NGVDGEWTYTE
Main difference between homology-based prediction and threading:
Threading uses the structure to compute energy function during alignment
22
Threading – Overview
• Build a structural template database
• Define a sequence–structure energy function
• Apply a threading algorithm to query sequence
• Perform local refinement of secondary structure
• Report best resulting structural model
23
Threading – Template Database• FSSP, SCOP, CATH
• Remove pairs of proteins with highly similar structures– Efficiency
– Statistical skew in favor of large families
24
Threading – Energy Function
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
how well a residue fits a structural environment: Es
how preferable to put two particular residues nearby: Ep
alignment gap penalty: Eg
total energy: wmEm + wsEs + wpEp + wgEg + wssEss
how often a residue mutates to the template residue: Em
compatibility with local secondary structure prediction: Ess
25
Protein Threading -- algorithm
• Threading algorithm – to find a sequence-structure alignment with the minimum energy– considering only singleton energy and gap penalty
– considering all three energy terms
sequence
fold
links
26
Protein Threading -- algorithm
• Iterative procedurese.g. repeated 3D-profile alignment
• Double dynamic programming
• Integer programming
27
Assessing Prediction Reliability
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
Score = -1500 Score = -900Score = -1120Score = -720
Which one is the correct structural fold for the target sequence if any?
The one with the highest score ?
28
Assessing Prediction Reliability
Template #1: AATTAATACATTAATATAATAAAATTACTGA
Query sequence: AAAA
Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA
Better template?
Which of these two sequences will have better chance to have a good match with the query sequence after randomly reshuffling them?
29
Assessing Prediction Reliability
• Different template structures may have different background scores, making direct comparison of threading scores against different templates invalid
• Comparison of threading results should be made based on how standout the score is in its background score distribution rather the threading scores directly
30
Assessing Prediction Reliability
Threading 100,000 sequences against a template structure provides the baseline information about the background scores of the template
By locating where the threading score with a particular query sequence, one can decide how significant the score, and hence the threading result, is!
Not significant significant
E-value
31
Assessing Prediction Reliability
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
Score = -1500
E-value = e-1
Score = -900
E-value = e-21
Score = -1120
E-value = 0.5 e-1
Score = -720
E-value = e-2
If no predictions have non-significant e-values, a prediction program should indicate that it could not make a prediction!
32
Prediction of Protein Structures
• Threading against a template database
• Select the hits with good e-values, e.g., < e-10
• Put the backbone atoms in the backbone into the corresponding positions in the aligned residues
FMFTAIGEEVVQRSRKIL- - - DDLVELVK
AVLTRYGQRLIQLYDLLAQIQQKAFDVLS
Unaligned residues will not have 3D coordinates
33
Prediction of Protein Structures
• Protein threading can predict only the backbone structure of a protein (side-chains have to be predicted using other methods)
• Typically the lower the e-value, the higher the prediction accuracy
Blue: actual structure
Green: predicted structure
predicted actual
34
Prediction of Protein Structures
• Examples – a few good examples
actual predicted actual
actual actual
predicted
predicted predicted
35
Prediction of Protein Structures
• Not so good example
36
Prediction of Protein Structures
• State of the art: ~50% of the soluble proteins in a microbial genome could have correct fold prediction and might be 50% of these proteins have good backbone structure prediction
• Functional inference could be made based on– accurately predicted structures:
– correctly identified structural folds:
37
Prediction of Protein Structures
• All-atom structures could be predicted through prediction of– prediction of backbone structure
– prediction of sidechain packing• Backbone-dependent rotamers• Ab initio prediction of sidechains
• State of the art – accurate prediction of side chains remains a challenging problem
38
Structure prediction using additional information
• Some structural information may be available before whole structure is solved
– disulfide bonds– active sites– residues identified buried/exposed– (partial) secondary structure– partial NMR data– inter-residual distances by cross-linking and mass spec– overall shape derived from cryo-EM– …….
• These data can provide highly useful constraints on threading prediction
39
Structure prediction using additional information
• The basic idea
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
Distance or other types of constraints could be derived before the structure is solved, which could help to the structure prediction more accurate
40
Applications• Many protein structures have been successfully predicted prior to the solution of
their experimental structures (and later were verified by experimental structures)
• Structure predictions of all predicted genes in three microbial genomes, Synechococcus, Procholorococcus MIT/MED
~60% of predicted genes have structural fold assignments
41
Existing Prediction Programs
• PROSPECT– https://csbl.bmb.uga.edu/protein_pipeline
• FUGU– http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html
• THREADER– http://bioinf.cs.ucl.ac.uk/threader/
42
CASP: Critical Assessment of Structure Prediction
• A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction, John Moult
• First held in 1994, every 2 years afterwards
• Teams make structure predictions from sequences alone
43
CASP
• Two categories of predictors– Automated
• Automatic Servers, must complete analysis within 48 hours
• Shows what is possible through computer analysis alone
– Non-automated• Groups spend considerable time and effort on
each target• Utilize computer techniques and human analysis
techniques
44
CAFASP
GOAL
The goal of CAFASP is to evaluate the performance of fully automatic structure prediction servers available to the community. In contrast to the normal CASP procedure, CAFASP aims to answer the question of how well servers do without any intervention of experts, i.e. how well ANY user using only automated methods can predict protein structure. CAFASP assesses the performance of methods without the user intervention
allowed in CASP.
45
Performance Evaluation in CAFASP3
Servers
(54 in total)
Sum MaxSub
Score
# correct
(30 FR targets)
3ds5 robetta 5.17-5.25 15-17
pmod 3ds3 pmode3 4.21-4.36 13-14
RAPTOR 3.98 13
shgu 3.93 13
3dsn 3.64-3.90 12-13
pcons3 3.75 12
fugu3 orf_c 3.38-3.67 11-12
… … …
pdbblast 0.00 0
(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released in December, 2002.)
Servers with name in italic are meta servers
MaxSub score ranges from 0 to 1
Therefore, maximum total score is 30
46
One structure where RAPTOR did best
Red: true structure
Blue: correct part of prediction
Green: wrong part of prediction
• Target Size:144
• Super-imposable size within 5A: 118
• RMSD:1.9
47
Some more results by other programs
48
Some more results by other programs
49
Some more results by other programs
50
Summary of current state of the art
51
Secondary Structure Prediction
• Given a protein sequence a1a2…aN, secondary structure prediction aims at defining the state of each amino acid ai as being either H (helix), E (extended=strand), or O (other) (Some methods have 4 states: H, E, T for turns, and O for other).
52
Measures used to evaluated secondary structure predictions
• Percentage of residues predicted ("PP") Percentage of residues for which secondary structure prediction was made (residues were assigned secondary structure with nonzero probability). The number is provided for the reference.
53
Measures used to evaluated secondary structure predictions
• Qindex: Qindex (Qhelix, Qstrand, Qcoil, Q3) gives percentage of residues predicted correctly as helix(H), strand(E), coil(C) or for all three conformational states.
• Qhelix ("Q_H") • Qstrand("Q_S") • Qcoil("Q_C") • Q3 ("Q3")
–
54
Qindex
• For a single conformational state:
• where i is either helix, strand or coil.
• For all three states:
55
Limitations of Q3
ALHEASGPSVILFGSDVTVPPASNAEQAK
hhhhhooooeeeeoooeeeooooohhhhh
ohhhooooeeeeoooooeeeooohhhhhh
hhhhhoooohhhhooohhhooooohhhhh
Amino acid sequence
Actual Secondary Structure
Q3=22/29=76%
Q3=22/29=76%
(useful prediction)
(terrible prediction)
Q3 for random prediction is 33%
Secondary structure assignment in real proteins is uncertain to about 10%; Therefore, a “perfect” prediction would have Q3=90%.
56
Early methods for Secondary Structure Prediction
• Chou and Fasman(Chou and Fasman. Prediction of protein conformation.
Biochemistry, 13: 211-245, 1974)
• GOR(Garnier, Osguthorpe and Robson. Analysis of the accuracy and implications of simple methods for predicting the
secondary structure of globular proteins. J. Mol. Biol., 120:97-120, 1978)
57
Chou and Fasman
• Start by computing amino acids propensities to belong to a given type of secondary structure:
)(
)/(
)(
)/(
)(
)/(
iP
TurniP
iP
BetaiP
iP
HelixiP
Propensities > 1 mean that the residue type I is likely to be found in theCorresponding secondary structure type.
58
Amino Acid -Helix -Sheet Turn Ala 1.29 0.90 0.78 Cys 1.11 0.74 0.80 Leu 1.30 1.02 0.59 Met 1.47 0.97 0.39 Glu 1.44 0.75 1.00 Gln 1.27 0.80 0.97 His 1.22 1.08 0.69 Lys 1.23 0.77 0.96 Val 0.91 1.49 0.47 Ile 0.97 1.45 0.51 Phe 1.07 1.32 0.58 Tyr 0.72 1.25 1.05 Trp 0.99 1.14 0.75 Thr 0.82 1.21 1.03 Gly 0.56 0.92 1.64 Ser 0.82 0.95 1.33 Asp 1.04 0.72 1.41 Asn 0.90 0.76 1.23 Pro 0.52 0.64 1.91 Arg 0.96 0.99 0.88
Chou and Fasman
Favors-Helix
Favors-strand
Favorsturn
59
Chou and Fasman
Predicting helices:- find nucleation site: 4 out of 6 contiguous residues with P()>1- extension: extend helix in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker)- if average P() over whole region is >1, it is predicted to be helical
Predicting strands:- find nucleation site: 3 out of 5 contiguous residues with P()>1- extension: extend strand in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker)- if average P() over whole region is >1, it is predicted to be a strand
60
Chou and Fasman
Position-specific parametersfor turn:Each position has distinctamino acid preferences.
Examples:
-At position 2, Pro is highly preferred; Trp is disfavored
-At position 3, Asp, Asn and Gly are preferred
-At position 4, Trp, Gly and Cys preferred
f(i) f(i+1) f(i+2) f(i+3)
61
Chou and Fasman
Predicting turns:- for each tetrapeptide starting at residue i, compute:
- PTurn (average propensity over all 4 residues)- F = f(i)*f(i+1)*f(i+2)*f(i+3)
- if PTurn > P and PTurn > Pand PTurn > 1 and F>0.000075 tetrapeptide is considered a turn.
Chou and Fasman prediction:
http://fasta.bioch.virginia.edu/fasta_www/chofas.htm
62
The GOR method
Position-dependent propensities for helix, sheet or turn is calculated for each amino acid. For each position j in the sequence, eight residues on either side are considered.
A helix propensity table contains information about propensity for residues at 17 positions when the conformation of residue j is helical. The helix propensity tables have 20 x 17 entries.Build similar tables for strands and turns.
GOR simplification:The predicted state of AAj is calculated as the sum of the position-dependent propensities of all residues around AAj.
GOR can be used at : http://abs.cit.nih.gov/gor/ (current version is GOR IV)
j
63
Accuracy
• Both Chou and Fasman and GOR have been assessed and their accuracy is estimated to be Q3=60-65%.
(initially, higher scores were reported, but the experiments set to measure Q3 were flawed, as the test cases included proteins used to derive the propensities!)
64
-Available servers:
- JPRED : http://www.compbio.dundee.ac.uk/~www-jpred/
- PHD: http://cubic.bioc.columbia.edu/predictprotein/
- PSIPRED: http://bioinf.cs.ucl.ac.uk/psipred/
- NNPREDICT: http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
- Chou and Fassman: http://fasta.bioch.virginia.edu/fasta_www/chofas.htm
Secondary Structure Prediction