Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Protein Fold recognition
Morten Nielsen,Thomas Nordahl
CBS, BioCentrum, DTU
IntroductionWhat is a protein fold
• Protein fold• Protein sequence id• Protein sequence/structure databases• Alignment values
• Scores, E-values & P-values
• Protein classifications• Fold, Superfamily, Family & protein
IntroductionWhat is a protein fold
• A protein fold is the scaffold that can be used as a template to model a query protein sequence.
• Fold recognition is technique that is used to identify the scaffold to be used, from a known protein structure. The sequence similarity is low and therefore the fold is difficult to recognize by use of simple sequence alignment tools (blosum62 matrix).
Outline
• Many textbooks and experts state that %ID is the only determining factor for successful homology modeling
• This is WRONG!• %ID is a very poor measure to
determine if a protein can be modeled• Many sequences with sequence homology
~10-15% can be accurately modeled
Outline
• Why homology modeling• How is it done • How to decide when to use homology
modeling– Why is %id such a terrible measure
• What are the best methods
Why protein modeling?
• Because it works!– Close to 50% of all new sequences can be
homology modeled
• Experimental effort to determine protein structure is very large and costly
• The gap between the size of the protein sequence data and protein structure data is large and increasing
Human genome~ 30.000 proteins
Homology modeling and the human genome
Swiss-Prot database
~200.000 in Swiss-Prot~ 2.000.000 if include Tremble
PDB New Fold Growth
New folds
Old folds
New
PD
B s
truct
ure
s
PDB New Fold GrowthN
ew
PD
B s
truct
ure
s
PDB New Fold GrowthN
ew
PD
B s
truct
ure
s
Identification of fold
Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47
Why %id is so bad!!
1200 models sharing 25-95% sequence identity with the submitted sequences
(www.expasy.ch/swissmod)
Identification of correct fold
• % ID is a poor measure– Many evolutionary related proteins
share low sequence homology• Alignment score even worse
– Many sequences will score high against every thing (hydrophobic stretches)
• P-value or E-value more reliable
What are P and E values?
• E-value– Number of expected hits
in database with score higher than match
– Depends on database size
• P-value – Probability that a
random hit will have score higher than match
– Database size independent Score
P(S
core
)
Score 15010 hits with higher score (E=10)10000 hits in database => P=10/10000 = 0.001
Protein classifications
Protein world
Protein fold
Protein structure classification
Protein superfamily
Protein familyNew Fold
Superfamilies
Proteins which are (remote) evolutionarily related– Sequence similarity low– Share function– Share special structural
features– Same evolutionary ancestor
Relationships between members of a superfamily may not be readily recognizable from the sequence alone
Fold
Family
Superfamily
Proteins
Template identification
Simple sequence based methods– Align (BLAST) sequence against sequence of
proteins with known structure (PDB database)
Sequence profile based methods– Align sequence profile (Psi-BLAST) against
sequence of proteins with known structure (PDB)– Align sequence profile against profile of proteins
with known structure (FFAS)
Sequence and structure based methods– Align profile and predicted secondary structure
against proteins with known structure (3D-PSSM)
Sequence profiles
In conventional alignment, a scoring matrix (BLOSUM62) gives the score for matching two amino acids– In reality not all positions in a protein are
equally likely to mutate– Some amino acids (active cites) are highly
conserved, and the score for mismatch must be very high
– Other amino acids can mutate almost for free, and the score for mismatch is lower than the BLOSUM score
Sequence profiles (just like a HMM) can capture these differences
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Sequence profiles/blosum62 scores
a)TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
b)TKAVVLTFNTSVEICLVMQ-GTSIVAAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
Which alignment is most correct a) or b) ?
Blosum62 scores:G-G: 6H-H: 8
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Blosum scoring matrix
A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Sequence profiles
Conserved
Non-conserved
Matching any thing but G => large negative score
Any thing can match
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
Sequence profiles
Align (BLAST) sequence against large sequence database (Swiss-Prot)
Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts
Use weight matrix to align against sequence database to find new significant hits
Repeat 2 and 3 (normally 3 times!)
Example. Sequence profiles
• Alignment of protein sequences 1PLC._ and 1GYC.A– E-value > 1000
• Profile alignment– Align 1PLC._ against Swiss-prot– Make position specific weight matrix from
alignment– Use this matrix to align 1PLC._ against
1GYC.A• E-value < 10-22. Rmsd=3.3
Sequence profiles
Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) 1PLC._: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + +1GYC.A: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 1PLC._: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V1GYC.A: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd=3.3 ÅStructure redTemplate blue
Sequence logo / Sequence profile
0 iterations (Blosum62)
2 iterations
1 iterations
3 iterations
Profile-profile alignmentQ
uery
Tem
pla
te
Compare amino acid preference for the two proteins and pair similar
positions(HHpred)
Including structure
• Sequence within a protein superfamily share remote sequence homology
• , but they share high structural homology
• Structure is known for template• Predict structural properties for query
– Secondary structure– Surface exposure
• Position specific gap penalties derived from secondary structure and surface exposure
CASP. Which are the best methods
• Critical Assessment of Structure Predictions
• Every second year• Sequences from about-to-be-solved-
structures are given to groups who submit their predictions before the structure is published
• Modelers make prediction• Meeting in December where correct
answers are revealed
CASP6 results
The top 4 homology modeling groups in CASP6
• All winners use consensus predictions– The wisdom of the crowd
• Same approach as in CASP5!• Nothing has happened in 2 years!
The wisdom of the crowd!
– Why the many are smarter than the few
– A general method useful to improve prediction accuracy
– No single method or expert will always be the best
The wisdom of the crowd!
– The highest scoring hit will often be wrong•Not one single prediction method is
consistently best – Many prediction methods will have the
correct fold among the top 10-20 hits– If many different prediction methods all
have same fold among the top hits, this fold is probably correct
How to do it? Where is the crowd
• Meta prediction server – Web interface to a list of public protein
structure prediction servers– Submit query sequence to all selected
servers in one go
http://bioinfo.pl/meta/
Meta Server
From fold to structure
Flying to the moon has not made man conquer space
Finding the right fold does not allow you to make accurate protein models– Can allow prediction of protein function
Alignment is still a very hard problem– Most protein interactions are determined
by the loops, and they are the least conserved parts of a protein structure
Modeling of new protein folds
• Only when everything else fails• Challenge
• Close to impossible to model Natures folding potential
Ab initio protein modeling
Fragments with correct local structure
Natures potential
Empirical potential
A way to solution
• Glue structure piece wise from fragments.• Guide process by empirical/statistical potential
Example (Rosetta web server)
Rosetta predictionStructure
www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php
Take home message
• Identifying the correct fold is only a small step towards successful homology modeling
• Do not trust % ID or alignment score to identify the fold. Use p-values
• Use sequence profiles and local protein structure to align sequences
• Do not trust one single prediction method, use consensus methods (3D Jury)
• Only if everythings fail, use ab initio methods