Hidden Markov Models, HMM’s Morten Nielsen, CBS, Department of Systems Biology, DTU.
Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
-
date post
19-Dec-2015 -
Category
Documents
-
view
221 -
download
1
Transcript of Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Objectives
• Understand the basic concepts of fold recognition
• Learn why even sequences with very low sequence similarity can be modeled– Understand why is %id such a terrible
measure for reliability
• See the beauty of sequence profiles– Position specific scoring matrices (PSSMs)
Protein Homology modeling?
• Identify template(s) – initial alignment• Can give you protein function
• Improve alignment• Can give you active site
• Backbone generation• Loop modeling
• Most difficult part• Side chains• Refinement• Validation
How to do it?
Identify fold (template) for modeling– Find the structure in
the PDB database that resembles your new protein the most
– Can be used to predict function
– And maybe active sites
Align protein sequence to template– Simple alignment
methods– Sequence profiles– Threading methods– Pseudo force fields
Model side chains and loops
Identification of fold
If sequence similarity is high proteins share structure (Safe zone)
If sequence similarity is low proteins may share structure (Twilight zone)
Most proteins do not have a high sequence homologous partner
Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47
Example.
>1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL
• What is the function• Where is the active site?
A post doc in our group did her PhD obtaining the structure of the sequence below
What would you do?
• Function• Run Blast against PDB
• No significant hits
• Run Blast against NR (Sequence database)• Function is Acetylesterase?
• Where is the active site?
Example. Where is the active site?
• Align sequence against structures of known acetylesterase, like• 1WAB, 1FXW, …
• Cannot be aligned. Too low sequence similarity
1K7C.A 1WAB._ RMSD 11.2397QAL 1K7C.A 71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFDAL 1WAB._ 160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY
Is it really impossible?
Protein homology modeling is only possibleif %id greater than 30-50%
WRONG!!!
!!!!
Why %id is so bad!!
1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod)
Identification of correct fold
• % ID is a poor measure– Many evolutionary related proteins
share low sequence homology– A short alignment of 5 amino acids can
share 100% id, what does this mean?• Alignment score even worse
– Many sequences will score high against every thing (hydrophobic stretches)
• P-value or E-value more reliable
What are P and E values?
• E-value– Number of expected hits
in database with score higher than match
– Depends on database size
• P-value – Probability that a
random hit will have score higher than match
– Database size independent Score
P(S
core
)
Score 15010 hits with higher score (E=10)10000 hits in database => P=10/10000 = 0.001
What goes wrong when Blast fails?
• Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences
Blosum scoring matrix
A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
What goes wrong when Blast fails?
• Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences• This scoring matrix is identical at all positions in the protein sequence!
EVVFIGDSLVQLMHQC
X X X
X X X
AGDS.GGGDS
Alignment accuracy. Scoring functions
• Blosum62 score matrix. Fg=1. Ng=0?
• Score =2+6+6+4-1=17• Alignment
L A G D S D
F 0 -2 -3 -3 -2 -3
I 2 1 -4 -3 -2 -3
G -4 0 6 -1 0 -1
D -4 -2 -1 6 0 6
S -2 1 0 0 4 0
L 4 -1 -4 -4 -2 -4
LAGDSI-GDS
Sequence profiles
• In reality not all positions in a protein are equally likely to mutate
• Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high
• Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score
• Sequence profiles can capture these differences
Protein world
Protein fold
Protein structure classification
Protein superfamily
Protein familyNew Fold
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Sequence profiles
Conserved
Non-conserved
Matching any thing but G => large negative score
Any thing can match
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
How to make sequence profiles
• Align (BLAST) sequence against large sequence database (Swiss-Prot)
• Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts
• Use weight matrix to align against sequence database to find new significant hits
• Repeat 2 and 3 (normally 3 times!)
Example. Where is the active site?
• Sequence profiles might show you where to look!• The active site could be around
• S9, G42, N74, and H195
Example. Where is the active site?
Align using sequence profiles
ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------
1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP
1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
Alignment accuracy
Alignment performance
0.000
0.050
0.100
0.150
0.200
0.250
0.300
0.350
0.400
0.450
Fractional n4
Train 0.259 0.393 0.417
Test 0.212 0.348 0.386
Blast Profile Profile+SS
AUC performance measure
Query Templ Score Hit/nonhit1CJ0.A 1B78.A 0.170963 0 1CJ0.A 1B8A.A -0.040029 0 1CJ0.A 1B8B.A -0.012789 0 1CJ0.A 1B8G.A 12.342823 1 1CJ0.A 1B9H.A 13.394361 1 1CJ0.A 1BAR.A -1.281068 0 1CJ0.A 1BAV.C -1.091305 0
Query Templ Score Hit/nonhit1CJ0.A 1B8G.A 12.342823 1 1CJ0.A 1DTY.A 11.867786 1 1CJ0.A 1DGD._ 11.271914 1 1CJ0.A 1GTX.A 11.010288 1 1CJ0.A 2GSA.A 10.958170 1 1CJ0.A 1BW9.A 2.651775 0 1CJ0.A 1AUP._ 2.507336 1 1CJ0.A 1GTM.A 2.444512 0
AUC (area under the ROC curve)
Fold recognition performance
Fold recognition
0.700
0.750
0.800
0.850
0.900
0.950
1.000
AUC
Per Protein 0.971 0.888 0.809
Prof-Prof PDB-blast Blast
Outlook
• Include position dependent gap penalties• The conventional alignment methods use
equal gap penalties through out the scoring matrix
• In real proteins placement of insertions and deletions is highly structure dependent
• No gaps in secondary structure elements• Gaps most frequent in loops• Distance dependency
Take home message
• Identifying the correct fold is only a small step towards successful homology modeling
• Do not trust % ID or alignment score to identify the fold. Use P-values
• You can do reliable fold recognition AND homology modeling when for low sequence homology
• Use sequence profiles and local protein structure to align sequences
What are (some of) the different available methods?
• Simple sequence based methods– Align (BLAST) sequence against sequence of proteins with
known structure (PDB database)
• Sequence profile based methods– Align sequence profile (Psi-BLAST) against sequence of
proteins with known structure (PDB, FUGUE)– Align sequence profile against profile of proteins with
known structure (FFAS)
• Sequence and structure based methods– Align profile and predicted secondary structure against
proteins with known structure (3D-PSSM, Phyre)
• Sequence profiles and structure based methods– HHpred