Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

41
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Page 1: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Protein Fold recognition

Morten Nielsen,Thomas Nordahl

CBS, BioCentrum, DTU

Page 2: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

IntroductionWhat is a protein fold

• Protein fold• Protein sequence id• Protein sequence/structure databases• Alignment values

• Scores, E-values & P-values

• Protein classifications• Fold, Superfamily, Family & protein

Page 3: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

IntroductionWhat is a protein fold

• A protein fold is the scaffold that can be used as a template to model a query protein sequence.

• Fold recognition is technique that is used to identify the scaffold to be used, from a known protein structure. The sequence similarity is low and therefore the fold is difficult to recognize by use of simple sequence alignment tools (blosum62 matrix).

Page 4: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Outline

• Many textbooks and experts state that %ID is the only determining factor for successful homology modeling

• This is WRONG!• %ID is a very poor measure to

determine if a protein can be modeled• Many sequences with sequence homology

~10-15% can be accurately modeled

Page 5: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Outline

• Why homology modeling• How is it done • How to decide when to use homology

modeling– Why is %id such a terrible measure

• What are the best methods

Page 6: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Why protein modeling?

• Because it works!– Close to 50% of all new sequences can be

homology modeled

• Experimental effort to determine protein structure is very large and costly

• The gap between the size of the protein sequence data and protein structure data is large and increasing

Page 7: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Human genome~ 30.000 proteins

Homology modeling and the human genome

Page 8: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Swiss-Prot database

~200.000 in Swiss-Prot~ 2.000.000 if include Tremble

Page 9: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

PDB New Fold Growth

New folds

Old folds

New

PD

B s

truct

ure

s

Page 10: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

PDB New Fold GrowthN

ew

PD

B s

truct

ure

s

Page 11: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

PDB New Fold GrowthN

ew

PD

B s

truct

ure

s

Page 12: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Identification of fold

Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47

Page 13: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Why %id is so bad!!

1200 models sharing 25-95% sequence identity with the submitted sequences

(www.expasy.ch/swissmod)

Page 14: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Identification of correct fold

• % ID is a poor measure– Many evolutionary related proteins

share low sequence homology• Alignment score even worse

– Many sequences will score high against every thing (hydrophobic stretches)

• P-value or E-value more reliable

Page 15: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

What are P and E values?

• E-value– Number of expected hits

in database with score higher than match

– Depends on database size

• P-value – Probability that a

random hit will have score higher than match

– Database size independent Score

P(S

core

)

Score 15010 hits with higher score (E=10)10000 hits in database => P=10/10000 = 0.001

Page 16: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Protein classifications

Page 17: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Protein world

Protein fold

Protein structure classification

Protein superfamily

Protein familyNew Fold

Page 18: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Superfamilies

Proteins which are (remote) evolutionarily related– Sequence similarity low– Share function– Share special structural

features– Same evolutionary ancestor

Relationships between members of a superfamily may not be readily recognizable from the sequence alone

Fold

Family

Superfamily

Proteins

Page 19: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Template identification

Simple sequence based methods– Align (BLAST) sequence against sequence of

proteins with known structure (PDB database)

Sequence profile based methods– Align sequence profile (Psi-BLAST) against

sequence of proteins with known structure (PDB)– Align sequence profile against profile of proteins

with known structure (FFAS)

Sequence and structure based methods– Align profile and predicted secondary structure

against proteins with known structure (3D-PSSM)

Page 20: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Sequence profiles

In conventional alignment, a scoring matrix (BLOSUM62) gives the score for matching two amino acids– In reality not all positions in a protein are

equally likely to mutate– Some amino acids (active cites) are highly

conserved, and the score for mismatch must be very high

– Other amino acids can mutate almost for free, and the score for mismatch is lower than the BLOSUM score

Sequence profiles (just like a HMM) can capture these differences

Page 21: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Sequence profiles/blosum62 scores

a)TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP

b)TKAVVLTFNTSVEICLVMQ-GTSIVAAESHPLHLHGFNFPSNFNLVDPMERNTAGVP

Which alignment is most correct a) or b) ?

Blosum62 scores:G-G: 6H-H: 8

TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Page 22: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Blosum scoring matrix

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Page 23: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Sequence profiles

Conserved

Non-conserved

Matching any thing but G => large negative score

Any thing can match

TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP

Page 24: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Sequence profiles

Align (BLAST) sequence against large sequence database (Swiss-Prot)

Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts

Use weight matrix to align against sequence database to find new significant hits

Repeat 2 and 3 (normally 3 times!)

Page 25: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Example. Sequence profiles

• Alignment of protein sequences 1PLC._ and 1GYC.A– E-value > 1000

• Profile alignment– Align 1PLC._ against Swiss-prot– Make position specific weight matrix from

alignment– Use this matrix to align 1PLC._ against

1GYC.A• E-value < 10-22. Rmsd=3.3

Page 26: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Sequence profiles

Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) 1PLC._: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + +1GYC.A: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 1PLC._: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V1GYC.A: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126

Rmsd=3.3 ÅStructure redTemplate blue

Page 27: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Sequence logo / Sequence profile

0 iterations (Blosum62)

2 iterations

1 iterations

3 iterations

Page 28: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Profile-profile alignmentQ

uery

Tem

pla

te

Compare amino acid preference for the two proteins and pair similar

positions(HHpred)

Page 29: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Including structure

• Sequence within a protein superfamily share remote sequence homology

• , but they share high structural homology

• Structure is known for template• Predict structural properties for query

– Secondary structure– Surface exposure

• Position specific gap penalties derived from secondary structure and surface exposure

Page 30: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

CASP. Which are the best methods

• Critical Assessment of Structure Predictions

• Every second year• Sequences from about-to-be-solved-

structures are given to groups who submit their predictions before the structure is published

• Modelers make prediction• Meeting in December where correct

answers are revealed

Page 31: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

CASP6 results

Page 32: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

The top 4 homology modeling groups in CASP6

• All winners use consensus predictions– The wisdom of the crowd

• Same approach as in CASP5!• Nothing has happened in 2 years!

Page 33: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

The wisdom of the crowd!

– Why the many are smarter than the few

– A general method useful to improve prediction accuracy

– No single method or expert will always be the best

Page 34: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

The wisdom of the crowd!

– The highest scoring hit will often be wrong•Not one single prediction method is

consistently best – Many prediction methods will have the

correct fold among the top 10-20 hits– If many different prediction methods all

have same fold among the top hits, this fold is probably correct

Page 35: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

How to do it? Where is the crowd

• Meta prediction server – Web interface to a list of public protein

structure prediction servers– Submit query sequence to all selected

servers in one go

http://bioinfo.pl/meta/

Page 36: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Meta Server

Page 37: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

From fold to structure

Flying to the moon has not made man conquer space

Finding the right fold does not allow you to make accurate protein models– Can allow prediction of protein function

Alignment is still a very hard problem– Most protein interactions are determined

by the loops, and they are the least conserved parts of a protein structure

Page 38: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Modeling of new protein folds

• Only when everything else fails• Challenge

• Close to impossible to model Natures folding potential

Ab initio protein modeling

Page 39: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Fragments with correct local structure

Natures potential

Empirical potential

A way to solution

• Glue structure piece wise from fragments.• Guide process by empirical/statistical potential

Page 40: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Example (Rosetta web server)

Rosetta predictionStructure

www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php

Page 41: Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Take home message

• Identifying the correct fold is only a small step towards successful homology modeling

• Do not trust % ID or alignment score to identify the fold. Use p-values

• Use sequence profiles and local protein structure to align sequences

• Do not trust one single prediction method, use consensus methods (3D Jury)

• Only if everythings fail, use ab initio methods