Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function...

165
Burkhard Rost (Columbia New York) Evolution teaches to Evolution teaches to predict protein structure predict protein structure and function and function Burkhard Rost CUBIC Columbia University [email protected] http://www.columbia.edu/~rost http:// cubic.bioc.columbia.edu/

Transcript of Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function...

Page 1: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches to predict protein Evolution teaches to predict protein structure and functionstructure and function

Evolution teaches to predict protein Evolution teaches to predict protein structure and functionstructure and function

Burkhard Rost

CUBIC Columbia University

[email protected]

http://www.columbia.edu/~rost

http://cubic.bioc.columbia.edu/

Page 2: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Is Bioinformatics up to the data deluge?• Sequence comparison: do we know what we do?

– conservation of structure and function

• Structure prediction: where are we today?• How to learn from the evolutionary odyssey?

– secondary structure– transmembrane proteins– solvent accessibility

• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions

Page 3: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

http://cubic.bioc.columbia.edu/http://cubic.bioc.columbia.edu/http://cubic.bioc.columbia.edu/http://cubic.bioc.columbia.edu/

• Volker Eyrich

• Rajesh Nair

• Jinfeng Liu

• Dariusz Przybylski

• Yanay Ofran

• Henry Bigelow

• Kazimierz Wrzeszczynski

• Sven Mika

• Chien Peter Chen

• Burkhard Rost

• http://cubic.bioc.columbia.edu/

• Miguel AndradeEMBL

• Sean O’DonoghueLION

• Andrej Sali Marc Marti-Renom

Rockefeller

• Alfonso Valencia Florencio Pazos Madrid

• Michal Linial Jerusalem

• Claus AndersenCopenhagen

• Bastian BruningNijmegen

• Hepan TanColumbia

• Trevor Siggers Columbia

Page 4: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

CUBIC http://cubic.bioc.columbia.eduCUBIC http://cubic.bioc.columbia.eduCUBIC http://cubic.bioc.columbia.eduCUBIC http://cubic.bioc.columbia.edu

Dariusz Przybylski

Jinfeng Liu

Trevor Siggers

Murat CokolHepan Tan

Volker Eyrich

Page 5: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

The Data DelugeThe Data DelugeThe Data DelugeThe Data Deluge

Conclusion:Bioinformaticswill have a hell of a problem

102

103

104

105

106

107

3-19823-19833-19843-19853-19863-19876-19886-19896-19906-19916-19926-19936-19946-19956-19966-19976-1999

PDB

SWISS-PROT

EMBL

Computer

Number of sequences in data base

Year

Date: 3-2001

Page 6: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Data Deluge: what do we want?Data Deluge: what do we want?Data Deluge: what do we want?Data Deluge: what do we want?

Expressed?

• cellular function• physiological function• substrate binding sites• protein-protein interfaces

• activity• specificity• docking• localisation

DNA

ORF

Protein

Active proteinDomains =smallest functional /structural subunits

3D structure

Function

Page 7: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Data Deluge: numbersData Deluge: numbersData Deluge: numbersData Deluge: numbers

Expressed?

• cellular function• physiological function• substrate binding sites• protein-protein interfaces

• activity• specificity• docking• localisation

30 entireorganisms

600.000 genes (GenBank)

1.500 domains (DALI)

10.000 structures (PDB) 600 'unique' (FSSP)

30.000 annotations(SWISS-PROT)

350.000 proteins (TrEMBL)

50

1.200.000

500.000

2000

17.000800

35.000

Page 8: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Data Deluge: what CAN we do?Data Deluge: what CAN we do?Data Deluge: what CAN we do?Data Deluge: what CAN we do?

Expressed?

• cellular function• physiological function• substrate binding sites• protein-protein interfaces

• activity• specificity• docking• localisation

Introns:100% mycoplasma

30-50% eukaryotes

ORF:10% error in bacteria

Signal peptides: sometimesProSite / SignalP

Domains: sometimesPfam / ProDom

3D structure:sometimes

Function:?motifs (ProSite / PRAM)alignment

Page 9: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Data Deluge: we CAN we do?Data Deluge: we CAN we do?Data Deluge: we CAN we do?Data Deluge: we CAN we do?

Not much …… yet

Page 10: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!• Sequence comparison: do we know what we do?

– conservation of structure and function

• Structure prediction: where are we today?• How to learn from the evolutionary odyssey?

– secondary structure– transmembrane proteins– solvent accessibility

• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions

Page 11: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Dynamic programming: optimal alignmentDynamic programming: optimal alignmentDynamic programming: optimal alignmentDynamic programming: optimal alignment

GGQLAKEEALE 0000001100G 1100000110Q 0120000011P 0012000001V 0001200000E 0000121100V 0000012110L 0001001212U GGQLAKEEAL

T EGQP.VE.VL

U GGQLAKEEALT EGQPVEVL

Pair of protein sequences

Optimal alignment (with gaps)

Optimal alignment (no gaps)U GGQLAKEEALT1 EVLT2 EGQPVEVL

Page 12: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

BLAST: fast matching of single ‘words’BLAST: fast matching of single ‘words’BLAST: fast matching of single ‘words’BLAST: fast matching of single ‘words’

T T Y K L I L N G K T L K G E T T T E A V D A A T A E K V F K Q Y A N D N G V D G E W T Y D D A T K T F T V T E K

T T Y K L I L L L L L L L L L L L L L L L L A W T V E K A F K T F A A A A A A A A A W T V E K A F K T F A A A A A

T T Y K L I L

T T Y K L I L

W T Y D D A T K T F

W T V E K A F K T F

A A T A E K V F K Q Y A

A W T V E K A F K T F A? ?

Page 13: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Profile-based comparisonProfile-based comparisonProfile-based comparisonProfile-based comparison 1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

Page 14: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

ZonesZonesZonesZones

Page 15: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Sequence -> StructureSequence -> StructureSequence -> StructureSequence -> Structure

• Sequence folds into unique structure

S -> T

structurespace sequence

space

Page 16: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Sequence -> StructureSequence -> StructureSequence -> StructureSequence -> Structure

• Sequence folds into unique structureS -> T

• Similar sequences fold into similar structuresS + S’-> T

structurespace sequence

space

Page 17: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Sequence -> StructureSequence -> StructureSequence -> StructureSequence -> Structure

• Sequence folds into unique structureS -> T

• Similar sequences fold into similar structuresS + S’-> T

• Most sequences don’t fold, at allS -> no T

structurespace sequence

space

Page 18: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

101

102

103

104

105

106

-15 -10 -5 0 5 10

10 15 20 25 30 35

Num

ber

of p

rote

in p

airs

Distance from HSSP threshold

Percentage sequence identity

Twilight Twilight

zone zone

= =

false false

positives positives

explodeexplode

Twilight Twilight

zone zone

= =

false false

positives positives

explodeexplode

50%

10%

90%

B Rost 1999 Prot. Engin.:12, 85-94

Page 19: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Significant Significant sequence sequence identityidentity

Significant Significant sequence sequence identityidentity

B Rost 1999 Prot. Engin.:12, 85-94

HSSP_PIDE (ϑ) = ϑ +

480 ⋅ L - 0.32 ⋅ 1 + e -L / 1000 { }

Page 20: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution did it !Evolution did it !Evolution did it !Evolution did it !

.

0

20

40

60

80

100

0 50 100 150 200 250

Number of residues aligned

Sequence identityimplies structural

similarity !

Don't know region

B Rost 1999 Prot. Engin.:12, 85-94

Page 21: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Similar sequence -> similar structure?Similar sequence -> similar structure?Similar sequence -> similar structure?Similar sequence -> similar structure?.

0

2 0

4 0

6 0

8 0

1 0 0

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

id e n t i tys im ila r i ty

Number of residues alignedB Rost 1999 Prot. Engin.:12, 85-94

Page 22: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Detecting true hits in Twilight zoneDetecting true hits in Twilight zoneDetecting true hits in Twilight zoneDetecting true hits in Twilight zone

.

0

20

40

60

80

100

-10 -5 0 5 10

15 20 25 30 35

Distance from threshold

old HSSP

idesim 10%

similarity-larger-than-

identity

they-dont-know-what-

they-doonly

sequenceidentity

B Rost 1999 Prot. Engin.:12, 85-94

Page 23: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Finding similar structures in Twilight zoneFinding similar structures in Twilight zoneFinding similar structures in Twilight zoneFinding similar structures in Twilight zone

.

4 1 0 3

6 1 0 3

8 1 0 31 0 4

3 1 0 4

5 1 0 4

-10 -5 0 5 10

15 20 25 30 35

Distance from threshold

old HSSP

ide

sim5%

similarity-larger-than-

identity

B Rost 1999 Prot. Engin.:12, 85-94

Page 24: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

‘‘Secure’ thresholds for BLASTSecure’ thresholds for BLAST‘‘Secure’ thresholds for BLASTSecure’ thresholds for BLAST

coverageaccuracy

truefalse

101

102

103

104

105

106

0

20

40

60

80

100

0.1 1 10 100 1000 104 105

Probability score of PSI-BLAST

B Rost 1999 Prot. Engin.:12, 85-94

Page 25: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Accuracy vs. coverageAccuracy vs. coverageAccuracy vs. coverageAccuracy vs. coverage

0

2 0

4 0

6 0

8 0

1 0 0

0 2 0 4 0 6 0 8 0 1 0 0

Accuracy

Coverage

• how many of thecorrect proteins

were found?

• how many of theproteins found

are correct?

Page 26: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

BLAST is not enough ...BLAST is not enough ...BLAST is not enough ...BLAST is not enough ...

∆similarity∆identityHSSP-curve % identityalignment score

blast2psi-blast

0

20

40

60

80

100

0 20 40 60 80 100

Accuracy

8

12

16

20

60 70 80 90 100

B Rost 1999 Prot. Engin.:12, 85-94

Page 27: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Sequence Space HoppingSequence Space HoppingSequence Space HoppingSequence Space Hopping

p r o t e i n A

s e l _ x

a n l _ z

p r o t e i n B

u n k _ y

a n b _ x

u n k _ x

p r o t e i n C

c a l _ y

c a l _ x

s e q _ x

s e q _ y

B Rost 1999 Prot. Engin.:12, 85-94

Page 28: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Success through sequence space hoppingSuccess through sequence space hoppingSuccess through sequence space hoppingSuccess through sequence space hopping

5 0

6 0

7 0

8 0

9 0

1 0 0

- 1 0 - 5 0 5

1 5 2 0 2 5 3 0

D i s t a n c e f r o m t h r e s h o l d

P e r c e n t a g e s e q u e n c e i d e n t i t y

o l d

i d e

0

1 0 0

2 0 0

B Rost 1999 Prot. Engin.:12, 85-94

Page 29: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

ZonesZonesZonesZones

Page 30: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search

Family

U

U

B Rost 2001 Structural Bioinformatics:in press

Page 31: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search

Family

U

safe forpairwise

safe zone

Page 32: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search

zonereached throughposition-specific

family profileFamily

U

safe forpairwise

safe zoneU

Page 33: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search

zonereached throughposition-specific

family profileFamily

U

safe forpairwise

safe zoneUlost after

iteration

Page 34: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search

zonereached throughposition-specific

family profileFam

ily U

safe forpairwise

safe zoneU

safe zonesof close

homologues

lost afteriteration

Page 35: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search

Page 36: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

ZonesZonesZonesZones

Page 37: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Hypothetical distribution of similar structuresHypothetical distribution of similar structuresHypothetical distribution of similar structuresHypothetical distribution of similar structures

Page 38: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

0 25 50 75 1000204060

Percentage of identical residues

Page 39: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Midnight zone: real - randomMidnight zone: real - randomMidnight zone: real - randomMidnight zone: real - random

0

20

40

60

0 5 10 15 20 25

Percentage identical residues

B Rost 1997 Folding & Design:2, S19-S24 AS Yang and B Honig 2000 J. Mol. Biol.:301, 679-689

Page 40: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

0

400

800

1200

1600

0 5 10 15 20 25

Num

ber

of s

truc

ture

pai

rs

Percentage pairwise sequence identity

25 50 75 100

0

Evolution into the Midnight zoneEvolution into the Midnight zoneEvolution into the Midnight zoneEvolution into the Midnight zone

B Rost and S O'Donoghue 1998 EMBL preprint

Page 41: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Protein structures evolved at random - almostProtein structures evolved at random - almostProtein structures evolved at random - almostProtein structures evolved at random - almost

• average < 10%

– -> most pairs have ‘random’ identity levels

• 3 - 4% anchor residues

• 4 billion years of evolution reached equilibrium

– rate of creating new structures slower than drift towards mean

• averages for convergent and divergent evolution similar

• convergent evolution may have been a major event

Page 42: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Structure spaceStructure spaceStructure spaceStructure space

B Rost 1998 Structure:6, 259-263

Page 43: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Gold-mine out of reach!Gold-mine out of reach!Gold-mine out of reach!Gold-mine out of reach!

0

2

4

6

8

10

30 40 50 60 70 80 90 100

MJ

MG

YE

HI

PDB

Fraction of protein pairs

Percentage of identical residues

Per

cent

age

of p

airs

Page 44: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Conservation of functionConservation of functionConservation of functionConservation of function

Devon & Valencia 2000, Proteins, 41, pp. 98

Page 45: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Conservation of EC numberConservation of EC numberConservation of EC numberConservation of EC number

0

20

40

60

80

100

101

102

103

104

105

0 20 40 60 80 100

Percentage of proteins N

umber of proteins

Percentage pairwise sequence identity

first EC digit: accuracyfirst EC digit: coverageall EC digits: accuracyall EC digits: coverage

Number of proteins

Page 46: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Conservation of EC number 2Conservation of EC number 2Conservation of EC number 2Conservation of EC number 2

first EC digit: accuracyfirst EC digit: coverageall EC digits: accuracyall EC digits: coverage

Number of proteins

0

20

40

60

80

100

101

102

103

104

105

-40 -20 0 20 40

0 20 40 60 80

Percentage of proteins

Number of proteins

Distance from threshold (identity/length)

Corresponding percentage sequence identity

0

20

40

60

80

100

101

102

103

104

105

20 30 40 50 60 70 80 90 100

Percentage of proteins N

umber of proteins

Percentage pairwise sequence identity

Page 47: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Conservation of EC number: BLASTConservation of EC number: BLASTConservation of EC number: BLASTConservation of EC number: BLAST

0

20

40

60

80

100

101

102

103

104

-4-2024

Percentage of proteins

log(BLAST E)

0

20

40

60

80

100

102

103

104

-200-150-100-500

Number of proteins

log(BLAST E)

first EC digit: accuracyfirst EC digit: coverageall EC digits: accuracyall EC digits: coverage

Number of proteins

Page 48: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Conservation Conservation in detailin detail

Conservation Conservation in detailin detail

A OxidoreductasesC OxidoreductasesA OxidoreductasesC OxidoreductasesA TransferasesC TransferasesA HydrolasesC HydrolasesA LyasesC LyasesA IsomerasesC IsomerasesA LigasesC Ligases

FULL EC number

0

20

40

60

80

100

20 40 60 80 100

Percentage of protein pairs

Percentage pairwise sequence identity

Pairwise Blast

0

20

40

60

80

100

20 40 60 80 100

Percentage of protein pairs

Percentage pairwise sequence identity

PSI-Blast

Page 49: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Accuracy vs. Accuracy vs. coverage: coverage:

EC numberEC number

Accuracy vs. Accuracy vs. coverage: coverage:

EC numberEC number

Pairwise

PSI-Blast

0

20

40

60

80

100

0 20 40 60 80 100

Coverage

Accuracy

0

20

40

60

80

85 90 95

0

20

40

60

80

100

0 20 40 60 80 100

Coverage

Accuracy

ONE pideALL pide

ONE distALL dist

ONE probALL prob

0

20

40

60

80

85 90 95

Page 50: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Conservation of EC numbersConservation of EC numbersConservation of EC numbersConservation of EC numbers

AccuracyCoverageFirst digit ECAccuracy

CoverageFull ECNumber of pairs

0

50

100

-40 -20 0 20 40

0 20 40 60 80

101

102

103

104

105

106

Percentage of pairs

Corresponding sequence identity

0

50

100

101

102

103

104

105

-40 -20 0 20 40

0 20 40 60 80

Percentage of pairs

0

50

100

102

103

104

-200-150-100-500

Number of proteins

0

50

100

-4-2024

Distance from HSSP-threshold log(BLAST E)log(BLAST E)

Number of pairs

0

50

100

102

103

104

-200-150-100-5000

50

100

-4-2024

PSI

PSI

PSI

Pair

Pair

Pair

Page 51: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Structure prediction: where are we today?• How to learn from the evolutionary odyssey?

– secondary structure– transmembrane proteins– solvent accessibility

• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions

Page 52: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Notation: protein structure 1D, 2D, 3DNotation: protein structure 1D, 2D, 3DNotation: protein structure 1D, 2D, 3DNotation: protein structure 1D, 2D, 3DPQITLWQRPLVTIKIGGQLKEALLDTGADDTVL

PP PQQQYFFQVISSIVRLLSTLWWQEDRKQAKRRRPQPPPPPVVTKFVVLIITTKEKAALIVHYKKFIILVIEENGGGGGTGQQKRRPPLWWVVFKVEESKKVVGLGLLILLLLLVVDDDDDTTTTTGGGGGAAAAADDDDDDDAKESSTTVIIVIVVVIVL

1281757077

120238169200247114740

904

466268

11831

1241

292449726217

102691

140

1109760691481976248590

690

730

415371597395000

5851300

79586900

EEEEE

EEEEEE

EEEEEEE

EE

EEEEE

EEEEEE

EE

kcal/mol0 -1 -2 -3 -4 -5

1 10 20 30 40 50 60 70 80 90

1

10

20

30

40

50

60

70

80

90

1D1D 2D2D 3D3D

Page 53: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Ch r i s t i n e O ren g o (S t ru c tu res , 1997 , 5 , 1093 -1108)

Page 54: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Ch r i s t i n e O ren g o (S t ru c tu res , 1997 , 5 , 1093 -1108)

Page 55: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Goal of structure predictionGoal of structure predictionGoal of structure predictionGoal of structure prediction

• Epstein & Anfinsen, 1961:sequence uniquely determines structure

• INPUT: sequence

3D structure3D structureand functionand function

• OUTPUT:

Page 56: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Protein structure prediction in realityProtein structure prediction in realityProtein structure prediction in realityProtein structure prediction in reality

FoRc

HoMo

3D

1D

Page 57: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

EEEE B B B B EEEEEE EEEEEE EEEEEEEEHHHEEE1shf 100% VTLFVALYDYEARTEDDLSFHKGEKFQILNSSEGDWWEARSLTTGETGYIPSNYVAPVD1srm 78% VTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLTTGQTGYIPSNYVAPSD1sem 39% ....VAEHDFQAGSPDELSFKRGNTLKVLNKDEDPHWYKAEL.DGNEGFIPSNYIRMTE

WHAT IF

Page 58: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

• assumption: H and U homolgous 3D structures• strategy: modelling of U based on H

U (sequence)

PDB

Hsignificant sequence identity

Homology modelling/comparative Homology modelling/comparative modellingmodelling

Homology modelling/comparative Homology modelling/comparative modellingmodelling

Page 59: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Protein structure prediction in realityProtein structure prediction in realityProtein structure prediction in realityProtein structure prediction in reality

FoRc

HoMo

3D

1D

Page 60: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Protein structure prediction in realityProtein structure prediction in realityProtein structure prediction in realityProtein structure prediction in reality

FoRc

HoMo

1D

….the art of being humble

SWISS-PROT view Genome view

Page 61: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Structure prediction for protein universeStructure prediction for protein universeStructure prediction for protein universeStructure prediction for protein universe

Percentage of proteins in the proteome Percentage of residues in the proteome0 10 20 30 40

Percentage of residues

0 10 20 30 40 50

A pernixA fulgidus

M jannaschiiM thermoautotrophicu

P abyssiP horikoshii

A aeolicusB subtilis

B burgdorferiC jejuni

C pneumoniaeC trachomatisD radiodurans

E coliH influenzae

H pyloriM genitalium

M pneumoniaeM tuberculosisN meningitidis

R prowazekiiS PCC6803T maritimaT pallidum

U urealyticum

S cerevisiaeC elegans

D melanogasterH sapiens(SP/TrEmbl)

H sapiens(chr 22)

Percentage of proteins

Euka

Prokaryotes

Archae

Page 62: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Improving prediction by waiting it out …Improving prediction by waiting it out …Improving prediction by waiting it out …Improving prediction by waiting it out …

1991

1995

1999

Page 63: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• How to learn from the evolutionary odyssey?

– secondary structure– transmembrane proteins– solvent accessibility

• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions

Page 64: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution did it !Evolution did it !Evolution did it !Evolution did it !

.

0

20

40

60

80

100

0 50 100 150 200 250

Number of residues aligned

Sequence identityimplies structural

similarity !

Don't know region

B Rost 1999 Prot. Engin.:12, 85-94

Page 65: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

Page 66: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

Page 67: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Η

Ε

L

>

>

>

pickmaximal

unit=>

currentprediction

J2

inputlayer

first orhidden layer

second oroutput layer

s0 s1 s2J1

:GYIY

DPAVGDPDNGVEP

GTEF:

:GYIY

DPEVGDPTQNIPP

GTKF:

:GYEY

DPAEGDPDNGVKP

GTSF:

:GYEY

DPAEGDPDNGVKP

GTAF:

Alignments

5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .

. . . . 5 . . . . . . . . . . . . . . .

. . . 5 . . . . . . . . . . . . . . . .

. . 3 . . . . 2 . . . . . . . . . . . .

. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .

5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .

GSAPD NTEKQ CVHIR LMYFW

profile table

:GYIY

DPEDGDPDDGVNP

GTDF:

Protein

corresponds to the the 21*3 bits coding for the profile of one residue

Page 68: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:

– secondary structure +15% -> 76% ± 10%– transmembrane proteins– solvent accessibility

• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions

Page 69: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Membrane predictionMembrane predictionMembrane predictionMembrane prediction

Page 70: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

HTM prediction waiting for database HTM prediction waiting for database growth ...growth ...

HTM prediction waiting for database HTM prediction waiting for database growth ...growth ...

1993

1999

1996

Page 71: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

.

eexxttrraa--ccyyttooppllaassmmiicc

iinnttrraa--ccyyttooppllaassmmiicc in

protein A protein CC-term

out inprotein BC-term

C-term

Topology for membrane helical proteinsTopology for membrane helical proteinsTopology for membrane helical proteinsTopology for membrane helical proteins

Page 72: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

HEADER LIPOPROTEIN(SURFACE FILM)COMPND PULMONARY SURFACTANT-ASSOCIATED POLYPEPTIDE C(SP-C)SOURCE PIG (SUS SCROFA)AUTHOR J.JOHANSSON,T.SZYPERSKI,T.CURSTEDT,K.WUTHRICH

AA LRIPCCPVNLKRLLVVVVVVVLVVVVTVGALLMGLOBS sec HHHHHHHHHHHHHHHHHHHHHHHHHPHD sec EEEEEEEEEEEEEEEEEEEEEEE

PHDsec success on Poly-ValinePHDsec success on Poly-ValinePHDsec success on Poly-ValinePHDsec success on Poly-Valine

Page 73: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

HTM

nonHTM

outputlayer

inputlayer

hiddenlayer

20444

21+3""""""

percentage of each amino acid in protein

length of protein (≤60, ≤120, ≤240, >240)

distance: centre, N-term (≤40,≤30,≤20,≤10)

distance: centre, C-term (≤40,≤30,≤20,≤10)

input global in sequence

input local in sequence

local

align-

ment

13

adjacent

residues

:::

AAA

AA.

LLL

LII

AAG

CCS

GVV

:::

global

statist.

whole

protein

% AA

Length

∆ N-term

∆ C-term

A C L I G S V ins del cons

100 0 0 0 0 0 0 0 0 1.17

100 0 0 0 0 0 0 33 0 0.42

0 0 100 0 0 0 0 0 33 0.92

0 0 33 66 0 0 0 0 0 0.74

66 0 0 0 33 0 0 0 0 1.17

0 66 0 0 0 33 0 0 0 0.74

0 0 0 33 0 0 66 0 0 0.48

H TM

nonH TM

3+1""""""

20444

first levelsequence-to- structure

second levelstructure-to- structure

PHDhtm

Page 74: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Refine by dynamic programming on NN Refine by dynamic programming on NN ‘energy’‘energy’

Refine by dynamic programming on NN Refine by dynamic programming on NN ‘energy’‘energy’

1

0

1

0

r e s id u e n u m b e r

T

N

Page 75: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

PHDhtmPHDhtm

refinerefinetopologytopologypredictiopredictio

nn

PHDhtmPHDhtm

refinerefinetopologytopologypredictiopredictio

nn

0.920.95

0.93

0.91 0.900.92

0.870.89

N-term C-term

5 30 6 5

oouutt

Eight bestHTM's

µ=0: 0 HTM

µ=2: 2 HTM

µ=3: 3 HTM

µ=1: 1 HTM

Loop lengths

Charge:Number of R+Kin loops 1-4

final prediction:∆ =(5+1) - (2+3)>0=> first loop out

lipid membrane bilayer

extra-cytoplasmic

intra-cytoplasmic

R+K

Σ=2+R KΣ=5

+R KΣ=3

+R KΣ=1

Page 76: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

HEADER LIPOPROTEIN(SURFACE FILM)COMPND PULMONARY SURFACTANT-ASSOCIATED POLYPEPTIDE C(SP-C)SOURCE PIG (SUS SCROFA)AUTHOR J.JOHANSSON,T.SZYPERSKI,T.CURSTEDT,K.WUTHRICH

AA LRIPCCPVNLKRLLVVVVVVVLVVVVTVGALLMGLOBS htm TTTTTTTTTTTTTTTTTTTTTTTTTPHD htm TTTTTTTTTTTTTTTTTTTTTTTT

PHDhtm on Poly-ValinePHDhtm on Poly-ValinePHDhtm on Poly-ValinePHDhtm on Poly-Valine

Page 77: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Example IS representativeExample IS representativeExample IS representativeExample IS representative

M etho d/Subset N pro t Q % correctsegm ents

% correcttopolog y

P H D htm _fil 1 31 9 4.4 8 8.5 ±3 .1 8 2.4 ±3 .8

P H D htm _ref 1 31 9 3.8 8 9.3 ±3 .1 8 6.3 ±3 .1

P H D htm _ref 8 3 9 3.6 8 8.0 ±3 .6 8 5.5 ±4 .8

Jones e t a l., 1 994 8 3 7 9.5 ±3 .7 7 7.1 ±3 .8

E u karyo tes 9 9 9 5.8 9 3.5 ±3 .2 9 0.3 ±3 .2

P roka ry otes 3 3 8 5.6 7 5.8 ±9 .1 7 2.7 ±9 .1

M etho d/Subset N pro t Q % correctsegm ents

% correcttopolog y

P H D htm _fil 1 31 9 4.4 8 8.5 ±3 .1 8 2.4 ±3 .8

P H D htm _ref 1 31 9 3.8 8 9.3 ±3 .1 8 6.3 ±3 .1

P H D htm _ref 8 3 9 3.6 8 8.0 ±3 .6 8 5.5 ±4 .8

Jones e t a l., 1 994 8 3 7 9.5 ±3 .7 7 7.1 ±3 .8

E u karyo tes 9 9 9 5.8 9 3.5 ±3 .2 9 0.3 ±3 .2

P roka ry otes 3 3 8 5.6 7 5.8 ±9 .1 7 2.7 ±9 .1

allHTM

correct:89.3 ± 3.1

topologycorrect:

86.3 ± 3.1

Page 78: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

To be or not to be (HTM)To be or not to be (HTM)To be or not to be (HTM)To be or not to be (HTM)

1

0

residue number

H

ϑ strict = 0.8 , and ϑ loose = 0.7

Page 79: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

False positives: globular proteinsFalse positives: globular proteinsFalse positives: globular proteinsFalse positives: globular proteins

Method Nglob Eglob

PHDhtm, ϑstrict = 0.8 435 1.6 % ± 0.7%

PHDhtm, ϑloose = 0.7 435 3.7 % ± 0.9%PHDhtm_fil 435 5.7 % ± 1.1%

PHDhtm_fil a 278 4.3 % ± 1.4%Jones et al., 1994 b 155 3.2 % ± 1.9%Edelman, 1993 c 14 21.4 % ±14.3%

ϑ=0.8:false

1.6 ± 0.7

ϑ=0.7:false

3.7 ± 0.9

Page 80: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Details PHDsec: Wrong alignmentDetails PHDsec: Wrong alignmentDetails PHDsec: Wrong alignmentDetails PHDsec: Wrong alignment

• single sequences => accuracy clearly lower• sufficient information in multiple alignment

– many sequences– diversity

• wrong alignment -> wrong prediction

ID %IDE %WSIM IFIR ILAS JFIR JLAS LALI NGAP LGAP LSEQftsh_ecoli 1.00 1.00 1 644 1 644 644 0 0 644ftsh_haein 0.76 0.84 256 635 1 380 380 0 0 381ftsh_bacsu 0.50 0.62 3 630 6 637 623 6 14 637ftsh_porpu 0.48 0.59 5 604 9 623 598 5 19 628ftsh_lacla 0.46 0.57 1 638 12 695 635 7 52 695ftsh_odosi 0.45 0.56 2 611 5 644 609 5 32 644

Page 81: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

....,....1....,....2....,....AA |MAKNLILWLVIAVVLMSVFQSFGPSESNG|OBS htm | HHHHHHHHHHHHHHHHHHHH |PHD htm | |Rel htm |99999999999888889999999999999|

Details PHDhtm: wrong for ‘save’ alignmentDetails PHDhtm: wrong for ‘save’ alignmentDetails PHDhtm: wrong for ‘save’ alignmentDetails PHDhtm: wrong for ‘save’ alignment

Page 82: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

....,....1....,....2....,....AA |MAKNLILWLVIAVVLMSVFQSFGPSESNG|OBS htm | HHHHHHHHHHHHHHHHHHHH |PHD htm | HHHHHHHHHHH |Rel htm |88877651000000000001357899999|PHDRhtm | HHHHHHHHHHHHHHHHHH |PHDThtm |iiiiTTTTTTTTTTTTTTTTTTooooooo|

Details PHDhtm: correct for accurate alignmentDetails PHDhtm: correct for accurate alignmentDetails PHDhtm: correct for accurate alignmentDetails PHDhtm: correct for accurate alignment

Page 83: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:

– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility

• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions

Page 84: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Defining residue solvent accessibilityDefining residue solvent accessibilityDefining residue solvent accessibilityDefining residue solvent accessibility

Page 85: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

hiddenlayer

0

1

2

9

16

25

36

49

64

81

outputlayer

inputlayer

20444

21+3""""""

percentage of each amino acid in protein

length of protein (≤60, ≤120, ≤240, >240)

distance: centre, N-term (≤40,≤30,≤20,≤10)

distance: centre, C-term (≤40,≤30,≤20,≤10)

input global in sequence

input local in sequence

A C L I G S V ins del cons

100 0 0 0 0 0 0 0 0 1.17

100 0 0 0 0 0 0 33 0 0.42

0 0 100 0 0 0 0 0 33 0.92

0 0 33 66 0 0 0 0 0 0.74

66 0 0 0 33 0 0 0 0 1.17

0 66 0 0 0 33 0 0 0 0.74

0 0 0 33 0 0 66 0 0 0.48

local

align-

m ent

13

adjacent

residues

:::

AAA

AA.

LLL

LII

AAG

CCS

GVV

:::

global

statist.

whole

protein

% AA

Length

∆ N-term

∆ C-term

first level only

PHDacc

Page 86: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution for accessibility predictionEvolution for accessibility predictionEvolution for accessibility predictionEvolution for accessibility prediction

• Detailed prediction problematic• Significant gain by evolutionary information:

in/out with > 75% accuracy!

Page 87: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

PHDacc: the un-g(l)ory detailsPHDacc: the un-g(l)ory detailsPHDacc: the un-g(l)ory detailsPHDacc: the un-g(l)ory details

• accuracy > 75% (two states: buried, exposed)

• distribution with ≈ 10%

• stronger predictions more accurate

• WARNING: reliability index almost factor

2 too large for single

sequences

• accuracy below average for intermediate state

• VERY dependent on alignment accuracy

Page 88: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:

– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%

• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions

Page 89: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:

– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%

• Are 1D predictions useful? Of course to experts– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions

Page 90: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

EXTRACELLULAR

NUCLEAR

CYTOPLASMIC

Simplistic

perspective

of

sub-cellular

location

Simplistic

perspective

of

sub-cellular

location

Page 91: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

-0.3 - 0.2 - 0.1 0 0.1 0.2 0.3- 0.73 C+0.26 I+0.55 L

- 0.3

- 0.2

- 0.1

0

0.1

ccc

c

c

cc

cc

c

c

cc

cc

c

cc c

ccccc

c

c

c c

cc

c

c

c

ccc

c cc

c

cc

c

c

c

cc

cc

c

c

cc

c

c

c

cccc cc

cccce

e

eee

e

e

eee

en nn

n

n

nn

n

n

n

n

nn

n

n

n

n

nn

n

nn

n

n

n

n

n

nn

n

nn

n nn

n

nn

n

n

nn

n

n

- 0.3 - 0.2 - 0.1 0 0.1 0.2 0.3

- 0.6 G - 0.32 T - 0.21 N + 0.23 E +0.4 K +0.44 R

- 0.1

0

0.1

0.2

0.3

cc

cc

c

cc

cc

c

c

c

cc

cc c

c

ccc

cc

cc

cc c

c cc

c c

ccc

c

c

c

c cc

cc cc c

cc

c

cc

cc

c

ccc

c

ccc

cc c

ce

e ee

ee

eeee

e

nn

n

n

n

n

n

n n

n

n

nn

n

n

nn

n

nn

n

nn

n

n

nn

n

n

n

n

nn

n

nn n

n

n

n

nn

n

n

- 0.3 - 0.2 - 0.1 0 0.1 0.2 0.3- 0.62 G- 0.26 V+0.25 K +0.32 E+0.54 R

- 0.2

- 0.1

0

0.1

0.2

ccccc

cccc ccc

cc cc

cccccccc

ccc

cc c

cccc

ccccc

ccc

cc cc c

ccc c ccc

c cc

c

c

cc

ccc cc

eee

e

e

e

e

eee

enn

n

n

nn

n

n

n

nn

n n

n

n

nn

nn n

nnn

n

nn

n

nn

n nnn n

n

n

n

n

n nnn

n

n

Residuecomposition

projected ontofirst twoprinciple

components

Surface residues

Core residuesAll residues

Page 92: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

-0.2 - 0.1 0 0.1 0.2

- 0.4 G - 0.35 N - 0.25 T - 0.21 S +0.21 E +0.47 K +0.52 R

- 0.2

- 0.1

0

0.1

0.2

g

g

g

g

g

g

g

ggg

g

g

g

g

g

gi

ii

oo

o

oo

o

o

o

- 0.2 - 0.1 0 0.1 0.2

- 0.4 G - 0.35 N - 0.25 T - 0.21 S +0.21 E +0.47 K +0.52 R

- 0.2

- 0.1

0

0.1

0.2

cc

cc

cccccc

cccccccccccccccccccccccccccccccccccccccccccc

ccc cccccc cccccccccccc

ccccccccccccccc

c

cc

c

c

c

c

ccc

c

c

c

ccc

ccc

c

cc

c

ccc c

c

c

ccc

c

cc

c

c

cc

cc

c

c

c

c

cccc

c c

c

ccc

cccccc

c ccc

cc cccc

cc

cc

ccc

cc

cc

ccccc

c

cc

c

ccc

ccc

c

cc

c

cc

cc ccc

ccc

c

cc

c

c

c

cc

cc

c

c

c cc

cccc

c

cccc

cc

cc ccc cc

ccccc c

cc

cc

c

c

c

c

c

cc

c

c

c

c

c

cccc

ccccc

cc

c

cc

ccc

ccc

c

c cc

cc c cc

ccc

cc

c

ccc

ccc

c

c c

c

c

cc

ccc

c

cc

c

c

c

cc c

cc

c

c

ccc

c

c

cc

c

c

cc c

cc

cc

c

cc cccc

c

c

c

c

c

cc

c

c

c

c

cc

cc

c

c

ccccc

c

cc

ccc

ccc

ccc c

c

cc

c

c

cc

ccc

ccccc

c c

c

cc

c

cccc

cc

ccc

cc

c

ccc

c

c

c c

cc

cc

c

cc

cc

c

cc

c c

ccc

c c

c

cc

cccc cc

c

cc

cc

c

c c

c

ccc cc c

ccc c cc

cc

c c

c

cc

cc

c cc

cccc c

c c

c

c

c

c

cc

c

ccc

cc

cc cc

cccc

c

cc

c

c

c ccccc

ccc

c

c cc

c

c

c

c

c

c

c

c

c

c cc

cc

cc

c

c

c

ccc

cc

c

cc

c

cc

cc

c

ccc

cc

cc ccc cc ccc

c

c cc

ccc c

cc

cc

c

c

c

cc

c

c

cc cc

ccc

ccc

cc

c

c

c

c

c

ccc

cccc

c cc

c

c

c

cc

c

c c

cc

c

cccc

c

ccc

cc

c

ccc

c

cc

c

c

c

c c

cc

c

cc

cc

c

c

cccccc cc

cc

cc ccc c

c

ccccc

c

cc

c

c

cc c

c

c

c

ccc

cc

c

cc

c

c

c

cccccc

ccccc

cc

c

c

c

c

c

c

cc

c

c

c

c

cc

c

c

ccc c

cc

c

cc

c c

cc

ccc

cc

cc

c cc

cc ccc

ccc

cc

cc

c

c

c

cc

cccc

ccc c

c

ccc

ccc

c

c cccc

c cc

cc cccc

c

c

cc

c

c

c

c

c

c

cc

c

c

c

c

ccc

c

c

cc

ccc

ccc

c

cc

ccc

cc

c

cc

ccc

c

c

cccc c

cc

c

ccc

c

ccc

cc c

cc

cc

cc

c

c ccc

c

c cc

c

c c

ce e

e

e

e

e

ee

e

e

e

e

ee e

ee

eeeee

e

ee

ee

e

ee

nn

nn

nn

n

n

n nn

n

n

n n n n

n

n

n

nn

nnn

n

n

n

nn

n

nn

n

nnnn

n

nnn

n

n

n n

n

n

n

nn

nn

n

n

n

n

n

nn

n

nnn

n

n

n

n

nn

nnnn n

n

nn

n

n

nn

n

n

nn

nn

nn

nnn

nn

n

nn

n

n

nn n n

nnn

n

n

n

n

n

n

n n n

n

nn nnn

nn

n

nnn

nn

nn

n

n

n

n

nnn

nn

nn

n

nn

nnn

nn

nn

n

n

n

n

n

n

n

nn

nnn

n

n

n

n

n

n

n

n

nn

nn

n

n

n

n

nn

n

nn

nn

n

n

nnn

nn

nnn n

nn

nn

n

n

n

n

nnnnn

n

nnnn

n

n

n

nnnn n

n

n

n

n

nn

n

nn

nn

n

n

n

n

n

nn

nn

nnnnnnn

nnn

nnnn

n

nn

nnnn n

n

nn

n

n

n

nnn

nn n

n

n n

nn

n n

n

n nnn

nn

nn

n

n

nnn

n

n

nn

nn

n

nnnn

nn

nnn

nnn

n

n nn

nnn n n

nn

n

n

n

nn

n

n

nn

n

nn

nn nn

n

n

n

n

n

n

n

nnn

nn

n n

nn

n

nn

nnn

nn

nn

nnn

nn

nnnn n

n

n

n

n

n

n n nnnn n

n

nnnnnnnn n

n

n

nnnnnnn

nnn

nnnn

n nnnn

nnn

nnnn

nn

n

nnnnn

nnn

nn

n

Surface compositionprojected onto

first twoprinciple components

Page 93: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

extracellular

cytoplasmic

nuclear 51015

51015

51015

Average surface composition

A

A

A

C

C

C

D

D

D

E

EE

F

F

F

GG

H

H

H

I

I

I

K

KK

L

L

L

M

M

M

N

N

P

P

P

Q

Q

Q

R

R

R

SS

S

T

T

V

V

V

W

W

W

Y

Y

Y

TG NC<N E<C

N<CE<CE<N

N<CN<E

E<CE<N

E<N E<CE<N

E<CE<N

C<EN<E

N<C C<E C<NE<N

C<EN<E

C<EN<E

C<N

Page 94: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Electrostaticproperties

extracellular

positive7%

negative9%

polar50%

apolar34%

+-

p

p

pp

pppp

pp

a

a

aa

aa

cytoplasmic

positive19%

negative19%

polar29%

apolar33%

+ ++

-

--

pppp

p

a

a

aa

aa

nuclear

positive26%

negative15%

polar27%

apolar32%

+ ++

+

+

--

-pppp

p

aa

aa

aa

Page 95: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Shuttle into the nucleusShuttle into the nucleusShuttle into the nucleusShuttle into the nucleus

CYTOPLASM

NUCLEUS

NL S M9

T ransport in Import in

Nucleus

Cytoplasm

Page 96: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

How many NLS motifs in databases?How many NLS motifs in databases?How many NLS motifs in databases?How many NLS motifs in databases?

• ONE in PROSITEbi-partite motif

Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E

Coverage F

PROSITE 1 96 31 90 % 3 %

SWISS-PROT 322 290 n.a. 9 %

NLS-lit cleaned 91 309 35 100 % 10 %

NLS-lit consensus 91 537 35 100 % 17 %

PredictNLS_DB 214 1354 186 100 % 43 %

Coverage

Page 97: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Experimental NLS: positive chargesExperimental NLS: positive chargesExperimental NLS: positive chargesExperimental NLS: positive charges

NLS Protein Reference

RKRKK YstDNApolalpha Hsieh et al., 1998RKRRR Amida Irie et al., 2000KKKKRKREK LEF-1 Prieve et al., 1998KKKRRSREK TCF-1 Prieve et al.,. 1998RQARRNRRRRWR HIV-1 Rev Truant et al., 1999RRMKWKK PDX-1 Moede et al., 1999PKKKRKV SV40 LrgT Kalderon et al., 1984PRRRK SRY Sudbeck and Scherer, 1997GKKRSKA H2B Moreland et al., 1987KAKRQR v-Rel Gilmore and Temin, 1988RGRRRRQR Amida Irie et al., 2000PPVKRERTS RanBP3 Welch et al., 1999PYLNKRKGKP Pho4p Welch et al., 1999KRx{7,9}PQPKKKP p53-NLS1 Liang and Clarke, 1999KVTKRKHDNEGSGSKRPK Hum-Ku70 Koike et al., 1999RLKKLKCSKx{19}KTKR GAL4 Chan et al., 1998RKRIREDRKx{18}RKRKR TCPTP Chan et al., 1998RRERx{4}RPRKIPR BDV-P Schwemmle et al., 1999KKKKKEEEGEGKKK act/inh betaA Blauer et al., 1999PRPRKIPR BDV-P Shoya et al., 1998PPRIYPQLPSAPT BDV-P Shoya et al., 1998KDCVINKHHRNRCQYCRLQR TR2 Yu et al., 1998APKRKSGVSKC PolyomaVP1 Chang et al., 1992RKKRRQRRR HIV-1 Tat Truant et al., 1999MPKTRRRPRRSQRKRPPT Rex Palmeri and Malim, 1999KRPMNAFIVWSRDQRRK SRY Sudbeck and Scherer, 1997KRPMNAFMVWAQAARRK SOX9 Sudbeck and Scherer, 1997PPRKKRTVV NS5A Ide et al., 1996YKRPCKRSFIRFI DNAse EBV Liu et al., 1998LKDVRKRKLGPGH DNAse EBV Lyons et al., 1987KRPRP AdenovE1a Bouvier and Baldacci, 1995RRSMKRK hVDR Vihinen-Ranta et al., 1997PAKRARRGYK CPV capsid Kaneko et al., 1997RKCLQAGMNLEARKTKK hGlu.cort. Kaneko et al., 1997RRERNKMAAAKCRNRRR CFOS Kaneko et al., 1997KRMRNRIAASKCRKRKL CJUN Kaneko et al., 1997

Page 98: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Experimental NLS: more complicatedExperimental NLS: more complicatedExperimental NLS: more complicatedExperimental NLS: more complicated

NLS Protein Reference

CYGSKNTGAKKRKIDDA DNAhelicaseQ1 Miyamoto et al., 1997

[AKR]TPIQKHWRPTVLTEGPPV KIRIETGEWE[KA] ASVintegrase Kukolj G. 1998

GGGx{3}KNRRx{6}RGGRN Nab2 Truant et al., 1998

KRxxxxxxxxxKTKK THOV NP Weber et al., 1998

EYLSRKGKLEL VirD2-Nterm Tinland et al., 1992KRPACTL KPECVQQLLVCSQEA KK HCDA Somasekaram et al., 1999

RVHPYQR QKI-5 Wu et al., 1999HARNT Eguchi et al., 1997YNNQSSNFGPMKGGN M9 Bonifaci et al., 1997

SxGTKRSYxxM InfluenzaNP Wang et al., 1997TKRSxxxM InfluenzaNP Wang et al., 1997VNEAFETLKRC MyoD Vandromme et al., 1995

MNKIPIKDLLNPG Mat-alpha Hall et al., 1984

Page 99: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

In silico mutagenisisIn silico mutagenisisIn silico mutagenisisIn silico mutagenisis

Page 100: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage

Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E

Coverage F

PROSITE 1 96 31 90 % 3 %

SWISS-PROT 322 290 n.a. 9 %

NLS-lit cleaned 91 309 35 100 % 10 %

NLS-lit consensus 91 537 35 100 % 17 %

PredictNLS_DB 214 1354 186 100 % 43 %

Coverage

Page 101: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage

Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E

Coverage F

PROSITE 1 96 31 90 % 3 %

SWISS-PROT 322 290 n.a. 9 %

NLS-lit cleaned 91 309 35 100 % 10 %

NLS-lit consensus 91 537 35 100 % 17 %

PredictNLS_DB 214 1354 186 100 % 43 %

Coverage

Page 102: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage

Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E

Coverage F

PROSITE 1 96 31 90 % 3 %

SWISS-PROT 322 290 n.a. 9 %

NLS-lit cleaned 91 309 35 100 % 10 %

NLS-lit consensus 91 537 35 100 % 17 %

PredictNLS_DB 214 1354 186 100 % 43 %

Coverage

Page 103: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage

Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E

Coverage F

PROSITE 1 96 31 90 % 3 %

SWISS-PROT 322 290 n.a. 9 %

NLS-lit cleaned 91 309 35 100 % 10 %

NLS-lit consensus 91 537 35 100 % 17 %

PredictNLS_DB 214 1354 186 100 % 43 %

Coverage

Page 104: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage

Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E

Coverage F

PROSITE 1 96 31 90 % 3 %

SWISS-PROT 322 290 n.a. 9 %

NLS-lit cleaned 91 309 35 100 % 10 %

NLS-lit consensus 91 537 35 100 % 17 %

PredictNLS_DB 214 1354 186 100 % 43 %

Coverage

Page 105: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Nuclear protein in proteomesNuclear protein in proteomesNuclear protein in proteomesNuclear protein in proteomes

Genome No ORFs No prot with NLS Estimated % nuclear

Human 13933 1311 > 22 % F

Drosophila 14219 1256 > 21 %C. elegans 16232 1141 > 17 %Yeast 6307 479 > 18%

E. coli 4286 54 0 %

Page 106: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Un-annotated nuclear proteins with NLSUn-annotated nuclear proteins with NLSUn-annotated nuclear proteins with NLSUn-annotated nuclear proteins with NLS

• ATAXIN-1 GERGHGGG

• Breast Cancer type2 (Brc2) RIKKKQR

• Fibroblast Growth factor (fgf) KKRRRRR

• Brg1 ERKRRQ

Page 107: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Using NLS to bind DNAUsing NLS to bind DNAUsing NLS to bind DNAUsing NLS to bind DNA

Page 108: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

DNA-binding predictions in proteomesDNA-binding predictions in proteomesDNA-binding predictions in proteomesDNA-binding predictions in proteomes

Genome Nprot Nprot bind-DNA Nprot bind-DNApredicted known

Human 13933 419 141Drosophila 14219 300 37C. elegans 16232 251 10Yeast 6307 67 10E. coli 4286 13 3

Page 109: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Rotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.edu

• want all cell-cycle protein• search in SWISS-PROT, PROSITE• search literature• build ‘expert’ set of known

Page 110: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Significant motifsSignificant motifsSignificant motifsSignificant motifs

AFWKLMDDSEQGFWKLMDESNQ

AFWKLMDDSEQGFWRISAEPNN

Page 111: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Rotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.edu

• want all cell-cycle protein• search in SWISS-PROT, PROSITE• search literature• build ‘expert’ set of known• choose unique subset

Page 112: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Finding unique subsets of proteinsFinding unique subsets of proteinsFinding unique subsets of proteinsFinding unique subsets of proteins

Page 113: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Similar sequence -> similar structure?Similar sequence -> similar structure?Similar sequence -> similar structure?Similar sequence -> similar structure?.

0

2 0

4 0

6 0

8 0

1 0 0

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

id e n t i tys im ila r i ty

Number of residues alignedB Rost 1999 Prot. Engin.:12, 85-94

Page 114: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Rotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.edu

• want all cell-cycle protein• search in SWISS-PROT, PROSITE• search literature• build ‘expert’ set of known• choose unique subset• find motifs

…. sorry time run out, here!

Page 115: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

RetentiRetention on

signals signals in ER in ER and and

GolgiGolgi

RetentiRetention on

signals signals in ER in ER and and

GolgiGolgi

Sequence motif (1) Total Eukaryotes Non-Eukaryotes

ER/Golgi Non-ER /Non-Golgi

N N % N % N % N %Endoplasmic reticulum (ER) motifs: (2)KDEL-C-term 61 55 90 6 10 56 92 5 8KDEL 775 455 59 320 41 61 7 714 92HDEL-C-term 49 49 100 0 0 45 92 4 8HDEL 315 185 59 130 41 46 15 269 2HDEF-C-term 4 3 75 1 25 2 50 2 50HDEF 91 50 55 41 45 2 2 89 98KKXX-C-term 907 492 52 415 48 53 6 854 94KKXX 57848 32493 56 25355 44 810 1 57038 99XXRR 51849 28043 56 23806 46 688 1 51161 99KKFF-C-term 4 3 75 1 25 1 25 3 75KKFF 261 168 64 93 36 5 2 256 98KKAA-C-term 22 7 22 15 68 5 23 17 77KKAA 995 600 60 395 40 24 3 964 97

Golgi apparatus motifs: (3)YQRL 273 137 50 136 50 3 1 270 99YKGL 447 237 54 210 46 5 1 442 99YHPL 80 40 50 40 50 4 5 76 95YXXZ 83589 44335 53 39234 47 477 1 83112 99NPFKD 14 12 86 2 14 0 0 14 100FXFXD 3200 1762 55 1438 45 31 1 3169 99FQFND 4 1 25 3 75 1 25 3 75PXPXP 8542 6043 71 2499 29 65 1 8477 99[DE]X[DE] 80940 42436 53 38504 47 479 1 80461 99GRIP-motif (5) 2 2 100 0 0 1 50 1 50GRIP-motif (shortened) (6) 29 17 59 12 41 1 3 28 97

C-term variations: (4)PROSITE Pattern (7) 173 151 88 22 12 134 77 39 23

Page 116: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:

– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%

• High-throughput success of predictions:– localisation: accessibility useful, but not enough!– whole genomes– 3D structure: threading– floppy regions

Page 117: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

0

1

2

3

4

5

0 100 200 300 400 500 600

Distribution of ORF lenghts for Eukaryotes

caeel-Phuman-Pyeast-P

Percentage of ORFs in entire genome

Length of ORF

0

2

4

6

8

10

12

0 100 200 300 400 500 600

Distribution of ORF lenghts for Archaes

aerpe-Parcfu-Pmetja-Pmettm-Ppyrab-Ppyrho-Pthema-P

Length of ORF

0

1

2

3

4

5

0 100 200 300 400 500 600

Distribution of ORF lenghts for Prokaryotes aquae-Pbacsu-Pborbu-Pchlpn-Pchltr-Pdeira-Pecoli-Phaein-Phelpy-Pmycge-Pmycpn-Pmyctu-Pricpr-Psyny3-Ptrepa-P

Percentage of ORFs in entire genome

Length of ORF 20

40

60

80

100

1000 1500 2000 2500

Distribution of ORF lenghts for Eukaryotes

caeelhumanyeast

Distribution

Length of ORF

Page 118: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

ArcheansArcheans

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100

aquaebacsuborbucamjechlpnchltrdeiraecolihaeinhelpymycgemycpnmyctuneimericprsyny3thematrepaureur

0 20 40 60 80 100

010203040506070

0 20 40 60 80 100

yeastcaeeldromehumanhs22

ProkaryotesProkaryotes

0102030

4050607080

0 20 40 60 80 100

aerpearcfumetjamettmpyrabpyrho

0 20 40 60 80 100

Family sizeFamily sizeFamily sizeFamily size

Cum

ulat

ive

perc

enta

ge o

f pr

otei

ns

Number of proteins in family

EukaryotesEukaryotes

Aeropyrum pernix K1

Page 119: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Structure prediction for protein Structure prediction for protein universeuniverse

Structure prediction for protein Structure prediction for protein universeuniverse

Percentage of proteins in the proteome Percentage of residues in the proteome0 10 20 30 40

Percentage of residues

0 10 20 30 40 50

A pernixA fulgidus

M jannaschiiM thermoautotrophicu

P abyssiP horikoshii

A aeolicusB subtilis

B burgdorferiC jejuni

C pneumoniaeC trachomatisD radiodurans

E coliH influenzae

H pyloriM genitalium

M pneumoniaeM tuberculosisN meningitidis

R prowazekiiS PCC6803T maritimaT pallidum

U urealyticum

S cerevisiaeC elegans

D melanogasterH sapiens(SP/TrEmbl)

H sapiens(chr 22)

Percentage of proteins

Euka

Prokaryotes

Archae

Page 120: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Do we aim at getting one structure per Do we aim at getting one structure per fold?fold?

Do we aim at getting one structure per Do we aim at getting one structure per fold?fold?

• Structural proteomics = hunt for new folds ?

Tough task for theory!

-> Practice:Shrink complexes: 14747 technicians!

• Can we avoid non-globular proteins?

• Can we prioritise aspects of function?

Page 121: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Similar amino acid compositionSimilar amino acid compositionSimilar amino acid compositionSimilar amino acid composition

20%15%10% 5%20%15%10% 5%20%15%10% 5%

20%15%10% 5%

20%15%10% 5%

20%15%10% 5%

Aeropyrum pernix K1

Yeast

Archaeoglobus fulgidus

Caenorhabditis elegans

Escherichia coliBacillus subtilis

Page 122: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Inventory of life: membrane proteinsInventory of life: membrane proteinsInventory of life: membrane proteinsInventory of life: membrane proteins

0 5 10 15 20 25 30

A pernixA fulgidus

M jannaschiiM thermoautotrophicu

P abyssiP horikoshii

A aeolicusB subtilis

B burgdorferiC jejuni

C pneumoniaeC trachomatisD radiodurans

E coliH influenzae

H pyloriM genitalium

M pneumoniaeM tuberculosisN meningitidis

R prowazekiiS PCC6803T maritimaT pallidum

U urealyticum

S cerevisiaeC elegans

D melanogasterH sapiens (SP/TrEmbl

H sapiens(chr 22)

%mem

Eukaryotes

Prokaryotes

Archaea

Page 123: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Number of transmembrane helices

Cumulative percentage of membrane proteins

0

20

40

60

80

100

0 5 10 15 20

ArchaeaProkaryoteEukaryote

Number of membrane helices -> Number of membrane helices -> complexity?complexity?

Number of membrane helices -> Number of membrane helices -> complexity?complexity?

Page 124: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

MembraneMembraneproteins:proteins:

kingdomskingdomsinventedinventeddifferentdifferent

trickstricks

MembraneMembraneproteins:proteins:

kingdomskingdomsinventedinventeddifferentdifferent

trickstricks

0

10

20

30

40 aerpe bacsu yeast

0

10

20

30

40 arcfu camje caeel

0

10

20

30

40 metja ecoli drome

0

10

20

30

40

1 3 5 7 9 11 13 15 17

pyrho haein human

inout

1 3 5 7 9 11 13 15 17 1 3 5 7 9 11 13 15 17

Page 125: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

The The membranemembrane

LEGOLEGO

The The membranemembrane

LEGOLEGO

Page 126: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Length of Length of globular regions globular regions

in membrane in membrane proteinsproteins

Length of Length of globular regions globular regions

in membrane in membrane proteinsproteins

IntracellularExtracellular

Length of globular regions in membrane proteins

Percentage of globular regions

10

20

30

40

50

10

20

30

40

50

0

10

20

30

40

50

100 200 300 400 500 600 100 200 300 400 500 600 700

Aeropyrum pernix K1

Caenorhabditiselegans

Bacillussubtilis

Drosophilamelangoster

Archaeoglobus fulgidus

Escherichiacoli

Page 127: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Inventory of life: coiled-coil proteinsInventory of life: coiled-coil proteinsInventory of life: coiled-coil proteinsInventory of life: coiled-coil proteins

0 5 10 15 20 25 30

A pernixA fulgidus

M jannaschiiM thermoautotrophicu

P abyssiP horikoshii

A aeolicusB subtilis

B burgdorferiC jejuni

C pneumoniaeC trachomatisD radiodurans

E coliH influenzae

H pyloriM genitalium

M pneumoniaeM tuberculosisN meningitidis

R prowazekiiS PCC6803T maritimaT pallidum

U urealyticum

S cerevisiaeC elegans

D melanogasterH sapiens (SP/TrEmbl

H sapiens(chr 22)

%mem

0 2 4 6 8 10 12

%coils

Eukaryotes

Prokaryotes

Archaeans

Page 128: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Number of coiled-coil regions

Percentage of coiled-coil proteins

arcfu

0

2 0

4 0

6 0

8 0 aerpe

0

2 0

6 0

8 0

4 0

bacsu ecoli

0

20

40

60

80 caeel

1 2 3 4 5 6 7

human

1 2 3 4 5 6 7

Length of coiled-coil regions

Percentage of coiled-coil regions 20

40

60

80aerpe arcfu

20

40

60

80bacsu ecoli

0

20

40

60

80

28 84 140 196 252

caeel human

28 84 140 196 252

Coiled-coil proteins: detailsCoiled-coil proteins: detailsCoiled-coil proteins: detailsCoiled-coil proteins: details

Page 129: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Inventory of life: compartmentsInventory of life: compartmentsInventory of life: compartmentsInventory of life: compartments

5 10 15 20 25

% extra-cellular

0 5 10 15 20 25 30

A pernixA fulgidus

M jannaschiiM thermoautotrophicu

P abyssiP horikoshii

A aeolicusB subtilis

B burgdorferiC jejuni

C pneumoniaeC trachomatisD radiodurans

E coliH influenzae

H pyloriM genitalium

M pneumoniaeM tuberculosisN meningitidis

R prowazekiiS PCC6803T maritimaT pallidum

U urealyticum

S cerevisiaeC elegans

D melanogasterH sapiens (SP/TrEmbl

H sapiens(chr 22)

% membrane

5 10 15 20

% nuclear

Page 130: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

ProteinProteinstructurstructur

eeuniverseuniverse

ProteinProteinstructurstructur

eeuniverseuniverse

S y s t e m a t i c d i s c o v e r y o f t a r g e t s t r u c t u r e s

Page 131: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

0

20

40

60

80

100

0 200 400 600 800 1000

Cumulative distribution of ORF lengths for Eukaryotes

caeel-PChuman-PCyeast-PC

Length of ORF

Distribution of protein lengthDistribution of protein lengthDistribution of protein lengthDistribution of protein length

Page 132: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Bottleneck 5: money ...Bottleneck 5: money ...Bottleneck 5: money ...Bottleneck 5: money ...

• Goal 500 in 5 years• money:

total of $ 25 M in 5 years

50,000,000,000 Lire

Page 133: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

What will we get?What will we get?What will we get?What will we get?

• many new structures• the machinery for structural genomics• some weired structures ...

Page 134: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Recipe to determine targetsRecipe to determine targetsRecipe to determine targetsRecipe to determine targets

•Is it a known structure?•Is it similar to a known structure?•Is it a membrane protein?•Does it look like a known fold?•Does it look like a globular protein?•Is it a big family?•Is it short (NMR) does it contain Met (MAD)?

Page 135: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Alternative recipe to determine targetsAlternative recipe to determine targetsAlternative recipe to determine targetsAlternative recipe to determine targets

•Do we have a crystal?

•Is it a known structure?•Is it similar to a known structure?

Page 136: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Reality check:Reality check:

the invaluable the invaluable contribution of contribution of

bioinformatics to bioinformatics to target selectiontarget selection

Reality check:Reality check:

the invaluable the invaluable contribution of contribution of

bioinformatics to bioinformatics to target selectiontarget selection

Protein expressed?

Protein purified/well behaved?

Crystal?

Known structure?

YESNO

YESNO

YESNO

YES

Dostructure

NO

Do

structure,anyways

Page 137: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Target Target selectionselection

Target Target selectionselection

Experimental fe

asibility

Function space Structure space

Page 138: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Priority classesPriority classesPriority classesPriority classes

• Experimental feasibility

• Biophysical properties

– length

– presence of Methionine

• Bioinformatics criteria

– similarity to known structure

– family size

– functional annotation

• Functional genomics

Page 139: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Target Target selection selection

machinerymachinery

Target Target selection selection

machinerymachinery

Page 140: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Conclusions: Structural GenomicsConclusions: Structural GenomicsConclusions: Structural GenomicsConclusions: Structural Genomics

• we get: • most major functional elements• most structural scaffolds• evolutionary links• structure-based comparison• high-throughput techniques

• we won’t get:• complexes• interaction between them• particular structures

• when? • 70% of the human genome by 2010 2015• remainder = HTMs?

Page 141: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:

– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%

• High-throughput success of predictions:– localisation: accessibility useful, but not enough!– whole genomes: kingdoms differ in some respects!– 3D structure: threading– floppy regions

Page 142: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

0

400

800

1200

1600

0 5 10 15 20 25

Num

ber

of s

truc

ture

pai

rs

Percentage pairwise sequence identity

25 50 75 100

0

Midnight zone STRONGLY populatedMidnight zone STRONGLY populatedMidnight zone STRONGLY populatedMidnight zone STRONGLY populated

Page 143: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

What we are threading forWhat we are threading forWhat we are threading forWhat we are threading for

.

Number of residues aligned

100

75

50

25

0

Sequence identityimplies

structuralsimilarity !

Don't know region

Page 144: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Goals of fold recognition, threading,Goals of fold recognition, threading,remote homology modellingremote homology modelling

Goals of fold recognition, threading,Goals of fold recognition, threading,remote homology modellingremote homology modelling

• Recognising similar fold(s) (entire proteins)

• Detecting remote homologies for fragments (part of protein)

• Align target and fold

• Remote homology modelling (prediction in 3D)

Page 145: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Str 3

...

...

3DPDB

EEH

HEEH

HEHH

EHHÉHE

FosfosProfile

1D Projectionsec acc

1aap

1tcp

1btr

Seq (U) PHD 3

...

...

1DPHD

PHD 1

PHD 2

PHD n

Str 1

Str 2

Str n

Two paths to fold recognitionTwo paths to fold recognitionTwo paths to fold recognitionTwo paths to fold recognition

Page 146: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

TOPITSTOPITSTOPITSTOPITS

good match to one of the known structures?=>

• predict fold of matching structure• model 3D coordinates by homology

LWQRPLVTIKIGGQLKEALLDTGAD

LWQRPLVTIKIGGQLKEALLDTGADLWRRPVVTAHIEGQLVEVLLDTGAD DRPLVRVILTNTGstALLDSGADLEKRPTTIVLINDTPLNVLLDTGAD :

-----EEEEE-----EEHHHH----o•oo•••••o•ooo•oo•••oo••o

align pre-dicted andknownstructure(s)

Project known 3D structureonto 1D

Predict 1D structure from sequence

input:sequence

generatesequencealignment

predict 1Dstructure

-----EEEEE----EEEEEE-----oooo•o•o•o•ooooo•ooooo•oo

-----EEEEE-----EEHHHH----o•oo•••••o•ooo•oo•••oo••o

note: exposed = oburied = •

.

55

60

65

70

75

80

85

55

60

65

70

75

80

85

302520151050Percentage of pairwise sequence identity

55

60

65

70

75

80

85

55

60

65

70

75

80

85

302520151050Percentage of pairwise sequence identity

0

100

200

0 5 10 15 20 25 30Percentage of pairwise sequence identity

Page 147: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Prediction-based threadingPrediction-based threadingPrediction-based threadingPrediction-based threading

SWISS-PROT

BLASTBLAST

PHDsecPHDsecPHDaccPHDacc

DSSP

MaxHomMaxHom

Page 148: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

1tcp-3aapA identity = 16% ; AS = 68% ; ali% = 51%

Protease inhibitor domain ofAlzheimer's Amyloid (1aap)

Blood coagulution inhibitor (1tcp)

EEEEEEE EEEEEEE HHHHHHHHSEQ....AETGPCRAMISRWYFDVTEGKCAPFFYGGCGG.NRNNFDTEEYCMAVC ////////////////// ||||||||||||||///// ||||||||||...SEQAETGPCRAMISRWYFDVT.EGKCAPFFYGGCGGNRNNF.DTEEYCMAVC...RDWIDECDSNEGGERAYFRNG.KGGCDSFWICPEDHTGADYYSSYRDCFNAC HHHH EEEEE EEEEE HHHHHHH

1aap1tcp

Example of remote sequence identityExample of remote sequence identityExample of remote sequence identityExample of remote sequence identity

Page 149: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

30% correct first, better if stronger30% correct first, better if stronger30% correct first, better if stronger30% correct first, better if stronger.

0

20

40

60

80

100

10 14 22 29 68 92 100

Percentage of pairs predicted at given zscore (coverage)

all z > 2

z > 2.5

z > 3

z > 3.5

z > 4 z > 4.5

.

10

20

30

40

50

60

70

10

20

30

40

50

60

70

2 4 6 8 10

µ=50; sequence (Blosum62) + 1D structureµ=50; sequence (McLachlan) + 1D structure

µ=100; structure onlyµ= 0; sequence only (McLachlan)

Rank R of first correctly detected remote homologue

Page 150: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Other threading methodsOther threading methodsOther threading methodsOther threading methods

• TOPITS is not the best!• CASP

PredictionCenter.llnl.gov/content.html• CAFASP

www.cs.bgu.ac.il/~dfischer/CAFASP2/• EVA

cubic.bioc.columbia.edu/eva/• CUBIC links

cubic.bioc.columbia.edu/doc/links_index.html

Page 151: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:

– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%

• High-throughput success of predictions:– localisation: accessibility useful, but not enough!– whole genomes: kingdoms differ in some respects!– threading: better than sequence alignment!– floppy regions (NORS: no regular secondary structure)

Page 152: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Long floppy regionsLong floppy regionsLong floppy regionsLong floppy regions

• less than 5% helix or strand over > 70 residues

Page 153: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Formate Dehydrogenase H (1aa6.pdb)phiX174 virion

(1al0F.pdb)

Isoamylase (1bf2.pdb)DNA-containing capsid of CPV (4dpv.pdb)

Floppy loops between domainsFloppy loops between domainsFloppy loops between domainsFloppy loops between domains

Page 154: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Floppy endsFloppy endsFloppy endsFloppy ends

pyruvate:ferredoxin oxidoredisoamylase(1b0pA.pdb)

Capsid protein of CPV(1b35C.pdb)

Hexon from adenovirus type 2 (1dhx.pdb)

Myeloperoxidase (1mhlA.pdb)

Aspartate aminotrans-ferase (2aat.pdb)

Prothrombin fragment 2 (2hppP.pdb)

SH3 domainof PLC-gamma (1hsq.pdb)

Hydroxylase com-ponent of MMOH (1mtyB.pdb)

Page 155: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Floppy-wrapFloppy-wrapFloppy-wrapFloppy-wrap

SH3 and adjacent ligand site (1awj.pdb)

Erythrocyte catalase (7cat.pdb)

GmDNV capsid protein (1dnx.pdb)

Cellulase (1tf4A.pdb) Phosphoglycerate mutase (3pgm.pdb)

Carboxypeptidase T (1obr.pdb)

Page 156: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

WeirdoesWeirdoesWeirdoesWeirdoes

Extracellular domain of T beta RI (1tbi.pdb)

HIVZ2 Tat protein (1tac.pdb)

Plasminogen Kringle 4 (1krn.pdb)

Gene 5 DNA binding protein (2gn5.pdb)

Recombinant Kringle 5 domain (5hpg.pdb)

Aspartate Trans-carbamoylase (9atc.pdb)

Page 157: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

0 5 10 15 20 25 30 35

A pernixA fulgidus

M jannaschiiM thermoautotrophicu

P abyssiP horikoshii

A aeolicusB subtilis

B burgdorferiC jejuni

C pneumoniaeC trachomatisD radiodurans

E coliH influenzae

H pyloriM genitalium

M pneumoniaeM tuberculosisN meningitidisR prowazekiiS PCC6803

T maritimaT pallidum

N urealyticum

C elegansD melanogaster

S cerevisiaeH sapiens

H sapiens chr.22

Percentage of proteins with non-structured regionsWeirdoes are not alone !Weirdoes are not alone !Weirdoes are not alone !Weirdoes are not alone !

Page 158: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

0 5 10 15

A pernixA fulgidus

M jannaschiiM thermoautotrophicu

P abyssiP horikoshii

A aeolicusB subtilis

B burgdorferiC jejuni

C pneumoniaeC trachomatisD radiodurans

E coliH influenzae

H pyloriM genitalium

M pneumoniaeM tuberculosisN meningitidisR prowazekiiS PCC6803

T maritimaT pallidum

N urealyticum

C elegansD melanogaster

S cerevisiaeH sapiens

H sapiens chr.22

Percentage of residues in the non-structured region

10% of biomass weird !10% of biomass weird !10% of biomass weird !10% of biomass weird !

Page 159: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

0

5

10

15

20

70 90 110 130 150 170 190

A. pernix

0

10

20

30

40

50

70 90 110 130 150 170 190

E. coli

0

5

10

15

20

70 90 110 130 150 170 190

C. elegans

Length distribution of Non-structured regions

Length of non-structured regions

Percentage of non-structured regions

Length distribution of floppy regionsLength distribution of floppy regionsLength distribution of floppy regionsLength distribution of floppy regions

Page 160: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Weirdoes functional !Weirdoes functional !Weirdoes functional !Weirdoes functional !

0

10

20

30

40

50

60

70

80

leftright

-100 -50 0 50 100 150 200

0

10

20

30

40

50

60

70

80

leftright

-100 -50 0 50 100 150 200

0

10

20

30

40

50

60

70

80

leftright

-100 -50 0 50 100 150 200

A. pernix E. coli C. elegans

Percentage of non-structured regions

Difference in percentage of aligned proteins

Page 161: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Yeast-2-hybrid interactionsYeast-2-hybrid interactionsYeast-2-hybrid interactionsYeast-2-hybrid interactions

0

5

10

15

20

25

30

35

0 2 4 6 8 10

non-NSRNSR

Accumulative percentage of proteins

Number of interacting partners

Page 162: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!

• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:

– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%

• High-throughput success of predictions:– localisation: accessibility useful, but not enough!– whole genomes: kingdoms differ in some respects!– threading: better than sequence alignment!– NORS: weirdoes not alone AND

functional!

Page 163: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

ConclusionsConclusionsConclusionsConclusions

• no prediction of 3D structure

• no prediction of function

• but: quantum leap through using ‘frozen knowledge’ from evolutionand protein structures

• the data deluge floods bioinformatics

• the unsolved urgent problems are legion

• but: it is still time to get it done:running BLAST is NOT all there is …the key is intelligent use of biological knowledge ...

Page 164: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

ThanksgivingThanksgivingThanksgivingThanksgiving• Volker Eyrich Schrödinger, New York• Chris Sander Whitehead, Boston• Reinhard Schneider LION, Boston• Alfonso Valencia CNB, Madrid

• Miguel Andrade EMBL, Heidelberg• Séan O’Donoghue LION, Heidelberg

• Amos Bairoch SIB, Genève• Michael Braxenthaler La Roche, New York• Søren Brunak CBS, København• Rita Casadio Univ. Bologna• Antoine De Daruvar LION, Bordeaux• David Eisenberg UCLA, Los Angeles• Piero Fariselli Univ. Bologna• Barry Honig Columbia, New York• Tim Hubbard Sanger, Hinxton• Michael Levitt Univ. Stanford• Marc Marti-Renom Rockefeller, New York• Andrej Sali Rockefeller, New York• Michael Scharf Take 5, Heidelberg• Gerrit Vriend Univ. Nijmegen• Manfred Sippl Univ. Salzburg

localisation

.. in general

•Jinfeng Liu genomes, floppy, domains•Rajesh Nair NLS, localisation•Yanay Ofran protein interactions•Dariusz Przybylski PSI-Blast, EVA, threading•Henry Bigelow predict porins

•Claus Andersen continuous DSSP•Bastiaan Bruning transcription factors•Sven Mika nuclear matrix proteins•Chien Peter Chen membrane proteins•Kazimierz Wrzeszczynski cell-cycle/ER-Golgi•Hepan Tan floppy regions

Page 165: Burkhard Rost (Columbia New York) Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia.edu.

Burkhard Rost (Columbia New York)

Availability of methodsAvailability of methodsAvailability of methodsAvailability of methods

• email: [email protected]– subject: HELP– file:

• WWW: http://cubic.bioc.columbia.edu/predictprotein/

• META: http://cubic.bioc.columbia.edu/ predictprotein/submit_meta.html

• EVA: http://cubic.bioc.columbia.edu/eva

• CUBIC: http://cubic.bioc.columbia.edu/

Email addressoptions# protein nameSEQWENCE