Secondary structure prediction cb1 sec3 - Rostlab structure prediction short title: cb1_sec3...

47
/81 © Burkhard Rost 1 title: Computational Biology 1 - Protein structure: Secondary structure prediction short title: cb1_sec3 lecture: Protein Prediction 1 - Protein structure Computational Biology 1 - TUM Summer 2015

Transcript of Secondary structure prediction cb1 sec3 - Rostlab structure prediction short title: cb1_sec3...

/81© Burkhard Rost

1

title: Computational Biology 1 - Protein structure: Secondary structure predictionshort title: cb1_sec3

lecture: Protein Prediction 1 - Protein structure Computational Biology 1 - TUM Summer 2015

/81© Burkhard Rost

Videos: YouTube / www.rostlab.org THANKS : Tim Karl + Carlo Di Domenico Special lectures:

• TBA No lecture:

• 05/12 Student assembly (SVV) • 05/14 Ascension day • 05/26 Whitsun holiday • 06/04 Corpus Christi LAST lecture: Jul 7 Examen: Jul 9

• Makeup: Oct 13, 2015 - morning/noon

2

CONTACT: Inga Weise [email protected]

Tim Karl

Tatyana Goldberg

Edda Kloppmann

Andrea Schaffer-

hans

Jonas Reeb

Carlo Di [email protected]

© Burkhard Rost /81

We are looking for IDP students

3

Carry your Genes (CyG)IDP@iteratec

Optional: Präsentationstitel

■ Personalized medicine

■ Carry your genes / medical data and provide them securely to treating doctor

■ Mobile Smartphone App (android) development

■ PC client development

■ Combined with special hardware encrypted storage

■ Implement security features at highest possible level

■ Work with bioinformatics and medical departments: Rostlab & RI

■ Work with IT Consulting company iteratec GmbH

■ Work with IT security company certgate

Carry your Genes (CyG)

© iteratec | Datum 5

What? Paid Hiwi job (one month) in combination with IDP project

Where? iteratec GmbH, Inselkammerstrasse 4, 82008 München-Unterhaching

When? asap

Who? Informatics master students @TUM

How? To apply please contact:

Dr. Lothar Richter

[email protected]

Daniel Stahr / Dr. Marc Offman

[email protected]

Settings

© iteratec | 27.05.2015 6

/81© Burkhard Rost

Recap: secondary structure

prediction

7

/81© Burkhard Rost

8

Goal of structure prediction

Epstein & Anfinsen, 1961:sequence uniquely determines structure

• INPUT: sequence

3D structureand function

• OUTPUT:

/81© Burkhard Rost

9

Zones

Day

light

Zon

e

Twili

ght Z

one

Mid

nigh

t Zon

eprofile - profile

sequence - profilesequence - sequence

sequ

ence

sim

ilar

->

stru

ctur

e sim

ilar

B Rost (1997) Fold Des 2:S19-24B Rost (1999) Protein Eng 12:85-94

© Burkhard Rost /81

Good news: comparative modeling

reliably predicts structure for >25 million proteins

10

© Burkhard Rost /81

Bad news:For most proteins

comparative modeling cannot be applied

11

/81© Burkhard Rost

12

Notation: protein structure 1D, 2D, 3DPQITLWQRPLVTIKIGGQLKEALLDTGADDTVL

PP PQQQYFFQVISSIVRLLSTLWWQEDRKQAKRRRPQPPPPPVVTKFVVLIITTKEKAALIVHYKKFIILVIEENGGGGGTGQQKRRPPLWWVVFKVEESKKVVGLGLLILLLLLVVDDDDDTTTTTGGGGGAAAAADDDDDDDAKESSTTVIIVIVVVIVL

1281757077

120238169200247114740

904

466268

11831

1241

292449726217

102691

140

1109760691481976248590

690

730

415371597395000

5851300

79586900

EEEEE

EEEEEE

EEEEEEE

EE

EEEEE

EEEEEE

EE

kcal/mol0 -1 -2 -3 -4 -5

1 10 20 30 40 50 60 70 80 90

1

10

20

30

40

50

60

70

80

90

1D1D 2D2D 3D3D

/81© Burkhard Rost

1st-2rd generation secondary structure prediction methods

13

/81© Burkhard Rost

1st generation (1957-1978): single residue oddse.g. Chou-Fasman/GOR

2nd generation (1983-1992):e.g. GORIIIodds for windows

14

Sec struc pred: 1st gen

p1(SECi|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i

p(SEC|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i= SUM (j=i-w,i+w) p1(SECj,AAj)

Erabutoxin β (3ebx)

V32 V36 V51

w=3

/81© Burkhard Rost

1st generation (1957-1978): single residue oddse.g. Chou-Fasman/GOR

2nd generation (1983-1992):e.g. GORIIIodds for windows

15

Sec struc pred: 1st gen

p1(SECi|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i

p(SEC|AAi)≠ p(SEC|AAj)

Erabutoxin β (3ebx)

V32 V36 V51

w=3

/81© Burkhard Rost

single residues (1. generation) • Chou-Fasman, GOR 1957-70/80

50-55% accuracy

segments (2. generation) • GORIII 1986-92

55-60% accuracy

problems • < 100% may be: 65% max • < 40% may be: strand non-local • short segments

16

Secondary structure prediction: 1.+2. Generation

Erabutoxin β (3ebx)

residuesiandi+3

/81© Burkhard Rost

17

ACDEFGHIKLMNPQRSTVWY.

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

Neural Network for secondary structure

/81© Burkhard Rost

single residues (1. generation) • Chou-Fasman, GOR 1957-70/80

50-55% accuracy

segments (2. generation) • GORIII 1986-92

55-60% accuracy

problems • < 100% may be: 65% max • < 40% may be: strand non-local • short segments

18

Secondary structure prediction: 1.+2. Generation

Erabutoxin β (3ebx)

residuesiandi+3

/81© Burkhard Rost

helix strand otheroverallaccuracymethod

unbalanced 62%comparison:data bankdistribution

comparison:33:33:33balanced 60%

19

full pie: all correctly predicted residues

/81© Burkhard Rost

single residues (1. generation) • Chou-Fasman, GOR 1957-70/80

50-55% accuracy

segments (2. generation) • GORIII 1986-92

55-60% accuracy

problems • < 100% may be: 65% max

< 40% may be: strand non-local • short segments

20

Secondary structure prediction: 1.+2. Generation

Erabutoxin β (3ebx)

residuesiandi+3

/81© Burkhard Rost

H

E

L

V (E)

P (E)

A (H)

PHDsec:

structure-to-structure

21

PHDsec: structure-to-structure network

B Rost (1996) Methods Enzymol 266:525-39

/81© Burkhard Rost

single residues (1. generation) • Chou-Fasman, GOR 1957-70/80

50-55% accuracy

segments (2. generation) • GORIII 1986-92

55-60% accuracy

problems • < 100% may be: 65% max

< 40% may be: strand non-local short segments

22

Secondary structure prediction: 1.+2. Generation

Erabutoxin β (3ebx)

residuesiandi+3

/81© Burkhard Rost

STILL ONLY 60+ε% accuracy.

How to improve beyond that?

23

/81© Burkhard Rost

24

Evolution has it!

.

0

20

40

60

80

100

0 50 100 150 200 250

Perc

enta

ge se

quen

ce id

entit

y

Number of residues aligned

Sequence identityimplies structural

similarity !

Don't know region

C Sander & R Schneider 1991 Proteins 9:56-68B Rost 1999 Prot Engin 12:85-94

/81© Burkhard Rost

Η

Ε

L

>

>

>

pickmaximal

unit=>

currentprediction

J2

inputlayer

first orhidden layer

second oroutput layer

s0 s1 s2J1

:GYIY

DPAVGDPDNGVEP

GTEF:

:GYIY

DPEVGDPTQNIPP

GTKF:

:GYEY

DPAEGDPDNGVKP

GTSF:

:GYEY

DPAEGDPDNGVKP

GTAF:

Alignments

5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .

. . . . 5 . . . . . . . . . . . . . . .

. . . 5 . . . . . . . . . . . . . . . .

. . 3 . . . . 2 . . . . . . . . . . . .

. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .

5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .

GSAPD NTEKQ CVHIR LMYFW

profile table

:GYIY

DPEDGDPDDGVNP

GTDF:

Protein

corresponds to the the 21*3 bits coding for the profile of one residue

25

PHD: Neural network & evolutionary information

B Rost & C Sander (1993) PNAS 90:7558-62B Rost (1996) Methods Enzymol 266:525-39

/81© Burkhard Rost

26

PROFsec: Evolutionary information + more

B Rost (2001) J Struct Biol 134, 204-18

/81© Burkhard Rost

single residues (1. generation) • Chou-Fasman, GOR 1957-70/80

50-55% accuracy

segments (2. generation) • GORIII 1986-92

55-60% accuracy

problems < 100% may be: 65% max < 40% may be: strand non-local short segments

27

Secondary structure prediction: 1.+2. Generation

Erabutoxin β (3ebx)

residuesiandi+3

/81© Burkhard Rost

Proper comparison of methods

28

/81© Burkhard Rost

Method A=60% Method B=63%

B better?

29

/81© Burkhard Rost

same measure?e.g. both Q3?

30

Method A=60% B=63%, B better?

/81© Burkhard Rost

use same (meaningful) measure e.g. both Q3 same data set

31

Method A=60% B=63%, B better?

/81© Burkhard Rost

use same (meaningful) measure e.g. both Q3 same data set: note both used 100 proteins, and both used random splits to take one half for testing,ok?

32

Method A=60% B=63%, B better?

/81© Burkhard Rost

use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins!

33

Method A=60% B=63%, B better?

/81© Burkhard Rost

use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: random ok?

34

Method A=60% B=63%, B better?

/81© Burkhard Rost

use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets.Overlap defined as, e.g. comparative modeling cannot be applied

35

Method A=60% B=63%, B better?

B Rost 1999 Prot Engin 12, 85-94 C Sander & R Schneider 1991 Proteins 9:56-69

/81© Burkhard Rost

use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets. 63-60=3significant?

36

Method A=60% B=63%, B better?

/81© Burkhard Rost

use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets. 63-60=3, whether significant or not depends on distribution and number:

37

Method A=60% B=63%, B better?

/81© Burkhard Rost

38

DeltaQ3=3%, 100 proteins->signficant?

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins

Per-residue accuracy (Q3)

<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm

/81© Burkhard Rost

39

DeltaQ3=3% for 100 proteins is significant!

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins

Per-residue accuracy (Q3)

<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm

rule-of-thumb:

Stderror=sigma/sqrt(proteins)

here: StdErr=10.5/sqrt(100)=±1.05

> DeltaQ3=3

-> statistically significant

/81© Burkhard Rost

Method B 20 years older than A, still better?

40

/81© Burkhard Rost

Difference statistically signficant

-> age no difference!

41

/81© Burkhard Rost

Any other test to do? (mind you B is 20

years old)

42

/81© Burkhard Rost

pre-release test: ideally use data added after both

methods had been developed

43

/81© Burkhard Rost

44

Cross-validation: how to

150

0

150

TrainTest

Table 1Nhidden Q315 6230 6445 63

Conclusion:Q3=64%best method has 30 hidden units

/81© Burkhard Rost

45

Cross-validation: how to

150

0

150

TrainTest

Table 1Nhidden Q315 6230 6445 63

Conclusion:Q3=64%best method has 30 hidden units

OK?

/81© Burkhard Rost

46

Cross-validation: need 3 sets!

100

100

100TrainTest

Cross-train

Table 1

Nhid cross-train test

15 62 60

30 64 6145 63 62

Conclusion:Q3=61%best method has 30 hidden units

/00© Burkhard Rost

01: 04/14 Tue: no lecture 02: 04/16 Thu: no lecture 03: 04/21 Tue: Organization of lecture: intro into cells & biology 04: 04/23 Thu: Intro I - acids/structure 05: 04/28 Tue: Intro 2 - domains/3D comparisons 06: 04/30 Thu: Alignment 1 07: 05/05 Tue: Alignment 2 08: 05/07 Thu: Comparative modeling 1 09: 05/12 Tue: SKIP: student assembly (SVV) 10: 05/14 Thu: SKIP: Ascension Day 11: 05/19 Tue: Experimental structure determination/Secondary structure assignment 12: 05/21 Thu: 1D: Secondary structure prediction 1 13: 05/26 Tue: SKIP: Whitsun holiday (05/23-26) 14: 05/28 Thu: 1D: Secondary structure prediction 2 15: 06/02 Tue: 1D: Secondary structure prediction 3 / Transmembrane structure prediction 1 16: 06/04 Thu: SKIP: Corpus Christi 17: 06/09 Tue: 1D: Transmembrane structure prediction 2 18: 06/11 Thu: 1D: Transmembrane structure prediction 3 - Jonas 19: 06/16 Tue: 1D: Solvent accessibility prediction 20: 06/18 Thu: 1D: Disorder prediction 21: 06/23 Tue: 2D prediction 1 22: 06/25 Thu: 2D prediction 2 23: 06/30 Tue: 3D prediction 1 24: 07/02 Thu: Nobel prize symposium 25: 07/07 Tue: wrap up 26: 07/09 Thu: examen, no lecture 27: 07/14 Tue: examen, no lecture 28: 07/16 Thu: examen alternative, no lecture

47

Lecture plan (CB1 structure)