Secondary structure prediction 2 cb1 sec2 · 2018. 5. 29. · structure prediction: combination of...

/135© Burkhard Rost

�1

title: Secondary structure prediction 2short title: cb1_sec2

lecture: Computational Biology 1 - Protein structure (for Informatics) - TUM summer semester


Videos: YouTube / www.rostlab.org/talks THANKS :. EXERCISES: Special lectures: • Mikal Boden UQ Brisbane No lecture: • 04/26 Security check Rostlab (exercise WILL be) • 05/01 May Day (also no exercise) • 05/08 Student representation (SVV) - exercise WILL happen • 05/10 Ascension Day (also no exercise) • 05/22 Whitsun holiday (also no exercise) • 05/31 Corpus Christi (also no exercise) • 06/21 no lecture (but exercise) LAST lecture: bef: Jul 12 Examen: Jul 12 18-20:00 (room TBA) • Makeup: no makeup (sorry due to overload)

�2

Announcements

Dmitrij Nechaev

Your Name

Lothar Richter

Michael Heinzinger

next

CONTACT: [email protected]© Michael Leunig

http://www.rostlab.org

mailto:[email protected]

© Burkhard Rost

Recap: protein prediction

�3


�4

Goal of structure prediction

Epstein & Anfinsen, 1961:sequence uniquely determines structure

• INPUT: sequence

3D structureand function

• OUTPUT:


�5

Zones

Day

light

Zon

e

Twili

ght Z

one

Mid

nigh

t Zon

eprofile - profile

sequence - profilesequence - sequence

sequ

ence

sim

ilar

->

stru

ctur

e sim

ilar

B Rost (1997) Fold Des 2:S19-24B Rost (1999) Protein Eng 12:85-94

© Burkhard Rost /135

Experimental 3D structure for 1 protein:>$100K

PDB=database of proteins of known 3D structure about 120 k in May 2017

�6


�7structure (PDB id 4lpk): JM Ostrem et al. & KM Shokat (2013) Nature 503:548-51

Comparative modeling predicts 3D structure in silico

pretein seqwence

priteen peqwinse

Query

PDB


Good news: comparative modeling

reliably predicts structure for over 40 million proteins

at 100k/protein this translates to: $4 trillion, i.e. $4x1012: more than the GDP of England and France!

�8


Bad news:For most residues

comparative modeling cannot be applied

�9


�10

Notation: protein structure 1D, 2D, 3DPQITLWQRPLVTIKIGGQLKEALLDTGADDTVL

PP PQQQYFFQVISSIVRLLSTLWWQEDRKQAKRRRPQPPPPPVVTKFVVLIITTKEKAALIVHYKKFIILVIEENGGGGGTGQQKRRPPLWWVVFKVEESKKVVGLGLLILLLLLVVDDDDDTTTTTGGGGGAAAAADDDDDDDAKESSTTVIIVIVVVIVL

1281757077

120238169200247114740

904

466268

11831

1241

292449726217

102691

140

1109760691481976248590

690

730

415371597395000

5851300

79586900

EEEEE

EEEEEE

EEEEEEE

EE

EEEEE

EEEEEE

EE

kcal/mol0 -1 -2 -3 -4 -5

1 10 20 30 40 50 60 70 80 90

1

10

20

30

40

50

60

70

80

90

1D1D 2D2D 3D3D


L Pauling & RB Corey (1953) PNAS 39:247-252L Pauling, RB Corey & HR Branson (1951) PNAS 37:205-234W Kabsch & C Sander (1983) Biopolymers 22:2577-2637

DSSP

�11

Pauling’s H-bond pattern used in DSSP


Science is communication

questions are often the first step

�12

© Burkhard Rost

1D: secondary structure prediction

�13


Secondary structure prediction 2ndary structure prediction “2D prediction”?

�14

Words


Secondary structure prediction 2ndary structure prediction 2D prediction

�15

Words

PQITLWQRPLVTIKIGGQLKEALLDTGADDTVL

PP PQQQYFFQVISSIVRLLSTLWWQEDRKQAKRRRPQPPPPPVVTKFVVLIITTKEKAALIVHYKKFIILVIEENGGGGGTGQQKRRPPLWWVVFKVEESKKVVGLGLLILLLLLVVDDDDDTTTTTGGGGGAAAAADDDDDDDAKESSTTVIIVIVVVIVL

1281757077

120238169200247114740

904

466268

11831

1241

292449726217

102691

140

1109760691481976248590

690

730

415371597395000

5851300

79586900

EEEEE

EEEEEE

EEEEEEE

EE

EEEEE

EEEEEE

EE

kcal/mol0 -1 -2 -3 -4 -5

1 10 20 30 40 50 60 70 80 90

1

10

20

30

40

50

60

70

80

90

1D1D 2D2D 3D3D


DSSP secondary assignment has 8 “states”

�16

Secondary structure prediction

H = HelixG = 310 helixI = Pi helixE = Extended (strand)B = beta-bridge, single strand residueT = Turn, i.e. one turn of helix S = bent“ “ = loop


�17

Local sequence determines secondary structure

LEDKSPDHNPTGID

AKGKPMDRNFTGRNHPPKDSS

AAQVKDALTK

LEQWGTLAQLRAIWEQELTDFPEFLTMMARQETWLGWLTI

helix strand

loop

LAVIGVLMKW

FVFLMIEKIYHKLT

DIRVGLTYYIAQ

VNTFVGTFAAVAHAL

W Kabsch & C Sander (1985) Identical pentapetides with different backbones. Nature 317:207


�18

??

???

How penta-peptides occur in 2 states?



DSSP

�19



�20

Helix is local, sheet is not

residuesiandi+3

H-bondresiduesi <-> i+4

Erabutoxin β (3ebx)

H-bondresiduesi <-> i±jj∈[4,L-4]

HELIX (H)

SHEET (E) with 3 strands)


take known structures find longest consecutive runs of motifs that occur ONLY in one of the three statesH (helix), E (strand), O (other)

�21

Simple method to predict sec str


First actual prediction method was much simpler

�22


First step (Szent-Györgyi)Proline breaks a helixHelices span several turns, i.e. >4 residues-> identify helices/non-helices

�23

Simple prediction: frequency

Proline bends main chain


First step (Szent-Györgyi)Proline breaks a helixHelices span several turns, i.e. >4 residues-> identify helices/non-helices

from Proline to odds for all ....

�24



from Proline to odds for all

�25


....,....1....,....2....QEKSPREVTMKKGDILTLLNSTNK E..E EEEEEE

AA D E G I K L M N P Q R S T V

E 1 1 3 1 1 1

L 1 1 1 4 1 1 1 1 2 1


single residues (1. generation) • Chou-Fasman, GOR 1957-70/80

Robson B & Pain RH (1971) Analysis of the Code Relating Sequence to Conformation in Proteins: Possible Implications for the Mechanism of Formation of Helical Regions. J. Mol. Biol. 58:237-259.Chou PY & Fasman GD (1974) Prediction of protein conformation. Biochemistry 13:211-215.Garnier J, Osguthorpe DJ and Robson B (1978) Analysis of the accuracy and Implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97-120.

�26

Secondary structure prediction methods


1st generation (1957-1978):e.g. Chou-Fasman / GORsingle residue odds

�27

Sec struc pred: 1st gen

p(SEC|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i, j=p(SEC|AAj) - ∀ i ⋀ j


V32 V36 V51

© Burkhard Rost

how to assess performance? problem 1: where to

get secondary structure from?

�28



DSSP

�29



Resource with 3D-coordinates of proteins (RNA & DNA) www.rcsb.org e.g. “Molecule of the Month” 2016/05: over 120,000 molecules

�30

PDB = Protein Data Bank

Num

ber o

f stru

ctur

es in

PD

B

1

10

100

1,000

10,000

100,000

1,000,000

Year

1975

1980

1985

1990

1995

2000

2005

2010

2015

2020

http://www.rcsb.org


find unique subset from proteins of known structure (PDB) convert 3D to 1D (secondary structure) with DSSP

�31

Prediction method



�32


p(SEC|AAi)=probability for observing secondary structure state SEC for amino acid AA at position=p(SEC|AAj) - ∀ i ⋀ j


V32 V36 V51



�33


Num

ber o

f stru

ctur

es in

PD

B

1

10

100

1,000

10,000

100,000

1,000,000

Year

1975

1980

1985

1990

1995

2000

2005

2010

2015

2020

© Burkhard Rost

how to assess performance? problem 2: how to

measure?

�34


�35

Assessing performance of secondary structure prediction

1.,.,.,.,.10,.,.,.,.20,.,.,.,.30,.,.,.,.40,.,.,.,.50 obs EEEE E EEEEEE EEEEEE EEEEEEEEEEE prd EEHHH EEEE EE HHEE EEEHHH

obs=observed, prd=predicted H: helix, E: strand, ‘ ‘: other


• Q3 : three-state per-residue accuracy number of correctly predicted residues in states helix, strand, other Q3= ---------------------------------------------------------------------------- number of residues in proteinSchulz GE & Schirmer RH (1979) Prediction of secondary structure from the amino acid sequence. In: (eds). Principles of protein structure. Berlin: Springer-Verlag, pp 108-130.

�36

Secondary structure prediction accuracy



published: 63% accuracy


�37




published: 63% accuracy assessed in 1994: 50-55% accuracy


�38


© Burkhard Rost

2nd Generation: how would you

improve?

�39


�40

Segments instead of isolated residues


V32 V36 V51


single residues 1. generation • Chou-Fasman, GOR 1957-70/80

50-55% accuracy (Q3) segments 2. generation

• GORIII 1986-92 55-60% Q3

• Gibrat J-F, Garnier J and Robson B (1987) Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J. Mol. Biol. 198:425-443.

• Biou V, Gibrat JF, Levin JM, Robson B and Garnier J (1988) Secondary structure prediction: combination of three different methods. Prot. Engin. 2:185-191.

• Garnier J & Robson B (1989) The GOR method for predicting secondary structure in proteins. In: D. FG (eds). Prediction of protein structure and the principles of protein conformation. New York: Plenum Press, pp 417-465.

�41

Secondary structure prediction: 1.+2. Generation


1st generation (1957-1978): single residue oddse.g. Chou-Fasman/GOR

2nd generation (1983-1992):e.g. GORIIIodds for windows

�42


p1(SECi|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i

p(SEC|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i= SUM (j=i-w,i+w) p1(SECj,AAj)


V32 V36 V51

w=3



50-55% accuracy

segments (2. generation) • GORIII 1986-92

55-60% accuracy

problems • < 100% they said: 65% max

�43



�44

Helix formation is local

residuesiandi+3

THYROID hormone receptor (2nll)



50-55% accuracy


55-60% accuracy

problems • < 100% may be: 65% max

• < 40% may be: strand non-local

�45



�46

β-sheet formation is NOT local




50-55% accuracy


55-60% accuracy

problems • < 100% may be: 65% max

• < 40% may be: strand non-local

• short segments

�47


B Rost and C Sander (2000) Methods in Molecular Biology 143: 71-95


�48

SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE

TYP EHHHH EE EEEE EE HHHEE EEEHH

Problems of secondary structure predictions (before 1994)

obs EEEE E E E EEEEEE EEEEEE EEEEEEEEEEE prd EEHHH EE EEEE EE HHEE EEEHHH

© Burkhard Rost

INSERT: concept of neural

networks

�49


�50

J11

J12

1

1

1

0

out0 = in1J11 in2J12 +

out = tanh (out0)

Simple Neural Network

Simple neural network


�51

10

Training a neural network 1


�52

10

Errare = (out net - out want) 2

.

1

- 121-1-2

out

in



�53

Error

Junctions

1001

11

11



�54

1001

11

11

.

1

- 121-1-2

out

in

1001

01

12

1001

- 11

12+?



�55

Neural networks classify points


�56

Simple Neural NetworkWith Hidden Layer

outi = f ij2 J ⋅ f jk

1 Jk∑ ⋅ kin#

$%

&

'(

j∑

#

$%%

&

'((

Simple neural network with hidden layer


�57

Principles of networks: input -> output

two steps:1. linear: sum over all input × connection2. non-linear: sigmoid trigger, i.e., project sum onto 0-1

.

:ACACC:

1.0

0input to unit

(=sum)

Σconnectionij*inputjstep 1:

step 2:

outp

utfr

om u

nit

inpu

t = 3

adj

acen

t res

idue

s in

pro

tein

seq

uenc

e

outp

ut =

sec

onda

ry s

truct

ure

stat

e of

cen

tral r

esid

ue

α

L

s1s2s3

Jdecision line

sum

result: < decision line


outi = ∑i=1

Nin+1

Jij inj

inj value of input unit j ; outi value of output unit i ; Jij connection between input unit j and output unit i

E = ∑i=1

Nout

(outi - desi)2

outi value of output unit i ; desi secondary structure stateobserved for central amino acid for output unit i (e.g. fora helix: des1=1, des2=0, des3=0)

• output:

• error:

• free variables: connections { J } • goal:

representation of set of examples (training set) for which the mapping input->output is known, i.e., the secondary structure state of the central residue has been observed by the network

�58

Principles of neural networks: error


training = change of connections {J} such that E decreases simplest procedure: • gradient descent

�59

Principles of neural networks: training

∆Jij(t+1) = - ε ∂E(t)∂Jij(t) + α ∆Jij(t-1)

where ∂E/∂J is the derivative of the error with respect tothe network connection; t is the algorithmic time given bythe presentation of one example; ε determines the stepwidth of the change (learning strength, typically some0.01); α gives the contribution of the momentum term(∆J(t-1) , typically some 0.2), which permits uphill moves

Error

{ J }


�60

Effect of over-training: theory

100

50

0Training time


�61


100

50

0Training time

over-train


�62


100

50

0Training time

over-train

toy problems


what were those two curves?

�63


�64


100

50

0Training time

over-train

training set

cross-training testing

validation set


�65

Sketch of simplified cross-validation


�66

Sketch of simplified cross-validation

TRAIN

TESTcross- TRAIN


�67

Effect of over-training: practice

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

num

ber o

f cor

rect

class

ifica

tions

per

exam

ple

0 5 10 15 20 25

number of cycles

ratio for training set

ratio for testing set

Training cycles

Cor

rect

cla

ssifi

catio

ns

testing

training


�68

Effect of over-training: practice

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

num

ber o

f cor

rect

class

ifica

tions

per

exam

ple

0 5 10 15 20 25

number of cycles

ratio for training set

ratio for testing set

Training cycles

Cor

rect

cla

ssifi

catio

ns

testing

training100

50

0Training time

toy problems

© Burkhard Rost

RETURN: secondary structure prediction

�69



50-55% accuracy


55-60% accuracy


• < 40% they said: strand non-local

• short segments

�70

Secondary structure predictions of 1. and 2. generation


�71B Rost (1996) Methods in Enzymology 266: 525-39

ACDEFGHIKLMNPQRSTVWY.

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

Neural Network for secondary structure


helix strand otheroverallaccuracymethod

unbalanced 62%

�72

NN predicts secondary structure

neural network



unbalanced 62%

�73


neural network

... and developer believes that application of machine learning is all the intelligence he will ever need...


�74

NN sec str: training dynamics

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

Other Strand Helix

time: 1 step = 20,000 training samples

Perfo

rman

ce

Eµ = oiµ − di

µ( )i∑

2

ΔJµ ∝ - ∂Eµ{J}∂J



unbalanced 62%neural network

�75


full pie: all correctly predicted residues



unbalanced 62%comparison:data bankdistribution

�76


neural network





comparison:33:33:33

�77


neural network



Eµ = oiµ − di

µ( )i∑

2


normal training

�78

Balanced training


E = oiµ − di

µ( )i∑

µ=α ,β,L∑

2

Eµ = oiµ − di

µ( )i∑

2


normal training

balanced training

�79

Balanced training


�80

Balanced training: dynamics

00.20.40.60.8

1

1 2 3 4 5 6 7 8 9 10

Other Strand Helix

1 2 3 4 5 6 7 8 9 10

1 0.8 0.6 0.4 0.2 0

unbalanced balancedEµ = oi

µ − diµ( )

i∑

2


train:E = oi

µ − diµ( )

i∑

µ=α ,β,L∑

2µ




comparison:33:33:33balanced 60%

�81



Neural networks DO improve if developer does something more

than dream the machine learning dream...

�82



50-55% accuracy


55-60% accuracy



• short segments

�83



�84

β-sheet formation is NOT local



Conclusion: not all sound

explanations are right!

�85



50-55% accuracy


55-60% accuracy



• short segments

�86



�87

Bad segment prediction

HHHHHHHHHEEEEE

HHHHEEE

HHHHHHHEEEEE

1st level

2nd level

comparison:observed:

SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE

TYP EHHHH EE EEEE EE HHHEE EEEHH


�88

Select samples at random

∆Jij(t+1) = - ε ∂E(t)∂Jij(t) + α ∆Jij(t-1)

where ∂E/∂J is the derivative of the error with respect tothe network connection; t is the algorithmic time given bythe presentation of one example; ε determines the stepwidth of the change (learning strength, typically some0.01); α gives the contribution of the momentum term(∆J(t-1) , typically some 0.2), which permits uphill moves

Error

{ J }


�89

Local correlations in reality

residuesiandi+3



�90

??

???

How to get those into the prediction?


H

E

L

V (E)

P (E)

A (H)

PHDsec:

structure-to-structure

�91

PHDsec: structure-to-structure network

B Rost (1996) Methods Enzymol 266:525-39


�92

Better segment prediction

HHHHHHHHHEEEEE

HHHHEEE

HHHHHHHEEEEE

1st level

2nd level

comparison:observed:


.

0

200

400

600

800

1000

1200

0 10 20 30 40 50

Num

ber o

f seg

men

ts

Segment length

0

5

10

15

20

25

25 30 35 40 45 50

DSSPPHD

-800

-600

-400

-200

0

200

400

600

800

0 2 4 6 8 10

helixstrandloop

Diff

eren

ce in

num

ber

of o

bser

ved

- pre

dict

ed se

gmen

tsSegment length

A B

�93

Better prediction of segment lengths


N Qian & TJ Sejnowski (1988) Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202:865-884.

�94

Structure-to-structure network: Invented?

H

E

L

V (E)

P (E)

A (H)

PHDsec:

structure-to-structure

PHDsec 1993


More output units, e.g. instead of central residue: take central 31. 9 output units2. average output -> 3 units output back into neural networks:Gianluca Pollastri, Dariusz Przybylski, B Rost and Pierre Baldi (2002) Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics 47:228-235.

�95

Other ideas


output back into neural networks:

�96

Other ideas

Gianluca Pollastri, Dariusz Przybylski, B Rost and Pierre Baldi (2002) Proteins 47:228-235: Fig. 1

idea: P Frasconi & M Gori (1996) IEEE Trans Neural netw 7:1521-5


STILL ONLY 60+ε% accuracy.

How to improve beyond that?

�97


�98

How to get more data into it?

?


�99

Evolution has it!

.

0

20

40

60

80

100

0 50 100 150 200 250

Perc

enta

ge se

quen

ce id

entit

y

Number of residues aligned

Sequence identityimplies structural

similarity !

Don't know region

C Sander & R Schneider 1991 Proteins 9:56-68B Rost 1999 Prot Engin 12:85-94


1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

�100B Rost (1996) Methods Enzymol 266:525-39


1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

SH3 Src-homology 3 domain one domain of proteins such as Src tyrosine kinase (STK)

�101


�102

Evolution improves prediction

Evolutionary profile implicitly captures history of and individual protein!

fly

chicken

rat

mouse

human


Η

Ε

L

>

>

>

pickmaximal

unit=>

currentprediction

J2

inputlayer

first orhidden layer

second oroutput layer

s0 s1 s2J1

:GYIY

DPAVGDPDNGVEP

GTEF:

:GYIY

DPEVGDPTQNIPP

GTKF:

:GYEY

DPAEGDPDNGVKP

GTSF:

:GYEY

DPAEGDPDNGVKP

GTAF:

Alignments

5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .

. . . . 5 . . . . . . . . . . . . . . .

. . . 5 . . . . . . . . . . . . . . . .

. . 3 . . . . 2 . . . . . . . . . . . .

. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .

5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .

GSAPD NTEKQ CVHIR LMYFW

profile table

:GYIY

DPEDGDPDDGVNP

GTDF:

Protein

corresponds to the the 21*3 bits coding for the profile of one residue

�103

PHD: Neural network & evolutionary information

B Rost & C Sander (1993) PNAS 90:7558-62B Rost (1996) Methods Enzymol 266:525-39


�104B Rost & C Sander (1993) PNAS 90:7558-62B Rost (1996) Methods Enzymol 266:525-39

Same idea as for regular secondary structure


H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

Η

Ε

L

>

>

>

pickmaximal

unit=>

currentprediction

J2

inputlayer

first orhidden layer

second oroutput layer

s0 s1 s2J1

:GYIY

DPAVGDPDNGVEP

GTEF:

:GYIY

DPEVGDPTQNIPP

GTKF:

:GYEY

DPAEGDPDNGVKP

GTSF:

:GYEY

DPAEGDPDNGVKP

GTAF:

Alignments

5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .

. . . . 5 . . . . . . . . . . . . . . .

. . . 5 . . . . . . . . . . . . . . . .

. . 3 . . . . 2 . . . . . . . . . . . .

. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .

5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .

GSAPD NTEKQ CVHIR LMYFW

profile table

:GYIY

DPEDGDPDDGVNP

GTDF:

Protein

corresponds to the the 21*3 bits coding for the profile of one residue

sing

le s

eque

nce

alignment


H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)


25%

80

100%

number of residues alignedSequ

ence

iden

tity

filterMaxHom

sequencedata bank

protein Aprotein B

:protein N

protein Aprotein C

:protein M

MaxHom

BLAST

11

22

33

ext ractal ignment

PHD

U

�105

From sequence to profile



P H D s e c

H

L

E

4+1""""""

20444

outputlayer

inputlayer

hiddenlayer

20444

21+3""""""

H

L

E

0.5

0.1

0.4percentage of each amino acid in proteinlength of protein (≤60, ≤120, ≤240, >240)distance: centre, N-term (≤40,≤30,≤20,≤10)distance: centre, C-term (≤40,≤30,≤20,≤10)

input global in sequence

input local in sequence

localalign-ment13

adjacentresidues

:::AAAAA.LLLLIIAAGCCSGVV:::

globalstatist.wholeprotein

%AALength∆ N-term∆ C-term

A C L I G S V ins del cons100 0 0 0 0 0 0 0 0 1.17100 0 0 0 0 0 0 33 0 0.42 0 0 100 0 0 0 0 0 33 0.92 0 0 33 66 0 0 0 0 0 0.74 66 0 0 0 33 0 0 0 0 1.17 0 66 0 0 0 33 0 0 0 0.74 0 0 0 33 0 0 66 0 0 0.48

first levelsequence-to- structure

second levelstructure-to- structure

�106

PHDsec: more details



�107

Jury

centre of mass = jury over 1-4

architecture 3architecture 4

singlenetworkvs.jurydecision

architecture 2architecture 1


�108

PROFsec: Evolutionary information + more

B Rost (2001) J Struct Biol 134, 204-18


HEADER CYTOSKELETONCOMPND ALPHA SPECTRIN (SH3 DOMAIN) �SOURCE CHICKEN (GALLUS GALLUS) BRAINAUTHOR M.NOBLE,R.PAUPTIT,A.MUSACCHIO,M.SARASTE

�109

Spectrin homology domain (SH3)

59%65%

72%


�110

Prediction accuracy varies!

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins

Per-residue accuracy (Q3)

<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm


�111

Stronger predictions more accurate!

.

0

20

40

60

80

100

0

20

40

60

80

100

3 4 5 6 7 8 9

Q per protein3 fit: Q3fit = 21 + 8.7 * Q

3

Q3 p

er p

rote

in

Reliability index averaged over protein


H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

H=0.5E=0.4L=0.1

H=0.8E=0.1L=0.1

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins


<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm


�112

Correct prediction of correctly predicted residues.

70

75

80

85

90

95

100

0 20 40 60 80 100

PHDsec

PHDacc

PHDhtm

70

75

80

85

90

95

100RI=9

RI=0RI=9

RI=0

RI=9

RI=4

7

over

all p

er-r

esid

ue a

ccur

acy

percentage of resdidues predicted


�113

False prediction for engineered proteins!

GB1: IgG-binding domain of protein G (CHAMELEON) Kim & Berg, Nature, 366, 267-270, 1993

....,....1....,....2....,....3....,....4....,....5....,..AA TTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEKDSSP EEEEEEE EEEEEEEEE HHHHHHHHHHHHHHHHH EEEEEEE EEEEEEEE

PHD 30 EEEEEE E EEHHHHHHHHHHHHHHEEE EEEEEE EEEEEPHD no EEEEEE EEEEEHHHHHHHHHHHHHHHH EEEEE EEEEEE

AATAEKVFKQY AWTVEKAFKTFPHD 30 EEEEEE EEEEEEE HHHHHHHHHEEE EEEE EEEEEEPHD no EEEEEE EEEEEEHHHHHHHHHHHHHHH EEEEE EEEEEE

EWTYDDATKTF AWTVEKAFKTFPHD 30 EEEEEE EEE EHHHHHHHHHHHHHHHH EEEEE EEEEEEPHD no EEEEEE E E EHHHHHHHHHHHHHHHH HHHHHHH EEEEE

AWTVEKAFKTF HHHHH


Method A=60% Method B=63%

B better?

�115


same measure?e.g. both Q3?

�116

Method A=60% B=63%, B better?


use same (meaningful) measure e.g. both Q3 same data set

�117



use same (meaningful) measure e.g. both Q3 same data set: note both used 100 proteins, and both used random splits to take one half for testing,ok?

�118



use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins!

�119



use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: random ok?

�120



use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets.Overlap defined as, e.g. comparative modeling cannot be applied

�121


B Rost 1999 Prot Engin 12, 85-94 C Sander & R Schneider 1991 Proteins 9:56-69


use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets. 63-60=3significant?

�122



use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets. 63-60=3, whether significant or not depends on distribution and number:

�123



�124

DeltaQ3=3%, 100 proteins->significant?

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins


<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm


�125

DeltaQ3=3% for 100 proteins is significant!

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins


<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm

rule-of-thumb:

Stderror=sigma/sqrt(proteins)

here: StdErr=10.5/sqrt(100)=±1.05

> DeltaQ3=3

-> statistically significant


Method B 20 years older than A, still better?

�126


Difference statistically signficant

-> age no difference!

�127


Any other test to do?

(mind you B is 20 years old)

�128


pre-release test: ideally use data added after both methods had been

developed�129


�130

Cross-validation: how to

150

0

150

TrainTest

Table 1Nhidden Q315 6230 6445 63

Conclusion:Q3=64%best method has 30 hidden units


�131

Cross-validation: how to

150

0

150

TrainTest

Table 1Nhidden Q315 6230 6445 63


OK?


�132

Cross-validation: need 3 sets!

100

100

100TrainTest

Cross-train

Table 1

Nhid cross-train test

15 62 60

30 64 61

45 63 62



01: 04/10 Tue: No lecture 02: 04/12 Thu: No lecture 03: 04/17 Tue: No lecture 04: 04/19 Thu: Intro 1: organization of lecture: intro into cells & biology 05: 04/24 Tue: Intro 2: amino acids, protein structure (comparison), domains 06: 04/26 Thu: No lecture 07: 05/01 Tue: SKIP: May Day 08: 05/03 Thu: Alignment 1 09: 05/08 Tue: SKIP: Student Representation (SVV) 10: 05/10 Thu: SKIP: Ascension Day 11: 05/15 Tue: Alignment 2 12: 05/17 Thu: Comparative modeling & exp structure determination & secondary structure assignment 13: 05/22 Tue: SKIP: Whitsun holiday 14: 05/24 Thu: Comparative modeling 2 & 1D: Secondary structure prediction 1 15: 05/29 Tue: 1D: Secondary structure prediction 2 16: 05/31 Thu: SKIP: Corpus Christi 17: 06/05 Tue: 1D: Secondary structure prediction 3 & Transmembrane structure prediction 1 18: 06/07 Thu: 1D: Transmembrane structure prediction 2 / Solvent accessibility prediction 19: 06/12 Tue: 1D: Transmembrane structure prediction 3 / Solvent accessibility prediction 20: 06/14 Thu: 1D: Disorder prediction 21: 06/19 Tue: 2D prediction / 3D prediction 22: 06/21 Thu: No lecture 23: 06/26 Tue: recap 1 24: 06/28 Thu: recap 2 25: 07/03 Tue: TBA 26: 07/05 Thu: TBA 27: 07/10 Tue: TBA 28: 07/12 Thu: TBA

�133

Lecture plan (CB1 structure: INF)

today

Secondary structure prediction 2 cb1 sec2 · 2018. 5. 29. · structure prediction: combination of...

Documents

Transcript of Secondary structure prediction 2 cb1 sec2 · 2018. 5. 29. · structure prediction: combination of...