Secondary structure prediction 2 cb1 sec2 · 2018. 5. 29. · structure prediction: combination of...
Transcript of Secondary structure prediction 2 cb1 sec2 · 2018. 5. 29. · structure prediction: combination of...
/135© Burkhard Rost
�1
title: Secondary structure prediction 2short title: cb1_sec2
lecture: Computational Biology 1 - Protein structure (for Informatics) - TUM summer semester
/135© Burkhard Rost
Videos: YouTube / www.rostlab.org/talks THANKS :. EXERCISES: Special lectures: • Mikal Boden UQ Brisbane No lecture: • 04/26 Security check Rostlab (exercise WILL be) • 05/01 May Day (also no exercise) • 05/08 Student representation (SVV) - exercise WILL happen • 05/10 Ascension Day (also no exercise) • 05/22 Whitsun holiday (also no exercise) • 05/31 Corpus Christi (also no exercise) • 06/21 no lecture (but exercise) LAST lecture: bef: Jul 12 Examen: Jul 12 18-20:00 (room TBA) • Makeup: no makeup (sorry due to overload)
�2
Announcements
Dmitrij Nechaev
Your Name
Lothar Richter
Michael Heinzinger
next
CONTACT: [email protected]© Michael Leunig
© Burkhard Rost
Recap: protein prediction
�3
/135© Burkhard Rost
�4
Goal of structure prediction
Epstein & Anfinsen, 1961:sequence uniquely determines structure
• INPUT: sequence
3D structureand function
• OUTPUT:
/135© Burkhard Rost
�5
Zones
Day
light
Zon
e
Twili
ght Z
one
Mid
nigh
t Zon
eprofile - profile
sequence - profilesequence - sequence
sequ
ence
sim
ilar
->
stru
ctur
e sim
ilar
B Rost (1997) Fold Des 2:S19-24B Rost (1999) Protein Eng 12:85-94
© Burkhard Rost /135
Experimental 3D structure for 1 protein:>$100K
PDB=database of proteins of known 3D structure about 120 k in May 2017
�6
/135© Burkhard Rost
�7structure (PDB id 4lpk): JM Ostrem et al. & KM Shokat (2013) Nature 503:548-51
Comparative modeling predicts 3D structure in silico
pretein seqwence
priteen peqwinse
Query
PDB
© Burkhard Rost /135
Good news: comparative modeling
reliably predicts structure for over 40 million proteins
at 100k/protein this translates to: $4 trillion, i.e. $4x1012: more than the GDP of England and France!
�8
© Burkhard Rost /135
Bad news:For most residues
comparative modeling cannot be applied
�9
/135© Burkhard Rost
�10
Notation: protein structure 1D, 2D, 3DPQITLWQRPLVTIKIGGQLKEALLDTGADDTVL
PP PQQQYFFQVISSIVRLLSTLWWQEDRKQAKRRRPQPPPPPVVTKFVVLIITTKEKAALIVHYKKFIILVIEENGGGGGTGQQKRRPPLWWVVFKVEESKKVVGLGLLILLLLLVVDDDDDTTTTTGGGGGAAAAADDDDDDDAKESSTTVIIVIVVVIVL
1281757077
120238169200247114740
904
466268
11831
1241
292449726217
102691
140
1109760691481976248590
690
730
415371597395000
5851300
79586900
EEEEE
EEEEEE
EEEEEEE
EE
EEEEE
EEEEEE
EE
kcal/mol0 -1 -2 -3 -4 -5
1 10 20 30 40 50 60 70 80 90
1
10
20
30
40
50
60
70
80
90
1D1D 2D2D 3D3D
/135© Burkhard Rost
L Pauling & RB Corey (1953) PNAS 39:247-252L Pauling, RB Corey & HR Branson (1951) PNAS 37:205-234W Kabsch & C Sander (1983) Biopolymers 22:2577-2637
DSSP
�11
Pauling’s H-bond pattern used in DSSP
© Burkhard Rost /135
Science is communication
questions are often the first step
�12
© Burkhard Rost
1D: secondary structure prediction
�13
/135© Burkhard Rost
Secondary structure prediction 2ndary structure prediction “2D prediction”?
�14
Words
/135© Burkhard Rost
Secondary structure prediction 2ndary structure prediction 2D prediction
�15
Words
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVL
PP PQQQYFFQVISSIVRLLSTLWWQEDRKQAKRRRPQPPPPPVVTKFVVLIITTKEKAALIVHYKKFIILVIEENGGGGGTGQQKRRPPLWWVVFKVEESKKVVGLGLLILLLLLVVDDDDDTTTTTGGGGGAAAAADDDDDDDAKESSTTVIIVIVVVIVL
1281757077
120238169200247114740
904
466268
11831
1241
292449726217
102691
140
1109760691481976248590
690
730
415371597395000
5851300
79586900
EEEEE
EEEEEE
EEEEEEE
EE
EEEEE
EEEEEE
EE
kcal/mol0 -1 -2 -3 -4 -5
1 10 20 30 40 50 60 70 80 90
1
10
20
30
40
50
60
70
80
90
1D1D 2D2D 3D3D
/135© Burkhard Rost
DSSP secondary assignment has 8 “states”
�16
Secondary structure prediction
H = HelixG = 310 helixI = Pi helixE = Extended (strand)B = beta-bridge, single strand residueT = Turn, i.e. one turn of helix S = bent“ “ = loop
/135© Burkhard Rost
�17
Local sequence determines secondary structure
LEDKSPDHNPTGID
AKGKPMDRNFTGRNHPPKDSS
AAQVKDALTK
LEQWGTLAQLRAIWEQELTDFPEFLTMMARQETWLGWLTI
helix strand
loop
LAVIGVLMKW
FVFLMIEKIYHKLT
DIRVGLTYYIAQ
VNTFVGTFAAVAHAL
W Kabsch & C Sander (1985) Identical pentapetides with different backbones. Nature 317:207
/135© Burkhard Rost
�18
??
???
How penta-peptides occur in 2 states?
/135© Burkhard Rost
L Pauling & RB Corey (1953) PNAS 39:247-252L Pauling, RB Corey & HR Branson (1951) PNAS 37:205-234W Kabsch & C Sander (1983) Biopolymers 22:2577-2637
DSSP
�19
Pauling’s H-bond pattern used in DSSP
/135© Burkhard Rost
�20
Helix is local, sheet is not
residuesiandi+3
H-bondresiduesi <-> i+4
Erabutoxin β (3ebx)
H-bondresiduesi <-> i±jj∈[4,L-4]
HELIX (H)
SHEET (E) with 3 strands)
/135© Burkhard Rost
take known structures find longest consecutive runs of motifs that occur ONLY in one of the three statesH (helix), E (strand), O (other)
�21
Simple method to predict sec str
© Burkhard Rost /135
First actual prediction method was much simpler
�22
/135© Burkhard Rost
First step (Szent-Györgyi)Proline breaks a helixHelices span several turns, i.e. >4 residues-> identify helices/non-helices
�23
Simple prediction: frequency
Proline bends main chain
/135© Burkhard Rost
First step (Szent-Györgyi)Proline breaks a helixHelices span several turns, i.e. >4 residues-> identify helices/non-helices
from Proline to odds for all ....
�24
Simple prediction: frequency
/135© Burkhard Rost
from Proline to odds for all
�25
Simple prediction: frequency
....,....1....,....2....QEKSPREVTMKKGDILTLLNSTNK E..E EEEEEE
AA D E G I K L M N P Q R S T V
E 1 1 3 1 1 1
L 1 1 1 4 1 1 1 1 2 1
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
Robson B & Pain RH (1971) Analysis of the Code Relating Sequence to Conformation in Proteins: Possible Implications for the Mechanism of Formation of Helical Regions. J. Mol. Biol. 58:237-259.Chou PY & Fasman GD (1974) Prediction of protein conformation. Biochemistry 13:211-215.Garnier J, Osguthorpe DJ and Robson B (1978) Analysis of the accuracy and Implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97-120.
�26
Secondary structure prediction methods
/135© Burkhard Rost
1st generation (1957-1978):e.g. Chou-Fasman / GORsingle residue odds
�27
Sec struc pred: 1st gen
p(SEC|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i, j=p(SEC|AAj) - ∀ i ⋀ j
Erabutoxin β (3ebx)
V32 V36 V51
© Burkhard Rost
how to assess performance? problem 1: where to
get secondary structure from?
�28
/135© Burkhard Rost
L Pauling & RB Corey (1953) PNAS 39:247-252L Pauling, RB Corey & HR Branson (1951) PNAS 37:205-234W Kabsch & C Sander (1983) Biopolymers 22:2577-2637
DSSP
�29
Pauling’s H-bond pattern used in DSSP
/135© Burkhard Rost
Resource with 3D-coordinates of proteins (RNA & DNA) www.rcsb.org e.g. “Molecule of the Month” 2016/05: over 120,000 molecules
�30
PDB = Protein Data Bank
Num
ber o
f stru
ctur
es in
PD
B
1
10
100
1,000
10,000
100,000
1,000,000
Year
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
/135© Burkhard Rost
find unique subset from proteins of known structure (PDB) convert 3D to 1D (secondary structure) with DSSP
�31
Prediction method
/135© Burkhard Rost
1st generation (1957-1978):e.g. Chou-Fasman / GORsingle residue odds
�32
Sec struc pred: 1st gen
p(SEC|AAi)=probability for observing secondary structure state SEC for amino acid AA at position=p(SEC|AAj) - ∀ i ⋀ j
Erabutoxin β (3ebx)
V32 V36 V51
/135© Burkhard Rost
1st generation (1957-1978):e.g. Chou-Fasman / GORsingle residue odds
�33
Sec struc pred: 1st gen
Num
ber o
f stru
ctur
es in
PD
B
1
10
100
1,000
10,000
100,000
1,000,000
Year
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
© Burkhard Rost
how to assess performance? problem 2: how to
measure?
�34
/135© Burkhard Rost
�35
Assessing performance of secondary structure prediction
1.,.,.,.,.10,.,.,.,.20,.,.,.,.30,.,.,.,.40,.,.,.,.50 obs EEEE E EEEEEE EEEEEE EEEEEEEEEEE prd EEHHH EEEE EE HHEE EEEHHH
obs=observed, prd=predicted H: helix, E: strand, ‘ ‘: other
/135© Burkhard Rost
• Q3 : three-state per-residue accuracy number of correctly predicted residues in states helix, strand, other Q3= ---------------------------------------------------------------------------- number of residues in proteinSchulz GE & Schirmer RH (1979) Prediction of secondary structure from the amino acid sequence. In: (eds). Principles of protein structure. Berlin: Springer-Verlag, pp 108-130.
�36
Secondary structure prediction accuracy
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
published: 63% accuracy
Robson B & Pain RH (1971) Analysis of the Code Relating Sequence to Conformation in Proteins: Possible Implications for the Mechanism of Formation of Helical Regions. J. Mol. Biol. 58:237-259.Chou PY & Fasman GD (1974) Prediction of protein conformation. Biochemistry 13:211-215.Garnier J, Osguthorpe DJ and Robson B (1978) Analysis of the accuracy and Implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97-120.
�37
Secondary structure prediction methods
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
published: 63% accuracy assessed in 1994: 50-55% accuracy
Robson B & Pain RH (1971) Analysis of the Code Relating Sequence to Conformation in Proteins: Possible Implications for the Mechanism of Formation of Helical Regions. J. Mol. Biol. 58:237-259.Chou PY & Fasman GD (1974) Prediction of protein conformation. Biochemistry 13:211-215.Garnier J, Osguthorpe DJ and Robson B (1978) Analysis of the accuracy and Implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97-120.
�38
Secondary structure prediction methods
© Burkhard Rost
2nd Generation: how would you
improve?
�39
/135© Burkhard Rost
�40
Segments instead of isolated residues
Erabutoxin β (3ebx)
V32 V36 V51
/135© Burkhard Rost
single residues 1. generation • Chou-Fasman, GOR 1957-70/80
50-55% accuracy (Q3) segments 2. generation
• GORIII 1986-92 55-60% Q3
• Gibrat J-F, Garnier J and Robson B (1987) Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J. Mol. Biol. 198:425-443.
• Biou V, Gibrat JF, Levin JM, Robson B and Garnier J (1988) Secondary structure prediction: combination of three different methods. Prot. Engin. 2:185-191.
• Garnier J & Robson B (1989) The GOR method for predicting secondary structure in proteins. In: D. FG (eds). Prediction of protein structure and the principles of protein conformation. New York: Plenum Press, pp 417-465.
�41
Secondary structure prediction: 1.+2. Generation
/135© Burkhard Rost
1st generation (1957-1978): single residue oddse.g. Chou-Fasman/GOR
2nd generation (1983-1992):e.g. GORIIIodds for windows
�42
Sec struc pred: 1st gen
p1(SECi|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i
p(SEC|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i= SUM (j=i-w,i+w) p1(SECj,AAj)
Erabutoxin β (3ebx)
V32 V36 V51
w=3
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% they said: 65% max
�43
Secondary structure prediction: 1.+2. Generation
/135© Burkhard Rost
�44
Helix formation is local
residuesiandi+3
THYROID hormone receptor (2nll)
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% may be: 65% max
• < 40% may be: strand non-local
�45
Secondary structure prediction: 1.+2. Generation
/135© Burkhard Rost
�46
β-sheet formation is NOT local
Erabutoxin β (3ebx)
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% may be: 65% max
• < 40% may be: strand non-local
• short segments
�47
Secondary structure prediction: 1.+2. Generation
B Rost and C Sander (2000) Methods in Molecular Biology 143: 71-95
/135© Burkhard Rost
�48
SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE
TYP EHHHH EE EEEE EE HHHEE EEEHH
Problems of secondary structure predictions (before 1994)
obs EEEE E E E EEEEEE EEEEEE EEEEEEEEEEE prd EEHHH EE EEEE EE HHEE EEEHHH
© Burkhard Rost
INSERT: concept of neural
networks
�49
/135© Burkhard Rost
�50
J11
J12
1
1
1
0
out0 = in1J11 in2J12 +
out = tanh (out0)
Simple Neural Network
Simple neural network
/135© Burkhard Rost
�51
10
Training a neural network 1
/135© Burkhard Rost
�52
10
Errare = (out net - out want) 2
.
1
- 121-1-2
out
in
Training a neural network 2
/135© Burkhard Rost
�53
Error
Junctions
1001
11
11
Training a neural network 3
/135© Burkhard Rost
�54
1001
11
11
.
1
- 121-1-2
out
in
1001
01
12
1001
- 11
12+?
Training a neural network 4
/135© Burkhard Rost
�55
Neural networks classify points
/135© Burkhard Rost
�56
Simple Neural NetworkWith Hidden Layer
outi = f ij2 J ⋅ f jk
1 Jk∑ ⋅ kin#
$%
&
'(
j∑
#
$%%
&
'((
Simple neural network with hidden layer
/135© Burkhard Rost
�57
Principles of networks: input -> output
two steps:1. linear: sum over all input × connection2. non-linear: sigmoid trigger, i.e., project sum onto 0-1
.
:ACACC:
1.0
0input to unit
(=sum)
Σconnectionij*inputjstep 1:
step 2:
outp
utfr
om u
nit
inpu
t = 3
adj
acen
t res
idue
s in
pro
tein
seq
uenc
e
outp
ut =
sec
onda
ry s
truct
ure
stat
e of
cen
tral r
esid
ue
α
L
s1s2s3
Jdecision line
sum
result: < decision line
/135© Burkhard Rost
outi = ∑i=1
Nin+1
Jij inj
inj value of input unit j ; outi value of output unit i ; Jij connection between input unit j and output unit i
E = ∑i=1
Nout
(outi - desi)2
outi value of output unit i ; desi secondary structure stateobserved for central amino acid for output unit i (e.g. fora helix: des1=1, des2=0, des3=0)
• output:
• error:
• free variables: connections { J } • goal:
representation of set of examples (training set) for which the mapping input->output is known, i.e., the secondary structure state of the central residue has been observed by the network
�58
Principles of neural networks: error
/135© Burkhard Rost
training = change of connections {J} such that E decreases simplest procedure: • gradient descent
�59
Principles of neural networks: training
∆Jij(t+1) = - ε ∂E(t)∂Jij(t) + α ∆Jij(t-1)
where ∂E/∂J is the derivative of the error with respect tothe network connection; t is the algorithmic time given bythe presentation of one example; ε determines the stepwidth of the change (learning strength, typically some0.01); α gives the contribution of the momentum term(∆J(t-1) , typically some 0.2), which permits uphill moves
Error
{ J }
/135© Burkhard Rost
�60
Effect of over-training: theory
100
50
0Training time
/135© Burkhard Rost
�61
Effect of over-training: theory
100
50
0Training time
over-train
/135© Burkhard Rost
�62
Effect of over-training: theory
100
50
0Training time
over-train
toy problems
© Burkhard Rost /135
what were those two curves?
�63
/135© Burkhard Rost
�64
Effect of over-training: theory
100
50
0Training time
over-train
training set
cross-training testing
validation set
/135© Burkhard Rost
�65
Sketch of simplified cross-validation
/135© Burkhard Rost
�66
Sketch of simplified cross-validation
TRAIN
TESTcross- TRAIN
/135© Burkhard Rost
�67
Effect of over-training: practice
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
num
ber o
f cor
rect
class
ifica
tions
per
exam
ple
0 5 10 15 20 25
number of cycles
ratio for training set
ratio for testing set
Training cycles
Cor
rect
cla
ssifi
catio
ns
testing
training
/135© Burkhard Rost
�68
Effect of over-training: practice
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
num
ber o
f cor
rect
class
ifica
tions
per
exam
ple
0 5 10 15 20 25
number of cycles
ratio for training set
ratio for testing set
Training cycles
Cor
rect
cla
ssifi
catio
ns
testing
training100
50
0Training time
toy problems
© Burkhard Rost
RETURN: secondary structure prediction
�69
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% they said: 65% max
• < 40% they said: strand non-local
• short segments
�70
Secondary structure predictions of 1. and 2. generation
/135© Burkhard Rost
�71B Rost (1996) Methods in Enzymology 266: 525-39
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
Neural Network for secondary structure
/135© Burkhard Rost
helix strand otheroverallaccuracymethod
unbalanced 62%
�72
NN predicts secondary structure
neural network
/135© Burkhard Rost
helix strand otheroverallaccuracymethod
unbalanced 62%
�73
NN predicts secondary structure
neural network
... and developer believes that application of machine learning is all the intelligence he will ever need...
/135© Burkhard Rost
�74
NN sec str: training dynamics
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10
Other Strand Helix
time: 1 step = 20,000 training samples
Perfo
rman
ce
Eµ = oiµ − di
µ( )i∑
2
ΔJµ ∝ - ∂Eµ{J}∂J
/135© Burkhard Rost
helix strand otheroverallaccuracymethod
unbalanced 62%neural network
�75
NN predicts secondary structure
full pie: all correctly predicted residues
/135© Burkhard Rost
helix strand otheroverallaccuracymethod
unbalanced 62%comparison:data bankdistribution
�76
NN predicts secondary structure
neural network
full pie: all correctly predicted residues
/135© Burkhard Rost
helix strand otheroverallaccuracymethod
unbalanced 62%comparison:data bankdistribution
comparison:33:33:33
�77
NN predicts secondary structure
neural network
full pie: all correctly predicted residues
/135© Burkhard Rost
Eµ = oiµ − di
µ( )i∑
2
ΔJµ ∝ - ∂Eµ{J}∂J
normal training
�78
Balanced training
/135© Burkhard Rost
E = oiµ − di
µ( )i∑
µ=α ,β,L∑
2
Eµ = oiµ − di
µ( )i∑
2
ΔJµ ∝ - ∂Eµ{J}∂J
normal training
balanced training
�79
Balanced training
/135© Burkhard Rost
�80
Balanced training: dynamics
00.20.40.60.8
1
1 2 3 4 5 6 7 8 9 10
Other Strand Helix
1 2 3 4 5 6 7 8 9 10
1 0.8 0.6 0.4 0.2 0
unbalanced balancedEµ = oi
µ − diµ( )
i∑
2
ΔJµ ∝ - ∂Eµ{J}∂J
train:E = oi
µ − diµ( )
i∑
µ=α ,β,L∑
2µ
/135© Burkhard Rost
helix strand otheroverallaccuracymethod
unbalanced 62%comparison:data bankdistribution
comparison:33:33:33balanced 60%
�81
full pie: all correctly predicted residues
© Burkhard Rost /135
Neural networks DO improve if developer does something more
than dream the machine learning dream...
�82
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% they said: 65% max
• < 40% they said: strand non-local
• short segments
�83
Secondary structure predictions of 1. and 2. generation
/135© Burkhard Rost
�84
β-sheet formation is NOT local
Erabutoxin β (3ebx)
© Burkhard Rost /135
Conclusion: not all sound
explanations are right!
�85
/135© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% they said: 65% max
• < 40% they said: strand non-local
• short segments
�86
Secondary structure predictions of 1. and 2. generation
/135© Burkhard Rost
�87
Bad segment prediction
HHHHHHHHHEEEEE
HHHHEEE
HHHHHHHEEEEE
1st level
2nd level
comparison:observed:
SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE
TYP EHHHH EE EEEE EE HHHEE EEEHH
/135© Burkhard Rost
�88
Select samples at random
∆Jij(t+1) = - ε ∂E(t)∂Jij(t) + α ∆Jij(t-1)
where ∂E/∂J is the derivative of the error with respect tothe network connection; t is the algorithmic time given bythe presentation of one example; ε determines the stepwidth of the change (learning strength, typically some0.01); α gives the contribution of the momentum term(∆J(t-1) , typically some 0.2), which permits uphill moves
Error
{ J }
/135© Burkhard Rost
�89
Local correlations in reality
residuesiandi+3
Erabutoxin β (3ebx)
/135© Burkhard Rost
�90
??
???
How to get those into the prediction?
/135© Burkhard Rost
H
E
L
V (E)
P (E)
A (H)
PHDsec:
structure-to-structure
�91
PHDsec: structure-to-structure network
B Rost (1996) Methods Enzymol 266:525-39
/135© Burkhard Rost
�92
Better segment prediction
HHHHHHHHHEEEEE
HHHHEEE
HHHHHHHEEEEE
1st level
2nd level
comparison:observed:
/135© Burkhard Rost
.
0
200
400
600
800
1000
1200
0 10 20 30 40 50
Num
ber o
f seg
men
ts
Segment length
0
5
10
15
20
25
25 30 35 40 45 50
DSSPPHD
-800
-600
-400
-200
0
200
400
600
800
0 2 4 6 8 10
helixstrandloop
Diff
eren
ce in
num
ber
of o
bser
ved
- pre
dict
ed se
gmen
tsSegment length
A B
�93
Better prediction of segment lengths
/135© Burkhard Rost
N Qian & TJ Sejnowski (1988) Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202:865-884.
�94
Structure-to-structure network: Invented?
H
E
L
V (E)
P (E)
A (H)
PHDsec:
structure-to-structure
PHDsec 1993
/135© Burkhard Rost
More output units, e.g. instead of central residue: take central 31. 9 output units2. average output -> 3 units output back into neural networks:Gianluca Pollastri, Dariusz Przybylski, B Rost and Pierre Baldi (2002) Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics 47:228-235.
�95
Other ideas
/135© Burkhard Rost
output back into neural networks:
�96
Other ideas
Gianluca Pollastri, Dariusz Przybylski, B Rost and Pierre Baldi (2002) Proteins 47:228-235: Fig. 1
idea: P Frasconi & M Gori (1996) IEEE Trans Neural netw 7:1521-5
© Burkhard Rost /135
STILL ONLY 60+ε% accuracy.
How to improve beyond that?
�97
/135© Burkhard Rost
�98
How to get more data into it?
?
/135© Burkhard Rost
�99
Evolution has it!
.
0
20
40
60
80
100
0 50 100 150 200 250
Perc
enta
ge se
quen
ce id
entit
y
Number of residues aligned
Sequence identityimplies structural
similarity !
Don't know region
C Sander & R Schneider 1991 Proteins 9:56-68B Rost 1999 Prot Engin 12:85-94
/135© Burkhard Rost
1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
�100B Rost (1996) Methods Enzymol 266:525-39
/135© Burkhard Rost
1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
SH3 Src-homology 3 domain one domain of proteins such as Src tyrosine kinase (STK)
�101
/135© Burkhard Rost
�102
Evolution improves prediction
Evolutionary profile implicitly captures history of and individual protein!
fly
chicken
rat
mouse
human
/135© Burkhard Rost
Η
Ε
L
>
>
>
pickmaximal
unit=>
currentprediction
J2
inputlayer
first orhidden layer
second oroutput layer
s0 s1 s2J1
:GYIY
DPAVGDPDNGVEP
GTEF:
:GYIY
DPEVGDPTQNIPP
GTKF:
:GYEY
DPAEGDPDNGVKP
GTSF:
:GYEY
DPAEGDPDNGVKP
GTAF:
Alignments
5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .
. . . . 5 . . . . . . . . . . . . . . .
. . . 5 . . . . . . . . . . . . . . . .
. . 3 . . . . 2 . . . . . . . . . . . .
. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .
5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .
GSAPD NTEKQ CVHIR LMYFW
profile table
:GYIY
DPEDGDPDDGVNP
GTDF:
Protein
corresponds to the the 21*3 bits coding for the profile of one residue
�103
PHD: Neural network & evolutionary information
B Rost & C Sander (1993) PNAS 90:7558-62B Rost (1996) Methods Enzymol 266:525-39
/135© Burkhard Rost
�104B Rost & C Sander (1993) PNAS 90:7558-62B Rost (1996) Methods Enzymol 266:525-39
Same idea as for regular secondary structure
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
Η
Ε
L
>
>
>
pickmaximal
unit=>
currentprediction
J2
inputlayer
first orhidden layer
second oroutput layer
s0 s1 s2J1
:GYIY
DPAVGDPDNGVEP
GTEF:
:GYIY
DPEVGDPTQNIPP
GTKF:
:GYEY
DPAEGDPDNGVKP
GTSF:
:GYEY
DPAEGDPDNGVKP
GTAF:
Alignments
5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .
. . . . 5 . . . . . . . . . . . . . . .
. . . 5 . . . . . . . . . . . . . . . .
. . 3 . . . . 2 . . . . . . . . . . . .
. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .
5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .
GSAPD NTEKQ CVHIR LMYFW
profile table
:GYIY
DPEDGDPDDGVNP
GTDF:
Protein
corresponds to the the 21*3 bits coding for the profile of one residue
sing
le s
eque
nce
alignment
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
/135© Burkhard Rost
25%
80
100%
number of residues alignedSequ
ence
iden
tity
filterMaxHom
sequencedata bank
protein Aprotein B
:protein N
protein Aprotein C
:protein M
MaxHom
BLAST
11
22
33
ext ractal ignment
PHD
U
�105
From sequence to profile
B Rost (1996) Methods Enzymol 266:525-39
/135© Burkhard Rost
P H D s e c
H
L
E
4+1""""""
20444
outputlayer
inputlayer
hiddenlayer
20444
21+3""""""
H
L
E
0.5
0.1
0.4percentage of each amino acid in proteinlength of protein (≤60, ≤120, ≤240, >240)distance: centre, N-term (≤40,≤30,≤20,≤10)distance: centre, C-term (≤40,≤30,≤20,≤10)
input global in sequence
input local in sequence
localalign-ment13
adjacentresidues
:::AAAAA.LLLLIIAAGCCSGVV:::
globalstatist.wholeprotein
%AALength∆ N-term∆ C-term
A C L I G S V ins del cons100 0 0 0 0 0 0 0 0 1.17100 0 0 0 0 0 0 33 0 0.42 0 0 100 0 0 0 0 0 33 0.92 0 0 33 66 0 0 0 0 0 0.74 66 0 0 0 33 0 0 0 0 1.17 0 66 0 0 0 33 0 0 0 0.74 0 0 0 33 0 0 66 0 0 0.48
first levelsequence-to- structure
second levelstructure-to- structure
�106
PHDsec: more details
B Rost (1996) Methods Enzymol 266:525-39
/135© Burkhard Rost
�107
Jury
centre of mass = jury over 1-4
architecture 3architecture 4
singlenetworkvs.jurydecision
architecture 2architecture 1
/135© Burkhard Rost
�108
PROFsec: Evolutionary information + more
B Rost (2001) J Struct Biol 134, 204-18
/135© Burkhard Rost
HEADER CYTOSKELETONCOMPND ALPHA SPECTRIN (SH3 DOMAIN) �SOURCE CHICKEN (GALLUS GALLUS) BRAINAUTHOR M.NOBLE,R.PAUPTIT,A.MUSACCHIO,M.SARASTE
�109
Spectrin homology domain (SH3)
59%65%
72%
/135© Burkhard Rost
�110
Prediction accuracy varies!
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
/135© Burkhard Rost
�111
Stronger predictions more accurate!
.
0
20
40
60
80
100
0
20
40
60
80
100
3 4 5 6 7 8 9
Q per protein3 fit: Q3fit = 21 + 8.7 * Q
3
Q3 p
er p
rote
in
Reliability index averaged over protein
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
H=0.5E=0.4L=0.1
H=0.8E=0.1L=0.1
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
/135© Burkhard Rost
�112
Correct prediction of correctly predicted residues.
70
75
80
85
90
95
100
0 20 40 60 80 100
PHDsec
PHDacc
PHDhtm
70
75
80
85
90
95
100RI=9
RI=0RI=9
RI=0
RI=9
RI=4
7
over
all p
er-r
esid
ue a
ccur
acy
percentage of resdidues predicted
/135© Burkhard Rost
�113
False prediction for engineered proteins!
GB1: IgG-binding domain of protein G (CHAMELEON) Kim & Berg, Nature, 366, 267-270, 1993
....,....1....,....2....,....3....,....4....,....5....,..AA TTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEKDSSP EEEEEEE EEEEEEEEE HHHHHHHHHHHHHHHHH EEEEEEE EEEEEEEE
PHD 30 EEEEEE E EEHHHHHHHHHHHHHHEEE EEEEEE EEEEEPHD no EEEEEE EEEEEHHHHHHHHHHHHHHHH EEEEE EEEEEE
AATAEKVFKQY AWTVEKAFKTFPHD 30 EEEEEE EEEEEEE HHHHHHHHHEEE EEEE EEEEEEPHD no EEEEEE EEEEEEHHHHHHHHHHHHHHH EEEEE EEEEEE
EWTYDDATKTF AWTVEKAFKTFPHD 30 EEEEEE EEE EHHHHHHHHHHHHHHHH EEEEE EEEEEEPHD no EEEEEE E E EHHHHHHHHHHHHHHHH HHHHHHH EEEEE
AWTVEKAFKTF HHHHH
© Burkhard Rost
Proper comparison of methods
�114
© Burkhard Rost /135
Method A=60% Method B=63%
B better?
�115
/135© Burkhard Rost
same measure?e.g. both Q3?
�116
Method A=60% B=63%, B better?
/135© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set
�117
Method A=60% B=63%, B better?
/135© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: note both used 100 proteins, and both used random splits to take one half for testing,ok?
�118
Method A=60% B=63%, B better?
/135© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins!
�119
Method A=60% B=63%, B better?
/135© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: random ok?
�120
Method A=60% B=63%, B better?
/135© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets.Overlap defined as, e.g. comparative modeling cannot be applied
�121
Method A=60% B=63%, B better?
B Rost 1999 Prot Engin 12, 85-94 C Sander & R Schneider 1991 Proteins 9:56-69
/135© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets. 63-60=3significant?
�122
Method A=60% B=63%, B better?
/135© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets. 63-60=3, whether significant or not depends on distribution and number:
�123
Method A=60% B=63%, B better?
/135© Burkhard Rost
�124
DeltaQ3=3%, 100 proteins->significant?
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
/135© Burkhard Rost
�125
DeltaQ3=3% for 100 proteins is significant!
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
rule-of-thumb:
Stderror=sigma/sqrt(proteins)
here: StdErr=10.5/sqrt(100)=±1.05
> DeltaQ3=3
-> statistically significant
© Burkhard Rost /135
Method B 20 years older than A, still better?
�126
© Burkhard Rost /135
Difference statistically signficant
-> age no difference!
�127
© Burkhard Rost /135
Any other test to do?
(mind you B is 20 years old)
�128
© Burkhard Rost /135
pre-release test: ideally use data added after both methods had been
developed�129
/135© Burkhard Rost
�130
Cross-validation: how to
150
0
150
TrainTest
Table 1Nhidden Q315 6230 6445 63
Conclusion:Q3=64%best method has 30 hidden units
/135© Burkhard Rost
�131
Cross-validation: how to
150
0
150
TrainTest
Table 1Nhidden Q315 6230 6445 63
Conclusion:Q3=64%best method has 30 hidden units
OK?
/135© Burkhard Rost
�132
Cross-validation: need 3 sets!
100
100
100TrainTest
Cross-train
Table 1
Nhid cross-train test
15 62 60
30 64 61
45 63 62
Conclusion:Q3=61%best method has 30 hidden units
/135© Burkhard Rost
01: 04/10 Tue: No lecture 02: 04/12 Thu: No lecture 03: 04/17 Tue: No lecture 04: 04/19 Thu: Intro 1: organization of lecture: intro into cells & biology 05: 04/24 Tue: Intro 2: amino acids, protein structure (comparison), domains 06: 04/26 Thu: No lecture 07: 05/01 Tue: SKIP: May Day 08: 05/03 Thu: Alignment 1 09: 05/08 Tue: SKIP: Student Representation (SVV) 10: 05/10 Thu: SKIP: Ascension Day 11: 05/15 Tue: Alignment 2 12: 05/17 Thu: Comparative modeling & exp structure determination & secondary structure assignment 13: 05/22 Tue: SKIP: Whitsun holiday 14: 05/24 Thu: Comparative modeling 2 & 1D: Secondary structure prediction 1 15: 05/29 Tue: 1D: Secondary structure prediction 2 16: 05/31 Thu: SKIP: Corpus Christi 17: 06/05 Tue: 1D: Secondary structure prediction 3 & Transmembrane structure prediction 1 18: 06/07 Thu: 1D: Transmembrane structure prediction 2 / Solvent accessibility prediction 19: 06/12 Tue: 1D: Transmembrane structure prediction 3 / Solvent accessibility prediction 20: 06/14 Thu: 1D: Disorder prediction 21: 06/19 Tue: 2D prediction / 3D prediction 22: 06/21 Thu: No lecture 23: 06/26 Tue: recap 1 24: 06/28 Thu: recap 2 25: 07/03 Tue: TBA 26: 07/05 Thu: TBA 27: 07/10 Tue: TBA 28: 07/12 Thu: TBA
�133
Lecture plan (CB1 structure: INF)
today