Secondary structure prediction cb1 sec3 - Rostlab structure prediction short title: cb1_sec3...
-
Upload
duongthien -
Category
Documents
-
view
231 -
download
0
Transcript of Secondary structure prediction cb1 sec3 - Rostlab structure prediction short title: cb1_sec3...
/81© Burkhard Rost
1
title: Computational Biology 1 - Protein structure: Secondary structure predictionshort title: cb1_sec3
lecture: Protein Prediction 1 - Protein structure Computational Biology 1 - TUM Summer 2015
/81© Burkhard Rost
Videos: YouTube / www.rostlab.org THANKS : Tim Karl + Carlo Di Domenico Special lectures:
• TBA No lecture:
• 05/12 Student assembly (SVV) • 05/14 Ascension day • 05/26 Whitsun holiday • 06/04 Corpus Christi LAST lecture: Jul 7 Examen: Jul 9
• Makeup: Oct 13, 2015 - morning/noon
2
CONTACT: Inga Weise [email protected]
Tim Karl
Tatyana Goldberg
Edda Kloppmann
Andrea Schaffer-
hans
Jonas Reeb
Carlo Di [email protected]
Optional: Präsentationstitel
■ Personalized medicine
■ Carry your genes / medical data and provide them securely to treating doctor
■ Mobile Smartphone App (android) development
■ PC client development
■ Combined with special hardware encrypted storage
■ Implement security features at highest possible level
■ Work with bioinformatics and medical departments: Rostlab & RI
■ Work with IT Consulting company iteratec GmbH
■ Work with IT security company certgate
Carry your Genes (CyG)
© iteratec | Datum 5
What? Paid Hiwi job (one month) in combination with IDP project
Where? iteratec GmbH, Inselkammerstrasse 4, 82008 München-Unterhaching
When? asap
Who? Informatics master students @TUM
How? To apply please contact:
Dr. Lothar Richter
Daniel Stahr / Dr. Marc Offman
Settings
© iteratec | 27.05.2015 6
/81© Burkhard Rost
8
Goal of structure prediction
Epstein & Anfinsen, 1961:sequence uniquely determines structure
• INPUT: sequence
3D structureand function
• OUTPUT:
/81© Burkhard Rost
9
Zones
Day
light
Zon
e
Twili
ght Z
one
Mid
nigh
t Zon
eprofile - profile
sequence - profilesequence - sequence
sequ
ence
sim
ilar
->
stru
ctur
e sim
ilar
B Rost (1997) Fold Des 2:S19-24B Rost (1999) Protein Eng 12:85-94
© Burkhard Rost /81
Good news: comparative modeling
reliably predicts structure for >25 million proteins
10
/81© Burkhard Rost
12
Notation: protein structure 1D, 2D, 3DPQITLWQRPLVTIKIGGQLKEALLDTGADDTVL
PP PQQQYFFQVISSIVRLLSTLWWQEDRKQAKRRRPQPPPPPVVTKFVVLIITTKEKAALIVHYKKFIILVIEENGGGGGTGQQKRRPPLWWVVFKVEESKKVVGLGLLILLLLLVVDDDDDTTTTTGGGGGAAAAADDDDDDDAKESSTTVIIVIVVVIVL
1281757077
120238169200247114740
904
466268
11831
1241
292449726217
102691
140
1109760691481976248590
690
730
415371597395000
5851300
79586900
EEEEE
EEEEEE
EEEEEEE
EE
EEEEE
EEEEEE
EE
kcal/mol0 -1 -2 -3 -4 -5
1 10 20 30 40 50 60 70 80 90
1
10
20
30
40
50
60
70
80
90
1D1D 2D2D 3D3D
/81© Burkhard Rost
1st generation (1957-1978): single residue oddse.g. Chou-Fasman/GOR
2nd generation (1983-1992):e.g. GORIIIodds for windows
14
Sec struc pred: 1st gen
p1(SECi|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i
p(SEC|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i= SUM (j=i-w,i+w) p1(SECj,AAj)
Erabutoxin β (3ebx)
V32 V36 V51
w=3
/81© Burkhard Rost
1st generation (1957-1978): single residue oddse.g. Chou-Fasman/GOR
2nd generation (1983-1992):e.g. GORIIIodds for windows
15
Sec struc pred: 1st gen
p1(SECi|AAi)=probability for observing secondary structure state SEC for amino acid AA at position i
p(SEC|AAi)≠ p(SEC|AAj)
Erabutoxin β (3ebx)
V32 V36 V51
w=3
/81© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% may be: 65% max • < 40% may be: strand non-local • short segments
16
Secondary structure prediction: 1.+2. Generation
Erabutoxin β (3ebx)
residuesiandi+3
/81© Burkhard Rost
17
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
Neural Network for secondary structure
/81© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% may be: 65% max • < 40% may be: strand non-local • short segments
18
Secondary structure prediction: 1.+2. Generation
Erabutoxin β (3ebx)
residuesiandi+3
/81© Burkhard Rost
helix strand otheroverallaccuracymethod
unbalanced 62%comparison:data bankdistribution
comparison:33:33:33balanced 60%
19
full pie: all correctly predicted residues
/81© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% may be: 65% max
< 40% may be: strand non-local • short segments
20
Secondary structure prediction: 1.+2. Generation
Erabutoxin β (3ebx)
residuesiandi+3
/81© Burkhard Rost
H
E
L
V (E)
P (E)
A (H)
PHDsec:
structure-to-structure
21
PHDsec: structure-to-structure network
B Rost (1996) Methods Enzymol 266:525-39
/81© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems • < 100% may be: 65% max
< 40% may be: strand non-local short segments
22
Secondary structure prediction: 1.+2. Generation
Erabutoxin β (3ebx)
residuesiandi+3
/81© Burkhard Rost
24
Evolution has it!
.
0
20
40
60
80
100
0 50 100 150 200 250
Perc
enta
ge se
quen
ce id
entit
y
Number of residues aligned
Sequence identityimplies structural
similarity !
Don't know region
C Sander & R Schneider 1991 Proteins 9:56-68B Rost 1999 Prot Engin 12:85-94
/81© Burkhard Rost
Η
Ε
L
>
>
>
pickmaximal
unit=>
currentprediction
J2
inputlayer
first orhidden layer
second oroutput layer
s0 s1 s2J1
:GYIY
DPAVGDPDNGVEP
GTEF:
:GYIY
DPEVGDPTQNIPP
GTKF:
:GYEY
DPAEGDPDNGVKP
GTSF:
:GYEY
DPAEGDPDNGVKP
GTAF:
Alignments
5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .
. . . . 5 . . . . . . . . . . . . . . .
. . . 5 . . . . . . . . . . . . . . . .
. . 3 . . . . 2 . . . . . . . . . . . .
. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .
5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .
GSAPD NTEKQ CVHIR LMYFW
profile table
:GYIY
DPEDGDPDDGVNP
GTDF:
Protein
corresponds to the the 21*3 bits coding for the profile of one residue
25
PHD: Neural network & evolutionary information
B Rost & C Sander (1993) PNAS 90:7558-62B Rost (1996) Methods Enzymol 266:525-39
/81© Burkhard Rost
26
PROFsec: Evolutionary information + more
B Rost (2001) J Struct Biol 134, 204-18
/81© Burkhard Rost
single residues (1. generation) • Chou-Fasman, GOR 1957-70/80
50-55% accuracy
segments (2. generation) • GORIII 1986-92
55-60% accuracy
problems < 100% may be: 65% max < 40% may be: strand non-local short segments
27
Secondary structure prediction: 1.+2. Generation
Erabutoxin β (3ebx)
residuesiandi+3
/81© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set
31
Method A=60% B=63%, B better?
/81© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: note both used 100 proteins, and both used random splits to take one half for testing,ok?
32
Method A=60% B=63%, B better?
/81© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins!
33
Method A=60% B=63%, B better?
/81© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: random ok?
34
Method A=60% B=63%, B better?
/81© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets.Overlap defined as, e.g. comparative modeling cannot be applied
35
Method A=60% B=63%, B better?
B Rost 1999 Prot Engin 12, 85-94 C Sander & R Schneider 1991 Proteins 9:56-69
/81© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets. 63-60=3significant?
36
Method A=60% B=63%, B better?
/81© Burkhard Rost
use same (meaningful) measure e.g. both Q3 same data set: must contain ALL available proteins! split training/testing: must ascertain that there was NO overlap between sets. 63-60=3, whether significant or not depends on distribution and number:
37
Method A=60% B=63%, B better?
/81© Burkhard Rost
38
DeltaQ3=3%, 100 proteins->signficant?
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
/81© Burkhard Rost
39
DeltaQ3=3% for 100 proteins is significant!
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
rule-of-thumb:
Stderror=sigma/sqrt(proteins)
here: StdErr=10.5/sqrt(100)=±1.05
> DeltaQ3=3
-> statistically significant
/81© Burkhard Rost
pre-release test: ideally use data added after both
methods had been developed
43
/81© Burkhard Rost
44
Cross-validation: how to
150
0
150
TrainTest
Table 1Nhidden Q315 6230 6445 63
Conclusion:Q3=64%best method has 30 hidden units
/81© Burkhard Rost
45
Cross-validation: how to
150
0
150
TrainTest
Table 1Nhidden Q315 6230 6445 63
Conclusion:Q3=64%best method has 30 hidden units
OK?
/81© Burkhard Rost
46
Cross-validation: need 3 sets!
100
100
100TrainTest
Cross-train
Table 1
Nhid cross-train test
15 62 60
30 64 6145 63 62
Conclusion:Q3=61%best method has 30 hidden units
/00© Burkhard Rost
01: 04/14 Tue: no lecture 02: 04/16 Thu: no lecture 03: 04/21 Tue: Organization of lecture: intro into cells & biology 04: 04/23 Thu: Intro I - acids/structure 05: 04/28 Tue: Intro 2 - domains/3D comparisons 06: 04/30 Thu: Alignment 1 07: 05/05 Tue: Alignment 2 08: 05/07 Thu: Comparative modeling 1 09: 05/12 Tue: SKIP: student assembly (SVV) 10: 05/14 Thu: SKIP: Ascension Day 11: 05/19 Tue: Experimental structure determination/Secondary structure assignment 12: 05/21 Thu: 1D: Secondary structure prediction 1 13: 05/26 Tue: SKIP: Whitsun holiday (05/23-26) 14: 05/28 Thu: 1D: Secondary structure prediction 2 15: 06/02 Tue: 1D: Secondary structure prediction 3 / Transmembrane structure prediction 1 16: 06/04 Thu: SKIP: Corpus Christi 17: 06/09 Tue: 1D: Transmembrane structure prediction 2 18: 06/11 Thu: 1D: Transmembrane structure prediction 3 - Jonas 19: 06/16 Tue: 1D: Solvent accessibility prediction 20: 06/18 Thu: 1D: Disorder prediction 21: 06/23 Tue: 2D prediction 1 22: 06/25 Thu: 2D prediction 2 23: 06/30 Tue: 3D prediction 1 24: 07/02 Thu: Nobel prize symposium 25: 07/07 Tue: wrap up 26: 07/09 Thu: examen, no lecture 27: 07/14 Tue: examen, no lecture 28: 07/16 Thu: examen alternative, no lecture
47
Lecture plan (CB1 structure)