IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of...
-
date post
20-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of...
IPM-POLYTECHNIQUE-WPI Workshop on
Bioinformatics and Biomathematics
April 11-21, 2005 IPM
School of Mathematics Tehran
Prediction of protein surface accessibility based on residue
pair types and accessibility state using dynamic
programming algorithm
R. Zarei1, M. Sadeghi2, and S. Arab3
1,2) NRCGEB, Tehran, Iran 3) IBB, University of Tehran
Proteins & structure of proteins
Prediction of protein structure
Prediction of protein accessible surface area
Method
conclusion
Flow of information
DNA
RNA
PROTEIN SEQ
PROTEIN STRUCT
PROTEIN FUNCTION
……….
Proteins are the Machinery of life
Proteins have Structural & functional roles in cells
No other type of biological macromolecule could possibly assume all of the functions that proteins have amassed over billions of
years of evolution.
Proteins structure leads to protein function
Precise placement of chemical groups allows proteins to have :
Catalysis function Structural roleTransport functionRegulatory function
Then the determination of 3-dimentional structure of proteins is important.
4 levels of protein structures
The Primary structure of proteins (A string of 20 different Amino acids)
The secondary structure of proteins (Local 3-D structure)
The Tertiary structure of proteins (Global 3-D structure)
The Quaternary structure of proteins (Association of multiple polypeptide chains)
The Primary structure of proteins
The secondary structure of proteins α-helix α- helices 310-helix
Π-helix
parallel β- sheets
anti parallel Hairpin loops
Loops Ώ loops Other secondary structures Extended loops
Coils random coil
The Tertiary structure of proteins
There are a wide variety of ways in which the various helix, sheets & loop elements can combine to produce a complete structure.
At the level of tertiary structure, the side chains play a much more active role in creating the final structure.
Why predict protein structure?
Structural knowledge brings understanding of function and mechanism of action
Protein structure is determined experimentally by X-ray and NMR
The sequence- structure gap is rapidly increasing.
1000 000 known sequences, 20 000 known structures
What is protein structure prediction?
In its most general form
A prediction of the (relative) spatial position of each atom in the tertiary structure generated from knowledge only of the primary structure (sequence)
Hypotheses of Prediction
No general prediction of 3D structure from sequence yet.
Sequence determines structure determines function
The 3D structure of a protein (the fold) is uniquely determined by the specificity of the sequence(Afinsen,1973)
Methods of structure prediction
Comparative (homology) modelling
Fold recognition/threading
Ab initio protein folding approaches
3D structure prediction of proteins
0 10 20 30 40 50 60 70 80 90 100
Existing folds
ThreadingBuilding by homology
similarity (%)
New folds
Ab initio prediction
Levels of structure prediction
1D secondary structure, accessibility,……
2D contact map of residues
3D Tertiary structure
Prediction in 1D
Structure prediction in 1D is To project 3D structure onto strings of structural assignments.
Secondary Structure prediction
Prediction of Accessible Surface Area
Prediction of Membrane Helices
What is prediction in 1D?
Given a protein sequence (primary structure)
GHWIATHWIATRGQLIREAYEDYGQLIREAYEDYRHFSSSSECPFIP
Assign the residues
(C=coils H=Alpha Helix E=Beta Strands)
CEEEEEEEEEECHHHHHHHHHHHHHHHHHHHHHHCCCHHHHCCCCCC
secondary structure prediction in 1D
less detailed resultsonly predicts the H (helix), E (extended) or C
(coil/loop) state of each residue, does not predict the full atomic structure
Accuracy of secondary structure predictionThe best methods have an average accuracy of just
about 73% (the percentage of residues predicted correctly)
History of prediction of protein structure in 1D methods
First generation
– How: single residue statistics– Accuracy: low
Second generation– How: segment statistics– Accuracy: ~60%
Third generation
– How: long-range interaction, homology based– Accuracy: ~70%
Protein surface
Accessible Surface Area
Solvent ProbeAccessible Surface
Van der Waals Surface
Reentrant Surface
The accessible surface is traced out by the probe sphere center as it rolls over the protein. It is a kind of expanded
van der waalse surface.
Accessibility
Accessible Surface Area (ASA)
in folded protein Accessibility =
Maximum ASA
Two state = b (buried) ,e (exposed)
e.g. b<= 16% e>16% Three state = b (buried), I (intermediate), e (exposed)
e.g. b<=16% 16%>i,<36% e>36%
Use of Solvent Accessibility
studies of solvent accessibility in proteins have led to many insight into protein structure like:
Protein function
Sequence motifs
Domains
Formulating antigenic determinants & site-directed mutagenesis
Why Predict Solvent Accessibility?Helpful for :
Predicting the arrangement of secondary structure segments in 3-D structure
Estimating the number of protein-protein & protein- solvent contacts of residues
Threading procedure to find putative remote homologues
Improving prediction of glycosylation sites
Predicting epitops
Problems of predicting solvent Accessibility
Prediction of solvent accessibility is less accurate than that of secondary structure
Problem of approximation for residue accessibility (a projection of surface area onto 2 states leads to reduce of information )
The problem of how to define the threshold
ASA Calculation
DSSP - Database of Secondary Structures for Proteins (swift.embl-heidelberg.de/dssp)
VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/)
GetArea - www.scsb.utmb.edu/getarea/area_form.html
Other ASA sites
Connolly Molecular Surface Home Pagehttp://www.biohedron.com/
Naccess Home Page http://sjh.bi.umist.ac.uk/naccess.html
ASA Parallelizationhttp://cmag.cit.nih.gov/Asa.htm
Protein Structure Database http://www.psc.edu/biomed/pages/research/PSdb/
Methods of Accessibility prediction
MethodCCAccuracyYearScientists
1Decision treeDT0.4371 ~ 72%1998Salzberg
2Bayesian statistics
BS0.4371 ~ 72%1996Tompson, Goldstein
3Multiple linear regression
MLR0.4371 ~ 72%2001Li, Pan
4Support vector Machine
SVM2~4 %79%2002Yuan, et al
5Neural network
2~4%79%1994Rost, sander
6A method Based on information theory
2001Sadeghi et al
PHD Prediction of rCD2
Accessibility Prediction
PredictProtein-PHDacc (58%)http://cubic.bioc.columbia.edu/predictprotein
PredAcc (70%?)http://condor.urbb.jussieu.fr/PredAccCfg.html
QHTAW... QHTAWCLTSEQHTAAVIWBBPPBEEEEEPBPBPBPB
THEORY &
METHOD
Data sets
A set of 230 nonredundant protein structures in the PDB with mutual sequence similarity <25% were selected to construct the training and testing sets from the PDBSELECT and with 2.5 Å resolution determined by x-ray and without chain breaks
ASA calculation
Surface area and accessibility for dataset proteins were calculated by software developed in our group
Accessibility states defined as two states and three states with different threshold
Two states B and E ( 5%, 9%, and 16%)
Three states B , I , E ( 4,9% - 9, 16% - 4,16% )
Conformation(State) of a residue is affected by:
Short range interactions(between near residues)Long range interactions(between far residues)
Most efforts have been focused on the analysis of near residues(local effects).
our method is based on :
Residue type (R)Residue conformation (state of neighbor
residues S & S’):
different neighbor residue types cause that residue adopt to different states.
E B I
E B I
E B I E B I E B I
E B I E B I
n1
n2
n3
3n Branch
n=length of protein
Branch with maximum information
Single residue prediction
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
s1
s2
s3
s4
s5
s6
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
S S
S S
S S
S S
S S S S S S
S S S S S
Double residue prediction
S S
Where P(SS’= XX’ ) is the probability of the occurrence of an event P(SS’=XX’ RiRj) is the conditional probabilityof SS’= XX’ if residues Ri and Rj have occurred.
The complementary event of
Complexity & problems of method
Considering pairwise residue type:20*20 entry
considering both types of Pair residues & pair residue states simultaneously :For two states : 20*20*2 entryFor three states : 20*20*3 entry
Note:
because of sample limitation we can’t analyze triplets or more.
Problems that we encountered for considering pairwise residue types & states simultaneously was: Each residue in a window with length of L predicts L times.
for example in a window with length of two residues, each residue predicts 2 times and so on.
If we consider the state of each residue in a window with the length of L , there are L times prediction for each residue.
Result : the ambiguity in answering the question or Which state stands for each residue ?
Solution: Use of dynamic programming
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
S S
S S
S S
S S
S S S S S S
S S S S S
Double residue prediction
S S
n1 n2 n3 n4 n5 n6 n7 n8 n9
S S S S S
double residue prediction for long length wndows
S S S S S
S S S S S
S S S S S
S S S S S
S S S S
S S S
S S
S
information content I of a sequence length L, amino acid types Ri and Ri+m and accessibility states S and S’ (E,I,B) in window size L calculate as follow:
Dynamic programming algorithm
Build an optimal solution from optimal solutions to sub problems
Decompose a large problem into number of small problems. Solve the small problems and use these to solve the large problem.
Three basic components
The development of a dynamic programming algorithm has three basic components:– The recurrence relation (for defining the value of
an optimal solution);– The tabular computation (for computing the value
of an optimal solution);– The trace back (for delivering an optimal solution).
Dynamic programming algorithm
Dynamic programming algorithm
Three states accessibility for two residues length window
n1 n2 n3 n4 n5 n6
n1 n2 n3 n4
n1n2
n2
n3 n2n3
n4 n3n4
n1
n2
n3
n2 n3 n4
EE EB EI
BB BIBE
IIIBIE
EE EB EI
BB BIBE
IIIBIE
EE EB EI
BB BIBE
IIIBIE
EE II
Results&
discussion
Window length
threshold
2
3
4
5
6
7
5%9%16%
66.77
68.51
69.34
70.2
70.96
71.93
68.2
69.37
70.22
71.29
71.34
72.1
65.2
66.37
66.42
67.29
67.34
68.3
Two states accuracy
60
62
64
66
68
70
72
74
2 3 4 5 6 7
5%
9%
16%
Two states accuracy
Three states accuracy
Window length
thresholds
2
3
4
5
6
7
4, 9 %9, 16%4,16%
63.81
64.21
64.56
65.3
65.8
66.18
64.79
65.54
66.74
67.36
68.15
69.3
62.79
63.54
63.74
64.26
64.85
65.1
58
60
62
64
66
68
70
2 3 4 5 6 7
4,9%
9,16%
4,16
Three states accuracy
Suggestions
• Taking longer windows surely increases prediction accuracy
• Analysis and scoring of amino acid pairs by other statistical methods such as markov chain
• Using larger data sets and analysis of amino acid triplets (8000* 27 states)
Thank You