Functional Annotation of Proteins with Known Structure by Structure and Sequence Similarity,
description
Transcript of Functional Annotation of Proteins with Known Structure by Structure and Sequence Similarity,
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Functional Annotation of Proteins with Known Structure
by Structure and Sequence Similarity, DNA-protein Interaction Patterns
and GO Framework
Ilya Shindyalov, UCSD/SDSCPhD, Group Leader, Protein Science Research
DIMACS 2005-06-13
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Protein
Essential Dataflow in Protein Science
Sequence Structure FunctionData:
Methods:Sequence similarity:
(i) BLAST,
(ii) fold recognition,
(iii) homology modeling
…
Results:
Structure similarity:
(i) DALI,
(ii) VAST,
(iii) CE
…
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
COVERAGE RATIO FOR FUNCTIONAL ANNOTATION
Disease Biological Cell Molecular Process Component Function
PDB STRUCTURES 0.758 0.396 0.371 0.335
SG TARGETS 0.355 0.315 0.452 0.259
PDB+SG 0.822 0.528 0.593 0.477
HOMOLOGY MODELS 0.984 0.792 0.839 0.821
Do we know the function, if we know the structure?
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
The Subjects of my Talk
3 Approaches of Using Structure Similarity to Infer Protein Function:
#1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how reliable it can be?
#4 [BONUS]: Why ontology is so important for humans?
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
#1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how reliable it can be?
#4 [BONUS]: Why ontology is so important for humans?
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
CEProtein structure comparison by Combinatorial Extension of the optimal path (Shindyalov and Bourne, 1998).
http://cl.sdsc.edu
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
CEStep 1. Heuristic search for initial path.
Distance between two fragments
Protein A
Protein B
AFP = Aligned Fragment Pair
Protein A
Protein B
AFP1 AFP2
Protein A Protein A
Pro
tein
B
Pro
t ein
B
Alignment Path
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
CEStep 2. Iterative dynamic programming on starting
superposition from step 1.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
CE vs. other Algorithms ???Novotny, M., Madsen, D., and Kleywegt, G.J. 2004. Evaluation of protein fold comparison
servers. Proteins 54: 260-270.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
2ACE vs. 1TN4: RMSD = 4.6Å Z-score = 4.6 LALI = 86 LGAP = 8 Seq. Identity = 3.5%
Acetylcholinestarase vs. Troponin C
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
#1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how reliable it can be?
#4 [BONUS]: Why ontology is so important for humans?
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
• PDB - Protein Data Bank of February 13, 2002 with 17,304 entries was used as the source of original structural data.
- The DNA fragment size is at least 5 bp long.
- At least 5 different protein residues are involved in the interaction with DNA.
- The contact distance cutoff between interacting atoms was < 5Å.
- We did not take into account the different types of DNA (A, B, Z) because of the insufficient level of this annotation in the PDB
• PDP – Protein Domain Parser (Alexandrov, Shindyalov, Bioinformatics, submitted)
• CE – Protein structure alignment by Combinatorial Extension (Shindyalov, Bourne, 1998)
• SCOP - Structure Classification of Proteins (Murzin et al., 1995)
Data and algorithms used:
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
PDB
Selection of DNA-binding protein chains by analyzing DNA-protein contacts
Parsing of DNA-binding protein chains into domains using PDP
Selection of DNA-binding protein domains by analyzing DNA-protein contacts
All-against-all structural alignment of DNA-binding protein domains using CE
Selection of representative (non-redundant) set of DNA-binding protein domains
Calculating classification of DNA-binding protein domains
Building representative set of domains:
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
• Rmsd, root mean squared deviation between two aligned and compared protein domains > 2.0 Å;
• Z-score, statistical score obtained from CE is < 4.5;
• Rnar, ratio of the number of aligned residues to the smallest domain length < 90%;
Note: sequence identity in the alignment < 90%;
Parameters measuring structural similarity:
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
* *** ** ** **** A YKLAAVGTE--FCCILLNIVKLPDGT | | || | || B ASQL—AVREERAFA---GGKAPDQQD ** * * ** ****
(1)Parameters measuring structural similarity: Rmsd, Z-score, Rnar;
(2) Parameter measuring the match between DNA-protein contact patterns, Rmat;
A and B - DNA-protein domain complexes;
Rmat = min{RmatA, RmatB}
RmatX - ratio of the number of matched residues to the total number of residues involved in contacts with DNA in the DNA-protein complex X.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
contij
distijij SSS
Realignment using scoring function taking into account structural similarity between two protein domains and protein-
DNA contact pattern
otherwise ,
if ,
2
211
C
CdCdCS ijijdistij
Structure similarity term:
Bj
Ai
contij KKCS 3
Protein-DNA contact pattern term:
where
otherwise ,0
DNAth contact wiin involved is residueprotein if ,1XmK
m – denotes protein residue, X – protein-DNA complex; C3 is a scaling constant;
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
• If Rmsd > 5.0 Å or Rnar < 70% or Z-score < 3.5, then domains are not considered as similar;
• If Rmsd 3.0 Å and Rnar 80%, then domains are considered as similar;
• If Rmat Rmatthreshold and either: 3.0 Å < Rmsd 5.0 Å and Rnar 70% or 70% Rnar < 80% and Rmsd 5.0 Å, then domains are considered similar;
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Comparison of the classification for all 338 DNA-binding domain representatives with SCOP at various threshold parameters
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Final classification of DNA-binding protein domains (fragment):
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Rmsd
Rnar
53
70
80
Similar if Rmat<80
Not similar
Not similar
Similar
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
http://spdc.sdsc.edu
SPDC – Structural Protein Domain Сlassification
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
#1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how reliable it can be?
#4 [BONUS]: Why ontology is so important for humans?
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Why do we need the ontology?
(what’s published in scientific journals is de facto not reaching the community).
•Lack of adequate means for information storage and exchange between:
- scientists, - computers, - scientists and computers
•Qualitative data explosion (new experimental methods and new kinds of data appear, e.g. micro-arrays, interfering-RNA).
• Quantitative data explosion (e.g. exponential growth of sequence data - doubling every 7 month)
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
GO can serve as a language which can be easily read by both humans and computers. By using GO we ultimately learn to talk in one universal language.
The goal of this work is to further realize the potential of GO.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
What is GO?
• Controlled dictionaries for:
- Molecular Function
- Biological Process
- Cellular Component
• Acyclic graph
• “is-a”, “part-of” (“has-a”) relationships
CAR
BMW Wheel
“is-a” “part-of”“has-a”
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
The GO Annotation (GOA) resources providing annotation of gene products with GO terms
The IEA code, Inferred from Electronic Annotation, this means no human involvement in the assignment
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Extending GO annotation of PDB chains using structural and sequence similarity
34,698 protein chains were taken from the PDB of February, 2003 with the exception of theoretical models, short chains (less than 30 Cα atoms), and chains which don’t form domains (no domains detected by PDP algorithm).
GO annotation has been assigned for 25,835 PDB protein chains by EBI from 34,698.
Rseq, sequence identity calculated for the structurally aligned residues.
Rnar, ratio of the number of aligned residues to the length of the shortest polypeptide, it measures overlap between aligned polypeptides.
Z-score, statistically founded score, it characterizes significance of the alignment.
Rmsd, root mean squared deviation between two structurally aligned polypeptides, it characterizes distances between C , C and mainchain O atoms of aligned residues.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
For two polypeptides A and B with all calculated parameter values (Rmsd, Z-score, Rnar, Rseq) and given threshold values (Rmsdthreshold, Z-scorethreshold, Rnarthreshold, Rseqthreshold) we define:
SSCAB=(Rmsd<Rmsdthreshold ) (Z-score>Z-scorethreshold)
(Rnar>Rnarthreshold) (Rseq>Rseqthreshold)
- denotes logical AND. SSCAB can only be ascribed two values: true or false. If SSCAB is true, then A and B are similar. If SSCAB is false, then A and B are not similar.
The chains were clustered such that for every two chains in each cluster the above condition (in red) holds true.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Specificity Criteria:For the clusters where GO terms were available for at least two chains we define:
“positive cluster” - where all chains have the same GO terms;
“negative cluster” - where chains have different GO terms (more specific definitions for three criteria will be given further);
TP (true positives) - a number of chains with GO terms in the positive clusters;
FP (false positives) - a number of chains with GO terms in the positive clusters;
ppv (positive predictive value) or specificity is the following ratio - TP/(TP+FP)
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Further detailing of specificity (Specificity-4) should involve the semantic distance (e.g. Lord et al, 2003) between terms in judging cluster to be “positive”.
Specificity Criteria (cont.):
Specificity-3 (less rigorous than specificity-2) - “positive” cluster must have a common set of terms {t1,..tL} for all N chains within the cluster:
{t1,..tN} {ti1,..tik(i)}, i=1,…N; {t1,..tN}.
Specificity-2 (less rigorous than specificity-1) - “positive” cluster must have for every pair of chains (i, j) with different number of GO terms the following: for the chain with a smaller number of terms – all terms must be present amongst the terms for a chain with a larger number of GO terms: {ti1,..tik(i)} {tj1,..tjk(j)}, if k(i) k(j); i{1,…N}, j{1,…N}; {t1,..tN}.
Specificity-1 (the most rigorous) - “positive” cluster must have every pair of chains (i, j) with the same set of GO terms: tin = tjn , n=1,…k(i), k(i)=k(j), for (i, j), i{1,…N}, j{1,…N}.
{ti1,..tik(i)} - is a set of GO terms k(i) for i-th chain.
Each specificity is defined for a clusters with at least two annotated chains.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Clusterization of PDB chains and the accuracy of GO annotation at different threshold values of structural
similarity parameters.
Threshold values Clusters and chains Performance of GO annotation
Rseq
Rnar
Rmsd
Å
Z-score
Clusters and single-
tons
Clusters
Clusters with at least two chains
with GO
Chains in clusters with at least two chains with
GO
FP chains
(specificit
y-1)
Specificity-1,
%
FP chains
specificity-2)
Specificity-2,
%
FP chains
(specificity-3)
Specificity-3,
%
Cove rage,
%
Newly annotated
chains
New chain-GO term
associations
New added chain-GO
term associations
Chains with
added GO terms
0% 90% 2.0 4.5 9940 2919 2435 20799 8255 60.3 4463 78.5 158 99.2 60.9 5397 170372 3531 1953
25% 90% 2.0 4.5 9959 2923 2440 20797 8091 61.1 4410 78.8 155 99.3 60.6 5367 169864 3386 1893
35% 90% 2.0 4.5 10069 2995 2490 20768 7534 63.7 3972 80.9 113 99.5 59.9 5310 167254 3281 1841
50% 90% 2.0 4.5 10368 3180 2643 20719 5618 72.9 2686 87.0 64 99.7 57.4 5089 160523 2937 1376
70% 90% 2.0 4.5 10867 3515 2886 20606 3759 81.8 1137 94.5 42 99.8 52.3 4639 153801 2062 1015
90% 90% 2.0 4.5 11478 3834 3033 20493 1536 92.5 517 97.5 29 99.8 45.2 4002 147162 861 359
0% 70% 5.0 3.8 3401 1962 1687 24805 17730 28.5 15163 38.9 5604 77.4 83.8 7426 266757 2858 1318
25% 70% 5.0 3.8 4261 2533 2142 24610 13683 44.4 9608 61.0 734 97.0 78.5 6956 229366 3961 2153
35% 70% 5.0 3.8 4778 2885 2431 24507 11606 52.6 7299 70.2 328 98.7 75.9 6728 215972 3962 2285
50% 70% 5.0 3.8 5455 3330 2787 24357 8200 66.3 4063 83.3 85 99.7 71.7 6351 197765 4164 1960
70% 70% 5.0 3.8 6199 3819 3152 24196 5042 79.2 1567 93.5 58 99.8 64.9 5749 187239 3187 1440
90% 70% 5.0 3.8 7031 4269 3359 23984 2235 90.7 734 97.0 29 99.9 56.5 5007 178146 1566 588
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
PDB chains (34,698)
5,007
588
25,247
3,856
newly annotated chains (this work)
chains annotated by EBI with added new GO terms (this work)annotated by EBI
not annotated anywhere
"GO term - chain" associations (335,322)
178,675
1,661
154,986
new associations for newly annotated chains (this work)
new associations added for chains annotated by EBI (this work)
annotated by EBI
New added "GO term - chain" associations for previously annotated chains
(1,661)
Process52%
Cellular component
6%
Function42%
New "GO term - chain" associations (178,675)
Process33%
Cellular component
14%
Function53%
Assignment of GO annotation with structural similarity parameters (Rmsd 5.0Å, Z-score 3.8, Rnar 70%, Rseq 90%).
Red dot denotes newly annotated chains, red arrow denotes new “GO term – chain” associations assigned for newly annotated chains. Purple line denotes new “GO term – chain” associations assigned for chains previously annotated (by EBI). Black arrow denotes existing “GO term – chain” associations assigned by EBI.
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
The example of “negative” cluster by the definition of specificity-1 and “positive” cluster by the definitions of specificity-2 and specificity-3. Seven GO terms could be assigned to chains 1h9dA, 1h9dC (Rmsd 5.0Å, Z-score 3.8, Rnar 70%, Rseq 90%).
1e50A (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1e50C (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1e50E (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1e50G (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1e50Q (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1e50R (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1cmoA (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1co1A (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1ljmA (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1ljmB (7) 3677, 3700, 5524, 5634, 6355, 7275, 8151,1hjbC (4) 3677, 5524, 5634, 6355,1hjbF (4) 3677, 5524, 5634, 6355,1hjcA (4) 3677, 5524, 5634, 6355,1hjcD (4) 3677, 5524, 5634, 6355,1io4C (4) 3677, 5524, 5634, 6355,1eanA (4) 3677, 5524, 5634, 6355,1eaoA (4) 3677, 5524, 5634, 6355,1eaoB (4) 3677, 5524, 5634, 6355,1eaqA (4) 3677, 5524, 5634, 6355,1eaqB (4) 3677, 5524, 5634, 6355,1h9dA no go terms1h9dC no go terms
3677 (F) - DNA binding 3700 (F) - transcription factor activity 5524 (F) - ATP binding 5634 (C) - nucleus 6355 (P) - regulation of transcription, DNA-dependent 7275 (P) - development 8151 (P) - cell growth and/or maintenance
The cluster of the same proteins which is Runt-related transcription factor 1 (synonyms: core-binding factor alfa subunit, acute myeloid leukemia 1 protein etc.).
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
The example of “positive” cluster by definition of specificity-1.Phospholipase A2.
1cl5A (5) 4623, 5509, 15070, 16042, 16787,1cl5B (5) 4623, 5509, 15070, 16042, 16787,1fb2A (5) 4623, 5509, 15070, 16042, 16787,1fb2B (5) 4623, 5509, 15070, 16042, 16787,1fv0A (5) 4623, 5509, 15070, 16042, 16787,1fv0B (5) 4623, 5509, 15070, 16042, 16787,1jq8A (5) 4623, 5509, 15070, 16042, 16787,1jq8B (5) 4623, 5509, 15070, 16042, 16787,1jq9A (5) 4623, 5509, 15070, 16042, 16787,1jq9B (5) 4623, 5509, 15070, 16042, 16787,1kpmB no go terms
4623 (F) - phospholipase A2 activity 5509 (F) - calcium ion binding 15070 (F) - toxin activity 16042 (P) - lipid catabolism 16787 (F) - hydrolase activity
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Only four “negative” clusters have occurred by definition of specificity-3:
An example of missed GO terms for 2mtaC and other chains of Cytochrome c-L (cytochrome c551i)
2mtaC (3) 5489, 6118, 159451mg2D (2) 16021, 16032, 1mg2H (2) 16021, 16032,1mg2L (2) 16021, 16032,1mg2P (2) 16021, 16032,1mg3D (2) 16021, 16032,1mg3H (2) 16021, 16032,1mg3L (2) 16021, 16032,1mg3P (2) 16021, 16032,
5489 (F) - electron transporter activity 6118 (P) - electron transport 15945 (P) - methanol metabolism 16021 (C) - integral to membrane 16032 (P) - viral life cycle
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
#1: Assigning function from known to unknown – CASE STUDY – Prediction of calcium binding in Acetylcholine Esterase – Projection on SNP responsible for Autism.
#2: Classification of DNA-binding protein domains involving (in addition to structure similarity) – DNA-protein interaction patterns and sequence similarity.
#3: Extending GO annotation using structure similarity – how reliable it can be?
#4 [BONUS]: Why ontology is so important for humans?
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Evolution of complex systems:
Computers: complexity doubles in every 18 month per $$$ (Moore’s Law)
Human Brain: very slow (complexity doubles in ~100,000 years)
System Short Term Storage
Long Term Storage
Speed Cost
PC cluster (256 units)
65GB 5 TB 256 GFLOP $130K
Human Brain (Average)
57 TB 1137 TB 4.4 TFLOP $130K
Complexity = Speed x MemoryComputer = 5TB x 256 GFLOP = 1024 memory FLOPs
Brain = 1137TB x 4.4 TFLOP = 5x1027 memory FLOPs
Brain/Computer=5x103 or 3.7 log units
Moore’s Law: 3.5 years/log unit
Based on (Ramsey, 1997)
Human brain capacity for computers will be reached: 2000+3.7x3.5=2013
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
The accuracy of predicting the future for the next 2 years equals 10%
DIMACS 2005-06-13 ILYA SHINDYALOV, UCSD
Credits:
Julia Ponomarenko (she did #2 and #3)
Phil Bourne (discussions, conceptualizations, logistics)
Lei Xie (PDB statistics)
NIH Grant GM63208
NSF Grants DBI 9808706, DBI 0111710
Gift from Ceres Inc.