Structural Genomics: Case studies in assigning function from structure ? ? ? ? ? ? ? ? ? ? ? ? James...
-
Upload
dana-pearson -
Category
Documents
-
view
219 -
download
0
Transcript of Structural Genomics: Case studies in assigning function from structure ? ? ? ? ? ? ? ? ? ? ? ? James...
Structural Genomics:Case studies in assigning function from structure
??
? ??
?
?
?
??
??
James D Watson
Structural Genomics Collaborators
MCSG – Mid-west Centre for Structural Genomics
SPINE – Structural Proteomics in Europe
SGC – Structural Genomics Consortium
Structural Genomics Aims
?
Pathogens and disease
Human Proteins Coverage of Fold Space
Automation / High Throughput
~1.3m non-redundant protein sequencesMRTKSPGDSKFHEITKTPPKNQVSNS…MIVISGENVDIAELTDFLCAA…PPRIPYSMVGPCCVFLMHH…MDVVDSLFVNGSNITSACELGFENE…VYAWETAHFLDAAPKLIEWEVS…MAQQRRGGFKRRKKVDFIAANKIE…CELGFENETLFCLDRPRPSKE…MAQQRRGGFKRRKKVDFIAANKIE…MGMKKNRPRRGSLAFSPRKRAKKLVP…MQILKENASNQRFVTRESEV…MEKFEGYSEKQKSRQQYFVYPFLF…MEEFVNPCKIKVIGVGGGGSNAVNRMY…MAVTQEEIIAGIAEIIEEVTGIEP……
Proteins: known sequences and 3D structures
5,500 non-redundant structures
~260,000 homology models
~10% unknown
Proteins: known sequences and 3D structures
5,500 non-redundant structures
Homology models
3D structures of ~16,000 carefully selected proteins
Protein Function
Protein function has many definitions:
• Biochemical Function - The biochemical role of the protein e.g. serine protease
• Biological Function - The role of the protein in the cell/organism e.g.digestion, blood clotting, fertilisation
Function through homology
Surface comparison
Sequence similarity
Motif searches
Active SiteTemplates
Structural Similarity
HTH motifs
Template Methodology
Use 3D templates to describe the active site of the enzyme - analogous to 1-D sequence motifs such as PROSITE, but in 3-D
(Wallace et al 1997)
•defines a functional site
•search a new structure for a functional site
•search a database of structures for similar clusters
Query structureQuery structure
SiteSeer’s “reverse” templates
1 2 3
4 5 6
87 9
…
3-residue templates
Problems with template methods
• Too many hits (hundreds, thousands or even tens of thousands)
•Use of rmsd rarely discriminates true from false positives
• Local distortion in structure may give a large rmsd
• Top hit rarely the correct hit – even in “obvious” cases
An example
PDB code: 1hsk
UDP-N-acetylenolpyruvoylglucosaminereductase (MURB)
E.C.1.1.1.158
Contains the 3D template that characterisesthis enzyme class
Sequence identity to template’s representative structure (1mbb) is 28% Ser
Arg
Glu
Enzyme active site templatesHits for 1hsk
102. E.C.1.1.1.158 2.19Å UDP-N-acetylmuramate dehydrogenase
Hit E.C number Rmsd Enzyme
1. E.C.1.3.99.2 0.76Å Acyl-CoA dehydrogenase
2. E.C.4.2.1.20 0.76Å Tryptophan synthase α-subunit
3. E.C.3.2.1.73 1.19Å Glycosyl hydrolases, family 17
4. E.C.3.2.1.73 1.21Å Glycosyl hydrolases, family 16
5. E.C.4.1.2.13 1.25Å Fructose-bisphosphate aldolase (class I)
… … …
… … …
386. … 3.94Å …
Arg
Glu
Serrmsd=2.19Å
Template structure – 1mbb
Comparison of template environments
Arg
Glu
Ser
Match to template:
Query structure – 1hsk
Template structure – 1mbb
Comparison of template environments
Arg
Glu
Ser
Match to template:
Query structure – 1hsk
Template structure – 1mbb
Comparison of template environments
Identical residues in neighbourhood:
Query structure – 1hsk
Template structure – 1mbb
Comparison of template environments
Arg
Glu
Ser
Similar residues in neighbourhood:
Query structure – 1hsk
Results for 1hsk
1. E.C.1.1.1.158 2.08 209.1 UDP-N-acetylmuramate dehydrogenase
2. E.C.3.2.1.14 2.13 146.0 Chitinase A chitodextrinase 1,4-beta-poly-N-acetylglucosaminidase coly-beta-glucosaminidase
3. E.C.3.2.1.17 1.92 142.4 Turkey lysozyme
4. E.C.3.2.1.17 1.89 138.7 Hen lysozyme
5. E.C.3.5.1.26 1.47 132.3 Aspartylglucosylaminidase
6. E.C.3.2.1.3 1.54 131.1 Glucan 1,4-alpha-glucosidase
Hit E.C number Rmsd Score Enzyme
ProFunc – function from 3D structure
Functional sequence motifsQ-x(3)-[GE]-x-C-[YW]-x(2)-[STAGC]
HTH-motifs Electrostatics Surface comparison Nests … etc
Homologous structures of known function
Homologous sequences of known function
Template based methods
Binding site identification and analysis
Residue conservation analysis
Function
Large scale analysis
• Created an edited version of the target database from the PDB – only those with status “In PDB”
• Extract all PDB codes for each Structural Genomics group
• Extract ‘prior’ knowledge (Header, Title, Jrnl, etc.)• Find any associated GOA annotation• Classify each structure by whether function is
“known” “unknown” or “limited info”• Run Profunc in a batch process on all codes (~560)• Extract summary results from each analysis• Compare to prior knowledge and estimate success
Number of deposits to the TargetDB by Structural Genomics group (Total of 577 unique entries)
CESG (6)
JCSG (59)NESG (83)
NYSGC (73)
PSF (4)
S2F (37)
SECSG (19)
TB (26)
MCSG (117)
BCSG (35)
RIKEN (124)
March 2004
63%
37%
No Matches
Signif icant Hits(> 30% Seq ID)
PDB Blast
• Run query sequences against the PDB using BLAST• Filtered out those matches released AFTER the query sequence• Any hits are ignored from subsequent analyses
• Still get significant matches – why?
Target selection criteria
Released within months of SG target
No Matches
Significant Hits
InterPro Scan
• InterPro scan on proteins of known function
• Cannot “backdate” the InterPro database• Essentially picking up itself
Function of query structure “known”
0% 20% 40% 60% 80% 100%
HTH motif
Enzyme
Ligand
DNA
Siteseer
SSM
No Hits Different Function Same Function
Limited Functional Info
0% 20% 40% 60% 80% 100%
HTH motif
Enzyme
Ligand
DNA
Siteseer
SSM
No Hits Different Function Same Function New Function
Unknown Function
0% 20% 40% 60% 80% 100%
InterPro
HTH motif
Enzyme
Ligand
DNA
Siteseer
SSM
No Hits Hit Unknown Function New Function
The Good, the Not So Good and the Ugly
Three examples show the varying levels of information that can be retrieved from structures:
1. New functional assignment
2. Possible function identified
3. Function remains unknown
Ser-His-Asp catalytic triad of the lipases with rmsd=0.28Å
(template cut-off is 1.2Å)
The Good: BioH structure (MCSG)One very strong hit
Experimentally confirmed by hydrolase assays
Novel carboxylesterase acting on short acyl chain substrates
Function Discovered
[FY] -x-[LIVMFY]-x-S-[TV]-x-K-x(4)-[AGLM]-x(2)-[LC]
70 F-T-M-Q-S-I-S-K-V-I-S-F-I-A-A-C 85
Class A:
APC1040:
The Not So Good: APC1040 (MCSG)
•Assigned as a probable glutaminase
•Most methods suggest -lactamase activity
•No match to Prosite patterns
Function being assayed
The Ugly: MT0777 (MCSG)
•Fold associated with many functions (Rossmann fold)
•No sequence motifs
•Residue conservation is poor.
•Template methods fail
Hypothetical protein from:
Methanobacterium
thermoautotrophicum
Function Unknown
Future Work• Improvements to scoring system and additional
templates• Further utilisation of SOAP services as they
become available (e.g. KEGG API service)• Possible adaptation to use as part of a larger
workflow or in LIMS systems (Taverna and MyGrid)
• More truely predictive analyses being developed (e.g. Electrostatics, ligand prediction, catalytic residue prediction)
(Hugh Shanahan)
Detection of DNA-binding proteins (with HTH motif) using structural motifs and electrostatics
● Combine electrostatics with HTH structural templates.● Can detect HTH DNA-binding proteins only.● 1/3 of DNA-binding proteins families have HTH motif● Use linear predictor as discriminant.
● Find comparable true positive rate (~80%) with more complicated methods. ● Very low (< 0.01% ) false positive rate.
Ligand Prediction
Active Site & Ligand description/fingerprinting methods:
Can active site geometry, shape, physical-chemical properties etc. be used to predict the preferred ligand class?
• Spherical Harmonics
• Hybrid Ellipsoids
Spherical Harmonics(Richard Morris)
The computation of Legendre polynomials of high order requires a robust integration scheme
Spherical t-designs
Hybrid Ellipsoids(Rafael Najmanovich)
• Every shape can be modelled by a set of hybrid ellipsoids
• The parameters describe location and a,b,c of the ellipsoid and a smear factor
• Similar parameters mean similar active sites and ligands
Predicting Catalytic Residues
(Alex Gutteridge)
• Aims:
• To predict the location of the active site in an enzyme structure.
• To predict the catalytic residues of an enzyme.
• How?
• Train a neural network to identify catalytic residues.
• Cluster high scoring residues to find the active site.
Workflows and Taverna(Tom Oinn)
• Most procedures used now follow a workflow type scheme
• Taverna allows users to pick elements from services to create their own workflows for automation of complex sets of procedures.
• Removes the need to write complex scripts
Beta 9 release available at: http://taverna.sourceforge.net/
Acknowledgements• Janet Thornton
• Christine Orengo
• Roman Laskowski - Profunc
• Richard Morris – Interpro search, Spherical Harmonics
• Gail Bartlett, Craig Porter – Enzyme Templates
• Alex Gutteridge – Catalytic Residue Prediction
• Sue Jones – HTH motifs
• Hugh Shanahan – DNA binding, Electrostatics
• Jonathan Barker – JESS
• Hannes Ponstingl – PITA
• Rafael Najmanovich – Hybrid Ellipsoids
• Martin Senger, Siamak Sobhany – SOAP, Tom Oinn – Taverna
• Annabel Todd and Russell Marsden – UCL
• MCSG consortium for lots of structures, plus many more at EBI and UCL
• Work was supported by NIH grant (GM 62414) and by the US DoE under contract (W-31-109-Eng-38)