Using Ontology to Classify Members of a Protein Family
-
Upload
robertstevens65 -
Category
Science
-
view
106 -
download
2
description
Transcript of Using Ontology to Classify Members of a Protein Family
Using Ontology to Classify Members of a Protein Family
Robert StevensBioHealth Informatics Group
School of Computer Science
University of [email protected]
Introduction• Developing an automated system for extracting and
classifying proteins from newly sequenced genomes• Building an OWL ontology that defines class
membership• Describing protein instances in OWL• Classifying against the ontology• Describing the protein family complement of a
genome• As good as human classification, but added value• Only possible through inter-disciplinary research
Acknowledgements (it takes all sorts)
Katy Wolstencroft (Bioinformatics)
Daniele Turi (Instance Store)
Phil Lord (myGrid)
Lydia Tabernero (Protein Scientist)
Matt Horridge, Nick Drummond et al (Protégé OWL)
Andy Brass and Robert Stevens (Bioinformatics)
Protein Classification• Proteins divided into broad functional classes
“Protein Families”• Families sub-divided to give family
classifications• Class membership cam be determined by
“protein features”, such as domains, etc.• Resources exist for feature detection via
primary sequence– but not class membership
• Current Limitation of Automated Tools• Needs human knowledge to recognise class
membership
Finding Domains on a Sequence
A search of the linear sequence of protein tyrosine phosphatase type K – identified 9 functional domains
>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).
MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV…
……..
Why Classify?• Classification and curation of a genome is the
first step in understanding the processes and functions happening in an organism
• Classification enables comparative genomic studies - what is already known in other organisms
• The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology
• In silico characterisation is the current bottleneck
The Protein Phosphatases
• large superfamily of proteins – involved in the removal of phosphate groups from molecules
• Important proteins in almost all cellular processes
• Involved in diseases – diabetes and cancer• human phosphatases well characterised
Phosphatase Classification
• Diagnostic phosphatase domains/motifs – sufficient for membership of the protein phosphatase superfamily
• Any protein having a phosphatase domain is a member of the phosphatase super-family
• Other motifs determine a protein’s place within the family
• Usually needs human to recognise that features detected imply class membership
• Can these be captured in an ontology?
Ontologies
• Describing and defining the classes of objects represented in information
• Defining the characteristics of objects
• The characteristics by which it can be recognised to which class an object belongs
• In a form understandable by a computer
• … and, of course, humans.
Web Ontology Language (OWL)
• W3C recommendation for ontologies for the Semantic Web
• OWL-DL mapped to a decidable fragment of first order logic
• Classes, properties and instances• Boolean operators, plus existential and
universal quantification• Rich class expressions used in restriction on
properties – hasDomain some (ImnunoGlobinDomain or FibronectinDomain)
OWL represents classes of instances
A
BC
Necessity and Sufficiency
• An R2A phosphatase must have a fibronectin domain
• Having a fibronectin domain does not a phosphatase make
• Necessity -- what must a class instance have? • Any protein that has a phosphatase catalytic
domain is a phosphatase enzyme• All phosphatase enzymes have a catalytic domain• Sufficiency – how is an instance recognised to be a
member of a class?
Definition of Tyrosine Phosphatase
Class TyrosineRreceptorProteinPhosphatase
EquivalentTo: Protein That- contains atLeast-1
ProteinTyrosinePhosphataseDomain and- contains EXACTLY 1
TransmembraneDomain
…there are known knowns; there are things we know we know. We also know there are
known unknowns; that is to say we know there are some things we do not know. But
there are also unknown unknowns -- the ones we don't know we don't know.
Definition of Tyrosine Phosphatase: What we Know we Know
Class TyrosineRreceptorProteinPhosphatase
EquivalentTo: Protein That- contains atLeast-1
ProteinTyrosinePhosphataseDomain and- contains EXACGTLY 1
TransmembraneDomain
Definition for R2A Phosphatase
Class: R2AEquivalentTO: Protein That- contains 2 ProteinTyrosinePhosphataseDomain and- (contains 1 TransmembraneDomain )and - (contains 4 FibronectinDomains) and- contains 1 ImmunoglobulinDomain and- contains 1 MAMDomain and- contains 1 Cadherin-LikeDomain and- contains only TyrosinePhosphataseDomain or
TransmembraneDomain or FibronectinDomain or ImnunoglobulinDomain or Clathrin-LikeDomain or ManDomain
Automatic Reasoning
• An OWL-DL ontology mapped to its dL form as a collection of axioms
• An automatic reasoner checks for satisfiability – throws out the inconsistant and infers subsumption
• Defined classes (where there are necessary and sufficient restrictions) enable a reasoner to infer subclass axioms
• Also infer to which class an object belongs• Based on the facts we know about it
Incremental Addition of Protein Functional Domains
Phosphatase catalytic
Cadherin-like
Immunoglobulin
MAM domain Cellular retinaldehyde
Adhesion recognition Transmembrane
Fibronectin III Glycosylation
Building the Ontology
• Classifications already made by biologists – based on protein functionality;
• Protein domain composition and other details in the literature;
• Some 50 classes of phosphatase, 30 protein domains and one relationship;
• ”Value partition” of protein domains (covering and disjoint);
• Defines range of contains property;• Literature contains knowledge of how to recognise
members of each class of phosphatase.
Classification of the Classical Tyrosine Phosphatases
What is the Ontology Telling Us?
• Each class of phosphatase defined in terms of domain composition
• We know the characteristics by which an individual protein can be recognised to be a member of a particular class of phosphatase
• We have this knowledge in a computational form• If we had protein instances described in terms of
the ontology, we could classify those individual proteins
• A catalogue of phosphatases
Description of an Instance of a Protein
• Instance: P21592 TypeOf: Protein ThatFact: hasDomain 2 ProteinTyrosinePhosphataseDomain and Fact: hasdomain 1 TransmembraneDomain and
Fact: hasdomain 4 FibronectinDomains and Fact: hasDomain 1 ImmunoglobulinDomain and
Fact: hasdomain 1 MAMDomain and Fact: hasdomain 1 Cadherin-LikeDomain
Instance: P21592 TypeOf: Protein ThatFact: hasDomain 2 ProteinTyrosinePhosphataseDomain and Fact: hasdomain 1 TransmembraneDomain and Fact: hasdomain 4 FibronectinDomains and Fact: hasDomain 1 ImmunoglobulinDomain and Fact: hasdomain 1 MAMDomain and Fact: hasdomain 1 Cadherin-LikeDomain
Tyrosine Phosphatase(containsDomain some TransmembraneDomain) and(containsDomain at least 1 ProteinTyrosinePhosphataseDomain)
R2A Phosphatase(containsDomain some MAMDomain) and(containsDomain some ProteinTyrosineCatalyticDomain or ImmunoglobulinDomain) and(containsDomain some FibronectinDomain or FibronectinTypeIIIFoldDomain) and(containsDomain exactly 2 ProteinTyrosinePhosphataseDomain)
Classifying Proteins>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine
phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..
InterPro
Instance Store
Reasoner
Translate
Codify
So Far…..• Human phosphatases have been classified using the
system• The ontology classification performed equally well as
expert classification• The ontology system refined classification
- DUSC contains zinc finger domain Characterised and conserved – but not in classification- DUSA contains a disintegrin domain previously uncharacterised – evolutionarily conserved
• A new kind of phosphatase?
Aspergillus fumigatus• Phosphatase compliment very different from human
>100 human <50 A.fumigatus• Whole subfamilies ‘missing’
Different fungi-specific phosphorylation pathways?No requirement for tissue-specific variations?
• Novel serine/threonine phosphatase with homeobox Conserved in aspergillus and closely related species, but not in any other
Again, a new phosphatase?
Scaling
• Over 700 protein families
• Some 14,000 described sequence features
• Hundreds of thousands types of protein
• Mass classification, then what?
Generic Technique
• Feature detection
• Categories defined in terms of those features
• Produce catalogue of what you currently know
• Highlight cases that don’t match current knowledge
Conclusions• Using ontology allows automated classification to
reach the standard of human expert annotation• Reasoning capabilities allow interpretation of domain
organisation• Capturing human knowledge in computational form• Systematic survey produces interesting biological
questions• Discovering the unexpected• Allows fast, efficient comparative genomics studies• A combination of CS and bioinformatics to do biology