Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.
-
Upload
adam-hodges -
Category
Documents
-
view
212 -
download
0
Transcript of Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.
Applying Semantic Technologiesto the Glycoproteomics Domain
W. S YorkMay 15, 2006
Some Goals of Glycoproteomics
• How do changes in the expression levels of specific genes alter the expression of specific glycans on the cell surface?
• Are changes in the expression of specific glycans at the cell surface related to cell function, cell development, and disease?
• What are the mechanisms by which specific glycans at the cell surface affect cell function, cell development, and the progression of disease?
Challenges of Glycoproteomics
• Vast amounts of data collected by high-throughput experiments - better methods for data archival, retrieval, and analysis are needed
• Complex structures of glycans and glycoproteins – better methods for representing branched structures and finding structural and functional homologies are needed
• Complex Biology and Biochemistry – better methods to find relationships between the glycoproteome and biological processes are needed
Glycoproteomics Solutions• Brute-force analysis of flat data files
• Too much data• Data is heterogeneous• What does the data represent?
• Relational databases• Data is well organized• Data organization is relatively rigid• What does the data represent?
• Semantic Technologies• Data is well organized• Data organization is flexible• Concepts represented by data are accessible• Relationships between concepts are accessible
What is Semantic Technology?
Semantics:1. (Linguistics) The study or science of meaning in language.2. (Linguistics) The study of relationships between signs and symbols and what they represent.The American Heritage® Dictionary of the English Language, Fourth Edition
Semantic Technology:The use of formal representations of concepts and their relationships to enable efficient, intelligent software.
Ontology (Computer Science):A model that represents a domain and is used to reason about the objects in that domain and the relations between them.http://en.wikipedia.org/wiki/Ontology_(computer_science)
The implication is that enabling computers to “understand” the meanings of and relationships between concepts will allow them to reason and communicate in a way that is analogous to the way humans do.
A Simple OntologyOrganism
Animal Plant
Lion DeerCow Hosta Alfalfa
Elsa
Simba
Elsie Bambi My Hosta Peter’s Alfalfa
is_a is_a
is_a is_ais_a
is_ais_a
is_a is_a is_a is_a is_a
is_a ate
ate
ate
ate
A Simple OntologyOrganism
Animal Plant
Lion DeerCow Hosta Alfalfa
Elsa
Simba
Elsie Bambi My Hosta Peter’s Alfalfa
is_a is_a
is_a is_a
is_ais_a
is_a
is_a is_a is_a is_a is_a
is_a ate
ate
ate
ate
Carnivore Herbivore
is_ais_a
eatseats
chemicalentity
residue
moleculeis_a
amino acidresidue
is_a
molecularfragment
The Structure of GlycO – Concept Taxonomy
carbohydratemoiety
is_a
carbohydrateresidue
is_amonoglycosyl
moiety
glycanmoiety
N-glycanis_a
O-glycan
residue
amino acidresidue
is_a
carbohydrateresidue
glycanmoiety
N-glycan
O-glycan
is_a
– Concept TaxonomyThe Structure of GlycO
residue
amino acidresidue
is_a
carbohydrateresidue
glycanmoiety
N-glycan
O-glycan
is_a
N-glycan coreb-D-Manp
is_instance_of
N-glycan_00020
is_instance_of
has_residue
is_instance_of
N-glycana-D-Manp 4
is_linked_to
– Concept Taxonomy– Instances and PropertiesThe Structure of GlycO
The GlycO Ontology in Protégé
3 Top-Level Classes are Defined in GlycO
The GlycO Ontology in Protégé
Semantics Include Chemical Context
This Class Inherits from 2 Parents
The GlycO Ontology in Protégé
The -D-Manp residues in N-glycans are found in 8 different chemical environments
GlycoTree – A Canonical Representation of N-Glycans
N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251
-D-GlcpNAc-D-GlcpNAc-D-Manp-(1-4)- -(1-4)-
-D-Manp -(1-6)+-D-GlcpNAc-(1-2)-
-D-Manp -(1-3)+-D-GlcpNAc-(1-4)-
-D-GlcpNAc-(1-2)+
-D-GlcpNAc-(1-6)+
We give a residue in this position the same name, regardless of the specific
structure it resides in
Semantics!
The GlycO Ontology in Protégé
Bisecting -D-GlcpNAc
The GlycO Ontology in Protégé
The GlycO Ontology in Protégé
1,3-linked -L-Fucp
The GlycO Ontology in Protégé
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
Ontology Population Workflow
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
[][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc] {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}}
Ontology Population Workflow
Has CarbBank
ID?
IUPAC to LINUCS
LINUCS to GLYDE
Compare to Knowledge
Base
Already in KB?
YES
NO
Semagix Freedom knowledge extractor
Instance Data
YES: next Instance
Insert into KB
NO
<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue></Glycan>
Ontology Population Workflow
The ProPreO Ontology in Protégé
3 Top-Level Classes are Defined in ProPreO
The ProPreO Ontology in Protégé
This Class Inheritsfrom 2 Parents
The ProPreO Ontology in Protégé
This Class Inheritsfrom 2 Parents
830.9570 194.9604 2
580.2985 0.3592
688.3214 0.2526
779.4759 38.4939
784.3607 21.7736
1543.7476 1.3822
1544.7595 2.9977
1562.8113 37.4790
1660.7776 476.5043
parent ion m/z
fragment ion m/z
ms/ms peaklist data
fragment ionabundance
parent ionabundance
parent ion charge
Semantic Annotation of MS Data
<ms/ms_peak_list>
<parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer
mode = “ms/ms”/>
<parent_ion m/z = 830.9570 abundance=194.9604 z=2/>
<fragment_ion m/z = 580.2985 abundance = 0.3592/>
< fragment_ion m/z = 688.3214 abundance = 0.2526/>
< fragment_ion m/z = 779.4759 abundance = 38.4939/>
< fragment_ion m/z = 784.3607 abundance = 21.7736/>
< fragment_ion m/z = 1543.7476 abundance = 1.3822/>
< fragment_ion m/z = 1544.7595 abundance = 2.9977/>
< fragment_ion m/z = 1562.8113 abundance = 37.4790/>
< fragment_ion m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
OntologicalConcepts
Semantically Annotated MS Data
Web Services Based Workflow for Proteomics1
Storage
Standard
FormatData
Raw Data
Filtered Data
Search Results
Final Output
Agent Agent Agent Agent Biological Sample Analysis
by MS/MS
Raw Data toStandar
d Format
DataPre-
process2
DB Search
(Mascot/Sequest)
Results Post-
process
(ProValt3)
O I O I O I O I O
Biological Information
1 Design and Implementation of Web Services based Workflow for proteomics. Journal of Proteome Research. Submitted2 Computational tools for increasing confidence in protein identifications. Association of Biomolecular Resource Facilities Annual Meeting, Portland, OR, 2004. 3 A Heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search results. Mol. Cell. Proteomics. 4(6), 762-772.
An Integrated Semantic Information System
• Formalized domain knowledge is in ontologies
• The schema defines the concepts• Instances represent individual objects• Relationships provide expressiveness
• Data is annotated using concepts from the ontologies
• The semantic annotations facilitate the identification and extraction of relevant information
• The semantic relationships allow knowledge that is implicit in the data to be discovered
Satya SahooChristopher ThomasCory HensonRavi Pavagada
Amit ShethKrzysztof KochutJohn Miller
James AtwoodLin LinAlison NairnGerardo Alvarez-ManillaSaeed Roushanzamir
Michael PierceRon OrlandoKelley MoremenParastoo Azadi
Alfred Merrill