Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

29
Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006

Transcript of Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Page 1: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Applying Semantic Technologiesto the Glycoproteomics Domain

W. S YorkMay 15, 2006

Page 2: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Some Goals of Glycoproteomics

• How do changes in the expression levels of specific genes alter the expression of specific glycans on the cell surface?

• Are changes in the expression of specific glycans at the cell surface related to cell function, cell development, and disease?

• What are the mechanisms by which specific glycans at the cell surface affect cell function, cell development, and the progression of disease?

Page 3: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Challenges of Glycoproteomics

• Vast amounts of data collected by high-throughput experiments - better methods for data archival, retrieval, and analysis are needed

• Complex structures of glycans and glycoproteins – better methods for representing branched structures and finding structural and functional homologies are needed

• Complex Biology and Biochemistry – better methods to find relationships between the glycoproteome and biological processes are needed

Page 4: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Glycoproteomics Solutions• Brute-force analysis of flat data files

• Too much data• Data is heterogeneous• What does the data represent?

• Relational databases• Data is well organized• Data organization is relatively rigid• What does the data represent?

• Semantic Technologies• Data is well organized• Data organization is flexible• Concepts represented by data are accessible• Relationships between concepts are accessible

Page 5: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

What is Semantic Technology?

Semantics:1. (Linguistics) The study or science of meaning in language.2. (Linguistics) The study of relationships between signs and symbols and what they represent.The American Heritage® Dictionary of the English Language, Fourth Edition

Semantic Technology:The use of formal representations of concepts and their relationships to enable efficient, intelligent software.

Ontology (Computer Science):A model that represents a domain and is used to reason about the objects in that domain and the relations between them.http://en.wikipedia.org/wiki/Ontology_(computer_science)

The implication is that enabling computers to “understand” the meanings of and relationships between concepts will allow them to reason and communicate in a way that is analogous to the way humans do.

Page 6: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

A Simple OntologyOrganism

Animal Plant

Lion DeerCow Hosta Alfalfa

Elsa

Simba

Elsie Bambi My Hosta Peter’s Alfalfa

is_a is_a

is_a is_ais_a

is_ais_a

is_a is_a is_a is_a is_a

is_a ate

ate

ate

ate

Page 7: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

A Simple OntologyOrganism

Animal Plant

Lion DeerCow Hosta Alfalfa

Elsa

Simba

Elsie Bambi My Hosta Peter’s Alfalfa

is_a is_a

is_a is_a

is_ais_a

is_a

is_a is_a is_a is_a is_a

is_a ate

ate

ate

ate

Carnivore Herbivore

is_ais_a

eatseats

Page 8: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

chemicalentity

residue

moleculeis_a

amino acidresidue

is_a

molecularfragment

The Structure of GlycO – Concept Taxonomy

carbohydratemoiety

is_a

carbohydrateresidue

is_amonoglycosyl

moiety

glycanmoiety

N-glycanis_a

O-glycan

Page 9: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

residue

amino acidresidue

is_a

carbohydrateresidue

glycanmoiety

N-glycan

O-glycan

is_a

– Concept TaxonomyThe Structure of GlycO

Page 10: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

residue

amino acidresidue

is_a

carbohydrateresidue

glycanmoiety

N-glycan

O-glycan

is_a

N-glycan coreb-D-Manp

is_instance_of

N-glycan_00020

is_instance_of

has_residue

is_instance_of

N-glycana-D-Manp 4

is_linked_to

– Concept Taxonomy– Instances and PropertiesThe Structure of GlycO

Page 11: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The GlycO Ontology in Protégé

3 Top-Level Classes are Defined in GlycO

Page 12: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The GlycO Ontology in Protégé

Semantics Include Chemical Context

This Class Inherits from 2 Parents

Page 13: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The GlycO Ontology in Protégé

The -D-Manp residues in N-glycans are found in 8 different chemical environments

Page 14: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

GlycoTree – A Canonical Representation of N-Glycans

N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251

-D-GlcpNAc-D-GlcpNAc-D-Manp-(1-4)- -(1-4)-

-D-Manp -(1-6)+-D-GlcpNAc-(1-2)-

-D-Manp -(1-3)+-D-GlcpNAc-(1-4)-

-D-GlcpNAc-(1-2)+

-D-GlcpNAc-(1-6)+

We give a residue in this position the same name, regardless of the specific

structure it resides in

Semantics!

Page 15: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The GlycO Ontology in Protégé

Bisecting -D-GlcpNAc

Page 16: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The GlycO Ontology in Protégé

Page 17: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The GlycO Ontology in Protégé

1,3-linked -L-Fucp

Page 18: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The GlycO Ontology in Protégé

Page 19: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

Ontology Population Workflow

Page 20: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

[][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc] {}[(4+1)][b-D-GlcpNAc] {}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}}

Ontology Population Workflow

Page 21: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue></Glycan>

Ontology Population Workflow

Page 22: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The ProPreO Ontology in Protégé

3 Top-Level Classes are Defined in ProPreO

Page 23: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The ProPreO Ontology in Protégé

This Class Inheritsfrom 2 Parents

Page 24: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

The ProPreO Ontology in Protégé

This Class Inheritsfrom 2 Parents

Page 25: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

830.9570 194.9604 2

580.2985 0.3592

688.3214 0.2526

779.4759 38.4939

784.3607 21.7736

1543.7476 1.3822

1544.7595 2.9977

1562.8113 37.4790

1660.7776 476.5043

parent ion m/z

fragment ion m/z

ms/ms peaklist data

fragment ionabundance

parent ionabundance

parent ion charge

Semantic Annotation of MS Data

Page 26: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

<ms/ms_peak_list>

<parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer

mode = “ms/ms”/>

<parent_ion m/z = 830.9570 abundance=194.9604 z=2/>

<fragment_ion m/z = 580.2985 abundance = 0.3592/>

< fragment_ion m/z = 688.3214 abundance = 0.2526/>

< fragment_ion m/z = 779.4759 abundance = 38.4939/>

< fragment_ion m/z = 784.3607 abundance = 21.7736/>

< fragment_ion m/z = 1543.7476 abundance = 1.3822/>

< fragment_ion m/z = 1544.7595 abundance = 2.9977/>

< fragment_ion m/z = 1562.8113 abundance = 37.4790/>

< fragment_ion m/z = 1660.7776 abundance = 476.5043/>

<ms/ms_peak_list>

OntologicalConcepts

Semantically Annotated MS Data

Page 27: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Web Services Based Workflow for Proteomics1

Storage

Standard

FormatData

Raw Data

Filtered Data

Search Results

Final Output

Agent Agent Agent Agent Biological Sample Analysis

by MS/MS

Raw Data toStandar

d Format

DataPre-

process2

DB Search

(Mascot/Sequest)

Results Post-

process

(ProValt3)

O I O I O I O I O

Biological Information

1 Design and Implementation of Web Services based Workflow for proteomics. Journal of Proteome Research. Submitted2 Computational tools for increasing confidence in protein identifications. Association of Biomolecular Resource Facilities Annual Meeting, Portland, OR, 2004. 3 A Heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search results. Mol. Cell. Proteomics. 4(6), 762-772.

Page 28: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

An Integrated Semantic Information System

• Formalized domain knowledge is in ontologies

• The schema defines the concepts• Instances represent individual objects• Relationships provide expressiveness

• Data is annotated using concepts from the ontologies

• The semantic annotations facilitate the identification and extraction of relevant information

• The semantic relationships allow knowledge that is implicit in the data to be discovered

Page 29: Applying Semantic Technologies to the Glycoproteomics Domain W. S York May 15, 2006.

Satya SahooChristopher ThomasCory HensonRavi Pavagada

Amit ShethKrzysztof KochutJohn Miller

James AtwoodLin LinAlison NairnGerardo Alvarez-ManillaSaeed Roushanzamir

Michael PierceRon OrlandoKelley MoremenParastoo Azadi

Alfred Merrill