Semantic Mediation in myGrid
-
Upload
byron-harper -
Category
Documents
-
view
35 -
download
0
description
Transcript of Semantic Mediation in myGrid
Semantic Mediation in myGrid
Chris Wroe
Manchester University
• UK e-Science Pilot Project.• Oct 2001 – April 2005.• £3.4 million.
• £0.4 million studentships.
Newcastle
NottinghamManchester
Southampton
Hinxton
Sheffield
Data-intensive bioinformatics
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric
AMBITText Extraction
Service
Provenance
Personalisation
Event Notification
Gateway
Service and WorkflowDiscovery
myGrid Information Repository
Ontology Mgt
Metadata Mgt
Work bench Taverna Talisman
Native Web Services
SoapLab
Web Portal
Legacy apps
Registries
Ontologies
FreeFluo Workflow Enactment Engine
OGSA-DQPDistributed Query Processor
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rsA
pplicationsC
ore servicesE
xternal servicesService Stack
Views
Legacy apps
GowLab
Workflow approach
Grave’s Disease
Workflow approach II
Issues
• Connecting web services together– Shim services
• Connecting data to web services– Data provenance delivered by LSIDs
• Connecting data to data– Distributed Query Processing
Technology
– Resource Description Framework• Representing metadata about data and services
– Ontology Web Language• Representing concepts and classifications
myGrid & Bioinformatics world
• Automating mainstream, well known tasks
• Well known mature data formats
• Often no formal description of formats
• Lots of code to manipulate formats already exists (BioPerl, BioJava …)
• Semantic mediation work in progress..
Williams-Beuren Syndrome Workflow
Main Bioinformatics Applications
Explore gaps regions within the W-B Critical Region
Main Bioinformatics
Services
Main Bioinformatics
Application
Main Bioinformatics
Application
SHIM Services
Williams Example (simple)
Genbankretrieval service
GenscanGene predication service
Genbank recordhas_part genomic sequence
genomic sequence in
Genbank record FASTA sequence
Semantic level
Syntactic level
Sample Genbank RecordLOCUS AY214156 1065 bp mRNA linear VRT 07-MAY-2004
DEFINITION Oncorhynchus nerka RH1 opsin mRNA, complete cds.
ACCESSION AY214156
VERSION AY214156.1 GI:37787241
KEYWORDS .
SOURCE Oncorhynchus nerka (sockeye salmon)
ORGANISM Oncorhynchus nerka
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Actinopterygii; Neopterygii; Teleostei; Euteleostei;
Protacanthopterygii; Salmoniformes; Salmonidae; Oncorhynchus.
REFERENCE 1 (bases 1 to 1065)
AUTHORS Dann,S.G., Allison,W.T., Levin,D.B., Taylor,J.S. and Hawryshyn,C.W.
TITLE Salmonid opsin sequences undergo positive selection and indicate an
alternate evolutionary relationship in oncorhynchus
JOURNAL J. Mol. Evol. 58 (4), 400-412 (2004)
PUBMED 15114419
REFERENCE 2 (bases 1 to 1065)
AUTHORS Dann,S.G., William,A.E., David,L.B. and Craig,H.W.
TITLE Direct Submission
JOURNAL Submitted (08-JAN-2003) Biology, University of Victoria, PO Box
3020 Stn CSC, Victoria, British Columbia V8W 3N5, Canada
FEATURES Location/Qualifiers
source 1..1065
/organism="Oncorhynchus nerka"
/mol_type="mRNA"
/db_xref="taxon:8023"
CDS 1..1065
/codon_start=1
/product="RH1 opsin"
/protein_id="AAP58347.1"
/db_xref="GI:37787242"
/translation="MNGTEGPDFYVPMSNATGIVRNPYEYPQYYLVSPAAYSLMAAYM
FFLILTGFPINFLTLYVTIEHKKLRTALNYILLNLAVADLFMVIGGFTTTMYTSMHGY
FVFGRTGCNIEGFCATHGGEIALWSLVVLAIERWLVVCKPISNFRFSETHAIIGVAFT
WVMAAACSVPPLLGWSRYIPEGMQCSCGIDYYTRAPDINNESFVIHMFVVHFMIPLFI
ISFCYGNLLCAVKAAAAAQQESETTQRAEREVTRMVIMMVVSFLVCWVPYASVAWYIF
CNQGTEFGPVFMTIPAFFAKSSSLYNPLIYVLMNKQFRNCMITTLCCGKNPFEEEEGA
STTASKTEASSVSSSSVAPA"
ORIGIN
1 atgaacggca cagagggacc agatttctac gtccctatgt ccaatgctac tggcattgtt
61 aggaacccct atgaataccc ccagtactac cttgtcagcc cagcggcgta ctcactcatg
121 gctgcctaca tgttcttcct catcctcacc ggcttcccca tcaacttcct cacactctat
181 gtcaccatcg agcacaaaaa gctgaggacc gccctgaact acatcctgct gaacctggct
241 gtggccgatc tcttcatggt aatcggaggc ttcaccacta cgatgtacac ctccatgcat
301 ggctatttcg tctttggaag aacgggctgc aacatcgagg gattctgtgc tacccatggt
361 ggtgagattg ccctatggtc cctggttgtc ctggctattg agaggtggtt ggtcgtctgc
421 aaacctatta gcaacttccg cttcagtgag acccatgcca tcataggcgt ggcctttacc
481 tgggtcatgg ctgctgcttg ctccgtcccc cctctgcttg ggtggtcccg ctatatcccc
541 gaaggcatgc agtgctcatg tggaattgac tactacacgc gcgcccctga catcaacaat
601 gagtcctttg tcatccacat gttcgttgtc cactttatga ttcccctgtt catcatctcc
661 ttctgctacg gcaacctgct ctgcgctgtc aaggcagctg ccgccgccca gcaggagtct
721 gagaccaccc agagggctga gagggaagtg acccgcatgg tcatcatgat ggtcgtctcc
781 ttcctagtgt gctgggtgcc ctacgccagc gtggcctggt atatcttctg caaccaggga
841 acagagttcg gccccgtctt catgacaatt ccggcattct ttgccaagag ttcgtccctg
901 tacaaccctc tcatctacgt gttgatgaac aagcagttcc gcaactgcat gatcaccacc
961 ctgtgctgtg ggaagaaccc cttcgaggag gaggagggag cctccaccac tgcctccaag
1021 accgaggcct cctccgtgtc ctccagctcc gtggctcctg cataa
//
FASTA
>gi|37787241|gb|AY214156.1| Oncorhynchus nerka RH1 opsin mRNA, complete cds ATGAACGGCACAGAGGGACCAGATTTCTACGTCCCTATGTCCAATGCTACTGGCATTGTTAGGAACCCCT ATGAATACCCCCAGTACTACCTTGTCAGCCCAGCGGCGTACTCACTCATGGCTGCCTACATGTTCTTCCT CATCCTCACCGGCTTCCCCATCAACTTCCTCACACTCTATGTCACCATCGAGCACAAAAAGCTGAGGACC GCCCTGAACTACATCCTGCTGAACCTGGCTGTGGCCGATCTCTTCATGGTAATCGGAGGCTTCACCACTA CGATGTACACCTCCATGCATGGCTATTTCGTCTTTGGAAGAACGGGCTGCAACATCGAGGGATTCTGTGC TACCCATGGTGGTGAGATTGCCCTATGGTCCCTGGTTGTCCTGGCTATTGAGAGGTGGTTGGTCGTCTGC AAACCTATTAGCAACTTCCGCTTCAGTGAGACCCATGCCATCATAGGCGTGGCCTTTACCTGGGTCATGG CTGCTGCTTGCTCCGTCCCCCCTCTGCTTGGGTGGTCCCGCTATATCCCCGAAGGCATGCAGTGCTCATG TGGAATTGACTACTACACGCGCGCCCCTGACATCAACAATGAGTCCTTTGTCATCCACATGTTCGTTGTC CACTTTATGATTCCCCTGTTCATCATCTCCTTCTGCTACGGCAACCTGCTCTGCGCTGTCAAGGCAGCTG CCGCCGCCCAGCAGGAGTCTGAGACCACCCAGAGGGCTGAGAGGGAAGTGACCCGCATGGTCATCATGAT GGTCGTCTCCTTCCTAGTGTGCTGGGTGCCCTACGCCAGCGTGGCCTGGTATATCTTCTGCAACCAGGGA ACAGAGTTCGGCCCCGTCTTCATGACAATTCCGGCATTCTTTGCCAAGAGTTCGTCCCTGTACAACCCTC TCATCTACGTGTTGATGAACAAGCAGTTCCGCAACTGCATGATCACCACCCTGTGCTGTGGGAAGAACCC CTTCGAGGAGGAGGAGGGAGCCTCCACCACTGCCTCCAAGACCGAGGCCTCCTCCGTGTCCTCCAGCTCC GTGGCTCCTGCATAA
Williams Example (simple)
Genbankretrieval service
GenscanGene predication service
Genbank recordhas_part genomic sequence
genomic sequence in
Genbank record FASTA sequence
Semantic level
Syntactic level
EMBOSS seqret service
Genbank service
Graves disease
Array Express Gene clustering service
Microarray expression data out
Microarray expression data in
Affymetrix CEL file Treeview format
Semantic level
Syntactic level
Example data
CellHeader=X Y MEAN STDV NPIXELS
0 0 112.0 24.4 25
1 0 10699.0 1340.6 20
2 0 147.0 42.4 25
3 0 10602.0 2126.2 25
4 0 100.8 29.9 20
5 0 96.0 11.9 25
6 0 9829.0 1983.4 25
7 0 133.3 21.6 20
8 0 9092.0 1470.7 25
CEL format
Probe_Id Sample1236
1000_at 147
1001_at 96
1002_at -59
Treeview format
Template
Cell header Probe ID
2 0 1000_at
5 0 1001_at
2 3 1002_at
Graves disease
Array Express Gene clustering service
Microarray expression data out
Microarray expression data in
Affymetrix CEL file Treeview format
Semantic level
Syntactic level
AffyR service
Template file
Classification of shims
Shim service FILTER
MAPPER DEREFERENCER TRANSLATOR
syntax (e.g. GenBank to EMBL) data (e.g. DNA to protein)
TRANSFORMER SIFTER (sql SELECT type operation) PARSER (sql PROJECT type operation) -
also known as SPLITTER or DECOMPOSER COMPARER SORTER
Defn: experimentally neutral service used to connect domain services that don’t quite fit
Providing more assistance
Taverna workbench
1. RegisterTaverna workbench
3. Query
Pedro
2. Annotate
operation
name, descriptioninputoutputtaskmethodresourceapplication
workflow
bioMoby service
WSDL operation
Soaplab service
service
name, descriptionauthororganisation
WSDL service
parameter
name, descriptionsemantic typeformattransport typecollection typecollection format
myGrid’s model of services
Service Description Flow
DiscoveryClient
SemanticIndexing
Component
Registry
XML document describing service
Extract servicedescriptions toreason over
Pedro
Jena RDF repository
Instance Store
FACT DL reasoner
<serviceDescription> <organisation>http://genetics.man.ac.uk</organisation> <operation> <name>execute</name> <task>http://www.mygrid.org.uk/ontology#pairwise_local_aligning</task> …..
Pedro
XML
RDF
Queries possible within RDF repository:
Find me an operation called “exec*”Find me a service provided by groups working on Williams
diseaseFind me an operation which performs aligning?
RDF
a1234
a2
“execute”
a3
http://genetics.man.ac.uk
#service
#local_pairwise_aligning
#operation
published_bytype
type
subclass
name
task
#aligning
hasOperation
RDF
a1234
a2
“execute”
a3
http://genetics.man.ac.uk
#service
#local_pairwise_aligning
#operation
published_bytype
type
subclass
name
task
#aligning
Queries not possible:Find me an operation which performs aligning which is local?Where does this service fit into a classification
hasOperation
OWL classes#service
#local_pairwise_aligning
#operation
Owl property restriction: hasOperation
Owl property restriction: performsTask
Most specific class expression extracted
Definition: Service which has an operation which performs the task local pairwise aligning
OWL classes
service
aligning service
local aligning service
pairwise local aligning service
Each service class has its own property based
OWL definition
a1234 Instance store indexes our service instance in the appropriate place Classification calculated
by the FACT reasoner using property based
definitions
Query by navigation
Service browser
Service classified by task
Use of ontologies
• Property based classification requires property based modelling
• Advantages– Explicit, machine interpretable, easier to
maintain large ontologies with polyhierarchies
• Disadvantages– Complex definitions take time/ skill to author,
require expert domain knowledge– Difficult to present back to the user
Property based classification on steroids
RNA sequence data
DNA sequence data
nucleic acid sequence data
Data
Property based classification on steroids
RNA sequence
DNA sequence
nucleic acid sequence
RNA sequence data
DNA sequence data
nucleic acid sequence data
encodes
Data Feature
Property based classification on steroids
RNA
DNA
nucleic acid
RNA sequence
DNA sequence
nucleic acid sequence
RNA sequence data
DNA sequence data
nucleic acid sequence data
encodes sequence_of
Data Feature Biological Concept
Property based classification on steroids
ribonucleotide
deoxyribonucleotide
nucleotide
RNA
DNA
nucleic acid
RNA sequence
DNA sequence
nucleic acid sequence
RNA sequence data
DNA sequence data
nucleic acid sequence data
encodes sequence_of polymer_of
Data Feature Biological Concept
Property based classification on steroids
ribonucleotide
deoxyribonucleotide
nucleotide
RNA
DNA
nucleic acid
RNA sequence
DNA sequence
nucleic acid sequence
RNA sequence data
DNA sequence data
nucleic acid sequence data
encodes sequence_of polymer_of
Data Feature Biological Concept
Human readable ontologies
GROWL parser
OWL API
Reasoner
OWL API
GROWL renderer
Only data to hand
• Metadata associated with data items.
• Life science identifier (LSID) protocol used to retrieve metadata.
• Metadata model similar to service parameter Data item
name, descriptionsemantic typeformatcollection typecollection format
Workflow run
Workflow design
Experiment design
Project
Person
Organisation
Process
Service
Event
Data item
Data itemData item
data derivation e.g. output data derived from input data
knowledge statementse.g. similar protein sequence to
instanceOf
partOf componentProcesse.g. web service invocation of BLAST @ NCBI
componentEvente.g. completion of a web service invocation at 12.04pm
runBye.g. BLAST @ NCBI
run for
Organisation level provenance Process level provenance
Data/ knowledge level provenance
Pro
vena
nce
(1)
User can add templates to each workflow process to determine links between data items.
19747251 AC005089.3831Homo sapiens BAC
clone CTA-315H11 from 7, complete sequence15145617 AC073846.6
815Homo sapiens BAC
clone RP11-622P13 from 7, complete sequence15384807 AL365366.20
46.1Human DNA sequence
from clone RP11-553N16 on chromosome 1, complete sequence7717376 AL163282.2
44.1Homo sapiens
chromosome 21 segment HS21C08216304790 AL133523.5
44.1Human chromosome 14
DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence34367431 BX648272.1
44.1Homo sapiens mRNA;
cDNA DKFZp686G08119 (from clone DKFZp686G08119)5629923 AC007298.17
44.1Homo sapiens 12q22
BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence34533695 AK126986.1
44.1Homo sapiens cDNA
FLJ45040 fis, clone BRAWH302048620377057 AC069363.10
44.1Homo sapiens
chromosome 17, clone RP11-104J23, complete sequence4191263 AL031674.1
44.1Human DNA sequence
from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence17977487 AC093690.5
44.1Homo sapiens BAC
clone RP11-731I19 from 2, complete sequence17048246 AC012568.7
44.1Homo sapiens
chromosome 15, clone RP11-342M21, complete sequence14485328 AL355339.7
44.1Human DNA sequence
from clone RP11-461K13 on chromosome 10, complete sequence5757554 AC007074.2
44.1Homo sapiens PAC
clone RP3-368G6 from X, complete sequence4176355 AC005509.1
44.1Homo sapiens
chromosome 4 clone B200N5 map 4q25, complete sequence2829108 AF042090.1
44.1Homo sapiens
chromosome 21q22.3 PAC 171F15, complete sequence
>gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequenceAAGCTTTTCTGGCACTGTTTCCTTCTTCCTGATAACCAGAGAAGGAAAAGATCTCCATTTTACAGATGAGGAAACAGGCTCAGAGAGGTCAAGGCTCTGGCTCAAGGTCACACAGCCTGGGAACGGCAAAGCTGATATTCAAACCCAAGCATCTTGGCTCCAAAGCCCTGGTTTCTGTTCCCACTACTGTCAGTGACCTTGGCAAGCCCTGTCCTCCTCCGGGCTTCACTCTGCACACCTGTAACCTGGGGTTAAATGGGCTCACCTGGACTGTTGAGCG
urn:lsid:taverna:datathing:15
..BLAST_Report
rdf:type
urn:lsid:taverna:datathing:13
..similar_sequences_to
.. nucleotide_sequence
rdf:type
service invocation
..created_by
workflow invocation
workflow definition
experiment definition
project
person
group
service description
organisation
..described_by
..run_during
..invocation_of
..part_of
..works_for
..part_of
..part_of
..author
..author
..run_for
A B
..masked_sequence_of
..filtered_version_of
Relationship BLAST report has with other items in the repository
Other classes of information related to BLAST report
Provenance tracking
Using IBM’s HaystackGenBank
record
Portion of the Web of
provenance
Managing collection of
sequences for review
Storage
• LSID has no protocol for storage
• Taverna/ Freefluo implements its own data/ metadata storage protocol
Taverna/Freefluo
Metadata Store
Data store
Publish interface
data
metadata
Retrieval• LSID protocol used to retrieve data and
metadata
• Query handled separately
Metadata Store
Data store
LSID interface
LSID aware client
Query
RDF aware client
Queries within Workflows
Grid Data Servicequery query result
Semantic content of result depends on query and data source schema
Select GO_ID FROM GO WHERE GO.term LIKE “enzyme activity”;
Select GO_Annotation_ID FROM GOAWHERE GO.term LIKE “enzyme activity”;
Gene ontology term ID
protein ID
Distributed Query Processing
• DQP linked with the OGSA-DAI activity
• Built within myGrid project
• Plans execution of a query over multiple Grid Data Services
• Each Grid Data Service provides schema metadata
• Currently no semantic mediation
Example query• select p.proteinId, blast(p.sequence)
from p in protein, t in proteinTermwhere t.termId = 'GO:0008372' andp.proteinId = t.proteinId
• “Select proteins and homologous proteins fromSWISS-PROT which have been annotated withGO:008372”
Gene ontology database SWISS-PROT protein database
t.proteinId p.proteinId
Data encoding the identity of a protein in SWISS-PROT namespace
Data encoding the identity of a protein in SWISS-PROT namespace
=
DQP Plan
Query 1: Select motifs for antigenic human proteins that participate in apoptosis and are homologous to the lymphocyteassociated receptor of death (also known as lard).
Translation: Select patterns in the proteins that invoke an immunological response and participate in programmed cell death thatare similar in their sequence of amino acids to the protein that is associated with triggering cell death in the white cells of theimmune system.
(A) Ontology expression: Motif which <isComponentOf (Protein which <hasOrganismClassification Species functionsInProcess Apoptosis hasFunction Antigen isHomologousTo Protein which <hasName ProteinName>)>)>
Species: Is instantiated by value “human”ProteinName: Is instantiated by value “lard”
TAMBIS I
TAMBIS II
• Informal query plan:• Select proteins with protein name “lard” from SWISS-PROT• Execute a BLAST sequence alignment process against
SWISS-PROT results• Check the entries for apoptosis process and antigen function• Pass the resultant sequences to PROSITE to scan for their
motifs
• CPL expression:set-unique {(#motif1:motif1)I
\protein3 <- get-sp-entries-by-de("lard"), \protein2 <- do-blastp-by-sq-in-entry(protein3),
Check-sp-entries-by-kwd("apoptosis",protein2), check-sp-entries-by-de("antigen",protein2),
Check-sp-entry-for-species("human",protein2), \motif1 <- do-ps-scan-by-sq-in-entry(protein2)}
select p.proteinId, blast(p.sequence)from p in protein, t in proteinTermwhere t.termId = 'GO:0008372' andp.proteinId = t.proteinId
• How we did it in the past– Service type directory
• How we currently plan to do it– Shims, genbank, microarray
• How we may want to do it in the future– DQP & TAMBIS
Overview
• We’re not attacking the same problem
• When would your problem become our problem
• Common descriptions of the core entities involved.– Data items, Datasets, Services.