Evaluating Hypotheses using SPARQL-DL as an abstract workflow language to choreograph SADI Semantic...
-
Upload
mark-wilkinson -
Category
Technology
-
view
777 -
download
0
description
Transcript of Evaluating Hypotheses using SPARQL-DL as an abstract workflow language to choreograph SADI Semantic...
“Shopping for data should be as easy
as shopping for shoes!!”
Carole Goble
Evaluating Hypotheses using SPARQL-DL as an
abstract workflow language to choreograph
SADI Semantic Web Services
Mark Wilkinson, Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM.
Start at the end...
We wanted to duplicatea real, peer-reviewed, bioinformatics analysis
simply by providing a model describingwhat the answer
(if one exists)would look like...
...the machine had to make every other decision
on it’s own
Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.
Interspecies data mining to predict novel ING-protein interactions in human.
BMC Genomics 9, 426 (2008).
This is the study we chose:
Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).
Original Study - simplified and abstracted:
Using what is known about interactions in other species, predict new interactions in your species of interest
Given a protein P in Species X
Find proteins similar to P in Species Y Retrieve interactors in Species Y Sequence-compare Y-interactors with Species X genome (1) Keep only those with homologue in X
Find proteins similar to P in Species Z Retrieve interactors in Species Z Sequence-compare Z-interactors with (1)
Putative interactors in Species X
For our prototype study, we simplified this further to:
X 2
Then intersect
The tricky part is...
In the abstract, the two workflowsare identical
but in reality they will be different because they call for information
from different species
Modeling the result – Step 1
OWL
Web Ontology Language (OWL) is the language approved by the W3C
for representing knowledge on the Web
ProbableInteractor:
is homologous to (
protein from ModelOrganism1…) # Potential Interactor in previous slide
and
protein from ModelOrganism2…) # Potential Interactor in previous slide
Modeling the result – Step 2
Probable Interactor is defined in OWL as a subclass of Potential Interactor that requires homologous pairs of interacting proteins to exist in both
comparator model organisms.
(Effectively, an intersection)
We then publish our OWL model of a Probable Interactor on the Web
In a local data-file we provide the protein we are interested in, and the two species we wish to use in our comparison
taxon:9606 a i:OrganismOfInterest . # humanuniprot:Q9UK53 a i:ProteinOfInterest . # ING1taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly
These four lines represent all of the data provided to the query I am about to show you...
PREFIX i: <http://sadiframework.org/ontologies/InteractingProteins.owl#>
SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE {
?protein a i:ProbableInteractor .}
This is the question we ask:
The reference to our OWL model of the answer
Our system then derives (and executes) the following workflow automatically
These are differentWeb services!
...selected at run-time based on the same model
There are two very cool things about what you just saw...
There are two very cool things about what you just saw...
The system was able to create a workflow based on
an OWL model
There are two very cool things about what you just saw...
The workflow it created (i.e. the services chosen)
differed depending on context
We got the answer
“simply” by designing a model of the answer!
Start at the beginning...
Semantic Automated Discovery and Integration
http://sadiframework.org MicrosoftResearch
A Semantic Web-focused Web Services specification
Web Services
vs.
Semantic Web
Web ServicesXML + XML Schema
Semantic WebRDF + OWL
Web ServicesPOST of SOAP
Semantic WebGET of RDF
Web ServicesNo (rigorous) semantics
Semantic WebRich, flexible semantics
Web Services&
Semantic Web
Fundamentally different Web technologies
A design-practice for Web Service provision on the
Semantic Web
100% standards-compliantwith no “invented” standards
Lightweight(only 2 inter-related “rules”)
Rules come from observations:
Web Services in Bioinformatics create implicit biological relationships
between their input and output
SADI Observation #1:
SADI Observation #1:
SeqRet
SADI Design Practice #1
Make the implicit explicit…
A Web Service should create “triples” linking the input data to the output data thus
explicitly describing the semantic relationship between them
SADI Best Practice #1
This is what bioinformatics Web Services implicitly do anyway!
Easy to implement this as a best-practice
...and makes SADI Services a source of Linked Data
SADI Observation #2:HTTP GET and POST
GET guarantees the response relates to the request URI
in a very precise and predictable way
POST does not…
SADI Observation #2: HTTP GET and POST
That’s why Web Services have a fundamentally different behaviour than the Semantic Web
We can fix that!
(without breaking any existing rules or standards!)
SADI Observation #2:GET and POST
SADI Best Practice #2
SUBJECT URI of the output graph (triples)
is the same as
SUBJECT URI of the input graph (triples)
(the output is “about” the input... Now explicitly!)
SADI Best Practice #2
SeqRet
BRCA1
GeneID
rdf:type
BRCA1
GeneID
rdf:type
AGCTTA...
hasDNASequence
Consequence
Web Services now exhibit a very similar behavior to the Web itself
POST “behaves like” GET
SADI Interfaces
Service Interfaces can be described by two OWL classes
(this is 100% compatible with the SAWSDL standard)
SADI Interfaces
OWL Class #1: My Input Class
SADI Interfaces
OWL Class #2: My Output Class
SADI Service Functionality
Consumes OWL Individuals of Class #1
Returns OWL Individuals of Class #2
SeqRet
BRCA1
GeneID
rdf:type
BRCA1
GeneID
rdf:type
AGCTTA...
hasDNASequence
but the URI of those two individuals is the same; they are the same individual, just now a member of a new class.
In practice, of course, you don’t return the input data
Strip it and add the new data provided by the Service
But since the output is still “rooted” in the input node,
Input and output are easily merged client-side
(just concatenate the output with the input)
Service Description
INPUT OWL ClassNamedIndividual: things with a “name” property from “foaf” ontology
OUTPUT OWL ClassGreetedIndividual: things with a “greeting” property from “hello” ontology
person:1
hello:Greeted Individual
rdf:type
Hello, Guy Incognito!
hello:greeting
POST http://example.org/myservice
person:1
hello:NamedIndividual
rdf:type
Guy Incognito
foaf:name
How do we discover services?
Input and output are about the same “thing”
Therefore, to describe what a service does simply compare (“diff”) the
Input and Output OWL classes
This is not prescriptive! Just how we use it
Service Description
INPUT OWL ClassNamedIndividual: things with a “name” property from “foaf” ontology
OUTPUT OWL ClassGreetedIndividual: things with a “greeting” property from “hello” ontology
person:1
hello:Greeted Individual
rdf:type
Hello, Guy Incognito!
hello:greeting
POST http://example.org/myservice
person:1
hello:NamedIndividual
rdf:type
Guy Incognito
foaf:name
The service providesa “greeting” to a Named Individual based on its “name”
Index all of the propertiesadded by all of the services
under all circumstances
Service Discovery
Real-world Example
Input Data: BRCA1 rdf:type Gene ID
Output Data: BRCA1 hasDNASequence AGCTTAGCCA…
Registry Index: Service provides “hasDNASequence” property to Gene IDs
Simply search for the property of interestbased on the data in-hand
Service Discovery
e.g. The question:
“what is the DNA sequence of BRCA1?”
Discover a SADI Web Service that generates the DNA Sequence property for gene identifiers
DEMOKnowledge Explorer
Plug-in
For more information about the Knowledge Explorer surf to:http://io-informatics.com
SADI has just filled-in “Encodes” property for the three genes from the output of discovered Web Service(s)
Discover services that provide the hasGOTerm property for Protein Sequence datatype
This kind of “Web Service Surfing” is veryintuitive for the Biologist!
No need to describe the algorithm or the databasejust describe the properties that will be added
Semantic Health And Research Environment
SHARE answers arbitrary SPARQL queries by finding and executing SADI Services
Example #1
What is the phenotype of every allele of the Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc }
Example #1
What is the phenotype of every allele of the Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE { locus:DEF genetics:hasVariant ?allele . ?allele info:visualizedByImage ?image .
?image info:hasDescription ?desc }
Note that there is no “FROM” clause!We don’t tell it where it should get the information, The machine has to figure that out by itself...
Enter that query into SHARE
Click “Submit”...
...and in a few seconds you get your answer.
Based on predicates in your query, SHARE utilized SADIto automatically discover the resources required to answer your question.
Because it is the Semantic WebThe query results are live hyperlinksto the respective Database or images
Importantly
We posed, and answered a complex database query
WITHOUT A DATABASE
(in fact, the data didn’t even have to exist... as I’ll now show you)
Example #2
Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants
SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Likely Rejecter:
A patient who has creatinine levelsthat are increasing over time
- - Wilkinson “MD”
Likely Rejecter:
Our database contains various blood chemistry measurements
at various time-points
Likely Rejecter:
…but there is no “likely rejecter” column or table in our database…
SemanticsSHARE determines
by itself
the need to do a Linear Regression analysis over
Creatinine blood chemistry measurements
SemanticsSHARE determines
by itself
how and where that analysiscan be done
and does it
The SHARE system utilizes Semantics (via SADI) to discover and access analytical services on the Web that do linear regression analysis
VOILA!
How does SHARE work?
Ontologies
Ontology Spectrum
Catalog/ID
SelectedLogical
Constraints(disjointness,
inverse, …)
Terms/glossary
Thesauri“narrowerterm”relation
Formalis-a
Frames(Properties)
Informalis-a
Formalinstance
Value Restrs.
GeneralLogical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty; – updated by McGuinness.Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
Catalog/ID
SelectedLogical
Constraints(disjointness,
inverse, …)
Terms/glossary
Thesauri“narrowerterm”relation
Formalis-a
Frames(Properties)
Informalis-a
Formalinstance
Value Restrs.
GeneralLogical
constraints
Basic SADI/SHARE functionality
Ontology Spectrum
BETTER!!
Catalog/ID
SelectedLogical
Constraints(disjointness,
inverse, …)
Terms/glossary
Thesauri“narrowerterm”relation
Formalis-a
Frames(Properties)
Informalis-a
Formalinstance
Value Restrs.
GeneralLogical
constraints
WHY?
Because I say so!
Because it fulfils XYZ
Ontology Spectrum
Why is this data a member of this OWL Class? (and therefore valid input to the service)
Catalog/ID
SelectedLogical
Constraints(disjointness,
inverse, …)
Terms/glossary
Thesauri“narrower term”relation
Formalis-a
Frames(Properties)
Informalis-a
Formalinstance
Value
Restrs.GeneralLogical
constraints
Categorization Systems – like library shelves, inflexible
Discovery systems - flexible
In the upper end of the Ontology Spectrum,
if the data has the right properties
It can be discovered to be a valid input to a Service
regardless of how it was originally classified or which
ontology was used for that classification
...and in the context of a SHARE query
those individual properties may have been aggregated
from many different places;
The data becomes a valid input as properties aggregate
In exactly the same way that the OWL property restrictions of a SADI Input Class tell SHARE what
properties a service requires as input
The property restrictions of an OWL Class in the SPARQL query tell SHARE what properties it
needs to retrieve to create members of that class
Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
The definition of a Likely Rejecter category is encoded in a machine-readable document written in the OWL Ontology language
Basically:
“the regression line over creatinine measurements should have an increasing slope”
SHARE burrows down through the ontological definition to learn about what the properties of “regression models” are
SHARE utilizes SADI to discover analytical services on the Web that do linear regression analysis then other services that can
determine the “latest” measurements from a time-series
Let’s go through that again from a different perspective
OWL Classes that include property restrictions can be “executed” as if they were workflows
Ontology = Query = Workflow
WORKFLOW
QUERY:
SELECT images of mutations from genes in organism XXX that share homology to this gene in
organism YYY
Concept:
“Homologous Mutant Image”
As OWL Axioms
HomologousMutantImage is owl:equivalentTo {
Gene Q hasImage image P
Gene Q hasSequence Sequence Q
Gene R hasSequence Sequence R
Sequence Q similarTo Sequence R
Gene R = “my gene of interest” }
Those axioms combine to create an OWL Class:
Homologous Mutant Image
QUERY:
Retrieve owl:Homologous Mutant Image for
gene XXX
SHARE:
Decomposes the owl:Homologous Mutant Image Class, discovers SADI services
relevant to that class, and pipelines them together into a
workflow
“The user experiments showed that workflow re-use… is difficult for bioinformaticians.”
- Gooderis, A. (2008) Ph.D. Thesis
Under slightly different circumstances(e.g. studying the same phenomenon in a
different organism)
this workflow
WILL NOT SOLVE THE PROBLEM
It must be edited on a case-by-case basis
This editing turns out to be extremely difficult, and can ~only be done manually
This great idea would be even better if workflows were a little bit easier to re-purpose!
With SHARE, a workflow is generated dynamically, based on all of the information
presented in the query
e.g. to create a new workflow simply specify a different organism ID in the query!
With SHARE, a workflow is generated dynamically, based on all of the information
presented in the query
Moreover, the workflow plan CHANGES dynamically based on service outputs!
With SHARE, the ontology is the workflow(not the same as “an ontology that describes workflows”)
The ontology acts as an abstraction of a workflow, which is concretized at run-
time based on circumstances of the query
Works (best) with ontologies in the “Frames+” part of the spectrum, if there are SADI services available
Catalog/ID
SelectedLogical
Constraints(disjointness,
inverse, …)
Terms/glossary
Thesauri“narrowerterm”relation
Formalis-a
Frames(Properties)
Informalis-a
Formalinstance
Value Restrs.
GeneralLogical
constraints
As far as we are aware, SHARE is the only system that exhibits this particular behaviour
...and IMO this is a pretty big deal...
Gordon, P.M.K., Soliman, M.A., Bose, P., Trinh, Q., Sensen, C.W., Riabowol, K.: Interspecies data mining to predict novel ING-protein interactions in human. BMC genomics. 9, 426 (2008).
That experiment can now be represented ONCE as an OWL Class
It becomes concretized automatically
for each individual researcher
as a distinct workflow given any starting protein and any combination of comparator species
This is far beyond simply changing the parameters entered into a workflow...
Moreover, the idea of automated workflow “individuality” is quite interesting to us...
This is an early prototype of a
Patient-driven Personalized Medicine
Web interface
Matching based on official name, compound name,
brand name, trade name, or “common name”
Still needs some work...
??!?!?
Link out to PubMed
Why the alert?
The SADI+SHARE workflow and reasoning waspersonalized to YOUR medical data
In future iterations, we will enable the workflowto be further customized through “personalized”OWL Classes (e.g. Provided by your Clinician!!)
These OWL Classes might include information about the current trajectory of your treatment for a chronic disease, for example, such that what you read on the Web is placed in the context of your expert Clinical care...
Frankly, I think it’s quite cool that “people” are creating and running personalized workflows at the touch of a button...
...as many of you know, that has been my dream since I started studying this
problem a decade ago!
Final thoughts
An experiment... based on a hypothesis
An experiment... based on a hypothesis
now modeled in OWL
. . .
. . .
Does this OWL Class represent a Hypothesis?
I think it does!
. . .
. . .
I believe that we will soon show
using SADI + SHARE
that we can model a non-trivial hypothetical biological scenario
then evaluate if that hypothesis is supported or notbased on whether the automatically-synthesized workflow
returns any individuals that conform to the model.
Ontology = Hypothesis = Query = Workflow [= Materials and Methods ]
These can be automatically derived through provenance information during workflow execution
Most of your publication is done!
All you need to do now is interpret the results!
Please join us!
SADI and SHARE are Open-Source projects
http://sadiframework.org
My New Home!
Luke McCarthy – Lead Dev.Everything...
Benjamin VanderValk SHARE & SADI & Experimental modeling & myHeath Button
Soroush Samadian Cardiovascular data modeling and queries
University of British Columbia
Edward Kawas SADI Service auto-generator
Ian WoodExperimental modeling project
U of New Brunswick
Dr. Chris BakerAlexandre Riazanov
Carleton University
Dr. Michel DumontierMarc-Alexandre NolinLeonid ChepelevSteve EtlingerNichaella KiethJose Cruz
C-BRASS Collaborators at other sites
Microsoft Research