Yun Zhang J. Craig Venter Institute San Diego, CA, USA August 4, 2012
Presentation to the J. Craig Venter Institute, Dec. 2014
-
Upload
mark-wilkinson -
Category
Internet
-
view
138 -
download
1
Transcript of Presentation to the J. Craig Venter Institute, Dec. 2014
“Shopping for data should be as easy as
shopping for shoes!”
Dr. Carole Goble
Professor, Dept. of Computer Science
University of Manchester
“A little bit of semantics goes a long way”
Dr. James Hendler
Artificial Intelligence Researcher
Rensselaer Polytechnic Institute
One of the originators of the Semantic Web
…but a lot of semantics goes a long, long way!
Mark Wilkinson
Isaac Peral Distinguished ResearcherDirector, Fundación BBVA Chair in Biological Informatics
Center for Plant Biotechnology and GenomicsTechnical University of Madrid
Making the Web a
biomedical research platform
from hypothesis through to publication
Publication
Discourse
Hypothesis
Experiment
Interpretation
Publication
Discourse
Hypothesis
Experiment
Interpretation
Motivation:
3 intersecting trends in the Life Sciences
that are now, or soon will be,
extremely problematic
NON-REPRODUCIBLE SCIENCE & THE FAILURE OF PEER REVIEW
TREND #1
Trend #1
Multiple recent surveys of high-throughput biology
reveal that upwards of 50% of published studies
are not reproducible
- Baggerly, 2009
- Ioannidis, 2009
Similar (if not worse!) in clinical studies
- Begley & Ellis, Nature, 2012
- Booth, Forbes, 2012
- Huang & Gottardo, Briefings in Bioinformatics, 2012
Trend #1
Trend #1
“the most common errors are simple,
the most simple errors are common”
At least partially because the
analytical methodology was inappropriate
and/or not sufficiently described
- Baggerly, 2009
Trend #1
These errors pass peer review
The researcher is (sometimes) unaware of the error
The process that led to the error is not recorded
Therefore it cannot be detected during peer-review
Agencies have Noticed!
In March, 2012, the US Institute of Medicine ~said
“Enough is enough!”
Agencies have Noticed!
Institute of Medicine Recommendations
For Conduct of High-Throughput Research:
Evolution of Translational Omics Lessons Learned and the Path Forward. The
Institute of Medicine of the National Academies, Report Brief, March 2012.
1. Rigorously-described, -annotated, and -followed data
management and manipulation procedures
2. “Lock down” the computational analysis pipeline once it
has been selected
3. Publish the analytical workflow in a formal manner,
together with the full starting and result datasets
BIGGER, CHEAPER DATA
TREND #2
Trend #2
High-throughput technologies are becoming
cheaper and easier to use
Trend #2
High-throughput technologies are becoming
cheaper and easier to use
But there are still very few experts trained in
statistical analysis of high-throughput data
Trend #2
The number of job postings for data scientist
positions increased by 15,000% between the
summers of 2011 and 2012
-- Indeed.com job trends data reported by
http://blogs.nature.com/naturejobs/2013/03/18/so-you-want-to-be-a-data-scientist
Trend #2
Therefore
Even small, moderately-funded laboratories
can now afford to produce more data
than they can manage or interpret
Trend #2
Therefore
Even small, moderately-funded laboratories
can now afford to produce more data
than they can manage or interpret
These labs will likely never be able to afford
a qualified data scientist
“THE SINGULARITY”
TREND #3
The Healthcare
Singularity and the
Age of Semantic
Medicine, Michael
Gillam, et al, The
Fourth Paradigm:
Data-Intensive
Scientific Discovery
Tony Hey (Editor),
2009
Slide adapted with
permission from
Joanne Luciano,
Presentation at
Health Web
Science Workshop
2012, Evanston IL,
USA
June 22, 2012.
Trend #3
The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009
Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USA
June 22, 2012.
“The Singularity”
The X-intercept is where, the moment a discovery is made,
it is immediately put into practice
Scientific research would have to be
conducted within a medium that
immediately interpreted
and disseminated the results...
You Are
Here
...in a form that immediately (actively!) affected the
results of other researchers...
You Are
Here
...without requiring them to be aware
of these new discoveries.
You Are
Here
3 intersecting and problematic trends
Non-reproducible science that passes peer-review
Cheaper production of larger and more complex datasets
that require specialized expertise to analyze properly
Need to more rapidly disseminate and use new discoveries
We Want More!
I don’t just want to reproduce
your experiment...
I want to re-use your experiment
In my own laboratory... On MY DATA!
When I do my analysis
I want to draw on the knowledge
of global domain-experts like
statisticians and pathologists...
...as if they were mentors sitting
in the chair beside me.
Image from: Mark Smiciklas
Intersection Consulting, cc-nca
Please don’t make me find
all of the data and knowledge
that I require to do my experiment
...it simply isn’t possible anymore...
Image from AJ Cann
cc-by-a license
I want to support peer review(ers)
so that I do better science.
How do we get there from here?
To overcome these intersecting problems
and to achieve the goals of transparent
reproducible research
We must learn how to
do research IN the Web
Not OVER the Web
How we use
The Web today
The Web is not a pigeon!
Semantic Web Technologies
The Web
The Semantic Web
causally related to
This is the critical bit!
causally related to
The link is explicitly labeled!
???
http://semanticscience.org/resource/SIO_000243
SIO_000243:
<owl:ObjectProperty rdf:about="&resource;SIO_000243">
<rdfs:label xml: lang="en"> is causally related with</rdfs:label>
<rdf:type rdf:resource="&owl;SymmetricProperty"/>
<rdf:type rdf:resource="&owl;TransitiveProperty"/>
<dc:description xml:lang="en"> A transitive, symmetric, temporal relation
in which one entity is causally related with another non-identical entity.
</dc:description>
<rdfs:subPropertyOf rdf:resource="&resource;SIO_000322"/>
</owl:ObjectProperty>
causally related with
http://semanticscience.org/resource/SIO_000243
SIO_000243:
<owl:ObjectProperty rdf:about="&resource;SIO_000243">
<rdfs:label xml: lang="en"> is causally related with</rdfs:label>
<rdf:type rdf:resource="&owl;SymmetricProperty"/>
<rdf:type rdf:resource="&owl;TransitiveProperty"/>
<dc:description xml:lang="en"> A transitive, symmetric, temporal relation
in which one entity is causally related with another non-identical entity.
</dc:description>
<rdfs:subPropertyOf rdf:resource="&resource;SIO_000322"/>
</owl:ObjectProperty>
causally related with
Semantic Web Technologies
“deep semantics”
Deep Semantics?
Ontology Spectrum
Catalog/
ID
Selected
Logical
Constraints(disjointness,
inverse, …)
Terms/
glossary
Thesauri
“narrower
term”
relationFormal
is-a
Frames
(Properties)
Informal
is-a
Formal
instanceValue Restrs. General
Logical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
Ontology Spectrum
Catalog/
ID
Selected
Logical
Constraints(disjointness,
inverse, …)
Terms/
glossary
Thesauri
“narrower
term”
relationFormal
is-a
Frames
(Properties)
Informal
is-a
Formal
instanceValue Restrs. General
Logical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
Most biomedical ontologies
e.g. Gene Ontology
Ontology Spectrum
Catalog/
ID
Selected
Logical
Constraints(disjointness,
inverse, …)
Terms/
glossary
Thesauri
“narrower
term”
relationFormal
is-a
Frames
(Properties)
Informal
is-a
Formal
instanceValue Restrs. General
Logical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
Ontologies being used in today’s talk
Most biomedical ontologies
e.g. Gene Ontology
Ontology Spectrum
Catalog/
ID
Selected
Logical
Constraints(disjointness,
inverse, …)
Terms/
glossary
Thesauri
“narrower
term”
relationFormal
is-a
Frames
(Properties)
Informal
is-a
Formal
instanceValue Restrs. General
Logical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
Categorization Systems
Like library shelves, inflexible
Discovery & Interpretation systems – flexible!
Remember, this is the critical bit!
http://semanticscience.org/resource/SIO_000243
causally related with
It’s relationships that make
the Semantic Web “Semantic”
Semantic Web Technologies
“deep semantics”
Even with “deep semantics”
a lot of important information cannot be represented
on the Semantic Web
For example, all of the data that results from
analytical algorithms and statistical analyses
Varying estimates
put the size of the
Deep Web between
500 and 800 times
larger than the
surface Web
On the WWW
“automation” of
access to Deep Web
data happens through
“Web Services”
There are many suggestions for how to bring the Deep Web
into the Semantic Web using Semantic Web Services (SWS)
There are many suggestions for how to bring the Deep Web
into the Semantic Web using Semantic Web Services (SWS)
Describe input data
Describe output data
Describe how the system manipulates the data
Describe how the world changes as a result
There are many suggestions for how to bring the Deep Web
into the Semantic Web using Semantic Web Services (SWS)
Describe input data
Describe output data
Describe how the system manipulates the data
Describe how the world changes as a result
None, so far, has proven to be wildly successful
(in my opinion)
There are many suggestions for how to bring the Deep Web
into the Semantic Web using Semantic Web Services (SWS)
Describe input data
Describe output data
Describe how the system manipulates the data
Describe how the world changes as a result
None, so far, has proven to be wildly successful
(in my opinion)
…because describing what a Service does is HARD!
Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.
Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.
Scientific Web Services are DIFFERENT!
Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.
“The service interfaces within bioinformatics are relatively simple. An extensible or constrained interoperability
framework is likely to suffice for current demands: a fully generic framework is currently not necessary.”
Scientific Web Services are DIFFERENT!
They’re simpler!
So perhaps we can solve the Semantic Web Service problem
as it pertains to this (important!) domain
With respect to the Semantic Web
What is missing from this list?
Describe input data
Describe output data
Describe how the system manipulates the data
Describe how the world changes as a result
http://semanticscience.org/resource/SIO_000243
causally related with
http://semanticscience.org/resource/SIO_000243
The Semantic Web gets its semantics from relationships
causally related with
http://semanticscience.org/resource/SIO_000243
In 2008 I published a set of design-patterns
for scientific Semantic Web Services
that focuses on the biological relationship that the Service “exposes”
causally related with
The Semantic Web gets its semantics from relationships
Design Pattern for
Web Services on the Semantic Web
AACTCTTCGTAGTG...
BLAST
Web Service
AACTCTTCGTAGTG...
BLAST
SADI
has
homology
to
Terminal Flower
type
gene
species
A. thal.
SADI requires you to explicitly declare
as part of your analytical output,
the biological relationship that your
algorithm “exposed”.
sequence
has_seq_string
AACTCTTCGTAGTG...
sequence
has_seq_string
I want to share several stories that demonstrate
the cool things that happen when you use
SADI + deep semantics
The Semantic Health
and Research Environment
Story #1: SHARE
A proof-of-concept workflow orchestrator
+ SADI Semantic Web Service registry
Objective: answer biologists’ questions
The SHARE registry
indexes all of the input/output/relationship
triples that can be generated by all known services
This is how SHARE discovers services
SHARE demonstrations
with increasingsemantic complexity
What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc
}
What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc
}
The query language here is SPARQL
The W3C-approved, standard query language for the Semantic Web
What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc
}
Note that there is no “FROM” clause!
We don’t tell it where it should get the information,
The machine has to figure that out by itself...
What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc
}
Starting data: the locus “DEF” (Deficiens)
What is the phenotype of every allele of the
Antirrhinum majus DEFICIENS gene
SELECT ?allele ?image ?desc
WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc
}
Query: A series of relationships v.v. DEF
Enter that query into
SHARE
Click “Submit”...
...and in a few seconds you get your answer.
Based on the relationships in your query, SHARE queried its registry
to automatically discover SADI Services capable of generating those triples
Because it is the Semantic Web
The query results are live hyperlinks
to the respective Database or images
(The answer is IN the Web!)
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
Note again that there is no “From” clause…
I have not told SHARE where to look for the
answer, I am simply asking my question
Enter that query into
SHARE
Two different
providers of
gene
information
(KEGG &
NCBI);
were found &
accessed
Two different
providers of
pathway
information
(KEGG and
GO);
were found &
accessed
The results are all links to the original data(The answer is IN the Web!)
Show me the latest Blood Urea Nitrogen and Creatinine levels
of patients who appear to be rejecting their transplants
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Show me the latest Blood Urea Nitrogen (BUN) and
Creatinine levels of patients who appear to be
rejecting their transplants
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Likely Rejecter:
A patient who has creatinine levels
that are increasing over time
- - Mark D Wilkinson’s definition
Likely Rejecter:
…but there is no “likely rejecter”
column or table in our database…
only blood chemistry measurements
at various time-points
Likely Rejecter:
So the data required to answer this question
DOESN’T EXIST!
My definition of a Likely Rejecter is encoded in
a machine-readable document written in the OWL Ontology language
Basically:
“the regression line over creatinine measurements should have an increasing slope”
Our ontology refers to other ontologies (possibly published by other people)
to learn about what the properties of “regression models” are
e.g. that regression models have slopes and intercepts
and that slopes and intercepts have decimal values
?
Enter that query into
SHARE
SHARE examines the query
Burrows around the Web reading the various ontologies
then uses the discovered Class definitions as a template to map a path from what it has, to what it needs, using
SADI services
Based on the Class definition
SHARE decides that it needs to do a
Linear Regression analysis
on the blood creatinine measurements
?
The conversation between SHARE and the registry
reveals the use of “Deep Semantics”
Q: Is there a SADI service that will consume instances of Patient and give
me instances of LikelyRejector
A: No
Q: Okay... So LikelyRejectors need a regression model of increasing slope
over their BloodCreatinine, so... Is there a SADI service that will consume
BloodCreatinine over time and give me its linear regression model?
A: No
Q: Okay... Blood Creatinine over time is a subclass of data of type
X/Y coordinate, so is there a service that consumes X/Y data and
returns its regression model?
A: Yes here’s the URL.
The SHARE system utilizes SADI to discover
analytical services on the Web that do linear regression analysis
and sends the data to be analyzed
This happens iteratively(e.g. SHARE also has to examine the slope of the regression line
using another service, find the “latest” in a series of time measurements, etc.)
There is reasoning after every Service invocation
(i.e. after every clause in the query)
Once it is able to find instances (OWL Individuals)
of the LikelyRejector class, it continues with the
rest of the query
VOILA!
The way SHARE “interprets” data varies
depending on the context of the query
(i.e. which ontologies it reads – Mine? Yours?)
and on what part of the query
it is trying to answer at any given moment
(which ontological concept is relevant to that clause)
Example?
Blood Creatinine measurements
were not dictated to be
Blood Creatinine measurements
Example?
The data had the ‘qualities/properties’ that
allowed one machine to interpret
that they were Blood Creatinine measurements
(e.g. to determine which patients were rejecting)
Example?
But the data also had the ‘qualities/properties’ that
allowed another machine to interpret them as
Simple X/Y coordinate data
(e.g. the Linear Regression calculation tool)
Benefit
of Deep Semantics
Data is amenable to
constant re-interpretation
http://www.flickr.com/people/faernworks/
One example of the “little ways”
that Semantics will help researchers
day-by-day
Story #2: Measurement Units
Units must be harmonized
Don’t leave this up to the researcher(it’s fiddly, time-consuming, and error-prone)
NASA Mars Climate Orbiter
Oops!
ID HEIGHT WEIGHT SBP CHOL HDL BMI
GR
SBP
GR
CHOL
GR
HDL
GR
pt1 1.82 177 128 227 55 0 0 1 0
pt2 179 196 13.4 5.9 1.7 1 0 1 0
The Reality of Clinical Datasets
(this is a small snapshot of a dataset we worked on,
courtesy of Dr. Bruce McManus & Janet McManus, from the PROOF COE)
Height in m and cm Chol in mmol/l and mg/l
...and other delicious weirdness The clinical analyses described here
were supported in part by the
PROOF Center of Excellence
for the Prevention of Organ Failure
GOAL: reduce the likelihood of errors by
getting the clinical researcher
“out of the loop”
(as per the Institute of Medicine Recommendations)
Experiment:
Reproduce a clinical study
(from >10 years ago)
by logically encoding
the clinical diagnosis guidelines
of the American Heart Association
then ask SHARE to automatically
analyse the patient clinical data
Semantically defining globally-accepted clinical phenotypes;
Building on the expertise of others
SystolicBloodPressure =
GALEN:SystolicBloodPressure and
("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and
(“om:dimension” value “om:pressure or stress dimension”) and
"sio:has value" some rdfs:Literal))
GALEN is a popular biomedical ontology
but it is largely, like GO, a series of
named but undefined Classes
Semantically defining globally-accepted clinical phenotypes;
Building on the expertise of others
SystolicBloodPressure =
GALEN:SystolicBloodPressure and
("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and
(“om:dimension” value “om:pressure or stress dimension”) and
"sio:has value" some rdfs:Literal))
So we use OWL to extend the GALEN
Classes with rich, logical descriptors
that take advantage of rich semantic
relationships like “has measurement valule”
and “dimension” and “has unit”
Semantically defining globally-accepted clinical phenotypes;
Building on the expertise of others
SystolicBloodPressure =
GALEN:SystolicBloodPressure and
("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and
(“om:dimension” value “om:pressure or stress dimension”) and
"sio:has value" some rdfs:Literal))
Very general definition
“some kind of pressure unit”
(so that others can build on this as they wish!)
HighRiskSystolicBloodPressure (as defined by Framingham)
SystolicBloodPressure and
sio:hasMeasurement some
(sio:Measurement and
(“sio:has unit” value om:kilopascal) and
(sio:hasValue some double[>= "18.7"^^double])))
Now we are specific to our clinical study (Framingham definitions):
MUST be in kpascal and must be > 18.7
Semantically defining globally-accepted clinical phenotypes;
Building on the expertise of others
SELECT ?record ?convertedvalue ?convertedunit
FROM <./patient.rdf>
WHERE {
?record rdf:type measure:HighRiskSystolicBloodPressure .
?record sio:hasMeasurement ?measurement.
?measurement sio:hasValue ?Pressure.
}
RecordID Start Val Start Unit Pressure End Unit
Pt1 15 cmHg 19.998 KiloPascal
Pt2 14.6 cmHg 19.465 KiloPascal
Pt1 148 mmHg 19.731 KiloPascal
Pt2 146 mmHg 19.465 KiloPascal
Running the Clinical Analysis
“Select the patients who are at-risk”
All measurements have now been automatically
harmonized to KiloPascal, because we encoded the
semantics in the model
While doing this experiment, we noticed
some interesting anomalies…
Visual inspection of our output data and the AHA guidelines
showed that in many cases the clinician
“tweaked” the guidelines when doing their analysis
------------------
AHA BMI risk threshold: BMI=25
In our dataset the clinical researcher used BMI=26
------------------
AHA HDL guideline HDL<=1.03mmol/l
The dataset from our researcher: HDL<=0.89mmol/l
-------------------
Visual inspection of our output data and the AHA guidelines
showed that in many cases the clinician
“tweaked” the guidelines when doing their analysis
These Alterations Were Not Recorded
in Their Study Notes!
Adjusting our Semantic definitions and re-running the analysis
resulted in nearly 100% correspondence with the clinical researcher
HighRiskCholesterolRecord=
PatientRecord and
(sio:hasAttribute some
(cardio:SerumCholesterolConcentration and
sio:hasMeasurement some ( sio:Measurement and
(sio:hasUnit value cardio:mili-mole-per-liter) and
(sio:hasValue some double[>= 5.0]))))
HighRiskCholesterolRecord=
PatientRecord and
(sio:hasAttribute some
(cardio:SerumCholesterolConcentration and
sio:hasMeasurement some ( sio:Measurement and
(sio:hasUnit value cardio:mili-mole-per-liter) and
(sio:hasValue some double[>= 5.2]))))
Reflect on this for a second... Because this is important!
1. We semantically encoded clinical guidelines
2. We found that clinical researchers did not follow the official guidelines
3. Their “personalization” of the guidelines was unreported
4. Nevertheless, we were able to create “personalized” Semantic Models
5. These models reflect the opinion of an individual domain-expert
6. These models are shared on the Web
7. Can be automatically re-used by others to interpret their own data using
that clinical expert’s viewpoint
AHA:HighRiskCholesterolRecord
PatientRecord and
(sio:hasAttribute some
(cardio:SerumCholesterolConcentration and
sio:hasMeasurement some ( sio:Measurement and
(sio:hasUnit value cardio:mili-mole-per-liter) and
(sio:hasValue some double[>= 5.0]))))
McManus:HighRiskCholesterolRecord
PatientRecord and
(sio:hasAttribute some
(cardio:SerumCholesterolConcentration and
sio:hasMeasurement some ( sio:Measurement and
(sio:hasUnit value cardio:mili-mole-per-liter) and
(sio:hasValue some double[>= 5.2]))))
PREFIX AHA =http://americanheart.org/measurements/
PREFIX McManus=http://stpaulshospital.org/researchers/mcmanus/
To do the analysis using AHL guidelines
SELECT ?patient ?risk
WHERE {
?patient rdf:type AHA: HighRiskCholesterolRecord .
?patient ex:hasCholesterolProfile ?risk
}
To do the analysis using McManus’ expert-opinion
SELECT ?patient ?risk
WHERE {
?patient rdf:type McManus:HighRiskCholesterolRecord .
?patient ex:hasCholesterolProfile ?risk
}
Flexibility Transparency
Reproducibility Shareability Comparability
Simplicity Automation
Personalization
(I’m going to return to this point several times)
Reproduce a peer-reviewed
scientific publication
by semantically modelling
the problem
Story #3: in silico Science
The PublicationDiscovering Protein Partners of a
Human Tumor Suppressor Protein
Original Study Simplified
Using what is known about protein interactions
in fly & yeast
predict new interactions with this
Human Tumor Suppressor
Semantic Model of the Experiment
OWL
Note that every word in this
diagram is, in reality, a URL
(it’s a Semantic Web model)
i.e. It refers to the expertise of
other researchers, distributed
around the world on the Web
Semantic Model of the Experiment
In a local data-file
provide the protein we are interested in
and the two species we wish to use in our comparison
taxon:9606 a i:OrganismOfInterest . # human
uniprot:Q9UK53 a i:ProteinOfInterest . # ING1
taxon:4932 a i:ModelOrganism1 . # yeast
taxon:7227 a i:ModelOrganism2 . # fly
Set-up the Experimental Conditions
SELECT ?protein
FROM <file:/local/workflow.input.n3>
WHERE {
?protein a i:ProbableInteractor .
}
Run the Experiment
SELECT ?protein
FROM <file:/local/workflow.input.n3>
WHERE {
?protein a i:ProbableInteractor .
}
Run the Experiment
This is the URL that leads our computer
to the Semantic model of the problem
SHARE examines the semantic model of
Probable Interactors
Retrieves third-party expertise from the Web
Discusses with SADI
what analytical tools are necessary
Chooses the right tools for the problem
Solves the problem!
SHARE derives (and executes) the following analysis automatically
SHARE is aware of the context of the specific question being asked
There are five very cool things about what you just saw...
There are five very cool things about what you just saw...
was able to create a
workflow based on a
semantic model1.
There are five very cool things about what you just saw...
was able to create a
COMPUTATIONAL workflow
based on a BIOLOGICAL model
2.
There are five very cool things about what you just saw...
(this is important because we want
this system to be used by clinicians and biologists
who don’t speak computerese!)2.
There are five very cool things about what you just saw...
The workflow it created, and services
selected, differed depending on the
context of the question
taxon:4932 a i:ModelOrganism1 . # yeast
taxon:7227 a i:ModelOrganism2 . # fly
3.
The workflow it created, and services
chosen, differed depending on the
context of the question
3.
There are five very cool things about what you just saw...
taxon:4932 a i:ModelOrganism1 . # yeast
taxon:7227 a i:ModelOrganism2 . # fly
The machine was contextually “aware of”
BOTH the biological model
AND the data it was analysing
(...remember this... It will be important later!)
There are five very cool things about what you just saw...
The ontological model was abstract (and
shareable!), but the workflow generated
from that model was explicit and concrete
4.
There are five very cool things about what you just saw...
The ontological model was abstract (and
shareable!), but the workflow generated
from that model was explicit and concrete
4.
There are five very cool things about what you just saw...
The ontological model was abstract (and
shareable!), but the workflow generated
from that model was explicit and concrete
4.
This matters because…
RememberTrend #1
“the most common errors are simple,
the most simple errors are common”
At least partially because the
analytical methodology was inappropriate
and/or not sufficiently described
RememberTrend #1
“the most common errors are simple,
the most simple errors are common”
At least partially because the
analytical methodology was inappropriate
and/or not sufficiently described
Here, the methodology leading to a result is explicit
and automatically constructed from an abstract template
so this is (at least in part) a
Solved Problem
There are five very cool things about what you just saw...
The choice of tool-selection was
guided by the knowledge of
worldwide domain-experts encoded in
globally-distributed ontologies
(e.g. Expert high-throughput statisticians, etc...)
5.
There are five very cool things about what you just saw...
The choice of tool-selection was
guided by the knowledge of
worldwide domain-experts encoded in
globally-distributed ontologies
(e.g. Expert high-throughput statisticians, etc...)
And this matters because…
5.
RememberTrend #2
Even small, moderately-funded laboratories
can now afford to produce more data
than they can manage or interpret
These labs will likely never be able to afford
a qualified data scientist
RememberTrend #2
Even small, moderately-funded laboratories
can now afford to produce more data
than they can manage or interpret
These labs will likely never be able to afford
a qualified data scientist
But if the expert knowledge of data scientists is
encoded in ontologies, and can be discovered
in a contextually-aware manner… then this is a
SOLVED PROBLEM
Can we make the Health information
on the Web
more “personal”?
Story #4: Personalized Health Info
Remember when I said...
The machine was contextually “aware of”
BOTH the biological model
AND the data it was analysing
This “dual-awareness” provides some
very interesting opportunities
for personalizing a patient’s Health Research activity
PROBLEM:
Patients are self-educating
both about their personal medical situation
(e.g. getting themselves sequenced)
also surfing the Web, getting dubious advice
from sites of dubious authority
and joining social-health groups
to exchange (often anecdotal)
medical “advice” with other patients
PROBLEM:
Patients are self-educating
The information on any given site
may or may not
be relevant to THAT patient
Information on the Web is, by nature, not personalized
PROBLEM:
Clinicians often have patients
(especially chronically-ill patients)
on a “trajectory” of treatment
Medicine is complicated!
e.g. the treatment trajectory of the patient can be
multi-step, and a specific sign/symptom might be
perfectly normal at a particular phase in their
“flow” of treatment
PROBLEM SUMMARY
Patients are reading non-personalized medical text
of dubious quality and relevance
Clinicians have no way to intervene
in this self-education process
explaining to patients how the information they read
relates to their personal “health trajectory”
Now you might see why this is so relevant!
The machine was contextually “aware of”
BOTH the biological model
AND the data it was analysing
This is an early prototype of a
Patient-driven Personalized Medicine
Web interface
Basically, it is a set of SHARE queries
Attached to a local database
of patient information
Running behind a Web bookmarklet
The queries text-mine a Web page
then compare the concepts in the page
to the patient’s personal data
using a SHARE query
The queries text-mine a Web page
then compare the concepts in the page
to the patient’s personal data
using a SHARE query
(that could contain ontologies...
...ontologies designed by their clinician!!)
Matching based on official name, compound name, brand name, trade name,
or “common name”
Still needs some work...
??!?!?
Link out to PubMed
Why the alert?
The SADI+SHARE workflow and reasoning was
personalized to YOUR medical data
In future iterations, we will enable the workflow
to be further customized through “personalized”
OWL Classes (e.g. Provided by your Clinician!!)
These OWL Classes might include information about the
current trajectory of your treatment for a chronic disease,
for example, such that what you read on the Web is
placed in the context of your expert Clinical care...
Frankly, I think it’s quite cool that people
patients
are creating and running
“personal health-research” workflows
at the touch of a button!
Almost the end…
Three brief final points....
Publication
Discourse
Hypothesis
Experiment
Interpretation
??
The Semantic Model represents
a possible solution to a problem
The Semantic Model represents
a possible solution to a problem
By my definition, that is a hypothesis
The Semantic Model represents
a possible solution to a problem
That hypothesis is tested by automatically converting it into a workflow;
The Semantic Model represents
a possible solution to a problem
That hypothesis is tested by automatically converting it into a workflow;
the workflow, and the results of the workflow are intimately tied to the hypothesis
The Semantic Model represents
a possible solution to a problem
i.e. You (or anyone!) can determine exactly which aspect
of the hypothesis led to which output data element, why, and how
The Semantic Model represents
a possible solution to a problem
“Exquisite Provenance”
a perfect record not only of what was done, when, and how
but also WHY
And this is important because...
“Exquisite Provenance”
is required
for the output data and knowledge
to be published as...
Richly annotated, citable, and queryable snippets of
scientific knowledge encoded in Linked Data/OWL
i.e. a way to publish data and knowledge on the Semantic Web
Publication
Discourse
Hypothesis
Experiment
Interpretation
A “modest” vision for
pure in silico Science
Last point… perhaps this is not yet obvious…
SADI services consume Linked Data on the Web
SADI services consume Linked Data on the Web
The ontologies provided to SHARE are
written in OWL, and are therefore
inherently part of the Web
SADI services consume Linked Data on the Web
The ontologies provided to SHARE are
written in OWL, and are therefore
inherently part of the Web
SADI services create novel semantic links
between existing data-points on the Web, or
between existing data and new data
SADI services consume Linked Data on the Web
The ontologies provided to SHARE are
written in OWL, and are therefore
inherently part of the Web
SADI services create novel semantic links
between existing data-points on the Web, or
between existing data and new data
The output of the automatically-generated workflow
is therefore Linked Data
and is therefore inherently part of the Web
SADI services consume Linked Data on the Web
The ontologies provided to SHARE are
written in OWL, and are therefore
inherently part of the Web
SADI services create novel semantic links
between existing data-points on the Web, or
between existing data and new data
The output of the automatically-generated workflow
is therefore Linked Data
and is therefore inherently part of the Web
The concluding NanoPublications are a combination
of Linked Data and OWL, and are published directly to the Web
The Life Science “Singularity”
The Semantic Web is a cradle-to-grave
biomedical research platform
that can, and will, dramatically improve
how biomedical research is done
WeAre
Here!
The important people
Luke McCarthy
(SADI/SHARE)
Benjamin Vandervalk
(SHARE)
Dr. Soroush Samadian
(clinical experiments)
Ian Wood
(Experiment-replication experiment)
Microsoft Research