Bio2RDF : A Semantic Web Atlas of post genomic knowledge about Human and Mouse
-
Upload
francois-belleau -
Category
Technology
-
view
2.221 -
download
1
description
Transcript of Bio2RDF : A Semantic Web Atlas of post genomic knowledge about Human and Mouse
BIO2RDF : A Semantic Web Atlas of post genomic knowledge about
Human and Mouse
François Belleau, Nicole Tourigny, Benjamin Good and Jean Morissette
● Centre de Recherche du CHUL, Université Laval● Département d'informatique et de génie logiciel, Université Laval
Vaugondy, Louis XV geograph, view of the world in the 18th century
Google Map view of the world in the 21th century
Evry, June 27, 2008 CHUL research center Laval University 4
Outline
Introduction
− Problem definition− Proposed approach− The 4 rules of linked data− Related Work
Results
− Bio2RDF first knowledge map− Semantic ranking
Paget query demo with SPARQL
Future work and Conclusion
Evry, June 27, 2008 CHUL research center Laval University 5
Problem definition
● The objective of data integration is to make data distributed over a number of distinct, heterogeneous databases accessible via a single interface [Davidson 1995].
● We already use global text search engine on the web (Google, Yahoo).
● There is many specialized integrated search tools in bioinformatics (NCBI Entrez, EBI search, KEGG GenomeNet).
Evry, June 27, 2008 CHUL research center Laval University 6
What is known about « Paget disease» ?
but first ...
What is known about the mouse and human
genomes ?
Evry, June 27, 2008 CHUL research center Laval University 7
Popular web search engines without semantic
Evry, June 27, 2008 CHUL research center Laval University 8
Some Bioinformatics integrated search tools
● EMBL-EBI EB-eye search● KEGG GenomeNet● NCBI Entrez
Evry, June 27, 2008 CHUL research center Laval University 9
EMBL-EBI search
Evry, June 27, 2008 CHUL research center Laval University 10
NCBI Entrez life science searchacross databases
Evry, June 27, 2008 CHUL research center Laval University 11
KEGG GenomeNet search
Evry, June 27, 2008 CHUL research center Laval University 12
Bio2RDF search
What is known about Paget disease in the mouse and
human genomes ?
Evry, June 27, 2008 CHUL research center Laval University 13
Proposed approach
● Apply the semantic web model to data integration in bioinformatics;
● Use a PageRank [Brin 1998] variation adapted to semantic graph, a method analog to Aleman-Meza group's work: the LinkRank;
● Adopt standard (RDF, OWL) and use existing software (Sesame, Virtuoso, PiggyBank).
Evry, June 27, 2008 CHUL research center Laval University 14
Outline
Introduction
− Problem definition− Proposed approach− The 4 rules of linked data− Related Work
Results
− Bio2RDF first knowledge map− Semantic ranking
Paget query demo with SPARQL
Future work and Conclusion
Evry, June 27, 2008 CHUL research center Laval University 15
Linked data 4 rules
http://www.w3.org/DesignIssues/LinkedData
Evry, June 27, 2008 CHUL research center Laval University 16
Rule #1: Use URIs as names for things.● Using normalized identifier to name
concept is already a reality in biology domain.
● Hexokinase is GO:0004396● Definition :
− Catalysis of the reaction: ATP + D-hexose = ADP + D-hexose 6-phosphate.
● Synonym of EC:2.7.1.1
Evry, June 27, 2008 CHUL research center Laval University 17
Rule #2 : Use HTTP URIs so that people can look up those names.● Derefencable URL ● The Banff Manifesto rule for URN
− urn:bm:public_namespace:private_identifier● Normalized URL according to Banff
Manifesto: http://bio2rdf.org/public_namespace:private_identifier
● http://bio2rdf.org/go:0004396
Evry, June 27, 2008 CHUL research center Laval University 18
Rule #3 When someone looks up a URI, provide useful information.
● http://bio2rdf.org/go:0004396 returns the RDF graph of this topic
Evry, June 27, 2008 CHUL research center Laval University 19
Rule #4 :Include links to other URIs so that they can discover more things.
●Openess Ratio > 0 (to be defined)
Evry, June 27, 2008 CHUL research center Laval University 20
Outline
Introduction
− Problem definition− Proposed approach− The 4 rules of linked data− Related Work
Results
− Bio2RDF first knowledge map− Semantic ranking
Paget query demo with SPARQL
Future work and Conclusion
Evry, June 27, 2008 CHUL research center Laval University 21
Related work
● DBPedia● YeastHub● UniProt● HCLS linked data● Bio2RDF architecture
Evry, June 27, 2008 CHUL research center Laval University 22
Related work – Linked data map
http://wiki.dbpedia.org/Interlinking
Evry, June 27, 2008 CHUL research center Laval University 23
Related work – Linked data map
● If we were to draw a map of the existing relations between linked data from bioinformatics database providers, what would it look like?
● Could we measure the amount of post genomic knowledge available related to a mouse or human genome sequence?
● Could it help answer the what is known question?
Evry, June 27, 2008 CHUL research center Laval University 24
Related work – YeastHub
Evry, June 27, 2008 CHUL research center Laval University 25
Related work – UniProt beta
Evry, June 27, 2008 CHUL research center Laval University 26
Related work – HCLS demo
Evry, June 27, 2008 CHUL research center Laval University 27
Bio2RDF architecture
Evry, June 27, 2008 CHUL research center Laval University 28
Bio2RDF actual datasources loaded in the Atlas graph
Evry, June 27, 2008 CHUL research center Laval University 29
Outline
Introduction
− Problem definition− Proposed approach− The 4 rules of linked data− Related Work
Results
− Bio2RDF first knowledge map− Semantic ranking
Paget query demo with SPARQL
Future work and Conclusion
What is known about human and mouse genome in 2008?
Evry, June 27, 2008 CHUL research center Laval University 31
What is Bioinformatics linked data ?
Evry, June 27, 2008 CHUL research center Laval University 32
http://bio2rdf.org/map
Bio2RDF linked data map is a first answer attempt
Evry, June 27, 2008 CHUL research center Laval University 33
Outline
Introduction
− Problem definition− Proposed approach− The 4 rules of linked data− Related Work
Results
− Bio2RDF first knowledge map− Semantic ranking
Paget query demo with SPARQL
Future work and Conclusion
Evry, June 27, 2008 CHUL research center Laval University 34
Semantic Web Ranking
● Openess Ratio● Averange Link Rank● Semantic weight
Openess Ratio
Average Link Rank
Semantic Weight
Evry, June 27, 2008 CHUL research center Laval University 38
The semantic mashup effect
OR = 0ALR = 2MeSH
OR = 1ALR = 1GeneID
OR = 0,5ALR = 1,5PubMed
mean OR = 0,5mean ALR = 1,5
Evry, June 27, 2008 CHUL research center Laval University 39
The semantic mashup effect
OR = 0ALR = 2,3
MeSH
OR = 1ALR = 1GeneID
OR = 0,5ALR = 1,5PubMed
mean OR = 0,4mean ALR = 1,6
Bio2RDF statistics by datasource
Bio2RDF actual 30 datasources
MeSH : OR = 0
Pubmed: OR = 0,5
GeneID : OR = 1
Bio2RDF : OR = 0,630 datasources, 225 namespaces
Evry, June 27, 2008 CHUL research center Laval University 46
Knowledge gain of 0,19
From 0,77 to 0,58
Evry, June 27, 2008 CHUL research center Laval University 47
Bio2RDF Semantic Web Atlas in numbers
● 30 different datasources, 30 different namespaces
− go, geneid, uniprot, pubmed, pdb, reactome, omim, etc.
● 195 namespaces referencing non-rdfized datasource
− cog, genethon, tigr, cath, goa, etc.
● 8 millions topics● 65 millions triples● 973 Mo, size of N3 format compressed data
− http://bio2rdf.org/download/bio2rdf-atlas-080414.n3.gz
Evry, June 27, 2008 CHUL research center Laval University 48
Bio2RDF Semantic Web Atlas in statistics
● Openess Ratio (OR) of 0.58● Averange Link Rank (ALR) of 4.7● 8 millions topics are connected by 19 millions
relations within the graph● 58 % of URIs are referencing the open world
outside the graph● 19 % of knowledge gain because of the mashup
effect
Evry, June 27, 2008 CHUL research center Laval University 49
Outline
Introduction
− Problem definition− Proposed approach− The 4 rules of linked data− Related Work
Results
− Bio2RDF first knowledge map− Semantic ranking
Paget query demo with SPARQL
Future work and Conclusion
Evry, June 27, 2008 CHUL research center Laval University 50
Bio2RDF search demo with SPARQL
What is known about Paget disease in the mouse and
human genomes ?
Submitted athttp://bio2rdf.org:8890/sparql
Evry, June 27, 2008 CHUL research center Laval University 51
Submit the SPARQL query to Virtuoso
Evry, June 27, 2008 CHUL research center Laval University 52
SPARQL query in a URL
http://bio2rdf.org:8890/sparql?defaultgraphuri=&query=CONSTRUCT+%7B%0D%0A%3Fs1+%3Fp1+%3Fo1+.%0D%0A%3Fs1+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22rdfsyntaxns%23type%3E+%3Ftype+.+%0D%0A%3Fs1+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdfschema%23label%3E+%3Flabel.+%0D%0A%3Fs1+%3Chttp%3A%2F%2Fbio2rdf.org%2Fbio2rdf%23linkRank%3E+%3FlinkRank.+%0D%0A%7D%0D%0AWHERE+%7B%0D%0A%3Fs1+%3Fp1+%3Fo1+.+%0D%0A%3Fo1+bif%3Acontains+%22paget%22+.%0D%0A%3Fs1+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22rdfsyntaxns%23type%3E+%3Ftype+.+%0D%0A%3Fs1+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdfschema%23label%3E+%3Flabel.+%0D%0A%3Fs1+%3Chttp%3A%2F%2Fbio2rdf.org%2Fbio2rdf%23linkRank%3E+%3FlinkRank.+%0D%0A%7D%0D%0A%0D%0A%0D%0A%0D%0A&format=application%2Frdf%2Bxml&debug=on
Evry, June 27, 2008 CHUL research center Laval University 53
View results in HTML
Evry, June 27, 2008 CHUL research center Laval University 54
View results with Sesame
Evry, June 27, 2008 CHUL research center Laval University 55
View results with Piggy Bank
Evry, June 27, 2008 CHUL research center Laval University 56
Outline
Introduction
− Problem definition− Proposed approach− The 4 rules of linked data− Related Work
Results
− Bio2RDF first knowledge map− Semantic ranking
Paget query demo with SPARQL
Future work and Conclusion
Evry, June 27, 2008 CHUL research center Laval University 57
Future works
● Create new rdfizer for public data source;● Build a community of users around the
Bio2RDF project (visit the Google group);● Connect more datasources to Bio2RDF by
building collaboration between research groups;
● Offer a public SPARQL endpoint based on Virtuoso server :
− http://bio2rdf.org:8890/sparql
Evry, June 27, 2008 CHUL research center Laval University 58
Conclusion
Those devices in the hands of scientists have forged our understanding of nature.
Evry, June 27, 2008 CHUL research center Laval University 59
Conclusion
We have started to map the knowledge space of biology, we have a first impression of what the bioinformatics nation looks like, the time has come to explore it, the time has come to build
the knowledgescope.
Evry, June 27, 2008 CHUL research center Laval University 60
Acknowlegments
Jean MorissetteNicole TourignyBenjamin Good
Bioinformatics lab’s team at CHUL Research Center :Philippe Rigault
Marc-Alexandre Nolin
Thanks to the essential annotators and data providerand to developers of open source project :
Sesame, Virtuoso and PiggyBank.François Belleau was a recipient of a studentship from Génome Québec. This work have been financed in part by the Atlas of Genomic Profiles of SteroidAction, a Genome Canada project. BMG is funded by Pacific Century and University of British Columbia Graduate Fellowships.
Evry, June 27, 2008 CHUL research center Laval University 61
http://bio2rdf.orgQuery the graph with SPARQL http://bio2rdf.org:8890/sparql
Download our software http://sourceforge.net/projects/bio2rdf/
Download the Atlas data in N3 format http://bio2rdf.org/download
Join our group http://groups.google.ca/group/bio2rdf
Contact us at [email protected]