My research and My favourite student project Alistair Edwards alistair.
FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne...
![Page 1: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/1.jpg)
FlyWeb: the way to go for biological data integration
Jun Zhao, Alistair Miles and Graham KlyneImage Bioinformatics Research Group
Department of ZoologyUniversity of Oxford
![Page 2: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/2.jpg)
FlyWeb Application
To answer questions about "what does this gene do?” Gene Expression Images Sequence and ESTs (Expressed sequence tags) of the gene Publications about the gene ....
A first example of the Image Web that our group is developing
Investigate the feasibility of existing Semantic Web tools and technologies for real applications
![Page 3: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/3.jpg)
Gene expression images
Reveal gene expression pattern in different development stages
Important for identifying genes of interests and verifying a picture of probable gene functions
![Page 4: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/4.jpg)
FlyWeb demonstration
http://openflydata.org/flyui/build/apps/imagemashup2/ Run application: [go]
Two examples: Single gene query (aos1) Use gene synonyms to enhance gene matching (rbf)
![Page 5: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/5.jpg)
![Page 6: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/6.jpg)
More than one synonyms
of gene “rbf”
![Page 7: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/7.jpg)
![Page 8: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/8.jpg)
How does it work?
Data from 3 independent sources: www.flybase.org – model organism
reference database, gene namesand identifiers
www.fruitfly.org (BDGP) – embryo in situ images
www.fly-ted.org – testis in situ images
All data accessed via SPARQL
Pure Ajax user application
Essentially, a mashup using a SPARQL API
![Page 9: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/9.jpg)
The client side FlyUI:
a library of Javascript widgets as front ends to SPARQL data sources
Built on Yahoo User Interface (YUI) library
Widgets are composed in a browser to create the complete application
Each widget provides: A Service that implements
SPARQL queries A Model encapsulating SPARQL
query results A Renderer
The in situ search application
GeneFinderWidget
FlyTED ImageWidget
BDGP ImageWidget
![Page 10: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/10.jpg)
Gene name mapping
FlyTED and BDGP use different gene names FlyTED data derived from spreadsheets with imperfectly
controlled gene name vocabulary BDGP's data are annotated using FlyBase's unique FBgn
numbers
Use FlyBase for automatic gene mapping
Additional inputs from scientists for disambiguating many-many mappings
Mappings are stored as JSON file to assist “GeneFinder” widget (having no use for RDF/OWL reasoning at this stage)
![Page 11: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/11.jpg)
SPARQL queries
Free text matchings
Case insensitive searching
Very important for our users
Too expensive using SPARQL Filter
Pre-generate lower-case gene names and load into the Flybase RDF DB
SELECT * WHERE { ?gene fbutil:anyName "userInput"^^xs:string ;
a chado:Feature ;chado:name ?symbol ;chado:uniquename ?flybaseID .
OPTIONAL { ?gene chado:dbxref [ chado:accession ?annotationSymbol ] . } OPTIONAL { ?gene chado:synonym [ chado:name ?synonym ] . } OPTIONAL { ?gene chado:synonym [ a syntype:FullName ; chado:name ?fullName ] . }}
SELECT DISTINCT * WHERE { ?fullImageURL " + flyted:associatesToGene <http://openflydata.org/id/flyted/gene-geneName> ; flyted:associatesToGene ?gene ; flyted:thumbnail ?thumbnailURL; rdfs:seeAlso ?flytedURL; rdfs:label ?caption }
![Page 12: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/12.jpg)
The RDF data sources
Flybase and BDGP: relational databases
FlyTED, an image repository built using Eprints
FlyAtlas (forthcoming), tissue-specific Drosophila gene expression levels, as a single spreadsheet
![Page 13: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/13.jpg)
Creating RDF from data sources
D2RQ mapping FlyBase and BDGP, native relational databases Conservative mapping, with minimum interpretation
OAI2SPARQL Harvesting N3 RDF metadata via the OAI-PMH protocol, built-in
support by Eprints Further from ESWC2008 paper
Custom Python program FlyAtlas Generating N3 from spreadsheet table
![Page 14: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/14.jpg)
More about the data sources
Bulk download http://openflydata.org/dump/flybase, ~8m triples http://openflydata.org/dump/bdgp, ~1m triples http://openflydata.org/dump/flyted, ~30,000 triples
SPARQL endpoint http://openflydata.org/query/flybase http://openflydata.org/query/bdgp http://openflydata.org/query/flyted
Schema http://purl.org/net/chado/schema/ http://purl.org/net/flybase/synonym-types/ http://purl.org/net/bdgp/schema/
![Page 15: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/15.jpg)
SPARQL server
Amazon EC2 (Elastic Compute Cloud): To run SPARQL endpoints To host the demo you've just seen
Jena TDB as triple store For better loading performance: ~6K tps for ~9M triples to
Amazon Elastic Block Storage (EBS) For better querying performance
SPARQLite home-grown SPARQL protocol implementation More later
Apache, Tomcat, mod_jk, etc.
![Page 16: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/16.jpg)
SPARQLite protocol
http://sparqlite.googlecode.com Also, a platform for exploring SPARQL service quality concerns,
more later
Motivation Enable streaming Create a database connection pool
Designed for Jena TDB/SDB + Postgres
Restricted forms of query (SELECT, ASK)
Restricted query result format (e.g. only JSON)
![Page 17: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/17.jpg)
Lessons
RDF provides a uniform and flexible data model RDF dump is cheaper and quicker Maintaining a separate SPARQL endpoint for each data source
makes it easier than a data warehouse approach for handling data updates
RDF facilitates data re-use and re-purposing
SPARQL raises the point of departure for an application
Benefits for the future Linking to other data sources Querying genes using the Fly Anatomy ontology Magic of inference
![Page 18: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/18.jpg)
Performance
Loading: Our datasets ~10 million triples Jena / RDB / Postgres, OK with <1 M triples Jena / SDB / Postgres better, but problems with load performance
with larger datasets Jena / TDB gives much better load performance (~6K tps), even on
32 bit system with Amazon EBS storage (but not so good with local EC2 store)
Virtuoso performs reasonably well
Querying, particularly text matching and case insensitive search
Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL
Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search
Any suggestions?
![Page 19: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/19.jpg)
Further lessons
SPARQL results streaming Resolves out of memory errors for large datasets Joseki / SDB / Postgres can be made to stream results, but using
just a single JDBC connection, causing performance problems with concurrent requests
Therefore, SPARQLite
The openness of SPARQL: SPARQL is an inherently open query language and protocol Open endpoints are vulnerable to simple queries that can
overload the service, exposing them to denial of service style attacks (whether intended or not)
Futures: API key mechanism? Restricted SPARQL profiles?
![Page 20: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/20.jpg)
Future directions
Adding new data sources: FlyAtlas tissue-specific Drosophila gene expression levels More information from FlyBase – e.g. references
More applications: Find out all the gene expression images of its neighbours Find out all the genes related to “blood pressure” ...
Linked data (dereferencable, follow-your nose) We're thinking about this, but our application does not currently
need it
How to control and predict quality of service for open SPARQL endpoints
![Page 21: FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d2a5503460f949feb82/html5/thumbnails/21.jpg)
Acknowledgement
Alistair Miles, Graham Klyne and David Shotton
Dr Helen White-Cooper and her research group
BBSRC for funding building the FlyTED database
BDGP and FlyBase for making the data available
JISC, for funding the FlyWeb project
The Jena team, esp. Andy Seaborne