BioSamples Database Linked Data, SWAT4LS Tutorial

Post on 27-Jan-2015

116 views 5 download

Tags:

description

Presentation used for the SWAT4LS tutorial about EBI RDF/Linked Data tutorial. This about the Biosamples databsse dataset.

Transcript of BioSamples Database Linked Data, SWAT4LS Tutorial

BioSamples Database Linked Data BioSamples Database Linked Data

Marco Brandizi, Functional Genomics TeamSWAT4LS Tutorial, Dec 9th, 2013

Find this presentation at http://tiny.cc/bsdswt13

• A reference system, where to search/browse information about biological samples used/useable for biomedical experiments

• Focused on the sample context (i.e., independent on the specific assay type/technology)

• Supports heterogeneous experiments

– Single place assay repositories can link (reference samples, authoritative source for repositories like Metagenomics/ENA/ArrayExpress)

– Single place for searches and related-to or same-as relationships (e.g., see the 'myEquivalents' project)

• Allows for consistency/standardisation of sample attributes/annotations

• Common IT interfaces to access sample information and links to specific data/repositories (e.g., web, XML/REST, RDF)

Why a BioSamples Database (aka BioSD)?

• Yet another type of interface, potentially useful to application developers and Linked Data tools

• Integration with similar/related data-sets (see example queries below!)

• Exploitation of ontologies (see below!)– Standardisation

– A little semantics goes a long way

• Modelling of certain aspects enhanced

– e.g., numbers, intervals, dates, units are detected from string value labels and triplified.

• Who knows?

– Apps!

– See Hackaton ideas below!

Why Linked Data for BioSD?

The BioSD Model

Sample Groups

Submission

External links

Samples

http://www.ebi.ac.uk/biosamples

The BioSD Model

Group's (or Submission's) samples

Sample's (or Groups') attribute typesand values

External links

BioSD Data (External Data Sources)SPARQL Source: http://tinyurl.com/o95xa5vTag Cloud made with http://www.wordle.net

SPARQL Source: http://tinyurl.com/ocyb2ld

BioSD Data (Common Attribute Types)

SPARQL Source: http://tinyurl.com/pjgdtzsTag Cloud made with http://www.wordle.net

BioSD Linked Data Model (Main Entities)

Please have a look at:

http://tinyurl.com/lo33ncc

BioSD Linked Data Model (Sample Attributes)

Please have a look at:

http://tinyurl.com/n5oyvyd

SPARQL Queries

Find Samples and attributesPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>PREFIX sio: <http://semanticscience.org/resource/>

SELECT DISTINCT ?smp ?pvLabel ?propTypeLabel WHERE { ?smp a biosd-terms:Sample; biosd-terms:has-bio-characteristic | sio:SIO_000332 ?pv. # is about

?pv rdfs:label ?pvLabel; biosd-terms:has-bio-characteristic-type ?pvType. ?pvType rdfs:label ?propTypeLabel.}

• Exercise: use FILTER()/REGEX() to find organism=homo sapiens

• Exercise: Find sample provenance repositories and their links– Hint: explore the sample's links (?smp) and see how RepositoryWebRecord

looks like

Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparqlExcercise Solution: see examples on such page

Samples about a given organismPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>

SELECT DISTINCT ?smp ?pvLabel ?propTypeLabelWHERE { ?smp biosd-terms:has-bio-characteristic ?pv. ?pv biosd-terms:has-bio-characteristic-type ?pvType; rdfs:label ?pvLabel. ?pvType a ?pvTypeClass. # Listeria ?pvTypeClass rdfs:label ?propTypeLabel; # '*' gives you transitive closure, even when inference is didsbled rdfs:subClassOf* <http://purl.obolibrary.org/obo/NCBITaxon_1637> }

• Exercise: Use the Bioportal Service to first find all subclasses of 'alchool' (obo:CHEBI_30879) and then search samples annotated with such subclasses

– Hint: Use SERVICE <http://sparql.bioontology.org/ontologies/sparql/?apikey=KEY>

Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparqlExcercise Solution: see one of the examples on such page

Geo-located Samples/Sample GroupsPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>PREFIX sio: <http://semanticscience.org/resource/>

SELECT DISTINCT ?item ?latVal ?longVal WHERE { ?item biosd-terms:has-bio-characteristic ?latPv, ?longPv. ?latPv biosd-terms:has-bio-characteristic-type [ rdfs:label ?latLabel]; sio:SIO_000300 ?latVal. # sio:has value

FILTER ( REGEX ( ?latLabel, "latitude", "i" ) ). ?longPv biosd-terms:has-bio-characteristic-type [ rdfs:label ?longLabel ]; sio:SIO_000300 ?longVal. # sio:has value

FILTER ( REGEX ( ?longLabel, "longitude", "i" ) ).}

• Find all samples having an attribute of type temperature, with a numerical value and a unit specified. Hint: use sio:SIO_000221 (has unit), sio:SIO_000300 (has value)

• Find samples/groups annotated with intervals, which use the properties biosd-terms:has-low-value and has-high-value and optionally have a unit.

Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparqlExcercise Solutions: see examples on that page

Expressed Genes and Samples• For http://purl.uniprot.org/uniprot/P04637 (P53 in Human)

• Find the EFO classes for which it is up-regulated in the Atlas (p-value < 1E-9)

• And show the atlas expression value label . Hints:– Start from the example http://tinyurl.com/kvvhw6b,

– Use the Atlas endpoint: http://www.ebi.ac.uk/rdf/services/atlas/sparql

• Find the samples having attributes that are instances of such EFO classes

• Which comes from a repository other than 'ArrayExpress'

• Hints:

– Use SERVICE <http://www.ebi.ac.uk/rdf/services/biosamples/sparql> and a sub-query

– Search property values linked to prop. types that are instances of the e.f. found by the Atlas

– Then link to the samples, the samples to the submissions, the submissions to the web records

● OR JUST HAVE A LOOK: http://tinyurl.com/ln3m7nv (will take a while...)

Ideas for the Hackaton

• Refer to http://tinyurl.com/mo7wgye for details

• From geo-located samples (samples annotated with latitude/longitude) to Google maps, e.g, by using Exhibit (http://www.simile-widgets.org/exhibit/)

• Take similar datasets (e.g., MAASTRO, Breast Cancer Data, your data), unify the schemas (e.g., using CONSTRUCT), define federated queries

• Use the Shape or OpenPHACTS validator to define sensible rules for BioSD and similar data-sets, e.g., must contain an organism, should have a treatment

• Design/build an App (or Web widget) that asks for eligibility criterion, i.e., pairs of attribute value/type, and translate it into a SPARQL query (or a more complex search based on SPARQL) to find samples

– Use common ontologies for auto-completion over property types

– Use string-based auto-completion for values

– Consider numerical values, intervals, units

– Do approximate matching, i.e., matching 8/10 of specified pairs is good.

Acknowledgements

• BioSD Team - Alvis Brazma, Tony Burdett, Adam Faulconbridge, Mike Gostev, Helen Parkinson, Rui Perreria, Ugis Sarkans, Drashtti Vasant

• Tony Burdett for the help with Zooma

• Simon Jupp, Andy Jenkinson, James Malone, for their great help with developing and setting up BioSD/RDF

– The rest of the Linked Data team @EBI(http://www.ebi.ac.uk/rdf)

• BiomedBridges FP7 project (http://www.biomedbridges.eu), for funding us

And you all!

Sorry, we have 2.7M samples, but not all of them...(Source: http://en.wikipedia.org/wiki/File:Assorted_computer_mice_-_MfK_Bern.jpg)

Contact info:

www.ebi.ac.uk/biosamples

www.marcobrandizi.info

Extras

• biosd-terms (http://tiny.cc/biosd_terms)– a small application ontology defining specific classes and properties, e.g.,

sample, sample group, has-knowledgeable-person

• Experimental Factors Ontology (EFO)– mainly to define/annotate sample attributes

• Ontology for Biomedical Investigations (OBI)

• Information Artefacts Ontology (IAO)

• Semantic Science Ontology (SIO)– to define main classes in BioSD/RDF

• Bibliographic Ontology (BIBO)

– We link publications about submissions/sample sets

• Dublin Core, schema.org, FOAF– for general categories and in the Linked Data spirit

• Linked automatically by Zooma: many more (e.g., CHEBI, NCBI-Tax, GO)

Main Ontologies used in BioSD / Linked Data

BioSD → RDFConversion

github.com/EBIBioSamples/biosd2rdf

github.com/EBIBioSamples/biosd2rdf