BioSamples Database Linked Data, SWAT4LS Tutorial

20
BioSamples Database Linked Data BioSamples Database Linked Data Marco Brandizi, Functional Genomics Team SWAT4LS Tutorial, Dec 9 th , 2013 Find this presentation at http://tiny.cc/bsdswt13

description

Presentation used for the SWAT4LS tutorial about EBI RDF/Linked Data tutorial. This about the Biosamples databsse dataset.

Transcript of BioSamples Database Linked Data, SWAT4LS Tutorial

Page 1: BioSamples Database Linked Data, SWAT4LS Tutorial

BioSamples Database Linked Data BioSamples Database Linked Data

Marco Brandizi, Functional Genomics TeamSWAT4LS Tutorial, Dec 9th, 2013

Find this presentation at http://tiny.cc/bsdswt13

Page 2: BioSamples Database Linked Data, SWAT4LS Tutorial

• A reference system, where to search/browse information about biological samples used/useable for biomedical experiments

• Focused on the sample context (i.e., independent on the specific assay type/technology)

• Supports heterogeneous experiments

– Single place assay repositories can link (reference samples, authoritative source for repositories like Metagenomics/ENA/ArrayExpress)

– Single place for searches and related-to or same-as relationships (e.g., see the 'myEquivalents' project)

• Allows for consistency/standardisation of sample attributes/annotations

• Common IT interfaces to access sample information and links to specific data/repositories (e.g., web, XML/REST, RDF)

Why a BioSamples Database (aka BioSD)?

Page 3: BioSamples Database Linked Data, SWAT4LS Tutorial

• Yet another type of interface, potentially useful to application developers and Linked Data tools

• Integration with similar/related data-sets (see example queries below!)

• Exploitation of ontologies (see below!)– Standardisation

– A little semantics goes a long way

• Modelling of certain aspects enhanced

– e.g., numbers, intervals, dates, units are detected from string value labels and triplified.

• Who knows?

– Apps!

– See Hackaton ideas below!

Why Linked Data for BioSD?

Page 4: BioSamples Database Linked Data, SWAT4LS Tutorial

The BioSD Model

Sample Groups

Submission

External links

Samples

http://www.ebi.ac.uk/biosamples

Page 5: BioSamples Database Linked Data, SWAT4LS Tutorial

The BioSD Model

Group's (or Submission's) samples

Sample's (or Groups') attribute typesand values

External links

Page 6: BioSamples Database Linked Data, SWAT4LS Tutorial

BioSD Data (External Data Sources)SPARQL Source: http://tinyurl.com/o95xa5vTag Cloud made with http://www.wordle.net

SPARQL Source: http://tinyurl.com/ocyb2ld

Page 7: BioSamples Database Linked Data, SWAT4LS Tutorial

BioSD Data (Common Attribute Types)

SPARQL Source: http://tinyurl.com/pjgdtzsTag Cloud made with http://www.wordle.net

Page 8: BioSamples Database Linked Data, SWAT4LS Tutorial

BioSD Linked Data Model (Main Entities)

Please have a look at:

http://tinyurl.com/lo33ncc

Page 9: BioSamples Database Linked Data, SWAT4LS Tutorial

BioSD Linked Data Model (Sample Attributes)

Please have a look at:

http://tinyurl.com/n5oyvyd

Page 10: BioSamples Database Linked Data, SWAT4LS Tutorial

SPARQL Queries

Page 11: BioSamples Database Linked Data, SWAT4LS Tutorial

Find Samples and attributesPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>PREFIX sio: <http://semanticscience.org/resource/>

SELECT DISTINCT ?smp ?pvLabel ?propTypeLabel WHERE { ?smp a biosd-terms:Sample; biosd-terms:has-bio-characteristic | sio:SIO_000332 ?pv. # is about

?pv rdfs:label ?pvLabel; biosd-terms:has-bio-characteristic-type ?pvType. ?pvType rdfs:label ?propTypeLabel.}

• Exercise: use FILTER()/REGEX() to find organism=homo sapiens

• Exercise: Find sample provenance repositories and their links– Hint: explore the sample's links (?smp) and see how RepositoryWebRecord

looks like

Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparqlExcercise Solution: see examples on such page

Page 12: BioSamples Database Linked Data, SWAT4LS Tutorial

Samples about a given organismPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>

SELECT DISTINCT ?smp ?pvLabel ?propTypeLabelWHERE { ?smp biosd-terms:has-bio-characteristic ?pv. ?pv biosd-terms:has-bio-characteristic-type ?pvType; rdfs:label ?pvLabel. ?pvType a ?pvTypeClass. # Listeria ?pvTypeClass rdfs:label ?propTypeLabel; # '*' gives you transitive closure, even when inference is didsbled rdfs:subClassOf* <http://purl.obolibrary.org/obo/NCBITaxon_1637> }

• Exercise: Use the Bioportal Service to first find all subclasses of 'alchool' (obo:CHEBI_30879) and then search samples annotated with such subclasses

– Hint: Use SERVICE <http://sparql.bioontology.org/ontologies/sparql/?apikey=KEY>

Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparqlExcercise Solution: see one of the examples on such page

Page 13: BioSamples Database Linked Data, SWAT4LS Tutorial

Geo-located Samples/Sample GroupsPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>PREFIX sio: <http://semanticscience.org/resource/>

SELECT DISTINCT ?item ?latVal ?longVal WHERE { ?item biosd-terms:has-bio-characteristic ?latPv, ?longPv. ?latPv biosd-terms:has-bio-characteristic-type [ rdfs:label ?latLabel]; sio:SIO_000300 ?latVal. # sio:has value

FILTER ( REGEX ( ?latLabel, "latitude", "i" ) ). ?longPv biosd-terms:has-bio-characteristic-type [ rdfs:label ?longLabel ]; sio:SIO_000300 ?longVal. # sio:has value

FILTER ( REGEX ( ?longLabel, "longitude", "i" ) ).}

• Find all samples having an attribute of type temperature, with a numerical value and a unit specified. Hint: use sio:SIO_000221 (has unit), sio:SIO_000300 (has value)

• Find samples/groups annotated with intervals, which use the properties biosd-terms:has-low-value and has-high-value and optionally have a unit.

Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparqlExcercise Solutions: see examples on that page

Page 14: BioSamples Database Linked Data, SWAT4LS Tutorial

Expressed Genes and Samples• For http://purl.uniprot.org/uniprot/P04637 (P53 in Human)

• Find the EFO classes for which it is up-regulated in the Atlas (p-value < 1E-9)

• And show the atlas expression value label . Hints:– Start from the example http://tinyurl.com/kvvhw6b,

– Use the Atlas endpoint: http://www.ebi.ac.uk/rdf/services/atlas/sparql

• Find the samples having attributes that are instances of such EFO classes

• Which comes from a repository other than 'ArrayExpress'

• Hints:

– Use SERVICE <http://www.ebi.ac.uk/rdf/services/biosamples/sparql> and a sub-query

– Search property values linked to prop. types that are instances of the e.f. found by the Atlas

– Then link to the samples, the samples to the submissions, the submissions to the web records

● OR JUST HAVE A LOOK: http://tinyurl.com/ln3m7nv (will take a while...)

Page 15: BioSamples Database Linked Data, SWAT4LS Tutorial

Ideas for the Hackaton

• Refer to http://tinyurl.com/mo7wgye for details

• From geo-located samples (samples annotated with latitude/longitude) to Google maps, e.g, by using Exhibit (http://www.simile-widgets.org/exhibit/)

• Take similar datasets (e.g., MAASTRO, Breast Cancer Data, your data), unify the schemas (e.g., using CONSTRUCT), define federated queries

• Use the Shape or OpenPHACTS validator to define sensible rules for BioSD and similar data-sets, e.g., must contain an organism, should have a treatment

• Design/build an App (or Web widget) that asks for eligibility criterion, i.e., pairs of attribute value/type, and translate it into a SPARQL query (or a more complex search based on SPARQL) to find samples

– Use common ontologies for auto-completion over property types

– Use string-based auto-completion for values

– Consider numerical values, intervals, units

– Do approximate matching, i.e., matching 8/10 of specified pairs is good.

Page 16: BioSamples Database Linked Data, SWAT4LS Tutorial

Acknowledgements

• BioSD Team - Alvis Brazma, Tony Burdett, Adam Faulconbridge, Mike Gostev, Helen Parkinson, Rui Perreria, Ugis Sarkans, Drashtti Vasant

• Tony Burdett for the help with Zooma

• Simon Jupp, Andy Jenkinson, James Malone, for their great help with developing and setting up BioSD/RDF

– The rest of the Linked Data team @EBI(http://www.ebi.ac.uk/rdf)

• BiomedBridges FP7 project (http://www.biomedbridges.eu), for funding us

Page 17: BioSamples Database Linked Data, SWAT4LS Tutorial

And you all!

Sorry, we have 2.7M samples, but not all of them...(Source: http://en.wikipedia.org/wiki/File:Assorted_computer_mice_-_MfK_Bern.jpg)

Contact info:

www.ebi.ac.uk/biosamples

www.marcobrandizi.info

Page 18: BioSamples Database Linked Data, SWAT4LS Tutorial

Extras

Page 19: BioSamples Database Linked Data, SWAT4LS Tutorial

• biosd-terms (http://tiny.cc/biosd_terms)– a small application ontology defining specific classes and properties, e.g.,

sample, sample group, has-knowledgeable-person

• Experimental Factors Ontology (EFO)– mainly to define/annotate sample attributes

• Ontology for Biomedical Investigations (OBI)

• Information Artefacts Ontology (IAO)

• Semantic Science Ontology (SIO)– to define main classes in BioSD/RDF

• Bibliographic Ontology (BIBO)

– We link publications about submissions/sample sets

• Dublin Core, schema.org, FOAF– for general categories and in the Linked Data spirit

• Linked automatically by Zooma: many more (e.g., CHEBI, NCBI-Tax, GO)

Main Ontologies used in BioSD / Linked Data

Page 20: BioSamples Database Linked Data, SWAT4LS Tutorial

BioSD → RDFConversion

github.com/EBIBioSamples/biosd2rdf

github.com/EBIBioSamples/biosd2rdf