Semantic Web for Life Science Data Sharing and Integration

44
Semantic Web for Life Science Data Sharing and Integration Kei Cheung, PhD Yale Center for Medical Informatics CBB752, October 1, 2012, Yale University

description

Semantic Web for Life Science Data Sharing and Integration. Kei Cheung, PhD Yale Center for Medical Informatics. CBB752, October 1, 2012, Yale University. in Science. (Nature Special Issue, September 2008). Big Data in Life Sciences. - PowerPoint PPT Presentation

Transcript of Semantic Web for Life Science Data Sharing and Integration

Page 1: Semantic Web for Life Science Data Sharing and Integration

Semantic Web for Life Science Data Sharing and Integration

Kei Cheung, PhD

Yale Center for Medical Informatics

CBB752, October 1, 2012, Yale University

Page 2: Semantic Web for Life Science Data Sharing and Integration

in Science

(Nature Special Issue, September 2008)

Page 3: Semantic Web for Life Science Data Sharing and Integration

Big Data in Life Sciences

• The Human Genome Project (HGP) has transformed genome sciences from being experimental to being increasingly computational

• HGP has intensified the growth of computational biology and bioinformatics (CBB)

• 1000 Genomes has further intensified the growth of CBB• The Web has become a popular medium for accessing

information over the Internet• Numerous bioinformatics databases and tools are Web

accessible• These databases and tools as well as the Web have

become indispensable for modern-day biomedical research

Page 4: Semantic Web for Life Science Data Sharing and Integration
Page 5: Semantic Web for Life Science Data Sharing and Integration

Can Google answer every question?!

Page 6: Semantic Web for Life Science Data Sharing and Integration

Kei (Hoi) Cheung (20 years ago)

Kei (Hoi) Cheung(more recent)

Kei (Hui) CheungNot me!

I’m NOT a company!

Find the most recent imageof the person “Kei Hoi Cheung”

Google Then

Page 7: Semantic Web for Life Science Data Sharing and Integration

Google Now

Page 8: Semantic Web for Life Science Data Sharing and Integration

Schema.org

• It is community effort providing a standard vocabulary for HTML tags

• Search engines such as Google, Yahoo, and Bing have started to incorporate schema.org into their semantic search strategies

• Use microdata/microformat/RDFa to embed schema.org terms within HTML documents

Page 9: Semantic Web for Life Science Data Sharing and Integration

RDFa Example…..

<div vocab="http://schema.org/" typeof="Product"> <img property="image" src="dell-30in-lcd.jpg" /> <span property="name">Dell UltraSharp 30" LCD Monitor</span>

<div property="aggregateRating" typeof="AggregateRating"> <span property="ratingValue">87</span> out of <span property="bestRating">100</span> based on <span property="ratingCount">24</span> user ratings </div>…..

Page 10: Semantic Web for Life Science Data Sharing and Integration

Problems of the Current Web

• Lack of use of global identifiers

• Lack of links and link semantics

• Lack of data semantics

Page 11: Semantic Web for Life Science Data Sharing and Integration

Lack of Use of Global Identifiers

• For example, each major protein database (UniProt, Ensembl, RefSeq, etc) assigns identifiers to protein sequences according to their own internal guidelines and identifier pattern.

• These databases may be in flux, as identifiers and sequences can change due to new data or predictive algorithms.

Page 12: Semantic Web for Life Science Data Sharing and Integration

Lack of Links and Link Semantics

colllaborators

Page 13: Semantic Web for Life Science Data Sharing and Integration

Lack of Data Semantics

Page 14: Semantic Web for Life Science Data Sharing and Integration

Transforming Web 1.0 into Web 2.0 & Web 3.0

Web 1.0

Web 2.0 Web 3.0

Web 2.0/3.0Web 2.0/3.0

Page 15: Semantic Web for Life Science Data Sharing and Integration

Web 2.0

• It changes the way people communicate and share artifacts on the web (e.g., Flicker, youtube, facebook)

• Wiki, blog, RSS, folksonomy (social tagging)• Multimedia rich (songs, images, videos, etc)• Dynamic, interactive, responsive user interface

(Ajax)• XML-based data exchange format

Page 16: Semantic Web for Life Science Data Sharing and Integration

Mashup

• Mashup (music), the musical genre encompassing songs which consist entirely of parts of other songs

• Mashup (video), a video that is edited from more than one source to appear as one

• Mashup (digital), a digital media file containing any or all of text, graphics, audio, video, and animation, which recombines and modifies existing digital works to create a derivative work.

• Mashup (web application hybrid), a web application that combines data and/or functionality from more than one source

Page 17: Semantic Web for Life Science Data Sharing and Integration

XML(KML) - Based Geo Data Mashup

KML file 1 KML file 2

KML file 3

KML file 4KML file n

Data file at site 1Data file at site 2

Data file at site 3

Data file at site 4Data file at site n

Page 18: Semantic Web for Life Science Data Sharing and Integration

NCI’s State Cacner Profile (http://statecancerprofiles.cancer.gov)

Use of Yahoo! Pipes to convert tabular data into KML format

Page 19: Semantic Web for Life Science Data Sharing and Integration

GeoCommons: Mashup of Maps

Page 20: Semantic Web for Life Science Data Sharing and Integration

Web 3.0: Semantic Web

• The Semantic Web provides a common machine-readable framework that allows data to be shared and reused across application, enterprise, and community boundaries– The Semantic Web is a knowledge web of data

• The Semantic Web is about two things– It is about common formats for identification,

representation, and integration of data drawn from diverse sources

– It is also about languages for recording how the data relates to real world objects

Page 21: Semantic Web for Life Science Data Sharing and Integration

Layers of the Semantic Web

Page 22: Semantic Web for Life Science Data Sharing and Integration

Semantic Web = Brilliant Web

Page 23: Semantic Web for Life Science Data Sharing and Integration

Knowledge Web Data Integration

Page 24: Semantic Web for Life Science Data Sharing and Integration

Web 3.0: Semantic Web (Cont’d)

• Global identifying scheme (URI)

• Standard data modeling languages (RDF, RDFS, OWL)

• Standard query languages (SPARQL)

• Enabling tools/technologies (e.g., Protégé, Jena, triplestore, etc)

Page 25: Semantic Web for Life Science Data Sharing and Integration

Resource Description Framework (RDF)

• It is a standard data model (directed acyclic graph) for representing information (metadata) about resources in the World Wide Web

• In general, it can be used to represent information about “things” or “resources” that can be identified (using URI’s) on the Web

• It is intended to provide a simple way to make statements (descriptions) about Web resources

Page 26: Semantic Web for Life Science Data Sharing and Integration

Uniform Resource Identifiers (URIs)

• A URI is a string of characters used to identify or name a resource on the Internet.

• URLs (Uniform Resource Locators) are a particular type of URI, used for resources that can be accessed on the WWW (e.g., web pages)

• In RDF, URIs typically look like “normal” URLs, often with fragment identifiers to point at specific parts of a document:

– http://www.semantic-systems-biology.org/SSB#CCO_B0000000 (id for “core cell cycle protein” in Cell Cycle Ontology)

Page 27: Semantic Web for Life Science Data Sharing and Integration

RDF Triple/Graph• The basic information unit in RDF is an RDF statement in the form of

– (subject, property, object)

• Each RDF statement can be modeled as a graph comprising two nodes connected by a directed arc

• A triple example

• A set of such triples can jointly form a directed labeled graph (DLG) that can in theory model a significant part of domain knowledge.

• An RDF graph can be represented in different formats (XML, Turtle, N3…)

Page 28: Semantic Web for Life Science Data Sharing and Integration

Linking statements to aggregate data from multiple sources

is a

Page 29: Semantic Web for Life Science Data Sharing and Integration

Linking statements to create knowledge

Page 30: Semantic Web for Life Science Data Sharing and Integration

Named Graph

located in

Interacts with

biordf:P05067

Meta Statement

biordf:P05067 foaf:kei_cheungCreated by

Page 31: Semantic Web for Life Science Data Sharing and Integration

Linked Data Cloud

Page 32: Semantic Web for Life Science Data Sharing and Integration

Cell Cycle Ontology (CCO) (Antezana et al, 2009, Genome Biology)

http://genomebiology.com/2009/10/5/R58

Page 33: Semantic Web for Life Science Data Sharing and Integration

RDF Graph Match (SPARQL)

BASE <http://www.semantic-systems-biology.org/ webcite>PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema# webcite>PREFIX ssb:<http://www.semantic-systems-biology.org/SSB# webcite>SELECT ?protein_labelWHERE {   GRAPH <cco_S_pombe> {      ?protein ssb:is_a ssb:CCO_B0000000.      ?protein rdfs:label ?protein_label   }}

core cell cycle protein

Page 34: Semantic Web for Life Science Data Sharing and Integration

RDF Schema (RDFS)

• RDF Schema terms:– Class– Property– type– subClassOf– range– Domain

• Example:<DNASequence, type, Class><Promoter,subClassOf,DNASequence><Protein,type,Class><TranscriptionFactor,subClassOf,Protein><Bind,type,Property><Bind,domain, TranscriptionFactor><Bind,range, Promoter>

Page 35: Semantic Web for Life Science Data Sharing and Integration

Relational table -> RDF -> RDFS ontology

Page 36: Semantic Web for Life Science Data Sharing and Integration

Web Ontology Language (OWL)

• It is more semantically expressive than RDF and RDFS, but it is syntactically the same as RDF– Relationship constraints such as cardinality,

sameAs, etc

• It has three species: OWL Lite, OWL DL, OWL Full

Page 37: Semantic Web for Life Science Data Sharing and Integration

OWL DL Representation (Subsumption)

:Nucleus a owl:Class ; rdfs:subClassOf [ a owl:Restriction ; owl:onProperty :part_of ; owl:someValuesFrom :Cell ]

Necessary but not sufficient condition: part of a nucleus is also part of a cell, but part of a cell is not necessarily part of a nucleus

Page 38: Semantic Web for Life Science Data Sharing and Integration

OWL Reasoning

• Which proteins participate in “mitosis”

:Protein a owl:Class ; rdfs:subClassOf [ a owl:Restriction ; owl:onProperty :participates_in ; owl:someValuesFrom :Mitosis ]

Page 39: Semantic Web for Life Science Data Sharing and Integration

Semantic Web Rule Language (SWRL = OWL + Rules)

hasParent(?x1,?x2) hasBrother(?x2,?x3) hasUncle(?x1,?x3)∧ ⇒

Page 40: Semantic Web for Life Science Data Sharing and Integration

SWRL (cont’d)

Page 41: Semantic Web for Life Science Data Sharing and Integration

Boimedical ontologies available in RDF/OWL format

• UniProt• Gene Ontology• NCI Metathesaurus• MGED Ontology• Sequence Ontology• Protein Ontology• These and many more ontologies are available

in ontology repositories such as the NCBO Bioportal (http://bioportal.bioontology.org/)

Page 42: Semantic Web for Life Science Data Sharing and Integration

SW Enabling Technologies

• Ontology editor (e.g., protégé)

• Triple store (e.g., virtuoso)

• OWL reasoner (e.g., Pellet)

• SWRL (e.g., protégé plug-in)

Page 43: Semantic Web for Life Science Data Sharing and Integration

Collaborative Semantic Web Projects

• Protein motion (biosciences)

• Pseudogenes (biosciences)

• SenseLab (neurosciences)

• Pathways (translational medicine)

• Semantic Web for Health Care and Life Science Interest Group (chartered by W3C)– BioRDF task force

Page 44: Semantic Web for Life Science Data Sharing and Integration

The End