fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of...

28
http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 1 fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Data fusion, semantic alignment, distributed queries Johan Montagnat CNRS, I3S lab, Modalis team on behalf of the CrEDIBLE consortium CNRS/UNS, laboratoire I3S (UMR7271), équipe MODALIS INRIA/CNRS/UNS, laboratoire I3S (UMR 7271), équipe Wimmics INSERM U1099, laboratoire LTSI, équipe MediCIS U. Picardie, laboratoire MIS, équipe Connaissances

Transcript of fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of...

Page 1: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 1

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE

Data fusion, semantic alignment, distributed queries

Johan MontagnatCNRS, I3S lab, Modalis team

on behalf of the CrEDIBLE consortiumCNRS/UNS, laboratoire I3S (UMR7271), équipe MODALIS INRIA/CNRS/UNS, laboratoire I3S (UMR 7271), équipe WimmicsINSERM U1099, laboratoire LTSI, équipe MediCISU. Picardie, laboratoire MIS, équipe Connaissances

Page 2: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 2

Motivations● Biomedical data

– Increasing amount / number of (open) sources – Big Data● Large-scale medical studies (statistical medical studies,

epidemiology...)

– High heterogeneity: images, clinical data,biomarkers, biology.... – Variety

– Need for cross-factors analysis – Linked Data● Data (re)analysis opportunities● Multicentric studies, translational research

● Centralized approaches encounter limitations– Multiple data source kinds– Large data volumes to transfer / archive / search– Sensitive patient data / complex access control policies– Need to adopt uniform data model & format

● Data is de facto distributed over acquisition centers

Page 3: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 3

Biomedical data integration– Data federation through distributed querying and heterogeneity

management

– Heterogeneous databases● data models alignment● multiple query interfaces

Client

Federator(query decomposition, planning & results federation)

Mediator(ETL or query rewriting)

Mediator(ETL or query rewriting)

Site 1 Site 2

Query-based access to data

Remote sub-queries

Page 4: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 4

Domain ontology-based federation

Client

Federator

Mediator MediatorSite 1 Site 2

Query-based access to data

Remote sub-queries

Medical domain ontology(reference model)

Data querying

Dataalignment

Page 5: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 5

Challenges and expertise

● Challenges– Representation of data semantics for heterogeneous

data sources → 1. biomedical ontology building– Data federation → 2. distributed query engine– Data mediation → 3. from R2RML to xR2RML

● PartnershipI3S/ModalisINRIA/I3S/WimmicsLTSI/MediCISMIS/ConnaissancesSemantic reference

Semantic representation

Distributed query engineMedical applications

Page 6: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 6

● Study of existing ontologies– Foundational: BFO, DOLCE– Measurement and information: OBI, IAO, PATO– Medicine: SNOMED, ICD, NCIT– Medical imaging: RadLex, AIM, OME, OntoNeuroLog,

QIBO...

1. Biomedical ontologies

Reality Images Biomarkers Facts, plans

Acquisition Processing Decision

Human oranimal subject

MRI, CT scan,PET image...

Anatomical structurevolume, bloodflow,lesion load...

Diagnosis, responseto treatment, therapyplanning

Imaging data context

Page 7: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 7

● Domain modeling focus– Radiological data– Imaging biomarkers– Brain function clinical assessment tools

● Neuroinformatics● Various projects on neurodegenerative studies and

cognitive functions

– Electronic Health Records● GINSENG epidemiology and sentinel network

– Medical image simulation (acquisition & analysis)● Virtual Imaging Platform project● Link between I/O data sets and image processing tools

● Diffusion: http://bioportal.bioontology.org/ontologies

Biomedical ontologies

Page 8: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

MI CNRS – bilan 2014 – Paris, 22/1/2015 8

Data traceability● Ontology

– Concepts& Rules

● Annotations

● Processing

DataSet DataSetProcessing

T1-MRI T2-MRI fMRI ... Segmentation Registration ...

A → B

Page 9: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

MI CNRS – bilan 2014 – Paris, 22/1/2015 9

Data traceability● Ontology

– Concepts& Rules

● Annotations

● Processing

DataSet DataSetProcessing

T1-MRI T2-MRI fMRI ... Segmentation Registration ...

Img1 IsA T1-MRI Img2 IsA T1-MRIA → B

Page 10: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

MI CNRS – bilan 2014 – Paris, 22/1/2015 10

Data traceability● Ontology

– Concepts& Rules

● Annotations

● Processing

DataSet DataSetProcessing

T1-MRI T2-MRI fMRI ... Segmentation Registration ...

Img1 IsA T1-MRI

Img2 IsA T1-MRI

A → BTool1 HasInput T1-MRI

Tool1 IsA Registration

Tool1 HasOutput Transfo

Page 11: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

MI CNRS – bilan 2014 – Paris, 22/1/2015 11

Data traceability● Ontology

– Concepts& Rules

● Annotations

● Processing

DataSet DataSetProcessing

T1-MRI T2-MRI fMRI ... Segmentation Registration ...

Img1 IsA T1-MRI

Img2 IsA T1-MRI

A → B

Tool1 HasInput T1-MRI

Tool1 IsA Registration

Tool1 HasOutput Transfo

Img1 IsProcessedBy Tool1

Img2 IsProcessedBy Tool1

Tool1 Produced Transfo1

Transfo1 IsA GlobalTransfo

Page 12: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

MI CNRS – bilan 2014 – Paris, 22/1/2015 12

Provenance summarization– Fine-grained annotation traces generated at run-time– Summary generated by inference rules application

● Produce relevant and human-tractable experiment summaries

Page 13: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 13

2. Data query and federation engine● KGRAM (Knowledge Graph Abstract Machine) Semantic

query engine:– Full support of SPARQL1.1– Generic interface for heterogeneous backends– Flexible architecture facilitating different deployment scenarios

● Mediation interface to access relational data– Federated relational schema derived from the ontology

Page 14: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 14

Distributed Query Processing● Query federator decoupled from data sources● Asynchronous querying of multiple data sources● Query planning and parallel querying

Page 15: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 15

3. Access to non-RDF data sources

● Deep Web– Unindexed web data

● Relational databases– R2RML mapping language:

relational schema → RDF● Non-relational databases

– Various types of DBs: RBS, Native XML DB, NoSQL, LDAP directory, OODB...

– xR2RML exension to R2RML to address most NoSQL and XML databases

Page 16: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 16

R2RML● Simple mapping language

– Generate RDF triples from values in relational tablesR2RML mapping graph:

Produced RDF:

“AlzheimerStudy”;

Id Name address

4 Hospital X ...

Centre

Page 17: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 17

R2RML● Simple mapping language

– Generate RDF triples from values in relational tables

Id Acronym CenterId

10 AlzheimerStudy 4

Study

Centre

R2RML mapping graph:

Produced RDF:

“AlzheimerStudy”;

Id Name address

4 Hospital X ...

Page 18: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 18

R2RML● Simple mapping language

– Generate RDF triples from values in relational tables

Id Acronym CenterId

10 AlzheimerStudy 4

Study

Centre

R2RML mapping graph:

Produced RDF:

“AlzheimerStudy”;

Id Name address

4 Hospital X ...

Page 19: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 19

R2RML● Simple mapping language

– Generate RDF triples from relational table cells

Id Acronym CenterId

10 AlzheimerStudy 4

Study

Centre

R2RML mapping graph:

Produced RDF:

“AlzheimerStudy”;

Foreign Key

Id Name address

4 Hospital X ...

Page 20: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 20

xR2RML extension to R2RML

● Leverages R2RML, backward compatible● Uniform language to describe mappings from

most common databases to RDF● Defines extensions to:

– Allow various query languages and protocols● SQL, NoSQL DBs, HTTP GET query string, SPARQL...

– Various data models● Row-based: RDB, column store, SPARQL result set● Tree-like: collections, key-value associations (JSON, XML,

Object, LDAP...)

– Generate RDF lists and containers from structured values

Page 21: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 21

Logical source and data references

xR2RM L m app ing<#Centre> xrr:logicalSource [ xrr:query "/centres/centre";

];

XM L d atab ase , XQ u ery<centres> <centre @Id="4"> <name>Hostital X</name> </centre> <centre @Id=“6"> <name>Pontchaillou</name> </centre> ...</centres>

rr: R2RML vocabularyxrr: xR2RML vocabulary

Page 22: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 22

Logical source

xR 2 R M L m app in g<#Centre> xrr:logicalSource [ xrr:query "/centres/centre";

]; rr:subjectMap [ rr:class st:centre; rr:template "http://example.org/centre#{//centre/@Id}"; ]; rr:predicateObjectMap [ rr:predicate lang:has-for-name; rr:objectMap [ xrr:reference “//centre/name" ]; ];

XML database, XQuery<centres>

<centre @Id="4">

<name>Hostital X</name>

</centre>

<centre @Id=“6">

<name>Pontchaillou</name>

</centre>

...

</centres>

rr: R2RML vocabulary

xrr: xR2RML vocabulary

Page 23: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 23

xR 2 R M L m app in g<#Centre> xrr:logicalSource [ xrr:query "/centres/centre"; xrr:format xrr:XML;

]; rr:subjectMap [ rr:class st:centre; rr:template "http://example.org/centre#{//centre/@Id}"; ]; rr:predicateObjectMap [ rr:predicate lang:has-for-name; rr:objectMap [ xrr:reference “//centre/name" ]; ];

XML database, XQuery<centres>

<centre @Id="4">

<name>Hostital X</name>

</centre>

<centre @Id=“6">

<name>Pontchaillou</name>

</centre>

...

</centres>

rr: R2RML vocabulary

xrr: xR2RML vocabulary

Page 24: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 24

Embedded content of mixed format

xR 2 R M L m app in g<#Centre> xrr:logicalSource [ xrr:sourceName “STAFF"; ]; ... rr:predicateObjectMap [ rr:predicate lang:fist-name; rr:objectMap [ xrr:reference “Column(Name)/JSONPath($.FirstName)" ]; ];

D ata w ith m ixed -fo rm at co ntent

Re latona l tab le “STAFF”, co lum n “N am e” co nta in s JSO N d ata:

... Name ...

... { “FirstName”: “Bob”, “LastName: “Smith” }

...

xrr:format Syntax path constructor

xrr:Row Column(), CSV(), TSV()

xrr:XML XPath()

xrr:JSON JSONPath()

... ...

Page 25: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 25

Conclusions● Contribution to heterogeneous medical data

sources integration– Data models to tackle data complexity– Query-based distributed data federation– Heterogeneous data models integration

● Scientific networking– Third annual multidisciplinary workshop (gathering experts

in biomedical data representation, semantic, distribution, federation, integration...)

● Contribution to other projects– ANR GINSENG: epidemiology, sentinel network– ANR VIP: medical image simulation and analysis

Page 26: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 26

Perspectives● Distributed query engine optimization

– Improved query planning and sources selection– Parallel query kernel– Flexible software architecture

● Heterogeneous data sources mediation– xR2RML mapping tool implementation– Exploitation in multiple projects beyond medical images

(astronomical data catalogs, humanities...)● Joint effort with SeqPhéHD

– Heterogeneous databases management– Life sciences applications (multi-scale data sets)– Data-driven analysis workflows

Page 27: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 27

MédiationAnalyse de données et workflows

Programmes pour le séquençage haut débit

Indexation de données génomiques

Recommendation et recherche de données complexe

Masses de données scientifiques(Scientific Big Data)

génome phénome médical

•Volumineuses•Complexes•Hétérogènes

Infrastructures de traitementCloud, grid, P2P, HPC

Accès,sélecton Traitement

Page 28: fédération de données et de ConnaissancEs Distribuées en ...– Increasing amount / number of (open) sources – Big Data Large-scale medical studies (statistical medical studies,

http://credible.i3s.unice.fr MI CNRS – bilan 2014 – Paris, 22/1/2015 28

Publications● Peer-reviewed

– B. Gibaud et al, “OntoVIP: An ontology for the annotation of object models used for medical image”, JBI 2014.

– B. Batrancourt et al, “ A multilayer ontology of instruments for neurological, behavioral and cognitive assessments”, Neuroinformatics, September 2014.

– A. Gaignard et al. “Domain-specific summarisation of Life-Science e-experiments from provenance traces”, JoWS, July 2014.

– B. Gibaud et al, “Ontology and Imaging Informatics” NCOR workshop, Buffalo, USA, June 2014.

– S. Cipière et al. “Global Initiative for Sentinel e-Health Network on Grid”, CCGrid-Health workshop, Chicago, USA, May 2014.

– G. Kassel et al, “Ontologie des entités fictives et virtuelles : préliminaires”, JFO, Hammamet, Tunisie, November 2014.

● Research reports– Workshop summary: #14-2– Annual report: #14-1– Bibliography study: #13-2-v2