Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute...

12
Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University of Southern California

Transcript of Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute...

Page 1: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

Information Integration

José Luis Ambite, Ph.D.

Project Leader, Information Sciences InstituteResearch Assistant Professor, Computer

Science

University of Southern California

Page 2: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

Team

Information Integration Infrastructure: Jose Luis Ambite, Craig Knoblock, Maria Muslea, Gowri Kumaraguruparan (USC/ISI)

Domain Collaborators:FBIRN: Naveen Ashish (UCI), Jessica Turner (MRN), Karl Helmer (MGH), Tim Olsen (WUSTL), Dingying Wei (UCI)

NHPRC: John Nylander, Dave Brink, Liz Moran (NHPRC)

CVRG: Naveen Ashish (UCI), Steve Granite (JHU)

NeuroDev: Dobyns, Paciorkowski (UW), Sherr (UCSF), …

UCI CTSI: Ashish, Keator (UCI), …

Security: Rachana Ananthakrishnan (UC), Laura Pearlman (USC/ISI)

Data Management: Robert Schuler, Ann Chervenak (USC/ISI)

Knowledge Engineering: Gully Burns (USC/ISI), Naveen Ashish (UCI), Jessica Turner (MRN)

User Interfaces: Naveen Ashish (UCI), Jose Luis Ambite, Pedro Szekely, Craig Rogers, Gowri Kumaraguruparan, Maria Muslea (USC/ISI)

Page 3: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

Information Integration

• Problem: consistent view of heterogeneous, distributed data

• Challenges: – Syntactic heterogeneity:

formats, data models– Semantic heterogeneity:

names, structure, viewpoint

– Efficiency: query execution– Scalability: ease of adding

new sources• Approaches:

– Warehouse/ETL– Common-schema

federation– Virtual

Integration/Mediator

• BIRN supports deep integration across complex data sources– Heterogeneous sources:

Relational, XML DBs, Web Services, HTML, files

– Structured queries– Secure, Efficient Query

ExecutionDecision Support

Application Programs, Workflows

Mediator

KnowledgeBases

Databases

Computer Programs

Web

BIRN

Page 4: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

Information Mediator

• Virtual Integration Architecture:– Virtual organization: providers, consumers sharing data for specific purpose– Autonomous sources: data, control remains at sources; no changes to sources – Mediator: define domain schema and describe source contents

• Domain schema: view of the domain agreed upon by virtual organization• Source descriptions: declarative logical formulas relating source/domain schemas

• Query Answering– User writes query in domain schema– Mediator:

• Determines sources relevant to query• Rewrites query in sources schemas• Breaks query into sub-queries for sources• Optimizes query evaluation plan• Combines answers from sources

• Declarative Easy to add new sources• EZ-config: Automatic configuration for

single schema federations

Mediator

DomainSchema

User queries

Reformulation

Optimizer

Execution Engine

DataSource

Data Source Data

Source

Wrapper WrapperSources schemas

Logical SourceDescriptions

[VLDBJ 2005, Frontiers NeuroScience 2010, JAMIA 2011]

Page 5: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

HID@MRN

FBIRN Data Integration Use Case:

HID and XNAT

HID@UCI

Human Imaging Database(s)Oracle DB

XNAT

EXtensible Neuroimaging Archive Toolkit Web service API

BIRN MediatorSQL query XML

query

User query: find all male

patients over 50 with t1 scans

Results integratedfrom XNAT and HID

HIDresults

XNAT results(XML)

Domain query Integrated resultsLogical Source

descriptions

[Front. NeuroScience 2010]

Page 6: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

ECG_Mesa(MySQL DB)

CardioVascular Research Grid

BIRN Mediator

Integrated results

Logical Sourcedescriptions

ChesnokovAnalysis(eXistDB

XML DBMS)

Image Metadatadcm4che PACS

(MySQL DB)

WaveformDB(eXistDB

XML DBMS)

DICOM Image Files(file system)

Waveform Files(file system)

Domain query

Use mediator to identify subjectsand files of interest

Same BIRN mediatorJust plug in CVRG sourcedescriptions

and additional wrapper for eXistDB (XML/XQuery database)

Page 7: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

LISDB

Neuro Developmental Disorders

BIRN MediatorSQL query

User query: find all white females with

Aicardi syndrome

Results integratedfrom LISDB and SherrDB

LISDBresults

SQLquery

Domain query Integrated resultsLogical Source

descriptions

SherrB

SherDBresults

Same BIRN mediator

NeuroDev source

descriptions

Page 8: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

8

Non-Human Primate Research Consortium

• Provide data integration infrastructure for NHPRC:– Colony management, genetics, pathology, …

• BIRN NHPRC Activities: – BIRN/ISI demonstrated Colony Management integration

prototype– NHPRC team developed DNA Banking application using BIRN

mediator– Collaborated on NHPRC Pathology Project

Page 9: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

BIRN Mediator(OMOP Model)

RAND

Custom Interface

BWHUCSDUCI

Scanner mediator

• Integration of multiple clinical data sources– Relational databases:

UCSD, UCI, RAND, Brigham & Women’s Hospital, …

– EMR system Relational export

• Domain model based on the OMOP common data model– OMOP: Observational

Medical Outcomes Partnership http://scanner.ucsd.edu/

[Ashish (UCI), Boxwala (UCSD) , …]

Page 10: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

Cross-CTSI Data Integration: Oxytocin Study

• UCSD-UCI cross-CTSI Oxytocin study: – HID@UCI, – RedCAP@UCSD

• Mediated solution – BIRN Mediator

• Data from neurological assessment scales– PANSS, STM, SCID, ….

BIRN Mediator

RedCAP HID

Custom Interface

[Ashish, Keator, Potkin, Fiefel, …]

Page 11: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

Cross-CTSI Data Integration: Oxytocin Study

Page 12: Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute Research Assistant Professor, Computer Science University.

BIRN Information Integration

• General information integration infrastructure– BIRN Mediator

• bridge semantics across data sources• provide integrated data for analysis and visualization

– Domain model development and curation process• Balance bottom-up/top-down domain model/ontology

development and reuse– Security and user data access control built-in

• Approach– Engage research communities: NHPRC, FBIRN,

CVRG, NeuroDev, Radiation Oncology, CTSIs, ... – Build applications incrementally– Enhance capabilities while providing useful tools