Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute...
-
Upload
camilla-palmer -
Category
Documents
-
view
214 -
download
2
Transcript of Information Integration José Luis Ambite, Ph.D. Project Leader, Information Sciences Institute...
Information Integration
José Luis Ambite, Ph.D.
Project Leader, Information Sciences InstituteResearch Assistant Professor, Computer
Science
University of Southern California
Team
Information Integration Infrastructure: Jose Luis Ambite, Craig Knoblock, Maria Muslea, Gowri Kumaraguruparan (USC/ISI)
Domain Collaborators:FBIRN: Naveen Ashish (UCI), Jessica Turner (MRN), Karl Helmer (MGH), Tim Olsen (WUSTL), Dingying Wei (UCI)
NHPRC: John Nylander, Dave Brink, Liz Moran (NHPRC)
CVRG: Naveen Ashish (UCI), Steve Granite (JHU)
NeuroDev: Dobyns, Paciorkowski (UW), Sherr (UCSF), …
UCI CTSI: Ashish, Keator (UCI), …
Security: Rachana Ananthakrishnan (UC), Laura Pearlman (USC/ISI)
Data Management: Robert Schuler, Ann Chervenak (USC/ISI)
Knowledge Engineering: Gully Burns (USC/ISI), Naveen Ashish (UCI), Jessica Turner (MRN)
User Interfaces: Naveen Ashish (UCI), Jose Luis Ambite, Pedro Szekely, Craig Rogers, Gowri Kumaraguruparan, Maria Muslea (USC/ISI)
Information Integration
• Problem: consistent view of heterogeneous, distributed data
• Challenges: – Syntactic heterogeneity:
formats, data models– Semantic heterogeneity:
names, structure, viewpoint
– Efficiency: query execution– Scalability: ease of adding
new sources• Approaches:
– Warehouse/ETL– Common-schema
federation– Virtual
Integration/Mediator
• BIRN supports deep integration across complex data sources– Heterogeneous sources:
Relational, XML DBs, Web Services, HTML, files
– Structured queries– Secure, Efficient Query
ExecutionDecision Support
Application Programs, Workflows
Mediator
KnowledgeBases
Databases
Computer Programs
Web
BIRN
Information Mediator
• Virtual Integration Architecture:– Virtual organization: providers, consumers sharing data for specific purpose– Autonomous sources: data, control remains at sources; no changes to sources – Mediator: define domain schema and describe source contents
• Domain schema: view of the domain agreed upon by virtual organization• Source descriptions: declarative logical formulas relating source/domain schemas
• Query Answering– User writes query in domain schema– Mediator:
• Determines sources relevant to query• Rewrites query in sources schemas• Breaks query into sub-queries for sources• Optimizes query evaluation plan• Combines answers from sources
• Declarative Easy to add new sources• EZ-config: Automatic configuration for
single schema federations
Mediator
DomainSchema
User queries
Reformulation
Optimizer
Execution Engine
DataSource
Data Source Data
Source
Wrapper WrapperSources schemas
Logical SourceDescriptions
[VLDBJ 2005, Frontiers NeuroScience 2010, JAMIA 2011]
HID@MRN
FBIRN Data Integration Use Case:
HID and XNAT
HID@UCI
Human Imaging Database(s)Oracle DB
XNAT
EXtensible Neuroimaging Archive Toolkit Web service API
BIRN MediatorSQL query XML
query
User query: find all male
patients over 50 with t1 scans
Results integratedfrom XNAT and HID
HIDresults
XNAT results(XML)
…
Domain query Integrated resultsLogical Source
descriptions
[Front. NeuroScience 2010]
ECG_Mesa(MySQL DB)
CardioVascular Research Grid
BIRN Mediator
Integrated results
Logical Sourcedescriptions
ChesnokovAnalysis(eXistDB
XML DBMS)
Image Metadatadcm4che PACS
(MySQL DB)
WaveformDB(eXistDB
XML DBMS)
DICOM Image Files(file system)
Waveform Files(file system)
Domain query
Use mediator to identify subjectsand files of interest
Same BIRN mediatorJust plug in CVRG sourcedescriptions
and additional wrapper for eXistDB (XML/XQuery database)
LISDB
Neuro Developmental Disorders
BIRN MediatorSQL query
User query: find all white females with
Aicardi syndrome
Results integratedfrom LISDB and SherrDB
LISDBresults
SQLquery
…
Domain query Integrated resultsLogical Source
descriptions
SherrB
SherDBresults
Same BIRN mediator
NeuroDev source
descriptions
8
Non-Human Primate Research Consortium
• Provide data integration infrastructure for NHPRC:– Colony management, genetics, pathology, …
• BIRN NHPRC Activities: – BIRN/ISI demonstrated Colony Management integration
prototype– NHPRC team developed DNA Banking application using BIRN
mediator– Collaborated on NHPRC Pathology Project
BIRN Mediator(OMOP Model)
RAND
Custom Interface
BWHUCSDUCI
Scanner mediator
• Integration of multiple clinical data sources– Relational databases:
UCSD, UCI, RAND, Brigham & Women’s Hospital, …
– EMR system Relational export
• Domain model based on the OMOP common data model– OMOP: Observational
Medical Outcomes Partnership http://scanner.ucsd.edu/
[Ashish (UCI), Boxwala (UCSD) , …]
Cross-CTSI Data Integration: Oxytocin Study
• UCSD-UCI cross-CTSI Oxytocin study: – HID@UCI, – RedCAP@UCSD
• Mediated solution – BIRN Mediator
• Data from neurological assessment scales– PANSS, STM, SCID, ….
BIRN Mediator
RedCAP HID
Custom Interface
[Ashish, Keator, Potkin, Fiefel, …]
Cross-CTSI Data Integration: Oxytocin Study
BIRN Information Integration
• General information integration infrastructure– BIRN Mediator
• bridge semantics across data sources• provide integrated data for analysis and visualization
– Domain model development and curation process• Balance bottom-up/top-down domain model/ontology
development and reuse– Security and user data access control built-in
• Approach– Engage research communities: NHPRC, FBIRN,
CVRG, NeuroDev, Radiation Oncology, CTSIs, ... – Build applications incrementally– Enhance capabilities while providing useful tools