A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data
-
Upload
matthieu-schapranow -
Category
Technology
-
view
679 -
download
0
Transcript of A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis of Big Medical Data
A Federated In-Memory Database Computing Platform Enabling Real-time Analysis of Big Medical Data
Dr.-Ing. Matthieu-P. Schapranow Hasso Plattner Institute, Potsdam, Germany
May 17, 2017
■ Can we enable clinicians to take their therapy decisions:
□ Incorporating all available patient specifics,
□ Referencing latest lab results and worldwide medical knowledge, and
□ In an interactive manner during their ward round?
Our Motivation Turn Precision Medicine Into Clinical Routine
Analyze Genomes: A Federated In-Memory Database Computing Platform
2
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
3
Our Vision Medical Board Incorporating Latest Medical Knowledge
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
4
Project Time Line
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
5
2009 2010 2011 2012 2013 2014 2015
SAP HANA launched Oncolyzer SORMAS
Drug Response Analysis
Enterprise Software
Medical Knowledge
Cockpit
Analyze Genomes Platform
IMDB Research
2016 2017
A R T +
T R A M
S + S
M
The Challenge Distributed Heterogeneous Data Sources
6
Human genome/biological data 600GB per full genome 15PB+ in databases of leading institutes
Prescription data 1.5B records from 10,000 doctors and 10M Patients (100 GB)
Clinical trials Currently more than 30k recruiting on ClinicalTrials.gov
Human proteome 160M data points (2.4GB) per sample >3TB raw proteome data in ProteomicsDB
PubMed database >23M articles
Hospital information systems Often more than 50GB
Medical sensor data Scan of a single organ in 1s creates 10GB of raw data Cancer patient records
>160k records at NCT Analyze Genomes: A Federated In-Memory Database Computing Platform
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
■ Requirements
□ Managed services
□ Reproducibility
□ Real-time data analysis
■ Restrictions
□ Data privacy
□ Data locality
□ Volume of big medical data
Software Requirements in Life Sciences
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
7
http://stevedempsen.blogspot.de/2013/08/agile-software-requirements-comic.html
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Our Approach: AnalyzeGenomes.com In-Memory Computing Platform for Big Medical Data
8
In-Memory Database
Analyze Genomes: A Federated In-Memory Database Computing Platform
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Our Approach: AnalyzeGenomes.com In-Memory Computing Platform for Big Medical Data
9
In-Memory Database
Combined and Linked Data
Genome Data
Cellular Pathways
Genome Metadata
Research Publications
Pipeline and Analysis Models
Drugs and Interactions
Analyze Genomes: A Federated In-Memory Database Computing Platform
Indexed Sources
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Our Approach: AnalyzeGenomes.com In-Memory Computing Platform for Big Medical Data
10
In-Memory Database
Extensions for Life Sciences
Data Exchange, App Store
Access Control, Data Protection
Fair Use
Statistical Tools
Real-time Analysis
App-spanning User Profiles
Combined and Linked Data
Genome Data
Cellular Pathways
Genome Metadata
Research Publications
Pipeline and Analysis Models
Drugs and Interactions
Analyze Genomes: A Federated In-Memory Database Computing Platform
Indexed Sources
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Our Approach: AnalyzeGenomes.com In-Memory Computing Platform for Big Medical Data
11
In-Memory Database
Extensions for Life Sciences
Data Exchange, App Store
Access Control, Data Protection
Fair Use
Statistical Tools
Real-time Analysis
App-spanning User Profiles
Combined and Linked Data
Genome Data
Cellular Pathways
Genome Metadata
Research Publications
Pipeline and Analysis Models
Drugs and Interactions
Analyze Genomes: A Federated In-Memory Database Computing Platform
Drug Response Analysis
Pathway Topology Analysis
Medical Knowledge Cockpit Oncolyzer
Clinical Trial Recruitment
Cohort Analysis
...
Indexed Sources
Combined column and row store
Map/Reduce Single and multi-tenancy
Lightweight compression
Insert only for time travel
Real-time replication
Working on integers
SQL interface on columns and rows
Active/passive data store
Minimal projections
Group key Reduction of software layers
Dynamic multi-threading
Bulk load of data
Object-relational mapping
Text retrieval and extraction engine
No aggregate tables
Data partitioning Any attribute as index
No disk
On-the-fly extensibility
Analytics on historical data
Multi-core/ parallelization
Our Technology In-Memory Database Technology
+
+++
+
P
v
+++t
SQL
xx
T
disk
12
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
Scheduling and Execution of Genome Data Processing Pipelines
Analyze Genomes: A Federated In-Memory Database Computing Platform
In-Memory Database
Tasks
Scheduler
ID Pipeline Params 12 BWA xyz.fastq 13 Stanford A_1.fastq 14 Bowtie xyz.fastq
Worker
Worker
Subtasks Task ID Job Status Params
12 97 Split done xyz.fastq
12 98 Import todo abc.vcf
12 98 Import done abc.vcf
Webservice
. . .
1. Trigger task execution
2. Schedule subtasks
3. Execute subtasks
13
Managed Services provided by Federated In-Memory Database System (FIMDB)
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
14
Node i
Worker Worker Worker
IMDB
Node j
Worker Worker Worker
IMDB
Node k
Worker Worker Worker
IMDB
Scheduler
Node m
Worker Worker Worker
IMDB
Relay
Node n
Worker Worker Worker
IMDB ...
Cloud Service Provider (Shared Algorithms and Public Reference Data)
Hospital or Research Department (Sensitive/Patient Data)
VPN
UDP TCP
Shared File System (Pool) Shared File System (Pool)
...
Shared File System (Global)
■ Not standardized
■ Not exchangeable
■ Concatenation of bash scripts reading from and writing to files
■ Requires IT expertise for
□ Setup
□ Error handling, and
□ Efficient processing and parallelization
■ Objective: Model, configure, and execute pipelines without involving IT experts
Genome Data Processing Pipelines State of the Art
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
15
bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …
■ Graphical modeling notation
■ Compliant with BPMN 2.0 extended by
□ Modular structure
□ Degree of parallelization
□ Parameters and variables
■ Model descriptions (XPDL) are stored in IMDB
■ Model instances are transformed into graph structure executed by our worker framework
Genome Data Processing Pipelines Standardized Modeling
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
Chart 16
Genome Data Processing Pipelines XML Process Definition Language
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
17
PIPELINES.MODELS
Database Structure
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
18
PIPELINES.PIPELINES
■ Results are imported into IMDB
■ Optimization reduced execution time by >50%
Genome Data Processing Pipelines Traditional vs. Optimized Approach
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
19
Reproducibility Modeling of Data Analysis Pipelines 1. Design time (researcher, process expert)
□ Definition of parameterized process model
□ Uses graphical editor and jobs from repository
2. Configuration time (researcher, lab assistant)
□ Select model and specify parameters, e.g. aln opts
□ Results in model instance stored in repository
3. Execution time (researcher)
□ Select model instance
□ Specify execution parameters, e.g. input files
Analyze Genomes: A Federated In-Memory Database Computing Platform
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017 20
■ Query-oriented search interface
■ Seamless integration of patient specifics, e.g. from EMR
■ Parallel search in international knowledge bases, e.g. for biomarkers, literature, cellular pathway, and clinical trials
App Example: Medical Knowledge Cockpit for Patients and Clinicians
Analyze Genomes: A Federated In-Memory Database Computing Platform
21
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Medical Knowledge Cockpit for Patients and Clinicians Pathway Topology Analysis
■ Search in pathways is limited to “is a certain element contained” today
■ Integrated >1,5k pathways from international sources, e.g. KEGG, HumanCyc, and WikiPathways, into HANA
■ Implemented graph-based topology exploration and ranking based on patient specifics
■ Enables interactive identification of possible dysfunctions affecting the course of a therapy before its start
Analyze Genomes: A Federated In-Memory Database Computing Platform
Unified access to multiple formerly disjoint data sources
Pathway analysis of genetic variants with graph engine
22
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
■ Interactively explore relevant publications, e.g. PDFs
■ Improved ease of exploration, e.g. by highlighted medical terms and relevant concepts
Medical Knowledge Cockpit for Patients and Clinicians Publications
Analyze Genomes: A Federated In-Memory Database Computing Platform
23
App Example: Real-time Assessment of Clinical Trial Candidates
■ Supports trial design and recruitment process through statistical data analysis
■ Real-time matching and clustering of patients and clinical trial inclusion/exclusion criteria
■ Reassessment of already screened or participating citizens to reduce recruitment costs
■ Integrates smoothly with the
Analyze Genomes: A Federated In-Memory Database Computing Platform
Real-time assessment of clinical trial candidates
24
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
■ Online: Visit we.analyzegenomes.com for latest research results, slides, videos, tools, and publications
■ Offline: High-Performance In-Memory Genome Data Analysis: In-Memory Data Management Research, Springer,
ISBN: 978-3-319-03034-0, 2014
■ In Person: Visit us at the HPI booth 200! ■ Join us for Intel Tech Talks at SAPPHIRE booth 669!
□ May 17 01.00pm: A Federated In-Memory Database Computing Platform Enabling Real-time Analysis of Big Medical Data
□ May 18 3.00pm: In-Memory Apps For Precision Medicine
Where to find additional information?
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
25
Keep in contact with us!
Dr. Schapranow, Intel Tech Talk at SAPPHIRE, May 17, 2017
Analyze Genomes: A Federated In-Memory Database Computing Platform
26
Dr. Matthieu-P. Schapranow Program Manager E-Health & Life Sciences
Hasso Plattner Institute
August-Bebel-Str. 88 14482 Potsdam, Germany
http://we.analyzegenomes.com/