A functional overview of the tranSMART 17.1 development project

36
27 Oct, 2016 Ward Weistra – The Hyve [email protected] / @wardweistra TranSMART Pro 17.1 project Functional overview

Transcript of A functional overview of the tranSMART 17.1 development project

27 Oct, 2016

Ward Weistra – The Hyve

[email protected] / @wardweistra

TranSMART Pro 17.1 projectFunctional overview

2

Agenda

● Backstory

● Technical improvements (briefly)

● Functional improvements

○ New crucial functionality

○ Backwards compatibility & upgrade path

○ Documentation & tests

● Project structure

BackstoryWhat are we solving here?

4

What are we solving here?

1. Missing crucial functionality○ Time series, Samples, Cross-study concepts○ Transcript-level RNA-Seq data

○ Large file storage

2. Code problems: ‘technical debt’○ Monolithic architecture

○ Lack of automated tests

○ Old version Grails/Java, many code repositories

○ No documentation of database

5

Backend only

● Stable, commercial grade core○ Decoupling of the backend from the transmartApp User

Interface via the REST API

○ Towards an ecosystem of User interfaces on top of the

tranSMART backend

● Why only the backend?○ transmartApp has issues: assumptions, old layered code

○ Enough work already

○ Current data will still work in transmartApp

○ 17.1 project will be part of the full 17.1 release

6

It’s all open! http://bit.ly/171documents

DESIGN

REQUIR

EMENTS

Time series, samples and cross study conceptsi2b2 database alignment and extension

8

History

● tranSMART was developed on top of i2b2

to combine clinical with omics data

● i2b2 has cross-study concepts (with ontology codes) and

support for storing samples and time series data

● tranSMART lost this:

○ Concepts are study specific

○ User Interface assumes a patient-concept pair to have

one value“Patient John has for concept heart rate the value 80 bpm”

“Concept age in study A is not the same as concept age in study B”

9

Time series

● Absolute time○ Blood measurement with start (and end) date+time

○ Hospital visit per patient grouping multiple measurements

with start (and end) date+time

● Relative time○ ‘Baseline’ (0 days) or ‘Week 1’ (7 days) observation

○ Shared between patients

● Ordinal time○ First, second and third observation

10

Samples

● Differentiated by ‘modifiers’

○ Tumor and normal measurement

○ Multiple doses

○ Multiple tissues

● Differentiated only by a number ‘instance_num’

○ Multiple replicas

11

Cross-study concepts

● We want ‘Age’ in different studies to be the same

concept○ Get subjects which match ‘Age > 50’ from ALL studies

○ Use ontology codes, eg. from an external ontology server

● Difference with i2b2: tranSMART is study based○ Study based data loading

○ Study based data access

● We need to support both

The (relevant part of the) 17.1 data model

12

13

Time series and samples - Example 1

A study with tumor and normal

samples

● Multiple observations for the same patient

differentiated by the modifier ‘tissue type’.

● The Start_Date (and End_Date) for the

observation will be empty.

● All observations will be linked to the same

trial_visit, which will link to the study.

Clinical trial with multiple timepoints

(Baseline, Week 1, Week 2)

● Multiple observations for the same patient

differentiated by their trial_visit.

● All observations will be linked to one of the

available trial_visits, which will link to the

study. Each trial_visit has a Label

(Baseline), a Unit (Days) and a Value (0, 7

and 14).

14

Time series and samples - Example 2

15

Time series and samples - Example 3 (1/2)

An EHR dataset with observation and

visit timestamps and samples.

● Multiple observations for the same patient

differentiated by their observation

Start_Date, visit and Instance_Num.

● The Start_Date (and End_Date) for the

observation will be set to a timestamp.

● The Instance_Num will be set starting

from 1 for multiple samples on the same

observation Start_Date and visit.

16

Time series and samples - Example 3 (2/2)

An EHR dataset with observation and

visit timestamps and samples.

● The observations from a patient will be

linked one visit per hospital visit.

● The Start_Date (and End_Date) for the

visit will be set to a timestamp including

time and date for the hospital visit.

● All observations will be linked to the same

trial_visit, which will link to the study.

● Querying observations based on a combination of:○ start time, end time

○ aggregated time series/samples values:■ minimum, maximum, average

○ temporal constraints on sets of events:■ define sets of events (e.g. A: all blood pressure readings for a

patient, B: the first use of drug X by the patient)

■ Specify constraints (e.g. All of A happen at least one week after

any of B).

17

Querying time series and samples

18

● Querying patients based on observations:○ Certain constraints are valid for any or for all observations

for the patient

○ Return patients where all observations of high blood

pressure occur after supply of drug X.

● Querying for aggregated values for numerical data:○ minimum, maximum, average

Querying time series and samples

Transcript-level RNA-Seq

20

TranSMART data types

● Metadata○ Study, concept, patient metadata / Links to source data

● Clinical / NHTMP / Derived imaging data / Biobanking data○ numerical and categorical

● Gene expression - RNA○ Micro array○ mRNAseq, miRNAseq - only linked to genes○ qPCR miRNA

● Copy Number Variation data (Array CGH)● Small Genomic Variants (SNP, indel – VCF format)● Large genomic rearrangements● Proteomics

○ Protein mass spectrometry – peptide or protein quantities○ Immunoassay Rule-based medicine (RBM) – analyte concentrations

● Metabolomics○ Metabolite quantities

21

Transcript-level RNA-Seq data

● Adding a data type where measurements

(readcount, normalised readcount and z-score)

are linked to transcripts instead of genes

● Dictionary will link genes to transcript for searching

REF_ID GPL_ID CHROMOSOME START_BP END_BP TRANSCRIPT

ENST0001 RNASEQ_TRANSCRIPT_ANNOT X 1000 1100 TR1

2 RNASEQ_TRANSCRIPT_ANNOT Y 2000 2500

3 RNASEQ_TRANSCRIPT_ANNOT 10 3000 4000 TR2

Large file storageLinking with Arvados

23

Linking with Arvados: Scalable Genomics

● Linking files in Arvados to studies in tranSMART

for the storage of large files (eg BAM, VCF)

● If possible:○ Align with linking files in MongoDB to studies

● Eventual UI goals:○ See in tranSMART which Arvados files linked to study

○ Start from tranSMART a Arvados workflow on Arvados

files

Backwards compatibilityand the upgrade path

25

Upgrade path / data migration

● If you have your data in 16.1 or 16.2○ There will be a data migration path provided to 17.1

26

Backwards compatibility

If you have your data in 16.1 or 16.2

● The current user interface (transmartApp) will still work on current

data

○ So only for data without time series, samples,

○ Plugins are not guaranteed to work (but might very well)

● The current REST clients will still work with the V1 version of the

REST API

Documentation and automated testing

28

Documentation

● Documentation will be provided for all available

REST and Core API calls

● Data model design will describe the complete data

model

29

Automated testing

● The Core API will have

unit and integration tests

with a minimal test

coverage of 70%.

● The RESTful API will have

automated functional

tests for all API calls.

Project structureGC, TSC and PMC

31

Involved parties

● tranSMART Pro○ Leadership: TranSMART Foundation

○ Sponsors:

■ Pfizer

■ Roche

■ AbbVie

■ Sanofi

○ Execution: The Hyve

● GC: Business decisionsSponsors + TSF

● TSC: Technical decisionsSponsors + TSF

● EUTAB: Represent end usersSponsors (+ selected experts)

● PMC: Direction for the releaseTSC + Partners (ITTM, Curoverse, Clarivate)

32

Governance

17.1 Project Management Committee

tranSMART ProTechnical Steering

Committee

tranSMART ProGoverning Committee

The Hyve

tranSMART ProEnd User Technical Advisory Board

17.1 project

33

Technical Steering Committee

● Pfizer: Jay Bergeron

● Roche: Thomas Thies

● Sanofi: Heike Schürmann

● AbbVie: Samantha Lipsky

● Chair + customer representative:

Prof. Yi-Ke Guo (Imperial College,

tranSMART Foundation)

34

The Hyve team

● Project manager: Erik van Eeuwijk

● Business analyst: Ward Weistra (me)

● Technical lead: Gijs Kant

● Development team:

○ Piotr Zakrzewski (present)

○ Ruslan Forostianov (present)

○ Jan Kanis

○ Ewelina Grudzien

○ Olaf Meuwese

○ Barteld Klasens (automated testing)

35

Timeline

● Module A and B: End of 2016○ Time series, samples, cross-study concepts

○ Transcript-level RNA-Seq

● Module C and project release: End of Q1 2017○ Linking with Arvados

● TranSMART 17.1 version release: Q2 2017○ Integration with all community developments