The Seven Bridges Cloud Ecosystem: Enabling Interoperable ... 20180403 Liz... · HIPAA-compliant on...
Transcript of The Seven Bridges Cloud Ecosystem: Enabling Interoperable ... 20180403 Liz... · HIPAA-compliant on...
sevenbridges.com © 2018 Seven Bridges
SAMPLE TITLE HERE
The Seven Bridges Cloud Ecosystem: Enabling Interoperable Data Access and Analysis
Liz Williams, PhD [email protected]
© 2018 Seven Bridges sevenbridges.com
The content of this presentation is solely the responsibility of Seven Bridges Genomics Inc and does not necessarily represent the official views of the National Cancer Institute or National Institutes of Health.
2
© 2018 Seven Bridges sevenbridges.com
The Seven Bridges Cloud Ecosystem Enables Precision Medicine
Data Users Infrastructure
Interoperability
Partnerships
3
© 2018 Seven Bridges sevenbridges.com
Infrastructure
Interoperability
Partnerships
The Seven Bridges Cloud Ecosystem
4
© 2018 Seven Bridges sevenbridges.com
Project Management User Management Authentication & Authorization System Monitoring Usage Logging Notification Service Backup Service Billing Management
The Seven Bridges Platform
5
Web Application API
Task Execution API Data/Metadata Service
Cloud Storage & Compute
Resource Manager
Core Platform Infrastructure
Data Infrastructure Independent Core Services
Task Execution Infrastructure
Task Scheduler
Job Management Layer
Orchestration Layer
© 2018 Seven Bridges sevenbridges.com
Security & Compliance on the Seven Bridges Platform
● HIPAA-compliant on AWS and GCP deployments
● ISO 27001:2013 certified
● US Federal Information Security Management Act (FISMA) Moderate certification based on NIST 800-53 Rev 4 controls for the CGC
● NIH Trusted Partner for the CGC
● Compliant with dbGaP Security Best Practices ● US-EU Privacy Shield Program registered participant; preparing for GDPR ● Support for CAP, CLIA, and GxP best practices
6
© 2018 Seven Bridges sevenbridges.com
Essential Features of an Interoperable Data Ecosystem
Collaborative Usable Reproducible Extendable Scalable
Findable Accessible Interoperable Reusable
+
7
© 2018 Seven Bridges sevenbridges.com
● Secure, customizable workspaces
● Managed billing
● User-friendly interface
● Easy data management
● Industry- standard bioinformatics pipelines
● Flexible & reproducible methods
● Automated & accessible task logs
● Developer- friendly tools
● Portable bioinformatics pipelines
● Scalable data storage
● Cloud- optimized computation
Collaborative Usable Reproducible Extendable Scalable
Essential Features of an Interoperable Data Ecosystem
8
© 2018 Seven Bridges sevenbridges.com
2006 2014 2015 2016 2017
TCGA Pilot Program announced
Launched the CGC
2018
Awarded NCI Cancer Genomics Cloud (CGC) Pilot contract
Logged 3000th user & 450th year of compute
time on the CGC
Registered 1000th CGC user
...
Growth of the Seven Bridges Cloud Ecosystem
9
CAVATICA selected as NIH Kids First Data Resource
Launched CAVATICA partnership with CHOP
Partnered with JAX to build NCI’s PDXNet Data Commons
Selected for NIH Data Commons Pilot
Launched CAVATICA
© 2018 Seven Bridges sevenbridges.com
Available by Q4 2018 *
* * *
Data in the Seven Bridges Cloud Ecosystem
10
sevenbridges.com © 2018 Seven Bridges
An NCI Cancer Research Data Commons Cloud Resource
The Seven Bridges Cancer Genomics Cloud (CGC)
11 11
© 2018 Seven Bridges sevenbridges.com
The Seven Bridges CGC
The Seven Bridges Cancer Genomics Cloud has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Contract No. HHSN261201400008C and Task Order No. 17X146 under Contract No. HHSN261201500003I.
cancergenomicscloud.org
A Cloud Resource within the NCI Cancer Research Data Commons for secure storage, sharing & analysis
of petabytes of public, multi-omic cancer datasets
12
© 2018 Seven Bridges sevenbridges.com
● User-friendly web interface
● Powerful RESTful API, Datasets API & object-oriented and user-friendly libraries in Python, R & Java
● Comprehensive online documentation & training resources
● Technical support from a team of 200+ expert scientists, bioinformaticians & engineers
Accessibility
cancergenomicscloud.org
…
13
© 2018 Seven Bridges sevenbridges.com
Collaboration Tools
cancergenomicscloud.org
● Secure and customizable private workspaces for management of collaborators, data, tools & analysis results
● Project description, note & notification features for communicating with collaborators around the world
● Automatically generated, durable records of input/output files, apps, versions & parameters for every task run on the platform
14
© 2018 Seven Bridges sevenbridges.com
2015 2016 2017 2018
Petabytes of Public Datasets
*
* *
Anticipated availability * cancergenomicscloud.org
● 3 PB of multi-omic public datasets ● 20 PB of linked data ● 0.5 PB of private & derived data
15
*
© 2018 Seven Bridges sevenbridges.com
Interactive and Programmatic Query Tools
cancergenomicscloud.org
● Web- and API-based metadata query tools to explore the data landscape and build cohorts for analysis
● Semantic triple-store technology for dataset harmonization & cross-dataset query building
16
© 2018 Seven Bridges sevenbridges.com
Built-in Data Security
cancergenomicscloud.org
● Per-file, per-user permissions management for third-party controlled-access data
● A permissions management model extendable across datasets & data governance entities
17
© 2018 Seven Bridges sevenbridges.com
Tools To Connect Data
cancergenomicscloud.org
Import Data to the Platform
● Command Line Uploader & CLI
● Seven Bridges Uploader (GUI)
● API import
● HTTP(S) / FTP import
Connect the Platform to External Resources
● Connect Cloud Storage (Volumes API)
● SBFS (a FUSE-based file system)
Mount projects from your desktop
18
© 2018 Seven Bridges sevenbridges.com
Tools To Analyze Data
cancergenomicscloud.org
● A curated collection of 350+ bioinformatics tools & workflows
● Optimized for speed & cost in the cloud
● Fully parameterized & customizable
● Accessible via the GUI & API
19
© 2018 Seven Bridges sevenbridges.com
Tools To Ensure Analytical Reproducibility
cancergenomicscloud.org
● Docker-containerized bioinformatics pipelines
● Automatically generated and accessible logs for every task run on the platform
● Tool & workflow versions
● Parameters
● Input & output files
20
© 2018 Seven Bridges sevenbridges.com
An Extendable Analysis Ecosystem
cancergenomicscloud.org
● SBFS to connect data on the platform to local applications
● Data Cruncher, a custom JupyterLab environment for interactive analysis, data visualization & implementation of custom tertiary analysis tools
Files Instance
21
© 2018 Seven Bridges sevenbridges.com
Tools To Port Your Own Pipelines to the Platform
cancergenomicscloud.org
● An intuitive and flexible software development kit for developing and porting custom tools to the platform
● Conformance with community standards to ensure pipeline portability & reproducibility
22
© 2018 Seven Bridges sevenbridges.com
● 3,000+ registered users from 60+ countries
● 347,000+ completed tasks representing 465+ years of total compute time
Value of the CGC Ecosystem to the Research Community
cancergenomicscloud.org 23
| Jan 2016
350000 -
325000 -
300000 -
275000 -
250000 -
225000 -
200000 -
175000 -
150000 -
125000 -
100000 -
75000 -
50000 -
25000 -
0 -
| Jul
2016
| Jan 2017
| Jul
2017
| Jan 2018
Completed Tasks Failed Tasks
Num
ber o
f Tas
ks R
un
© 2018 Seven Bridges sevenbridges.com
Case Study #1: TCGA Immune Response Working Group ● Collaborative analysis with members of the Immune Response Working Group of The
Cancer Genome Atlas (TCGA) Research Network ● Outcome: cost-optimized (<$0.30/sample), high-throughput HLA typing across ~9,000 TCGA
RNA-Seq (fastq) files
Case Study #2: PanCancer Analysis of Whole Genomes (PCAWG) Study ● High-throughput, harmonized analysis by Seven Bridges of all tumor and matched genomes
in the dataset (~1,350) ● Outcome: rapid generation of ~65,000 output files (including ~5,000 VCFs) totaling 725 TB
Case Study #3: Independent Analysis on 45,000 Genomes ● High-throughput analysis of 45,000 bacterial genomes accessed from SRA via API and
analyzed using a custom workflow ● Outcome: analysis completed in ~1 week by a novice CGC user with no substantive
assistance from the CGC team
The CGC Enables Scalable, Cost-Effective Research
cancergenomicscloud.org 24
sevenbridges.com © 2018 Seven Bridges
An NCI-funded Resource for the Patient-Derived Xenograft Development and Trial Centers Research Network
The JAX-Seven Bridges PDXNet Data Commons
25 25
© 2018 Seven Bridges sevenbridges.com
The JAX-Seven Bridges PDXNet Data Commons
pdxnetwork.org/pdccc/
The JAX-Seven Bridges PDX Data Commons and Coordination Center is funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. 1U24CA224067-01.
A cloud-based environment for secure storage, sharing & analysis of data for the Patient-Derived Xenograft
Development and Trial Centers Research Network (PDXNet)
26
© 2018 Seven Bridges sevenbridges.com
The JAX-Seven Bridges PDXNet Data Commons
pdxnetwork.org/pdccc/
Designed to: ● Connect the PDX Development and Trial Centers
(PDTCs) & the Patient-Derived Model Repository (PDMR)
● Colocalize PDXNet data & bioinformatics resources to facilitate data harmonization, discovery & analysis
● Integrate data from individual PDTCs & pilot projects to inform preclinical trials
● Make PDXNet data & harmonized workflows FAIR and available to the broader research community
27
© 2018 Seven Bridges sevenbridges.com
Key Features of the PDXNet Data Commons
pdxnetwork.org/pdccc/
● Collaborative ● Usable: Custom data sharing features to enable phased release of consortium
datasets to PDXNet participants & to the public
● Reproducible: Use of Rabix & CWL for creating reproducible and portable workflows for consortium-wide data harmonization
● Extendable:
○ Full integration with the Seven Bridges CGC to enable access to all available public datasets & bioinformatics resources
○ A harmonized metadata model that enables increasingly complex queries across public and private datasets using existing data query tools
● Scalable
28
sevenbridges.com © 2018 Seven Bridges
The NIH Common Fund Gabriella Miller Kids First Pediatric Data Resource
CAVATICA
29 29
© 2018 Seven Bridges sevenbridges.com
CAVATICA & the Kids First Data Resource
cavatica.org
A cloud-based environment for secure storage, sharing & analysis of large volumes of genomic data
from pediatric cancer & rare disease patients
30
© 2018 Seven Bridges sevenbridges.com
CAVATICA & the Kids First Data Resource
cavatica.org
Designed to: ● Integrate data for multiple rare pediatric diseases
across dozens of hospitals & clinical sites
● Colocalize consortium data & bioinformatics resources to facilitate data harmonization, discovery & analysis
● Make Kids First data & harmonized workflows FAIR and available to the broader research community
31
© 2018 Seven Bridges sevenbridges.com
Key Features of CAVATICA & the Kids First Data Resource
● Collaborative ● Usable: Custom permissions management for fine-grained control of private
dataset access
● Reproducible: Use of Rabix & CWL for creating reproducible and portable workflows for consortium-wide harmonization
● Extendable:
○ Interoperability with the CGC to enable authorized access to public datasets
○ A harmonized metadata model that enables queries across pediatric and adult datasets using existing data query tools
● Scalable
cavatica.org 32
sevenbridges.com © 2018 Seven Bridges
An NIH Data Commons Pilot Solution
FAIR4CURES
33 33
© 2018 Seven Bridges sevenbridges.com
FAIR4CURES
A data and standards ecosystem for making NIH data resources FAIR and for enabling secure data sharing & analysis
The FAIR4CURES project is funded in whole or in part with Federal funds from the National Institutes of Health.
34
in collaboration with the NIH Data Commons Pilot Phase Consortium (DCPPC)
© 2018 Seven Bridges sevenbridges.com
FAIR4CURES
Designed to:
● Be a cloud-agnostic platform for making distributed NIH data resources FAIR and available for analysis by the broader research community
● Establish community standards and generate resources for making digital objects FAIR
Findable: ○ GUIDs for digital objects ○ A common metadata model for indexing & search
Accessible: Standardized authentication / authorization
Interoperable: ○ Open API standards ○ Cross-platform interoperability
Reusable: ○ GUIDs for digital objects
35
© 2018 Seven Bridges sevenbridges.com
Key Features of FAIR4CURES
● Collaborative: GUIDs to promote data and tool publication & reuse ● Usable: Workspaces connected to multiple cloud providers to enable compute
where the data live
● Reproducible: GUIDs to promote analytical reproducibility ● Extendable:
○ A standardized authentication & authorization schema ○ Open API standards & cross-platform interoperability ○ A common metadata model that enables queries across increasingly diverse
datasets & data types using existing data query tools ● Scalable
36
© 2018 Seven Bridges sevenbridges.com
The Seven Bridges Cloud Ecosystem: Interoperable Data Access and Analysis to Drive Precision Medicine
Infrastructure
Interoperability
37
Partnerships
sevenbridges.com © 2018 Seven Bridges
Liz Williams, PhD [email protected]