Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS...
Transcript of Boston University / Globus Genomics Proof of …Approach: Augment current team and tools with a NGS...
Boston University / Globus Genomics Proof of Concept Results and Overview
November 21, 2013
www.globus.org/genomics
• Globus Genomics is developed, operated, and supported by researchers, developers, and bioinformaticians at the Computation Institute – University of Chicago/Argonne National Lab
• We are a non-profit organization building solutions for non-profit researchers
• Our goal is to support the advancement of science by bringing together our strengths and capabilities to help meet the unique needs of researchers and research institutions
Who We Are
www.globus.org/genomics
Challenges in Sequencing Analysis
Sequencing Centers
Sequencing Centers
Data Movement and Access Challenges
Manual Data Analysis
PublicData
Storage
Local Cluster/CloudSeq
Center
Research Lab
• Data is distributed in different locations
• Research labs need access to the data for analysis
• Be able to Share data with other researchers/collaborators• Inefficient ways of data movement
• Data needs to be available on the local and Distributed Compute Resources
• Local Clusters, Cloud, Grid
How do we analyze this Sequence Data
Once we have the Sequence Data
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
• Manually move the data to the Compute node
(Re)Run Script
Install
Modify
• Install all the tools required for the Analysis• BWA, Picard, GATK, Filtering Scripts, etc.
• Shell scripts to sequentially execute the tools• Manually modify the scripts for any change
• Error Prone, difficult to keep track, messy..• Difficult to maintain and transfer the knowledge
FTP, SCP, HTTP
SCPFT
P, SC
P, HTT
P
www.globus.org/genomics
Globus Genomics
Sequencing Centers
Sequencing Centers
PublicData
Storage
Local Cluster/CloudSeq
Center
Research Lab
Globus Provides a• High-performance • Fault-tolerant• Secure
file transfer Service between all data-endpoints
Data Management Data Analysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
Galaxy Data Libraries
• Globus Integrated within Galaxy
• Web-based UI• Drag-Drop workflow
creations• Easily modify Workflows
with new tools
Globus Genomics on Amazon EC2
• Analytical tools are automatically run on the scalable compute resources when possible
Galaxy Based Workflow Management System
FTP, SCP, others
FTP, SCPSCP
Globus Genomics
FTP,
SCP,
HTTP
www.globus.org/genomics
Globus integrated with Galaxy – A flexible, scalable, simplified analysis platform
Accessibility• Unified Web-interface for obtaining genomic data and applying computational
tools to analyze the data• Easily integrate your own tools and scripts for analysis • Collection of tools (Tools Panel) that reflect good practices and community
insights• Access every step of analysis and intermediate results:
§ View, Download, Visualize, Reuse (History Panel)Reproducibility
• Track provenance and ensure repeatability of each analysis step: § input datasets, tools used, parameter values, and output datasets
• Intuitive Workflow Editor to create or modify complex workflows and use them as templates – Reusable and Reproducible
Transparency• Publish and share metadata, histories, and workflows at multiple levels• Store public and generated datasets as Data Libraries – e.g: hg19 Ref Genome• Shared datasets and workflows can be imported by other users for reuse
Publish
Templates
Data and Tools
Globus Integration• Access Globus Endpoints and transfer data from within Galaxy UI and into Galaxy workspace• Leverage local cluster or cloud based scalable computational resources for parallelizing the
tools
www.globus.org/genomics
Globus Overview
• No IT required– Software as a Service (SaaS)
• No client software installation• New features automatically available
– Consolidated support & troubleshooting– Works with existing GridFTP servers– Globus Connect solves “last mile problem”
• GridFTP-based– Open source and freely available– Provides a comprehensive security model– Defacto standard for data movement in large national
cyberinfrastructure projects
• >10,000 registered users, >28 PB moved• Recommended and used by DOE Facilities, NSF
Supercomputing centers, and many campuses
www.globus.org/genomics
• Workflows can be easily defined and automated with integrated Galaxy Platform capabilities
• Data movement is streamlined with integrated Globus file-transfer functionality
• Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure
Globus Genomics
www.globus.org/genomics
• Professionally managed and supported platform• Best practice pipelines• Enhanced workbench with breadth of analytic tools• Technical support and bioinformatics consulting• Access to pre-integrated end-points for reliable and high-
performance data transfer (e.g. Broad Institute, Perkin Elmer, university sequencing centers, etc.)
• Cost-effective solution with subscription-based pricing
Additional Capabilities
www.globus.org/genomics
• Worked closely with Andi, Charlie, Adam and Andy• Setup a data movement capability with Globus Transfer• Implemented an RNA-Seq pipeline using Globus
Genomics • Setup a scalable analysis instance utilizing Amazon Web
Services• Completed analysis runs on various small test data sets• Completed analysis runs on multiple 48 sample data
sets
RNA-Seq Pipeline – Proof of Concept
www.globus.org/genomics
Huntington’s Disease mRNA-Seq Study
• Huntington’s Disease (HD) mRNA-Seq Study– 48 samples: 21 HD and 27 controls– Total mRNA extracted from prefrontal cortex of postmortem human
brain
• Sequencing Dataset– Illumina TruSeq protocol (poly-A tail mRNA selection)– HiSeq 2000 @ Tufts sequencing center– 101 nucleotides, paired-end reads, unstranded, multiplexed 3/lane– Average of 83M reads per sample (55,810,684 – 167,044,880)– ~350 Gb of data
www.globus.org/genomics
mRNA-Seq Analysis WorkflowSingle Sample
www.globus.org/genomics
mRNA-Seq Analysis WorkflowAll Samples
www.globus.org/genomics
RNA-Seq Analysis Workflow – Globus Genomics (Stage 1- Alignment per sample)
www.globus.org/genomics
RNA-Seq Analysis Workflow – Globus Genomics (Stage 2 – Differential Expression)
www.globus.org/genomics
POC Results – Trial Test Data Set
www.globus.org/genomics
• 48 sample run on RNA-Seq pipeline took ~ 48 hours– Submitted two batches of 25 and 23 samples in parallel– Each batch completed in ~24 hours
• Average cost per RNA-Seq analysis ~ $9.50 / sample + storage
• Expectation is to be able to scale-out the analysis to handle 100+ data sets in parallel
POC Results – Final Test Data Set
www.globus.org/genomics
Globus Genomics Demonstration
www.globus.org/genomics
• Flexibility to add tools (choose amongst 500+ applications or add your own)
• Custom build pipelines from scratch, utilize existing best practice pipelines or refine pipelines as needed
• Run at very large scale• Streamline data movement between sequencing
centers and collaborators with high performance, secure transfers and sophisticated data sharing
• Move from a POC setup to a production environment
Post POC – Next Steps
www.globus.org/genomics
Security Considerations
Globus Genomics compliance with the NCBI Database of Genotypes and Phenotypes (dbGaP) security best practicesProtecting the Security of Controlled Data on Servers
Requirement: Servers must not be accessible directly from the internet and unnecessary services should be disabled.
Ø All Globus Genomics servers are protected by Amazon Security Groups and by stateful packet inspection firewalls. Only necessary services are allowed
Requirement: Keep systems up to date with security patches.
Ø All relevant security patches are applied as soon as they are available.
Requirement: dbGaP data on the systems must be secured from other users and if exported via file sharing, ensure limited access to remote systems.
Ø Globus Genomics and Globus Online provide sharing solutions that are secure and user controlled
www.globus.org/genomics
Security Considerations
Globus Genomics compliance with the NCBI Database of Genotypes and Phenotypes (dbGaP) security best practices
Requirement: If accessing system remotely, encrypted data access must be used.
Ø Globus Genomics uses HTTPS and GridFTP protocol with authentication and encryption when transferring the files
Requirement: Ensure that all users of this data have IT security training suitable for this data access and understand the restrictions and responsibilities involved in access to this data.
Ø Data access is strictly restricted to individual users and only users can share the data with other users. We provide detailed instructions to our users on data security and access control.
Requirement: If data is used on multiple systems, ensure that data access policies are retained throughout the processing of the data on all the other systems. If data is cached on local systems, directory protection must be kept, and data must be removed when processing is complete.
Ø The data sharing and access policies on Globus Genomics are retained across all the systems involved
www.globus.org/genomics
Subscription Pricing
Entry Level Basic Standard
Typical Workload (monthly)
~ 50 exomes ~ 10 whole genomes~ 100 RNA-seqs
~ 250 exomes ~ 50 whole genomes~ 500 RNA-seqs
~ 1000 exomes ~ 200 whole genomes~ 2000 RNA-seqs
Technical Support M-F, 9-5 CTBest Effort
M-F, 9-5 CT,2-business day response
M-F, 9-5 CT1-business day response
Support Channels Email Email + Phone Email + Phone
Support Contacts 1 2 5
Batch Submission Capability
No Yes Yes
Branding No No Yes
Monthly Pricing* $500 $1,500 $5,000
Annual Pricing* $5,000 $13,500 $50,000
* Does not include AWS costs
www.globus.org/genomics
Example Collaborations
Cox LabBackground: A computational lab focusing on identification and characterization of genetic variation influencing susceptibility to complex disorders.
Approach: Develop a consensus variant calling approach to improve quality and confidence in identified variants from multiple variant calling applications.
Results: Consensus caller generated a high quality list of variants (less than 0.01% mendel error rate) for 134 samples in 4 days.
Future Plans: Apply consensus caller to 13,000 exome samples from NDAR
www.globus.org/genomics
Example Collaborations
Dobyns Lab
Backround: Investigate the nature and causes of a wide range of human developmental brain disorders
Approach: Replaced manual analysis with Globus Genomics
Results: Achieved greater than 20X speed-up in analysis of exome data
Future Plans: Leverage scale-out capability of Globus Genomics on 150 exome data set and seek to achieve 50X speed-up in analysis
www.globus.org/genomics
Example Collaborations
Georgetown Medical CenterBackround: Innovation Center for Biomedical Informatics is an academic hub for innovative research in the field of biomedical informatics.
Approach: Augment current team and tools with a NGS analysis platform to support standard and best-practice pipelines while leveraging elastic cloud-based resources.
Results: Pilot effort is nearly complete – improved quality and performance results on whole genome, exome and RNA-Seq pipelines utilizing Globus Genomics
Future Plans: Provide Globus Genomics as a well-managed platform-as-a-service for ICBI collaborators and users
www.globus.org/genomics
• Successful Proof of Concept
• Look into ways where Globus can streamline data movement and aid in data management
– Access Globus Endpoints and data from within Galaxy UI and into Galaxy workspace– High-performance, reliable data transfer protocol optimized for high-bandwidth wide-
area networks
• Leverage the enhanced Galaxy based solution for NGS analysis needs– Collection of tools and workflows that reflect good practices and community insights– Intuitive Workflow Editor to create or modify complex workflows and use them as
templates – Reusable and Reproducible
• Utilize elastic cloud based computational resources to execute analysis at large scale
• Deliver an advanced, well managed, well supported genomics analysis platform to enable researchers to focus on their research and not IT
Summary
www.globus.org/genomics
Questions?