Download - Genomics: a journey into the Cloud June 2, 2015. Overview Big Data Big Data in Genomics Enter: The Cloud Cloud Technologies: Hadoop/MapReduce Cloud Technologies:

Genomics: a journey into the Cloud June 2, 2015

Overview Big Data Big Data in Genomics Enter: The Cloud Cloud Technologies: Hadoop/MapReduce Cloud Technologies: NoSQL Applications in Genomics Million Veterans Program Challenges and Lessons Learned Questions 1

Big Data Big Data describes the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored or siloed due to the limitations of traditional data management techniques Three Vs of Big Data Variety Volume Velocity 2

Big Data in Genomics Hypothesis drive vs. data driven approach Cause and effect paradigm is inconsequential Data analytics techniques Hidden Markov Models Support Vector Machines Boltzmann Chains 3 Big Data analytics techniques used to mine twitter data to discover consumer sentiment can also be used to predict flu outbreaks or genes associated with cancer

If I had a peso This document contains Booz Allen Hamilton Inc. proprietary and confidential business information. 4

Enter: The Cloud Definition: Deploying groups of remote servers and software networks that allow centralized data storage and online access to computer services or resources Benefits of the Cloud Resource Pooling Economies of Scale Rapid Elasticity and Scaling On-demand Storage and Compute Co-locating Data and Analytics Service Models Deployment Models 5

Big Data and analytics: a match made in the Cloud Software as a Service - Galaxy - GATK - Hbase Platform as a Service - Hadoop/MapReduce - Spark - MAPR Infrastructure as a Service - Amazon Web Services - Microsoft Azure - Rackspace 6 Cloud Service models demonstrate the migration of infrastructure, platform, and software tools as services instead of commodities Harnessing the power of Big Data and Cloud Computing entails bringing data and analytics together Hadoop/MapReduce platform is the most widely used platform for Big Data Analytics in the Cloud

Googles Solution to the Big Data Problem 7

Harnessing the Power of the Cloud: Hadoop/MapReduce Hadoop/MapReduce are frameworks for automatically scaling storage and compute Data and computations spread over thousands of computers HDFS handles Storage while MapReduce handles Compute Hadoop, developed by Yahoo is an open source implementation MapReduce is Google's framework for large data computations GATK is an alternative implementation specifically for NGS Benefits Scalable, Efficient, Reliable Easy to Program Runs on commodity computers Fast for very large jobs Fault tolerant 8 Challenges Redesigning / Retooling applications Data Storage efficiency Threshold to reap processing benefits Slow for small jobs

What is HDFS? Hadoop Distributed File System breaks down data into chunks and distributes it across a cluster of machines 9

How does HDFS work? NameNode: The Master node determines how chunks are data are distributed across DataNodes DataNodes: Stores chunks of data and replicates it across other DataNodes 10

What is MapReduce? MapReduce is a programming model for processing large data sets with parallel distributed algorithms on a cluster 11

How does MapReduce work? part 1 12

How does MapReduce work part 2 13

How does MapReduce work part 3 14

Harnessing the Power of the Cloud: NoSQL NoSQL or Not Only SQL is a class of databases that is modeled in means other than the tabular format of relational databases Column based instead of row based Efficient scale out architecture Flexible schema suited to object oriented programming Four basic types Document databases Graph stores Key-value stores Wide-column stores Schema-less architecture pushes out database relationships to the software level 15

Hadoop applications in genomics Short Read Mapping Typical query/subject example Query: Read libraries split into smaller chunks by MapReduce Subject: Genome split into blocks by HDFS Genome Assembly De Bruijn Graphs Genome Wide Association Studies NoSQL SNP indexing Genomic Sequence Manager 16

The Million Veterans Program (MVP) National voluntary research program funded by the Department of Veterans Affairs Office of Research & Development Goal is to study how genes and environment factors affect veterans health Building one of the world's largest medical databases containing biological samples and health information from one million veterans Blood samples for genomic profiling Single Nucleotide Polymorphism (SNP) Array Analysis Next Generation Sequencing (NGS) Analysis Personal health surveys and military deployment history Electronic health records Genomic Informatics for Integrative Science (GenISIS) comprises hardware, platform, and tools to manage, store, and analyze MVP data Current recruitment has passed 400K samples with a goal of 1 Million samples in 5 years Total Data Volume expected to exceed 10 Petabytes in 5 years This document contains Booz Allen Hamilton Inc. proprietary and confidential business information. 17

Overview This document contains Booz Allen Hamilton Inc. proprietary and confidential business information. 18

MVP Data Warehouse Metadata extracted from vendor generated genomic data using SNP Arrays Genotyping, Whole Genome Sequencing, and Whole Exome Sequencing will be cataloged in a Metadata Database Genomic data will be linked with corresponding de-identified clinical and survey data by an Honest Broker system Terminology and Annotation Server will allow researchers to incorporate a wide array of genomic and clinical annotations to integrate genomic, survey, and clinical data Query Mart will enable researchers to build cohorts and subset data using clinical and genomic information and export to the Data Mart for further analysis This document contains Booz Allen Hamilton Inc. proprietary and confidential business information. 19

Cloud Broker This document contains Booz Allen Hamilton Inc. proprietary and confidential business information. 20 Cloud Portal manages access control for different types of data and users Cloud Engine co-locates data with analytical tools Intelligent Orchestration Tool maps data and processes to storage and compute clusters to efficiently manage resources Geographically distributed computational resources pooled through a virtual private cloud

This document contains Booz Allen Hamilton Inc. proprietary and confidential business information. Data Lake Key Value Data Store 21 SNP rs4362914 Gene TCF7L2 Sample SHIP00067 5221 Patient PT-00589A Patient PT-00589A Condition Diabetes Type II SNP rs4362914 Genome Loc Chr7:4344 859978 Sample SHIP00067 5221 SNP rs4362914 SNP rs4362914 Condition Diabetes Type II Survey S-2014-06- 18-A3288 Deployment Vietnam War Genome Loc Chr7:4344 859978 Genotype T Sample SHIP00067 5221 Survey S-2014-06- 18-A3288 Gene TCF7L2 Condition Diabetes Type II Tier 1 Tier 2 Tier 3 Access Control

Challenges and Lessons Learned This document contains Booz Allen Hamilton Inc. proprietary and confidential business information. 22 Petabyte scale genomics data poses storage, transfer, and processing challenges Cloud computing offers optimal solutions for data storage and analytics Next generation algorithms with built-in scalability features (e.g. Apache Hadoop/MapReduce) Co-locating data and analytical tools to reduce data replication and transfer bottlenecks Genomic data is PHI and should be protected using Data-in-Motion and Data-at-Rest best practices Encryption and decryption of genomic datasets constitute a significant fraction of data transfer and analysis time YMMV Efficient architectural design of storage and processing systems diminish security risks and encryption/decryption bottlenecks Data integration and metadata annotation are critical in deriving knowledge from data Lack of unified standard formats in genomics necessitates substantial effort in highly specialized analytical pipelines Data integration can be powered by annotation using multiple ontologies Data annotation upon ingest is crucial in a rapidly changing genomic sequencing landscape

Questions 23