Overview Big Data Big Data in Genomics Enter: The Cloud Cloud
Technologies: Hadoop/MapReduce Cloud Technologies: NoSQL
Applications in Genomics Million Veterans Program Challenges and
Lessons Learned Questions 1
Slide 3
Big Data Big Data describes the realization of greater business
intelligence by storing, processing, and analyzing data that was
previously ignored or siloed due to the limitations of traditional
data management techniques Three Vs of Big Data Variety Volume
Velocity 2
Slide 4
Big Data in Genomics Hypothesis drive vs. data driven approach
Cause and effect paradigm is inconsequential Data analytics
techniques Hidden Markov Models Support Vector Machines Boltzmann
Chains 3 Big Data analytics techniques used to mine twitter data to
discover consumer sentiment can also be used to predict flu
outbreaks or genes associated with cancer
Slide 5
If I had a peso This document contains Booz Allen Hamilton Inc.
proprietary and confidential business information. 4
Slide 6
Enter: The Cloud Definition: Deploying groups of remote servers
and software networks that allow centralized data storage and
online access to computer services or resources Benefits of the
Cloud Resource Pooling Economies of Scale Rapid Elasticity and
Scaling On-demand Storage and Compute Co-locating Data and
Analytics Service Models Deployment Models 5
Slide 7
Big Data and analytics: a match made in the Cloud Software as a
Service - Galaxy - GATK - Hbase Platform as a Service -
Hadoop/MapReduce - Spark - MAPR Infrastructure as a Service -
Amazon Web Services - Microsoft Azure - Rackspace 6 Cloud Service
models demonstrate the migration of infrastructure, platform, and
software tools as services instead of commodities Harnessing the
power of Big Data and Cloud Computing entails bringing data and
analytics together Hadoop/MapReduce platform is the most widely
used platform for Big Data Analytics in the Cloud
Slide 8
Googles Solution to the Big Data Problem 7
Slide 9
Harnessing the Power of the Cloud: Hadoop/MapReduce
Hadoop/MapReduce are frameworks for automatically scaling storage
and compute Data and computations spread over thousands of
computers HDFS handles Storage while MapReduce handles Compute
Hadoop, developed by Yahoo is an open source implementation
MapReduce is Google's framework for large data computations GATK is
an alternative implementation specifically for NGS Benefits
Scalable, Efficient, Reliable Easy to Program Runs on commodity
computers Fast for very large jobs Fault tolerant 8 Challenges
Redesigning / Retooling applications Data Storage efficiency
Threshold to reap processing benefits Slow for small jobs
Slide 10
What is HDFS? Hadoop Distributed File System breaks down data
into chunks and distributes it across a cluster of machines 9
Slide 11
How does HDFS work? NameNode: The Master node determines how
chunks are data are distributed across DataNodes DataNodes: Stores
chunks of data and replicates it across other DataNodes 10
Slide 12
What is MapReduce? MapReduce is a programming model for
processing large data sets with parallel distributed algorithms on
a cluster 11
Slide 13
How does MapReduce work? part 1 12
Slide 14
How does MapReduce work part 2 13
Slide 15
How does MapReduce work part 3 14
Slide 16
Harnessing the Power of the Cloud: NoSQL NoSQL or Not Only SQL
is a class of databases that is modeled in means other than the
tabular format of relational databases Column based instead of row
based Efficient scale out architecture Flexible schema suited to
object oriented programming Four basic types Document databases
Graph stores Key-value stores Wide-column stores Schema-less
architecture pushes out database relationships to the software
level 15
Slide 17
Hadoop applications in genomics Short Read Mapping Typical
query/subject example Query: Read libraries split into smaller
chunks by MapReduce Subject: Genome split into blocks by HDFS
Genome Assembly De Bruijn Graphs Genome Wide Association Studies
NoSQL SNP indexing Genomic Sequence Manager 16
Slide 18
The Million Veterans Program (MVP) National voluntary research
program funded by the Department of Veterans Affairs Office of
Research & Development Goal is to study how genes and
environment factors affect veterans health Building one of the
world's largest medical databases containing biological samples and
health information from one million veterans Blood samples for
genomic profiling Single Nucleotide Polymorphism (SNP) Array
Analysis Next Generation Sequencing (NGS) Analysis Personal health
surveys and military deployment history Electronic health records
Genomic Informatics for Integrative Science (GenISIS) comprises
hardware, platform, and tools to manage, store, and analyze MVP
data Current recruitment has passed 400K samples with a goal of 1
Million samples in 5 years Total Data Volume expected to exceed 10
Petabytes in 5 years This document contains Booz Allen Hamilton
Inc. proprietary and confidential business information. 17
Slide 19
Overview This document contains Booz Allen Hamilton Inc.
proprietary and confidential business information. 18
Slide 20
MVP Data Warehouse Metadata extracted from vendor generated
genomic data using SNP Arrays Genotyping, Whole Genome Sequencing,
and Whole Exome Sequencing will be cataloged in a Metadata Database
Genomic data will be linked with corresponding de-identified
clinical and survey data by an Honest Broker system Terminology and
Annotation Server will allow researchers to incorporate a wide
array of genomic and clinical annotations to integrate genomic,
survey, and clinical data Query Mart will enable researchers to
build cohorts and subset data using clinical and genomic
information and export to the Data Mart for further analysis This
document contains Booz Allen Hamilton Inc. proprietary and
confidential business information. 19
Slide 21
Cloud Broker This document contains Booz Allen Hamilton Inc.
proprietary and confidential business information. 20 Cloud Portal
manages access control for different types of data and users Cloud
Engine co-locates data with analytical tools Intelligent
Orchestration Tool maps data and processes to storage and compute
clusters to efficiently manage resources Geographically distributed
computational resources pooled through a virtual private cloud
Slide 22
This document contains Booz Allen Hamilton Inc. proprietary and
confidential business information. Data Lake Key Value Data Store
21 SNP rs4362914 Gene TCF7L2 Sample SHIP00067 5221 Patient
PT-00589A Patient PT-00589A Condition Diabetes Type II SNP
rs4362914 Genome Loc Chr7:4344 859978 Sample SHIP00067 5221 SNP
rs4362914 SNP rs4362914 Condition Diabetes Type II Survey
S-2014-06- 18-A3288 Deployment Vietnam War Genome Loc Chr7:4344
859978 Genotype T Sample SHIP00067 5221 Survey S-2014-06- 18-A3288
Gene TCF7L2 Condition Diabetes Type II Tier 1 Tier 2 Tier 3 Access
Control
Slide 23
Challenges and Lessons Learned This document contains Booz
Allen Hamilton Inc. proprietary and confidential business
information. 22 Petabyte scale genomics data poses storage,
transfer, and processing challenges Cloud computing offers optimal
solutions for data storage and analytics Next generation algorithms
with built-in scalability features (e.g. Apache Hadoop/MapReduce)
Co-locating data and analytical tools to reduce data replication
and transfer bottlenecks Genomic data is PHI and should be
protected using Data-in-Motion and Data-at-Rest best practices
Encryption and decryption of genomic datasets constitute a
significant fraction of data transfer and analysis time YMMV
Efficient architectural design of storage and processing systems
diminish security risks and encryption/decryption bottlenecks Data
integration and metadata annotation are critical in deriving
knowledge from data Lack of unified standard formats in genomics
necessitates substantial effort in highly specialized analytical
pipelines Data integration can be powered by annotation using
multiple ontologies Data annotation upon ingest is crucial in a
rapidly changing genomic sequencing landscape