Hadoop in Life Sciences - NDM Technologies white paper introduces the new data processing ......

11
White Paper Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components – MapReduce and Hadoop Distributed File System (HDFS) – and its adoption in the Life Sciences with an example in Genomics data analysis. March 2012 HADOOP IN THE LIFE SCIENCES: An Introduction

Transcript of Hadoop in Life Sciences - NDM Technologies white paper introduces the new data processing ......

White Paper

Abstract

This introductory white paper reviews the Apache HadoopTM technology, its components – MapReduce and Hadoop Distributed File System (HDFS) – and its adoption in the Life Sciences with an example in Genomics data analysis. March 2012

HADOOP IN THE LIFE SCIENCES: An Introduction

2 Hadoop in the Life Sciences: An Introduction

Copyright © 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. Part number h10574

3 Hadoop in the Life Sciences: An Introduction

Table of Contents

Audience ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3  

Executive Summary .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4  

Hadoop: an Introduction ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5  

Genomics example: CrossBow ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8  

Enterprise-Class Hadoop on EMC Isilon .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9  

Conclusion ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10  

References .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10  

Audience This white paper introduces the new data processing and analysis paradigm, HadoopTM, within the context of its usage in the life sciences, specifically Genomics Sequencing. It is intended for audiences with basic knowledge of storage and computing technology; a rudimentary understanding of DNA sequencing and the bioinformatics analysis associated with it.

4 Hadoop in the Life Sciences: An Introduction

Executive Summary Life Sciences data will reach the ExaByte (1018 bytes, EB) scale soon. This is “Big Data”. As a reference point, all words ever spoken by all human beings when transcribed are about 5 EB of data. In a recent article titled “Will Computers Crash Genomics?”1, the analysis points to exponential growth of the total genomics sequencing market capacity, as outlined in Figure 1 below: 10 Tera base-pairs (1012 bp) per day, with an astounding 5x year-on-year growth rate (500%). The human genome is approximately 3 billion base pairs long – a base pair (bp) comprising of DNA molecules in G-C or A-T pairs

Figure 1: Genomics Growth

Each base-pair represents a total of about 100 bytes (of raw, analyzed and interpreted data). Therefore the genomics market capacity in 2010 storage terms (from Fig. 1) was about 200 PetaBytes (PB), with the capacity growing to about 1 ExaByte (EB) by late 2012. This capacity is drowning out technologies attempting to handle the deluge of Big Data in the life sciences. Proteomics (study of proteins) and imaging data are early stages of this exponential rise. It is not just the data storage volume, but also its velocity and variability that make this a challenge requiring “scale-out” technologies: grow simply and painlessly as the data center and business needs grow. Within the past year, one computing and storage framework has matured into a contender to handle this tsunami of Big Data: Hadoop™.

Life Sciences workflows require a High Performance Computing (HPC) infrastructure to process and analyze the data to determine the variations in the genome and the proper scale of storage to retain this data. With Next Generation (genome) Sequencing (NGS) workflows generating up to 2 TeraBytes (TB) of data per run per week per sequencer – not including the raw images – the need for a scale-out storage that integrates easily with HPC is a “line item requirement”. EMC Isilon has provided the scale-out storage for nearly all the workflows for all the DNA sequencer instrument manufacturers in the market today at more than 150 customers. Since 2008, the EMC Isilon OneFS storage platform has a Life Sciences installed base of more than 65 PetaBytes (PB).

5 Hadoop in the Life Sciences: An Introduction

As genomics has very large, semi-structured, file-based data and is modeled on post-process streaming data access and I/O patterns that can be parallelized, it is ideally suited for Hadoop. It consists of two main components: a file system and a compute system – the Hadoop Distributed File System (HDFS) and the MapReduce framework respectively. The Hadoop ecosystem consists of many open source tools, as shown in Figure 2 below:

Figure 2: Hadoop Components

To make the Hadoop storage “scale-out” and truly distributed, the EMC Isilon OneFS™ file system features connectivity to the Hadoop Distributed File System (HDFS) just like any other shared file system protocol: NFS, CIFS or SMB3. This allows for the data co-location of the storage with its compute nodes using the standard higher level Java application programming interface (API) to build MapReduce “jobs”.

Hadoop: an Introduction Hadoop was created by Doug Cutting of the Apache Lucene project4 initially as the Nutch Distributed File System (NDFS), which was inspired by Google’s BigTable data infrastructure and the MapReduce5 application layer in 2004. Hadoop is an Apache™ Foundation derivative which is comprised of a MapReduce layer for data analysis and a Hadoop Distributed File System (HDFS) layer written in the Java programming language to distribute and scale the MapReduce data.

The Hadoop MapReduce framework runs on the compute cluster using the data stored on the HDFS. MapReduce 'jobs' aim to provide a key/value based processing ability in a highly parallelized fashion. Since the data is distributed over the cluster, a MapReduce job can be split-up to run many parallel processes over the data stored on the cluster. The Map parts of MapReduce only run on the data they can see – that is the data blocks on the particular machine its running on. The Reduce brings together the output from the Maps. The result is a system that provides a highly-

6 Hadoop in the Life Sciences: An Introduction

paralleled batch processing capability. The system scales well, since you just need to add more hardware to increase its storage capability or decrease the time a MapReduce job takes to run.

The partitioning of the storage and compute framework into master and worker node types is outlined in the Figure 3 below:

Figure 3: Hadoop Cluster

Hadoop is a Write Once Ready Many (WORM) system with no random writes. This makes Hadoop faster than HPC and Storage integrated separately. The life sciences has been at the forefront of the technology adoption curve: one of the earliest use-cases of the Sun GridEngine6 HPC was the DNA sequence comparison BLAST16 search.

Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV7. The R (statistical language) Hadoop interface, RHIPE8, is also popular in the life sciences community.

The HDFS layer has a “Name Node”, the controller, with “data locality” through the name node and uses the “share nothing” architecture – which is a distributed independent node based scheme7.

From a platform perspective, the OneFS HDFS interface is compatible with Apache Hadoop, EMC GreenPlum3 and Cloudera. In a traditional Hadoop implementation, the HDFS “Name Node” is a single point of failure since it is the sole keeper of all the metadata for all the data that lives in the filesystem – the OneFS HDFS interface resolves this by distributing the name node data3. HDFS creates a 3x replica for redundancy – OneFS drastically reduces the need for a 3x copy.

A good example of the MapReduce algorithm “key-value” pair process for analyzing word count of specific words across documents9 is shown in Figure 3 below:

7 Hadoop in the Life Sciences: An Introduction

Figure 4: Hadoop Example – word count across documents

Hadoop is not suited for low-latency, “in process” use-cases like real-time, spectral or video analysis; or for large numbers of small files (<8KB). When small files have to be used, the Hadoop Archive (HAR) can be used to archive small files for processing.

Since its early days, life sciences organizations have been Hadoop’s earliest adopters. Following the publication of the first Apache Hadoop project10 in January 2008, the first large-scale MapReduce project was initiated by the Broad Institute – resulting in the comprehensive Genome Analysis Tool Kit (GATK)11. The Hadoop “CrossBow” project12 from Johns Hopkins University came soon after. Other projects are Cloud-based: they include CloudBurst, Contrail, Myrna and CloudBLAST13. An interesting implementation is the NERSC (Department of Energy) Flash-based Hadoop cluster within the Magellan Science Cloud14.

8 Hadoop in the Life Sciences: An Introduction

Genomics example: CrossBow The Hadoop ‘word count across documents’ example in Fig. 4 can be extended to DNA Sequencing: count for single base changes across millions of short DNA fragments and across hundreds of samples.

A Single Nucleotide Polymorphism (SNP) occurs when one nucleotide (A, T, C or G) varies in the DNA sequence of members of the same biological species. Next Generation Sequencers (NGS) like Illumina® HiSeq can produce data in the order of 200 Giga base pairs in a single one-week run for a 60x human genome “coverage” – this means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. This data requires specialized software algorithms called “short read aligners”.

CrossBow12 is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 5 alongside explains the steps necessary to process genome data to look for SNPs.

The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown in Figure 5 is a traditional N-node Hadoop cluster.

1. The Map step is the short read alignment algorithm, called BoWTie (Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads.

2. The Sort step apportions the alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset

Figure 5: Crossbow example– SNP cal ls across DNA fragments

9 Hadoop in the Life Sciences: An Introduction

for that partition). The data here are the sorted alignments.

3. The Reduce step calls SNPs for each reference genome partition. Many parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP) run in the Hadoop cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls.

Results are stored via HDFS; then archived in SOAPsnp format.

Enterprise-Class Hadoop on EMC Isilon As demonstrated by previous examples, the data and analysis scalability required for Genomics is ideally suited for Hadoop. EMC Isilon’s OneFS distributes the Hadoop Name Node to provide high availability and load balancing, thereby eliminating the single point of failure. The Isilon NAS storage solution provides a highly efficient single file system/single volume, scalable up to 15 PB. Data can be staged from other protocols to HDFS using OneFS as a staging gateway. EMC Isilon provides Enterprise Grade data services to the Hadoop infrastructure via SnapshotIQ and SyncIQ for advanced backup and disaster recovery capabilities.

The equation for Hadoop scalability can be represented as:

Big(Data + Analytics) = HadoopEMC:Isilon

These advantages are summarized in Fig. 6 below:

Figure 6: Hadoop advantages with EMC Isilon

When combined the EMC GreenPlum Analytics appliance and solution17, the Hadoop architecture becomes a complete Enterprise package.

10 Hadoop in the Life Sciences: An Introduction

Conclusion What began as an internal project at Google in 2004 has now matured into a scalable framework for two computing paradigms that are particularly suited for the life sciences: parallelization and distribution. The post-processing streaming data patterns for text strings, clustering and sorting – the core process patterns in the life sciences – are ideal workflows for Hadoop. The CrossBow example discussed above aligned Illumina NGS reads for SNP calling over a ‘35x’ coverage of the human genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude better than traditional HPC technology for parallel processes.

Even though Hadoop implementations in the Cloud are popular on the Public Cloud instances, several issues have resulted in most large institutions maintaining their own data repositories internally: large data transfer from the on-premise storage to the Cloud; data regulations and security; data availability; data redundancy and HPC throughput. This is especially true as genome sequencing moves into the Clinic for diagnostic testing.

The convergence of these issues is evidenced by the mirroring of Short Read sequence Archive (SRA) at the National Center for Biotechnology Information (NCBI) on the DNANexus’ SRA Cloud15 – its business model is slowly evolving into a ‘full data and analysis offsite’ model via Hadoop. The Hybrid Cloud model (a source data mirror between Private Cloud and Community Cloud) with Hadoop as a Service (HaaS) is the current state-of-the-art.

Hadoop’s advantages far outweigh its challenges – it is ready to become the life sciences analytics framework of the future. The EMC Isilon platform is bringing that future to you today.

References 1. Pennisi, E; Science 11 February 2011: Vol. 331 no. 6018 pp. 666-668

2. Editorial, “Challenges and Opportunities”, Science 11 February 2011: Vol. 331 no. 6018 pp 692.

3. Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h10528

4. Cafarella, M and Cutting D, “Building Nutch, Open Source Search”, ACM Queue vol. 2, no. 2, April 2004.

5. Dean J and Ghemawat S, "MapReduce: Simplfied Data Processing on Large Clusters", OSDI conference proceedings, 2004.

6. Vasiliu B, “Integrating BLAST with Sun GridEngine”, July 2003, http://developers.sun.com/solaris/articles/integrating_blast.html, last visited Dec 2011.

7. White, Tom: “Hadoop -- The Definitive Guide” 2nd Edition, Published by O’Reilly, Oct 2010

8. RHIPE: http://ml.stat.purdue.edu/rhipe/, last visited Dec 2011

11 Hadoop in the Life Sciences: An Introduction

9. MapReduce example: http://markusklems.files.wordpress.com/2008/07/mapreduce.png , last visited Dec 2011.

10. “Hadoop wins Terabyte sort benchmark”, Apr 2008, Apr 2009, http://sortbenchmark.org/YahooHadoop.pdf, http://sortbenchmark.org/Yahoo2009.pdf last accessed Dec 2011

11. McKenna A, et al, "The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data", Genome Research, 20:1297–1303, July 2010.

12. Langmead B, Schatz MC, et al, “Human SNPs from short reads in hours using cloud computing” Poster Presentation, WABI Sep 2009, http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf, last accessed Dec 2011.

13. Taylor RC, "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics" BMC Bioinformatics 2010, 11(Suppl 12):S1, http://www.biomedcentral.com/1471-2105/11/S12/S1 , last accessed Dec 2011.

14. Ramakrishnan L, “Evaluating Cloud Computing for HPC Applications”, DoE NeRSC, http://www.nersc.gov/assets/Events/MagellanNERSCLunchTalk.pdf, last accessed Dec 2011.

15. “DNAnexus to mirror SRA database in Google Cloud”, BioIT World, Page 41, http://www.bio-itworld.com/uploadedFiles/Bio-IT_World/1111BITW_download.pdf , last visited Dec 2011.

16. Altschul SF, et al, "Basic local alignment search tool". J Mol Biol 215 (3): 403–410, October 1990.

17. Lockner J.,"EMC’s Enterprise Hadoop Solution: Isilon Scale-out NAS and GreenPlum HD", White Paper, The Enterprise Strategy Group, Inc (ESG), February 2012