Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ......

40
1 1 Cloud-based Analytics and Map Reduce 1

Transcript of Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ......

Page 1: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

1 1

Cloud-based Analytics and Map Reduce

1

Page 2: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 2

2

Datasets

• Many technologies converging around “Big Data” theme – Cloud Computing, NoSQL, Graph Analytics

• Biology is becoming increasingly data intensive – Sequencing, imaging, and other instruments

• Computing with big datasets is fundamentally different than big compute on small datasets

– I/O centric, data movement is critical

Page 3: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 3

3

Cloud-based Analytics

• Must be specific when talking cloud – IaaS, PaaS, SaaS, ...

• Basic cloud ingredients – Self-service on-demand compute and storage

– Commodity hardware

– High capacity

– Everything has an API

• Leading example: Amazon Web Services

• Honorable mention: Google Compute Engine, OpenStack, CloudStack, VMWare

Page 4: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 4

4

Good News Mapping Informatics to the Cloud

• Good News: You don’t have to change much

• Using the basic IaaS building blocks we can handle most traditional use cases

– Clusters, client-server, N-tier deployments

• Running standard HPC clusters in the cloud is simple

• StarCluster: http://star.mit.edu/cluster/

– Create EC2 clusters in minutes

– OpenMPI, ATLAS, Lapack, NumPy, SciPy

– SGE, IPython, Condor, MPICH2 plugins

Page 5: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 5

5

Mapping Informatics to the Cloud

• Cloud computing is democratizing access to IT infrastructure resources

– Anyone can have access to massive compute and storage resources in minutes

– This changes the way we solve scientific problems

• Cloud computing is not a silver bullet for scalability

• Cloud providers have primarily focused on horizontal scaling and not on HPC

Page 6: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 6

6

Map Reduce

• We need a computing framework that is... – able to handle huge datasets - 1TB+ – massively parallel - runs on commodity hardware – fault tolerant - hardware fails – locality aware - moving computation is cheaper than moving data – easy to use

Page 7: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 7

7

Map Reduce

• Original paper by Google in 2004 – Introduced a simplified parallel processing model – Used to build Google search indexes – Users specify a Map function and Reduce function – Framework manages task distribution, orchestration, data

transfers, redundancy, and fault tolerance

Page 8: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 8

8

Map Reduce

• Map Reduce is a simple approach to parallel programming

• Existing algorithms must be translated into one or more map/reduce steps

• Batch oriented

• Requires a distributed filesystem

• Map Reduce can be implemented on top of MPI – http://mapreduce.sandia.gov

Page 9: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 9

9

Apache Hadoop

• Free implementation derived from Google MapReduce - written in Java

• Composed of many complementary projects – Core – set of components and interfaces for distributed

filesystems and general I/O serialization – MapReduce – distributed data processing model and execution

environment – HDFS - distributed file system that runs on large clusters

Page 10: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 10

1

Hadoop-related projects at Apache Hadoop Ecosystem

• Hadoop has a large ecosystem of tools – HBase - non-relational, column-orientated database that runs on

HDFS – Pig - data flow language for exploring datasets – Hive - distributed data warehouse with SQL-like query language – Mahout - machine learning and data mining library

Page 11: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 11

1

Hadoop Components

• Core consists of compute layer (MapReduce) and storage layer (HDFS)

• Alternatives to HDFS

– Amazon S3

– GlusterFS

– Lustre

• Many distributions/flavors of Hadoop exist

– Apache

– Cloudera

– Amazon Elastic Map Reduce

– Intel

Page 12: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 12

1

Core improvements and enterprise features Intel Hadoop

• Encrypted HDFS

• Faster job launch

• Optimized for SSDs and 10GbE networking

• Accelerated Hive queries

• Premium support

• Intel Manager

• Multi-datacenter HBase deployments

Page 13: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 13

1

Hadoop Job Flow

http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow

Page 14: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 14

1

Canonical Example - Word Count

Page 15: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 15

1

Translating Workloads to Map Reduce

• Parallel programming requires parallel thinking – Domain decomposition – Exploit natural parallelism

• A map reduce job assumes independent mappers and reducers running in parallel on individual slices of data

• Share-nothing architecture - avoid communication and global data structures

• Exploit parallelism at each stage in workflow

Page 16: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 16

1

Translating Workloads to Map Reduce

• Genomic analysis is ideally suited for the Hadoop framework – large, semi-structured, file-based data, parallel IO – parallel processing by reads, samples, genes, etc.

• Hadoop interfaces exist for C, Java, Perl, R, ...

• Hadoop Streaming allows any executables to be mappers and reducers

Page 17: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 17

1

Unix Pipes for Hadoop Enter Hadoop Streaming

• Utility that comes with Hadoop distribution

• Use any mapper and reducer – Perl – Bash – cat/wc – Binaries

• Easiest way to get started with Hadoop

• cat in.txt | mapper.sh | sort | reduce.sh > out.txt

Page 18: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 18

1

Revolution R with Hadoop

• Series of R connectors to Hadoop features

• Write MapReduce jobs in R using Hadoop Streaming

• Import tables from HDFS and HBase

• https://github.com/RevolutionAnalytics/RHadoop

• http://www.revolutionanalytics.com/why-revolution-r/whitepapers/R-and-Hadoop-Big-Data-Analytics.pdf

Page 19: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 19

1

Hadoop on AWS Elastic MapReduce

• Amazon realized customers were spending a lot of time configuring and operating Hadoop clusters on EC2

• EMR is a service that runs on EC2 and handles set-up, teardown, and other Hadoop details

• Charged by instance-hour

– Instances can be terminated automatically when your job finishes

• Adds ‘Job Flows’ feature

• Unique to EMR

Page 20: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 20

2

Features Elastic MapReduce

• Elastic – Add nodes to a running cluster

– New: variable node count in each flow step

– New: easier to dynamically resize # nodes in use

• Stores inputs and outputs in S3 – New: can use multi-part HTTP upload if configured

• Easy to Use – Job Flows, Debugging

• Support for On-Demand and Reserved Instances – Recent support for Spot and VPC instances!

• Support for bootstrap actions

• Support for Pig, Hive, Hadoop 0.20.*

Page 21: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 21

2

Creating Job Flows Elastic MapReduce

Page 22: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 22

2

Monitoring and Debugging Elastic MapReduce

Page 23: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 23

2

Debugging Elastic MapReduce

Page 24: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 24

2

Translating Workloads to Map Reduce

• The most efficient programs require use of Java and understanding Hadoop internals

• Hadoop has more scheduling and execution overhead compared to some traditional HPC environments

– Hadoop can be integrated with cluster schedulers like SGE

• Moving large amounts of data into HDFS can be slow

– Filesystem alternatives include S3, GlusterFS, Lustre, http

– HDFS can greatly benefit from SSDs

Page 25: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 25

2

A few tips Data Movement

• Primary hurdle in adopting the cloud

• Avoid using SCP or other TCP based transfers unless you tune your settings

– http://www.psc.edu/index.php/networking/641-tcp-tune

• Alternative transports: GridFTP, Aspera fasp, Bit Torrent

• AWS offers physical import/export via FedEx

• Aggregate the data within your job if possible

– Ingest data in preprocess phase or via P2P torrents

Page 26: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 26

2

Cloud-scale RNA-sequencing differential expression analysis with Myrna

Life Science Example

• Langmead et al. Genome Biology 2010, 11:R83

• RNA-Seq analysis pipeline – Focused on differential expression analysis between genes – Complementary to whole transcriptome assembly (e.g. cufflinks)

• Workflow contains 7 stages – Bowtie for alignment and R/Bioconductor for EM and statistics

Page 27: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 27

2

Stage 1 - Preprocess

• Process FASTQ input list – Optional dump from .sra format

• Assign sample names

• Copy into HDFS

• Parallel across input files

Page 28: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 28

2

Stage 2 - Align

• Align reads to reference genome using bowtie

• Each node independently obtains the bowtie index from local or shared filesystem (hdfs)

• Parallel across reads

Page 29: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 29

2

Stage 3 - Overlap

• Calculate overlaps between alignment and predefined gene intervals

• Aggregate counts for each genomic feature

• Parallel across alignments

Page 30: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 30

3

Stage 4 - Normalize

• Calculate normalization factor based on count distribution

• Parallel across genetic feature labels

Page 31: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 31

3

Stage 5 - Statistical analysis

• Fits a linear model relating the counts to the outcome using R

• Uses values calculated from Align and Overlap stages

• Parallel across genes

Page 32: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 32

3

Stage 6 - Summarize

• Significance summaries such as P-values and gene-specific counts are calculated

• Outputs a list of top N genes ranked by false discovery rate

– Hadoop takes care of sorting

• This stage is serial

– Mitigated by small size of calculation at this stage

Page 33: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 33

3

Stage 7 - Postprocess

• Discards overlaps not belonging to top genes

• Creates output files, summary tables, and plots

• Compressed and stored in user specified output directory

• This stage has modest parallelism

Page 34: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 34

3

Myrna Performance

• Uses standard bioinformatics tools bowtie and R/Bioconductor in a Hadoop job flow

• Workflow broken into many stages to take advantage of parallelism

• Near linear speedup

Page 35: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 35

3

Summary

• The most popular genomics algorithms will eventually get Hadoop implementations

– The other 80% will not...

• Hadoop is well suited for processing large unstructured data offline

• Hadoop is not well suited for communication heavy jobs or real-time processing

• Hadoop can be run locally or integrated into existing HPC infrastructure

• HBase and other products run on Hadoop taking advantage of framework features

Page 36: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 36

3

Summary

• Building a Hadoop cluster

– Use dense storage nodes

– Boosting HDFS performance

– Use SSD drives

– Faster interconnects

– Replace HDFS with S3, GlusterFS, Lustre

• Running a Hadoop cluster

– Job monitoring and debugging requires additional tooling

– AWS Elastic Map Reduce product for EC2

– Intel Manager for Hadoop

Page 37: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 37

3

Observations

• The cloud is making good parallel programming techniques more important than ever

– Message passing, threading, distributed systems

• Understand the difference between vertical and horizontal scaling

– Use both!

• Cloud best practices are finding their way back into local infrastructure/HPC

– Hadoop, configuration management, SOA

Page 38: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INTEL CONFIDENTIAL 38

3

References

• http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow

• http://markusklems.files.wordpress.com/2008/07/mapreduce.png

• http://bowtie-bio.sourceforge.net/myrna/index.shtml

Page 39: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.

INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO

FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel

microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks

of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

39

Page 40: Cloud-based Analytics and Map Reduce · PDF fileCloud-based Analytics and Map Reduce 1 . ... GlusterFS, Lustre, http – HDFS can greatly benefit from SSDs . 25 INTEL CONFIDENTIAL