Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

137
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Elastic MapReduce: Deep Dive and Best Practices Parviz Deyhim November 13, 2013

description

Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch four years ago, our customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

Transcript of Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Page 1: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Elastic MapReduce: Deep Dive and Best Practices Parviz Deyhim

November 13, 2013

Page 2: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Amazon Controlling Cost with EMR

Advanced Optimizations

Page 3: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Amazon Controlling Cost with EMR

Advanced Optimizations

Page 4: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

What is EMR?

Map-Reduce engine Integrated with tools

Hadoop-as-a-service

Massively parallel

Cost effective AWS wrapper

Integrated to AWS services

Page 5: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

HDFS

Amazon EMR

Page 6: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

HDFS

Amazon EMR

Amazon S3 Amazon DynamoDB

Page 7: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

HDFS

Analytics languages Data management

Amazon EMR

Amazon S3 Amazon DynamoDB

Page 8: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

HDFS

Analytics languages Data management

Amazon EMR Amazon RDS

Amazon S3 Amazon DynamoDB

Page 9: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

HDFS

Analytics languages Data management

Amazon Redshift

Amazon EMR Amazon RDS

Amazon S3 Amazon DynamoDB

AWS Data Pipeline

Page 10: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Introduction

• Launch clusters of any size in a matter of minutes

• Use variety of different instance sizes that match your workload

Page 11: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Introduction

• Don’t get stuck with hardware

• Don’t deal with capacity planning

• Run multiple clusters with different sizes, specs and node types

Page 12: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Introduction

• Integration with Spot market • 70-80% discount

Page 13: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Amazon Controlling Cost with EMR

Advanced Optimizations

Page 14: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Design Patterns

Pattern #1: Transient vs. Alive Clusters

Pattern #2: Core Nodes and Task Nodes

Pattern #3: Amazon S3 as HDFS

Pattern #4: Amazon S3 & HDFS

Pattern #5: Elastic Clusters

Page 15: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Pattern #1: Transient vs. Alive Clusters

Page 16: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Pattern #1: Transient Clusters

• Cluster lives for the duration of the job

• Shut down the cluster when the job is done

• Data persist on Amazon S3

• Input & output Data on Amazon S3

Page 17: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Benefits of Transient Clusters 1. Control your cost

2. Minimum maintenance • Cluster goes away when job is done

3. Practice cloud architecture • Pay for what you use

• Data processing as a workflow

Page 18: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

When to use Transient cluster?

If ( Data Load Time + Processing Time) * Number Of Jobs < 24 Use Transient Clusters Else Use Alive Clusters

Page 19: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

When to use Transient cluster?

( 20min data load + 1 hour Processing time) * 10 jobs = 13 hours < 24 hour = Use Transient Clusters

Page 20: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Alive Clusters • Very similar to traditional Hadoop deployments

• Cluster stays around after the job is done

• Data persistence model:

• Amazon S3

• Amazon S3 Copy To HDFS

• HDFS and Amazon S3 as backup

Page 21: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Alive Clusters • Always keep data safe on Amazon S3 even if

you’re using HDFS for primary storage

• Get in the habit of shutting down your cluster and start a new one, once a week or month

• Design your data processing workflow to account for failure

• You can use workflow managements such as AWS Data Pipeline

Page 22: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Benefits of Alive Clusters • Ability to share data between multiple jobs

EMR

Amazon S3

HDFS HDFS

Amazon S3

EMR

EMR

Transient cluster Long running clusters

Page 23: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Benefit of Alive Clusters • Cost effective for repetitive jobs

EMR

pm pm

EMR

EMR

EMR

pm pm

EMR

Page 24: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

When to use Alive cluster? If ( Data Load Time + Processing Time) * Number Of Jobs > 24 Use Alive Clusters Else Use Transient Clusters

Page 25: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

When to use Alive cluster?

( 20min data load + 1 hour Processing time) * 20 jobs = 26hours > 24 hour = Use Alive Clusters

Page 26: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Pattern #2: Core & Task nodes

Page 27: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Core Nodes

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

Run TaskTrackers (Compute) Run DataNode (HDFS)

Page 28: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Core Nodes

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

Can add core nodes

Page 29: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Core Nodes

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

Can add core nodes More HDFS space More CPU/mem

HDFS

Page 30: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Core Nodes

Master instance group

Core instance group

HDFS HDFS

Can’t remove core nodes because of HDFS

HDFS

Amazon EMR cluster

Page 31: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Task Nodes

Master instance group

Task instance group Core instance group

HDFS HDFS

Run TaskTrackers No HDFS

Reads from core node HDFS

Amazon EMR cluster

Page 32: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Task Nodes

Master instance group

Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Can add task nodes

Page 33: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Task Nodes

Master instance group

Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

More CPU power More memory

Page 34: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Task Nodes

Master instance group

Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

You can remove task nodes

Page 35: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Task Nodes

Master instance group

Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

You can remove task nodes

Page 36: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Tasknode Use-Case #1 • Speed up job processing using Spot

market

• Run task nodes on Spot market

• Get discount on hourly price

• Nodes can come and go without interruption to your cluster

Page 37: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Tasknode Use-Case #2 • When you need extra horse power

for a short amount of time

• Example: Need to pull large amount of data from Amazon S3

Page 38: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Example:

HS1 48TB HDFS

HS1 48TB HDFS

Amazon S3

Page 39: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Example: HS1 48TB HDFS

HS1 48TB HDFS

Amazon S3

m1.xl

m1.xl

m1.xl

m1.xl

m1.xl

m1.xl

Add Spot task nodes to load data from Amazon S3

Page 40: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Example: HS1 48TB HDFS

HS1 48TB HDFS Amazon S3

m1.xl

m1.xl

m1.xl

m1.xl

m1.xl

m1.xl

Remove after data load from Amazon S3

Page 41: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Pattern #3: Amazon S3 as HDFS

Page 42: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon S3 as HDFS

• Use Amazon S3 as your permanent data store

• HDFS for temporary storage data between jobs

• No additional step to copy data to HDFS

Amazon EMR cluster

Task instance group Core instance group

HDFS

HDFS

Amazon S3

Page 43: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Benefits: Amazon S3 as HDFS • Ability to shut down your cluster

HUGE Benefit!!

• Use Amazon S3 as your durable storage

11 9s of durability

Page 44: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Benefits: Amazon S3 as HDFS • No need to scale HDFS

• Capacity

• Replication for durability

• Amazon S3 scales with your data

• Both in IOPs and data storage

Page 45: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Benefits: Amazon S3 as HDFS • Ability to share data between multiple clusters

• Hard to do with HDFS

Amazon S3

EMR

EMR

Page 46: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Benefits: Amazon S3 as HDFS • Take advantage of Amazon S3 features

• Amazon S3 ServerSideEncryption

• Amazon S3 LifeCyclePolicy

• Amazon S3 versioning to protect against corruption

• Build elastic clusters

• Add nodes to read from Amazon S3

• Remove nodes with data safe on Amazon S3

Page 47: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

What About Data Locality? • Run your job in the same region as your

Amazon S3 bucket

• Amazon EMR nodes have high speed connectivity to Amazon S3

• If your job Is CPU/memory-bounded data, locality doesn’t make a difference

Page 48: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Anti-Pattern: Amazon S3 as HDFS

• Iterative workloads – If you’re processing the same dataset more than once

• Disk I/O intensive workloads

Page 49: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Pattern #4: Amazon S3 & HDFS

Page 50: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon S3 & HDFS

1. Data persist on Amazon S3

Page 51: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon S3 & HDFS

2. Launch Amazon EMR and copy data to HDFS with S3distcp

S3D

istC

p

Page 52: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon S3 & HDFS

3. Start processing data on HDFS

S3D

istC

p

Page 53: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Benefits: Amazon S3 & HDFS • Better pattern for I/O-intensive workloads

• Amazon S3 benefits discussed previously applies

• Durability

• Scalability

• Cost

• Features: lifecycle policy, security

Page 54: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Pattern #5: Elastic Clusters

Page 55: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Elastic Cluster (m) 1. Start cluster with certain number of nodes

Page 56: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Elastic Cluster (m) 2. Monitor your cluster with Amazon CloudWatch

metrics • Map Tasks

Running

• Map Tasks Remaining

• Cluster Idle?

• Avg. Jobs Failed

Page 57: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Elastic Cluster (m) 3. Increase the number of nodes as you need

more capacity by manually calling the API

Page 58: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Elastic Cluster (a) 1. Start your cluster with certain number of nodes

Page 59: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Elastic Cluster (a) 2. Monitor cluster capacity with Amazon

CloudWatch metrics

• Map Tasks Running

• Map Tasks Remaining

• Cluster Idle?

• Avg. Jobs Failed

Page 60: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Elastic Cluster (a) 3. Get HTTP Amazon SNS notification to a simple

app deployed on Elastic Beanstalk

Page 61: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Elastic Cluster (a) 4. Your app calls the API to add nodes to your

cluster

API

Page 62: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Amazon Controlling Cost with EMR

Advanced Optimizations

Page 63: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Nodes and Size • Use m1.smal, m1.large, c1.medium for

functional testing

• Use M1.xlarge and larger nodes for production workloads

Page 64: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Nodes and Size • Use CC2 for memory and CPU intensive

jobs

• Use CC2/C1.xlarge for CPU intensive jobs

• Hs1 instances for HDFS workloads

Page 65: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon EMR Nodes and Size • Hi1 and HS1 instances for disk I/O-

intensive workload

• CC2 instances are more cost effective than M2.4xlarge

• Prefer smaller cluster of larger nodes than larger cluster of smaller nodes

Page 66: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Holy Grail Question

How many nodes do I need?

Page 67: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits

• Depends on how much data you have

• And how fast you like your data to be processed

Page 68: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits

Before we understand Amazon EMR capacity planning, we need to understand Hadoop’s inner working of splits

Page 69: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits • Data gets broken up to splits (64MB or 128)

Data Splits

128MB

Page 70: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits • Splits get packaged into mappers

Data Splits Mappers

Page 71: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits

• Mappers get assigned to nodes for processing

Mappers Instances

Page 72: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits • More data = More splits = More mappers

Page 73: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits • More data = More splits = More mappers

Queue

Page 74: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits • Data mappers > cluster mapper capacity =

mappers wait for capacity = processing delay

Queue

Page 75: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Introduction to Hadoop Splits • More nodes = reduced queue size = faster

processing

Queue

Page 76: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Calculating the Number of Splits for Your Job

Uncompressed files: Hadoop splits a single file to multiple splits. Example: 128MB = 2 splits based on 64MB split size

Page 77: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Calculating the Number of Splits for Your Job

Compressed files:

1. Splittable compressions: same logic as un-

compressed files

128MB BZIP 64MB BZIP

Page 78: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Calculating the Number of Splits for Your Job

Compressed files:

2. Unsplittable compressions: the entire file is a

single split.

128MB GZ 128MB GZ

Page 79: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Calculating the Number of Splits for Your Job

Number of splits If data files have unsplittable compression # of splits = number of files Example: 10 GZ files = 10 mappers

Page 80: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Cluster Sizing Calculation

Just tell me how many nodes I need for my job!!

Page 81: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Cluster Sizing Calculation

1. Estimate the number of mappers your job requires.

Page 82: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Cluster Sizing Calculation

2. Pick an instance and note down the number of mappers it can run in parallel

M1.xlarge = 8 mappers in parallel

Page 83: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Cluster Sizing Calculation

3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.

Page 84: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Cluster Sizing Calculation

4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.

Page 85: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Cluster Sizing Calculation

Total Mappers * Time To Process Sample Files

Instance Mapper Capacity * Desired Processing Time

Estimated Number Of Nodes:

Page 86: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Example: Cluster Sizing Calculation

1. Estimate the number of mappers your job requires

150

2. Pick an instance and note down the number of mappers it can run in parallel

m1.xlarge with 8 mapper capacity per instance

Page 87: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Example: Cluster Sizing Calculation

3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.

8 files selected for our sample test

Page 88: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Example: Cluster Sizing Calculation

4. Run an Amazon EMR cluster with a single core

node and process your sample files from #3. Note down the amount of time taken to process your sample files.

3 min to process 8 files

Page 89: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Cluster Sizing Calculation

Total Mappers For Your Job * Time To Process Sample Files

Per Instance Mapper Capacity * Desired Processing Time

Estimated number of nodes:

150 * 3 min 8 * 5 min

= 11 m1.xlarge

Page 90: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

File Size Best Practices

• Avoid small files at all costs

• Anything smaller than 100MB

• Each mapper is a single JVM

• CPU time required to spawn JVMs/mappers

Page 91: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

File Size Best Practices

Mappers take 2 sec to spawn up and be ready

for processing

10TB of 100mgB = 100,000 mappers * 2Sec =

total of 55 hours mapper CPU setup time

Page 92: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

File Size Best Practices

Mappers take 2 sec to spawn up and be ready

for processing

10TB of 1000MB = 10,000 mappers * 2Sec =

total of 5 hours mapper CPU setup time

Page 93: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

File Size on Amazon S3: Best Practices

• What’s the best Amazon S3 file size for

Hadoop?

About 1-2GB

• Why?

Page 94: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

File Size on Amazon S3: Best Practices

• Life of mapper should not be less than 60 sec

• Single mapper can get 10MB-15MB/s speed to Amazon S3

≈ 60sec * 15MB 1GB

Page 95: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Holy Grail Question

What if I have small file issues?

Page 96: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Dealing with Small Files

• Use S3DistCP to combine smaller files together

• S3DistCP takes a pattern and target file to combine smaller input files to larger ones

Page 97: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Dealing with Small Files

Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args '--src,s3://myawsbucket/cf,\ --dest,hdfs:///local,\

--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\ --targetSize,128,\

Page 98: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Compressions

• Compress as much as you can

• Compress Amazon S3 input data files

– Reduces cost

– Speed up Amazon S3->mapper data transfer time

Page 99: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Compressions • Always Compress Data Files On Amazon S3

• Reduces Storage Cost

• Reduces Bandwidth Between Amazon S3 and Amazon EMR

• Speeds Up Your Job

Page 100: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Compressions • Compress Mappers and Reducer Output

• Reduces Disk IO

Page 101: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Compressions

• Compression Types: – Some are fast BUT offer less space reduction

– Some are space efficient BUT Slower

– Some are splitable and some are not

Algorithm % Space Remaining

Encoding Speed

Decoding Speed

GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s

Page 102: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Compressions

• If You Are Time Sensitive, Faster Compressions Are A Better Choice

• If You Have Large Amount Of Data, Use Space Efficient Compressions

• If You Don’t Care, Pick GZIP

Page 103: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Change Compression Type • You May Decide To Change Compression Type

• Use S3DistCP to change the compression types of your files

• Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args '--src,s3://myawsbucket/cf,\

--dest,hdfs:///local,\

--outputCodec,lzo’

Page 104: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Amazon Controlling Cost with EMR

Advanced Optimizations

Page 105: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Architecting for cost • AWS pricing models:

– On-demand: Pay as you go model.

– Spot: Market place. Bid for instances and get a discount

– Reserved Instance: upfront payment (for 1 or 3 year) for reduction in overall monthly payment

Page 106: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Reserved Instances use-case For alive and long-running clusters

Page 107: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Reserved Instances use-case For ad-hoc and unpredictable workloads, use medium utilization

Page 108: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Reserved Instances use-case

For unpredictable workloads, use Spot or on-demand pricing

Page 109: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Amazon Controlling Cost with EMR

Advanced Optimizations

Page 110: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations (Stage 1)

• The best optimization is to structure your data (i.e., smart data partitioning)

• Efficient data structuring= limit the amount of data being processed by Hadoop= faster jobs

Page 111: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations (Stage 1) • Hadoop is a batch processing framework

• Data processing time = an hour to days

• Not a great use-case for shorter jobs

• Other frameworks may be a better fit – Twitter Storm

– Spark

– Amazon Redshift, etc.

Page 112: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations (Stage 1) • Amazon EMR team has done a great deal of

optimization already

• For smaller clusters, Amazon EMR configuration optimization won’t buy you much

– Remember you’re paying for the full hour cost of an instance

Page 113: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations (Stage 1)

Best Optimization??

Page 114: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations (Stage 1)

Add more nodes

Page 115: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations (Stage 2)

• Monitor your cluster using Ganglia

• Amazon EMR has Ganglia bootstrap action

Page 116: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations (Stage 2)

• Monitor and look for bottlenecks – Memory

– CPU

– Disk IO

– Network IO

Page 117: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations Run Job

Page 118: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations Run Job

Find Bottlenecks

Ganglia CPU Disk Memory

Page 119: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Adv. Optimizations Run Job

Find Bottlenecks

Address Bottleneck

Ganglia CPU Disk Memory Fine Tune

Change Algo

Page 120: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Network IO • Most important metric to watch for if using

Amazon S3 for storage

• Goal: Drive as much network IO as possible from a single instance

Page 121: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Network IO • Larger instances can drive > 600Mbps

• Cluster computes can drive 1Gbps -2 Gbps

• Optimize to get more out of your instance throughput – Add more mappers?

Page 122: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Network IO • If you’re using Amazon S3 with Amazon

EMR, monitor Ganglia and watch network throughput.

• Your goal is to maximize your NIC throughput by having enough mappers per node

Page 123: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Network IO, Example

Low network utilization Increase number of mappers if possible to drive more traffic

Page 124: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

CPU • Watch for CPU utilization of your clusters

• If >50% idle, increase # of mapper/reducer per instance

– Reduces the number of nodes and reduces cost

Page 125: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Example Adv. Optimizations (Stage 2)

What potential optimization do you see in this graph?

Page 126: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Example Adv. Optimizations (Stage 2)

40% CPU idle. Maybe add more mappers?

Page 127: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Disk IO • Limit the amount of disk IO

• Can increase mapper/reducer memory

• Compress data anywhere you can

• Monitor cluster and pay attention to HDFS bytes written metrics

• One play to pay attention to is mapper/reducer disk spill

Page 128: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Disk IO • Mapper has in memory buffer

mapper memory buffer

mapper

Page 129: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Disk IO • When memory gets full, data spills to disk

mapper memory buffer data spills to disk

mapper

Page 130: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013
Page 131: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Disk IO • If you see mapper/reducer excessive spill to disk,

increase buffer memory per mapper

• Excessive spill when ratio of “MAPPER_SPILLED_RECORDS” and “MAPPER_OUTPUT_RECORDS” is more than 1

Page 132: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Disk IO Example:

MAPPER_SPILLED_RECORDS = 221200123

MAPPER_OUTPUT_RECORDS = 101200123

Page 133: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Disk IO • Increase mapper buffer memory by increasing

“io.sort.mb”

<property><name>io.sort.mb<name><value>200</value></property>

• Same logic applies to reducers

Page 134: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Disk IO • Monitor disk IO using Ganglia

• Look out for disk IO wait

Page 135: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Disk IO • Monitor disk IO using Ganglia

• Look out for disk IO wait

Page 136: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Remember! Run Job

Find Bottlenecks

Address Bottleneck

Ganglia CPU Disk Memory Fine Tune

Change Algo

Page 137: Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT404