Deep Dive: Amazon Elastic MapReduce

Deep dive – Amazon Elastic MapReduce

Rahul Bhartia

Solution architect – Amazon Web Services

Handling five billion sessions per day at AnswersAndrew Jorgensen

@ajorgensen

Software engineer – Twitter

Agenda

• Amazon Elastic MapReduce (EMR)

• Amazon EMR: Leveraging Amazon Simple Storage Service (S3)

• Amazon EMR: Design patterns

• Amazon EMR: Storage optimizations

• Answers: Handling five billion sessions per day

• Takeaway

Amazon Elastic MapReduce (EMR)

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleControl the cluster

Easy to deploy

AWS Management Console Command Line

Or use the Amazon EMR API with your favorite SDK.

Easy to monitor and debug

Integrated with Amazon CloudWatch

Monitor Cluster, Node, and IO

Monitor Debug

Hue

Amazon S3 and Hadoop distributed file system (HDFS)

Hue

Query Editor

Hue

Job Browser

Try different configurations to find your optimal architecture.

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Easy to add and remove compute

capacity on your cluster.

Match compute

demands with

cluster sizing.

Resizable clusters

Spot Instances

for task nodes

Up to 86% lower

on average

off

on-demand

pricing

On-demand for

core nodes

Standard

Amazon EC2

pricing for

on-demand

capacity

Easy to use Spot Instances

Meet SLA at predictable cost Exceed SLA at lower cost

Use bootstrap actions to install applications…

https://github.com/awslabs/emr-bootstrap-actions

https://github.com/awslabs/emr-bootstrap-actions

…or to configure Hadoop

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file (Merge values in new config to existing)

--keyword-key-value (Override values provided)

Configuration File

Name

Configuration File

Keyword

File Name

Shortcut

Key-Value Pair

Shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

Read data directly into Hive,

Apache Pig, and Hadoop

Streaming and Cascading from

Amazon Kinesis streams

No intermediate data

persistence required

Simple way to introduce real-time sources into

batch-oriented systems

Multi-application support and automatic

checkpointing

Amazon EMR Integration with Amazon Kinesis

Amazon EMR: Leveraging Amazon S3

Amazon S3 as your persistent data store

• Amazon S3

– Designed for 99.999999999% durability

– Separate compute and storage

• Resize and shut down Amazon EMR clusters with no data loss

• Point multiple Amazon EMR clusters at same data in Amazon S3

EMRFS makes it easier to leverage Amazon S3

• Better performance and error handling options

• Transparent to applications – just read/write to “s3://”

• Consistent view

– For consistent list and read-after-write for new puts

• Support for Amazon S3 server-side and client-side encryption

• Faster listing using EMRFS metadata

EMRFS support for Amazon S3 client-side encryption

Amazon S3

Am

azo

n S

3 e

ncry

ptio

n c

lien

tsE

MR

FS

en

ab

led

for

Am

azo

n S

3 c

lien

t-sid

e e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Amazon S3 EMRFS metadata

in Amazon DynamoDB

• List and read-after-write consistency

• Faster list operations

Number of

objects

Without Consistent

Views

With Consistent

Views

1,000,000 147.72 29.70

100,000 12.70 3.69

Fast listing of Amazon S3 objects using

EMRFS metadata

*Tested using a single node cluster with a m3.xlarge instance.

Optimize to leverage HDFS

• Iterative workloads – If you’re processing the same dataset more than once

• Disk I/O intensive workloads

Persist data on Amazon S3 and use S3DistCp to

copy to HDFS for processing.

Amazon EMR: Design patterns

Amazon EMR example #1: Batch processing

GBs of logs pushed

to Amazon S3 hourlyDaily Amazon EMR

cluster using Hive to

process data

Input and output

stored in Amazon S3

250 Amazon EMR jobs per day, processing 30 TB of data

http://aws.amazon.com/solutions/case-studies/yelp/

http://aws.amazon.com/solutions/case-studies/yelp/

Amazon EMR example #2: Long-running cluster

Data pushed to

Amazon S3Daily Amazon EMR cluster

Extract, Transform, and Load

(ETL) data into database

24/7 Amazon EMR cluster

running HBase holds last 2

years’ worth of data

Front-end service uses

HBase cluster to power

dashboard with high

concurrency

Amazon EMR example #3: Interactive query

TBs of logs sent dailyLogs stored in

Amazon S3Amazon EMR cluster using Presto for ad hoc

analysis of entire log set

Interactive query using Presto on multipetabyte warehouse

http://techblog.netflix.com/2014/10/using-presto-in-our-big-

data-platform.html

http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html

Amazon EMR: Storage optimizations

File formats

• Row oriented

– Text files

– Sequence files

• Writable object

– Avro data files

• Described by schema

• Columnar format

– Object Record Columnar (ORC)

– Parquet

Logical Table

Row oriented

Column oriented

Choosing the right file format

• Processing and query tools

– Hive, Impala, and Presto.

• Evolution of schema

– Avro for schema and Presto for storage.

• File format “splittability”

– JSON/XML as records instead of single object

• Compression

– Block or file.

File sizes

• Avoid small files

– Avoid anything smaller than 100 MB

• Each mapper processes a single File

• Fewer files, matching closely to block size

– Fewer calls to Amazon S3

– Fewer network/HDFS requests

Dealing with small files

• Reduce HDFS block size (e.g., 1 MB [default is 128 MB])

– --bootstrap-action s3://elasticmapreduce/bootstrap-

actions/configure-hadoop --args “-m,dfs.block.size=1048576”

• Better: use S3DistCp to combine smaller files together

– S3DistCp takes a pattern and target path to combine smaller

input files into larger ones

– Supply a target size and compression codec

Compression

• Always compress data files on Amazon S3

– Reduces network traffic between Amazon S3 and

Amazon EMR

– Speeds up your job

• Compress mappers and reducer output

Amazon EMR compresses internode traffic with LZO with

Hadoop 1, and Snappy with Hadoop 2.

Choosing the right compression

• Time sensitive: faster compressions are a better choice.

• Large amount of data: use space-efficient compressions.

• Combined workload: use Gzip.

Algorithm Splittable? Compression RatioCompress +

Decompress Speed

Gzip (DEFLATE) No High Medium

bzip2 Yes Very high Slow

LZO Yes Low Fast

Snappy No Low Very fast

Answers: Handling five billion sessions per day

Answers

Lambda architecture

Batch layer

Speed layer

Batch view

QueryReal-time view

Computation

S3DistCp

Computations

Input Amazon

S3 bucketIntermediate

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Computation

Computations

S3DistCp

Cascalog

Input Amazon


Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Computation

Computations

S3DistCp

CascalogLZO

Input Amazon


Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Compression

LZO is fast.

Compression

LZO is splittable.

Compression

LZO is built-in.

Compression

LZO improves performance.

Snapshotting

Result

Snapshot Snapshot

Result

Backfilling

Monday Tuesday Friday

…

Job scheduling

Orchestrator Amazon EMR

Job scheduling

Data Pipeline

Simple Workflow Service

Batch view

Data Pump

Orchestrator

What if my data is not in Amazon S3?

Amazon S3

What if my data is not in Amazon S3?

Amazon S3

Amazon S3

Takeaway

Cost-saving tips

• Use Amazon S3 as your persistent data store (only pay for compute when you need it!).

• Use Amazon EC2 Spot Instances (especially with task nodes) tosignificantly reduce the cost of running your clusters.

• Use Amazon EC2 Reserved Instances if you have steady workloads.

• Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours).

• Contact your sales rep about custom pricing options, if you are spending more than $10K per month on Amazon EMR.

Deep Dive: Amazon Elastic MapReduce

Technology

Transcript of Deep Dive: Amazon Elastic MapReduce