Deep Dive: Amazon Elastic MapReduce

53
Deep dive Amazon Elastic MapReduce Rahul Bhartia Solution architect Amazon Web Services Handling five billion sessions per day at Answers Andrew Jorgensen @ajorgensen Software engineer Twitter

Transcript of Deep Dive: Amazon Elastic MapReduce

Page 1: Deep Dive: Amazon Elastic MapReduce

Deep dive – Amazon Elastic MapReduce

Rahul Bhartia

Solution architect – Amazon Web Services

Handling five billion sessions per day at AnswersAndrew Jorgensen

@ajorgensen

Software engineer – Twitter

Page 2: Deep Dive: Amazon Elastic MapReduce

Agenda

• Amazon Elastic MapReduce (EMR)

• Amazon EMR: Leveraging Amazon Simple Storage Service (S3)

• Amazon EMR: Design patterns

• Amazon EMR: Storage optimizations

• Answers: Handling five billion sessions per day

• Takeaway

Page 3: Deep Dive: Amazon Elastic MapReduce

Amazon Elastic MapReduce (EMR)

Page 4: Deep Dive: Amazon Elastic MapReduce

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleControl the cluster

Page 5: Deep Dive: Amazon Elastic MapReduce

Easy to deploy

AWS Management Console Command Line

Or use the Amazon EMR API with your favorite SDK.

Page 6: Deep Dive: Amazon Elastic MapReduce

Easy to monitor and debug

Integrated with Amazon CloudWatch

Monitor Cluster, Node, and IO

Monitor Debug

Page 7: Deep Dive: Amazon Elastic MapReduce

Hue

Amazon S3 and Hadoop distributed file system (HDFS)

Page 8: Deep Dive: Amazon Elastic MapReduce

Hue

Query Editor

Page 9: Deep Dive: Amazon Elastic MapReduce

Hue

Job Browser

Page 10: Deep Dive: Amazon Elastic MapReduce

Try different configurations to find your optimal architecture.

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Page 11: Deep Dive: Amazon Elastic MapReduce

Easy to add and remove compute

capacity on your cluster.

Match compute

demands with

cluster sizing.

Resizable clusters

Page 12: Deep Dive: Amazon Elastic MapReduce

Spot Instances

for task nodes

Up to 86% lower

on average

off

on-demand

pricing

On-demand for

core nodes

Standard

Amazon EC2

pricing for

on-demand

capacity

Easy to use Spot Instances

Meet SLA at predictable cost Exceed SLA at lower cost

Page 13: Deep Dive: Amazon Elastic MapReduce

Use bootstrap actions to install applications…

https://github.com/awslabs/emr-bootstrap-actions

Page 14: Deep Dive: Amazon Elastic MapReduce

…or to configure Hadoop

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file (Merge values in new config to existing)

--keyword-key-value (Override values provided)

Configuration File

Name

Configuration File

Keyword

File Name

Shortcut

Key-Value Pair

Shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

Page 15: Deep Dive: Amazon Elastic MapReduce

Read data directly into Hive,

Apache Pig, and Hadoop

Streaming and Cascading from

Amazon Kinesis streams

No intermediate data

persistence required

Simple way to introduce real-time sources into

batch-oriented systems

Multi-application support and automatic

checkpointing

Amazon EMR Integration with Amazon Kinesis

Page 16: Deep Dive: Amazon Elastic MapReduce

Amazon EMR: Leveraging Amazon S3

Page 17: Deep Dive: Amazon Elastic MapReduce

Amazon S3 as your persistent data store

• Amazon S3

– Designed for 99.999999999% durability

– Separate compute and storage

• Resize and shut down Amazon EMR clusters with no data loss

• Point multiple Amazon EMR clusters at same data in Amazon S3

Page 18: Deep Dive: Amazon Elastic MapReduce

EMRFS makes it easier to leverage Amazon S3

• Better performance and error handling options

• Transparent to applications – just read/write to “s3://”

• Consistent view

– For consistent list and read-after-write for new puts

• Support for Amazon S3 server-side and client-side encryption

• Faster listing using EMRFS metadata

Page 19: Deep Dive: Amazon Elastic MapReduce

EMRFS support for Amazon S3 client-side encryption

Amazon S3

Am

azo

n S

3 e

ncry

ptio

n c

lien

tsE

MR

FS

en

ab

led

for

Am

azo

n S

3 c

lien

t-sid

e e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Page 20: Deep Dive: Amazon Elastic MapReduce

Amazon S3 EMRFS metadata

in Amazon DynamoDB

• List and read-after-write consistency

• Faster list operations

Number of

objects

Without Consistent

Views

With Consistent

Views

1,000,000 147.72 29.70

100,000 12.70 3.69

Fast listing of Amazon S3 objects using

EMRFS metadata

*Tested using a single node cluster with a m3.xlarge instance.

Page 21: Deep Dive: Amazon Elastic MapReduce

Optimize to leverage HDFS

• Iterative workloads – If you’re processing the same dataset more than once

• Disk I/O intensive workloads

Persist data on Amazon S3 and use S3DistCp to

copy to HDFS for processing.

Page 22: Deep Dive: Amazon Elastic MapReduce

Amazon EMR: Design patterns

Page 23: Deep Dive: Amazon Elastic MapReduce

Amazon EMR example #1: Batch processing

GBs of logs pushed

to Amazon S3 hourlyDaily Amazon EMR

cluster using Hive to

process data

Input and output

stored in Amazon S3

250 Amazon EMR jobs per day, processing 30 TB of data

http://aws.amazon.com/solutions/case-studies/yelp/

Page 24: Deep Dive: Amazon Elastic MapReduce

Amazon EMR example #2: Long-running cluster

Data pushed to

Amazon S3Daily Amazon EMR cluster

Extract, Transform, and Load

(ETL) data into database

24/7 Amazon EMR cluster

running HBase holds last 2

years’ worth of data

Front-end service uses

HBase cluster to power

dashboard with high

concurrency

Page 25: Deep Dive: Amazon Elastic MapReduce

Amazon EMR example #3: Interactive query

TBs of logs sent dailyLogs stored in

Amazon S3Amazon EMR cluster using Presto for ad hoc

analysis of entire log set

Interactive query using Presto on multipetabyte warehouse

http://techblog.netflix.com/2014/10/using-presto-in-our-big-

data-platform.html

Page 26: Deep Dive: Amazon Elastic MapReduce

Amazon EMR: Storage optimizations

Page 27: Deep Dive: Amazon Elastic MapReduce

File formats

• Row oriented

– Text files

– Sequence files

• Writable object

– Avro data files

• Described by schema

• Columnar format

– Object Record Columnar (ORC)

– Parquet

Logical Table

Row oriented

Column oriented

Page 28: Deep Dive: Amazon Elastic MapReduce

Choosing the right file format

• Processing and query tools

– Hive, Impala, and Presto.

• Evolution of schema

– Avro for schema and Presto for storage.

• File format “splittability”

– JSON/XML as records instead of single object

• Compression

– Block or file.

Page 29: Deep Dive: Amazon Elastic MapReduce

File sizes

• Avoid small files

– Avoid anything smaller than 100 MB

• Each mapper processes a single File

• Fewer files, matching closely to block size

– Fewer calls to Amazon S3

– Fewer network/HDFS requests

Page 30: Deep Dive: Amazon Elastic MapReduce

Dealing with small files

• Reduce HDFS block size (e.g., 1 MB [default is 128 MB])

– --bootstrap-action s3://elasticmapreduce/bootstrap-

actions/configure-hadoop --args “-m,dfs.block.size=1048576”

• Better: use S3DistCp to combine smaller files together

– S3DistCp takes a pattern and target path to combine smaller

input files into larger ones

– Supply a target size and compression codec

Page 31: Deep Dive: Amazon Elastic MapReduce

Compression

• Always compress data files on Amazon S3

– Reduces network traffic between Amazon S3 and

Amazon EMR

– Speeds up your job

• Compress mappers and reducer output

Amazon EMR compresses internode traffic with LZO with

Hadoop 1, and Snappy with Hadoop 2.

Page 32: Deep Dive: Amazon Elastic MapReduce

Choosing the right compression

• Time sensitive: faster compressions are a better choice.

• Large amount of data: use space-efficient compressions.

• Combined workload: use Gzip.

Algorithm Splittable? Compression RatioCompress +

Decompress Speed

Gzip (DEFLATE) No High Medium

bzip2 Yes Very high Slow

LZO Yes Low Fast

Snappy No Low Very fast

Page 33: Deep Dive: Amazon Elastic MapReduce

Answers: Handling five billion sessions per day

Page 34: Deep Dive: Amazon Elastic MapReduce

Answers

Page 35: Deep Dive: Amazon Elastic MapReduce

Lambda architecture

Batch layer

Speed layer

Batch view

QueryReal-time view

Page 36: Deep Dive: Amazon Elastic MapReduce

Computation

S3DistCp

Computations

Input Amazon

S3 bucketIntermediate

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Page 37: Deep Dive: Amazon Elastic MapReduce

Computation

Computations

S3DistCp

Cascalog

Input Amazon

S3 bucketIntermediate

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Page 38: Deep Dive: Amazon Elastic MapReduce

Computation

Computations

S3DistCp

CascalogLZO

Input Amazon

S3 bucketIntermediate

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Final

Amazon S3

bucket

Page 39: Deep Dive: Amazon Elastic MapReduce

Compression

LZO is fast.

Page 40: Deep Dive: Amazon Elastic MapReduce

Compression

LZO is splittable.

Page 41: Deep Dive: Amazon Elastic MapReduce

Compression

LZO is built-in.

Page 42: Deep Dive: Amazon Elastic MapReduce

Compression

LZO improves performance.

Page 43: Deep Dive: Amazon Elastic MapReduce

Snapshotting

Result

Snapshot Snapshot

Result

Page 44: Deep Dive: Amazon Elastic MapReduce

Backfilling

Monday Tuesday Friday

Page 45: Deep Dive: Amazon Elastic MapReduce

Job scheduling

Orchestrator Amazon EMR

Page 46: Deep Dive: Amazon Elastic MapReduce

Job scheduling

Data Pipeline

Simple Workflow Service

Page 47: Deep Dive: Amazon Elastic MapReduce

Batch view

Data Pump

Orchestrator

Page 48: Deep Dive: Amazon Elastic MapReduce

What if my data is not in Amazon S3?

Amazon S3

Page 49: Deep Dive: Amazon Elastic MapReduce

What if my data is not in Amazon S3?

Amazon S3

Page 50: Deep Dive: Amazon Elastic MapReduce

What if my data is not in Amazon S3?

Amazon S3

Amazon S3

Page 51: Deep Dive: Amazon Elastic MapReduce

Takeaway

Page 52: Deep Dive: Amazon Elastic MapReduce

Cost-saving tips

• Use Amazon S3 as your persistent data store (only pay for compute when you need it!).

• Use Amazon EC2 Spot Instances (especially with task nodes) tosignificantly reduce the cost of running your clusters.

• Use Amazon EC2 Reserved Instances if you have steady workloads.

• Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours).

• Contact your sales rep about custom pricing options, if you are spending more than $10K per month on Amazon EMR.

Page 53: Deep Dive: Amazon Elastic MapReduce

SAN FRANCISCO

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved