(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

80
November 13 th , 2014 | Las Vegas, NV Ian Meyers, Amazon Web Services

description

Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

Transcript of (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Page 1: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

November 13th, 2014 | Las Vegas, NV

Ian Meyers, Amazon Web Services

Page 2: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Amazon Elastic MapReduceManaged, elastic Hadoop (1.x & 2.x) cluster

Integrates with Amazon S3, Amazon DynamoDB, Amazon

Kinesis and Amazon Redshift

Install Storm, Spark, Presto, Hive, Pig, Impala, & end-user

tools automatically

Native support for Spot Instances

Integrated HBase NoSQL database

Amazon EMR

Page 3: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 4: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 5: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 6: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file – merge values in new config to existing

--keyword-key-value – override values provided

Configuration File NameConfiguration File

KeywordFile Name Shortcut Key-Value Pair Shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

Page 7: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Set number of mappers per task tracker

Useful for small memory footprint map tasks

More work done with a given instance

Page 8: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Set HDFS block size to 1MB

Useful for smaller files when HDFS is used

Page 9: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Reuse mappers

Mapper startup time ~ 2-20 seconds

Useful for tasks with large number of mappers

Mappers must be “clean” after run (relevant for Java)

Page 10: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Configure process heap size, Java opts, and allow for replacing the hadoop-user-env.sh

Hadoop 1

Hadoop 2

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons--args –{namenode}-heap-size=2048,--{namenode}-opts=-XX:GCTimeRatio=19

Page 11: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 12: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 13: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

EMRfs

HDFS

Amazon EMR

Amazon S3 Amazon

DynamoDB

Processed Files

Registry

File Data

Page 14: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 15: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 16: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 17: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

55

Page 18: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

5

Page 19: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 20: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

≈60sec * 15MB 1GB

Page 21: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 22: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

aws emr add-steps --cluster-id <cluster>

--steps Name=GroupSmallFiles,

Type=CUSTOM_JAR,

Args=files,home/hadoop/lib/emr-s3distcp-1.0.jar,

src,s3://myawsbucket/cf,

dest,hdfs:///local,

groupBy,.*(i-\w.log).*,

targetSize,128…

Page 23: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 24: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Algorithm % Space

Remaining

Encoding

Speed

Decoding

Speed

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

Page 25: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

-outputCodec,lzo

Page 26: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 27: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Amazon EMR Cluster

Task Instance

Group

Core Instance

Group

HDF

S

HDF

S

Amazon S3

Page 28: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

HUGE Benefit!!

Page 29: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 30: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

EMR

EMR

Amazon

S3

Page 31: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 32: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 33: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Amazon EMR Cluster

Task Instance

Group

Core Instance

Group

HDF

S

HDF

S

Amazon S3

Page 34: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 35: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

S3D

istC

P

Page 36: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 37: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

S3D

istC

P

Page 38: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 39: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 40: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 41: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

EMR

HDFS

Pig

Page 42: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Hive 0.13.1• Support for ORC

• Window functions

• Decimal types

• TRUNCATE command

• Better optimiser (less

need for hinting)

Pig 0.12.0• Streaming UDF’s not

written in Java

• Native support for Avro

• Native support for

Parquet

• Improved data types

Impala 1.1 • In-memory SQL engine

• Support for HBase

tables

• Support for Parquet –

column-oriented file

format

• Query and interactive

shells

HBase 0.94.18• Database

Snapshotting

• Improved read caching

and seek optimisation

• Improved transactions

Page 43: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Read Data Directly into Hive,

Pig, Streaming and Cascading

from Kinesis Streams

No Intermediate Data

Persistence Required

Simple way to introduce real time sources into

Batch Oriented Systems

Multi-Application Support & Automatic

Checkpointing

Amazon EMR Integration with Amazon Kinesis

Page 44: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

drop table call_data_records;

CREATE TABLE call_data_records (start_time bigint,end_time bigint,phone_number STRING,carrier STRING,recorded_duration bigint,calculated_duration bigint,lat double,long double

)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ","STORED BY'com.amazon.emr.kinesis.hive.KinesisStorageHandler'TBLPROPERTIES("kinesis.stream.name"="TestAggregatorStream");

Amazon EMR Integration with Amazon Kinesis

Page 45: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 46: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 47: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 48: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 49: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 50: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 51: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

EC2 InstanceMap

Tasks

Reduce

Tasks

m1.small 2 1

m1.large 3 1

m1.xlarge 8 3

m2.xlarge 3 1

m2.2xlarge 6 2

m2.4xlarge 14 4

m3.xlarge 6 1

m3.2xlarge 12 3

cg1.4xlarge 12 3

cc2.8xlarge 24 6

c3.4xlarge 24 6

hi1.4xlarge 24 6

hs1.8xlarge 24 6

cr1.8xlarge &

c3.8xlarge48 12

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

8192

16384

32768

65536

0

50

100

150

200

250

300

Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)

Page 52: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 53: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 54: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Instance Cost / Map Task Cost / Reduce Task

m1.large $0.08 $0.15

m1.xlarge $0.06 $0.15

m3.xlarge $0.04 $0.07

m3.2xlarge $0.04 $0.07

Page 55: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Instance Cost / Map Task Cost / Reduce Task

c1.medium $0.13 $0.13

c1.xlarge $0.35 $0.70

c3.xlarge $0.05 $0.11

c3.2xlarge $0.05 $0.11

Page 56: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 57: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 58: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Total tasks * Time to process sample files

Instance task capacity * Desired processing time

Estimated number of nodes:

Page 59: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

1. Estimate the number of tasks your job requires

150

2. Pick an instance and note down the number of Tasks it can run in parallel

m1.xlarge with 8 task capacity per instance

Page 60: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

3. We need to pick some sample data files to run a

test workload. The number of sample files should

be the same number from step #2.

8 files selected for our sample test

Page 61: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

4. Run an Amazon EMR cluster with a single core

node and process your sample files from #3.

Note down the amount of time taken to process

this dataset.

3 min to process 8 files

Page 62: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Total tasks for your job * Time to process sample files

Per instance task capacity * Desired processing time

Estimated number of nodes:

150 * 3 min 8 * 5 min

= 11 m1.xlarge

Page 63: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 64: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 65: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Master instance group

Amazon EMR cluster

HDFS HDFS

Run TaskTrackers

(Compute)

Run DataNode

(HDFS)

Core instance group

Page 66: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Can add core nodes

More HDFS space

More CPU/memory

Master instance group

Amazon EMR cluster

HDFS HDFS HDFS

Core instance group

Page 67: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Can’t remove core

nodes because of

HDFS

Master instance group

HDFS HDFS HDFS

Amazon EMR cluster

Core instance group

Page 68: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Run TaskTrackers

No HDFS

Reads from core node

HDFS

Master instance group

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

Page 69: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Can add task

nodes

Master instance group

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

Page 70: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

More CPU power

More memory

Master instance group

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

Page 71: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

You can remove

task nodes when

processing is

completed

Task instance group

Master instance group

Core instance group

HDFS HDFS

Amazon EMR cluster

Page 72: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

You can remove

task nodes when

processing is

completed

Master instance group

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

Page 73: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 74: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 75: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Amazon

CloudWatch

Page 76: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 77: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 78: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 79: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
Page 80: (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

http://bit.ly/awsevals