Deep Dive: Amazon Elastic MapReduce

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Deep Dive: Amazon EMR

Matt Yanchyshyn - Sr. Manager, Solutions Architecture

Why Amazon EMR?

easy to uselaunch a cluster in minutes

low costpay an hourly rate

elasticeasily add or remove capacity

reliablespend less time monitoring

securemanaged firewalls

flexibleyou control the cluster

Easy to deploy

AWS Management Console Command line

or use the EMR API with your favorite SDK

Easy to monitor and debug

Monitor Debug

integrated with Amazon CloudWatch

monitor cluster, node, and IO

Try different configurations to find your optimal architecture

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Easy to add and remove compute

capacity on your cluster

Match compute

demands with

cluster sizing

Resizable clusters

Spot for

task nodes

Up to 90%

off EC2

on-demand

pricing

On-demand for

core nodes

Standard

Amazon EC2

pricing for

on-demand

capacity

Easy to use Spot Instances

Meet SLA at predictable cost Exceed SLA at lower cost

Read data directly into Hive,

Pig, streaming and cascading

from Amazon Kinesis streams

No intermediate data

persistence required

Simple way to introduce real-time sources into

batch-oriented systems

Multi-application support and automatic

checkpointing

Amazon EMR Integration with Amazon Kinesis

The Hadoop ecosystem can run in Amazon EMR

Bootstrap Actions

Use bootstrap actions to install applications or to configure Hadoop

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file – merge values in new config to existing

--keyword-key-value – override values provided

Configuration file

name

Configuration file

keywordFile name shortcut

Key-value pair

shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

Hue

Amazon S3 and HDFS

Hue

Query Editor

Hue

Job Browser

Leverage Amazon S3 with EMRFS

Amazon S3 as your persistent data store

• Separate compute and storage

• Resize and shut down Amazon

EMR clusters with no data loss

• Point multiple Amazon EMR

clusters at same data in Amazon

S3

EMR

EMR

Amazon

S3

EMRFS makes it easier to use Amazon S3

• Read-after-write consistency

• Very fast list operations

• Error-handling options

• Support for Amazon S3 encryption

• Transparent to applications: s3://

EMRFS client-side encryption

Amazon S3

Am

azon

S3 e

ncry

ption

clie

nts

EM

RF

S e

nable

d fo

r

Am

azon S

3 c

lient-s

ide e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

HDFS is still there if you need it

• Iterative workloads

– If you’re processing the same dataset more than

once

• Disk I/O-intensive workloads

• Persist data on Amazon S3 and use S3DistCp to

copy to/from HDFS for processing

Amazon EMR – Design Patterns

EMR example #1: Batch processing

GB of logs pushed to

S3 hourlyDaily EMR cluster

using Hive to process

data

Input and output

stored in S3

250 Amazon EMR jobs per day, processing 30 TB of data

http://aws.amazon.com/solutions/case-studies/yelp/

http://aws.amazon.com/solutions/case-studies/yelp/

EMR example #2: Long-running cluster

Data pushed to S3 Daily EMR cluster

ETL data into

database

24/7 EMR cluster running

HBase holds last two years of

data

Front-end service uses

HBase cluster to power

dashboard with high

concurrency

TBs of logs sent

dailyLogs stored in

Amazon S3

Hive metastore

on Amazon EMR

EMR example #3: Interactive query

Interactive query using Presto on multi-petabyte warehouse

http://nflx.it/1dO7Pnt

http://nflx.it/1dO7Pnt

EMR example #4: Streaming-data processing

TBs of logs sent

dailyLogs stored in

Amazon Kinesis

Amazon Kinesis

Client Library

AWS Lambda

Amazon EMR

Amazon EC2

Optimizations for Storage

File formats

• Row-oriented– text files

– sequence files

• writable object

– Avro data files

• described by schema

• Columnar format– Object Record Columnar (ORC)

– Parquet

logical table

row-oriented

column-oriented

Choosing the right file format

• Processing and query tools– Hive, Impala, and Presto

• Evolution of schema– Avro for schema and Presto for storage

• File format “splittability”– Avoid JSON/XML files. Use them as records.

• Compression - block or file

File sizes

• Avoid small files

– Anything smaller than 100 MB

• Each mapper is a single JVM

– CPU time is required to spawn JVMs/mappers

• Fewer files, matching closely to block size

– fewer calls to S3

– fewer network/HDFS requests

Dealing with small files

• Reduce HDFS block size, e.g. 1 MB (default is 128 MB)

– --bootstrap-action s3://elasticmapreduce/bootstrap-

actions/configure-hadoop --args “-m,dfs.block.size=1048576”

• Better: use S3DistCP to combine smaller files together

– S3DistCP takes a pattern and target path to combine smaller

input files to larger ones

– Supply a target size and compression codec

Compression

• Always compress data files on Amazon S3

– reduces network traffic between Amazon S3 and

Amazon EMR

– speeds up your job

• Compress mappers and reducer output

Amazon EMR compresses inter-node traffic with LZO with

Hadoop 1 and Snappy with Hadoop 2

Choosing the right compression

• Time-sensitive, faster compressions are a better choice

• Large amount of data, use space-efficient compressions

• Combined workload, use gzip

Algorithm Splittable? Compression ratioCompress and

decompress speed

Gzip (DEFLATE) No High Medium

bzip2 Yes Very high Slow

LZO Yes Low Fast

Snappy No Low Very fast

The Nielsen Company and Amazon EMR

Co

pyr

igh

t ©

2012

Th

e N

iels

en C

om

pan

y. C

on

fid

enti

al a

nd

pro

pri

etar

y.

SOCIAL TV IS A CONSUMER PHENOMENONViewers interacting around TV programming through social to connect with friends, fans, stars, and advertisers in real time.

of smartphone/tablet owners use devices as second screens

while watching TV

1 Billion84percent

Tweets about U.S.TV in 2014, sent by

25 million people

Co

pyr

igh

t ©

2012

Th

e N

iels

en C

om

pan

y. C

on

fid

enti

al a

nd

pro

pri

etar

y.

WHAT IS NIELSEN SOCIAL?The leading provider of social TV measurement, analytics, and audience engagement solutions.

• 90+ network, agency and advertiser clients in the U.S.

• Exclusive provider of Nielsen Twitter TV Ratings

• International presence in Italy, Australia, and Mexico

Co

pyr

igh

t ©

2012

Th

e N

iels

en C

om

pan

y. C

on

fid

enti

al a

nd

pro

pri

etar

y.

HOW DOES NIELSEN SOCIAL WORK?Nielsen Social captures Twitter activity about every TV program across 250+ U.S. networks, 1900+ brands, movies, and sports.

Twitter provides impressions and demographics for every tweet about TV, which we aggregate and de-duplicate to produce Nielsen Twitter TV Ratings.

Co

pyr

igh

t ©

2012

Th

e N

iels

en C

om

pan

y. C

on

fid

enti

al a

nd

pro

pri

etar

y.

NIELSEN SOCIAL – AMAZON EMR JOB FLOW

1) Country Segmentation Cluster• Input: Global Firehose (500M TPD, 200 GB gz

files on S3)• Output: US/IT/AU/MX Tweets (150M TPD)• Data Volatility: Low (+/- 20%)

2) TV Tweet Matching Cluster• Input: US/IT/AU/MX Tweets (150M TPD)• Output: TV Tweets (2 to 25M) • Data Volatility: High (+/- 1200%)

3) TV Analytics Cluster• Input: TV Tweets (2 to 25M TPD)• Output: Analytics and Reports • Data Volatility: High (+/- 1200%)

Amazon EMR Config: • Multiple Transient Clusters• On Demand Instances

+ more for spikes• Type: m4.2xlarge / m3.xl• Job Freq: daily

Amazon EMR Config: • Dedicated (24/7)• Reserved Instances• Type: m4.2xlarge• Job Freq: every 10 min

Amazon EMR Config: • Transient• Reserved Instances

+ On Demand for spikes• Type: m2.2xlarge• Job Freq: hourly/overlap

TWITTER FIREHOSE

TV SCHEDULES

DEMOS & IMPRESSIONS

Co

pyr

igh

t ©

2012

Th

e N

iels

en C

om

pan

y. C

on

fid

enti

al a

nd

pro

pri

etar

y.

TWITTER FIREHOSE

NIELSEN SOCIAL – AMAZON EMR JOB FLOW

1) Country Segmentation Cluster• Input: Global Firehose (500M TPD, 200 GB gz

files on S3)• Output: US/IT/AU/MX Tweets (150M TPD)• Data Volatility: Low (+/- 20%)

2) TV Tweet Matching Cluster• Input: US/IT/AU/MX Tweets (150M TPD)• Output: TV Tweets (2 to 25M) • Data Volatility: High (+/- 1200%)

3) TV Analytics Cluster• Input: TV Tweets (2 to 25M TPD)• Output: Analytics and Reports • Data Volatility: High (+/- 1200%)

TV SCHEDULES

DEMOS & IMPRESSIONS

Amazon EMR Config: • Dedicated (24/7)• Reserved Instances• Type: m4.2xlarge• Job Freq: every 10 min

Amazon EMR Config: • Transient• Reserved Instances

+ On Demand for spikes• Type: m2.2xlarge• Job Freq: hourly/overlap

NIELSEN TWITTER TV RATINGS

Takeaway

Cost-saving tips for Amazon EMR

• Use Amazon S3 as your persistent data store

• Only pay for compute when you need it

• Use Amazon EC2 Spot instances to save >80%

• Use Amazon EC2 Reserved instances for steady workloads

• Use CloudWatch alerts to notify you if a cluster is

underutilized, then shut it down (for example, 0 mappers

running for >N hours)

• Contact your AWS sales for pricing options if you are

spending >$10K/mo on Amazon EMR

CHICAGO

Deep Dive: Amazon Elastic MapReduce

Technology

Transcript of Deep Dive: Amazon Elastic MapReduce