(BDT208) A Technical Introduction to Amazon Elastic MapReduce

95
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Abhishek Sinha, Amazon Web Services Gaurav Agrawal, AOL Inc October 2015 BDT208 A Technical Introduction to Amazon EMR

Transcript of (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Page 1: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Abhishek Sinha, Amazon Web Services

Gaurav Agrawal, AOL Inc

October 2015

BDT208

A Technical Introduction to

Amazon EMR

Page 2: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

What to Expect from the Session

• Technical introduction to Amazon EMR

• Basic tenets

• Amazon EMR feature set

• Real-Life experience of moving a 2-PB, on-premises

Hadoop cluster to the AWS cloud

• Is not a technical introduction to Apache Spark, Apache

Hadoop, or other frameworks

Page 3: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Amazon EMR • Managed platform

• MapReduce, Apache Spark, Presto

• Launch a cluster in minutes

• Open source distribution and MapR

distribution

• Leverage the elasticity of the cloud

• Baked in security features

• Pay by the hour and save with Spot

• Flexibility to customize

Page 4: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Make it easy, secure, and

cost-effective to run

data-processing frameworks

on the AWS cloud

Page 5: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

What Do I Need to Build a Cluster ?

1. Choose instances

2. Choose your software

3. Choose your access method

Page 6: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

An Example EMR Cluster

Master Node

r3.2xlarge

Slave Group - Core

c3.2xlarge

Slave Group – Task

m3.xlarge

Slave Group – Task

m3.2xlarge (EC2 Spot)

HDFS (DataNode).

YARN (NodeManager).

NameNode (HDFS)

ResourceManager

(YARN)

Page 7: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Choice of Multiple Instances

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Machine

Learning

Batch

Processing

In-memory

(Spark &

Presto)

Large HDFS

Page 8: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Select an Instance

Page 9: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Choose Your Software (Quick Bundles)

Page 10: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Choose Your Software – Custom

Page 11: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Hadoop Applications Available in Amazon EMR

Page 12: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Choose Security and Access Control

Page 13: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

You Are Up and Running!

Page 14: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

You Are Up and Running!

Master Node DNS

Page 15: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

You Are Up and Running!

Information about the software you are

running, logs and features

Page 16: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

You Are Up and Running!

Infrastructure for this cluster

Page 17: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

You Are Up and Running!

Security Groups and Roles

Page 18: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Use the CLI

aws emr create-cluster

--release-label emr-4.0.0

--instance-groups

InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge

InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK

Page 19: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Programmatic Access to Cluster Provisioning

Page 20: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Now that I have a cluster, I need to process

some data

Page 21: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Amazon EMR can process data from multiple sources

Hadoop Distributed File

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

Page 22: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Amazon EMR can process data from multiple sources

Hadoop Distributed File

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

Page 23: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Amazon EMR can process data from multiple sources

Hadoop Distributed File

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

Page 24: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

On an On-premises Environment

Tightly coupled

Page 25: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Compute and Storage Grow Together

Tightly coupled

Storage grows along with

compute

Compute requirements vary

Page 26: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Underutilized or Scarce Resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Page 27: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Underutilized or Scarce Resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processingWeekly peaks

Steady state

Page 28: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Underutilized or Scarce Resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Underutilized capacity

Provisioned capacity

Page 29: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Contention for Same Resources

Compute

boundMemory

bound

Page 30: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Separation of Resources Creates Data Silos

Team A

Page 31: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Replication Adds to Cost

3x

Single datacenter

Page 32: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

So how does Amazon EMR solve these problems?

Page 33: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Decouple Storage and Compute

Page 34: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Amazon S3 is Your Persistent Data Store

11 9’s of durability

$0.03 / GB / month in US-East

Lifecycle policies

Versioning

Distributed by default

EMRFSAmazon S3

Page 35: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

The Amazon EMR File System (EMRFS)

• Allows you to leverage Amazon S3 as a file-system

• Streams data directly from Amazon S3

• Uses HDFS for intermediates

• Better read/write performance and error handling than

open source components

• Consistent view – consistency for read after write

• Support for encryption

• Fast listing of objects

Page 36: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION ‘samples/pig-apache/input/'

Page 37: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Page 38: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Benefit 1: Switch Off Clusters

Amazon S3Amazon S3 Amazon S3

Page 39: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Auto-Terminate Clusters

Page 40: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

You Can Build a Pipeline

Page 41: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Or You Can Use AWS Data Pipeline

Input data

Use Amazon EMR to

transform unstructured

data to structured

Push to

Amazon S3

Ingest into

Amazon

Redshift

Page 42: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Sample Pipeline

Page 43: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Run Transient or Long-Running Clusters

Page 44: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Run a Long-Running Cluster

Amazon EMR cluster

Page 45: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Benefit 2: Resize Your Cluster

Page 46: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Resize the Cluster

Scale Up, Scale Down, Stop a resize,

issue a resize on another

Page 47: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

How do you scale up and save cost ?

Page 48: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Spot Instance

Bid

Price

OD

Price

Page 49: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Spot Integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Page 50: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

The Spot Bid Advisor

Page 51: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Spot Integration with Amazon EMR

• Can provision instances from the Spot market

• Replaces a Spot instance incase of interruption

• Impact of interruption

• Master node – Can lose the cluster

• Core node – Can lose intermediate data

• Task nodes – Jobs will restart on other nodes (application

dependent)

Page 52: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Scale up with Spot Instances

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Page 53: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

Page 54: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Resize Nodes with Spot Instances

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105

Page 55: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Resize Nodes with Spot Instances

50 % less run-time ( 14 7)

25% less cost (140 105)

Page 56: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Scaling Hadoop Jobs with Spothttp://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/

1500 to 2000 clusters

6000 Jobs

Page 57: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

For each instance_type in (Availability Zone, Region)

{

cpuPerUnitPrice = instance.cpuCores/instance.spotPrice

if (maxCpuPerUnitPrice < cpuPerUnitPrice) {

optimalInstanceType = instance_type;

}

}

Source: Github /Bloomreach/ Briefly

Page 58: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Intelligent Scale Down

Page 59: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Intelligent Scale Down: HDFS

Page 60: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Effectively Utilize Clusters

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Page 61: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Benefit 3: Logical Separation of Jobs

Hive, Pig,

Cascading

Prod

Presto Ad-Hoc

Amazon S3

Page 62: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Benefit 4: Disaster Recovery Built In

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Page 63: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Amazon S3 as a Data Lake

Nate Sammons, Principal Architect – NASDAQ

Reference – AWS Big Data Blog

Page 64: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Re-cap

Rapid provisioning of clusters

Hadoop, Spark, Presto, and other applications

Standard open-source packaging

De-couple storage and compute and scale them

independently

Resize clusters to manage demand

Save costs with Spot instances

Page 65: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

How AOL Inc. moved a 2 PB Hadoop

cluster to the AWS cloud

Gaurav Agrawal

Senior Software Engineer, AOL Inc.

AWS Certified Associate Solutions Architect

Page 66: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

AOL Data Platforms Architecture 2014

Page 67: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Data Stats & Insights

Cluster Size

2 PB

In-House

Cluster

100 Nodes

Raw

Data/Day

2-3 TB

Data

Retention

13-24 Months

Page 68: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Challenges with In-House Infrastructure

Fixed Cost

Slow DeploymentCycle

Always On Self Serve

Static : Not Scalable Outages Impact Production Upgrade

Storage Compute

Page 69: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

AOL Data Platforms Architecture 2015

12

2

34

56

Page 70: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Migration

• Web Console vs. CLI

Page 71: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Web Console and CLI

Web Console for Training

Setup IAM for users

AWS Services Options

S3 Data upload

EMR Creation & Steps

Try & Test multiple approaches

CLI is your friend..!!!

Page 72: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

Page 73: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

bucket-prod-control

Environment Level Buckets

Dev, QA, Production, Analyst

Project Level Buckets

Code, Data, Log, Extract and Control

Compressed Snappy Data to GZIP

Multi Platforms Support

Best Compression

Lowest storage cost

Low cost for Data OUT

bucket-dev bucket-qa

bucket-prod bucket-analyst

bucket-prod-code

bucket-prod-log

bucket-prod-data

bucket-prod-extract

76%Less Storage

70KSaving/Year

Copy Existing Data to S3

Page 74: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

Page 75: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

EMR Design Options

Transient

Amazon S3

Elastic Cluster

On-Demand vs. Reserved vs.

Core NodesAmazon EMR

vs. Persistent Cluster

vs. local HDFS

vs. Static Cluster

Spot

vs. Task Nodes

Page 76: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

AOL Data Platforms Architecture 2015

Page 77: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission - CLI

Page 78: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

EMR Jobs Submission - CLI

In-house scheduler

Common Utilities

Provision EMR

Push/Pull Data to S3

Job submission to Scheduler

Database Load

JSON Files

Applications, Steps, Bootstrap,EC2 attributes, Instance Groups

Future : Event Driven Design – Lambda, SQS

Page 79: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

EMR Jobs Submission - CLI

aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" \

--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav" "Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" \

--visible-to-all-users \

--ec2-attributes file://omni_awssot.generic.ec2_attributes.json \

--ami-version "3.7.0" \

--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ \

--enable-debugging \

--instance-groups file://omni_awssot.generic.instance_groups.json \

--auto-terminate \

--applications file://omni_awssot.generic.applications.json \

--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json \

--steps file://omni_awssot.generic.steps.json

Page 80: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission – CLI

• Monitoring

Page 81: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Monitoring

EMR WatchDog : Node.js

Duplicate Clusters

Failed Clusters

Long-running Clusters

Long-provisioning Clusters

CloudWatch Alarms

Monthly Billing

S3 Bucket Size

SNS Email Notifications

Amazon CloudWatchAmazon SNS

Page 82: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission – CLI

• Monitoring

• Elasticity

Page 83: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Elasticity

Why be Elastic?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 09/05/2015 Cores Nodes

Daily Processes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Core Nodes Demand - 09/20/2015 Core Nodes

No Clusters

Spike in Demand

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 06/01/2015Cores Nodes

Major RestatementDemand > 10K EC2

Page 84: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Elasticity

Why be Elastic?

True Cloud Architecture

Spot is an Open Market

Scale Horizontally

Our Limit : 3,000 EC2/Region

Multiple Regions

Multiple Instance Types

Page 85: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission – CLI

• Monitoring

• Elasticity

• Cost Management & BCDR

Page 86: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Cost Management & BCDR

Multi Region Deployment

Best AZ for pricing

Design for failure

Global. BC-DR.

Page 87: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission – CLI

• Monitoring

• Elasticity

• Cost Management & BCDR

• Optimization

Page 88: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

OptimizationData Management

Partition Data on S3

S3 Versioning/Lifecycle

How many nodes?

Based on Data Volume

Complete hour for pricing

Hadoop Run-time Params

Memory Tuning

Compress M & R Output

Combine Splits Input format

Security

Page 89: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Score Card

Feature AWS

Pay for what you use ✔

Decouple Storage and Compute ✔

True Cloud Architecture ✔

Self Service Model ✔

Elastic & Scalable ✔

Global Infrastructure. BCDR. ✔

Quick & Easy Deployments ✔

Redshift External Tables on S3 ?

More languages for Lambda ?

Page 90: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

AWS vs. In-House Cost

0 2 4 6

Service

Cost Comparison

AWS

In-House

Source : AOL & AWS Billing Tool

4xIn-House / Month

1xAWS / Month

** In-House cluster includes Storage, Power and Network cost.

Page 91: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

AWS vs. In-House Cost

10/8/2015

Amazon Web Services

1/4th Cost of In-House Hadoop Infrastructure

1/4th Cost

Data Platforms. AOL Inc.

Page 92: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 06/01/2015Core…

Restatement Use Case

• Restate historical data going back 6 months

Availability Zones

10

550EMR Clusters

24,000Spot EC2 Instances

0

10

20

30

40

50

60

70

Timing Comparison

In-House

AWS

Page 93: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Tag All Resources

Infrastructure as CodeCommand Line Interface

JSON as configuration files

IAM Roles and Policies

Use of Application ID

Enable CloudTrail

S3 Lifecycle ManagementS3 Versioning

Separate Code/Data/Logs buckets

Keyless EMR Clusters

Hybrid Model

Enable Debugging

Create Multiple CLI Profiles

Multi-Factor Authentication

CloudWatch Billing Alarms

Spot EC2 Instances

SNS notifications for failures

Loosely coupled Apps

Scale Horizontally

Best Practices & Suggestions

Page 94: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Remember to complete

your evaluations!

Page 95: (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Thank you!

Photo Credits• Key Board : http://bit.ly/1LRQMdR

• Compression : http://bit.ly/1MtT3Pa

• Optimization : http://bit.ly/1FlidQD

• WatchDog : http://bit.ly/1OX50j6

• Elasticity : http://bit.ly/1YFfCr4

• Fish Bowl : http://bit.ly/1VjrcJd

• Blank Cheque : http://bit.ly/1RkTgGe