Amazon Elastic Map Reduce - Ian Meyers

London Hadoop User

Group

Deep experience in building and

operating global web scale systems

About Amazon Web Services

? …get into cloud computing?

How did Amazon…

Utility computing

On demand Pay as you go

Uniform Available

Utility computing

Utility computing

On demand Pay as you go

Uniform Available

Compute

Storage

Security Scaling

Database

Networking Monitoring

Messaging

Workflow

DNS

Load Balancing

Backup CDN

No Up-‐Front Capital Expense

Pay Only for What You Use

Self-‐Service Infrastructure

Easily Scale Up and Down

Improve Agility & Time-‐to-‐Market

Low Cost

Deploy

Cloud computing benefits

Traditional IT capacity

ElasNc capacity

Capacity

Time Your IT needs

On and Off Fast Growth

Variable peaks Predictable peaks

ElasNc capacity

ElasNc capacity

On and Off Fast Growth

Predictable peaks Variable peaks

WASTE

CUSTOMER DISSATISFACTION

ElasNc capacity

Fast Growth On and Off

Predictable peaks Variable peaks

Num

ber o

f EC

2 In

stan

ces

4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/2008 4/17/2008 4/13/2008

40 servers to 5000 in 3 days

EC2 scaled to peak of 5000 instances

“Techcrunched”

Launch of Facebook modification

Steady state of ~40 instances

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & AdministraNon

Networking

Global Infrastructure


Region

US-WEST (N. California) EU-WEST (Ireland)

ASIA PAC (Tokyo)

ASIA PAC (Singapore)

US-WEST (Oregon)

SOUTH AMERICA (Sao Paulo)

US-EAST (Virginia)

GOV CLOUD

ASIA PAC (Sydney)

Availability Zone


Customer Needs

•  Store Any Amount of Data –  Without Capacity Planning

•  Perform Complex Analysis on Any Data –  Scale on Demand

•  Store Data Securely •  Decrease Time to Market

–  Build Environments Quickly

•  Reduce Costs –  Reduce Capital Expenditure

•  Enable Global Reach

IngesNon | IntegraNon

ElasNc Block Store

High performance block storage device

1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities

IMAGE

Availability 99.99%

Durability 99.999999999%

Is a Web Store Not a file system

No Single Points of Failure Eventually consistent

Paradigm Object store

Performance Very Fast

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.095/GB/month

Typical use case

Write once, read many

Limits 100 Buckets, Unlimited Storage, 5TB Objects

Simple Storage Service Highly scalable object storage for the internet

1 byte to 5TB in size 99.999999999% durability

Peak Requests: 830,000+ per second

Total Number of Objects Stored in Amazon S3

14 Billion 40 Billion 102 Billion

762 Billion

262 Billion

1.3 Trillion

Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012

Objects in S3

Glacier Long term object archive

Extremely low cost per gigabyte 99.999999999% durability

ElasNc Block Store

High performance block storage device

1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities

IMAGE

Durability 99.999999999%

Designed for Archival Not a file system Vaults & Archives

3-5 Hour Retrieval Time

Paradigm Archive Store

Performance Configurable - Low

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.011/GB/month

Typical use case

Write once, read infrequently

< 10% / Month

Simple Storage Service Highly scalable object storage

1 byte to 5TB in size 99.999999999% durability

Glacier Long term object archive

Extremely low cost per gigabyte 99.999999999% durability

Storage Lifecycle IntegraNon

Structured Data Management

Compute Storage


Database

App Services


Networking

Database

Relational Database Service Managed Oracle, MySQL & SQL Server

Dynamo DB Managed NOSQL Database

Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse

RDS Dynamo DB

Redshift

Compute Storage


Database

App Services


Networking

Database

Relational Database Service Database-as-a-Service No need to install or manage database instances Scalable and fault tolerant configurations Integration with Data Pipeline

RDS Dynamo DB

Redshift

Compute Storage


Database

App Services


Networking

Database

DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive

RDS Dynamo DB

Redshift

Compute Storage


Database

App Services


Networking

Database

Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB

RDS Dynamo DB

Redshift

Unstructured Data …

Parallel ETL

Elastic MapReduce Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Leverage Hive & Pig analytics scripts Support for Spot Instances Integrated HBase NOSQL Database

Compute Storage


Database

App Services


Networking

Application Services

Elastic MapReduce

• AWS Web Console • Command Line

elastic-‐mapreduce -‐-‐create -‐-‐key-‐pair micro -‐-‐region eu-‐west-‐1 -‐-‐name IanMM-‐Test1 -‐-‐num-‐instances 5 -‐-‐instance-‐type m2.4xlarge –alive -‐-‐log-‐uri s3n://meyersi-‐ire/EMR/log

Launching Clusters

• Enabling Tools

elastic-‐mapreduce -‐-‐create -‐-‐key-‐pair micro -‐-‐region eu-‐west-‐1 -‐-‐name IanMM-‐Test1 -‐-‐num-‐instances 5 -‐-‐instance-‐type m2.4xlarge -‐-‐alive

-‐-‐pig-‐interactive -‐-‐pig-‐versions latest -‐-‐hive-‐interactive –-‐hive-‐versions latest -‐-‐hbase -‐-‐log-‐uri s3n://meyersi-‐ire/EMR/log

Launching Clusters

• Hadoop Configuration Bootstrap Action

elastic-‐mapreduce -‐-‐create -‐-‐bootstrap-‐action s3://elasticmapreduce/bootstrap-‐actions/configure-‐hadoop -‐-‐args "-‐s,dfs.block.size=1048576” -‐-‐key-‐pair micro -‐-‐region eu-‐west-‐1 -‐-‐name IanMM-‐Test-‐3 -‐-‐instance-‐group core -‐-‐instance-‐count 2 -‐-‐instance-‐type m2.4xlarge -‐-‐instance-‐group task -‐-‐instance-‐count 2 -‐-‐instance-‐type m2.4xlarge -‐-‐alive -‐-‐pig-‐interactive -‐-‐hive-‐interactive -‐-‐log-‐uri s3n://meyersi-‐ire/EMR/log

Launching Clusters

Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc.

Activity: This is a data aggregation, manipulation, or copy that runs on a user-configured schedule.

Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type.

Amazon Data Pipeline

Output: S3 file Path: s3://trend-‐data/#{year-‐month-‐day}.csv

AcNvity: EMR Transform Hive Query: user-‐metrics.hql Frequency: Daily

Input: RDS Table Table: User-‐Demographics SQL PrecondiNon: “Select last_update from table“ > #{YY-‐MM-‐DD}

Input: DynamoDB Table Table: User-‐Event-‐Data-‐#{year-‐month}

Success NoNficaNon: [email protected] Failure NoNficaNon: emr-‐[email protected] Delay NoNficaNon: : emr-‐[email protected]

Orchestration with Data Pipeline

Analytics Pipeline

Redshift

S3

RDS

EMR

Data Pipeline

…collect & store

…orchestrate

…process & analyse

Dynamo DB

Benefits only possible in the Cloud

Pay as you Go

Lower Overall Costs

Stop Guessing Capacity

Agility / Speed /

Innovation

Avoid Undifferentiated

Heavy Lifting Go Global in Minutes

✔ ✔ ✔ ✔ ✔ ✔ “Private Cloud” /

On Premises

X X X X X X

Agility & Global Reach

at the Core of EMR

Ease of Operation

Compute Infrastructure

Hadoop ConfiguraNon Local Disk OperaNng System Config

HDFS

Networking

Hive Pig HBase

User Defined Sogware InstallaNon

Ease of Operation

Compute Infrastructure

Hadoop

ConfiguraNon

Local Disk

OperaNng

System Config

HDFS

Networking

Hive Pig

HBase User Defin

ed Sogware Installa

Non

Multiple Hadoop Distributions - Open Source & MapR Clusters Launched with 1 Command Up in 5 Minutes Hard Partitioned per Customer on CPU, Memory and Disk Dynamic Cluster Resizing In any of 8 Regions around the Globe

Lower Overall Costs

Cheaper | Spot Market Management

Lower TCO

June 2013 Study by Accenture Technology Labs Not Sponsored or Funded by Amazon “Accenture assessed the price-‐performance raJo between bare-‐metal Hadoop clusters and Hadoop-‐as-‐a-‐Service on Amazon Web Services…[and] revealed that Hadoop-‐as-‐a-‐Service offers bePer price-‐performance raJo…” hkp://www.accenture.com/us-‐en/Pages/insight-‐hadoop-‐deployment-‐comparison.aspx

• Spot allows customers to bid on unused EC2 capacity

• Spot price based on supply/demand of instance types in an Availability Zone

• Customers are fulfilled when their bid price is higher than the Spot Price

•  Instances will be interrupted when the Spot price exceed the bid price

Spot 101 - What are Spot Instances

elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4

Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Other EMR + Spot Use Cases § Run entire cluster on Spot for biggest cost savings § Reduce the cost of application testing

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50% Cost Savings: ~20%

Reducing Hadoop Costs with Spot

Stop Guessing Capacity

Dynamic Clusters

Extend on-premise environments…

with Amazon VPC…

Populate as demand dictates…

Connect over dedicated links…

And turn it off when you are done

EMR is Hadoop…

…cheaper, easier, and more agile

What’s New?

• MapR M7 Introduction •  Optimised for HBase Clusters •  Failure Recovery •  Point in Time Recovery

Snapshotting •  Low Latency Hadoop Optimisations •  HBase Mirroring •  NFS + HDFS •  MapR M5 Price Drop

• Support for Pig 0.11.1 •  RANK, CUBE & ROLLUP capability •  Groovy UDF’s •  Support for Guava Functions •  Performance Improvements

• Spark/Shark Bootstrap Action •  In Memory Hadoop •  Spark Scripting (similar to Pig) •  Shark Shell with Hive

Interoperability

Amazon Elastic Map Reduce - Ian Meyers

Technology

Transcript of Amazon Elastic Map Reduce - Ian Meyers