Amazon Elastic Map Reduce - Ian Meyers

51
London Hadoop User Group

description

In this talk, Ian will talk about Amazon Elastic MapReduce and how it integrates with other AWS services in a big data stack.

Transcript of Amazon Elastic Map Reduce - Ian Meyers

Page 1: Amazon Elastic Map Reduce - Ian Meyers

London Hadoop User

Group

Page 2: Amazon Elastic Map Reduce - Ian Meyers

Deep experience in building and

operating global web scale systems

About  Amazon  Web  Services  

? …get into cloud computing?

How did Amazon…

Page 3: Amazon Elastic Map Reduce - Ian Meyers

Utility computing

On demand Pay as you go

Uniform Available

Page 4: Amazon Elastic Map Reduce - Ian Meyers

Utility computing

On demand Pay as you go

Uniform Available

Page 5: Amazon Elastic Map Reduce - Ian Meyers

Utility computing

Page 6: Amazon Elastic Map Reduce - Ian Meyers

Utility computing

On demand Pay as you go

Uniform Available

Compute  

Storage  

Security   Scaling  

Database  

Networking  Monitoring  

Messaging  

Workflow  

DNS  

Load  Balancing  

Backup  CDN  

Page 7: Amazon Elastic Map Reduce - Ian Meyers

No  Up-­‐Front  Capital  Expense  

Pay  Only  for  What  You  Use  

Self-­‐Service  Infrastructure  

Easily  Scale  Up  and  Down  

Improve  Agility  &  Time-­‐to-­‐Market  

Low  Cost  

Deploy

Cloud computing benefits

Page 8: Amazon Elastic Map Reduce - Ian Meyers

Traditional IT capacity

ElasNc  capacity  

Capacity

Time Your IT needs

Page 9: Amazon Elastic Map Reduce - Ian Meyers

On  and  Off   Fast  Growth  

Variable  peaks   Predictable  peaks  

ElasNc  capacity  

Page 10: Amazon Elastic Map Reduce - Ian Meyers

ElasNc  capacity  

On  and  Off   Fast  Growth  

Predictable  peaks  Variable  peaks  

WASTE

CUSTOMER DISSATISFACTION

Page 11: Amazon Elastic Map Reduce - Ian Meyers

ElasNc  capacity  

Fast  Growth  On  and  Off  

Predictable  peaks  Variable  peaks  

Page 12: Amazon Elastic Map Reduce - Ian Meyers

Num

ber o

f EC

2 In

stan

ces

4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/2008 4/17/2008 4/13/2008

40  servers  to  5000  in  3  days  

EC2 scaled to peak of 5000 instances

“Techcrunched”

Launch of Facebook modification

Steady state of ~40 instances

Page 13: Amazon Elastic Map Reduce - Ian Meyers

Compute   Storage  

AWS  Global  Infrastructure  

Database  

App  Services  

Deployment  &  AdministraNon  

Networking  

Global Infrastructure

Page 14: Amazon Elastic Map Reduce - Ian Meyers

Global Infrastructure

Region

US-WEST (N. California) EU-WEST (Ireland)

ASIA PAC (Tokyo)

ASIA PAC (Singapore)

US-WEST (Oregon)

SOUTH AMERICA (Sao Paulo)

US-EAST (Virginia)

GOV CLOUD

ASIA PAC (Sydney)

Page 15: Amazon Elastic Map Reduce - Ian Meyers

Availability Zone

Global Infrastructure

Page 16: Amazon Elastic Map Reduce - Ian Meyers

Customer Needs

•  Store  Any  Amount  of  Data  –  Without  Capacity  Planning  

•  Perform  Complex  Analysis  on  Any  Data  –  Scale  on  Demand  

•  Store  Data  Securely  •  Decrease  Time  to  Market  

–  Build  Environments  Quickly  

•  Reduce  Costs  –  Reduce  Capital  Expenditure  

•  Enable  Global  Reach  

Page 17: Amazon Elastic Map Reduce - Ian Meyers

IngesNon  |  IntegraNon  

Page 18: Amazon Elastic Map Reduce - Ian Meyers

ElasNc  Block  Store  

High performance block storage device

1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities

IMAGE

Availability 99.99%

Durability 99.999999999%

Is a Web Store Not a file system

No Single Points of Failure Eventually consistent

Paradigm Object store

Performance Very Fast

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.095/GB/month

Typical use case

Write once, read many

Limits 100 Buckets, Unlimited Storage, 5TB Objects

Simple  Storage  Service  Highly  scalable  object  storage  for  the  internet  

1  byte  to  5TB  in  size  99.999999999%  durability  

Page 19: Amazon Elastic Map Reduce - Ian Meyers

Peak Requests: 830,000+ per second

Total Number of Objects Stored in Amazon S3

14 Billion 40 Billion 102 Billion

762 Billion

262 Billion

1.3 Trillion

Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012

Objects in S3

Page 20: Amazon Elastic Map Reduce - Ian Meyers

Glacier  Long  term  object  archive  

Extremely  low  cost  per  gigabyte  99.999999999%  durability  

ElasNc  Block  Store  

High performance block storage device

1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities

IMAGE

Durability 99.999999999%

Designed for Archival Not a file system Vaults & Archives

3-5 Hour Retrieval Time

Paradigm Archive Store

Performance Configurable - Low

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.011/GB/month

Typical use case

Write once, read infrequently

< 10% / Month

Page 21: Amazon Elastic Map Reduce - Ian Meyers

Simple  Storage  Service  Highly  scalable  object  storage  

1  byte  to  5TB  in  size  99.999999999%  durability  

Glacier  Long  term  object  archive  

Extremely  low  cost  per  gigabyte  99.999999999%  durability  

Storage  Lifecycle  IntegraNon  

Page 22: Amazon Elastic Map Reduce - Ian Meyers

Structured  Data  Management  

Page 23: Amazon Elastic Map Reduce - Ian Meyers

Compute   Storage  

AWS  Global  Infrastructure  

Database  

App  Services  

Deployment  &  AdministraNon  

Networking  

Database

Relational Database Service Managed Oracle, MySQL & SQL Server

Dynamo DB Managed NOSQL Database

Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse

RDS Dynamo DB

Redshift

Page 24: Amazon Elastic Map Reduce - Ian Meyers

Compute   Storage  

AWS  Global  Infrastructure  

Database  

App  Services  

Deployment  &  AdministraNon  

Networking  

Database

Relational Database Service Database-as-a-Service No need to install or manage database instances Scalable and fault tolerant configurations Integration with Data Pipeline

RDS Dynamo DB

Redshift

Page 25: Amazon Elastic Map Reduce - Ian Meyers

Compute   Storage  

AWS  Global  Infrastructure  

Database  

App  Services  

Deployment  &  AdministraNon  

Networking  

Database

DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive

RDS Dynamo DB

Redshift

Page 26: Amazon Elastic Map Reduce - Ian Meyers

Compute   Storage  

AWS  Global  Infrastructure  

Database  

App  Services  

Deployment  &  AdministraNon  

Networking  

Database

Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB

RDS Dynamo DB

Redshift

Page 27: Amazon Elastic Map Reduce - Ian Meyers

Unstructured  Data  …  

Parallel  ETL  

Page 28: Amazon Elastic Map Reduce - Ian Meyers

Elastic MapReduce Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Leverage Hive & Pig analytics scripts Support for Spot Instances Integrated HBase NOSQL Database

Compute   Storage  

AWS  Global  Infrastructure  

Database  

App  Services  

Deployment  &  AdministraNon  

Networking  

Application Services

Elastic MapReduce

Page 29: Amazon Elastic Map Reduce - Ian Meyers

• AWS Web Console • Command Line

elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐name  IanMM-­‐Test1  -­‐-­‐num-­‐instances  5  -­‐-­‐instance-­‐type  m2.4xlarge  –alive  -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log  

Launching Clusters

Page 30: Amazon Elastic Map Reduce - Ian Meyers

• Enabling Tools

elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐name  IanMM-­‐Test1  -­‐-­‐num-­‐instances  5  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐alive    

-­‐-­‐pig-­‐interactive  -­‐-­‐pig-­‐versions  latest  -­‐-­‐hive-­‐interactive  –-­‐hive-­‐versions  latest  -­‐-­‐hbase    -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log  

Launching Clusters

Page 31: Amazon Elastic Map Reduce - Ian Meyers

• Hadoop Configuration Bootstrap Action

elastic-­‐mapreduce  -­‐-­‐create  -­‐-­‐bootstrap-­‐action  s3://elasticmapreduce/bootstrap-­‐actions/configure-­‐hadoop  -­‐-­‐args  "-­‐s,dfs.block.size=1048576”  -­‐-­‐key-­‐pair  micro  -­‐-­‐region  eu-­‐west-­‐1  -­‐-­‐name  IanMM-­‐Test-­‐3  -­‐-­‐instance-­‐group  core  -­‐-­‐instance-­‐count  2  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐instance-­‐group  task  -­‐-­‐instance-­‐count  2  -­‐-­‐instance-­‐type  m2.4xlarge  -­‐-­‐alive  -­‐-­‐pig-­‐interactive  -­‐-­‐hive-­‐interactive  -­‐-­‐log-­‐uri  s3n://meyersi-­‐ire/EMR/log  

Launching Clusters

Page 32: Amazon Elastic Map Reduce - Ian Meyers

Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc.  

Activity: This is a data aggregation, manipulation, or copy that runs on a user-configured schedule.

Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type.  

Amazon Data Pipeline

Page 33: Amazon Elastic Map Reduce - Ian Meyers

Output:  S3  file  Path:  s3://trend-­‐data/#{year-­‐month-­‐day}.csv  

AcNvity:  EMR  Transform  Hive  Query:  user-­‐metrics.hql  Frequency:  Daily  

Input:  RDS  Table  Table:  User-­‐Demographics  SQL  PrecondiNon:    “Select  last_update  from  table“  >  #{YY-­‐MM-­‐DD}  

Input:  DynamoDB  Table  Table:  User-­‐Event-­‐Data-­‐#{year-­‐month}  

Success  NoNficaNon:  [email protected]  Failure  NoNficaNon:  emr-­‐[email protected]  Delay  NoNficaNon:  :  emr-­‐[email protected]  

 

Orchestration with Data Pipeline

Page 34: Amazon Elastic Map Reduce - Ian Meyers

Analytics Pipeline

Redshift

S3

RDS

EMR

Data Pipeline

…collect & store

…orchestrate

…process & analyse

Dynamo DB

Page 35: Amazon Elastic Map Reduce - Ian Meyers

Benefits only possible in the Cloud

Pay as you Go

Lower Overall Costs

Stop Guessing Capacity

Agility / Speed /

Innovation

Avoid Undifferentiated

Heavy Lifting Go Global in Minutes

✔ ✔ ✔ ✔ ✔ ✔ “Private Cloud” /

On Premises

X X X X X X

Page 36: Amazon Elastic Map Reduce - Ian Meyers

Agility & Global Reach

at the Core of EMR

Page 37: Amazon Elastic Map Reduce - Ian Meyers

Ease of Operation

Compute  Infrastructure  

Hadoop  ConfiguraNon   Local  Disk   OperaNng  System  Config  

HDFS  

Networking  

Hive   Pig   HBase  

User  Defined  Sogware  InstallaNon  

Page 38: Amazon Elastic Map Reduce - Ian Meyers

Ease of Operation

Compute  Infrastructure  

Hadoop  

ConfiguraNon  

Local  Disk  

OperaNng  

System  Config  

HDFS  

Networking  

Hive  Pig  

HBase  User  Defin

ed  Sogware  Installa

Non  

Multiple Hadoop Distributions - Open Source & MapR Clusters Launched with 1 Command Up in 5 Minutes Hard Partitioned per Customer on CPU, Memory and Disk Dynamic Cluster Resizing In any of 8 Regions around the Globe

Page 39: Amazon Elastic Map Reduce - Ian Meyers

Lower Overall Costs

Cheaper | Spot Market Management

Page 40: Amazon Elastic Map Reduce - Ian Meyers

Lower TCO

June  2013  Study  by  Accenture  Technology  Labs      Not  Sponsored  or  Funded  by  Amazon      “Accenture  assessed  the  price-­‐performance  raJo  between  bare-­‐metal  Hadoop  clusters  and  Hadoop-­‐as-­‐a-­‐Service  on  Amazon  Web  Services…[and]  revealed  that  Hadoop-­‐as-­‐a-­‐Service  offers  bePer  price-­‐performance  raJo…”        hkp://www.accenture.com/us-­‐en/Pages/insight-­‐hadoop-­‐deployment-­‐comparison.aspx  

Page 41: Amazon Elastic Map Reduce - Ian Meyers

• Spot allows customers to bid on unused EC2 capacity

• Spot price based on supply/demand of instance types in an Availability Zone

• Customers are fulfilled when their bid price is higher than the Spot Price

•  Instances will be interrupted when the Spot price exceed the bid price

Spot 101 - What are Spot Instances

Page 42: Amazon Elastic Map Reduce - Ian Meyers

elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4

Page 43: Amazon Elastic Map Reduce - Ian Meyers

Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Other EMR + Spot Use Cases § Run entire cluster on Spot for biggest cost savings § Reduce the cost of application testing

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50% Cost Savings: ~20%

Reducing Hadoop Costs with Spot

Page 44: Amazon Elastic Map Reduce - Ian Meyers

Stop Guessing Capacity

Dynamic Clusters

Page 45: Amazon Elastic Map Reduce - Ian Meyers

Extend on-premise environments…

Page 46: Amazon Elastic Map Reduce - Ian Meyers

with Amazon VPC…

Page 47: Amazon Elastic Map Reduce - Ian Meyers

Populate as demand dictates…

Page 48: Amazon Elastic Map Reduce - Ian Meyers

Connect over dedicated links…

Page 49: Amazon Elastic Map Reduce - Ian Meyers

And turn it off when you are done

Page 50: Amazon Elastic Map Reduce - Ian Meyers

EMR is Hadoop…

…cheaper, easier, and more agile

Page 51: Amazon Elastic Map Reduce - Ian Meyers

What’s New?

• MapR M7 Introduction •  Optimised for HBase Clusters •  Failure Recovery •  Point in Time Recovery

Snapshotting •  Low Latency Hadoop Optimisations •  HBase Mirroring •  NFS + HDFS •  MapR M5 Price Drop

• Support for Pig 0.11.1 •  RANK, CUBE & ROLLUP capability •  Groovy UDF’s •  Support for Guava Functions •  Performance Improvements

• Spark/Shark Bootstrap Action •  In Memory Hadoop •  Spark Scripting (similar to Pig) •  Shark Shell with Hive

Interoperability