Deep Dive: Amazon Elastic MapReduce
-
Upload
amazon-web-services -
Category
Technology
-
view
466 -
download
1
Transcript of Deep Dive: Amazon Elastic MapReduce
Deep dive – Amazon Elastic MapReduce
Rahul Bhartia
Solution architect – Amazon Web Services
Handling five billion sessions per day at AnswersAndrew Jorgensen
@ajorgensen
Software engineer – Twitter
Agenda
• Amazon Elastic MapReduce (EMR)
• Amazon EMR: Leveraging Amazon Simple Storage Service (S3)
• Amazon EMR: Design patterns
• Amazon EMR: Storage optimizations
• Answers: Handling five billion sessions per day
• Takeaway
Amazon Elastic MapReduce (EMR)
Why Amazon EMR?
Easy to UseLaunch a cluster in minutes
Low CostPay an hourly rate
ElasticEasily add or remove capacity
ReliableSpend less time monitoring
SecureManage firewalls
FlexibleControl the cluster
Easy to deploy
AWS Management Console Command Line
Or use the Amazon EMR API with your favorite SDK.
Easy to monitor and debug
Integrated with Amazon CloudWatch
Monitor Cluster, Node, and IO
Monitor Debug
Hue
Amazon S3 and Hadoop distributed file system (HDFS)
Hue
Query Editor
Hue
Job Browser
Try different configurations to find your optimal architecture.
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add and remove compute
capacity on your cluster.
Match compute
demands with
cluster sizing.
Resizable clusters
Spot Instances
for task nodes
Up to 86% lower
on average
off
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Use bootstrap actions to install applications…
https://github.com/awslabs/emr-bootstrap-actions
…or to configure Hadoop
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--keyword-config-file (Merge values in new config to existing)
--keyword-key-value (Override values provided)
Configuration File
Name
Configuration File
Keyword
File Name
Shortcut
Key-Value Pair
Shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
Read data directly into Hive,
Apache Pig, and Hadoop
Streaming and Cascading from
Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
Amazon EMR: Leveraging Amazon S3
Amazon S3 as your persistent data store
• Amazon S3
– Designed for 99.999999999% durability
– Separate compute and storage
• Resize and shut down Amazon EMR clusters with no data loss
• Point multiple Amazon EMR clusters at same data in Amazon S3
EMRFS makes it easier to leverage Amazon S3
• Better performance and error handling options
• Transparent to applications – just read/write to “s3://”
• Consistent view
– For consistent list and read-after-write for new puts
• Support for Amazon S3 server-side and client-side encryption
• Faster listing using EMRFS metadata
EMRFS support for Amazon S3 client-side encryption
Amazon S3
Am
azo
n S
3 e
ncry
ptio
n c
lien
tsE
MR
FS
en
ab
led
for
Am
azo
n S
3 c
lien
t-sid
e e
ncry
ptio
n
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number of
objects
Without Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Fast listing of Amazon S3 objects using
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
Optimize to leverage HDFS
• Iterative workloads – If you’re processing the same dataset more than once
• Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to
copy to HDFS for processing.
Amazon EMR: Design patterns
Amazon EMR example #1: Batch processing
GBs of logs pushed
to Amazon S3 hourlyDaily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
Amazon EMR example #2: Long-running cluster
Data pushed to
Amazon S3Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database
24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
Amazon EMR example #3: Interactive query
TBs of logs sent dailyLogs stored in
Amazon S3Amazon EMR cluster using Presto for ad hoc
analysis of entire log set
Interactive query using Presto on multipetabyte warehouse
http://techblog.netflix.com/2014/10/using-presto-in-our-big-
data-platform.html
Amazon EMR: Storage optimizations
File formats
• Row oriented
– Text files
– Sequence files
• Writable object
– Avro data files
• Described by schema
• Columnar format
– Object Record Columnar (ORC)
– Parquet
Logical Table
Row oriented
Column oriented
Choosing the right file format
• Processing and query tools
– Hive, Impala, and Presto.
• Evolution of schema
– Avro for schema and Presto for storage.
• File format “splittability”
– JSON/XML as records instead of single object
• Compression
– Block or file.
File sizes
• Avoid small files
– Avoid anything smaller than 100 MB
• Each mapper processes a single File
• Fewer files, matching closely to block size
– Fewer calls to Amazon S3
– Fewer network/HDFS requests
Dealing with small files
• Reduce HDFS block size (e.g., 1 MB [default is 128 MB])
– --bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args “-m,dfs.block.size=1048576”
• Better: use S3DistCp to combine smaller files together
– S3DistCp takes a pattern and target path to combine smaller
input files into larger ones
– Supply a target size and compression codec
Compression
• Always compress data files on Amazon S3
– Reduces network traffic between Amazon S3 and
Amazon EMR
– Speeds up your job
• Compress mappers and reducer output
Amazon EMR compresses internode traffic with LZO with
Hadoop 1, and Snappy with Hadoop 2.
Choosing the right compression
• Time sensitive: faster compressions are a better choice.
• Large amount of data: use space-efficient compressions.
• Combined workload: use Gzip.
Algorithm Splittable? Compression RatioCompress +
Decompress Speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
Answers: Handling five billion sessions per day
Answers
Lambda architecture
Batch layer
Speed layer
Batch view
QueryReal-time view
Computation
S3DistCp
Computations
Input Amazon
S3 bucketIntermediate
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Computation
Computations
S3DistCp
Cascalog
Input Amazon
S3 bucketIntermediate
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Computation
Computations
S3DistCp
CascalogLZO
Input Amazon
S3 bucketIntermediate
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Compression
LZO is fast.
Compression
LZO is splittable.
Compression
LZO is built-in.
Compression
LZO improves performance.
Snapshotting
Result
Snapshot Snapshot
Result
Backfilling
Monday Tuesday Friday
…
Job scheduling
Orchestrator Amazon EMR
Job scheduling
Data Pipeline
Simple Workflow Service
Batch view
Data Pump
Orchestrator
What if my data is not in Amazon S3?
Amazon S3
What if my data is not in Amazon S3?
Amazon S3
What if my data is not in Amazon S3?
Amazon S3
Amazon S3
Takeaway
Cost-saving tips
• Use Amazon S3 as your persistent data store (only pay for compute when you need it!).
• Use Amazon EC2 Spot Instances (especially with task nodes) tosignificantly reduce the cost of running your clusters.
• Use Amazon EC2 Reserved Instances if you have steady workloads.
• Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours).
• Contact your sales rep about custom pricing options, if you are spending more than $10K per month on Amazon EMR.
SAN FRANCISCO
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved