Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
6.425 -
download
4
description
Transcript of Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Elastic MapReduce: Deep Dive and Best Practices Parviz Deyhim
November 13, 2013
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
What is EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
HDFS
Amazon EMR
HDFS
Amazon EMR
Amazon S3 Amazon DynamoDB
HDFS
Analytics languages Data management
Amazon EMR
Amazon S3 Amazon DynamoDB
HDFS
Analytics languages Data management
Amazon EMR Amazon RDS
Amazon S3 Amazon DynamoDB
HDFS
Analytics languages Data management
Amazon Redshift
Amazon EMR Amazon RDS
Amazon S3 Amazon DynamoDB
AWS Data Pipeline
Amazon EMR Introduction
• Launch clusters of any size in a matter of minutes
• Use variety of different instance sizes that match your workload
Amazon EMR Introduction
• Don’t get stuck with hardware
• Don’t deal with capacity planning
• Run multiple clusters with different sizes, specs and node types
Amazon EMR Introduction
• Integration with Spot market • 70-80% discount
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Amazon EMR Design Patterns
Pattern #1: Transient vs. Alive Clusters
Pattern #2: Core Nodes and Task Nodes
Pattern #3: Amazon S3 as HDFS
Pattern #4: Amazon S3 & HDFS
Pattern #5: Elastic Clusters
Pattern #1: Transient vs. Alive Clusters
Pattern #1: Transient Clusters
• Cluster lives for the duration of the job
• Shut down the cluster when the job is done
• Data persist on Amazon S3
• Input & output Data on Amazon S3
Benefits of Transient Clusters 1. Control your cost
2. Minimum maintenance • Cluster goes away when job is done
3. Practice cloud architecture • Pay for what you use
• Data processing as a workflow
When to use Transient cluster?
If ( Data Load Time + Processing Time) * Number Of Jobs < 24 Use Transient Clusters Else Use Alive Clusters
When to use Transient cluster?
( 20min data load + 1 hour Processing time) * 10 jobs = 13 hours < 24 hour = Use Transient Clusters
Alive Clusters • Very similar to traditional Hadoop deployments
• Cluster stays around after the job is done
• Data persistence model:
• Amazon S3
• Amazon S3 Copy To HDFS
• HDFS and Amazon S3 as backup
Alive Clusters • Always keep data safe on Amazon S3 even if
you’re using HDFS for primary storage
• Get in the habit of shutting down your cluster and start a new one, once a week or month
• Design your data processing workflow to account for failure
• You can use workflow managements such as AWS Data Pipeline
Benefits of Alive Clusters • Ability to share data between multiple jobs
EMR
Amazon S3
HDFS HDFS
Amazon S3
EMR
EMR
Transient cluster Long running clusters
Benefit of Alive Clusters • Cost effective for repetitive jobs
EMR
pm pm
EMR
EMR
EMR
pm pm
EMR
When to use Alive cluster? If ( Data Load Time + Processing Time) * Number Of Jobs > 24 Use Alive Clusters Else Use Transient Clusters
When to use Alive cluster?
( 20min data load + 1 hour Processing time) * 20 jobs = 26hours > 24 hour = Use Alive Clusters
Pattern #2: Core & Task nodes
Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
Run TaskTrackers (Compute) Run DataNode (HDFS)
Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
Can add core nodes
Core Nodes
Master instance group
Amazon EMR cluster
Core instance group
HDFS HDFS
Can add core nodes More HDFS space More CPU/mem
HDFS
Core Nodes
Master instance group
Core instance group
HDFS HDFS
Can’t remove core nodes because of HDFS
HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
Master instance group
Task instance group Core instance group
HDFS HDFS
Run TaskTrackers No HDFS
Reads from core node HDFS
Amazon EMR cluster
Amazon EMR Task Nodes
Master instance group
Amazon EMR cluster
Task instance group Core instance group
HDFS HDFS
Can add task nodes
Amazon EMR Task Nodes
Master instance group
Amazon EMR cluster
Task instance group Core instance group
HDFS HDFS
More CPU power More memory
Amazon EMR Task Nodes
Master instance group
Amazon EMR cluster
Task instance group Core instance group
HDFS HDFS
You can remove task nodes
Amazon EMR Task Nodes
Master instance group
Amazon EMR cluster
Task instance group Core instance group
HDFS HDFS
You can remove task nodes
Tasknode Use-Case #1 • Speed up job processing using Spot
market
• Run task nodes on Spot market
• Get discount on hourly price
• Nodes can come and go without interruption to your cluster
Tasknode Use-Case #2 • When you need extra horse power
for a short amount of time
• Example: Need to pull large amount of data from Amazon S3
Example:
HS1 48TB HDFS
HS1 48TB HDFS
Amazon S3
Example: HS1 48TB HDFS
HS1 48TB HDFS
Amazon S3
m1.xl
m1.xl
m1.xl
m1.xl
m1.xl
m1.xl
Add Spot task nodes to load data from Amazon S3
Example: HS1 48TB HDFS
HS1 48TB HDFS Amazon S3
m1.xl
m1.xl
m1.xl
m1.xl
m1.xl
m1.xl
Remove after data load from Amazon S3
Pattern #3: Amazon S3 as HDFS
Amazon S3 as HDFS
• Use Amazon S3 as your permanent data store
• HDFS for temporary storage data between jobs
• No additional step to copy data to HDFS
Amazon EMR cluster
Task instance group Core instance group
HDFS
HDFS
Amazon S3
Benefits: Amazon S3 as HDFS • Ability to shut down your cluster
HUGE Benefit!!
• Use Amazon S3 as your durable storage
11 9s of durability
Benefits: Amazon S3 as HDFS • No need to scale HDFS
• Capacity
• Replication for durability
• Amazon S3 scales with your data
• Both in IOPs and data storage
Benefits: Amazon S3 as HDFS • Ability to share data between multiple clusters
• Hard to do with HDFS
Amazon S3
EMR
EMR
Benefits: Amazon S3 as HDFS • Take advantage of Amazon S3 features
• Amazon S3 ServerSideEncryption
• Amazon S3 LifeCyclePolicy
• Amazon S3 versioning to protect against corruption
• Build elastic clusters
• Add nodes to read from Amazon S3
• Remove nodes with data safe on Amazon S3
What About Data Locality? • Run your job in the same region as your
Amazon S3 bucket
• Amazon EMR nodes have high speed connectivity to Amazon S3
• If your job Is CPU/memory-bounded data, locality doesn’t make a difference
Anti-Pattern: Amazon S3 as HDFS
• Iterative workloads – If you’re processing the same dataset more than once
• Disk I/O intensive workloads
Pattern #4: Amazon S3 & HDFS
Amazon S3 & HDFS
1. Data persist on Amazon S3
Amazon S3 & HDFS
2. Launch Amazon EMR and copy data to HDFS with S3distcp
S3D
istC
p
Amazon S3 & HDFS
3. Start processing data on HDFS
S3D
istC
p
Benefits: Amazon S3 & HDFS • Better pattern for I/O-intensive workloads
• Amazon S3 benefits discussed previously applies
• Durability
• Scalability
• Cost
• Features: lifecycle policy, security
Pattern #5: Elastic Clusters
Amazon EMR Elastic Cluster (m) 1. Start cluster with certain number of nodes
Amazon EMR Elastic Cluster (m) 2. Monitor your cluster with Amazon CloudWatch
metrics • Map Tasks
Running
• Map Tasks Remaining
• Cluster Idle?
• Avg. Jobs Failed
Amazon EMR Elastic Cluster (m) 3. Increase the number of nodes as you need
more capacity by manually calling the API
Amazon EMR Elastic Cluster (a) 1. Start your cluster with certain number of nodes
Amazon EMR Elastic Cluster (a) 2. Monitor cluster capacity with Amazon
CloudWatch metrics
• Map Tasks Running
• Map Tasks Remaining
• Cluster Idle?
• Avg. Jobs Failed
Amazon EMR Elastic Cluster (a) 3. Get HTTP Amazon SNS notification to a simple
app deployed on Elastic Beanstalk
Amazon EMR Elastic Cluster (a) 4. Your app calls the API to add nodes to your
cluster
API
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Amazon EMR Nodes and Size • Use m1.smal, m1.large, c1.medium for
functional testing
• Use M1.xlarge and larger nodes for production workloads
Amazon EMR Nodes and Size • Use CC2 for memory and CPU intensive
jobs
• Use CC2/C1.xlarge for CPU intensive jobs
• Hs1 instances for HDFS workloads
Amazon EMR Nodes and Size • Hi1 and HS1 instances for disk I/O-
intensive workload
• CC2 instances are more cost effective than M2.4xlarge
• Prefer smaller cluster of larger nodes than larger cluster of smaller nodes
Holy Grail Question
How many nodes do I need?
Introduction to Hadoop Splits
• Depends on how much data you have
• And how fast you like your data to be processed
Introduction to Hadoop Splits
Before we understand Amazon EMR capacity planning, we need to understand Hadoop’s inner working of splits
Introduction to Hadoop Splits • Data gets broken up to splits (64MB or 128)
Data Splits
128MB
Introduction to Hadoop Splits • Splits get packaged into mappers
Data Splits Mappers
Introduction to Hadoop Splits
• Mappers get assigned to nodes for processing
Mappers Instances
Introduction to Hadoop Splits • More data = More splits = More mappers
Introduction to Hadoop Splits • More data = More splits = More mappers
Queue
Introduction to Hadoop Splits • Data mappers > cluster mapper capacity =
mappers wait for capacity = processing delay
Queue
Introduction to Hadoop Splits • More nodes = reduced queue size = faster
processing
Queue
Calculating the Number of Splits for Your Job
Uncompressed files: Hadoop splits a single file to multiple splits. Example: 128MB = 2 splits based on 64MB split size
Calculating the Number of Splits for Your Job
Compressed files:
1. Splittable compressions: same logic as un-
compressed files
128MB BZIP 64MB BZIP
Calculating the Number of Splits for Your Job
Compressed files:
2. Unsplittable compressions: the entire file is a
single split.
128MB GZ 128MB GZ
Calculating the Number of Splits for Your Job
Number of splits If data files have unsplittable compression # of splits = number of files Example: 10 GZ files = 10 mappers
Cluster Sizing Calculation
Just tell me how many nodes I need for my job!!
Cluster Sizing Calculation
1. Estimate the number of mappers your job requires.
Cluster Sizing Calculation
2. Pick an instance and note down the number of mappers it can run in parallel
M1.xlarge = 8 mappers in parallel
Cluster Sizing Calculation
3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
Cluster Sizing Calculation
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time
Estimated Number Of Nodes:
Example: Cluster Sizing Calculation
1. Estimate the number of mappers your job requires
150
2. Pick an instance and note down the number of mappers it can run in parallel
m1.xlarge with 8 mapper capacity per instance
Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
8 files selected for our sample test
Example: Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3. Note down the amount of time taken to process your sample files.
3 min to process 8 files
Cluster Sizing Calculation
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time
Estimated number of nodes:
150 * 3 min 8 * 5 min
= 11 m1.xlarge
File Size Best Practices
• Avoid small files at all costs
• Anything smaller than 100MB
• Each mapper is a single JVM
• CPU time required to spawn JVMs/mappers
File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing
10TB of 100mgB = 100,000 mappers * 2Sec =
total of 55 hours mapper CPU setup time
File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing
10TB of 1000MB = 10,000 mappers * 2Sec =
total of 5 hours mapper CPU setup time
File Size on Amazon S3: Best Practices
• What’s the best Amazon S3 file size for
Hadoop?
About 1-2GB
• Why?
File Size on Amazon S3: Best Practices
• Life of mapper should not be less than 60 sec
• Single mapper can get 10MB-15MB/s speed to Amazon S3
≈ 60sec * 15MB 1GB
Holy Grail Question
What if I have small file issues?
Dealing with Small Files
• Use S3DistCP to combine smaller files together
• S3DistCP takes a pattern and target file to combine smaller input files to larger ones
Dealing with Small Files
Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args '--src,s3://myawsbucket/cf,\ --dest,hdfs:///local,\
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\ --targetSize,128,\
Compressions
• Compress as much as you can
• Compress Amazon S3 input data files
– Reduces cost
– Speed up Amazon S3->mapper data transfer time
Compressions • Always Compress Data Files On Amazon S3
• Reduces Storage Cost
• Reduces Bandwidth Between Amazon S3 and Amazon EMR
• Speeds Up Your Job
Compressions • Compress Mappers and Reducer Output
• Reduces Disk IO
Compressions
• Compression Types: – Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm % Space Remaining
Encoding Speed
Decoding Speed
GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
Compressions
• If You Are Time Sensitive, Faster Compressions Are A Better Choice
• If You Have Large Amount Of Data, Use Space Efficient Compressions
• If You Don’t Care, Pick GZIP
Change Compression Type • You May Decide To Change Compression Type
• Use S3DistCP to change the compression types of your files
• Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args '--src,s3://myawsbucket/cf,\
--dest,hdfs:///local,\
--outputCodec,lzo’
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Architecting for cost • AWS pricing models:
– On-demand: Pay as you go model.
– Spot: Market place. Bid for instances and get a discount
– Reserved Instance: upfront payment (for 1 or 3 year) for reduction in overall monthly payment
Reserved Instances use-case For alive and long-running clusters
Reserved Instances use-case For ad-hoc and unpredictable workloads, use medium utilization
Reserved Instances use-case
For unpredictable workloads, use Spot or on-demand pricing
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Adv. Optimizations (Stage 1)
• The best optimization is to structure your data (i.e., smart data partitioning)
• Efficient data structuring= limit the amount of data being processed by Hadoop= faster jobs
Adv. Optimizations (Stage 1) • Hadoop is a batch processing framework
• Data processing time = an hour to days
• Not a great use-case for shorter jobs
• Other frameworks may be a better fit – Twitter Storm
– Spark
– Amazon Redshift, etc.
Adv. Optimizations (Stage 1) • Amazon EMR team has done a great deal of
optimization already
• For smaller clusters, Amazon EMR configuration optimization won’t buy you much
– Remember you’re paying for the full hour cost of an instance
Adv. Optimizations (Stage 1)
Best Optimization??
Adv. Optimizations (Stage 1)
Add more nodes
Adv. Optimizations (Stage 2)
• Monitor your cluster using Ganglia
• Amazon EMR has Ganglia bootstrap action
Adv. Optimizations (Stage 2)
• Monitor and look for bottlenecks – Memory
– CPU
– Disk IO
– Network IO
Adv. Optimizations Run Job
Adv. Optimizations Run Job
Find Bottlenecks
Ganglia CPU Disk Memory
Adv. Optimizations Run Job
Find Bottlenecks
Address Bottleneck
Ganglia CPU Disk Memory Fine Tune
Change Algo
Network IO • Most important metric to watch for if using
Amazon S3 for storage
• Goal: Drive as much network IO as possible from a single instance
Network IO • Larger instances can drive > 600Mbps
• Cluster computes can drive 1Gbps -2 Gbps
• Optimize to get more out of your instance throughput – Add more mappers?
Network IO • If you’re using Amazon S3 with Amazon
EMR, monitor Ganglia and watch network throughput.
• Your goal is to maximize your NIC throughput by having enough mappers per node
Network IO, Example
Low network utilization Increase number of mappers if possible to drive more traffic
CPU • Watch for CPU utilization of your clusters
• If >50% idle, increase # of mapper/reducer per instance
– Reduces the number of nodes and reduces cost
Example Adv. Optimizations (Stage 2)
What potential optimization do you see in this graph?
Example Adv. Optimizations (Stage 2)
40% CPU idle. Maybe add more mappers?
Disk IO • Limit the amount of disk IO
• Can increase mapper/reducer memory
• Compress data anywhere you can
• Monitor cluster and pay attention to HDFS bytes written metrics
• One play to pay attention to is mapper/reducer disk spill
Disk IO • Mapper has in memory buffer
mapper memory buffer
mapper
Disk IO • When memory gets full, data spills to disk
mapper memory buffer data spills to disk
mapper
Disk IO • If you see mapper/reducer excessive spill to disk,
increase buffer memory per mapper
• Excessive spill when ratio of “MAPPER_SPILLED_RECORDS” and “MAPPER_OUTPUT_RECORDS” is more than 1
Disk IO Example:
MAPPER_SPILLED_RECORDS = 221200123
MAPPER_OUTPUT_RECORDS = 101200123
Disk IO • Increase mapper buffer memory by increasing
“io.sort.mb”
<property><name>io.sort.mb<name><value>200</value></property>
• Same logic applies to reducers
Disk IO • Monitor disk IO using Ganglia
• Look out for disk IO wait
Disk IO • Monitor disk IO using Ganglia
• Look out for disk IO wait
Remember! Run Job
Find Bottlenecks
Address Bottleneck
Ganglia CPU Disk Memory Fine Tune
Change Algo
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
BDT404