(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
-
Upload
amazon-web-services -
Category
Technology
-
view
1.032 -
download
5
description
Transcript of (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014
November 13th, 2014 | Las Vegas, NV
Ian Meyers, Amazon Web Services
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Amazon Elastic MapReduceManaged, elastic Hadoop (1.x & 2.x) cluster
Integrates with Amazon S3, Amazon DynamoDB, Amazon
Kinesis and Amazon Redshift
Install Storm, Spark, Presto, Hive, Pig, Impala, & end-user
tools automatically
Native support for Spot Instances
Integrated HBase NoSQL database
Amazon EMR
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--keyword-config-file – merge values in new config to existing
--keyword-key-value – override values provided
Configuration File NameConfiguration File
KeywordFile Name Shortcut Key-Value Pair Shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
Set number of mappers per task tracker
Useful for small memory footprint map tasks
More work done with a given instance
Set HDFS block size to 1MB
Useful for smaller files when HDFS is used
Reuse mappers
Mapper startup time ~ 2-20 seconds
Useful for tasks with large number of mappers
Mappers must be “clean” after run (relevant for Java)
Configure process heap size, Java opts, and allow for replacing the hadoop-user-env.sh
Hadoop 1
Hadoop 2
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons--args –{namenode}-heap-size=2048,--{namenode}-opts=-XX:GCTimeRatio=19
EMRfs
HDFS
Amazon EMR
Amazon S3 Amazon
DynamoDB
Processed Files
Registry
File Data
55
5
≈60sec * 15MB 1GB
aws emr add-steps --cluster-id <cluster>
--steps Name=GroupSmallFiles,
Type=CUSTOM_JAR,
Args=files,home/hadoop/lib/emr-s3distcp-1.0.jar,
src,s3://myawsbucket/cf,
dest,hdfs:///local,
groupBy,.*(i-\w.log).*,
targetSize,128…
Algorithm % Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
-outputCodec,lzo
Amazon EMR Cluster
Task Instance
Group
Core Instance
Group
HDF
S
HDF
S
Amazon S3
HUGE Benefit!!
EMR
EMR
Amazon
S3
Amazon EMR Cluster
Task Instance
Group
Core Instance
Group
HDF
S
HDF
S
Amazon S3
S3D
istC
P
S3D
istC
P
EMR
HDFS
Pig
Hive 0.13.1• Support for ORC
• Window functions
• Decimal types
• TRUNCATE command
• Better optimiser (less
need for hinting)
Pig 0.12.0• Streaming UDF’s not
written in Java
• Native support for Avro
• Native support for
Parquet
• Improved data types
Impala 1.1 • In-memory SQL engine
• Support for HBase
tables
• Support for Parquet –
column-oriented file
format
• Query and interactive
shells
HBase 0.94.18• Database
Snapshotting
• Improved read caching
and seek optimisation
• Improved transactions
Read Data Directly into Hive,
Pig, Streaming and Cascading
from Kinesis Streams
No Intermediate Data
Persistence Required
Simple way to introduce real time sources into
Batch Oriented Systems
Multi-Application Support & Automatic
Checkpointing
Amazon EMR Integration with Amazon Kinesis
drop table call_data_records;
CREATE TABLE call_data_records (start_time bigint,end_time bigint,phone_number STRING,carrier STRING,recorded_duration bigint,calculated_duration bigint,lat double,long double
)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ","STORED BY'com.amazon.emr.kinesis.hive.KinesisStorageHandler'TBLPROPERTIES("kinesis.stream.name"="TestAggregatorStream");
Amazon EMR Integration with Amazon Kinesis
EC2 InstanceMap
Tasks
Reduce
Tasks
m1.small 2 1
m1.large 3 1
m1.xlarge 8 3
m2.xlarge 3 1
m2.2xlarge 6 2
m2.4xlarge 14 4
m3.xlarge 6 1
m3.2xlarge 12 3
cg1.4xlarge 12 3
cc2.8xlarge 24 6
c3.4xlarge 24 6
hi1.4xlarge 24 6
hs1.8xlarge 24 6
cr1.8xlarge &
c3.8xlarge48 12
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
0
50
100
150
200
250
300
Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)
Instance Cost / Map Task Cost / Reduce Task
m1.large $0.08 $0.15
m1.xlarge $0.06 $0.15
m3.xlarge $0.04 $0.07
m3.2xlarge $0.04 $0.07
Instance Cost / Map Task Cost / Reduce Task
c1.medium $0.13 $0.13
c1.xlarge $0.35 $0.70
c3.xlarge $0.05 $0.11
c3.2xlarge $0.05 $0.11
Total tasks * Time to process sample files
Instance task capacity * Desired processing time
Estimated number of nodes:
1. Estimate the number of tasks your job requires
150
2. Pick an instance and note down the number of Tasks it can run in parallel
m1.xlarge with 8 task capacity per instance
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
8 files selected for our sample test
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
this dataset.
3 min to process 8 files
Total tasks for your job * Time to process sample files
Per instance task capacity * Desired processing time
Estimated number of nodes:
150 * 3 min 8 * 5 min
= 11 m1.xlarge
Master instance group
Amazon EMR cluster
HDFS HDFS
Run TaskTrackers
(Compute)
Run DataNode
(HDFS)
Core instance group
Can add core nodes
More HDFS space
More CPU/memory
Master instance group
Amazon EMR cluster
HDFS HDFS HDFS
Core instance group
Can’t remove core
nodes because of
HDFS
Master instance group
HDFS HDFS HDFS
Amazon EMR cluster
Core instance group
Run TaskTrackers
No HDFS
Reads from core node
HDFS
Master instance group
HDFS HDFS
Amazon EMR cluster
Task instance groupCore instance group
Can add task
nodes
Master instance group
HDFS HDFS
Amazon EMR cluster
Task instance groupCore instance group
More CPU power
More memory
Master instance group
HDFS HDFS
Amazon EMR cluster
Task instance groupCore instance group
You can remove
task nodes when
processing is
completed
Task instance group
Master instance group
Core instance group
HDFS HDFS
Amazon EMR cluster
You can remove
task nodes when
processing is
completed
Master instance group
HDFS HDFS
Amazon EMR cluster
Task instance groupCore instance group
Amazon
CloudWatch
http://bit.ly/awsevals