Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

43
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Amazon EMR August 23, 2016 Introducing Amazon EMR Release 5.0 Faster, Easier Hadoop, Spark, and Presto

Transcript of Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Page 1: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jonathan Fritz, Amazon EMR

August 23, 2016

Introducing Amazon EMR Release 5.0Faster, Easier Hadoop, Spark, and Presto

Page 2: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Agenda

• Quick Introduction to Amazon EMR• What’s New in Amazon EMR release 5.0• Interactive Query Demo

• Use Cases

• Best Practices

Page 3: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleCustomize the cluster

Page 4: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Storage S3 (EMRFS), HDFS

YARNCluster Resource Management

BatchMapReduce

InteractiveTez

In MemorySpark

ApplicationsHive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop

HB

ase / Pho

enix

Presto

Hue (SQL Interface/Metastore Management)Zeppelin (Interactive Notebook)

Ganglia (Monitoring)HiveServer2/Spark Thriftserver (JDBC/ODBC)

Amazon EMR service

Page 5: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Options to submit jobs – Off Cluster

Amazon EMR Step API

Submit a Spark application

Amazon EMR

AWS Data Pipeline

Airflow, Luigi, or other schedulers on EC2

Create a pipeline to schedule job

submission or create complex workflows

AWS Lambda

Use AWS Lambda tosubmit applications to

EMR Step API or directly to Spark on your cluster

Page 6: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Options to submit jobs – On Cluster

Web UIs: Hue SQL editor, Zeppelin notebooks, R Studio, and more!

Connect with ODBC / JDBC using HiveServer2/Spark Thriftserver

Use Spark Actions in your Apache Oozie workflow to create DAGs of jobs.

(start using start-thriftserver.sh)

Or, use the native APIs and CLIs for each application

Page 7: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Many storage layers to choose from

Amazon DynamoDB

Amazon RDS Amazon Kinesis

Amazon Redshift

EMR File System(EMRFS)

Amazon S3

Amazon EMR

Page 8: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

EMR 5.0 - Applications

Page 9: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Page 10: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Quick introduction to Spark

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• Massively parallel

• Uses DAGs instead of map-reduce for execution

• Minimizes I/O by storing data in DataFrames in memory

• Partitioning-aware to avoid network-intensive shuffle

Page 11: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Spark components to match your use case

Page 12: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Spark 2.0 – Performance Enhancements

• Second generation Tungsten engine• Whole-stage code generation to create optimized

bytecode at runtime• Improvements to Catalyst optimizer for query

performance• New vectorized Parquet decoder to increase throughput

Page 13: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Datasets and DataFrames (Spark 2.0)

• Datasets• Distributed collection of data

• Strong typing, ability to use Lambda functions

• Object-oriented operations (similar to RDD API)

• Optimized encoders which increase performance and minimize serialization/deserialization overhead

• Compile-time type safety for more robust applications

• DataFrames• Dataset organized into named columns• Represented as a Dataset of rows

Page 14: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Spark SQL (Spark 2.0)

• SparkSession – replaces the old SQLContext and HiveContext

• Seamlessly mix SQL with Spark programs

• ANSI SQL Parser and subquery support• HiveQL compatibility and can directly use tables in

Hive metastore• Connect through JDBC / ODBC using the Spark Thrift

server

Page 15: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Spark 2.0 – ML Updates

• Additional distributed algorithms in SparkR, including K-Means, Generalized Linear Models, and Naive Bayes

• ML pipeline persistence is now supported across all languages

Page 16: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Spark 2.0 – Structured Streaming

• Structured Streaming API is an extension to the DataFrame/Dataset API (instead of DStream)

• SparkSession is the new entry point for streaming• Better merges processing on static and streaming

datasets, abstracting the velocity of the data

Page 17: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Configuring Executors – Dynamic Allocation

• Optimal resource utilization• YARN dynamically creates and shuts down executors

based on the resource needs of the Spark application• Spark uses the executor memory and executor cores

settings in the configuration for each executor• Amazon EMR uses dynamic allocation by default, and

calculates the default executor size to use based on the instance family of your Core Group

Page 18: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Apache Hive 2.1.0 with Apache Tez 0.8.4

Page 19: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Tez is the default engine for Hive and Pig

Page 20: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Hive 2.1 – New Features

• Improvements to Hive’s cost-based optimizer

• LLAP (beta) for faster processing (coming soon)

• Predicate pushdown for Parquet file format• HPL/SQL for procedural SQL

• Similar to Oracle’s PL/SQL and Teradata’s stored procedures

• Hive-On-Spark improvements• Apache HBase as Hive Metastore (alpha)

• CLI mode in Beeline (Hive CLI deprecation)

Page 21: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Use RDS for an external Hive metastore

Amazon Aurora

Hive Metastore with schema for tables in S3

Amazon S3Set metastore location in hive-site

Page 22: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Presto 0.150

Page 23: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

In-memory distributed query engine

Support standard ANSI-SQL

Support rich analytical functions

Support wide range of data sources

Combine data from multiple sources in single query

Response time ranges from seconds to minutes

Page 24: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

High Performance• E.g. Netflix: runs 3500+ Presto queries / day on 25+

PB dataset in S3 with 350 active platform users

Extensibility

• Pluggable backends: Hive, Cassandra, JMX, Kafka,

MySQL, PostgreSQL, MySQL, and more

• JDBC, ODBC for commercial BI tools or dashboards

• Client Protocol: HTTP+JSON, support various

languages (Python, Ruby, PHP, Node.js, Java(JDBC),

C#,…)

ANSI SQL• complex queries, joins, aggregations, various functions

(Window functions)

Page 25: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Presto: In-memory processing and pipelining

Page 26: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

High Level Architecture

Components: a coordinator and multiple workers.

Queries are submitted by a client such as the Presto CLI to the coordinator.

The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.

https://prestodb.io/overview.html

Page 27: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Hue 3.10

Page 28: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Hue 3.10 – New Features

• Fully redesigned SQL editor with table assist, better autocomplete of values and nested types, more charts, and search and replace functionality

• New notebook UI with graphical widgets

• Dry-run Oozie jobs to test options before execution• Email action on Oozie job failure• Improved security features including TLS certificate

chain support, passwords in file scripts, and inactive user timeouts

Page 29: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Apache Zeppelin 0.6.1

Page 30: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Zeppelin 0.6.1 – New Features

• Shiro Authentication

• Notebook Authorization

Save your notebook in S3 by setting zeppelin-env:

export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name

export ZEPPELIN_NOTEBOOK_S3_USER = username

(optional) export ZEPPELIN_NOTEBOOK_S3_KMS_KEY_ID = kms-key-id

Page 31: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Build your own Apache Bigtop Application

Page 32: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Creating an Apache Bigtop application

https://blogs.aws.amazon.com/bigdata/post/TxNJ6YS4X6S59U/Building-and-Deploying-Custom-Applications-with-Apache-Bigtop-and-Amazon-EMR

Page 33: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Demo

Page 34: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Customer Use Cases

Page 35: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Twitter (Answers) uses EMR as the batch layer in their Lambda architecture

Page 36: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Page 37: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

FINRA saves money with comparable performance with Hive on Tez with S3

Page 38: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

A Few Tips

Page 39: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Decouple compute and storage by using S3 as your data layer

HDFS

S3 is designed for 11 9’s of durability and is

massively scalable

EC2 Instance Memory

Amazon S3

Amazon EMR

Amazon EMR

Intermediates stored on local disk or HDFS

Local

Page 40: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Partitions, compression, and file formats

• Avoid key names in lexicographical order• Improve throughput and S3 list performance• Use hashing/random prefixes or reverse the date-time• Compress data set to minimize bandwidth from S3 to

EC2• Make sure you use splittable compression or have each file

be the optimal size for parallelization on your cluster

• Columnar file formats like Parquet can give increased performance on reads

Page 41: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Use EC2 Spot Instances to save money

• Use the Spot Bid Advisor to help find the optimal bid price

• Resize your cluster with EMR task groups to add capacity to YARN without adding HDFS data nodes

• Store data in S3 so cluster can be recreated if Spot reclaims nodes

Page 42: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Configuring VPC private subnets

• Use Amazon S3 Endpoints for connectivity to S3

• Use Managed NAT for connectivity to other services or the Internet

• Control the traffic using Security Groups• ElasticMapReduce-Master-Private• ElasticMapReduce-Slave-Private• ElasticMapReduce-ServiceAccess

Page 43: Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Thank you!Jonathan Fritz - [email protected]/emrblogs.aws.amazon.com/bigdata