Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink,...

32
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark November 14 th , 2018 Best Practices for Migrating Big Data Workloads to AWS Bruno Faria, Sr. EMR Solutions Architect, AWS

Transcript of Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink,...

Page 1: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

November 14th, 2018

Best Practices for Migrating Big Data Workloads to AWS

Bruno Faria, Sr. EMR Solutions Architect, AWS

Page 2: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Agenda

• Deconstructing current big data environments

• Identifying challenges with on-premises or unmanaged architectures

• Migrating big data workloads to Amazon EMR

• Choose the right engine for the job

• Decouple storage and compute with Amazon S3

• Design for cost and scalability

• Q+A

Page 3: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Deconstructing on-premisesbig data environments

Page 4: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

On-premises Hadoop Clusters

• A cluster of 1U machines

• Typically 12 Cores, 32/64 GB RAM, and 6 - 8 TB of HDD ($3-4K)

• Networking switches and racks

• Open-source distribution of Hadoop or a fixed licensing term by commercial distributions

• HDFS uses local disk and is sized for 3x data replication Server rack 1

(20 nodes)

Server rack 2

(20 nodes)

Server rack N

(20 nodes)

Core

Page 5: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

On-prem: Workload types running on the same cluster

• Large Scale ETL

• Interactive Queries

• Machine Learning and Data Science

• NoSQL

• Stream Processing

• Search

• Data warehouses

Page 6: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

On-prem: Swim lane of jobs

Over utilized Under utilized

Page 7: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

On-premises: Role of a big data administrator

• Management of the cluster (failures, hardware replacement, restarting services, expanding cluster)

• Configuration management

• Tuning of specific jobs or hardware

• Managing development and test environments

• Backing up data and disaster recovery

Page 8: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Common Challenges with Big Data Environments

Page 9: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

On-prem: Over utilization and idle capacity

• Tightly coupled compute and storage requires buying excess capacity

• Can be over-utilized during peak hours and underutilized at other times

• Results in high costs and low efficiency

Page 10: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

On-premises: System Management Issues

• Managing distributed applications and availability

• Durable storage and disaster recovery

• Adding new frameworks and doing upgrades

• Multiple environments

• Need team to manage cluster and procure hardware

Page 11: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Migrating workloads to AWS

Page 12: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Key Migration Considerations

• DO NOT LIFT AND SHIFT

• Deconstruct workloads and use the right tool for the job

• Decouple storage and compute with S3

• Design for cost and scalability

Page 13: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Deconstruct workloads - Analytics Types & Frameworks

Batch Takes minutes to hours Example: Daily/weekly/monthly reportsAmazon EMR (MapReduce, Hive, Pig, Spark), AWS Glue

Interactive Takes seconds Example: Self-service dashboardsAmazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)

Stream Takes milliseconds to secondsExample: Fraud alerts, 1 minute metricsAmazon EMR (Spark Streaming, Flink), Amazon Kinesis Analytics, KCL, Storm

Artificial Intelligence

Takes milliseconds to minutesExample: Fraud detection, forecast demand, text to speechAmazon AI (Lex, Polly, ML, Rekognition), Amazon EMR (Spark ML), Deep Learning AMI

Page 14: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Translate use cases to the right tools

Amazon EMR

ApplicationsHive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hb

ase

Pre

sto

BatchMapReduce

InteractiveTez

In MemorySpark

StreamingFlink

YARNCluster Resource Management

StorageS3 (EMRFS), HDFS

• Low-latency SQL -> Athena, Presto, Amazon Redshift/Spectrum• Data Warehouse / Reporting -> Spark, Hive, AWS Glue, Amazon Redshift• HDFS -> Amazon S3

Page 15: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Many storage layers to choose from

Amazon RDSAmazon ElastiCahe

Amazon Redshift Amazon DynamoDB

Amazon Kinesis

Amazon S3

Amazon EMR

Amazon ES

Page 16: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Amazon S3 as your persistent data store

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)

• Decouple storage and compute

• No need to run compute clusters for storage (unlike HDFS)

• Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances

• Multiple & heterogeneous analysis clusters and services can use the same data

• Designed for 99.999999999% durability

• No need to pay for data replication

• Secure – SSL, client/server-side encryption at rest

• Low cost

Page 17: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Decouple Storage and Compute

Compute

External Metastore

Amazon RDS

Glue Data Catalog

Amazon EMR

Amazon Athena

Amazon Redshift SpectrumAmazon S3

Storage

Page 18: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Amazon S3 tips: Partitions, compression, and file formats

• Partition your data for improved read performance

• Read only files the query needs

• Reduce amount of data scanned

• Optimize file sizes

• Avoid files that are too small (generally, anything less than 128MB)

• Fewer files, matching closely to block size

• Fewer calls to S3 (faster listing)

• Fewer network/HDFS requests

• Compress data set to minimize bandwidth from S3 to EC2

• Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster

• Columnar file formats like Parquet can give increased performance on reads

Page 19: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

File Formats

Columnar – Parquet & ORC

• Compressed

• Column based read optimized

• Integrated indexes and stats

Row – Avro

• Compressed

• Row based read optimized

• Integrated indexes and stats

Text – xSV, JSON

• May or may not be compressed

• Not optimized

• Generic and malleable

Page 20: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

File Format – Examples

SELECT count(*) as count FROM examples_csv

Run time: 36 seconds, Data scanned: 15.9GB

SELECT count(*) as count FROM examples_parquet

Run time: 5 seconds, Data scanned: 4.93GB

Page 21: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

File Formats – Considerations

Scanning

• xSV and JSON require scanning entire file

• Columnar ideal when selecting only a subset of columns

• Row ideal when selecting all columns of a subset of rows

Read Performance

• Text – SLOW

• Avro – Optimal (specific to use case)

• Parquet & ORC – Optimal (specific to use case)

Write Performance

• Text – SLOW

• Avro – Good

• Parquet & ORC – Good (has some overhead with large datasets)

* Highly dependent on the dataset

Page 22: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

External Hive Metastore

Use an external metastore when you require a persistent metastore or a metastore shared by different clusters, services, and applications. There are two options for an external metastore:

• Amazon RDS/Aurora

• AWS Glue Data Catalog (Amazon EMR version 5.8.0 or later only)

• Search metastore

• Utilize crawlers to detect new data, schema, partitions

• Schema and version management

Page 23: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Architecting for Cost and Scalability

Page 24: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Amazon EMR Pricing

With Amazon EMR, you only pay a per-second rate for every second you use.

The price is based on the instance type and number of EC2 instances that you deploy and the region in which you launch your cluster.

Page 25: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Performance and hardware

Considerations

• Transient or long running

• Instance types

• Cluster size

• Application settings

• File formats and S3 tuning

Master Node r3.2xlarge

Slave Group - Core c4.2xlarge

Slave Group – Task m4.2xlarge (EC2 Spot)

Page 26: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Reserved and Spot Instances

Spot Instances

Amazon EC2 Spot Instances offer spare compute capacity available at discounts compared to On-Demand instances.

Reserved Instances

Amazon EC2 Reserved Instances provide you the option to make a payment for instances that you want to reserve at a significant discount compared to On-Demand pricing.

Page 27: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Understanding the Node Types in Amazon EMR

Master node: The node that manages the cluster. The master node tracks the status of tasks and monitors the health of the cluster.

Core node: The node that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster.

Task node: The node that only runs tasks and does not store data in HDFS. Task nodes are optional.

Amazon EMR Cluster

Page 28: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Use Spot and Reserved Instances to lower costs

Spot for task nodes

Up to 80% off EC2

On-Demand pricing

On-demand for core nodes

Standard Amazon EC2 pricing for

on-demand capacity

Meet SLA at predictable cost Exceed SLA at lower cost

Page 29: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Advanced Spot Provisioning with Instance Fleets

Master Node Core Instance Fleet Task Instance Fleet

• Provision from a list of instance types with Spot and On-Demand

• Launch in the most optimal Availability Zone based on capacity/price

• Spot Block support

Page 30: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Transient or long running workloads

Transient

Long running

Page 31: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Amazon EMR: Lower costs with Auto Scaling

Page 32: Best Practices for Migrating Big Data Workloads to AWS · Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop, Phoenix Hbase Presto Batch MapReduce Interactive Tez In Memory Spark

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Thank you!