AWS Analytics Modernization

38
© 2021, Amazon Web Services, Inc. or its Affiliates. Jay Elango Analytics Specialist Architect AWS Analytics Modernization Modernize Your Big Data Platform with AWS Analytics Services

Transcript of AWS Analytics Modernization

Page 1: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Jay Elango

Analytics Specialist Architect

AWS Analytics ModernizationModernize Your Big Data Platform with AWS Analytics

Services

Page 2: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Topics

• Challenges associated with on-premises Hadoop Big Data Platform

• New realities facing the organizations

• Lake house architecture on AWS

• Value in move to managed service with AWS EMR

• EMR Migration Programs

• Customer examples

Page 3: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Challenges with on-premises

Hadoop Big Data platform

Page 4: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Compute and storage grow together

• Storage grows along with compute

• Compute requirements vary

Tightly coupled

Tightly coupled

Page 5: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Replication adds to cost

3x

• Data is replicated several times

• Typically only in one data center

Page 6: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Under utilized or Scarce resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processing

Weekly peaks

Steady state

Page 7: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Contention for the same resources

Compute

boundMemory

bound

Page 8: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Separation of resources creates data silos

Team A

Page 9: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Limited on Fast Following Application Versions

• Large Scale Transformation: Map/Reduce, Hive, Pig, Spark

• Interactive Queries: Impala, Spark SQL, Presto

• Machine Learning: Spark ML, MxNet, Tensorflow

• Interactive Notebooks: Jupyter, Zeppelin

• NoSQL: HBase

With a monolithic cluster, there may be dependencies of downstream applications that impact the

inability to upgrade versions. By not upgrading, organizations could be limiting innovation.

Page 10: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

The new realities organizations

are facing

Page 11: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

New realities – Organizations want more value from

their data

U S E D B Y

M A N Y P E O P L E

G R O W I N G

E X P O N E N T I A L L Y F R O M N E W

S O U R C E S

I N C R E A S I N G L Y

D I V E R S E

A N A L Y Z E D

B Y M A N Y

A P P L I C A T I O N S

Modernization Goals :

▪ Drive innovation

▪ Enable organizations for Customer Experience/Journey Analytics/360 Analytics & build product intelligence by deriving

insights from various data sources and formats.

▪ Business agility

▪ Enable business to scale infrastructure, manage performance & optimize Cost.

What organizations are looking to build? Modernized data platform

Page 12: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

New realities – Organization’s modernized platform needs

Data silos

OLTP ERP CRM LOB

DW Silo 1

Business Intelligence

Devices Web Sensors Social

DW Silo 2

Business

Intelligence

toData Lake

Non-relational

databases

Machine

learning

Data

warehousing

Log

analytics

Big data

processing

Relational

databases

Page 13: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

New realities - Organizations moving to Lake House

architecture

Scalable data lakes

Purpose-built

data services

Seamless

data movement

Unified governance

Performant and

cost-effective

Data Lake

Non-

relational

databases

Machine

learning

Data

warehousing

Log

analytics

Big data

processing

Relational

databases

Page 14: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Lake House architecture on AWS

Scalable data lakes

Purpose-built

data services

Seamless

data movement

Unified governance

Performant and

cost-effective

Amazon

DynamoDB

Amazon

SageMaker

Amazon

Redshift

Amazon

Elasticsearch

Service

Amazon

EMR

Amazon

Aurora

Amazon Athena

Amazon S3

Page 15: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Move to managed AWS Analytics services

Amazon EMR

Spark, Hive, Presto, Hudi, HBase

Amazon Elasticsearch

Service

Elasticsearch

Logstash

Kibana

Operational analytics

Amazon Managed Streaming

for Apache Kafka

Real-time analytics

Amazon Kinesis Data Analytics for

Apache Flink

Real-time analytics

Apache Flink

Page 16: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon Confidential | © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon EMR Easily run Spark, Hadoop, Hive,

Presto, HBase, and other big

data frameworks

Automate provisioning, configuring, and tuning

Get the latest, stable, open-source releases

Automatically scale up and down

Simple and predictable pricing

Easy setup, management, and monitoring

Latest open-source framework updates within 30 days

Manage cluster size based on utilization to reduce costs

Per-second pricing, and save 50%–80% with Amazon EC2

Spot and Reserved Instances

Amazon Athena

Amazon S3

Amazon

DynamoDB

Amazon

SageMaker

Amazon

Redshift

Amazon

Elasticsearch

Service

Amazon

Aurora

Amazon

EMR

Page 17: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

The value in move to managed with

Amazon EMR for big data platforms

Page 18: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Foundation 1: Decouple storage and compute

Page 19: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Foundation 2: Amazon S3 is you persistent data store

Amazon S3

Unmatched durability,

availability, and scalabilityStrong read-after-write consistency

Support for transactions

Easiest to use with

cost optimization:

Intelligent tiering

Best security (including Row &

Column level), Compliance,

and audit capabilities

Most ways to get data in

Broadest portfolio

of analytics tools

Cold storage and archive capabilities

Page 20: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 1 : Turn off clusters

Amazon S3Amazon S3 Amazon S3

Page 21: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 2 : Built-in Disaster Recovery

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Page 22: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 3 : Logical separation of jobs/applications

Ad-Hoc

Re-architect Monolithic to Purpose-built clusters by:• Creating Transient and/or Persistent clusters

• Separating clusters by Application

• Separating clusters by Application Version

• Isolating Department specific clusters

Traditional Monolithic Cluster

Purpose-built Clusters

vs.

Design consideration are given to:• How do you submit jobs or build pipelines

• Persisting your data in S3

• Storing metadata off the cluster

• How long does the job run

• What applications are needed

Page 23: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Amazon EMR Cluster

EMR Benefit 4 : Auto-scaling Clusters (Persistent / Transient )

Page 24: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Amazon EMR Managed Scaling:

Reduce costs by up to 60%

• Completely managed environment for

automatically scaling clusters

• No configurations required except min/max

capacity

• More data points and faster reaction time

• Can save 20%-60% costs depending on the

workload pattern

Page 25: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 5 : Leverage Spot Instances & Instance Fleets

10 node cluster running for 14 hours

Cost = $1 * 10 nodes * 14 hours = Total $140

Auto Scale with Spot Instance to reduce cost and run-time

Add 10 more

nodes of Spot at

50% discount

20 node cluster running for 7 hoursCost = $1 * 10 nodes * 7 hours = $70

= $0.5 * 10 nodes * 7 hours = $35= Total $105

Results : 50% less run-time (14hrs → 7hrs),

25% less cost ($140 → $105)

Diversify Spot and On-demand Instances via

Instance Fleets

• Can mix different instance types, markets (On-

demand or Spot) in one group

• Don’t specify an AZ and we will find the cheapest

one

Page 26: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 6 : EMR Self-service with AWS Service Catalog

Standardize

Enforce Consistency and

Compliance

Limit Access

Enforce Tagging, Security

Groups

Developer Autonomy

One-Stop Shop

Automate Deployments

Agile Governance

Configure Consume

Page 27: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 7 : Amazon EMR differentiated performance

1.7x faster performance than standard Apache

Spark 3.0 at 40% of the cost

25.7% average cost reduction with Graviton2

11.5% average performance improvement

with Graviton2

Up to 2.6x faster performance than open-source

Presto 0.238 at 80% of the cost

Page 28: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 8 : Fully Managed EMR Notebooks

1. Provide an end to end data engineering and data science using EMR Notebooks which is based

on the popular open source Jupyter Notebooks to build applications with Apache Spark

2. Attach / Detach from individual clusters; automatically backed up to S3

3. Tag-based Permissions

4. Support for PySpark, Spark SQL, Spark R, and Scala

5. NEW features include a visual experience to debug and monitor Spark jobs directly into the

off-cluster, persistent, Apache Spark History Server using the EMR Console, associate Git

repositories such as GitHub and Bitbucket, and compare and merge two different notebooks

using the nbdime utility.

Page 29: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 8 : EMR Studio integrated development

environment

Easily build and deploy data science code without logging in to AWS console

Start notebooks in seconds, run jobs later

Save debugging time with native application UIs in one place

Build production pipelines simply and flexibly

Page 30: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Benefit 9 : Analysts confirm Lowest TCO in the Industry

Feb. 2019 Forrester recognizes:

AWS EMR as the Cloud Hadoop/Spark (HARK) Leader.

Nov. 2018, IDC report confirms:

“EMR provides 57% reduced costs

vs. on premise resulting in 342%

ROI over 5 years.”

Dec. 2018, Gartner suggests:

“AWS remains the largest

Hadoop provider in terms of

both revenue and user base.”

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and

Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester

Wave™ is a graphical representation of Forrester's call on a market and is

plotted using a detailed spreadsheet with exposed scores, weightings, and

comments. Forrester does not endorse any vendor, product, or service

depicted in the Forrester Wave™. Information is based on best available

resources. Opinions reflect judgment at the time and are subject to change.

Page 31: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Data Lake

on AWS

EMR Benefit 10 : Leverage AWS Lake House Analytic

ecosystem

Amazon SageMaker

AWS Deep Learning AMIs

Amazon Rekognition

Amazon Lex

AWS DeepLens

Amazon Comprehend

Amazon Translate

Amazon Transcribe

Amazon Polly

Amazon Athena

Amazon EMR

Amazon Redshift

Amazon Elasticsearch service

Amazon Kinesis

Amazon QuickSight

Analytics Machine Learning

On-premises Data MovementAWS Direct Connect

AWS Storage Gateway

AWS Snowball

AWS Snowmobile

Real-time Data Movement

AWS IoT CoreAWS Kinesis FirehoseAWS Kinesis Data Streams

AWS Kinesis Video Streams

AWS

Glue

Blueprin

ts

ML

Transforms

Data

Catalog

Access

Contro

l

Lake

Formation

Page 32: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Migration Programs

Page 33: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

EMR Migration Guide

• Technical advice to help planning migration

Free EMR Migration Workshop

• Jumpstart your migration to the cloud

Visit aws.amazon.com/emr/emr-migration/

Email [email protected]

EMR Migration Program

Page 34: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Customer Examples

Page 35: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Challenge

NHS Digital wanted to modernize their data access

environment for its users across the UK. The legacy system

was too slow, expensive to maintain and users were

frustrated with performance issues.

Solution

NHS Digital migrated the dataset from their legacy systems,

converted the data into parquet format, loaded them into

S3. Used KMS to encrypt the data. Used Amazon EMR to

process the data from S3.

Benefit

Performance Improvement from 137 minutes to 137

seconds using AWS EMR.

NHS Digital

Page 36: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Customer Examples - High impact results with Amazon EMR

near real-time analytics for 140M players

scales 3,000 transient clusters on a daily basis

achieves costs savings of 55% when compared to on-demand pricing and

40% savings when compared to Reserved Instances

powers the Predix solution processing 1,000,000 data executions/day

computes Zestimates on 100M +homes in hours instead of 1 day

Page 37: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Customer Examples - On-premises migrations to Amazon

EMR

Processes 135B events/day and have cost savings of 60% (~$20M)

decreased costs by $600k in less than 5 months

saves 75% and is 60% more efficient

reduced cost of operation and improved Spark performance 3x

re-architects 1 monolithic pipeline into 3 purpose built clusters

Page 38: AWS Analytics Modernization

© 2021, Amazon Web Services, Inc. or its Affiliates.

Thank You