AWS Webcast - Database in the Cloud Series - Scalable Games and Analytics with AWS
AWS Analytics Modernization
Transcript of AWS Analytics Modernization
© 2021, Amazon Web Services, Inc. or its Affiliates.
Jay Elango
Analytics Specialist Architect
AWS Analytics ModernizationModernize Your Big Data Platform with AWS Analytics
Services
© 2021, Amazon Web Services, Inc. or its Affiliates.
Topics
• Challenges associated with on-premises Hadoop Big Data Platform
• New realities facing the organizations
• Lake house architecture on AWS
• Value in move to managed service with AWS EMR
• EMR Migration Programs
• Customer examples
© 2021, Amazon Web Services, Inc. or its Affiliates.
Challenges with on-premises
Hadoop Big Data platform
© 2021, Amazon Web Services, Inc. or its Affiliates.
Compute and storage grow together
• Storage grows along with compute
• Compute requirements vary
Tightly coupled
Tightly coupled
© 2021, Amazon Web Services, Inc. or its Affiliates.
Replication adds to cost
3x
• Data is replicated several times
• Typically only in one data center
© 2021, Amazon Web Services, Inc. or its Affiliates.
Under utilized or Scarce resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processing
Weekly peaks
Steady state
© 2021, Amazon Web Services, Inc. or its Affiliates.
Contention for the same resources
Compute
boundMemory
bound
© 2021, Amazon Web Services, Inc. or its Affiliates.
Separation of resources creates data silos
Team A
© 2021, Amazon Web Services, Inc. or its Affiliates.
Limited on Fast Following Application Versions
• Large Scale Transformation: Map/Reduce, Hive, Pig, Spark
• Interactive Queries: Impala, Spark SQL, Presto
• Machine Learning: Spark ML, MxNet, Tensorflow
• Interactive Notebooks: Jupyter, Zeppelin
• NoSQL: HBase
With a monolithic cluster, there may be dependencies of downstream applications that impact the
inability to upgrade versions. By not upgrading, organizations could be limiting innovation.
© 2021, Amazon Web Services, Inc. or its Affiliates.
The new realities organizations
are facing
© 2021, Amazon Web Services, Inc. or its Affiliates.
New realities – Organizations want more value from
their data
U S E D B Y
M A N Y P E O P L E
G R O W I N G
E X P O N E N T I A L L Y F R O M N E W
S O U R C E S
I N C R E A S I N G L Y
D I V E R S E
A N A L Y Z E D
B Y M A N Y
A P P L I C A T I O N S
Modernization Goals :
▪ Drive innovation
▪ Enable organizations for Customer Experience/Journey Analytics/360 Analytics & build product intelligence by deriving
insights from various data sources and formats.
▪ Business agility
▪ Enable business to scale infrastructure, manage performance & optimize Cost.
What organizations are looking to build? Modernized data platform
© 2021, Amazon Web Services, Inc. or its Affiliates.
New realities – Organization’s modernized platform needs
Data silos
OLTP ERP CRM LOB
DW Silo 1
Business Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence
toData Lake
Non-relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
© 2021, Amazon Web Services, Inc. or its Affiliates.
New realities - Organizations moving to Lake House
architecture
Scalable data lakes
Purpose-built
data services
Seamless
data movement
Unified governance
Performant and
cost-effective
Data Lake
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
© 2021, Amazon Web Services, Inc. or its Affiliates.
Lake House architecture on AWS
Scalable data lakes
Purpose-built
data services
Seamless
data movement
Unified governance
Performant and
cost-effective
Amazon
DynamoDB
Amazon
SageMaker
Amazon
Redshift
Amazon
Elasticsearch
Service
Amazon
EMR
Amazon
Aurora
Amazon Athena
Amazon S3
© 2021, Amazon Web Services, Inc. or its Affiliates.
Move to managed AWS Analytics services
Amazon EMR
Spark, Hive, Presto, Hudi, HBase
Amazon Elasticsearch
Service
Elasticsearch
Logstash
Kibana
Operational analytics
Amazon Managed Streaming
for Apache Kafka
Real-time analytics
Amazon Kinesis Data Analytics for
Apache Flink
Real-time analytics
Apache Flink
© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon Confidential | © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR Easily run Spark, Hadoop, Hive,
Presto, HBase, and other big
data frameworks
Automate provisioning, configuring, and tuning
Get the latest, stable, open-source releases
Automatically scale up and down
Simple and predictable pricing
Easy setup, management, and monitoring
Latest open-source framework updates within 30 days
Manage cluster size based on utilization to reduce costs
Per-second pricing, and save 50%–80% with Amazon EC2
Spot and Reserved Instances
Amazon Athena
Amazon S3
Amazon
DynamoDB
Amazon
SageMaker
Amazon
Redshift
Amazon
Elasticsearch
Service
Amazon
Aurora
Amazon
EMR
© 2021, Amazon Web Services, Inc. or its Affiliates.
The value in move to managed with
Amazon EMR for big data platforms
© 2021, Amazon Web Services, Inc. or its Affiliates.
Foundation 1: Decouple storage and compute
© 2021, Amazon Web Services, Inc. or its Affiliates.
Foundation 2: Amazon S3 is you persistent data store
Amazon S3
Unmatched durability,
availability, and scalabilityStrong read-after-write consistency
Support for transactions
Easiest to use with
cost optimization:
Intelligent tiering
Best security (including Row &
Column level), Compliance,
and audit capabilities
Most ways to get data in
Broadest portfolio
of analytics tools
Cold storage and archive capabilities
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 1 : Turn off clusters
Amazon S3Amazon S3 Amazon S3
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 2 : Built-in Disaster Recovery
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 3 : Logical separation of jobs/applications
Ad-Hoc
Re-architect Monolithic to Purpose-built clusters by:• Creating Transient and/or Persistent clusters
• Separating clusters by Application
• Separating clusters by Application Version
• Isolating Department specific clusters
Traditional Monolithic Cluster
Purpose-built Clusters
vs.
Design consideration are given to:• How do you submit jobs or build pipelines
• Persisting your data in S3
• Storing metadata off the cluster
• How long does the job run
• What applications are needed
© 2021, Amazon Web Services, Inc. or its Affiliates.
Amazon EMR Cluster
EMR Benefit 4 : Auto-scaling Clusters (Persistent / Transient )
© 2021, Amazon Web Services, Inc. or its Affiliates.
Amazon EMR Managed Scaling:
Reduce costs by up to 60%
• Completely managed environment for
automatically scaling clusters
• No configurations required except min/max
capacity
• More data points and faster reaction time
• Can save 20%-60% costs depending on the
workload pattern
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 5 : Leverage Spot Instances & Instance Fleets
10 node cluster running for 14 hours
Cost = $1 * 10 nodes * 14 hours = Total $140
Auto Scale with Spot Instance to reduce cost and run-time
Add 10 more
nodes of Spot at
50% discount
20 node cluster running for 7 hoursCost = $1 * 10 nodes * 7 hours = $70
= $0.5 * 10 nodes * 7 hours = $35= Total $105
Results : 50% less run-time (14hrs → 7hrs),
25% less cost ($140 → $105)
Diversify Spot and On-demand Instances via
Instance Fleets
• Can mix different instance types, markets (On-
demand or Spot) in one group
• Don’t specify an AZ and we will find the cheapest
one
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 6 : EMR Self-service with AWS Service Catalog
Standardize
Enforce Consistency and
Compliance
Limit Access
Enforce Tagging, Security
Groups
Developer Autonomy
One-Stop Shop
Automate Deployments
Agile Governance
Configure Consume
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 7 : Amazon EMR differentiated performance
1.7x faster performance than standard Apache
Spark 3.0 at 40% of the cost
25.7% average cost reduction with Graviton2
11.5% average performance improvement
with Graviton2
Up to 2.6x faster performance than open-source
Presto 0.238 at 80% of the cost
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 8 : Fully Managed EMR Notebooks
1. Provide an end to end data engineering and data science using EMR Notebooks which is based
on the popular open source Jupyter Notebooks to build applications with Apache Spark
2. Attach / Detach from individual clusters; automatically backed up to S3
3. Tag-based Permissions
4. Support for PySpark, Spark SQL, Spark R, and Scala
5. NEW features include a visual experience to debug and monitor Spark jobs directly into the
off-cluster, persistent, Apache Spark History Server using the EMR Console, associate Git
repositories such as GitHub and Bitbucket, and compare and merge two different notebooks
using the nbdime utility.
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 8 : EMR Studio integrated development
environment
Easily build and deploy data science code without logging in to AWS console
Start notebooks in seconds, run jobs later
Save debugging time with native application UIs in one place
Build production pipelines simply and flexibly
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Benefit 9 : Analysts confirm Lowest TCO in the Industry
Feb. 2019 Forrester recognizes:
AWS EMR as the Cloud Hadoop/Spark (HARK) Leader.
Nov. 2018, IDC report confirms:
“EMR provides 57% reduced costs
vs. on premise resulting in 342%
ROI over 5 years.”
Dec. 2018, Gartner suggests:
“AWS remains the largest
Hadoop provider in terms of
both revenue and user base.”
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and
Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester
Wave™ is a graphical representation of Forrester's call on a market and is
plotted using a detailed spreadsheet with exposed scores, weightings, and
comments. Forrester does not endorse any vendor, product, or service
depicted in the Forrester Wave™. Information is based on best available
resources. Opinions reflect judgment at the time and are subject to change.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Lake
on AWS
EMR Benefit 10 : Leverage AWS Lake House Analytic
ecosystem
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch service
Amazon Kinesis
Amazon QuickSight
Analytics Machine Learning
On-premises Data MovementAWS Direct Connect
AWS Storage Gateway
AWS Snowball
AWS Snowmobile
Real-time Data Movement
AWS IoT CoreAWS Kinesis FirehoseAWS Kinesis Data Streams
AWS Kinesis Video Streams
AWS
Glue
Blueprin
ts
ML
Transforms
Data
Catalog
Access
Contro
l
Lake
Formation
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Migration Programs
© 2021, Amazon Web Services, Inc. or its Affiliates.
EMR Migration Guide
• Technical advice to help planning migration
Free EMR Migration Workshop
• Jumpstart your migration to the cloud
Visit aws.amazon.com/emr/emr-migration/
Email [email protected]
EMR Migration Program
© 2021, Amazon Web Services, Inc. or its Affiliates.
Customer Examples
© 2021, Amazon Web Services, Inc. or its Affiliates.
Challenge
NHS Digital wanted to modernize their data access
environment for its users across the UK. The legacy system
was too slow, expensive to maintain and users were
frustrated with performance issues.
Solution
NHS Digital migrated the dataset from their legacy systems,
converted the data into parquet format, loaded them into
S3. Used KMS to encrypt the data. Used Amazon EMR to
process the data from S3.
Benefit
Performance Improvement from 137 minutes to 137
seconds using AWS EMR.
NHS Digital
© 2021, Amazon Web Services, Inc. or its Affiliates.
Customer Examples - High impact results with Amazon EMR
near real-time analytics for 140M players
scales 3,000 transient clusters on a daily basis
achieves costs savings of 55% when compared to on-demand pricing and
40% savings when compared to Reserved Instances
powers the Predix solution processing 1,000,000 data executions/day
computes Zestimates on 100M +homes in hours instead of 1 day
© 2021, Amazon Web Services, Inc. or its Affiliates.
Customer Examples - On-premises migrations to Amazon
EMR
Processes 135B events/day and have cost savings of 60% (~$20M)
decreased costs by $600k in less than 5 months
saves 75% and is 60% more efficient
reduced cost of operation and improved Spark performance 3x
re-architects 1 monolithic pipeline into 3 purpose built clusters
© 2021, Amazon Web Services, Inc. or its Affiliates.
Thank You