Immersion Day Building a Data Lake on AWS
Transcript of Immersion Day Building a Data Lake on AWS
© 2021, Amazon Web Services, Inc. or its Affiliates.
Jason Moldan, Solutions Architect
Norman Owens, Solutions Architect
Arun Shanmugam, Data Architect
Sreeji Gopal, Data Architect
Matt Atwater, Principle Architect, Clear Scale
Immersion Day – Building a Data
Lake on AWS
© 2021, Amazon Web Services, Inc. or its Affiliates.
Workshop Agenda
1. Overview of the Data Lake
2. Hydrating the Data Lake
Lab1 – Hydrating the Data Lake with DMS
3. Working Within the Data Lake
Lab2 – ETL with AWS Glue
Lab3 - Querying the Data Lake with Amazon Athena and
Amazon QuickSight
4. Consuming the Data Lake – Reporting, Analytics, Machine
Learning
5. Introduction to DataBrew
© 2021, Amazon Web Services, Inc. or its Affiliates.
• A centralized repository for both
structured and unstructured data
• Store data as-is in open-source file
formats to enable direct analytics
What is a Data Lake?
© 2021, Amazon Web Services, Inc. or its Affiliates.
Why a Data Lake?
• Decouple storage from compute,
allowing you to scale
• Enable advanced analytics across all of
your data sources
• Reduce complexity in ETL and
operational overhead
• Future extensibility as new database and
analytics technologies are invented
© 2021, Amazon Web Services, Inc. or its Affiliates.
Traditionally, Analytics Looked Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
TBs-PBs Scale
Schema Defined Prior to Data Load
Operational and Ad Hoc Reporting
Large Initial Capex + $$K / TB/ Year
Relational Data
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Lakes Extend the Traditional Approach
OLTP ERP CRM LOB
Catalog
DW
Queries
Big Data
ProcessingInteractive Real-Time
Web Sensors SocialDevices
Business Intelligence Machine Learning TB-EBs Scale
All Data in one place, a Single Source of Truth
Relational and Non-Relational Data
Decouples (low cost) Storage and Compute
Schema on Read
Diverse Analytical Engines
Data Lake
100110000100101011100
101010111001010100001
011111011010001111001
0110010110
0100011000010
© 2021, Amazon Web Services, Inc. or its Affiliates.
A m a z o n S 3
A m a z o n G l a c i e r
A W S G l u e
Store Data in the Format You WantOpen and comprehensive
• Store data in the format you want:
• Text files like CSV
• Columnar like Apache Parquet, and Apache ORC
• Logstash like Grok
• JSON (simple, nested), AVRO
• And more…
CSV
ORC
Grok
Avro
Parquet
JSON
© 2021, Amazon Web Services, Inc. or its Affiliates.
Any ScaleScalable and durable
• S3 has trillions of objects and exabytes of data
• Built to store any amount of data
• Run analytic engines at largest scale by spinning
up any amount of compute resources in minutes
• Runs on the world’s largest global
cloud infrastructure
© 2021, Amazon Web Services, Inc. or its Affiliates.
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance Multiple upload
Range GET
Store as much as you need
Scale storage and compute
independently
No minimum usage commitments
Scalable
Amazon EMR
Amazon Redshift
Amazon DynamoDB
Amazon SageMaker
Many more
Integrated
Simple REST API
AWS SDKs
Read-after-create consistency
Event notification
Lifecycle policies
Easy to use
Why Amazon S3 for a Data Lake?
© 2021, Amazon Web Services, Inc. or its Affiliates.
Pay Only for the Resources You Use as you ScaleLowest Cost
• Pay-as-you-go for the resources you consume
• As low as $0.05/GB scanned with Athena
• EMR and Athena can automatically scale down
resources after job completes, saving you costs
• Commit to a set term and save up to 75% with
Reserved Instance
• Run on spare compute capacity with EMR and
save up to 90% with Spot
Traditional approach leads to wasted capacity
Traditional: Rigid
AWS: Elastic
Capacity
Demand
Demand
Servers
Unmet demand
upset players
missed revenue
Excess capacity
wasted $$$
AWS approach: pay for the capacity you use
© 2021, Amazon Web Services, Inc. or its Affiliates.
Lowest Total Cost of Ownership (TCO)Cost-effective
• Less admin time to
manage, and support
• No up-front costs—
hardware acquisition,
installation
• Save on operating
costs—data center space,
power, cooling
• Business value: cost of
delays, risk premium,
competitive abilities,
governance, etc.
• Less admin time to manage,
and support
• No up-front costs—hardware
acquisition, installation
• Save on operating
costs—data center space,
power, cooling
• Business value: cost of delays,
risk premium, competitive
abilities, governance, etc.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Typical steps for building a data lake
Implementing a Data Lake architecture requires a broad set of tools and
technologies to serve an increasingly diverse set of applications and use
cases.
Set up storage1
Move data2 Cleanse, prep, and catalog data
3 Configure and enforce security and compliance policies
4 Make data available for analytics
5
Processing & Analytics
Real-time Analytics
AI & Predictive
BI & Data Visualization
Transactional & RDBMS
AWS LambdaApache Storm on
EMR
Apache Flink
on EMRSpark Streaming
on EMR
Elasticsearch
ServiceKinesis Data Analytics,
Kinesis Data Streams
DynamoDB
NoSQL DB Relational Database
Aurora
EMR
Hadoop, Spark,
Presto
Redshift
Data Warehouse
Athena
Query Service
Amazon Lex
Speech
recognition
Amazon
Rekognition
Amazon Polly
Text to speech
Machine Learning
Predictive analytics
SageMaker
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Lake on AWS
Data Ingestion
AWS
SnowballAWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
TempSST SVT
S3
Central Storage
Scalable, secure, cost-effective ETL Enrich
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Lake on AWS
Data Ingestion
AWS
SnowballAWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
S3
Central Storage
Scalable, secure, cost-effective
Catalog & Search
Amazon
DynamoDBAmazon Elasticsearch
Service
AWS
Glue
Analytics & Serving
Amazon
Athena
Amazon
EMRAWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
NeptuneAmazon
RDS
Access & User Interfaces
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMSAWS
CloudTrail
Manage & Secure
AWS
IAM
Amazon
CloudWatch
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Lake on AWS
Access & User Interfaces
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMSAWS
CloudTrail
Manage & Secure
AWS
IAM
Amazon
CloudWatch
Data Ingestion
AWS
SnowballAWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Analytics & Serving
Amazon
Athena
Amazon
EMRAWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
NeptuneAmazon
RDS
S3
Central Storage
Scalable, secure, cost-effective
Catalog & Search
Amazon
DynamoDBAmazon Elasticsearch
Service
AWS
Glue
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Lake Architectureson AWS
S CALAB LE D ATA LAK ES
P U R P O S E - B U I LT
D ATA S E R V I C E S
S EAM LE S S
D ATA M O V EM EN T
U N I FI E D G O V E R N AN CE
P E R FO R M AN T AN D
C O S T- EFFECTI V E
Amazon
DynamoDB
Amazon
SageMaker
Amazon
Redshift
Amazon
Elasticsearch
Service
Amazon
EMR
AmazonS3
Amazon
Aurora
AmazonAthena
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
© 2021, Amazon Web Services, Inc. or its Affiliates.
Benefits of a Data Lake – All Data in One Place
Store and analyze all of your data,
from all of your sources, in one
centralized location.
“Why is the data distributed in
many locations? Where is the
single source of truth ?”
© 2021, Amazon Web Services, Inc. or its Affiliates.
Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store
it efficiently?”
© 2021, Amazon Web Services, Inc. or its Affiliates.
Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”
© 2021, Amazon Web Services, Inc. or its Affiliates.
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Load into Downstream Services
AURORAAmazon Redshift
Amazon DynamoDB
Amazon Aurora
Amazon Elasticsearch
Run complex analytic queries against petabytes of structured data
A NoSQL database service that
delivers consistent, single-digit millisecond latency at any scale.
A MySQL and PostgreSQL compatible relational database built for the cloud
Delivers Elasticsearch’s real-time analytics
capabilities alongside the availability,
scalability, and security that production workloads require.
Amazon SageMakerfully managed service that provides
every developer and data scientist with
the ability to build, train, and deploy
machine learning (ML) models quickly