Immersion Day Building a Data Lake on AWS

© 2021, Amazon Web Services, Inc. or its Affiliates.

Jason Moldan, Solutions Architect

Norman Owens, Solutions Architect

Arun Shanmugam, Data Architect

Sreeji Gopal, Data Architect

Matt Atwater, Principle Architect, Clear Scale

Immersion Day – Building a Data

Lake on AWS


Workshop Agenda

1. Overview of the Data Lake

2. Hydrating the Data Lake

Lab1 – Hydrating the Data Lake with DMS

3. Working Within the Data Lake

Lab2 – ETL with AWS Glue

Lab3 - Querying the Data Lake with Amazon Athena and

Amazon QuickSight

4. Consuming the Data Lake – Reporting, Analytics, Machine

Learning

5. Introduction to DataBrew


• A centralized repository for both

structured and unstructured data

• Store data as-is in open-source file

formats to enable direct analytics

What is a Data Lake?


Why a Data Lake?

• Decouple storage from compute,

allowing you to scale

• Enable advanced analytics across all of

your data sources

• Reduce complexity in ETL and

operational overhead

• Future extensibility as new database and

analytics technologies are invented


Traditionally, Analytics Looked Like This

OLTP ERP CRM LOB

Data Warehouse

Business Intelligence

TBs-PBs Scale

Schema Defined Prior to Data Load

Operational and Ad Hoc Reporting

Large Initial Capex + $$K / TB/ Year

Relational Data


Data Lakes Extend the Traditional Approach

OLTP ERP CRM LOB

Catalog

DW

Queries

Big Data

ProcessingInteractive Real-Time

Web Sensors SocialDevices

Business Intelligence Machine Learning TB-EBs Scale

All Data in one place, a Single Source of Truth

Relational and Non-Relational Data

Decouples (low cost) Storage and Compute

Schema on Read

Diverse Analytical Engines

Data Lake

100110000100101011100

101010111001010100001

011111011010001111001

0110010110

0100011000010


A m a z o n S 3

A m a z o n G l a c i e r

A W S G l u e

Store Data in the Format You WantOpen and comprehensive

• Store data in the format you want:

• Text files like CSV

• Columnar like Apache Parquet, and Apache ORC

• Logstash like Grok

• JSON (simple, nested), AVRO

• And more…

CSV

ORC

Grok

Avro

Parquet

JSON


Any ScaleScalable and durable

• S3 has trillions of objects and exabytes of data

• Built to store any amount of data

• Run analytic engines at largest scale by spinning

up any amount of compute resources in minutes

• Runs on the world’s largest global

cloud infrastructure


Designed for 11 9s

of durability

Designed for

99.99% availability

Durable Available High performance Multiple upload

Range GET

Store as much as you need

Scale storage and compute

independently

No minimum usage commitments

Scalable

Amazon EMR

Amazon Redshift

Amazon DynamoDB

Amazon SageMaker

Many more

Integrated

Simple REST API

AWS SDKs

Read-after-create consistency

Event notification

Lifecycle policies

Easy to use

Why Amazon S3 for a Data Lake?


Pay Only for the Resources You Use as you ScaleLowest Cost

• Pay-as-you-go for the resources you consume

• As low as $0.05/GB scanned with Athena

• EMR and Athena can automatically scale down

resources after job completes, saving you costs

• Commit to a set term and save up to 75% with

Reserved Instance

• Run on spare compute capacity with EMR and

save up to 90% with Spot

Traditional approach leads to wasted capacity

Traditional: Rigid

AWS: Elastic

Capacity

Demand

Demand

Servers

Unmet demand

upset players

missed revenue

Excess capacity

wasted $$$

AWS approach: pay for the capacity you use


Lowest Total Cost of Ownership (TCO)Cost-effective

• Less admin time to

manage, and support

• No up-front costs—

hardware acquisition,

installation

• Save on operating

costs—data center space,

power, cooling

• Business value: cost of

delays, risk premium,

competitive abilities,

governance, etc.

• Less admin time to manage,

and support

• No up-front costs—hardware

acquisition, installation

• Save on operating

costs—data center space,

power, cooling

• Business value: cost of delays,

risk premium, competitive

abilities, governance, etc.


Typical steps for building a data lake

Implementing a Data Lake architecture requires a broad set of tools and

technologies to serve an increasingly diverse set of applications and use

cases.

Set up storage1

Move data2 Cleanse, prep, and catalog data

3 Configure and enforce security and compliance policies

4 Make data available for analytics

5

Processing & Analytics

Real-time Analytics

AI & Predictive

BI & Data Visualization

Transactional & RDBMS

AWS LambdaApache Storm on

EMR

Apache Flink

on EMRSpark Streaming

on EMR

Elasticsearch

ServiceKinesis Data Analytics,

Kinesis Data Streams

DynamoDB

NoSQL DB Relational Database

Aurora

EMR

Hadoop, Spark,

Presto

Redshift

Data Warehouse

Athena

Query Service

Amazon Lex

Speech

recognition

Amazon

Rekognition

Amazon Polly

Text to speech

Machine Learning

Predictive analytics

SageMaker


Data Lake on AWS

Data Ingestion

AWS

SnowballAWS Storage

Gateway

Amazon

Kinesis Data

Firehose

AWS Direct

Connect

AWS Database

Migration

Service

TempSST SVT

S3

Central Storage

Scalable, secure, cost-effective ETL Enrich


Data Lake on AWS

Data Ingestion

AWS

SnowballAWS Storage

Gateway

Amazon

Kinesis Data

Firehose

AWS Direct

Connect

AWS Database

Migration

Service

S3

Central Storage

Scalable, secure, cost-effective

Catalog & Search

Amazon

DynamoDBAmazon Elasticsearch

Service

AWS

Glue

Analytics & Serving

Amazon

Athena

Amazon

EMRAWS

Glue

Amazon

Redshift

Amazon

DynamoDB

Amazon

QuickSight

Amazon

Kinesis

Amazon

Elasticsearch

Service

Amazon

NeptuneAmazon

RDS

Access & User Interfaces

AWS

AppSync

Amazon

API Gateway

Amazon

Cognito

AWS

KMSAWS

CloudTrail

Manage & Secure

AWS

IAM

Amazon

CloudWatch


Data Lake on AWS

Access & User Interfaces

AWS

AppSync

Amazon

API Gateway

Amazon

Cognito

AWS

KMSAWS

CloudTrail

Manage & Secure

AWS

IAM

Amazon

CloudWatch

Data Ingestion

AWS

SnowballAWS Storage

Gateway

Amazon

Kinesis Data

Firehose

AWS Direct

Connect

AWS Database

Migration

Service

Analytics & Serving

Amazon

Athena

Amazon

EMRAWS

Glue

Amazon

Redshift

Amazon

DynamoDB

Amazon

QuickSight

Amazon

Kinesis

Amazon

Elasticsearch

Service

Amazon

NeptuneAmazon

RDS

S3

Central Storage

Scalable, secure, cost-effective

Catalog & Search

Amazon

DynamoDBAmazon Elasticsearch

Service

AWS

Glue


Data Lake Architectureson AWS

S CALAB LE D ATA LAK ES

P U R P O S E - B U I LT

D ATA S E R V I C E S

S EAM LE S S

D ATA M O V EM EN T

U N I FI E D G O V E R N AN CE

P E R FO R M AN T AN D

C O S T- EFFECTI V E

Amazon

DynamoDB

Amazon

SageMaker

Amazon

Redshift

Amazon

Elasticsearch

Service

Amazon

EMR

AmazonS3

Amazon

Aurora

AmazonAthena

Non-

relational

databases

Machine

learning

Data

warehousing

Log

analytics

Big data

processing

Relational

databases


Benefits of a Data Lake – All Data in One Place

Store and analyze all of your data,

from all of your sources, in one

centralized location.

“Why is the data distributed in

many locations? Where is the

single source of truth ?”


Benefits of a Data Lake – Quick Ingest

Quickly ingest data

without needing to force it into a

pre-defined schema.

“How can I collect data quickly

from various sources and store

it efficiently?”


Benefits of a Data Lake – Storage vs Compute

Separating your storage and compute

allows you to scale each component as

required

“How can I scale up with the

volume of data being generated?”


Benefits of a Data Lake – Schema on Read

“Is there a way I can apply multiple

analytics and processing frameworks

to the same data?”

A Data Lake enables ad-hoc

analysis by applying schemas

on read, not write.


What can you do with a Data Lake?


Query Directly with Amazon Athena


Analyze with Hadoop on Amazon EMR


Create Visualizations with Amazon QuickSight


Train ML Models with Amazon SageMaker


Create a Central Data Catalog with AWS Glue


Load into Downstream Services

AURORAAmazon Redshift

Amazon DynamoDB

Amazon Aurora

Amazon Elasticsearch

Run complex analytic queries against petabytes of structured data

A NoSQL database service that

delivers consistent, single-digit millisecond latency at any scale.

A MySQL and PostgreSQL compatible relational database built for the cloud

Delivers Elasticsearch’s real-time analytics

capabilities alongside the availability,

scalability, and security that production workloads require.

Amazon SageMakerfully managed service that provides

every developer and data scientist with

the ability to build, train, and deploy

machine learning (ML) models quickly


Thanks! You

Immersion Day Building a Data Lake on AWS

Documents

Transcript of Immersion Day Building a Data Lake on AWS