Post on 20-May-2020
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eric JohnsonSenior Developer Advocate - Serverless
AWS
@edjgeek
Big “Serverless” DataPowering Big Data with Serverless
Background Image by Эдуард Ризванов from Pixabay
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Who am I?
• Sr. Developer Advocate – Serverless, AWS
• Serverless / Tooling / Automation Geek
• Software Architect / Solutions Architect
• Husband to Brigitte
• Father to Noah, Jake, Owen
Sophie Anne, & Gracie Mae
• Music lover
• Pizza / Diet Dr. Pepper fanatic
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why are we here?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless in big data processing
Amazon Kinesis Video Streams
Amazon KinesisData Streams
Amazon Kinesis Data Firehose
Amazon KinesisData Analytics
Amazon Athena AWS Lambda Amazon SimpleStorage Service
Amazon DynamoDB
Understanding the role Serverless plays in Big Data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Ingestion
Real-time processing
Real-time analytics
Post processing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is serverless?
No infrastructure provisioning, no management
Automatic scaling
Pay for value Highly available and secure
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingestion
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingesting data at scale
Amazon Kinesis Video Streams
Amazon KinesisData Streams
Amazon Kinesis Data Firehose
Video Ingestion Data Ingestion
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video ingestion
• Fully managed infrastructure that scales to load
• Offers SDK in C++ and Java
• Supports live and on-demand playback of streams
• Durable storage using Amazon S3
• Works with many forms of time encoded data
• Supports multiple time code based formats
Amazon Kinesis Video Streams
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data ingestion – Kinesis Data Streams
• Uses shards to scale• 1 MB or 1000 records /second/shard ingress
• 2 MB/second/shard egress
• Works with Kinesis Data Analytics
• Can support connected consumers for enhanced fanout
• Can store data up to 168 hours (7 days)
Amazon Kinesis Data Streams
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data ingestion – Kinesis Firehose
• Auto-scales to meet load• Different regions have different capacity
• US East: 5,000 records/second, 2,000
transactions/second, and 5 MiB/second.
• Works with Kinesis Data
Analytics
• Can transform data before
delivery to target
• Stores data up to 24 hours
on failed delivery
Amazon Kinesis Firehose
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data ingestion – Kinesis Firehose
Data Sources Targets
• Firehose PUT APIs• Amazon Kinesis
Agent• AWS IoT• CloudWatch Logs• CloudWatch Events
• Amazon S3• Amazon Redshift• Amazon
Elasticsearch Service
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream
Kinesis Data Streams vs. Kinesis Firehose
Kinesis Firehose
Amazon KinesisData Stream
Data Producers010001110010100
01000111001001101010100010010100010100
01000100101110100
010010100010100
010010100010100010010100010100
010010100010100010010100010100
Data Producers
Amazon Kinesis Data Firehose
01000111001001101010100010010100010100
010010100010100
010010100010100010010100010100
01000111001001101010100
01000111001001101010100
01000111001001101010100
01000111001001101010100
010001101010100
010001101100
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Firehose
Kinesis Firehose
Data Producers
Amazon Kinesis Data Firehose
01000111001001101010100010010100010100
010010100010100
010010100010100010010100010100
01000111001001101010100
01000111001001101010100
01000111001001101010100
01000111001001101010100
010001101010100
010001101100
Use Kinesis Firehose when you need:• Ability to transform data in the stream• Auto scaling for unpredictable load• Multiple targets for final data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream
Kinesis Data Streams
Amazon KinesisData Stream
Data Producers010001110010100
01000111001001101010100010010100010100
01000100101110100
010010100010100
010010100010100010010100010100
010010100010100010010100010100
Use Kinesis Data Streams when:• You have semi-predictable traffic• You need to perform real-time action on
data in the stream
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time
processing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream + Lambda
Amazon KinesisData Stream
Data Producers
Lambda function
Lambda function
Lambda function
Amazon DynamoDB
Amazon KinesisData Stream
AWS IoT Core
Lambda services handles intermittent pollingvia GetRecords API
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream + Lambda
Amazon KinesisData Stream
Data Producers
Lambda function
Lambda function
Lambda function
Amazon DynamoDB
Amazon KinesisData Stream
AWS IoT Core
Lambda services handles intermittent pollingvia GetRecords API
All applications share 2 MB/second/shard egress
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Data Stream + Enhanced Fanout + Lambda
Amazon KinesisData Stream
Data Producers
Lambda function
Lambda function
Lambda function
Amazon DynamoDB
Amazon KinesisData Stream
AWS IoT Core
Functions triggered by consumers
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon KinesisData Stream
Data Producers
Lambda function
Lambda function
Lambda function
Amazon DynamoDB
Amazon KinesisData Stream
AWS IoT Core
Functions triggered by consumers
Each consumer provides an individual 2 MB/second/shard egress
Kinesis Data Stream + Enhanced Fanout + Lambda
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video Processing
Amazon Kinesis
Video Streams
Amazon Rekognition video
Amazon SageMaker
S3 Bucket
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video Processing
Amazon Kinesis
Video Streams
Amazon Rekognition video
Amazon SageMaker
Real time analysis and machine learning
S3 Bucket
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video Processing
Amazon Kinesis
Video Streams
Amazon Rekognition video
Amazon SageMaker
Real time analysis and machine learning
S3 Bucket
HLS Compatible live oron-demand playback
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Video Processing
Amazon Kinesis
Video Streams
Amazon Rekognition video
Amazon SageMaker
Real time analysis and machine learningHLS Compatible live oron-demand playback
S3 Bucket
Near real-time processing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time
analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis Data Analytics
• Built-in functions to filter, aggregate, and transform streaming data
• Processes streaming data with sub-second latencies
• Build SQL queries that perform joins, aggregations over time windows and filters
• includes open source libraries based on Apache Flink that enable you to build an application in hours instead of months
Amazon KinesisData Analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time analytics
Amazon KinesisData Stream
Amazon Kinesis Data Firehose
Amazon KinesisData Analytics
Stream source can be Kinesis Data Stream or Firehose
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Inside Kinesis Data Analytics
Stream
data
-- Create Fail Stream --CREATE OR REPLACE STREAM "FAIL_STREAM" (
sensorId INT,currentTemperature INT,status VARCHAR(10)
);
CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM"SELECT "sensorId", "currentTemperature", "status"FROM "SOURCE_SQL_STREAM_001"WHERE "status" SIMILAR TO '%FAIL%';
-- Create Warn Stream --CREATE OR REPLACE STREAM "WARN_STREAM" (
sensorId INT,currentTemperature INT,status VARCHAR(10)
);
CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM"SELECT "sensorId", "currentTemperature", "status"FROM "SOURCE_SQL_STREAM_001"WHERE "status" SIMILAR TO '%WARN%';
FAIL_STREAM
WARN_STREAM
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Inside Kinesis Data Analytics
Stream
data
-- Create Fail Stream --CREATE OR REPLACE STREAM "FAIL_STREAM" (
sensorId INT,currentTemperature INT,status VARCHAR(10)
);
CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM"SELECT "sensorId", "currentTemperature", "status"FROM "SOURCE_SQL_STREAM_001"WHERE "status" SIMILAR TO '%FAIL%';
-- Create Warn Stream --CREATE OR REPLACE STREAM "WARN_STREAM" (
sensorId INT,currentTemperature INT,status VARCHAR(10)
);
CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM"SELECT "sensorId", "currentTemperature", "status"FROM "SOURCE_SQL_STREAM_001"WHERE "status" SIMILAR TO '%WARN%';
FAIL_STREAM
WARN_STREAM
Use SQL or Apache Flink to filter data
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Inside Kinesis Data Analytics
Stream
data
-- Create Fail Stream --CREATE OR REPLACE STREAM "FAIL_STREAM" (
sensorId INT,currentTemperature INT,status VARCHAR(10)
);
CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM"SELECT "sensorId", "currentTemperature", "status"FROM "SOURCE_SQL_STREAM_001"WHERE "status" SIMILAR TO '%FAIL%';
-- Create Warn Stream --CREATE OR REPLACE STREAM "WARN_STREAM" (
sensorId INT,currentTemperature INT,status VARCHAR(10)
);
CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM"SELECT "sensorId", "currentTemperature", "status"FROM "SOURCE_SQL_STREAM_001"WHERE "status" SIMILAR TO '%WARN%';
FAIL_STREAM
AWS Lambda
• Alert• Diagnose• Remediat
e
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Inside Kinesis Data Analytics
Stream
data
-- Create Fail Stream --CREATE OR REPLACE STREAM "FAIL_STREAM" (
sensorId INT,currentTemperature INT,status VARCHAR(10)
);
CREATE OR REPLACE PUMP "FAIL_STREAM_PUMP" AS INSERT INTO "FAIL_STREAM"SELECT "sensorId", "currentTemperature", "status"FROM "SOURCE_SQL_STREAM_001"WHERE "status" SIMILAR TO '%FAIL%';
-- Create Warn Stream --CREATE OR REPLACE STREAM "WARN_STREAM" (
sensorId INT,currentTemperature INT,status VARCHAR(10)
);
CREATE OR REPLACE PUMP "WARN_STREAM_PUMP" AS INSERT INTO "WARN_STREAM"SELECT "sensorId", "currentTemperature", "status"FROM "SOURCE_SQL_STREAM_001"WHERE "status" SIMILAR TO '%WARN%';
WARN_STREAM
Amazon KinesisData Stream
• Dashboards
• Consumer response
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time analytics
Amazon KinesisData Stream
Amazon Kinesis Data Firehose
Amazon KinesisData Analytics
Amazon KinesisData Stream
AWS Lambda
FAIL_STREAM
WARN_STREAM
What about the raw data?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time analytics
Amazon KinesisData Stream
Amazon Kinesis Data Firehose
Amazon KinesisData Analytics
Amazon KinesisData Stream
Amazon Kinesis Data Firehose
AWS Lambda
FAIL_STREAM
WARN_STREAM
Raw Data Archive
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Post processing
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless data storage
Amazon SimpleStorage Service
Amazon DynamoDB
Amazon Timestream
AmazonQuantum Ledger
Database
Amazon CloudWatc
h
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon KinesisData Analytics
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless data storage
Amazon SimpleStorage Service
Amazon DynamoDB
Amazon Timestream
AmazonQuantum Ledger
Database
Amazon CloudWatc
h
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon KinesisData Analytics
How you need to process
your data determines
where to store it
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless storage options
Amazon SimpleStorage Service
Amazon DynamoDB
Amazon Timestream
AmazonQuantum Ledger
Database
Amazon CloudWatc
h
• Immutable and transparent
• Cryptographically Verifiable
• Object storage• Unstructured
data
• Structured data• Alerting built in
• NoSQL• Key value or
document data
• Time series database
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Post processing – Serverless Tools
Amazon AthenaQuery S3 data with standard SQL expressions
Amazon S3 SelectRetrieve subsets of object data, instead of the entire object.
AWS GlueExtract, transform, and load (ETL) service that works across multiple services.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB TableDynamoDB
Table
DynamoDB Table
DynamoDB Table
DynamoDB Table
Other non-serverless services• MariaDB
• Microsoft SQL Server• MySQL• Oracle• PostgreSQL
Critical data can be stored in many places
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB TableDynamoDB
Table
DynamoDB Table
DynamoDB Table
DynamoDB Table
Crawler Data Catalog
Other non-serverless services• MariaDB
• Microsoft SQL Server• MySQL• Oracle• PostgreSQL
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB TableDynamoDB
Table
DynamoDB Table
DynamoDB Table
DynamoDB Table
Crawler Data Catalog
Other non-serverless services• MariaDB
• Microsoft SQL Server• MySQL• Oracle• PostgreSQL
What it is doing • Classifies data to determine the format,
schema, and associated properties of the raw data
• Groups data into tables or partitions – Data is grouped based on crawler heuristics.
• Writes metadata to the Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB TableDynamoDB
Table
DynamoDB Table
DynamoDB Table
DynamoDB Table
Crawler Data Catalog
Other non-serverless services• MariaDB
• Microsoft SQL Server• MySQL• Oracle• PostgreSQL
This catalog contains meta-data about the data stores. How do I get the data itself in a meaningful way?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enter: AWS Athena
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB TableDynamoDB
Table
DynamoDB Table
DynamoDB Table
DynamoDB Table
Crawler Data Catalog
Other non-serverless services• MariaDB
• Microsoft SQL Server• MySQL• Oracle• PostgreSQL
Amazon Athena
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Athena
Bucket
Bucket
Bucket
Bucket
Bucket
DynamoDB TableDynamoDB
Table
DynamoDB Table
DynamoDB Table
DynamoDB Table
Crawler Data Catalog
Other non-serverless services• Amazon Aurora
• MariaDB• Microsoft SQL Server• MySQL• Oracle• PostgreSQL
Athena queries Glue Data Catalog
Glue returns data from data source Amazon Athena
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Athena
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Question
I have HUGE compressed CSV files
stored on Amazon S3.
How do I get small bits of data without
reading the entire file?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enter: Amazon S3 Select
import boto3
s3 = boto3.client('s3’)
r = s3.select_object_content(
Bucket='jbarr-us-west-2’,
Key='sample-data/airportCodes.csv’,
ExpressionType='SQL’,
Expression="select * from s3object s where s.\"Country (Name)\" like '%United States%’”,
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'CSV': {}}, )
for event in r['Payload’]:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8’)
print(records)
elif 'Stats' in event:
statsDetails = event['Stats']['Details’]
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before S3 Select
Lambda function
Bucket
001010010110110010010101010010100111001001001101100101100101010001001001111001000011001001001111110010000000110110010100110000010100101101100100101010100101001110010010011011001011001010100010010011110010000110010010011111100100000001101100101001100000101001011011001001010101001010011100100100110110010110010101000100100111100100001100100100111111001000000011011001010011000001010010110110010010101010010100111001001001101100101100101010001001001111001000011001001001111110010000000110110010100110000010100101101100100101010100101001110010010011011001011001010100010010011110010000110010010011111100100000001101100101001100000101001011011001001010101001010011100100100110110010110010101000100100111100100001100100100111111001
0000000110110010100110
Entire file returned
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
001010010110110010010101010010100111001001001101100101100101010001001001111001000011001001001111110010000000110110010100110000010100101101100100101010100101001110010010011011001011001010100010010011110010000110010010011111100100000001101100101001100000101001011011001001010101001010011100100100110110010110010101000100100111100100001100100100111111001000000011011001010011000001010010110110010010101010010100111001001001101100101100101010001001001111001000011001001001111110010000000110110010100110000010100101101100100101010100101001110010010011011001011001010100010010011110010000110010010011111100100000001101100101001100000101001011011001001010101001010011100100100110110010110010101000100100111100100001100100100111111001
0000000110110010100110
After S3 Select
Lambda function
Bucket
Parsed value returned
Up to 400% faster and 80% cheaper
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions?https://pixabay.com/illustrations/questions-font-who-what-how-why-2245264/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eric Johnson@edjgeek
Image Source: https://pixabay.com/illustrations/thank-you-polaroid-letters-2490552/