JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

40
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing Richard Freeman, Ph.D. Lead Data Engineer and Architect 23 rd March 2017 RAVEN data science platform in AWS

Transcript of JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Page 1: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingRichard Freeman, Ph.D. Lead Data Engineer and Architect

23rd March 2017

RAVEN data science platform in AWS

Page 2: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

What to Expect from the Session

• Recap of some AWS services

• Event-driven data platform at JustGiving

• Serverless computing

• Six serverless patterns

• Serverless recommendations and best practices

Page 3: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Recap of Some AWS Services

Page 4: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Managed Services

Amazon S3

• Distributed web scale object storage

• Highly scalable, reliable, and secure

• Pay only for what you use

• Supports encryption

AWS Lambda

• Runs code in response to triggers

• Amazon API Gateway requests

• Serverless

• Pay only when code runs

• Automatically scales and high availability

Amazon Simple Notification Service (SNS)

• Set up, operate, and send notifications

• Publish messages from an application and immediately deliver them to subscribers or other applications

• Push messages to mobile devices

Amazon DynamoDB

• Fast, fully-managed NoSQL Database Service

• Capable of handling any amount of data

• Durable and Highly Available

• SSD storage

• Cost Effective

Page 5: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Amazon KinesisAmazon Kinesis Firehose

Easily load massive volumes of streaming data into S3, Amazon Redshift, and Amazon Elasticsearch Service

Amazon Kinesis Analytics

Easily analyze streaming data with standard SQL

Amazon Kinesis Streams

Build your own custom applications that process or analyze streaming data. Scales out using shards

Page 6: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Event-Driven Data Platform at JustGiving

Page 7: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

We are

A tech-for-good platform for events based fundraising, charities and crowdfunding

“Ensure no good cause goes unfunded”

• The #1 platform for online social giving in the world

• Raised $4.2bn in donations• 28.5m users • 196 countries • 27,000 good causes• Peaks in traffic: Ice bucket,

natural disasters• GiveGraph

• 91 million nodes • 0.53 billion relationships

Page 8: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Fundraising Page

Page 9: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Our Requirements

• Limitation in existing SQL Server data warehouse• Long running and complex queries for data scientists • New data sources: API, clickstream, unstructured, log, behavioural

data etc.

• Easy to add data sources and pipelines

• Reduce time spent on data preparation and experiments

Machine learning

Graph processing

Natural language processing

Stream processing

Data ingestion

Data preparation

Automated Pipelines

Insight

Predictions

Metrics

Inspire

Data-driven

Complex Impact

Page 10: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Event Driven Data Platform at JustGiving [1 of 2]

• JustGiving developed in-house analytics and data science platform in AWS called RAVEN.

• Reporting, Analytics, Visualization, Experimental, Networks

• Uses event-driven and serverless pipelines rather than directed acyclic graphs (DAGs) or Workflows

• Messaging, queues, pub/sub patterns• Separate storage from compute

• Supports scalable event driven• ETL / ELT• Machine learning• Natural language processing• Graph processing

• Allows users to consume raw tables, data blocks, metrics, KPIs, insight, reports, etc.

Page 11: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Event Driven Data Platform at JustGiving [2 of 2]

Page 12: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Event Driven ETL and Data Pipeline Patterns [1 of 2]

Page 13: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Event Driven ETL and Data Pipeline Patterns [2 of 2]

Pros

• Decouple the L in ETL / ELT

• Adds lots of flexibility to loading

• Supports incremental and multi-cluster loads

• Simple reload on failure, e.g., send an Amazon SNS/Amazon SQS message

• Basic sequencing can be done within message

Cons:

• Does not support complex loading workflows

• No native FIFO support in SQS (support added Nov 17, 2016)

Page 14: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Serverless Computing

Page 15: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

AWS Lambda FunctionsEVENT SOURCE FUNCTION SERVICES (ANYTHING)

Data stores & streaming event sourcese.g. Amazon S3, Amazon DynamoDB, Amazon Kinesis

Requests to endpointse.g. Amazon Alexa Skills, Amazon API Gateway

Changes in repositories and logse.g. AWS Code Commit, Amazon CloudWatch

Node.jsPythonJavaC#

Event & Message servicese.g. Amazon SNS, Cron events

Page 16: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Writing Lambda Functions

• The Basics• AWS SDK, IAM Roles and Policies• Python, Node.js, Java8, and C#• Lambda handles inbound traffic and integration

• Stateless• Use S3, DynamoDB, or other Internet storage for persistent data• 128Mb and 1.5GB RAM• Duration 100ms to 5min

• Runs your code in response to events or requests• Pay for execution time in 100ms increments

• Familiar• Use bash, processes, threads, /tmp, sockets, …• Bring your own libraries, even native ones

Page 17: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

AWS Lambda, container or EC2 instance?

Amazon EC2 Instance Containers AWS Lambda

Infrastructure You configure, maintain and needs be built for HA

You configure, maintain and needs be built for HA

Request-driven and AWS manages

Instance Flexibility

Choose instance type, OS, etc.

Choose instance type, OS, etc.

Prioritises ease of use - one OS, default hardware, no maintenance or scheduled downtime

Scaling Provisioning instances andconfigure auto-scaling

Provisioning containers andconfigure auto-scaling

Implicit, based on requests

Launch Minutes Seconds Milliseconds to few seconds

State State or stateless State or stateless Stateless

AWS Integration Custom Custom Built-in triggers SNS, Kinesis Stream, S3, API Gateway etc.

Cost EC2 per hour and spot Containers on EC2 clusters per hour and spot

Per 100ms, invocation and RAM

Page 18: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Six Serverless Patterns

Page 19: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 1 – Serverless Message Bus [1 of 2]

Page 20: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 1 – Serverless Message Bus [2 of 2]

Pros• Simple serverless and highly available service bus• Centralised logic for transformation and routing• Kinesis supports more than one Lambda per message• Kinesis streams support replayCons:• SNS - Scalability one lambda per SNS message• Lambda limits: 500 MB disk, 1.5GB RAM, complete within

5min

Page 21: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 2 – Serverless Steaming Data Capture API [1 of 2]

Page 22: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 2 – Serverless Steaming Data Capture API [2 of 2]

Pros:• Serverless real-time stream capture pipeline• Managed API configuration and deployment• Secure - IAM Roles• Simple transformation• Cost effective – pay for what you useCons:• Complex transformation• API Gateway and Kinesis soft limits need to be aligned

Page 23: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 3 - Serverless Small Data Pipeline [1 of 2]

Page 24: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 3 - Serverless Small Data Pipeline [2 of 2]

Pros:• Batch and incremental possible• Useful for small and infrequently added files• Can preserve file name• Can be used for projection, enrichment, parsing, etc.Cons:• Lambda limits: 500 MB disk, 1.5GB RAM, complete within 5min• Complex joins• One file on input and output

Page 25: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 4 - Serverless Streamify File and Merge into Larger Files [1 of 2]

Page 26: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 4 - Serverless Streamify File and Merge into Larger Files [2 of 2]

Pros:

• Batch and incremental possible

• Useful for small and frequently added files

• Lambda and Amazon Kinesis Firehose - Serverless way to persist a stream

Cons:

• Lambda limits: 500 MB disk, 1.5GB RAM, complete within 5min

• Firehose limits: 15 min, 128 MB buffer

• Complex joins

Page 27: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 5 - Serverless Streaming Analytics and Persist Stream [1 of 2]

Page 28: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 5 - Serverless Streaming Analytics and Persist Stream [2 of 2]

Pros:

• Stream processing without a running cluster

• Lambda and Amazon Kinesis Analytics automatic scaling

• A static website can access DynamoDB metrics

• Choice of language: Python, SQL, Node.js, Java8, and C#

• Code can be changed without interruption

Cons:

• Lambda limits: 500 MB disk, 1.5GB RAM, complete within 5min

• Firehose limits: 15 min, 128 MB buffer

• Complex stream joins

Page 29: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 6 – Serverless Data and Process API [1 of 2]

Page 30: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Pattern 6 – Serverless Data and Process API [2 of 2]

Pros:

• Serverless data API

• DynamoDB uses IAM roles

• Choice of language: Python, Node.js, C#, and Java8

• Code can be changed without interruption

Cons:

• RDS needs encrypted user/pass and VPC

• Lambda limits: 500 MB disk, 1.5GB RAM, complete within 5min

• Complex or compute intensive queries

Page 31: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Serverless Recommendations and Best Practices

Page 32: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Lambda Functions Are Good For…• REST API micro-services backend• Rows & events

• Parsing, projecting, enriching, filtering etc.

• REST and other AWS integration• Built in triggers API Gateway, S3, SNS,

CloudWatch, Kinesis Streams etc.

• Secure• IAM Roles, VPC and KMS

• Highly Available and auto-scaling• Simple packaging and deployment• For streams, easily modify code

Page 33: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Serverless Data Processing Patterns

Pattern Serverless Spark on EMR when …

Pattern 3 - Serverlesssmall data pipeline

• Small and infrequently added files• Preserves existing file names

• Big files > 400MB• Joining data sources• Merge small files into one

Pattern 4 - Serverlessstreamify file and merge into larger files

• Merge small frequently added files into a stream or larger ones in S3

• Streaming analytics and time series

• Big files > 400MB• Joining streams or data

sources

Pattern 5 - Serverlessanalytics and persist stream

• Streaming analytics and time series• Real-time dashboard• Persist an Amazon Kinesis stream

• Joining streams or data sources

Page 34: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Understand the Lambda limits

• Memory, local disk, execution time

• Limited Tooling – Improving

• Throttle – review the architecture and exception DLQ

• Concurrent Lambda functions execution • AWS account• DynamoDB and Amazon Kinesis shard iterators

• Complex joins harder to implement – better in Spark on EMR

• ORC and Parquet – better in Spark on EMR

Page 35: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Lambda Recommendations

• Test with production volumes• Think deployment and version control

• Frameworks Serverless, Chalice etc.• New

• AWS Serverless Application Model (SAM)• Environment variables• Dead Letter Queue

• Program defensivelyExceptions, e.g. invalid records

• Reduce execution time • Minimize logging • Optimize code, e.g. aggregate in memory to reduce writes

Page 36: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

1-2-3 of Lambda

1. Create IAM Role• Access resources, e.g. DynamoDB, Kinesis

Streams etc.• Lambda execution

2. Create Lambda Function • Configuration and function code• Test manually

3. Deploy package• Enable event Trigger • Monitor

First 1 million requests per

month are free !

Page 37: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Other Serverless Example Use CasesThumbnail Generation

• http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

Mobile Application Backend• http://docs.aws.amazon.com/lambda/latest/dg/with-on-demand-custom-android-example.html

Authorizing Access Through a Proxy Resource• https://aws.amazon.com/blogs/compute/authorizing-access-through-a-proxy-resource-to-amazon-api-gateway-and-aws-lambda-

using-amazon-cognito-user-pools/

Serverless IoT Back Ends• https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-serverless-iot-back-ends-iot401

Creating Voice Experiences with Alexa Skills• https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-workshop-creating-voice-experiences-with-alexa-skills-from-

idea-to-testing-in-two-hours-alx203

Lambda with AWS CloudTrail• http://docs.aws.amazon.com/lambda/latest/dg/with-cloudtrail.html

Slack Integration for AWS Lambda• https://aws.amazon.com/blogs/aws/new-slack-integration-blueprints-for-aws-lambda/

Dynamic GitHub Actions with AWS Lambda• https://aws.amazon.com/blogs/compute/dynamic-github-actions-with-aws-lambda/

Lambda with Scheduled Events• http://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html

Page 38: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Find Out More

Serverless Cross Account Stream Replication Using AWS Lambda, Amazon DynamoDB, and Amazon Kinesis Firehose (AWS Compute Blog)

• https://aws.amazon.com/blogs/compute/serverless-cross-account-stream-replication-using-aws-lambda-amazon-dynamodb-and-amazon-kinesis-firehose

Analyze a Time Series in Real Time with AWS Lambda, Amazon Kinesis and Amazon DynamoDB Streams (AWS Big Data Blog)

• https://blogs.aws.amazon.com/bigdata/post/Tx148NMGPIJ6F6F/Analyze-a-Time-Series-in-Real-Time-with-AWS-Lambda-Amazon-Kinesis-and-Amazon-Dyn

Serverless Dynamic Real-Time Dashboard with AWS DynamoDB, S3 and Cognito (Medium.com)• https://medium.com/@rfreeman/serverless-dynamic-real-time-dashboard-with-aws-dynamodb-a1a7f8d3bc01

JustGiving Public Repository• https://github.com/JustGiving

Serverless at re:Invent 2016 – Wrap-up (AWS Compute Blog)• https://aws.amazon.com/blogs/compute/serverless-at-reinvent-2016-wrap-up/

Page 39: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Summary• Challenges of existing big data pipelines

• Event-driven ETL and serverless pipelines at JustGiving

• Serverless computing

• Six patterns for scalable data pipelines1. Serverless message bus2. Serverless steaming data capture API 3. Serverless small data pipeline4. Serverless streamify file and merge into larger files5. Serverless streaming analytics and persist stream6. Serverless data and process API

• Recommendations• Using AWS managed services• Understanding Lambda

Page 40: JustGiving | Serverless Data Pipelines, API, Messaging and Stream Processing

Thank You!“Ensure no good cause goes unfunded”

Contact

https://linkedin.com/in/drfreeman