(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS

50
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jagmeet Chawla, Chief Architect, The Weather Channel Raul Frias, Solutions Architect, AWS October 2015 Scaling to 25 Billion Daily Requests Within 3 Months Building a Global Big Data Distribution Platform ARC346

Transcript of (ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jagmeet Chawla, Chief Architect, The Weather Channel

Raul Frias, Solutions Architect, AWS

October 2015

Scaling to 25 Billion Daily Requests

Within 3 MonthsBuilding a Global Big Data Distribution Platform

ARC346

What to Expect from the Session

Building a Big Data Distribution Platform:

- Goals

- Architecture

- Logical and Physical Components

- Data Supply Chain, from Ingest to

Distribution

- Journey

- Building, Tuning and Scaling the Platform

- AWS Insights

- Evolution of the Architecture

Audience:

- Engineering Leaders

- Architects

Video Introduction

Video Conclusion

Background: The Weather Company

We power weather for

Apple, Facebook,

Google, Microsoft,

Twitter, Yahoo and

many more

Our B2B Division, WSI,

has 4,600+ B2B clients

in 60 countries.

WHERE THE WORLD GETS ITS WEATHER

#1 MOST DISTRIBUTED

Cable Network

170M+ App Downloads

47.2M Unduplicated Monthly

Uniques

124M+

Monthly Unique

72% visit 2x or more Daily

Background: A Data Company

DataNetwork of

100K+ weather sensors

Global Lightning Detection Network

Global Radar & Location Data

Largest Collection of

Weather Data

State-of-the-Science Forecasts

TechnologiesIndustry Best

Forecast Modeling

Proprietary Radar

Algorithms

Proprietary Weather Analytics

220+ Fulltime Meteorologists

TWC Content (Video, Images, Articles)

Weather APIs Content APIs

20+ TB Data Daily

800+ Sources of

Ingest

40+ Billion API

Requests Daily

Background: About Data

Weather Data

- Observations

- Forecasts

- Radar

- Alerts

- Notices

- Emergency Bulletins

- Health & Life Style

Content

- Articles

- Images

- Slide Shows

- Videos

- Maps

Domain Specific

- Aviation

- Energy

- Insurance

Background: Big Data

- Push/Pull, every 5 minutes

- Real Time Alerts & Notification

- World’s most volatile atmospheric data

- 15-20 sec. to prepare and serve

- 800+ Partners

- 50+ GB Raw compressed data

- Several Billion Request / day

Big Data

Variety

VolumeVelocity

Textual data, structured, unstructured, binary data, pictures, images, videos

Background: About Distribution

Digital- Weather.com,

Wunderground.com

- Mobile Apps on all Major

Mobile OS Platforms

Partnerships- Major Mobile Phone

Company

- Major Search Engine

- Many Others …

B2B- Major Airlines

- Energy Trading Desks

- Many Others …

40+ Billion API Requests / day

Expect 60 Billion / day by EOY 2015

We power weather for

Apple, Facebook,

Google, Microsoft,

Twitter, Yahoo and

many more

Our B2B Division, WSI,

has 4,600+ B2B clients

in 60 countries.

124M+

Monthly Unique

72% visit 2x or more Daily

170M+ App Downloads

47.2M Unduplicated Monthly

Uniques

The Dark Ages: Before The Cloud

- Run From TWC Data Centers

- Slow Time To Market

- Product

- Content

- Limited Distributed Scaling

- Limits of our existing Data

Centers

- Batch Based Forecast Systems

- Java Based Monolithic

Applications

- Big Web, Mobile Web

- Data Services

- Homegrown CMS

Business

- Build a Low Latency Global On Demand

Forecasting System

- Build a Highly Scalable Global Data

Distribution Platform

- Reboot Digital Properties (weather.com,

Mobile Apps, CMS)

- Reduce time to deploy new data sets

- Data Distribution APIs as Product

- Secured/Metered access to APIs

- Consolidate Data Centers

Reboot & Reimagine: Goals

Technical

- 100% cloud based

- Capable of handling billions of requests a day

- Capable of ingesting & processing Terabytes

of data a day

- Low latency APIs (25-100 ms)

- Highly Scalable

- Highly Available (99.99)

- Generic Data Processing Engine (DPE)

- Developer Friendly APIs

- Authentication, metering, and throttling

How we did it: Architecture Blueprint

Architecture: Component Layers

- Large Undertaking – Divide & Conquer

- Loosely Coupled Layered Architecture

- Focus on your Core Competency

- Best Tool/Technology for the job

- Independent Delivery Timelines

- DATA PLATFORM: Weather Data

Distribution As A Service

- Eat your own dog food!Data Processing Engine

Data Services

StorageSystems of

Record

GatewayCDN

Architecture: Data Processing Engine (DPE)

- Generic DPE

- API Driven

- Data Agnostic

- Extensible

- Always on, Always flowing

- Asynchronous, Non Blocking

- High availability

- Low latency

- Horizontal scalability

Data Processing Engine

Data Services

StorageSystems of

Record

GatewayCDN

Architecture: Data Processing Engine (DPE)

Push/Pull Data

ProvidersIAPI Rabbit MQ

DPE

Redis

Riak

S3Rabbit MQ

System Of Record

(e.g. Forecast On Demand)

DPE Core

Plugin 1 Plugin 2 Plugin 3

- DPE Architecture- DPE Core

- Custom Plugins for Process, Download,

Store, Archive

- Technical Stack- Java 1.7

- Storage (Redis)

- Archive (Riak, S3)

- Distribution – RabbitMQ

- OS: Amazon-Linux (Centos 6 variant)

- Ingestion API

- RestFul Web Service

- Messaging Queue- RabbitMQ Cluster

- Workers- DPE

Architecture: Data Flow (DPE)

Private Subnet

RabbitMQ

ClusterIAPI Endpoint

AZ A

AZ B

Public Subnet

Public Subnet

Private Subnet

Data Processing

Engine

Private

Subnet

Data

Publisher

Private

Subnet

Architecture: Storage

- Polyglot Architecture

- Best Store for the Job

- Most Cost Effective

Storage for the Job

- BYOS: Bring Your Own Store

- Cache Rich!

Data Processing Engine

Data Services

StorageSystems of

Record

GatewayCDN

Architecture: Storage Polyglot

- Archive

- Images

- Videos

Bucket

Key/Value

Master

Slaves

- Real-time Data

and Caching

Key/Value

Node

NodeNode

Node

Key/Value

- Historical Weather

Archive

- Data Migration

- Gateway Data

- Analytics

Node

NodeNode

Node

Columnar

- Analytics

Parquet

Columnar

Storage

Repositories

MySQL

SQL

Server

- Informatica

- Drupal

Architecture: Cache is your friend!

CDN

Master

Slaves

- App Cache

Key/Value

(with data types

for values)

- Origin Cache- Edge Caching

- Edge Compute

- Make Sure All Data Elements are TTL Driven

- Always Respect Cache Control Headers

VarnishEC2 EC2

App Instances

EC2 EC2

- And Keep It Simple!

Architecture: Systems Of Record

- Let the system designers focus on the

problem they are trying to solve

- Let them pick the best technology

- Just Make sure they interface using

standard protocols

- Let DPE handle Ingest

- Let Services Layer handle

Distribution

- Support both Push/Pull model for

publication to distribution engineData Processing Engine

Data Services

StorageSystems of

Record

GatewayCDN

Architecture: Systems of Record

Forecast On Demand CMS

GET Model Post Model

Forecast On Demand

Data Services Data Services

Content Management system

Get: On Cache Miss Post: On Publish

RESTFul End Point

Currents On Demand

GET Model

Currents On Demand

Data Services

Get: On Cache Miss

Architecture: Data Services

Data Processing Engine

Data Services

StorageSystems of

Record

- RestFul API Design

- Stateless

- Decoupled

- Atomic / Aggregation Services

- Support both Push/Pull Model

- API Key driven Auth/Metering

- Horizontally Scalable

- Capable of serving billions of

request / day

- Data lends well to caching

GatewayCDN

Architecture: Distribution – Weather Data

Redis

Riak

OAPI API Gateway CDN API Users

FOD

Dispatcher

COD

Dispatcher

Aggregate

Engine

COD

Cache

FOD

Cache

Outbound API (OAPI)

- Fine grained RESTful API

- Intelligent Cache Management

- Accesses datastores, system of records and

other services

Aggregate Engine

- Aggregates fine grained APIs

- Aggregates at Edge through CDN ESI

Architecture: Request Flow

AZ A

AZ B

Public Subnet

Public Subnet

Private

Subnet

Internet

Private

Subnet

OAPIFOD Cache

COD Cache

FOD

CODOAPI

Distribution

Services

Architecture: Distribution – Content (Articles, Images, Video)

D

R

U

P

A

L

C

M

SMetadata Store

Images

Videos

Asset

Metadata

Image Cut Service

Video Distribution

Services

Generic Asset

Service

mRSS Feeds

Metadata

Metadata

Static Asset Pools

S3

Architecture: Gateway

Data Processing Engine

Data Services

StorageSystems of

Record

GatewayCDN- Authentication

- Routing

- Metering

- Throttling

- CDN Aware, CDN Driven

- Remember 25ms latency target!

- We rolled our own

Architecture: Gateway

API

UsersCDN

Authentication,

metering, Throttling

Quick Response

Caching routingOrigin routing

Source of

Authentication

Truth

- User makes API request

- CDN checks authorization - Look Aside

- If authorized, check cache

- If cache-miss, hit origin caching/routing

- If origin cache-miss, pass through to backend servers

Architecture: The Other Side – Events & Analytics!

Data Lake

Operational

Analytics

Business

Analytics

Executive

Dashboards

Data

Discovery

Data

Science

3rd Party

System

Integration

Stream

Processing

Long Term Raw Storage

Short Term Storage and

Big Data Processing

Consumers

Amazon SQS

Streaming

Custom

Ingestion

Pipeline

Events

3rd Party

Other DBs

S3

Batch

Sources

Streaming

Sources

ETL

Data Access

SQL

Architecture: Putting it all together

Data Processing Engine

Data Services

StorageSystems of

Record

GatewayCDN

Architecture: Implementation

Global Region 2Global

Region 3

Global

Region 4Global Region 1

Global Traffic Management

and CDN

Remote

Ingestion

Remote

Ingestion

FOD FOD FOD

Global Region 2

MonitoringConfiguration Mgmt Automation

Partner Data Sources:

(Weather, Alerts, Traffic, etc)

Distribution Engine Distribution Engine Distribution Engine

FOD

Distribution Engine

And while we were building it …

A curve ball !

Challenge:

• New deal struck with a

MAJOR mobile phone

company

• Ship new API

• Time to Market = 3 months

• Scale to 25+ billion

requests per day

Some findings

Architecture Already Decoupled

- Focus on Scaling Distribution Layer

Findings in Cycle:

- Load Testing / Tuning

- VPC NAT Saturation

- DNS Servers Sizing

- Instance Types and Characteristics

- OS Kernel Limits

- Destructive Testing / Fixing

- Brought Down instances, AZs,

Regions

- Corrupted caches, databases

Load Test

Tune

Destructive Test

Fix

KEY TAKEAWAY

It takes time to figure all this out … so

please budget time and resources for both

load and destructive testing

AWS Insights

Leverage AWS Managed Services

• Amazon Route 53 – DNS

• Amazon RDS – Relational DBs

• Amazon DynamoDB – NoSQL DBs

• Amazon ElastiCache – Redis or Memcached

• Amazon SQS - Queuing

• Amazon Redshift – Data Warehouse

• Amazon Kinesis – Stream Storage

• AWS Lambda – “Code as a Service”

Data Processing Engine

Data Services

StorageSystems

of Record

GatewayCDN

Leverage AWS Managed Services

• Amazon Route 53 – DNS

• Amazon RDS – Relational DBs

• Amazon DynamoDB – NoSQL DBs

• Amazon ElastiCache – Redis or Memcached

• Amazon SQS - Queuing

• Amazon Redshift – Data Warehouse

• Amazon Kinesis – Stream Storage

• Lambda – “Code as a Service”

Data Processing Engine

Data Services

StorageSystems

of Record

GatewayCDN

Why RDS vs. EC2-based RDMS

Independent of RDBMS

• Licensing

• Replication

engine:

• Backups

• Updates

MySQL,

Oracle,

Postgres

MS SQL Amazon

Aurora

Max. IOPS 20,000 10,000 100,000s

Max. TBs 6 4 64

Storage

Which NoSQL?

+ Write performance more critical than durability

+ Native multi-X replication

+ Ecosystem

– Repartitioning

– Operational burden

– Data transfer cost

+ “Zero downtime”

+ Cross-region

replication

– Repartitioning

– Operational burden

– Data transfer cost

+ Managed solution

+ Easy to scale

+ Constantly Evolving

– Item size

– Cross-region replication

Storage

DynamoDB

Stream Storage

Building a DPE – AWS Style

Decouple producers &

consumers

Temporary buffer

Preserve client ordering

Streaming MapReduce

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 1

4 4 3 3 2 2 1 1

Producer 1

Shard 1

Shard 2

Consumer 1

Count of

Red = 4

Count of

Violet = 4

Consumer 2

Count of

Blue = 4

Count of

Green = 4

Producer 2

Producer 3

Producer N

Key = Red

Key = Green

Data Processing Engine

Which Stream Store Should I Use?

Amazon Kinesis and Apache Kafka have many similarities

• Multiple consumers

• Ordering of records

• Streaming MapReduce

• Low latency

• Highly durable, available, and scalable

Differences

• Record lifetime: 24 hours in Amazon Kinesis, configurable in Kafka

• Record size: 1MB/record in Amazon Kinesis, configurable in Kafka

• Amazon Kinesis is a fully managed service

• Easier to provision, manage, and scale

Data Processing Engine

Server-less Approach to DPE

Data Input Amazon

Kinesis

Action AWS

Lambda

Data Output

IT application activity

Capture the

stream

Audit

Process the

stream

SNS

Metering records Condense Redshift

Change logs Backup S3

IoT Device Data Store RDS

Transaction orders Process SQS

Server health metrics Monitor EC2

Data Processing Engine

Evolution

Architectural Evolution: Micro-services Approach

GTM/CDNUser

ForecastAggregationLocation

VarnishVarnish Varnish

Common Services Layer – Router & Controller Auth & Metering

Lifestyle

Varnish

Storage Polyglot

Micro DPE

Architectural Evolution: Technical Stack

Ingest

- Queue:

- Amazon SQS

- Stream

- Kafka

- Micro DPE

- Avro

- Thrift

- Proto-buffs

- Micro-Services Type of Model For Ingest

Distribution

- Micro Services

- Language Polyglot

- Service Discovery

Storage

- Amazon Aurora

- BYOS

Analytics

- Parquet +

Amazon S3

- Spark

- Amazon EMR

Wrapping Up!

- Have an Architectural

Blueprint

- Keep Decoupled or

Loosely Coupled Layers

- Communication via

Standard Protocols

- Keep Architectural Plan

“Technology Agnostic”

- Storage Polyglot

- Language Polyglot

- Be Aware of the

Monoliths!

- Keep Caching

Architecture Simple – TTL

Driven

- Always Budget for

- Load Testing

- Destructive Testing

Related Sessions

ARC309 - From Monolithic to Microservices: Evolving Architecture

Patterns in the Cloud - Thursday

ARC301 - Scaling Up to Your First 10 Million Users - Thursday

BDT310 - Big Data Architectural Patterns and Best Practices on

AWS – Today 2:45 PM

BDT403 - Best Practices for Building Real-time Streaming

Applications with Amazon Kinesis - Thursday

Remember to complete

your evaluations!

Thank you!