Netflix: Amazon S3 & Amazon Elastic MapReduce to Monitor at Gigascale (BDT302) | AWS re:Invent 2013

Post on 16-Sep-2014

736 views 1 download

Tags:

description

How does Netflix stay on top of the operations of its Internet service with millions of users and billions of metrics? With Atlas, its own massively distributed, large-scale monitoring system. Come learn how Netflix built Atlas with multiple processing pipelines using Amazon S3 and Amazon EMR to provide low-latency access to billions of metrics while supporting query-time aggregation along multiple dimensions.

Transcript of Netflix: Amazon S3 & Amazon Elastic MapReduce to Monitor at Gigascale (BDT302) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Roy Rapoport

November 14, 2013

Deft Data at Netflix:Using Amazon S3 and Amazon Elastic

Friday, November 15, 13

A Word About Me …

Friday, November 15, 13

• About 20 years in technology

A Word About Me …

Friday, November 15, 13

• About 20 years in technology• Systems engineering, networking,

software development, QA, release management

A Word About Me …

Friday, November 15, 13

• About 20 years in technology• Systems engineering, networking,

software development, QA, release management

• Time at Netflix: 1599 days

A Word About Me …

Friday, November 15, 13

• About 20 years in technology• Systems engineering, networking,

software development, QA, release management

• Time at Netflix: 1599 days

A Word About Me …

(4y:4m:15d)

Friday, November 15, 13

• About 20 years in technology• Systems engineering, networking,

software development, QA, release management

• Time at Netflix: 1599 days • Before at Netflix: Service Delivery in

the IT/Ops, troubleshooter, Builder of Python Things[tm]

A Word About Me …

(4y:4m:15d)

Friday, November 15, 13

• About 20 years in technology• Systems engineering, networking,

software development, QA, release management

• Time at Netflix: 1599 days • Before at Netflix: Service Delivery in

the IT/Ops, troubleshooter, Builder of Python Things[tm]

• Current role: Cloud Monitoring

A Word About Me …

(4y:4m:15d)

Friday, November 15, 13

• About 20 years in technology• Systems engineering, networking,

software development, QA, release management

• Time at Netflix: 1599 days • Before at Netflix: Service Delivery in

the IT/Ops, troubleshooter, Builder of Python Things[tm]

• Current role: Cloud Monitoring•We build platforms

A Word About Me …

(4y:4m:15d)

Friday, November 15, 13

• About 20 years in technology• Systems engineering, networking,

software development, QA, release management

• Time at Netflix: 1599 days • Before at Netflix: Service Delivery in

the IT/Ops, troubleshooter, Builder of Python Things[tm]

• Current role: Cloud Monitoring•We build platforms•Sometimes we make them easy to use

A Word About Me …

(4y:4m:15d)

Friday, November 15, 13

A Word About Netflix …

Friday, November 15, 13

A Word About Netflix …Just the Stats

Friday, November 15, 13

• 16 years

A Word About Netflix …Just the Stats

Friday, November 15, 13

• 16 years• 2000+ employees

A Word About Netflix …Just the Stats

Friday, November 15, 13

• 16 years• 2000+ employees• 40 million users

A Word About Netflix …Just the Stats

Friday, November 15, 13

• 16 years• 2000+ employees• 40 million users• 5x10^9 hours/quarter

A Word About Netflix …Just the Stats

Friday, November 15, 13

A Word About Netflix …

Friday, November 15, 13

A Word About Netflix …Freedom and Responsibility Culture

Friday, November 15, 13

• Optimize speed of innovationConstrain availabilityCost will be what cost will be

A Word About Netflix …Freedom and Responsibility Culture

Friday, November 15, 13

• Optimize speed of innovationConstrain availabilityCost will be what cost will be

• Hire smart (experienced) peopleGet out of their way

A Word About Netflix …Freedom and Responsibility Culture

Friday, November 15, 13

• Optimize speed of innovationConstrain availabilityCost will be what cost will be

• Hire smart (experienced) peopleGet out of their way

• Anti-process bias

A Word About Netflix …Freedom and Responsibility Culture

Friday, November 15, 13

A Word About Netflix …

Friday, November 15, 13

A Word About Netflix …Technology and Operations

Friday, November 15, 13

A Word About Netflix …Technology and Operations

•Service Oriented Architecture

Friday, November 15, 13

A Word About Netflix …Technology and Operations

•Service Oriented Architecture•Decentralized Operations. You

Friday, November 15, 13

A Word About Netflix …Technology and Operations

•Service Oriented Architecture•Decentralized Operations. You

•Build

Friday, November 15, 13

A Word About Netflix …Technology and Operations

•Service Oriented Architecture•Decentralized Operations. You

•Build•Test

Friday, November 15, 13

A Word About Netflix …Technology and Operations

•Service Oriented Architecture•Decentralized Operations. You

•Build•Test•Deploy

Friday, November 15, 13

A Word About Netflix …Technology and Operations

•Service Oriented Architecture•Decentralized Operations. You

•Build•Test•Deploy•Set up alerting and monitoring

Friday, November 15, 13

A Word About Netflix …Technology and Operations

•Service Oriented Architecture•Decentralized Operations. You

•Build•Test•Deploy•Set up alerting and monitoring•Wake up at 2AM

Friday, November 15, 13

A Word About Netflix …Technology and Operations

Friday, November 15, 13

A Word About Netflix …

• AWS-based for 100% of streaming*

Technology and Operations

Friday, November 15, 13

A Word About Netflix …

• AWS-based for 100% of streaming*• Huge expansion

Technology and Operations

Friday, November 15, 13

A Word About Netflix …

• AWS-based for 100% of streaming*• Huge expansion

• Customer Growth

Technology and Operations

Friday, November 15, 13

A Word About Netflix …

• AWS-based for 100% of streaming*• Huge expansion

• Customer Growth• New markets

Technology and Operations

Friday, November 15, 13

A Word About Netflix …

• AWS-based for 100% of streaming*• Huge expansion

• Customer Growth• New markets• Metrics

Technology and Operations

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

Friday, November 15, 13

In the Old Days …Our Old Alerting System

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

Friday, November 15, 13

In the Old Days …Our Old Alerting System

• Enterprise IT Solution

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

Copyright USAID Microlinks. CC Attribution 2.0 License

Friday, November 15, 13

In the Old Days …Our Old Alerting System

• Enterprise IT Solution• Managed by the Enterprise IT Alerting People

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

Copyright USAID Microlinks. CC Attribution 2.0 License

Friday, November 15, 13

In the Old Days …Our Old Alerting System

• Enterprise IT Solution• Managed by the Enterprise IT Alerting People• File Tickets

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

Copyright: http://www.flickr.com/photos/s_w_ellis

CC Attribution 2.0 License

Friday, November 15, 13

In the Old Days …Our Old Alerting System

• Enterprise IT Solution• Managed by the Enterprise IT Alerting People• File Tickets• Send alerts to NOC

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

Friday, November 15, 13

In the Old Days …Our Old Alerting System

• Enterprise IT Solution• Managed by the Enterprise IT Alerting People• File Tickets• Send alerts to NOC• Completely separate from telemetry system

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Our Old Telemetry System

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Our Old Telemetry System

• Spare-time effort by a lone sysadmin

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Our Old Telemetry System

• Spare-time effort by a lone sysadmin• Loved by developers

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Our Old Telemetry System

• Spare-time effort by a lone sysadmin• Loved by developers• Custom TCP protocol

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Our Old Telemetry System

• Spare-time effort by a lone sysadmin• Loved by developers• Custom TCP protocol• RRD file back-end storage

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Our Old Telemetry System

• Spare-time effort by a lone sysadmin• Loved by developers• Custom TCP protocol• RRD file back-end storage• Mostly Perl

Copyright: http://www.flickr.com/photos/acme

CC Attribution 2.0 License

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Our Old Telemetry System

• Spare-time effort by a lone sysadmin• Loved by developers• Custom TCP protocol• RRD file back-end storage• Mostly Perl• Datacenter-bound (and limited)

Friday, November 15, 13

In the Old Days …

Copyright: State Library of Victoria Collections. CC Attribution 2.0 License

In the Old Days …

Our Old Telemetry System

• Spare-time effort by a lone sysadmin• Loved by developers• Custom TCP protocol• RRD file back-end storage• Mostly Perl• Datacenter-bound (and limited)• Starting to falter under metrics growth

Friday, November 15, 13

Speaking of Growth

Friday, November 15, 13

Speaking of Growth

Friday, November 15, 13

Speaking of Growth

By way of comparison

Friday, November 15, 13

Speaking of Growth

By way of comparison • Every person in the world• twice

Friday, November 15, 13

Speaking of Growth

By way of comparison • Every person in the world• twice•Every smartphone in the

world• ten times

Friday, November 15, 13

Copyright: http://www.flickr.com/photos/76651030@N02/

CC Attribution 2.0 License

So We Built Something Better

Friday, November 15, 13

So We Built Something Better

UI

Atlas Epic CloudWatch

UI Layer Fronts Multiple Systems

Copyright: http://www.flickr.com/photos/76651030@N02/

CC Attribution 2.0 License

Friday, November 15, 13

So We Built Something Better UA E CClear Regional Separation

• And aggregation

global

us-east-1 us-west-1 us-west-2 eu-west-1

Copyright: http://www.flickr.com/photos/76651030@N02/

CC Attribution 2.0 License

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Localized Node/Metric Identification

Before:

I think You’re Bob

Here’s a metric!

OK!

I’m Bob. Here’s

a metric!

Now:

Copyright: http://www.flickr.com/photos/76651030@N02/

CC Attribution 2.0 License

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.ami ami-aa5166ef

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.app wpnf.ami ami-aa5166ef

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country us

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country us

nf.node i-097c0e52

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country us

nf.region us-west-1nf.node i-097c0e52

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country us

nf.region us-west-1nf.zone us-west-1b

nf.node i-097c0e52

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country usclass nccp

nf.region us-west-1nf.zone us-west-1b

nf.node i-097c0e52

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country usclass nccp

nf.region us-west-1nf.zone us-west-1b

nf.node i-097c0e52

type request

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country usclass nccp

nf.region us-west-1nf.zone us-west-1b

nf.node i-097c0e52

type request

uiversion UI_169_mid

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country usclass nccp

nf.region us-west-1nf.zone us-west-1b

nf.node i-097c0e52

type request

action authorizationuiversion UI_169_mid

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country usclass nccp

nf.region us-west-1nf.zone us-west-1b

nf.node i-097c0e52

type request

action authorizationdevtype 101

uiversion UI_169_mid

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country usclass nccp

nf.region us-west-1nf.zone us-west-1b

nf.node i-097c0e52

type requestclver PHL_0AB

action authorizationdevtype 101

uiversion UI_169_mid

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eWhat’s a Metric?

• com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

• 256 characters aren’t enough!• This is better:

nf.asg wp-batch-v163

nf.app wpnf.cluster wp-batch

nf.ami ami-aa5166ef

nf.country usclass nccp

nf.region us-west-1nf.zone us-west-1b

nf.node i-097c0e52

type requestclver PHL_0AB

action authorizationdevtype 101

uiversion UI_169_mid

geo us

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Copyright: Kurt Moerman

CC Attribution 2.0 License

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries

Copyright: Kurt Moerman

CC Attribution 2.0 License

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries• Make the complex possible

Copyright: Kurt Moerman

CC Attribution 2.0 License

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries• Make the complex possible• Make the simple … sort of hard

Copyright: Kurt Moerman

CC Attribution 2.0 License

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries• Make the complex possible• Make the simple … sort of hard

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries• Make the complex possible• Make the simple … sort of hard

http://atlas/api/v1/graph?q=nf.region,us-west-1,:eq,nf.app,employeeinfo,:eq,:and,name,employeeinfo_api,:eq,:and,:sum&e=now-5m&s=e-3h

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries• Make the complex possible• Make the simple … sort of hard

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries• Make the complex possible• Make the simple … sort of hard

http://atlas/api/v1/graph?q=nf.region,us-west-1,:eq,nf.app,employeeinfo,:eq,:and,name,employeeinfo_api,:eq,:and,:sum,(,nf.zone,),:by&e=now-5m&s=e-3h

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries• Make the complex possible• Make the simple … sort of hard

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Powerful queries• Make the complex possible• Make the simple … sort of hard

http://atlas/api/v1/graph?q=sps,nf.cluster,(,nccp-legacy,nccp-modern,),:in,nccprt,(,NCCPLicense,com_netflix_streaming_nccp_request_license,),:in,:and,stat,SuccessfulRequests,:eq,:and,device.rollup,3ds,:eq,:and,:sum,:set,entering_trough,sps,:get,1h,:offset,0.95,:mul,sps,:get,:gt,:set,smoothed,sps,:get,10,0.1,0.02,:des,:set,low_volume,smoothed,:get,-0.005,:mul,0.1,:add,:set,mid_volume,smoothed,:get,-0.00125,:mul,0.1,:add,:set,base,0.06,:set,min_pct,1,smoothed,:get,20,:lt,low_volume,:get,:mul,smoothed,:get,80,:lt,mid_volume,:get,:mul,:add,entering_trough,:get,0.05,:mul,:add,base,:get,:add,:sub,10,0.1,0.02,:des,:set,sps,:get,$(device.rollup)SPS,:legend,min_pct,:get,smoothed,:get,:mul,lowerbound,:legend,sps,:get,min_pct,:get,smoothed,:get,:mul,:lt,5,:rolling-count,2,:ge,:vspan,60,:alpha,$(device.rollup),:legend

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards• Alerting

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards• Alerting• Automated Canaries

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards• Alerting• Automated Canaries• Capacity Analytics

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards• Alerting• Automated Canaries• Capacity Analytics• Special Projects

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards• Alerting• Automated Canaries• Capacity Analytics• Special Projects• BI

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards• Alerting• Automated Canaries• Capacity Analytics• Special Projects• BI

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards• Alerting• Automated Canaries• Capacity Analytics• Special Projects• BI

Friday, November 15, 13

So We Built Something Better UA E C

glus us us eRidiculous Read Volume:

• Engage

• Graphs and Dashboards• Alerting• Automated Canaries• Capacity Analytics• Special Projects• BI

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

regionalendpoint

global endpoint

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

clientinstance

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

regionalendpoint

global endpoint

Friday, November 15, 13

So We Built Something Better UA E C

glus us us e

clientinstance

publishcluster

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

regionalendpoint

global endpoint

Friday, November 15, 13

Amazon S3

So We Built Something Better UA E C

glus us us e

clientinstance

publishcluster

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

regionalendpoint

global endpoint

Friday, November 15, 13

Amazon S3

So We Built Something Better UA E C

glus us us e

clientinstance

publishcluster

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

pollercluster

regionalendpoint

global endpoint

Friday, November 15, 13

Amazon S3

So We Built Something Better UA E C

glus us us e

clientinstance

publishclusterm

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

pollercluster

regionalendpoint

global endpoint

Friday, November 15, 13

Amazon S3

So We Built Something Better UA E C

glus us us e

clientinstance

publishclusterm

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

pollercluster

m

regionalendpoint

global endpoint

Friday, November 15, 13

Amazon S3

So We Built Something Better UA E C

glus us us e

clientinstance

publishclusterm

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

pollercluster

m

m

regionalendpoint

global endpoint

Friday, November 15, 13

Amazon S3

So We Built Something Better UA E C

glus us us e

clientinstance

publishclusterm

backendinstancebackend

instancebackendinstancebackend

instancebackendinstancebackend

instancebackendinstance

pollercluster

m

m

regionalendpoint

global endpoint

Friday, November 15, 13

That Sounds Great!

Friday, November 15, 13

That Sounds Great!Surely there are no problems

Copyright: http://www.flickr.com/photos/lainetrees/

CC Attribution 2.0 License

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder•We looked at spinning disks

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder•We looked at spinning disks•Memory’s the way to go

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder•We looked at spinning disks•Memory’s the way to go•m2.4xlarge

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder•We looked at spinning disks•Memory’s the way to go•m2.4xlarge•This is operational data

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder•We looked at spinning disks•Memory’s the way to go•m2.4xlarge•This is operational data

•People want it available, fast

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder•We looked at spinning disks•Memory’s the way to go•m2.4xlarge•This is operational data

•People want it available, fast•Operations have short memories

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder•We looked at spinning disks•Memory’s the way to go•m2.4xlarge•This is operational data

•People want it available, fast•Operations have short memories

20,160 m2.4xlarge$32,094,720 upfront$8,005,939/month

per regionwith no redundancy

Friday, November 15, 13

That Sounds Great!Surely there are no problems

•Speed is hard•Speed at volume is harder•We looked at spinning disks•Memory’s the way to go•m2.4xlarge•This is operational data

•People want it available, fast•Operations have short memories Copyright: http://www.flickr.com/photos/amenk/

CC Attribution 2.0 License

Friday, November 15, 13

That Doesn’t Sound Great!

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …•“Reduce”? Get it? Get it?

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …•“Reduce”? Get it? Get it?•Our granularity is two-dimensional

Step size (time)

Dim

ensi

onal

ity (t

ags)

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …•“Reduce”? Get it? Get it?•Our granularity is two-dimensional•We can reduce on either dimension

Step size (time)

Dim

ensi

onal

ity (t

ags)

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …•“Reduce”? Get it? Get it?•Our granularity is two-dimensional•We can reduce on either dimension•Some tags make sense for very rapid reduction

Step size (time)

Dim

ensi

onal

ity (t

ags)

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …•“Reduce”? Get it? Get it?•Our granularity is two-dimensional•We can reduce on either dimension•Some tags make sense for very rapid reduction

•Hystrix

Step size (time)

Dim

ensi

onal

ity (t

ags)

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …•“Reduce”? Get it? Get it?•Our granularity is two-dimensional•We can reduce on either dimension•Some tags make sense for very rapid reduction

•Hystrix•nf.node

Step size (time)

Dim

ensi

onal

ity (t

ags)

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …•“Reduce”? Get it? Get it?•Our granularity is two-dimensional•We can reduce on either dimension•Some tags make sense for very rapid reduction

•Hystrix•nf.node

•Sometimes a lot (vhs) Step size (time)

Dim

ensi

onal

ity (t

ags)

Friday, November 15, 13

That Doesn’t Sound Great!•If only we could reduce it …•“Reduce”? Get it? Get it?•Our granularity is two-dimensional•We can reduce on either dimension•Some tags make sense for very rapid reduction

•Hystrix•nf.node

•Sometimes a lot (vhs)•Sometimes a little (Cassandra)

Step size (time)

Dim

ensi

onal

ity (t

ags)

Friday, November 15, 13

A Reductive Approach

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

•minimum

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

•minimum•maximum

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

•minimum•maximum•total

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

•minimum•maximum•total•count

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

•minimum•maximum•total•count

•Example:

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

•minimum•maximum•total•count

•Example:•3,5,9,14,20: min 3, max 20, tot 51, count 5

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

•minimum•maximum•total•count

•Example:•3,5,9,14,20: min 3, max 20, tot 51, count 5

•Allows for sense of scale

Friday, November 15, 13

A Reductive Approach•For a series of values, reduce and keep:

•minimum•maximum•total•count

•Example:•3,5,9,14,20: min 3, max 20, tot 51, count 5

•Allows for sense of scale•Allows for arbitrary further reduction w/o loss of precision

Friday, November 15, 13

Reduction: Policy

Copyright: http://www.flickr.com/photos/bagaball/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Policy

•Policy-driven EMR engine

Copyright: http://www.flickr.com/photos/bagaball/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Policy

•Policy-driven EMR engine•Four possible actions

Copyright: http://www.flickr.com/photos/bagaball/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Policy

•Policy-driven EMR engine•Four possible actions

•preserve

Copyright: http://www.flickr.com/photos/bagaball/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Policy

•Policy-driven EMR engine•Four possible actions

•preserve•drop

Copyright: http://www.flickr.com/photos/bagaball/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Policy

•Policy-driven EMR engine•Four possible actions

•preserve•drop•consolidate

Copyright: http://www.flickr.com/photos/bagaball/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Policy

•Policy-driven EMR engine•Four possible actions

•preserve•drop•consolidate•rollup

Copyright: http://www.flickr.com/photos/bagaball/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Policy

{ "rules" : [ { "operations" : [{"op" : "drop"}], "query" : "nf.app,api,:eq,class,(,LastMinuteFailRatio,SLA,NetflixSimpleDBService,),:in,:and" }, { "operations" : [{ “config" : { "keys" : [ "nf.node", "device", "nf.country" ] }, "op" : “rollup" }], "query" : ":true" } ]}

Friday, November 15, 13

Amazon EMR

clientinstance

publishcluster

Amazon S3

pollercluster

regionalendpoint

global endpoint

6Hcluster

EMRDriver

4Dcluster

18Dcluster

Historicalcluster

metrics

query

responsemetrics

metrics metrics

1

2 3

45 5

5

Friday, November 15, 13

Amazon EMR

clientinstance

publishcluster

Amazon S3

pollercluster

regionalendpoint

global endpoint

6Hcluster

EMRDriver

4Dcluster

18Dcluster

Historicalcluster

metrics

query

responsemetrics

metrics metrics

1

2 3

45 5

5

Friday, November 15, 13

Amazon EMR

clientinstance

publishcluster

Amazon S3

pollercluster

regionalendpoint

global endpoint

6Hcluster

EMRDriver

4Dcluster

18Dcluster

Historicalcluster

as-neededcluster

as-neededcluster

as-neededcluster

metrics

query

responsemetrics

metrics metrics

1

2 3

45 5

5

Friday, November 15, 13

Reduction: Benefits

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Benefits

•Indefinite storage in Amazon S3

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Benefits

•Indefinite storage in Amazon S3•Fear of commitment achievement:

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Unlocked

Friday, November 15, 13

Reduction: Benefits

•Indefinite storage in Amazon S3•Fear of commitment achievement:•Can be aggressive about hiding metrics

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Unlocked

Friday, November 15, 13

Reduction: Benefits

•Indefinite storage in Amazon S3•Fear of commitment achievement:•Can be aggressive about hiding metrics•High granularity for special days

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Unlocked

Friday, November 15, 13

Reduction: Benefits

•Indefinite storage in Amazon S3•Fear of commitment achievement:•Can be aggressive about hiding metrics•High granularity for special days•Automated for regular operations*

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Unlocked

Friday, November 15, 13

Reduction: Benefits

•Indefinite storage in Amazon S3•Fear of commitment achievement:•Can be aggressive about hiding metrics•High granularity for special days•Automated for regular operations*•Not in critical path for visibility SLA

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Unlocked

Friday, November 15, 13

Reduction: Benefits

•Indefinite storage in Amazon S3•Fear of commitment achievement:•Can be aggressive about hiding metrics•High granularity for special days•Automated for regular operations*•Not in critical path for visibility SLA•Firewalls accidental metric explosions

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Unlocked

Friday, November 15, 13

Reduction: Benefits

•Indefinite storage in Amazon S3•Fear of commitment achievement:•Can be aggressive about hiding metrics•High granularity for special days•Automated for regular operations*•Not in critical path for visibility SLA•Firewalls accidental metric explosions•Huge efficiency gains

Copyright: http://www.flickr.com/photos/dr_pete/

CC Attribution 2.0 License

Unlocked

Friday, November 15, 13

Reduction: Efficiency

Copyright: http://www.flickr.com/photos/sebrenner/

CC Attribution 2.0 License

Friday, November 15, 13

Reduction: Efficiency

Friday, November 15, 13

Reduction: Efficiency

6H 4D 18D HISTORY

TimeHorizon

Size

Instances Per Hour

% Reduction

6 Hours 4 Days 18 Days 3 Months

600 512 180 12

100 5 0 0

0 95 100 100

Friday, November 15, 13

Reduction: Efficiency

6H 4D 18D HISTORY

TimeHorizon

Size

Instances Per Hour

% Reduction

6 Hours 4 Days 18 Days 3 Months

600 512 180 12

100 5 0 0

0 95 100 100

Friday, November 15, 13

Reduction: Efficiency

6H 4D 18D HISTORY

TimeHorizon

Size

Instances Per Hour

% Reduction

6 Hours 4 Days 18 Days 3 Months

600 512 180 12

100 5 0 0

0 95 100 100

Friday, November 15, 13

Reduction: Efficiency

6H 4D 18D HISTORY

TimeHorizon

Size

Instances Per Hour

% Reduction

6 Hours 4 Days 18 Days 3 Months

600 512 180 12

100 5 0 0

0 95 100 100

Friday, November 15, 13

Reduction: Efficiency

6H 4D 18D HISTORY

TimeHorizon

Size

Instances Per Hour

% Reduction

6 Hours 4 Days 18 Days 3 Months

600 512 180 12

100 5 0 0

0 95 100 100

Friday, November 15, 13

Previews

Friday, November 15, 13

Previews

Copyright: http://www.flickr.com/photos/creativealan/

CC Attribution 2.0 License

Friday, November 15, 13

Previews

•Self-service for special requests

Copyright: http://www.flickr.com/photos/creativealan/

CC Attribution 2.0 License

Friday, November 15, 13

Previews

•Self-service for special requests•Different instance types

Copyright: http://www.flickr.com/photos/creativealan/

CC Attribution 2.0 License

Friday, November 15, 13

Previews

•Self-service for special requests•Different instance types

•cr1.8xlarge

Copyright: http://www.flickr.com/photos/creativealan/

CC Attribution 2.0 License

Friday, November 15, 13

Previews

•Self-service for special requests•Different instance types

•cr1.8xlarge•hi1.4xlarge

Copyright: http://www.flickr.com/photos/creativealan/

CC Attribution 2.0 License

Friday, November 15, 13

Previews

•Self-service for special requests•Different instance types

•cr1.8xlarge•hi1.4xlarge

•Multi-tiered metric visibility

Copyright: http://www.flickr.com/photos/creativealan/

CC Attribution 2.0 License

Friday, November 15, 13

Growth Redux

Friday, November 15, 13

5/11 8/11 9/11 1/12 4/12 8/1211/121/13 5/1310/131/14

(M)

met

rics

2 2.5 10

Growth Redux

Friday, November 15, 13

5/11 8/11 9/11 1/12 4/12 8/1211/121/13 5/1310/131/14

(M)

met

rics

2 2.5 10 15

Growth Redux

Friday, November 15, 13

5/11 8/11 9/11 1/12 4/12 8/1211/121/13 5/1310/131/14

(M)

met

rics

2 2.5 10 15 18 30 55 90212

728

Growth Redux

Friday, November 15, 13

5/11 8/11 9/11 1/12 4/12 8/1211/121/13 5/1310/131/14

(M)

met

rics

2 2.5 10 15 18 30 55 90212

728

1,200

Growth Redux

Friday, November 15, 13

Growth Redux

Friday, November 15, 13

And a Last Word About Costs

Friday, November 15, 13

And a Last Word About Costs

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder•Speed of Innovation

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder•Speed of Innovation•Availability

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder•Speed of Innovation•Availability•Cost

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder•Speed of Innovation•Availability•Cost

•Never intended to lower costs

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder•Speed of Innovation•Availability•Cost

•Never intended to lower costs•Cloud migration

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder•Speed of Innovation•Availability•Cost

•Never intended to lower costs•Cloud migration•Additional features

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder•Speed of Innovation•Availability•Cost

•Never intended to lower costs•Cloud migration•Additional features•Massive Performance

Friday, November 15, 13

And a Last Word About Costs

•Priorities Reminder•Speed of Innovation•Availability•Cost

•Never intended to lower costs•Cloud migration•Additional features•Massive Performance

Friday, November 15, 13

EMR

FTWFriday, November 15, 13

Friday, November 15, 13

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT302 Thank You

Friday, November 15, 13