Yow Conference Dec 2013 Netflix Workshop Slides with Notes

187
Patterns for Continuous Delivery, High Availability, DevOps & Cloud Native Open Source with NetflixOSS Workshop with Notes December 2013 Adrian Cockcroft @adrianco @NetflixOSS
  • date post

    16-Sep-2014
  • Category

    Technology

  • view

    14
  • download

    1

description

Last full deck by Adrian at Netflix, downloadable with added notes.

Transcript of Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Page 1: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Patterns for Continuous Delivery, High Availability, DevOps & Cloud

Native Open Source with NetflixOSS

Workshop with NotesDecember 2013Adrian Cockcroft@adrianco @NetflixOSS

Page 2: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Presentation vs. Workshop

• Presentation– Short duration, focused subject– One presenter to many anonymous audience– A few questions at the end

• Workshop– Time to explore in and around the subject– Tutor gets to know the audience– Discussion, rat-holes, “bring out your dead”

Page 3: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Presenter

Adrian Cockcroft• Technology Fellow

– From 2014 Battery Ventures

• Cloud Architect– From 2007-2013 Netflix

• eBay Research Labs– From 2004-2007

• Sun Microsystems– HPC Architect– Distinguished Engineer– Author of four books– Performance and Capacity

• BSc Physics and Electronics– City University, London

Biography

Page 4: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Attendee Introductions

• Who are you, where do you work• Why are you here today, what do you need• “Bring out your dead”

– Do you have a specific problem or question?– One sentence elevator pitch

• What instrument do you play?

Page 5: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Content

Cloud at Scale with Netflix

Cloud Native NetflixOSS

Resilient Developer Patterns

Availability and Efficiency

Questions and Discussion

Page 6: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?

Page 7: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

How Netflix Used to Work

Customer Device (PC Web browser)

Monolithic Web App

Oracle

MySQL

Monolithic Streaming App

Oracle

MySQL

Limelight/Level 3 Akamai CDNs

Content Management

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Datacenter

Page 8: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

How Netflix Streaming Works Today

Customer Device (PC, PS3, TV…)

Web Site or Discovery API

User Data

Personalization

Streaming API

DRM

QoS Logging

OpenConnect CDN Boxes

CDN Management and Steering

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Datacenter

Page 9: Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Page 10: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Netflix Scale

• Tens of thousands of instances on AWS– Typically 4 core, 30GByte, Java business logic– Thousands created/removed every day

• Thousands of Cassandra NoSQL nodes on AWS– Many hi1.4xl - 8 core, 60Gbyte, 2TByte of SSD– 65 different clusters, over 300TB data, triple zone– Over 40 are multi-region clusters (6, 9 or 12 zone)– Biggest 288 m2.4xl – over 300K rps, 1.3M wps

Page 11: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Reactions over time

2009 “You guys are crazy! Can’t believe it”

2010 “What Netflix is doing won’t work”

2011 “It only works for ‘Unicorns’ like Netflix”

2012 “We’d like to do that but can’t”

2013 “We’re on our way using Netflix OSS code”

Page 12: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Objectives:

ScalabilityAvailability

AgilityEfficiency

Page 13: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Principles:

ImmutabilitySeparation of Concerns

Anti-fragilityHigh trust organization

Sharing

Page 14: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Outcomes:• Public cloud – scalability, agility, sharing• Micro-services – separation of concerns• De-normalized data – separation of concerns• Chaos Engines – anti-fragile operations• Open source by default – agility, sharing• Continuous deployment – agility, immutability• DevOps – high trust organization, sharing• Run-what-you-wrote – anti-fragile development

Page 15: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

When to use public cloud?

Page 16: Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Page 17: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

"This is the IT swamp draining manual for anyone who is neck deep in alligators."- Adrian Cockcroft, Cloud Architect at Netflix

Page 18: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Goal of Traditional IT:Reliable hardware

running stable software

Page 19: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

SCALEBreaks hardware

Page 20: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

….SPEEDBreaks software

Page 21: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

SPEED at SCALE

Breaks everything

Page 22: Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Page 23: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cloud Native

What is it?Why?

Page 24: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Strive for perfection

Perfect codePerfect hardware

Perfectly operated

Page 25: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

But perfection takes too long

Compromises…Time to market vs. Quality

Utopia remains out of reach

Page 26: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Where time to market wins big

Making a land-grabDisrupting competitors (OODA)

Anything delivered as web services

Page 27: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Observe

Orient

Decide

Act

Land grab opportunity Competitive

move

Customer Pain Point

Analysis

Get buy-in

Plan response

Commit resources

Implement

Deliver

Engage customers

Model alternatives

BIG DATA

INNOVATION

CULTURE

CLOUD

Measure customers

Colonel Boyd, USAF

“Get inside your adversaries'

OODA loop to disorient them”

Page 28: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

How Soon?

Product features in days instead of monthsDeployment in minutes instead of weeks

Incident response in seconds instead of hours

Page 29: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cloud NativeA new engineering challenge

Construct a highly agile and highly available service from ephemeral and

assumed broken components

Page 30: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Inspiration

Page 31: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

How to get to Cloud Native

Freedom and Responsibility for DevelopersDecentralize and Automate Ops Activities

Integrate DevOps into the Business Organization

Re-Org!

Page 32: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Four Transitions

• Management: Integrated Roles in a Single Organization– Business, Development, Operations -> BusDevOps

• Developers: Denormalized Data – NoSQL– Decentralized, scalable, available, polyglot

• Responsibility from Ops to Dev: Continuous Delivery– Decentralized small daily production updates

• Responsibility from Ops to Dev: Agile Infrastructure - Cloud– Hardware in minutes, provisioned directly by developers

Page 33: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

The DIY Question

Why doesn’t Netflix build and run its own cloud?

Page 34: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Fitting Into Public Scale

Public Grey Area Private

1,000 Instances 100,000 Instances

Netflix FacebookStartups

Page 35: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

How big is Public?

AWS upper bound estimate based on the number of public IP AddressesEvery provisioned instance gets a public IP by default (some VPC don’t)

AWS Maximum Possible Instance Count 5.1 Million – Sept 2013Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange

Page 36: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

The Alternative Supplier Question

What if there is no clear leader for a feature, or AWS doesn’t have what

we need?

Page 37: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Things We Don’t Use AWS For

SaaS Applications – Pagerduty, Onelogin etc.Content Delivery Service

DNS Service

Page 38: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

CDN Scale

AWS CloudFrontAkamai

LimelightLevel 3

Netflix Openconnect

YouTube

Gigabits Terabits

NetflixFacebookStartups

Page 39: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginx

see openconnect.netflix.com

Page 40: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

DNS Service

AWS Route53 is missing too many features (for now)Multiple vendor strategy Dyn, Ultra, Route53

Abstracted (broken) DNS APIs with Denominator

Page 41: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

What Changed?

Get out of the way of innovationBest of breed, by the hour

Choices based on scale

Cost reduction

Slow down developers

Less competitiveLess revenue

Lower margins

Process reduction

Speed up developers

More competitive

More revenue

Higher margins

Page 42: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Getting to Cloud Native

Page 43: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Congratulations, your startup got funding!

• More developers• More customers• Higher availability• Global distribution• No time….

Growth

Page 44: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

AWS Zone A

Your architecture looks like this:

Web UI / Front End API

Middle Tier

RDS/MySQL

Page 45: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

And it needs to look more like this…

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Page 46: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Inside each AWS zone:Micro-services and de-normalized data stores

API or Web Calls

memcached

Cassandra

Web service

S3 bucket

Page 47: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

We’re here to help you get to global scale…Apache Licensed Cloud Native OSS Platform

http://netflix.github.com

Page 48: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Technical Indigestion – what do all these do?

Page 49: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Updated site – make it easier to find what you need

Page 50: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Getting started with NetflixOSS Step by Step

1. Set up AWS Accounts to get the foundation in place2. Security and access management setup3. Account Management: Asgard to deploy & Ice for cost monitoring4. Build Tools: Aminator to automate baking AMIs5. Service Registry and Searchable Account History: Eureka & Edda6. Configuration Management: Archaius dynamic property system7. Data storage: Cassandra, Astyanax, Priam, EVCache8. Dynamic traffic routing: Denominator, Zuul, Ribbon, Karyon9. Availability: Simian Army (Chaos Monkey), Hystrix, Turbine10. Developer productivity: Blitz4J, GCViz, Pytheas, RxJava11. Big Data: Genie for Hadoop PaaS, Lipstick visualizer for Pig12. Sample Apps to get started: RSS Reader, ACME Air, FluxCapacitor

Page 51: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

AWS Account Setup

Page 52: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Flow of Code and Data Between AWS Accounts

ProductionAccount

Archive Account

AuditableAccount

Dev Test Build Account

AMI

AMI

Backup Data to S3

WeekendS3 restore

New Code

Backup Data to S3

Page 53: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Account Security

• Protect Accounts– Two factor authentication for primary login

• Delegated Minimum Privilege– Create IAM roles for everything

• Security Groups– Control who can call your services

Page 54: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cloud Access Control

www-prod

• Userid wwwprod

Dal-prod

• Userid dalprod

Cass-prod

• Userid cassprod

Cloud access audit log ssh/sudo bastion

Security groups don’t allowssh between instances

Developers

Page 55: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Tooling and Infrastructure

Page 56: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Fast Start Amazon Machine Imageshttps://github.com/Answers4AWS/netflixoss-ansible/wiki/AMIs-for-NetflixOSS

• Pre-built AMIs for– Asgard – developer self service deployment console– Aminator – build system to bake code onto AMIs– Edda – historical configuration database– Eureka – service registry– Simian Army – Janitor Monkey, Chaos Monkey,

Conformity Monkey• NetflixOSS Cloud Prize Winner

– Produced by Answers4aws – Peter Sankauskas

Page 57: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Fast Setup CloudFormation Templates

http://answersforaws.com/resources/netflixoss/cloudformation/

• CloudFormation templates for– Asgard – developer self service deployment console– Aminator – build system to bake code onto AMIs– Edda – historical configuration database– Eureka – service registry– Simian Army – Janitor Monkey for cleanup,

Page 58: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

CloudFormation Walk-Through for Asgard

(Repeat for Prod, Test and Audit Accounts)

Page 59: Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Page 60: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Setting up Asgard – Step 1 Create New Stack

Page 61: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Setting up Asgard – Step 2 Select Template

Page 62: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Setting up Asgard – Step 3 Enter IP & Keys

Page 63: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Setting up Asgard – Step 4 Skip Tags

Page 64: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Setting up Asgard – Step 5 Confirm

Page 65: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Setting up Asgard – Step 6 Watch CloudFormation

Page 66: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Setting up Asgard – Step 7 Find PublicDNS Name

Page 67: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Open Asgard – Step 8 Enter Credentials

Page 68: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Use Asgard – AWS Self Service Portal

Page 69: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Use Asgard - Manage Red/Black Deployments

Page 70: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Track AWS Spend in Detail with ICE

Page 71: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Ice – Slice and dice detailed costs and usage

Page 72: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Setting up ICE

• Visit github site for instructions• Currently depends on HiCharts

– Non-open source package license– Free for non-commercial use– Download and license your own copy– We can’t provide a pre-built AMI – sorry!

• Long term plan to make ICE fully OSS– Anyone want to help?

Page 73: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Build Pipeline Automation

Jenkins in the Cloud auto-builds NetflixOSS Pull Requestshttp://www.cloudbees.com/jenkins

Page 74: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Automatically Baking AMIs with Aminator

• AutoScaleGroup instances should be identical• Base plus code/config• Immutable instances• Works for 1 or 1000… • Aminator Launch

– Use Asgard to start AMI or– CloudFormation Recipe

Page 75: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Discovering your Services - Eureka

• Map applications by name to – AMI, instances, Zones– IP addresses, URLs, ports– Keep track of healthy, unhealthy and initializing

instances• Eureka Launch

– Use Asgard to launch AMI or use CloudFormation Template

Page 76: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Deploying Eureka Service – 1 per Zone

Page 77: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Edda

AWS Instances, ASGs, etc.

Eureka Services metadataYour Own

Custom State

Searchable state history for a Region / Account

Monkeys

Timestamped delta cache of JSON describe call results for anything of interest…

Edda LaunchUse Asgard to launch AMI oruse CloudFormation Template

Page 78: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Edda Query ExamplesFind any instances that have ever had a specific public IP address$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"["i-0123456789","i-012345678a","i-012345678b”]

Show the most recent change to a security group$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504@@ -1,33 +1,33 @@ {… "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32",+ "10.10.1.3/32",- "10.10.1.4/32"… }

Page 79: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Archaius – Property Console

Page 80: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Archaius library – configuration management

SimpleDB or DynamoDB for NetflixOSS. Netflix uses Cassandra

for multi-region…

Based on Pytheas. Not open sourced yet

Page 81: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Data Storage and Access

Page 82: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Data Storage Options

• RDS for MySQL– Deploy using Asgard

• DynamoDB– Fast, easy to setup and scales up from a very low cost base

• Cassandra– Provides portability, multi-region support, very large scale– Storage model supports incremental/immutable backups– Priam: easy deploy automation for Cassandra on AWS

Page 83: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Priam – Cassandra co-process

• Runs alongside Cassandra on each instance• Fully distributed, no central master coordination• S3 Based backup and recovery automation• Bootstrapping and automated token assignment.• Centralized configuration management• RESTful monitoring and metrics• Underlying config in SimpleDB

– Netflix uses Cassandra “turtle” for Multi-region

Page 84: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Astyanax Cassandra Client for Java

• Features– Abstraction of connection pool from RPC protocol– Fluent Style API– Operation retry with backoff– Token aware– Batch manager– Many useful recipes– Entity Mapper based on JPA annotations

Page 85: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cassandra Astyanax Recipes

• Distributed row lock (without needing zookeeper)• Multi-region row lock• Uniqueness constraint• Multi-row uniqueness constraint• Chunked and multi-threaded large file storage• Reverse index search• All rows query• Durable message queue• Contributed: High cardinality reverse index

Page 86: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

EVCache - Low latency data access• multi-AZ and multi-Region replication• Ephemeral data, session state (sort of)• Client code• Memcached

Page 87: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Routing Customers to Code

Page 88: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Denominator: DNS for Multi-Region Availability

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Denominator – manage traffic via multiple DNS providers with Java code

Regional Load Balancers Regional Load Balancers

UltraDNS DynECT DNS

AWS Route53

Denominator

Zuul API Router

Page 89: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Zuul – Smart and Scalable Routing Layer

Page 90: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Ribbon library for internal request routing

Page 91: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Ribbon – Zone Aware LB

Page 92: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Karyon - Common server container

• Bootstrappingo Dependency & Lifecycle management via Governator.o Service registry via Eureka.o Property management via Archaiuso Hooks for Latency Monkey testingo Preconfigured status page and heathcheck servlets

Page 93: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

• Embedded Status Page Consoleo Environmento Eurekao JMX

Karyon

Page 94: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Availability

Page 95: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Either you break it, or users will

Page 96: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Add some Chaos to your system

Page 97: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Clean up your room! – Janitor Monkey

Works with Edda history to clean up after Asgard

Page 98: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Conformity MonkeyTrack and alert for old code versions and known issues

Walks Karyon status pages found via Edda

Page 99: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Hystrix Circuit Breaker: Fail Fast -> recover fast

Page 100: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Hystrix Circuit Breaker State Flow

Page 101: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Turbine DashboardPer Second Update Circuit Breakers in a Web Browser

Page 102: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Developer Productivity

Page 103: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Blitz4J – Non-blocking Logging

• Better handling of log messages during storms• Replace sync with concurrent data structures.• Extreme configurability• Isolation of app threads from logging threads

Page 104: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

JVM Garbage Collection issues? GCViz!

• Convenient• Visual• Causation• Clarity• Iterative

Page 105: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Pytheas – OSS based tooling framework

• Guice

• Jersey

• FreeMarker

• JQuery

• DataTables

• D3

• JQuery-UI

• Bootstrap

Page 106: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

RxJava - Functional Reactive Programming

• A Simpler Approach to Concurrency– Use Observable as a simple stable composable abstraction

• Observable Service Layer enables any of– conditionally return immediately from a cache– block instead of using threads if resources are constrained– use multiple threads– use non-blocking IO– migrate an underlying implementation from network

based to in-memory cache

Page 107: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Big Data and Analytics

Page 108: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Hadoop jobs - Genie

Page 109: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Lipstick - Visualization for Pig queries

Page 110: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Suro Event Pipeline

1.5 Million events/s80 Billion events/day

Cloud native, dynamic,configurable offline andrealtime data sinks

Error rate alerting

Page 111: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Putting it all together…

Page 112: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Sample Application – RSS Reader

Page 113: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

3rd Party Sample App by Chris Freglyfluxcapacitor.com

Flux Capacitor is a Java-based reference app using:archaius (zookeeper-based dynamic configuration)astyanax (cassandra client)blitz4j (asynchronous logging)curator (zookeeper client)eureka (discovery service)exhibitor (zookeeper administration)governator (guice-based DI extensions)hystrix (circuit breaker)karyon (common base web service)ribbon (eureka-based REST client)servo (metrics client)turbine (metrics aggregation)Flux also integrates popular open source tools such as Graphite, Jersey, Jetty, Netty, and Tomcat.

Page 114: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

3rd party Sample App by IBMhttps://github.com/aspyker/acmeair-netflix/

Page 115: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

NetflixOSS Project Categories

Page 116: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

GithubNetflixOSS

Source

AWSBase AMI

MavenCentral

Cloudbees Jenkins

AminatorBakery

DynaslaveAWS Build

Slaves

Asgard(+ Frigga)Console

AWSBaked AMIs

GlistenWorkflow DSL

AWS Account

NetflixOSS Continuous Build and Deployment

Page 117: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

AWS Account

Asgard Console

Multiple AWS Regions

Eureka Registry

3 AWS Zones

Application ClustersAutoscale Groups

Instances

NetflixOSS Services Scope

Page 118: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

•Baked AMI – Tomcat, Apache, your code•Governator – Guice based dependency injection•Archaius – dynamic configuration properties client•Eureka - service registration client

Initialization

•Karyon - Base Server for inbound requests•RxJava – Reactive pattern•Hystrix/Turbine – dependencies and real-time status•Ribbon and Feign - REST Clients for outbound calls

Service Requests

•Astyanax – Cassandra client and pattern library•Evcache – Zone aware Memcached client•Curator – Zookeeper patterns•Denominator – DNS routing abstraction

Data Access

•Blitz4j – non-blocking logging•Servo – metrics export for autoscaling•Atlas – high volume instrumentationLogging

NetflixOSS Instance Libraries

Page 119: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

•CassJmeter – Load testing for Cassandra•Circus Monkey – Test account reservation rebalancing

Test Tools

•Janitor Monkey – Cleans up unused resources•Efficiency Monkey•Doctor Monkey•Howler Monkey – Complains about AWS limits

Maintenance

•Chaos Monkey – Kills Instances•Chaos Gorilla – Kills Availability Zones•Chaos Kong – Kills Regions•Latency Monkey – Latency and error injection

Availability

•Conformity Monkey – architectural pattern warnings•Security Monkey – security group and S3 bucket permissionsSecurity

NetflixOSS Testing and Automation

Page 120: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds

“It’s done when it runs Asgard”Functionally completeDemonstrated March 2013Released June 2013 in V3.3

Vendor and end user interestOpenstack “Heat” getting therePaypal C3 Console based on Asgard

IBM Example application “Acme Air”Based on NetflixOSS running on AWSPorted to IBM Softlayer with Rightscale

Page 121: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Some of the companies using NetflixOSS(There are many more, please send us your logo!)

Page 122: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Use NetflixOSS to scale your startup or enterprise

Contribute to existing github projects and add your own

Page 123: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Resilient API Patterns

Switch to Ben’s Slides

Page 124: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Availability

Is it running yet?How many places is it running in?How far apart are those places?

Page 125: Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Page 126: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Netflix Outages

• Running very fast with scissors– Mostly self inflicted – bugs, mistakes from pace of change– Some caused by AWS bugs and mistakes

• Incident Life-cycle Management by Platform Team– No runbooks, no operational changes by the SREs– Tools to identify what broke and call the right developer

• Next step is multi-region active/active– Investigating and building in stages during 2013– Could have prevented some of our 2012 outages

Page 127: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Incidents – Impact and MitigationPRX Incidents

CSXX Incidents

Metrics impact – Feature disableXXX Incidents

No Impact – fast retry or automated failoverXXXX Incidents

Public Relations Media Impact

High Customer Service Calls

Affects AB Test Results

Y incidents mitigated by Active Active, game day practicing

YY incidents mitigated by

better tools and practices

YYY incidents mitigated by better

data tagging

Page 128: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)

Start Here

memcached

Cassandra

Web service

S3 bucket

Personalization movie group choosers (for US, Canada and Latam)

Each icon is three to a few hundred instances across three AWS zones

Page 129: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Three Balanced Availability ZonesTest with Chaos Gorilla

Cassandra and Evcache ReplicasZone A

Cassandra and Evcache ReplicasZone B

Cassandra and Evcache ReplicasZone C

Load Balancers

Chaos Gorilla

Page 130: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Isolated Regions

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

Page 131: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Highly Available NoSQL Storage

A highly scalable, available and durable deployment pattern based

on Apache Cassandra

Page 132: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Single Function Micro-Service PatternOne keyspace, replaces a single table or materialized view

Single function Cassandra Cluster Managed by PriamBetween 6 and 288 nodes

Stateless Data Access REST ServiceAstyanax Cassandra Client

OptionalDatacenterUpdate Flow

Many Different Single-Function REST Clients

Each icon represents a horizontally scaled service of three to hundreds of instances deployed over three availability zones

Over 60 Cassandra clustersOver 2000 nodesOver 300TB dataOver 1M writes/s/cluster

Page 133: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Stateless Micro-Service Architecture

Linux Base AMI (CentOS or Ubuntu)

Optional Apache

frontend, memcached, non-java apps

Java (JDK 6 or 7)

Javamonitorin

g

Tomcat

Application war file, base servlet, platform, client interface jars, Astyanax

Page 134: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cassandra Instance Architecture

Linux Base AMI (CentOS or Ubuntu)

Tomcat and

Priam on JDK

Healthcheck,

Status

Java (JDK 7)

Javamonitorin

g

Cassandra Server

Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and SSTables

Page 135: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Apache Cassandra

• Scalable and Stable in large deployments– No additional license cost for large scale!– Optimized for “OLTP” vs. Hbase optimized for “DSS”

• Available during Partition (AP from CAP)– Hinted handoff repairs most transient issues– Read-repair and periodic repair keep it clean

• Quorum and Client Generated Timestamp– Read after write consistency with 2 of 3 copies– Latest version includes Paxos for stronger transactions

Page 136: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Astyanax - Cassandra Write Data FlowsSingle Region, Multiple Availability Zone, Token Aware

Token Aware Clients

1. Client Writes to local coordinator

2. Coodinator writes to other zones

3. Nodes return ack4. Data written to

internal commit log disks (no more than 10 seconds later)

If a node goes offline, hinted handoff completes the write when the node comes back up.

Requests can choose to wait for one node, a quorum, or all nodes to ack the write

SSTable disk writes and compactions occur asynchronously

14

4

42

3

33

2

Page 137: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Data Flows for Multi-Region WritesToken Aware, Consistency Level = Local Quorum

US Clients

1. Client writes to local replicas2. Local write acks returned to

Client which continues when 2 of 3 local nodes are committed

3. Local coordinator writes to remote coordinator.

4. When data arrives, remote coordinator node acks and copies to other remote zones

5. Remote nodes ack to local coordinator

6. Data flushed to internal commit log disks (no more than 10 seconds later)

If a node or region goes offline, hinted handoff completes the write when the node comes back up.Nightly global compare and repair jobs ensure everything stays consistent.

EU Clients

6

5

5

6 64

44

16

6

62

2

23

100+ms latency

Page 138: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cassandra at Scale

Benchmarking to Retire Risk

More?

Page 139: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Scalability from 48 to 288 nodes on AWShttp://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

0 50 100 150 200 250 300 3500

200000

400000

600000

800000

1000000

1200000

174373

366828

537172

1099837

Client Writes/s by node count – Replication Factor = 3

Used 288 of m1.xlarge4 CPU, 15 GB RAM, 8 ECUCassandra 0.86Benchmark config only existed for about 1hr

2011

Page 140: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cassandra Disk vs. SSD BenchmarkSame Throughput, Lower Latency, Half Cost

http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html

Load Test Driver

REST service

36x m2.xlarge EVcache

48x m2.4xlarge Cassandra

REST service

15x hi1.4xlarge Cassandra

Load Generation

Application

Memcached

Cassandra

2012

Page 141: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

2013 - Cross Region Use Cases

• Geographic Isolation– US to Europe replication of subscriber data– Read intensive, low update rate– Production use since late 2011

• Redundancy for regional failover– US East to US West replication of everything– Includes write intensive data, high update rate– Testing now

Page 142: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Benchmarking Global CassandraWrite intensive test of cross region replication capacity

16 x hi1.4xlarge SSD nodes per zone = 96 total192 TB of SSD in six locations up and running Cassandra in 20 minutes

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-West-2 Region - Oregon

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East-1 Region - Virginia

Test Load

Test Load

Validation Load

Inter-Zone Traffic

1 Million writesCL.ONE (wait for one replica to ack)

1 Million readsAfter 500msCL.ONE with noData loss

Inter-Region TrafficUp to 9Gbits/s, 83ms 18TB

backups from S3

Page 143: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Copying 18TB from East to WestCassandra bootstrap 9.3 Gbit/s single threaded 48 nodes to 48 nodes

Thanks to boundary.com for these network analysis plots

Page 144: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Inter Region Traffic TestVerified at desired capacity, no problems, 339 MB/s, 83ms latency

Page 145: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Ramp Up Load Until It Breaks!Unmodified tuning, dropping client data at 1.93GB/s inter region trafficSpare CPU, IOPS, Network, just need some Cassandra tuning for more

Page 146: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Failure Modes and EffectsFailure Mode Probability Current Mitigation Plan

Application Failure High Automatic degraded response

AWS Region Failure Low Active-Active multi-region deployment

AWS Zone Failure Medium Continue to run on 2 out of 3 zones

Datacenter Failure Medium Migrate more functions to cloud

Data store failure Low Restore from S3 backups

S3 failure Low Restore from remote archive

Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…

Page 147: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cloud Security

Fine grain security rather than perimeterLeveraging AWS Scale to resist DDOS attacks

Automated attack surface monitoring and testinghttp://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned

Page 148: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Security Architecture

• Instance Level Security baked into base AMI– Login: ssh only allowed via portal (not between instances)– Each app type runs as its own userid app{test|prod}

• AWS Security, Identity and Access Management– Each app has its own security group (firewall ports)– Fine grain user roles and resource ACLs

• Key Management– AWS Keys dynamically provisioned, easy updates– High grade app specific key management using HSM

Page 149: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Cost-AwareCloud Architectures

Based on slides jointly developed withJinesh Varia

@jinmanTechnology Evangelist

Page 150: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

« Want to increase innovation? Lower the cost of failure »

Joi Ito

Page 151: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Go Global in Minutes

Page 152: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Netflix Examples

• European Launch using AWS Ireland– No employees in Ireland, no provisioning delay, everything worked– No need to do detailed capacity planning– Over-provisioned on day 1, shrunk to fit after a few days– Capacity grows as needed for additional country launches

• Brazilian Proxy Experiment– No employees in Brazil, no “meetings with IT”– Deployed instances into two zones in AWS Brazil– Experimented with network proxy optimization– Decided that gain wasn’t enough, shut everything down

Page 153: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Product Launch Agility - Rightsized

Pre-Launch Build-out Testing Launch Growth Growth

DemandCloudDatacenter

$

Page 154: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Product Launch - Under-estimated

Pre-Launch Build-out

TestingLaunch

GrowthGrowth

Page 155: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Product Launch Agility – Over-estimated

Pre-Launch Build-out

TestingLaunch

GrowthGrowth

$

Page 156: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Return on Agility = Grow Faster, Less Waste… Profit!

Page 157: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

#1 Business Agility by Rapid Experimentation = Profit

Key Takeaways on Cost-Aware Architectures….

Page 158: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

When you turn off your cloud resources, you actually stop paying for them

Page 159: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

1 5 9 13 17 21 25 29 33 37 41 45 49

Week

Web

Ser

vers

Optimize during a year

50% SavingsWeekly CPU Load

Page 160: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Busi

ness

Thr

ough

put

Inst

ance

s

Page 161: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

50%+ Cost SavingScale up/down

by 70%+

Move to Load-Based Scaling

Page 162: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Pay as you go

Page 163: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

AWS Support – Trusted Advisor – Your personal cloud assistant

Page 164: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Other simple optimization tips

• Don’t forget to…– Disassociate unused EIPs– Delete unassociated Amazon

EBS volumes– Delete older Amazon EBS

snapshots– Leverage Amazon S3 Object

Expiration

Janitor Monkey cleans up unused resources

Page 165: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

#1 Business Agility by Rapid Experimentation = Profit

#2 Business-driven Auto Scaling Architectures = Savings

Building Cost-Aware Cloud Architectures

Page 166: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

When Comparing TCO…

Page 167: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

When Comparing TCO…

Make sure that you are including all the cost factors into consideration

PlacePowerPipesPeoplePatterns

Page 168: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Save more when you reserve

On-demandInstances

•Pay as you go•Starts from $0.02/Hour

ReservedInstances

•One time low upfront fee + Pay as you go•$23 for 1 year term and $0.01/Hour

1-year and 3-year terms

Light Utilization RI

Medium Utilization RI

Heavy Utilization RI

Page 169: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Utilization (Uptime)

Ideal For Savings over On-Demand

10% - 40%(>3.5 < 5.5 months/year)

Disaster Recovery(Lowest Upfront) 56%

40% - 75%(>5.5 < 7 months/year)

Standard Reserved Capacity 66%

>75%(>7 months/year)

Baseline Servers(Lowest Total Cost) 71%

Break-even point

ReservedInstances

•One time low upfront fee + Pay as you go•$23 for 1 year term and $0.01/Hour

1-year and 3-year terms

Light Utilization RI

Medium Utilization RI

Heavy Utilization RI

Page 170: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Mix and Match Reserved Types and On-DemandIn

stan

ces

Days of Month

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Heavy Utilization Reserved Instances

Light RI Light RILight RILight RI

On-Demand

Page 171: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Netflix Concept for Regional Failover Capacity

West Coast

Light Reservations

Heavy ReservationsNormalUse

Failover Use

Page 172: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

#1 Business Agility by Rapid Experimentation = Profit

#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings

Building Cost-Aware Cloud Architectures

Page 173: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Variety of Applications and Environments

Production Fleet

Dev FleetTest FleetStaging/QAPerf FleetDR Site

Every Application has…. Every Company has….

Business App Fleet

Marketing SiteIntranet SiteBI AppMultiple Products Analytics

Page 174: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Consolidated Billing: Single payer for a group of accounts

• One Bill for multiple accounts

• Easy Tracking of account charges (e.g., download CSV of cost data)

• Volume Discounts can be reached faster with combined usage

• Reserved Instances are shared across accounts (including RDS Reserved DBs)

Page 175: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Over-Reserve the Production Environment

Production Env.Account 100 Reserved

QA/Staging Env. Account 0 Reserved

Perf Testing Env.Account 0 Reserved

Development Env.Account 0 Reserved

Storage Account 0 Reserved

Total Capacity

Page 176: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Consolidated Billing Borrows Unused Reservations

Production Env.Account 68 Used

QA/Staging Env. Account 10 Borrowed

Perf Testing Env.Account 6 Borrowed

Development Env.Account 12 Borrowed

Storage Account 4 Borrowed

Total Capacity

Page 177: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Consolidated Billing Advantages

• Production account is guaranteed to get burst capacity– Reservation is higher than normal usage level– Requests for more capacity always work up to reserved

limit– Higher availability for handling unexpected peak demands

• No additional cost– Other lower priority accounts soak up unused reservations– Totals roll up in the monthly billing cycle

Page 178: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

#1 Business Agility by Rapid Experimentation = Profit

#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings

#4 Consolidated Billing and Shared Reservations = Savings

Building Cost-Aware Cloud Architectures

Page 179: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Continuous optimization in your architecture results in

recurring savings as early as your next month’s bill

Page 180: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Right-size your cloud: Use only what you need

• An instance type for every purpose

• Assess your memory & CPU requirements– Fit your

application to the resource

– Fit the resource to your application

• Only use a larger instance when needed

Page 181: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Reserved Instance Marketplace

Buy a smaller term instanceBuy instance with different OS or type

Buy a Reserved instance in different region

Sell your unused Reserved InstanceSell unwanted or over-bought capacityFurther reduce costs by optimizing

Page 182: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Instance Type Optimization

Older m1 and m2 families• Slower CPUs• Higher response times• Smaller caches (6MB)• Oldest m1.xl 15GB/8ECU/48c• Old m2.xl 17GB/6.5ECU/41c• ~16 ECU/$/hr

Latest m3 family• Faster CPUs• Lower response times• Bigger caches (20MB)• Even faster for Java vs. ECU• New m3.xl 15GB/13 ECU/50c• 26 ECU/$/hr – 62% better!• Java measured even higher• Deploy fewer instances

Page 183: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

#1 Business Agility by Rapid Experimentation = Profit

#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings

#4 Consolidated Billing and Shared Reservations = Savings

#5 Always-on Instance Type Optimization = Recurring Savings

Building Cost-Aware Cloud Architectures

Page 184: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Follow the Customer (Run web servers) during the day

Follow the Money (Run Hadoop clusters) at night

0

2

4

6

8

10

12

14

16

Mon Tue Wed Thur Fri Sat Sun

No

of In

stan

ces

Runn

ing

Week

Auto Scaling Servers

Hadoop Servers

No. of ReservedInstances

Page 185: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Soaking up unused reservations

Unused reserved instances is published as a metric

Netflix Data Science ETL Workload• Daily business metrics roll-up• Starts after midnight• EMR clusters started using hundreds of instances

Netflix Movie Encoding Workload• Long queue of high and low priority encoding jobs• Can soak up 1000’s of additional unused instances

Page 186: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

#1 Business Agility by Rapid Experimentation = Profit

#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings

#4 Consolidated Billing and Shared Reservations = Savings

#5 Always-on Instance Type Optimization = Recurring Savings

Building Cost-Aware Cloud Architectures

#6 Follow the Customer (Run web servers) during the day Follow the Money (Run Hadoop clusters) at night

Page 187: Yow Conference Dec 2013 Netflix Workshop Slides with Notes

Takeaways

Cloud Native Manages Scale and Complexity at Speed

NetflixOSS makes it easier for everyone to become Cloud Native

Rethink deployments and turn things off to save money!

http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflix

http://www.linkedin.com/in/adriancockcroft

@adrianco @NetflixOSS @benjchristensen