Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn...

Dynamic Scaling and Redundancy

in the Cloud

Coburn Watson

Manager, Cloud Performance, Netflix

IEEE SCV 10/09/2013

Netflix, Inc.

• World's leading internet television network

• ~ 38 Million subscribers in 40+ countries

• Over a billion hours streamed per month

• Approximately 33% of all US Internet traffic at

• Recent Notables

• Increased originals catalog

• Large open source contribution

• OpenConnect (homegrown CDN)

About Me

• Manage Cloud Performance Engineering Team

• Sub-team of Cloud Solutions Organization

• Focus on performance since 2000

• Large-scale billing applications, eCommerce, datacenter

mgmt., etc.

• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.

• Passion for tackling performance at cloud-scale

• Looking for great performance engineers

• cwatson@netflix.com

Performance on the Cloud

http://onebigphoto.com/castle-in-germany-floating-above-the-clouds/

The balancing act

• Performance • Instance type selection • Tuning of “BaseAMI” • Aggressive caching model (memcached)

• Reliability • Redundancy

• Over-provision to absorb failures • Focus on OSS vs. COTS

• Select architectures (C*) with strong redundancy

• Scalability • Support “thundering herd” scenarios • Stateless service model..rapid increases in

capacity

Instance type selection

• Focus on horizontal vs. vertical scaling model • Many moderate systems vs. few powerhouses

• A 3rd of instances in each of 3 Availability Zones*

• Large memory apps reduce flexibility on choice

• m2.2xl = 4 cores, 36 GB RAM

• m3.2xl = 8 cores, 30 GB RAM

• 2x CPU, only 25% more in terms of $

* datacenter

Instance type selection, cont…

• IO capabilities are qualified vs. quantified…

- Measure it! ~ 1000 Mbps

~ 700 Mbps

Base AMI Tuning

• “Base AMI” – common Linux image • OS, middleware, system utilities

• Tens of thousands of instances on same Base AMI

• Provides global tuning opportunities • Examples: CFS tuning to improve batch throughput

• kernel.sched_latency_ns=48000000 # default: 24000000 (24ms)

• kernel.sched_min_granularity_ns=6000000 # default: 3000000 (3

• Above tunables provided 2-5% improvement in throughput

• Can be applied during “baking” process which produces new

AMI with application layered on top.

Maximizing Redundancy

Fear (Revere) the Monkeys

• Simulate • Latency

• Errors

• Initiate

• Instance Termination

• Availability Zone Failure

• Identify

• Configuration Drift

… in Test and Production

Hystrix: Defend Your App

● Protection from downstream service failures

● Functional (unavailable) or performance in nature

Maximizing: Scalability and

Performance

Dynamic Scaling

EC2 footprint autoscales 2500-3500 instances

per day

• order of tens of thousands of EC2 instances

• Larger ASG spans 200-1000 m2.4xlarge daily

• m2.4xlarge: 8 Cores, 64GB RAM, 1.7TB Disk

• Improved scalability during unexpected workloads

• Absorb variance in service performance profile

• Reactive chain of dependencies

• Creates "reserved instance troughs" for batch

activity

Dynamic Scaling, cont.

Example covers 3 services • 2 edge (A,B), 1 mid-tier (C)

• C has more upstream services

than simply A and B

Multiple Autoscaling Policies • (A) System Load Average

• (B,C) Request-Rate based

• Response time variability greatest during scaling events

• Average response time primary between 75-150 msec

• Instance counts 3x, Aggregate requests 4.5x (not shown)

• Average CPU utilization per instance: ~25-55%

Study performed:

• 24 node C* SSD-based cluster (hi1.4xlarge)

• mid-tier service load application

• Targeting 2x production rates

• Increase read ops from 30k to to 70k in ~ 3 minutes

• Increase write ops 750 to 1500 in ~ 3 minutes

Results:

• 95th pctl response time increase: ~ 17 msec to 45

• 99th pctl response time increase: ~ 35 msec to 80

Cassandra Performance

Response times consistent during 4x increase in load *

* Due to upstream code change

EVcache (memcached) Scalability

Takeaways

• Evolve architecture and processes to mitigate

• Factor redundancy requirements into scaling

strategy

• Stateless micro-service architectures win!

• Benchmark everything…quantify variability

• On-demand Cloud makes benchmarking convenient

Netflix Open Source

Our Open Source Software simplifies mgmt at scale

Great projects, stunning colleagues: jobs.netflix.com

• cwatson@netflix.com

• Twitter: @coburnw

• LinkedIn: http://www.linkedin.com/in/coburnw/

• Netflix Tech Blog: http://techblog.netflix.com

Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn...

Documents

Transcript of Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn...

Microservices with Netflix OSS & Spring Cloud - Arnaud Cogoluègnes

A Personalized Cloud-based Traffic Redundancy Elimination for Smartphones

Building Microservices with Spring Cloud and Netflix OSS

Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Netflix cloud architecture...continued

#lspe Q1 2013 dynamically scaling netflix in the cloud

Baking Stash in the AWS Cloud at Netflix

Mikroservisi sa Spring Cloud Netflix - JavaCRO · - U Spring Cloud Netflix Eureka i Ribbon se koriste zajedno ¸Mogu üe je koristiti i Ribbon bez konfiguracije ¸U tom slu þaju

(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Invent 2014

Microservices with Netflix OSS and Spring Cloud - Dev Day Orange

Building Distributed Systems with Netflix OSS and Spring Cloud

Monal Daxini - Beaming Flink to the Cloud @ Netflix

CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Asgard, the Grails App that Deploys Netflix to the Cloud

Netflix: Using Big Data to Drive Big Engagementassets.teradata.com/resourceCenter/downloads/CaseStudies/EB8564.pdf · Netflix to the teradata Cloud, which has given Netflix the power

Sudhir Tonse Manager, Cloud Platform Engineering – Netflix @stonse Scalable Microservices at Netflix. Challenges and Tools of the Trade.

Microservices with Netflix OSS and Spring Cloud

Netflix Global Cloud Architecture

Cloud Native Architecture at Netflix

Modular Redundancy for Cloud based IMS Robustness