Post on 26-Jun-2020
Dynamic Scaling and Redundancy
in the Cloud
Coburn Watson
Manager, Cloud Performance, Netflix
IEEE SCV 10/09/2013
Netflix, Inc.
• World's leading internet television network
• ~ 38 Million subscribers in 40+ countries
• Over a billion hours streamed per month
• Approximately 33% of all US Internet traffic at
night
• Recent Notables
• Increased originals catalog
• Large open source contribution
• OpenConnect (homegrown CDN)
About Me
• Manage Cloud Performance Engineering Team
• Sub-team of Cloud Solutions Organization
• Focus on performance since 2000
• Large-scale billing applications, eCommerce, datacenter
mgmt., etc.
• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.
• Passion for tackling performance at cloud-scale
• Looking for great performance engineers
• cwatson@netflix.com
Performance on the Cloud
http://onebigphoto.com/castle-in-germany-floating-above-the-clouds/
The balancing act
• Performance • Instance type selection • Tuning of “BaseAMI” • Aggressive caching model (memcached)
• Reliability • Redundancy
• Over-provision to absorb failures • Focus on OSS vs. COTS
• Select architectures (C*) with strong redundancy
• Scalability • Support “thundering herd” scenarios • Stateless service model..rapid increases in
capacity
Instance type selection
• Focus on horizontal vs. vertical scaling model • Many moderate systems vs. few powerhouses
• A 3rd of instances in each of 3 Availability Zones*
• Large memory apps reduce flexibility on choice
• m2.2xl = 4 cores, 36 GB RAM
• m3.2xl = 8 cores, 30 GB RAM
• 2x CPU, only 25% more in terms of $
* datacenter
Instance type selection, cont…
• IO capabilities are qualified vs. quantified…
- Measure it! ~ 1000 Mbps
~ 700 Mbps
Base AMI Tuning
• “Base AMI” – common Linux image • OS, middleware, system utilities
• Tens of thousands of instances on same Base AMI
• Provides global tuning opportunities • Examples: CFS tuning to improve batch throughput
• kernel.sched_latency_ns=48000000 # default: 24000000 (24ms)
• kernel.sched_min_granularity_ns=6000000 # default: 3000000 (3
ms)
• Above tunables provided 2-5% improvement in throughput
• Can be applied during “baking” process which produces new
AMI with application layered on top.
Maximizing Redundancy
vs.
Fear (Revere) the Monkeys
• Simulate • Latency
• Errors
• Initiate
• Instance Termination
• Availability Zone Failure
• Identify
• Configuration Drift
… in Test and Production
Hystrix: Defend Your App
● Protection from downstream service failures
● Functional (unavailable) or performance in nature
Maximizing: Scalability and
Performance
12
Dynamic Scaling
EC2 footprint autoscales 2500-3500 instances
per day
• order of tens of thousands of EC2 instances
• Larger ASG spans 200-1000 m2.4xlarge daily
• m2.4xlarge: 8 Cores, 64GB RAM, 1.7TB Disk
Why:
• Improved scalability during unexpected workloads
• Absorb variance in service performance profile
• Reactive chain of dependencies
• Creates "reserved instance troughs" for batch
activity
13
Dynamic Scaling, cont.
Example covers 3 services • 2 edge (A,B), 1 mid-tier (C)
• C has more upstream services
than simply A and B
Multiple Autoscaling Policies • (A) System Load Average
• (B,C) Request-Rate based
Dynamic Scaling, cont.
Dynamic Scaling, cont.
• Response time variability greatest during scaling events
• Average response time primary between 75-150 msec
Dynamic Scaling, cont.
• Instance counts 3x, Aggregate requests 4.5x (not shown)
• Average CPU utilization per instance: ~25-55%
Study performed:
• 24 node C* SSD-based cluster (hi1.4xlarge)
• mid-tier service load application
• Targeting 2x production rates
• Increase read ops from 30k to to 70k in ~ 3 minutes
• Increase write ops 750 to 1500 in ~ 3 minutes
Results:
• 95th pctl response time increase: ~ 17 msec to 45
msec
• 99th pctl response time increase: ~ 35 msec to 80
msec
Cassandra Performance
18
Response times consistent during 4x increase in load *
* Due to upstream code change
EVcache (memcached) Scalability
Takeaways
• Evolve architecture and processes to mitigate
risks
• Factor redundancy requirements into scaling
strategy
• Stateless micro-service architectures win!
• Benchmark everything…quantify variability
• On-demand Cloud makes benchmarking convenient
Netflix Open Source
Our Open Source Software simplifies mgmt at scale
Great projects, stunning colleagues: jobs.netflix.com
Q&A
• cwatson@netflix.com
• Twitter: @coburnw
• LinkedIn: http://www.linkedin.com/in/coburnw/
• Netflix Tech Blog: http://techblog.netflix.com