Epidemic Failures

17
Cloud Native and Epidemic Failures April 2014 Adrian Cockcroft @adrianco @BatteryVentures http://www.linkedin.com/in/adriancockcroft

description

Slides originally written in April 2013 for a private conference and internal use at Netflix. Publishing now since Heartbleed is another example of an epidemic failure mode.

Transcript of Epidemic Failures

Page 1: Epidemic Failures

Cloud Native and Epidemic Failures

April 2014Adrian Cockcroft

@adrianco @BatteryVentureshttp://www.linkedin.com/in/adriancockcroft

Page 2: Epidemic Failures

Cloud Native?

Epidemic Failures

Automated Diversity

Page 3: Epidemic Failures

Cloud Native

Construct a highly agile and highly available service from ephemeral and

often broken components

Page 4: Epidemic Failures

Inspiration

Page 5: Epidemic Failures

Numquam ponenda est pluralitas sine necessitate

Plurality must never be posited without necessity

Occam’s Razor

Page 6: Epidemic Failures

Monoculture

Replicate “the best” as patternsReduce interaction complexityEpidemic single point of failure

Page 7: Epidemic Failures

Pattern Failures

Infrastructure Pattern FailuresSoftware Stack Pattern Failures

Application Pattern Failures

Page 8: Epidemic Failures

Infrastructure Pattern Failures

• Device failures – bad batch of disks, PSUs, etc.• CPU failures – cache corruption, math errors• Datacenter failures – power, network, disaster• Routing failures – DNS, Internet/ISP path

Page 9: Epidemic Failures

Software Stack Pattern Failures

• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Expiration – Certs timing out• Trust revocation – Certificate Authority fails• Security exploit – e.g. heartbleed• Language bugs – compile time• Runtime bugs – JVM, Linux, Hypervisor• Network bugs – routers, firewalls, protocols

Page 10: Epidemic Failures

Application Pattern Failures

• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Content bombs – Data dependent failure• Configuration – wrong/bad syntax• Versioning – incompatible mixes• Cascading failures – error handling bugs etc.• Cascading overload – excessive logging etc.

Page 11: Epidemic Failures

What to do?

Automated diversity managementDiversified automationEfficient vs. Antifragile

Page 12: Epidemic Failures

Specific Ideas

• Automate running a mixture– Diversity as default for any service stack– No developer overhead, stay agile, low cost

• Support oldest and newest versions together – Automate running 50/50 mix CentOS/Ubuntu– Mix versions of JDK, Tomcat, etc.

• Vendor diversity– Multiple DNS vendors, cloud regions, costs more– Multiple cloud vendors? Much higher cost.

Page 13: Epidemic Failures

Generate Permutations> epi <- data.frame(java=gl(2,1,8,c("java6","java7")), linux=gl(2,2,8,c("centos","ubuntu")), codeversion=gl(2,4,8,c("v34","v35")))> epi java linux codeversion1 java6 centos v342 java7 centos v343 java6 ubuntu v344 java7 ubuntu v345 java6 centos v356 java7 centos v357 java6 ubuntu v358 java7 ubuntu v35

Page 14: Epidemic Failures

Deployment

• Builds– Manual to test, automate if it works– Modify build to generate permutation AMIs– Modify Asgard to auto-deploy permutations

• Data collection– Tag each instance with its permutation– Gather metrics by permutation per instance– Do R-based Design of Experiments analysis

Page 15: Epidemic Failures

Analysis

• As a function of permutations– Error rate– Response time– CPU Utilization

• Interactions– E.g. interaction between linux and java– Contrasts identify components with issues– Small changes with high statistical significance

Page 16: Epidemic Failures

GCS Total API Outage for ~1hr

Page 17: Epidemic Failures

Takeaway

Watch out for monocultures

A|B Testing – it’s not just for personalization

http://perfcap.blogspot.comhttp://slideshare.net/adrianco – Netflix

http://slideshare.net/adriancockcroft - Battery

http://www.linkedin.com/in/adriancockcroft

@adrianco @BatteryVentures