Chaos Driven Development
-
Upload
bruce-wong -
Category
Engineering
-
view
185 -
download
5
Transcript of Chaos Driven Development
A LITTLE ABOUT ME
• Founder of Chaos Engineering @ Netflix
• Computer Science Background
• Multiple roles scaling Netflix from 8m to 60m+ subs
• Currently Taking a Break
@bruce_m_wong
Most enterprises hire people to fix things. Netflix hires people to break things….
…we should embrace Netflix's culture of "chaos engineering" throughout organizations of all shapes and sizes.
http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone@bruce_m_wong
http://www.techrepublic.com/article/serious-about-cloud-it-might-be-time-to-look-into-chaos-engineering/https://gigaom.com/2014/09/11/netflixs-new-chaos-engineering-push-aims-to-hire-staff-to-help-break-its-cloud-based-system/@bruce_m_wong
CHAOS DEFINED
“If it ain’t broke don’t fix it”
-Bert Lance, Nation’s Business 1977
If it ain’t broke, try harder -chaos philosophy
@bruce_m_wong
CHAOS DEFINED
Intentionally introducing failure into a system with the purpose of validating resilience design.
@bruce_m_wong
WHY CHAOS?
•Hardware fails
•Power outages
•Software has bugs
•Human error
•Natural disasters@bruce_m_wong
http://www.pcworld.com/article/2691772/how-netflix-survived-the-amazon-ec2-reboot.htmlhttps://gigaom.com/2014/10/03/netflix-lost-218-database-servers-during-aws-reboot-and-stayed-online/
@bruce_m_wong
FAULT-TOLERANT DESIGN PRINCIPLES
• Eliminate Single Points of Failure
• Allow parts of the system to fail independently (Failure Isolation)
• Prevent propagation (Failure Containment)
@bruce_m_wong
MINIMUM VIABLE PRODUCT• Understand your users
• Understand your value proposition
• Understand your business
@bruce_m_wong
PRIORITIZE• Many aspects and features are important
• Each have different consequences for not working
• A product’s value proposition is what drives your business
@bruce_m_wong
WHAT DOES CHAOS LOOK LIKE?
• Types - errors, latency
• Duration - how long?
• Intensity - how much?
@bruce_m_wong
WHAT DOES CHAOS LOOK LIKE?
• Return errors a % of requests
• i.e. return HTTP500 for 1% of requests for 1 minute
@bruce_m_wong
WHAT DOES CHAOS LOOK LIKE?
• Make it slow(er) - Introduce Latency
• i.e. sleep for 10ms on every request for 1 minute
@bruce_m_wong
WHAT DOES CHAOS LOOK LIKE?
Gradually increase
• i.e. sleep for 10ms on every request for 1 minute
• sleep for 100ms on every request for 3 minutes
@bruce_m_wong
WHAT DOES CHAOS LOOK LIKE?
The design/implementation worked!
• microscopic impact, high confidence
What if it didn’t work?
• smaller impact than an outage
• proactively fix it and try again@bruce_m_wong
WHAT AN OUTAGE LOOKS LIKE?
• Detection takes time (TTD)
• Analysis takes time
• Resolution takes time (TTR)
• Inconvenient times
@bruce_m_wong
CHAOS VS OUTAGEChaos
• Controlled
• Planned
• Intentional
• Microscopic user impact
Outages
• Uncontrolled
• Unpredictable
• Unintended
• Large impact@bruce_m_wong
WHAT ABOUT TESTING?
• Testing is good - do it, automate it
• While great testing disciplines can find most functional bugs…
• scale, traffic and capacity
• System misconfiguration and design limitations
@bruce_m_wong
LESSONS LEARNED
• Learn more from chaos exercises than outages
• Fixing a failure mode will uncover new ones
• Configuration is often overlooked
• Tools can break
@bruce_m_wong
WHAT MAKES RESILIENCE DESIGN HARD?
• Product and Engineering Decision
• Tradeoffs are difficult
• Organizational Silos
@bruce_m_wong
WHAT MAKES CHAOS HARD?In addition to the technical challenges
• Organizations rarely incentivize people to try and break production
• Misconceptions about complex systems and scale
@bruce_m_wong
TAKE AWAYS
• What are the consequences?
• Start small, start early
• Work together - share context
• Validate don’t assume
@bruce_m_wong