Chaos Driven Development

44
CHAOS DRIVEN DEVELOPMENT Future Insights Live 2015, Las Vegas Bruce Wong

Transcript of Chaos Driven Development

CHAOS DRIVEN DEVELOPMENTFuture Insights Live 2015, Las Vegas

Bruce Wong

A LITTLE ABOUT ME

• Founder of Chaos Engineering @ Netflix

• Computer Science Background

• Multiple roles scaling Netflix from 8m to 60m+ subs

• Currently Taking a Break

@bruce_m_wong

Most enterprises hire people to fix things. Netflix hires people to break things….

…we should embrace Netflix's culture of "chaos engineering" throughout organizations of all shapes and sizes.

http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone@bruce_m_wong

http://www.techrepublic.com/article/serious-about-cloud-it-might-be-time-to-look-into-chaos-engineering/https://gigaom.com/2014/09/11/netflixs-new-chaos-engineering-push-aims-to-hire-staff-to-help-break-its-cloud-based-system/@bruce_m_wong

http://www.cnbc.com/id/102394893@bruce_m_wong

http://www.cnbc.com/id/102394893@bruce_m_wong

CHAOS DEFINED

“If it ain’t broke don’t fix it”

-Bert Lance, Nation’s Business 1977

If it ain’t broke, try harder -chaos philosophy

@bruce_m_wong

CHAOS DEFINED

Intentionally introducing failure into a system with the purpose of validating resilience design.

@bruce_m_wong

WHY CHAOS?

Failure happens.

@bruce_m_wong

WHY CHAOS?

•Hardware fails

•Power outages

•Software has bugs

•Human error

•Natural disasters@bruce_m_wong

http://money.cnn.com/2012/10/30/technology/netflix-hurricane-sandy/@bruce_m_wong

http://www.pcworld.com/article/2691772/how-netflix-survived-the-amazon-ec2-reboot.htmlhttps://gigaom.com/2014/10/03/netflix-lost-218-database-servers-during-aws-reboot-and-stayed-online/

@bruce_m_wong

BLUE MOONS

Once in a blue moon will eventually happen@bruce_m_wong

FAULT-TOLERANT DESIGN PRINCIPLES

• Eliminate Single Points of Failure

• Allow parts of the system to fail independently (Failure Isolation)

• Prevent propagation (Failure Containment)

@bruce_m_wong

START WITH CONSEQUENCES

Chaos Driven Development

@bruce_m_wong

MINIMUM VIABLE PRODUCT• Understand your users

• Understand your value proposition

• Understand your business

@bruce_m_wong

PRIORITIZE• Many aspects and features are important

• Each have different consequences for not working

• A product’s value proposition is what drives your business

@bruce_m_wong

DESIGN FOR FAILURE

What failure isolation might look like

@bruce_m_wong

APPLYING CHAOS

Validation of fault-tolerant design

@bruce_m_wong

BREAKING THE CONNECTION

How Confident are you?

-Next week?

-Next month?

-After that “quick patch”

WHAT DOES CHAOS LOOK LIKE?

• Types - errors, latency

• Duration - how long?

• Intensity - how much?

@bruce_m_wong

WHAT DOES CHAOS LOOK LIKE?

• Return errors a % of requests

• i.e. return HTTP500 for 1% of requests for 1 minute

@bruce_m_wong

WHAT DOES CHAOS LOOK LIKE?

• Make it slow(er) - Introduce Latency

• i.e. sleep for 10ms on every request for 1 minute

@bruce_m_wong

WHAT DOES CHAOS LOOK LIKE?

Gradually increase

• i.e. sleep for 10ms on every request for 1 minute

• sleep for 100ms on every request for 3 minutes

@bruce_m_wong

WHAT DOES CHAOS LOOK LIKE?

The design/implementation worked!

• microscopic impact, high confidence

What if it didn’t work?

• smaller impact than an outage

• proactively fix it and try again@bruce_m_wong

WHAT AN OUTAGE LOOKS LIKE?

• Detection takes time (TTD)

• Analysis takes time

• Resolution takes time (TTR)

• Inconvenient times

@bruce_m_wong

CHAOS VS OUTAGEChaos

• Controlled

• Planned

• Intentional

• Microscopic user impact

Outages

• Uncontrolled

• Unpredictable

• Unintended

• Large impact@bruce_m_wong

WHAT ABOUT TESTING?

• Testing is good - do it, automate it

• While great testing disciplines can find most functional bugs…

• scale, traffic and capacity

• System misconfiguration and design limitations

@bruce_m_wong

LESSONS LEARNED

• Learn more from chaos exercises than outages

• Fixing a failure mode will uncover new ones

• Configuration is often overlooked

• Tools can break

@bruce_m_wong

WHY IS THIS HARD?

@bruce_m_wong

WHAT MAKES RESILIENCE DESIGN HARD?

• Product and Engineering Decision

• Tradeoffs are difficult

• Organizational Silos

@bruce_m_wong

ORGANIZATIONAL SILOS• Services by Domain

• Dev/Ops/Product

• Incomplete context

@bruce_m_wong

WHAT MAKES CHAOS HARD?In addition to the technical challenges

• Organizations rarely incentivize people to try and break production

• Misconceptions about complex systems and scale

@bruce_m_wong

TAKE AWAYS

• What are the consequences?

• Start small, start early

• Work together - share context

• Validate don’t assume

@bruce_m_wong

QUESTIONS?

@bruce_m_wong