What we learned from the AWS Outage

AWS Outage / Availability zone failure in

Sydney region

- 05th June 2016 -

Author: Gilles Baillet

* Disclaimer: The opinions expressed in this presentation are the author's own and do not reflect the view of his employer

Who am I?

Gilles BailletCloud Centre of Excellence Manager

Leading a team of 5 DevOps engineers on the Ops (dark) side of DevOps

AWS Certified SysOps Associate and Solutions Architect Associate

Fan of DevOps, AWS, Data Pipeline and now Lambda

Food, drinks travel-addict and almost married!

You can meet me at several meetups around Sydney (AWS, Docker, Elastic)You can connect with me on LinkedIn: https://au.linkedin.com/in/gillesbaillet I accept connections from (almost) everyone!

https://au.linkedin.com/in/gillesbaillet

Before we start

Availability Zone AlignmentRandomisation of the assignment of AZs across AWS accounts

Our AZ are “aligned” across all our production and non-production accounts

Tip: Talk to your TAM!

The chain of events as presented by AWS

• At 3:25PM AEST: loss of power at a regional substation

• At 4:46PM AEST: power restored

• At 6:00PM AEST: over 80% of impacted services back online

• At 1:00AM AEST: nearly all instances recovered

• TOTAL DURATION: 1h21 / 9h35

http://aws.amazon.com/message/4372T8/

http://aws.amazon.com/message/4372T8/

The chain of events as experienced by my company

• At 3:25PM AEST: trigger of monitoring/alerting services

• At 3:30PM AEST: conference bridge opened

• At 5:30PM AEST: most services were restored

• At 3:00AM AEST: all production services were restored

• TOTAL DURATION: 2h05 / 11h35

Black Swan

“An event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight. The term is based on an ancient saying which presumed black swans did not exist, but the saying was rewritten after black swans were discovered in the wild”https://en.wikipedia.org/wiki/Black_swan_theory

Taleb, N. N. (2007). The black swan: The impact of the highly improbable. Random house.

Impact during the outage

• all services running in the impacted AZ

• some Auto Scaling Group processes

• a NIC failure at 3:26PMInstance restarted

No ELB health checksHealthy instance marked as unhealthy

• EC2 Console / EC2 CLI commands

• Some CloudWatch metrics

• Some services relying on a single instance of a service (eg. domain controller)

Impact after the outage

• DB repair / integrity check

• Restoration of data stored on ephemeral storage

• 24 hours fixing instances in lower environments (DEV, UAT etc.)

• Clean up of rogue instances

Some things did work

• ELB Health checks

• RDS Database failover

• Some Auto Scaling Group processes

• AWS support escalation

• All critical services running on Cloud 2.0!

Lesson learned

• Implementation vs design

• Instance type matters

• AWS Enterprise support is worth the cost

• Cattle are awesome

• Datacenters in Sydney are not weather proof

• 100s of companies impacted

What’s next?

• Review of design documents vs implementation

• Use older instance types

• Use Chaos Monkey

• Turn Pets into Cattle (more work for my team!)

• Deploy new VPCs across 3 AZs

• Revisit DNS client TTL versus Health Check timeout

• AWS to fix “things” on their end

Questions?

What we learned from the AWS Outage

Internet

Transcript of What we learned from the AWS Outage