What we learned from the AWS Outage
-
Upload
darrell-king -
Category
Internet
-
view
261 -
download
1
Transcript of What we learned from the AWS Outage
![Page 1: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/1.jpg)
AWS Outage / Availability zone failure in
Sydney region
- 05th June 2016 -
Author: Gilles Baillet
* Disclaimer: The opinions expressed in this presentation are the author's own and do not reflect the view of his employer
![Page 2: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/2.jpg)
Who am I?
Gilles BailletCloud Centre of Excellence Manager
Leading a team of 5 DevOps engineers on the Ops (dark) side of DevOps
AWS Certified SysOps Associate and Solutions Architect Associate
Fan of DevOps, AWS, Data Pipeline and now Lambda
Food, drinks travel-addict and almost married!
You can meet me at several meetups around Sydney (AWS, Docker, Elastic)You can connect with me on LinkedIn: https://au.linkedin.com/in/gillesbaillet I accept connections from (almost) everyone!
![Page 3: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/3.jpg)
Before we start
Availability Zone AlignmentRandomisation of the assignment of AZs across AWS accounts
Our AZ are “aligned” across all our production and non-production accounts
Tip: Talk to your TAM!
![Page 4: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/4.jpg)
The chain of events as presented by AWS
• At 3:25PM AEST: loss of power at a regional substation
• At 4:46PM AEST: power restored
• At 6:00PM AEST: over 80% of impacted services back online
• At 1:00AM AEST: nearly all instances recovered
• TOTAL DURATION: 1h21 / 9h35
http://aws.amazon.com/message/4372T8/
![Page 5: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/5.jpg)
The chain of events as experienced by my company
• At 3:25PM AEST: trigger of monitoring/alerting services
• At 3:30PM AEST: conference bridge opened
• At 5:30PM AEST: most services were restored
• At 3:00AM AEST: all production services were restored
• TOTAL DURATION: 2h05 / 11h35
![Page 6: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/6.jpg)
Black Swan
“An event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight. The term is based on an ancient saying which presumed black swans did not exist, but the saying was rewritten after black swans were discovered in the wild”https://en.wikipedia.org/wiki/Black_swan_theory
Taleb, N. N. (2007). The black swan: The impact of the highly improbable. Random house.
![Page 7: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/7.jpg)
Impact during the outage
• all services running in the impacted AZ
• some Auto Scaling Group processes
• a NIC failure at 3:26PMInstance restarted
No ELB health checksHealthy instance marked as unhealthy
• EC2 Console / EC2 CLI commands
• Some CloudWatch metrics
• Some services relying on a single instance of a service (eg. domain controller)
![Page 8: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/8.jpg)
Impact after the outage
• DB repair / integrity check
• Restoration of data stored on ephemeral storage
• 24 hours fixing instances in lower environments (DEV, UAT etc.)
• Clean up of rogue instances
![Page 9: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/9.jpg)
Some things did work
• ELB Health checks
• RDS Database failover
• Some Auto Scaling Group processes
• AWS support escalation
• All critical services running on Cloud 2.0!
![Page 10: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/10.jpg)
Lesson learned
• Implementation vs design
• Instance type matters
• AWS Enterprise support is worth the cost
• Cattle are awesome
• Datacenters in Sydney are not weather proof
• 100s of companies impacted
![Page 11: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/11.jpg)
What’s next?
• Review of design documents vs implementation
• Use older instance types
• Use Chaos Monkey
• Turn Pets into Cattle (more work for my team!)
• Deploy new VPCs across 3 AZs
• Revisit DNS client TTL versus Health Check timeout
• AWS to fix “things” on their end
![Page 12: What we learned from the AWS Outage](https://reader035.fdocuments.in/reader035/viewer/2022070600/58ab76441a28abb54e8b64cb/html5/thumbnails/12.jpg)
Questions?