Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage...
-
Upload
jeff-malek -
Category
Technology
-
view
1.124 -
download
0
description
Transcript of Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage...
@JPMALEK
04/13/2023 1
Retrospective from a startup built in the cloud : top 3 big lessons
from the AWS outage on
04.21.2011 plus 4,369 other smaller ones
@JPMALEK
04/13/2023 2
What a country : entrepreneurial resiliency
@JPMALEK
04/13/2023 3
“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs,
AWS, the BD API”
(true story)
@JPMALEK
04/13/2023 4
Boom
@JPMALEK
04/13/2023 5
me: previous startupteams in 3 countries
highly transactional systemMS tech : IIS/MS SQL Server
co-located, leased/owned hardware0% in cloud
$75M/yearly rev
@JPMALEK
04/13/2023 6
me : current startupsystems 100% on AWS
99% free/open-source software
standing on the shoulders of giants
@JPMALEK
04/13/2023 7
What HappenedRegions and Zones
US-WEST
A
B
C
D
US-EAST
A
B
C
D
@JPMALEK
04/13/2023 8
What Happened in us-eastIt’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS
US-EAST
A
B
C
D
Region
Zones
Control plane services
EBS Cluster
@JPMALEK
04/13/2023 9
What Happened in us-eastIt’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS
EBS Cluster
? ‘re-mirroring storm’
Control plane servicesThread-starved
Regional API brown-out
Region/Zones
@JPMALEK
04/13/2023 10
fault tolerance: 3 to 47 important failearnings
and 4,369 less important ones
@JPMALEK
04/13/2023 11
in the context of our startup, of course
YMMV depending on velocity
@JPMALEK
04/13/2023 12
Ruger
@JPMALEK
04/13/2023 13
The Ruger Fault Equivalency
time = money
fault tolerance = time² - risk tolerance
Also known as:
'Fast, good and cheap : pick two‘
@JPMALEK
04/13/2023 14
system design philosophy:leverage proven, open-source tech
in the cloudto build ascaleablereliablesecure
operational foundationquickly
@JPMALEK
04/13/2023 15
So how do you achievethe right level of fault tolerance in
the cloud?
3 tenets
@JPMALEK
04/13/2023 16
Tenet #1
Scripted Repeatability Tenet #2
SPOF Elimination Tenet #3
Clear-Cut Communication
@JPMALEK
04/13/2023 17
Tenet #1prepare a fault-tolerant foundation with
scripted repeatability
aka automation
@JPMALEK
04/13/2023 18
Tenet #1 : scripted repeatability
from the start :script the non-interactive install of your tools
and OS
custom AMIDebian : great package management
based on Eric Hammond’s workhttp://alestic.com/
@JPMALEK
04/13/2023 19
Tenet #1 : scripted repeatability
which will allow you toscript the setup/tear-down of your stack
@JPMALEK
04/13/2023 20
Tenet #1 : scripted repeatability
which will allow you toscript system tests
integrity (3-4K tests)performance (30-40K tests)
load, capacity (2-4M requests)
@JPMALEK
04/13/2023 21
Tenet #1 : scripted repeatability
A/B system test results : MySQL Percona Upgrade
@JPMALEK
04/13/2023 22
That’s how1 person
set up andmanaged a network
comprised of 90+/- server instancesfor 1.5 years
while serving various other roleswithout having to leave their chair
try that with real hardware
@JPMALEK
04/13/2023 23
Tenet #2SPOF Elimination
We don’t need no stinkin single points of failure.
@JPMALEK
04/13/2023 24
Tenet #2 : SPOF Elimination
SPOF Examples:Cloud Provider
RegionZone
Load BalancerApp Server Database
Fred
@JPMALEK
04/13/2023 25
Tenet #2 : SPOF Elimination
Cloud Provider fail-over?
e.g. AWS –> Rackspace
@JPMALEK
04/13/2023 26
Tenet #2 : SPOF Elimination
Region fail-over?
e.g. useast->uswest within AWSNah.
@JPMALEK
04/13/2023 27
Tenet #2 : SPOF Elimination
Zone fail-over?Yes.
US-WEST
A
B
C
D
US-EAST
A
B
C
D
@JPMALEK
04/13/2023 28
Tenet #2 : SPOF Elimination
Zone fail-over best practices:are you using auto-scaling?
no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics
@JPMALEK
04/13/2023 29
Tenet #2 : SPOF Elimination
Load-balancer (ELB), app server, database fail-over?
Yes.
@JPMALEK
04/13/2023 30
Tenet #2 : SPOF Elimination
So it’s actually all about reduction of the right SPOFs for
your business context
Just adding the ability to fail-over and have backups within a region is huge!
Probably enough for most.What about Fred?
@JPMALEK
04/13/2023 31
Tenet #3Clear-Cut Communication
@JPMALEK
04/13/2023 32
Tenet #3 : Clear-cut Communication
During an outage, communicating the right things at the right time:
hard.But not that hard.
@JPMALEK
04/13/2023 33
Tenet #1
Scripted Repeatability Tenet #2
SPOF Elimination Tenet #3
Clear-Cut Communication
Three Tenets Revisited
@JPMALEK
04/13/2023 34
Thank You
Our AWS account rep :"Dylan Peterson" <[email protected]>
(notes attached to this slide)