Post on 25-Jul-2015
@JPMALEK
04/14/2023 1
Retrospective from a startup built in the cloud : top 3 big lessons
from the AWS outage on
04.21.2011 plus 4,369 other smaller ones
@JPMALEK
04/14/2023 3
“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs,
AWS, the BD API”
(true story)
@JPMALEK
04/14/2023 6
me: previous startupteams in 3 countries
highly transactional systemMS tech : IIS/MS SQL Server
co-located, leased/owned hardware0% in cloud
$75M/yearly rev
@JPMALEK
04/14/2023 7
me : current startupsystems 100% on AWS
99% free/open-source software
standing on the shoulders of giants
@JPMALEK
04/14/2023 11
The Ruger Fault Equivalency
time = money
fault tolerance = time² - risk tolerance
Also known as:
'Fast, good and cheap : pick two‘
@JPMALEK
04/14/2023 12
system design philosophy:leverage proven, open-source tech
in the cloudto build ascaleablereliablesecure
operational foundationquickly
@JPMALEK
04/14/2023 13
So how do you achievethe right level of fault tolerance in
the cloud?
3 tenets
@JPMALEK
04/14/2023 14
Tenet #1
Scripted Repeatability Tenet #2
SPOF Elimination Tenet #3
Clear-Cut Communication
@JPMALEK
04/14/2023 16
Tenet #1prepare a fault-tolerant foundation with
scripted repeatability
aka automation
@JPMALEK
04/14/2023 17
from the start :script the non-interactive install of your tools
and OS
custom AMIDebian : great package management
based on Eric Hammond’s workhttp://alestic.com/
@JPMALEK
04/14/2023 19
which will allow you toscript system tests
integrity (3-4K tests)performance (30-40K tests)
load, capacity (2-4M requests)
@JPMALEK
04/14/2023 21
That’s how1 person
set up andmanaged a network
comprised of 90+/- server instancesfor 1.5 years
while serving various other roleswithout having to leave their chair
try that with real hardware
@JPMALEK
04/14/2023 27
Zone fail-over best practices:are you using auto-scaling?
no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics
@JPMALEK
04/14/2023 29
So it’s actually all about reduction of the right SPOFs for
your business context
Just adding the ability to fail-over and have backups within a region is huge!
Probably enough for most.What about Fred?
@JPMALEK
04/14/2023 31
During an outage, communicating the right things at the right time:
hard.But not that hard.
@JPMALEK
04/14/2023 32
Tenet #1
Scripted Repeatability Tenet #2
SPOF Elimination Tenet #3
Clear-Cut Communication
Three Tenets Revisited