Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage...

34
@JPMALEK Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 06/18/2022 1

description

All about the April 2011 AWS outage, its causes, effects and ways to mitigate the same sorts of issues in the future.

Transcript of Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage...

Page 1: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 1

Retrospective from a startup built in the cloud : top 3 big lessons

from the AWS outage on

04.21.2011 plus 4,369 other smaller ones

Page 2: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 2

What a country : entrepreneurial resiliency

Page 3: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 3

“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs,

AWS, the BD API”

(true story)

Page 4: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 4

Boom

Page 5: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 5

me: previous startupteams in 3 countries

highly transactional systemMS tech : IIS/MS SQL Server

co-located, leased/owned hardware0% in cloud

$75M/yearly rev

Page 6: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 6

me : current startupsystems 100% on AWS

99% free/open-source software

standing on the shoulders of giants

Page 7: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 7

What HappenedRegions and Zones

US-WEST

A

B

C

D

US-EAST

A

B

C

D

Page 8: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 8

What Happened in us-eastIt’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS

US-EAST

A

B

C

D

Region

Zones

Control plane services

EBS Cluster

Page 9: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 9

What Happened in us-eastIt’s all about the EBS (Elastic Block Store) – apologies for the artistic license, AWS

EBS Cluster

? ‘re-mirroring storm’

Control plane servicesThread-starved

Regional API brown-out

Region/Zones

Page 10: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 10

fault tolerance: 3 to 47 important failearnings

and 4,369 less important ones

Page 11: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 11

in the context of our startup, of course

YMMV depending on velocity

Page 12: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 12

Ruger

Page 13: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 13

The Ruger Fault Equivalency

time = money

fault tolerance = time²  - risk tolerance

Also known as:

'Fast, good and cheap : pick two‘

Page 14: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 14

system design philosophy:leverage proven, open-source tech

in the cloudto build ascaleablereliablesecure

operational foundationquickly

Page 15: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 15

So how do you achievethe right level of fault tolerance in

the cloud?

3 tenets

Page 16: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 16

Tenet #1

Scripted Repeatability Tenet #2

SPOF Elimination Tenet #3

Clear-Cut Communication

Page 17: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 17

Tenet #1prepare a fault-tolerant foundation with

scripted repeatability

aka automation

Page 18: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 18

Tenet #1 : scripted repeatability

from the start :script the non-interactive install of your tools

and OS

custom AMIDebian : great package management

based on Eric Hammond’s workhttp://alestic.com/

Page 19: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 19

Tenet #1 : scripted repeatability

which will allow you toscript the setup/tear-down of your stack

Page 20: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 20

Tenet #1 : scripted repeatability

which will allow you toscript system tests

integrity (3-4K tests)performance (30-40K tests)

load, capacity (2-4M requests)

Page 21: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 21

Tenet #1 : scripted repeatability

A/B system test results : MySQL Percona Upgrade

Page 22: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 22

That’s how1 person

set up andmanaged a network

comprised of 90+/- server instancesfor 1.5 years

while serving various other roleswithout having to leave their chair

try that with real hardware

Page 23: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 23

Tenet #2SPOF Elimination

We don’t need no stinkin single points of failure.

Page 24: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 24

Tenet #2 : SPOF Elimination

SPOF Examples:Cloud Provider

RegionZone

Load BalancerApp Server Database

Fred

Page 25: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 25

Tenet #2 : SPOF Elimination

Cloud Provider fail-over?

e.g. AWS –> Rackspace

Page 26: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 26

Tenet #2 : SPOF Elimination

Region fail-over?

e.g. useast->uswest within AWSNah.

Page 27: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 27

Tenet #2 : SPOF Elimination

Zone fail-over?Yes.

US-WEST

A

B

C

D

US-EAST

A

B

C

D

Page 28: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 28

Tenet #2 : SPOF Elimination

Zone fail-over best practices:are you using auto-scaling?

no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics

Page 29: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 29

Tenet #2 : SPOF Elimination

Load-balancer (ELB), app server, database fail-over?

Yes.

Page 30: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 30

Tenet #2 : SPOF Elimination

So it’s actually all about reduction of the right SPOFs for

your business context

Just adding the ability to fail-over and have backups within a region is huge!

Probably enough for most.What about Fred?

Page 31: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 31

Tenet #3Clear-Cut Communication

Page 32: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 32

Tenet #3 : Clear-cut Communication

During an outage, communicating the right things at the right time:

hard.But not that hard.

Page 33: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 33

Tenet #1

Scripted Repeatability Tenet #2

SPOF Elimination Tenet #3

Clear-Cut Communication

Three Tenets Revisited

Page 34: Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

@JPMALEK

04/13/2023 34

Thank You

Our AWS account rep :"Dylan Peterson" <[email protected]>

(notes attached to this slide)