ZERO-DOWNTIME DATACENTER FAILOVERS - luka.io fileWHO? Luka Kladaric formerly: web developer for 10+...

64
ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) Luka Kladaric @ Python Belgrade Jan 2018.

Transcript of ZERO-DOWNTIME DATACENTER FAILOVERS - luka.io fileWHO? Luka Kladaric formerly: web developer for 10+...

ZERO-DOWNTIME DATACENTER FAILOVERS(SWITCHING HOSTING PROVIDERS FOR DUMMIES)

Luka Kladaric @ Python Belgrade Jan 2018.

WHO?Luka Kladaric

formerly:web developer for 10+ years

now:architecture, infrastructure & security consultant

also a startup founder and remote work evangelist

2 — Luka Kladaric @ Python Belgrade Jan 2018.

migrating an entire company's infrastructure

from Rackspace to Amazon AWS

3 — Luka Kladaric @ Python Belgrade Jan 2018.

60 virtual machines

3 baremetal boxes (db)

assorted networking equipment

4 — Luka Kladaric @ Python Belgrade Jan 2018.

the migration took 2 months to execute

but a year and a half to prepare...

5 — Luka Kladaric @ Python Belgrade Jan 2018.

WHY?6 — Luka Kladaric @ Python Belgrade Jan 2018.

hand-crafted build server, unreproducible

jobs for 3 Android apps......each completely different

7 — Luka Kladaric @ Python Belgrade Jan 2018.

massive monolthic 10 GB git repository

touching anything triggers a rollout of everything

no concept of "stable"

8 — Luka Kladaric @ Python Belgrade Jan 2018.

half the servers are not deployable from scratch

or their deployability is unknown

9 — Luka Kladaric @ Python Belgrade Jan 2018.

no local dev environments

half the company has to VPN into production to get any work done

everyone works directly on production systems

no db schema migration system == no db versioning

10 — Luka Kladaric @ Python Belgrade Jan 2018.

horrible code review tool (Rietveld)

11 — Luka Kladaric @ Python Belgrade Jan 2018.

same mysql account used by everyone everywhere

>

>

12 — Luka Kladaric @ Python Belgrade Jan 2018.

same mysql account used by everyone everywhere

that mysql account is "root"

>

13 — Luka Kladaric @ Python Belgrade Jan 2018.

same mysql account used by everyone everywhere

that mysql account is "root"

that mysql db is 1.5 TB big

14 — Luka Kladaric @ Python Belgrade Jan 2018.

no access to LB config

has a bunch of magic in it

changes often result in issues and outages

15 — Luka Kladaric @ Python Belgrade Jan 2018.

no server metrics / perfdata

no idea if overprovisioned and by how much

16 — Luka Kladaric @ Python Belgrade Jan 2018.

no access to disaster recovery instancein case the primary DC went down

(access goes through primary DC)

17 — Luka Kladaric @ Python Belgrade Jan 2018.

RACKSPACE WAS REALLY TERRIBLEa constant pain to deal with

unexpected outages of never explained causes

unresponsive support team

zero flexibility

18 — Luka Kladaric @ Python Belgrade Jan 2018.

HOW LONG WOULD IT TAKE TO MIGRATE THIS?optimistically:

conservatively:

realistically:

19 — Luka Kladaric @ Python Belgrade Jan 2018.

HOW LONG WOULD IT TAKE TO MIGRATE THIS?optimistically: 3 months

conservatively:

realistically:

20 — Luka Kladaric @ Python Belgrade Jan 2018.

HOW LONG WOULD IT TAKE TO MIGRATE THIS?optimistically: 3 months

conservatively: 6-9 months (of dedicated work)

realistically:

21 — Luka Kladaric @ Python Belgrade Jan 2018.

HOW LONG WOULD IT TAKE TO MIGRATE THIS?optimistically: 3 months

conservatively: 6-9 months (of dedicated work)

realistically: a year (with interruptions)

22 — Luka Kladaric @ Python Belgrade Jan 2018.

NO LEADERSHIP BUY-IN2 failed attempts to get approval

Infrastructure team makes a pact"Do Things The Right Way From Now On"

mask cleanup work with ongoing maintenance

23 — Luka Kladaric @ Python Belgrade Jan 2018.

PLOT TWISTRACKSPACE STARTS FALLING APART

24 — Luka Kladaric @ Python Belgrade Jan 2018.

NEW ESTIMATE

19 man-days

(after final push for preparation)

25 — Luka Kladaric @ Python Belgrade Jan 2018.

HOSTING COST ESTIMATE

before: $18k

after: $6k

savings: $12k (-66%!)

26 — Luka Kladaric @ Python Belgrade Jan 2018.

GOT APPROVAL!27 — Luka Kladaric @ Python Belgrade Jan 2018.

Actually executed in 25-30 man-days

over 2 months

28 — Luka Kladaric @ Python Belgrade Jan 2018.

HOW?29 — Luka Kladaric @ Python Belgrade Jan 2018.

build server rebuilt from scratch

deployed from Ansible

all build jobs defined in codewith inheritance and templating

tweaking jobs through UI disabled

30 — Luka Kladaric @ Python Belgrade Jan 2018.

monolithic git repository split upinto 40 smaller repositories

changes trigger rollout onlyon affected project

31 — Luka Kladaric @ Python Belgrade Jan 2018.

all servers rebuilt and redeployed with Ansible

"upgrading the fleet to Ubuntu 16.04" ;)

32 — Luka Kladaric @ Python Belgrade Jan 2018.

better code review tool (Phabricator)

allows code ownershiprules of engagement per repository

don't ask me about Phabricator(it's amazing)

33 — Luka Kladaric @ Python Belgrade Jan 2018.

most dev work doesn't require VPN any more

but even if it did...

34 — Luka Kladaric @ Python Belgrade Jan 2018.

no more shared mysql root account (RIP)

no write access to production database(to people *or* their software)

local dev environments!(far from perfect)

35 — Luka Kladaric @ Python Belgrade Jan 2018.

db schema migration system based on gh-ost

sql scripts -> code review -> git -> web ui to run

36 — Luka Kladaric @ Python Belgrade Jan 2018.

all LB logic slowly moved to our own haproxies

haproxy configuration auto-generated from Ansible

makes it easy to shuffle things around

37 — Luka Kladaric @ Python Belgrade Jan 2018.

all apps slowly migrated to be served through haproxies

avoiding Rackspace LB magic

38 — Luka Kladaric @ Python Belgrade Jan 2018.

metrics, metrics, metrics

(Datadog ftw)

39 — Luka Kladaric @ Python Belgrade Jan 2018.

TIME SPENT:A YEAR AND A HALF

40 — Luka Kladaric @ Python Belgrade Jan 2018.

AND ALL BEFORE WE HADAPPROVAL TO DO ANYTHING

;)41 — Luka Kladaric @ Python Belgrade Jan 2018.

"yeah yeah yeah...

how did you do the actual migration?"

42 — Luka Kladaric @ Python Belgrade Jan 2018.

VPN bridge between AWS and RS

~20 MB/s, ~20ms ping

good enough to treat as a "local" connectionfor shorter periods of time

43 — Luka Kladaric @ Python Belgrade Jan 2018.

mysql master-master replication between DCs

this was a massive pain to achieve with a 1.5TB db

44 — Luka Kladaric @ Python Belgrade Jan 2018.

recreate the entire fleet in AWS

45 — Luka Kladaric @ Python Belgrade Jan 2018.

app servers in both DCs (java+python)

46 — Luka Kladaric @ Python Belgrade Jan 2018.

haproxies in both DCs

aware of app servers in both DCspreferring local onesbut falling back to remote if necessary

"no request left behind"

47 — Luka Kladaric @ Python Belgrade Jan 2018.

CloudFlare used for near-instant DNS failover

but even stray requests will get handled

48 — Luka Kladaric @ Python Belgrade Jan 2018.

RESULTS49 — Luka Kladaric @ Python Belgrade Jan 2018.

core production migrated in days

internal tools migrated within a week or two

developer tools migrated within a month(git hosting, build server, etc)

obscure legacy services migrated within 2 months

50 — Luka Kladaric @ Python Belgrade Jan 2018.

all hardware at Rackspacedecomissioned within 3 months

51 — Luka Kladaric @ Python Belgrade Jan 2018.

sideffect: actual HA instead of fake HA

old "two or more of everything" approachtranslated well into Availability Zones

52 — Luka Kladaric @ Python Belgrade Jan 2018.

cost estimate? right on the money.

once the dust settledthe $18k/mo bill from RS

was replacedwith a $6k/mo bill from AWS

53 — Luka Kladaric @ Python Belgrade Jan 2018.

AND IT WAS GOOD54 — Luka Kladaric @ Python Belgrade Jan 2018.

55 — Luka Kladaric @ Python Belgrade Jan 2018.

The moral of this story is:

don't wait for permission to do your job right.

56 — Luka Kladaric @ Python Belgrade Jan 2018.

57 — Luka Kladaric @ Python Belgrade Jan 2018.

1. If you see something broken, fix it

57 — Luka Kladaric @ Python Belgrade Jan 2018.

1. If you see something broken, fix it

2. If you don't have time to fix it - write it down

57 — Luka Kladaric @ Python Belgrade Jan 2018.

1. If you see something broken, fix it

2. If you don't have time to fix it - write it down

3. But do come back to it when you can steal a minute

57 — Luka Kladaric @ Python Belgrade Jan 2018.

1. If you see something broken, fix it

2. If you don't have time to fix it - write it down

3. But do come back to it when you can steal a minute

4. Even if it takes months to make progress

57 — Luka Kladaric @ Python Belgrade Jan 2018.

The team was well aware of how broken things were.

If we pushed for it to be a single massiveproject, it would've never happened.

58 — Luka Kladaric @ Python Belgrade Jan 2018.

QUESTIONS?Luka Kladaric @ Python Belgrade Jan 2018.

THANK YOU!Luka Kladarictwitter: @[email protected]

Luka Kladaric @ Python Belgrade Jan 2018.