Architecting for the Cloud: Hoping for the Best, Prepared for the Worst

AWS Loft: Behind the scenes with Cotap

Architecting for the Cloud:

Hoping for the best, prepared for the worst.

Infrastructure as Code


● Current state

● Past decisions

● Tracking the evolution

● CloudFormation

● Design -> JSON

● Version Control!


Rule #1

All changes have to be under Version

Control

Design for automation

Design for automation

● AutoScalingGroups

● Hardware: CloudFormation

● Software: Configuration management

● Cattle not Cats

Rule #2

No instances should be launched manually.

Monitoring & Alerting


● Cost ofo Interruptions

o Waking somebody up

● Channels

● Self-healing infrastructure

● External monitoring

● Page only when critical


Situation Channel Page

Disk full 60% Chat, Email ✗

Disk full 90% Chat, Email, PagerDuty ✓

Chef not running for > 30m Chat, Email ✗

Redis not running for > 3 x 5s Chat, Email, PagerDuty ✓

ElasticSearch N-1 Chat, Email ✗

ElasticSearch N-2 Chat, Email, PagerDuty ✓


● Cost ofo Interruptions

o Waking somebody up

● Channels

● Self-healing infrastructure

● External monitoring

● Page only when critical

Platform to fail

Platform to fail

● Easy creation of temporary “Stacks”

● Branches can get their own hardware

● Clients can talk to a branch

● QA happens on Sandbox

● Exact copy of Production

● Scale up/down based on needs

● Different Region (us-east-1)

Platform to fail

Platform to fail

● Easy creation of temporary “Stacks”

● Branches can get their own hardware

● Clients can talk to a branch

● QA happens on Sandbox

● Exact copy of Production

● Scale up/down based on needs

● Different Region (us-east-1)

All changes have to go through Sandbox.

Rule #3

Rule #4

Production is just a more powerful Sandbox

Disaster Recovery

Disaster Recovery

● Multi-AZs

● Traffic routing

● Multi-Regions (S3 too)

● AutoScalingGroups Min:1 Max:1

● Off-site backups (VPN + Disks)

● RPO + RTO

Security

Security

● MFA

● Public key distribution

● Root key rotation

● Private/Public Subnets

● ACLs/Security Groups

● Update AMIs

● Trusted Advisor!

Security

Scaling

Scaling

● Preemptive

● Automatic

● Vertically

● Horizontally

● Bottlenecks

Scaling

Cost Control

Cost Control

● Tagso Role

o Environment

● Cost explorer

● Threshold alerting

● Share monthly

● Export to CSV

● Right-Scale (ASG)

Cost Control

Cost Control

● Tagso Role

o Environment

● Cost explorer

● Threshold alerting

● Share monthly

● Export to CSV

● Right-Scale (ASG)

4 rules of 5 nines.

● All changes have to be under VC

● No instance should be launched manually

● All changes are deployed to Sandbox first

● Production is just a more powerful Sandbox

Questions?

t: @martincozzi

e: [email protected]

engineering.cotap.com

Architecting for the Cloud: Hoping for the Best, Prepared for the Worst

Engineering

Transcript of Architecting for the Cloud: Hoping for the Best, Prepared for the Worst