(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

November 13, 2014 | Las Vegas

APP307 - Leveraging the Cloud with a

Blue-Green Deployment ArchitectureJim Plush, Sr. Director of Engineering, CrowdStrike - @jimplush

Sean Berry, Principal Software Engineer, CrowdStrike - @schleprachaun

About us

• Founded in September 2011

• ~150 employees

• Detection/prevention

– Advanced cyber threats

– Real-time detection

– Real-time analytics

Cybersecurity startup

Published experts

Event Stream Processing

Sensor

Targeted Malicious

Malware

The “CLOUD”

{"date":"11/14/2014 08:03", "path": “C:\WINDOWS\Programs\Word.exe", "id": 49, "parentId": 48}

{"date":"11/14/2014 08:03", "path": “C:\WINDOWS\System32\cmd.exe", "id": 50, "parentId": 49}

{"date":"11/14/2014 08:03", "path": “C:\WINDOWS\Programs\Word.exe", "id": 51, "parentId": 50}

DNS Lookup

{"date":"11/14/2014 08:03", “dns": “badapple.cc”, "id": 52, "parentId": 51}

TCP Connect

{"date":"11/14/2014 08:03", “tcp_connect”: “10.10.10.10”, "id": 53, "parentId": 51}

FTP Download

{"date":"11/14/2014 08:03", "download": “10.10.10.10/badstuff.exe”, “id": 54, "parentId": 51}

Document Exfiltration

{"date":"11/14/2014 08:03", "scp": “C:\Documents\TradeSecrets.doc”, “id": 55, "parentId": 54}

Tactical UI

Data ingestionService AService A

UI

Service AService A

API

Se

nso

rs

Termination server

Termination server

Termination server

Termination server

KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3

Data plane

Se

nso

rsS

en

so

rs

Exte

rna

l se

rvic

e E

lastic L

oa

d B

ala

ncin

g loa

d b

ala

ncer Content Router

Content Router

Service AService AProcessor 1


• Fortune 500, Think Tanks, Non-Profits

• 100K+ events per second

– Expected to hit 500K EPS by end of 2015

• Each enterprise customer can generate 2-4 TBs of

data per day

• Microservice architecture

• Polyglot environment

High scale, big data

Our tech stack is complicated

…but possible because of AWS

Motivation

Solving for the problems• OMG, all servers need to be patched??

• I’m afraid to restart that service; it’s been running

for 2 years

• Large rolling restarts

• Deployment fear

– Friday night deploys

• B/G for event processing?

Our primary objectives for deployments

• Minimize customer impact

– Customers should have no indication that

anything has changed

• Maximize engineer’s weekends

– Avoid burnout

• Reduce dependencies of rollouts

– Everything goes out together, 50+ services,

1000+ VMS

Leveraging AWS

• Programmable data centers

• Nodes are ephemeral

• It should be easier to re-create an environment

than to fix it

— Think like the cloud

What is blue-green?

Router

Web

server

App

server

Application v1

Shared

database

Web

server

App

server

Application v2

xx

What is blue-green?

• Full cluster BG

– Everything goes out together

– Indiana Jones: “idol switch”

• App-based BG

– Each app or team controls their own

blue-green deployments

Data plane

The data planecan’t blue-green all the things

Blue cluster

Green cluster

KafkaDynamoDB Redis Amazon RDS

pgsql

Amazon

Redshift

Amazon

GlacierAmazon S3

When do we deploy?• Teams deploy end of sprint releases together

• Hot-fix/Upgrades are performed via rolling restart

deployments frequently

• Early on deployments took an entire day

– Lack of automation

• Deploys today generally take 45 minutes

– Everyone has run a deployment

Sustaining engineer

• Every team member including QA has run

deployments

• Builds confidence, understanding, and

redundancy

• Ensures documentation is up to date and all

things are automated that can be.

Sustaining engineer badge of honor

shirt after their tour of duty

Deployment day

• Apt repo synchronized and locked down

• Data plane migrations applied

• “Green” cluster is launched (1000s of machines)

• IT tests run

• Canary customers

• Logging and error checks

• Active-active

• “Blue” marked as inactive, decommissioned

Keys to success

Pro tip: It’s not just flipping load balancers

Keys to successAutomate all the things

• jr devs should be able to run your deploy

system

Keys to successInstrumentation & Metrics

https://github.com/codahale/metrics

https://github.com/rcrowley/go-metrics

Keys to successUse a provisioning system

• Chef

• Puppet

• Salt

• baked AMIs

Keys to successLive integration / regression test suites

Test

System

Send deterministic input values

Verify processed state

Keys to successCanary Customers

V1 App V2 App

Keys to successFeature Flags

Keys to successUnified app requirements

Keys to successDeployment History

– every team member

“Thank God we have blue-green”

Implementation

How we blue-green

Elevator pitch on Kafka

• Distributed commit log

• Similar to a message queue

• Allows for replaying messages from earlier in the stream in case of failure


Data plane





Se

nso

rs

Termination server

Termination server

Termination server

Termination server

Content Router

Content Router

Se

nso

rs

• Blue is running; normal operation

• Content Routers are writing to the “active” topics in Kafka

• Blue processors read from the “active” topicsSe

nso

rs

Active topic

Active topic

Exte

rna

l se

rvic

e E

LB

loa

d b

ala

ncer

It all starts with a running cluster

Main management page for blue-green


Data plane

Exte

rna

l se

rvic

e E

LB

loa

d b

ala

nceerS

en

so

rs

Termination server

Termination server

Termination server

Termination server

Termination server

Termination server

Termination server

Termination server

Content Router

Content Router

Se

nso

rsS

en

so

rs





Active topic

Launching new cluster

Active topic

Active topic

Inactive Topic





Content Router

Content Router

• Green cluster is launched

• Termination servers are kept out of the ELB load

balancer by failing health checks

• Content Routers write to the “active” topics

• Processors in green read from the “inactive” topics

Sizing the new cluster

Getting the size right• Sizing of our autoscale groups is

determined programmatically

– Admin page allows for

setting mix / max

– Script determines

appropriate desired-capacity

based on running cluster

• Launching is then as simple as

updating the autoscale groups to

the new sizes

def current_counts(region='us-east-1'):

proc = Popen(

"as-describe-auto-scaling-groups “

“--region {} “

“--max-records=600".format(region),

shell=False, stdout=PIPE, stderr=PIPE)

out, err = proc.communicate()

if err:

raise Exception(err)

counts = {}

for line in out.splitlines():

if "AUTO-SCALING-GROUP" not in line:

continue

parts = line.split()

group = parts[1]

current = parts[-2]

counts[group] = int(current)

return counts

Tuning size before we launch

Bootstrapping

User data and Chef get things rolling

• Inside out Chef bootstrapping

– Didn’t feel comfortable running `wget … | bash`

• Custom version of Chef installer

– Version of Chef

– Where to find the Chef servers

– Which role to run

– Which environment (dev, integ, blue, green)

Testing the new stuffE

xte

rna

l se

rvic

e E

LB

loa

d b

ala

ncer

Se

nso

rs

Termination server

Termination server

Termination server

Termination server

Se

nso

rs

Active topic

Active topic


Data plane

Termination server

Termination server

Termination server

Termination serverInte

gra

tio

n te

sts

Active topic

Inactive Topic

Content Router

Content Router









Content Router

Content Router

• Test customer(s) are *canaried

• Integration test suite is run by connecting to a termination server directly

• Tests pass; then we canary real customers

Canary customers

• Canary information is stored in zookeeper

• Fortunately we dogfood our own tech

• This affords us the ability to use ourselves as canaries for

new code

• The inactive processing cluster is set to read from the .inactivetopics

• The standard Kafka topics with .inactive appended

• The ingestion layer has a watcher on that znode and routes

any canaried customer to a the .inactive topics

• Ex. regular traffic goes to foo.bar, canary traffic goes to

foo.bar.inactive

• When we are ready to test real traffic we mark several

customers as canaries and start the monitoring process to

determine any issues

Canary customers

Canary customersS

en

so

rs

Exte

rna

l se

rvic

e E

LB

loa

d b

ala

ncer

Event ingestor

Kafka

Green Processors

Inactive Topic

Regular Traffic

Active topic

Blue Processors

Active topic

Inactive Topic

Canary Traffic

Customer 123

Customer 456

Let’s canary some customers

That was easy

Testing

IT tests run

• Integration tests are run

– ~3000 tests in total

– Test customer must be “canaried”

• If any tests fail, we triage and determine if it is still possible to

move forward

• Testing is only done when we are passing 100% — no

exceptions!

Sean is mad - we have work to do

Sean is happy - so we all are happy


Data plane

Trust, but verify!S

en

so

rs

Termination server

Termination server

Termination server

Termination server

Se

nso

rs

Active Topic

Active Topic

Inactive Topic

Se

nso

rs

Exte

rna

l se

rvic

e E

LB

loa

d b

ala

ncer





Content Router

Content Router

Inactive Topic





• Monitor green services

• Verify health of the cluster by inspecting graphical

data and log outputs

• Rerun tests with load

Monitoring

Logging and error checking

• Every server forwards its relevant logs to Splunk

• Several dashboards have been set up with common things to

watch for

• Raw logs are streamed in near real-time and we watch

specifically for log-level ERROR

• This is one of our most important steps, as it gives us the most

insight into the health of the system as a whole

Logging / Error Checking

Moving customers overTermination server

Termination server

Termination server

Termination server

Termination server

Termination server

Termination server

Termination server

Se

nso

rsS

en

so

rsS

en

so

rs

Exte

rna

l se

rvic

e E

LB

loa

d b

laa

ncer


Data plane

Active topic

Active topic

Content Router

Content Router









Content Router

Content Router

Active topic

Active topic

• Flip all customers back away from canary

• Activate green cluster

• Event processors and consuming services in blue

and green now write to and consume the “active” topics

• We are in a state of active-active for a few minutes

Each node in the data processing layer has a watcher on a particular znode which tells

the environment whether it is active (use standard Kafka topics) or inactive (append .inactive to the topics)





Active Topic

Kafka





Active - active

Inactive Topic

Ingestion

Inactive TopicActive topic

When we are ready to make the switch, we start by making the new cluster active and

enter into an active-active state where both processing clusters are doing work.

Kafka

Green, switchto active!

Active Topic

This is where is it paramount that

code is forward compatible since

two different code bases will be

doing work simultaneously

Active - active









Ingestion

However, blue and green are fully partitioned and there is no intercommunication

between the clusters. This allows for things like changes in serialization for inter-

service communication.

Active Topic

Kafka

Active Topic

Active - active









Ingestion


Data plane

Flipping the switchTermination server

Termination server

Termination server

Termination server

Content Router

Content Router

Se

nso

rsS

en

so

rsS

en

so

rs

Exte

rna

l se

rvic

e E

LB

loa

d b

ala

ncer

Termination server

Termination server

Termination server

Termination server

Content Router

Content Router

Active topic

Active topic









Inactive topic

Active topic

• We deactivate Blue, which forces Termination Servers in Blue to

fail health checks and all Blue sensors disconnect

• Blue processors switch to read from the “inactive” topic

• Once all consumers of the “inactive” topic have caught up to the

head of the stream, Blue can be decommissioned

Out with the old…

Termination server

Termination server

Termination server

Termination server

Content Router

Content Router


Data plane

Active topic

Active topic

Se

nso

rsS

en

so

rsS

en

so

rs

Exte

rna

l se

rvic

e E

LB

loa

d b

ala

ncer





• Green is now the active cluster

• If we need to roll back code, we have a snapshot of the repository

in Amazon S3

• We haven’t had to roll back code… yet

Easing the pain

Bootstapping faster

Half-baked AMIsWe use a process to create “half-baked” AMIs, which speed up deployments

• JVM (for our Scala code base)

• Common tools and configurations

• Latest updates to make sure patches are up to date

• Build plan is run twice daily

Green ServerGreen ServerGreen ServerGreen ServerGreen Server

Green server

Green ServerGreen ServerGreen ServerGreen ServerGreen Server

Blue server

Half-baked-AMI

Auto Scaling group

1

AMI

Auto Scale Group

Amazon S3

Getting code ready

How code graduates - Development

Commit on main

Development apt repo

Auto deploy changed

roles

Development cluster

How code graduates - Production

Create release-X.X.X or

hotfix-X.X.X branches

Integration apt repo

Production apt repo

Same exact

Binary

Integration clusterIntegration apt repo

Sync specified

Packages for integ

New production cluster

Choosing what goes out

Viewing debian details

Integration is synced

Production is synced from Integ

Updating the data plane

Data plane migrations

• Migrations applied to the database are forward only

• We have past experiences with two way migrations, but the cost

outweigh the benefits.

• Code must be forward compatible in case rollbacks are necessary

• Database schemas are only modified via migrations even in

development and integration environments

• We use an in-house migration service (based on flyway) to parallelize

the process

Final Thoughts

• blue-green deployments can be done in many ways

• Our requirement of never losing customer data

made this the best solution for us

• The automation and tooling around our deployment

system were built over many months and was a lot

of work (built by 2 people – Hi Dennis!)

• But it is completely worth it, knowing we have a very

reliable, fault-tolerant system

Thank you

http://bit.ly/awsevals

Jim: @jimplush

Sean: @schleprachaun

http://bit.ly/awsevals

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

Technology

Transcript of (APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014