(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014
-
Upload
amazon-web-services -
Category
Technology
-
view
1.120 -
download
0
description
Transcript of (APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
November 13, 2014 | Las Vegas
APP307 - Leveraging the Cloud with a
Blue-Green Deployment ArchitectureJim Plush, Sr. Director of Engineering, CrowdStrike - @jimplush
Sean Berry, Principal Software Engineer, CrowdStrike - @schleprachaun
About us
• Founded in September 2011
• ~150 employees
• Detection/prevention
– Advanced cyber threats
– Real-time detection
– Real-time analytics
Cybersecurity startup
Published experts
Event Stream Processing
Sensor
Targeted Malicious
Malware
The “CLOUD”
{"date":"11/14/2014 08:03", "path": “C:\WINDOWS\Programs\Word.exe", "id": 49, "parentId": 48}
{"date":"11/14/2014 08:03", "path": “C:\WINDOWS\System32\cmd.exe", "id": 50, "parentId": 49}
{"date":"11/14/2014 08:03", "path": “C:\WINDOWS\Programs\Word.exe", "id": 51, "parentId": 50}
DNS Lookup
{"date":"11/14/2014 08:03", “dns": “badapple.cc”, "id": 52, "parentId": 51}
TCP Connect
{"date":"11/14/2014 08:03", “tcp_connect”: “10.10.10.10”, "id": 53, "parentId": 51}
FTP Download
{"date":"11/14/2014 08:03", "download": “10.10.10.10/badstuff.exe”, “id": 54, "parentId": 51}
Document Exfiltration
{"date":"11/14/2014 08:03", "scp": “C:\Documents\TradeSecrets.doc”, “id": 55, "parentId": 54}
Tactical UI
Data ingestionService AService A
UI
Service AService A
API
Se
nso
rs
Termination server
Termination server
Termination server
Termination server
KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3
Data plane
Se
nso
rsS
en
so
rs
Exte
rna
l se
rvic
e E
lastic L
oa
d B
ala
ncin
g loa
d b
ala
ncer Content Router
Content Router
Service AService AProcessor 1
Service AService AProcessor 2
• Fortune 500, Think Tanks, Non-Profits
• 100K+ events per second
– Expected to hit 500K EPS by end of 2015
• Each enterprise customer can generate 2-4 TBs of
data per day
• Microservice architecture
• Polyglot environment
High scale, big data
Our tech stack is complicated
…but possible because of AWS
Motivation
Solving for the problems• OMG, all servers need to be patched??
• I’m afraid to restart that service; it’s been running
for 2 years
• Large rolling restarts
• Deployment fear
– Friday night deploys
• B/G for event processing?
Our primary objectives for deployments
• Minimize customer impact
– Customers should have no indication that
anything has changed
• Maximize engineer’s weekends
– Avoid burnout
• Reduce dependencies of rollouts
– Everything goes out together, 50+ services,
1000+ VMS
Leveraging AWS
• Programmable data centers
• Nodes are ephemeral
• It should be easier to re-create an environment
than to fix it
— Think like the cloud
What is blue-green?
Router
Web
server
App
server
Application v1
Shared
database
Web
server
App
server
Application v2
xx
What is blue-green?
• Full cluster BG
– Everything goes out together
– Indiana Jones: “idol switch”
• App-based BG
– Each app or team controls their own
blue-green deployments
Data plane
The data planecan’t blue-green all the things
Blue cluster
Green cluster
KafkaDynamoDB Redis Amazon RDS
pgsql
Amazon
Redshift
Amazon
GlacierAmazon S3
When do we deploy?• Teams deploy end of sprint releases together
• Hot-fix/Upgrades are performed via rolling restart
deployments frequently
• Early on deployments took an entire day
– Lack of automation
• Deploys today generally take 45 minutes
– Everyone has run a deployment
Sustaining engineer
• Every team member including QA has run
deployments
• Builds confidence, understanding, and
redundancy
• Ensures documentation is up to date and all
things are automated that can be.
Sustaining engineer badge of honor
shirt after their tour of duty
Deployment day
• Apt repo synchronized and locked down
• Data plane migrations applied
• “Green” cluster is launched (1000s of machines)
• IT tests run
• Canary customers
• Logging and error checks
• Active-active
• “Blue” marked as inactive, decommissioned
Keys to success
Pro tip: It’s not just flipping load balancers
Keys to successAutomate all the things
• jr devs should be able to run your deploy
system
Keys to successInstrumentation & Metrics
https://github.com/codahale/metrics
https://github.com/rcrowley/go-metrics
Keys to successUse a provisioning system
• Chef
• Puppet
• Salt
• baked AMIs
Keys to successLive integration / regression test suites
Test
System
Send deterministic input values
Verify processed state
Keys to successCanary Customers
V1 App V2 App
Keys to successFeature Flags
Keys to successUnified app requirements
Keys to successDeployment History
– every team member
“Thank God we have blue-green”
Implementation
How we blue-green
Elevator pitch on Kafka
• Distributed commit log
• Similar to a message queue
• Allows for replaying messages from earlier in the stream in case of failure
KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3
Data plane
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Se
nso
rs
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Se
nso
rs
• Blue is running; normal operation
• Content Routers are writing to the “active” topics in Kafka
• Blue processors read from the “active” topicsSe
nso
rs
Active topic
Active topic
Exte
rna
l se
rvic
e E
LB
loa
d b
ala
ncer
It all starts with a running cluster
Main management page for blue-green
KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3
Data plane
Exte
rna
l se
rvic
e E
LB
loa
d b
ala
nceerS
en
so
rs
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Se
nso
rsS
en
so
rs
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Active topic
Launching new cluster
Active topic
Active topic
Inactive Topic
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Content Router
Content Router
• Green cluster is launched
• Termination servers are kept out of the ELB load
balancer by failing health checks
• Content Routers write to the “active” topics
• Processors in green read from the “inactive” topics
Sizing the new cluster
Getting the size right• Sizing of our autoscale groups is
determined programmatically
– Admin page allows for
setting mix / max
– Script determines
appropriate desired-capacity
based on running cluster
• Launching is then as simple as
updating the autoscale groups to
the new sizes
def current_counts(region='us-east-1'):
proc = Popen(
"as-describe-auto-scaling-groups “
“--region {} “
“--max-records=600".format(region),
shell=False, stdout=PIPE, stderr=PIPE)
out, err = proc.communicate()
if err:
raise Exception(err)
counts = {}
for line in out.splitlines():
if "AUTO-SCALING-GROUP" not in line:
continue
parts = line.split()
group = parts[1]
current = parts[-2]
counts[group] = int(current)
return counts
Tuning size before we launch
Bootstrapping
User data and Chef get things rolling
• Inside out Chef bootstrapping
– Didn’t feel comfortable running `wget … | bash`
• Custom version of Chef installer
– Version of Chef
– Where to find the Chef servers
– Which role to run
– Which environment (dev, integ, blue, green)
Testing the new stuffE
xte
rna
l se
rvic
e E
LB
loa
d b
ala
ncer
Se
nso
rs
Termination server
Termination server
Termination server
Termination server
Se
nso
rs
Active topic
Active topic
KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3
Data plane
Termination server
Termination server
Termination server
Termination serverInte
gra
tio
n te
sts
Active topic
Inactive Topic
Content Router
Content Router
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Content Router
Content Router
• Test customer(s) are *canaried
• Integration test suite is run by connecting to a termination server directly
• Tests pass; then we canary real customers
Canary customers
• Canary information is stored in zookeeper
• Fortunately we dogfood our own tech
• This affords us the ability to use ourselves as canaries for
new code
• The inactive processing cluster is set to read from the .inactivetopics
• The standard Kafka topics with .inactive appended
• The ingestion layer has a watcher on that znode and routes
any canaried customer to a the .inactive topics
• Ex. regular traffic goes to foo.bar, canary traffic goes to
foo.bar.inactive
• When we are ready to test real traffic we mark several
customers as canaries and start the monitoring process to
determine any issues
Canary customers
Canary customersS
en
so
rs
Exte
rna
l se
rvic
e E
LB
loa
d b
ala
ncer
Event ingestor
Kafka
Green Processors
Inactive Topic
Regular Traffic
Active topic
Blue Processors
Active topic
Inactive Topic
Canary Traffic
Customer 123
Customer 456
Let’s canary some customers
That was easy
Testing
IT tests run
• Integration tests are run
– ~3000 tests in total
– Test customer must be “canaried”
• If any tests fail, we triage and determine if it is still possible to
move forward
• Testing is only done when we are passing 100% — no
exceptions!
Sean is mad - we have work to do
Sean is happy - so we all are happy
KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3
Data plane
Trust, but verify!S
en
so
rs
Termination server
Termination server
Termination server
Termination server
Se
nso
rs
Active Topic
Active Topic
Inactive Topic
Se
nso
rs
Exte
rna
l se
rvic
e E
LB
loa
d b
ala
ncer
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Content Router
Content Router
Inactive Topic
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
• Monitor green services
• Verify health of the cluster by inspecting graphical
data and log outputs
• Rerun tests with load
Monitoring
Logging and error checking
• Every server forwards its relevant logs to Splunk
• Several dashboards have been set up with common things to
watch for
• Raw logs are streamed in near real-time and we watch
specifically for log-level ERROR
• This is one of our most important steps, as it gives us the most
insight into the health of the system as a whole
Logging / Error Checking
Moving customers overTermination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Se
nso
rsS
en
so
rsS
en
so
rs
Exte
rna
l se
rvic
e E
LB
loa
d b
laa
ncer
KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3
Data plane
Active topic
Active topic
Content Router
Content Router
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Content Router
Content Router
Active topic
Active topic
• Flip all customers back away from canary
• Activate green cluster
• Event processors and consuming services in blue
and green now write to and consume the “active” topics
• We are in a state of active-active for a few minutes
Each node in the data processing layer has a watcher on a particular znode which tells
the environment whether it is active (use standard Kafka topics) or inactive (append .inactive to the topics)
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Active Topic
Kafka
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Active - active
Inactive Topic
Ingestion
Inactive TopicActive topic
When we are ready to make the switch, we start by making the new cluster active and
enter into an active-active state where both processing clusters are doing work.
Kafka
Green, switchto active!
Active Topic
This is where is it paramount that
code is forward compatible since
two different code bases will be
doing work simultaneously
Active - active
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Ingestion
However, blue and green are fully partitioned and there is no intercommunication
between the clusters. This allows for things like changes in serialization for inter-
service communication.
Active Topic
Kafka
Active Topic
Active - active
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Ingestion
KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3
Data plane
Flipping the switchTermination server
Termination server
Termination server
Termination server
Content Router
Content Router
Se
nso
rsS
en
so
rsS
en
so
rs
Exte
rna
l se
rvic
e E
LB
loa
d b
ala
ncer
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Active topic
Active topic
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
Inactive topic
Active topic
• We deactivate Blue, which forces Termination Servers in Blue to
fail health checks and all Blue sensors disconnect
• Blue processors switch to read from the “inactive” topic
• Once all consumers of the “inactive” topic have caught up to the
head of the stream, Blue can be decommissioned
Out with the old…
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
KafkaDynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3
Data plane
Active topic
Active topic
Se
nso
rsS
en
so
rsS
en
so
rs
Exte
rna
l se
rvic
e E
LB
loa
d b
ala
ncer
Service AService AProcessor 1
Service AService AProcessor 2
Service AService AProcessor 3
Service AService AProcessor 4
• Green is now the active cluster
• If we need to roll back code, we have a snapshot of the repository
in Amazon S3
• We haven’t had to roll back code… yet
Easing the pain
Bootstapping faster
Half-baked AMIsWe use a process to create “half-baked” AMIs, which speed up deployments
• JVM (for our Scala code base)
• Common tools and configurations
• Latest updates to make sure patches are up to date
• Build plan is run twice daily
Green ServerGreen ServerGreen ServerGreen ServerGreen Server
Green server
Green ServerGreen ServerGreen ServerGreen ServerGreen Server
Blue server
Half-baked-AMI
Auto Scaling group
1
AMI
Auto Scale Group
Amazon S3
Getting code ready
How code graduates - Development
Commit on main
Development apt repo
Auto deploy changed
roles
Development cluster
How code graduates - Production
Create release-X.X.X or
hotfix-X.X.X branches
Integration apt repo
Production apt repo
Same exact
Binary
Integration clusterIntegration apt repo
Sync specified
Packages for integ
New production cluster
Choosing what goes out
Viewing debian details
Integration is synced
Integration is synced
Production is synced from Integ
Updating the data plane
Data plane migrations
• Migrations applied to the database are forward only
• We have past experiences with two way migrations, but the cost
outweigh the benefits.
• Code must be forward compatible in case rollbacks are necessary
• Database schemas are only modified via migrations even in
development and integration environments
• We use an in-house migration service (based on flyway) to parallelize
the process
Final Thoughts
• blue-green deployments can be done in many ways
• Our requirement of never losing customer data
made this the best solution for us
• The automation and tooling around our deployment
system were built over many months and was a lot
of work (built by 2 people – Hi Dennis!)
• But it is completely worth it, knowing we have a very
reliable, fault-tolerant system
Thank you