Chaos Patterns
-
Upload
bruce-wong -
Category
Documents
-
view
376 -
download
2
Transcript of Chaos Patterns
![Page 1: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/1.jpg)
CHAOS PATTERNSArchitecting for failure in distributed systems
Bruce Wong - @bruce_m_wong / Jos Boumans - @jiboumanshttp://www.soponderando.com.br/
![Page 2: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/2.jpg)
http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg
How to measure everything
Architecting in AWS for
resilience & cost
www.slideshare.net/jiboumans/aws-architecting-for-resilience-cost-at-scale http://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead
![Page 3: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/3.jpg)
VP of Operations & Infrastructure
http://www.krux.com/
3 Billion Users
![Page 4: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/4.jpg)
ABOUT BRUCE
2010 2015
Software Engineer
Insight Engineering
Senior Engineering Manager
Chaos Engineering
Prosumers Consumers Enterprise
http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html
![Page 5: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/5.jpg)
A LOT OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
![Page 6: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/6.jpg)
http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/
REAL WORLD FAILURES
![Page 7: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/7.jpg)
SEPTEMBER 20TH, 2015Also: April 21, 2011 - June 29, 2012 - October 22, 2012 - December 24, 2012 - August 26, 2013 <out of space>
https://twitter.com/iamDeveloper/status/645659734767329281 https://aws.amazon.com/message/5467D2/
![Page 8: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/8.jpg)
ISOLATION & CONTAINMENTIdeally limit failure to a single service
Stop it from spreadinghttp://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/
![Page 9: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/9.jpg)
So#ware,)8)
Automa/on,)4)
Process,)14)
#"of"Issues"
Amazon"Cloud"Major"Outage"7"Issues"Categories"
https://steamcommunity.com/app/620/ http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg
AWS Root Cause Analysis over time
http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis
![Page 10: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/10.jpg)
Humans, Software, Processes
All likely causes of failure
Isolation Unlikely
2 - 4x Yearly frequency of catastrophic failure
![Page 11: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/11.jpg)
THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines
![Page 12: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/12.jpg)
Complex SystemsDifficult to model, not feasible to simulate at scale
![Page 13: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/13.jpg)
Software is Iterativetesting, code coverage, “agile”
![Page 14: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/14.jpg)
Resilience Design is also Iterative…unlike software, complexity makes testing difficult
![Page 15: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/15.jpg)
![Page 16: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/16.jpg)
Rich Search ExperienceMany optional enhancements
![Page 17: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/17.jpg)
http://usa.streetsblog.org/category/issues-campaigns/air-quality/
NAVIGATING THE CHAOS
![Page 18: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/18.jpg)
FALLBACK PATTERNS“Expect the Unexpected”
http://blabitcanada.com/category/twitter-2/
![Page 19: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/19.jpg)
BASIC API CALL3 potential points of failure
![Page 20: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/20.jpg)
FALLBACK PATTERNSThe cost of resilience should be accuracy or latency
http://redis.io/ http://memcached.org/
http://varnish-cache.org/
![Page 21: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/21.jpg)
ENSURING DATA ACCESS
https://www.flickr.com/photos/ichijo2009/8501266124
![Page 22: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/22.jpg)
CAP THEOREM APPLIESYour choice: sacrifice availability or consistency. Orange is a lie.
RDBMS BigTable Based
Master / Slave based
CouchDB Dynamo Based
http://ferd.ca/beating-the-cap-theorem-checklist.html
![Page 23: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/23.jpg)
SPLIT OUT YOUR CONTROL PLANE
http://paul-barford.blogspot.com/2015/01/sappho-pap-obbink-further-painting-into.html
![Page 24: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/24.jpg)
EC2 EMR RDS
Dynamo
Cloudfront CDN
Route53 DNS
Cloudwatch Monitoring
![Page 25: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/25.jpg)
Cloudfront CDN
Route53 DNS
Cloudwatch Monitoring
![Page 26: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/26.jpg)
Control plane Separate
from workload
DNS & CDN Your best friends
Latency or Accuracy
Pick one to sacrificefor resilience
![Page 27: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/27.jpg)
USER EXPERIENCEMy tweet got posted
![Page 28: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/28.jpg)
http://mclaughlindrums.com/wp-content/uploads/2013/04/Relativity-by-Escher.jpg
ORDERED CHAOS
![Page 29: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/29.jpg)
Nation’s Business, 1977
![Page 30: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/30.jpg)
CHAOS DEFINED
Intentionally introducing failure into a system with the purpose of validating resilience design.
![Page 31: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/31.jpg)
http://www.cnbc.com/id/102394893
![Page 32: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/32.jpg)
BREAKING THE SYSTEM
How Confident are you?
-Next week?
-Next month?
-After that “quick patch”
![Page 33: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/33.jpg)
CHAOS VS OUTAGEChaos
• Controlled
• Planned
• Intentional
• Microscopic user impact
Outages
• Uncontrolled
• Unpredictable
• Unintended
• Large impact
![Page 34: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/34.jpg)
Single Point of FailureDiscover - Fix - Validate
![Page 35: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/35.jpg)
CHAOS MONKEY
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.htmlhttps://github.com/Netflix/SimianArmy
![Page 36: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/36.jpg)
9am-5pm Mon-Fri Don’t upset your on-call
1 Instance Per group / per day
Detect SPOF Intentionally
![Page 37: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/37.jpg)
Slow is HardProduct + Business + Engineering Decisions
https://pragprog.com/book/mnee/release-it
![Page 38: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/38.jpg)
Custom Fallback
accuracy or latency
Fail Silent For optional data
Fail Fast to keep servers healthy
![Page 39: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/39.jpg)
LATENCY MONKEYother frameworks
http://www.infoq.com/presentations/failure-as-a-service-netflix
http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html
![Page 40: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/40.jpg)
HTTP 5xx 1 minute duration
10-100ms Sleep during request
1-100% Of requests
![Page 41: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/41.jpg)
Prevent Propagationto avoid cascading failure
![Page 42: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/42.jpg)
CHAOS KONGbecause regions fail
http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html
![Page 43: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/43.jpg)
GeoDNS fallback to LatencyDNS
Proxy Cross-Region
communication
Capacity Cost-Benefit Decision
![Page 44: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/44.jpg)
"ONCE IN A BLUE MOON"Happens at least a few times a year....
https://whisperofangels.wordpress.com/2013/08/20/once-in-a-blue-moon/
![Page 45: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/45.jpg)
TAKE AWAYgo found chaos engineering at your company RIGHT
NOW
![Page 46: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/46.jpg)
Most enterprises hire people to fix things. Netflix hires people to break things….
…we should embrace Netflix's culture of "chaos engineering" throughout organizations of all shapes and sizes.
http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone
![Page 47: Chaos Patterns](https://reader030.fdocuments.in/reader030/viewer/2022021417/58835a261a28ab42678b619d/html5/thumbnails/47.jpg)
Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@bruce_m_wong / @jiboumansSlides - https://www.linkedin.com/in/brucemwong