Circuit breakers @ API World 2016
-
Upload
scott-triglia -
Category
Software
-
view
174 -
download
0
Transcript of Circuit breakers @ API World 2016
Handle API Failure gracefully with Circuit Breakers
Scott Triglia
1
Jim’s
2
Yelp’s Mission:Connecting people with great
local businesses.
Yelp StatsAs of Q2 2016
92M monthly
users
32 countries
72% of searches from
mobile
108M reviews
Scott Triglia
4
@scott_triglia
Work with the $$$
Let’s talk Circuit Breakers
5
6
7
8Photo by zerial, CC BY-NC 2.0
9
10
Our goals today: introduce a basic circuit breaker
11
Our goals today: a modular circuit breaker
12
Our goals today: test it out on several scenarios
13
14
15
1616
1
2
34
the fundamental rule: your systems will fail
what’s your response?
17
1818
1
2
1919
1
2
3
2020
1
2
34
21
Nygard’s circuit breaker
22
23
24
25
26
Circuit Breaker States: * Healthy (or “closed”) * Recovering (or “half-open”) * Unhealthy (or “open”)
27
28
Recovery:
* Wait for recovery_timeout seconds* Send a trial request, trust its results
29
Before a circuit breaker
30
Assume the kitchen gets slow
31
Kitchen’s backlog grows
32
Diners wait much longer to get food
33
New diners make issues worse
34
And your entire system collapses
35
And your entire system collapses
36
With a circuit breaker
37
CB sees backlog, stops orders
38
Fewer frustrated diners
39
Reduced load on the kitchen
40
A well defined failure mode
41
How can we do better?
42
Improvements
43
1) Detecting Unhealthiness
44
2) Mitigating downtime
45
3) Recovering Effectively
46
Detecting Unhealthiness
Component 1:
47
48
def signal_overload(cb): if len(jobs) > THRESH: cb.mark_unhealthy()
49
New Behavior:
* CB gets signals from anywhere * Signal combining logic
50
* Allows many (many) new signals
* Must combine signals * Adds complexity to system
51
Mitigating Downtime
Component 2:
52
53
54
55
56
New Behavior: * Code can check in advance about healthiness of system * Automatic monitoring!
57
* Build features on top of system health status
* Requires a single source of truth?
58
Recovering Effectively
Component 3:
59
60
Dark launch:
* Reject but process normally * Dangerous with side effects
Block User Request Try to process anyway!
61
Synthetic:
* Dark launching with fake requests * Not necessarily representative
Block User Request Process fake requests
62
New Behavior:
* Traffic determines health * Removal of recovery timeouts
63
* Faster(?) recovery * No timeout tuning required * Dark launching not always possible * Synthetic can be unrepresentative
64
in summary
65
Your system will fail, have a plan!
66
The basic CB is better than nothing
67
Questions to ask:
* What is “unhealthy” for my system? * How should I react to unhealthiness? * How do we recover?
68
Questions to ask:
* What is “unhealthy” for my system? * How should I react to unhealthiness? * How do we recover?
69
Questions to ask:
* What is “unhealthy” for my system? * How should I react to unhealthiness? * How do we recover?
70
Questions to ask:
* What is “unhealthy” for my system? * How should I react to unhealthiness? * How do we recover?
71
…and much more!
Much comes down to your use case
72
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
[email protected] @scott_triglia
75
Can’t we do better than rejecting requests?
http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
77
How do I safely test out a new circuit breaker?
https://engineering.heroku.com/blogs/2015-06-30-improved-production-stability-with-circuit-breakers/