Page 1
Building Fault Tolerant Micro Services
Kristoffer [email protected]
@kerlandsson
Page 4
Building Fault Tolerant Micro ServicesWhy?
Failure modes
Stability patterns
Monitoring guidelines
Page 5
Kristoffer Erlandsson
Page 6
250+ services
1000+ service instances (JVMs)
Largest actor on the Stockholm Stock Exchange
Page 8
Database
WebApplication
Page 9
Database
WebApplication
Page 10
Database
WebApplication
Page 11
DatabaseWeb
Application
Page 12
DatabaseWeb
Application
Page 13
DatabaseWeb
Application
Page 15
Products
Payments
WebApplication
Special Offers
Page 17
Products
Payments
Special Offers
WebApplication
Page 18
Products
Payments
Special Offers
WebApplication
Page 19
Products
Payments
Special Offers
WebApplication
DatabasePurchaseHistory
Page 20
Cascading Failure
Products
Payments
Special Offers
WebApplication
DatabasePurchaseHistory
Page 21
How about increasing availability?
Page 22
0.99999
5 minutes per year
Page 23
1000 service
instances?
Page 25
0.999991000 ≈ 0.99
87 hours per year
Page 26
Design for Failure
Page 27
Web Application
Thread pool
Page 28
Special Offers
Web Application
Thread pool
Page 29
Special Offers
Web Application
Thread pool
Page 30
Special Offers
Web Application
Thread pool
Page 31
Web Application
Thread pool
Special Offers
Products
Payments
Page 32
Special Offers
Web Application
Thread pool
Products
Payments
Page 33
Special Offers
Web Application
Thread pool
Products
Payments
Page 34
URL url = new URL("http://acme-books.com/special-offers");URLConnection connection = url.openConnection();connection.connect();InputStream inputStream = connection.getInputStream();// Read response from stream
Page 35
Use
TimeoutsPrevents blocked threads
Page 36
URL url = new URL("http://acme-books.com/special-offers");URLConnection connection = url.openConnection();connection.setConnectTimeout(100);connection.setReadTimeout(500);connection.connect();InputStream inputStream = connection.getInputStream();// Read response from stream
Page 37
Set Aggresive Timeouts
Page 38
Special Offers
DatabasePurchaseHistory
Timeouts here?
Page 39
Special Offers
DatabasePurchaseHistory
Here too!
Page 41
Terrible response times
Awful throughput
Page 42
Web Application
Thread pool
Special Offers
Timeout
Page 43
Web Application
Thread pool
Special Offers
Manytimeouts
Page 44
Web Application
Thread pool
Special Offers
Throughput lower than number of
incoming requests
Manytimeouts
Page 45
Web Application
Thread pool
Special Offers
Throughput lower than number of
incoming requests
Manytimeouts
Page 46
Web Application
Thread pool
Special Offers
Throughput lower than number of
incoming requests
Growing queue!
Manytimeouts
Page 47
Frequently called service
Timeouts are not enough
Page 48
Circuit
BreakersCalls to broken services fail fast
Offloads broken services
Page 50
Special Offers
Web Application
Page 51
Timeout
Web Application
Web Application
Special Offers
Page 52
Error
Open state
Special Offers
Web Application
Page 53
Single call
Half open state
Special Offers
Web Application
Error
Page 54
Closed state
Special Offers
Web Application
Page 55
Timeouts over threshold
Unhandled errors over threshold
Known irrecoverable error occurs
Page 57
Handle service call errors
Page 58
try {return specialOffers.getOffers();
} catch (Exception e) {return Offers.emptyOffers();
}
Page 60
Terrible response times
Awful throughput
Again?!?
Page 61
Web Application
Thread pool
Special Offers
Slowresponse
Page 62
Web Application
Thread pool
Special Offers
Slowresponses
Page 63
Web Application
Thread pool
Special Offers
Throughput lower than number of incoming
requests (again)
Slowresponses
Page 64
Response time < timeout
Timeouts and circuit breakers are not enough
Page 65
BulkheadsIsolates components
Prevents cascading
Page 66
Limit number of concurrent calls
Upper bound on numberof waiting threads
Page 67
Special Offers
Web Application
Thread pool
Bulkhead(size=2)
Page 68
Special Offers
Web Application
Thread pool
Bulkhead(size=2)
Page 69
Special Offers
Web Application
Thread pool
Error
Error
Bulkhead(size=2)
Page 70
Special Offers
Web Application
Thread pool
Error
Error
Bulkhead(size=2)
- Fast page load- No special offers
- Slow page load- Including special
offers
Page 71
Products
Payments
WebApplication
Special Offers
PurchaseHistory
One bulkhead per service
Page 72
Upper bound on number ofwaiting threads
Protects very well againstcascading failure …
Page 73
… if bulkhead sizes are …
… significantly smaller thanrequest pool size
Page 74
Peak load when healthy
40 requests per second (rps)
0.1 seconds response time
Page 75
Suitable bulkhead size
40 rps x 0.1 seconds = 4
+ breathing room = 7
Page 76
Bonus: protects services from overload
Page 77
Semaphore bulkhead = new Semaphore(2);
Offers protectedGetOffers() {if (bulkhead.tryAcquire(0, TimeUnit.SECONDS)) {
try {return specialOffers.getOffers();
} finally {bulkhead.release();
}} else {
throw new RejectedByBulkheadException();}
}
Page 79
Many threads are waiting
Few available threads - low throughput
All service calls are rejected!
Page 80
Products
Payments
WebApplication
Special Offers
Page 81
Products
Payments
WebApplication
Special Offers
Page 82
Products
Payments
WebApplication
Special Offers
Where have ourtimeouts gone?!?
Page 83
Broken Client Library
More protection required
Page 84
Thread Pool HandoversCalling threads can always walk away
Generic timeouts
Page 85
Web Application
Requestthreadpool
Service threadpool
Service
Page 86
Web Application
Requestthreadpool
Service threadpool
Service
Page 87
Web Application
Requestthreadpool
Service threadpool
Service
Page 88
Web Application
Requestthreadpool
Service threadpool
Service
Page 89
Web Application
Requestthreadpool
Service threadpool
Service
Page 90
Web Application
Requestthreadpool
Service threadpool
ServiceTimeout
Page 91
Web Application
Error
Requestthreadpool
Service threadpool
Service
Page 92
Web Application
Error
Requestthreadpool
Service threadpool
Service call timeouts still required
Service
Page 93
Web ApplicationBulkhead includedRequest
threadpool
Service threadpool
Service
Error
Page 94
ExecutorService executor = new ThreadPoolExecutor(3, 3, 1,TimeUnit.MINUTES, new SynchronousQueue<>());
Offers protectedGetOffers() {try {
Future<Offers> future =executor.submit(specialOffers::getOffers);
return future.get(1, TimeUnit.SECONDS);} catch (RejectedExecutionException e) {
throw new RejectedByBulkheadException();} catch (TimeoutException e) {
throw new ServiceCallTimeoutException();}
}
Page 95
Thread pool handovers are verypowerful!
What about performance?
Page 97
Monitor Service Calls
Page 98
Timeout rate
Rejected call rate
Short circuit rate
Failure/success rate
Response times
Page 100
Understand problems before changing
configuration
Page 103
”All this seems like a lot of work!”
Page 105
class GetOffersCommand extends HystrixCommand<Offers> {
public GetOffersCommand() {super(HystrixCommandGroupKey
.Factory.asKey("SpecialOffers"));}
@Overrideprotected Offers run() throws Exception {
return specialOffers.getOffers();}
}
public Offers getOffers() {return new GetOffersCommand().execute();
}
Page 106
class GetOffersCommand extends HystrixCommand<Offers> {
// ...
@Overrideprotected Offers getFallback() {
return Offers.emptyOffers();}
}
Page 109
Design for failure
Use timeouts
Circuit Breakers
Bulkheads
Monitor service calls
Page 110
https://github.com/Netflix/Hystrix
Page 111
Image Attributions
• Polycelis felina" by Eduard Solà - Own work. Licensed under CC BY-SA 3.0 via Commons -https://commons.wikimedia.org/wiki/File:Polycelis_felina.jpg#/media/File:Polycelis_felina.jpg
• "Old book bindings" by Tom Murphy VII - Own work. Licensed under CC BY-SA 3.0 via Commons -https://commons.wikimedia.org/wiki/File:Old_book_bindings.jpg#/media/File:Old_book_bindings.jpg
• "Cute Snail" by gniyuhs - Own work. Licensed under CC BY-SA 3.0 via deviantart - http://gniyuhs.deviantart.com/art/Cute-Snail-278597934
• ”Cash” by 401(K) 2012 – Own work. Licensed under CC BY-SA 2.0 via Flickr -https://www.flickr.com/photos/68751915@N05/6355816649
• "Circuit breakers at substation near Denver International Airport, Colorado" by Greg Goebel from Loveland CO, USA -Yipws_2bUploaded by PDTillman. Licensed under CC BY-SA 2.0 via Wikimedia Commons -https://commons.wikimedia.org/wiki/File:Circuit_breakers_at_substation_near_Denver_International_Airport,_Colorado.jpg#/media/File:Circuit_breakers_at_substation_near_Denver_International_Airport,_Colorado.jpg
• ”The control room of the nuclear ship NS Savannah, Baltimore, Maryland, USA” - Own work. Licensed under CC BY-SA 3.0 via Commons -https://en.wikipedia.org/wiki/File:NS_Savannah_control_room_MD1.jpg#/media/File:NS_Savannah_control_room_MD1.jpg
Page 112
Thank you! - Questions?
Kristoffer [email protected]
@kerlandsson