Stability patterns presentation
-
Upload
james-tong -
Category
Technology
-
view
2.239 -
download
1
description
Transcript of Stability patterns presentation
![Page 1: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/1.jpg)
Stability Patterns…and Antipatterns
© Michael Nygard, 2007-2012 1
Michael [email protected]
@mtnygard
Saturday, June 23, 12
![Page 2: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/2.jpg)
Stability Antipatterns
2Saturday, June 23, 12
![Page 3: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/3.jpg)
Integration Points
Integrations are the #1 risk to stability.
Every out of process call can and will eventually kill your system.
Yes, even database calls.
Saturday, June 23, 12
![Page 4: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/4.jpg)
Example: Wicked database hang
Saturday, June 23, 12
![Page 5: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/5.jpg)
“In Spec” vs. “Out of Spec”
“In Spec” failuresTCP connection refusedHTTP response code 500Error message in XML response
Example: Request-Reply using XML over HTTP
Well-Behaved Errors Wicked Errors
“Out of Spec” failures
TCP connection accepted, but no data sentTCP window full, never clearedServer replies with “EHLO”Server sends link farm HTMLServer streams Weird Al mp3s
Saturday, June 23, 12
![Page 6: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/6.jpg)
Remember This
Necessary evil.
Peel back abstractions.
Large systems fail faster than small ones.
Useful patterns: Circuit Breaker, Use Timeouts, Use Decoupling Middleware, Handshaking, Test Harness
Saturday, June 23, 12
![Page 7: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/7.jpg)
Chain Reaction
Failure moves horizontally across tiers
Common in search engines and app servers
Saturday, June 23, 12
![Page 8: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/8.jpg)
Remember This
One server down jeopardizes the rest.
Hunt for Resource Leaks.
Useful pattern: Bulkheads
Saturday, June 23, 12
![Page 9: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/9.jpg)
Cascading Failure
Failure moves vertically across tiers
Common in enterprise services & SOA
Saturday, June 23, 12
![Page 10: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/10.jpg)
Remember This
“Damage Containment”
Stop cracks from jumping the gap
Scrutinize resource pools
Useful patterns: Use Timeouts, Circuit Breaker
Saturday, June 23, 12
![Page 11: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/11.jpg)
Too many, too clicky
Some malicious users
Buyers
Front-page viewers
Screen scrapers
Users
Saturday, June 23, 12
![Page 12: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/12.jpg)
Handle Traffic Surges Gracefully
Degrade features automatically
Shed load.
Don’t keep sessions for bots.
Reduce per-user burden:
IDs, not object graphs.
Query parameters, not result sets.
Saturday, June 23, 12
![Page 13: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/13.jpg)
Blocked Threads
All request threads blocked = “crash”
Impossible to test away
Learn to use java.util.concurrent or System.Threading.(Ruby & PHP coders, just avoid threads completely.)
Saturday, June 23, 12
![Page 14: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/14.jpg)
Pernicious and Cumulative
Hung request handlers = less capacity.Hung request handler = frustrated user/caller
Each remaining thread serves 1/(N-1) extra requests
Saturday, June 23, 12
![Page 15: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/15.jpg)
Example: Blocking calls
String key = (String)request.getParameter(PARAM_ITEM_SKU);Availability avl = globalObjectCache.get(key);
Object obj = items.get(id);if(obj == null) { obj = strategy.create(id);}…
In a request-processing method
In GlobalObjectCache.get(String id), a synchronized method:
In the strategy:public Object create(Object key) throws Exception { return omsClient.getAvailability(key);}
Saturday, June 23, 12
![Page 16: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/16.jpg)
Remember This
Use proven constructs.
Don’t wait forever.
Scrutinize resource pools.
Beware the code you cannot see.
Useful patterns: Use Timeouts, Circuit Breaker
Saturday, June 23, 12
![Page 17: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/17.jpg)
Attacks of Self-Denial
BestBuy: XBox 360 Preorder
Amazon: XBox 360 Discount
Victoria’s Secret: Online Fashion Show
Anything on FatWallet.com
Saturday, June 23, 12
![Page 18: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/18.jpg)
Defenses
Avoid deep linksStatic landing pagesCDN diverts or throttles usersShared-nothing architecture Session only on 2nd clickDeal pool
Saturday, June 23, 12
![Page 19: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/19.jpg)
Remember This
Open lines of communication.
Support your marketers.
Saturday, June 23, 12
![Page 20: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/20.jpg)
Unbalanced Capacities
OnlineStore
SiteScopeNYC
Customers
SiteScopeSan Francisco
20 Hosts
75 Instances
3,000 Threads
OrderManagement
6 Hosts
6 Instances
450 Threads
Scheduling
1 Host
1 Instance
25 Threads
Saturday, June 23, 12
![Page 21: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/21.jpg)
Scaling Ratios
Dev QA Prod
Online Store 1/1/1 2/2/2 20/300/6
Order Management 1/1/1 2/2/2 4/6/2
Scheduling 1/1/1 2/2/2 4/2
Saturday, June 23, 12
![Page 22: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/22.jpg)
Unbalanced Capacities
Scaling effect between systems
Sensitive to traffic & behavior patterns
Stress both sides of the interface in QA
Simulate back end failures during testing
Saturday, June 23, 12
![Page 23: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/23.jpg)
SLA Inversion
Frammitz
99.99%
Corporate MTA
99.999%
SpamCannon's
DNS
98.5%
SpamCannon's
Applications
99%
Corporate DNS
99.9%
Inventory
99.9%
Message
Broker
99%
Partner 1's
Application
No SLA
Partner 1's
DNS
99%
Message
Queues
99.99%
Pricing and
Promotions
No SLA
What SLA can Frammitz really guarantee?Saturday, June 23, 12
![Page 24: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/24.jpg)
Remember This
No empty promises.
Monitor your dependencies.
Decouple from your dependencies.
Measure availability by feature, not by server.
Beware infrastructure services: DNS, SMTP, LDAP.
Saturday, June 23, 12
![Page 25: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/25.jpg)
Unbounded Result Sets
Development and testing is done with small data sets
Test databases get reloaded frequently
Queries often bonk badly with production data volume
Saturday, June 23, 12
![Page 26: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/26.jpg)
Unbounded Result Sets: Databases
SQL queries have no inherent limits
ORM tools are bad about this
Appears as slow performance degradation
Saturday, June 23, 12
![Page 27: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/27.jpg)
Unbounded Result Sets: SOA
Chatty remote protocols, N+1 query problem
Hurts caller and provider
Caller is naive, trusts server not to hurt it.
Saturday, June 23, 12
![Page 28: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/28.jpg)
Remember This
Test with realistic data volumesDon’t trust data producers.Put limits in your APIs.
Saturday, June 23, 12
![Page 29: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/29.jpg)
Stability Patterns
29Saturday, June 23, 12
![Page 30: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/30.jpg)
Circuit Breaker
Ever seen a remote call wrapped with a retry loop?
int remainingAttempts = MAX_RETRIES;
while(--remainingAttempts >= 0) { try { doSomethingDangerous(); return true; } catch(RemoteCallFailedException e) { log(e); }}return false;
Why?Saturday, June 23, 12
![Page 31: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/31.jpg)
Faults Cluster
Fast retries good for for dropped packets(but let TCP do that)
Most other faults require minutes to hours to correct
Immediate retries very likely to fail again
Saturday, June 23, 12
![Page 32: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/32.jpg)
Faults Cluster
Problems with the remote host, application or
the network will probably persist
for an long time... minutes
or hours
Saturday, June 23, 12
![Page 33: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/33.jpg)
Bad for Users and Systems
Systems:
Ties up threads, reducing overall capacity.
Multiplies load on server, at the worst times.
Induces a Cascading Failure
Users:
Wait longer to get an error response.
What happens after final retry?
Saturday, June 23, 12
![Page 34: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/34.jpg)
Stop Banging Your Head
Wrap a “dangerous” call
Count failures
After too many failures, stop passing calls
After a “cooling off” period, try the next call
If it fails, wait some more before calling again
Closed
on call / pass throughcall succeeds / reset countcall fails / count failurethreshold reached / trip breaker
Open
on call / failon timeout / attempt reset
pop
Half-Open
on call/pass throughcall succeeds/resetcall fails/trip breaker
attemptreset
reset pop
Saturday, June 23, 12
![Page 35: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/35.jpg)
Considerations
Sever malfunctioning features
Degrade gracefully on caller
Critical work must be queued for later
Saturday, June 23, 12
![Page 36: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/36.jpg)
Remember This
Stop doing it if it hurts.
Expose, monitor, track, and report state changes
Good against: Cascading Failures, Slow Responses
Works with: Use Timeouts
Saturday, June 23, 12
![Page 37: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/37.jpg)
Bulkheads
Partition the system
Allow partial failure without losing service
Applies at different granularity levels
Saturday, June 23, 12
![Page 38: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/38.jpg)
Common Mode Dependency
Foo Bar
Baz
Foo and Bar are coupled via Baz
Saturday, June 23, 12
![Page 39: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/39.jpg)
With Bulkheads
Foo Bar
Baz
Baz
Pool 1
Baz
Pool 2
Foo and Bar have dedicated resources from Baz.
Saturday, June 23, 12
![Page 40: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/40.jpg)
Remember This
Save part of the ship
Decide if less efficient use of resources is OK
Pick a useful granularity
Very important with shared-service models
Monitor each partition’s performance to SLA
Saturday, June 23, 12
![Page 41: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/41.jpg)
Test Harness
Real-world failures are hard to create in QA
Integration tests work for “in-spec” errors, but not “out-of-spec” errors.
Saturday, June 23, 12
![Page 42: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/42.jpg)
“In Spec” vs. “Out of Spec”
“In Spec” failuresTCP connection refusedHTTP response code 500Error message in XML response
Example: Request-Reply using XML over HTTP
Well-Behaved Errors Wicked Errors
“Out of Spec” failures
TCP connection accepted, but no data sentTCP window full, never clearedServer replies with “EHLO”Server sends link farm HTMLServer streams Weird Al mp3s
Saturday, June 23, 12
![Page 43: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/43.jpg)
“Out-of-spec” errors happen all the time in the
real world.
They never happenduring testing...
unless you force them to.43
Saturday, June 23, 12
![Page 44: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/44.jpg)
Daemon listening on network
Substitutes for the remote end of an interface
Can run locally (dev) or remotely (dev or QA)
Is totally evil
Killer Test Harness
Saturday, June 23, 12
![Page 45: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/45.jpg)
Port Nastiness19720 Allows connections requests into the queue, but never accepts them.
19721 Refuses all connections
19722 Reads requests at 1 byte / second
19723 Reads HTTP requests, sends back random binary
19724 Accepts requests, sends responses at 1 byte / sec.
19725 Accepts requests, sends back the entire OS kernel image.
19726 Send endless stream of data from /dev/random
Just a Few Evil Ideas
Now those are some out-of-spec errors.
45Saturday, June 23, 12
![Page 46: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/46.jpg)
Remember This
Force out-of-spec failures
Stress the caller
Build reusable harnesses for L1-L6 errors
Supplement, don’t replace, other testing methods
Saturday, June 23, 12
![Page 47: Stability patterns presentation](https://reader034.fdocuments.in/reader034/viewer/2022042700/559652461a28abc4598b45e3/html5/thumbnails/47.jpg)
Integration Points
Cascading Failures
Users
Blocked Threads
Attacks ofSelf-Denial
Scaling Effects
UnbalancedCapacities
Slow Responses
SLA Inversion
UnboundedResult Sets Use Timeouts
Circuit Breaker
Bulkheads
Steady State
Fail Fast
Handshaking
Test Harness
DecouplingMiddleware
counters
prevents
counters
counters
reduces impact
mitigates
finds problems in
damage
mutual
aggravation
found
nearleads to
leads toleads to
results from
violating
counters
counters
counters can avoid
leads to
avoids
counters
counters
exacerbates
lead to
works with
counters
leads to
Chain Reactions
Saturday, June 23, 12