On Failure and Resilience
-
Upload
mike-brittain -
Category
Technology
-
view
4.049 -
download
2
Transcript of On Failure and Resilience
On Failure and Resilience
Mike Brittain!"#$%&'# '( $)*")$$#")*, $&+,
@mikebrittain
!resented at "#signals on $ug %&, %'&%
“Software Infrastructure”“Framework” code, caching, ORM, file storage tier, developer tools, CI!deployment, site performance,
front-end architecture.
Managing failures and building resilience into systems, applications,
process, and people.
Photo: http://www.etsy.com/shop/TheOldTimeJunkShop
$61 M in goods sold in the marketplace2.9 M items sold1.2 B page views
http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
ArchitectureLinux, Apache, MySQL, PHP, Postgres, Solr, Gearman, Memcache, Chef, Hadoop, EC%(S"(EMR
"') Logical data stores(%" shards ) more functionally partitioned)
Search and storage tiers as “services”
150 Engineers + Designers + Product(this was 20 in Feb 2010)
credit: martin_heigan (flickr)
Buyers, sellers, support, developer api, i&*n, core infrastructure, storage, payments, security, fraud detection, big data and BI, email delivery, corp IT, operations, developer tools, continuous integration and testing, site performance,search, advertising, seller economics, mobile web, iOS.
Zero Release Managers
There Will Be Fail
Credit: wilkee.deviantart.com
We cannot comprehend all of the ways in which the individual parts of a complex system will interact. We cannot know all of the states and scenarios.
We cannot prevent failures.
Yet, we can mitigate them.
Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.
“Uptime” is not binary.
Convos AsyncTasks Ads Auth
Functionally Partitioned
Convos AsyncTasks Ads Auth
Functionally Partitioned
Master-Master Replication
Ads Ads Auth AuthAsynctasks
AsynctasksConvos Convos
1 234
5
Master-Master Replication
Ads Ads Auth AuthAsynctasks
AsynctasksConvos Convos
1 234
5
Master-Master Replication
Ads Ads Auth AuthAsynctasks
AsynctasksConvos Convos
1 234
5
Sharded Tables
shard3 shard3 shard4 shard4shard2 shard2shard1 shard1
5 231
4
~!" of listing data is stored on shard#
Sharded Tables
shard3 shard3 shard4 shard4shard2 shard2shard1 shard1
5 231
4
Sharded Tables
shard3 shard3 shard4 shard4shard2 shard2shard1 shard1
Outage is limited to~!" of data set
“Uptime” is not binary.
Uptime of the application is the responsibility of our Operations team.
Uptime of the application is the responsibility of our Operations, Engineering,Product, and Design teams.
Uptime of the application is the responsibility of our Operations, Engineering,Product, and Design teams.
If you are committing code, you are operating the site.
Branching in Code
“All existing revision control systems were built by people who build installed software”
Always Ship TrunkPaul Hammond
Velocity Conf 2010
Enable and disable features quickly.Features for staff or for beta groups.Percentage ramp-up of users or requests.A/B “experiments.”
Config Flags
$cfg[‘new_search’] = array('enabled' => 'on');$cfg[‘sign_in’] = array('enabled' => 'on');$cfg[‘checkout’] = array('enabled' => 'on');$cfg[‘homepage’] = array('enabled' => 'on');
$cfg[‘new_search’] = array('enabled' => 'on');
// Meanwhile...
if ($cfg[‘new_search’]) { # New hotness $results = do_solr();} else { # old and boring $results = do_grep();}
But...
“Doesn’t that mean you have conditionals all over your code?”
Yes.
“Doesn’t that mean you have conditionals all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
“Doesn’t that mean you have conditionals all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
“That sounds like it sucks.”Really?
“Doesn’t that mean you have conditionals all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
“That sounds like it sucks.”Really?
“Wait a minute... all of the counter arguments are in Comic Sans. WTF?!?
Oh, you noticed? ;)
00:00Site down for maintenance
+01:47Site up, disabled login and registration
+06:40Site up, some seller tools disabled
+07:41All features restored
DB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
“Uptime” is not binary.
Features are launched by flipping a config flag, not by deploying
hundreds of lines of code.
“If Engineering at Etsy has a religion, it’s the Church of Graphs.
Ian Malpass, Code as Crafthttp://etsy.me/ePkoZB
http://www.flickr.com/photos/flyforfun/2694158656/
THIS IS HOWYOU RUN
A COMPLEXSYSTEM
http://www.flickr.com/photos/flyforfun/2694158656/
OperatorConfig flags
Metrics
Oh, you want to talk about how we collect metrics and make graphs?
http://www.slideshare.net/mikebrittain/metricsdriven-engineering
Resilient User Interfaces
Interfaces and user experiencesthat adapt to technical andarchitectural failure.
http://www.flickr.com/photos/caffeina/2144044776/
http://www.flickr.com/photos/17793901@N00/106331831/
/** * Creates a database connection. */ public function __construct($host, $user, $pass, $db) { parent::__construct($host, $user, $pass, $db);
if (mysqli_connect_error()) {
throw new DBConnection_Exception( sprintf("Error: %s, %s", mysqli_connect_errno(), mysqli_connect_error()));
}}
try { $conn = new DBConnection('viewsdb.host', 'db_read_user', 'ssssshh!', 'views_db');} catch (DBConnection_Exception $e) {
// TODO: Someone should figure out what to do if // we can't connect to the views db. throw $e;}
Site navigationLogo
Cute Picture
Generic, catch-allerror messaging....
http://www.flickr.com/photos/caffeina/2144044776/
Every back-end service is anopportunity for failure.
1
2 3
4
56
10
8
9
4 11
13
12
7
147
Critical Path
http://www.flickr.com/photos/caffeina/2144044776/
#srsly?
" #$$ ms
Non-blocking Ajax
Google Docs
Google Calendar
GMail
“Oops, we aren’t able to access click metrics right
now, do not worry — your data is safe.”
Product design doesn’t stopat 100% availability.
OpsDev
Product
OpsDev
1
2 3
4
56
10
8
9
4 11
13
12
7
147
Operability Reviews
What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts? How do we know the thresholds are right?How do we turn it off?... and what happens when we do?
“What could possibly go wrong?”
What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts? How do we know the thresholds are right?How do we turn it off? ...and what happens when we do?
“What could possibly go wrong?”
“GameDay” Exercises
Tuesday, April 24, 12
Tuesday, April 24, 12
Pedro
Surprise!!!Turning off multi-language supportimproves our page generation times by up to 25%.
Homepage (95th perc.)
(Blameless) Post-Mortems
How could this have gone better?
How quickly did we find out that something was wrong?Did we communicate well to our visitors and each other?Why did we have confidence that what we were doing was OK?Did we have the right tools, did we use them properly?Did we collect metrics, and could we find them?Where did we make the wrong decisions?
What steps do we take to reduce the chance of this happening again in the future?
“... an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure.
This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.”
http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
John AllspawVP, Technical Operations, Etsy
We should try to learn not only what went wrong, but also what went right.
00:00Site down for maintenance
+01:47Site up, disabled login and registration
+06:40Site up, some seller tools disabled
+07:41All features restored
DB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
Operational Mindset
OpsDev Product
Business Priorities
Operational Mindset
OpsDev Product
Introspection
!"#$ %&$'( )*+ $++*+ ,$-!.",$
!"#$ %&$'( )*+ $++*+ ,$-!.",$...or, how are we screwing our users?
Risk mitigation in a complex system
Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.
Flickr: roboppyhttp://www.flickr.com/photos/51035735481@N01/163374138/
Flickr: jamesjyuhttp://www.flickr.com/photos/32593095@N00/3465022/
Flickr: circulatinghttp://www.flickr.com/photos/26835318@N00/2318226026/
PHOTO CREDITS