On Failure and Resilience

102
On Failure and Resilience Mike Brittain , @mikebrittain resented at signals on ug ,

Transcript of On Failure and Resilience

Page 1: On Failure and Resilience

On Failure and Resilience

Mike Brittain!"#$%&'# '( $)*")$$#")*, $&+,

@mikebrittain

!resented at "#signals on $ug %&, %'&%

Page 2: On Failure and Resilience

“Software Infrastructure”“Framework” code, caching, ORM, file storage tier, developer tools, CI!deployment, site performance,

front-end architecture.

Page 3: On Failure and Resilience

Managing failures and building resilience into systems, applications,

process, and people.

Page 4: On Failure and Resilience
Page 5: On Failure and Resilience

Photo: http://www.etsy.com/shop/TheOldTimeJunkShop

$61 M in goods sold in the marketplace2.9 M items sold1.2 B page views

http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/

Page 6: On Failure and Resilience

ArchitectureLinux, Apache, MySQL, PHP, Postgres, Solr, Gearman, Memcache, Chef, Hadoop, EC%(S"(EMR

"') Logical data stores(%" shards ) more functionally partitioned)

Search and storage tiers as “services”

Page 7: On Failure and Resilience

150 Engineers + Designers + Product(this was 20 in Feb 2010)

credit: martin_heigan (flickr)

Page 8: On Failure and Resilience

Buyers, sellers, support, developer api, i&*n, core infrastructure, storage, payments, security, fraud detection, big data and BI, email delivery, corp IT, operations, developer tools, continuous integration and testing, site performance,search, advertising, seller economics, mobile web, iOS.

Page 9: On Failure and Resilience
Page 10: On Failure and Resilience

Zero Release Managers

Page 11: On Failure and Resilience

There Will Be Fail

Credit: wilkee.deviantart.com

Page 12: On Failure and Resilience

We cannot comprehend all of the ways in which the individual parts of a complex system will interact. We cannot know all of the states and scenarios.

We cannot prevent failures.

Page 13: On Failure and Resilience

Yet, we can mitigate them.

Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.

Page 14: On Failure and Resilience

“Uptime” is not binary.

Page 15: On Failure and Resilience

Convos AsyncTasks Ads Auth

Functionally Partitioned

Page 16: On Failure and Resilience

Convos AsyncTasks Ads Auth

Functionally Partitioned

Page 17: On Failure and Resilience

Master-Master Replication

Ads Ads Auth AuthAsynctasks

AsynctasksConvos Convos

1 234

5

Page 18: On Failure and Resilience

Master-Master Replication

Ads Ads Auth AuthAsynctasks

AsynctasksConvos Convos

1 234

5

Page 19: On Failure and Resilience

Master-Master Replication

Ads Ads Auth AuthAsynctasks

AsynctasksConvos Convos

1 234

5

Page 20: On Failure and Resilience

Sharded Tables

shard3 shard3 shard4 shard4shard2 shard2shard1 shard1

5 231

4

~!" of listing data is stored on shard#

Page 21: On Failure and Resilience

Sharded Tables

shard3 shard3 shard4 shard4shard2 shard2shard1 shard1

5 231

4

Page 22: On Failure and Resilience

Sharded Tables

shard3 shard3 shard4 shard4shard2 shard2shard1 shard1

Outage is limited to~!" of data set

Page 23: On Failure and Resilience

“Uptime” is not binary.

Page 24: On Failure and Resilience

Uptime of the application is the responsibility of our Operations team.

Page 25: On Failure and Resilience

Uptime of the application is the responsibility of our Operations, Engineering,Product, and Design teams.

Page 26: On Failure and Resilience

Uptime of the application is the responsibility of our Operations, Engineering,Product, and Design teams.

If you are committing code, you are operating the site.

Page 27: On Failure and Resilience

Branching in Code

Page 28: On Failure and Resilience

“All existing revision control systems were built by people who build installed software”

Always Ship TrunkPaul Hammond

Velocity Conf 2010

Page 29: On Failure and Resilience

Enable and disable features quickly.Features for staff or for beta groups.Percentage ramp-up of users or requests.A/B “experiments.”

Config Flags

Page 30: On Failure and Resilience

$cfg[‘new_search’] = array('enabled' => 'on');$cfg[‘sign_in’] = array('enabled' => 'on');$cfg[‘checkout’] = array('enabled' => 'on');$cfg[‘homepage’] = array('enabled' => 'on');

Page 31: On Failure and Resilience

$cfg[‘new_search’] = array('enabled' => 'on');

// Meanwhile...

if ($cfg[‘new_search’]) { # New hotness $results = do_solr();} else { # old and boring $results = do_grep();}

Page 32: On Failure and Resilience

But...

Page 33: On Failure and Resilience

“Doesn’t that mean you have conditionals all over your code?”

Yes.

Page 34: On Failure and Resilience

“Doesn’t that mean you have conditionals all over your code?”

Yes.

“Does anyone ever clean those up?”

Sometimes.

Page 35: On Failure and Resilience

“Doesn’t that mean you have conditionals all over your code?”

Yes.

“Does anyone ever clean those up?”

Sometimes.

“That sounds like it sucks.”Really?

Page 36: On Failure and Resilience

“Doesn’t that mean you have conditionals all over your code?”

Yes.

“Does anyone ever clean those up?”

Sometimes.

“That sounds like it sucks.”Really?

“Wait a minute... all of the counter arguments are in Comic Sans. WTF?!?

Oh, you noticed? ;)

Page 37: On Failure and Resilience

00:00Site down for maintenance

+01:47Site up, disabled login and registration

+06:40Site up, some seller tools disabled

+07:41All features restored

DB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/

Page 38: On Failure and Resilience

“Uptime” is not binary.

Page 39: On Failure and Resilience

Features are launched by flipping a config flag, not by deploying

hundreds of lines of code.

Page 40: On Failure and Resilience

“If Engineering at Etsy has a religion, it’s the Church of Graphs.

Ian Malpass, Code as Crafthttp://etsy.me/ePkoZB

Page 41: On Failure and Resilience
Page 42: On Failure and Resilience
Page 43: On Failure and Resilience

http://www.flickr.com/photos/flyforfun/2694158656/

THIS IS HOWYOU RUN

A COMPLEXSYSTEM

Page 44: On Failure and Resilience

http://www.flickr.com/photos/flyforfun/2694158656/

OperatorConfig flags

Metrics

Page 45: On Failure and Resilience

Oh, you want to talk about how we collect metrics and make graphs?

http://www.slideshare.net/mikebrittain/metricsdriven-engineering

Page 46: On Failure and Resilience

Resilient User Interfaces

Page 47: On Failure and Resilience

Interfaces and user experiencesthat adapt to technical andarchitectural failure.

Page 48: On Failure and Resilience
Page 49: On Failure and Resilience
Page 50: On Failure and Resilience

http://www.flickr.com/photos/caffeina/2144044776/

Page 51: On Failure and Resilience

http://www.flickr.com/photos/17793901@N00/106331831/

Page 52: On Failure and Resilience
Page 53: On Failure and Resilience
Page 54: On Failure and Resilience

/** * Creates a database connection. */ public function __construct($host, $user, $pass, $db) { parent::__construct($host, $user, $pass, $db);

if (mysqli_connect_error()) {

throw new DBConnection_Exception( sprintf("Error: %s, %s", mysqli_connect_errno(), mysqli_connect_error()));

}}

Page 55: On Failure and Resilience

try { $conn = new DBConnection('viewsdb.host', 'db_read_user', 'ssssshh!', 'views_db');} catch (DBConnection_Exception $e) {

// TODO: Someone should figure out what to do if // we can't connect to the views db. throw $e;}

Page 56: On Failure and Resilience
Page 57: On Failure and Resilience
Page 58: On Failure and Resilience

Site navigationLogo

Cute Picture

Generic, catch-allerror messaging....

Page 59: On Failure and Resilience

http://www.flickr.com/photos/caffeina/2144044776/

Page 60: On Failure and Resilience

Every back-end service is anopportunity for failure.

Page 61: On Failure and Resilience
Page 62: On Failure and Resilience
Page 63: On Failure and Resilience
Page 64: On Failure and Resilience

1

2 3

4

56

10

8

9

4 11

13

12

7

147

Page 65: On Failure and Resilience
Page 66: On Failure and Resilience

Critical Path

Page 67: On Failure and Resilience
Page 68: On Failure and Resilience
Page 69: On Failure and Resilience
Page 70: On Failure and Resilience

http://www.flickr.com/photos/caffeina/2144044776/

#srsly?

Page 71: On Failure and Resilience

" #$$ ms

Page 72: On Failure and Resilience

Non-blocking Ajax

Page 73: On Failure and Resilience

Google Docs

Google Calendar

Page 74: On Failure and Resilience

GMail

Page 75: On Failure and Resilience

“Oops, we aren’t able to access click metrics right

now, do not worry — your data is safe.”

Page 76: On Failure and Resilience

Product design doesn’t stopat 100% availability.

Page 77: On Failure and Resilience

OpsDev

Page 78: On Failure and Resilience

Product

OpsDev

Page 79: On Failure and Resilience

1

2 3

4

56

10

8

9

4 11

13

12

7

147

Page 80: On Failure and Resilience

Operability Reviews

Page 81: On Failure and Resilience

What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts? How do we know the thresholds are right?How do we turn it off?... and what happens when we do?

“What could possibly go wrong?”

Page 82: On Failure and Resilience

What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts? How do we know the thresholds are right?How do we turn it off? ...and what happens when we do?

“What could possibly go wrong?”

Page 83: On Failure and Resilience

“GameDay” Exercises

Page 84: On Failure and Resilience

Tuesday, April 24, 12

Page 85: On Failure and Resilience

Tuesday, April 24, 12

Pedro

Page 86: On Failure and Resilience

Surprise!!!Turning off multi-language supportimproves our page generation times by up to 25%.

Homepage (95th perc.)

Page 87: On Failure and Resilience

(Blameless) Post-Mortems

Page 88: On Failure and Resilience

How could this have gone better?

How quickly did we find out that something was wrong?Did we communicate well to our visitors and each other?Why did we have confidence that what we were doing was OK?Did we have the right tools, did we use them properly?Did we collect metrics, and could we find them?Where did we make the wrong decisions?

What steps do we take to reduce the chance of this happening again in the future?

Page 89: On Failure and Resilience

“... an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure.

This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.”

http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/

John AllspawVP, Technical Operations, Etsy

Page 90: On Failure and Resilience

We should try to learn not only what went wrong, but also what went right.

Page 91: On Failure and Resilience

00:00Site down for maintenance

+01:47Site up, disabled login and registration

+06:40Site up, some seller tools disabled

+07:41All features restored

DB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/

Page 92: On Failure and Resilience

Operational Mindset

OpsDev Product

Page 93: On Failure and Resilience

Business Priorities

Operational Mindset

OpsDev Product

Page 94: On Failure and Resilience

Introspection

Page 95: On Failure and Resilience

!"#$ %&$'( )*+ $++*+ ,$-!.",$

Page 96: On Failure and Resilience

!"#$ %&$'( )*+ $++*+ ,$-!.",$...or, how are we screwing our users?

Page 97: On Failure and Resilience

Risk mitigation in a complex system

Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.

Page 98: On Failure and Resilience

Thank you.

Mike Brittain

[email protected]@mikebrittain

Page 99: On Failure and Resilience
Page 100: On Failure and Resilience
Page 101: On Failure and Resilience
Page 102: On Failure and Resilience

Flickr: roboppyhttp://www.flickr.com/photos/51035735481@N01/163374138/

Flickr: jamesjyuhttp://www.flickr.com/photos/32593095@N00/3465022/

Flickr: circulatinghttp://www.flickr.com/photos/26835318@N00/2318226026/

PHOTO CREDITS