Normal accidents and outpatient surgeries
-
Upload
jonathan-creasy -
Category
Technology
-
view
250 -
download
0
description
Transcript of Normal accidents and outpatient surgeries
![Page 1: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/1.jpg)
Normal Accidents andOutpatient Surgeries
Resilience Engineering Done Right
![Page 2: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/2.jpg)
Safety in a Complex and Changing Environment"...so safety isn't about the absence of something...that you need to count errors or monitor violations. But the presence of something. But the presence of what?
When we need to find that things go right under difficult circumstances, it's mostly because of people's adaptive capability; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle"-Sidney Dekker
![Page 3: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/3.jpg)
Safety in a Complex and Changing Environment
"...so safety isn't about the absence of something...that you need to count errors or monitor violations. But the presence of something. But the presence of what?
When we need to find that things go right under difficult circumstances, it's mostly because of people's adaptive capability; their ability to recognize, adapt to, and absorb changes and disruptions, some of which might fall outside of what the system is designed or trained to handle"-Sidney Dekker
RESILIENCE
![Page 4: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/4.jpg)
Vocabulary Lesson
Continuous Integration: The ability to quickly make sure the system is ready for production.
![Page 5: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/5.jpg)
Vocabulary Lesson
Continuous Integration: The ability to quickly make sure the system is ready for production.
Resilience: The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances in order to sustain required operations.
![Page 6: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/6.jpg)
Vocabulary Lesson
Continuous Integration: The ability to quickly make sure the system is ready for production.
Resilience: The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances in order to sustain required operations.
Maintainability: Characteristic of design and installation which determines the probability that a failed equipment, machine, or system can be restored to its normal state within a given timeframe.
![Page 7: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/7.jpg)
Vocabulary Lesson
Continuous Integration: The ability to quickly make sure the system is ready for production.
Resilience: The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances in order to sustain required operations.
Maintainability: Characteristic of design and installation which determines the probability that a failed equipment, machine, or system can be restored to its normal state within a given timeframe.
The SYSTEM includes all the hardware and software, but also all of the PEOPLE involved.
![Page 8: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/8.jpg)
Maintainability = Uptime Goodness
MTTR vs. MTBF
![Page 9: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/9.jpg)
Maintainability = Uptime Goodness
MTTR vs. MTBF
Low MTTR > Low MTBF
![Page 10: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/10.jpg)
Maintainability = Uptime Goodness
MTTR vs. MTBF
Low MTTR > Low MTBF
Low MTTR = Better Uptime for most types of F
![Page 11: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/11.jpg)
Maintainability = Uptime Goodness
MTTR vs. MTBF
Low MTTR > Low MTBF
Low MTTR = Better Uptime for most types of F
Low MTTR Requires: • more useful metrics• intelligent data analysis• pre-planned, purposeful resilience• cooperation between application and infrastructure
![Page 12: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/12.jpg)
Your Average Operations Engineer
![Page 13: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/13.jpg)
Your Average Operations Engineer
![Page 14: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/14.jpg)
Automation as a Default:"One of the best ways to eliminate human problems is to take the human out of the problem. Machines are very good at doing things repeatedly and doing them the same way every single time. Humans are not good at this. Let the machines do it.”
Rapid Recovery:"Do we spend an unpredictable amount of time trying to solve some obscure issue, or do we simply recreate the instance providing the service from configuration management"
blog.lusis.org/blog/2011/10/18/deploy-all-the-things/
![Page 15: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/15.jpg)
Automation as a Default:"One of the best ways to eliminate human problems is to take the human out of the problem. Machines are very good at doing things repeatedly and doing them the same way every single time. Humans are not good at this. Let the machines do it."
Rapid Recovery:"Do we spend an unpredictable amount of time trying to solve some obscure issue, or do we simply recreate the instance providing the service from configuration management"
blog.lusis.org/blog/2011/10/18/deploy-all-the-things/
PUPPET + KICKSTART+ Network Automation
![Page 16: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/16.jpg)
Automation as a Default:"One of the best ways to eliminate human problems is to take the human out of the problem. Machines are very good at doing things repeatedly and doing them the same way every single time. Humans are not good at this. Let the machines do it."
Rapid Recovery:"Do we spend an unpredictable amount of time trying to solve some obscure issue, or do we simply recreate the instance providing the service from configuration management"
blog.lusis.org/blog/2011/10/18/deploy-all-the-things/
PUPPET + KICKSTART + Network Automation
ESPER + HEALTHCHECK + NAGIOS + SPLUNK+ OHSHIT
![Page 17: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/17.jpg)
Comfortable Changes1) Are Small• Many Small Changes = Fewer Incidents with lower
MTTR
![Page 18: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/18.jpg)
Comfortable Changes1) Are Small• Many Small Changes = Fewer Incidents with lower
MTTR
2) Are ReproducibleRPM:• Really Peaceful Mornings• Reduce Paging Monitors• Reusable Provisioning Methods
![Page 19: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/19.jpg)
Comfortable Changes1) Are Small• Many Small Changes = Fewer Incidents with lower
MTTR
2) Are ReproducibleRPM:• Really Peaceful Mornings• Reduce Paging Monitors• Reusable Provisioning Methods
Rule # 81: If you are logging into servers, you are doing it wrong.
![Page 20: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/20.jpg)
Comfortable Changes3) Are easily understood by your most junior team members
![Page 21: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/21.jpg)
Comfortable Changes3) Are easily understood by your most junior team members
Rule # 4: Keep it Simple, because you are smart. Do not make it overly complex because you can.
![Page 22: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/22.jpg)
Comfortable Changes3) Are easily understood by your most junior team members
Rule # 4: Keep it Simple, because you are smart. Do not make it overly complex because you can.
4) Can be deployed to a subset of production systems
![Page 23: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/23.jpg)
Comfortable Changes5) Follow Process
![Page 24: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/24.jpg)
Comfortable Changes5) Follow Process
Change control, deployment processes, peer review, all of these things matter for a world-class OPS organization.
![Page 25: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/25.jpg)
Comfortable Changes6) Have been approved by a GO / NO-GO process with all relevant parties checking in.
![Page 26: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/26.jpg)
Comfortable Changes6) Have been approved by a GO / NO-GO process with all relevant parties checking in.
Ensure that all teams involved in a change have signed off, including ON-CALL and CUSTOMER SERVICE
![Page 27: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/27.jpg)
Tracking Changes
![Page 28: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/28.jpg)
![Page 29: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/29.jpg)
Small ChangesJohn Allspaw presented these graphs of data gathered at Etsy.
More Smaller Deploymentsmeans
Faster MTTRmeans
Fewer Minutes of Disruption
![Page 30: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/30.jpg)
![Page 31: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/31.jpg)
![Page 32: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/32.jpg)
Operations Meta-MetricsWhen in doubt, COLLECT DATA, Build a Timeline!
Things to Monitor:
Changes (who/what/when/type)
Incidents(Type/Severity/Duration)
Responses to Incidents(TTD/TTR)
Things to Collect:IRC/Jabber LogsJira Logs
Search your Data: Use HBASE+PIG/HIVE, ESPER, SOLR and SPLUNK
Store everything, even stuff you don't yet know how to use.
![Page 33: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/33.jpg)
Tracking Incidents - MTTD
1.Frequency2.Severity3.Root Cause: Five Whys Mentality
o why was the website down? The CPU utilization on all our front-end servers went to 100%
o why did the CPU usage spike? A new bit of code contained an infinite loop!
o why did that code get written? So-and-so made a mistakeo why did his mistake get checked in? He didn't write a unit test for the
featureo why didn't he write a unit test? He's a new employee, and he was not
properly trained
4.Time-to-Detect5.Time-to-Resolve
![Page 34: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/34.jpg)
Tracking Incidents - MTTD
Rule # 18: Monitor EVERYTHING, alert on actionable items only, record other for trend information.
Rule # 20: Do not make the monitoring system so noisy it is useless.
![Page 35: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/35.jpg)
Tracking Incidents - MTTD
Data Points to source these metrics from:Output from Application, CLOG, Puppet, Jabber, Jira, healthcheck, hardware, Eluna, Nagios....all collectible data
![Page 36: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/36.jpg)
Handling Incident Response - MTTRDetect a ProblemCommunicate to Support/Community/Executives
Begin to take ActionCommunicate to Support/Community/Executives
Coordinate Troubleshooting/DiagnosisCommunicate to Support/Community/Executives
Confirm Stability, Resolving StepsCommunicate to Support/Community/Executives
![Page 37: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/37.jpg)
Handling Incident Response - MTTRRule # 24: Assign people to be point people for every bit of technology
Rule # 25: Assign Backup People to those People
Rule #12: Know your bottlenecks, and how to spot them.
Rule # 42: Create gigantic poster size drawings of the physical layouts of your data center
Rule #43: Create gigantic poster size drawings of the logical flows of each part of your product.
![Page 38: Normal accidents and outpatient surgeries](https://reader033.fdocuments.in/reader033/viewer/2022061206/5482ef50b47959fb0c8b494d/html5/thumbnails/38.jpg)
XKCD #974:
I find that when someone is taking time to do something right in the present, they're a perfectionist with no ability to prioritize, whereas when someone took time to do something right in the past, they're a master artisan of great foresight.