There is No Spoon - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2016
-
Upload
devopsdays-tel-aviv -
Category
Technology
-
view
43 -
download
0
Transcript of There is No Spoon - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2016
-
Production Engineering (PE) There Is No Spoon
Ran LeibmanProduction Engineer
-
Agenda1. How Production Engineering was formed in Facebook 2. What we do in Onavo 3. How we moved to the PE model in Onavo 4. Q & A
-
Facebook - Pre PE Days
-
SRO - Site Reliability Operations
1. Keep the site up 24/7 2. Follow the sun 3. Capacity plans
-
Why SRO was not enough ?
-
What are the alternatives ?
-
NOC
-
The Production Engineering Model
1. PEs are embedded within the software engineering teams 2. Taking part in meetings 3. Involved in roadmap plans 4. Reviewing diffs 5. Oncall - Software & Production Engineers
-
Onavo - Adopting The PE Model
-
Protect user traffic using IPsec Protect against malicious sites Compress user traffic Control data leakage
Save, Measure & Protect your mobile data
-
a bit of contextOnavo
1. Founded at 2010 2. Classic Startup Dev & Ops teams
1. Dev - writes code 2. Ops - keeps the infra up & running
3. Acquired by Facebook at 2013
-
Making The Change - Step By Step
-
Step 1 - Go Sit Close/Next With The Developers
-
Step 2 - Get The Colleagues Onboard
-
Step 3 - Get Your Tooling Ready
-
you dont want that Confused Travolta moment Have Good (short) Documentation
Document your alerts Links to dashboards Links to third party software docs Runbooks - how to debug in prod
log files, how to restart the service, getting stack traces & metrics Links to config management
-
Dev Friendly Systems
-
avoid the graph porn Simple And Indicative Dashboards
1. Match the product KPIs 2. Strong signal 3. Intuitive titles 4. Easy to spot anomalies 5. Easy to find correlations
-
Step 4 - Review Your Alerts
-
rm -rf /all/false/alarms*Refactor Your Alerts as Needed
The first challenge is to make sure alerts are handled To make it possible every alert should be
Indicate a real problem Clear to understand - Informative Impactful Actionable
-
Step 5 - Train The Team - Get Them Ready
-
learning is easy - remembering is hardTrain The Team
Wiki / Doc based makes it easier to remember
Hands-on Hands-on Hands-on Pre create task pool (even if low impact) Give oncall use cases & examples Reusable
-
Step 6 - Oncall + Hand Holding
-
make yourself available and adjust as you goShared Oncall
Short oncall cycles, 1-2 days Increase the period each cycle Oncall Summaries Do oncall as well - set an example Preemptively check status with
the current oncall
-
Step 1 - Go Sit Close/Next With The DevelopersThe Steps
-
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
-
Step 3 - Get Your Tooling Ready
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
-
Step 4 - Review Your Alerts
Step 3 - Get Your Tooling Ready
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
-
Step 5 - Train The Team
Step 4 - Review Your Alerts
Step 3 - Get Your Tooling Ready
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
-
Step 6 - Oncall + Hand Holding
Step 5 - Train The Team
Step 4 - Review Your Alerts
Step 3 - Get Your Tooling Ready
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
-
Questions?
Ran LeibmanProduction Engineer