Production Engineering (PE) There Is No Spoon
Ran LeibmanProduction Engineer
Agenda1. How Production Engineering was formed in Facebook 2. What we do in Onavo 3. How we moved to the PE model in Onavo 4. Q & A
Facebook - Pre PE Days
SRO - Site Reliability Operations
1. Keep the site up 24/7 2. Follow the sun 3. Capacity plans
Why SRO was not enough ?
What are the alternatives ?
NOC
The Production Engineering Model
1. PEs are embedded within the software engineering teams 2. Taking part in meetings 3. Involved in roadmap plans 4. Reviewing diffs 5. Oncall - Software & Production Engineers
Onavo - Adopting The PE Model
Protect user traffic using IPsec Protect against malicious sites Compress user traffic Control data leakage
Save, Measure & Protect your mobile data
a bit of contextOnavo
1. Founded at 2010 2. Classic Startup Dev & Ops teams
1. Dev - writes code 2. Ops - keeps the infra up & running
3. Acquired by Facebook at 2013
Making The Change - Step By Step
Step 1 - Go Sit Close/Next With The Developers
Step 2 - Get The Colleagues Onboard
Step 3 - Get Your Tooling Ready
you dont want that Confused Travolta moment Have Good (short) Documentation
Document your alerts Links to dashboards Links to third party software docs Runbooks - how to debug in prod
log files, how to restart the service, getting stack traces & metrics Links to config management
Dev Friendly Systems
avoid the graph porn Simple And Indicative Dashboards
1. Match the product KPIs 2. Strong signal 3. Intuitive titles 4. Easy to spot anomalies 5. Easy to find correlations
Step 4 - Review Your Alerts
rm -rf /all/false/alarms*Refactor Your Alerts as Needed
The first challenge is to make sure alerts are handled To make it possible every alert should be
Indicate a real problem Clear to understand - Informative Impactful Actionable
Step 5 - Train The Team - Get Them Ready
learning is easy - remembering is hardTrain The Team
Wiki / Doc based makes it easier to remember
Hands-on Hands-on Hands-on Pre create task pool (even if low impact) Give oncall use cases & examples Reusable
Step 6 - Oncall + Hand Holding
make yourself available and adjust as you goShared Oncall
Short oncall cycles, 1-2 days Increase the period each cycle Oncall Summaries Do oncall as well - set an example Preemptively check status with
the current oncall
Step 1 - Go Sit Close/Next With The DevelopersThe Steps
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
Step 3 - Get Your Tooling Ready
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
Step 4 - Review Your Alerts
Step 3 - Get Your Tooling Ready
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
Step 5 - Train The Team
Step 4 - Review Your Alerts
Step 3 - Get Your Tooling Ready
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
Step 6 - Oncall + Hand Holding
Step 5 - Train The Team
Step 4 - Review Your Alerts
Step 3 - Get Your Tooling Ready
Step 2 - Get The Colleagues Onboard Step 1 - Go Sit Close/Next With The Developers
The Steps
Questions?
Ran LeibmanProduction Engineer
Top Related