DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity Model
Click here to load reader
description
Transcript of DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity Model
Chef’s Operations Maturity Model: Helping Horses Become UnicornsMatt Ray DevopsDays Austin May 5, 2014
Introductions• Matt Ray
• Director Partner Integration at Chef
• mattray GitHub|IRC|Twitter
“There’s nothing horses hate more than hearing stories about unicorns.”John Arbuckle Chief Architect at GE Capital "Hunting the DevOps Whale in Large Enterprises" ChefConf 2014
https://keepinghouseandhorse.files.wordpress.com/2013/10/photoshop3.jpeg
But… Enterprise• Our applications are too complex
• Politics get in the way
• We’ve always done it this way
It’s Not Magic• Not everyone requires Continuous Delivery
• They require:
• Higher reliability
• Greater visibility
• More resilience
• Faster response
https://img0.etsystatic.com/000/0/5209298/il_fullxfull.282855902.jpg
How Do We Get There?
The Map is not the Territory• Comparative study of Operational
Maturity Models
• On one end: ad-hoc, slow to respond, “traditional” approach
• At the other: very fast, fully automated, and disaster indifferent
• Figure out what is most important to your Organization
https://www.chimacumtack.com/images/measurehorse.jpg
Fitting the Model• Varying degrees of adoption
• Operational trends often correlated and relational, but not definitive
• Roadmap for improving time to deployment and lower time to recovery
• Understand the challenges, set real expectations for progress
http://www.web3dservice.com/3d_models/images/unicorn_3d_model_03.jpg
Roadmap Considerations• Hardware Management
• OS Management
• Infrastructure Management
• Software Deployments
• Incident Management
• Disaster Recovery
http://cultofunicorn.com/wp-content/uploads/2013/05/Unicorn_horse.jpg
Hardware Management
Every Server is Sacred!• HA Support expected across the entire stack
• Dependence on vendor/on-site SE for replacement/maintenance
• “This is the best hardware money can buy!”
• Architecture Review and Request Forms for all changes
• “Tier 1” data centers
• Every project special snowflake
1 SysAdmin to 25-250 systems?
Automate Common Tasks
Maybe not ALL servers are sacred…• Start using some farms of standardized machines
• Fewer support contracts, less dependence on vendor/on-site support
• Architecture Reviews for new services with some implementation standardization
• HA support across most of the stack
• Probably still using “Tier 1” data centers with excess redundancy
1 Systems Engineer to 250-500 systems
Configuration Management
Most of these servers aren’t sacred?• Limited support on ALL systems
• On-site support used sparingly, lower-skill onsite staff for “normal” failures
• Architecture Reviews only manage exceptions. Automated requests may be exposed via emerging APIs
• Wide adoption of virtualization: server instances are commoditized
• Hardware becoming standardized and easy to replace
• Smaller, more efficient data centers.
• Limited redundancy with hot/hot/hot N+1/N HA strategies
Application Management
1 Systems Engineer to 500-1000 Systems
None of the servers are sacred• Infrastructure as a Service
• Hardware (if any) is fully commoditized
• Hardware is completely standardized, special cases are regarded as a risk to business
• Redundant Array of Inexpensive Data centers
1 Site Reliability Engineer to 1000+ Systems
Continuous Delivery
1 Site Reliability Engineer to 1000+ Systems
Continuous Delivery
Operating System Management
Operating Systems Management• Many OS flavors and versions. Manual, irregular patching
• Limited flavors and versions, planned upgrades. “Patch Tuesday!”
• Standard versions using JEOS with regular upgrades. Automated patching.
• Internally maintained versions, constant upgrades
http://www.smallwebs.com/Swords/images/UK1796HC2d/SCOTLANDFOREVER2.jpg
Incident Management
Incident Threshold: Recovery Time• Which teams have regular on call responsibilities?
• What is expected of someone on call?
• How are people notified & engaged on an incident?
Incident Threshold: Recovery Time• "Something is wrong!" 12+ hours
• "Something is wrong with the…!" 1-12 hours
• "Something went wrong with your deployment!” <60 minutes
• "The core infrastructure fabric is down!” seconds - 10 minutes
Postmortems
http://photography.nationalgeographic.com/photography/photo-of-the-day/
Postmortems• Postmortem Focus
• Root Cause Orientation
• Root Cause Mitigation/Resolution
• Root Cause Elimination Rate
http://img3.wikia.nocookie.net/__cb20111008164412/mlpfanart/images/thumb/b/b2/Twilight_Sparkle_Angry_by_Ivan-Chan.png/597px-Twilight_Sparkle_Angry_by_Ivan-Chan.png
Postmortems: Ad Hoc• "Human Error”: blame finding & punishment
• "Triggering Event”: blaming specific operator error or specific hardware failures
• Cycle between protecting heroes and then firing them
• <10% - Mostly break fix detection
Postmortems: Formal• Focus on "Triggering Event" or "Human Error", but blaming process and/or infrastructure
• "Let's implement more process and overhead”
• 10% within 3 months - mostly simple fixes
• Tracking but little progress against goals vs. other priorities, frequent recurrence
Postmortems: Officially "Blame Free"• Primary focus on on underlying technical root causes, systemic fixes
• Improved tooling, programatic checks, operator tools for special cases. Some focus on building resiliency
• 20% - Easily fixable issues eliminated within 3 months, programs to eliminate larger issues over time
Postmortems: “5 Whys”• Including business and cultural issues
• Primary focus on insights and opportunities from lessons learned
• Increased resiliency and appropriate operator tools, focus on self-healing fixes
• Recurrence becomes infrequent and is a big deal
Navigating the Change• Many more mile markers
• Roadmap to improve your
• Mean Time To Production
• Mean Time to Recovery
Becoming a Unicorn is Possible• Approach the challenges with realistic expectations for your organization
• Always room for improvement
• Culture trumps everything
http://webecoist.momtastic.com/wp-content/uploads/2010/09/unicorns_3x.jpg
Where Can I Download It?bit.ly/Chef-OMM
Thanks!Matt Ray [email protected] @mattray !Thanks to George Miranda, Paul Edelhertz & Jesse Robbins