DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity Model

Post on 27-Aug-2014

3.012 views 13 download

Tags:

description

Helping customers evaluate their ability to deploy and operate systems while managing incidents is key to our Consulting practice. We have developed an operations maturity model that provides a roadmap for understanding and improving mean time to production while setting realistic expectations. This session will explain the challenges and thresholds for becoming a more effective organization.

Transcript of DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity Model

Chef’s Operations Maturity Model: Helping Horses Become UnicornsMatt Ray DevopsDays Austin May 5, 2014

Introductions• Matt Ray

• Director Partner Integration at Chef

• matt@getchef.com

• mattray GitHub|IRC|Twitter

“There’s nothing horses hate more than hearing stories about unicorns.”John Arbuckle Chief Architect at GE Capital "Hunting the DevOps Whale in Large Enterprises" ChefConf 2014

http://pichost.me/1468004/

DevOps Unicorns• Etsy

• Facebook

• Netflix

https://keepinghouseandhorse.files.wordpress.com/2013/10/photoshop3.jpeg

But… Enterprise• Our applications are too complex

• Politics get in the way

• We’ve always done it this way

It’s Not Magic• Not everyone requires Continuous Delivery

• They require:

• Higher reliability

• Greater visibility

• More resilience

• Faster response

https://img0.etsystatic.com/000/0/5209298/il_fullxfull.282855902.jpg

How Do We Get There?

The Map is not the Territory• Comparative study of Operational

Maturity Models

• On one end: ad-hoc, slow to respond, “traditional” approach

• At the other: very fast, fully automated, and disaster indifferent

• Figure out what is most important to your Organization

https://www.chimacumtack.com/images/measurehorse.jpg

Fitting the Model• Varying degrees of adoption

• Operational trends often correlated and relational, but not definitive

• Roadmap for improving time to deployment and lower time to recovery

• Understand the challenges, set real expectations for progress

http://www.web3dservice.com/3d_models/images/unicorn_3d_model_03.jpg

Roadmap Considerations• Hardware Management

• OS Management

• Infrastructure Management

• Software Deployments

• Incident Management

• Disaster Recovery

http://cultofunicorn.com/wp-content/uploads/2013/05/Unicorn_horse.jpg

Hardware Management

Every Server is Sacred!• HA Support expected across the entire stack

• Dependence on vendor/on-site SE for replacement/maintenance

• “This is the best hardware money can buy!”

• Architecture Review and Request Forms for all changes

• “Tier 1” data centers

• Every project special snowflake

1 SysAdmin to 25-250 systems?

Automate Common Tasks

Maybe not ALL servers are sacred…• Start using some farms of standardized machines

• Fewer support contracts, less dependence on vendor/on-site support

• Architecture Reviews for new services with some implementation standardization

• HA support across most of the stack

• Probably still using “Tier 1” data centers with excess redundancy

1 Systems Engineer to 250-500 systems

Configuration Management

Most of these servers aren’t sacred?• Limited support on ALL systems

• On-site support used sparingly, lower-skill onsite staff for “normal” failures

• Architecture Reviews only manage exceptions. Automated requests may be exposed via emerging APIs

• Wide adoption of virtualization: server instances are commoditized

• Hardware becoming standardized and easy to replace

• Smaller, more efficient data centers.

• Limited redundancy with hot/hot/hot N+1/N HA strategies

Application Management

1 Systems Engineer to 500-1000 Systems

None of the servers are sacred• Infrastructure as a Service

• Hardware (if any) is fully commoditized

• Hardware is completely standardized, special cases are regarded as a risk to business

• Redundant Array of Inexpensive Data centers

1 Site Reliability Engineer to 1000+ Systems

Continuous Delivery

1 Site Reliability Engineer to 1000+ Systems

Continuous Delivery

Operating System Management

Operating Systems Management• Many OS flavors and versions. Manual, irregular patching

• Limited flavors and versions, planned upgrades. “Patch Tuesday!”

• Standard versions using JEOS with regular upgrades. Automated patching.

• Internally maintained versions, constant upgrades

http://www.smallwebs.com/Swords/images/UK1796HC2d/SCOTLANDFOREVER2.jpg

Incident Management

Incident Threshold: Recovery Time• Which teams have regular on call responsibilities?

• What is expected of someone on call?

• How are people notified & engaged on an incident?

Incident Threshold: Recovery Time• "Something is wrong!" 12+ hours

• "Something is wrong with the…!" 1-12 hours

• "Something went wrong with your deployment!” <60 minutes

• "The core infrastructure fabric is down!” seconds - 10 minutes

Postmortems

http://photography.nationalgeographic.com/photography/photo-of-the-day/

Postmortems• Postmortem Focus

• Root Cause Orientation

• Root Cause Mitigation/Resolution

• Root Cause Elimination Rate

http://img3.wikia.nocookie.net/__cb20111008164412/mlpfanart/images/thumb/b/b2/Twilight_Sparkle_Angry_by_Ivan-Chan.png/597px-Twilight_Sparkle_Angry_by_Ivan-Chan.png

Postmortems: Ad Hoc• "Human Error”: blame finding & punishment

• "Triggering Event”: blaming specific operator error or specific hardware failures

• Cycle between protecting heroes and then firing them

• <10% - Mostly break fix detection

Postmortems: Formal• Focus on "Triggering Event" or "Human Error", but blaming process and/or infrastructure

• "Let's implement more process and overhead”

• 10% within 3 months - mostly simple fixes

• Tracking but little progress against goals vs. other priorities, frequent recurrence

Postmortems: Officially "Blame Free"• Primary focus on on underlying technical root causes, systemic fixes

• Improved tooling, programatic checks, operator tools for special cases. Some focus on building resiliency

• 20% - Easily fixable issues eliminated within 3 months, programs to eliminate larger issues over time

Postmortems: “5 Whys”• Including business and cultural issues

• Primary focus on insights and opportunities from lessons learned

• Increased resiliency and appropriate operator tools, focus on self-healing fixes

• Recurrence becomes infrequent and is a big deal

Navigating the Change• Many more mile markers

• Roadmap to improve your

• Mean Time To Production

• Mean Time to Recovery

Becoming a Unicorn is Possible• Approach the challenges with realistic expectations for your organization

• Always room for improvement

• Culture trumps everything

http://webecoist.momtastic.com/wp-content/uploads/2010/09/unicorns_3x.jpg

Where Can I Download It?bit.ly/Chef-OMM

Thanks!Matt Ray matt@getchef.com @mattray !Thanks to George Miranda, Paul Edelhertz & Jesse Robbins