TriAgile 2021
Transcript of TriAgile 2021
TriAgile 2021
Despite Good Code, Production Failures - Why?
Kevin S. GreenIBM Service Management [email protected]
TriAgile
Anthony J D’Angelo
-What is CSMO?-What is SRE?-Diversify Agile teams with SRE-Getting Started -Q&A
Agenda
TriAgile
Anthony J D’Angelo
IBM Cloud © 2018 IBM Corporation
DevSecOpsDevelopment IT Operations
NoOps ?
SRE
8
aaS
EnvOps
ITIL ?
Shift Right
Shift Left
Cloud Service Management and Operations (CSMO)
Environment Ops
DevSecOps
Site ReliabilityEngineering
Management Service ITIL, IT4IT, ZeroOutage
AIOps
Cloud Service Management and Operations What is CSMO?
Concept Client
TriAgileWhat is Site Reliability Engineering(SRE)
9
What is SRE?
• System Thinking• Data-Driven Decisions• Engineering Rigid• Embracing Risk• Eliminating Toil• Technical Debt• Simplicity• Collaboration• Shared Responsibility• Trust & Transparency
• Ops to scale with load through Automation, but don’t stop at Automation• Cap operational load: 50% time spent on toil - 50% on engineering projects (improvements)• Excess Ops work overflows to the Dev Team, share 5% of Ops work with Dev Team• Have an SLA / SLO for the service, measure against the SLA / SLO• Error budget to control velocity. Effective self-regulation of features vs. stability• Observability, including the Golden Signals: Latency, Traffic, Errors, Saturation, Requests• Actionable symptom-based alerts, from the user perspective. (Automated) runbooks to
govern actions.• Blameless Post Mortem for every event• Hire (only) developers; Common staffing pool for SRE and Dev
Monitoring
Incident Response
Post Mortem / RCA
Testing & Release Procedures
Capacity Planning
Development
Product
“Fundamentally, it’s what happens when you ask a
software engineer to design an operations function.
response
analysis
preparation
design
TriAgile
Majority of client focus on Development and DevOps to get code created and delivered to clients quickly. Ops Modernization waned.
Questionable Progress
Client ExampleDiversify w/ SRE
• Outcome: Client Agile Development practices resulted significant progress. Ops struggled to keep pace. Client experienced regular monthly outages during peak demand.
– Stability Outages• Scenario: client DevOps obtained high
velocity. Development/Software Engineers and Ops were working separately. Development was successfully modernizing its work.
Client Implemented CSMO Practices to achieve Stability!
TriAgileManifesto for Agile Software Development
Diversify w/ SRE
We are uncovering better ways of developingsoftware by doing it and helping others do it.Through this work we have come to value:
Individuals and interactions over processes and toolsWorking software over comprehensive documentationCustomer collaboration over contract negotiationResponding to change over following a plan
That is, while there is value in the items onthe right, we value the items on the left more.
Service Levels
NFRs, IM, B2M
Error Budget
SRE Roles
TriAgileIndividuals and InteractionsDiversify w/ SRE
• Application SREs – work closely with Application Development team.
• Platform SREs – focus on platforms such as cloud or other foundation infrastructure.
• Transformation SREs – drives transformation of organization to adopt SRE.
• Solution SREs – focus on products such as monitoring tools, CI/CD pipeline.
TriAgileWorking SoftwareDiversify w/ SRE
SREs utilize several practices to assure that the target software is operational. While development focuses on functional requirements, SREs focus on non-functional requirements to assure the software works. An example practice is Build to Manage (B2M)
TriAgile
Customer Collaboration
Diversify w/ SRE
Uptime Downtime per month
Downtime per year
99.999 % .4 min 5 min
99.99 % 4 min 52 min
99.9 % 43 min 8h 46m
99,5 % 3h 36m 1d 19h 48m
99 % 7h 12m 3d 15h 36m
Well engineered softwareWell engineered operations
Well engineered business
Well engineered infrastructure
How many 9’s do you need?
Service levels are mechanisms SREs utilize to determine how important a service is to the customer. Each SRE supported service has key measures based on collaboration that informs the team and business of the level of resilience required. Key metrics utilized include:• Service level indicator (SLI) – quantative measure of service
reliability. • Service Level Objective (SLO) – a goal, reliability target for a
given SLI. • Service Level Agreement(SLA) – consequences for not
meeting the SLO.
TriAgileResponding to changeDiversify w/ SRE
100%
99%
SLA(i.e. 99%)
availability
Error Budget
SLA
SLA at risk:
No more Releases this cycle !
SLA
SLA overachieved
Be more aggressive in rolling out changes.Explore new things.
Very advanced approach: Force Downtime to set realistic expectations on contracted availability (for instance for internal services).OR
Error Budgets are utilized by SREs and Software Engineers to drive the velocity of changes released.
TriAgileCSMO Resources (including SRE Information)
IBM Garage Architecture Center - https://www.ibm.com/cloud/architecture/architectures/serviceManagementArchitecture• CSMO Field Guide https://www.ibm.com/cloud/architecture/content/field-guide/csmo-field-guide/• CSMO Ref Arch https://www.ibm.com/cloud/architecture/architectures/serviceManagementArchitecture/referenceArchitecture• CSMO Course https://www.ibm.com/cloud/architecture/content/course/explore-csmo
Getting Started