Say Goodbye to Post Mortems Say Hello to Problem · PPT file · Web...

76
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved. Say Goodbye to Post Mortems Say Hello to Effective Problem Management Charles T. Foy Siemens Medical Solutions USA, Inc. Health Services Division [email protected]

Transcript of Say Goodbye to Post Mortems Say Hello to Problem · PPT file · Web...

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.

Say Goodbye to Post MortemsSay Hello to

Effective Problem Management

Charles T. FoySiemens Medical Solutions USA, Inc.Health Services [email protected]

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 2

Company: Siemens, AG

Our division: healthcare software Our department: application hosting Mainframe, mid-range, open systems, distributed systems All operating systems (except Tandem) My role

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 3

Caution!

This company founded by former employees of International Business Machines (IBM)

Proclivity for acronyms is part of the culture.

Proclivity: “a natural or habitual inclination or tendency; propensity; predisposition”

You have been warned…

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 4

Agenda

What drove creation of a Problem Management System? First steps Give it a name? Got Lucky! Build versus Buy It’s a Defect! What to track? Classifications? Database Structure The Process Trending Benefits

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 5

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 6

What drove creation of a Problem Management System?

Disparate, inconsistent ‘post-mortems’ Usually driven by customer demand for an explanation Needed a defined process

Consistent across the company Communicates to the customer – internal and external

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 7

First StepsLaunch:

Assigned to a small group Two service delivery managers One consultant (employee #26) Quality Assurance and Process Definition expert

No detailed marching ordersother than “standard post-mortem process”

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 8

First StepsStarted with….

StandardizedText

Document

StandardizedText

Document

Root Cause Root Cause

StandardizedText

Document

Follow-up

Root Cause FieldFollow up Field

Document

Database

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 9

First StepsDefined our own goal:

Redefined project outcomes: reduces unscheduled outages increases availability communicates the root cause and preventive measures

implemented to internal and external audiences

Has to: Drive to the root cause In a searchable manner, track:

outage details root causes, corrective actions customer communications preventive measures implementation status etc

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 10

First StepsGive it a Name?

Needed a new name no longer a “Post Mortem” process“Post Mortem” didn’t sit wellBefore fully ITIL-aware

Never Happened!

How about a working title for our project?Perhaps the Post Event Analysis Process, a.k.a. PEAP?Always change it later on

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 11

Thus, PEAP was born!

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 12

And if the Post Event Analysis Processproduces a Report,

it of course would be called….

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 13

First StepsPost Mortem Report new name:

The Post Event Analysis ReportOr

PEAR

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 14

Define the database and process

Database needs:1. Description, short term resolution, root cause2. Customers impacted, length of outage3. Corrective actions implemented & their status4. Etc.

Process:1. Capture the root cause2. Ensure the corrective action was implemented3. Communicate all the above

Seemed straightforward, linear, one to one…

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 15

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 16

Next Steps – define the database requirementsWe Got Lucky!

Ran into a friend… Provided us with an excellent service outage to use as our model Decided to use it as proof of concept

Slowdown affecting almost all his applications, Response time dropped to zero within 5 minutes…

Started looking like it was the Storage Area Network (SAN)

Started looking for commonalities – network was suspect A Configuration Management Database (CMDB) would have helped!

Problem cleared up, 45 minutes into the event

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 17

The Outage Incident

Look up - Jake San Technician Fixes the problem! Not! Battery Swap! 45 minutes ago, looks good! Here’s what happened…

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 18

Root cause:

Battery was going to go bad and was swapped out. So Hardware is the root cause

But wait…is it really a Hardware issue?Battery didn’t actually die… it was Jake San Technician!

Human Error!But wait…is it really a “Human Error” issue?

Jake doing his jobOK, a… “Rules” issue – “always swap batteries off peak”

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 19

Root cause?

Aren’t these ‘contributing’ root causes?They didn’t know the battery was alertingSAN vendor knewSAN technician walked in and worked without their

knowledgeSAN technician educationData center employees educationNo battery swap rule/process

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 20

What would we put as our root cause?

Do we need to track all these ‘root’ causes?

Do we need to track the corrective actions for each?

Don’t most outages have multiple root causes?

Root cause?

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 21

Conclusion: MULTIPLE root causes

Multiple root causes, multiple follow-ups.

This would be complex.

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 22

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 23

Build a database?

Designed requirements, got a resource time estimatePresented to upper managementAnything on the shelf?

Essentially, you’re tracking defects!

Tools and Methodology Manager:

• Hardware that breaks• Software that breaks• Humans that make errors…

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 24

Defect Tracking

Company standard defect tracking applicationFully implemented and operational

Subject Matter Expert (SME)Does 90% of what you needEasy to implementWhat are your major defect

categories?

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 25

To build this, you need Classifications….

What are your major defect areas?

How granular?

Defect Tracking

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 26

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 27

The Classifications

Asked our peers

Specific type of hardware

Specific type of software

Human error

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 28

How much detail?

Major category (hardware)

The thing that broke (server)

Thing that caused it to break (bad power supply)

Model that broke (Fleetwood XL340)

The Classifications

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 29

Human Error

Does that work for Human Error?

Example: Jeff mistyped a static route in a backup router. Primary router fails. Backup router kicks in but does not recover all the interfaces…

Major category (human error)The thing that broke (typing)Thing that caused it to break (not enough sleep)Model that broke (Jeff)

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 30

Human Error?

Do we really want to say “human error”? What does it mean to make a human error? Failure To Follow A Process?

…FTFAP Eureka! A five letter acronym!

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 31

Classifications

Euphemism at first, then…The “Process” category was born!

Process Not Followed (a.k.a. Human Error)Process IncompleteProcess Incorrect (covers the “need to change the

Rule” root cause)Documentation wrong

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 32

More items to track

Version and vendor of the software/hardware? Name of the Human? Impacted application(s)? Impacted customer(s)? O/S level?, 3rd party software, something we wrote? Was this tested before it was put into production? Did it happen before? What is the air-speed velocity of an unladen swallow?

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 33

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 34

Database StructureSupports Multiple Levels of Classification

Global Keyword: allows for over-all groupings1. Hardware2. Software3. Process

Keyword 1 answers “What broke?” Answer: Server

Keyword 2 answers “What thing within KW1 broke?” Answer: Power Supply

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 35

Keyword Grouping Samples

Hardware

Keyword 1 Keyword2

Router Chassis

Memory

Nic Card

NPE

Pwr Supply

Keyword 1 Keyword2

Server Cable

CPU

Hard Drive

HBA

Memory

MthrBoard

Pwr Supply

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 36

Keyword Grouping Samples

Software

Keyword 1 Keyword2

Application A Print Subsys

GSM

RSA

Service Pack

CICS Configuration

Dayend Flow

MODS

PTF

Keyword 1 Keyword2

Server BIOS

Term Svcs

DHCP

Firewall

IIS

LDAP

Virus-Wm

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 37

Keyword Grouping Samples

Process

Keyword 1 Keyword2

Process Incomplete

Process Incorrect

Process Not Follow

Documentation Incorrect

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 38

Database Structure

Primary Root Cause

Contributing Root Cause 1

Contributing Root Cause 2

Contributing Root Cause 3

Contributing Root Cause 4

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 39

Database Structure

Battery Maintenance During Prime Time

Flaky Battery

SAN Technician working without their

knowledge

Vendor allowed in data center without

escort

No battery preventive maintenance

schedule

Hardware, SAN, Battery

Process, Process Incorrect

Process, Process Not Followed

Process, Process Incomplete

Process, Process Incorrect

All root causes and keywords

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 40

Database Structure

Primary Root Cause

Contributing Root Cause 1

Contributing Root Cause 2

Contributing Root Cause 3

Contributing Root Cause 4

Service Outage View

Preventive Action Item

Preventive Action Item

Preventive Action Item

Preventive Action Item

Preventive Action Item

All root causes and follow-ups

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 41

Battery Maintenance During Prime Time

Bad Battery

SAN Technician working without their

knowledge

Vendor allowed in data center without

escort

No battery preventive maintenance

schedule

Install second battery

Change process, require SAN Technician to get permisison from SAN group for all work

Change security process, no unescorted vendors

Create process to replace batteries every x months, well in

advance of MTBF

Change Battery Maintenance Process – swap is done off-peak

Service Outage View

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 42

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 43

The ProcessWho will own the process?

Owner? PEAP Owner role? (PO?)

We need action in the title… PEAP Driver (PD?)

How about a PEAP Owner/Driver? A POD!

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 44

The ProcessPOD role

Battery Maintenance During Prime Time

Flaky Battery

SAN Technician working without our

knowledge

Vendor allowed in data center without

escort

No battery preventive maintenance

schedule

Install second battery.

Change process, require SAN Technician to get permisison from SAN group for all work

Change security process, no unescorted vendors

Create process to replace batteries every x months, well in

advance of MTBF

Change Battery Maintenance Process – swap is done off-peak

Service Outage View

ID all root

causes

Describe Preventive

Action

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 45

The ProcessAssign follow ups…

Battery Maintenance During Prime Time

Flaky Battery

SAN Technician working without

knowledge

Vendor allowed in data center without

escort

No battery preventive maintenance

schedule

Install second battery

Change process, require SAN Technician to get permisison from SAN group for all work

Change security process, no unescorted vendors

Create process to replace batteries every x months, well in

advance of MTBF

Change Battery Maintenance Process – swap is done off-peak

Service Outage View

Assign to Manager of SAN Group

Assign to Manager of SAN Group

Assign to Building Security Manager

Assign to Manager of SAN Group, drive process internal and external

Assign to Manager of SAN Group

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 46

Document all in the database Communicate:

InternallyExternally

Drive the process to completion

The ProcessDocument and Communicate

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 47

Surprisingly, nobody wants to be a POD!

Actually a good thing…

If your area contributed or caused an outage, you get to be POD.

Incentive not to have outages

Battery Maintenance During Prime Time

Not aware of potential bad battery

SAN Technician working without

knowledge

Vendor allowed in data center without

escort

No battery preventive maintenance

schedule

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 48

The Process - details to work out

How to define an outage? When is the outage over? Who is best to drive this process? How does the process get initiated?

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 49

The Process - details

Eureka!Outage Manager can launch PEAPAssign POD

= manager of group that fixed the outage

Existing Outage Management ProcessExisting outage definitionKnowledge of incidentCommunicates incident status to customers

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 50

ITIL Terminology: Incident: Any event that is not part the standard

operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service.

Problem: unknown underlying cause of one of more incidents.

-from ITIL Foundations by ITpreneurs B.V. 2006

At the end of the Incident Management process, the item is moved to the Problem Management Process

The Process

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 51

The ProcessIncident to Problem Transition

Outage incident:Details in incident tracking systemOutage resolved, incident ticket is solved

Interface to PEAP – Problem Management system:Details transferred to a defect recordDefect assigned to an owner – the PODUpdates to defect record pass back to incident ticket

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 52

Post Event Analysis Processin a nutshell

Outage ends, incident transferred to PEAP PEAP assigned to a manager (POD) POD notified automatically by e-mail POD:

gathers information, determines root causes enters findings in database and internal post-mortem (PEAR) assigns follow-ups as needed (new records created) PEAR sent internally Customer Letter is created, reviewed, sent to affected

customers Corrective actions implemented PEAR reviewed by senior management PEAP solved

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 53

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 54

Post Event Analysis Process - Rollout

CollateralPEAR templateCustomer Letter templatesProcess user guide – database navigation, process

steps Education class

Overview of root cause determinationOverview of processNavigating the database

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 55

Post Event Analysis Process - Rollout

Limited scope initially Multi-customer outages > 15 min All multi-customer outages All outages

Quality Management System – central location Process description User Guide PEAR and Customer Letter templates

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 56

Post Event Analysis Process - RolloutChallenges:

1. Process defined in QMS but lengthy “Checklist” with links to QMS section

2. Original process – too many steps Gathered feedback Reduced the number of steps Second round of education – new process, value

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 57

Post Event Analysis Process - Rollout

Challenges:

3. Culture change – gaps in compliance Phased roll-out Re-education Administrative reminders Senior management support

4. Not all root causes identified Weekly reviews with senior management “5 Why’s”

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 58

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 59

Benefits

1. Decreased downtime “one-fers” (that aren’t) are identified

across platforms across time spans

2. Increased customer satisfaction Many “customers” of ours are CIOs or IT staff

Explain to their own customers Knowledge of cause and remediation

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 60

Benefits

3. Adjust monitoring focus Identify gaps – component level Identify gaps – end-user experience

4. Fewer outages due to late running maintenance More precise estimates, smaller scopes Avoid effort to complete PEAP

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 61

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 62

Trending

Primary Root Cause

Contributing Root Cause 1

Contributing Root Cause 2

Contributing Root Cause 3

Contributing Root Cause 4

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 63

Trending

Root Cause Root Cause Root Cause Root Cause Root Cause

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 64

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Trending

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 65

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Root Cause

Trending

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 66

Root Cause

Trending

Corrective Actions

Major Category, Keyword 1, Keyword 2

Part that failed, Vendor

Customers Impacted, Duration

Applications Impacted

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 67

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 68

TrendingWhat can you discover?

Outage Type by Percentage(not actual data)

Hardware 27%

Software 31%

Process 42%

Hardware Process Software

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 69

Trending

Process Issues Time Percentage(not actual data)

Process-IncorrectHours:146

Percentage: 55%

Process-IncompleteHours: 63

Percentage: 23%

Process-Not Follow

Hours: 60Percentage:

22%

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 70

Trending

Hardware Outage Minutes by Month(NOT ACTUAL DATA)

0

10

20

30

40

50

60

70

80

2006-Jan

2006-Mar

2006-Apr

2006-May

2006-Jun

2006-Jul

2006-Aug

2006-Sep

2006-Oct

2006-Nov

2006-Dec

2007-Jan

2007-Feb

2007-Mar

2007-Apr

2007-May

2007-Jun

2007-Jul

2007-Aug

2007-Sep

2007-Oct

2007-Nov

2007-Dec

Min

utes

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 71

Trending

Outages by Weekday(NOT ACTUAL DATA)

4

27 28 28

18 21

90

5

10

15

20

25

30

Sun Mon Tue Wed Thu Fri Sat

Cou

nt

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 72

Trending

Outages by Weekday(NOT ACTUAL DATA)

6 7 7 5 7 4

9 10 117 6

12 11 10

68

4

1 211

0

5

10

15

20

25

30

Sun Mon Tue Wed Thu Fri Sat

Coun

t SoftwareProcessHardware

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 73

Trending

Application A Downtime Minutes(not actual data)

Network-Loose Electrical Plug, 101

Process-Incorrect, 98DB2, 85

W2K Server OS-HIS, 360

Mainframe-SNA Server, 75

M/F-FEP, 140

Process-Incomplete, 30

Server-undetermined outage, 30 Causes < 120 mins, 108

Mainframe-TELNET, 180

Circuit Breaker-UPS, 69

Network-Switch, 69

Process-Not Follow, 39

W2K3 OS-HIS, 135

Windows Server OS-IIS, 252

Network-Circuit, 292

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 74

Conclusion

Methodology – works for all size IT shops Robust defect-tracking database

critical for large shops smaller scale - standard document, keywords No group per landscape - someone is responsible

Integration of PEAP into workflows Phased roll-out, repeat education Admin to assist with tracking and notifications

Problem manager to ‘own’ the process?

How to categorize keywords– ongoing refinements

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 75

Thank you!

Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 76