Incident Consequence Analysis

6
Notification / Report: Incident Consequence Analysis <Description of major incident> Service desk references: ###### This is reported as a <Minor, Major> Incident. <Business units> affected in <location>. <x> minutes unavailable and/or <x> minutes degraded. <Resolution>. <Service> affected by <cause>. <No, blank> further root cause analysis required. Escalated to <escalations>. This incident affected the company <less than, the same as, greater than> usual. The outage was <less than, blank, greater than> normal. The risk is <less than, blank, greater than> average. Prepared by: <first name, surname> <Major Incident Dashboard> Rolling Incident averages: Classification – <Norm>, Outage analysis – <Norm>, Risk management – <Norm> This incident was <calculation> less than the norm using the Incident User Metric. Resources <Job descriptions and names of resources involved> Service affected <Name of service from catalogue> Description of incident <Description of incident>> Resolution <Description of resolution? Timelin es & details Time analysi s <Graph of Expanded Incident Lifecycle> Incident Breakdowns <Pie chart of incident breakdown by service> <Pie chart of incident breakdown by cause> Thinking problem management! (ICA Template) Page: 1

description

Incident consequence analysis using in major incident process

Transcript of Incident Consequence Analysis

Page 1: Incident Consequence Analysis

Incidentdd/mm hhmm

Detectedhhmm

Repairhhmm

Recoverhhmm

Restorehhmm

Resolutionhhmm

Workaround<Description>

Diagnosishhmm

Escalations: Problem management: dd/mm<Any extra details>

Notification / Report:dd/mm

Incident Consequence Analysis<Description of major incident>Service desk references: ######

This is reported as a <Minor, Major> Incident.

<Business units> affected in <location>. <x> minutes unavailable and/or <x> minutes degraded.

<Resolution>. <Service> affected by <cause>. <No, blank> further root cause analysis required.

Escalated to <escalations>.This incident affected the company <less than, the

same as, greater than> usual. The outage was <less than, blank, greater than> normal. The risk

is <less than, blank, greater than> average.

Prepared by: <first name, surname>

<Major Incident Dashboard>

Rolling Incident averages:Classification – <Norm>, Outage analysis –

<Norm>, Risk management – <Norm>This incident was <calculation> less than the

norm using the Incident User Metric.

Resources <Job descriptions and names of resources involved>

Service affected <Name of service from catalogue>

Description of incident<Description of incident>>

Resolution<Description of resolution?

Timelines& details

Time analysis

<Graph of Expanded Incident Lifecycle>

Incident Breakdowns<Pie chart of incident breakdown by service> <Pie chart of incident breakdown by cause>

Time unavailable/degraded <x> minutes unavailable, <x> minutes degradedMTTR=<x> minutes, MTBF=<x> days, MTBSI=<x> days.

Incident User Metric Cost of downtime analysis <x><Incident user Metric skyline>

<Last 10 Incidents – ROC analysis>

Classification (<x>%)Outage analysis

(<x>%)Risk Management (<x>%)

S CR OP U P IT B I V CM

Thinking problem management! (ICA Template) Page: 1

Page 2: Incident Consequence Analysis

Major Incident Dashboard

Classification

Outage

Risk

0% 10% 20% 30% 40% 50% 60% 70%

<x> <x> <x> <x> <x> <x> <x> <x> <x> <x>ClassificationScope (S) <input from major incident draft template>Credibility (CR) <input from major incident draft template>Operations (OP) <input from major incident draft template>Urgency (U) <input from major incident draft template>Prioritization (P) <input from major incident draft template>Outage analysisIT service outage analysis (IT) <input from major incident draft template>Business service outage analysis (B) <input from major incident draft template>Risk managementRisk impact (I) Best practice CIA analysis (CRAMM) – Confidentiality (unauthorized disclosure), Integrity (unauthorized modification or misuse), Availability (destruction or loss).

<input from major incident draft template>

Risk vulnerability (M) What are the chances of the outage occurring considering loss, error or failure?

<input from major incident draft template>

Countermeasures (CM) What is being done to prevent this from happening again?

<input from major incident draft template>

ClosureEscalations Please note that if no comments or questions are received within 5 working days this reported is classed as Accepted

<input from major incident draft template>

Example

Incident Consequence AnalysisEmail outage in Pofadder

Service desk references: 555772This is reported as a Minor Incident.

All Business units affected in Pofadder. 12 minutes unavailable and 238 minutes degraded.

Mail server recycled. Messaging affected by bug. No further root cause analysis required.

Escalated to Infrastructure Manager.This incident affected the company less than usual. The outage was normal. The risk is less

than average.

Prepared by: Ronald Bartels

Rolling Incident averages:Classification – 69%, Outage analysis – 49%, Risk

management – 54%This incident was 66% less than the norm using

the Incident User Metric.

Resources Service Level Manager (M Mouse), Regional Infrastructure team leader (D Duck).

Service affected Messaging

Description of incidentIT customers located in the Pofadder office experienced slow delivery of mail messages to other regions and business units. IT Customers unable to confirm instructions or send credit minutes via email. The inbound and outbound queues on the Exchange server were not flowing. Documents scanned and emailed via multi-function devices where the size of the document was over 1.5mb where largely affected. Log file gave specific error code which suggested several resolutions from the knowledge base. (http://support.microsoft.com/kb/329617). The bad mail folder was cleared and the SMTP service was restarted. However, this did not clear the issue and it was only when the mail server was power cycled that

Thinking problem management! (ICA Template) Page: 2

Page 3: Incident Consequence Analysis

Incident9/10 09h30

Detected11h25

Repair13h15

Recover13h17

Restore13h35

Resolution13h35Server

restarted

WorkaroundFailed - Bad mail folder cleared and SMTP service restarted.

Diagnosis13h06

Escalations: Problem management: 9/10Notification / Report:

9/10

Incident breakdown by Service(affected messaging)

Ecommerce

Monitoring

Printing

Third party

Operations

Backups

Service desk

Storage

AD

Documents

Security

Intranet

Hosting

Payments

Voice

Messaging

Data networks

Incident breakdown by Cause(caused by bug)

Change

Capacity

Process

Vendor

Hardware

Bug

Environmental

Service Provider

Carrier

Configuration

Component Failure

normal operations resumed.ResolutionMail server recycled.

Timelines& details

Time analysis

Incident times

00:00

00:14

00:28

00:43

00:57

01:12

01:26

01:40

01:55

02:09

hh:mm 01:55 01:41 00:09 00 00:02 00:18

Detect Diagnose Repair Recover Restore

Incident Breakdowns

Time unavailable/degraded 12 minutes unavailable, 238 minutes degradedMTTR=238 minutes, MTBF=8 days, MTBSI=7 days.

Incident User Metric Cost of downtime analysis 347

Thinking problem management! (ICA Template) Page: 3

Page 4: Incident Consequence Analysis

Incident User Metric Skyline

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

2000024

\03\

2007

31\0

3\20

07

07\0

4\20

07

14\0

4\20

07

21\0

4\20

07

28\0

4\20

07

05\0

5\20

07

12\0

5\20

07

19\0

5\20

07

26\0

5\20

07

02\0

6\20

07

09\0

6\20

07

16\0

6\20

07

23\0

6\20

07

30\0

6\20

07

07\0

7\20

07

14\0

7\20

07

21\0

7\20

07

28\0

7\20

07

04\0

8\20

07

11\0

8\20

07

18\0

8\20

07

25\0

8\20

07

01\0

9\20

07

08\0

9\20

07

15\0

9\20

07

22\0

9\20

07

29\0

9\20

07

06\1

0\20

07

Last 10 incidents- ROC Analysis

1 2 3 4 5 6 7 8 9 10

Risk

Outage

Classification

Classification (60%)Outage analysis

(50%)Risk Management (41%)

S CR OP U P IT B I V CM

2 2 1 4 3 3 1 2 2 1ClassificationScope (S) Less than 25% of customers affected*Credibility (CR) Multiple business units affected negativelyOperations (OP) Some interference with normal completion of workUrgency (U) Underway and could not be stoppedPrioritization (P) High - Technicians respond immediately, assess the situation, and

may interrupt other staff working low or medium priority jobs for assistance.

Outage analysisIT service outage analysis (IT) Major - App, server, link (network or voice) unavailable for greater

than 1 hour or degraded for greater than 4 hoursBusiness service outage analysis (B) Minor -Financial loss with a visible impact on profitability but no

real effect, greater than $10k or some embarrassment or rule or process breaches or medical treatment

Risk managementRisk impact (I) Best practice CIA analysis (CRAMM) – Confidentiality (unauthorized disclosure), Integrity (unauthorized modification or misuse), Availability (destruction or loss).

Confidentiality=Confidential, Integrity=High, Availability=Moderate

Risk vulnerability (M) What are the chances of the outage occurring considering loss, error or failure?

Low loss probabilityModerate error probabilityModerate failure probability

Countermeasures (CM) What is being done to prevent this from happening again?

Proactive monitoring of environment. Refer to the knowledge base at service desk. Antivirus service locks up SMTP Service when BadMail queue reaches a specific size. Add check to daily check list to monitor BadMail folder size.

ClosureEscalations Please note that if no comments No further root cause analysis required. Escalated to Infrastructure

Thinking problem management! (ICA Template) Page: 4

Page 5: Incident Consequence Analysis

or questions are received within 5 working days this reported is classed as Accepted

Manager.

Thinking problem management! (ICA Template) Page: 5