Incident Management

16
INCIDENT MANAGEMENT Goal of Incident Management The primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimise the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within Service Level Agreement (SLA) limits INCINDENT MGT PROCESS Inputs : Incident details sourced from (for example) Service Desk, networks or computer operations configuration details from the Configuration Management Database (CMDB)

Transcript of Incident Management

Page 1: Incident Management

INCIDENT MANAGEMENT

Goal of Incident ManagementThe primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimise the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within Service Level Agreement (SLA) limits

INCINDENT MGT PROCESS

Inputs:

Incident details sourced from (for example) Service Desk, networks or computer operations

configuration details from the Configuration Management Database (CMDB)

response from Incident matching against Problems and Known Errors resolution details response on RFC to effect resolution for Incident(s).

Outputs:

Page 2: Incident Management

RFC for Incident resolution; updated Incident record (including resolution and/or Work-arounds)

resolved and closed Incidents communication to Customers management information (reports).

Incident Management activities:

Incident detection and recording Classification and initial support investigation and diagnosis resolution and recovery Incident closure Incident ownership, monitoring, tracking and communication

The roles and functions related to the Incident Management process are:

first, second-and third-line support groups, including specialist support groups and external suppliers (roles)

Incident Manager (role) Service Desk manager (function).

Incident Handling

Most IT departments and specialist groups contribute to handling Incidents at some time. The Service Desk is responsible for the monitoring of the resolution process of all registered Incidents - in effect the Service Desk is the owner of all Incidents. The process is mostly reactive. To react efficiently and effectively therefore demands a formal method of working that can be supported by software tools.

Incidents that cannot be resolved immediately by the Service Desk may be assigned to specialist groups. A resolution or Work-around should be established as quickly as possible in order to restore the service to Users with minimum disruption to their work. After resolution of the cause of the Incident and restoration of the agreed service, the Incident is closed.

Incident Handling (incident life cycle)

Page 3: Incident Management

Priority

The priority of an Incident is primarily determined by the impact on the business and the urgency with which a resolution or Work-around is needed. Targets for resolving Incidents or handling requests are generally embodied in an SLA.

The Service Desk plays an important role in the Incident Management process, as follows:

all Incidents are reported to and registered by the Service Desk - where Incidents are generated automatically, the process should still include registration by the Service Desk

the majority of Incidents (perhaps up to 85% in a highly skilled environment) will be resolved at the Service Desk

the Service Desk is the 'independent'function monitoring Incident resolution progress of all registered Incidents.

On receipt of an Incident notification, the main actions to be carried out by the Service Desk are:

Page 4: Incident Management

record basic details - this includes timing data and details of symptoms obtained

if a service request has been made, the request is handled in conformance with the organisation's standard procedures

from the CMDB , the Configuration Items ( CI ) reported as the cause for an Incident is selected, to complete the Incident record

the appropriate priority is assigned and the User is given the unique system-generated Incident number (to be quoted at the beginning of all further communication)

the Incident is assessed and, if possible, resolution advice is given: this frequently will be possible for routine Incidents or when a match to a known Problem/error is achieved

following successful resolution the Incident record is closed: details of the resolution action and the appropriate category code are added

the Incident is assigned to second-line support (i.e. a specialist group) following unsuccessful resolution or recognition that a further level of support is needed.

Relationship between Incidents, Problems, Known Errors and RFCs

Incident s, the result of failures or errors within the IT infrastructure, result in actual or potential variations from the planned operation of the IT services.

The cause of Incidents may be apparent and that cause can be addressed without the need for further investigation, resulting in a repair, a Work-around or an RFC to remove the error. In some cases the Incident itself, i.e. the effect or potential effect upon the Customer, can be dealt with quickly. Perhaps by rebooting a PC or resetting a communications line, without directly addressing the underlying cause of the Incident.

Where the underlying cause of the Incident is not identifiable, then it may be appropriate to raise a Problem record. A Problem is thus, in effect, indicative of an unknown error within the infrastructure. Normally a Problem record is raised only if investigation is warranted.

This impact will often be assessed via the impact, (both actual and potential), upon the business services, and the number of similar Incidents apparently sharing a common underlying cause that have reported. This may be appropriate even where the actual result of the Incident has been addressed. It can be seen therefore that a Problem record is independent of associated Incident records, and both the Problem record and the investigation into its cause can persist even after the initial Incident has been successfully closed.

Successful processing of a Problem record will result in the identification of the underlying error, and the record can then be converted into a Known Error once a Work-around has been developed, and/or an RFC . This logical flow,

Page 5: Incident Management

from an initial report to the resolution of an underlying Problem, is shown in Figure

We thus have the following definitions:

Problem The unknown underlying cause of one or more Incidents.

Known Error

A Problem that is successfully diagnosed and for which a Work-around is known.

RFC A Request For Change to any component of an IT Infrastructure or to any aspect of IT services.

A Problem can result in multiple Incidents, and it is possible that the Problem will not be diagnosed until several Incidents have occurred, over a period of time. Handling Problems is quite different from handling Incidents and is therefore covered by the Problem Management process.

During the Incident-resolution process the Incident is matched against the Problem and Known Error database. It should also be matched against the Incident database to see whether there is a similar Incident outstanding, or whether there has been resolution action taken for any previous similar Incident. If a Work-around or resolution is available, the Incident can be resolved immediately. If not,

Page 6: Incident Management

Incident Management is responsible for finding a resolution or Work-around with minimum disruption to the business process.

When Incident Management finds a Work-around it will be analysed by the Problem Management team who will update the associated Problem record (see Figure 5.5). Note that an associated Problem record may not exist at this time - for example, the Work-around may be to send a report by fax due to a communication line failure, but at this point there may not be a Problem record for the communication line failure, which the Problem Management team would have to create. The process is then that the Service Desk will link Incidents that are clearly the result of an existing Problem record.

It is also possible that the Problem Management team, while investigating the Problem associated with the Incident, finds a Work-around or a resolution for a Problem and/or some related Incidents. In this case, the Problem Management team should inform the Incident Management process in order that open Incidents have their status changed to 'Known Error' or 'closed' as appropriate.

Where it is felt at Incident logging that an Incident should be treated as a Problem, then it should be referred immediately to the Problem Management process, where, if appropriate, a new Problem record will be raised. Incident Management will, as always, remain responsible for pursuing a resolution to the Incident with minimal possible disruption to the business processes.

Benefits of Incident ManagementThe major benefits to be gained by implementing an Incident Management process are as follows:

For the business as a whole:

reduced business impact of Incident s by timely resolution, thereby increasing effectiveness

the proactive identification of beneficial system enhancements and amendments

the availability of business-focused management information related to the SLA.

For the IT organization in particular:

improved monitoring, allowing performance against SLA s to be accurately measured

improved management information on aspects of service quality better staff utilisation , leading to greater efficiency elimination of lost or incorrect Incidents and service requests more accurate CMDB information (giving an ongoing audit while registering

Incident s) improved User and Customer satisfaction .

Page 7: Incident Management

In contrast, failing to implement Incident Management may result in:

no one to manage and escalate Incidents - hence Incidents may become more severe than necessary and adversely affect IT service quality

specialist support staff being subject to constant interruptions, making them less effective

business staff being disrupted as people ask their colleagues for advice frequent reassessment of Incidents from first principle rather than reference

to existing solutions lack of coordinated management information lost, or incorrectly or badly managed Incidents .

Planning and implementation:

Critical success factors

Successful Incident Management requires a sound basis, as highlighted by the following points:

An up-to-date CMDB is a prerequisite for an efficiently working Incident Management process. If a CMDB is not available, information about Configuration Items (CIs) related to Incidents should be obtained manually, and determining impact and urgency will be much more difficult and time-consuming.

A 'knowledge base' in the form of an up-to-date Problem/error database should be developed to provide for resolutions and Work-arounds. This will greatly speed up the process of resolving Incidents. Third-party Known Error databases should also be available to assist in this process.

An effectively automated system for Incident Management is fundamental to the success of a Service Desk. Paper-based systems are not really practical or necessary, now that good and cheap support tools are available.

Forge a close link with the Service-Level Management (SLM) process to obtain necessary Incident response targets. Timely Incident resolution will satisfy Customers and Users.

Possible problem areas

Be prepared to overcome:

no visible management or staff commitment, resulting in non-availability of resources for implementation

lack of clarity about business needs working practices not being reviewed or changed poorly defined service objectives, goals and responsibilities no provision of agreed Customer service levels lack of knowledge for resolving Incidents inadequate training for staff

Page 8: Incident Management

lack of integration with other processes lack of, or expense of, tools to automate the process resistance to change.

Incident Management activitiesThis section discusses in more detail the six activities encapsulated…

Incident detection and recording classification and initial support investigation and diagnosis resolution and recovery Incident closure ownership, monitoring, tracking and communication

Incident detection and recording

Incident details from Service Desk or event management systems are the inputs for Incident Management. Resultant actions are to:

record basic details of the Incident alert specialist support group(s) as necessary start procedures for handling the service request.

Outputs will be:

updated details of Incidents the recognition of any errors on the CMDB notice to Customers when an Incident has been resolved.

Classification and initial support

Inputs:

recorded Incident details configuration details from the CMDB response from Incident matching against Problems and Known Errors.

Incident records raised in the previous activity are now analysed to discover the reason for the Incident. The Incident should also be classified, the process on which further resolution actions are based. Annex 5A provides some examples of classification codes.

Actions:

classifying Incidents

Page 9: Incident Management

matching against Known Errors and Problems informing Problem Management of the existence of new Problems and of

unmatched or multiple Incidents assigning impact and urgency, and thereby defining priority assessing related configuration details providing initial support (assess Incident details, find quick resolution) closing the Incident or routing to a specialist support group, and

informing the User(s).

Outputs:

RFC for Incident resolution updated Incident details, and Work-arounds for Incidents, or Incident routed to second- or third-line

support.

Classification is the process of identifying the reason for the Incident and hence the corresponding resolution action. Many Incidents are regularly experienced and the appropriate resolution actions are well known. This is not always the case, however, and a procedure for matching Incident classification data against that for Problems and Known Errors is necessary. Successful matching gives access to proven resolution actions, which should require no further investigation effort

Investigation and diagnosis

Inputs:

updated Incident details configuration details from the CMDB.

Actions:

assessment of the Incident details, collection and analysis of all related information, and resolution (including any Work-around) or a route to n-line support.

Outputs:

Incident details yet further updated, and a specification of the selection or required Work-around

Resolution and recovery

Inputs:

Page 10: Incident Management

updated Incident details any response on an RFC to effect resolution for the Incident(s) any derived Work-around or solution.

Actions:

resolve the Incident using the solution/Work-around or, alternatively, to raise an RFC (including a check for resolution)

take recovery actions.

Outputs:

RFC for future Incident resolution resolved Incident, including recovery details, updated Incident details.

Incident closure

Inputs:

updated Incident details, resolved Incident.

Actions:

the confirmation of the resolution with the Customer or originator 'close' category Incident.

Outputs:

updated Incident detail, closed Incident record.

When the Incident has been resolved, the Service Desk should ensure that:

details of the action taken to resolve the Incident are concise and readable classification is complete and accurate according to root cause resolution/action is agreed with the Customer - verbally or, preferably, by

email or in writing all details applicable to this phase of the Incident control are recorded, such

that: the Customer is satisfied cost-centre project codes are allocated the time spent on the Incident is recorded the person, date and time of closure are recorded.

Page 11: Incident Management

Ownership, monitoring, tracking and communication

Inputs:

Incident records.

Actions:

monitor Incidents escalate Incidents inform User.

Outputs:

management reports about Incident progress escalated Incident details; and Customer reports and communication.

Roles of the Incident Management processProcesses span the organisation's hierarchy. Therefore it is important to define the responsibilities associated with the activities that have to be performed in the process. To remain flexible, it is advisable to use the concept of roles; in many organisations roles may be combined because of the small size of the organisation or because of cost. For example, many organisations combine the roles of Change Management and Configuration Management. A role embraces a set of responsibilities, tasks and levels of authorisation.

5.8.1 Incident Manager

An Incident Manager has the responsibility for:

driving the efficiency and effectiveness of the Incident Management process producing management information managing the work of Incident support staff (first-and second-line) monitoring the effectiveness of Incident Management and making

recommendations for improvement developing and maintaining the Incident Management systems.

In many organisations, the role of Incident Manager is assigned to the (function) Service Desk Supervisor.

Page 12: Incident Management

5.8.2 Incident-handling support staff

First-line support (Service Desk) responsibilities include:

Incident registration routing service requests to support groups when Incidents are not closed initial support and classification ownership, monitoring, tracking and communication resolution and recovery of Incidents not assigned to second-line support closure of Incidents .

Second-line support (specialist groups that may be part of the Service Desk) will be involved in tasks such as:

handling service requests monitoring Incident details, including the Configuration Items affected Incident investigation and diagnosis (including resolution where possible) detection of possible Problem s and the assignment of them to the Problem

Management team for them to raise Problem records the resolution and recovery of assigned Incidents.

Ownership, monitoring, tracking and communication tasks cover:

monitoring the status and progress towards resolution of all open Incidents keeping affected User s informed about progress escalating the process if necessary.

Key Performance IndicatorsTo judge process performance, clearly defined objectives with measurable targets - often referred to as Key Performance Indicators (KPIs) - should be set. The following metrics are examples for the effectiveness and efficiency of the Incident Management process:

total numbers of Incident s mean elapsed time to achieve Incident resolution or circumvention, broken

down by impact code percentage of Incidents handled within agreed response time (Incident

response-time targets may be specified in SLA s, for example, by impact code)

average cost per Incident percentage of Incidents closed by the Service Desk without reference to

other levels of support Incidents processed per Service Desk workstation number and percentage of Incidents resolved remotely , without the need

for a visit.

Reports should be produced under the authority of the Incident Manager, who should draw up a schedule and distribution list, in collaboration with the Service

Page 13: Incident Management

Desk and support groups handling Incidents. Distribution lists should at least include IT services management and specialist support groups. Consider also making the data available to Users and Customers, for example via SLA reports.

INCIDENT MANAGEMENT

Data requirements for service Incident recordsThe following data should be recorded during the Incident life cycle:

unique reference number Incident classification date/time recorded name/id of the person and/or group recording the Incident name/department/phone/location of User calling call-back method (telephone, mail etc.) description of symptoms category (often a main category and a subcategory) impact/urgency/priority Incident status (active, waiting, closed etc.) related Configuration Item support group/person to which the Incident is allocated related Problem/Known Error resolution date and time closure category closure date and time.

To have control during the complete Incident life-cycle, for every action is recorded:

name/id of the support group or person recording the action type of action (routing, diagnose, recovery, resolving, closing etc.) date/time of action description and outcome of action.