Data centre incident nov 2010 v3

28
Disaster and Recovery By Alan Davies Gregynog Colloquium 17 th June 2011

description

University of Glamorgan's data centre incident.

Transcript of Data centre incident nov 2010 v3

Page 1: Data centre incident nov 2010   v3

Disaster and Recovery

By Alan Davies

Gregynog Colloquium 17th June 2011

Page 2: Data centre incident nov 2010   v3
Page 3: Data centre incident nov 2010   v3

TOPICS Before the Flood

The “Disaster” !

The Recovery

Future

Page 4: Data centre incident nov 2010   v3

BEFORE SERVER VIRTUALISATIONHOW THE ROOM LOOKED IN 2009

Page 5: Data centre incident nov 2010   v3

SERVERS Over 200 standalone

Virtualisation – 200 into 20 will go !

9 new Host Servers, holding 155 Virtual Servers

Power Savings

Space Savings

Resilience ??

Page 6: Data centre incident nov 2010   v3

STORAGE

60TB of data (100,000 CDs)

10GB per staff

Resilience ??

Page 7: Data centre incident nov 2010   v3

DATA BACKUP

Disk-to-Disk-to-Tape

40TB Disk capacity

Tape cartridges 1.6TB

48 Cartridge Tape Library

Secure Fireproof Safes

Page 8: Data centre incident nov 2010   v3

ENVIRONMENT CONTROL

Power UPS Diesel Generator

Cooling

Humidity !!

Page 9: Data centre incident nov 2010   v3

SECONDARY DATA CENTRE

Page 10: Data centre incident nov 2010   v3

THE DISASTERSUNDAY 28 NOVEMBER

Freezing Temperatures

Rooftop Air Handler

Water, Water, Everywhere !!

Page 11: Data centre incident nov 2010   v3

WATER TRASHED OUR LOVELY SERVER ROOM !

Page 12: Data centre incident nov 2010   v3

WATER TRASHED OUR LOVELY SERVER ROOM !

Page 13: Data centre incident nov 2010   v3

WATER TRASHED OUR LOVELY SERVER ROOM !

Page 14: Data centre incident nov 2010   v3

Backup Device survived!! But Not the

overnight tapes

WATER TRASHED OUR LOVELY SERVER ROOM !

Library Servers

Page 15: Data centre incident nov 2010   v3

LETS BUILD ANOTHER ONE..!

Page 16: Data centre incident nov 2010   v3

LETS BUILD ANOTHER ONE..! Boxes x 300

Page 17: Data centre incident nov 2010   v3

LETS BUILD ANOTHER ONE..!

Production Line .. bit by bit .... Luverly !

Page 18: Data centre incident nov 2010   v3

NOW TO RESTORE SERVICES ! University Gold Team (Chaired by the VC)

Business Continuity and Recovery Prioritising Services Tracking Progress Communicating Regular meetings, 29 Nov to 15 Dec

ISD Contingency Team Recovery and Business Continuity Mapping Service Dependencies Managing Resources (people, procurement, time) Directing operations Dealing with Insurance Claim

Lots of staff involved Everyone in the department had a part to play.

Page 19: Data centre incident nov 2010   v3

NOW TO RESTORE SERVICES ! Scale of Operation

165 Servers destroyed 121 Live Services

Core Services – 39 (Telephone, Web Site, Email, VLE...) Non Core Services – 82 (Tills, HR, Invoicing...)

20 Test & Development Environments

Process Cleaning the room and salvaging equipment Limiting further risk by removing the cause Identifying what services were working (not working) Recovering services by alternative means (where we could) Procuring equipment prior to the rebuild Building a new server infrastructure Recovering services by priority Keeping the Gold Team informed

Page 20: Data centre incident nov 2010   v3

NOW TO RESTORE SERVICES !Timeline

Page 21: Data centre incident nov 2010   v3

WHAT NEXT ?

Options Paper DISAG

Independent Review Prof David Baker

Secondary Server Room

External Services?

Page 22: Data centre incident nov 2010   v3

LESSONS LEARNT – MANAGEMENT PERSPECTIVE. People

Successful recovery is based on staff goodwill, commitment, professionalism. Having and maintaining good relationships with suppliers. Having a strong recovery team with management, operational and administration

experience. Having the Gold team to agree priorities. Everyone wants to help!

Communications Having a contacts list to get hold of key staff, and key suppliers. People are patient and will wait for their systems if they understand the situation The value of having a staff and student portal (especially when you don’t have it!) The value of Facebook to get messages out to staff and students. Sharing personal emails and mobile phone numbers to ease communication. Communicating ‘what is happening with the recovery process’ is important for

your own department staff. Tempering expectations by communicating the right message to the organisation

and customers.

Page 23: Data centre incident nov 2010   v3

LESSONS LEARNT – MANAGEMENT PERSPECTIVE. Inventory

Keeping an itemised list of parts of equipment held in your Data Centre will allow you to replace equipment quickly.

Having a list of core services and their dependencies so that you can agree priorities for restoring.

Resilience Don’t put all your eggs in one basket Not to keep your backup/restore device in the same building Never put equipment in front of a room cooling system which has a

fan that is capable of blowing water across the room. Never assume that because there is no water in the data centre that

water cannot find a way into the building. Procurement

Having the ability to raise orders quickly. Using existing framework agreements to reduce time for

procurements and European competition.

Page 24: Data centre incident nov 2010   v3

LESSONS LEARNT – MANAGEMENT PERSPECTIVE. Operations

Keep a log of all decisions and actions taken. If there is a risk, don’t delay in dealing with it. Ensure that every system is backed up.

Page 25: Data centre incident nov 2010   v3

THE FUTURE - HOW IT LOOKS TODAY.

Page 26: Data centre incident nov 2010   v3

HOW IT LOOKS TODAY.

Page 27: Data centre incident nov 2010   v3

HOW IT LOOKS TODAY.

Page 28: Data centre incident nov 2010   v3

AN IT INFRASTRUCTURE INCIDENT

Any Questions?