Hurricane Sandy After Action / Improvement Report · Hurricane Sandy After Action / Improvement...

13
University Information Technology Services Hurricane Sandy After Action / Improvement Report Prepared by Victor Font UITS Business Continuity / Disaster Recovery Coordinator November 2012

Transcript of Hurricane Sandy After Action / Improvement Report · Hurricane Sandy After Action / Improvement...

University Information Technology Services

Hurricane Sandy After Action /

Improvement Report Prepared by Victor Font UITS Business Continuity / Disaster Recovery Coordinator November 2012

2

Contents  Executive Summary ............................................................................................................ 3  Pre-Storm Planning and Preparation ................................................................................... 3  

Volunteer Staff ................................................................................................................ 4  Loss of Data Center Contingency Plan ........................................................................... 4  Logistics .......................................................................................................................... 5  

Execution ............................................................................................................................ 6  What We Did Well .............................................................................................................. 8  Areas for Improvement (Raw Feedback) ............................................................................ 9  

Emergency Preparedness ................................................................................................ 9  Communications ........................................................................................................... 10  Logistics ........................................................................................................................ 10  Facilities ........................................................................................................................ 11  

Recommendations ............................................................................................................. 12  Conclusion ........................................................................................................................ 12  Glossary ............................................................................................................................ 13  

3

Executive Summary This report details UITS’s planning, execution and response to the threats posed to UConn’s information technology systems by Hurricane Sandy arrival from Thursday, October 25, 2012 through Tuesday, October 30, 2012. It also includes raw feedback comments, items for improvement and action item assignments.

Pre-Storm Planning and Preparation On Thursday, October 25th, the Department of Public Safety invited UITS’s participation in a 3pm meeting to detail the planned opening of the Emergency Operations Center (EOC) and direct UConn departments to prepare their own responses accordingly. In attendance from UITS were Debora Romano-Connors, Kathie Sorrentino, Jila Kazerounian and Victor Font. The key focus areas discussed were human and animal safety and communications. David Martel, Interim Vice President of University Communications, dialed in and asked that UITS keep the following websites and supporting systems running for the duration of the emergency: RAVE, ListServ, Web (alert.uconn.edu, uconn.edu, UConn Today,). Kathie noted that in addition Message of the Day (MOTD) and itstatus.uconn.edu would be available and utilized. Should the storm cause a data center outage, the University Communications contingency plan was to direct their audience to the UConn Facebook page and encourage the download of the MyUConn app that would continue to receive alert messages. On Friday, October 26th at 11:30am, a meeting convened with UITS staff to discuss emergency readiness and UConn’s UITS expectations. In attendance were Victor Font, RC Teal, Debora Romano-Connors, Jason Pufahl, Kristy Hughes, Rush Bhatt, Mark Wiggins, Matt Smith, Richard Simon and John Wrynn. The result is the following four actions: First, an emergency command structure was established consisting of the following resources:

• Command Structure Team: o Point of Contact for EOC: Victor Font o Communications: Kathie Sorrentino o Help Center: Pat Meinweiser o NOC and Server Admin Support: RC Teal o Logistics: Kristy Hughes (Michelle Cahill, Denise Irmscher)

Second, resources were assigned to address Critical Emergency Systems Availability and Reliability:

4

• RAVE, ListServ, Web (alert.uconn.edu, uconn.edu, UConn Today, ITSTATUS), MOTD

o (A) Services within MSB o (B) Web Services Sub-B of Library (Jon Rifkin) o (C) Amazon Site (Matt Smith, Jeff Farese, Steve Maresca)

• Encourage download/use Facebook and MyUConn (Apple/Android App Stores) as backup communications site/app

• Technologies & Technicians: o RAVE: Rush Bhatt (Craig Burdick) o Web Related Pages & Refresh Rate: Jon Rifkin (Mitch Saba) o ListServ: Mark Wiggins (Paul Parciak) o Radius: Matt Smith (Steve Maresca) o Active Directory: Rick Simon (John Wrynn) o Message of the Day (MOTD) & ITSTATUS: Pat Meinweiser (Marybeth

Bardot) o Network: Jack Babbitt (Jeff Farese) o Exchange: Josh Boggis (Rick Simon)

Third, Kristy Hughes updated the UITS emergency contact list. And fourth, Kathie Sorrentino asked for an increased purchasing limit for the ProCard. The ProCard purchasing limit was raised to $25,000. The card was kept in Kathie’s office accessible only to her and RC Teal.

Volunteer  Staff   UConn highly depends on UITS to keep systems running at full capacity during emergency situations. Yet, UITS is not classified as an essential service. Therefore, all of UITS’s staff is classified as non-essential personnel. Anticipating travel difficulties we asked for volunteers to remain on-site for the duration of the storm while the EOC was in operation. Room and board would be provided. The following individuals volunteered for duty:

• Rush Bhatt  • Russell Jancewicz • Elena Sevilla • Angelo Fazzina

In addition, Matt Smith committed to continue to monitor the situation from home and direct the technical team as needed.

Loss  of  Data  Center  Contingency  Plan  

5

In the event that the storm caused the UITS data center to fall off-line, the technical team decided the best course of action to support University Communications is to replicate alert.uconn.edu and uconn.edu in the Amazon cloud. In the given time frame, it is not feasible to move UConn Today to the cloud. Matt Smith organized the technical team. Jeff Farese made changes to the domain name servers (DNS) and the technical team began porting the websites to the cloud. The changes to DNS are required so the website URLs remain unchanged. All messages coming from University Communications reference the current web addresses and would continue to do so in the event UITS had to swap to the cloud services. The technical changes made on Friday included:

• Removed several aliases from InfoBlox “Host” records, turning them into distinct “CNAME” records with 1 minute Time to Lives (TTL)

• Setup a Red Hat/Apache server in the Amazon Cloud, ready for content.

The plan was to create virtual hosts for the sites/names and test on Monday. Based on weather reports, it was announced that the EOC would commence operations at 4am Tuesday. The next scheduled EOC preparedness meeting was scheduled for 10am Monday. The next scheduled UITS meeting was scheduled for 12noon Monday. We planned to make the call at that time as to when the on-site staff would be required, if at all.

Logistics   Kristy Hughes contacted the Nathan Hale Inn to arrange rooms for the staff volunteering to remain on site. There were two rooms available for Monday and 8 available for Tuesday. By the time the communications circulated, there was only one room available which she booked for Victor Font, EOC contact, for Monday and Tuesday. The UITS plan also included making the following requests at the 10am Monday EOC preparation follow-up meeting:

• Contact dining services to arrange for food for on-site UITS staff • Obtain sleeping quarters from Student Affairs for UITS on-site staff • Contact facilities to assure storm drains remain clear to prevent flooding in the

data center • Arrange for emergency pumps for data center and roof in case of flooding • Obtain transportation team contact number in case we need emergency

transportation • Logistics: Count supply of bottled water, flashlights or lanterns with spare

batteries, blankets, etc. Arrange to have these items delivered to the data center if needed.

6

Execution Due to a rapidly changing storm track, the EOC commenced limited operational monitoring at 4pm Sunday. Victor Font arrived on-site at 5am Monday morning and checked in at the EOC at 7am. Upon arrival at UITS, Byron Herrera informed Victor that he had already gone outside and cleared the storm drains of debris to protect the UITS area from flooding. The 10am EOC meeting took place as scheduled. Jason Pufahl, RC Teal, and Victor attended. Kathie reported a down tree on her block prevented her departure from home. When the tree was cleared, she reported she was on the way to work. Jason spoke to her and suggested she stay home; between RC, Jason, and Victor there was enough coverage to handle any emergency situation. Kathie continued to manage some communications from home. At the 10am EOC meeting, the Provost requested all volunteer staff to report to work as soon as possible. Matt’s team continued setup operations for the Amazon cloud. Rush Bhatt, Yi Zhang, Ruben Mercado was working with Angelo to get the Rave feed automated. By the time Rush was able to leave his home for UConn, the weather had turned quite bad and driving under the prevailing conditions was unwise. We asked Rush to remain available at home, his safety being the highest priority. Jon Rifkin, Russ, Angelo, and Elena were on-site. Jason and RC returned to UITS and Victor remained at the EOC. Victor arranged for sleeping quarters and meals for the UITS staff. At 1pm Jason relieved Victor at the EOC so he could check into the hotel and rest for a few hours before returning to the EOC for the 4pm update meeting. Jason followed up with Logan Trimble to get the details of the sleeping quarters. The following was communicated to the staff by RC Teal at 1:17pm:

• Everyone will stay in the Veterans House on 195 right beside the Whitney Dining hall. Your ID card will be activated for entry.

• Whitney dining hall is open tonight till 9:30pm. All they have to do is give their name and department to eat

• The expected schedule on-site in MSB for the next 24 hours or so Work schedule is flexible; we will call in case of an emergency need.

• Process for transportation to/from Nathan Hale and location(s) just off campus (vans and buses were discussed in Friday's meeting); to get transportation call 6-6924 30 minutes prior to when you want to go to the house. It would be best if you went together.

After confirming these arrangements and distributing them to the staff, Logan Trimble informed us that we had been bumped from the Veterans House. Alternate arrangements were made for Angelo to stay in a residence hall. Russ and Elena went home. In addition,

7

we made arrangements for Craig Burdick, University Communications, to stay in a residence hall. All dining halls were available for the staff to eat. After the 4pm EOC meeting, Jason went home. RC and Victor stayed in the EOC which became very busy as the evening progressed. Facilities continued to monitor and clean the MSB storm drains as needed. They also had several large pumps and an ample supply of flashlights available should we have needed them. Victor remained in the EOC until 8:30pm. RC stayed until midnight and returned to the data center where he slept. Victor returned to the EOC at 7am and RC returned around 8am. At the 10am Tuesday EOC update meeting, UITS reported no adverse incidents and all systems ran smoothly. The EOC was deactivated at noon on Tuesday.

8

What We Did Well At the closing EOC briefing on Tuesday, David Martel, University Communications, made mention of the great collaboration between UITS and other UCONN departments. UITS provided EOC coverage for the duration of the storm except for a very brief period late Monday night as the storm’s intensity waned. The operations staff reported to work as scheduled throughout the operational period. UITS systems kept running without any adverse events. The technical staff did a great job conceiving and building a contingency plan should we have lost the data center. UITS does very well in an emergency situation (reactionary) …. the adreniline kicks in and the needs are met. Raising the ProCard limit to make emergency purchases.

9

Areas for Improvement (Raw Feedback)

Emergency  Preparedness   Last minute rush to ‘stand up’ technologies. UITS has gone through several ‘power related’ outages and has been intimately involved for 4+ years with the technologies supporting emergency situations, Yet, off-site web service instances were not in place and/or were not active, documented, and personnel were not knowledgeable to speak to the current state (manager, team lead, team members). It is great that techs were able to stand something up in a 24 hr period … kudos … however, if this could be stood up in this period of time why not earlier? What keeps staff from identifying and following up with priority work activities, especially those deemed essential in case of emergency. We need to look for ways to automate the changeover required to implement off-site backup websites (Action Item: schedule meeting to plan off-site implementations. Required Attendees: Rush, John Rifkin, Yi Zhang, Jeff, Russ, Joan Marquis. Assigned to: Victor) Lack of documentation regarding services and support that is part of the UConn emergency response package (Service Level Agreement) for UITS is responsible, which currently includes: ListServ, RAVE/Alert, telephone message of the day, and now Amazon. The failover alert.uconn.edu web presence is housed in the sub-basement of the library; perhaps Amazon replaces this? Need to work with a provider off the main Storrs campus. (This is in planning and we have other backups available in the Amazon Cloud) Ensure that our personnel contact information is maintained and updated regularly. Identify person responsible and schedule as routine activity. (Kathie and Kristy will follow-up on this. Also, Kathie will follow-up with HR systems for single source of truth.) Ensure contact lists are maintained off-site and available (hard copy, electronic…) Activate Command Center within UITS, set schedule to meet for updates and who should attend and bring what to the table. Planning ahead is good but was initiated way too late. I believe there was a late afternoon meeting on Friday to plan for the needs of a Monday storm. I believe leadership knew a storm was coming way before Friday. (We made a mistake and will improve this for the future. This point may become moot when we have automated failover in place for the identified critical systems.)

10

Communications   Create a very clear communication plan. Identify roles and responsibilities, backup personnel. (Victor and Kathie will work together to develop this plan.) People were informed there would be a single source with EOC (Victor) and with UITS (Kathie). Communications would flow in this manner. Changes took place mid-stream, Kathie was told she need not come in, then many began communicating via emails, desk visits, planned updates, etc. (Jason, RC, Victor, Matt, Jila, Kathie). Team Lead and Manager were directing folks from home rather than permitting direction to come from the sources identified. Too many people, need to clearly define roles, backup, schedules, who will communicate, how and when. Once requirements were defined they were not static and changed throughout the storm. (The nature of any emergency is lack of stasis. We must remain flexible and open to adjustment as situations change.) The university was closed Tuesday, but leadership had not made any attempt to notify us if we were still needed, or our services had ended. I had to send an email asking the days agenda, or what the course of action for the day would be. In my opinion it seems the EOC knew exactly when they were considering the storm ended, it would have been nice if they thought to tell the support staff when support was no longer needed and could be released to go home. Leadership should not make statements based on assumptions, but only facts. This course of action caused leadership to appear unorganized, and have no faith in their words.

Logistics   Communication, planning, and leadership with regards to employee on-site requirements was not defined well The belief was leadership was looking for volunteers to be on-call and on-site, not that they were looking for employees to volunteer their non-work time to “help out” Notice was sent Friday afternoon by leadership saying “...we have hotel rooms for Monday and Tuesday,” I took this to mean rooms were already acquired for the 3 volunteers in my department. Once the storm time table moved quicker we asked about rooms for Sunday. A response was given that rooms for Monday and Tuesday were available but Sunday was ignored, we were told the Governor said not to travel though... The next day (Monday) we find out last Friday’s notice was not true, the hotel was full, only one room was acquired, and rumor was it was given to a consultant. I feel this is an

11

issue because it was not given to one of the 3 “technical people” doing the actual work they were asked to volunteer to do but someone else who could not accomplish the tasks of the “technical people,” and could have likely fulfilled their role from another location....? The alternatives given by leadership were, attempt to drive home, a cot to sleep on, or look for other accommodations. I expected “other accommodations” to be parallel with the original promise, not an apartment without electricity, heat, or hot water. Not to mention I'm walking in the dark in an unfamiliar place, trying not to injure myself and had to ask for a flashlight. In order to fulfill leadership’s needs, the staff has minimum needs that need to be planned for. If leadership wanted employees on-site then volunteer's meals should have been thought about as part of the planning? If they were, when was notice sent out indicating so? I don't care if meals are provided or not, I wanted to know ahead of time if they would be or not, so I can plan accordingly, so I can do my job to the best of my ability.

Facilities   Get the Help Center on backup generator to maintain services, communications, etc. (We need to identify the essential staff functions and make certain they are up and running when needed. RC is taking the lead on this planning.)

12

Recommendations Maintaining communications with key staff and providing the ability to connect to the internet is paramount during emergency operational periods. It is recommended that UITS obtain 4GL/LTE devices and deploy them to designated essential staff and emergency response volunteers. Have several high-reliability cell phones available for distribution. (Kathie will research both of these recommendations.)

Conclusion While UITS rose to the occasion and received favorable outward recognition for a job well done, internal communications and planning broke down and served as a cause of frustration for the staff. Developing written planning documents, clearly defining roles and responsibilities, testing the plans, and communicating with the staff as frequently and effectively as we do with external customers will help to alleviate the frustration.

13

Glossary CNAME—Canonical Name record is a type of resource record in the Domain Name System (DNS) that specifies that the domain name is an alias of another, canonical domain name. “Canonical” usually means: a more generally accepted or standard name. DNS—Domain Name System: a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various details with domain names assigned to each of the participating entities. EOC—Emergency Operations Center located in second floor training room of the Public Safety building MOTD—Message of the day MSB—Math Science Building NOC—Network Operations Center Operational Period—The period during which the EOC is activated. TTL—Time to live: a mechanism that limits the lifespan or lifetime of data in a computer or network. TTL may be implemented as a counter or timestamp attached to or embedded in the data. Once the prescribed event count or timespan has elapsed, data is discarded. UITS—University Information Technology Services URL—Uniform resource locator, also known as a web site address