White paper data center critical infrastructure risk and vulnerabilities

34
An Information Technology Wake-up Call Disaster Recovery Planning Impact to Capital Markets Technology & Data Center Critical Infrastructure TECHNICAL WHITE PAPER Assess and Mitigate Risk and Vulnerabilities to Business Continuity and Disaster Recovery In the New York Metropolitan Area Vincent Pelly Scott Haglund Sophie Pascal, Contributing Editor

description

 

Transcript of White paper data center critical infrastructure risk and vulnerabilities

Page 1: White paper   data center critical infrastructure risk and vulnerabilities

An Information Technology Wake-up Call

Disaster Recovery Planning

Impact to Capital Markets Technology

& Data Center Critical Infrastructure

TECHNICAL WHITE PAPER

Assess and Mitigate Risk and Vulnerabilities to Business Continuity and Disaster Recovery In the New York Metropolitan Area

Vincent Pelly

Scott Haglund

Sophie Pascal, Contributing Editor

Page 2: White paper   data center critical infrastructure risk and vulnerabilities

Table of Contents

Executive Summary ................................................................................................................................... 1

What happens to Business when the lights go out? ................................................................................. 1

Intended Audience and Structure ............................................................................................................. 2

Keeping Business in Business .................................................................................................................... 3

The unanticipated hidden risks ................................................................................................................. 3

Lessons Learned ........................................................................................................................................ 4

Review of past events are key to effective Disaster Recovery ................................................................. 4

Risks to the Critical Infrastructure ............................................................................................................ 6

Climate Conditions and Patterns .............................................................................................................. 6

Seismic Activity and Risk ........................................................................................................................... 7

Electrical Distribution and the Power Grid ............................................................................................... 8

Data Center Reliability Classification ...................................................................................................... 10

Best Practices .......................................................................................................................................... 11

For Infrastructure Design ........................................................................................................................ 11

High Availability & Disaster Recovery ..................................................................................................... 11

RTO and RPO ........................................................................................................................................... 11

Expectations for Continuous Availability ................................................................................................ 12

Virtualization ........................................................................................................................................... 13

Replication and Network Bandwidth ...................................................................................................... 14

Database Replication .............................................................................................................................. 14

Types of Backup Recovery and Replication Architectures ...................................................................... 15

Disaster Recovery Site Selection ............................................................................................................. 16

Summary and Recommendations ........................................................................................................... 17

Best Practices .......................................................................................................................................... 18

Business Continuity Management Framework ....................................................................................... 19

Elements of Business Recovery Planning ................................................................................................ 20

FEMA Flood Maps ................................................................................................................................... 21

Appendix A .............................................................................................................................................. 22

FEMA Flood Hazard Mapping - HIGH ...................................................................................................... 22

FEMA Flood Hazard Mapping - HIGH (cont’d) ........................................................................................ 23

FEMA Flood Hazard Mapping - LOW ....................................................................................................... 24

Page 3: White paper   data center critical infrastructure risk and vulnerabilities

FEMA Flood Hazard Mapping - LOW (cont’d) ......................................................................................... 25

FEMA Flood Hazard Mapping - LOW (cont’d) ......................................................................................... 26

Appendix B .............................................................................................................................................. 27

Natural Disaster Risk Profiles for Data Centers ...................................................................................... 27

Natural Disaster Risk Profiles for Data Centers (cont’d) ......................................................................... 28

Appendix C .............................................................................................................................................. 29

East Coast Liquidity Venues .................................................................................................................... 29

Works Cited & References ...................................................................................................................... 30

About Citihub .......................................................................................................................................... 31

About the Authors .................................................................................................................................. 31

Page 4: White paper   data center critical infrastructure risk and vulnerabilities

1

Executive Summary What happens to Business when the lights go out?

In the aftermath of Hurricane Sandy, significant flooding to coastal areas caused a majority of

the Northeastern United States to be left without commercial electricity. Many businesses lost

power because their buildings were located in zones that were flooded with seawater and

because the main electrical panels were located below the rising water level. Generators that

supported data centers weren’t able to supply fuel because pumps were located in flooded

basements. Firms that had not pre-purchased fuel or secured delivery contracts for their backup

generators were unable to operate their data centers beyond fuel storage capacity, and firms

that did pre-purchase fuel could not receive deliveries due to flooded roadways. Employees

were unable to access their offices, critical staff members were unable to travel to offsite

recovery locations because government mandates forbade access to roadways for non-

essential personnel, and customers were unable to complete online transactions. The overall

impact of Hurricane Sandy was evaluated at between $30 billion and $50 billion.1

The numerous failures to IT mission critical infrastructure brought immediate attention to some

very important design flaws in Recovery plans and processes today. The design flaws identify

that data center facilities are vulnerable, leaving Business exposed to outages it cannot afford.

The objective of this white paper is to provide senior executives with an overview of Disaster

Recovery preparedness as well as the potential risks and vulnerabilities that exist in critical

infrastructure, specifically in the New York metropolitan area. It will also help senior executives to

become aware of critical details that may not be covered in their current Disaster Recovery plans.

We at Citihub believe in the importance of having an end-to-end Business Continuity solution

that includes not only a tested and validated data center and infrastructure design, but also the

ability to provide staff with remote access to the key applications needed to continue operations.

The recommendations listed in this white paper outline high-level frameworks designed for

addressing business systems redundancy. It will also demonstrate how to significantly reduce

data loss by using various design principles and best practices to obtain the best Disaster

Recovery system to support Business requirements.

Although the target industry is financial services, this paper can serve as a primary reference for

building the appropriate Disaster Recovery solution for any company, regardless of industry or

geography.

Finally, this paper will offer a long-term business case for addressing critical vulnerabilities as

well as factors that senior executives should take into consideration when setting priorities

regarding critical infrastructure. This will ensure Business Continuity and prevent loss of

revenue in the event of another major outage.

1 http://online.wsj.com/article/SB10001424052970204712904578092663774022062.html?mod=googlenews_wsj

Page 5: White paper   data center critical infrastructure risk and vulnerabilities

2

Intended Audience and Structure

This white paper is intended to help senior management and senior-level executives of financial

services institutions navigate the Business Continuity and Disaster Recovery landscape. It

outlines successful implementation strategies and best practices, and assumes that readers

have basic knowledge of networks and infrastructure, as well as awareness of the geographical

specificity of their businesses.

Citihub will examine how site selection, power, cooling, and inadequacies within the system

recovery architecture can contribute to the data centers risk of downtime. The analysis will

explore specific data center infrastructure vulnerabilities, and suggest recommendations and

best practices that identify and remediate gaps within the infrastructure to minimize downtime

and achieve the highest possible return on investment.

Page 6: White paper   data center critical infrastructure risk and vulnerabilities

3

Keeping Business in Business

The unanticipated hidden risks

The technological ecosystem supporting financial markets relies heavily on centralized data

centers, infrastructure and communication networks as the core processing engines of capital

markets. Uninterrupted operations are critical to the daily operations of the financial services

industry, serving e-commerce, market data and pricing, matching engines, settlements and

other critical systems, transactions and data that enable sell-side and buy-side firms to maintain

worldwide market liquidity.

Firms are at risk when disruption to the IT infrastructure occurs; systems are down and

information is unavailable, adversely impacting business operations. Financial markets including

retail banks and institutional securities firms require reliable and consistent operations to

support front and back office systems, particularly settlement and clearing firms that process

open transactions and communications with customers, counterparties and third parties.

Disruptions to daily operations can prevent the ability of financial institutions to manage liquidity,

which can increase financial risk to their organizations.

These are some of the business and technical drivers behind the design and implementation of

robust Disaster Recovery plans that should be considered in priority when selecting proper

backup sites and developing sound Recovery management processes.

Examples of system outages that should be considered when designing business and system resiliency plans:

Isolated failures caused by software, hardware errors or recent system upgrades that were not fully tested

External outages to telecommunications and electrical feeds caused by inadvertent damage to primary lines

Loss of critical infrastructure and mechanical and electrical systems, as well as failure of backup systems to provide continues operations

Wide-spread outages caused by natural disasters and catastrophic events

Immediate threats and consequences of not having a Disaster Recovery plan:

Loss of revenue and of customer confidence, and damage to the corporate brand and reputation can arise from the inability of clients to access systems and account information or execute transactions

Cost to restore operations to normal state; without proper planning and Disaster Recovery management, this can be expensive

Potential fines or fees can be imposed for non-compliance related to unprepared resiliency plans resulting from extended outages2

2 Dodd-Frank H.R. 4173 – 316 “(ii)establish and maintain emergency procedures, backup facilities, and a plan for disaster recovery”

Page 7: White paper   data center critical infrastructure risk and vulnerabilities

4

Lessons Learned Review of past events are key to effective Disaster Recovery

Business today has not fully internalized the significant findings of this paper dated almost ten

years ago.

During the past 12 years the East coast of the United States, in particular the Northeast and the

New York metropolitan area, has experienced several widespread power outages related to

extreme weather conditions that have greatly impacted technology infrastructure. These events

confirm that our IT critical infrastructure is vulnerable to regional disruption (power outages,

climate change and natural disasters) as demonstrated from the increase of wide scale and

regional disruption over the past decade.

In response to these events, IT executives have planned accordingly by revising Business

Continuity plans and introducing alternative backup sites, such as tertiary sites in geographical

regions that are outside the location of the primary corporate site. Within the financial services

community, senior industry leaders along with the Federal Reserve Board, OCC and SEC

issued in 2005 an interagency white paper3 that described best practices to strengthen the

resiliency of U.S. financial services post 9/11. The paper stressed the critical importance of

protecting the financial system from new risks associated with widespread outages by focusing

on the following high-level Business Continuity objectives:

Rapid recovery and timely resumption of critical operations

Key staff to resume critical operations in one major operating location

Comprehensive testing that demonstrates effective internal and external continuity arrangements

3Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System, September 2005,

www.sec.gov/rules/concept/34-46432.htm

“Firms that play significant roles in critical financial markets should maintain sufficient

geographically dispersed resources, including staff, equipment and data to recover clearing

and settlement activities within the business day on which a disruption occurs. Firms may

consider the costs and benefits of a variety of approaches that ensure rapid recovery from a

wide-scale disruption. However, if a backup site relies largely on staff from the primary site, it

is critical for the firm to determine how staffing needs at the backup site would be met if a

disruption results in loss or inaccessibility of staff at the primary site.”

- Federal Reserve White Paper on the Resiliency of the U.S. Financial System, 2005

Page 8: White paper   data center critical infrastructure risk and vulnerabilities

5

The results of the Federal interagency white paper, as well as the analyses and discussions

held with financial industry technology experts and practitioners, show that sound practices

based on the above key points have resulted in the development and implementation of best

practices regarding Business Continuity. It is understood industry wide that many firms at the

time did not embrace the urgency of the report, mostly for cost considerations. But today they

can no longer be ignored.

On the strength of that interagency paper and in reviewing past and recent events, it is

imperative that these key points be taken into consideration when designing and building the

Disaster Recovery architecture:

Performing a top down assessment of critical business activities that are mapped to supporting IT systems and key staff members

Prioritise systems to recover first and assign required support staff for a potentially limited capability in recovery mode

Establishing a crisis management team who will coordinate activities and make prioritisation calls on the ground. Critical time is often lost in the decision making process to invoke a Disaster Recovery plan.

Having a solid Recovery plan around established backup site(s) for data centers and all key business staff that is separate from the core processing location.

Periodically test back-up systems and network connectivity, and perform application role swaps on a scheduled basis to ensure Recovery plans function properly.

Comprehensive Disaster Recovery testing should be end-to-end and involve telecommunication

firms, third-party service providers and securities exchanges, as well as vetting of the business

process and the proper activation sequence for application systems. It should also serve to

familiarize business users with operational procedures in unusual situations.

Page 9: White paper   data center critical infrastructure risk and vulnerabilities

6

Risks to the Critical Infrastructure Climate Conditions and Patterns

“NOAA estimated approximately $1 billion in damage that occurred in 2011 from 12-14 major events”4

- NOAA 2012

A significant concern when reviewing an organization’s primary and recovery site is the

geographic vulnerability to severe weather. Using tools and resources available from FEMA, the

National Oceanic and Atmospheric Administration5, and historical weather patterns can provide

data on locations that have had consistent damage due to severe weather.

Below is a summary of the NOAA 2011 and 2012 National Events Map for the U.S.

Significant U.S. Weather and Climate Events

As outlined in the Uptime Institute Natural Disaster Risk Profiles6, the summary of risk profiles

located in Appendix B outlines the risks to data center sites geographically associated with

severe weather. The impact to the data center in or near the storm path should expect

disruption, as well as minor to severe infrastructure damage when subject to the following

natural disasters:

Tornado

Hurricane

Earthquake

Ice Storm

Blizzard

Thunderstorm

Lightning

Flood

For detailed FEMA flood maps of the New York Metro area, please refer to Appendix B. Listed

are the primary locations of critical data centers serving financial services in Appendix A.

4 http://www.noaa.gov/extreme2011/index.html

5 NOAA National Climatic Data Center, State of the Climate: National Overview for Annual 2012, published online December 2012 from

http://www.ncdc.noaa.gov/sotc/national/2012/13. 6 Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications

2011 2012

Page 10: White paper   data center critical infrastructure risk and vulnerabilities

7

Seismic Activity and Risk Historically, earthquakes and seismic activity are a rare occurrence in the New York

metropolitan area, with the exception of the 2011 Virginia Earthquake7 that produced tremors

throughout the New York area. Although no damage or outages occurred during the 2011 event,

it’s a best practice to evaluate seismic activity when selecting a primary and recovery site.

The following graphs summarize historically the impact, magnitude and spread of Seismic

activity for the U.S. and New York area. 8

U.S. & New York Area Seismic Hazard Map

Source: USGS

7 http://en.wikipedia.org/wiki/2011_Virginia_earthquake 8 United States Geological Survey, http://earthquake.usgs.gov/earthquakes/states/new_york/hazards.php

Page 11: White paper   data center critical infrastructure risk and vulnerabilities

8

Electrical Distribution and the Power Grid When planning for alternative Disaster Recovery backup locations, as well as performing a risk

and vulnerabilities assessment on the primary site, another key area of concern relates to the

location of the power utilities and the major interconnections of the power grid. This type of

assessment becomes critical when planning for 2N9 redundancy for primary and secondary

locations. In order to lower the risk of localized power outages, a full disclosure of the locations

of power stations, substations and feeds to the facility, as well as the redundancy within the

feeds, is necessary to determine where electrical power gaps may exist.

9 When referring to the data center utility feed, a 2N system contains double the amount needed that run separately with no single points of

failure.

U.S. Electrical Grid and Power Plants

“The U.S. electric grid is a complex network of independently owned and operated

power plants and transmission lines.”

- NPR, Visualizing The U.S. Electric Grid

Source: NPR

Page 12: White paper   data center critical infrastructure risk and vulnerabilities

9

Data Center Components The critical items within a data center contain a number of systems that control and run the

electrical and mechanical components necessary for successful operation. Many of these

systems are tied into the Building Management Systems (BMS), and others are directly linked to

IT monitoring systems. Within the past two years, the industry has taken the stance that both

BMS and IT critical systems should be managed and monitored by a single system and reported

via a holistic dashboard. These systems are part of the building envelope and each contain a

set of core delivery mechanisms and risk profiles.

During Hurricane Sandy, many of the critical systems, specifically the electrical and mechanical

(M&E), were severely damaged due to the fact that storm surge water entered the basements

and took down main electrical panels, water and fuel pumps, etc. Many data centers with

generator fuel pumps located in basements had difficulty starting up backup generators, and in

some cases fuel had to be manually delivered to generators located on higher floors (via the

bucket brigade).

In addition to the M&E systems, other common infrastructure dependencies required to

maintaining operations during a recovery period are generally related to the operations of the

telecommunications infrastructure. During a widespread outage it is critical that the

telecommunications infrastructure remain intact across the United States. Firms can mitigate

this risk by implementing resiliency through the use of circuit diversity and routing when

establishing geographically dispersed facilities.

Source: Citihub

Page 13: White paper   data center critical infrastructure risk and vulnerabilities

10

Data Center Reliability Classification

Several data center industry experts have defined reliability classifications for the data center infrastructure. The term reliability refers to a

variety of subjects including availability, durability and quality, as to how the data center has been engineered. The following five

performance-based metrics have been defined to classify the reliability of the data center based on the Building Industry Consulting Services

International (BICSI) standard for IT systems10.

Class F0 Class F1 Class F2 Class F3 Class F4

Single Path without Alternate

Power Source Single Path Single Path with Redundant

Components Concurrently Maintainable Fault Tolerant

Class F0 support basic environmental and energy requirements of the IT functions without supplementary equipment

Capital cost avoidance is the major driver

There is a high risk of downtime due to planned and unplanned events

Class F0 facilities maintenance performed during non-scheduled hours, and downtime of several hours or even days has minimum impact on the mission

Critical power distribution system separate from the general use power systems would not exist

No back-up generator system

The system might deploy power conditioning or surge protective devices to allow the specific equipment to function adequately (utility grade power does not meet the basic requirements of critical equipment)

No for power or air conditioning

Class F1 support the basic environmental and energy requirements of the IT functions

There is high risk of downtime due to planned and unplanned events

Class F1 facilities maintenance can be performed during non-scheduled hours, and the impact of downtime is relatively low

The critical power distribution system would deploy a power conditioning device to allow the critical equipment to function adequately (utility grade power does not meet the basic requirements of critical equipment)

No redundancy of any kind would be used for power or air conditioning for a similar reason

Class F2 provide level of reliability higher than Class F1 to reduce the risk of downtime due to component failure

Class F2 facilities there is a moderate risk of downtime due to planned and unplanned events

Maintenance activities can typically be performed during unscheduled hours

The critical power system would need redundancy in those parts of the electrical distribution system that are most likely to fail

These would include any products that have a high parts count or moving parts, such as UPS, controls, air conditioning, generators or ATS

In addition, it may be appropriate to specify premium quality devices that provide longer life or better reliability

Class F3 provide additional reliability and maintainability to reduce the risk of downtime due to natural disasters, human-driven disasters, planned maintenance, and repair activities

Maintenance and repair activities will typically need to be performed during full production time with no opportunity for curtailed operations

Critical power system in a Class F3 facility must provide for reliable, continuous power even when major components (or, where necessary, major subsystems) are out of service for repair or maintenance

To protect against unplanned downtime, the power system must be able to sustain operations while a dependent component or subsystem is out of service

Class F4 eliminate downtime through the application of all tactics to provide continuous operation regardless of planned or unplanned activities

All recognizable single points of failure from the point of connection to the utility to the point of connection to the critical loads are eliminated

Systems are typically automated to reduce the chances for human error and are staffed 24×7

Rigorous training is provided for the staff to handle any contingency

Compartmentalization and fault tolerance are prime requirements for a Class F4 facility

Critical power system in a Class F4 facility must provide for reliable, continuous power even when major components (or, where necessary, major subsystems) are out of service for repair or maintenance

To protect against unplanned downtime, the power system must be able to sustain operations while a dependent component or subsystem is out of service

10

BICSI Standards for Data Centers, https://www.bicsi.org/default.aspx

Page 14: White paper   data center critical infrastructure risk and vulnerabilities

11

Best Practices For Infrastructure Design

High Availability & Disaster Recovery

High Availability and Disaster Recovery are both concepts related to Business Continuity. But

whereas Business Continuity applies to the whole business (including IT), HA & DR typically are

more related to IT Continuity, as part of overall Business Continuity. High Availability solutions

mainly address outages at a single site, while Disaster Recovery solutions mainly address sudden,

site-wide disasters. High Availability and Disaster Recovery objectives and metrics are different.

A highly available site provides resiliency from errors of the underlying platform and single points

of failure. Availability encompasses reliability, recovery, and failure. One of the most common

measures of availability is the percentage of time that a given system is active and working. The

following table correlates the percentage of availability to calendar time equivalents.

Acceptable Uptime Downtime Per day Downtime Per month Downtime Per year

99% 14.40 minutes 7 hours 3.65 days

99.9% 86.40 seconds 43 minutes 8.77 hours

99.99% 8.64 seconds 4 minutes 52.60 minutes

99.999% 0.86 seconds 26 seconds 5.26 minutes

RTO and RPO

RTO is the elapsed time from service interruption until service is restored. It answers the

question: "How long can you be without service?" RTO represents a time limit that cannot be

exceeded without facing severe consequences. A unified High Availability and Disaster Recovery

approach would establish both an uptime objective and an RTO for each service.

RPO, on the other hand, is the point of time represented by the data upon service resumption. It

answers the question: "How old can the data be?"

Page 15: White paper   data center critical infrastructure risk and vulnerabilities

12

Expectations for Continuous Availability

Data Replication

The two basic methods of data replication are synchronous and asynchronous. In general

terms, synchronous capabilities are used for shorter distances, and asynchronous capabilities

are used for longer distances. The method chosen depends on Business Recovery

requirements.

Synchronous replication ensures that a remote copy of the data, identical to the primary copy,

is created at the time the primary copy is updated. In synchronous replication, an update

operation is not considered done until completion is confirmed at both the primary and

secondary site. An incomplete operation is rolled back at both locations, ensuring that the

remote copy is always an exact mirror image of the primary.

Asynchronous replication places data updates in a queue on the primary server. However, it

does not wait for the update acknowledgments on the secondary server. So, all data that did not

have time to be copied across the network on the secondary server are lost if the first server

fails. Application data may be lost in this type of failure.

Most companies cannot tolerate more than a few hours or even minutes of downtime without

serious impact to the bottom line. Synchronous data replication may be the appropriate solution

for companies seeking the fastest possible data recovery, minimal data loss, and protection

against database integrity problems.

Page 16: White paper   data center critical infrastructure risk and vulnerabilities

13

Virtualization

Virtualization makes it possible to implement Disaster Recovery plans at a significantly lower

cost. Since virtual machines are hardware-independent, any physical server can be used as a

recovery target for any virtual machine. As virtualization also makes it possible to consolidate

workloads onto fewer servers, organizations can significantly reduce the cost of hardware for

Disaster Recovery by reducing the number of servers needed at the primary site.

Many organizations have already embraced the benefits of virtualization, as it can add

tremendous value to Disaster Recovery planning. Before virtualization, Disaster Recovery was

often too expensive to implement, and many organizations chose only to protect the most

critical applications. Consolidating multiple physical servers as virtual hosts significantly reduces

the amount of physical servers that need to be recovered in the event of an outage.

Page 17: White paper   data center critical infrastructure risk and vulnerabilities

14

Replication and Network Bandwidth

Network bandwidth can also introduce challenges to data replication strategies. It’s important to

understand the amount of changed data that can occur within a given period of time. Depending

on the rate of changed data in a given system, one can determine the amount of bandwidth

needed. This period of time is referred to as the replication latency window. The network

bandwidth guideline below can assist with these calculations.

Database Replication

Database replication is similar to database mirroring. These solutions use production database

transaction logs to maintain a current copy of the production database on a standby server. In

the event of a server outage, the database replication software, automatically switches the

standby database into the production database. There are traditionally no restrictions on where

the databases can reside, provided that they can communicate with each other.

Synchronous replication however, does have some drawbacks. It has a theoretical distance

limitation of 200 kilometres (km) or 124 miles, but the practical distance limitation for a busy

system could be as little as 50(km) or 30 miles.

Estimated Hours To Replicate Capacity

Network 20 GB 80 GB 120 GB 200 GB 300 GB 730 GB

T1 42.33 169.31 253.97 423.28 634.92 1544.97

10Base-T LAN 6.50 26.01 39.01 65.02 97.52 237.31

DS3 / T3 1.50 6.02 9.03 15.05 22.57 54.93

100Base-T LAN 0.65 2.60 3.90 6.50 9.75 23.73

OC3 0.42 1.68 2.52 4.19 6.29 15.31

OC12 0.10 0.42 0.63 1.05 1.57 3.82

Page 18: White paper   data center critical infrastructure risk and vulnerabilities

15

Types of Backup Recovery and Replication Architectures

Choosing the best suited backup and recovery option for an organization can be challenging.

Traditionally, businesses request little to no downtime when recovering from a disaster or other

type of outage. Implementing these types of solutions may represent a sizable investment.

Management will have to decide which recovery option best fits the organization’s needs,

particularly in relation to risk assessment, compliance and other requirements, as outlined

earlier in this paper.

Single Site Backup and Recovery

Multi-Site Asynchronous Data Replication

Multi-Site Synchronous Data Replication

Cloud Backup and Recovery

Backups and snapshots required for off-site storage must be created periodically

Data can only be as up-to-date as the last backup; daily, weekly or monthly.

Recovery is limited to the point in time of the last backup

Asynchronous replication is supported by disk arrays, networks and host based replication products

Changes to data are committed to the source first, then buffered or journaled and sent to the replication target(s)

It's designed to work over long distances and greatly reduces bandwidth requirements

This can introduce delays that are nearly instantaneous to several hours, dependant on network latency

There is also no guarantee that the secondary system will have the most recent copy of the data if the primary fails

Used primarily for high-end transactional applications that require instantaneous failover if the primary node fails.

With synchronous replication, data is written to the primary and secondary storage systems at the same time, and is not complete until it is acknowledged by both local and remote storage systems.

Synchronous replication requires considerable bandwidth, which makes it also more expensive.

Applications and data remain on-premises in this approach, with data being backed up into the cloud and restored onto on-premises hardware when a disaster occurs.

In other words, the backup in the cloud becomes a substitute for tape-based off-site backups.

Many backup software vendors now provide options to directly back up to popular cloud service providers such as AT&T, Amazon, Microsoft and Rackspace.

Page 19: White paper   data center critical infrastructure risk and vulnerabilities

16

Disaster Recovery Site Selection

During the process of assessing the type of backup recovery and replication architecture, one of

the key critical components is the disaster recovery site selection. Using leading industry best

practices, the following recommendations provide guidance during a disaster recovery data

center site selection. In general, primary and backup sites should not be subjected to the same

threat profile (severe weather risks, same power grid, and flood zones).

Disaster Recovery sites should be located a significant distance11 from the primary site

Proven practices suggest a minimum of 50 to 200 miles from the primary data center, though neither the SEC or FSA12 are specific to any mandates required

Leading Disaster Recovery practices indicate between 200 and 800 miles, provided there are no technical limitations imposed by solution architectures such as low latency / algorithmic trading, synchronous replication, and fiber channel distance limitations

Avoid flood prone areas, major airport flight paths, earthquake areas and ensure diversity of power feeds

Mitigate key man risk by ensuring labor pool resiliency (data center staff and application recovery resources) and creating appropriate documentation for cross regional training

11

2003 SEC guidelines on Disaster Recovery (http://www.sec.gov/news/studies/34-47638.htm 12

FSA BCM guide (http://www.fsa.gov.uk/pubs/other/bcm_guide.pdf

50 to 200

miles

Google Earth Imagery 2013: Blue/Red pins (data centers), Red area (0 – 25 miles) / Yellow Area (25-200 Miles) marginal / Green (200-800 Miles)

Page 20: White paper   data center critical infrastructure risk and vulnerabilities

17

Summary and Recommendations

Target Focus Areas

When performing an evaluation and assessment of IT critical infrastructure, certain issues

should be addressed in order to properly frame and design a sound Business Recovery plan.

The following interview questions can be used as a guide when assessing an environment:

1. Can the IT infrastructure be trusted to withstand a major disruption?

2. Has the resiliency of the Data Center, Network and Compute environment been proven?

3. Has a Disaster Recovery test been performed recently? Were the critical business applications included in the last test? What were the results?

4. Have the business requirements been mapped to the IT infrastructure via a top-down review?

5. Does management fully understand the regulatory ramifications of not adhering to sound business recovery plans?

If those questions cannot be answered, then the business may be at risk of failure because of

its inability to recover production systems.

Citihub would recommend an end-to-end assessment of IT infrastructure, along with an in-depth

review of business continuity plans.

A detailed infrastructure assessment of the Disaster Recovery plan and processes should

include the following:

A thorough review of the existing primary and backup data centers, as well as the network and compute infrastructure, and the Disaster Recovery plan designs and architecture

An assessment of critical backup systems and confirmation that generator fuel pumps are not located in high risk areas such as basement buildings in flood zones

Review schedules for regular backup exercises and confirmation of failover procedures; confirmation that critical power has been tested and generators are functioning with sufficient fuel levels.

A review of regional and local FEMA flood zone maps (US), or the international equivalent, to determine the level of acceptable risk for data centres and critical systems

An understanding of fuel delivery schedules and the assurance that contracts are in place for emergency fuel delivery, taking into consideration that hospitals and emergency facilities have priority for fuel deliveries

A review of the backup data center location, making sure that the site is outside the primary geographic area and on separate utility grids if possible.

The education of teams for preparedness, so they react proactively and at the appropriate time (not delay in switching to backup power in the middle of the event)

An evaluation of service provider backup plans to identify dependency risks

The evaluation of remote access procedures and support systems; confirmation of sufficient capacity to support key staff working remotely.

Page 21: White paper   data center critical infrastructure risk and vulnerabilities

18

Best Practices

To help spearhead a Business Continuity Management plan and a Disaster Recovery program,

the following best practices can drive awareness of the critical nature of these processes as well

as help senior management establish or revise existing plans and eliminate gaps.

Establish a planning group to develop resiliency designs and recovery strategies

Build management awareness by establishing Key Performance Indicators (KPI) for Disaster Recovery to include the following:

- Status of previous Disaster Recovery events/tests with periodic reports to senior management

- Other core IT competencies that are critical to Disaster Recovery planning

- Periodic tests to verify implementation of the Disaster Recovery plan and reports about gaps and risks

- A review process that includes the deployment of new solutions

Perform Risk Assessments and Audits that will:

- Complete top-down inventory assessment of all critical assets required to sustain operations

- Review process structure assessments, audits, and reports

- Assess gap and risks from previous events or audits

- Create implementation plan to eliminate gaps

- Document Disaster Recovery plan actions and escalation procedures

- Build comprehensive training material

- Develop test verification criteria and procedures

Separate people from technology and confirm business processes that require onsite staff to resume operations

Establish real remote access strategy for staff who are unable to commute during severe weather conditions

Page 22: White paper   data center critical infrastructure risk and vulnerabilities

19

Business Continuity Management Framework

Source: Citihub Business Continuity Management and Disaster Recovery Framework

Page 23: White paper   data center critical infrastructure risk and vulnerabilities

20

Elements of Business Recovery Planning

The business process assessment for determining critical areas of recovery begins with a top-

down review as shown below. This approach confirms the technical infrastructure and

dependencies associated with each business process.

The above process enables end-to-end mapping of dependencies critical to providing an

understanding of the key components that make up an application system. In order to determine

business unit IT needs, and provide a gap analysis against IT capabilities, Citihub has

developed a business impact analysis methodology on critical processes and the IT systems

which support them.

The three areas of focus are:

Business Unit Overview

Process Summary

Application Requirements Summary

The business unit overview and readiness heat map is used to capture business process criticality and IT capability readiness in the event of a catastrophic outage

The key process summary examines business processes and rates the impact of a sustained outage on the business on three dimensions: Operational Impact, Financial Impact and Reputational Impact

The application requirements gap analysis section summarizes the applications each business unit requires and provides a RAG status when compared against IT capabilities

Source: Citihub

Source: Citihub Business Impact Analysis Methodology

Page 24: White paper   data center critical infrastructure risk and vulnerabilities

21

FEMA Flood Maps

Appendix B illustrates one of the more critical vulnerabilities that exist within the New York

metropolitan area. The storm surge during Hurricane Sandy13, which caused major flooding in

parts of the region, impacted critical systems in the core BMS and data center M&E, as well as

transportation infrastructures in and out of New York City and the Tri-State area.

The maps are ranked high to low by impact due to flooding and storm surge severity.

Rank Risk Impact Mitigation

LOW No impact due to storm surge None Ensure redundancy site is active and tested

MEDIUM Storm surge impact can occur but unlikely

Partial or no building damage and/or access to main entrance

Ensure redundancy site is active and tested

Recovery plans activated

HIGH Storm surge impact is severe Damage to main electrical switch gear and/or generators or fuel pumps

Ensure redundancy site is active and tested

Recovery plans activated

Staff plan activated

13

http://www.nhc.noaa.gov/refresh/graphics_at3+shtml/030345.shtml?gm_esurge

Page 25: White paper   data center critical infrastructure risk and vulnerabilities

22

Appendix A FEMA Flood Hazard Mapping - HIGH

New York Locations: Lower Manhattan and 55 Water Street

New York Locations: 25 Broadway and 32 Ave of the Americas

New Jersey Locations: 410 Commerce Blvd. and 760 Washington Ave.

Page 26: White paper   data center critical infrastructure risk and vulnerabilities

23

FEMA Flood Hazard Mapping - HIGH (cont’d)

New Jersey Locations: 545 Washington Blvd. and 755 Secaucus Road

New Jersey Locations: 15 Enterprise Ave. North and 300 Boulevard East

Page 27: White paper   data center critical infrastructure risk and vulnerabilities

24

FEMA Flood Hazard Mapping - LOW

New York Locations: 111 8th Ave. and 360 Hamilton Ave., White Plains

New York Locations: 480 North Bedford Road, Chappaqua and 11 Skyline Drive, Hawthorne

Page 28: White paper   data center critical infrastructure risk and vulnerabilities

25

FEMA Flood Hazard Mapping - LOW (cont’d)

New Jersey Locations: 1400 Federal Blvd. and 3003 Woodbridge Ave.

New Jersey Locations: 165 Halsey Street and 100 Delawanna Ave

Page 29: White paper   data center critical infrastructure risk and vulnerabilities

26

FEMA Flood Hazard Mapping - LOW (cont’d)

Chicago Locations: 350 East Cermak, Chicago, IL and 2905 Diehl Road, Aurora IL

Page 30: White paper   data center critical infrastructure risk and vulnerabilities

27

Appendix B

Natural Disaster Risk Profiles for Data Centers

Type On-Site Off-Site Impact

Tornado In or near the storm path, expect disruption and minor to severe infrastructure damage

In or near the storm path, expect disruption and minor to severe infrastructure damage

Advanced warning of tornado potential but no site specific warning

Employees remain at site

Duration is brief although intense

Roof and outside equipment (cooling towers, etc.) damaged or destroyed

Potential damage to the building structure

Loss of local utility and communications

Hurricane In or near the storm path, expect disruption and minor to severe infrastructure damage

Expect severe region-wide damage to public infrastructure, utilities and communications

Significant advanced warning

Duration is hours to a few days

Employees may require evacuation from site

Post-storm security may be required

Emergency supplies needed for at least several days

Roof and outside equipment (cooling towers, etc.) damaged or destroyed

Potential damage to the building structure

Loss of local utility and communications

Repair to regional damage may require days, weeks or longer for massive reconstruction of electric power transmission or distribution facilities

Potential for off-sit public infrastructure damage

Earthquake Expect catastrophic damage and disruption to data centers near the epicenter and infrastructure damage to data centers further away

Expect severe region-wide damage to public infrastructure, utilities and communications

No warning

Brief duration with the threat of continued aftershocks

Employees may be unable to leave site

Emergency supplies needed for several days of operation

Building structural damage

Toppling of un-braced computer hardware and site infrastructure equipment including collapse of raised floor

Site may be isolated for an extended period

Highways and bridges may be damaged or destroyed preventing movement of diesel fuel and other operating supplies required for continues operation

Power and communications may sustain extensive damage requiring days, weeks or longer to repair

Source: Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications

Page 31: White paper   data center critical infrastructure risk and vulnerabilities

28

Natural Disaster Risk Profiles for Data Centers (cont’d)

Type On-Site Off-Site Impact

Ice Storm / Blizzard

Expect some disruption or failure of data center if outside equipment is not designed to survive severe ice and snow accumulation`

Expect severe region-wide damage to public infrastructure, utilities and communications

Several days warning generally expected

Storm or multiple storms may last several days with accumulative effects

Employees may be unable to leave or enter site

Emergency supplies needed for at least several days

Ice damage to structure and outside equipment

Roof failure from excessive snow load

Potential freezing of pipes

Loss of overhead power and /or communications lines over large areas may require several days, weeks or longer to repair

Roads dangerous or impassable

Thunderstorm / Lightning

Expect disruption ranging from disaster to no impact depending on distance to lightning strike and proper operation of surge suppression, UPS, and engine-generator systems

Expect frequent momentary public utility disruptions from lightning strikes hitting the electric power transmission grid

Special sensors can provide minutes of storm approach warning

Duration is brief but may recur daily during thunderstorm season

Frequent UPS battery discharges shorten remaining battery life

Extended power interruption if utility service is overhead or radial and a nearby lightning strike causes protective devices to open

Possible flooding and roof leakage

Momentary under voltages can affect hundreds of square miles

Fires started by lightning can destroy public infrastructure located in rural areas

Flood Expect catastrophic damage and disruption to data centers in severe flood areas or with infrastructure systems below grade

Expect severe region-wide damage to public infrastructure, utilities and communications

Several day warning generally expected

Employees may be unable to leave site

Emergency supplies needed for at least several days operation

Site infrastructure damage requiring days to weeks to repair

Site may be isolated for an extended period

Highways and bridges may be damaged preventing movement of diesel fuel and other operating supplies required for continues operation

Power and communications may sustain extensive damage requiring days, weeks or longer to repair

Source: Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications

Page 32: White paper   data center critical infrastructure risk and vulnerabilities

29

Appendix C East Coast Liquidity Venues

The New York metro area is responsible for approximately 94% of the volume of shares traded

for the US cash equity market14. The following maps illustrate the major liquidity venues in the

New York and Chicago metropolitan locations.

New York

Chicago

14

http://www.batstrading.com/market_data/daily_volume/

Page 33: White paper   data center critical infrastructure risk and vulnerabilities

30

Works Cited & References

Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System,

September 2005. http://www.sec.gov/news/studies/34-47638.htm

Dodd-Frank H.R. 4173 Wall Street Reform and Consumer Protection Act, January 2010

National Public Radio (NPR), Visualizing The U.S. Electric Grid, April 24th 2009,

www.npr.org/templates/story/story.php?storyId=110997398

NOAA National Climatic Data Center, State of the Climate: National Overview for Annual 2012, published

online December 2012 from http://www.ncdc.noaa.gov/sotc/national/2012/13.

Uptime Institute, Natural Disaster Risk Profiles for Data Centers, http://uptimeinstitute.com/publications

United States Geological Survey, http://earthquake.usgs.gov/earthquakes/states/new_york/hazards.php

BICSI Standards for Data Centers, https://www.bicsi.org/default.aspx

Colocation Selection, best practices and critical considerations for choosing the right data center

colocation solution. Bill Kleyman, Cloud and Virtualization Architect, October 2012

Climate Change and Infrastructure, Urban Systems, and Vulnerabilities, Technical Report for the U.S.

Department of Energy in Support of the National Climate Assessment, February 29, 2012

The historic nor’easter of 13-14 March 2010, Richard H. Grumm, National Weather Service

Page 34: White paper   data center critical infrastructure risk and vulnerabilities

31

About Citihub

Founded in 1998, Citihub provides IT expertise to some of the world’s leading enterprise

organizations and is comprised of industry veterans who relish the challenge of complex

technology and cultural change. We take a fresh approach to the technical challenges of today

and believe in partnering with our clients through change. Citihub clients include Investment

Banks, Hedge Funds, Media, and Manufacturing.

About the Authors

Vincent Pelly

Vincent Pelly is an Associate Partner at Citihub with more than 30 years of experience across

the financial services industry with specialization in infrastructure, program management and IT

strategy. He has extensive experience managing large enterprise projects in infrastructure and

data center advisory and technology implementation, and has managed large infrastructure

transformation programs.

Scott Haglund

Scott Haglund is an independent consultant with more than 30 years of experience in the

development and execution of global infrastructure strategy, architecture, transformation,

technology roadmaps, optimization, and service delivery standards for the enterprise. He

specializes in data center automation strategies, and has led many enterprise infrastructure

transformation programs.