A Symmetrix White Paper: Disaster Recovery as Business ... · A Symmetrix White Paper: Disaster...

13
June 2002 White Paper A Symmetrix White Paper: Disaster Recovery as Business Continuity

Transcript of A Symmetrix White Paper: Disaster Recovery as Business ... · A Symmetrix White Paper: Disaster...

A Symmetrix White Paper: Disaster Recovery as Business Continuity 1

June 2002

White Paper

A Symmetrix White Paper: Disaster Recovery as Business Continuity

A Symmetrix White Paper: Disaster Recovery as Business Continuity 2

No part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written consent of EMC Corporation. The information contained in this document is subject to change without notice. EMC Corporation assumes no responsibility for any errors that may appear.

All computer software programs, including but not limited to microcode, described in this document are furnished under a license, and may be used or copied only in accordance with the terms of such license. EMC either owns or has the right to license the computer software programs described in this document. EMC Corporation retains all rights, title and interest in the computer software programs.

EMC Corporation makes no warranties, expressed or implied, by operation of law or otherwise, relating to this document, the products or the computer software programs described herein. EMC CORPORATION DISCLAIMS ALL IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. In no event shall EMC Corporation be liable for (a) incidental, indirect, special, or consequential damages or (b) any damages whatsoever resulting from the loss of use, data or profits, arising out of this document, even if advised of the possibility of such damages.

EMC2, EMC, and Symmetrix are registered trademarks and EMC Enterprise Storage, SRDF, TimeFinder, and where information lives are trademarks of EMC Corporation.

Copyright 2002 EMC Corporation. All rights reserved

C894.1

A Symmetrix White Paper: Disaster Recovery as Business Continuity 3

Table of Contents Introduction......................................................................................................... 4 No Information, No Business, No Kidding! ...................................................... 5 The Hot-Site Business ....................................................................................... 5

How the Hot-Site Business Works................................................................................................5 The Traditional Recovery Process.................................................................... 6

Stage I: Initial Response ...........................................................................................................7 Stage II: Disaster Declaration ...................................................................................................7 Stage III: Restoration Process ..................................................................................................7 Stage IV: Restoration of Lost and Backlogged Transactions ...................................................7 Stage V: Resume Normal Processing.......................................................................................8 Stage VI: Return Home .............................................................................................................8

What’s Wrong with This Picture?...................................................................... 8 A New Paradigm in Disaster Recovery............................................................. 9 The Cost of Recovery......................................................................................... 9 Building a Disaster Recovery Strategy........................................................... 10 Recovery Solutions .......................................................................................... 11 Remote Mirrored Restart Capability ............................................................... 12 Implementing Business Continuity................................................................. 12 Summary ........................................................................................................... 13

A Symmetrix White Paper: Disaster Recovery as Business Continuity 4

Introduction Today, businesses operate in real time and customers demand around-the-clock access to information. Global competition has created a volatile business environment that simply will not tolerate any interruption in business. Providing access to information is now a 24 hour-per-day, 365 day-per-year challenge.

IT professionals no longer can afford the luxury of removing information from service for any reason. Businesses simply cannot tolerate the downtime needed to back up data, run batch processing, maintain various systems, and perform a wide variety of required computing processes. And in the event of computer disaster or other interruption, rapid restart capability has become mandatory for businesses of all types.

A successful restart program is a function of a well-architected and managed IT infrastructure. Built correctly, an information infrastructure provides for high reliability to mission-critical systems and provides for disaster recovery at minimal cost.

EMC Corporation is the premier provider of information infrastructure and EMC products and services help many businesses protect their information and provide for seamless business continuity. IT professionals worldwide now have primary responsibility for protecting information and providing for business continuity and EMC can help them achieve these goals.

A Symmetrix White Paper: Disaster Recovery as Business Continuity 5

No Information, No Business, No Kidding! Fueled by a 35 percent annual drop in technology costs and breakthroughs in network cost, the disaster recovery industry has grown dramatically. In 1990, Comdisco and SunGard had supported less than 50 disaster events in the first 10 years of existence of the disaster recovery industry. IBM joined the market in 1990.

Over the last decade, businesses both large and small have rapidly increased their focus on building effective, workable computer and business operations disaster recovery programs. The rapid adoption rate has exponentially increased the rate at which companies declare and implement their disaster recovery capability. The total number of disasters declared since 1990 has been approximately 1,500.

In 1990 most sophisticated data centers were built around the concept of a homogeneous computer environment or, at most, two different platforms, e.g., IBM mainframe or DEC VAX. Most processing was dedicated to the transactional side of conducting business.

Today, data centers contain hundreds of large, disparate, and sophisticated computing platforms, all networked together with thousands of individual workstations, both within and outside the walls of the business. Even moderately sized data centers often contain tens of terabytes of information that support not only tactical but highly strategic efforts of the business.

The Hot-Site Business Until the late 1970s, requirements for alternate processing (disaster recovery) usually were satisfied through reciprocal agreements with other companies with similar computer hardware located in close proximity to a business’s computer center. Over time, computing environments became increasingly complex and businesses found themselves processing data seven days per week, 24 hours per day. The evolution of processing made dependence on reciprocal agreements unworkable.

The hot-site industry began in response to a demand for independent recovery sites that could provide on-demand resources. The largest demand for computer backup was the financial services industry. Federal regulations from the Controller of the Currency (OCC), the Securities Exchange Commission (SEC), and other regulatory bodies required institutions not only to backup data but also to maintain the ability to resume processing in an acceptable period of time, usually 24 to 48 hours.

SunGard was the first vendor (around 1979) to enter the market, followed by Comdisco in the early 1980s and IBM in 1990. These three service providers represent approximately 90 percent of the hot-site business. Comdisco was the largest vendor, with about 45 percent market share. Weyerhauser, AllTel, Systematics, and several other smaller companies own the remainder of the business.

During the ’80s, SunGard and Comdisco enjoyed a clear playing field. Prices were high and competition, although intense, was polite. However, the business became extremely competitive when IBM entered the market, and the hot-site industry rapidly became a commodity business. Competition for market share remains extremely fierce. In 2001 SunGard purchased the Comdisco Disaster Recovery Services (CDRS) Division of Comdisco. SunGard and IBM Business Recovery Services are now the two dominate providers in this market space.

How the Hot-Site Business Works Hot-site services can be compared to insurance policies, and most customers view them this way. Customers subscribe to equipment configurations that will meet their recovery requirements. Additionally, a customer subscription comes with a specific number of test hours and typically gives the customer the right to test their recovery capability on an annual basis. It is not uncommon for customers to under-subscribe to their actual requirements to maintain a lower cost solution.

Right-of-access during a time of disaster is a major differentiator between the big two vendors. SunGard guarantees access, IBM does not. The guarantee of access may require a customer to share hot-site resources with other companies that have also declared a disaster. This could mean that a customer would not have access to its full disaster recovery subscription at time of disaster. If access is not guaranteed, a company could have no facility or computing resources if a disaster impacts their operations.

To manage access issues, hot-site vendors may attempt to limit the number of subscriptions per configuration. In some cases, vendors may limit the number of subscriptions they sell in a specific geographic location, such as Wall Street. Although SunGard and IBM are selling a subscription to a recovery configuration, test time is a major

A Symmetrix White Paper: Disaster Recovery as Business Continuity 6

constraint on the number of subscriptions a vendor can truly support on a specific hot-site configuration. Therefore, in a real sense, a hot-site is bounded by the demand for test hours.

The Traditional Recovery Process During the early ‘80s, the hot-site providers employed the traditional tools of tape backup and restore, incorporating them into a methodology for building disaster recovery programs.

This methodology has a well-defined, six-stage recovery process generally referred to as the Recovery Timeline. (See Figure 1) The timeline has been captured in a number of disaster recovery planning tools. One of the most popular PC-based planning tools is Strohl System’s LRDPS.

One of the best ways to look at the recovery process is to think of it as a “data center move project” numerous events and tasks that must be executed in a particular sequence over a specific time frame. The events and tasks have dependencies and other relationships, which must be observed to produce the desired outcome. There are six major stages that make up the recover timeline:

Figure 1.

A Symmetrix White Paper: Disaster Recovery as Business Continuity 7

Stage I: Initial Response When an event occurs that interrupts normal processing in a computer environment, companies must execute a problem determination and escalation process. If the event is deemed significant, hot-site vendors ask that the customer place them on alert during the evaluation period. This allows the vendor to participate in problem determination and, in some cases, take action to prevent the customer from declaring a disaster. During this evaluation the company must determine if there is sufficient cause to declare a disaster. Very few disasters are of the “act-of-nature or manmade catastophe” variety, and often take several hours to evaluate.

Hot-site vendors do not charge a customer a fee for placing an alert call. If a disaster is declared, however, vendors charge a substantial declaration fee, which varies depending on the size of the subscription configuration.

Stage II: Disaster Declaration If the company chooses to activate its disaster recovery plan, all vendors have a specific set of procedures the customer must follow. These procedures are designed to avoid false alarms, and to allow the vendors to appropriately allocate their recovery resources to the event.

This stage of an event includes the retrieving of vital records from an off-site tape storage facility, mobilizing the recovery team, and relocating the tapes and team to the hot-site. When a customer declares a disaster, the vendor will assign them to a hot-site configuration and begin the process of conditioning the hot-site configuration to the customer’s needs. This process includes asking testing companies to evacuate the hot-site, configuring mainframe and midrange processors to the client configurations, loading appropriate disk geometry, etc.

The declaration and relocation process can require between two to eight hours to complete.

Stage III: Restoration Process All vendors are extremely experienced in supporting customers during disasters. When a customer arrives on site to begin the restoration process, vendors begin with an organizational meeting. This meeting serves the purposes of calming everyone down, organizing the work to be done, and acclimating the recovery team to the hot-site.

The vendors will turn over the recovery configuration with “base” operating systems activated, so the customer can begin restoring data. The restoration process is the most time-consuming, error-prone, labor-intensive, and difficult part of a recovery and is a function of how much data a customer has to restore. Operating systems, catalogs, applications libraries, and utility software are included as data. As a general rule, in a mainframe environment you can restore about 5.0 TB of data in a 24-hour period. Open systems cannot achieve nearly this restore rate.

After all data is restored, it must be synchronized. In today’s environment of around-the-clock operations, few companies can shut down processing on a weekly or daily basis to take full backups of all of their computer data. The reality is that each production environment has different applications backed up at different times. These varying schedules result in an unsynchronized set of data files. Synchronization is the process of getting all data files aligned to a particular point in time. Synchronization often results in lost data and can take many hours to accomplish.

Concurrent with the restore process, the customer must switch its production telecommunications network and activate it at the hot-site. Many companies pre-position networking equipment at the hot-site and establish the hot-site as a hot node on their production environment.

Stage IV: Restoration of Lost and Backlogged Transactions Given the nature of most companies’ backup programs, a loss of data occurs for transactions and processing done between the last backup and the disaster event. The loss can be between 24 and 48 hours of data, depending on the company’s backup schedule and when tapes are transferred off-site. While a company is relocating to a hot-site and restoring its data processing environment, the company must backlog transactions. All of this data must be input to the systems and processed so that the systems are current.

At a large multi-national bank, for example, it was determined that the “time to get current” (that is, execute stages one through four) would require between five and 16 days and would result in the loss of up to 44 million transactions.

A Symmetrix White Paper: Disaster Recovery as Business Continuity 8

Stage V: Resume Normal Processing Once all steps are complete, a company is ready to resume production processing for the duration of the event.

Stage VI: Return Home At the end of a disaster event, the customer must return home. With the exception of knowing when this will occur, this stage is essentially a reverse of the activation process, and requires detailed planning and precise execution. This part of a customer’s disaster program is seldom, if ever, tested.

What’s Wrong with This Picture? For many organizations the tools available to build an effective disaster recovery program, as described above, are not sufficient to meet the recovery time objectives (RTOs) and data recovery requirements for large, heterogeneous computing environments. In a global economy, this is no longer acceptable. There are at least five basic problems with this process:

1. Data loss: A tape-based process inherently exposes the company to data loss. The amount of lost data is a function of how often backups are run and when the data is moved off-site. In the event of a disaster very few companies have documented transactions so that they can be re-entered after the systems are re-started. Very few of today’s organizations can tolerate this level of data loss without exposing themselves to severe negative business impact.

2. Timely recovery: In an increasing number of production environments the amount of data to be recovered has scaled well beyond the capability of tape-based restore mechanisms to allow timely recovery and restart. There is simply too much data. In 1990 it was unusual to find a computing environment larger than 1.0TB of data. Today it is not at all unusual to find data centers with tens of terabytes of data. A well-tuned mainframe restore process can restore approximately 5.0 TB of data from tape to disk in a 24-hour period. Open systems and NT environments cannot achieve anywhere near this restore rate. With 6TB of production data, it is extremely unlikely that a company could achieve recovery in less than 72 hours.

3. Data synchronization: Today’s typical computing environment is heterogeneous and extremely complex. The degree of data coupling between applications makes it virtually impossible to define and run a backup program that will allow a company to take apart and accurately rebuild all critical information. In a major mid-western bank EMC Professional Services staff identified over 2,000 discrete synchronization points in the daily process. The problem is complicated by the fact that backups are often not run or otherwise cancelled, run out of sequence, or run while data bases are hot, requiring extensive normalization prior to being put back into production.

4. Testing: An untested disaster recovery plan is an unworkable plan. Companies that are serious about disaster recovery conduct regular tests of their program. Testing is a labor-intensive and costly process. The planning cycle often begins 10 to 12 weeks prior to the execution of a test.

Many companies “fake” results by taking additional or special backups to be used for the recovery test, or by selecting a specific date and time for the test to ensure that they have all data required for a recovery. Few companies conduct “surprise” tests.

Because of the tools used for traditional recovery, most of the time allocated to testing the recovery capability must be devoted to the restoration and synchronization of data. Few companies have the time to run their production cycles during a test.

Production environments can change significantly between tests and invalidate the results of the previous test. Considerable time and effort is devoted to adjusting for those changes during each test and between test cycles. Companies never have a clear certainty of their ability to effectively recover between tests.

5. Return home: Finally, with the existing technology you cannot test the return home portion of the plan. In the event of a disaster, returning home would create an additional production outage, thus compounding the impact of the original disaster on the company’s business.

A Symmetrix White Paper: Disaster Recovery as Business Continuity 9

Because of these problems, it is not uncommon to find companies that cannot meet their recovery time objective (RTO) and recovery point objective (RPO).

A New Paradigm in Disaster Recovery EMC’s Enterprise Storage systems and the data mobility family of software products uniquely position the company to play a major role in this new disaster recovery paradigm. EMC’s focus on data mobility recognizes the central fact that organizations depend on information to conduct business. An EMC advanced business continuity solution would look like this:

Figure 2

As you can see, use of EMC’s SRDF (Symmetrix Remote Data Facility) and EMC TimeFinder eliminate time-consuming data restoration from the recovery process. This essentially eliminates the five problems that are inherent to traditional tape-based recovery programs. At EMC, we don’t do disaster recovery, we make Busness Continuity work!

The Cost of Recovery Two issues are worth examining. First, what do companies invest their disaster recovery dollars in? Second, what do they get for that investment?

Total disaster recovery investment sometimes is difficult to identify. Costs often are buried in normal production support, and staff resources generally are not tracked. Additionally, many companies are not able to meet time-to-recover or data-protection objectives with their current level of expenditure.

The Gartner Group has conducted several studies over the last 10 years exploring the level of investment in disaster recovery. Shortly after September 11, 2001, they were quoted as saying that the average company in North America was spending 3.8 percent of their IT budget on disaster recovery and those companies considered best-of-breed were spending 6.0 percent of their IT budget on disaster recovery. The financial services industry has always invested more on disaster recovery (as a percentage of IT budget) than other industries. It is not unusual to see 10 percent of an institution’s IT budget dedicated to disaster recovery.

A Symmetrix White Paper: Disaster Recovery as Business Continuity 10

Building a Disaster Recovery Strategy The first question a company must answer when deciding to implement a recovery program is what strategy to implement. The key components of a recovery strategy are recovery time objective (RTO) and recovery point objective (RPO). The RTO answers the question, “How long can we be without our computer environment?” The RPO answers the question “How much data can we afford to lose if we have a disaster?”

Companies answer these questions by conducting a business impact or criticality analysis. These studies are designed to identify (1) business processes and systems that are critical to conducting business, and (2) resources essential to supporting those functions. Processes and systems are generally divided into three groups: (see figure 3) those that support the product or service delivery and are essential to the business, those that are business support functions necessary to run the core business, and those that are deferrable.

Figure 3

Initial analysis of data will show that approximately 50 percent of stored data is essential, 30 percent is required for support, and 20 percent is deferrable. However, this analysis may not represent the real world, and systems seldom are independent from other systems processed in the same computing center.

Criticality Analysis – Study that establishes the Recovery Time Objective (RTO) and Data Recovery Objective (DRO) of the business. Data is classified into categories for assigning a RTO. How much of a business’ data belongs in each of the following categories: • Product or Service Related Data – Required to

support the core products and services of the business e.g., Order Entry, MRP, Demand Deposit; RTO is usually 24 hours or less

• Business Support Data – Required to run the business e.g., Financials, Payables, Data Warehouse; RTO is usually 48 hours or less

• Deferrable Data – Required to support the business e.g., Fixed Asset Accounting; RTO is 72 hours or longer

• Initial Studies show that 50 percent of the data is essential, 30 percent is support related, and 20 percent is deferrable.

A Symmetrix White Paper: Disaster Recovery as Business Continuity 11

Figure 4

Additional analysis, which focuses on the relationship (degree of data coupling) between the various classifications of data, yields different results. When the large multinational bank referred to earlier conducted its criticality analysis, managers determined that, because of the degree of coupling, 80 percent of their data was essential, 15 percent was dedicated to support, and only five percent was deferrable.

The output of a criticality analysis is published in a strategy document that highlights the recovery requirements and the cost of implementing a program to mitigate the impacts identified in the study. The cost of implementing a strategy is a function of how long a company has to recover and how much data it can lose. The shorter the RTO and RPO, the higher the cost of implementation.

Recovery Solutions Let’s turn now to a case study. In 1996, under pressure from its Office of the Controller, the large multinational bank referred to previously embarked on solving how to truly recover a large, complex computing environment in a very short period of time with little data loss.

The bank first conducted a time to get current study. This study determined that it would take between five and 16 days for the bank to recover from the loss of a data center. Additionally, because of the way they backed up data, they were going to lose between 24 and 48 hours of transactional data. This was clearly unacceptable to the bank and the bank examiners.

In 1995, the Gartner Group published a study that found the average company in North America spent 2.5 percent of the IT budget on disaster recovery, and that companies with revenue in excess of $10 billion (US) were spending an average of 3.8 percent of the IT budget on disaster recovery. Additionally, financial services companies spend even more on disaster recovery due to regulatory requirements.

The project team set about the evaluation process with the idea that they would spend no more than the average company of their size would spend, or 3.8 percent of the IT budget.

The project team evaluated the currently available recovery tools on the market. Each tool was evaluated with respect to its effect on time to recover and data loss. Those tools—data staging, remote standby operating systems, remote electronic vaulting, remote forward data base recovery—are presented on the following chart.

A Symmetrix White Paper: Disaster Recovery as Business Continuity 12

Ultimately, remote data mirroring was the only solution that would meet the goals of the bank for time and currency.

The cost of implementation of a mirroring environment is usually thought to be prohibitive. But the design at the large multinational bank was targeted at 3.8 percent of their IT budget. This price is approximately the average quoted in the Gartner Group study.

Remote Mirrored Restart Capability For many years, the traditional recovery timeline was acceptable to most companies. However, with the exponential increase in the amount of data and complex relationships between applications, larger businesses have discovered that they cannot meet their recovery time and recovery point requirements using these traditional tools.

The recovery process has become exceedingly complex, labor-intensive, and subject to human error. Businesses operating in the wired world of the global economy require systems restart in minutes, with no data loss. Nothing less is acceptable.

Business impact analysis is a relic of the past. Data classification and coupling causes additional problems with synchronization.

When the large multinational bank was designing its current recovery/restart program in 1996, managers conducted a criticality analysis, described previously, and determined that approximately 95 percent of bank data was either critical or essential. With this realization, the bank took aim at a restart capability that mirrored all production data.

The analysis of currently available recovery solutions led the bank to the conclusion that a remote mirrored restart capability was the only viable option that would meet recovery time and recovery point requirements. Furthermore, technology advances had driven the cost of deployment down sufficiently to make the project financially feasible.

Implementing Business Continuity Implementing a business continuity program is not easy. It requires re-thinking how and why technology is deployed. The biggest impediment to implementation is changing process, not technology. Disaster recovery is a significant part of business continuity. EMC eliminates the most difficult, time-consuming, labor-intensive, and error-prone part of recovery.

Today’s modern computing environment often spends more computing cycles moving data between operating environments (applications, locations, disk to tape, and tape to disk, etc.) than presenting data to customers. The biggest barrier to 24x7 information access is all of the parallel computing requirements required to support customer access.

A Symmetrix White Paper: Disaster Recovery as Business Continuity 13

EMC Enterprise Storage and the company’s suite of data mobility products have consistently demonstrated that they can substantially impact the way businesses deploy and use technology. These products offer solutions to problems that no other vendor can resolve, and it’s done in a way that allows EMC customers to lower their overall run costs, efficiency, and performance.

Business continuity implementation is a very complex process. Professional services is the key to future success in this domain. Architecting and deploying a mirrored business continuity environment requires that EMC have a complete understanding of the way a company has implemented computing in a production environment. Business continuity requires that the customer build a new operating environment. Business continuity is not about implementing disaster recovery. If you build the proper operations environment and include Symmetrix, TimeFinder, and SRDF, then you will achieve disaster recovery.

To be successful in business continuity design and implementation, the methodology has to go beyond an evaluation of storage deployment and recommendations on how to implement EMC Enterprise Storage. It must include a production analysis. This includes the six major areas:

1) Hardware and operating systems environment - Operating systems—UNIX, MVS, NT/2000, Linux, and AS/400 - Tape - Storage - Network architecture and management - Website

2) Applications development - Systems Development Life Cycle (SDLC)—Policy and standards - Architecture philosophy—Buy vs. build, database deployment - Data base philosophy and deployment—DB2, IMS, Oracle, Sybase, etc. - Restart implementation, data base integrity validation, file structure validation—How do I know they are

correct? - Tape philosophy and deployment

3) Operations - Scheduler and how they have implemented it - Job flow mapping - Auto restart

4) Production support and how scheduling is implemented 5) Storage management

- Policy and standards 6) Change management

- Policy and standards - Roll back

The outcome of this study would be a plan to implement business continuity. It would include what could be done (realistically) on a first pass, what should be done over time, and what benefit it would provide. A business case would be a standard part of the deliverable—how much it will cost and how much it would save in operations after implementation.

Summary There’s little question that businesses today must ensure their data is not lost in the event of a disaster, and that they can continue to be in business. EMC stands in a unique position, due to its technology leadership and extensive investment in research, to implement business continuity programs at a higher level than any other vendor. The tragic events of September 11, 2001 demonstrate this unique capability. How many other vendors can claim a 100 percent success rate for recovery? Companies that do not implement the new required level of disaster recovery systems and business continuity programs are risking the future of their businesses, their customers’ businesses and the shareholders’ investments.