VALIDATION OF A FAULT-TREE DOWNTIME METHODOLOGY THOUGH A...

VALIDATION OF A FAULT-TREE DOWNTIME

METHODOLOGY THOUGH A CASE STUDY

by

KYLE NORMAN RAMER

B.S., Bucknell University, 2010

A Master’s Report submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Master of Science of Civil Engineering

Department of Civil, Environmental, and Architectural Engineering

2011

This Master’s Report entitled: Validation of a Fault-Tree Downtime Methodology through a Case Study

Written by Kyle Norman Ramer has been approved for the Department of Civil, Environmental, and Architectural Engineering

Keith Porter

Abbie Liel

Franck Vernerey

Date: 12/06/2011

The final copy of this Master’s Report has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards

of scholarly work in the above mentioned discipline.

i

ABSTRACT

Ramer, Kyle Norman (M.S., Civil Engineering)

Validation of a Fault-Tree Downtime Methodology through a Case Study Master’s Report directed by Associate Research Professor Keith Porter The fault-tree methodology created by Porter et al. [6] was applied to a case study as a

means of validation for its use in calculating the probabilistic downtime of a facility cause by an

earthquake event. The fault-tree analysis calculates downtime using both the probability of a

building being non-operational due to the damage state of individual components within the

building and component repair-time distributions conditioned on their damage state. Downtime

estimates were calculated using the fault-tree methodology for a data center and compared with

restoration-time functions for the most-similar classes of facility in both HAZUS-MH and ATC-

13. The fault-tree results were also submitted to a senior operator at data center in question, as a

sanity check or sniff test of the results. The operator judged that the downtime duration for a

particular probability level seemed to overestimate restoration time by about a factor of two. This

overestimation of approximately a factor of 2 was also seen in the comparison with the HAZUS-

MH and ATC-13 downtime estimates. Whether the fault-tree method overestimates downtime,

or the expert and generic sources underestimate it, or both, is unknown. A factor of two seems to

be within a reasonable range for the accuracy of the fault-tree method. This provides validation

for the fault-tree methodology. The case study performed in this study provides an initial

indication that the fault-tree methodology created by Porter et al. [6] is a valid means of

determining downtime of a facility due to earthquake shaking.

ii

CONTENTS�

INTRODUCTION .......................................................................................................................... 1

REVIEW OF LITERATURE ......................................................................................................... 3

METHODOLOGY ......................................................................................................................... 7

Component Failure Probability Calculation................................................................................ 7

Component Repair Time Calculation.......................................................................................... 8

Fault-Tree Implementation.......................................................................................................... 8

DATA CENTER CASE STUDY ................................................................................................. 11

Component Failure Probability Calculation.............................................................................. 11

Component Repair Time Calculation........................................................................................ 13

Fault-Tree Implementation........................................................................................................ 14

Exceedance Probability Calculation.......................................................................................... 16

CASE STUDY VALIDATION .................................................................................................... 20

Sniff Test Validation ................................................................................................................. 20

Cross-Validation........................................................................................................................ 20

Validation Considering the Earthquake Experience of the Facility.......................................... 22

CONCLUSIONS........................................................................................................................... 23

REFERENCES CITED................................................................................................................. 25

APPENDIX................................................................................................................................... 26

iii

LIST OF TABLES

Table 1 – Event directory and component capacities ................................................................... 29 Table 2 – Event repair times and their source............................................................................... 30

iv

LIST OF FIGURES

Figure 1 – Fault-Tree logic applied to downtime through an and gate .......................................... 9 Figure 2 – Fault-Tree logic applied to downtime though an or gate ............................................ 10 Figure 3 – Probability of data center failure ................................................................................. 13 Figure 4 – Facility downtime estimation ...................................................................................... 16 Figure 5 – Coefficient of variation of downtime estimation......................................................... 16 Figure 6 – 50 year exceedance probability of the facility............................................................. 18 Figure 7 – Downtime comparison of Fault-Tree, HAZUS-MH, and ATC-13 50-year exceedance probabilities................................................................................................................................... 21 Figure 8 – Data Center Fault-Tree ................................................................................................ 26 Figure 9 – 50-year exceedance probability for earthquake shaking intensity .............................. 32

1

INTRODUCTION

A second generation of performance-based earthquake engineering (PBEE-2) has

developed largely in the last decade, with the objective of estimating probabilistic future seismic

performance of buildings and other facilities in terms of repair costs, fatalities, and loss of

functionality (dollars, deaths, and downtime). Most recently, ATC-58 [1], a project funded by

FEMA to develop professional guidelines for carrying out PBEE-2 analysis, has placed most of

the development emphasis on estimating repair costs. ATC-58 estimates probabilistic repair costs

using building-specific structural analysis outputs generated from a suite of earthquake ground-

motion time histories along with building-component fragility functions and conditional

probability distributions of repair cost given a damage state. The estimation of downtime in

ATC-58-style PBEE-2 is to divide repair cost by a burn rate (repair expenditures per day). This

approach does not attempt to resolve the point at which a facility becomes operational again,

only the point in time when all repairs are finished.

An alternative to a PBEE-2 approach is catastrophe risk modeling, a discipline that

considers buildings as independent and identical samples of building categories typically defined

in terms of structural material, lateral-force-resisting system, height, and building-code era. The

National Institute of Building Sciences and Federal Emergency Management Agency [3] have

developed the HAZUS-MH software (the leading public-sector catastrophe-risk model) that

estimates downtime as a function of probabilistic overall damage state of the structural portion of

a building, and of the building’s occupancy category. While catastrophe-risk models seem to

give reasonable downtime estimates (in the opinion of their users), they do not resolve downtime

behavior at the level of individual buildings, nor explain behavior in engineering terms.

2

Porter et al. (2011) proposed a methodology for calculating downtime as the time from

when repairs begin until the time the building becomes operational [6]. This is accomplished by

combining component fragility functions (which estimate the probabilistic damage state of

individual building components as a function of shaking severity), component repair-time

distributions, i.e., probability distribution of component repair time conditioned on their damage

state, and fault-tree analysis (which estimates the probability of a building being non-operational

as a function of the damage state of the individual components). Examples of components are

individual beams, columns, walls, suspended ceilings, electrical transformers, and packaged air-

conditioning units.

In Porter et al. [6], the fault-tree methodology was applied to a hypothetical data center,

for illustration purposes, but not, so far, to a real facility, and it has not yet been validated in any

substantial way. The objective of this research is to apply Porter’s downtime methodology to a

real building, and to validate it by comparing its results to results generated by preexisting

methodologies and by conducting a sniff test of the results by soliciting the opinion of an expert

on the workings of the facility under investigation.

3

REVIEW OF LITERATURE

Downtime due to earthquake damage has been explored on a number of levels from the

reasons for losses to means of calculating downtime accurately for both general and specific

facility cases. Downtime is ultimately the result of the loss of function of components utilized by

a given business. Tierney [9] investigates how various component failures affected business

losses by surveying business affected by the 1993 Midwest floods and the 1994 Northridge

earthquake. Tierney [9] presents survey results on disaster-related business impacts collected

following the two natural disasters. The survey was given to business of varying sizes in regions

affected by the disasters. The survey results outline the reasons businesses experienced losses

due to the disasters, which were lifeline service interruptions, physical damage, material flow

disruption, and loss of customers. These factors show the broad effect that an earthquake can

have on businesses. These losses that caused downtime ranged from direct interruption of a

business to losses contingent on other business’s interruption.

Cabrera [2] follows the same approach as Tierney [9], but focuses on industrial facilities

impacted by the March 2011 Tohoku earthquake in Japan. Cabrera [2] outlines the main drivers

behind both direct business interruption (downtime resulting from damage at the facility of

interest) and contingent business interruption (downtime at the facility of interest, but resulting

from damage elsewhere) in industrial facilities and how they create both business interruption

and lifeline interruption. Lifelines include electricity, water, telecommunications, etc. The

relationship between business interruption and lifeline interruption is particularly relevant

because it shows that the functionality of many businesses is dependent on other businesses and

infrastructure. In the Japanese facilities investigated, the majority of the damage was non-

4

structural. The losses incurred were due to non-structural damage as well as lifeline and supply-

chain interruption. The findings of Cabrera [2] and Tierney [9] indicate that business-specific,

contingent, and lifeline components should be considered when evaluating potential losses and

downtime incurred on a facility.

Given the types of components that are considered when evaluating facility downtime,

downtime estimates for each of these components must be determined. One method of

determining the downtime estimates is through expert opinion. Porter and Sherrill [5] discuss

how expert panels judged impacts of a hypothetical M7.8 southern San Andreas Fault

earthquake. The results of the panels provide downtime assessments for various locations in

California but more importantly the agenda provides a method for determining component

damage states following earthquakes when field data are unavailable or analytical methods are

impractical. The panel approach can be seen as a structured application of expert opinion to

estimate lifeline downtime in earthquakes.

Using component-specific downtime estimates, methodologies have been created to

estimate the downtime of complete facilities. Porter et al. [4] provide a methodology for

developing building-specific seismic vulnerability functions for a building, considering both

structural and non-structural building components. The methodology applies suites of ground

motion time histories to a probabilistic structural model to estimate probabilistic structural

response conditioned on shaking, and then uses component fragility functions to estimate the

probabilistic damage state of each building component. Repair duration is calculated by

assembling a simple Gantt chart for repair, treating component repair times as probabilistic and

conditioned on damage state. By Monte Carlo Simulation then, one can model ground motion,

structural response, damage, and repair duration.

5

ATC-13 [7] generated restoration curves for facilities damaged by earthquakes based on

the building’s loss of function by soliciting expert opinion. ATC-13 [7] generalizes facilities into

various usage classifications and associates example/typical equipment and contents that may be

present in each building classifications to be used in the restoration estimation. Assuming that

repair/reconstruction follow ordinary construction schedules, experts estimated restoration time

for each facility classification at 30%, 60%, and 100% restoration levels and also gave their

experience level with each facility class. The restoration times were weighted based on the

experience level of each expert to develop functional restoration time curves based on the

facility’s loss of function.

More recently, the HAZUS-MH developers [3] produced software to estimate regional

earthquake losses throughout the United States. Their method applies ground-motion prediction

equations to an earthquake rupture forecast to estimate regional shaking, either probabilistically

or on a scenario basis. It contains an estimate of the building inventory on a census-block basis,

constructed from various proxies such as the US Census of Population and Housing. Damage to

the building stock is estimated using the calculated ground motions and the capacity spectrum

method of structural analysis to estimate structural response for various categories of

construction. Structural response is then input to fragility functions that estimate the probabilistic

damage state of the building stock. The fragility functions are created through engineering first

principles and probably expert opinion. Downtime is then estimated using conditional probability

distributions of downtime given damage state by occupancy class. The derivation of these

conditional probability distributions is vague, apparently drawing in part on ATC-13 [7].

A structural-engineering approach to estimating the impacts of earthquake on individual

buildings is offered by ATC-58, a FEMA-sponsored effort to produce professional guidelines for

6

applying 2nd-generation performance based earthquake engineering. The guidelines work also

includes the PACT software [1] for implementing the methodology. This software takes as input

the structural response calculated by building-specific structural analyses, and responses due to a

particular earthquake ground motions. It calculates probabilistic damage and loss using built-in

fragility functions and conditional probability distributions of repair cost (conditioned on

damage). Additionally, PACT determines casualty levels, repair costs, and repair time for the

given building. The downtime calculated by PACT addresses repair time and ignores the time

between when the earthquake occurs and the start of the repairs. It does however generate

component-specific repair times.

Porter et al. [6] offer a method to estimate downtime for buildings after earthquakes. The

methodology applies fault-tree analysis to estimate the probability of a component being non-

operational at a given point in time after experiencing a specified level of excitation as product of

a fragility function (probability that a component will fail to operate as a function of input

excitation) and a conditional probability of the time to restore the component given that it fails.

The result is that, at the basic-event level, one estimates at any time after the earthquake the

probability that the component is nonfunctional. By “basic event” is meant the failure of a single

component, such as a single wall, column, or ceiling. By defining the top event as the facility

being nonfunctional, i.e., failing to perform some particular operation such as data processing,

the result is that one can calculate probabilistic downtime either by Monte Carlo Simulation or in

closed form. As noted earlier, it is this method that is tested here.

7

METHODOLOGY

Component Failure Probability Calculation

The downtime methodology, created by Porter et al. [6] to be tested for validation, bases

its downtime fault-tree logic on failure probabilities of individual components within the facility.

The failure probability of each component is generated by calculating component-specific failure

probabilities conditioned on the peak excitation to which the component is subjected.

“Excitation” can mean member forces or deformations, either in scalar or vector form. In current

practice it is usually parameterized as a scalar value, and most often the peak acceleration to

which a component is subjected or the peak interstory drift at the story level of the building

component.

The components considered here happen to be all acceleration-sensitive. For components

at ground level of a building, that means that the input excitation is peak ground accelerations

(PGA). Excitation for components above the ground floor can in principle be estimated from

structural analysis; here it is simply taken as PGA times an amplification factor: 2 for

components in the top 1/3rd of the building height, 1.5 for components in the middle 1/3rd of the

building height.

The calculation of component failure probability is performed using a component-

specific fragility functions as seen in Equation 1. Each component, indexed by i, has a failure

probability, Pi calculated using the excitation xi (here, acceleration), the median capacity of the

component, θi, and βi, the logarithmic standard deviation of capacity. Φ is the cumulative

standard normal distribution function. To simulate damage, one can draw a sample u of a

8

uniformly distributed random variable between 0 and 1; if u < Pi, the component is considered

damaged.

(1)

Component Repair Time Calculation

To calculate downtime, conditioned on a component being damaged (calculated using

Equation 1), it is assumed that repair time is lognormally distributed. The component’s

lognormally distributed repair time, ti (days), is simulated using Equation 2 where qi is the

median repair duration, bi is the logarithmic standard deviation of repair duration, and Φ‐1(ri) is

the inverse of the cumulative standard normal probability distribution evaluated at a probability

level r, a sample of a uniformly distributed random variable bounded by 0 and 1. If a component

is considered undamaged in a given simulation, the repair time is 0 days.

(2)

Fault-Tree Implementation

Each component’s probabilistic repair time for each PGA is applied to the fault-tree

logic’s Boolean operations. These Boolean operations consist of and and or gates. Within the

fault tree, a facility’s failed state is dependent on the failure of various systems. A failure of a

system will be referred to as an upper event (eu) and a failure of a component within a system

will be referred to as a lower event (el). If an upper event requires that all lower events associated

with it occur, an and gate is applied. If an upper event requires that only one lower event

associated with it occurs, an or gate is applied. The ability of a system (upper event) to be

operational, meaning the ability of a system to be taken out of its failed state, is also determined

9

based on the lower events connected to it. If an upper event required all lower events associated

with it to fail (and gate), the upper event needs just one lower event to be taken out of its failed

state. In this case Equation 3 is applied to determine the downtime of an upper event, tu, because

the first lower event to be taken out of the failed state will take the upper event out of its failed

state. The repair time for a lower event n is referred to as tln.

tu = min (tl1, tl2, … tln) (3)

If an upper event requires just one of the lower events associated with it to fail (or gate),

all of the lower events associated with the upper event must to be taken out of their failed states

for the upper event to be taken out of its failed state. In this case Equation 4 is applied.

(4)

An and gate takes the minimum repair time of the components directly below the upper

event and an or gate takes the maximum repair time of the events directly below the upper event.

The fault-tree logic applied to downtime is displayed in Figure 1 and Figure 2. It is applied

throughout the facility’s systems until the facility’s total downtime is simulated relative to PGA.

Downtime is simulated repeatedly using Monte-Carlo simulation with all downtime simulations

averaged to determine the facility’s mean downtime for given PGA.

Figure 1 – Fault-Tree logic applied to downtime through an and gate

10

Figure 2 – Fault-Tree logic applied to downtime though an or gate

11

DATA CENTER CASE STUDY

Component Failure Probability Calculation

In Porter et al. [6], the fault-tree methodology was applied to a hypothetical facility to

illustrate the fault-tree’s functionality. However, until this research, the fault-tree methodology

has not been applied thoroughly. In this study, the methodology was applied in collaboration

with and inspected by the facility operators to an actual building where downtime results could

be compared to existing models or judged by the operators themselves or other experts. For this

research, as a first means of validation, the fault-tree methodology for downtime was applied to

an existing data center of which the facility’s failure probability for varying PGA was already

determined by also using the fault-tree logic.

The data center in question is operated by a southern California public power utility. The

facility is located in the San Gabriel Valley of Los Angeles County, and houses the data

processing and telecommunications equipment that controls the power grid. There are probably

dozens, maybe hundreds or more, such facilities around the United States, but in general the

equipment in the facility is similar to probably thousands or more computer data centers around

the U.S.

The site of the data center was the subjected of a 2011 geotechnical investigation. It

included two cone penetrometer tests (CPT) that reached approximately 90 ft below grade, and

produced estimates of average shearwave velocity in the top 30m of soil of 450 m/sec to 480

m/sec. No groundwater was encountered, indicating low potential for liquefaction or lateral

spreading. There are no known faults at the site. The implication is that the principal seismic

12

hazard at the site is shaking produced by rupture of faults in the region. Some regional faults are

capable of producing earthquakes of M7.8.

The data center building is a 2-story, 50,000 square foot, generally rectangular (190 ft x

130 ft) reinforced concrete shearwall structure, built in the late 1950s and expanded to its current

configuration in the 1980s. Data-processing operations happen in 3 computer rooms with raised

access floors and suspended ceilings, served by power and telecommunications equipment in

several additional rooms. Air conditioning equipment are located on the roof and a pad outside

the building. A complete list of the components that contribute to the functionality of the facility

is provided in the appendix. The list was produced in collaboration with the operators of the

facility, as part of an engineering risk consulting contract by the engineering risk consulting

company SPA Risk LLC. Prof Porter is a principal of that firm, and provided the data used here.

Confidentiality issues prevent revealing the exact location of the facility or the name of the

operator.

The data center’s probability of operational failure (meaning at least some data-

processing operations cease) was calculated using component fragility functions by applying

Equation 1 and the fault-tree logic displayed in a flow chart in Figure 8 in the Appendix. Figure

3 shows the probability of being in a non-operational state of the facility for a range of PGA.

13

Figure 3 – Probability of data center failure

It should also be noted that the data center’s failure probability distribution utilized in the

downtime calculations was based on the “as-is” condition of the building. The failure probability

calculations also accounted for redundancy in the components, e.g., multiple air conditioning

units, only some of which are required to be operating. If there were more of a component than

was required, a binomial distribution was applied to the component’s fragility function for n

independent trials with m successes in the trials where n is the number of component units and m

is the difference between the number of component units and required component units.

Appendix Table 1 displays a list of the events applied to the data center’s fault-tree as well as

their median capacity, logarithmic standard deviation of capacity, and the number of required

and actual units of the component.

Component Repair Time Calculation

For each event listed in Appendix Table 1, a repair time was calculated for a range of

PGA using Equation 2 where downtime was calculated if a sample of a uniformly distributed

14

random variable bounded by 0 and 1, drawn for a given component, was less than that

component’s calculated probability of failure due to a PGA. For computational purposes, PGA

ranged from 0.01g to 2.00g in increments of 0.01g. Parameters of the fragility functions and

downtime distributions were determined either by ATC-58 [1], HAZUS-MH [3], Porter et al. [6],

or by judgment. Because of uncertainty, all median repair time estimates are rounded to one

significant figure. Appendix Table 2 shows a list of all events along with their median repair

time, logarithmic standard deviations of repair time, and the source of those statistics. All

components listed are considered non-structural components.

Fault-Tree Implementation

With the repair times probabilistically determined for each component over the range of

PGA, the fault-tree methodology was applied to calculate downtime. The facility’s downtime

distribution over a range of PGA, meaning the amount of time for the facility to become

operational, was calculated to be the time required to repair all equipment systems, address

hazardous-material release or wait out conflagrations, or remove the facility’s red-tag label. A

facility is red-tagged if it is deemed structurally unsafe. (Porter performed a structural analysis of

the facility to estimate its red-tag capacity; details are not provided here.) Equipment systems are

considered to be repaired when all of the following occur: any uncontrolled building fire (inside

the building) is stopped and all burnt components are repaired, the building support systems

become operational, the grid control and telecommunication systems become operational.

In the case of the building support systems, an or gate connects both its failure and repair

to the four lower systems: heating, ventilation, and air conditioning (HVAC), the power system,

raised access floors supporting data-processing equipment, and the suspended ceilings above

15

data-processing equipment. This means that the building support system fails if any of those four

lower systems fail. It also means that the building support systems become operational when all

of its four lower systems are repaired. In mathematical terms, the time required for the building

support systems to become operational is the maximum repair time of the HVAC, the power

system, the raised access floors, and the suspended ceilings.

One of the events required for the equipment systems to become operational is the

stopping of an uncontrolled building fire. An uncontrolled building fire occurs if the building

ignites and fire response system fails (and gate). This means the downtime due to an

uncontrolled building fire is the minimum of the restoration time of the fire response system and

the time needed for the fire to go out on its own.

The facility’s downtime was simulated 1000 times and averaged for each PGA tested to

create an expected value of downtime along with a coefficient of variation for each PGA tested.

Figure 4 shows the facility’s downtime estimation for various PGA and Figure 5 shows the

coefficients of variation for the downtime estimates. The coefficients of variation (the standard

deviation of downtime duration divided by the mean value, for a given level of PGA) vary

between about 1.0 and 2.0. This means that the downtime is uncertain within a factor of about 2

for PGA greater than 0.5g. At lower magnitudes of PGA, accuracy of the downtime estimations

are very low based on the coefficients of variation and change significantly relative to PGA. This

is due to the signal to noise ratio of the PGA versus the damage state. At lower PGA, the

probabilities of failure for components and the facility are very low. In the rare case where the

component enters a damaged state at a low PGA and thus experiences downtime due to low

PGA, an increased error in the downtime estimation occurs for the entire facility.

16

Figure 4 – Facility downtime estimation

Figure 5 – Coefficient of variation of downtime estimation

Exceedance Probability Calculation

To better understand the downtime estimations displayed in Figure 4, the 50 year

exceedance probability of the facility’s downtime was calculated as a function of downtime. That

is to say, the probability calculated is that the facility’s actual downtime due to earthquake

17

shaking experienced in the next 50 years exceeds a specified value. This probability is

determined by first calculating the probability that the facility will not be operational for at least

the duration, t, given a shaking intensity, s, from an earthquake. This is shown in Equation 5.

(5)

Equation 5 utilizes the median downtime, qT(s), and the logarithmic standard deviation of

downtime, bT(s), given that the uncertain shaking intensity, S, takes on a particular value, s, and

given a number of other conditions abbreviated by “&.” These additional conditions include the

facility location, a mathematical model of its design and reliance on equipment and other

components, a particular model of the regional seismicity, a particular model of the ground

motion intensity given the occurrence of an earthquake, and other parameters. The coefficients

qT(s) and bT(s) must be derived from both the mean downtime m and the coefficient of variation

of downtime d. Coefficients bT(s) and qT(s) are calculated using Equation 7 and Equation 8

respectively in the Appendix. The result from Equation 5 is integrated with the absolute value of

the first derivative with respect to s of the hazard curve, which gives the probability that an

earthquake will occur causing shaking intensity to exceed a particular value s at least once during

the next 50 years. This integral is shown in Equation 6 where & is as defined above, T is

uncertain earthquake-induced downtime, and t is a particular value of T.

(6)

The second term on the right side of Equation 6 is the seismic hazard at the site, i.e., here

the probability (or almost equivalently, the occurrence frequency) of shaking exceeding a

18

particular value of s, here measured in terms of peak ground acceleration. Seismic hazard for the

site was provided by Keith Porter, who used the USGS/USC software OpenSHA Seismic Hazard

Calculator ver 1.2.2. The software takes as input a facility location, earthquake rupture forecast,

ground-motion prediction equation, and time period of interest. In the present case, Keith Porter

used the Uniform California Earthquake Rupture Forecast ver 2.0 [10], Campbell and Bozrgnia’s

next-generation attenuation relationship [11], and a 50-year period.

For computational purposes, the integration performed in Equation 6 was done

numerically. Figure 9 in the Appendix shows the 50-year exceedance probability for various

shaking intensities and Figure 6 shows the 50-year exceedance probability of the facility’s

downtime.

Figure 6 – 50 year exceedance probability of the facility

Figure 6 indicates for example that there is a 5% probability that within the next 50 years

the facility will experience enough earthquake damage to incur a downtime of at least six

months. This is almost equivalent to saying that, with 0.1% probability, an earthquake in 2012

19

could cause enough damage to render the facility inoperative for at least 6 months.

20

CASE STUDY VALIDATION

Sniff Test Validation

The exceedance probability results were submitted to an operator of the facility who had

participated in creating and inspecting the fault tree for this facility. This operator was asked to

judge whether these results seemed credible. After examining the downtime probabilities

calculated using the fault-tree methodology, the operator felt that the downtime estimations were

slightly overestimated. Based on his knowledge of the facility, he reasoned that an earthquake

within 2012 could conceivably cause 3 months of downtime rather than the 6 months calculated

using the fault-tree methodology as having 0.1% occurrence probability. To be precise, he wrote,

“I cannot imagine an earthquake in 2012 would cause 6 months of outage time. Taking the

uncertainty factor into consideration, a 3 month outage seems more within the realm of

possibility.” This feedback seems like neither a strong endorsement nor a strong denunciation of

the credibility of the model, but something in between, as if the curve is plausible if shifted to the

left by a factor of 2.

Cross-Validation

As another means of comparison of the downtime estimates calculated using the fault-

tree methodology, similar exceedance probability calculations were performed using the

HAZUS-MH downtime calculation method [3] as well as the ATC-13 downtime calculation

method [7]. Identical 50-year exceedance probabilities of earthquake shaking were utilized in the

calculations. The results generated by these calculations are based on generalized buildings of

the same class as the data center under examination. For the HAZUS-MH method, the

21

generalized building was model building type 19.1 (low rise reinforced concrete shearwall

facility with a High-Code seismic design level) with occupancy class 16 (Banks and Financial

Institutions) [3]. The ATC-13 method utilized a class 6 facility (low rise reinforced concrete

shearwall facility and a class 16 social function (industrial/high technology) [7]. HAZUS’s

coefficient of variation of downtime used in Equation 5 was also approximated using a curve

fitting to ATC-13’s coefficient of variation of downtime. Figure 7 shows a comparison between

the Fault-Tree methodology, HAZUS, and ATC-13 50-year exceedance probability estimation

for the facility’s downtime where all three methods are subjected to the same 50 year earthquake

history.

Figure 7 – Downtime comparison of Fault-Tree, HAZUS-MH, and ATC-13 50-year exceedance probabilities

Figure 7 shows that the fault-tree methodology’s exceedance probabilities are greater

than those calculated using ATC-13 and HAZUS methods by approximately a factor of 2. Given

that HAZUS-MH probability distributions draw partly on ATC-13’s restoration curves, it is not

22

surprising that the HAZUS and ATC-13 curves yield similar results. Figure 7 indicates that there

is a 5% probability that within the next 50 years, the facility will experience approximately 3

months of downtime based on the HAZUS and ATC-13 calculations. It also shows there is a

2.5% probability that within the next 50 years, the facility will experience greater than

approximately 6 months of downtime based on the HAZUS and ATC-13 calculations. Both of

these results are about half of those calculated using the fault-tree methodology.

Validation Considering the Earthquake Experience of the Facility

An additional means of validation can also be taken from Figure 6. Figure 6 shows that

there is a 46% chance in 50 years that there will be no more than 2.5 hours of downtime. This is

nearly zero hours of downtime. The data center investigated in this case study has not

experienced any downtime due to an earthquake in the 20 years it has been operating. Therefore

the 20 years without downtime due to an earthquake fits within the 50-year exceedance

probability of downtime distribution

23

CONCLUSIONS

The fault-tree method created by Porter et al. [6] to determine the probabilistic downtime

of a facility due to an earthquake was applied to a case study as a means of validation of the

methodology. 50-year exceedance probabilities of downtime were calculated using the fault-tree

methodology for a computer data center and compared with results generated by both HAZUS-

MH [3] and ATC-13 [7] downtime calculation methods for a facility of the same broad class as

the data center. Both the HAZUS-MH and ATC-13 methods are widely accepted as viable means

for calculating downtime. The fault-tree results were also submitted to an expert familiar with

the design and operation of the data center for a sniff test. The facility operator was asked to

judge whether or not the fault-tree’s downtime estimates seemed credible. The expert judged that

the downtime probabilities presented were overestimated by about a factor of two, at least for

one point on the Figure 6 curve. This overestimation of approximately a factor of 2 was also seen

in the comparison with the HAZUS-MH [3] and ATC-13 [7] downtime estimates. Both the

HAZUS-MH and ATC-13 estimates were based on a generalized building class assumed to have

similar features as the data center examined in the case study. It is therefore safe to assume that

the accuracy of these models is slightly less than that of the fault-tree methodology since the

fault-tree methodology incorporates component and event-specific downtime estimates and the

fault-tree is specific to the facility of which is it applied. The factor of two differences between

the existing accepted method’s downtime results along with the expert’s opinion and the fault-

tree method’s downtime estimations does fall within the factor of two range of the accuracy of

the fault-tree method as seen in the coefficient of variation of the case study’s results. This

24

provides validation for the fault-tree methodology. Additional support for the fault-tree

methodology is given by the past earthquake experience of the data center investigated in the

case study. The facility’s 20 years of operation without experiencing downtime fits within the

50-year exceedance probability of downtime distribution. The case study performed here

provides an initial indication that the fault-tree methodology created by Porter et al. [6] is a valid

means of determining downtime of a facility due to earthquake shaking. To be considered an

acceptable model in the business and engineering community, additional case studies and

comparisons should be performed. The accuracy of the methodology will also greatly improve as

more component-specific fragility functions and downtime estimations become available.

This study is limited in several ways. The accuracy of the fault-tree is dependent on the

component-specific estimations of the median downtime and log standard deviation of

downtime. Within this study, many of the components’ downtime estimations were made based

on experimental testing and research. However, a number of the components’ downtime

estimations were based on judgment. As more experimentation and research is performed in the

area of component-specific downtime estimation, more precise downtime estimations can be

made. Furthermore, the more simulations conducted the more precise the downtime estimations

will be relative to PGA. The fault-tree methodology assumes that repairs begin immediately

following an earthquake and are simultaneous. This means that repairs are not performed

sequentially and that there is enough manpower to begin repair all components at the same time.

The fault-tree method also assumes that the failure of one component is independent of another

component. Additionally, the repair of one component is assumed to be independent of the repair

of other components.

25

REFERENCES CITED

[1] (ATC) Applied Technology Council, 2011. ATC-58: Guidelines for Seismic Performance Assessment of Buildings, 75% Draft. Redwood City, CA.

[2] Cabrera, C., 2011. Industrial facilities and business interruption. PEER Reconnaissance

Briefing on East Japan Earthquake, U.C. Berkeley. http://www.youtube.com/watch?v=3AO94vECi7U&feature=relmfu

[3] (NIBS and FEMA) National Institute of Building Sciences and Federal Emergency

Management Agency, 2009. Multi-hazard Loss Estimation Methodology, Earthquake Model, HAZUS®MH MR4 Technical Manual. Federal Emergency Management Agency, Washington, DC

[4] Porter, K.A., A.S. Kiremidjian, J.S. LeGrue, 2001. Assembly-based vulnerability of

buildings and its use in performance evaluation. Earthquake Spectra, 17 (2), 291-312. [5] Porter, K.A., R. Sherrill, 2011. Utility performance panels in the ShakeOut

scenario. Earthquake Spectra, 27 (2), 1-20.

[6] Porter, K.A., K. Torisawa, H. Ishida, M. Miyamura, in review. A performance-based earthquake engineering method to estimate downtime using fault-trees. Submitted for publication to Earthquake Engineering and structural Dynamics, April 2011.

[7] Rojahn, C., R.L. Sharpe, 1985. Earthquake damage evaluation data for California.

Applied Technology Council, Redwood City CA, vol (13), 259-286. [8] SPA Risk LCC, 2011. Update of Earthquake Risk Analysis for Three Facilities. Denver,

CO. (ATC) Applied Technology Council, 1985. ATC-13, Earthquake Damage Evaluation Data for California. Redwood City, CA, 492 pp.

[9] Tierney, K.J., 1995. Impact of recent U.S. disasters on businesses: the 1993 Midwest floods and the 1994 Northridge earthquake. University of Delaware Disaster Research Center, Newark DE, 53 pp.

[10] Jordon, T.H., E.H. Field, and P. Somerville, 2006. USC-SCEC/CEA Technical Report #4

Part A: Earthquake Rate Model 2.0 for Milestone 1b. University of Southern California, Southern California Earthquake Center, Los Angeles, CA, 108 pp.

[11] Campbell K.W. and Y. Bozorgnia, 2008. NGA ground motion model for the geometric

mean horizontal component of PGA, PGV, PGD and 5% damped linear elastic response spectra for periods ranging from 0.01 to 10s. Earthquake Spectra 24 (1), 139-171.

26

APPENDIX

Figure 8 – Data Center Fault-Tree

27

Figure 8 – Data Center Fault-Tree (cont.)

28

Figure 8 – Data Center Fault-Tree (cont.)

29

Table 1 – Event directory and component capacities

Event Name Existing

Units Required

Units

Median Capacity

(θ)

Log StDev of Capacity

(β)

Data Center is red-tagged 1 1 0.53 0.43

Server Racks Fail 21 11 0.73 0.44

Display Console Fails 1 1 3 0.25

Telephone Exchange Server Fails 5 5 1.3 0.4

Cable Tray Fails 1 1 99 0.1

Offsite Telecommunications Fail 1 1 0.29 0.55

Workstation Desks Fail 4 1 1 0.5

Network Switch Racks Fail 6 6 1.3 0.4

Microwave Switch Racks Fail 50 50 1.3 0.4

Telecom Switches in Racks Fail 12 12 1.8 0.4

Offsite Water Pipes Fail 1 0 0.7 0.6

Condenser Fans Fail 4 2 4.82 0.6

Air Handlers Fail 14 10 1.4 0.6

Exhaust Fan Fails 1 1 1.4 0.6

Heat Exchangers Fail 4 4 3 0.5

Conflagration 1 1 150 1.6

Hazardous Material Release 1 1 1.62 0.6

Suspended Ceilings Collapse 2 2 0.7 0.55

Raised Access Floors Collapse 2 2 1.8 0.6

Transformers Fail 3 3 3.05 0.6

Control Panels Fail 2 1 3 0.4

Power Distribution Models Fail 6 4 3.05 0.4

Switchgear and Breakers Fail 1 1 2.4 0.4

Ignition Occurs 1 1 9.79 1.22

30

Table 1 – Event directory and component capacities (cont.)

Event Name Existing

Units Required

Units

Median Capacity

(θ)

Log StDev of Capacity (β)

Battery Racks Fail 5 4 2.32 0.2

Rectifiers & Inverters Fail 3 3 2.7 0.6

Switchgear Fails 1 1 0.46 0.6

Power Transfer Equipment Fails 2 1 3 0.4

Fuel Tank Fails 1 1 3 0.25

Fuel Pipe Fails 1 1 2.5 0.5

Muffler Fails 1 1 99 0.1

Exhaust Duct Fails 1 1 1.9 0.5

Emergency Generator Fails 2 1 2 0.2

Day Tank and Pumps Fail 2 1 0.8 0.5

Smoke Detectors Fail 47 9 99 0.1

FCC Panel Fails 1 1 3 0.4

Halon Tanks Fail 2 2 3 0.25 Halon Hose or Diffuser Nozzles Fail 4 2 99 0.1

Table 2 – Event repair times and their source

Event Name Median Repair

Time (q)

Log StDev of Repair Time

(b)

Repair Time Source

Data Center is red-tagged 180 1 [6]

Server Racks Fail 10 0.3 judgment

Display Console Fails 10 0.3 judgment

Telephone Exchange Server Fails 30 1 [6]

Cable Tray Fails 30 1 judgment

Offsite Telecommunications Fail 3 1 judgment

Workstation Desks Fail 3 0.3 judgment

Network Switch Racks Fail 10 0.3 judgment

Microwave Switch Racks Fail 30 1 judgment

Telecom Switches in Racks Fail 10 0.3 judgment

Offsite Water Pipes Fail 10 1 [6]

Condenser Fans Fail 7 0.3 [1]

31

Table 2 – Event repair times and their source (cont.)

Event Name Median Repair

Time (q)

Log StDev of Repair Time

(b)

Repair Time Source

Air Handlers Fail 4 0.3 [1]

Exhaust Fan Fails 7 0.3 [1]

Heat Exchangers Fail 4 0.3 [1]

Conflagration 10 1 [6]

Hazardous Material Release 1 1 [1]

Suspended Ceilings Collapse 3 1 [6]

Raised Access Floors Collapse 3 2 [6]

Transformers Fail 25 0.3 [1]

Control Panels Fail 8 0.3 [1]

Power Distribution Models Fail 20 0.3 [1]

Switchgear and Breakers Fail 20 0.3 [1]

Ignition Occurs 10 1 judgment

Battery Racks Fail 27 0.3 judgment

Rectifiers & Inverters Fail 2 0.3 judgment

Switchgear Fails 3 1 [1]

Power Transfer Equipment Fails 8 0.3 [1]

Fuel Tank Fails 3 2 [6]

Fuel Pipe Fails 3 1 judgment

Muffler Fails 10 1 judgment

Exhaust Duct Fails 3 2 [6]

Emergency Generator Fails 2 0.3 [1]

Day Tank and Pumps Fail 3 1 judgment

Smoke Detectors Fail 3 1 judgment

FCC Panel Fails 8 0.3 [3]

Halon Tanks Fail 10 0.5 [6]

Halon Hose or Diffuser Nozzles Fail 10 0.5 [6]

32

Figure 9 – 50-year exceedance probability for earthquake shaking intensity

(7)

(8)

VALIDATION OF A FAULT-TREE DOWNTIME METHODOLOGY THOUGH A...

Documents

Transcript of VALIDATION OF A FAULT-TREE DOWNTIME METHODOLOGY THOUGH A...