VALIDATION OF A FAULT-TREE DOWNTIME METHODOLOGY THOUGH A...
Transcript of VALIDATION OF A FAULT-TREE DOWNTIME METHODOLOGY THOUGH A...
VALIDATION OF A FAULT-TREE DOWNTIME
METHODOLOGY THOUGH A CASE STUDY
by
KYLE NORMAN RAMER
B.S., Bucknell University, 2010
A Master’s Report submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Master of Science of Civil Engineering
Department of Civil, Environmental, and Architectural Engineering
2011
This Master’s Report entitled: Validation of a Fault-Tree Downtime Methodology through a Case Study
Written by Kyle Norman Ramer has been approved for the Department of Civil, Environmental, and Architectural Engineering
Keith Porter
Abbie Liel
Franck Vernerey
Date: 12/06/2011
The final copy of this Master’s Report has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards
of scholarly work in the above mentioned discipline.
i
ABSTRACT
Ramer, Kyle Norman (M.S., Civil Engineering)
Validation of a Fault-Tree Downtime Methodology through a Case Study Master’s Report directed by Associate Research Professor Keith Porter The fault-tree methodology created by Porter et al. [6] was applied to a case study as a
means of validation for its use in calculating the probabilistic downtime of a facility cause by an
earthquake event. The fault-tree analysis calculates downtime using both the probability of a
building being non-operational due to the damage state of individual components within the
building and component repair-time distributions conditioned on their damage state. Downtime
estimates were calculated using the fault-tree methodology for a data center and compared with
restoration-time functions for the most-similar classes of facility in both HAZUS-MH and ATC-
13. The fault-tree results were also submitted to a senior operator at data center in question, as a
sanity check or sniff test of the results. The operator judged that the downtime duration for a
particular probability level seemed to overestimate restoration time by about a factor of two. This
overestimation of approximately a factor of 2 was also seen in the comparison with the HAZUS-
MH and ATC-13 downtime estimates. Whether the fault-tree method overestimates downtime,
or the expert and generic sources underestimate it, or both, is unknown. A factor of two seems to
be within a reasonable range for the accuracy of the fault-tree method. This provides validation
for the fault-tree methodology. The case study performed in this study provides an initial
indication that the fault-tree methodology created by Porter et al. [6] is a valid means of
determining downtime of a facility due to earthquake shaking.
ii
CONTENTS�
INTRODUCTION .......................................................................................................................... 1
REVIEW OF LITERATURE ......................................................................................................... 3
METHODOLOGY ......................................................................................................................... 7
Component Failure Probability Calculation................................................................................ 7
Component Repair Time Calculation.......................................................................................... 8
Fault-Tree Implementation.......................................................................................................... 8
DATA CENTER CASE STUDY ................................................................................................. 11
Component Failure Probability Calculation.............................................................................. 11
Component Repair Time Calculation........................................................................................ 13
Fault-Tree Implementation........................................................................................................ 14
Exceedance Probability Calculation.......................................................................................... 16
CASE STUDY VALIDATION .................................................................................................... 20
Sniff Test Validation ................................................................................................................. 20
Cross-Validation........................................................................................................................ 20
Validation Considering the Earthquake Experience of the Facility.......................................... 22
CONCLUSIONS........................................................................................................................... 23
REFERENCES CITED................................................................................................................. 25
APPENDIX................................................................................................................................... 26
iii
LIST OF TABLES
Table 1 – Event directory and component capacities ................................................................... 29 Table 2 – Event repair times and their source............................................................................... 30
iv
LIST OF FIGURES
Figure 1 – Fault-Tree logic applied to downtime through an and gate .......................................... 9 Figure 2 – Fault-Tree logic applied to downtime though an or gate ............................................ 10 Figure 3 – Probability of data center failure ................................................................................. 13 Figure 4 – Facility downtime estimation ...................................................................................... 16 Figure 5 – Coefficient of variation of downtime estimation......................................................... 16 Figure 6 – 50 year exceedance probability of the facility............................................................. 18 Figure 7 – Downtime comparison of Fault-Tree, HAZUS-MH, and ATC-13 50-year exceedance probabilities................................................................................................................................... 21 Figure 8 – Data Center Fault-Tree ................................................................................................ 26 Figure 9 – 50-year exceedance probability for earthquake shaking intensity .............................. 32
1
INTRODUCTION
A second generation of performance-based earthquake engineering (PBEE-2) has
developed largely in the last decade, with the objective of estimating probabilistic future seismic
performance of buildings and other facilities in terms of repair costs, fatalities, and loss of
functionality (dollars, deaths, and downtime). Most recently, ATC-58 [1], a project funded by
FEMA to develop professional guidelines for carrying out PBEE-2 analysis, has placed most of
the development emphasis on estimating repair costs. ATC-58 estimates probabilistic repair costs
using building-specific structural analysis outputs generated from a suite of earthquake ground-
motion time histories along with building-component fragility functions and conditional
probability distributions of repair cost given a damage state. The estimation of downtime in
ATC-58-style PBEE-2 is to divide repair cost by a burn rate (repair expenditures per day). This
approach does not attempt to resolve the point at which a facility becomes operational again,
only the point in time when all repairs are finished.
An alternative to a PBEE-2 approach is catastrophe risk modeling, a discipline that
considers buildings as independent and identical samples of building categories typically defined
in terms of structural material, lateral-force-resisting system, height, and building-code era. The
National Institute of Building Sciences and Federal Emergency Management Agency [3] have
developed the HAZUS-MH software (the leading public-sector catastrophe-risk model) that
estimates downtime as a function of probabilistic overall damage state of the structural portion of
a building, and of the building’s occupancy category. While catastrophe-risk models seem to
give reasonable downtime estimates (in the opinion of their users), they do not resolve downtime
behavior at the level of individual buildings, nor explain behavior in engineering terms.
2
Porter et al. (2011) proposed a methodology for calculating downtime as the time from
when repairs begin until the time the building becomes operational [6]. This is accomplished by
combining component fragility functions (which estimate the probabilistic damage state of
individual building components as a function of shaking severity), component repair-time
distributions, i.e., probability distribution of component repair time conditioned on their damage
state, and fault-tree analysis (which estimates the probability of a building being non-operational
as a function of the damage state of the individual components). Examples of components are
individual beams, columns, walls, suspended ceilings, electrical transformers, and packaged air-
conditioning units.
In Porter et al. [6], the fault-tree methodology was applied to a hypothetical data center,
for illustration purposes, but not, so far, to a real facility, and it has not yet been validated in any
substantial way. The objective of this research is to apply Porter’s downtime methodology to a
real building, and to validate it by comparing its results to results generated by preexisting
methodologies and by conducting a sniff test of the results by soliciting the opinion of an expert
on the workings of the facility under investigation.
3
REVIEW OF LITERATURE
Downtime due to earthquake damage has been explored on a number of levels from the
reasons for losses to means of calculating downtime accurately for both general and specific
facility cases. Downtime is ultimately the result of the loss of function of components utilized by
a given business. Tierney [9] investigates how various component failures affected business
losses by surveying business affected by the 1993 Midwest floods and the 1994 Northridge
earthquake. Tierney [9] presents survey results on disaster-related business impacts collected
following the two natural disasters. The survey was given to business of varying sizes in regions
affected by the disasters. The survey results outline the reasons businesses experienced losses
due to the disasters, which were lifeline service interruptions, physical damage, material flow
disruption, and loss of customers. These factors show the broad effect that an earthquake can
have on businesses. These losses that caused downtime ranged from direct interruption of a
business to losses contingent on other business’s interruption.
Cabrera [2] follows the same approach as Tierney [9], but focuses on industrial facilities
impacted by the March 2011 Tohoku earthquake in Japan. Cabrera [2] outlines the main drivers
behind both direct business interruption (downtime resulting from damage at the facility of
interest) and contingent business interruption (downtime at the facility of interest, but resulting
from damage elsewhere) in industrial facilities and how they create both business interruption
and lifeline interruption. Lifelines include electricity, water, telecommunications, etc. The
relationship between business interruption and lifeline interruption is particularly relevant
because it shows that the functionality of many businesses is dependent on other businesses and
infrastructure. In the Japanese facilities investigated, the majority of the damage was non-
4
structural. The losses incurred were due to non-structural damage as well as lifeline and supply-
chain interruption. The findings of Cabrera [2] and Tierney [9] indicate that business-specific,
contingent, and lifeline components should be considered when evaluating potential losses and
downtime incurred on a facility.
Given the types of components that are considered when evaluating facility downtime,
downtime estimates for each of these components must be determined. One method of
determining the downtime estimates is through expert opinion. Porter and Sherrill [5] discuss
how expert panels judged impacts of a hypothetical M7.8 southern San Andreas Fault
earthquake. The results of the panels provide downtime assessments for various locations in
California but more importantly the agenda provides a method for determining component
damage states following earthquakes when field data are unavailable or analytical methods are
impractical. The panel approach can be seen as a structured application of expert opinion to
estimate lifeline downtime in earthquakes.
Using component-specific downtime estimates, methodologies have been created to
estimate the downtime of complete facilities. Porter et al. [4] provide a methodology for
developing building-specific seismic vulnerability functions for a building, considering both
structural and non-structural building components. The methodology applies suites of ground
motion time histories to a probabilistic structural model to estimate probabilistic structural
response conditioned on shaking, and then uses component fragility functions to estimate the
probabilistic damage state of each building component. Repair duration is calculated by
assembling a simple Gantt chart for repair, treating component repair times as probabilistic and
conditioned on damage state. By Monte Carlo Simulation then, one can model ground motion,
structural response, damage, and repair duration.
5
ATC-13 [7] generated restoration curves for facilities damaged by earthquakes based on
the building’s loss of function by soliciting expert opinion. ATC-13 [7] generalizes facilities into
various usage classifications and associates example/typical equipment and contents that may be
present in each building classifications to be used in the restoration estimation. Assuming that
repair/reconstruction follow ordinary construction schedules, experts estimated restoration time
for each facility classification at 30%, 60%, and 100% restoration levels and also gave their
experience level with each facility class. The restoration times were weighted based on the
experience level of each expert to develop functional restoration time curves based on the
facility’s loss of function.
More recently, the HAZUS-MH developers [3] produced software to estimate regional
earthquake losses throughout the United States. Their method applies ground-motion prediction
equations to an earthquake rupture forecast to estimate regional shaking, either probabilistically
or on a scenario basis. It contains an estimate of the building inventory on a census-block basis,
constructed from various proxies such as the US Census of Population and Housing. Damage to
the building stock is estimated using the calculated ground motions and the capacity spectrum
method of structural analysis to estimate structural response for various categories of
construction. Structural response is then input to fragility functions that estimate the probabilistic
damage state of the building stock. The fragility functions are created through engineering first
principles and probably expert opinion. Downtime is then estimated using conditional probability
distributions of downtime given damage state by occupancy class. The derivation of these
conditional probability distributions is vague, apparently drawing in part on ATC-13 [7].
A structural-engineering approach to estimating the impacts of earthquake on individual
buildings is offered by ATC-58, a FEMA-sponsored effort to produce professional guidelines for
6
applying 2nd-generation performance based earthquake engineering. The guidelines work also
includes the PACT software [1] for implementing the methodology. This software takes as input
the structural response calculated by building-specific structural analyses, and responses due to a
particular earthquake ground motions. It calculates probabilistic damage and loss using built-in
fragility functions and conditional probability distributions of repair cost (conditioned on
damage). Additionally, PACT determines casualty levels, repair costs, and repair time for the
given building. The downtime calculated by PACT addresses repair time and ignores the time
between when the earthquake occurs and the start of the repairs. It does however generate
component-specific repair times.
Porter et al. [6] offer a method to estimate downtime for buildings after earthquakes. The
methodology applies fault-tree analysis to estimate the probability of a component being non-
operational at a given point in time after experiencing a specified level of excitation as product of
a fragility function (probability that a component will fail to operate as a function of input
excitation) and a conditional probability of the time to restore the component given that it fails.
The result is that, at the basic-event level, one estimates at any time after the earthquake the
probability that the component is nonfunctional. By “basic event” is meant the failure of a single
component, such as a single wall, column, or ceiling. By defining the top event as the facility
being nonfunctional, i.e., failing to perform some particular operation such as data processing,
the result is that one can calculate probabilistic downtime either by Monte Carlo Simulation or in
closed form. As noted earlier, it is this method that is tested here.
7
METHODOLOGY
Component Failure Probability Calculation
The downtime methodology, created by Porter et al. [6] to be tested for validation, bases
its downtime fault-tree logic on failure probabilities of individual components within the facility.
The failure probability of each component is generated by calculating component-specific failure
probabilities conditioned on the peak excitation to which the component is subjected.
“Excitation” can mean member forces or deformations, either in scalar or vector form. In current
practice it is usually parameterized as a scalar value, and most often the peak acceleration to
which a component is subjected or the peak interstory drift at the story level of the building
component.
The components considered here happen to be all acceleration-sensitive. For components
at ground level of a building, that means that the input excitation is peak ground accelerations
(PGA). Excitation for components above the ground floor can in principle be estimated from
structural analysis; here it is simply taken as PGA times an amplification factor: 2 for
components in the top 1/3rd of the building height, 1.5 for components in the middle 1/3rd of the
building height.
The calculation of component failure probability is performed using a component-
specific fragility functions as seen in Equation 1. Each component, indexed by i, has a failure
probability, Pi calculated using the excitation xi (here, acceleration), the median capacity of the
component, θi, and βi, the logarithmic standard deviation of capacity. Φ is the cumulative
standard normal distribution function. To simulate damage, one can draw a sample u of a
8
uniformly distributed random variable between 0 and 1; if u < Pi, the component is considered
damaged.
(1)
Component Repair Time Calculation
To calculate downtime, conditioned on a component being damaged (calculated using
Equation 1), it is assumed that repair time is lognormally distributed. The component’s
lognormally distributed repair time, ti (days), is simulated using Equation 2 where qi is the
median repair duration, bi is the logarithmic standard deviation of repair duration, and Φ‐1(ri) is
the inverse of the cumulative standard normal probability distribution evaluated at a probability
level r, a sample of a uniformly distributed random variable bounded by 0 and 1. If a component
is considered undamaged in a given simulation, the repair time is 0 days.
(2)
Fault-Tree Implementation
Each component’s probabilistic repair time for each PGA is applied to the fault-tree
logic’s Boolean operations. These Boolean operations consist of and and or gates. Within the
fault tree, a facility’s failed state is dependent on the failure of various systems. A failure of a
system will be referred to as an upper event (eu) and a failure of a component within a system
will be referred to as a lower event (el). If an upper event requires that all lower events associated
with it occur, an and gate is applied. If an upper event requires that only one lower event
associated with it occurs, an or gate is applied. The ability of a system (upper event) to be
operational, meaning the ability of a system to be taken out of its failed state, is also determined
9
based on the lower events connected to it. If an upper event required all lower events associated
with it to fail (and gate), the upper event needs just one lower event to be taken out of its failed
state. In this case Equation 3 is applied to determine the downtime of an upper event, tu, because
the first lower event to be taken out of the failed state will take the upper event out of its failed
state. The repair time for a lower event n is referred to as tln.
tu = min (tl1, tl2, … tln) (3)
If an upper event requires just one of the lower events associated with it to fail (or gate),
all of the lower events associated with the upper event must to be taken out of their failed states
for the upper event to be taken out of its failed state. In this case Equation 4 is applied.
(4)
An and gate takes the minimum repair time of the components directly below the upper
event and an or gate takes the maximum repair time of the events directly below the upper event.
The fault-tree logic applied to downtime is displayed in Figure 1 and Figure 2. It is applied
throughout the facility’s systems until the facility’s total downtime is simulated relative to PGA.
Downtime is simulated repeatedly using Monte-Carlo simulation with all downtime simulations
averaged to determine the facility’s mean downtime for given PGA.
Figure 1 – Fault-Tree logic applied to downtime through an and gate
10
Figure 2 – Fault-Tree logic applied to downtime though an or gate
11
DATA CENTER CASE STUDY
Component Failure Probability Calculation
In Porter et al. [6], the fault-tree methodology was applied to a hypothetical facility to
illustrate the fault-tree’s functionality. However, until this research, the fault-tree methodology
has not been applied thoroughly. In this study, the methodology was applied in collaboration
with and inspected by the facility operators to an actual building where downtime results could
be compared to existing models or judged by the operators themselves or other experts. For this
research, as a first means of validation, the fault-tree methodology for downtime was applied to
an existing data center of which the facility’s failure probability for varying PGA was already
determined by also using the fault-tree logic.
The data center in question is operated by a southern California public power utility. The
facility is located in the San Gabriel Valley of Los Angeles County, and houses the data
processing and telecommunications equipment that controls the power grid. There are probably
dozens, maybe hundreds or more, such facilities around the United States, but in general the
equipment in the facility is similar to probably thousands or more computer data centers around
the U.S.
The site of the data center was the subjected of a 2011 geotechnical investigation. It
included two cone penetrometer tests (CPT) that reached approximately 90 ft below grade, and
produced estimates of average shearwave velocity in the top 30m of soil of 450 m/sec to 480
m/sec. No groundwater was encountered, indicating low potential for liquefaction or lateral
spreading. There are no known faults at the site. The implication is that the principal seismic
12
hazard at the site is shaking produced by rupture of faults in the region. Some regional faults are
capable of producing earthquakes of M7.8.
The data center building is a 2-story, 50,000 square foot, generally rectangular (190 ft x
130 ft) reinforced concrete shearwall structure, built in the late 1950s and expanded to its current
configuration in the 1980s. Data-processing operations happen in 3 computer rooms with raised
access floors and suspended ceilings, served by power and telecommunications equipment in
several additional rooms. Air conditioning equipment are located on the roof and a pad outside
the building. A complete list of the components that contribute to the functionality of the facility
is provided in the appendix. The list was produced in collaboration with the operators of the
facility, as part of an engineering risk consulting contract by the engineering risk consulting
company SPA Risk LLC. Prof Porter is a principal of that firm, and provided the data used here.
Confidentiality issues prevent revealing the exact location of the facility or the name of the
operator.
The data center’s probability of operational failure (meaning at least some data-
processing operations cease) was calculated using component fragility functions by applying
Equation 1 and the fault-tree logic displayed in a flow chart in Figure 8 in the Appendix. Figure
3 shows the probability of being in a non-operational state of the facility for a range of PGA.
13
Figure 3 – Probability of data center failure
It should also be noted that the data center’s failure probability distribution utilized in the
downtime calculations was based on the “as-is” condition of the building. The failure probability
calculations also accounted for redundancy in the components, e.g., multiple air conditioning
units, only some of which are required to be operating. If there were more of a component than
was required, a binomial distribution was applied to the component’s fragility function for n
independent trials with m successes in the trials where n is the number of component units and m
is the difference between the number of component units and required component units.
Appendix Table 1 displays a list of the events applied to the data center’s fault-tree as well as
their median capacity, logarithmic standard deviation of capacity, and the number of required
and actual units of the component.
Component Repair Time Calculation
For each event listed in Appendix Table 1, a repair time was calculated for a range of
PGA using Equation 2 where downtime was calculated if a sample of a uniformly distributed
14
random variable bounded by 0 and 1, drawn for a given component, was less than that
component’s calculated probability of failure due to a PGA. For computational purposes, PGA
ranged from 0.01g to 2.00g in increments of 0.01g. Parameters of the fragility functions and
downtime distributions were determined either by ATC-58 [1], HAZUS-MH [3], Porter et al. [6],
or by judgment. Because of uncertainty, all median repair time estimates are rounded to one
significant figure. Appendix Table 2 shows a list of all events along with their median repair
time, logarithmic standard deviations of repair time, and the source of those statistics. All
components listed are considered non-structural components.
Fault-Tree Implementation
With the repair times probabilistically determined for each component over the range of
PGA, the fault-tree methodology was applied to calculate downtime. The facility’s downtime
distribution over a range of PGA, meaning the amount of time for the facility to become
operational, was calculated to be the time required to repair all equipment systems, address
hazardous-material release or wait out conflagrations, or remove the facility’s red-tag label. A
facility is red-tagged if it is deemed structurally unsafe. (Porter performed a structural analysis of
the facility to estimate its red-tag capacity; details are not provided here.) Equipment systems are
considered to be repaired when all of the following occur: any uncontrolled building fire (inside
the building) is stopped and all burnt components are repaired, the building support systems
become operational, the grid control and telecommunication systems become operational.
In the case of the building support systems, an or gate connects both its failure and repair
to the four lower systems: heating, ventilation, and air conditioning (HVAC), the power system,
raised access floors supporting data-processing equipment, and the suspended ceilings above
15
data-processing equipment. This means that the building support system fails if any of those four
lower systems fail. It also means that the building support systems become operational when all
of its four lower systems are repaired. In mathematical terms, the time required for the building
support systems to become operational is the maximum repair time of the HVAC, the power
system, the raised access floors, and the suspended ceilings.
One of the events required for the equipment systems to become operational is the
stopping of an uncontrolled building fire. An uncontrolled building fire occurs if the building
ignites and fire response system fails (and gate). This means the downtime due to an
uncontrolled building fire is the minimum of the restoration time of the fire response system and
the time needed for the fire to go out on its own.
The facility’s downtime was simulated 1000 times and averaged for each PGA tested to
create an expected value of downtime along with a coefficient of variation for each PGA tested.
Figure 4 shows the facility’s downtime estimation for various PGA and Figure 5 shows the
coefficients of variation for the downtime estimates. The coefficients of variation (the standard
deviation of downtime duration divided by the mean value, for a given level of PGA) vary
between about 1.0 and 2.0. This means that the downtime is uncertain within a factor of about 2
for PGA greater than 0.5g. At lower magnitudes of PGA, accuracy of the downtime estimations
are very low based on the coefficients of variation and change significantly relative to PGA. This
is due to the signal to noise ratio of the PGA versus the damage state. At lower PGA, the
probabilities of failure for components and the facility are very low. In the rare case where the
component enters a damaged state at a low PGA and thus experiences downtime due to low
PGA, an increased error in the downtime estimation occurs for the entire facility.
16
Figure 4 – Facility downtime estimation
Figure 5 – Coefficient of variation of downtime estimation
Exceedance Probability Calculation
To better understand the downtime estimations displayed in Figure 4, the 50 year
exceedance probability of the facility’s downtime was calculated as a function of downtime. That
is to say, the probability calculated is that the facility’s actual downtime due to earthquake
17
shaking experienced in the next 50 years exceeds a specified value. This probability is
determined by first calculating the probability that the facility will not be operational for at least
the duration, t, given a shaking intensity, s, from an earthquake. This is shown in Equation 5.
(5)
Equation 5 utilizes the median downtime, qT(s), and the logarithmic standard deviation of
downtime, bT(s), given that the uncertain shaking intensity, S, takes on a particular value, s, and
given a number of other conditions abbreviated by “&.” These additional conditions include the
facility location, a mathematical model of its design and reliance on equipment and other
components, a particular model of the regional seismicity, a particular model of the ground
motion intensity given the occurrence of an earthquake, and other parameters. The coefficients
qT(s) and bT(s) must be derived from both the mean downtime m and the coefficient of variation
of downtime d. Coefficients bT(s) and qT(s) are calculated using Equation 7 and Equation 8
respectively in the Appendix. The result from Equation 5 is integrated with the absolute value of
the first derivative with respect to s of the hazard curve, which gives the probability that an
earthquake will occur causing shaking intensity to exceed a particular value s at least once during
the next 50 years. This integral is shown in Equation 6 where & is as defined above, T is
uncertain earthquake-induced downtime, and t is a particular value of T.
(6)
The second term on the right side of Equation 6 is the seismic hazard at the site, i.e., here
the probability (or almost equivalently, the occurrence frequency) of shaking exceeding a
18
particular value of s, here measured in terms of peak ground acceleration. Seismic hazard for the
site was provided by Keith Porter, who used the USGS/USC software OpenSHA Seismic Hazard
Calculator ver 1.2.2. The software takes as input a facility location, earthquake rupture forecast,
ground-motion prediction equation, and time period of interest. In the present case, Keith Porter
used the Uniform California Earthquake Rupture Forecast ver 2.0 [10], Campbell and Bozrgnia’s
next-generation attenuation relationship [11], and a 50-year period.
For computational purposes, the integration performed in Equation 6 was done
numerically. Figure 9 in the Appendix shows the 50-year exceedance probability for various
shaking intensities and Figure 6 shows the 50-year exceedance probability of the facility’s
downtime.
Figure 6 – 50 year exceedance probability of the facility
Figure 6 indicates for example that there is a 5% probability that within the next 50 years
the facility will experience enough earthquake damage to incur a downtime of at least six
months. This is almost equivalent to saying that, with 0.1% probability, an earthquake in 2012
19
could cause enough damage to render the facility inoperative for at least 6 months.
20
CASE STUDY VALIDATION
Sniff Test Validation
The exceedance probability results were submitted to an operator of the facility who had
participated in creating and inspecting the fault tree for this facility. This operator was asked to
judge whether these results seemed credible. After examining the downtime probabilities
calculated using the fault-tree methodology, the operator felt that the downtime estimations were
slightly overestimated. Based on his knowledge of the facility, he reasoned that an earthquake
within 2012 could conceivably cause 3 months of downtime rather than the 6 months calculated
using the fault-tree methodology as having 0.1% occurrence probability. To be precise, he wrote,
“I cannot imagine an earthquake in 2012 would cause 6 months of outage time. Taking the
uncertainty factor into consideration, a 3 month outage seems more within the realm of
possibility.” This feedback seems like neither a strong endorsement nor a strong denunciation of
the credibility of the model, but something in between, as if the curve is plausible if shifted to the
left by a factor of 2.
Cross-Validation
As another means of comparison of the downtime estimates calculated using the fault-
tree methodology, similar exceedance probability calculations were performed using the
HAZUS-MH downtime calculation method [3] as well as the ATC-13 downtime calculation
method [7]. Identical 50-year exceedance probabilities of earthquake shaking were utilized in the
calculations. The results generated by these calculations are based on generalized buildings of
the same class as the data center under examination. For the HAZUS-MH method, the
21
generalized building was model building type 19.1 (low rise reinforced concrete shearwall
facility with a High-Code seismic design level) with occupancy class 16 (Banks and Financial
Institutions) [3]. The ATC-13 method utilized a class 6 facility (low rise reinforced concrete
shearwall facility and a class 16 social function (industrial/high technology) [7]. HAZUS’s
coefficient of variation of downtime used in Equation 5 was also approximated using a curve
fitting to ATC-13’s coefficient of variation of downtime. Figure 7 shows a comparison between
the Fault-Tree methodology, HAZUS, and ATC-13 50-year exceedance probability estimation
for the facility’s downtime where all three methods are subjected to the same 50 year earthquake
history.
Figure 7 – Downtime comparison of Fault-Tree, HAZUS-MH, and ATC-13 50-year exceedance probabilities
Figure 7 shows that the fault-tree methodology’s exceedance probabilities are greater
than those calculated using ATC-13 and HAZUS methods by approximately a factor of 2. Given
that HAZUS-MH probability distributions draw partly on ATC-13’s restoration curves, it is not
22
surprising that the HAZUS and ATC-13 curves yield similar results. Figure 7 indicates that there
is a 5% probability that within the next 50 years, the facility will experience approximately 3
months of downtime based on the HAZUS and ATC-13 calculations. It also shows there is a
2.5% probability that within the next 50 years, the facility will experience greater than
approximately 6 months of downtime based on the HAZUS and ATC-13 calculations. Both of
these results are about half of those calculated using the fault-tree methodology.
Validation Considering the Earthquake Experience of the Facility
An additional means of validation can also be taken from Figure 6. Figure 6 shows that
there is a 46% chance in 50 years that there will be no more than 2.5 hours of downtime. This is
nearly zero hours of downtime. The data center investigated in this case study has not
experienced any downtime due to an earthquake in the 20 years it has been operating. Therefore
the 20 years without downtime due to an earthquake fits within the 50-year exceedance
probability of downtime distribution
23
CONCLUSIONS
The fault-tree method created by Porter et al. [6] to determine the probabilistic downtime
of a facility due to an earthquake was applied to a case study as a means of validation of the
methodology. 50-year exceedance probabilities of downtime were calculated using the fault-tree
methodology for a computer data center and compared with results generated by both HAZUS-
MH [3] and ATC-13 [7] downtime calculation methods for a facility of the same broad class as
the data center. Both the HAZUS-MH and ATC-13 methods are widely accepted as viable means
for calculating downtime. The fault-tree results were also submitted to an expert familiar with
the design and operation of the data center for a sniff test. The facility operator was asked to
judge whether or not the fault-tree’s downtime estimates seemed credible. The expert judged that
the downtime probabilities presented were overestimated by about a factor of two, at least for
one point on the Figure 6 curve. This overestimation of approximately a factor of 2 was also seen
in the comparison with the HAZUS-MH [3] and ATC-13 [7] downtime estimates. Both the
HAZUS-MH and ATC-13 estimates were based on a generalized building class assumed to have
similar features as the data center examined in the case study. It is therefore safe to assume that
the accuracy of these models is slightly less than that of the fault-tree methodology since the
fault-tree methodology incorporates component and event-specific downtime estimates and the
fault-tree is specific to the facility of which is it applied. The factor of two differences between
the existing accepted method’s downtime results along with the expert’s opinion and the fault-
tree method’s downtime estimations does fall within the factor of two range of the accuracy of
the fault-tree method as seen in the coefficient of variation of the case study’s results. This
24
provides validation for the fault-tree methodology. Additional support for the fault-tree
methodology is given by the past earthquake experience of the data center investigated in the
case study. The facility’s 20 years of operation without experiencing downtime fits within the
50-year exceedance probability of downtime distribution. The case study performed here
provides an initial indication that the fault-tree methodology created by Porter et al. [6] is a valid
means of determining downtime of a facility due to earthquake shaking. To be considered an
acceptable model in the business and engineering community, additional case studies and
comparisons should be performed. The accuracy of the methodology will also greatly improve as
more component-specific fragility functions and downtime estimations become available.
This study is limited in several ways. The accuracy of the fault-tree is dependent on the
component-specific estimations of the median downtime and log standard deviation of
downtime. Within this study, many of the components’ downtime estimations were made based
on experimental testing and research. However, a number of the components’ downtime
estimations were based on judgment. As more experimentation and research is performed in the
area of component-specific downtime estimation, more precise downtime estimations can be
made. Furthermore, the more simulations conducted the more precise the downtime estimations
will be relative to PGA. The fault-tree methodology assumes that repairs begin immediately
following an earthquake and are simultaneous. This means that repairs are not performed
sequentially and that there is enough manpower to begin repair all components at the same time.
The fault-tree method also assumes that the failure of one component is independent of another
component. Additionally, the repair of one component is assumed to be independent of the repair
of other components.
25
REFERENCES CITED
[1] (ATC) Applied Technology Council, 2011. ATC-58: Guidelines for Seismic Performance Assessment of Buildings, 75% Draft. Redwood City, CA.
[2] Cabrera, C., 2011. Industrial facilities and business interruption. PEER Reconnaissance
Briefing on East Japan Earthquake, U.C. Berkeley. http://www.youtube.com/watch?v=3AO94vECi7U&feature=relmfu
[3] (NIBS and FEMA) National Institute of Building Sciences and Federal Emergency
Management Agency, 2009. Multi-hazard Loss Estimation Methodology, Earthquake Model, HAZUS®MH MR4 Technical Manual. Federal Emergency Management Agency, Washington, DC
[4] Porter, K.A., A.S. Kiremidjian, J.S. LeGrue, 2001. Assembly-based vulnerability of
buildings and its use in performance evaluation. Earthquake Spectra, 17 (2), 291-312. [5] Porter, K.A., R. Sherrill, 2011. Utility performance panels in the ShakeOut
scenario. Earthquake Spectra, 27 (2), 1-20.
[6] Porter, K.A., K. Torisawa, H. Ishida, M. Miyamura, in review. A performance-based earthquake engineering method to estimate downtime using fault-trees. Submitted for publication to Earthquake Engineering and structural Dynamics, April 2011.
[7] Rojahn, C., R.L. Sharpe, 1985. Earthquake damage evaluation data for California.
Applied Technology Council, Redwood City CA, vol (13), 259-286. [8] SPA Risk LCC, 2011. Update of Earthquake Risk Analysis for Three Facilities. Denver,
CO. (ATC) Applied Technology Council, 1985. ATC-13, Earthquake Damage Evaluation Data for California. Redwood City, CA, 492 pp.
[9] Tierney, K.J., 1995. Impact of recent U.S. disasters on businesses: the 1993 Midwest floods and the 1994 Northridge earthquake. University of Delaware Disaster Research Center, Newark DE, 53 pp.
[10] Jordon, T.H., E.H. Field, and P. Somerville, 2006. USC-SCEC/CEA Technical Report #4
Part A: Earthquake Rate Model 2.0 for Milestone 1b. University of Southern California, Southern California Earthquake Center, Los Angeles, CA, 108 pp.
[11] Campbell K.W. and Y. Bozorgnia, 2008. NGA ground motion model for the geometric
mean horizontal component of PGA, PGV, PGD and 5% damped linear elastic response spectra for periods ranging from 0.01 to 10s. Earthquake Spectra 24 (1), 139-171.
26
APPENDIX
Figure 8 – Data Center Fault-Tree
27
Figure 8 – Data Center Fault-Tree (cont.)
28
Figure 8 – Data Center Fault-Tree (cont.)
29
Table 1 – Event directory and component capacities
Event Name Existing
Units Required
Units
Median Capacity
(θ)
Log StDev of Capacity
(β)
Data Center is red-tagged 1 1 0.53 0.43
Server Racks Fail 21 11 0.73 0.44
Display Console Fails 1 1 3 0.25
Telephone Exchange Server Fails 5 5 1.3 0.4
Cable Tray Fails 1 1 99 0.1
Offsite Telecommunications Fail 1 1 0.29 0.55
Workstation Desks Fail 4 1 1 0.5
Network Switch Racks Fail 6 6 1.3 0.4
Microwave Switch Racks Fail 50 50 1.3 0.4
Telecom Switches in Racks Fail 12 12 1.8 0.4
Offsite Water Pipes Fail 1 0 0.7 0.6
Condenser Fans Fail 4 2 4.82 0.6
Air Handlers Fail 14 10 1.4 0.6
Exhaust Fan Fails 1 1 1.4 0.6
Heat Exchangers Fail 4 4 3 0.5
Conflagration 1 1 150 1.6
Hazardous Material Release 1 1 1.62 0.6
Suspended Ceilings Collapse 2 2 0.7 0.55
Raised Access Floors Collapse 2 2 1.8 0.6
Transformers Fail 3 3 3.05 0.6
Control Panels Fail 2 1 3 0.4
Power Distribution Models Fail 6 4 3.05 0.4
Switchgear and Breakers Fail 1 1 2.4 0.4
Ignition Occurs 1 1 9.79 1.22
30
Table 1 – Event directory and component capacities (cont.)
Event Name Existing
Units Required
Units
Median Capacity
(θ)
Log StDev of Capacity (β)
Battery Racks Fail 5 4 2.32 0.2
Rectifiers & Inverters Fail 3 3 2.7 0.6
Switchgear Fails 1 1 0.46 0.6
Power Transfer Equipment Fails 2 1 3 0.4
Fuel Tank Fails 1 1 3 0.25
Fuel Pipe Fails 1 1 2.5 0.5
Muffler Fails 1 1 99 0.1
Exhaust Duct Fails 1 1 1.9 0.5
Emergency Generator Fails 2 1 2 0.2
Day Tank and Pumps Fail 2 1 0.8 0.5
Smoke Detectors Fail 47 9 99 0.1
FCC Panel Fails 1 1 3 0.4
Halon Tanks Fail 2 2 3 0.25 Halon Hose or Diffuser Nozzles Fail 4 2 99 0.1
Table 2 – Event repair times and their source
Event Name Median Repair
Time (q)
Log StDev of Repair Time
(b)
Repair Time Source
Data Center is red-tagged 180 1 [6]
Server Racks Fail 10 0.3 judgment
Display Console Fails 10 0.3 judgment
Telephone Exchange Server Fails 30 1 [6]
Cable Tray Fails 30 1 judgment
Offsite Telecommunications Fail 3 1 judgment
Workstation Desks Fail 3 0.3 judgment
Network Switch Racks Fail 10 0.3 judgment
Microwave Switch Racks Fail 30 1 judgment
Telecom Switches in Racks Fail 10 0.3 judgment
Offsite Water Pipes Fail 10 1 [6]
Condenser Fans Fail 7 0.3 [1]
31
Table 2 – Event repair times and their source (cont.)
Event Name Median Repair
Time (q)
Log StDev of Repair Time
(b)
Repair Time Source
Air Handlers Fail 4 0.3 [1]
Exhaust Fan Fails 7 0.3 [1]
Heat Exchangers Fail 4 0.3 [1]
Conflagration 10 1 [6]
Hazardous Material Release 1 1 [1]
Suspended Ceilings Collapse 3 1 [6]
Raised Access Floors Collapse 3 2 [6]
Transformers Fail 25 0.3 [1]
Control Panels Fail 8 0.3 [1]
Power Distribution Models Fail 20 0.3 [1]
Switchgear and Breakers Fail 20 0.3 [1]
Ignition Occurs 10 1 judgment
Battery Racks Fail 27 0.3 judgment
Rectifiers & Inverters Fail 2 0.3 judgment
Switchgear Fails 3 1 [1]
Power Transfer Equipment Fails 8 0.3 [1]
Fuel Tank Fails 3 2 [6]
Fuel Pipe Fails 3 1 judgment
Muffler Fails 10 1 judgment
Exhaust Duct Fails 3 2 [6]
Emergency Generator Fails 2 0.3 [1]
Day Tank and Pumps Fail 3 1 judgment
Smoke Detectors Fail 3 1 judgment
FCC Panel Fails 8 0.3 [3]
Halon Tanks Fail 10 0.5 [6]
Halon Hose or Diffuser Nozzles Fail 10 0.5 [6]
32
Figure 9 – 50-year exceedance probability for earthquake shaking intensity
(7)
(8)