Survivability AIAA Team 1 David Macdonald Taylor Raymond Result of Aircraft Survivability Failure.
Resilience and Survivability in Communication Networks · Survivability is the capability of a...
Transcript of Resilience and Survivability in Communication Networks · Survivability is the capability of a...
Resilience and Survivability in
Communication Networks: Strategies, Principles, and Survey of Disciplines
J. P.G. Sterbenz, D. Hutchison, E. K. Cetinkaya, A. Jabbar, J. P. Rohrer, M. Schöller, and P. Smith,
Computer Networks: Special Issue on Resilient and Survivable Networks (COMNET), vol. 54, no. 8, pp. 1243–1342, June 2010.
Overview
▐ Past resilience failures
▐ Taxonomy
Related disciplines
From Challenge to Failure
▐ Strategy
Foundations
Principles
▐ The ResumeNet project
© NEC Corporation 2009Page 2
Past Incidents 1: Hinsdale
▐ 1988 Hinsdale Illinois Bell central office fire
100K customers lose service for weeks
also major disruptions in
• long distance
• 800
• 911
• cellular
• ATC for O’Hare
▐ Fault tolerance by redundancy not sufficient
▐ Resilience requires
spatially diverse redundancy
separation of infrastructures
© NEC Corporation 2009Page 3
Past Incidents 2: Hurricane Katrina
▐ Internet impact
Little impact on national Internet service
Significant impact on local Internet service [Renesys]
▐ Power grid fails
2.6M w/o power
New Orleans power out for a month
Restoration crews unavailable
© NEC Corporation 2009Page 4
Past Incidents 2: Hurricane Katrina
▐ Communication and network infrastructure
Insufficient battery and generator backup
Backup not robust (time duration and spatial diversity)[http://www.oe.netl.doe.gov/hurricanes_emer/katrina.aspx]
▐ Incompatible communications [http://www.livescience.com/technology/ap_050913_comm_breakdown.html]
New Orleans 1992 M/A-Com
LA 1996 Motorola
multiple incompatible federal systems
MS national guard used sneakernet
▐ New Orleans communication not survivable
Energy Center tower lost power
Backup power transformer taken out by glass shard
MA-Com repair crews denied entry for 3 days by state police
▐ Amateur radio again critical
© NEC Corporation 2009Page 5
Past Incidents 3: YouTube hijack
▐ YouTube announces 208.65.152.0/22
▐ Pakistan’s government orders Pakistan Telecom to block YouTube
▐ Pakistan Telecom implements blocking by rogue BGP
advertisement
PT announces a more specific 208.65.153.0/24 of YouTube’s /22
Rogue route also advertised to routing peers
Within 2 minutes most of the DFZ carried the bad route
Most of the Internet goes to Pakistan for YouTube and gets nothing!
▐ YouTube recovers by announcing both the /24 and the two more
specific /25s
▐ Finally Pakistan Telecom was disconnected by PCCW
© NEC Corporation 2009Page 6
http://www.renesys.com/blog/2008/02/pakistan_hijacks_youtube_1.shtml
Past Incidents 4: SuproNet misconfiguration
16.Feb.2009 16:23:30UTC
▐ SuproNet (AS 47868) announced
94.125.216.0/21 through AS29113
with an overly long AS path
SuproNet intention was AS
prepending
Admin used Cisco-style way of
configuration on a MikroTik router
47868%256=252
▐ As paths longer than 255 ASN
triggered a Cisco IOS bug
No filtering of excessively long AS
paths
Router resetted BGP session
But propagated the route
▐ Instability of announced networks 0.56% to 4.76%
© NEC Corporation 2009Page 7
http://www.renesys.com/blog/2009/02/longer-is-not-better.shtml
Past Incidents 5: DDoS Attack in Burma
▐ Attack on Burma’s main
ISP (MPT)
▐ Connectivity to the
country via T3 (45
Mbps) links disrupted
for several days
Source: http://asert.arbornetworks.com/2010/11/attac-severs-
myanmar-internet/
Challenges categorisation
▐ We identified a number of challenge classes:
1. Component Faults
2. Hardware destruction
3. Communication environment
4. Human mistakes
5. Malicious attacks
6. Unusual but legitimate demand for service
7. Failure of a provider service
Network resilience definition
“The ability of the network to provide and maintain an acceptable
level of service in the face of various faults and challenges.”
[ResiliNets]
Overview
▐ Past resilience failures
▐ Taxonomy
Related disciplines
From Challenge to Failure
▐ Strategy
Foundations
Principles
▐ The ResumeNet project
© NEC Corporation 2009Page 11
Challenge Tolerance
Related Disciplines – 1/11
▐ Challenge tolerance deals
with the design and
engineering of systems that
continue to provide service in
the face of challenges.
© NEC Corporation 2009Page 12
Related Disciplines – 2/11
▐ Disruption tolerance is the
ability of a system to tolerate
disruptions in connectivity
among its components.
▐ Tolerance to environmental
challenges:
Weak and episodic channels
Mobility
Delay tolerance
▐ Tolerance of power and
energy constraints
Challenge Tolerance
© NEC Corporation 2009Page 13
Energy
Delay Mobility
Connectivity
Disruption
Tolerance
Environmental
Related Disciplines – 3/11
▐ Traffic tolerance is the
ability of a system to tolerate
Unpredictable offered load
without a significant drop in
carried load (including
congestion collapse)
To isolate the effects from
cross traffic, other flows, and
other nodes.
▐ Traffic can either be
unexpected but legitimate
such as from a flash crowd,
or malicious such as a DDoS
attack.
Challenge Tolerance
© NEC Corporation 2009Page 14
Energy
Delay Mobility
Connectivity
Disruption
Tolerance
Environmental
attack
legitimate
Traffic
Tolerance
Challenge Tolerance
Related Disciplines – 4/11
▐ Survivability is the capability
of a system to fulfill its
mission, in a timely manner,
in the presence of threats
such as targeted attacks or
large-scale natural disasters
resulting in many failures.
▐ Fault tolerance
A system survives few
random failures
© NEC Corporation 2009Page 15
Energy
Delay Mobility
Connectivity
Disruption
Tolerance
Environmental
attack
legitimate
Traffic
Tolerance
Fault
Tolerance
Survivability
Trustworthiness
Related Disciplines – 5/11
▐ Trustworthiness
“Assurance that a system
will perform as expected.”
Quantifiable behavior of the
system
▐ IFIP 10.4
© NEC Corporation 2009Page 16
Trustworthiness
Related Disciplines – 6/11
▐ Security is the property of a
system and measures taken
such that it protects itself
from unauthorized access or
change
Confidentiality:
“Dependability with respect to
the absence of unauthorized
disclosure of information”
Nonrepudiability: “Protection
against false denial of
involvement in an association
(especially a communication
association that transfers
data)”
© NEC Corporation 2009Page 17
SecurityNonrepudiabilityConfidentiality
Trustworthiness
Related Disciplines – 7/11
▐ Security is the property of a
system and measures taken
such that it protects itself from
unauthorized access or change
Accountability: The property
that ensures that the actions of
an entity may be traced
uniquely to that entity, which
can then be held responsible
for its actions.
Authenticity: “Property of
being genuine and able to be
verified and be trusted”
Authorisability: “An approval
that is granted to a system
entity to access a system
resource.”
© NEC Corporation 2009Page 18
SecurityNonrepudiabilityConfidentiality
AAA
Trustworthiness
Security
Related Disciplines – 8/11
▐ Security is the property of a
system and measures taken
such that it protects itself
from unauthorized access or
change
Availability: “Dependability
with respect to the readiness
for usage. Measure of correct
service delivery with respect to
the alternation of correct and
incorrect service.”
Integrity: “Dependability with
respect to the absence of
improper alterations of
information.”
© NEC Corporation 2009Page 19
NonrepudiabilityConfidentiality
AAA
Availability Integrity
Availability
▐ Failure probability density f(t): time to failure
▐ Failure cumulative distribution function Q(t): Pr[failure in [0,t]]
▐ A = MTTF / MTBF
Repair keeps availability higher
© NEC Corporation 2009Page 20
t0
1
A
Trustworthiness
Dependability
Security
Related Disciplines – 9/11
▐ Dependability is that property of a computer system such that reliance can justifiably be placed on the service it delivers. Reliability: “Dependability with
respect to the continuity of service. Measure of continuous correct service delivery. Measure of the time to failure.”
Maintainability: “Dependability with respect to the aptitude to undergo repairs and evolutions. Measure of continuous incorrect service delivery (corrective maintenance only). Measure of the time to restorationfrom the last experienced failure (corrective maintenance only).”
Safety: “Dependability with respect to the non occurrence of catastrophic failures. Measure of continuous delivery of either correct service or incorrect service after benign failure. Measure of the time to catastrophic failure.”
© NEC Corporation 2009Page 21
NonrepudiabilityConfidentiality
AAA
Availability Integrity
Reliability Safety
Maintainability
Reliability and Maintainability
▐ Reliability
Length of uptime
▐ Maintainability
Length of downtime
▐ Availability
Fraction of uptime
© NEC Corporation 2009Page 22
tfailed
operable
maintainability
reliability
availability
Availability vs. Reliability
▐ High availability but low reliability
MTTR very low but MTTF also low
▐ High reliability but low availability
MTTF large but MTTR also large
© NEC Corporation 2009Page 23
tfailed
operable
tfailed
operable
Information access
Telepresence
Trustworthiness
Related Disciplines – 10/11
▐ Performability is that
property of a computer
system such that it delivers
performance required by the
service, as described by
QoS (quality of service)
measures.
© NEC Corporation 2009Page 24
SecurityNonrepudiabilityConfidentiality
AAA
Dependability
Availability Integrity
Reliability Safety
Maintainability
Performability
QoS measures
Related Disciplines – 11/11
▐ Robustness is a control theoretic property that relates the operation of a system to perturbations of its inputs. In the context of resilience, robustness describes the trustworthiness (quantifiable behavior) of a system in the face of challenges.
© NEC Corporation 2009Page 25
Challenge Tolerance
Energy
Delay Mobility
Connectivity
Disruption
Tolerance
Environmental
attack
legitimate
Traffic
Tolerance
Fault
Tolerance
Survivability
Trustworthiness
SecurityNonrepudiabilityConfidentiality
AAA
Dependability
Availability Integrity
Reliability Safety
Maintainability
Performability
QoS measures
Robustness
Challenge Fault Error Failure
26
Dormant
Faults
External
Fault
Internal
Fault
Active
Environmental : mobile , wireless, delay
Natural Disasters
Non-malicious: ops., traffic, accidents
Malicious attacks
Lower-level failureChallenges
Errors
Defend
Detect
Detect
System
Operation
Errors passed on
to operational state
Defend
Challenge Fault Error Failure
© NEC Corporation 2009Page 27
Dormant
Faults
External
Fault
Internal
Fault
Active
Environmental : mobile , wireless, delay
Natural Disasters
Non-malicious: ops., traffic, accidents
Malicious attacks
Lower-level failureChallenges
Errors
Defend
Detect
Detect
System
OperationDiagnose
Refine
Defend
Normal Operation
Severely Degraded
Partially Degraded
De
gra
de Im
pro
ve
De
gra
de Im
pro
ve
Operational Space Service Space
Acceptable
Unacceptable
Impaired
De
gra
deIm
pro
ve
De
gra
deIm
pro
ve
Service
Resilience
Service
Resilience
Remediate
Recover
Network Design
Traffic Engineering
Protocol Specs and Constraints
Service Specs
Se
rvice
Failu
re
Overview
▐ Past resilience failures
▐ Taxonomy
Related disciplines
From Challenge to Failure
▐ Strategy
Foundations
Principles
▐ The ResumeNet project
© NEC Corporation 2009Page 28
Strategy Foundations
▐ Faults are inevitable
Not possible (nor practical) to construct perfect system
• internal faults will exist
Not possible to prevent challenges and threats
• external faults will occur
▐ Understand normal operations
When no adverse conditions present
Deployment corresponds with design requirements
▐ Expect Adverse Events and Conditions
Defend against challenges and threats to normal operation
Detect when an adverse event or condition has occurred
▐ Respond to Adverse Events and Conditions
Remediation ensuring correct operation and graceful degradation
Restoration to normal operation
Diagnosis of root cause faults
Refinement of future responses
© NEC Corporation 2009Page 29
Strategy Principles - Prerequisites
▐ Understand the level of resilience the system should provide
▐ Specify, verify, and refine normal operation of the system
▐ Understand challenges
▐ Develop Metrics to measure and engineer resilience
▐ Heterogeneity in mechanism, trust, and policy among different
network realms
© NEC Corporation 2009Page 30
prerequisites
service
requirements
normal
behaviour
threat and
challenge models
metrics
heterogeneity
Strategy Principles - Enablers
▐ Security and self-protection are essential properties of entities to
defend against challenges in a resilient network
▐ Management complexity impacts resilience negatively
▐ Alternatives of how to distribute and manage state are critical to
resilience
© NEC Corporation 2009Page 31
prerequisites
service
requirements
normal
behaviour
threat and
challenge models
metrics
heterogeneity
tradeoffs
resource
tradeoffs
state
management
complexity
Strategy Principles - Tradeoffs
▐ Optimize resilience at reasonable costs
▐ Maintain connectivity and association when possible
▐ Redundancy in space, time, and information
▐ Diversity in space, time, medium, and mechanism
▐ Multilevel resilience is needed with respect to protocol layer, protocol plane, and hierarchical network organisation
▐ Context awareness is necessary to autonomously detect challenges
▐ Translucency of service boundaries is needed to control the degree of abstraction vs. the visibility between levels
© NEC Corporation 2009Page 32
prerequisites
service
requirements
normal
behaviour
threat and
challenge models
metrics
heterogeneity
tradeoffs
resource
tradeoffs
state
management
complexity
enablers
redundancy
diversity
context awareness
self-protection
translucency
multilevel
connectivity
Strategy Principles - Behaviour
▐ Self-organising and autonomic behaviour is necessary for network
resilience that is highly reactive with minimal human intervention
▐ Adaptability of all components to the network environment is
essential for a node in a resilient
▐ Evolvability is needed to refine future behaviour to improve the
response to challenges
© NEC Corporation 2009Page 33
prerequisites tradeoffs enablers behaviour
resource
tradeoffs
state
management
complexity
redundancy
diversity
context awareness
self-protection
translucency
multilevel
self-organising
and autonomic
adaptable
evolvable
connectivity
service
requirements
normal
behaviour
threat and
challenge models
metrics
heterogeneity
Overview
▐ Past resilience failures
▐ Taxonomy
Related disciplines
From Challenge to Failure
▐ Strategy
Foundations
Principles
▐ The ResumeNet project
© NEC Corporation 2009Page 34
Synopsis of the ResumeNet project
▐ Challenge / Objective
FP7-ICT-2007-2 Objective: 1.6: „New Paradigms and Experimental Facilities“
▐ Instrument: STREP / 3 years / 09.2008 – 08.2011
▐ Advisory Board
Rüdiger Grimm (UKoblenz), Jim Kurose (UMassachusetts),
Jean-Claude Laprie* (LAAS-CNRS), Rick Schlichting (AT&T)
© NEC Corporation 2009Page 35
Eidgenössische Technische Hochschule Zürich Switzerland
Lancaster University (D. Hutchison) United Kingdom
Technische Universität München (G. Carle) Germany
France Telecom (C. Lac) France
NEC Europe Ltd (M. Schöller) United Kingdom
Universität Passau (H. de Meer) Germany
Technical University Delft (P. van Mieghem) Netherlands
Uppsala Universitet (P. Gunnigberg) Sweden
Université de Liège (G. Leduc) Belgium
Understand normal behaviour
▐ Behaviour of Infrastructure
Services
Information
▐ The Wray CWMS Example Online attacks
Adverse weather conditions, e.g., rain, storm
Vandalism
▐ Re-evaluate during operation Malicious behaviour of
landlords
Mis-configurations
Milk truck
© NEC Corporation 2009Page 36
Measuring Resilience
▐ Goal: A resilience metric R
Composite metric of non-
normalized, non-orthogonal
metrics
▐ Huge set of metrics
Graph theory: diameter,
betweeness, degree
connectivity, …
Networking metrics: QoS,
Security, Dependability
▐ Evaluation of one metric for
a sequence of failures
Requires exhaustive search
over all combinations
Network dependent
© NEC Corporation 2009Page 37
Number of failures
Me
tric
Sequence 1
Sequence 2
Metric Envelopes
▐ Comparing resilience based
on metric envelopes give a
visual explanation of the
network degradation
process
▐ Depending on the
application domain a more
bounded envelope might be
preferable
▐ The effect of various failure
sources on the evaluated
metric can be revealed
© NEC Corporation 2009Page 38
Resilience Metrics: A Computational Approach
C. Doerr and, J. Martin-Hernandez, “A computational approach to multi-level analysis of
network resilience,” in 3rd International Conference on Dependability (DEPEND), Venice, Italy,
July 2010.
GÉANT2 – Where are the weak points?
▐ Risk map indicate which
areas are most vulnerable
to challenges
▐ Impact map visualize the
effect of a particular failure
on the network as a whole
Let’s take a deeper look:
What concretely would
happen?
© NEC Corporation 2009Page 40
GÉANT2 – Multi-level Metric Envelopes
© NEC Corporation 2009Page 41
Diversity and Redundancy
▐ Rope Ladder Protection (RLR) Schemes is designed to unify the
advantages of both node protection and link protection
RLR focused on small jitter and small loss gap
Keep trunks close together
• Small resilience against areal challenges
▐ Implementation of risk-aware
Rope Ladder Routing
RLR construction to take
areal challenges into account
Assess need of repair during
challenges occurring
© NEC Corporation 2009Page 42
Risk-aware rope ladder routing
▐ Use the Graph Explorer to detect groups of links that are likely to fail at the same time because of the same challenge
▐ Find shortest paths in risk-disjoint groups and to place the two trunks of the rope ladder
▐ Protection schemes for switched mesh networks Assessment of past protection
scheme use
Refining delayed repair of multi-path structures
© NEC Corporation 2009Page 43
Conclusion
▐ The Internet is a critical infrastructure
▐ Resilience should be a primary design consideration for
networked systems
▐ There are a number of disciplines related to resilience
Addressing resilience issues in a discipline independent manner is
insufficient
A systematic approach is required
© NEC Corporation 2009Page 44