Download - E E 681 - Module 18 M.H. Clouqueur and W. D. Grover TRLabs & University of Alberta © Wayne D. Grover 2002, 2003 Analysis of Path Availability in Span-Restorable.

E E 681 - Module 18

M.H. Clouqueur and W. D. Grover

TRLabs & University of Alberta

© Wayne D. Grover 2002, 2003

Analysis of Path AvailabilityAnalysis of Path Availability in Span-Restorable Mesh in Span-Restorable Mesh

NetworksNetworks

E E 681 - Module 18 © Wayne D. Grover 2002, 2003 2

Review of Mesh Design

Motivation: Something must be done to reduce the impact of network element failures on service availability

Solution: Mesh Restoration Mechanism (Requires extra capacity)

Capacity planning methods: • Max Latching• Herzberg• Modular capacity placement• Joint working-spare capacity placement

More and more capacity efficientbut

Availability ???

AvailabilityCapacityIntuitively:

Questions to answer: • How much does mesh restoration improve the availability of service?

• How does the availability depend on the total capacity?


What Causes Unavailability?

Single span failuresMultiple span failuresNode failuresSpan maintenance servicesCombinations of the above

What we need to compare:

Which are the most important?

Count P(event) Impact

Number of such events Example: Probability of bringing the system under study in down state

By doing this comparison the major contributor to unavailability appears to be:

Combination of Span failure and Span maintenance service (equivalent to dual span failure in the worst case)


Impact of Failures

For the previous comparison we could only guess or make assumptions for the value of the impact of each failure categories.

Examples: Single span failure, Impact = 0 (network fully restorable to single failures) Dual span failure, Impact = 0.5 (at least half of the traffic on average should be restorable)

Determination of availability of service paths:

We need to know the exact value of the impact of each failure scenario on the availability of that service path

Availability analysis of path p:

Failure of (S1, S2) Impact = 0.3321Failure of (S1, S3) Impact = 0.0000

Failure of (Si, Sj) Impact = 0.5243

We need a tool that determines the probability of path p being down for any given set of failed spans

......


Problem of Independence of Span Failures

The contribution of a failure event to the unavailability is: P(event)×Impact

For a dual span failure: i ji j S SP[failure of (S , S )]=U ×U

Based on the assumption that failures of Si and Sj are independent

Special Case:

S1

S2

S3

S2

S1

1 31 3 S SP[failure of (S , S )] U ×UIn that case:

This span does not really exist

but rather 11 3 SP[failure of (S , S )]=U

Common cable sheaths


Path Availability Calculation

Exact Expression: path

all failure events

U = P(event) P(event affecting path)

Simplified approach: *path links, i

i in path

U = U

Where: U*links, i is the equivalent link unavailability on span i

Advantage: we only need to compute one value for each span and then use those values for the calculation of end-to-end availability.

Drawback of simplified approach: Some failure events contribute to the unavailability of links on several spans in a neighbourhood and can therefore be counted several times when summing the U*

links,i’s.


Link Equivalent Unavailability

Concept of Equivalent Unavailability:

• Non-restorable network: Ulink = Uspan (physical unavailability)

When the span is down, the link is down

• Restorable network: Ulink = Ulink* (Equivalent link unavailability)

Ulink* is different from Uspan because of the restoration mechanism

We will see that Ulink* is in the order of Uspan

2 therefore Ulink* << Uspan


Derivation of Ulink*

2s 2

SU · 2w · 1-R

2

It can easily be shown that the expected number of failed and non restored links in the network at any time is:

R2: Average Dual Failure Restorability of Links

In general Ulinks,i* can be defined as: Number of failed and non- restored links

Total number of links affected by failure

2s 2

*link

SU 2w 1-R

2U =

working links

2s 2

SU 2w 1-R

2=

S w

2*link s 2U =U (S-1) 1-R

The only unknown

S: Total number of spansUs: Average physical unavailability of spansw: Average working capacity of spans

Span-specific average U*link(i) can be obtained using

span-specific average R2: R2(i) (calculated over S-1 dual-failure scenarios involving span i)

a,b

a,b

Nab

a+ b2R =1-

(w w )

Nab: non restored working units in the case of failure of span a and span b


Determination of R2

There is no closed form model for R2 as the impact of each failure scenario depends on several factors specific to the failure case. However failures events can be divided into a few main categories:

Case 0: Span failure and wi > feasible spare paths

Case 1: Two failures but no spatial interactions

Case 2: Two failures and spatial interactions (competition for spare capacity)

Case 3: Two failures with second failure hitting the first restoration pathset

Case 4: Two failures isolating a degree-2 node

not possible by definition in a restorable network

no outage

may be outage

certain outage

may be outage

Unavailability Sequences:


Impact of Dual Failures

Example #1, no spatial interaction:

NO OUTAGE

W = 3

3

W = 2

2

The two restoration paths do not interfere

W: working capacityS: spare capacity




W = 3

W = 2

Example #2, spatial interaction - capacity dependency:

S < 5or S > 5 ?

Is there enough spare capacity to restore both failures?

POSSIBLE OUTAGE depending on the value of S




W = 2

2

The second failure hits the restoration path set deployed for first failed span

The outcome in this situation depends on the adaptability of restoration mechanism and on the amount of remaining spare capacity

POSSIBLE OUTAGE

Example #3, spatial interaction - special case:



W = 3

Nothing can be done to restore any of the two failures

OUTAGE

W = 2


Example #4, isolated node:


Adaptability of the restoration mechanism

S2

S1

S2

S1

S2

S1

Static behaviour Partly adaptive behaviour Fully adaptive behaviour

Restoration preplan says:“S2 is to be restored through S1”

S2 is restored via another route where spare capacity is available

S1 is left unrestored

S2 is restored via another route where spare capacity is available

S1 is restored again (if possible) with release of spare capacity previously used for restoration of span S1 (similar to path restoration’s stub release)

Optional: The spare capacity used on span S2 gets “working status” and benefits from restoration effort for S2


Results of Case Studies

* Designed with Optimal Modular Spare Capacity Placement

Non-modular environment

Modular Environment*

Static behaviour 0.53 to 0.75 0.69 to 0.83Partly-adaptive 0.55 to 0.79 0.87 to 0.91Fully-adaptive 0.55 to 0.80 0.91 to 0.99

Typical test network:

R2 Results for 5 test networks:

With a fully adaptive behaviour in a modular environment the working units enjoy almost full restorability to any dual span failures


Improvement over a non-restorable network

Path availability improvement example:

Test network: EuroNetA (19 nodes, 37 spans)Reference path: 5 hopsAssumption: Us=310-4

If the network is non-restorable: Ulink = Uspan, Upath = 1510-4 = 13 hrs/year

If the network is restorable, the simulation with fully adaptive behaviour gives:R2 = 0.716735 Ulink

* = 9.1810-7 Upath = 4.5910-6 = 2.4 min/year

Making a network restorable to single span failures brings a considerable improvement in the average availability of service paths.

For specific services it might still not be enough … How can we make service paths even more available?


Design for High Availability

The idea is to provision a network from an availability standpoint.

Two integer programming formulations were developed:

• Dual Failure Minimum Capacity (DFMC)

Finds the minimum capacity assignment for full restorability to dual-failures (R2=100%)

note: Cannot be used for networks with any degree-2 graph cut.

• Dual Failure Max Restorability (DFMR)

Finds the spare capacity placement that maximizes the average restorability to dual-failures for a given spare capacity budget.


R2 Design - Experimental Results

Cost of improving the R2 restorability:

Spare capacity for R1=100% 223 units(55% redundancy)

Total working: 405

Spare capacity forR2=100% 628 units(155% redundancy)

0.75

0.8

0.85

0.9

0.95

1

1.05

200 300 400 500 600 700

Total Spare Capacity

Ave

rag

e R2

9

1

4

710

11

5

12

6

2

8

3

628223

To go from R2 = 80% to R2 = 100% we need to almost TRIPLE the spare capacity


Conclusion of R2 studies

It is very costly to guarantee R2 restorability to all service paths in the network. However in most cases of dual span failures the restoration mechanism is able to restore part or all of the failed working units

Idea: For little or no extra capacity it should be possible to guarantee full restorability to dual failures to selected network connections

W = 3(including 1 with higher priority)

W = 22

1

S = 3

Constraint modification for the Dual Failure Minimum Capacity formulation (DFMC):

instead of

“For any dual failure restoration paths must be found for all failed working units”

we now have

“For any dual failure restoration paths must always be found for all working units requiring R2 restorability”

Restoration of higher priority connection


Multi-Priority Mesh Design

1

S

k kk

Minimize C s

,1

( , ) 1,2,..., .

iP

pi j i

p

f x i j S i j

, ,(1 ) ( , ) 1,2,..., . , 1, 2,..., . p pi j i j if C i j S i j p P

, , , ,1 1

( , , ) 1,2,..., . , ,

jiPP

p p p pk i j i k j i j k

p p

s f f i j k S i j i k j k

1

1,2,..., .

iP

pi i

p

f w i SIn case of single failure of span i, restore all working units that require R1 restorability

In case of failure of spans i and j, restore all working units (xi) that require R2 restorability

Route p cannot be used if it crosses one of the two failed spans

,1

( , ) 1, 2,..., .

iP

p pk i i k

p

s f i k S i k Spare capacity is needed to support restoration of single span failures

Spare capacity is needed to support restoration of dual span failures

Subject to:


Network Availability Simulator

The simulator determines times of span failures and repairs according to given statistical distributions:

span 12 span 12 span 5 span 2 span 2 span 5 span 8

At each stage:

Set offailed spans

Restorationanalyzer

Set of lost Connections Connections

Outage Recorder

stage 1 stage 2 stage 3 stage 4 stage 5 stage 6t

The objective of the simulator is to obtain information about the availability of end-to-end network connections by generating span failures at random times and analyzing the restoration depending on connection priorities

Characteristics of a network connection: • Origin node• Destination node• Size (STS-1, STS-3, STS-12,…)• Restorability Requirement (R0, R1, R2)• Routing between O and D


Network Availability Simulator

Advantages of the simulator:

• Confirm results obtained with theoretical availability expressions based on R2.

• Obtain information about the distribution of outage times (1000 outages of 0.1 sec has a different impact than 10 outages of 10 sec)

• Possibility to use different distributions of time-to-repair and time-between-failure for each span.


Mesh/Ring Availability Comparison

Single span failures

Mesh: Full protection

Rings: Full protection

Dual span failures with no spatial interaction

Mesh: Full protection

Rings: Full protection

Dual span failures with spatial interaction

Mesh: Protection from 0% to 100% of the working units depending on available spare capacity and adaptability of the restoration mechanism

Rings (2 span failures on same ring): Protection of about 2/3 of the traffic (demands that are not isolated by the 2 span failures

For connections requiring R1 restorability, the Ring-based solution and the mesh solution provide similar levels of availability


Mesh/Ring Availability Comparison

Possibility of guaranteeing R2 restorability:

origin

exit

The connection is lost whatever his priority level is.

Mesh Networks: Yes! With adequate design and an adaptive restoration mechanism.

Ring Networks: No, in certain cases restorability to dual failures cannot be guaranteed

Example:

Conclusion: The mesh architecture seems to be more appropriate than rings to serve demands with high availability requirements