Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf ·...

94
BRKIPM-2001 v1.1 Routing High Availability NSF & NSR

Transcript of Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf ·...

Page 1: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

BRKIPM-2001

v1.1

Routing High Availability – NSF & NSR

Page 2: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 4

Agenda

Setting the stage – Introduction

Non-Stop Forwarding & Graceful Restart (NSF/GR)

Non-Stop Routing (NSR)

Deployment Considerations and Scenarios

Page 3: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5BRKIPM-2001

Introduction – High Availability

Page 4: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 6

Availability Definitions

The probability that an item (or network, etc.) is operational, and functional as needed, at any point in time

Or, the expected or measured fraction of time the defined service, device or area is operational; annual uptime is the amount (in days, hrs., min., etc.) the item is operational in a year

Network Provider

Shared NetworkServer

Network

User

Network

Availability

Page 5: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 8

Availability Definitions

Network Availability

There is a working network path between source and destination (generally bi-directionally)

Generally involves only the Network Layer (OSI Layer 3)

Service Availability

The offered service performs according to the stated SLAs(packet loss, delay, jitter, response time, etc.)

Involves all layers

Network vs. Service Availability

Our focus is on Network Availability today

Page 6: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 9

What Is High Availability?

DPM = Defects per Million (Hours of Running Time)

Availability Downtime Per Year (24x365)

99.000%

99.500%

99.900%

99.950%

99.990%

99.999%

99.9999%

3 Days

1 Day

53 Minutes

5 Minutes

30 Seconds

15 Hours

19 Hours

8 Hours

4 Hours

36 Minutes

48 Minutes

46 Minutes

23 Minutes

DPM

10000

5000

1000

500

100

10

1

“High

Availability”

Page 7: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 11

Most common causes of downtime

Telco/ISP

35%Human error

31%

Power

failure

14%

Hardware

failure

12%

Other 8%

Common causes of Enterprise Network Downtime **

Embedded Management

Best Practices

System and Network

Level Resiliency

Mitigating the Exposure:Targeting Downtime

Operational

Process

40% Network

20%

Software

Application

40%

Page 8: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 13

What is Routing High Availability?

Routing HA

Set of technologies & features to enable traffic to continue to flow through a device during a fault

Routing HA maintains the logical network topology while the faulty device recovers

Routing HA helps to address failures within the control plane of a routing device

Routing HA increases the resiliency of a single system

Page 9: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 14

What is Routing Fast Convergence?

Routing FC

Set of technologies & features to enable traffic to continue to flow around a device during a fault

Routing FC adapts the logical network topology to avoid the faulty component

Routing FC targets to address any component failure within a routing device

Routing FC increases the resiliency of the network

Page 10: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 15

What is Routing Fast Convergence?

Routing FC

Set of technologies & features to enable traffic to continue to flow around a device during a fault

Routing FC adapts the logical network topology to avoid the faulty component

Routing FC targets to address any component failure within a routing device

Routing FC increases the resiliency of the network

Page 11: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 16

Routing Convergence vs. Routing HA

Routing FC

Set of technologies & features to enable traffic to continue to flow around a device during a fault

Routing FC adapts the logical network topology to avoid the faulty component

Routing FC targets to address any component failure within a routing device

Routing FC increases the resiliency of the network

Routing HA

Set of technologies & features to enable traffic to continue to flow through a device during a fault

Routing HA maintains the logical network topology while the faulty device recovers

Routing HA helps to address failures within the control plane of a routing device

Routing HA increases the resiliency of a single system

Page 12: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 17

Main Routing HA Applications

Route Processor failure

Routing Process failure (modular OS)

Chassis Failure

Cat6k-VSS

Page 13: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 18

Routing HA to help Planned Downtime

Routing HA technologies can assist minimizing customer impact during planned maintenance

Controlled RP failover, for example to swap hardware, or to upgrade memory on RPs

Routing Protocol patches (IOS-XR)

Clearing BGP Sessions (IOS-XR)

HA technologies pre-requisite for In-Service Software Upgrade

Page 14: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19BRKIPM-2001

Non-Stop-Forwarding (NSF)

Page 15: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 20

Behaviour without NSF

Router A loses its control plane for some period of time

It will take some time for Router B to recognize this failure, and react to it

Control Data A

Control Data B

Page 16: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 21

Behaviour without NSF

During the time that A has failed, and B has not detected the failure, B will continue forwarding traffic through A

Once the control plane resets, the data plane will reset as well, and this traffic will be dropped

NSF reduces or eliminates the traffic dropped while A’s control plane is down

Control Data A

Reset

Control Data B

Page 17: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 22

Prerequisite 1: Separated Forwarding Plane

CPU

IOS

interfacesinterfaces

Route DRAM

Packet DRAMASICNP (Network

Processor)

Interconnect

Control Packet

Data Packet

Data Packet

Control Plane - RIB (Routing

Information Base)

- aka. routing table

Data Plane - FIB (Forwarding

Information Base)

Concept of separated control- and forwarding plane essential for routing HA

Routing HA maintains the forwarding plane while the control plane restarts/recovers

Page 18: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 23

Prerequisite 1: Separated Forwarding Plane

Control

Plane

Engine0 – 622M

IOS

buff. SPA

SPA

Q

NP

buff.

Engine5 – 10G

NP

Qbuff.IOS

IOS

Engine3 – 3G

Q

F

buff.

F

Qbuff.IOS

Engine6 – 20G

RP (active) RP (standby)

NP

buff.

NP

Qbuff.

Q

CPU

IOS

CPU

IOS

Data

Plane

Distributed router architectures have this natively

Forwarding information base (FIB) located on Linecards

Cisco 12000

Page 19: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 24

Prerequisite 1: Separated Forwarding Plane

IOS IOS

F

IOS IOS

Sup720 (standby)

F

buff.

buff.

buff.

buff.

IOS

F

4, 6, 9, or 13 Linecard/Sup slots

buff.

buff.

SP RP

SP RP

buff.

buff.

buff.

buff.

F

IOS

buff.

buff.

buff.

buff.

buff.

buff.

20G

F

IOS

Catalyst 6500

Cat6500 also has it, despite FIB and Switching Matrix located physically on RP

FIB is synced between active and standby

Page 20: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 25

Prerequisite 2: Stateful Switch Over (SSO)

Any routing HA requires one important mechanism:The link and its line protocol need to stay up

If not, all neighbours would re-route across the restarting node

Can be trivial: Keep the linecard up and laser on, for example for POS/HDLC

Keeping physical link active is easy with Ethernet as well, but need to sync ARP/v6ND/adjacency information

Can be complex: PPP, ATM or FrameRelay require state to be maintained when failing over the control-plane, sync needed as well

Page 21: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 26

GR/NSF Fundamentals

If A is NSF capable, the control plane will not reset the data plane when it restart

Instead, the forwarding information in the data plane is marked as stale

Any traffic B sends to A will still be switched based on the last known forwarding information

This is the Non-Stop Forwarding behaviour

Control Data A

No reset

Control Data B

Mark forwarding

information as stale

Page 22: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 27

GR/NSF Fundamentals

While A’s control plane is down, the routing protocol hold timer on B counts down....

A has to come back up and signal B before B’s hold timer expires, or B will route around it

When A comes back up, it signals B that it is still forwarding traffic, and would like to resync

This is the first step in Graceful Restart (GR)

Hold Timer: 1514131211109876

Control Data A

Control Data B

Page 23: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 28

GR/NSF Fundamentals

The second GR phase deals with neighbors updating the restarting router’s routing table

This involves new protocol mechanisms

Control Data

Control Data

A

BI’

mre

sta

rtin

g

Ok

, fi

ne

, I’

ll

se

nd

ro

ute

s

Page 24: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 29

GR/NSF Fundamentals – Summary

Key Components of NSF on the restarting router

Keeping interfaces/linecards up

Maintaining Forwarding State in the data plane

Synchronizing routing information post failover

On the neighbouring router(s)

Maintain routes while neighbour restarts

Help restarting node synchronizing its routing table

GR/NSF implementation in various protocols generally differ in the way synchronization works

NSF/GR

capable

NSF/GR

aware

Page 25: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 30

EIGRP GR/NSF Fundamentals

The signal in EIGRP is an update with the initializationand restart (RS) bits set.

A sends its hellos with the restart bit set until GR is complete.

B transmits the routing information it knows to A.

When B is finished sending information, it sends a special end of table signal so A knows the table is complete

A

B

To

po

log

y in

form

ati

on

He

llo

+ R

es

tart

Init

+ R

es

tart

En

d o

f ta

ble

Control Data

Control Data

Page 26: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 31

Control Data

EIGRP GR/NSF Fundamentals

When A receives this end of table marker, it recalculates its topology table, and updates the local routing table

When the local routing table is completely updated, EIGRPnotifies CEF

CEF then updates the forwarding tables, and removes all information marked as stale

A

BControl Data

Page 27: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 32

EIGRP GR/NSF – Configuration

Use the nsf command under the router eigrp configuration mode to enable graceful restart

no configuration required on helper node

Show ip protocols can be used to verify graceful restart is operational

Currently only supported for IPv4

A

B

router eigrp 100

nsf

....

A#show ip protocols

Routing Protocol is "eigrp 100“

....

Redistributing: eigrp 100

EIGRP NSF-aware route hold timer is 240s

EIGRP NSF enabled

NSF signal timer is 20s

NSF converge timer is

....

http://www.cisco.com/en/US/tech/tk365/technologies_white_paper0900aecd8023df74.shtml

http://www.cisco.com/en/US/products/sw/iosswrel/ps1839/products_feature_guide09186a0080160010.html

Restarting Node

Helper Node

Page 28: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 33

OSPF NSF Implementations

There are two mechanisms: Cisco- and IETF(RFC3623) Style

“cisco”-Style is also defined as in informational RFC4811 & RFC4812

Approaches differ in the ways ...

… the restart process is signalled

… the restarting node synchronizes the LSA database

… deciding when to abort the GR process

Page 29: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 34

OSPF NSF – Cisco Style

OSPF uses an extension to the hello packets called link local signaling

The first hello A sends to B has an empty neighbor list; this tells B that something is wrong with the neighbor relationship

A sets the restart bit in its hello, which tells B that A is still forwarding traffic, and would like to resynchronize its database

A

BE

mp

ty H

ello

+ R

esta

rt

Control Data

Control Data

Page 30: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 35

OSPF NSF – Cisco Style

B moves A into the exchange state, and uses out of band signaling (OOB) to resynchronize their databases

This process is the same as initial database synchronization, but it uses different packet types

A

BD

BD

exch

an

ge

Set A to

exchange

LS

A e

xch

an

ge

Control Data

Control Data

Page 31: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 36

Control Data

OSPF NSF – Cisco Style

When A and B have resynchronized their databases, they place each other in full state, and run SPF

After running SPF, the local routing table is updated, and OSPF notifies CEF

CEF then updates the forwarding tables, and removes all information marked as stale

A

BControl Data

Page 32: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 37

OSPF NSF CISCO – Configuration

A

B

router ospf 1

nsf cisco

router ospf 1

Restarting Node

Helper Node

B#show ip ospf int

GigabitEthernet0/0 is up, line protocol is up

Supports Link-local Signaling (LLS)

Cisco NSF helper support enabled

IETF NSF helper support enabled

A#show ip ospf

Non-Stop Forwarding enabled

IETF NSF helper support enabled

Cisco NSF helper support enabled

A#show ip ospf neighbor det

Neighbor 10.0.0.3, interface address 10.0.2.34

In the area 0 via interface GigabitEthernet4/1

Neighbor priority is 1, State is FULL, 6 state changes

DR is 10.0.2.34 BDR is 10.0.2.33

Options is 0x12 in Hello (E-bit, L-bit)

Options is 0x52 in DBD (E-bit, L-bit, O-bit)

LLS Options is 0x1 (LR)

Page 33: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 38

OSPF NSF – IETF Style

OSPF IETF NSF uses a new LSA type to signal GR

A will send out a GRACE-LSA to inform its neighbour(s) that it is undergoing a graceful restart

A

BG

race

LS

A

Control Data

Control Data

Page 34: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 39

OSPF NSF – IETF Style

B moves A into the exchange state, and uses the “regular” mechanism to resynchronize their databases

This process is the same as initial database synchronization

A

BD

BD

exch

an

ge

Set A to

exchange

LS

A e

xch

an

ge

Control Data

Control Data

Page 35: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 40

Control Data

OSPF NSF – IETF Style

When A and B have resynchronized their databases, they place each other in full state, and run SPF

After running SPF, the local routing table is updated, and OSPF notifies CEF

CEF then updates the forwarding tables, and removes all information marked as stale

(all of the above is identical to OSPF NSF style)

A

BControl Data

Page 36: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 41

OSPF NSF – IETF in Operation

A

B

Restarting Node

Helper Node

A# redundancy force-switchover

This will reload the active unit and force switchover to standby[confirm]y

Preparing for switchover..

#B

*Dec 15 10:13:40.374: OSPF: IETF NSF Received grace-LSA from 10.0.0.2 on GigabitEthernet4/1

*Dec 15 10:13:40.374: OSPF: IETF NSF Validate grace-LSA from nbr 10.0.0.2 on GigabitEthernet4/1

*Dec 15 10:13:40.374: OSPF: IETF NSF Process grace-LSA from nbr 10.0.0.2 on GigabitEthernet4/1,

age 1, grace period 120, graceful restart reason: Switch to redundant control processor,

graceful ip address 10.0.2.33

*Dec 15 10:13:40.374: OSPF: IETF NSF helper interface count: 1 (area 0), GigabitEthernet4/1

*Dec 15 10:13:40.374: OSPF: IETF NSF Enter graceful restart helper mode for 10.0.0.2 on

GigabitEthernet4/1 for 119 seconds (requested 120 sec)

*Dec 15 10:14:04.266: OSPF: IETF NSF GR-resync FROM Nbr 10.0.0.2 10.0.2.33 GigabitEthernet4/1

*Dec 15 10:14:04.266: OSPF: IETF NSF Starting graceful-resync with 10.0.0.2 address 10.0.2.33 on

GigabitEthernet4/1

*Dec 15 10:14:04.266: %OSPF-5-ADJCHG: Process 1, Nbr 10.0.0.2 on GigabitEthernet4/1 from LOADING

to FULL, Loading Done

Page 37: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 42

OSPF IETF/RFC3623 vs. Cisco

Main practical difference is in criteria for aborting the GR process

RFC3623 aborts the process when

it detects a neighbour which is not OSPF-GR aware, or

if a topology change occurs during the LSDB synchronization

Cisco NSF continues the process, accepting the caveat of transient routing asymmetry

“nsf cisco enforce global” can be used to abort NSF when non-GR-aware neighbors are found

I feel the “nsf cisco” being more flexible, at the expense of being proprietary

You need to settle on one mode, however any Cisco box supporting both modes can help a neighbour configured with any of the two while the neighbour restarts

Page 38: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 43

ISIS NSF Implementations

For ISIS, there are also two approaches: “nsf cisco” and “nsf ietf”

Unlike OSPF, approaches differ more fundamentally:

IETF/RFC3847 works more like a traditional GR/NSF protocol

“cisco”-style ISIS NSF does not require any protocol extensions or neighbour awareness

Page 39: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 44

IS-IS GR/NSF Fundamentals (IETF)

IS-IS adds a new TLV to the hello packet, the restart option. The restart option TLVcontains a Restart Request (RR) bit and a Restart Acknowledgement (RA) bit

Restart option TLV needs to be sent in all hellos (IIH).

When A restarts, it transmits its hellos with an empty neighbor list, and the RR bit set

B transmits hellos to A with the RA bit set

A

BE

mp

ty H

ell

o +

RR

He

llo

+ R

A

Control Data

Control Data

Page 40: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 45

IS-IS GR/NSF Fundamentals (IETF)

B then clears the flags which indicate routing data that needs to be transmitted to A (the SRM flags)

A and B then use IS-IS’ normal synchronization process using complete sequence number packets (CSNPs) to describe their databases, and exchanging link state packets (LSPs)

A

BC

SN

Ps

Lin

k S

tate

Pa

ck

ets

clear SRM flags

Control Data

Control Data

Page 41: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 46

Control Data

IS-IS GR/NSF Fundamentals (IETF)

When A and B have resynchronized their databases, they run SPF

After running SPF, the local routing table is updated, and IS-IS notifies CEF

CEF then updates the forwarding tables, and removes all information marked as stale

A

BControl Data

Page 42: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 47

IS-IS GR/NSF IETF –Configuration

Use the nsf ietf command

under the router isis configuration mode to enable graceful restart

No configuration required on helper node (enabled by default)

show isis nsf can be used to

verify graceful restart is operational

show clns neigh detail

shows neighbor support of ISIS GR

A

B

router isis

nsf ietf

....

A#show isis nsf

NSF is ENABLED, mode ‘ietf'

A#show clns neighbor detail

System Id Interface SNPA State neighborxx Gi7/1 0005.0096.a819 Up Area …

NSF capable

Restarting Node

Helper Node

Page 43: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 48

IS-IS GR/NSF Fundamentals (Cisco-Style)

IS-IS Cisco-Style works without any GR protocol extensions

IS-IS constantly syncs the neighbour adjacency state as well as LSP header checkpoints on the standby

Once A restarts, it requests the full LSPs from its neighbors, using a CSNP (Complete Sequence Number Packet) packet

Neighbour follows regular IS-IS mechanisms and floods its complete LSP database

A

BC

SN

P LS

Ps

Control Data

Control Data

Page 44: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 49

Control Data

Control Data

IS-IS GR/NSF Fundamentals (Cisco)

When A has resynchronized its database, A runs SPF

After running SPF, the local routing table is updated, and IS-IS notifies CEF

CEF then updates the forwarding tables, and removes all information marked as stale

A

B

Page 45: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 50

IS-IS GR/NSF Cisco – Configuration

Use the nsf cisco command

under the router isis configuration mode to enable graceful restart

No configuration required on helper node

A

B

router isis

nsf cisco

....

A#show isis nsf

NSF is ENABLED, mode ‘cisco'

Restarting Node

Helper Node

Page 46: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 51

IS-IS IETF/RFC3847 vs. Cisco

With “nsf cisco” requiring no protocol extensions to synchronize the LSDB, deploying it is much easier

Cisco nodes configured with “nsf cisco” will also signal support for neighbours using IETF-style GR

A

B

router isis

nsf cisco

....

B#show clns neighbor detail

System Id Interface SNPAneighborxx Gi4/3 0005.00fe.3444 …

NSF capable

router isis

nsf ietf

....

Page 47: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 52

BGP GR/NSF Fundamentals

Graceful restart capability is negotiated when session comes up. If both peers state they are capable of GR, it’s enabled on the peering session, on a per-address-family (ipv4, ipv6, vpnv4, etc.) basis

When A restarts, it opens a new TCP session to B, using the same router ID

B interprets this as a restart, and closes the old TCP session

B also considers TCP session going down as a signal for A restarting

While A restarts, B marks all paths received from A as “stale”

A

BG

R c

ap

ab

ilit

y

Ne

w T

CP

Se

ss

ion

Restart; close

old session

r3#show ip bgp 10.20.0.0

BGP routing table entry for 10.20.0.0/16, version 47

Paths: (1 available, best #1, table Default-IP-Routing

Flag: 0x820

Not advertised to any peer

Local, (stale)

10.0.0.2 (metric 21) from 10.0.0.1 (0.0.0.0)

Origin IGP, metric 0, localpref 100, valid, internal, best

Control Data

Control Data

Page 48: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 53

BGP GR/NSF Fundamentals

B transmits updates containing its BGP table (it’s local RIB out)

A goes into read only mode, and does not run the bestpathcalculations until its B has finished sending updates

When B has finished sending updates, it sends an end of RIB marker, which is an update with an empty withdrawn NLRI TLV

A

BU

pd

ate

s

En

d o

f R

IB M

ark

er

Read only

mode

Control Data

Control Data

Page 49: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 54

Control Data

BGP GR/NSF Fundamentals

When A receives the end of RIB marker, it runs bestpath, and installs the best routes in the routing table

After the local routing table is updated, BGP notifies CEF

CEF then updates the forwarding tables, and removes all information marked as stale

A

BControl Data

Page 50: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 55

BGP GR/NSF Fundamentals

Use the bgp graceful-restart

command under the global router bgpconfiguration mode to enable graceful restart

IOS-XR and recent IOS can disable it on a per-nbr basis

Needs to be enabled on both ends, sessions need to be reset in order for the config to take effect

Show ip bgp neighbors can be

used to verify graceful restart is operational

A

Brouter#show ip bgp neighbors x.x.x.x....Neighbor capabilities:....Graceful Restart Capabilty:advertised and receivedRemote Restart timer is 120 secondsAddress families preserved by peer:IPv4 Unicast, IPv4 Multicast

router bgp 65000

bgp graceful-restart

....

router bgp 65501

bgp graceful-restart

....

Page 51: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 56

GR/NSF Summary

All NSF protocols require some form of neighbour interaction and functionality/configuration on the adjacent systems

Holding onto the routes while the neighbour restarts

Re-Sending the routing information

Deploying NSF in scaled edge deployments (for example large hub site or service provider edge) can be challenging as all neighbors need to be “touched” (config, OS upgrade, etc.)

What if we used another approach …

Page 52: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 57BRKIPM-2001

Non-Stop Routing

Page 53: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 58

Non-stop Routing – NSR

Idea: Why not sync all routing protocol state to the standby RP (or standby process)?

Restarting RP could pick up right where the primary left off

No need to refresh any information, no need for the neighbour to know that anything happened

Easy idea – challenging implementation

Now we absolutely need to avoid anything to let the neighbour know

Forwarding

Continues

Ac

tive

Sta

nd

by

SSO

Line Cards

Routing

Adjacency

Maintained to

Neighbours

No Link Flap

Page 54: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 59

The “easy” NSR

IS-IS “nsf cisco” (available for a long time) actually looks like NSR (only on the surface, though)

Checkpointed adjacency state (as maintained by hello’s) as well as LSDB on standby, able to recover with existing protocol mechanism

Neighbour actually notices something happens, but we still achieve non-stop forwarding

RSVP and PIM in IOS-XR uses checkpoints, refreshes state from neighbors

There is a substantial difference, to real NSR, though: restarting node forwards on potentially outdated information

Let’s look at “real” NSR now…

Page 55: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 60

OSPFv2 NSR (IOS-XR)

Neighbour & interface state and LSDB constantly synced between active and standby

Input packets replicated to both active and standby (1)

LSDB updated on active & standby (2a/2b)

Standby ACKs LSA to Active (3)

Active RP acks LSA to sender (4)

state & LSDB sync

(4)

(3)

(1)

ACTIVE RP

OSPF

Raw IP

(2a)

OSPF

Raw IP

(2b)

STANDBY RP

Sender/Peer

Page 56: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 61

More tricky: NSR for TCP-based Protocols

LDP and BGP use TCP for reliable delivery of PDUs

Eases protocol implementation, but makes NSR quite challenging

Strict requirement to maintain TCP session during failover

TCP session reset would be interpreted by nbr as adjacency down rerouting

How can we reliably maintain the TCP session?

Need to ensure TCP stack on active and standby RP are sync’ed (sequence numbers, etc.)

Need to ensure to only acknowledge the receipt of a packet when primary and standby received it

Page 57: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 62

TCP NSR – Receive Path (IOS-XR)

Input pkt replicated to both active and standby TCP stack (1)

Standby ACKs pkt to active once it stored it in buffer (2)

Once active TCP sees the ACK, it ACKs pkt to sender

Active “owns” TCP session

TCP delivers data to application

(4)

(2)

(1)

ACTIVE RP

APP

TCP

(4a)

APP

TCP

(4b)

STANDBY RP

(3)

Sender/Peer

Page 58: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 63

TCP NSR – Send Path (IOS-XR)

In the send path, standby TCP stack sends the packet towards the peer

Standby “owns” the session

(4)

(2)

ACTIVE RP

APP

TCP

(1)

APP

TCP

STANDBY RP

(3)

Sender/Peer

Page 59: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 64

NSR Support in IOS-XR

Supported for BGP, OSPFv2, and LDP

OSPFv3/IPv6 planned for 4.2

Configured on global protocol level

When GR/NSF is also enabled, protocols can fall back to NSF in case NSR is not possible

for example when standby RP is not in NSR-ready state

generally recommended to enable NSF alongside NSR

Important to monitor NSR state on standby

router bgp …

nsr

router ospf ..

nsr

mpls ldp

nsr

router isis

nsf cisco

RP/0/RP0/CPU0:router#show redundancy

Redundancy information for node 0/RP0/CPU0:

==========================================

Node 0/RP0/CPU0 is in ACTIVE role

Partner node (0/RP1/CPU0) is in STANDBY role

Standby node in 0/RP1/CPU0 is ready

Standby node in 0/RP1/CPU0 is NSR-ready

Page 60: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 65

NSR Support in IOS

BGP NSR

Supported for IPv4 VRFneighbors on c10k and c7600

GR/NSF should also be enabled

For peers supporting GR, TCP state is not maintained and failover is done via NSF

OSPFv2 NSR

coming in 15.1(2)S

GR/NSF can be enabled to support fallback to NSF in case NSR not ready

router bgp …

bgp graceful-restart

address-family ipv4 vrf ..

neighbor x.x.x.x ha-mode sso

....

# show ip bgp vpnv4 all sso summary

# show tcp ha connections

router ospf 1

nsr

[ nsf cisco|ietf ]

....

# show ip ospf nsr

Page 61: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 66

NSR Summary

Unique, Self-Contained Routing HA Solution

Simplifies NSF/SSO deployment by synchronizing edge routes automatically

NSF-aware neighbour devices not needed

Addresses additional network scenarios – e.g. unmanaged CPE devices

Delivers persistent routing for the entire customer edge

Retains scalability and safety of NSF/GR with benefits of NSR

Page 62: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 67BRKIPM-2001

Deployment Considerations and Use Cases

Page 63: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 68

Complex?!?!

Two approaches (NSF and NSR) to address the same problem

Different protocols, different NSF/NSR variants, implementations and roadmaps

Different fundamental approaches to increase availability: HA and Fast Convergence

Let’s look at some generic deployment guidance, some implementation caveats and use cases

Page 64: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 69

GR/NSF Deployment Considerations

Be careful with partial deployments of GR/NSF capability

If B restarts, A will reset its session, removing all the routing information it learned from B. However, D will continue to forward traffic through B

This will, at best, cause asymmetric routing. At worst, it could cause a routing loop

Router A must be GR capable or GR aware

Core

GR/NSF capable

A

B C

D

Session reset

D continues

forwarding

Asymmetric

return path

Page 65: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 70

Service Provider

A

B C

D

OSPF

Multiple Routing Protocols

OSPF is configured for GR/NSF, while BGP is not

D’s next hop for all routes is A; the path to A is learned via OSPF

If the control plane on B restarts, D will continue learning BGP routes from C with a next hop of A; it will also maintain the best path to that next hop through B

Best path

to A

BGP learned

routes

Page 66: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 71

Multiple Routing Protocols

Since the best path to A is still through B, D will continue forwarding through B for all the BGP routes it is learning through C

B will drop this traffic, since it is not maintaining its BGP state, only its OSPF state

If BGP and an IGP are running together, they must both have GR enabled

Service Provider

A

B C

D

OSPF

D continues

forwarding

BGP learned

routes

Page 67: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 72

A

B C

D

IPv6 Deployment Considerations

NSF/NSR implementation for IPv6 is not yet at the same state as for IPv4, i.e.

no GR support for IPv6-AF in BGP in IOS

no NSF support for OSPFv3

but: works with IS-IS

As v4 and v6 routing is carried in different protocols, everything is fine

IPv4

continues

through

restarting

node

IPv6

routes

around the

failure

Page 68: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 73

MPLS Deployments – P/LSR Routers

MPLS P (or LSR) routers act as transit node only

no directly connected customers or services

Assuming there is sufficient redundancy and capacity within the network, it can be better just route around the failure

There are still several deployments around with IOS releases not supporting MPLS SSO

RPR redundancy should be configured to let linecards reload on RP failure/failover

Fast Convergence required to minimize packet loss

A

B C

D

Page 69: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 74

Other Protocols

To achieve hitless convergence, all protocols and features involved in routing and forwarding of a given service along a given path need to be GR enabled- or capable

All routing protocols

Don’t forget PIM (Mcast), RSVP (MPLS-TE)

ARP/IPv6 ND

HSRP/VRRP

etc.

Did we miss anything?

Page 70: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 75

HA with NAT/FW/IPSec/L2TP

Network Address Tranlsation (NAT), Firewall or IPSec/L2TP/PPPoX all maintain session state

Broadband platforms (ASR1000, c10k, ASR9000) support SSO for PPPoX/L2TP to allow for stateful switch-over

ASR9000 maintains session state on linecard(s), so state is much easier to maintain for RP failovers

Currently, IPSec (incl. DMVPN), NAT and FW is not SSO- capable on any platform, so sessions need to be re-established after RP failover

Lack of SSO support for a fundamental feature like the ones above on a given platform is often a reason to not deploy Routing HA at all

We rather want to fail over to a redundant device

Page 71: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 76

Protocol Hello Considerations

Depending on platform and OS, it can take a few seconds until standby process is operational

Neighbour adjacencies configured with “fast” hello’s could time out, leading to re-route

Default hello timers are ok, no need to increase

Restarting RP/process starts to send hello’s as soon as possible and at higher rate right after restart

Make sure to test failover with tuned hello times with platforms/software prior to deployment (see [1] for some test results)

[1] http://www.cisco.com/en/US/technologies/tk869/tk769/technologies_white_paper09186a00801dce40.html

%OSPF-5-ADJCHG: […],

Neighbor Down: Dead timer

expired

Page 72: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 77

BFD Consideration

BFD (Bi-directional Forwarding Detection) is a “hello-type” protocol designed and deployed to provide sub-second failure detection

BFD needs to be SSO-aware to ensure standby RP can take over

BFD session state sync’ed

Still, platform restrictions apply, ex. 6500/7600 performing RP failover cause short traffic disruption on bus, affecting traffic to/from the RPs

S/E chassis and 67xx/ES linecards mitigate this

Still: recommended not to go below 500msec x 3, smaller values can cause BFD going down

BFD BFD

OSPF OSPF

Page 73: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 78

Single-RP Deployments

Any platform only supporting a single control plane (i.e. 7200, ISRs, fixed Catalyst L3-switches, etc.) can only act as GR helper node

SSO and NSF is not configurable

When BGP GR is configured to act as helper, they won’t announce GR for any address family (AF)

10.0.0.2

7600,

dual RP

7200

router#show ip bgp neighbors 10.0.0.2....Neighbor capabilities:...

Graceful Restart Capability: advertised and receivedRemote Restart timer is 120 secondsAddress families advertised by peer:

none

router#show ip bgp neighbors 10.0.0.1....Neighbor capabilities:

...Graceful Restart Capability: advertised and received

Remote Restart timer is 120 secondsAddress families advertised by peer:

IPv4 Unicast

10.0.0.1

Page 74: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 79

Single-RP Deployments

Problematic are dual-RP platforms (i.e. 6500, 7600, 12000, ASR1000) with only a single RP installed

In this case, redundancy mode can be configured as RPR, documenting that linecards/etc. will be restarted when RP reloads

NSF should not be configured for any protocol, helper support is generally enabled by default

However, configuring BGP GR (to act as helper) will announce GR for supported/ configured AFs

Neighbors will hold on to routes if peer goes down

Recommendation: Avoid single-RP deployments when using NSF/GR

7600

single RP

router#show ip bgp neighbors 10.0.0.1....Neighbor capabilities:...

Graceful Restart Capability: advertised and receivedRemote Restart timer is 120 secondsAddress families advertised by peer:

IPv4 Unicast

Page 75: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 80

Example, using multiple AFsRemote node shutdown, no failoverrouter#show bgp all neighbors 10.0.0.1 routes

Network Next Hop Metric LocPrf Weight Path

*>i10.20.0.0/16 10.0.0.2 0 100 0 I

Network Next Hop Metric LocPrf Weight Path

*>i2001:DB8:200::/56 2001:DB8:1::2 0 100 0 I

*Nov 24 14:31:55.487: %BGP-5-ADJCHANGE: neighbor 10.0.0.1 Down NSF peer closed the session

*Nov 24 14:31:55.487: IPv6RT[Default]: bgp 65000, Delete 2001:DB8:200::/56 from table

router#show bgp all neighbors 10.0.0.1 routes

For address family: IPv4 Unicast

BGP table version is 43, local router ID is 10.0.0.3

Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,

r RIB-failure, S Stale

Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path

S>i10.20.0.0/16 10.0.0.2 0 100 0 I

router#

*Nov 24 14:33:55.323: RT: del 10.20.0.0/16 via 10.0.0.2, bgp metric [200/0]

*Nov 24 14:33:55.323: RT: delete subnet route to 10.20.0.0/16

router#

Routes not purged until the GR stale timer expires (2 mins by default)

no support for IPv6

GR in the test setup

Page 76: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 81

BGP GR and Manual Session Resets

Helper node (B) considers TCP session reset as an indication for A restarting

B holds on to the routes via A

If A reloads or operator on A clears the session, we would rather B to purge the routes and converge around A

BGP supports the CEASE notification: B would interpret this as a “real” reset and route around

Caveat: IOS currently does not send CEASE prior to reload, nbr shutdown or when doing “clear bgp …”

IOS-XR and NX-OS send CEASE notification as per RFC 4486

No compelling workaround is available, we’re working on getting this implemented in IOS

A

B

Restarting Node

Helper Node

TCP

Session

Page 77: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 82

Fast Convergence

Sub-second IGP convergence can generally be achieved, thanks to feature development and implementation improvements in the past few years

Rapid failure detection often the biggest challenge

Robust implementation, mitigating the risk of churn

Thanks to BGP Prefix-Independent Convergence (BGP-PIC), even very large BGP tables can converge as quickly as the underlying IGP

Fast Convergence technologies enable IP networks to offer strict SLAs for mission-critical, loss-sensitive applications

Page 78: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 83

Interaction with Fast Convergence

Which failure types can be addressed by HA and by Routing Convergence?

Failure Routing HA Routing FC

Link Failure No Yes

Node Failure No Yes

Process Failure Yes No *

RP Failure/

FailoverYes Yes **

*) Some process failures result in effective re-routing, others could lead to blackholes

**) Detection of RP failover depends on HA config

Page 79: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 84

Interaction with Fast Convergence

Design approach for Fast Convergence

Deploy redundant devices/links to provide path diversity for any single failure case

Detect failures as fast as possible and route around

Send notification to other devices so they can also route around

Fast Convergence addresses both link and node/RP failure, while routing HA “only” addresses RP/protocol failover

Link failures are more common than node/RP failures, hence we need to look at Fast Convergence to address those anyway

Why not just rely on Fast Convergence for node/RP failures?

Page 80: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 85

Interaction with Fast Convergence

Fast Convergence generally works extremely well to route around RP failures in the core and distribution, i.e. within the core IGP domain

Core generally designed with enough capacity to allow for single device/link failures

Same level of convergence often can’t be delivered into the access

Distribution routers are therefore a sweet spot for Routing HA

Failure has impact to a large number of “customers”

Can provide lossless failover for PE RP failures

Can minimize downtime for software upgrades

Core

Dist.

Access

Page 81: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 86

Deploying HA in the DistributionOption 1: NSF disabled in core

Core

Dist.

Access

NSF and SSO disabled in Core devices, and enabled within distribution layer

Core routers:

OSPF/ISIS NSF helper-mode only to support distribution routers

BGP & LDP don’t have helper-mode, need to enable GR to supportdistribution nodes

Core router RP failure will trigger routing convergence, use of LDP labels or BGP paths follows IGP

Distribution routers:

Dual-RP Nodes: All protocols enabled for graceful restart

Single-RP or non-redundant nodes: no NSF/SSO/GR-helper configNo problem, core and access neighbors will route around (if possible)

Enable NSR when available (or ISIS “nsf cisco”)

Access Routers:

GR helper mode enabled, where available

If IGP is run into the access layer, ensure all access routers are GR-aware for IGP, otherwise use OSPF “nsf cisco” on distribution routers, which doesn’t abort GR if some neighbors are not GR-aware

Page 82: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 87

Deploying HA in the Distribution Option 2: NSF in Core and Dist.

NSF and SSO enabled on Core as well as Distribution devices

Core routers:

Imperative to enable NSF/GR for all protocols, incl. IGP, BGP, LDP, RSVP, etc.

Distribution routers:

Dual-RP Nodes: All protocols enabled for graceful restart, and NSR (when available)

Single-RP: needs GR-helper config to support core router failover

Redundant nodes with single RP: Problematic, can cause black hole

Access Routers:

GR helper mode enabled, where available

If IGP is run into the access layer, ensure all access routers are GR-aware for IGP, otherwise use OSPF “nsf cisco” on distribution routers, which doesn’t abort GR if some neighbors are not GR-aware

Core

Dist.

Access

Page 83: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 88

NSF/GR Deployment Considerations –Summary

Evaluate NSF/GR/SSO support for all relevant protocols and features

Check for non-standard hello and very low BFD timers

Be aware of single-RP deployments and its dependencies on NSF/GR (especially BGP)

When core provides enough capacity to re-route around failures, consider NSF/GR in distribution only

Remember that NSF/GR only addresses selected failure scenarios, ensure routing convergence is tuned to handle link and node failures quickly

Page 84: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 89

NSR Deployment Considerations

Restarting node doesn’t need any neighbour awareness, considerations with regards to neighbour capabilities doesn’t really apply

Partial deployments on selected routers easily possible

What still applies: All protocols/features need to be HA/NSR- and SSO-capable

In addition, we generally recommend enabling NSF as a fallback to NSR – Restarting router reverts to NSF in case NSR recovery failed (or NSR wasn’t ready/sync’edat time of failure)

Hence: Unless pure NSR deployment is targeted, same considerations/evaluations apply

Page 85: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 90

NSR IOS-XR Case Study – ASR9000

ASR9000 MPLS-VPN P & PE Device

ISIS, configured with “nsf ietf”

LDP: NSR & GR

BGP: NSR & GR (L3VPN, VPLS AD)

Multicast

Test Results

RP Failover without HA: 30-140 sec traffic loss

RP FO, RP removal with HA: 0 ms (vpnv4 and VPLS flows)

Link failures (core and edge links): 140-300 msec

nsr process-failures switchover

router isis FOOnsf ietfaddress-family ipv4 unicastspf-interval initial 100 sec 100 max 1000 interface …bfd fast-detect ipv4bfd minimum-interval 50bfd multiplier 3

mpls ldpnsrgraceful-restartlog graceful-restart

router bgpnsrbgo graceful-restartbgp graceful-restart graceful-reset

multicast-routing address-family ipv4nsf

Page 86: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 91

Use Case: Cat6500 VSS

Catalyst VSS is a special form of RP redundancy

Active RP is in one chassis, hot-standby RP in the other

State synchronisation/SSO achieved via VSL

RP or chassis fail-over requires Routing HA mechanisms (NSF) in the same way as in a single, dual-RP chassis

Current IOS SW releases offer VSSNSF/SSO feature parity compared to single chassis HA deployments

Using Quad-Sup deployment doesn’t change this, redundant Sup in the chassis is not sync’ed to active, Sup failure will trigger chassis reload

SiSi SiSi

Physical View

Logical View

VSL

ActiveStandby

Active Standby

Page 87: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 92

Use Case: IOS Software Redundancy on Single-RP ASR1000

Stand-by IOS process in RP in the single-engine 4RU/2RU system

Two IOS process in a single RP function similar to different processes on separate RP

Supports all NSF/SSOfeatures supported by dual-RP systems

Requires additional RP memory – 4G

Route Processor

Linux Kernel

IOS

Backup

Chassis

Manager

Interface

ManagerForwarding

Manager

IOS-XE “Middleware”

IOS

Active

Page 88: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 93

Use Case: In-service Software Upgrade (ISSU)

STANDBYACTIVE

OLD NEW = RP Is Active = RP Is Standby = New Cisco IOS = Old Cisco IOS

1

2

34

5

OLDACTIVE

OLDSTANDBY

OLDACTIVE

NEWSTANDBY

OLDSTANDBY

NEWACTIVE

OLDSTANDBY

NEWACTIVE

NEWSTANDBY

NEWACTIVE

Page 89: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 94BRKIPM-2001

Summary

Page 90: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 95

Routing HA Evolution

Data Plane

Control Plane

Data Plane

Control Plane

Data Plane

Control Plane

Data Plane

Control Plane

Data Plane

Control Plane

Data Plane

Control Plane

None

NSF

NSR

Failure

Propagation

Restarting Node Neighbor

Page 91: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 96

Key Takeaways

NSF & NSR technologies augment the portfolio of technologies to increase network availability, offering (near-)zero packet loss for control plane failures

Designing for Fast Routing Convergence has been #1 priority in most networks and has proven to be very successful

Complexity of NSF/GR deployment often made it 2nd

choice, treated with lower urgency

Introduction of NSR changes the game, really eases deployment as it acts locally per node

ISSU requires Routing HA to reduce downtime

It’s time to look at HA again!

Page 92: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 97

Cisco Nonstop Forwarding with Stateful Switchover Deployment Guidehttp://www.cisco.com/en/US/technologies/tk869/tk769/technologies_white_paper0900aecd801dc5e2_ps6550_Products_White_Paper.html

Cisco Globally Resilient IP: Overview and Applicationshttp://www.cisco.com/en/US/docs/ios/solutions_docs/grip/GRIP_ovr.html

Please also browse the on-site Cisco Store for suitable reading

BRKIPM-2001 Recommended Reading

Page 93: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco ConfidentialBRKIPM-2001 98

We value your feedback - don't forget to complete your online session evaluations after each session. Complete 4 session evaluations & the Overall Conference Evaluation (available from Thursday) to receive your Cisco Networkers 20th Anniversary t-shirt.

All surveys can be found on our onsite portal and mobile website: www.ciscoliveeurope.com/connect/mobi/login.ww

You can also access our mobile site and complete your evaluation from your mobile phone:

1. Scan the Access Code(See http://tinyurl.com/qrmelist for software,

alternatively type in the access URL)

2. Login

3. Complete and Submit the evaluation

Please complete your Session Survey

Page 94: Routing High Availability NSF & NSRd2zmdbbm9feqrf.cloudfront.net/2011/eur/pdf/BRKIPM-2001.pdf · 2012-02-22 · What is Routing High Availability? Routing HA Set of technologies &