Why are we scared of SPF? IGP Scaling and Stability Dave Katz.

Why are we scared of SPF?

IGP Scaling and Stability

Dave Dave KatzKatz

Overview

History Components of IGP Convergence Conclusions

Copyright © 2002, Juniper Networks, Inc. 3

History

1990: Stability, Scalability, Speed, Correctness--Choose one First few years spent just getting implementations

to work Naïve implementations had enough trouble

accomplishing correctness without being complicated by reality

Prototype-quality software shipped; things tended to fall apart in really ugly ways when pushed hard


History

1994: Stability, Scalability, Speed, Correctness--Choose two Convergence speed became marketing bullet,

InterOp booth fodder Cute trick for demos, but the world wasn’t

clamoring for it Fast convergence == network back up before

someone can call the NOC Efforts to speed convergence tended to cause

instability


History

1995: Stability, Scalability, Speed, Correctness--Choose 2.5 Networks started getting larger; the era of large

ISPs began Stability and scalability were really important, lest

you end up in the newspaper (“AOL down for 19 hours,” other less famous catastrophes)

Simplistic software/hardware architectures were inherently unstable

Big guard rails used to stay away from the instability cliff

Speed was sacrificed (chunky timers)


The Modern Era

Pressure is mounting to get fast again Real applications exist that could make use of it

(VoIP, etc.) Not just a parlor trick any more Perception of IP as being “too slow” used to

promote other technologies

We know how to do better now

Components of IGP Convergence

Detection LSA/LSP Generation Flooding/Propagation SPF Calculation Route Recursion Route Download


Detection

Hardware detection is vastly preferable Can be debounced, held down, etc., in or close to

hardware to reduce churn GE and 10GE use in POPs makes this difficult

(since you need a way to detect a failed path to a neighbor, not just a failed interface)


Detection

Software detection (Hellos) ultimately needed Fast hellos have been destabilizing in the past due

to scheduling latencies (relative to adjacency timeouts)

Fast hellos are now doable, and are even somewhat scalable (subsecond detection and hundreds of neighbors)

Intelligent scheduling and/or distributed processing If Hello load exceeds 100% of capacity (CPU or

protocol I/O bandwidth) things will still fail

Adjacency maintenance must be immune to heavy CPU load


LSA/LSP Generation

When something changes, you have to tell the world

Traditionally, generation delayed to collect multiple changes, then hold down to limit network traffic (on order of seconds)

More intelligent strategy is to rapidly announce interesting changes, allow several successive changes to be announced quickly before holddown

Newer LSPs will tend to overtake old ones during flooding on systems under load, if done intelligently


LSA/LSP Generation

ISIS relatively malleable; some time constants specified but none are “truly normative”

OSPF requires receivers to drop LSAs updated within five seconds (limiting senders is sufficient)

Suggestion--drop receiver behavior completely, use adaptive strategy on transmit

Old receivers will drop rapid updates, but retransmission will operate in similar timeframe (or add a knob)


Flooding/Propagation

Propagation of received LSA/LSPs delayed Group LSAs into bigger LSUpd packets in OSPF Throttling transmission bounds neighbor load (no

flow control) Propagation delays directly affect

convergence The next guy can’t even think of calculating

routes until the LSA/LSP arrives Background noise (refreshes, flaps) add to the

problem


Flooding/Propagation

Intelligent scheduling gives “interesting” link-state data flooding priority

Adaptive retransmission schemes can help when things get tough

Proper scheduling puts noise “in the noise”


SPF Calculation

Traditionally viewed with abject terror Naïve implementations were slow Run-to-completion scheduling led to lost hellos Inefficient implementations caused even more

overhead (reinstalling all routes in FIB) Holddowns and scheduling delays added to

work around stability problems Delays slow convergence, create routing

loops (2-3 times delay value)


SPF Calculation

In a properly engineered system, SPF should not be destabilizing Do adjacency maintenance in a preemptive fashion Schedule SPF calculations as background (relative

to LSA/LSP processing, flooding, etc.) SPF should be able to run back-to-back all day long

without threatening stability, and with only marginal impact on overall convergence

Incremental SPF helps even more, though gains are not significant compared to other things given current networks

Backoff algorithms arguably unnecessary (especially exponential backoff)


Route Recursion

A change in IGP next hop may cause a next hop change in many thousands of BGP routes

By far the richest target in improving convergence

Traditionally done in software in order to produce a “flat” forwarding table

Indirect lookup in hardware has minimal forwarding time cost (essentially free if forwarding engine has any free cycles) with huge win in convergence time


Route Download

Output of route calculations typically must be downloaded to hardware

Download overhead typically rises with the number of forwarding tables

Can be very expensive unless recursion is done in hardware

Some level of distribution (multiple engines) necessary for scaling; fixing recursion problem and careful engineering minimizes cost

Conclusions


Conclusions

Stability and Scalability have been the primary concerns until recently; this effort was quite successful

Some of the biggest barriers to overall network convergence have been outside of the IGP implementation per se; examine the behavior of the system as a whole (and the network as a whole)

As these barriers fall it becomes more interesting to take more heroic measures to improve IGP performance


Conclusions

2002: Stability, Scalability, Speed, Correctness--Choose 3.5 Careful engineering should be able to provide

speed, scalability, and stability The only effect of a heavily loaded system should

be a gradual slowing in convergence (not to crash and burn)

IGPs are not inherently unstable, at least until it is no longer possible to support all of the adjacencies (and even then it should be possible to gnaw off limbs)


Conclusions

Adding knobs is not the answer Nobody really knows how to set them Most settings are wrong

Either make the parameters adaptive, or make them non-critical Keep adaptivity simple and bounded; behavior is

chaotic enough as it is

http://www.juniper.net

Copyright © 2002, Juniper Networks, Inc. All rights reserved. Juniper Networks is registered in the U.S. Patent and Trademark Office and in other countries as a trademark of Juniper Networks, Inc. G10, Internet Processor, Internet Processor II, JUNOS, JUNOScript, M5, M10, M20, M40, M40e, and M160 are trademarks of Juniper Networks, Inc. All other trademarks, service marks, registered trademarks, or registered service marks are the property of their respective owners. All specifications are subject to change without notice.

Juniper Networks assumes no responsibility for any inaccuracies in this presentation. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this information without notice.

Why are we scared of SPF? IGP Scaling and Stability Dave Katz.

Documents

Transcript of Why are we scared of SPF? IGP Scaling and Stability Dave Katz.