Why are we scared of SPF? IGP Scaling and Stability Dave Katz.
-
Upload
zoe-allison -
Category
Documents
-
view
221 -
download
1
Transcript of Why are we scared of SPF? IGP Scaling and Stability Dave Katz.
Why are we scared of SPF?
IGP Scaling and Stability
Dave Dave KatzKatz
Overview
History Components of IGP Convergence Conclusions
Copyright © 2002, Juniper Networks, Inc. 3
History
1990: Stability, Scalability, Speed, Correctness--Choose one First few years spent just getting implementations
to work Naïve implementations had enough trouble
accomplishing correctness without being complicated by reality
Prototype-quality software shipped; things tended to fall apart in really ugly ways when pushed hard
Copyright © 2002, Juniper Networks, Inc. 4
History
1994: Stability, Scalability, Speed, Correctness--Choose two Convergence speed became marketing bullet,
InterOp booth fodder Cute trick for demos, but the world wasn’t
clamoring for it Fast convergence == network back up before
someone can call the NOC Efforts to speed convergence tended to cause
instability
Copyright © 2002, Juniper Networks, Inc. 5
History
1995: Stability, Scalability, Speed, Correctness--Choose 2.5 Networks started getting larger; the era of large
ISPs began Stability and scalability were really important, lest
you end up in the newspaper (“AOL down for 19 hours,” other less famous catastrophes)
Simplistic software/hardware architectures were inherently unstable
Big guard rails used to stay away from the instability cliff
Speed was sacrificed (chunky timers)
Copyright © 2002, Juniper Networks, Inc. 6
The Modern Era
Pressure is mounting to get fast again Real applications exist that could make use of it
(VoIP, etc.) Not just a parlor trick any more Perception of IP as being “too slow” used to
promote other technologies
We know how to do better now
Components of IGP Convergence
Detection LSA/LSP Generation Flooding/Propagation SPF Calculation Route Recursion Route Download
Copyright © 2002, Juniper Networks, Inc. 8
Detection
Hardware detection is vastly preferable Can be debounced, held down, etc., in or close to
hardware to reduce churn GE and 10GE use in POPs makes this difficult
(since you need a way to detect a failed path to a neighbor, not just a failed interface)
Copyright © 2002, Juniper Networks, Inc. 9
Detection
Software detection (Hellos) ultimately needed Fast hellos have been destabilizing in the past due
to scheduling latencies (relative to adjacency timeouts)
Fast hellos are now doable, and are even somewhat scalable (subsecond detection and hundreds of neighbors)
Intelligent scheduling and/or distributed processing If Hello load exceeds 100% of capacity (CPU or
protocol I/O bandwidth) things will still fail
Adjacency maintenance must be immune to heavy CPU load
Copyright © 2002, Juniper Networks, Inc. 10
LSA/LSP Generation
When something changes, you have to tell the world
Traditionally, generation delayed to collect multiple changes, then hold down to limit network traffic (on order of seconds)
More intelligent strategy is to rapidly announce interesting changes, allow several successive changes to be announced quickly before holddown
Newer LSPs will tend to overtake old ones during flooding on systems under load, if done intelligently
Copyright © 2002, Juniper Networks, Inc. 11
LSA/LSP Generation
ISIS relatively malleable; some time constants specified but none are “truly normative”
OSPF requires receivers to drop LSAs updated within five seconds (limiting senders is sufficient)
Suggestion--drop receiver behavior completely, use adaptive strategy on transmit
Old receivers will drop rapid updates, but retransmission will operate in similar timeframe (or add a knob)
Copyright © 2002, Juniper Networks, Inc. 12
Flooding/Propagation
Propagation of received LSA/LSPs delayed Group LSAs into bigger LSUpd packets in OSPF Throttling transmission bounds neighbor load (no
flow control) Propagation delays directly affect
convergence The next guy can’t even think of calculating
routes until the LSA/LSP arrives Background noise (refreshes, flaps) add to the
problem
Copyright © 2002, Juniper Networks, Inc. 13
Flooding/Propagation
Intelligent scheduling gives “interesting” link-state data flooding priority
Adaptive retransmission schemes can help when things get tough
Proper scheduling puts noise “in the noise”
Copyright © 2002, Juniper Networks, Inc. 14
SPF Calculation
Traditionally viewed with abject terror Naïve implementations were slow Run-to-completion scheduling led to lost hellos Inefficient implementations caused even more
overhead (reinstalling all routes in FIB) Holddowns and scheduling delays added to
work around stability problems Delays slow convergence, create routing
loops (2-3 times delay value)
Copyright © 2002, Juniper Networks, Inc. 15
SPF Calculation
In a properly engineered system, SPF should not be destabilizing Do adjacency maintenance in a preemptive fashion Schedule SPF calculations as background (relative
to LSA/LSP processing, flooding, etc.) SPF should be able to run back-to-back all day long
without threatening stability, and with only marginal impact on overall convergence
Incremental SPF helps even more, though gains are not significant compared to other things given current networks
Backoff algorithms arguably unnecessary (especially exponential backoff)
Copyright © 2002, Juniper Networks, Inc. 16
Route Recursion
A change in IGP next hop may cause a next hop change in many thousands of BGP routes
By far the richest target in improving convergence
Traditionally done in software in order to produce a “flat” forwarding table
Indirect lookup in hardware has minimal forwarding time cost (essentially free if forwarding engine has any free cycles) with huge win in convergence time
Copyright © 2002, Juniper Networks, Inc. 17
Route Download
Output of route calculations typically must be downloaded to hardware
Download overhead typically rises with the number of forwarding tables
Can be very expensive unless recursion is done in hardware
Some level of distribution (multiple engines) necessary for scaling; fixing recursion problem and careful engineering minimizes cost
Conclusions
Copyright © 2002, Juniper Networks, Inc. 19
Conclusions
Stability and Scalability have been the primary concerns until recently; this effort was quite successful
Some of the biggest barriers to overall network convergence have been outside of the IGP implementation per se; examine the behavior of the system as a whole (and the network as a whole)
As these barriers fall it becomes more interesting to take more heroic measures to improve IGP performance
Copyright © 2002, Juniper Networks, Inc. 20
Conclusions
2002: Stability, Scalability, Speed, Correctness--Choose 3.5 Careful engineering should be able to provide
speed, scalability, and stability The only effect of a heavily loaded system should
be a gradual slowing in convergence (not to crash and burn)
IGPs are not inherently unstable, at least until it is no longer possible to support all of the adjacencies (and even then it should be possible to gnaw off limbs)
Copyright © 2002, Juniper Networks, Inc. 21
Conclusions
Adding knobs is not the answer Nobody really knows how to set them Most settings are wrong
Either make the parameters adaptive, or make them non-critical Keep adaptivity simple and bounded; behavior is
chaotic enough as it is
http://www.juniper.net
Copyright © 2002, Juniper Networks, Inc. All rights reserved. Juniper Networks is registered in the U.S. Patent and Trademark Office and in other countries as a trademark of Juniper Networks, Inc. G10, Internet Processor, Internet Processor II, JUNOS, JUNOScript, M5, M10, M20, M40, M40e, and M160 are trademarks of Juniper Networks, Inc. All other trademarks, service marks, registered trademarks, or registered service marks are the property of their respective owners. All specifications are subject to change without notice.
Juniper Networks assumes no responsibility for any inaccuracies in this presentation. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this information without notice.