Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 QoS Support.

Shivkumar KalyanaramanRensselaer Polytechnic Institute

1

QoS Support


2

What is QoS? “Better performance” as described by a set of

parameters or measured by a set of metrics.

Generic parameters: BandwidthDelay, Delay-jitterPacket loss rate (or loss probability)

Transport/Application-specific parameters:TimeoutsPercentage of “important” packets lost


3

What is QoS (contd) ? These parameters can be measured at several

granularities: “micro” flow, aggregate flow, population.

QoS considered “better” if a) more parameters can be specified b) QoS can be specified at a fine-granularity.

QoS vs CoS: CoS maps micro-flows to classes and may perform optional resource reservation per-class

QoS spectrum:Best Effort Leased Line


4

Example QoS Bandwidth: r Mbps in

a time T, with burstiness b

Delay: worst-case Loss: worst-case or

statistical

r tokens per second

b tokens

<= R bps

regulator

bitsslope r

Arrival curve

d

Ba

slope ra

time

b*R/(R-r)


5

Fundamental Problems

In a FIFO service discipline, the performance assigned to one flow is convoluted with the arrivals of packets from all other flows! Cant get QoS with a “free-for-all” Need to use new scheduling disciplines which provide

“isolation” of performance from arrival rates of background traffic

B

Scheduling DisciplineFIFO

B


6

Fundamental Problems Conservation Law

(Kleinrock): (i)Wq(i) = K

Irrespective of scheduling discipline chosen: Average backlog

(delay) is constant Average bandwidth is

constant

Zero-sum game => need to “set-aside” resources for premium services


7

QoS Big Picture: Control/Data Planes

Internetw ork or W ANW orkstationRouter

Router

RouterW orkstation

Control Plane: Signaling + Admission Control orSLA (Contracting) + Provisioning/Traffic Engineering

Data Plane: Traffic conditioning (shaping, policing, markingetc) + Traffic Classification + Scheduling, Buffer management


8

Eg: Integrated Services (IntServ) An architecture for providing QOS guarantees in IP

networks for individual application sessions Relies on resource reservation, and routers need to

maintain state information of allocated resources (eg: g) and respond to new Call setup requests


9

Call Admission

Call Admission: routers will admit calls based on their R-spec and T-spec and base on the current resource allocated at the routers to other calls.


10

Token Bucket Characterized by three parameters (b, r, R)

b – token depthr – average arrival rateR – maximum arrival rate (e.g., R link capacity)

A bit is transmitted only when there is an available tokenWhen a bit is transmitted exactly one token is consumed

r tokens per second

b tokens

<= R bps

regulatortime

bits

b*R/(R-r)

slope R

slope r


11

Per-hop Reservation Given b,r,R and per-hop delay d Allocate bandwidth ra and buffer space Ba such that to

guarantee d

bits

b

slope rArrival curve

d

Ba

slope ra


12

Mechanisms: Queuing/Scheduling

Use a few bits in header to indicate which queue (class) a packet goes into (also branded as CoS)

High $$ users classified into high priority queues, which also may be less populated => lower delay and low likelihood of packet drop

Ideas: priority, round-robin, classification, aggregation, ...

Class C

Class B

Class A

Traffic Classes

Traffic Sources

$$$$$$

$$$

$


13

Mechanisms: Buffer Mgmt/Priority Drop

Ideas: packet marking, queue thresholds, differential dropping, buffer assignments

Drop RED and BLUE packets

Drop only BLUE packets


14

Classification


15

Why Classification? Providing Value Added Services

Some examples

Differentiated services Regard traffic from Autonomous System #33 as `platinum

grade’ Access Control Lists

Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp Committed Access Rate

Rate limit WWW traffic from sub interface#739 to 10Mbps Policy based Routing

Route all voice traffic through the ATM network


16

Packet Classification

Action

--------

---- ----

--------

Predicate ActionClassifier (Policy Database)

Packet Classification

Forwarding Engine

Incoming Packet

HEADER


17

Multi-field Packet Classification

Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.

Field 1 Field 2 … Field k Action

Rule 1 152.163.190.69/ 21 152.163.80.11/ 32 … UDP A1

Rule 2 152.168.3.0/ 24 152.163.0.0/ 16 … TCP A2

… … … … … …

Rule N 152.168.0.0/ 16 152.0.0.0/ 8 … ANY An


18

Prefix matching: 1-d range problem

0 232-1

128.9/16

128.9.16.14

128.9.16/20 128.9.176/20

128.9.19/24

128.9.25/24

Most specific route = “longest matching prefix”


19

R5

Classification: 2D Geometry problem

R4

R3

R2R1

R7

P2

Field #1

Fie

ld #

2

R6

Field #1 Field #2 Data

P1

e.g. (128.16.46.23, *)e.g. (144.24/16, 64/24)


20

Packet ClassificationReferences

T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp 191-202.

V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203-214.

V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, Sigcomm 1999.

P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999.

P. Gupta, N. McKeown, “Packet classification on multiple fields”, Sigcomm 1999.


21

Proposed Schemes

Pros ConsSequentialEvaluation

Small storage, scales well withnumber of fields

Slow classification rates

Ternary CAMs Single cycle classification Cost, density, powerconsumption

Grid of Tries(Srinivasan etal[Sigcomm

98])

Small storage requirements andfast lookup rates for two fields.Suitable for big classifiers

Not easily extendible tomore than two fields.


22

Proposed Schemes (Contd.)

Pros ConsCrossproducting

(Srinivasan etal[Sigcomm 98])

Fast accesses.Suitable formultiple fields.

Large memoryrequirements. Suitablewithout caching forclassifiers with fewer than50 rules.

Bil-level Parallelism(Lakshman and

Stiliadis[Sigcomm 98])

Suitable formultiple fields.

Large memory bandwidthrequired. Comparativelyslow lookup rate.Hardware only.


23

Proposed Schemes (Contd.)

Pros ConsHierarchical

Intelligent Cuttings(Gupta and

McKeown[HotI 99])

Suitable for multiplefields. Small memoryrequirements. Goodupdate time.

Large preprocessingtime.

Tuple Space Search(Srinivasan et

al[Sigcomm 99])

Suitable for multiplefields. The basic schemehas good update timesand memoryrequirements.

Classification rate can below. Requires perfecthashing for determinism.

Recursive FlowClassification (GuptaandMcKeown[Sigcomm99])

Fast accesses. Suitable formultiple fields.Reasonable memoryrequirements for real-lifeclassifiers.

Large preprocessing timeand memoryrequirements for largeclassifiers.


24

Scheduling


25

Output Scheduling

scheduler

Allocating output bandwidthControlling packet delay


26

Output Scheduling

FIFO

Fair Queueing


27

Motivation: Parekh-Gallager theorem

Let a connection be allocated weights at each WFQ scheduler along its path, so that the least bandwidth it is allocated is g

Let it be leaky-bucket regulated such that # bits sent in time [t1, t2] <= g(t2 - t1) +

Let the connection pass through K schedulers, where the kth scheduler has a rate r(k)

Let the largest packet size in the network be P

1

1 1

)(///___K

k

K

k

krPgPgdelayendtoend


28

Motivation FIFO is natural but gives poor QoS

bursty flows increase delays for others

hence cannot guarantee delays

Need round robin scheduling of packets

Fair Queueing

Weighted Fair Queueing, Generalized Processor Sharing


29

Scheduling: Requirements

An ideal scheduling discipline is easy to implement: VLSI space, exec time is fair: max-min fairnessprovides performance bounds:

deterministic or statistical granularity: micro-flow or aggregate flow

allows easy admission control decisionsto decide whether a new flow can be

allowed


30

Choices: 1. Priority

Packet is served from a given priority level only if no packets exist at higher levels (multilevel priority with exhaustive service)

Highest level gets lowest delay Watch out for starvation! Usually map priority levels to delay classes

Low bandwidth urgent messages

Realtime

Non-realtime

Priority


31

Scheduling Policies: Choices #1 Priority Queuing: classes have different priorities;

class may depend on explicit marking or other header info, eg IP source or destination, TCP Port numbers, etc.

Transmit a packet from the highest priority class with a non-empty queue. Problem: starvation

Preemptive and non-preemptive versions


32

Scheduling Policies (more) Round Robin: scan class queues serving one from each

class that has a non-empty queue


33

Choices: 2. Work conserving vs. non-work-conserving

Work conserving discipline is never idle when packets await service

Why bother with non-work conserving?


34

Non-work-conserving disciplines

Key conceptual idea: delay packet till eligible Reduces delay-jitter => fewer buffers in network How to choose eligibility time?

rate-jitter regulatorbounds maximum outgoing rate

delay-jitter regulatorcompensates for variable delay at previous

hop


35

Do we need non-work-conservation?

Can remove delay-jitter at an endpoint insteadbut also reduces size of switch buffers…

Increases mean delaynot a problem for playback applications

Wastes bandwidthcan serve best-effort packets instead

Always punishes a misbehaving sourcecan’t have it both ways

Bottom line: not too bad, implementation cost may be the biggest problem


36

Choices: 3. Degree of aggregation

More aggregation less statecheaper

smaller VLSIless to advertise

BUT: less individualization Solution

aggregate to a class, members of class have same performance requirement

no protection within class


37

Choices: 4. Service within a priority level

In order of arrival (FCFS) or in order of a service tag Service tags => can arbitrarily reorder queue

Need to sort queue, which can be expensive FCFS

bandwidth hogs win (no protection) no guarantee on delays

Service tags with appropriate choice, both protection and delay

bounds possible: eg: differential buffer management, packet drop


38

Weighted round robin Serve a packet from each non-empty queue in turn Unfair if packets are of different length or weights are not

equal Different weights, fixed packet size

serve more than one packet per visit, after normalizing to obtain integer weights

Different weights, variable size packets normalize weights by mean packet size

e.g. weights {0.5, 0.75, 1.0}, mean packet sizes {50, 500, 1500}

normalize weights: {0.5/50, 0.75/500, 1.0/1500} = { 0.01, 0.0015, 0.000666}, normalize again {60, 9, 4}


39

Problems with Weighted Round Robin

With variable size packets and different weights, need to know mean packet size in advance

Can be unfair for long periods of time E.g.

T3 trunk with 500 connections, each connection has mean packet length 500 bytes, 250 with weight 1, 250 with weight 10

Each packet takes 500 * 8/45 Mbps = 88.8 microseconds

Round time =2750 * 88.8 = 244.2 ms


40

Generalized Processor Sharing(GPS) Assume a fluid model of traffic

Visit each non-empty queue in turn (RR)Serve infinitesimal from eachLeads to “max-min” fairness

GPS is un-implementable!We cannot serve infinitesimals, only packets


41

Fair Queuing (FQ) Idea: serve packets in the order in which they would have

finished transmission in the fluid flow system Mapping bit-by-bit schedule onto packet transmission

schedule Transmit packet with the lowest Fi at any given time Variation: Weighted Fair Queuing (WFQ)


42

FQ Example

F=10

Flow 1(arriving)

Flow 2transmitting Output

F=2

F=5

F=8

Flow 1 Flow 2 Output

F=10

Cannot preempt packetcurrently being transmitted


43

WFQ: Practical considerations

For every packet, the scheduler needs to

classify it into the right flow queue and maintain a

linked-list for each flow

schedule it for departure

Complexities of both are o(log [# of flows])

first is hard to overcome (studied earlier)

second can be overcome by DRR


44

Deficit Round Robin

50 700 250

400 600

200 600 100

500

500 Quantum size

250

500

500400

750

1000

Good approximation of FQ

Much simpler to implement


45

WFQ Problems

To get a delay bound, need to pick g the lower the delay bounds, the larger g needs to be large g => exclusion of more competitors from link g can be very large, in some cases 80 times the peak

rate! Sources must be leaky-bucket regulated

but choosing leaky-bucket parameters is problematic WFQ couples delay and bandwidth allocations

low delay requires allocating more bandwidth wastes bandwidth for low-bandwidth low-delay

sources


46

Delay-Earliest Due Date (EDD) Earliest-due-date: packet with earliest deadline selected Delay-EDD prescribes how to assign deadlines to packets A source is required to send slower than its peak rate Bandwidth at scheduler reserved at peak rate Deadline = expected arrival time + delay bound

If a source sends faster than contract, delay bound will not apply

Each packet gets a hard delay bound Delay bound is independent of bandwidth requirement

but reservation is at a connection’s peak rate Implementation requires per-connection state and a priority

queue


47

Rate-controlled scheduling

A class of disciplines two components: regulator and scheduler incoming packets are placed in regulator where they

wait to become eligible then they are put in the scheduler

Regulator shapes the traffic, scheduler provides performance guarantees

Considered impractical; interest waning after QoS decline


48

Examples Recall

rate-jitter regulatorbounds maximum outgoing rate

delay-jitter regulatorcompensates for variable delay at previous hop

Rate-jitter regulator + FIFO similar to Delay-EDD

Rate-jitter regulator + multi-priority FIFO gives both bandwidth and delay guarantees (RCSP)

Delay-jitter regulator + EDD gives bandwidth, delay,and delay-jitter bounds (Jitter-

EDD)


49

Stateful Solution Complexity

Data path Per-flow classification Per-flow buffer management Per-flow scheduling

Control path install and maintain per-flow state for data and control paths

Classifier

Buffermanagement

Scheduler

flow 1

flow 2

flow n

output interface

…

Per-flow State


50

Differentiated Services Model

Edge routers: traffic conditioning (policing, marking, dropping), SLA negotiation Set values in DS-byte in IP header based upon negotiated

service and observed traffic. Interior routers: traffic classification and forwarding (near

stateless core!) Use DS-byte as index into forwarding table

IngressEdge Router

EgressEdge Router

Interior Router


51

Diffserv ArchitectureEdge router:- per-flow traffic management

- marks packets as in-profile and out-profile

Core router:

- per class TM

- buffering and scheduling

based on marking at edge

- preference given to in-profile packets- Assured Forwarding

scheduling

...

r

b

marking


52

Diff Serv: implementation Classify flows into classes

maintain only per-class queues

perform FIFO within each class

avoid “curse of dimensionality”


53

Diff Serv A framework for providing differentiated QoS

set Type of Service (ToS) bits in packet headers

this classifies packets into classes

routers maintain per-class queues

condition traffic at network edges to conform to

class requirements

May still need queue management inside the network


54

Network Processors (NPUs)

Slides from Raj Yavatkar, [email protected]


55

CPUs vs NPUs

What makes a CPU appealing for a PCFlexibility: Supports many applicationsTime to market: Allows quick introduction of

new applicationsFuture proof: Supports as-yet unthought of

applications No-one would consider using fixed function

ASICs for a PC


56

Why NPUs seem like a good idea

What makes a NPU appealingTime to market: Saves 18months building an

ASIC. Code re-use.Flexibility: Protocols and standards change.Future proof: New protocols emerge.Less risk: Bugs more easily fixed in s/w.

Surely no-one would consider using fixed function ASICs for new networking equipment?


57

The other side of the NPU debate…

Jack of all trades, master of noneNPUs are difficult to programNPUs inevitably consume more power, …run more slowly and …cost more than an ASIC

Requires domain expertiseWhy would a/the networking vendor educate

its suppliers? Designed for computation rather than memory-

intensive operations


58

NPU Characteristics

NPUs try hard to hide memory latencyConventional caching doesn’t work

Equal number of reads and writesNo temporal or spatial localityCache misses lose throughput, confuse

schedulers and break pipelinesTherefore it is common to use multiple

processors with multiple contexts


59

Network ProcessorsLoad-balancing

cache

cache

cache

cache

cache

Off chip Memory

DispatchCPU

CPU

CPU

CPU

CPU

CPU

DedicatedHW support, e.g. lookups




Incoming packets dispatched to:1. Idle processor, or 2. Processor dedicated to packets in this flow

(to prevent mis-sequencing), or3. Special-purpose processor for flow,

e.g. security, transcoding, application-levelprocessing.


60

Network ProcessorsPipelining

cache

Off chip Memory

CPU

cache

CPU

cache

CPU

cache

CPU





Processing broken down into (hopefully balanced) steps,Each processor performs one step of processing.


61

NPUs and Memory

Network processors and their memoryPacket processing is all about getting packets

into and out of a chip and memory.Computation is a side-issue.Memory speed is everything: Speed matters

more than size.


62

NPUs and Memory

Buffer MemoryLookupCounters Schedule StateClassification

Program Data Instruction Code

Typical NPU or packet-processor has 8-64 CPUs, 12 memory interfaces and 2000 pins


63

Intel IXP Network Processors

Microengines RISC processors

optimized for packet processing

Hardware support for multi-threading

Fast path Embedded

StrongARM/Xscale Runs embedded OS and

handles exception tasks Slow path, Control plane

ME

1

ME

2

ME

n

StrongARM

SR

AM

DR

AM

Media/Fabric

Interface

ControlProcessor


64

NPU Building Blocks: Processors


65

Division of Functions


66

NPU Building Blocks: Memory


67

Memory Scaling


68

Memory Types


69

NPU Building Blocks: CAM and Ternary CAM

CAM Operation:

Ternary CAM (T-CAM):


70

Memory Caching vs CAMCACHE:

Content Addressable Memory (CAM):


71

Ternary CAMs

10.0.0.0 R1

10.1.0.0 R2

10.1.1.0 R310.1.3.0 R4

255.0.0.0255.255.0.0

255.255.255.0

255.255.255.0

255.255.255.25510.1.3.1 R4

Value Mask

Priority Encoder

Next Hop

Associative Memory

Using T-CAMs for Classification:


72

IXP: A Building Block for Network Systems

Example: IXP2800 16 micro-engines + XScale

core Up to 1.4 Ghz ME speed 8 HW threads/ME 4K control store per ME Multi-level memory hierarchy Multiple inter-processor

communication channels

NPU vs. GPU tradeoffs Reduce core complexity

No hardware caching Simpler instructions

shallow pipelines Multiple cores with HW multi-

threading per chip

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28

RDRAMController

Intel®XScale™

Core

MediaSwitchFabric

I/F

PCI

QDR SRAM Controller

ScratchMemory

Hash

Unit

Multi-threaded (x8)

Microengine Array

Per-EngineMemory, CAM, Signals

Interconnect


73

IXP2800 Features Half Duplex OC-192 / 10 Gb/sec Ethernet Network Processor XScale Core

700 MHz (half the ME) 32 Kbytes instruction cache / 32 Kbytes data cache

Media / Switch Fabric Interface 2 x 16 bit LVDS Transmit & Receive Configured as CSIX-L2 or SPI-4

PCI Interface 64 bit / 66 MHz Interface for Control 3 DMA Channels

QDR Interface (w/Parity) (4) 36 bit SRAM Channels (QDR or Co-Processor) Network Processor Forum LookAside-1 Standard Interface Using a “clamshell” topology both Memory and Co-processor can be

instantiated on same channel RDR Interface

(3) Independent Direct Rambus DRAM Interfaces Supports 4i Banks or 16 interleaved Banks Supports 16/32 Byte bursts


74

Hardware Features to ease packet processing

Ring Buffers For inter-block communication/synchronization Producer-consumer paradigm

Next Neighbor Registers and Signaling Allows for single cycle transfer of context to the

next logical micro-engine to dramatically improve performance

Simple, easy transfer of state Distributed data caching within each micro-engine

Allows for all threads to keep processing even when multiple threads are accessing the same data


75

XScale Core processor

Compliant with the ARM V5TE architecture support for ARM’s thumb instructions support for Digital Signal Processing (DSP)

enhancements to the instruction set Intel’s improvements to the internal pipeline to

improve the memory-latency hiding abilities of the core

does not implement the floating-point instructions of the ARM V5 instruction set


76

Microengines – RISC processors IXP 2800 has 16 microengines, organized into 4 clusters

(4 MEs per cluster) ME instruction set specifically tuned for processing

network data 40-bit x 4K control store Six-stage pipeline in an instruction

On an average takes one cycle to execute Each ME has eight hardware-assisted threads of

executioncan be configured to use either all eight threads or only

four threads The non-preemptive hardware thread arbiter swaps

between threads in round-robin order


77

MicroEngine v2

128GPR

Control Store

4K Instructions

128 GPR

Local Memory640 words

128 Next Neighbor

128 S Xfer Out

128 D Xfer Out

Local CSRs

CRC Unit

128 S Xfer In

128 D Xfer In

LM Addr 1LM Addr 0

D-Push Bus

S-Push Bus

D-Pull Bus S-Pull Bus

To Next Neighbor

From Next Neighbor

A_Operand B_Operand

ALU_Out

P-Random #

32-bit ExecutionData Path

Multiply

Find first bit

Add, shift, logical

2 per CTX

CRC remain

Lock0-15

StatusandLRULogic(6-bit)

TAGs 0-15

Status Entry#

CA

M

Timers

Timestamp

Prev B

B_op

Prev A

A_op


78

Registers available to each ME Four different types of registers

general purpose, SRAM transfer, DRAM transfer, next-neighbor (NN)

256, 32-bit GPRs can be accessed in thread-local or absolute mode

256, 32-bit SRAM transfer registers. used to read/write to all functional units on the

IXP2xxx except the DRAM 256, 32-bit DRAM transfer registers

divided equally into read-only and write-only used exclusively for communication between the MEs

and the DRAM Benefit of having separate transfer and GPRs

ME can continue processing with GPRs while other functional units read and write the transfer registers


79

Different Types of Memory

Type of Memory

Logical width (bytes)

Size in bytes Approx unloaded latency (cycles)

Special Notes

Local to ME 4 2560 3 Indexed addressing post incr/decr

On-chip scratch

4 16K 60 Atomic ops 16 rings w/at. get/put

SRAM 4 256M 150 Atomic ops

64-elem q-array

DRAM 8 2G 300 Direct path to/from MSF


80

Resource Manager Library

Control Plane PDK

Control Plane Protocol Stacks

Core Components

IXA Software Framework

MicroenginePipeline

XScale™Core

Microblock

Microblock

Microblock

Microblock Library

Utility LibraryProtocol Library

ExternalProcessors

Hardware Abstraction Library

MicroengineC Language

C/C++ Language

Core Component Library


81

Micro-engine C Compiler C language

constructs Basic types, pointers, bit fields

In-line assembly code support

AggregatesStructs, unions,

arrays


82

What is a Microblock Data plane packet processing on the microengines is

divided into logical functions called microblocks Coarse Grain and stateful Example

5-Tuple Classification, IPv4 Forwarding, NAT

Several microblocks running on a microengine thread can be combined into a microblock group.

A microblock group has a dispatch loop that defines the dataflow for packets between microblocks

A microblock group runs on each thread of one or more microengines

Microblocks can send and receive packets to/from an associated Xscale Core Component.


83

XScale™ Core

Micro-engines

Core Components and Microblocks

User-written code

Microblock Library

Intel/3rd party blocks

Microblock

Microblock Library

Microblock Microblock

Core Component

CoreComponent

Core Component

CoreLibraries

Core Component Library

Resource Manager Library


84

Debate about network processors

Context

Data cache(s)

Data Hdr

Characteristics:1. Stream processing.2. Multiple flows.3. Most processing on

header, not data.4. Two sets of data:

packets, context.5. Packets have no

temporal locality, andspecial spatial locality.

6. Context has temporal and spatial locality.

Characteristics:1. Shared in/out bus.2. Optimized for data

with spatial and temporallocality.

3. Optimized forregister accesses.

The nail:

The hammer:

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 QoS Support.

Documents

Transcript of Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 QoS Support.