Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 QoS Support.
-
date post
18-Dec-2015 -
Category
Documents
-
view
223 -
download
3
Transcript of Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 QoS Support.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
1
QoS Support
Shivkumar KalyanaramanRensselaer Polytechnic Institute
2
What is QoS? “Better performance” as described by a set of
parameters or measured by a set of metrics.
Generic parameters: BandwidthDelay, Delay-jitterPacket loss rate (or loss probability)
Transport/Application-specific parameters:TimeoutsPercentage of “important” packets lost
Shivkumar KalyanaramanRensselaer Polytechnic Institute
3
What is QoS (contd) ? These parameters can be measured at several
granularities: “micro” flow, aggregate flow, population.
QoS considered “better” if a) more parameters can be specified b) QoS can be specified at a fine-granularity.
QoS vs CoS: CoS maps micro-flows to classes and may perform optional resource reservation per-class
QoS spectrum:Best Effort Leased Line
Shivkumar KalyanaramanRensselaer Polytechnic Institute
4
Example QoS Bandwidth: r Mbps in
a time T, with burstiness b
Delay: worst-case Loss: worst-case or
statistical
r tokens per second
b tokens
<= R bps
regulator
bitsslope r
Arrival curve
d
Ba
slope ra
time
b*R/(R-r)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
5
Fundamental Problems
In a FIFO service discipline, the performance assigned to one flow is convoluted with the arrivals of packets from all other flows! Cant get QoS with a “free-for-all” Need to use new scheduling disciplines which provide
“isolation” of performance from arrival rates of background traffic
B
Scheduling DisciplineFIFO
B
Shivkumar KalyanaramanRensselaer Polytechnic Institute
6
Fundamental Problems Conservation Law
(Kleinrock): (i)Wq(i) = K
Irrespective of scheduling discipline chosen: Average backlog
(delay) is constant Average bandwidth is
constant
Zero-sum game => need to “set-aside” resources for premium services
Shivkumar KalyanaramanRensselaer Polytechnic Institute
7
QoS Big Picture: Control/Data Planes
Internetw ork or W ANW orkstationRouter
Router
RouterW orkstation
Control Plane: Signaling + Admission Control orSLA (Contracting) + Provisioning/Traffic Engineering
Data Plane: Traffic conditioning (shaping, policing, markingetc) + Traffic Classification + Scheduling, Buffer management
Shivkumar KalyanaramanRensselaer Polytechnic Institute
8
Eg: Integrated Services (IntServ) An architecture for providing QOS guarantees in IP
networks for individual application sessions Relies on resource reservation, and routers need to
maintain state information of allocated resources (eg: g) and respond to new Call setup requests
Shivkumar KalyanaramanRensselaer Polytechnic Institute
9
Call Admission
Call Admission: routers will admit calls based on their R-spec and T-spec and base on the current resource allocated at the routers to other calls.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
10
Token Bucket Characterized by three parameters (b, r, R)
b – token depthr – average arrival rateR – maximum arrival rate (e.g., R link capacity)
A bit is transmitted only when there is an available tokenWhen a bit is transmitted exactly one token is consumed
r tokens per second
b tokens
<= R bps
regulatortime
bits
b*R/(R-r)
slope R
slope r
Shivkumar KalyanaramanRensselaer Polytechnic Institute
11
Per-hop Reservation Given b,r,R and per-hop delay d Allocate bandwidth ra and buffer space Ba such that to
guarantee d
bits
b
slope rArrival curve
d
Ba
slope ra
Shivkumar KalyanaramanRensselaer Polytechnic Institute
12
Mechanisms: Queuing/Scheduling
Use a few bits in header to indicate which queue (class) a packet goes into (also branded as CoS)
High $$ users classified into high priority queues, which also may be less populated => lower delay and low likelihood of packet drop
Ideas: priority, round-robin, classification, aggregation, ...
Class C
Class B
Class A
Traffic Classes
Traffic Sources
$$$$$$
$$$
$
Shivkumar KalyanaramanRensselaer Polytechnic Institute
13
Mechanisms: Buffer Mgmt/Priority Drop
Ideas: packet marking, queue thresholds, differential dropping, buffer assignments
Drop RED and BLUE packets
Drop only BLUE packets
Shivkumar KalyanaramanRensselaer Polytechnic Institute
14
Classification
Shivkumar KalyanaramanRensselaer Polytechnic Institute
15
Why Classification? Providing Value Added Services
Some examples
Differentiated services Regard traffic from Autonomous System #33 as `platinum
grade’ Access Control Lists
Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp Committed Access Rate
Rate limit WWW traffic from sub interface#739 to 10Mbps Policy based Routing
Route all voice traffic through the ATM network
Shivkumar KalyanaramanRensselaer Polytechnic Institute
16
Packet Classification
Action
--------
---- ----
--------
Predicate ActionClassifier (Policy Database)
Packet Classification
Forwarding Engine
Incoming Packet
HEADER
Shivkumar KalyanaramanRensselaer Polytechnic Institute
17
Multi-field Packet Classification
Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.
Field 1 Field 2 … Field k Action
Rule 1 152.163.190.69/ 21 152.163.80.11/ 32 … UDP A1
Rule 2 152.168.3.0/ 24 152.163.0.0/ 16 … TCP A2
… … … … … …
Rule N 152.168.0.0/ 16 152.0.0.0/ 8 … ANY An
Shivkumar KalyanaramanRensselaer Polytechnic Institute
18
Prefix matching: 1-d range problem
0 232-1
128.9/16
128.9.16.14
128.9.16/20 128.9.176/20
128.9.19/24
128.9.25/24
Most specific route = “longest matching prefix”
Shivkumar KalyanaramanRensselaer Polytechnic Institute
19
R5
Classification: 2D Geometry problem
R4
R3
R2R1
R7
P2
Field #1
Fie
ld #
2
R6
Field #1 Field #2 Data
P1
e.g. (128.16.46.23, *)e.g. (144.24/16, 64/24)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
20
Packet ClassificationReferences
T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp 191-202.
V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203-214.
V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, Sigcomm 1999.
P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, 1999.
P. Gupta, N. McKeown, “Packet classification on multiple fields”, Sigcomm 1999.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
21
Proposed Schemes
Pros ConsSequentialEvaluation
Small storage, scales well withnumber of fields
Slow classification rates
Ternary CAMs Single cycle classification Cost, density, powerconsumption
Grid of Tries(Srinivasan etal[Sigcomm
98])
Small storage requirements andfast lookup rates for two fields.Suitable for big classifiers
Not easily extendible tomore than two fields.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
22
Proposed Schemes (Contd.)
Pros ConsCrossproducting
(Srinivasan etal[Sigcomm 98])
Fast accesses.Suitable formultiple fields.
Large memoryrequirements. Suitablewithout caching forclassifiers with fewer than50 rules.
Bil-level Parallelism(Lakshman and
Stiliadis[Sigcomm 98])
Suitable formultiple fields.
Large memory bandwidthrequired. Comparativelyslow lookup rate.Hardware only.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
23
Proposed Schemes (Contd.)
Pros ConsHierarchical
Intelligent Cuttings(Gupta and
McKeown[HotI 99])
Suitable for multiplefields. Small memoryrequirements. Goodupdate time.
Large preprocessingtime.
Tuple Space Search(Srinivasan et
al[Sigcomm 99])
Suitable for multiplefields. The basic schemehas good update timesand memoryrequirements.
Classification rate can below. Requires perfecthashing for determinism.
Recursive FlowClassification (GuptaandMcKeown[Sigcomm99])
Fast accesses. Suitable formultiple fields.Reasonable memoryrequirements for real-lifeclassifiers.
Large preprocessing timeand memoryrequirements for largeclassifiers.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
24
Scheduling
Shivkumar KalyanaramanRensselaer Polytechnic Institute
25
Output Scheduling
scheduler
Allocating output bandwidthControlling packet delay
Shivkumar KalyanaramanRensselaer Polytechnic Institute
26
Output Scheduling
FIFO
Fair Queueing
Shivkumar KalyanaramanRensselaer Polytechnic Institute
27
Motivation: Parekh-Gallager theorem
Let a connection be allocated weights at each WFQ scheduler along its path, so that the least bandwidth it is allocated is g
Let it be leaky-bucket regulated such that # bits sent in time [t1, t2] <= g(t2 - t1) +
Let the connection pass through K schedulers, where the kth scheduler has a rate r(k)
Let the largest packet size in the network be P
1
1 1
)(///___K
k
K
k
krPgPgdelayendtoend
Shivkumar KalyanaramanRensselaer Polytechnic Institute
28
Motivation FIFO is natural but gives poor QoS
bursty flows increase delays for others
hence cannot guarantee delays
Need round robin scheduling of packets
Fair Queueing
Weighted Fair Queueing, Generalized Processor Sharing
Shivkumar KalyanaramanRensselaer Polytechnic Institute
29
Scheduling: Requirements
An ideal scheduling discipline is easy to implement: VLSI space, exec time is fair: max-min fairnessprovides performance bounds:
deterministic or statistical granularity: micro-flow or aggregate flow
allows easy admission control decisionsto decide whether a new flow can be
allowed
Shivkumar KalyanaramanRensselaer Polytechnic Institute
30
Choices: 1. Priority
Packet is served from a given priority level only if no packets exist at higher levels (multilevel priority with exhaustive service)
Highest level gets lowest delay Watch out for starvation! Usually map priority levels to delay classes
Low bandwidth urgent messages
Realtime
Non-realtime
Priority
Shivkumar KalyanaramanRensselaer Polytechnic Institute
31
Scheduling Policies: Choices #1 Priority Queuing: classes have different priorities;
class may depend on explicit marking or other header info, eg IP source or destination, TCP Port numbers, etc.
Transmit a packet from the highest priority class with a non-empty queue. Problem: starvation
Preemptive and non-preemptive versions
Shivkumar KalyanaramanRensselaer Polytechnic Institute
32
Scheduling Policies (more) Round Robin: scan class queues serving one from each
class that has a non-empty queue
Shivkumar KalyanaramanRensselaer Polytechnic Institute
33
Choices: 2. Work conserving vs. non-work-conserving
Work conserving discipline is never idle when packets await service
Why bother with non-work conserving?
Shivkumar KalyanaramanRensselaer Polytechnic Institute
34
Non-work-conserving disciplines
Key conceptual idea: delay packet till eligible Reduces delay-jitter => fewer buffers in network How to choose eligibility time?
rate-jitter regulatorbounds maximum outgoing rate
delay-jitter regulatorcompensates for variable delay at previous
hop
Shivkumar KalyanaramanRensselaer Polytechnic Institute
35
Do we need non-work-conservation?
Can remove delay-jitter at an endpoint insteadbut also reduces size of switch buffers…
Increases mean delaynot a problem for playback applications
Wastes bandwidthcan serve best-effort packets instead
Always punishes a misbehaving sourcecan’t have it both ways
Bottom line: not too bad, implementation cost may be the biggest problem
Shivkumar KalyanaramanRensselaer Polytechnic Institute
36
Choices: 3. Degree of aggregation
More aggregation less statecheaper
smaller VLSIless to advertise
BUT: less individualization Solution
aggregate to a class, members of class have same performance requirement
no protection within class
Shivkumar KalyanaramanRensselaer Polytechnic Institute
37
Choices: 4. Service within a priority level
In order of arrival (FCFS) or in order of a service tag Service tags => can arbitrarily reorder queue
Need to sort queue, which can be expensive FCFS
bandwidth hogs win (no protection) no guarantee on delays
Service tags with appropriate choice, both protection and delay
bounds possible: eg: differential buffer management, packet drop
Shivkumar KalyanaramanRensselaer Polytechnic Institute
38
Weighted round robin Serve a packet from each non-empty queue in turn Unfair if packets are of different length or weights are not
equal Different weights, fixed packet size
serve more than one packet per visit, after normalizing to obtain integer weights
Different weights, variable size packets normalize weights by mean packet size
e.g. weights {0.5, 0.75, 1.0}, mean packet sizes {50, 500, 1500}
normalize weights: {0.5/50, 0.75/500, 1.0/1500} = { 0.01, 0.0015, 0.000666}, normalize again {60, 9, 4}
Shivkumar KalyanaramanRensselaer Polytechnic Institute
39
Problems with Weighted Round Robin
With variable size packets and different weights, need to know mean packet size in advance
Can be unfair for long periods of time E.g.
T3 trunk with 500 connections, each connection has mean packet length 500 bytes, 250 with weight 1, 250 with weight 10
Each packet takes 500 * 8/45 Mbps = 88.8 microseconds
Round time =2750 * 88.8 = 244.2 ms
Shivkumar KalyanaramanRensselaer Polytechnic Institute
40
Generalized Processor Sharing(GPS) Assume a fluid model of traffic
Visit each non-empty queue in turn (RR)Serve infinitesimal from eachLeads to “max-min” fairness
GPS is un-implementable!We cannot serve infinitesimals, only packets
Shivkumar KalyanaramanRensselaer Polytechnic Institute
41
Fair Queuing (FQ) Idea: serve packets in the order in which they would have
finished transmission in the fluid flow system Mapping bit-by-bit schedule onto packet transmission
schedule Transmit packet with the lowest Fi at any given time Variation: Weighted Fair Queuing (WFQ)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
42
FQ Example
F=10
Flow 1(arriving)
Flow 2transmitting Output
F=2
F=5
F=8
Flow 1 Flow 2 Output
F=10
Cannot preempt packetcurrently being transmitted
Shivkumar KalyanaramanRensselaer Polytechnic Institute
43
WFQ: Practical considerations
For every packet, the scheduler needs to
classify it into the right flow queue and maintain a
linked-list for each flow
schedule it for departure
Complexities of both are o(log [# of flows])
first is hard to overcome (studied earlier)
second can be overcome by DRR
Shivkumar KalyanaramanRensselaer Polytechnic Institute
44
Deficit Round Robin
50 700 250
400 600
200 600 100
500
500 Quantum size
250
500
500400
750
1000
Good approximation of FQ
Much simpler to implement
Shivkumar KalyanaramanRensselaer Polytechnic Institute
45
WFQ Problems
To get a delay bound, need to pick g the lower the delay bounds, the larger g needs to be large g => exclusion of more competitors from link g can be very large, in some cases 80 times the peak
rate! Sources must be leaky-bucket regulated
but choosing leaky-bucket parameters is problematic WFQ couples delay and bandwidth allocations
low delay requires allocating more bandwidth wastes bandwidth for low-bandwidth low-delay
sources
Shivkumar KalyanaramanRensselaer Polytechnic Institute
46
Delay-Earliest Due Date (EDD) Earliest-due-date: packet with earliest deadline selected Delay-EDD prescribes how to assign deadlines to packets A source is required to send slower than its peak rate Bandwidth at scheduler reserved at peak rate Deadline = expected arrival time + delay bound
If a source sends faster than contract, delay bound will not apply
Each packet gets a hard delay bound Delay bound is independent of bandwidth requirement
but reservation is at a connection’s peak rate Implementation requires per-connection state and a priority
queue
Shivkumar KalyanaramanRensselaer Polytechnic Institute
47
Rate-controlled scheduling
A class of disciplines two components: regulator and scheduler incoming packets are placed in regulator where they
wait to become eligible then they are put in the scheduler
Regulator shapes the traffic, scheduler provides performance guarantees
Considered impractical; interest waning after QoS decline
Shivkumar KalyanaramanRensselaer Polytechnic Institute
48
Examples Recall
rate-jitter regulatorbounds maximum outgoing rate
delay-jitter regulatorcompensates for variable delay at previous hop
Rate-jitter regulator + FIFO similar to Delay-EDD
Rate-jitter regulator + multi-priority FIFO gives both bandwidth and delay guarantees (RCSP)
Delay-jitter regulator + EDD gives bandwidth, delay,and delay-jitter bounds (Jitter-
EDD)
Shivkumar KalyanaramanRensselaer Polytechnic Institute
49
Stateful Solution Complexity
Data path Per-flow classification Per-flow buffer management Per-flow scheduling
Control path install and maintain per-flow state for data and control paths
Classifier
Buffermanagement
Scheduler
flow 1
flow 2
flow n
output interface
…
Per-flow State
Shivkumar KalyanaramanRensselaer Polytechnic Institute
50
Differentiated Services Model
Edge routers: traffic conditioning (policing, marking, dropping), SLA negotiation Set values in DS-byte in IP header based upon negotiated
service and observed traffic. Interior routers: traffic classification and forwarding (near
stateless core!) Use DS-byte as index into forwarding table
IngressEdge Router
EgressEdge Router
Interior Router
Shivkumar KalyanaramanRensselaer Polytechnic Institute
51
Diffserv ArchitectureEdge router:- per-flow traffic management
- marks packets as in-profile and out-profile
Core router:
- per class TM
- buffering and scheduling
based on marking at edge
- preference given to in-profile packets- Assured Forwarding
scheduling
...
r
b
marking
Shivkumar KalyanaramanRensselaer Polytechnic Institute
52
Diff Serv: implementation Classify flows into classes
maintain only per-class queues
perform FIFO within each class
avoid “curse of dimensionality”
Shivkumar KalyanaramanRensselaer Polytechnic Institute
53
Diff Serv A framework for providing differentiated QoS
set Type of Service (ToS) bits in packet headers
this classifies packets into classes
routers maintain per-class queues
condition traffic at network edges to conform to
class requirements
May still need queue management inside the network
Shivkumar KalyanaramanRensselaer Polytechnic Institute
54
Network Processors (NPUs)
Slides from Raj Yavatkar, [email protected]
Shivkumar KalyanaramanRensselaer Polytechnic Institute
55
CPUs vs NPUs
What makes a CPU appealing for a PCFlexibility: Supports many applicationsTime to market: Allows quick introduction of
new applicationsFuture proof: Supports as-yet unthought of
applications No-one would consider using fixed function
ASICs for a PC
Shivkumar KalyanaramanRensselaer Polytechnic Institute
56
Why NPUs seem like a good idea
What makes a NPU appealingTime to market: Saves 18months building an
ASIC. Code re-use.Flexibility: Protocols and standards change.Future proof: New protocols emerge.Less risk: Bugs more easily fixed in s/w.
Surely no-one would consider using fixed function ASICs for new networking equipment?
Shivkumar KalyanaramanRensselaer Polytechnic Institute
57
The other side of the NPU debate…
Jack of all trades, master of noneNPUs are difficult to programNPUs inevitably consume more power, …run more slowly and …cost more than an ASIC
Requires domain expertiseWhy would a/the networking vendor educate
its suppliers? Designed for computation rather than memory-
intensive operations
Shivkumar KalyanaramanRensselaer Polytechnic Institute
58
NPU Characteristics
NPUs try hard to hide memory latencyConventional caching doesn’t work
Equal number of reads and writesNo temporal or spatial localityCache misses lose throughput, confuse
schedulers and break pipelinesTherefore it is common to use multiple
processors with multiple contexts
Shivkumar KalyanaramanRensselaer Polytechnic Institute
59
Network ProcessorsLoad-balancing
cache
cache
cache
cache
cache
Off chip Memory
DispatchCPU
CPU
CPU
CPU
CPU
CPU
DedicatedHW support, e.g. lookups
DedicatedHW support, e.g. lookups
DedicatedHW support, e.g. lookups
DedicatedHW support, e.g. lookups
Incoming packets dispatched to:1. Idle processor, or 2. Processor dedicated to packets in this flow
(to prevent mis-sequencing), or3. Special-purpose processor for flow,
e.g. security, transcoding, application-levelprocessing.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
60
Network ProcessorsPipelining
cache
Off chip Memory
CPU
cache
CPU
cache
CPU
cache
CPU
DedicatedHW support, e.g. lookups
DedicatedHW support, e.g. lookups
DedicatedHW support, e.g. lookups
DedicatedHW support, e.g. lookups
Processing broken down into (hopefully balanced) steps,Each processor performs one step of processing.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
61
NPUs and Memory
Network processors and their memoryPacket processing is all about getting packets
into and out of a chip and memory.Computation is a side-issue.Memory speed is everything: Speed matters
more than size.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
62
NPUs and Memory
Buffer MemoryLookupCounters Schedule StateClassification
Program Data Instruction Code
Typical NPU or packet-processor has 8-64 CPUs, 12 memory interfaces and 2000 pins
Shivkumar KalyanaramanRensselaer Polytechnic Institute
63
Intel IXP Network Processors
Microengines RISC processors
optimized for packet processing
Hardware support for multi-threading
Fast path Embedded
StrongARM/Xscale Runs embedded OS and
handles exception tasks Slow path, Control plane
ME
1
ME
2
ME
n
StrongARM
SR
AM
DR
AM
Media/Fabric
Interface
ControlProcessor
Shivkumar KalyanaramanRensselaer Polytechnic Institute
64
NPU Building Blocks: Processors
Shivkumar KalyanaramanRensselaer Polytechnic Institute
65
Division of Functions
Shivkumar KalyanaramanRensselaer Polytechnic Institute
66
NPU Building Blocks: Memory
Shivkumar KalyanaramanRensselaer Polytechnic Institute
67
Memory Scaling
Shivkumar KalyanaramanRensselaer Polytechnic Institute
68
Memory Types
Shivkumar KalyanaramanRensselaer Polytechnic Institute
69
NPU Building Blocks: CAM and Ternary CAM
CAM Operation:
Ternary CAM (T-CAM):
Shivkumar KalyanaramanRensselaer Polytechnic Institute
70
Memory Caching vs CAMCACHE:
Content Addressable Memory (CAM):
Shivkumar KalyanaramanRensselaer Polytechnic Institute
71
Ternary CAMs
10.0.0.0 R1
10.1.0.0 R2
10.1.1.0 R310.1.3.0 R4
255.0.0.0255.255.0.0
255.255.255.0
255.255.255.0
255.255.255.25510.1.3.1 R4
Value Mask
Priority Encoder
Next Hop
Associative Memory
Using T-CAMs for Classification:
Shivkumar KalyanaramanRensselaer Polytechnic Institute
72
IXP: A Building Block for Network Systems
Example: IXP2800 16 micro-engines + XScale
core Up to 1.4 Ghz ME speed 8 HW threads/ME 4K control store per ME Multi-level memory hierarchy Multiple inter-processor
communication channels
NPU vs. GPU tradeoffs Reduce core complexity
No hardware caching Simpler instructions
shallow pipelines Multiple cores with HW multi-
threading per chip
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28
RDRAMController
Intel®XScale™
Core
MediaSwitchFabric
I/F
PCI
QDR SRAM Controller
ScratchMemory
Hash
Unit
Multi-threaded (x8)
Microengine Array
Per-EngineMemory, CAM, Signals
Interconnect
Shivkumar KalyanaramanRensselaer Polytechnic Institute
73
IXP2800 Features Half Duplex OC-192 / 10 Gb/sec Ethernet Network Processor XScale Core
700 MHz (half the ME) 32 Kbytes instruction cache / 32 Kbytes data cache
Media / Switch Fabric Interface 2 x 16 bit LVDS Transmit & Receive Configured as CSIX-L2 or SPI-4
PCI Interface 64 bit / 66 MHz Interface for Control 3 DMA Channels
QDR Interface (w/Parity) (4) 36 bit SRAM Channels (QDR or Co-Processor) Network Processor Forum LookAside-1 Standard Interface Using a “clamshell” topology both Memory and Co-processor can be
instantiated on same channel RDR Interface
(3) Independent Direct Rambus DRAM Interfaces Supports 4i Banks or 16 interleaved Banks Supports 16/32 Byte bursts
Shivkumar KalyanaramanRensselaer Polytechnic Institute
74
Hardware Features to ease packet processing
Ring Buffers For inter-block communication/synchronization Producer-consumer paradigm
Next Neighbor Registers and Signaling Allows for single cycle transfer of context to the
next logical micro-engine to dramatically improve performance
Simple, easy transfer of state Distributed data caching within each micro-engine
Allows for all threads to keep processing even when multiple threads are accessing the same data
Shivkumar KalyanaramanRensselaer Polytechnic Institute
75
XScale Core processor
Compliant with the ARM V5TE architecture support for ARM’s thumb instructions support for Digital Signal Processing (DSP)
enhancements to the instruction set Intel’s improvements to the internal pipeline to
improve the memory-latency hiding abilities of the core
does not implement the floating-point instructions of the ARM V5 instruction set
Shivkumar KalyanaramanRensselaer Polytechnic Institute
76
Microengines – RISC processors IXP 2800 has 16 microengines, organized into 4 clusters
(4 MEs per cluster) ME instruction set specifically tuned for processing
network data 40-bit x 4K control store Six-stage pipeline in an instruction
On an average takes one cycle to execute Each ME has eight hardware-assisted threads of
executioncan be configured to use either all eight threads or only
four threads The non-preemptive hardware thread arbiter swaps
between threads in round-robin order
Shivkumar KalyanaramanRensselaer Polytechnic Institute
77
MicroEngine v2
128GPR
Control Store
4K Instructions
128 GPR
Local Memory640 words
128 Next Neighbor
128 S Xfer Out
128 D Xfer Out
Local CSRs
CRC Unit
128 S Xfer In
128 D Xfer In
LM Addr 1LM Addr 0
D-Push Bus
S-Push Bus
D-Pull Bus S-Pull Bus
To Next Neighbor
From Next Neighbor
A_Operand B_Operand
ALU_Out
P-Random #
32-bit ExecutionData Path
Multiply
Find first bit
Add, shift, logical
2 per CTX
CRC remain
Lock0-15
StatusandLRULogic(6-bit)
TAGs 0-15
Status Entry#
CA
M
Timers
Timestamp
Prev B
B_op
Prev A
A_op
Shivkumar KalyanaramanRensselaer Polytechnic Institute
78
Registers available to each ME Four different types of registers
general purpose, SRAM transfer, DRAM transfer, next-neighbor (NN)
256, 32-bit GPRs can be accessed in thread-local or absolute mode
256, 32-bit SRAM transfer registers. used to read/write to all functional units on the
IXP2xxx except the DRAM 256, 32-bit DRAM transfer registers
divided equally into read-only and write-only used exclusively for communication between the MEs
and the DRAM Benefit of having separate transfer and GPRs
ME can continue processing with GPRs while other functional units read and write the transfer registers
Shivkumar KalyanaramanRensselaer Polytechnic Institute
79
Different Types of Memory
Type of Memory
Logical width (bytes)
Size in bytes Approx unloaded latency (cycles)
Special Notes
Local to ME 4 2560 3 Indexed addressing post incr/decr
On-chip scratch
4 16K 60 Atomic ops 16 rings w/at. get/put
SRAM 4 256M 150 Atomic ops
64-elem q-array
DRAM 8 2G 300 Direct path to/from MSF
Shivkumar KalyanaramanRensselaer Polytechnic Institute
80
Resource Manager Library
Control Plane PDK
Control Plane Protocol Stacks
Core Components
IXA Software Framework
MicroenginePipeline
XScale™Core
Microblock
Microblock
Microblock
Microblock Library
Utility LibraryProtocol Library
ExternalProcessors
Hardware Abstraction Library
MicroengineC Language
C/C++ Language
Core Component Library
Shivkumar KalyanaramanRensselaer Polytechnic Institute
81
Micro-engine C Compiler C language
constructs Basic types, pointers, bit fields
In-line assembly code support
AggregatesStructs, unions,
arrays
Shivkumar KalyanaramanRensselaer Polytechnic Institute
82
What is a Microblock Data plane packet processing on the microengines is
divided into logical functions called microblocks Coarse Grain and stateful Example
5-Tuple Classification, IPv4 Forwarding, NAT
Several microblocks running on a microengine thread can be combined into a microblock group.
A microblock group has a dispatch loop that defines the dataflow for packets between microblocks
A microblock group runs on each thread of one or more microengines
Microblocks can send and receive packets to/from an associated Xscale Core Component.
Shivkumar KalyanaramanRensselaer Polytechnic Institute
83
XScale™ Core
Micro-engines
Core Components and Microblocks
User-written code
Microblock Library
Intel/3rd party blocks
Microblock
Microblock Library
Microblock Microblock
Core Component
CoreComponent
Core Component
CoreLibraries
Core Component Library
Resource Manager Library
Shivkumar KalyanaramanRensselaer Polytechnic Institute
84
Debate about network processors
Context
Data cache(s)
Data Hdr
Characteristics:1. Stream processing.2. Multiple flows.3. Most processing on
header, not data.4. Two sets of data:
packets, context.5. Packets have no
temporal locality, andspecial spatial locality.
6. Context has temporal and spatial locality.
Characteristics:1. Shared in/out bus.2. Optimized for data
with spatial and temporallocality.
3. Optimized forregister accesses.
The nail:
The hammer: