Post on 28-Jan-2016
description
Task Partitioning for Multi-Core Network
Processors
Task Partitioning for Multi-Core Network
Processors
Rob Ennals, Richard SharpRob Ennals, Richard SharpIntel Research, CambridgeIntel Research, Cambridge
Alan MycroftAlan MycroftProgramming Languages Research Group,Programming Languages Research Group,University of Cambridge Computer LaboratoryUniversity of Cambridge Computer Laboratory
Talk OverviewTalk Overview
Network ProcessorsNetwork Processors What they are, and why they are interestingWhat they are, and why they are interesting
Architecture Mapping Scripts (AMS)Architecture Mapping Scripts (AMS) How to separate your high level program from low level detailsHow to separate your high level program from low level details
Task PipeliningTask Pipelining How it can go wrong, and how to make sure it goes rightHow it can go wrong, and how to make sure it goes right
Network ProcessorsNetwork Processors
Designed for high speed packet processingDesigned for high speed packet processing Up to 40Gb/sUp to 40Gb/s
High performance per wattHigh performance per watt
ASIC performance with CPU programmabilityASIC performance with CPU programmability
Highly parallelHighly parallel Multiple programmable coresMultiple programmable cores
Specialised co-processorsSpecialised co-processors
Exploit the inherent parallelism of packet processingExploit the inherent parallelism of packet processing
Products available from many manufacturersProducts available from many manufacturers Intel, Broadcom, Hifn, Freescale, EZChip, Xelerated, etcIntel, Broadcom, Hifn, Freescale, EZChip, Xelerated, etc
Lots of ParallelismLots of Parallelism
Intel IXP 2800: 16 cores, each with 8 threadsIntel IXP 2800: 16 cores, each with 8 threads
EZChip NP-1c: 5 different types of coresEZChip NP-1c: 5 different types of cores
Agere APP: several specialised coresAgere APP: several specialised cores
FreeScale C-5: 16 cores, 5 co-processorsFreeScale C-5: 16 cores, 5 co-processors
Hifn 5NP4G: 16 coresHifn 5NP4G: 16 cores
Xelerated X10: 200 VLIW packet enginesXelerated X10: 200 VLIW packet engines
BroadCom BCM1480: 4 coresBroadCom BCM1480: 4 cores
Pipelined Programming ModelPipelined Programming Model Used by many NP designsUsed by many NP designs
Packets flow between coresPackets flow between cores
Why do this?Why do this? Cores may have different functional unitsCores may have different functional units
Cores may maintain state tables locallyCores may maintain state tables locally
Cores may have limited code spaceCores may have limited code space
Reduce contention for shared resourcesReduce contention for shared resources
Makes it easier to preserve packet orderingMakes it easier to preserve packet ordering
Core Core Core Core
An Example: IXP2800An Example: IXP2800
16 microengine cores16 microengine cores Each with 8 concurrent threadsEach with 8 concurrent threads
Each with local memory and specialised functional unitsEach with local memory and specialised functional units
Pipelined programming modelPipelined programming model Dedicated datapath between adjacent microenginesDedicated datapath between adjacent microengines
Exposed IO LatencyExposed IO Latency Separate operations to schedule IO, and to wait for it to finishSeparate operations to schedule IO, and to wait for it to finish
No cache hierarchyNo cache hierarchy Must manually cache data in faster memoriesMust manually cache data in faster memories
Very powerful, but hard to programVery powerful, but hard to program
XScale Core32K IC32K DC MEv2
10MEv2
11MEv2
12
MEv215
MEv214
MEv213
Rbuf64 @ 128B
Tbuf64 @ 128B
Hash64/48/128
Scratch16KBQDR
SRAM2
QDRSRAM
1
RDRAM1
RDRAM3
RDRAM2
GASKET
PCI
(64b)66 MHz
16b16b
16b16b
1818 18181818 1818
7272 7272 7272
64b64b
SPI4orCSIX
Stripe/byte align
E/D Q E/D Q
QDRSRAM
3
E/D Q
1818 1818
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28
CSRs -Fast_wr -UART-Timers -GPIO-BootROM/SlowPort
QDRSRAM
4
E/D Q
1818 1818
IXP2800IXP2800
IXDP-2400IXDP-2400
Packets fromnetwork
IXP2400 IXP2400
CSIX Fabric
Packets tonetwork
Things are even harder in practice…Things are even harder in practice…
Systems contain multiple NPs!Systems contain multiple NPs!
What People Do NowWhat People Do Now
Design their programs around the architectureDesign their programs around the architecture
Explicitly program each microengine threadExplicitly program each microengine thread
Explicity access low level functional unitsExplicity access low level functional units
Manually hoist IO operations to be earlyManually hoist IO operations to be early
THIS SUCKS!THIS SUCKS!
High level program gets polluted with low level detailsHigh level program gets polluted with low level details
IO hoisting breaks modularityIO hoisting breaks modularity
Programs are hard to understand, hard to modify, hard to write, hard Programs are hard to understand, hard to modify, hard to write, hard to maintain, and hard to port to other platforms. to maintain, and hard to port to other platforms.
The PacLang ProjectThe PacLang Project
Aiming to make it easier to program Network ProcessorsAiming to make it easier to program Network Processors
Based around the PacLang languageBased around the PacLang language C-like syntax and semanticsC-like syntax and semantics
Statically allocated threads, linked by queuesStatically allocated threads, linked by queues
Abstracts away all low level detailsAbstracts away all low level details
A number of interesting featuresA number of interesting features Linear type systemLinear type system
Architecture Mapping scripts (this talk)Architecture Mapping scripts (this talk)
Various other features in progressVarious other features in progress
A prototype implementation is available A prototype implementation is available
Architecture Mapping ScriptsArchitecture Mapping Scripts
Our compiler takes two filesOur compiler takes two files A high level PacLang programA high level PacLang program
An architecture mapping script (AMS)An architecture mapping script (AMS)
PacLang program contains no low-level detailsPacLang program contains no low-level details Portable across different architecturesPortable across different architectures
Very easy to read and debugVery easy to read and debug
Low level details are all in the AMSLow level details are all in the AMS Specific to a particular architectureSpecific to a particular architecture
Can change performance, but not semanticsCan change performance, but not semantics
Tells the compiler how to transform the program so that it executes Tells the compiler how to transform the program so that it executes efficientlyefficiently
Design Flow with an AMSDesign Flow with an AMS
Compiler
PacLang Program AMS
Deploy
Analyse Performance
Refine AMS
Advantages of the AMS ApproachAdvantages of the AMS Approach
Improved code readability and portabilityImproved code readability and portability The code isn’t polluted with low-level detailsThe code isn’t polluted with low-level details
Easier to get programs correctEasier to get programs correct Correctness depends only on the PacLang program Correctness depends only on the PacLang program
The AMS can change the performance, but not the semanticsThe AMS can change the performance, but not the semantics
Easy exploration of optimisation choicesEasy exploration of optimisation choices You only need to modify the AMSYou only need to modify the AMS
PerformancePerformance The programmer still has a lot of control over the generated code.The programmer still has a lot of control over the generated code.
No need to pass all control over to someone else’s optimiserNo need to pass all control over to someone else’s optimiser
AMS + Optimiser = GoodAMS + Optimiser = Good
Writing an optimiser that can do everything perfectly is hardWriting an optimiser that can do everything perfectly is hard Network Processors are much harder to optimise for than CPUsNetwork Processors are much harder to optimise for than CPUs
More like hardware synthesis than conventional compilationMore like hardware synthesis than conventional compilation
Writing a program that applies an AMS is easierWriting a program that applies an AMS is easier
AMS can fill in gaps left by an optimiserAMS can fill in gaps left by an optimiser Write an optimiser that usually does a reasonable jobWrite an optimiser that usually does a reasonable job
Use an AMS to deal with places where the optimiser does poorlyUse an AMS to deal with places where the optimiser does poorly
Programmers like to have controlProgrammers like to have control I may know exactly how I want to map my program to hardwareI may know exactly how I want to map my program to hardware
Optimisers can give unpredictable behaviourOptimisers can give unpredictable behaviour
An AMS is an addition, not an alternative to an automatic
optimiser!
An AMS is an addition, not an alternative to an automatic
optimiser!
This is a sufficiently important point that it is worth This is a sufficiently important point that it is worth making twice making twice
What can an AMS say?What can an AMS say?
How to pipeline a task across multiple microenginesHow to pipeline a task across multiple microengines What to store in each kind of memoryWhat to store in each kind of memory When to move data between different memoriesWhen to move data between different memories How to represent data in memory (e.g. pack or not?)How to represent data in memory (e.g. pack or not?) How to protect shared resourcesHow to protect shared resources How to implement queuesHow to implement queues Which code should be considered the critical pathWhich code should be considered the critical path Which code should be placed on the XScale coreWhich code should be placed on the XScale core Low level details such as loop unrolling and function inliningLow level details such as loop unrolling and function inlining Which of several alternative algorithms to useWhich of several alternative algorithms to use
And whatever else one might think ofAnd whatever else one might think of
AMS-based program pipeliningAMS-based program pipelining
High-level program has problem-orientated concurrencyHigh-level program has problem-orientated concurrency
Division of program into tasks models the problemDivision of program into tasks models the problem
Tasks do not map directly to hardware unitsTasks do not map directly to hardware units
AMS transforms this to implementation-oriented concurrencyAMS transforms this to implementation-oriented concurrency
Original tasks are split and joined to make new tasksOriginal tasks are split and joined to make new tasks
New tasks map directly to hardware unitsNew tasks map directly to hardware units
User Task
User TaskCompiler
AMS Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Hardware Task
Task PipeliningTask Pipelining
Convert one repeating task into several tasks with a Convert one repeating task into several tasks with a queue between themqueue between them
A; B; C;
A; B; C;
Pipeline Transform
Pipelining is not always safePipelining is not always safe
May change the behaviour of the program:May change the behaviour of the program:
q.enq(1); q.enq(2);
1,2,1,2,...
q.enq(1);
1,1,2,2,...
q.enq(2);
Pipeline Transform
Elements now written to queue out of order!
Iterations of t1 get ahead of t2
t1 t2
Pipelining Safety is tricky (1/3)Pipelining Safety is tricky (1/3)
Concurrent tasks interact in complex waysConcurrent tasks interact in complex ways
q1.enq(1);
q2.enq(q1.deq);
q2.enq(2);
1,1,... 1,1,2,2,...
Pipeline split point
passes values from q1 to q2
values can appear on q2 out of orderq1 q2
Pipelining Safety is tricky (2/3)Pipelining Safety is tricky (2/3)
Concurrent tasks interact in complex waysConcurrent tasks interact in complex ways
q1.enq(1);
q1.enq(3); q2.enq(4);
q2.enq(2);
1,1,3,... 4,2,2,...
Pipeline split point
q1 says: 1,1 written before 3.q2 says: 4 written before 2.t4 says: 3 written before 4.unsplit task says: 2 written before 1,1.
This combination not possible inthe original program.
q1 q2
t3
Pipelining Safety is tricky (3/3)Pipelining Safety is tricky (3/3)
q1.enq(1);
q2.enq(q1.deq);
q2.enq(2);
1,1,... 1,1,2,2,...
Pipeline split point
q1 q2
q1.enq(1);
q1.enq(q2.deq);
q2.enq(2);
1,1,2,2 2,2,...
Pipeline split point
q1 q2
Unsafe Safe
Checking Pipeline SafetyChecking Pipeline Safety
Difficult for programmer to know if pipeline is safeDifficult for programmer to know if pipeline is safe
Fortunately, our compiler checks safetyFortunately, our compiler checks safety
Rejects AMS if pipelining is unsafeRejects AMS if pipelining is unsafe
Applies a safety analysis that checks that pipelining Applies a safety analysis that checks that pipelining cannot change observable program behaviourcannot change observable program behaviour
I won’t subject you to the full safety analysis nowI won’t subject you to the full safety analysis now
Read the details in the paperRead the details in the paper
Task Rearrangement in ActionTask Rearrangement in Action
Classify
IP
ARP
IP OptionsRx Tx
ICMP Err
Classify+ IP(1/3)
IP(2/3)
IP Options+ ARP
+ICMP Err
RxTx
IP(2/3)
The PacLang LanguageThe PacLang Language
High level language, abstracting all low level detailsHigh level language, abstracting all low level details
Not IXP specific – can be targeted to any architectureNot IXP specific – can be targeted to any architecture Our toolset can also generate Click modulesOur toolset can also generate Click modules
C-like, imperative languageC-like, imperative language
Static threads, connected by queuesStatic threads, connected by queues
Advanced type systemAdvanced type system Linearly typed packets – allow better packet implementationLinearly typed packets – allow better packet implementation
Packet views – make it easer to work with multiple protocolsPacket views – make it easer to work with multiple protocols
PerformancePerformance
One of the main aims of PacLangOne of the main aims of PacLang No feature is added to the language if it can’t be implemented No feature is added to the language if it can’t be implemented
efficientlyefficiently
PacLang programs run fastPacLang programs run fast
We have implemented a high performance IP forwarderWe have implemented a high performance IP forwarder It achieves 3Gb/s on a RadiSys ENP2611, IXP2400 cardIt achieves 3Gb/s on a RadiSys ENP2611, IXP2400 card
Worst case, using min-size packetsWorst case, using min-size packets
Using a standard longest-prefix-match algorithmUsing a standard longest-prefix-match algorithm
Using only 5 of the 8 available micro-engines (including drivers)Using only 5 of the 8 available micro-engines (including drivers)
Competitive with other IP forwarders on the same platformCompetitive with other IP forwarders on the same platform
AvailabilityAvailability
A preview release of the PacLang compiler is availableA preview release of the PacLang compiler is available
Download it from Intel Research Cambridge, or from SourceForgeDownload it from Intel Research Cambridge, or from SourceForge
Full source-code is availableFull source-code is available
A research prototype, not a commercial quality productA research prototype, not a commercial quality product
Runs simple demo programsRuns simple demo programs
But lacks many features that would be needed in a full productBut lacks many features that would be needed in a full product
Not all AMS features are currently workingNot all AMS features are currently working
A Tangent: LockBendA Tangent: LockBend
Abstracted Lock Optimisation for C ProgramsAbstracted Lock Optimisation for C Programs
Take an existing C programTake an existing C program
Add some pragmas telling the compiler how to transform the program Add some pragmas telling the compiler how to transform the program to use a different locking strategyto use a different locking strategy
Fine grained, ordered, optimistic, two phase, etcFine grained, ordered, optimistic, two phase, etc
Compiler verifies that program semantics is preservedCompiler verifies that program semantics is preserved
LockBend Pragmas
Legacy C ProgramCompiler Program with Optimised
Locking Strategy