NetThreads: Programming NetFPGA with Threaded Software
Martin LabrecqueGregory Steffan
ECE Dept.
Geoff SalmonMonia GhobadiYashar Ganjali
University of Toronto
CS Dept.
2
Real-Life Customers
● Hardware:
– NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA● Collaboration with CS researchers
– Interested in performing network experiments
– Not in coding Verilog
– Want to use GigE link at maximum capacity
Requirements: Easy to program system Efficient systemWhat would the ideal solution look like?
3
Processor
Processor
Processor
Processor
Processor
Processor
Envisioned System (Someday)
● Many Compute Engines
● Delivers the expected performance
● Hardware handles communication and synchronizaton
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Processor
Processor
Processor
Processor
Processor
Processor
control-flowparallelism
data-level parallelism
Processors inside an FPGA?
4
FPGA
Soft processors: processors in the FPGA fabric
FPGAs increasingly implement SoCs with CPUs Commercial soft processors: NIOS-II and Microblaze
Processor
PC
Instr. Mem.
Reg. Array
regA
regB
regW
datW
datA
datB
ALU
25:21
20:16
+4
Data Mem.
datIn
addrdatOut
aluA
aluB
IncrPC
Instr
4:0 Wdest
Wdata
20:13
Xtnd
25:21
Wdata
Wdest
15:0
Xtnd << 2
Zero Test
25:21
Wdata
Wdest
20:0
25:21
Wdata
Wdest
•Easier to program than HDL•Customizable
Soft Processors in FPGAs
DDR controller
Ethernet MACEthernet MAC
Ethernet MAC
DDR controller
What is the performance requirement?
5
Performance In Packet Processing
● The application defines the throughput required
Home networking (~100 Mbps/link)
Edge routing (≥ 1 Gbps/link)
Scientific instruments(< 100 Mbps/link)
● Our measure of throughput:– Bisection search of the minimum packet inter-arrival– Must not drop any packet
Are soft processors fast enough?
6
Realistic Goals
● 109 bps stream with normal inter-frame gap of 12 bytes
● 2 processors running at 125 MHz
● Cycle budget:
– 152 cycles for minimally-sized 64B packets;
– 3060 cycles for maximally-sized 1518B packets
Soft processors: non-trivial processing at line rate!
How can they efficiently be organized?
Key Design Features
8
Efficient Network Processing
Memory system with specialized
memories
Multithreaded soft
processor
Multiple processors
support
9
Multiprocessor System Diagram
InputBuffer
DataCache
OutputBuffer
Synch. Unit
packetinput
packetoutput
Instr.
Data
Input mem.
Output mem.
I$
processor
4-threads
Off-chip DDR
I$
processor
4-threads
- Overcomes the 2-port limitation of block RAMs- Shared data cache is not the main bottleneck in our experiments
10
Performance of Single-Threaded Processors
● Single-issue, in order pipeline ● Should commit 1 instruction every cycle, but:
– stall on instruction dependences
– stall on memory, I/O, accelerators accesses
● Throughput depends on sequential execution:
– packet processing
– device control
– event monitoring
Solution to Avoid Stalls: Multithreading
many concurrent threads
11
Avoiding Processor Stall Cycles
Single-ThreadTraditional execution
BE
FO
RE
F
E
F
E
M M
DD
W W
F
E
M
D
W5
stag
esTime
Ideally, eliminates all stalls
Multithreading: execute streams of independent instructions
LegendThread1Thread2Thread3Thread4
AF
TE
R
F F
E E
F
E
M M M
F
E
M5 st
ages
Time
D DD D
W W W W
F
E
M
D
W
4 threads eliminate hazards in 5-stage pipeline
Data or control hazard
5-stage pipeline is 77% more area efficient [FPL’07]
Multithreading Evaluation
13
Infrastructure• Compilation:
– modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA
• Timing: – no free PLL: processors run at the speed of the Ethernet MACs, 125MHz
• Platform: – 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2
SDRAM – Virtex II Pro 50 (speed grade 7ns)– 16KB private instruction caches and shared data write-back cache– Capacity would be increased on a more modern FPGA
• Validation: – Reference trace from MIPS simulator– Modelsim and online instruction trace collection
- PC server can send ~0.7 Gbps maximally size packets- Simple packet echo application can keep up- Complex applications are the bottleneck, not the architecture
14
Our benchmarksBenchmark Description Dynamic
Instructions per packet
x1000
Variance of Instructions per packet
x1000
UDHCP DHCP server 35 36
Classifier Regular expression +
QOS
13 35
NAT Network Address
Translation+ Accounting
6 7
Realistic non-trivial applications, dominated by control flow
15
What is limiting performance?
0
1
2
3
4
5
6
7
8
9
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000
Time (cycles)
Num
ber
of A
ctiv
e Thre
ads
Let’s focus on the underlying problem: Synchronization
Packet Backlog due to Synchronization
Serializing Tasks
Addressing Synchronization Overhead
17
Real Threads Synchronize
• All threads execute the same code
• Concurrent threads may access shared data
• Critical sections ensure correctness
Lock();
shared_var = f();
Unlock();
Impact on round-robin scheduled threads?
Thread1 Thread2 Thread3 Thread4
18
Multithreaded processor with Synchronization
5 st
ages
Time
F
E
M
D
W
F F
E E
M M
F
E
M
D D D
W W W
F
E
M
D
W
F F
E E
M M
F
E
M
D D D
W W W
F
E
M
D
WAcquire
lock
Releaselock
19
Synchronization Wrecks Round-Robin Multithreading
5 st
ages
Time
F
E
M
D
W
F
E
M
D
W
F
E
M
D
WAcquire
lock
Releaselock
With round-robin thread scheduling and contention on locks:< 4 threads execute concurrently> 18% cycles are wasted while blocked on synchronization
20
D
W
Better Handling of Synchronization5
stag
es
Time
F
E
M
D
W
F
E
M
D
W
F
E
M
D
W
E
M M
E
M
D
W W
F F
E E
M M
F
E
D D
WW
F
D
F
BE
FO
RE
E
M M
E
M
D
W W WTime
F
E
M
D
W
F
E
M
F
E
M
D D
W W
F
E
M
D
W
F F
E E
M M
D D
W W
F
E
M
D
W
AF
TE
R
F F
E E
M M
F
E
D D D
WW
F
D
F
DESCHEDULE Thread3 Thread4
5 st
ages
21
Thread scheduler• Suspend any thread waiting for a lock• Round-robin among the remaining threads• Unlock operation resumes threads across processors
- Multithreaded processor hides hazards across active threads- Fewer than N threads requires hazard detection
But, hazard detection was on critical path of single threaded processor
Is there a low cost solution?
22
Static Hazard Detection• Hazards can be determined at compile time
Hazard distance(maximum 2)
Min. issue cycle
addi r1,r1,r4 0 0
addi r2,r2,r5 1 1
or r1,r1,r8 0 3
or r2,r2,r9 0 4
- Hazard distances are encoded as part of the instructions
Static hazard detection allows scheduling without an extra pipeline stageVery low area overhead (5%), no frequency penalty
Thread Scheduler Evaluation
24
0500
10001500
20002500
30003500
40004500
5000
UDHCP NAT Classifier
pa
cke
ts p
er
seco
nd
Round-Robin 1p
Round-Robin 2p
Scheduler 1p
Scheduler 2p
Results on 3 benchmark applications
- Thread scheduling improves throughput by 63%, 31%, and 41%- Why isn’t the 2nd processor always improving throughput?
25
0
2
4
6
8
10
12
RR S1 RR S1 RR S1
Cy
cle
s p
er
ins
tru
cti
on
Other
Locked
No Packet
Busy
Cycle Breakdown in Simulation
UDHCP Classifier NAT
- Removed cycles stalled waiting for a lock- What is the bottleneck?
26
Impact of Allowing Packet Drops
0
5000
10000
15000
20000
25000
0 1 2 3 4 5
allowed percentage of packet drops
thro
ug
hp
ut
(pa
cke
ts/s
ec)
1 processor 2 processors
- System still under-utilized- Throughput still dominated by serialization
27
Future Work
• Adding custom hardware accelerators– Same interconnect as processors– Same synchronization interface
• Evaluate speculative threading– Alleviate need for fine grained-synchronization– Reduce conservative synchronization overhead
28
Conclusions
• Efficient multithreaded design– Parallel threads hide stalls on one thread– Thread scheduler mitigates synchronization costs
• System Features– System is easy to program in C– Performance from parallelism is easy to get
On the lookout for relevant applications suitable for benchmarking
NetThreads available with compiler at:http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
Martin LabrecqueGregory Steffan
ECE Dept.
Geoff SalmonMonia GhobadiYashar Ganjali
University of Toronto
CS Dept.
NetThreads available with compiler at:http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
30
Backup
31
Software Network Processing• Not meant for:
– Straightforward tasks accomplished at line speed in hardware– E.g. basic switching and routing
• Advantages compared to Hardware– Complex applications are best described in a high-level software – Easier to design and fast time-to-market– Can interface with custom accelerators, controllers– Can be easily updated
• Our focus: stateful applications– Data structures modified by most packets– Difficult to pipeline the code into balanced stages
Run-to-Completion/Pool-of-Threads model for parallelism:−Each thread processes a packet from beginning to end −No thread-specific behavior
32
Impact of allowing packet drops
0
1
2
3
4
5
6
7
0 1 2 3 4 5
allowed percentage of packet drops
no
rma
lize
d th
rou
gh
pu
t (p
ack
ets
/se
c) Round-Robin 1p
Round-Robin 2p
Sched 1p
Sched 2p
NAT benchmark
t
33
Cycle Breakdown in Simulation
0
2
4
6
8
10
RR S1 RR S1 RR S1
cycl
es
pe
r in
stru
ctio
n
Other
HazardBubbleSquashed
Locked
No Packet
Busy
UDHCP Classifier NAT
- Removed cycles stalled waiting for a lock- Throughput still dominated by serialization
34
More Sophisticated Thread Scheduling
• Add pipeline stage to pick hazard-free instruction
• Result:– Increased instruction latency– Increased hazard window– Increased branch mis-prediction cost
Fetch ThreadSelection
RegisterRead
Execute Writeback
MU
X
Add hazard detection without an extra pipeline stage?
Memory
35
Implementation• Where to store the hazard distance bits?
– Block RAMs are multiple of 9 bits wide– 36 bits word leaves 4 bits available
• Also encode lock and unlock flagsLock/ Unlock +
Hazard DistanceInstruction
4 Bits 32 Bits
x 36 bitsI$
processor
4-threads
Off-chip DDR
I$
processor
4-threadsx 36 bits
x 32 bits
How to convert instructions from 36 bits to 32 bits?
36
Instruction Compaction 36 32 bits
R-Type Instructions
opcode (6) rs (5) rt (5) rd (5) sa (5) function (6)
opcode (6) target (26)
J-Type Instructions
Example: add rd, rs, rt
Example: j label
- De-compaction: 2 block RAMs + some logic between DDR and cache- Not a critical path of the pipeline
opcode (6) rs (5) rt (5) immediate (16)
Example: addi rt, rs, immediate
I-Type Instructions
Top Related