Download - NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

NetThreads: Programming NetFPGA with Threaded Software

Martin LabrecqueGregory Steffan

ECE Dept.

Geoff SalmonMonia GhobadiYashar Ganjali

University of Toronto

CS Dept.

2

Real-Life Customers

● Hardware:

– NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA● Collaboration with CS researchers

– Interested in performing network experiments

– Not in coding Verilog

– Want to use GigE link at maximum capacity

Requirements: Easy to program system Efficient systemWhat would the ideal solution look like?

3

Processor

Processor

Processor

Processor

Processor

Processor

Envisioned System (Someday)

● Many Compute Engines

● Delivers the expected performance

● Hardware handles communication and synchronizaton

Hardware Accelerator






Processor

Processor

Processor

Processor

Processor

Processor

control-flowparallelism

data-level parallelism

Processors inside an FPGA?

4

FPGA

Soft processors: processors in the FPGA fabric

FPGAs increasingly implement SoCs with CPUs Commercial soft processors: NIOS-II and Microblaze

Processor

PC

Instr. Mem.

Reg. Array

regA

regB

regW

datW

datA

datB

ALU

25:21

20:16

+4

Data Mem.

datIn

addrdatOut

aluA

aluB

IncrPC

Instr

4:0 Wdest

Wdata

20:13

Xtnd

25:21

Wdata

Wdest

15:0

Xtnd << 2

Zero Test

25:21

Wdata

Wdest

20:0

25:21

Wdata

Wdest

•Easier to program than HDL•Customizable

Soft Processors in FPGAs

DDR controller

Ethernet MACEthernet MAC

Ethernet MAC

DDR controller

What is the performance requirement?

5

Performance In Packet Processing

● The application defines the throughput required

Home networking (~100 Mbps/link)

Edge routing (≥ 1 Gbps/link)

Scientific instruments(< 100 Mbps/link)

● Our measure of throughput:– Bisection search of the minimum packet inter-arrival– Must not drop any packet

Are soft processors fast enough?

6

Realistic Goals

● 109 bps stream with normal inter-frame gap of 12 bytes

● 2 processors running at 125 MHz

● Cycle budget:

– 152 cycles for minimally-sized 64B packets;

– 3060 cycles for maximally-sized 1518B packets

Soft processors: non-trivial processing at line rate!

How can they efficiently be organized?

Key Design Features

8

Efficient Network Processing

Memory system with specialized

memories

Multithreaded soft

processor

Multiple processors

support

9

Multiprocessor System Diagram

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

Data

Input mem.

Output mem.

I$

processor

4-threads

Off-chip DDR

I$

processor

4-threads

- Overcomes the 2-port limitation of block RAMs- Shared data cache is not the main bottleneck in our experiments

10

Performance of Single-Threaded Processors

● Single-issue, in order pipeline ● Should commit 1 instruction every cycle, but:

– stall on instruction dependences

– stall on memory, I/O, accelerators accesses

● Throughput depends on sequential execution:

– packet processing

– device control

– event monitoring

Solution to Avoid Stalls: Multithreading

many concurrent threads

11

Avoiding Processor Stall Cycles

Single-ThreadTraditional execution

BE

FO

RE

F

E

F

E

M M

DD

W W

F

E

M

D

W5

stag

esTime

Ideally, eliminates all stalls

Multithreading: execute streams of independent instructions

LegendThread1Thread2Thread3Thread4

AF

TE

R

F F

E E

F

E

M M M

F

E

M5 st

ages

Time

D DD D

W W W W

F

E

M

D

W

4 threads eliminate hazards in 5-stage pipeline

Data or control hazard

5-stage pipeline is 77% more area efficient [FPL’07]

Multithreading Evaluation

13

Infrastructure• Compilation:

– modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA

• Timing: – no free PLL: processors run at the speed of the Ethernet MACs, 125MHz

• Platform: – 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2

SDRAM – Virtex II Pro 50 (speed grade 7ns)– 16KB private instruction caches and shared data write-back cache– Capacity would be increased on a more modern FPGA

• Validation: – Reference trace from MIPS simulator– Modelsim and online instruction trace collection

- PC server can send ~0.7 Gbps maximally size packets- Simple packet echo application can keep up- Complex applications are the bottleneck, not the architecture

14

Our benchmarksBenchmark Description Dynamic

Instructions per packet

x1000

Variance of Instructions per packet

x1000

UDHCP DHCP server 35 36

Classifier Regular expression +

QOS

13 35

NAT Network Address

Translation+ Accounting

6 7

Realistic non-trivial applications, dominated by control flow

15

What is limiting performance?

0

1

2

3

4

5

6

7

8

9

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000

Time (cycles)

Num

ber

of A

ctiv

e Thre

ads

Let’s focus on the underlying problem: Synchronization

Packet Backlog due to Synchronization

Serializing Tasks

Addressing Synchronization Overhead

17

Real Threads Synchronize

• All threads execute the same code

• Concurrent threads may access shared data

• Critical sections ensure correctness

Lock();

shared_var = f();

Unlock();

Impact on round-robin scheduled threads?

Thread1 Thread2 Thread3 Thread4

18

Multithreaded processor with Synchronization

5 st

ages

Time

F

E

M

D

W

F F

E E

M M

F

E

M

D D D

W W W

F

E

M

D

W

F F

E E

M M

F

E

M

D D D

W W W

F

E

M

D

WAcquire

lock

Releaselock

19

Synchronization Wrecks Round-Robin Multithreading

5 st

ages

Time

F

E

M

D

W

F

E

M

D

W

F

E

M

D

WAcquire

lock

Releaselock

With round-robin thread scheduling and contention on locks:< 4 threads execute concurrently> 18% cycles are wasted while blocked on synchronization

20

D

W

Better Handling of Synchronization5

stag

es

Time

F

E

M

D

W

F

E

M

D

W

F

E

M

D

W

E

M M

E

M

D

W W

F F

E E

M M

F

E

D D

WW

F

D

F

BE

FO

RE

E

M M

E

M

D

W W WTime

F

E

M

D

W

F

E

M

F

E

M

D D

W W

F

E

M

D

W

F F

E E

M M

D D

W W

F

E

M

D

W

AF

TE

R

F F

E E

M M

F

E

D D D

WW

F

D

F

DESCHEDULE Thread3 Thread4

5 st

ages

21

Thread scheduler• Suspend any thread waiting for a lock• Round-robin among the remaining threads• Unlock operation resumes threads across processors

- Multithreaded processor hides hazards across active threads- Fewer than N threads requires hazard detection

But, hazard detection was on critical path of single threaded processor

Is there a low cost solution?

22

Static Hazard Detection• Hazards can be determined at compile time

Hazard distance(maximum 2)

Min. issue cycle

addi r1,r1,r4 0 0

addi r2,r2,r5 1 1

or r1,r1,r8 0 3

or r2,r2,r9 0 4

- Hazard distances are encoded as part of the instructions

Static hazard detection allows scheduling without an extra pipeline stageVery low area overhead (5%), no frequency penalty

Thread Scheduler Evaluation

24

0500

10001500

20002500

30003500

40004500

5000

UDHCP NAT Classifier

pa

cke

ts p

er

seco

nd

Round-Robin 1p

Round-Robin 2p

Scheduler 1p

Scheduler 2p

Results on 3 benchmark applications

- Thread scheduling improves throughput by 63%, 31%, and 41%- Why isn’t the 2nd processor always improving throughput?

25

0

2

4

6

8

10

12

RR S1 RR S1 RR S1

Cy

cle

s p

er

ins

tru

cti

on

Other

Locked

No Packet

Busy

Cycle Breakdown in Simulation

UDHCP Classifier NAT

- Removed cycles stalled waiting for a lock- What is the bottleneck?

26

Impact of Allowing Packet Drops

0

5000

10000

15000

20000

25000

0 1 2 3 4 5

allowed percentage of packet drops

thro

ug

hp

ut

(pa

cke

ts/s

ec)

1 processor 2 processors

- System still under-utilized- Throughput still dominated by serialization

27

Future Work

• Adding custom hardware accelerators– Same interconnect as processors– Same synchronization interface

• Evaluate speculative threading– Alleviate need for fine grained-synchronization– Reduce conservative synchronization overhead

28

Conclusions

• Efficient multithreaded design– Parallel threads hide stalls on one thread– Thread scheduler mitigates synchronization costs

• System Features– System is easy to program in C– Performance from parallelism is easy to get

On the lookout for relevant applications suitable for benchmarking

NetThreads available with compiler at:http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

Martin LabrecqueGregory Steffan

ECE Dept.

Geoff SalmonMonia GhobadiYashar Ganjali

University of Toronto

CS Dept.

NetThreads available with compiler at:http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

30

Backup

31

Software Network Processing• Not meant for:

– Straightforward tasks accomplished at line speed in hardware– E.g. basic switching and routing

• Advantages compared to Hardware– Complex applications are best described in a high-level software – Easier to design and fast time-to-market– Can interface with custom accelerators, controllers– Can be easily updated

• Our focus: stateful applications– Data structures modified by most packets– Difficult to pipeline the code into balanced stages

Run-to-Completion/Pool-of-Threads model for parallelism:−Each thread processes a packet from beginning to end −No thread-specific behavior

32

Impact of allowing packet drops

0

1

2

3

4

5

6

7

0 1 2 3 4 5

allowed percentage of packet drops

no

rma

lize

d th

rou

gh

pu

t (p

ack

ets

/se

c) Round-Robin 1p

Round-Robin 2p

Sched 1p

Sched 2p

NAT benchmark

t

33

Cycle Breakdown in Simulation

0

2

4

6

8

10

RR S1 RR S1 RR S1

cycl

es

pe

r in

stru

ctio

n

Other

HazardBubbleSquashed

Locked

No Packet

Busy

UDHCP Classifier NAT

- Removed cycles stalled waiting for a lock- Throughput still dominated by serialization

34

More Sophisticated Thread Scheduling

• Add pipeline stage to pick hazard-free instruction

• Result:– Increased instruction latency– Increased hazard window– Increased branch mis-prediction cost

Fetch ThreadSelection

RegisterRead

Execute Writeback

MU

X

Add hazard detection without an extra pipeline stage?

Memory

35

Implementation• Where to store the hazard distance bits?

– Block RAMs are multiple of 9 bits wide– 36 bits word leaves 4 bits available

• Also encode lock and unlock flagsLock/ Unlock +

Hazard DistanceInstruction

4 Bits 32 Bits

x 36 bitsI$

processor

4-threads

Off-chip DDR

I$

processor

4-threadsx 36 bits

x 32 bits

How to convert instructions from 36 bits to 32 bits?

36

Instruction Compaction 36 32 bits

R-Type Instructions

opcode (6) rs (5) rt (5) rd (5) sa (5) function (6)

opcode (6) target (26)

J-Type Instructions

Example: add rd, rs, rt

Example: j label

- De-compaction: 2 block RAMs + some logic between DDR and cache- Not a critical path of the pipeline

opcode (6) rs (5) rt (5) immediate (16)

Example: addi rt, rs, immediate

I-Type Instructions