NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept....

36
NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University of Toronto CS Dept.

Transcript of NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept....

Page 1: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

NetThreads: Programming NetFPGA with Threaded Software

Martin LabrecqueGregory Steffan

ECE Dept.

Geoff SalmonMonia GhobadiYashar Ganjali

University of Toronto

CS Dept.

Page 2: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

2

Real-Life Customers

● Hardware:

– NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA● Collaboration with CS researchers

– Interested in performing network experiments

– Not in coding Verilog

– Want to use GigE link at maximum capacity

Requirements: Easy to program system Efficient systemWhat would the ideal solution look like?

Page 3: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

3

Processor

Processor

Processor

Processor

Processor

Processor

Envisioned System (Someday)

● Many Compute Engines

● Delivers the expected performance

● Hardware handles communication and synchronizaton

Hardware Accelerator

Hardware Accelerator

Hardware Accelerator

Hardware Accelerator

Hardware Accelerator

Hardware Accelerator

Processor

Processor

Processor

Processor

Processor

Processor

control-flowparallelism

data-level parallelism

Processors inside an FPGA?

Page 4: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

4

FPGA

Soft processors: processors in the FPGA fabric

FPGAs increasingly implement SoCs with CPUs Commercial soft processors: NIOS-II and Microblaze

Processor

PC

Instr. Mem.

Reg. Array

regA

regB

regW

datW

datA

datB

ALU

25:21

20:16

+4

Data Mem.

datIn

addrdatOut

aluA

aluB

IncrPC

Instr

4:0 Wdest

Wdata

20:13

Xtnd

25:21

Wdata

Wdest

15:0

Xtnd << 2

Zero Test

25:21

Wdata

Wdest

20:0

25:21

Wdata

Wdest

•Easier to program than HDL•Customizable

Soft Processors in FPGAs

DDR controller

Ethernet MACEthernet MAC

Ethernet MAC

DDR controller

What is the performance requirement?

Page 5: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

5

Performance In Packet Processing

● The application defines the throughput required

Home networking (~100 Mbps/link)

Edge routing (≥ 1 Gbps/link)

Scientific instruments(< 100 Mbps/link)

● Our measure of throughput:– Bisection search of the minimum packet inter-arrival– Must not drop any packet

Are soft processors fast enough?

Page 6: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

6

Realistic Goals

● 109 bps stream with normal inter-frame gap of 12 bytes

● 2 processors running at 125 MHz

● Cycle budget:

– 152 cycles for minimally-sized 64B packets;

– 3060 cycles for maximally-sized 1518B packets

Soft processors: non-trivial processing at line rate!

How can they efficiently be organized?

Page 7: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

Key Design Features

Page 8: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

8

Efficient Network Processing

Memory system with specialized

memories

Multithreaded soft

processor

Multiple processors

support

Page 9: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

9

Multiprocessor System Diagram

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

Data

Input mem.

Output mem.

I$

processor

4-threads

Off-chip DDR

I$

processor

4-threads

- Overcomes the 2-port limitation of block RAMs- Shared data cache is not the main bottleneck in our experiments

Page 10: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

10

Performance of Single-Threaded Processors

● Single-issue, in order pipeline ● Should commit 1 instruction every cycle, but:

– stall on instruction dependences

– stall on memory, I/O, accelerators accesses

● Throughput depends on sequential execution:

– packet processing

– device control

– event monitoring

Solution to Avoid Stalls: Multithreading

many concurrent threads

Page 11: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

11

Avoiding Processor Stall Cycles

Single-ThreadTraditional execution

BE

FO

RE

F

E

F

E

M M

DD

W W

F

E

M

D

W5

stag

esTime

Ideally, eliminates all stalls

Multithreading: execute streams of independent instructions

LegendThread1Thread2Thread3Thread4

AF

TE

R

F F

E E

F

E

M M M

F

E

M5 st

ages

Time

D DD D

W W W W

F

E

M

D

W

4 threads eliminate hazards in 5-stage pipeline

Data or control hazard

5-stage pipeline is 77% more area efficient [FPL’07]

Page 12: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

Multithreading Evaluation

Page 13: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

13

Infrastructure• Compilation:

– modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA

• Timing: – no free PLL: processors run at the speed of the Ethernet MACs, 125MHz

• Platform: – 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2

SDRAM – Virtex II Pro 50 (speed grade 7ns)– 16KB private instruction caches and shared data write-back cache– Capacity would be increased on a more modern FPGA

• Validation: – Reference trace from MIPS simulator– Modelsim and online instruction trace collection

- PC server can send ~0.7 Gbps maximally size packets- Simple packet echo application can keep up- Complex applications are the bottleneck, not the architecture

Page 14: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

14

Our benchmarksBenchmark Description Dynamic

Instructions per packet

x1000

Variance of Instructions per packet

x1000

UDHCP DHCP server 35 36

Classifier Regular expression +

QOS

13 35

NAT Network Address

Translation+ Accounting

6 7

Realistic non-trivial applications, dominated by control flow

Page 15: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

15

What is limiting performance?

0

1

2

3

4

5

6

7

8

9

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000

Time (cycles)

Num

ber

of A

ctiv

e Thre

ads

Let’s focus on the underlying problem: Synchronization

Packet Backlog due to Synchronization

Serializing Tasks

Page 16: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

Addressing Synchronization Overhead

Page 17: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

17

Real Threads Synchronize

• All threads execute the same code

• Concurrent threads may access shared data

• Critical sections ensure correctness

Lock();

shared_var = f();

Unlock();

Impact on round-robin scheduled threads?

Thread1 Thread2 Thread3 Thread4

Page 18: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

18

Multithreaded processor with Synchronization

5 st

ages

Time

F

E

M

D

W

F F

E E

M M

F

E

M

D D D

W W W

F

E

M

D

W

F F

E E

M M

F

E

M

D D D

W W W

F

E

M

D

WAcquire

lock

Releaselock

Page 19: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

19

Synchronization Wrecks Round-Robin Multithreading

5 st

ages

Time

F

E

M

D

W

F

E

M

D

W

F

E

M

D

WAcquire

lock

Releaselock

With round-robin thread scheduling and contention on locks:< 4 threads execute concurrently> 18% cycles are wasted while blocked on synchronization

Page 20: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

20

D

W

Better Handling of Synchronization5

stag

es

Time

F

E

M

D

W

F

E

M

D

W

F

E

M

D

W

E

M M

E

M

D

W W

F F

E E

M M

F

E

D D

WW

F

D

F

BE

FO

RE

E

M M

E

M

D

W W WTime

F

E

M

D

W

F

E

M

F

E

M

D D

W W

F

E

M

D

W

F F

E E

M M

D D

W W

F

E

M

D

W

AF

TE

R

F F

E E

M M

F

E

D D D

WW

F

D

F

DESCHEDULE Thread3 Thread4

5 st

ages

Page 21: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

21

Thread scheduler• Suspend any thread waiting for a lock• Round-robin among the remaining threads• Unlock operation resumes threads across processors

- Multithreaded processor hides hazards across active threads- Fewer than N threads requires hazard detection

But, hazard detection was on critical path of single threaded processor

Is there a low cost solution?

Page 22: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

22

Static Hazard Detection• Hazards can be determined at compile time

Hazard distance(maximum 2)

Min. issue cycle

addi r1,r1,r4 0 0

addi r2,r2,r5 1 1

or r1,r1,r8 0 3

or r2,r2,r9 0 4

- Hazard distances are encoded as part of the instructions

Static hazard detection allows scheduling without an extra pipeline stageVery low area overhead (5%), no frequency penalty

Page 23: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

Thread Scheduler Evaluation

Page 24: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

24

0500

10001500

20002500

30003500

40004500

5000

UDHCP NAT Classifier

pa

cke

ts p

er

seco

nd

Round-Robin 1p

Round-Robin 2p

Scheduler 1p

Scheduler 2p

Results on 3 benchmark applications

- Thread scheduling improves throughput by 63%, 31%, and 41%- Why isn’t the 2nd processor always improving throughput?

Page 25: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

25

0

2

4

6

8

10

12

RR S1 RR S1 RR S1

Cy

cle

s p

er

ins

tru

cti

on

Other

Locked

No Packet

Busy

Cycle Breakdown in Simulation

UDHCP Classifier NAT

- Removed cycles stalled waiting for a lock- What is the bottleneck?

Page 26: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

26

Impact of Allowing Packet Drops

0

5000

10000

15000

20000

25000

0 1 2 3 4 5

allowed percentage of packet drops

thro

ug

hp

ut

(pa

cke

ts/s

ec)

1 processor 2 processors

- System still under-utilized- Throughput still dominated by serialization

Page 27: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

27

Future Work

• Adding custom hardware accelerators– Same interconnect as processors– Same synchronization interface

• Evaluate speculative threading– Alleviate need for fine grained-synchronization– Reduce conservative synchronization overhead

Page 28: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

28

Conclusions

• Efficient multithreaded design– Parallel threads hide stalls on one thread– Thread scheduler mitigates synchronization costs

• System Features– System is easy to program in C– Performance from parallelism is easy to get

On the lookout for relevant applications suitable for benchmarking

NetThreads available with compiler at:http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

Page 29: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

Martin LabrecqueGregory Steffan

ECE Dept.

Geoff SalmonMonia GhobadiYashar Ganjali

University of Toronto

CS Dept.

NetThreads available with compiler at:http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

Page 30: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

30

Backup

Page 31: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

31

Software Network Processing• Not meant for:

– Straightforward tasks accomplished at line speed in hardware– E.g. basic switching and routing

• Advantages compared to Hardware– Complex applications are best described in a high-level software – Easier to design and fast time-to-market– Can interface with custom accelerators, controllers– Can be easily updated

• Our focus: stateful applications– Data structures modified by most packets– Difficult to pipeline the code into balanced stages

Run-to-Completion/Pool-of-Threads model for parallelism:−Each thread processes a packet from beginning to end −No thread-specific behavior

Page 32: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

32

Impact of allowing packet drops

0

1

2

3

4

5

6

7

0 1 2 3 4 5

allowed percentage of packet drops

no

rma

lize

d th

rou

gh

pu

t (p

ack

ets

/se

c) Round-Robin 1p

Round-Robin 2p

Sched 1p

Sched 2p

NAT benchmark

t

Page 33: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

33

Cycle Breakdown in Simulation

0

2

4

6

8

10

RR S1 RR S1 RR S1

cycl

es

pe

r in

stru

ctio

n

Other

HazardBubbleSquashed

Locked

No Packet

Busy

UDHCP Classifier NAT

- Removed cycles stalled waiting for a lock- Throughput still dominated by serialization

Page 34: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

34

More Sophisticated Thread Scheduling

• Add pipeline stage to pick hazard-free instruction

• Result:– Increased instruction latency– Increased hazard window– Increased branch mis-prediction cost

Fetch ThreadSelection

RegisterRead

Execute Writeback

MU

X

Add hazard detection without an extra pipeline stage?

Memory

Page 35: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

35

Implementation• Where to store the hazard distance bits?

– Block RAMs are multiple of 9 bits wide– 36 bits word leaves 4 bits available

• Also encode lock and unlock flagsLock/ Unlock +

Hazard DistanceInstruction

4 Bits 32 Bits

x 36 bitsI$

processor

4-threads

Off-chip DDR

I$

processor

4-threadsx 36 bits

x 32 bits

How to convert instructions from 36 bits to 32 bits?

Page 36: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

36

Instruction Compaction 36 32 bits

R-Type Instructions

opcode (6) rs (5) rt (5) rd (5) sa (5) function (6)

opcode (6) target (26)

J-Type Instructions

Example: add rd, rs, rt

Example: j label

- De-compaction: 2 block RAMs + some logic between DDR and cache- Not a critical path of the pipeline

opcode (6) rs (5) rt (5) immediate (16)

Example: addi rt, rs, immediate

I-Type Instructions