Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

45
Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip Mohamed ABDELFATTAH Vaughn BETZ

description

Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip. Mohamed ABDELFATTAH Vaughn BETZ. Outline. 1. Why NoCs on FPGAs?. 2. Hard/soft efficiency gap. 3. Integrating hard NoCs with FPGA. Outline. 1. Why NoCs on FPGAs?. Motivation. Previous Work. 2. Hard/soft efficiency gap. - PowerPoint PPT Presentation

Transcript of Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

Page 1: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

Mohamed ABDELFATTAHVaughn BETZ

Page 2: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

2

Outline

Why NoCs on FPGAs?

Hard/soft efficiency gap

Integrating hard NoCs with FPGA

1

2

3

Page 3: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

3

OutlineWhy NoCs on FPGAs?

Hard/soft efficiency gap

Integrating hard NoCs with FPGA

1

2

3

Motivation Previous Work

Page 4: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

4

Interconnect

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

Page 5: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

5

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

Hard Blocks:• Memory• Multiplier• Processor

Page 6: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

6

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

Hard InterfacesDDR/PCIe ..

Interconnect still the same

Hard Blocks:• Memory• Multiplier• Processor

1600 MHz

200 MHz

800 MHz

Page 7: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

7

MotivationDDR3 PHY and Controller1. Bandwidth requirements for

hard logic/interfaces2. Timing closure

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

1600 MHz

200 MHz

800 MHz

Page 8: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

8

MotivationDDR3 PHY and Controller1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Page 9: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

9

MotivationDDR3 PHY and Controller1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Page 10: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

10

MotivationDDR3 PHY and Controller1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

5. Low-level interconnect hinders modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Page 11: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

Barcelona Los Angeles

Keep the “roads”, but add “freeways”.

Hard Blocks

Logic Cluster

Source: Google Earth

Page 12: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

12

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

1. Bandwidth requirements for hard logic/interfaces

2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

5. Low-level interconnect hinders modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect

FPGA with NoCNoC

Routers

Links Router forwards data packet

Router moves data to local interconnect

Page 13: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

13

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

1. Bandwidth requirements for hard logic/interfaces

2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

5. Low-level interconnect hinders modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect

FPGA with NoC

Pre-design NoC to requirements NoC links are “re-usable” Latency-tolerant communication NoC abstraction favors modularity

High bandwidth endpoints known

Page 14: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

14

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

1. Bandwidth requirements for hard logic/interfaces

2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

5. Low-level interconnect hinders modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect

FPGA with NoC

Latency-tolerant communication NoC abstraction favors modularity

Page 15: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

15

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Implementation options: Soft Logic (LUTs, .. ) Hard Logic (unchangeable)

Mixed Soft/Hard

Hard vs. Soft

Soft NoC Hard NoC• Build as needed out of LUTs • Must build the whole thing

• Tailor to application • Must be general enough for any aiapplication

• Slower, bigger • Faster, smaller

Investigate the hard vs. soft tradeoff for NoCs (area/delay)

Configurability Efficiency

Page 16: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

16

Previous Work FPGA-tuned Soft NoCs:

– LiPar (2005), NoCeM (2008), Connect (2012) Hard NoCs:

– Francis and Moore (2008): Exploring Hard and Soft Networks-on-Chip for FPGAs

Applications that leverage NoCs:– Chung et al. (2011): CoRAM: An In-Fabric Memory Architecture

for FPGA-based ComputingOur Contributions:

1. Quantify area/performance gap of hard and soft NoCs2. Investigate how this impacts NoC design (hard/soft)3. Integrate hard NoC with FPGA fabric

1. Why NoCs on FPGAs?

Page 17: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

17

OutlineWhy NoCs on FPGAs?

Hard/soft efficiency gap

Integrating hard NoCs with FPGA

1

2

3

NoC Architecture

Methodology Soft NoC design

Results

Area/Speed Efficiency Gap

Page 18: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

18

Router Microarchitecture NoC = Routers + Links

2. Hard/Soft Efficiency

State-of-the-art router architecture from Stanford:1. Acknowledge that the NoC community have excelled at

building a router: We just use it2. To meet FPGA bandwidth requirements:

High-performance router3. A complex router includes a superset of NoC

components that may be used: More complete analysis

Split router into 5 Components

Page 19: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

19

Router – 5 Components2. Hard/Soft Efficiency

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Page 20: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

20

Router – 5 Components2. Hard/Soft Efficiency

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Multi-Queue Buffer

• Port Width• Buffer depth• Number of VCs

= Memory + CIControl Logic

Input Modules

Page 21: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

21

Router – 5 Components2. Hard/Soft Efficiency

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Multiplexers

Logic + crowded interconnect

• Port Width• Number of Ports

Crossbar

Page 22: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

22

Router – 5 Components2. Hard/Soft Efficiency

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Retiming Register

Registers + little control logic

• Port Width• Number of VCs

Output Modules

Page 23: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

23

Router – 5 Components2. Hard/Soft Efficiency

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Arbiters

= Logic + Registers

• Number of Ports• Number of VCs

Allocators

Page 24: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

24

Design Space2. Hard/Soft Efficiency

5 Components

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Input Module

Crossbar

VC Allocator

SW Allocator

Output Module

Port Width

Number of Ports

Number of VCs

Buffer Depth

4 Parameters

Page 25: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

25

Methodology Post-routing FPGA (soft) area and delay Post-synthesis ASIC (hard) area and delay Both TSMC 65 nm technology (Stratix III) Verify results against previous FPGA:ASIC

comparison by Kuon and Rose

2. Hard/Soft Efficiency

Per Router Component

Page 26: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

26

3 Options for Buffer on FPGA Relatively small memories Critical component in router design 3 options for FPGA:

Registers

LUTRAM

Block RAM

One per LUT

640 bits

9 Kbits

2. Hard/Soft Efficiency

Area of each implementation option

Page 27: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

27

Width = 32 Bits

2. Hard/Soft Efficiency

Another logic cluster used

3 Options for Buffer on FPGA

Page 28: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

28

3 Options for Buffer on FPGA Relatively small memories 3 options for implementation on FPGA

Registers

LUTRAM

Block RAM

One per LUT

640 bits

9 Kbits

0.77 Kbit/mm2

23 Kbit/mm2

142 Kbit/mm2

16% utilized BRAM more area efficient than fully used LUTRAM (Valid for Stratix III)

LUTRAM could win for some points in other FPGAs

Use BRAM for FPGA (soft) implementationSoft

2. Hard/Soft Efficiency

Page 29: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

29 High port count inefficient in softSoft

24X – 94X

60X – 170X

2. Hard/Soft Efficiency

Results – High Port Count

Page 30: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

30 High port count inefficient in soft Width scales betterSoft

2. Hard/Soft Efficiency

Results – Width

26X – 17X

72X

Page 31: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

31 Buffer depth is free on FPGAs when using BRAMSoft

Filling up the BRAM

Results – Deep Buffers2. Hard/Soft Efficiency

Page 32: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

32

Soft Router Design Design recommendations based on FPGA silicon area Supported by delay measurements

Buffer depth is free on FPGAs when using BRAMSoft

High port count inefficient in soft Width scales betterSoft

Use BRAM for FPGA (soft) implementationSoft

2. Hard/Soft Efficiency

Page 33: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

33

Results – Area

Memory

= Logic + Registers

2. Hard/Soft Efficiency

Router Component Mean Area Ratio LUT:REGInput Module 17 --Crossbar 85 --VC Allocator 48 8:1Switch Allocator 56 20:1Output Module 39 0.6:1Router 30

Page 34: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

34

Results – Delay

2. Hard/Soft Efficiency

Router Component Mean Delay RatioInput Module 2.9Crossbar 4.4VC Allocator 3.9Switch Allocator 3.3Output Module 3.4Router 3.6

Page 35: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

35

OutlineWhy NoCs on FPGAs?

Hard/soft efficiency gap

Integrating hard NoCs with FPGA

1

2

3

Hard NoC + FPGA Wiring

Conclusion Future Work

Page 36: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

36

What to harden?Router Component Area Ratio Delay RatioInput Module 17 2.9Crossbar 85 4.4VC Allocator 48 3.9Switch Allocator 56 3.3Output Module 39 3.4Router 30 3.6

Router Component Area Ratio Delay RatioInput Module 17 2.9Crossbar 85 4.4VC Allocator 48 3.9SW Allocator 56 3.3Output Module 39 3.4Router 30 3.6

50% Total Area Critical

Path

Results suggest hardening Crossbar and Allocators Mixed hard/soft implementation

40%

10%

3. Hard NoC with FPGA

Page 37: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

37Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Mixed Implementation

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Soft Hard MixedArea 4.1 mm2 (1X) 0.14 mm2 (30X) 2.3 mm2 (1.8X)

Speed 150 MHz (1X) 810 MHz (5X) 390 MHz (2.5X)

? ?

How to connect hard and soft?

How efficient is mixed/hard after doing that?

Soft

Hard

Mixed not worth hardening

For a typical router ..• 5 ports• 32 bits wide• 2 VCs• 10 buffer words

3. Hard NoC with FPGA

Page 38: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

38

Integrating a Hard Router3. Hard NoC with FPGA

Router Logic

Programmable Interconnect

Router

• Same I/O mux structure as a logic block – 9X the area• Conventional FPGA interconnect between routers

Logic clusters

RouterLogic

Page 39: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

39

Router Logic

Programmable Interconnect

FPGA

Router

Integrating a Hard Router3. Hard NoC with FPGA

• Same I/O mux structure as a logic block – 9X the area• Conventional FPGA interconnect between routers

730 MHz

19th of FPGA vertically ( 2.5mm)

Page 40: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

40

Router Logic

Programmable Interconnect

Router

Integrating a Hard Router3. Hard NoC with FPGA

Assumed a mesh Can form any topology

FPGA

Page 41: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

41

Soft Hard Hard (+ interconnect)Area 4.1 mm2 (1X) 0.14 mm2 (30X) 0.18 mm2 = 9 LABs (22X)

Speed 150 MHz (1X) 810 MHz (5X) 730 MHz (4.7X)

64-node NoC on Stratix V

Integrating a Hard Router

Router Logic

Programmable Interconnect

Router

Soft Hard (+ interconnect)

Area~12,500

LABs576 LABs

%LABs 33 % 1.6 %

%FPGA 12 % 0.6 %

3. Hard NoC with FPGA

Hard NoC + Soft Interconnect is very compelling

Provides 47 GB/s peak bisection bandwidth

Very Cheap! Less than cost of 3 soft nodes

Page 42: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

Why NoCs on FPGAs?

Hard/soft efficiency gap

Integrating hard NoCs with FPGA

1

2

3

• Big city needs freeways to handle traffic• Solve communication problems for a large/heterogeneous FPGA:

• Timing Closure – Interconnect Scaling – Modular Design

• A hard NoC is on average 30X smaller and 3.6X faster than soft• Crossbars and allocators worst – Input buffer best

• An efficient soft NoC:• Uses BRAMs – Large width, low Port Count – Deep buffers

• Mixed implementation does not make sense• Integrated fully hard NoC with FPGA fabric (for NoC Links)

• 22X area improvement over soft• Reaches max. FPGA frequency (4.7X faster than soft)• 64-node NoC = 0.6% of total FPGA area (Stratix V)

Page 43: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

43

Future Work Power analysis More hardening:

– Dedicated inter-router links (hard wires)– Clock domain crossing hardware

How do traffic hotspots (DDR/PCIe) influence NoC design?

Latency insensitive design methodology that uses NoC CAD tool changes for a NoC-based FPGA

3. Hard NoC with FPGA

Page 44: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip

Thank You!

[email protected]

Page 45: Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip