CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming,...

35
CS250B: Modern Computer Systems What Are FPGAs And Why Should You Care Sang-Woo Jun

Transcript of CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming,...

Page 1: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

CS250B: Modern Computer Systems

What Are FPGAsAnd Why Should You Care

Sang-Woo Jun

Page 2: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

What Are FPGAs

Field-Programmable Gate Array

Can be configured to act like any circuit – More later!

Can do many things, but we focus on computation acceleration

Page 3: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

FPGAs Come In Many Forms

PCIe-Attached

CPU Integrated In-Network

In-Storage

Page 4: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

How Is It Different From CPU/GPUs

GPU – The other major accelerator

CPU/GPU hardware is fixedo “General purpose”

o we write programs (sequence of instructions) for them

FPGA hardware is not fixedo “Special purpose”

o Hardware can be whatever we want

o Will our hardware require/support software? Maybe!

Optimized hardware is very efficiento GPU-level performance**

o 10x power efficiency (300 W vs 30 W)

Page 5: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Analogy

“The Z-Berry”“Experimental Investigations on Radiation Characteristics of IC Chips”benryves.com “Z80 Computer”

CPU/GPU comes with fixed circuits FPGA gives you a big bag of components

To build whatever

Shadi Soundation: Homebrew 4 bit CPU

Could be a CPU/GPU!

Page 6: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Fine-Grained Parallelism of Special-Purpose Circuits

Example -- Calculating gravitational force: 𝐺×𝑚1×𝑚2

(𝑥1−𝑥2)2+(𝑦1−𝑦2)

2

8 instructions on a CPU → 8 cycles**

Much fewer cycles on a special purpose circuit

A = G × m1

B = A × m2

C = x1 - x2

D = C2

E = y1 - y2

F = E2

G = D + F

Ret = B / G

A = G × m1 × m2 B = (x1 - x2)2 C = (y1 - y2)2

D = B + C

Ret = B / G

4 cycles with basic operations

3 cycles with compound operations

Ret = (G × m1 × m2) / ((x1 - x2)2 + (y1 - y2)2)

1 cycle with even further compound operations

May slow down clock

Page 7: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Also, Remember:

Page 8: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Coarse-Grained Parallelism ofSpecial-Purpose Circuits

Typical unit of parallelism for general-purpose units are threads ~= cores

Special-purpose processing units can also be replicated for parallelismo Large, complex processing units: Few can fit in chip

o Small, simple processing units: Many can fit in chip

Only generates hardware useful for the applicationo Instruction? Decoding? Cache? Coherence?

Page 9: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Hurdles of FPGA Programming

Very close to actually designing a chip, with all associated difficultieso Signal propagation delays, clock speeds

o Circuit design with acceptable propagation delay

o Explicit pipeline design for performance

o On-chip resource management (Number of available transistors, SRAM, …)

High-level tools exist, but not yet very effectiveo Often trade-off between ease of programming and performance/utilization

Page 10: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

How Is It Different From ASICs

ASIC (Application-Specific Integrated Circuit)o Special chip purpose-built for an application

o E.g., ASIC bitcoin miner, Intel neural network accelerator

o Function cannot be changed once expensively built

+ FPGAs can be field-programmedo Function can be changed completely whenever

o FPGA fabric emulates custom circuits

- Emulated circuits are not as efficient as bare-metalo ~10x performance (larger circuits, faster clock)

o ~10x power efficiency

Page 11: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Prominent Technology For Performance Scaling

Efficient use of increasing transistor budgeto Platform for specialized architectures

Reconfigurable based on application/algorithm changeso Kind of like software running on a hardware processor

o Single FPGA hardware can be shared between different applications

Common operations often embedded using ASIC blockso High-performance, low-power, small size

o Arithmetic, shifters, neural network operations, high-speed communications, etc

Page 12: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

FPGA Architecture Goals

Reconfigure a physical chip to act like a different circuit

Transistors within an integrated circuit are fixed during fabrication!o How can FPGAs change themselves?

Shadi Soundation: Homebrew 4 bit CPU

Page 13: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Basic FPGA Architecture“Configurable logic block (CLB)”

Programmable interconnect

I/O block 6-InputLook-Up

Table

FF

Latch

Programmable

Input 1 Input 2 Output

0 0 0

0 1 0

1 0 0

1 1 1

Ex) 2-LUT for “AND”

~

Stores state forsequential circuit

construction

Page 14: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Basic FPGA Architecture – DSP Blocks

CLBs act as gates – Many needed to implement high-level logic

Arithmetic operation provided as efficient ALU blockso “Digital Signal Processing (DSP) blocks”

o Each block provides an adder + multiplier

“DSP block”

× +/-

Page 15: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Basic FPGA Architecture – Block RAM

CLB can act as flip-flopso (~1 bit/block) – tiny!

Some on-chip SRAM provided as blockso ~18/36 Kbit/block, MBs per chip

o Massively parallel access to data → multi-TB/s bandwidth

“Block RAM”

Page 16: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Basic FPGA Architecture – Hard Cores

Some functions are provided as efficient, non-configurable “hard cores”o Multi-core ARM cores (“Zynq” series)

o Multi-Gigabit Transceivers

o PCIe/Ethernet PHY

o Memory controllers

o …

ARM PCIe

Ethernet

Memory

Page 17: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Example Accelerator Card Architecture

PCIe

FPGA

DRAM

DRAM

1GbE

FMC

40GbE

“FPGA Mezzanine Card” Expansiono Network Ports, Memory, Storage, PCIe, …

General-Purpose I/O Pins Multi-Gigabit Transceivers

Page 18: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Example Development Board (Xilinx VCU108)

Page 19: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Example Development Board(Intel Stratix 10)

Page 20: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Example Accelerator Card (Xilinx Alveo U250)

Page 21: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Programming FPGAs

Languages and tools overlap with ASIC/VLSI design

FPGAs for acceleration typically done with eithero Hardware Description Languages (HDL): Register-Transfer Level (RTL) languages

o High-Level Synthesis: Compiler translates software programming languages to RTL

Page 22: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Major Hardware Description Languages

Verilog: Most widely used in industryo Relatively low-level language supported by everyone

Chisel – Compiles to Verilogo Relatively high-level language from Berkeley

o Embedded in the Scala programming language

o Prominently used in RISC-V development (Rocket core, etc)

Bluespec – Compiles to Verilogo Relatively high-level language from MIT

o Supports types, interfaces, etc

o Also active RISC-V development (Piccolo, etc)

Page 23: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Register-Transfer Level

RTL models a circuit using:o Registers (state), and

o Combinational logic (computation)

o Typically everything is clock-synchronous

Unfamiliar constraint: Transfer must finish within a clock cycleo Logic must have a short enough critical path, or

o Clock must be slow enough

Register A

Register B

Transfer (Logic)Register A

Register C

Page 24: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Reminder: Critical Path

A chain of logic components has additive delayo The “depth” of combinational circuits is important

The “critical path” defines the overall propagation delay of a circuit

Example: A full adderSource: en:User:Cburnett @ Wikimedia

Critical path of three componentstPD = tPD(xor2)+tPD (and2)+tPD (or2)

Page 25: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Timing Behavior of State Elements

Meeting the setup time constrainto “Processing must fit in clock cycle”

o After rising clock edge,

o tPD(State element 1) + tPD(Combinational logic) + tSETUP(State element 2)

o must be smaller than the clock period

Data from here…must reach here

…before the next clock Otherwise, “timing violation”

Page 26: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Complexities of RTL

Example RTL logic:o Reg#(Bit#(64)) A, B; // Two 64-bit registers

o A <= (A>>B); // Somewhere, do a variable-width shift

o This is very inefficient on an FPGA! Very long critical path• Long critical path -> Slow clock

• Aside: Reg#(Bit#(2)) B; then A>>(B*16); Generates much better hardware

Kind of have to know what kind of circuits are generated by what logico Typically covered by a few rule of thumbs

o Will be covered later!

Page 27: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

High-Level Synthesis

Compiler translates software programming languages to RTL

High-Level Synthesis compiler from Xilinx, Altera/Intelo Compiles C/C++, annotated with #pragma’s into RTL

o Theory/history behind it is a complex can of worms we won’t go into

o Personal experience: needs to be HEAVILY annotated to get performance

o Anecdote: Naïve RISC-V in Vivado HLS achieves IPC of 0.0002 [1], 0.04 after optimizations [2]

OpenCLo Inherently parallel language more efficiently translated to hardware

o Stable software interface

[1] http://msyksphinz.hatenablog.com/entry/2019/02/20/040000[2] http://msyksphinz.hatenablog.com/entry/2019/02/27/040000

Page 28: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

FPGA Compilation Toolchain

High-Level HDL Code

Language Compiler

Verilog/VHDL

Bitfile

High-level language vendor tool

Synthesize NetlistMap/Place/Route

FPGA Vendor toolchain (Few open source)

Constraint File

“Which transceiver instance shouldtop_transceiver_01 map to?”And so, so much more…

FunctionalSimulation

Cycle-levelSimulation

Page 29: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Example System Abstraction For Accelerators

Vendor-Provided “Shell”

Ethernet DRAM…

User Code

Abstract Interface

Hardware

User Software

Software

PCIe Vendor PCIe Driver

Abstract Interface

Page 30: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Programming/Using an FPGA Accelerator

Bitfile is programmed to FPGA over “JTAG” interfaceo Typically used over USB cable

o Supports FPGA programming, limited debugging access, etc

o Kind of slow…

o Bitfile often stored in on-board flash for persistence

Modern FPGAs provide faster programming methods as wello On-chip accelerator to load from local memory

• e.g., Xilinx ICAP (Internal Configuration Access Port)

o Milliseconds to program a new design

Page 31: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Programming/Using an FPGA Accelerator

PCIe-attached FPGA accelerator card is typically used similarly to GPUso Program FPGA, execute software

o Software copies data to FPGA board, notify FPGA-> FPGA logic performs computations-> Software copies data back from FPGA

FPGA flexibility gives immense freedom of usage patternso Streaming, coherent memory, …

o Copy to on-board DRAM, or directly to on-chip memory, …

o Decompress before loading to DRAM, etc!

Page 32: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Partial Reconfiguration

FPGA

Sub-components Parts of the FPGA can be swapped out dynamically without turning off FPGAo Physical area is drawn on chip

Used in Amazon F1, etc

Toolchain support for isolation

Page 33: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

FPGAs In The Cloud

Amazon EC2 F1 instance (1 – 4 FPGAs)

Page 34: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

FPGAs In The Cloud

Microsoft Azure with FPGAs embedded in the networko Powering Bing, etc

Page 35: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash

Questions?