1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design)...
-
Upload
pamela-fletcher -
Category
Documents
-
view
214 -
download
1
Transcript of 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design)...
![Page 1: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/1.jpg)
1
Using GPCE Principles for Hardware Systems and Accelerators
(bridging the gap to HW design)
Rishiyur S. Nikhil
www.bluespec.com
CTO,
GPCE 09October 4, 2009
![Page 2: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/2.jpg)
2
Generative and component approaches are revolutionizing software development ... GPCE provides a venue for researchers and practitioners interested in foundational techniques for enhancing the productivity, quality, and time-to-market in software development ... In addition to exploring cutting-edge techniques for developing generative and component-based software, our goal is to foster further cross-fertilization between the software engineering research community and the programming languages community.
This seems to be a conference about improving software development ...
... so why am I here talking about hardware design?
Two reasons ....
![Page 3: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/3.jpg)
3
... Generative Programming (developing programs that synthesize other programs), Component Engineering (raising the level of modularization and analysis in application design), and Domain-Specific Languages (elevating program specifications to compact domain-specific notations that are easier to write, maintain, and analyze) are key technologies for automating program development.... enhancing the productivity, quality, and time-to-market in software development that stems from deploying standard components and automating program generation. ...
Reason (1): you may be interested in seeing how the principles highlighted below ...
... are used with equal capability and effectiveness in HW design
![Page 4: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/4.jpg)
4
Reason (2): I would like to tempt you to upgrade from being not only a software engineer (v 1.0) ...
... to “The Compleat Computation-ware Engineere (v 2.0)” ...
... where you think of hardware computation as an important (and easy to use) component in your toolbox, when you solve your next problem.HW
SW
![Page 5: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/5.jpg)
5
The traditional HW creation “flow” (early 1990s to present)
Source code(Verilog/VHDL)
RTL simulation
Traditional ASIC synthesis
Traditional FPGA synthesis*
Gate-levelVerilog/VHDL
Gate-levelVerilog/VHDL
Place&Route, ..., tape out, ...
manufacture ...
Place&Route, ..., FPGA download
run/debug/edit: “instant”
10s of months$10M-50M
minutes/ hours
$100-10K
* “synthesis” is just jargon for a certain kind of compilation
![Page 6: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/6.jpg)
6
New flows (not yet mainstream)
Source code(Verilog/VHDL)
RTL simulation
Traditional ASIC synthesis
Traditional FPGA synthesis
Gate-levelVerilog/VHDL
Gate-levelVerilog/VHDL
Place&Route, ..., tape out, ...
manufacture ...
Place&Route, ..., FPGA download
Source code(High Level Language)
“High Level” synthesis
By raising level of abstraction,• improve design time by 10x (or more)• expressive power, simulation speed
• with no loss of silicon quality (area, speed, power)
• In fact, sometimes with better silicon quality (because improved flexibility can result in better architectures)
Simulation by compiled execution
![Page 7: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/7.jpg)
7
Some candidate high level languages
Source code(Verilog/VHDL)
RTL simulation
Traditional ASIC synthesis
Traditional FPGA synthesis
Gate-levelVerilog/VHDL
Gate-levelVerilog/VHDL
Place&Route, ..., tape out, ...
manufacture ...
Place&Route, ..., FPGA download
Source code(BSV)
Source code(C/C++/SystemC)
“High Level” synthesis
Classic limitations of automatic parallelization from sequential codes,cf. “dusty deck Fortran” ca. 1970s
Bluespec’s fresh approach, inspired by
• Term Rewriting Systems (parallel atomic transactions) to describe complex concurrent behaviorRelated to: UNITY, TLA+, EventB, ...
• Haskell (types, overloading, parameterization, generativity)
![Page 8: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/8.jpg)
8
HW languages have always been “generative”
module mkM1 (…); mkM3 m3b ( … ); // instantiates mkM3 mkM2 m2 ( … ); // instantiates mkM2endmodule
module mkM2 (…); mkM3 m3a ( … ); // instantiates mkM3endmodule
module mkM3 (…); …endmodule
m3am3b
m2
m1 (instance of mkM1)
m3a
m3b m2
m1 (instance of mkM1)
Example (Verilog) Two visualizations of the resulting module instance hierarchy:
![Page 9: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/9.jpg)
9
HW languages have long been “generative” (contd.)
Source code(Verilog/VHDL)
RTL simulation
Traditional ASIC synthesis
Traditional FPGA synthesis
Gate-levelVerilog/VHDL
Gate-levelVerilog/VHDL
Place&Route, ..., tape out, ...
manufacture ...
Place&Route, ..., FPGA download
Source code(BSV)
Source code(C/C++/SystemC)
“High Level” synthesis
Static Elaboration
Execution
Static Elaboration(jargon for “generation”)
• Execute the structural aspects of the program to produce the module hierarchy (structure)
Execution within the fixed structure (behavior)• Essentially just the
execution of a giant FSM
Verilog/VHDL have poor generative capabilities (weak afterthought!):•Not orthogonal, not reflective, not Turing-complete
![Page 10: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/10.jpg)
10
I’m now going to show you some code examples for some non-trivial HW designs. I hope, at the end of this, you’ll say:
“Hey! I could do that!”
even if you’ve never designed HW before!
![Page 11: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/11.jpg)
11
Verilog/VHDL module interfaces: wire oriented
data
RDY
ENA
data
ENA
RDY
Example: transferring a datum from one module to another
declare input and output wires
declare input and output wires
declaration of wires;connections to module interface
wires;logic for RDY/ENA
data
ENA
RDY
Protocol (proper behavior) specified separately using waveforms and
English text
Very verbose, very error-prone
![Page 12: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/12.jpg)
12
interface Get #(type t); // polymorphic method ActionValue #(t) get();endinterface
interface Put #(type t); method Action put (t x);endinterface
module mkConnection #(Get#(t) g, Put#(t) p) (Empty); rule connect; let x <- g.get(); p.put (x); endruleendmodule
Put
BSV module interfaces: “transactional” (object-oriented)
Get
These interface definitions are sufficiently useful and reusable that they’re in standard BSV libraries
Get#(Packet) g1 <- mkM1 (...);Put#(Packet) p1 <- mkM2 (...);Empty e <- mkConnection (g1, p1);
parameters
![Page 13: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/13.jpg)
13
clientinterface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response;endinterface
interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response;endinterface
module mkConnection #(Client#(t1,t2), Server#(t1,t2)); mkConnection (t1.request, t2.request); mkConnection (t2.response, t1.response);endmodule
Get
data
RD
Y
EN
S
Put
data
EN
A
RD
Y
server
Put
data
RD
Y
EN
A
Get
data
EN
A
RD
Y
req_t resp_t
Note overloaded mkConnection(BSV uses Haskell’s Typeclass mechanism for user-
extensible, recursive, statically typed overloading)
Interfaces can be composed
Get/Put pairs are very common, and duals of each other, so the BSV library defines Client/Server interfaces for this purpose
![Page 14: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/14.jpg)
14
Example: a Butterfly cross-bar switch
Basic building blocks:
Recursive structure: 1x1 2x2 4x4 … NxN
buffer (FIFO)
2x1 merge
routing logic
interface XBar #(type t); interface List#(Put#(t)) input_ports; interface List#(Get#(t)) output_ports;endinterface
The entire interface can be defined in a few lines (polymorphic in the data type of packets flowing through the switch):
![Page 15: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/15.jpg)
15
Butterfly switch: module implementation
module mkXBar #(Integer n, function UInt #(32) destinationOf (t x), Module #(Merge2x1 #(t)) mkMerge2x1) ( XBar #(t) )
endmodule: mkXBar
2x1 merge module
used by routing logic
Size of switch(# of ports)
Interface
Module parameters
Parameters are static arguments, and so can be of any type, including (unbounded) Integers, functions, modules, etc.
Interfaces represent dynamic communications and can only carry hardware-representable types.
![Page 16: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/16.jpg)
16
Butterfly switch: module implementation
module mkXBar #(...) ( XBar #(t) ); List #(Put#(t)) iports; List #(Get#(t)) oports;
if (n == 1) begin // ---- BASE CASE (n = 1) FIFO #(t) f <- mkFIFO; iports = cons (toPut (f), nil); oports = cons (toGet (f), nil); end
else begin // ---- RECURSIVE CASE (n > 1)
end interface input_ports = iports; interface output_ports = oports;endmodule: mkXBar
buffer (FIFO)
![Page 17: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/17.jpg)
17
Butterfly switch: module implementation
module mkXBar #(...) ( XBar #(t) );
if (n == 1) begin // ---- BASE CASE (n = 1)
end else begin // ---- RECURSIVE CASE (n > 1) XBar#(t) upper <- mkXBar (n/2, destinationOf, mkMerge2x1); XBar#(t) lower <- mkXBar (n/2, destinationOf, mkMerge2x1);
List#(Merge2x1#(t)) merges <- replicateM (n, mkMerge2x1);
iports = append (upper.input_ports, lower.input_ports);
function Get#(t) oport_of (Merge2x1#(t) m) = m.oport; oports = map (oport_of, merges);
... routing behavior ...
end
endmodule: mkXBar
![Page 18: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/18.jpg)
18
Butterfly switch: module implementation
module mkXBar #(...) ( XBar #(t) );
if (n == 1) begin // ---- BASE CASE (n = 1)
end else begin // ---- RECURSIVE CASE (n > 1)
let ps = append (upper.output_ports, lower.output_ports); for (Integer j = 0; j < n; j = j + 1) rule route; let x <- ps[j].get (); case (flip (destinationOf (x), j, n)) matches tagged Invalid : merges [j] .iport0.put (x); tagged Valid .jFlipped : merges [jFlipped].iport1.put (x); endcase endrule end
endmodule: mkXBar
![Page 19: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/19.jpg)
19
Butterfly switch: atomicity of rules
for (Integer j = 0; j < n; j = j + 1) rule route; let x <- ps[j].get (); case (flip (destinationOf (x), j, n)) matches tagged Invalid : merges [j] .iport0.put (x); tagged Valid .jFlipped : merges [jFlipped].iport1.put (x); endcase endrule
May not be a packet to get
The hardware control logic the manage these complex, dynamic (data-dependent), reactive, control conditions is the most tedious and error-prone aspect of designing with RTL (Verilog, VHDL) and even with SystemC.
Creation of this logic is automated (synthesized), based on the atomicity semantics of rules.
May not be able to put a packet:• flow control• contention
![Page 20: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/20.jpg)
20
Butterfly switch: summary observations
The core mkXBar module is expressed in ~40-50 lines of code• Parameterized by packet type, size, routing function, 2x1 merge
module• It’s fully synthesizable
(550 MHz using Magma Synthesis, TSMC 0.18 micron libraries)
Static elaboration (“generativity”) has the full power of Haskell evaluation• Higher-order functions, lists/vectors, recursion, ...
There is no syntactic distinction between the “static elaboration” part and the “dynamic” part of the source code• An expression “a+b” may be used both for static elaboration and as a
dynamic computation (i.e., an adder in the hardware)
2-layers: static elaboration produces a module hierarchy with rules• The rules are then synthesized according to atomicity semantics into
the correct data paths and control logic
![Page 21: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/21.jpg)
21
Controller Scrambler Encoder
Interleaver Mapper
IFFTCyclicExtend
headers
data
IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain)
complex numbersaccounts for 85% area
24 Uncoded
bits
Example: IFFT in 802.11a wireless transmitter
![Page 22: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/22.jpg)
22
in0
…
in1
in2
in63
in3
in4
Bfly4
Bfly4
Bfly4
x16
Bfly4
Bfly4
Bfly4
…
Bfly4
Bfly4
Bfly4
…
out0
…
out1
out2
out63
out3
out4
Perm
ute
_1
Perm
ute
_2
Perm
ute
_3
All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...
*
*
*
*
+
-
-
+
+
-
-
+
*jt2
t0
t3
t1
The IFFT computation (specification)
![Page 23: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/23.jpg)
23
IFFT: the HW implementation space(varying in area, power, clock speed, latency, throughput)
serialization unserializationfewer Bfly4s
Varying degrees of pipelining
Iterate 1 stage thrice
Direct combi-national circuit In any stage, use fewer
than 16 Bfly4s
![Page 24: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/24.jpg)
24
stage_j mkLinearPipe ()
module mkLinearPipe #(Integer n_stages, Bool with_registers, function Module #(Pipe#(a,a) mkStage (Integer stage_j)) (Pipe#(a,a))); ...endmodule
Pipe
Get
Put
Pipe
Get
Put
Pipe
Get
Put
n_stages
0
n_stages-1
Higher-order functions for building linear pipelines(“linear combinator”)
mkStage ()
![Page 25: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/25.jpg)
25
mkLoopPipe ()
module mkLoopPipelined #(Integer n, function Module#(PipeF #(Tuple2#(a, UInt#(logn)), a)) mkLoopBody ()) (PipeF #(a,a))
Pipe
Get
Put
Pipe
Get
Put
n
(a,j)
a
(x,j)
x
Higher-order functions for building looped pipelines(“loop combinator”)
![Page 26: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/26.jpg)
26
Generating all versions of IFFT
serialization unserializationfewer Bfly4s
Varying degrees of pipelining
Iterate 1 stage thrice
Direct combi-national circuit In any stage, use fewer
than 16 Bfly4s
Which architecture is “best” depends on the requirements• Desired latency, throughput• Area, power, clock speed• Target silicon technology (FPGA, ASIC 90nm, ASIC 65nm, ...)
“PAClib” (Pipeline Architecture Constructor Library) is a library of such higher-order pipeline combinators. Using PAClib, IFFT can be succinctly expressed in a single source code which, depending on the parameters supplied, will elaborate (unfold) into any one of the possible architectures in the space of architectures illustrated.
PAClib enables a “pipeline DSL”
![Page 27: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/27.jpg)
27
Another important reason for generativity—enables rapid experimentation to determine optimal architecture
Architectural effects can be quite unpredictable. E.g.,• Hypothesis: linear pipe will take more silicon area than looped pipe
But the looped pipe has other silicon costs:• Needs multiplexers, control logic area cost• Needs higher clock speed for same throughput area cost, power cost• A kicker: disables some constant propagations area cost, power cost
(for ASICs, silicon area directly affects price of chip)
Bottom line:• Need to be able to experiment with different architectures• Generativity allows scripting the exploration of the space
![Page 28: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/28.jpg)
28
I hope that by now you’re saying:
“Hey! Writing HW programs doesn’t look too hard!”(Has all the creature comforts of a modern high-level programming language.)
But, so what?• Why would I want to compute something directly in HW?• Even if I want to, aren’t the costs and logistics of actually putting
something in HW just too high a barrier?
![Page 29: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/29.jpg)
29
Why implement things in HW?
Reason (1):
fixed machine(e.g., x86, GPGPU, Cell)
X-machine(fine-grain parallel)
Run: Run:
instructions (program) for application X
Interpret:
Caveat: lots of devils in the details• Interpretation at GHz may still be faster than direct execution at MHz• Interpretation with monster memory bandwidth may still be faster than direct execution with
anemic memory bandwidth
SpeedSpeedSpeed
Direct implementation in HW typically• removes a layer of interpretation, and interpretation generally costs an
order of magnitude in speed• can exploit more parallelism
![Page 30: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/30.jpg)
30
Why implement things in HW?
Reason (2): Power consumption
• Interpretation on fixed computing architectures costs power
fixed machine(e.g., x86, GPGPU, Cell)
X-machine
instructions (program) for application X
Interpret:Pay energy cost for X-execution
Also pay for fetch, decode, register management, cache management, extra data movement, branch misprediction, ...
Portable devices: battery life Server farms/ clouds: cost of power supply, air conditioning
![Page 31: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/31.jpg)
31
Opportunity with today’s FPGA technology(Field Programmable Gate Arrays)
FPGA capacity:• millions of gates
FPGA speeds:• 100s of MHz
Example of what is possible: a single FPGA can easily run H.264 decoding at VGA resolution (640x480) and, with a good design, at HDTV (1920x1080) resolution
FPGA board costs:• As low as $100s• $1K-$10K typical• $10K-$100K for
multi-FPGA boards)
... new and exciting:• FPGA-in-processor-socket:
• AMD Hypertransport bus• Intel Front-Side Bus
• FPGA-on-processor-chip:• Coming soon
Linux X
FA626 ICE X
Bluespec Emulation X
Linux XLinux X
FA626 ICE XFA626 ICE X
Bluespec Emulation XBluespec Emulation XBluespec Emulation X
Your application software on hostFPGA
subsystemYour computation
on FPGAC
lk/Rst
ICE
Int
Ctrl
L2Cache
AXI Interconnect Fabric
AXI-AHBBridge
FA626Processor
GMACTraffic Gen
DDR2Gasket
GMACTransactor
EngineTraffic Gen
EngineTransactor
S
SRAMController
S
SRAMboot memory
RS232UART
SM
DDR2memory
S SSM
S S S
Emulation Board
FPGA Device
Console Co-emulation link
DDR2memory
DDR2Controller
EthernetGMAC
SecurityEngine
S
Debugger
S S
Clk/R
st
ICE
Int
Ctrl
L2Cache
AXI Interconnect Fabric
AXI-AHBBridge
FA626Processor
GMACTraffic Gen
DDR2Gasket
GMACTransactor
EngineTraffic Gen
EngineTransactor
S
SRAMController
S
SRAMboot memory
RS232UART
SM
DDR2memory
S SSM S SSM
S S SS S S
Emulation Board
FPGA Device
Console Co-emulation link
DDR2memory
DDR2Controller
EthernetGMAC
SecurityEngine
S
Debugger
S S
FPGA host communication links:• USB• 1Gb/10Gb Ethernet• PCI Express
![Page 32: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/32.jpg)
32
SW appHW app
(BSV/RTL)
services
SCE-MI
Link layer
services
SCE-MI
Link layer
sockets/PCIe/ USB/ Ethernet/FSB/ Hypertransport
A “Communications Protocol Stack”. Analogy:
RPCsocketTCP/IPEthernet
HW agnostic: FPGA(or Bluesim/Verilog sim)
Software
Making FPGA acceleration easy and routine
Atop today’s FPGA technology, we provide the communication infrastructure:• Make it easy for SW to invoke a HW service or vice versa• Concurrent, pipelined, ...
• Model: Concurrent RPCs (Remote Procedure Calls)• Auto-generate SW and HW (BSV) stubs from service specs• (like using IDL to specify distributed client/server communication)
![Page 33: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/33.jpg)
33
Putting it all together:
SW part (e.g., C++) HW part (BSV)Get/Put/Client/Server
interfacesGet/Put/Client/Server
interfaces
mkConnection connections
FPGA synthesis etc.
BSV synthesisgcc
FPGA
servicesSCE-MI
Link layer
link/ load link/ load
generate
servicesSCE-MI
Link layer
Yourapplication
BSV applies GPCE concepts to HW design—generation, parameterization, changeability; reusability; easy exploration of architecture space, ...
FPGAs are compelling due to speed, lower power, low cost, fast communication with host
![Page 34: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/34.jpg)
34
Virtex5 FPGA
BSV UltraSparc model
Virtutech Simics
Ethernet
Example: CMU ProtoFlexhttp://www.ece.cmu.edu/~protoflex
Virtutech Simics: commercial SW simulator for whole-systems (OS/devices/apps)(“Virtual Platform” for early SW development, before ASIC is available)Problem: very clever tricks for fast simulation, but steady slowdown– for each added thread and core– for each added bit of instrumentation
CMU ProtoFlex:– Fully operational model of 16-cpu UltraSPARC III SunFire 3800 Server, running
unmodified Solaris 8; running on FPGA at 90 MHz– Hybrid simulation: continue to use Simics for modeling rest of system (I/O devices, ...)– Benchmark: TPC-C OLTP on Oracle 10g Enterprise Database Server
Also SPECINT (bzip2, crafty, gcc, gzip, parser, vortex)– Performance: 10-60 MIPS
39x faster than Virtutech Simics alone on same system/benchmark– Written in BSB by 1 graduate student (Eric Chung) in 1 year!
![Page 35: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/35.jpg)
35
Example: Univ. of Glasgow document retrieval experiment
“FPGA-Accelerated Information Retrieval: High-Efficiency Document Filtering”,W. Vanderbauwhede, L. Azzopardi , and M. Moadeli,in Proc. 19th IEEE Intl. Conf. on Field Programmable Logic and Applications (FPL'09), Prague, Czech Republic, Aug 31-Sep 2, 2009
FPGA(match algorithm)
SRAM(search terms)
Document stream
Score stream
E.g.,• find spam in emails• find similar patents• find relevant news stories
Experiments on 3 collections, from ~1M to 1.5M documents eachRan same algorithm• 1.6 GHz Itanium-2• Virtex-4 FPGA
Power consumption: 130 Watts (Itanium), 1.25 Watts (FPGA)
Speedup: ~ 10x – 20x• Itanium slows down as profile (search database) size increases• FPGA does not (parallelism)
![Page 36: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/36.jpg)
36
Example: MEMOCODE’08 Design Contest
Goal: Speed up a software reference application running on the PowerPC on Xilinx XUP reference board using SW/HW codesign
The application:• decrypt• sort• re-encryptlarge db of records in DRAM
Time allotted: 4 weeksXilinx XUP
http://rijndael.ece.vt.edu/memocontest08/
![Page 37: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/37.jpg)
37
Example: MEMOCODE’08 Design Contest Results
(BSV)
Reference: http://rijndael.ece.vt.edu/memocontest08/everybodywins/
Records had to be repeatedly streamed through a “merge-sort” block.
Advantage to those who could rapidly generate a variety of merge-sort architectures and find the best one to “fit” into the FPGA
![Page 38: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/38.jpg)
38
With languages that use GPCE principles,
HW design is now ready for incorporation
into yourprogramming
toolbox!
SW part (e.g., C++) HW part (BSV)Get/Put/Client/Server
interfacesGet/Put/Client/Server
interfaces
mkConnection connections
FPGA synthesis etc.
BSV synthesisgcc
FPGA
servicesSCE-MI
Link layer
link/ load link/ load
generate
servicesSCE-MI
Link layer
Thank you for your kind attention!
In summary
![Page 39: 1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October.](https://reader035.fdocuments.in/reader035/viewer/2022070408/56649e715503460f94b706aa/html5/thumbnails/39.jpg)
39
Acknowledgements
James Hoe (MIT/CMU) and Arvind (MIT) for original technology for high-level synthesis from rules to RTL used in BSV today, 1997-2000
Lennart Augustsson (Chalmers/Sandburst) for Haskell-based generative technology used in BSV today, 2000-2003
My colleagues in the engineering teams at Sandburst and Bluespec for continuous and substantial improvements, 2000-2009
Prof. Arvind’s group at MIT for their research and ideas, 2000-2009