Flexible and scalable FPGA-oriented design of multipliers ...
Scalable FPGA-accelerated Cycle-Accurate Hardware ...
Transcript of Scalable FPGA-accelerated Cycle-Accurate Hardware ...
![Page 1: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/1.jpg)
Scalable FPGA-accelerated Cycle-Accurate Hardware
Simulation in the Cloud
Introduction
RISC-V Summit 2018 Tutorial
Speakers: Sagar Karandikar, David Biancolin, Alon Amid
https://fires.im
@firesimproject
![Page 2: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/2.jpg)
What is this tutorial about?
• How to get started with FireSim • We’re going to keep things very practical today, treat this as a
tutorial • For more of the research background, see our papers/previous
talks
• We’re going to cram in a lot of info, but feel free to interrupt us with questions • We’ll put up the full slide deck online anyway
2
![Page 3: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/3.jpg)
Today’s Agenda
• Intro, what is FireSim capable of? • (Sagar, 10 mins)
• Building/Deploying FireSim Simulations • (Sagar, 15 mins)
• Using FireSim to simulate your own custom HW • (David, 15 mins)
• Testing and Debugging your design in FireSim • (Alon, 15 mins)
• Community/Contributing/Conclusion • (David, 10 mins)
3
![Page 4: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/4.jpg)
What can FireSim be used for?
• Evaluating a custom hardware design • For example, building a custom accelerator and running real workloads on it • e.g. “A Hardware Accelerator for Tracing Garbage Collection” from ISCA 2018 and the
Hwacha project (see RV-Summit talk tomorrow)
• Debugging a Chisel design at FPGA-speeds • e.g. FireSim Debugging Docs and DESSERT from FPL 2018
• Rapidly Customizing/Evaluating RISC-V cores, • For example, Rocket Chip and BOOM, provided as default designs in FireSim. • FireSim can run all of SPECInt2017 with reference inputs on Rocket Chip in parallel on
10 cloud-FPGAs in ~1 day • e.g. “Composable Building Blocks to Open up Processor Design” from MICRO 2018
• Simulating thousand-node datacenter-scale systems with full control over the hardware • e.g. FireSim ISCA 2018 Paper • Implicitly includes the previous use case
4
![Page 5: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/5.jpg)
What can FireSim be used for?
• Evaluating a custom hardware design • For example, building a custom accelerator and running real workloads on it • e.g. “A Hardware Accelerator for Tracing Garbage Collection” from ISCA 2018 and the
Hwacha project (see RV-Summit talk tomorrow)
• Debugging a Chisel design at FPGA-speeds • e.g. FireSim Debugging Docs and DESSERT from FPL 2018
• Rapidly Customizing/Evaluating RISC-V cores, • For example, Rocket Chip and BOOM, provided as default designs in FireSim. • FireSim can run all of SPECInt2017 with reference inputs on Rocket Chip in parallel on
10 cloud-FPGAs in ~1 day • e.g. “Composable Building Blocks to Open up Processor Design” from MICRO 2018
• Simulating thousand-node datacenter-scale systems with full control over the hardware • e.g. FireSim ISCA 2018 Paper • Implicitly includes the previous use case
5
![Page 6: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/6.jpg)
How does FireSim simulate a datacenter?
• Model hardware at scale, cycle-accurately: • CPUs down to microarchitecture (automatically transformed from Chisel RTL)
• Fast networks, switches (SW models)
• Novel accelerators (also transformed from Chisel)
• Run real software: • Real OS, networking stack (Linux)
• Real frameworks/applications (not microbenchmarks)
• Be productive/usable: • Run on a commodity platform (Amazon EC2 F1)
• Want to encourage collaboration between systems, architecture: real HW/SW co-design research
6
![Page 7: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/7.jpg)
Why do we want an FPGA-based simulator?
7
RTL on FPGA 100 MHz
DRAM
100ns latency
RTL taped-out
1 GHz
DRAM
100ns latency
FPGA Emulation/Prototyping SoC sees 10 cycle DRAM latency
Taped-out Design SoC sees 100 cycle DRAM latency
![Page 8: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/8.jpg)
How do we fix this? FAME-1 Xform in FIRRTL w/MIDAS
8
RTL Design
Physical DRAM
100ns
latency
<- Resp Queue
Req Queue ->
DRAM Model
100
cycle latency
Mem Channel
FPGA Fabric
“FAME-1” Transformed RTL Design
![Page 9: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/9.jpg)
Comparing existing HW “simulation” systems
9
• Taping-out excels at: • Modeling reality: “single source of truth” • Scalability
• Hardware-accelerated simulators excel at: • Simulation rate • Ability to run real workloads (as fn. of sim rate)
• Software-based simulators excel at: • Ease-of-use • Ease-of-rebuild (time-to-first-cycle) • Commodity host platform • Cost • Introspection
![Page 10: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/10.jpg)
Prior work on HW-accelerated simulators: DIABLO
• DIABLO, ASPLOS’15 [4]: • Simulated 3072 servers, 96 ToRs at ~2.7 MHz
• Booted Linux, ran apps like Memcached
• An example of the many projects from RAMP
• Need to hand-write abstract RTL models • Harder than writing “tapeout-ready” RTL
• Need to validate against real HW
• Tied to an expensive custom host-platform • $100k+ host platform, custom built
DIABLO Prototype
10
![Page 11: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/11.jpg)
FPGAs in the Cloud
Useful trends throughout the architect’s stack
High-Productivity Hardware Design
Language w/IR
Open, Silicon-Proven SoC Implementations
Open ISA
11
![Page 12: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/12.jpg)
FireSim at a high-level
Server Simulations • Or your own custom hardware
• Good fit for the FPGA
• We have tapeout-proven RTL: automatically FAME-1 transform
Network simulation • Or your own custom SW models
• Little parallelism in switch models (e.g. a thread per port)
• Need to coordinate all of our distributed server simulations
• So use CPUs + host network
12
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
CPU
Host PCIe
f1.16xlarge
Ho
st E
ther
net
(EC
2 N
etw
ork
)
Server Simulations
Swit
ch M
od
el
![Page 13: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/13.jpg)
Now, let’s build a datacenter-scale FireSim simulation!
13
![Page 14: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/14.jpg)
14
Modeled System
- 4x RISC-V Rocket Cores @ 3.2 GHz
- 16K I/D L1$
- 256K Shared L2$
- 200 Gb/s Eth. NIC
Resource Util.
- < ¼ of an FPGA
Sim Rate
- N/A
Step 1: Server SoC in RTL
![Page 15: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/15.jpg)
15
Modeled System
- 4x RISC-V Rocket Cores @ 3.2 GHz
- 16K I/D L1$
- 256K Shared L2$
- 200 Gb/s Eth. NIC
Resource Util.
- < ¼ of an FPGA
Sim Rate
- N/A
Step 1: Server SoC in RTL
![Page 16: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/16.jpg)
16
Modeled System
- 4x RISC-V Rocket Cores @ 3.2 GHz
- 16K I/D L1$
- 256K Shared L2$
- 200 Gb/s Eth. NIC
- 16 GB DDR3
Resource Util.
- < ¼ of an FPGA
- ¼ Mem Chans
Sim Rate
- ~150 MHz
- ~40 MHz (netw)
Step 2: FPGA Simulation of one server blade
![Page 17: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/17.jpg)
17
Modeled System
- 4x RISC-V Rocket Cores @ 3.2 GHz
- 16K I/D L1$
- 256K Shared L2$
- 200 Gb/s Eth. NIC
- 16 GB DDR3
Resource Util.
- < ¼ of an FPGA
- ¼ Mem Chans
Sim Rate
- ~150 MHz
- ~40 MHz (netw)
Step 2: FPGA Simulation of one server blade
![Page 18: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/18.jpg)
18
Modeled System
- 4 Server Blades
- 16 Cores
- 64 GB DDR3
Resource Util.
- < 1 FPGA
- 4/4 Mem Chans
Sim Rate
- ~14.3 MHz (netw)
Step 3: FPGA Simulation of 4 server blades
Cost: $0.49 per hour (spot) $1.65 per hour (on-demand)
![Page 19: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/19.jpg)
19
Modeled System
- 4 Server Blades
- 16 Cores
- 64 GB DDR3
Resource Util.
- < 1 FPGA
- 4/4 Mem Chans
Sim Rate
- ~14.3 MHz (netw)
Step 3: FPGA Simulation of 4 server blades
![Page 20: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/20.jpg)
20
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR Switch
- 200 Gb/s, 2us links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz (netw)
Step 4: Simulating a 32 node rack
Cost: $2.60 per hour (spot) $13.20 per hour (on-demand)
![Page 21: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/21.jpg)
21
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR Switch
- 200 Gb/s, 2us links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz (netw)
Step 4: Simulating a 32 node rack
Cost: $2.60 per hour (spot) $13.20 per hour (on-demand)
![Page 22: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/22.jpg)
22
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR Switch
- 200 Gb/s, 2us links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz (netw)
Step 4: Simulating a 32 node rack
![Page 23: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/23.jpg)
23
Modeled System
- 256 Server Blades
- 1024 Cores
- 4 TB DDR3
- 8 ToRs, 1 Aggr
- 200 Gb/s, 2us links
Resource Util.
- 64 FPGAs =
- 8x f1.16xlarge
- 1x m4.16xlarge
Sim Rate
- ~9 MHz (netw)
Step 5: Simulating a 256 node “aggregation pod”
![Page 24: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/24.jpg)
24
Modeled System
- 256 Server Blades
- 1024 Cores
- 4 TB DDR3
- 8 ToRs, 1 Aggr
- 200 Gb/s, 2us links
Resource Util.
- 64 FPGAs =
- 8x f1.16xlarge
- 1x m4.16xlarge
Sim Rate
- ~9 MHz (netw)
Step 5: Simulating a 256 node “aggregation pod”
![Page 25: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/25.jpg)
25
Modeled System
- 1024 Servers
- 4096 Cores
- 16 TB DDR3
- 32 ToRs, 4 Aggr, 1 Root
- 200 Gb/s, 2us links
Resource Util.
- 256 FPGAs =
- 32x f1.16xlarge
- 5x m4.16xlarge
Sim Rate
- ~6.6 MHz (netw)
Step 6: Simulating a 1024 node datacenter
![Page 26: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/26.jpg)
26
Modeled System
- 1024 Servers
- 4096 Cores
- 16 TB DDR3
- 32 ToRs, 4 Aggr, 1 Root
- 200 Gb/s, 2us links
Resource Util.
- 256 FPGAs =
- 32x f1.16xlarge
- 5x m4.16xlarge
Sim Rate
- ~6.6 MHz (netw)
Step 6: Simulating a 1024 node datacenter
Harnesses millions of dollars of FPGAs to simulate 1024 nodes cycle-exactly
with a cycle-accurate network simulation and global synchronization
at a cost-to-user of only 100s of dollars/hour
![Page 27: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/27.jpg)
Open-source: Not just datacenter simulation
• An “easy” button for fast, FPGA-accelerated full-system simulation • Replace network endpoints with your own Chisel
designs
• One-click: Parallel FPGA builds, Simulation run/result collection, building target software
• Scales to a variety of use cases: • Networked (performance depends on scale)
• Non-networked (150+ MHz), limited by your budget
• firesim command line program • Like docker or vagrant, but for FPGA sims
• User doesn’t need to care about distributed magic happening behind the scenes
27 FireSim Developer Environment
![Page 28: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/28.jpg)
Open-source: Not just datacenter simulation
• Scripts can call firesim to fully automate distributed FPGA sim • Reproducibility: included scripts to
reproduce ISCA 2018 results
• e.g. scripts to automatically run SPECInt2017 reference inputs in ≈1 day
• Many others
• 100+ pages of documentation: https://docs.fires.im
• AWS provides grants for researchers: https://aws.amazon.com/grants/
28
$ cd fsim/deploy/workloads
$ ./run-all.sh
![Page 29: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/29.jpg)
Latest Updates
• Growing ecosystem!
• BOOM (Out-of-order RISC-V core) now available in FireSim
• Projects publicly releasing FireSim images at RISC-V Summit: • Hwacha vector accelerator • Keystone Secure Enclave
• Berkeley IceNet (photonic DC network) modeling
• New debugging features • Auto-ILA: Annotate Chisel and get an
ILA automatically wired-up in FireSim • Tracer widgets: Collect live instruction
traces from FireSim sims • Integrating DESSERT [8]
• Assertion Synthesis now on master
• First Academic User papers: • ISCA ‘18: Maas et. al. “A Hardware
Accelerator for Tracing Garbage Collection” (Berkeley)
• MICRO ‘18: Zhang et. al. “Composable Building Blocks to Open up Processor Design” (MIT)
29
![Page 30: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/30.jpg)
Wrapping Up
• We can prototype scalable-systems built on arbitrary RTL at unprecedented scale
+ Mix software models when desired
• Simulation is automatically built and deployed
• Automatically deploy real workloads and collect results
• Open-source, runs on Amazon EC2 F1, no capex
30
SoC RTL
Other RTL
Network Topology
Automatically deployed, high-performance,
distributed simulation
SW Models
Full Work-load
![Page 31: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/31.jpg)
Administrivia
• Everything we’re going to show you today is documented in excruciating detail at http://docs.fires.im/
• Use these slides as a high-level view of what’s possible
• We’ll also put the slides/videos online
31
![Page 32: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/32.jpg)
Today’s Agenda
• Intro, what is FireSim capable of? • (Sagar, 10 mins)
• Building/Deploying FireSim Simulations • (Sagar, 15 mins)
• Using FireSim to simulate your own custom HW • (David, 15 mins)
• Testing and Debugging your design in FireSim • (Alon, 15 mins)
• Community/Contributing/Conclusion • (David, 10 mins)
32
☑ ️
![Page 33: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/33.jpg)
Learn More: Web: https://fires.im GitHub: https://github.com/firesim @firesimproject
ISCA’18 Paper: sagark.org/assets/pubs/firesim-isca2018.pdf
The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, RISE Lab sponsor Amazon Web Services, and ADEPT Lab affiliates Google, Huawei, Siemens, SK Hynix, and Seagate. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
Questions?
![Page 34: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/34.jpg)
References
[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016. [3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14 [4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15 [5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10 [6] Evaluation of RISC-V RTL Designs with FPGA Simulation. Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. CARRV '17. [7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16 [8] Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, and Krste Asanović, “DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles”, FPL 2018
34
![Page 35: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/35.jpg)
Building/Deploying Simulations
FireSim Tutorial
RISC-V Summit 2018 Tutorial
Speaker: Sagar Karandikar
![Page 36: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/36.jpg)
What will we cover?
•Building FireSim FPGA images for a set of targets
•Choosing targets at runtime
•Configuring target software
•Managing EC2 infrastructure for simulations
•Running a simulation
•Customizing workloads: SPEC example
36
![Page 37: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/37.jpg)
Background Terminology
37
“AGFI”: FPGA Bitstream for F1
FPGAs
![Page 38: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/38.jpg)
First-time User Setup
•Won’t discuss this today, but we’ve documented the process of starting with AWS/FireSim from scratch here: http://docs.fires.im/en/latest/Initial-Setup/index.html • For the following walkthrough, we assume that you’ve
setup a Manager instance and cloned firesim into ~/firesim
38
![Page 39: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/39.jpg)
Let’s simulate the following system:
Target Design:
• One Rocket Chip-based node (RTL) • 4 Rocket Cores
• 16K L1 I$, D$
• Block Device, UART, Serial Adapter
• No NIC
• 4 MB LLC (RTL Model)
• DDR3 Memory System
• No network
• Boot vanilla buildroot-Linux distro
Host Resources:
• One manager instance (c4.4xlarge)
• One F1 instance (f1.2xlarge, single FPGA)
39
☑ ️
![Page 40: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/40.jpg)
Using the firesim manager command line
From inside ~/firesim, source sourceme-f1-manager.sh
This properly sets up your environment and puts firesim on your $PATH
This means that we can now call firesim from anywhere on the instance. It will always run from the directory:
your_firesim_clone/deploy/
in this case:
~/firesim/deploy/
40
![Page 41: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/41.jpg)
Configuring the Manager. 4 files in firesim/deploy/
41
config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini
![Page 42: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/42.jpg)
Setting up the manager to build FPGA images
42
config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini
![Page 43: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/43.jpg)
Running builds
• Once we’ve configured what we want to build, let’s build it
$ firesim buildafi
• This completely automates the process. For each design, in-parallel: • Launch a build instance (c4.4xlarge) • Run FireSim generator (Chisel/FIRRTL/MIDAS) • Ship infrastructure to build instances, run Vivado FPGA
builds in parallel • Collect results back onto manager instance
• ~/firesim/deploy/results-workload/TIMESTAMP-config/ contains final results + QoR information + dcps to open in Vivado for further manual tuning
• Email you the entry to put into config_hwdb.ini • Terminate the build instance
43
![Page 44: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/44.jpg)
Status - We have FPGA images for the simulator of our design
Target Design:
• One Rocket Chip-based node (RTL) • 4 Rocket Cores
• 16K L1 I$, D$
• Block Device, UART, Serial Adapter
• No NIC
• 4 MB LLC (RTL Model)
• DDR3 Memory System
• No network
• Boot vanilla buildroot-Linux distro
Host Resources:
• One manager instance (c4.4xlarge)
• One F1 instance (f1.2xlarge, single FPGA)
44
☑ ️
☑ ️☑ ️☑ ️
☑ ️
![Page 45: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/45.jpg)
Now, let’s build our target-software
• We need: • bbl + vmlinux image: Berkeley BootLoader + Linux kernel image as payload • Root Filesystem disk image
• FireSim provides two levels of workload automation: • firesim-software repo for automatically building base-images for various
distros (e.g. buildroot or Fedora) that are compatible with Rocket / BOOM cd firesim/sw/firesim-software
./sw-manager.py -c br-disk.json build # simple buildroot distro
This lets us use the linux-uniform.json workload, which we will see later
• firesim manager / workload generation tool can bake custom workloads into rootfses • We’ll get to this later
45
![Page 46: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/46.jpg)
Status
Target Design:
• One Rocket Chip-based node (RTL) • 4 Rocket Cores
• 16K L1 I$, D$
• Block Device, UART, Serial Adapter
• No NIC
• 4 MB LLC (RTL Model)
• DDR3 Memory System
• No network
• Boot vanilla buildroot-Linux distro
Host Resources:
• One manager instance (c4.4xlarge)
• One F1 instance (f1.2xlarge, single FPGA)
46
☑ ️
☑ ️☑ ️☑ ️☑ ️
☑ ️
![Page 47: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/47.jpg)
Setting up the manager’s runtime configuration
47
config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini
![Page 48: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/48.jpg)
Setting up the manager’s runtime configuration
48
config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini
![Page 49: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/49.jpg)
Launching Simulation Instances
• Once we’ve configured what we want to simulate and what infrastructure we require, let’s launch our simulation (F1) instances
49
$ firesim launchrunfarm
FireSim Manager. Docs: http://docs.fires.im
Running: launchrunfarm
Waiting for instance boots: f1.16xlarges
Waiting for instance boots: m4.16xlarges
Waiting for instance boots: f1.2xlarges
i-0d6c29ac507139163 booted!
The full log of this run is: /home/centos/firesim-new/deploy/logs/2018-05-19--00-19-43-launchrunfarm-B4Q2ROAK.log
![Page 50: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/50.jpg)
Status
Target Design:
• One Rocket Chip-based node (RTL) • 4 Rocket Cores
• 16K L1 I$, D$
• Block Device, UART, Serial Adapter
• No NIC
• 4 MB LLC (RTL Model)
• DDR3 Memory System
• No network
• Boot vanilla buildroot-Linux distro
Host Resources:
• One manager instance (c4.4xlarge)
• One F1 instance (f1.2xlarge, single FPGA)
50
☑ ️
☑ ️☑ ️☑ ️☑ ️
☑ ️☑ ️
![Page 51: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/51.jpg)
Distributing Simulation Infrastructure to the Run Farm • Next, we need to setup our run farm instance with simulation infrastructure
$ firesim infrasetup FireSim Manager. Docs: http://docs.fires.im
Running: infrasetup
Building FPGA software driver for FireSimNoNIC-FireSimRocketChipQuadCoreConfig-FireSimDDR3FRFCFSLLC4MBConfig
[172.30.2.254] Executing task 'instance_liveness'
[172.30.2.254] Checking if host instance is up...
[172.30.2.254] Executing task 'infrasetup_node_wrapper'
[172.30.2.254] Copying FPGA simulation infrastructure for slot: 0.
[172.30.2.254] Installing AWS FPGA SDK on remote nodes. Upstream hash: 2fdf23ffad944cb94f98d09eed0f34c220c522fe
[172.30.2.254] Unloading EDMA Driver Kernel Module.
[172.30.2.254] Copying AWS FPGA EDMA driver to remote node.
[172.30.2.254] Clearing FPGA Slot 0.
[172.30.2.254] Checking for Cleared FPGA Slot 0.
[172.30.2.254] Flashing FPGA Slot: 0 with agfi: agfi-06b9b705ab9af1238.
[172.30.2.254] Checking for Flashed FPGA Slot: 0 with agfi: agfi-06b9b705ab9af1238.
[172.30.2.254] Loading EDMA Driver Kernel Module.
[172.30.2.254] Starting Vivado hw_server.
[172.30.2.254] Starting Vivado virtual JTAG.
The full log of this run is:
/home/centos/firesim/deploy/logs/2018-11-14--16-42-35-infrasetup-CHLESMNB83XFPZUR.log
51
![Page 52: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/52.jpg)
Let’s run a simulation!
$ firesim runworkload
FireSim Manager. Docs: http://docs.fires.im
Running: runworkload
Creating the directory: /home/centos/firesim/deploy/results-workload/2018-11-14--16-47-17-linux-uniform/
[172.30.2.254] Executing task 'instance_liveness'
[172.30.2.254] Checking if host instance is up...
[172.30.2.254] Executing task 'boot_switch_wrapper'
[172.30.2.254] Executing task 'boot_simulation_wrapper'
[172.30.2.254] Starting FPGA simulation for slot: 0.
52
![Page 53: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/53.jpg)
Monitoring
53
![Page 54: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/54.jpg)
Manually Interacting with simulations
54
$ ssh 172.30.2.254
$ screen –r fsim0
![Page 55: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/55.jpg)
Shutting down simulations/Capturing results
• Frequently, we want to capture results from simulations automatically
• FireSim’s workload system supports this! • More on how to specify what to capture in a few mins
• But let’s continue with our example – it’s configured just to capture each simulated system’s console output, once the simulation is powered off. If we do so, the manager will produce:
...
FireSim Simulation Exited Successfully. See results in:
/home/centos/firesim/deploy/results-workload/2018-11-14--16-52-49-linux-
uniform/
The full log of this run is:
/home/centos/firesim/deploy/logs/2018-11-14--16-52-49-runworkload-
HSMYYMD4JBA6BHT5.log
55
![Page 56: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/56.jpg)
Finally, get rid of our run farm EC2 instances
• Easy!
$ firesim terminaterunfarm
FireSim Manager. Docs: http://docs.fires.im
Running: terminaterunfarm
IMPORTANT!: This will terminate the following instances:
f1.16xlarges
[]
m4.16xlarges
[]
f1.2xlarges
['i-033959e7513fcf928']
Type yes, then press enter, to continue. Otherwise, the operation will be cancelled.
yes
Instances terminated. Please confirm in your AWS Management Console.
The full log of this run is:
/home/centos/firesim/deploy/logs/2018-11-14--17-00-01-terminaterunfarm-RGWY68L5ICAYQTA3.log
56
![Page 57: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/57.jpg)
Custom Workloads: linux-uniform workload
• Previously, we relied on linux-uniform.json as our workload
• These jsons live in firesim/deploy/workloads/
57
![Page 58: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/58.jpg)
Let’s look at a more interesting example: SPEC
58
• JSONs are used for two things: • Building rootfs/binary
combos automatically, with benchmark infrastructure built-in
• Deploying simulations with the manager (like we saw previously) and knowing which results to collect at the end
![Page 59: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/59.jpg)
Building/Deploying SPEC on Rocket/BOOM
• Building: We don’t have enough time to go into detail here. Look at the spec17-% target in firesim/deploy/workloads/Makefile
• At a high-level, you can just run make spec17-intrate, and you’ll get 10 rootfs/linux image combos that will automatically run the spec benchmarks in parallel
• Deploying: Set your workload to spec17-intrate.json in config_runtime.ini, set the # f1.2xlarges to 10, select the hardware config you want to benchmark, then firesim launchrunfarm/infrasetup/runworkload as usual
• At the end, all your performance results live in one directory on the manager: firesim/deploy/results-workload/TIMESTAMP-spec17-intrate-HASH/
• Instances get automatically terminated one-by-one as benchmarks complete – essentially zero cost to running in parallel on EC2, since you pay by the machine-second anyway
59
![Page 60: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/60.jpg)
Summary
• Don’t fret if you didn’t catch everything, everything we showed you today is documented in excruciating detail at http://docs.fires.im
• We learned how to: • Build FireSim FPGA images for a set of targets
• http://docs.fires.im/en/latest/Building-a-FireSim-AFI.html
• Setup/Launch a simulation, including choosing targets, configuring target software, and managing EC2 infrastructure • http://docs.fires.im/en/latest/Running-Simulations-Tutorial/Running-a-Single-Node-
Simulation.html
• Customize workloads: SPEC example • http://docs.fires.im/en/latest/Advanced-Usage/Workloads/Defining-Custom-
Workloads.html • http://docs.fires.im/en/latest/Advanced-Usage/Workloads/SPEC-2017.html
60
![Page 61: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/61.jpg)
Modifying the Target Design
FireSim Tutorial RISC-V Summit 2018
Speaker: David Biancolin
![Page 62: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/62.jpg)
Modifying the Target Design
In this section:
• Explain how the target is modeled in FireSim
• Discuss how to add and modify each class of model
Code references apply to FireSim v.1.4.0
Relevant FireSim docs section: “Targets”
https://docs.fires.im/en/latest/Advanced-Usage/Generating-Different-Targets.html
62
![Page 63: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/63.jpg)
Important Submodules & Aliases
Submodules:
MIDAS: sim/midas
FIRRTL: sim/firrtl
Firechip: target-design/firechip
Rocket-Chip: target-design/firechip/rocket-chip
Chisel: target-design/firechip/rocket-chip/chisel
Aliases for this presentation:
FSCALA -> sim/src/main/scala
FCPP -> sim/src/main/cc/
MSCALA -> {in MIDAS}/src/main/scala/midas
MCPP -> {in MIDAS}/src/main/cc/
63
![Page 64: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/64.jpg)
• Represent the target as a synchronous-dataflow graph
• FireSim stitches together a larger graph to model a datacenter
Target as a Dataflow Graph
64
SW Model To be hosted on a CPU
Transformed RTL Model Derived from SoC RTL
To be hosted on an FPGA
Abstract RTL Model Cannot be taped out
To be hosted on an FPGA
![Page 65: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/65.jpg)
Expressing the Target Graph in FireSim
• Target graph is implicit, currently not encoded in one place.
• There is only one transformed-RTL model • The CHIRRTL you passed to midas.Compiler
• In Simulation Mapping • 2+ entry queue channels are instantiated on Decoupled IO
• 1-cycle pipe channels are used everywhere else.
• SW and Custom RTL models are bound to IO bundles using midas.core.Endpoint
• Matches on a IO type, generates a model or endpoint
65
![Page 66: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/66.jpg)
Parameterizing the RTL-Transformed Model
• FireSim provides four base Rocket-Chip “cakes” that form the base RTL-transformed model • DESIGN make variable
Two types of parameters:
1. Target: (see TargetConfigs.scala) Configure the generated RTL model • # of Cores, L1 $ Sizes, # of memory channels • TARGET_CONFIG make variable
2. Sim/Platform: (see SimConfigs.scala) Configure MIDAS, SW/Custom RTL model generation
• Configs for endpoints, Type of DRAM model, enable debug features etc… • PLATFORM_CONFIG make variable
66
![Page 67: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/67.jpg)
Swapping out the RTL-Transformed Model
Two common approaches, based on the type of Target:
1. Project-template-based Rocket-chip targets (FireChip) • Point at the FireChip submodule at your fork
NB: MIDAS depends on Rocket-Chip; Rocket-Chip provides Chisel submodule
2. Everything else (ex. Your custom CGRA etc..) • Add scala sources for your target (modify sim/build.sbt)
• Define an App (Generator) that calls midas.Compiler(…)
• Add a target-specific makefrag (ex. sim/src/main/makefrag/firesim/Makefrag)
See the MIDAS examples for an simple demonstration of this.
67
![Page 68: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/68.jpg)
On Abstract RTL & SW Models.
FireSim & MIDAS provide the following models.
Abstract-RTL :
1. DefaultIOModel ($MSCALA/widgets/PeekPokeIO.scala) • Bound to I/O unbound by custom endpoints
2. DRAM & LLC model ($MSCALA/models/dram/MidasMemModel.scala) • Bound to AXI4 slave interfaces
SW-models:
1. NIC
2. SerialIO
3. Block-device
4. UART
These are all FireSim provided: See $FSCALA/endpoints and $FCPP/endpoints
68
![Page 69: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/69.jpg)
Compile-time vs Runtime Configuration
Software models and custom RTL models are configured twice:
1. Compile/Generation time
2. Runtime
On Compile-time configuration:
• RTL-generation for Custom-RTL models (FPGA resynthesis required) • How: using different a PLATFORM_CONFIG; hacking on the Chisel sources
• CPP compilation for SW models (no FPGA resynthesis) • How: changing model & driver CPP sources.
69
![Page 70: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/70.jpg)
Runtime Configuration
• Pass plus args to the simulator • SW models: does what you expect
• Custom-RTL models: driver writes to model’s configuration registers
• Ex. +blkdev_write_latency,
• How: Modify deploy/config_runtime.ini • Change ini-exposed variables
• Specify a customruntimeconfig in deploy/config_hwdb.ini
70
![Page 71: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/71.jpg)
Runtime Configuration – config_runtime.ini
71
![Page 72: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/72.jpg)
HWDB entry: None -> Uses a default runtime.conf emitted at generation time
Runtime.conf -> plusArgs you want to pass to the simulator
Runtime Configuration – config_hwdb.ini
72
![Page 73: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/73.jpg)
Adding a Custom Model – Chisel-side
1. Expose a Record/Bundle in your Target RTL’s IO with a specific type
2. Extend midas.Endpoint to: 1. Match on the correct Chisel type
• Implement Endpoint.matchType(Data)=> Boolean
2. Generate your SW model’s endpoint, or custom RTL-model Implement • Implement Endpoint.widget(Parameters)=> EndpointWidget
3. Add your endpoint to the EndpointMap Field in your PLATFORM_CONFIG
See:
• $FSCALA/endpoints/UARTWidget.scala
• $FSCALA/endpoints/BlockDevWidget.scala
• $FSCALA/firesim/SimConfigs.scala
73
![Page 74: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/74.jpg)
Adding a Custom Model – Driver-side
1. Define an endpoint driver class • Use simif::{read, write, push, pull} to interact with FPGA-hosted endpoint
2. Register the endpoint driver in firesim_top.cc
See:
• $FCPP/endpoints/uart.{cc, h}
• $FCPP/endpoints/blockdev.{cc, h}
• $FCPP/firesim/firesim_top.cc
74
![Page 75: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/75.jpg)
DRAM Timing Models – Generation Time Conf
• Parameterized using the PLATFORM_CONFIG make variable • See $FSCALA/firesim/SimConfigs.scala
• Abstract timing models: • Latency-Bandwidth Pipe • Bank Conflict
• DDR3 Timing models: • First-Come First-Served (FCFS) • First-Ready, First-Come First-Served (FR-FCFS)
• All timing models can be composed with a LLC model.
• To be presented at FPGA2019
75
![Page 76: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/76.jpg)
DRAM Timing Models – Runtime Conf
• DDR timing models expose >30 runtime arguments • Many illegal settings, need to validate against generated model instance
• Ex. Setting that would overflow a timing register.
• Many invalid settings, need to validate against DDR3 part database • Ex. A non-standard JEDEC DDR3 timing, timings are density dependent
• A default runtime-configuration is emitted at generation time • See generated-src/f1/${TARGET_TUPLE}/runtime.conf
• To generate a custom runtime-config: • In sim/: make conf
76
![Page 77: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/77.jpg)
DRAM Timing Models – custom runtime.conf
77
![Page 78: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/78.jpg)
Debugging, Testing, and Profiling
FireSim Tutorial RISC-V Summit 2018 Speaker: Alon Amid
![Page 79: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/79.jpg)
Agenda
• FireSim Debugging Using Software Simulation
• FireSim Debugging Using Integrated Logic Analyzers
• Advanced FPGA-Based Debugging and Profiling Features
• The FireSim Vision for Debugging and Profiling
79
![Page 80: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/80.jpg)
Debugging Using Software Simulation
80
Modifying internal simulated target hardware, no new external endpoints
Target-Level SW Simulation
What Am I
doing?
Simulator-Level SW Simulation
Adding/Modifying new interfaces and endpoints,
modifying simulation models
Midas-Level SW Simulation
FPGA-Level SW Simulation
My FireSim Simulation Is Not Working
![Page 81: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/81.jpg)
Debugging Using Software Simulation
81
Target-Level Simulation
• Software Simulation
• Target Design Untransformed
• No Host-FPGA interfaces
MIDAS-Level Simulation
• Software Simulation
• Target Design Transformed by MIDAS
• Host-FPGA interfaces/shell emulated using abstract models
FPGA-Level Simulation
• Software Simulation
• Target Design Transformed by MIDAS
• Host-FPGA interfaces/shell simulated by the FPGA tools
![Page 82: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/82.jpg)
82
Remember this slide from
Sagar’s presentation?
Debugging Using Software Simulation
![Page 83: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/83.jpg)
83
RTL Design
Physical DRAM
100ns
latency
<- Resp Queue
Req Queue ->
DRAM Model
100
cycle latency
Mem Channel
“FAME-1” Transformed RTL Design
Target-Level SW Simulation
FPGA Fabric
Debugging Using Software Simulation
![Page 84: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/84.jpg)
84
RTL Design
Physical DRAM
100ns
latency
<- Resp Queue
Req Queue ->
DRAM Model
100
cycle latency
Mem Channel
“FAME-1” Transformed RTL Design
MIDAS-Level SW Simulation
FPGA Fabric
Abstract Model
Target-Level SW Simulation
Debugging Using Software Simulation
![Page 85: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/85.jpg)
85
RTL Design
Physical DRAM
100ns
latency
<- Resp Queue
Req Queue ->
DRAM Model
100
cycle latency
Mem Channel
“FAME-1” Transformed RTL Design
MIDAS-Level SW Simulation
FPGA Fabric
Abstract Model
Target-Level SW Simulation
FPGA-Level SW Simulation
Debugging Using Software Simulation
![Page 86: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/86.jpg)
Debugging Using Software Simulation
86
Level Waves VCS Verilator XSIM
Target Off ~5 kHz ~5 kHz N/A
Target On ~1 kHz ~5 kHz N/A
MIDAS Off ~4 kHz ~2 kHz N/A
MIDAS On ~3 kHz ~1 kHz N/A
FPGA On ~2 Hz N/A ~0.5 Hz
![Page 87: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/87.jpg)
• Target-Level Simulation In firesim/target-design/firechip/vsim
$ make DESIGN=<YourDesign> CONFIG=<YourConfig> debug
$ ./simv-<YourDesign>-<YourConfig>-debug +max-cycles=50000 +vcdplusfile=<WaveformFileName>.vpd ../tests/<InputTest>.riscv
• MIDAS-level Simulation In firesim/sim
$ make <verilator|vcs>-debug
$ make EMUL=<verilator|vcs> DESIGN=FireSimNoNIC run-asm-test-debug
• FPGA-Level Simluation In firesim/sim
$ make xsim
$ make xsim-dut <VCS=1> &
$ make run-xsim SIM_BINARY=<PATH/TO/DRIVER/BINARY/FOR/TARGET/TO/RUN>
**FPGA-level simulation currently does not support DMA_PCIS (which is used for the NIC interface)
87
Debugging Using Software Simulation
![Page 88: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/88.jpg)
88
“Everything looks OK in SW simulation, but there is still a bug somewhere”
“My bug only appears after hours of running Linux on my simulated HW”
When SW Simulation Debugging is Not Enough…
![Page 89: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/89.jpg)
Debugging Using Integrated Logic Analyzers
Integrated Logic Analyzers (ILAs)
• Common debugging feature provided by FPGA vendors
• Continuous recording of a sampling window of up to 1024 cycles. • Stores recorded samples in BRAM.
• Realtime trigger-based sampled output of probed signals • Multiple probes ports can be combined to a single trigger
• Trigger can be in any location within the sampling window
• On the AWS F1-Instances, ILA interfaced through a debug-bridge and server
89
From: aws-fpga cl_hello_world example
![Page 90: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/90.jpg)
Debugging Using Integrated Logic Analyzers
AutoILA – Automation of ILA integration with FireSim
• Annotate requested signals and bundles in the Chisel source code
• Automatic configuration and generation of the ILA IP in the FPGA toolchain
• Automatic expansion and wiring of annotated signals to the top level of a design using a FIRRTL transform.
• Remote waveform and trigger setup from the manager instance
90
![Page 91: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/91.jpg)
Debugging using Integrated Logic Analyzers
91
Pros:
• No emulated parts – what you see is what’s running on the FPGA
• FPGA simulation speed - O(MHz) compared to O(KHz) in software simulation
• Real-time trigger-based
Cons:
• Requires a full build to modify visible signals/triggers (takes several hours)
• Limited sampling window size
• Consumes FPGA resources
![Page 92: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/92.jpg)
Advanced FPGA-Based Debugging Features
• FPGA-based simulation enables high simulation speed, which enables advanced debugging and profiling tools.
• Reach “deep” in simulation time, and obtain large levels of coverage and data
• Examples: • TracerV
• Synthesizable Assertions
92
Simulated Time
SW Simulation
FPGA-based Simulation
![Page 93: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/93.jpg)
TracerV
93
• Out-of-band full instruction execution trace
• MIDAS widget connected to target trace ports
• By default, large amount of info wired out of Rocket/BOOM, per-hart, per-cycle: • Instruction Address • Instruction • Privilege Level • Exception/Interrupt Status, Cause
• TracerV can rapidly generate several TB of data.
![Page 94: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/94.jpg)
TracerV
• Out-of-Band software-level debugging and profiling: Profiling does not perturb execution
• Useful for kernel and hypervisor level cycle-sensitive profiling
• Examples: • Co-Optimization of NIC and Network Driver • Keystone Secure Enclave Project • High-performance hardware-specific code
(supercomputing?)
• Requires large-scale analytics for insightful profiling and optimization.
94
![Page 95: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/95.jpg)
TracerV
95
Pros:
• Out-of-Band (no impact on workload execution)
• SW-centric method
• Large amounts of data
Cons:
• Slower simulation performance (40 MHz)
• No HW visibility
• Large amounts of data
![Page 96: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/96.jpg)
Synthesizable Assertions
• Assertions – rapid error checking embedded in HW source code. • Commonly used in SW Simulation
• Halts the simulation upon a triggered assertion. Represented as a “stop” statement in FIRRTL
• By defaults, emitted as non-synthesizable SV functions ($fatal)
96
From: Trillion-Cycle Bug Finding Using FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, Krste Asanović. ADEPT Winter Retreat 2018
From: BROOM: An open-source Out-of-Order processor with resilient low-voltage operation in 28nm CMOS, Christopher Celio, Pi-Feng Chiu, Krste Asanovic, David Patterson and Borivoje Nikolic. HotChip 30, 2018
![Page 97: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/97.jpg)
Synthesizable Assertions
• Synthesizable Assertions on FPGA • Transform FIRRTL stop statements into synthesizable logic
• Insert combinational logic and signals for the stop condition arguments
• Insert encodings for each assertion (for matching error statements in SW)
• Wire the assertion logic output to the Top-Level
• Generate timing tokens for cycle-exact assertions
• Assertion checker records the cycle and halts simulation when assertion is triggered
97
![Page 98: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/98.jpg)
Synthesizable Assertions
• Previously a research feature presented in DESSERT [1]
• Helped Identify BOOM bugs trillions of cycles into execution
• Integrated into FireSim in the latest release.
98
[1] Kim, D., Celio, C., Karandikar, S., Biancolin, D., Bachrach, J. and Asanovic, K., DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles. The International Conference on Field-Programmable Logic and Applications (FPL), 2018
![Page 99: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/99.jpg)
Synthesizable Assertions
99
Pros:
• FPGA simulation speed
• Real-time trigger-based
• Consumes small amount of FPGA resources (compared to ILA)
• Key signals have pre-written assertions in re-usable components/libraries
Cons:
• Low visibility: No waveform/state
• Assertions are best added while writing source RTL rather than during “investigative” debugging
![Page 100: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/100.jpg)
The FireSim Vision: Speed and Visibility
• High-performance simulation
• Full application workloads
• Tunable visibility & resolution
• Unique data-based insights
100
![Page 101: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/101.jpg)
Speed and Visibility – How do we get there?
• Integration of additional DESSERT features • Hardware state snapshot extraction from FPGA to software simulation using
arbitrary software triggers.
• Easy-to-use information transfer between execution traces and software hooks.
• Data-Processing pipeline for insights from large-scale traces • O(GB)-O(TB) out-of-band logs require big-data analysis methods for insights
(Golden model comparison is one potential method).
• Potential unique insights: can collect globally-cycle-accurate out-of-band instruction traces from a networked datacenter simulation.
101
![Page 102: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/102.jpg)
Summary
• Debugging Using Software Simulation (docs) • Target-Level
• MIDAS-Level
• FPGA-Level
• Debugging Using Integrated Logic Analyzers (docs)
• Advanced Debugging and Profiling Features • TracerV (docs)
• Assertion Synthesis (docs)
• FireSim Debugging and Profiling Future Vision
102
![Page 103: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/103.jpg)
Project Management, Contributing & Conclusion
FireSim Tutorial RISC-V Summit 2018
Speaker: David Biancolin
![Page 104: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/104.jpg)
Development Model – Branches
• FireSim and its key submodules (FireChip, MIDAS, aws-fpga) have two write-controlled branches
• master • Stable
• Contains the most recent release of FireSim
• Can replicate ISCA2018 paper results
• dev • Main development branch
• Base branch for feature PRs and non-critical bug fixes
• Every merge commit has globally shared, regenerated AGFIs
104
![Page 105: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/105.jpg)
Development Model – Merges & Releases
• To merge a feature branch into dev: • Pass all MIDAS-level tests (in sim/: `make test`)
• Regenerate all AGFIs if RTL changes were made
• Associated submodule PRs should be merged; submodule pointers should point at merge commit on dev
• Changes to dev will be summarized in a “Dev-tracking PR” • PR description will become the Changelog entry
• Dev will be periodically merged into master and tagged as a release • Changelog will be updated
• Submodule dev branches will also be merged into master
105
![Page 106: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/106.jpg)
How to Get Help
• FireSim and MIDAS adds a bunch of additional complexity on top of Rocket-Chip, Chisel, FIRRTL, Scala… • But there’s a team of us here to help!
• For general questions: • Post on the google groups: https://groups.google.com/forum/#!forum/firesim
• Check out the docs: https://docs.fires.im/en/latest/
• (Nascent) FAQ section: https://docs.fires.im/en/latest/Advanced-Usage/FAQs.html
106
![Page 107: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/107.jpg)
Reporting Issues
• If you think you’ve encountered a bug (in FireSim or MIDAS): • Open an issue on FireSim’s github page
• Try your best to open an issue in the most relevant repo • Please don’t open duplicates across multiple repos!
• If you are unsure, default to FireSim and we’ll be happy to redirect you
107
![Page 108: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/108.jpg)
Contributing
• We welcome PRs!
• Issues are labeled with “Good First Issue”
• Ask on the google groups for suggestions
108
![Page 109: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/109.jpg)
Regarding Future Work
• There are many research projects underway at UCB-BAR using and improving FireSim
• Many will be merged into mainline FireSim
• There will be API breakages -> We apologize in advance
• Watch the changelogs, dev tracking PRs
Our goal: make FireSim the standard open way to do fast full-system simulation of Chisel-designed systems, especially ones based on Rocket-Chip.
Focus on:
• Generality (eg. support multi-clock target designs)
• Robustness (eg. More extensive testing in RTL-simulation; CI; regressions)
• Debuggability (eg. DESSERT features) 109
![Page 110: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/110.jpg)
Two more talks from Berkeley Architecture Research
110
Both talks open-sourcing their work and releasing FireSim images!
![Page 111: Scalable FPGA-accelerated Cycle-Accurate Hardware ...](https://reader030.fdocuments.in/reader030/viewer/2022012715/61ada8f6a6831b63cd36f376/html5/thumbnails/111.jpg)
Conclusion
Key links: • Website: https://fires.im • Google Groups: https://groups.google.com/forum/#!forum/firesim • Docs: https://docs.fires.im/en/latest/ • Twitter: @firesimproject Acknowledgements: The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, RISE Lab sponsor Amazon Web Services, and ADEPT Lab affiliates Google, Huawei, Siemens, SK Hynix, and Seagate. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
111