2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s <...
Transcript of 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s <...
![Page 1: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/1.jpg)
1
FPGA-‐Optimized ORCA
© 2016 VectorBlox Computing Inc.
![Page 2: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/2.jpg)
Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00
ISA: RV32IM hw multiply, sw divider
< 2,000 LUTs ~ 20MHz
ORCA FPGA-‐Optimized
![Page 3: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/3.jpg)
What is ORCA?
• Family of RISC-V implementations – Highly parameterized – Ideally suited for FPGAs – Portable across FPGA vendors – BSD license open source hardware
3
![Page 4: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/4.jpg)
Why ORCA?
• Many reasons – Orcas travel in pods: family of many sizes – Orcas are native to Vancouver
• ORCA – many possible backronyms – ORCA RISC-V Computer Architecture – ORCA Reconfigurable CPU Architecture – Optimized RISC-V CPU Architecture
4
![Page 5: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/5.jpg)
ORCA: Multiple FPGA Vendors
5
• Altera – Drop-in Qsys
replacement for Nios II/f
– Avalon I / D masters
• Lattice – Wishbone I / D
masters
• Xilinx, Microsemi – Coming soon
![Page 6: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/6.jpg)
Area 2008 LUT4 1623 LUT4 541 ALMs
Fmax 22 MHz 109 MHz 244 MHz
DMIPS n/a 79 MIPS 212 MIPS
DMIPS/MHz n/a 0.73 0.87
6
ORCA RISC-V RV32I on Different FPGAs
![Page 7: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/7.jpg)
ORCA vs Other RISC-V
7
ORCA RV32IM
Z-scale RV32IM
PicoRV RV32I
Area 2353 LUT4 (Cyclone IV,
60nm)
2678 LUT4 (Spartan 6,
45nm)
2949 LUT4 (Cyclone IV,
60nm)
Fmax 125 MHz 33 MHz 127 MHz
DMIPS 122 MIPS 44 MIPS 39 MIPS
DMIPS/MHz 0.98 (measured)
1.35 (claimed)
0.31 (claimed)
![Page 8: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/8.jpg)
ORCA RISC-V vs FPGA CPUs
8
ORCA RV32IM
Altera Nios II/f
Area 2353 LUT4 (Cyclone IV)
2678 LUT4 (Cyclone IV)
Fmax 125 MHz 140 MHz
DMIPS 122 MIPS 163 MIPS
DMIPS/MHz 0.98 (measured)
1.16 (claimed)
![Page 9: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/9.jpg)
RISC-V: Architecture Space • Width (3 choices)
– 32, 64, 128 bits
• Instruction Set (9 binary options, 2^9 choices) – Minimum: I – Binary options: M, A, F, D (== G), Q, L, B, T, P
• Instruction Encoding (2 choices) – C
• Architecture Space 3 x 2^10 = 3072 possibilities
9
![Page 10: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/10.jpg)
ORCA Implementation Space • Logic design (12 choices)
– Multiplier (sw, hw) – Divider (sw, hw) – Shifter (1-cycle, 8-cycles,
32-cycles) • Counters (3 choices)
– 0, 32, or 64 bits
• Pipelining (2 choices) – 4 or 5 stages
• Forwarding (2 choices) – ALU only – ALU + other units
• Implementation space 12 x 3 x 2 x 2 = 144 possibilities • Overall arch. x impl. = 3072 x 144 / 2 = 221,184 possibilities
10
![Page 11: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/11.jpg)
Huge Design Space • Implementation on ASIC
– Need to choose one design point in the architecture + implementation space
– Benefit: user has no choice – Problem: compromise across many applications
• Implementation on FPGA – Can have fully parameterized design – User can choose best architecture + implementation according
to application – Benefit: good performance, area – Problem: overwhelming design space
11
![Page 12: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/12.jpg)
32b vs 64b Counters
12
Faster
Slower
Smaller LUT6 count Bigger
Execution Time
64b counters
no counters
32b counters
![Page 13: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/13.jpg)
4 vs 5 Pipeline Stages
13
4 pipeline stages
5 pipeline stages
Faster
Slower
Smaller LUT4 count Bigger
Execution Time
![Page 14: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/14.jpg)
FPGA è ASIC but
ASIC !è FPGA
good FPGA implementation è often leads to good ASIC implementation
good ASIC implementation
è often leads to poor FPGA implementation
14
![Page 15: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/15.jpg)
Register File
• Discrete FFs: inefficient
– 32 cpu registers x 32 b = 1024 FFs – 32 mux32 = 32 x 11 LUT4 = 352 LUT4s
• Note: muxes are costly, must avoid!!! 15
x31 x0x1x2
b0
b1
XLEN-1
Cost of muxes (1b wide): mux4: 2 LUT4 or 1 LUT6 mux16: 10 LUT4 or 5 LUT6 mux32: 11 LUT4 or 5.5 LUT6
![Page 16: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/16.jpg)
Register File Implications • Block RAMs: dual ported, registered output
– Use 1 RD, 1 WR port – Use data-out FFs as pipeline FFs – Needs external data-forwarding
16
rd-data1
addr1
wr-data1
rd-data2
addr2
wr-data2
Port 1
Port 2
(UNUSED)
(OPERAND A or B)
(RS or RT)
(RD)
(WB DATA)
(UNUSED)
RAM cannot forward WB data to OPERAND data internally
rd-data1
addr1
wr-data1
rd-data2
addr2
wr-data2
Port 1
Port 2
(UNUSED)
(OPERAND A or B)
(RS or RT)
(RD)
(WB DATA)
(UNUSED)
![Page 17: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/17.jpg)
ORCA Datapath
17
ALU / BR / SLT
CSR
LD/ST
Data Avalon /Wishbone
Decode Ex/Mem WB
Forward(can be collapsed
into Ex/Mem stage)
Instr. Avalon /Wishbone
InstrFetch
RF1
RF2
Fetch
![Page 18: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/18.jpg)
Some FPGA Suggestions
• RV32E spec – Reduced # registers saves nothing in FPGAs – Divide is expensive
• Software – Beware, shifts may be 1b/cycle (slow)
• Privileged Arch spec – Too many CSRs, 64b counters too big
• Increases pressure on multiplexers
– Suggest small / med / full versions • No “official” rules on what to include/exclude to reduce size • Eg, hypervisors not likely to run on FPGAs
18
![Page 19: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/19.jpg)
Conclusions
• ORCA RISC-V family is free, portable, FPGA-optimized
– FPGA and ASIC optimizations are different • FPGA architecture dictates certain design choices
– Some RISC-V decisions are “unconsciously” aimed towards ASIC implementation
• These do not lead to good FPGA implementations • But good FPGA choices lead to good ASICs
19
![Page 20: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/20.jpg)
Free FPGA Hardware! • Today only: Lattice donating
FPGA boards for RISC-V users
• ORCA RV32I system ~2000 LUTs http://www.github.com/VectorBlox/risc-v About 1500 LUTs available for user I/O (eg, UART) 20
![Page 21: 2016-01-05 VectorBlox ORCA RISC-V DEMO · 1/5/2016 · Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz ORCA FPGAOptimized..](https://reader035.fdocuments.in/reader035/viewer/2022071022/5fd629686117de67fa1171a5/html5/thumbnails/21.jpg)
© 2016 VectorBlox Computing Inc. 21
LUNCHTIME So Long, and Thanks for All the Fish !!