Getting Started With Embedded Linux – ZedBoard - Digilent Inc
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent...
-
Upload
shinya-takamaeda-yamazaki -
Category
Technology
-
view
261 -
download
1
description
Transcript of A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent...
A High Performance Heterogeneous FPGA-based Accelerator
with PyCoRAM Team: PyCoRAMist
Shinya Takamaeda-Yamazaki Tokyo Institute of Technology
JSPS Research Fellow (DC1)
February 21, 2014 Digilent Design Contest @TED Yokohama
The 1st IPSJ SIG-ARC High-Performance Processor Design Contest (Jan 2014 @Tokyo) n A competition of developing a fast
computing system for the specified applications on the specified platform
n FPGA board: Digilent Atlys l FPGA: Xilinx Spartan-6 LX45
DRAM: DDR2-800 (1.6GB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 2
4 Specified Contest Applications
2014-02-21 Shinya T-Y. Tokyo Tech 3
Hybrid System of CPU core + HW Accelerator
Suitable for HW Accelerators Matrix Mult & Stencil Sort & Shortest Path Difficult for HW Accelerators
Application Description Requirements for Memory System
310_sort Integer Sort Low Latency
320_mm Matrix-Matrix Multiplication High Bandwidth
330_stencil 9-Point Stencil (Integer) High Bandwidth
340_spath Shortest Path Search Low Latency
How to Implement an Accelerator?
n HDL? NO WAY! It’s so annoying L l Implementing the entire system using HDL is hard, because ...
• Scheduling logic of computations and memory accesses – Double buffering requires complicated logics
– State machine implementation is so annoying and error-prone
l But, we want define the pipeline design in cycle-level • Essential for high performance of FPGA-based accelerators
– HDL is still good weapon to write just a computation logic
– The modern high-level synthesis tools are still not effective
n Memory abstractions make up happy?
2014-02-21 Shinya T-Y. Tokyo Tech 4
CoRAM Memory Architecture
CoRAM (Connected RAM) [Chung+,FPGA’11] n Abstract Memory System for FPGAs
l High-level abstraction for memory management • Decoupling computing logics and memory access behaviors
• Memory access patterns in software model (C language)
2014-02-21 Shinya T-Y. Tokyo Tech 5
HW Kernels (Computing Logics)
CoR
AM
M
emor
y
Read Write
Manage
Control Threads (Memory Access
Pattern in C)
CoRAM Channel
Read/Write Read/Write
Communication FIFOs (Registers)
Abstracted On-chip Memories
Off-chip Memory
PyCoRAM [Takamaeda+,CARL’13]
n Python-based implementation of CoRAM memory architecture for modern FPGA EDKs l CoRAM memory abstraction for EDK development flow
n Key features l Control Thread in Python
• We developed Python-to-Verilog HLS Compiler from scratch
l AMBA AXI4 Interconnect for on-chip interconnect • For IP-core based development on Xilinx Platform Studio (XPS)
2014-02-21 Shinya T-Y. Tokyo Tech 6
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 7
User I/O
User Logic
CoRAM Channel
CoRAM Register
Control Thread
DMAC
CoRAM Memory
DMAC
CoRAM Stream FSM
GPIO
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 8
User I/O
User Logic
CoRAM Channel
CoRAM Register
Control Thread
DMAC
CoRAM Memory
DMAC
CoRAM Stream FSM
GPIO Modeled in RTL (Verilog HDL)
Memory Access Pattern
in Python
def calc_sum(times):� ram = CoramMemory(idx=0, datawidth=32, size=1024)� channel = CoramChannel(idx=0, datawidth=32)� addr = 0� sum = 0� for i in range(times):� ram.write(0, addr, 128)� channel.write(addr)� sum += channel.read()� addr += 128 * (32/8)� print(‘sum=’, sum)�calc_sum(8)�
# Transfer (off-chip DRAM to BRAM) # Notification to User-logic # Wait for Notification from User-logic # $display Verilog system task �
0�1�2�3�4�5�6�7�8�9�10�11�
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 9
PyCoRAM IP
AXI4 Interconnect
DRAM Controller FPGA
User I/O
User Logic
CoRAM Channel
CoRAM Register
Control Thread
DMAC
AXI I/F
CoRAM Memory
DMAC
AXI I/F
CoRAM Stream FSM
GPIO
FPGA Accelerator for PROCON n 6-stage MIPS-core + UART loader + Two accelerators
l XPS automatically synthesizes AXI4 interconnections
l Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
2014-02-21 Shinya T-Y. Tokyo Tech 10
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM Abstraction
L1-D Cache (2-way, 32KB, 64bytes/line)
6-stage MIPS-core
PyCoRAM Abstraction
Memory Loader
UART
PyCoRAM Abstraction
Matrix Multiplication Accelerator
PyCoRAM Abstraction
9-point Stencil
Accelerator
FPGA Accelerator for PROCON n 6-stage MIPS-core + UART loader + Two accelerators
l XPS automatically synthesizes AXI4 interconnections
l Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
2014-02-21 Shinya T-Y. Tokyo Tech 11
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM Abstraction
L1-D Cache (2-way, 32KB, 64bytes/line)
6-stage MIPS-core
PyCoRAM Abstraction
Memory Loader
UART
PyCoRAM Abstraction
Matrix Multiplication Accelerator
PyCoRAM Abstraction
9-point Stencil
Accelerator
9.8%
4.5%
0.4%
2.5% 28.1% 22.5%
6.3%
Matrix-Matrix Multiplication Accelerator n Each row of matrix A/B/C is stored on CoRAM memories
l Data movements between on-chip memory and DRAM are managed by control threads of PyCoRAM
l Fully-occupied pipeline for every cycle
l Double buffering of computations and transmission of mat B • Mat B is transposed in advance by the other CoRAM hardware
• 1/4 of the total memory bandwidth is utilized (about 400MB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 12
Computing Logic (Verilog HDL) Control Thread
(Python)
sum
CoRAM Memory 0
B × +
CoRAM Memory 1
CoRAM Memory 2
Control Logic CoRAM Channel 0
8-stage Multiply Pipeline A
C
check sum +
Stencil Computation Accelerator n 3 arrays for source and 1 array for result by CoRAM
l Data movements between on-chip memory and DRAM are managed by control threads of PyCoRAM
l The pipeline consumes data of 3 points for every cycle • (Sum of input data within latest 3 cycles) / 9
l Write back of the result, then read the next array • 1/12 of the total memory bandwidth is utilized (about 130MB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 13
Computing Logic (Verilog HDL) Control Thread
(Python) CoRAM
Memory 0
d1
CoRAM Memory 2
CoRAM Memory 3
Control Logic CoRAM Channel 0
41-stage Add-Divide
Pipeline
d0
rslt
d2
+ /
+ check sum
CoRAM Memory 1
L1 Data Cache for MIPS-core n CoRAM Memory as Data Memory
l Data replacements are managed by the control thread • When a cache miss occurs, a handling request is issued to the CT
2014-02-21 Shinya T-Y. Tokyo Tech 14
Cache Logic (Verilog HDL)
Control Thread
(Python)
CoRAM Memory
0,1
Control Logic
CoRAM Channel 0
D0
D1
MU
X Tag0
=
Select Tag1
=
Write Data
Addr Stall
Read Data
Write Enable Read
Enable
reg
reg
reg
Evaluation n Evaluation targets
l Reference design provided by the contest committee (Ref)
l 6-stage MIPS-core+L1 Cache (6-stage)
l 6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC)
n Application dataset l Dataset provided for first round match
n FPGA EDA tools l Xilinx Platform Studio 14.6, PlanAhead 14.6
• Optimization goal: Speed, Optimization Effort: High
• AXI4 Interconnect: 32-bit Shared bus (Area optimized)
n Compiler for MIPS-core l gcc 4.3.3 (-O3)
2014-02-21 Shinya T-Y. Tokyo Tech 15
Performance n =Execution time (not including data transfer time) n Drastic speed up compared to the reference design
l The 6-stage+MIPS-core achieves 3.5 times faster speed
l The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster speed at average, 47.1 times faster at maximum
2014-02-21 Shinya T-Y. Tokyo Tech 16
3.9 1.4
5.9 4.7 3.5 3.9
35.2
47.1
4.7
13.2
0
5
10
15
20
25
30
35
40
45
50
310_sort 320_mm 330_stencil 340_spath Gmean
Rel
ativ
e Pe
rfor
man
ce
6-stage
6-stage+ACC
14.2 14.2 16.0
20.8
3.6
9.8
2.7 4.4 3.6
0.4 0.3
4.4
0
5
10
15
20
25
310_sort 320_mm 330_stencil 340_spath
Tim
e [s
ec]
Ref 6-stage 6-stage+ACC
Conclusion
n From IPSJ SIG-ARC High-Performance Processor Design Contest
n Development of a heterogeneous FPGA-based accelerator with PyCoRAM l Heterogeneous system of MIPS-core and two accelerators
l 47.1 times faster than the reference design
n The tool-chain and framework are available on GitHub l PyCoRAM: http://shtaxxx.github.io/PyCoRAM/
l Pyverilog: http://shtaxxx.github.io/Pyverilog/
2014-02-21 Shinya T-Y. Tokyo Tech 17