A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent...

A High Performance Heterogeneous FPGA-based Accelerator

with PyCoRAM Team: PyCoRAMist

Shinya Takamaeda-Yamazaki Tokyo Institute of Technology

JSPS Research Fellow (DC1)

February 21, 2014 Digilent Design Contest @TED Yokohama

The 1st IPSJ SIG-ARC High-Performance Processor Design Contest (Jan 2014 @Tokyo) n  A competition of developing a fast

computing system for the specified applications on the specified platform

n  FPGA board: Digilent Atlys l  FPGA: Xilinx Spartan-6 LX45

DRAM: DDR2-800 (1.6GB/s)

2014-02-21 Shinya T-Y. Tokyo Tech 2

4 Specified Contest Applications


Hybrid System of CPU core + HW Accelerator

Suitable for HW Accelerators Matrix Mult & Stencil Sort & Shortest Path Difficult for HW Accelerators

Application Description Requirements for Memory System

310_sort Integer Sort Low Latency

320_mm Matrix-Matrix Multiplication High Bandwidth

330_stencil 9-Point Stencil (Integer) High Bandwidth

340_spath Shortest Path Search Low Latency

How to Implement an Accelerator?

n  HDL? NO WAY! It’s so annoying L l  Implementing the entire system using HDL is hard, because ...

•  Scheduling logic of computations and memory accesses –  Double buffering requires complicated logics

–  State machine implementation is so annoying and error-prone

l  But, we want define the pipeline design in cycle-level •  Essential for high performance of FPGA-based accelerators

–  HDL is still good weapon to write just a computation logic

–  The modern high-level synthesis tools are still not effective

n  Memory abstractions make up happy?


CoRAM Memory Architecture

CoRAM (Connected RAM) [Chung+,FPGA’11] n  Abstract Memory System for FPGAs

l  High-level abstraction for memory management •  Decoupling computing logics and memory access behaviors

•  Memory access patterns in software model (C language)


HW Kernels (Computing Logics)

CoR

AM

M

emor

y

Read Write

Manage

Control Threads (Memory Access

Pattern in C)

CoRAM Channel

Read/Write Read/Write

Communication FIFOs (Registers)

Abstracted On-chip Memories

Off-chip Memory

PyCoRAM [Takamaeda+,CARL’13]

n  Python-based implementation of CoRAM memory architecture for modern FPGA EDKs l  CoRAM memory abstraction for EDK development flow

n  Key features l  Control Thread in Python

•  We developed Python-to-Verilog HLS Compiler from scratch

l  AMBA AXI4 Interconnect for on-chip interconnect •  For IP-core based development on Xilinx Platform Studio (XPS)


PyCoRAM Microarchitecture


User I/O

User Logic

CoRAM Channel

CoRAM Register

Control Thread

DMAC

CoRAM Memory

DMAC

CoRAM Stream FSM

GPIO



User I/O

User Logic

CoRAM Channel

CoRAM Register

Control Thread

DMAC

CoRAM Memory

DMAC

CoRAM Stream FSM

GPIO Modeled in RTL (Verilog HDL)

Memory Access Pattern

in Python

def calc_sum(times):� ram = CoramMemory(idx=0, datawidth=32, size=1024)� channel = CoramChannel(idx=0, datawidth=32)� addr = 0� sum = 0� for i in range(times):� ram.write(0, addr, 128)� channel.write(addr)� sum += channel.read()� addr += 128 * (32/8)� print(‘sum=’, sum)�calc_sum(8)�

# Transfer (off-chip DRAM to BRAM) # Notification to User-logic # Wait for Notification from User-logic # $display Verilog system task �

0�1�2�3�4�5�6�7�8�9�10�11�



PyCoRAM IP

AXI4 Interconnect

DRAM Controller FPGA

User I/O

User Logic

CoRAM Channel

CoRAM Register

Control Thread

DMAC

AXI I/F

CoRAM Memory

DMAC

AXI I/F

CoRAM Stream FSM

GPIO

FPGA Accelerator for PROCON n  6-stage MIPS-core + UART loader + Two accelerators

l  XPS automatically synthesizes AXI4 interconnections

l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz


AXI4 Interconnect (32-bit, Shared-bus)

DRAM Controller

PyCoRAM Abstraction

L1-D Cache (2-way, 32KB, 64bytes/line)

6-stage MIPS-core

PyCoRAM Abstraction

Memory Loader

UART

PyCoRAM Abstraction

Matrix Multiplication Accelerator

PyCoRAM Abstraction

9-point Stencil

Accelerator

FPGA Accelerator for PROCON n  6-stage MIPS-core + UART loader + Two accelerators

l  XPS automatically synthesizes AXI4 interconnections

l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz


AXI4 Interconnect (32-bit, Shared-bus)

DRAM Controller

PyCoRAM Abstraction

L1-D Cache (2-way, 32KB, 64bytes/line)

6-stage MIPS-core

PyCoRAM Abstraction

Memory Loader

UART

PyCoRAM Abstraction

Matrix Multiplication Accelerator

PyCoRAM Abstraction

9-point Stencil

Accelerator

9.8%

4.5%

0.4%

2.5% 28.1% 22.5%

6.3%

Matrix-Matrix Multiplication Accelerator n  Each row of matrix A/B/C is stored on CoRAM memories

l  Data movements between on-chip memory and DRAM are managed by control threads of PyCoRAM

l  Fully-occupied pipeline for every cycle

l  Double buffering of computations and transmission of mat B •  Mat B is transposed in advance by the other CoRAM hardware

•  1/4 of the total memory bandwidth is utilized (about 400MB/s)


Computing Logic (Verilog HDL) Control Thread

(Python)

sum

CoRAM Memory 0

B × +

CoRAM Memory 1

CoRAM Memory 2

Control Logic CoRAM Channel 0

8-stage Multiply Pipeline A

C

check sum +

Stencil Computation Accelerator n  3 arrays for source and 1 array for result by CoRAM

l  Data movements between on-chip memory and DRAM are managed by control threads of PyCoRAM

l  The pipeline consumes data of 3 points for every cycle •  (Sum of input data within latest 3 cycles) / 9

l  Write back of the result, then read the next array •  1/12 of the total memory bandwidth is utilized (about 130MB/s)


Computing Logic (Verilog HDL) Control Thread

(Python) CoRAM

Memory 0

d1

CoRAM Memory 2

CoRAM Memory 3

Control Logic CoRAM Channel 0

41-stage Add-Divide

Pipeline

d0

rslt

d2

+ /

+ check sum

CoRAM Memory 1

L1 Data Cache for MIPS-core n  CoRAM Memory as Data Memory

l  Data replacements are managed by the control thread •  When a cache miss occurs, a handling request is issued to the CT


Cache Logic (Verilog HDL)

Control Thread

(Python)

CoRAM Memory

0,1

Control Logic

CoRAM Channel 0

D0

D1

MU

X Tag0

=

Select Tag1

=

Write Data

Addr Stall

Read Data

Write Enable Read

Enable

reg

reg

reg

Evaluation n  Evaluation targets

l  Reference design provided by the contest committee (Ref)

l  6-stage MIPS-core+L1 Cache (6-stage)

l  6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC)

n  Application dataset l  Dataset provided for first round match

n  FPGA EDA tools l  Xilinx Platform Studio 14.6, PlanAhead 14.6

•  Optimization goal: Speed, Optimization Effort: High

•  AXI4 Interconnect: 32-bit Shared bus (Area optimized)

n  Compiler for MIPS-core l  gcc 4.3.3 (-O3)


Performance n  =Execution time (not including data transfer time) n  Drastic speed up compared to the reference design

l  The 6-stage+MIPS-core achieves 3.5 times faster speed

l  The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster speed at average, 47.1 times faster at maximum


3.9 1.4

5.9 4.7 3.5 3.9

35.2

47.1

4.7

13.2

0

5

10

15

20

25

30

35

40

45

50

310_sort 320_mm 330_stencil 340_spath Gmean

Rel

ativ

e Pe

rfor

man

ce

6-stage

6-stage+ACC

14.2 14.2 16.0

20.8

3.6

9.8

2.7 4.4 3.6

0.4 0.3

4.4

0

5

10

15

20

25

310_sort 320_mm 330_stencil 340_spath

Tim

e [s

ec]

Ref 6-stage 6-stage+ACC

Conclusion

n  From IPSJ SIG-ARC High-Performance Processor Design Contest

n  Development of a heterogeneous FPGA-based accelerator with PyCoRAM l  Heterogeneous system of MIPS-core and two accelerators

l  47.1 times faster than the reference design

n  The tool-chain and framework are available on GitHub l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/

l  Pyverilog: http://shtaxxx.github.io/Pyverilog/


A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent...

Technology

Transcript of A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent...