Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware...

1

Efficient Weight Reuse for Large LSTMs

Zhiqiang Que1, Thomas Nugent1, Shuanglong Liu1, Li Tian3, Xinyu Niu2, Yongxin Zhu3, Wayne Luk1

1Imperial College London, 2Corerain Technologies Ltd. , 3Chinese Academy of Science

2

Outline

➢ Motivation

➢ Design & Implementation

➢Stall-free hardware architecture

➢Blocking-batching strategy

➢SBE FPGA Accelerator

➢ Evaluation

➢Batch size and Blocking number

➢Performance and Efficiency

➢ Future Work and Summary

3

Speech Recognition

Google Research Blog, August 2015

4J. Donahue et al. “LRCN”

Video AnalysingSpeech Recognition

Google Research Blog, August 2015

5

•COMAC:

on-board flight testing

6

•COMAC:

on-board flight testing

• ECG Anomaly Detection

Oh, Shu Lih et al. Automated diagnosis of arrhythmia..

7

LSTM Overview

𝑥𝑡

ht

htht-1

ct-1 Ct

Most of the calculations within LSTM cells lie in the Matrix-Vector Multiplication (MV)

8

• C1: RNN Data dependence: stall in execution

• C2: Large RNN: weights stored on external DRAM

Motivation: Challenges

9

• Blocking-batching strategy (address C2)- reuse weights for large LSTM systems

• Stall-free architecture (address C1)- reorganise multiplications to enhance throughput

• Zynq and Virtex-7 implementations- 23.7x speed, 1/208x energy of CPU- 1.3x speed, 1/19.2x energy of GPU

Contributions

10

Outline

➢ Motivation





➢ Evaluation




11

xt-1

ht-1

weights

dependency

Traditional Design

ht

ht+1

xt

ht

Stall-free hardware architecture

M x V : for i < height:for j < width:

r[i] += w[i][j] * x[j]

weights

New Design

xt-1

ht-1

ht

xt

ht

ht+1

For simplicity,

Only one element is involved

New M x V : for j < width:for i < height:

r[i] += w[i][j] * x[j]

12

Outline

➢ Motivation





➢ Evaluation




13

Blocking and Batching

b3b0 b1

… … …

…

…

…

… … …

…

b2

WhWX

…

…

…

…

… … ……

…

…

…

Batching

{xt, ht-1}

14

3 Scenarios- Case1

1. The hidden unit weights can be stored in one block

b3b0 b1

… … …

…

…

…

… … …

…

b2

WhWX b0

b1

b2

b3

…

…

…

…

… … ……

…

…

…

15

3 Scenarios- Case2

2. The hidden unit weights can be stored in two blocks

b3b0 b1

… … …

…

…

…

… … …

…

b2

WhWX b0

b1

b2

b3

…

…

…

…

… … ……

…

…

…

16

3 Scenarios- Case3

3. The hidden unit weights can be stored in more than two blocks

b3b0 b1

… … …

…

…

…

… … …

…

b2

WhWX b0

b1

b2

b3

…

…

…

…

… … ……

…

…

…

17

Outline

➢ Motivation





➢ Evaluation




18

SBE Architecture Details

Multipliers

AccPE x

+

Exter-BlockInter-Block

BB-FIFO

19

System Overview

20

Outline

➢ Motivation





➢ Evaluation




21

Batching Size

22

1.0*Pm

0.95*Pm

0.84*Pm

0.2*Pm

sweetpoint

Throughput / BRAM_Num

Blocking Number Throughput

23

Outline

➢ Motivation





➢ Evaluation




24

CPU, GPU vs FPGA Application:Activity Recognition(LRCN)

CPU&GPU: TensorFlow

23.7 1.3

208 19.2

25

Comparison with other FPGA designs

Fastest

MostEfficient

26

Outline

➢ Motivation





➢ Evaluation




27

Future Work

•Combing the proposed SBE with pruning methods

•Deploying very large LSTM models using multi-FPGAs

28

• Blocking-batching strategy- reuse weights for large LSTM systems

• Stall-free architecture- reorganise multiplications to enhance throughput

• Zynq and Virtex-7 implementations- 23.7x speed, 1/208x energy of CPU- 1.3x speed, 1/19.2x energy of GPU

Summary

Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware...

Documents

Transcript of Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware...