Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware...

28
1 Efficient Weight Reuse for Large LSTMs Zhiqiang Que 1 , Thomas Nugent 1 , Shuanglong Liu 1 , Li Tian 3 , Xinyu Niu 2 , Yongxin Zhu 3 , Wayne Luk 1 1 Imperial College London, 2 Corerain Technologies Ltd. , 3 Chinese Academy of Science

Transcript of Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware...

Page 1: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

1

Efficient Weight Reuse for Large LSTMs

Zhiqiang Que1, Thomas Nugent1, Shuanglong Liu1, Li Tian3, Xinyu Niu2, Yongxin Zhu3, Wayne Luk1

1Imperial College London, 2Corerain Technologies Ltd. , 3Chinese Academy of Science

Page 2: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

2

Outline

➢ Motivation

➢ Design & Implementation

➢Stall-free hardware architecture

➢Blocking-batching strategy

➢SBE FPGA Accelerator

➢ Evaluation

➢Batch size and Blocking number

➢Performance and Efficiency

➢ Future Work and Summary

Page 3: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

3

Speech Recognition

Google Research Blog, August 2015

Page 4: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

4J. Donahue et al. “LRCN”

Video AnalysingSpeech Recognition

Google Research Blog, August 2015

Page 5: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

5

•COMAC:

on-board flight testing

Page 6: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

6

•COMAC:

on-board flight testing

• ECG Anomaly Detection

Oh, Shu Lih et al. Automated diagnosis of arrhythmia..

Page 7: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

7

LSTM Overview

𝑥𝑡

ht

htht-1

ct-1 Ct

Most of the calculations within LSTM cells lie in the Matrix-Vector Multiplication (MV)

Page 8: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

8

• C1: RNN Data dependence: stall in execution

• C2: Large RNN: weights stored on external DRAM

Motivation: Challenges

Page 9: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

9

• Blocking-batching strategy (address C2)- reuse weights for large LSTM systems

• Stall-free architecture (address C1)- reorganise multiplications to enhance throughput

• Zynq and Virtex-7 implementations- 23.7x speed, 1/208x energy of CPU- 1.3x speed, 1/19.2x energy of GPU

Contributions

Page 10: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

10

Outline

➢ Motivation

➢ Design & Implementation

➢Stall-free hardware architecture

➢Blocking-batching strategy

➢SBE FPGA Accelerator

➢ Evaluation

➢Batch size and Blocking number

➢Performance and Efficiency

➢ Future Work and Summary

Page 11: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

11

xt-1

ht-1

weights

dependency

Traditional Design

ht

ht+1

xt

ht

Stall-free hardware architecture

M x V : for i < height:for j < width:

r[i] += w[i][j] * x[j]

weights

New Design

xt-1

ht-1

ht

xt

ht

ht+1

For simplicity,

Only one element is involved

New M x V : for j < width:for i < height:

r[i] += w[i][j] * x[j]

Page 12: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

12

Outline

➢ Motivation

➢ Design & Implementation

➢Stall-free hardware architecture

➢Blocking-batching strategy

➢SBE FPGA Accelerator

➢ Evaluation

➢Batch size and Blocking number

➢Performance and Efficiency

➢ Future Work and Summary

Page 13: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

13

Blocking and Batching

b3b0 b1

… … …

… … …

b2

WhWX

… … ……

Batching

{xt, ht-1}

Page 14: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

14

3 Scenarios- Case1

1. The hidden unit weights can be stored in one block

b3b0 b1

… … …

… … …

b2

WhWX b0

b1

b2

b3

… … ……

Page 15: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

15

3 Scenarios- Case2

2. The hidden unit weights can be stored in two blocks

b3b0 b1

… … …

… … …

b2

WhWX b0

b1

b2

b3

… … ……

Page 16: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

16

3 Scenarios- Case3

3. The hidden unit weights can be stored in more than two blocks

b3b0 b1

… … …

… … …

b2

WhWX b0

b1

b2

b3

… … ……

Page 17: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

17

Outline

➢ Motivation

➢ Design & Implementation

➢Stall-free hardware architecture

➢Blocking-batching strategy

➢SBE FPGA Accelerator

➢ Evaluation

➢Batch size and Blocking number

➢Performance and Efficiency

➢ Future Work and Summary

Page 18: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

18

SBE Architecture Details

Multipliers

AccPE x

+

Exter-BlockInter-Block

BB-FIFO

Page 19: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

19

System Overview

Page 20: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

20

Outline

➢ Motivation

➢ Design & Implementation

➢Stall-free hardware architecture

➢Blocking-batching strategy

➢SBE FPGA Accelerator

➢ Evaluation

➢Batch size and Blocking number

➢Performance and Efficiency

➢ Future Work and Summary

Page 21: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

21

Batching Size

Page 22: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

22

1.0*Pm

0.95*Pm

0.84*Pm

0.2*Pm

sweetpoint

Throughput / BRAM_Num

Blocking Number Throughput

Page 23: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

23

Outline

➢ Motivation

➢ Design & Implementation

➢Stall-free hardware architecture

➢Blocking-batching strategy

➢SBE FPGA Accelerator

➢ Evaluation

➢Batch size and Blocking number

➢Performance and Efficiency

➢ Future Work and Summary

Page 24: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

24

CPU, GPU vs FPGA Application:Activity Recognition(LRCN)

CPU&GPU: TensorFlow

23.7 1.3

208 19.2

Page 25: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

25

Comparison with other FPGA designs

Fastest

MostEfficient

Page 26: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

26

Outline

➢ Motivation

➢ Design & Implementation

➢Stall-free hardware architecture

➢Blocking-batching strategy

➢SBE FPGA Accelerator

➢ Evaluation

➢Batch size and Blocking number

➢Performance and Efficiency

➢ Future Work and Summary

Page 27: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

27

Future Work

•Combing the proposed SBE with pruning methods

•Deploying very large LSTM models using multi-FPGAs

Page 28: Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware architecture Blocking-batching strategy SBE FPGA Accelerator Evaluation Batch size and

28

• Blocking-batching strategy- reuse weights for large LSTM systems

• Stall-free architecture- reorganise multiplications to enhance throughput

• Zynq and Virtex-7 implementations- 23.7x speed, 1/208x energy of CPU- 1.3x speed, 1/19.2x energy of GPU

Summary