Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware...
Transcript of Efficient Weight Reuse for Large LSTMsMotivation Design & Implementation Stall-free hardware...
1
Efficient Weight Reuse for Large LSTMs
Zhiqiang Que1, Thomas Nugent1, Shuanglong Liu1, Li Tian3, Xinyu Niu2, Yongxin Zhu3, Wayne Luk1
1Imperial College London, 2Corerain Technologies Ltd. , 3Chinese Academy of Science
2
Outline
➢ Motivation
➢ Design & Implementation
➢Stall-free hardware architecture
➢Blocking-batching strategy
➢SBE FPGA Accelerator
➢ Evaluation
➢Batch size and Blocking number
➢Performance and Efficiency
➢ Future Work and Summary
3
Speech Recognition
Google Research Blog, August 2015
4J. Donahue et al. “LRCN”
Video AnalysingSpeech Recognition
Google Research Blog, August 2015
5
•COMAC:
on-board flight testing
6
•COMAC:
on-board flight testing
• ECG Anomaly Detection
Oh, Shu Lih et al. Automated diagnosis of arrhythmia..
7
LSTM Overview
𝑥𝑡
ht
htht-1
ct-1 Ct
Most of the calculations within LSTM cells lie in the Matrix-Vector Multiplication (MV)
8
• C1: RNN Data dependence: stall in execution
• C2: Large RNN: weights stored on external DRAM
Motivation: Challenges
9
• Blocking-batching strategy (address C2)- reuse weights for large LSTM systems
• Stall-free architecture (address C1)- reorganise multiplications to enhance throughput
• Zynq and Virtex-7 implementations- 23.7x speed, 1/208x energy of CPU- 1.3x speed, 1/19.2x energy of GPU
Contributions
10
Outline
➢ Motivation
➢ Design & Implementation
➢Stall-free hardware architecture
➢Blocking-batching strategy
➢SBE FPGA Accelerator
➢ Evaluation
➢Batch size and Blocking number
➢Performance and Efficiency
➢ Future Work and Summary
11
xt-1
ht-1
weights
dependency
Traditional Design
ht
ht+1
xt
ht
Stall-free hardware architecture
M x V : for i < height:for j < width:
r[i] += w[i][j] * x[j]
weights
New Design
xt-1
ht-1
ht
xt
ht
ht+1
For simplicity,
Only one element is involved
New M x V : for j < width:for i < height:
r[i] += w[i][j] * x[j]
12
Outline
➢ Motivation
➢ Design & Implementation
➢Stall-free hardware architecture
➢Blocking-batching strategy
➢SBE FPGA Accelerator
➢ Evaluation
➢Batch size and Blocking number
➢Performance and Efficiency
➢ Future Work and Summary
13
Blocking and Batching
b3b0 b1
… … …
…
…
…
… … …
…
b2
WhWX
…
…
…
…
… … ……
…
…
…
Batching
{xt, ht-1}
14
3 Scenarios- Case1
1. The hidden unit weights can be stored in one block
b3b0 b1
… … …
…
…
…
… … …
…
b2
WhWX b0
b1
b2
b3
…
…
…
…
… … ……
…
…
…
15
3 Scenarios- Case2
2. The hidden unit weights can be stored in two blocks
b3b0 b1
… … …
…
…
…
… … …
…
b2
WhWX b0
b1
b2
b3
…
…
…
…
… … ……
…
…
…
16
3 Scenarios- Case3
3. The hidden unit weights can be stored in more than two blocks
b3b0 b1
… … …
…
…
…
… … …
…
b2
WhWX b0
b1
b2
b3
…
…
…
…
… … ……
…
…
…
17
Outline
➢ Motivation
➢ Design & Implementation
➢Stall-free hardware architecture
➢Blocking-batching strategy
➢SBE FPGA Accelerator
➢ Evaluation
➢Batch size and Blocking number
➢Performance and Efficiency
➢ Future Work and Summary
18
SBE Architecture Details
Multipliers
AccPE x
+
Exter-BlockInter-Block
BB-FIFO
19
System Overview
20
Outline
➢ Motivation
➢ Design & Implementation
➢Stall-free hardware architecture
➢Blocking-batching strategy
➢SBE FPGA Accelerator
➢ Evaluation
➢Batch size and Blocking number
➢Performance and Efficiency
➢ Future Work and Summary
21
Batching Size
22
1.0*Pm
0.95*Pm
0.84*Pm
0.2*Pm
sweetpoint
Throughput / BRAM_Num
Blocking Number Throughput
23
Outline
➢ Motivation
➢ Design & Implementation
➢Stall-free hardware architecture
➢Blocking-batching strategy
➢SBE FPGA Accelerator
➢ Evaluation
➢Batch size and Blocking number
➢Performance and Efficiency
➢ Future Work and Summary
24
CPU, GPU vs FPGA Application:Activity Recognition(LRCN)
CPU&GPU: TensorFlow
23.7 1.3
208 19.2
25
Comparison with other FPGA designs
Fastest
MostEfficient
26
Outline
➢ Motivation
➢ Design & Implementation
➢Stall-free hardware architecture
➢Blocking-batching strategy
➢SBE FPGA Accelerator
➢ Evaluation
➢Batch size and Blocking number
➢Performance and Efficiency
➢ Future Work and Summary
27
Future Work
•Combing the proposed SBE with pruning methods
•Deploying very large LSTM models using multi-FPGAs
28
• Blocking-batching strategy- reuse weights for large LSTM systems
• Stall-free architecture- reorganise multiplications to enhance throughput
• Zynq and Virtex-7 implementations- 23.7x speed, 1/208x energy of CPU- 1.3x speed, 1/19.2x energy of GPU
Summary