Learning Compact Recurrent Neural Networks with Block-Term ... · Learning Compact Recurrent Neural...
Transcript of Learning Compact Recurrent Neural Networks with Block-Term ... · Learning Compact Recurrent Neural...
Learning Compact Recurrent Neural Networks withBlock-Term Tensor Decomposition
Jinmian Ye1, Linnan Wang2, Guangxi Li1, Di Chen1, Shandian Zhe3, Xinqi Chu4, Zenglin Xu1
1SMILE Lab, University of Electronic Science and Technology of China 2Brown University 3University of Utah 4Xjera Labs pte.ltd
IntroductionRNNs are powerful sequence modeling tools. We introduce a compactarchitecture with Block-Term tensor decomposition to address theredundancy problem.Challenges: 1) traditional RNNs suffer from an excess of parameters;and 2) Tensor-Train RNN has limited representation ability and flexibilitydue to the difficulty of searching the optimal setting of ranks.Contributions: 1) introduces a new sparsely connected RNNarchitecture with Block-Term tensor decomposition; and 2) achievesbetter performance while maintaining fewer parameters.
Figure: Architecture of BT-LSTM. The redundant dense connections between input andhidden state is replaced by low-rank BT representation.
Should be noted that we only substitute the input-hidden matrixmultiplication while retaining the current design philosophy of LSTM.(
ft′, it′, c̃′t ,ot
′)=W ·xt +U ·ht−1+b (1)(
ft , it , c̃t ,ot)=(σ(ft
′), σ(it
′), tanh(c̃′t), σ(ot
′))
(2)
AnalysisComparison of complexity and memory usage of vanilla RNN,Tensor-Train RNN (TT-RNN) and our BT-RNN. In this table, the weightmatrix’s shape is I ×J. Here, Jmax =maxk(Jk),k ∈ [1,d].
Method Time MemoryRNN forward O(IJ) O(IJ)RNN backward O(IJ) O(IJ)TT-RNN forward O(dIR2Jmax) O(RI)TT-RNN backward O(d2IR4Jmax) O(R3I)BT-RNN forward O(NdIRdJmax) O(RdI)BT-RNN backward O(Nd2IRdJmax) O(RdI)
Total #Parameters: PBTD =N(∑d
k=1 IkJkR +Rd)
Figure: The number of parameters w.r.tCore-order d and Tucker-rank R, in the setting ofI = 4096,J = 256,N = 1. While the vanilla RNNcontains I × J = 1048576 parameters. When d issmall, the first part
∑d1 IkJkR does the main
contribution to parameters. While d is large, thesecond part Rd does. So we can see the numberof parameters will go down sharply at first, but riseup gradually as d grows up (except for the case ofR = 1).
1 2 3 4 5 6 7 8 9 10 11 12
Core-order d
100
101
102
103
104
105
106
107
108
# P
ara
ms
R=1 R=2 R=3 R=4
MethodBackground 1: Tensor Product Given A ∈RI1×···×Id and B ∈RJ1×···×Jd, while Ik = Jk .To simplify, i−k denotes indices (i1, . . . , ik−1), while i+k denotes (ik+1, . . . , id):
(A •k B)i−k ,i+k ,j−k ,j
+k=
Ik∑p=1
Ai−k ,p,i+kBj−k ,p,j
+k
(3)
Background 2: Block-Term Decomposition
Figure: X=∑N
n=1Gn •1 A(1)n •2 A
(2)n •3 · · · •d A(d)
n
BT-RNN Model: 1) Tensorizing W and x; 2) Decomposing W with BTD; 3)Computing W ·x; 4) Traing from scratch via BPTT!
(a)vector to tensor (b)matrix to tensor
Figure: Step 1: Tensorization operation in a case of3-order tensors.
Figure: Step 2 & 3: Decomposing weightmatrix and computing y = Wx.
Implementation: W ∈RJ×I, W ∈RJ1×I1×J2×···×Jd×Id, where I = I1I2 · · · Id andJ = J1J2 · · ·Jd. Gn ∈RR1×···×Rd denotes the core tensor, A(d)
n ∈RId×Jd×Rd denotes thefactor tensor.
W ·xt =N∑
n=1
Xt •1A(1)n •2 · · · •d A
(d)n •1,2,...,d Gn (4)
ConclusionWe proposed a Block-Term RNN architecture to address the redundancy problem inRNNs. Experiment results on 3 challenge tasks show that our BT-RNN architecturecan not only consume several orders fewer parameters but also improve the modelperformance over standard traditional LSTM and the TT-LSTM.
References[1] L. De Lathauwer. Decompositions of a higher-order tensor in block termspart ii:Definitions and uniqueness. SIAM SIMAX’ 2008.[2] Y. Yang, D. Krompass, and V. Tresp. Tensor-Train Recurrent Neural Networks forVideo Classification. In ICML’ 2017.[3] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: Arecurrent neural network for image generation. In ICML’ 2015.[4] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural imagecaption generator. In CVPR’ 2015.
ExperimentsTasks: 1. Action Recognition in Videos; 2. Image Generation; 3. Image Captioning; 4. SensitivityAnalysis on Hyper-Parameters.Datasets: 1. UCF11 YouTube Action dataset; 2. MNIST; 3. MSCOCO.
0 50 100 150 200 250 300 350 400
Epoch
0
0.5
1
1.5
2
2.5
3
Tra
in L
oss
LSTM=Params: 58.9M-Top: 0.697
BTLSTM=R: 1-CR: 81693x-Top: 0.784
BTLSTM=R: 2-CR: 40069x-Top: 0.803
BTLSTM=R: 4-CR: 17388x-Top: 0.853
TTLSTM=R: 4-CR: 17554x-Top: 0.781
Figure: Performance of different RNN models on the ActionRecognition task trained with UCF11.
Method Accuracy
OrthogonalApproaches
Original 0.712Spatial-temporal 0.761Visual Attention 0.850
RNNApproaches
LSTM 0.697TT-LSTM 0.796BT-LSTM 0.853
Table: State-of-the-art results on UCF11 dataset reportedin literature, in comparison with our best model.
Task 1: We use a single LSTM cell as the model architecture to evaluate BT-LSTM against LSTMand TT-LSTM. The frames in video are directly input to the LSTM cell. The figure in leftdemonstrates the training loss of different models. From these experiments, we claim that ourBT-LSTM has: 1) 8×104 times parameter reductions; 2) faster convergence; 3) better modelefficiency.
0 50 100 150 200 250 3000
50
100
150
200
250
300
(a)LSTM, #Params:1.8M
0 50 100 150 200 250 3000
50
100
150
200
250
300
(b) BT-LSTM,#Params:1184
Task 2: In this experiment, we use anencoder-decoder architecture to generate im-ages. We only substitute the encoder net-work to qualitatively evaluate the LSTM andBT-LSTM. The result shows that both LSTMand BT-LSTM can generate comparable im-ages.
(c) LSTM: A train traveling downtracks next to a forest.TT-LSTM: A train traveling downtrain tracks next to a forest.BT-LSTM: A train traveling througha lush green forest.
(d) LSTM: A group of people stand-ing next to each other.TT-LSTM: A group of men standingnext to each other.BT-LSTM: A group of people posingfor a photo.
(e) LSTM: A man and a dog arestanding in the snow.TT-LSTM: A man and a dog are inthe snow.BT-LSTM: A man and a dog playingwith a frisbee.
(f) LSTM: A large elephantstanding next to a baby elephant.TT-LSTM: An elephant walkingdown a dirt road near trees.BT-LSTM: A large elephant walkingdown a road with cars.
Task 3: We use the architecture described in [4], all the three models can generate propersentences but with little improvement in BT-LSTM.
(g) Truth:W′P=4096
(h) y = W · x,P=4096
(i) d=2, R=1,N=1, P=129
(j) d=2, R=4,N=1, P=528
(k) d=2, R=1,N=2, P=258
(l) d=4, R=4,N=1, P=384
Figure: The trained W for different BT-LSTM settings. The closer to (a), the better W is.
Task 4: In this experiment, we use a single LSTM cell as the model architecture to evaluateBT-LSTM against LSTM and TT-LSTM.