Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high...

12
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4470–4481 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics 4470 Simple Recurrent Units for Highly Parallelizable Recurrence Tao Lei 1 Yu Zhang 2 Sida I. Wang 1,3 Hui Dai 1 Yoav Artzi 1,4 1 ASAPP Inc. 2 Google Brain 3 Princeton University 4 Cornell University 1 {tao, hd}@asapp.com 2 [email protected] 3 [email protected] 4 [email protected] Abstract Common recurrent neural architectures scale poorly due to the intrinsic difficulty in par- allelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is de- signed to provide expressive recurrence, en- able highly parallelized implementation, and comes with careful initialization to facili- tate training of deep models. We demon- strate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5–9x speed-up over cuDNN-optimized LSTM on classifica- tion and question answering datasets, and de- livers stronger results than LSTM and convo- lutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model (Vaswani et al., 2017) on translation by incorporating SRU into the architecture. 1 1 Introduction Recurrent neural networks (RNN) are at the core of state-of-the-art approaches for a large num- ber of natural language tasks, including machine translation (Cho et al., 2014; Bahdanau et al., 2015; Jean et al., 2015; Luong et al., 2015), lan- guage modeling (Zaremba et al., 2014; Gal and Ghahramani, 2016; Zoph and Le, 2016), opin- ion mining (Irsoy and Cardie, 2014), and situated language understanding (Mei et al., 2016; Misra et al., 2017; Suhr et al., 2018; Suhr and Artzi, 2018). Key to many of these advancements are architectures of increased capacity and computa- tion. For instance, the top-performing models for semantic role labeling and translation use eight re- current layers, requiring days to train (He et al., 2017; Wu et al., 2016b). The scalability of these models has become an important problem that im- pedes NLP research. 1 Our code is available at https://github.com/ taolei87/sru. The difficulty of scaling recurrent networks arises from the time dependence of state com- putation. In common architectures, such as Long Short-term Memory (LSTM; Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRU; Cho et al., 2014), the computation of each step is suspended until the complete ex- ecution of the previous step. This sequential de- pendency makes recurrent networks significantly slower than other operations, and limits their ap- plicability. For example, recent translation mod- els consist of non-recurrent components only, such as attention and convolution, to scale model train- ing (Gehring et al., 2017; Vaswani et al., 2017). In this work, we introduce the Simple Recurrent Unit (SRU), a unit with light recurrence that offers both high parallelization and sequence modeling capacity. The design of SRU is inspired by pre- vious efforts, such as Quasi-RNN (QRNN; Brad- bury et al., 2017) and Kernel NN (KNN; Lei et al., 2017), but enjoys additional benefits: SRU exhibits the same level of parallelism as convolution and feed-forward nets. This is achieved by balancing sequential dependence and independence: while the state compu- tation of SRU is time-dependent, each state dimension is independent. This simplifica- tion enables CUDA-level optimizations that parallelize the computation across hidden di- mensions and time steps, effectively using the full capacity of modern GPUs. Figure 1 com- pares our architecture’s runtimes to common architectures. SRU replaces the use of convolutions (i.e., n- gram filters), as in QRNN and KNN, with more recurrent connections. This retains modeling capacity, while using less compu- tation (and hyper-parameters).

Transcript of Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high...

Page 1: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4470–4481Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

4470

Simple Recurrent Units for Highly Parallelizable Recurrence

Tao Lei1 Yu Zhang2 Sida I. Wang1,3 Hui Dai1 Yoav Artzi1,41ASAPP Inc. 2Google Brain 3Princeton University 4Cornell University

1{tao, hd}@asapp.com

[email protected]

[email protected]

[email protected]

AbstractCommon recurrent neural architectures scalepoorly due to the intrinsic difficulty in par-allelizing their state computations. In thiswork, we propose the Simple Recurrent Unit(SRU), a light recurrent unit that balancesmodel capacity and scalability. SRU is de-signed to provide expressive recurrence, en-able highly parallelized implementation, andcomes with careful initialization to facili-tate training of deep models. We demon-strate the effectiveness of SRU on multipleNLP tasks. SRU achieves 5–9x speed-upover cuDNN-optimized LSTM on classifica-tion and question answering datasets, and de-livers stronger results than LSTM and convo-lutional models. We also obtain an average of0.7 BLEU improvement over the Transformermodel (Vaswani et al., 2017) on translation byincorporating SRU into the architecture.1

1 IntroductionRecurrent neural networks (RNN) are at the coreof state-of-the-art approaches for a large num-ber of natural language tasks, including machinetranslation (Cho et al., 2014; Bahdanau et al.,2015; Jean et al., 2015; Luong et al., 2015), lan-guage modeling (Zaremba et al., 2014; Gal andGhahramani, 2016; Zoph and Le, 2016), opin-ion mining (Irsoy and Cardie, 2014), and situatedlanguage understanding (Mei et al., 2016; Misraet al., 2017; Suhr et al., 2018; Suhr and Artzi,2018). Key to many of these advancements arearchitectures of increased capacity and computa-tion. For instance, the top-performing models forsemantic role labeling and translation use eight re-current layers, requiring days to train (He et al.,2017; Wu et al., 2016b). The scalability of thesemodels has become an important problem that im-pedes NLP research.

1Our code is available at https://github.com/taolei87/sru.

The difficulty of scaling recurrent networksarises from the time dependence of state com-putation. In common architectures, such asLong Short-term Memory (LSTM; Hochreiterand Schmidhuber, 1997) and Gated RecurrentUnits (GRU; Cho et al., 2014), the computationof each step is suspended until the complete ex-ecution of the previous step. This sequential de-pendency makes recurrent networks significantlyslower than other operations, and limits their ap-plicability. For example, recent translation mod-els consist of non-recurrent components only, suchas attention and convolution, to scale model train-ing (Gehring et al., 2017; Vaswani et al., 2017).

In this work, we introduce the Simple RecurrentUnit (SRU), a unit with light recurrence that offersboth high parallelization and sequence modelingcapacity. The design of SRU is inspired by pre-vious efforts, such as Quasi-RNN (QRNN; Brad-bury et al., 2017) and Kernel NN (KNN; Lei et al.,2017), but enjoys additional benefits:

• SRU exhibits the same level of parallelism asconvolution and feed-forward nets. This isachieved by balancing sequential dependenceand independence: while the state compu-tation of SRU is time-dependent, each statedimension is independent. This simplifica-tion enables CUDA-level optimizations thatparallelize the computation across hidden di-mensions and time steps, effectively using thefull capacity of modern GPUs. Figure 1 com-pares our architecture’s runtimes to commonarchitectures.

• SRU replaces the use of convolutions (i.e., n-gram filters), as in QRNN and KNN, withmore recurrent connections. This retainsmodeling capacity, while using less compu-tation (and hyper-parameters).

Page 2: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4471

Figure 1: Average processing time in milliseconds of a batch of 32 samples using cuDNN LSTM, word-level convolution conv2d (with filter width k = 2 and k = 3), and the proposed SRU. We vary thenumber of tokens per sequence (l) and feature dimension (d).

• SRU improves the training of deep recur-rent models by employing highway connec-tions (Srivastava et al., 2015) and a parame-ter initialization scheme tailored for gradientpropagation in deep architectures.

We evaluate SRU on a broad set of problems,including text classification, question answering,translation and character-level language model-ing. Our experiments demonstrate that light re-currence is sufficient for various natural languagetasks, offering a good trade-off between scala-bility and representational power. On classifica-tion and question answering datasets, SRU out-performs common recurrent and non-recurrent ar-chitectures, while achieving 5–9x speed-up com-pared to cuDNN LSTM. Stacking additional lay-ers further improves performance, while incurringrelatively small costs owing to the cheap compu-tation of a single layer. We also obtain an averageimprovement of 0.7 BLEU score on the Englishto German translation task by incorporating SRUinto Transformer (Vaswani et al., 2017).

2 Related Work

Improving on common architectures for sequenceprocessing has recently received significant atten-tion (Greff et al., 2017; Balduzzi and Ghifary,2016; Miao et al., 2016; Zoph and Le, 2016; Leeet al., 2017). One area of research involves incor-porating word-level convolutions (i.e. n-gram fil-ters) into recurrent computation (Lei et al., 2015;Bradbury et al., 2017; Lei et al., 2017). For ex-ample, Quasi-RNN (Bradbury et al., 2017) pro-poses to alternate convolutions and a minimal-ist recurrent pooling function and achieves sig-nificant speed-up over LSTM. While Bradburyet al. (2017) focus on the speed advantages ofthe network, Lei et al. (2017) study the theoret-

ical characteristics of such computation and pos-sible extensions. Their results suggest that sim-plified recurrence retains strong modeling capac-ity through layer stacking. This finding motivatesthe design of SRU for both high parallelizationand representational power. SRU also relates toIRNN (Le et al., 2015), which uses an identity di-agonal matrix to initialize hidden-to-hidden con-nections. SRU uses point-wise multiplication forhidden connections, which is equivalent to usinga diagonal weight matrix. This can be seen as aconstrained version of diagonal initialization.

Various strategies have been proposed to scalenetwork training (Goyal et al., 2017) and tospeed up recurrent networks (Diamos et al., 2016;Shazeer et al., 2017; Kuchaiev and Ginsburg,2017). For instance, Diamos et al. (2016) utilizehardware infrastructures by stashing RNN param-eters on cache (or fast memory). Shazeer et al.(2017) and Kuchaiev and Ginsburg (2017) im-prove the computation via conditional computingand matrix factorization respectively. Our imple-mentation for SRU is inspired by the cuDNN-optimized LSTM (Appleyard et al., 2016), but en-ables more parallelism – while cuDNN LSTM re-quires six optimization steps, SRU achieves moresignificant speed-up via two optimizations.

The design of recurrent networks, such as SRUand related architectures, raises questions aboutrepresentational power and interpretability (Chenet al., 2018; Peng et al., 2018). Balduzzi and Ghi-fary (2016) applies type-preserving transforma-tions to discuss the capacity of various simplifiedRNN architectures. Recent work (Anselmi et al.,2015; Daniely et al., 2016; Zhang et al., 2016; Leiet al., 2017) relates the capacity of neural networksto deep kernels. We empirically demonstrate SRUcan achieve compelling results by stacking multi-ple layers.

Page 3: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4472

3 Simple Recurrent Unit

We present and explain the design of Simple Re-current Unit (SRU) in this section. A single layerof SRU involves the following computation:

ft = � (Wfxt + vf � ct�1 + bf ) (1)ct = ft � ct�1 + (1� ft)� (Wxt) (2)

rt = � (Wrxt + vr � ct�1 + br) (3)ht = rt � ct + (1� rt)� xt (4)

where W, Wf and Wr are parameter matricesand vf , vr, bf and bv are parameter vectors tobe learnt during training. The complete architec-ture decomposes to two sub-components: a light

recurrence (Equation 1 and 2) and a highway net-

work (Equation 3 and 4).The light recurrence component successively

reads the input vectors xt and computes the se-quence of states ct capturing sequential informa-tion. The computation resembles other recurrentnetworks such as LSTM, GRU and RAN (Leeet al., 2017). Specifically, a forget gate ft controlsthe information flow (Equation 1) and the statevector ct is determined by adaptively averagingthe previous state ct�1 and the current observationWxt according to ft (Equation 2).

One key design decision that differs from previ-ous gated recurrent architectures is the way ct�1

is used in the sigmoid gate. Typically, ct�1 ismultiplied with a parameter matrix to compute ft,e.g., ft = �(Wfxt + Vfct�1 + bf ). However,the inclusion of Vfct�1 makes it difficult to par-allelize the state computation: each dimension ofct and ft depends on all entries of ct�1, and thecomputation has to wait until ct�1 is fully com-puted. To facilitate parallelization, our light recur-rence component uses a point-wise multiplicationvf � ct�1 instead. With this simplification, eachdimension of the state vectors becomes indepen-dent and hence parallelizable.

The highway network component (Srivastavaet al., 2015) facilitates gradient-based training ofdeep networks. It uses the reset gate rt (Equation3) to adaptively combine the input xt and the statect produced from the light recurrence (Equation4), where (1 � rt) � xt is a skip connection thatallows the gradient to directly propagate to the pre-vious layer. Such connections have been shown toimprove scalability (Wu et al., 2016a; Kim et al.,2016; He et al., 2016; Zilly et al., 2017).

The combination of the two components makesthe overall architecture simple yet expressive, andeasy to scale due to enhanced parallelization andgradient propagation.

3.1 Parallelized ImplementationDespite the parallelization friendly design of SRU,a naive implementation which computes equations(1)–(4) for each step t sequentially would notachieve SRU’s full potential. We employ two op-timizations to enhance parallelism. The optimiza-tions are performed in the context of GPU / CUDAprogramming, but the general idea can be appliedto other parallel programming models.

We re-organize the computation of equations(1)–(4) into two major steps. First, given the inputsequence {x1 · · ·xL}, we batch the matrix multi-plications across all time steps. This significantlyimproves the computation intensity (e.g. GPU uti-lization). The batched multiplication is:

U> =

0

@WWf

Wr

1

A [x1,x2, · · · ,xL] ,

where L is the sequence length, U 2 RL⇥3d isthe computed matrix and d is the hidden state size.When the input is a mini-batch of B sequences, Uwould be a tensor of size (L,B, 3d).

The second step computes the remaining point-wise operations. Specifically, we compile allpoint-wise operations into a single fused CUDAkernel and parallelize the computation across eachdimension of the hidden state. Algorithm 1 showsthe pseudo code of the forward function. The com-plexity of this step is O(L ·B · d) per layer, whereL is the sequence length and B is the batch size. Incontrast, the complexity of LSTM is O(L ·B · d2)because of the hidden-to-hidden multiplications(e.g. Vht�1), and each dimension can not be in-dependently parallelized. The fused kernel alsoreduces overhead. Without it, operations such assigmoid activation would each invoke a separatefunction call, adding kernel launching latency andmore data moving costs.

The implementation of a bidirectional SRU issimilar: the matrix multiplications of both direc-tions are batched, and the fused kernel handles andparallelizes both directions at the same time.

3.2 InitializationProper parameter initialization can reduce gradientpropagation difficulties and hence have a positive

Page 4: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4473

Algorithm 1 Mini-batch version of the forward pass defined in Equations (1)–(4).

Indices: Sequence length L, mini-batch size B, hidden state dimension d.Input: Input sequences batch x[l, i, j]; grouped matrix multiplication U[l, i, j0];

initial state c0[i, j]; parameters vf [j], vr[j], bf [j] and br[j].Output: Output h[·, ·, ·] and internal c[·, ·, ·] states.

Initialize h[·, ·, ·] and c[·, ·, ·] as two L⇥B ⇥ d tensors.for i = 1, · · · , B; j = 1, · · · , d do // Parallelize each example i and dimension j

c = c0[i, j]for l = 1, · · · , L do

f = � (U[l, i, j + d] + vf [j]⇥ c+ bf [j] )c = f ⇥ c+ (1� f)⇥U[l, i, j]r = � (U[l, i, j + d⇥ 2] + vr[j]⇥ c+ br[j] )h = r ⇥ c+ (1� r)⇥ x[l, i, j]c[l, i, j] = c

h[l, i, j] = h

return h[·, ·, ·] and c[·, ·, ·]

impact on the final performance. We now describean initialization strategy tailored for SRU.

We start by adopting common initializations de-rived for feed-forward networks (Glorot and Ben-gio, 2010; He et al., 2015). The weights of param-eter matrices are drawn with zero mean and 1/dvariance, for instance, via the uniform distribution[�

p3/d,+

p3/d]. This ensures the output vari-

ance remains approximately the same as the inputvariance after the matrix multiplication.

However, the light recurrence and highwaycomputation would still reduce the variance ofhidden representations by a factor of 1/3 to 1/2:

1

3 Var[ht]

Var[xt] 1

2,

and the factor converges to 1/2 in deeper layers(see Appendix A). This implies the output ht andthe gradient would vanish in deep models. To off-set the problem, we introduce a scaling correction

constant ↵ in the highway connection

ht = rt � ct + (1� rt)� xt · ↵ ,

where ↵ is set top3 such that Var[ht] ⇡ Var[xt]

at initialization. When the highway network is ini-tialized with a non-zero bias br = b, the scalingconstant ↵ can be accordingly set as:

↵ =p

1 + exp(b)⇥ 2 .

Figure 2 compares the training progress with andwithout the scaling correction. See Appendix Afor the derivation and more discussion.

0.25

0.38

0.52

0.65

5 layer

step no scaling with scaling

1 0.6958798766136169

0.6947121024131775

2 0.6936097145080566

0.706794798374176

4 0.6916782855987549

0.6896340847015381

5 0.6914946436882019

0.6871278285980225

6 0.6910921931266785

0.6831133365631104

7 0.6902860403060913

0.6865813136100769

8 0.6863651871681213

0.6747273802757263

9 0.6915381550788879

0.682847261428833

10 0.6851135492324829

0.6826606392860413

11 0.6769261360168457

0.6660687923431396

12 0.6850230693817139

0.6728894710540771

13 0.6851754784584045

0.6739820837974548

14 0.665700376033783

0.6354039311408997

15 0.6582719683647156

0.6352145671844482

16 0.6655007004737854

0.6511343717575073

17 0.6622896790504456

0.6527312994003296

18 0.5968040227890015

0.5772292613983154

19 0.6655246019363403

0.6686962246894836

20 0.5835714340209961

0.57429039478302

21 0.6634141802787781

0.6961659789085388

22 0.5927562713623047

0.6136318445205688

23 0.5757806897163391

0.574468195438385

24 0.5658635497093201

0.6130213737487793

25 0.581440806388855

0.6160621643066406

26 0.6234608292579651

0.6146613955497742

27 0.7135524153709412

0.6505687236785889

28 0.581239640712738

0.6254966259002686

30 0.6899824142456055

0.6361363530158997

31 0.6247599124908447

0.6577983498573303

32 0.6129587292671204

0.6187800765037537

33 0.5548222064971924

0.6293429732322693

34 0.5743948817253113

0.5717604756355286

35 0.5607229471206665

0.579410195350647

36 0.5135197639465332

0.5244811177253723

37 0.6162765622138977

0.6503244638442993

38 0.7048767805099487

0.6836536526679993

39 0.5215612053871155

0.559955358505249

40 0.6180161237716675

0.5934694409370422

41 0.6618679165840149

0.7113165855407715

43 0.5564876794815063

0.6011627912521362

44 0.569699764251709

0.6278694272041321

45 0.5164698958396912

0.5786489844322205

46 0.6221534609794617

0.6011412739753723

0.25

0.42

0.58

0.75

0 250 500 750 1000

• no scaling• with scaling

5 layers

20 layers

�2

Figure 2: Training curves of SRU on classification.The x-axis is the number of training steps and they-axis is the training loss. Scaling correction im-proves the training progress, especially for deepermodels with many stacked layers.

4 Experiments

We evaluate SRU on several natural language pro-cessing tasks and perform additional analyses ofthe model. The set of tasks includes text classifica-tion, question answering, machine translation, andcharacter-level language modeling. Training timeon these benchmarks ranges from minutes (classi-fication) to days (translation), providing a varietyof computation challenges.

The main question we study is the performance-speed trade-off SRU provides in comparison to

Page 5: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4474

Model Size CR SUBJ MR TREC MPQA SST Time

Best reported results:Wang and Manning (2013) 82.1 93.6 79.1 - 86.3 - -Kalchbrenner et al. (2014) - - - 93.0 - 86.8 -Kim (2014) 85.0 93.4 81.5 93.6 89.6 88.1 -Zhang and Wallace (2017) 84.7 93.7 81.7 91.6 89.6 85.5 -Zhao et al. (2015) 86.3 95.5 83.1 92.4 93.3 - -

Our setup (default Adam, fixed word embeddings):CNN 360k 83.1±1.6 92.7±0.9 78.9±1.3 93.2±0.8 89.2±0.8 85.1±0.6 417LSTM 352k 82.7±1.9 92.6±0.8 79.8±1.3 93.4±0.9 89.4±0.7 88.1±0.8 2409QRNN (k=1) 165k 83.5±1.9 93.4±0.6 82.0±1.0 92.5±0.5 90.2±0.7 88.2±0.4 345QRNN (k=1) + highway 204k 84.0±1.9 93.4±0.8 82.1±1.2 93.2±0.6 89.6±1.2 88.9±0.2 371

SRU (2 layers) 204k 84.9±1.6 93.5±0.6 82.3±1.2 94.0±0.5 90.1±0.7 89.2±0.3 320SRU (4 layers) 303k 85.9±1.5 93.8±0.6 82.9±1.0 94.8±0.5 90.1±0.6 89.6±0.5 510SRU (8 layers) 502k 86.4±1.7 93.7±0.6 83.1±1.0 94.7±0.5 90.2±0.8 88.9±0.6 879

Table 1: Test accuracies on classification benchmarks (Section 4.1). The first block presents best reportedresults of various methods. The second block compares SRU and other baselines given the same setup.For the SST dataset, we report average results of 5 runs. For other datasets, we perform 3 independenttrials of 10-fold cross validation (3⇥10 runs). The last column compares the wall clock time (in seconds)to finish 100 epochs on the SST dataset.

other architectures. We stack multiple layers ofSRU to directly substitute other recurrent, convo-lutional or feed-forward modules. We minimizehyper-parameter tuning and architecture engineer-ing for a fair comparison. Such efforts have a non-trivial impact on the results, which are beyond thescope of our experiments. Unless noted otherwise,the hyperparameters are set identical to prior work.

4.1 Text ClassificationDataset We use six sentence classificationbenchmarks: movie review sentiment (MR; Pangand Lee, 2005), sentence subjectivity (SUBJ;Pang and Lee, 2004), customer reviews polar-ity (CR; Hu and Liu, 2004), question type (TREC;Li and Roth, 2002), opinion polarity (MPQA;Wiebe et al., 2005), and the Stanford sentimenttreebank (SST; Socher et al., 2013).2

Following Kim (2014), we use word2vec em-beddings trained on 100 billion Google News to-kens. For simplicity, all word vectors are normal-ized to unit vectors and are fixed during training.

Setup We stack multiple SRU layers and usethe last output state to predict the class label fora given sentence. We train for 100 epochs anduse the validation (i.e., development) set to se-lect the best training epoch. We perform 10-fold

2We use the binary version of SST dataset.

cross validation for datasets that do not have astandard train-evaluation split. The result on SSTis averaged over five independent trials. We useAdam (Kingma and Ba, 2014) with the defaultlearning rate 0.001, a weight decay 0 and a hid-den dimension of 128.

We compare SRU with a wide range of meth-ods on these datasets, including various convo-lutional models (Kalchbrenner et al., 2014; Kim,2014; Zhang and Wallace, 2017) and a hierarchicalsentence model (Zhao et al., 2015) reported as thestate of the art on these datasets (Conneau et al.,2017). Their setups are not exactly the same asours, and may involve more tuning on word em-beddings and other regularizations. We use thesetup of Kim (2014) but do not fine-tune wordembeddings and the learning method for simplic-ity. In addition, we directly compare againstthree baselines trained using our code base: a re-implementation of the CNN model of Kim (2014),a two-layer LSTM model and Quasi-RNN (Brad-bury et al., 2017). We use the official implemen-tation of Quasi-RNN and also implement a ver-sion with highway connection for a fair compar-ison. These baselines are trained using the samehyper-parameter configuration as SRU.

Results Table 1 compares the test results on thesix benchmarks. We select the best number re-

Page 6: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4475

Figure 3: Mean validation accuracies (y-axis) and standard deviations of the CNN, 2-layer LSTM and2-layer SRU models. We plot the curves of the first 100 epochs. X-axis is the training time used (inseconds). Timings are performed on NVIDIA GeForce GTX 1070 GPU, Intel Core i7-7700K Processorand cuDNN 7003.

ported in previous methods when multiple modelvariants were explored in their experiments. De-spite our simple setup, SRU outperforms most pre-vious methods and achieves comparable resultscompared to the state-of-the-art but more sophisti-cated model of Zhao et al. (2015). Figure 3 showsvalidation performance relative to training time forSRU, cuDNN LSTM and the CNN model. OurSRU implementation runs 5–9 times faster thancuDNN LSTM, and 6–40% faster than the CNNmodel of Kim (2014). On the movie review (MR)dataset for instance, SRU completes 100 trainingepochs within 40 seconds, while LSTM takes over320 seconds.

4.2 Question Answering

Dataset We use the Stanford Question Answer-ing Dataset (SQuAD; Rajpurkar et al., 2016).SQuAD is a large machine comprehension datasetthat includes over 100K question-answer pairs ex-tracted from Wikipedia articles. We use the stan-dard train and development sets.

Setup We use the Document Reader model ofChen et al. (2017) as our base architecture forthis task. The model is a combination of word-level bidirectional RNNs and attentions, providinga good testbed to compare our bidirectional SRU

implementation with other RNN components.3

We use the open source implementation of Doc-ument Reader in our experiments.4 We train mod-els for up to 100 epochs, with a batch size of32 and a hidden dimension of 128. Followingthe author suggestions, we use the Adamax op-timizer (Kingma and Ba, 2014) and variationaldropout (Gal and Ghahramani, 2016) during train-ing. We compare with two alternative recurrentcomponents: the bidirectional LSTM adopted inthe original implementation of Chen et al. (2017)and Quasi-RNN with highway connections for im-proved performance.

Results Table 2 summarizes the results onSQuAD. SRU achieves 71.4% exact match and80.2% F1 score, outperforming the bidirectionalLSTM model by 1.9% (EM) and 1.4% (F1) re-spectively. SRU also exhibits over 5x speed-upover LSTM and 53–63% reduction in total train-ing time. In comparison with QRNN, SRU ob-tains 0.8% improvement on exact match and 0.6%on F1 score, and runs 60% faster. This speed im-provement highlights the impact of the fused ker-

3The current state-of-the-art models (Seo et al., 2016;Wang et al., 2017) make use of additional components suchas character-level embeddings, which are not directly com-parable to the setup of Chen et al. (2017). However, thesemodels can potentially benefit from SRU since RNNs are in-corporated in the model architecture.

4https://github.com/hitvoice/DrQA

Page 7: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4476

Model # layers Size Dev Dev Time per epochEM F1 RNN Total

LSTM 3 4.1m 69.5 78.8 316s 431s(Chen et al., 2017)

QRNN (k=1) + highway 4 2.4m 70.1± 0.1 79.4± 0.1 113s 214s6 3.2m 70.6± 0.1 79.6± 0.2 161s 262s

SRU 3 2.0m 70.2± 0.3 79.3± 0.1 58s 159sSRU 4 2.4m 70.7± 0.1 79.7± 0.1 72s 173sSRU 6 3.2m 71.4± 0.1 80.2± 0.1 100s 201s

Table 2: Exact match (EM) and F1 scores of various models on SQuAD (Section 4.2). We also reportthe total processing time per epoch and the time spent in RNN computations. SRU outperforms othermodels, and is more than five times faster than cuDNN LSTM.

nel (Algorithm 1). While the QRNN baseline in-volves a similar amount of computation, assem-bling all element-wise operations of both direc-tions in SRU achieves better GPU utilization.

4.3 Machine TranslationDataset We train translation models on theWMT English!German dataset, a standardbenchmark for translation systems (Peitz et al.,2014; Li et al., 2014; Jean et al., 2015). Thedataset consists of 4.5 million sentence pairs. Weobtain the pre-tokenized dataset from the Open-NMT project (Klein et al., 2017). The sentenceswere tokenized using the word-piece model (Wuet al., 2016b), which generates a shared vocabu-lary of about 32,000 tokens. Newstest-2014 andnewstest-2017 are provided and used as the vali-dation and test sets.5

Setup We use the state-of-the-art Transformermodel of Vaswani et al. (2017) as our base archi-tecture. In the base model, a single Transformerconsists of a multi-head attention layer and a bot-tleneck feed-forward layer. We substitute the feed-forward network using our SRU implementation:

base: W · ReLU_layer(x) + b

ours: W · SRU_layer(x) + b .

The intuition is that SRU can better capture se-quential information as a recurrent network, andpotentially achieve better performance while re-quiring fewer layers.

We keep the model configuration the same asVaswani et al. (2017): the model dimension is

5https://github.com/OpenNMT/

OpenNMT-tf/tree/master/scripts/wmt

dmodel = 512, the feed-forward and SRU layer hasinner dimensionality dff = dsru = 2048, and posi-tional encoding (Gehring et al., 2017) is applied onthe input word embeddings. The base model with-out SRU has 6 layers, while we set the number oflayers to 4 and 5 when SRU is added. Followingthe original setup, we use a dropout probability 0.1for all components, except the SRU in the 5-layermodel, for which we use a dropout of 0.2 as weobserve stronger over-fitting in training.

We use a single NVIDIA Tesla V100 GPU foreach model. The published results were obtainedusing 8 GPUs in parallel, which provide a large ef-fective batch size during training. To approximatethe setup, we update the model parameters ev-ery 5⇥5120 tokens and use 16,000 warm-up stepsfollowing OpenNMT suggestions. We train eachmodel for 40 epochs (250,000 steps), and perform3 independent trials for each model configuration.A single run takes about 3.5 days with a TeslaV100 GPU.

Results Table 3 shows the translation results.When SRU is incorporated into the architecture,both the 4-layer and 5-layer model outperform theTransformer base model. For instance, our 5-layer model obtains an average improvement of0.7 test BLEU score and an improvement of 0.5BLEU score by comparing the best results of eachmodel achieved across three runs. SRU also ex-hibits more stable performance, with smaller vari-ance over 3 runs. Figure 4 further compares thevalidation accuracy of different models. These re-sults confirm that SRU is better at sequence mod-eling compared to the original feed-forward net-work (FFN), requiring fewer layers to achieve sim-

Page 8: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4477

Model # layers Size BLEU score Speed HoursValid Test (toks/sec) per epoch

Transformer (base) 6 76m 26.6±0.2 (26.9) 27.6±0.2 (27.9) 20k 2.0Transformer (+SRU) 4 79m 26.7±0.1 (26.8) 27.8±0.1 (28.3) 22k 1.8Transformer (+SRU) 5 90m 27.1±0.0 (27.2) 28.3±0.1 (28.4) 19k 2.1

Table 3: English!German translation results (Section 4.3). We perform 3 independent runs for eachconfiguration. We select the best epoch based on the valid BLEU score for each run, and report theaverage results and the standard deviation over 3 runs. In addition, we experiment with averaging modelcheckpoints and use the averaged version for evaluation, following (Vaswani et al., 2017). We show thebest BLEU results achieved in brackets.

3.8

3.9

4.1

4.2

4.3

4.0 4.3 4.5 4.8 5.0

Table 1

Base model (6) w/ SRU (5, 0.2) w/ SRU (4, 0.1) w/ SRU (5, 0.1, 1536)

8.8449 6.9806

6.8904 5.7221

6.0114 5.1309

5.5672 4.8197

5.3095 4.6361

5.1364 4.5104

5.0100 4.4275

4.9129 4.3571

4.8346 4.2995

4.7704 4.2550

4.7163 4.2146

4.6698 4.1893

4.6295 4.1635

4.5939 4.1360

4.5623 4.1260

4.5339 4.1007

4.5084 4.0810

4.4852 4.0714

4.4638 4.0579

4.4443 4.0455

4.4261 4.0365

4.4094 4.0272

4.3939 4.0191

4.3791 4.0091

4.3656 4.0035

4.3527 3.9950

4.3406 3.9850

4.3293 3.9784

4.3188 3.9771

4.3085 3.9825

4.2989 3.9752

4.2896 3.9656

4.2809 3.9641

4.2726 3.9540

4.2645 3.9517

4.2568 3.9482

4.2495 3.9456

4.2423 3.9421

4.2357 3.9353

8.0749 6.4272

6.4581 5.4060

5.6984 4.8829

5.2985 4.6175

5.0637 4.4721

4.9053 4.3722

4.7891 4.2944

4.6997 4.2504

4.6281 4.2073

4.5686 4.1802

4.5186 4.1461

4.4760 4.1203

4.4386 4.0945

4.4060 4.0821

4.3768 4.0678

4.3508 4.0517

4.3272 4.0476

4.3056 4.0331

4.2861 4.0280

4.2681 4.0205

4.2512 4.0098

4.2359 4.0094

4.2216 4.0000

4.2080 3.9928

4.1957 3.9873

4.1839 3.9876

67%

68%

70%

71%

72%

1 10 20 30 40

Base model (6)w/ SRU (4, 0.1)w/ SRU (5, 0.2)

Train PPL

Valid

Train-valid perplexity Valid accuracy

Epoch

67%

68%

70%

71%

72%

1 10 20 30 40

Base modelw/ SRU (4 layer)w/ SRU (5 layer)

Valid accuracy

Epoch

3.8

3.9

4.1

4.2

4.3

4.0 4.3 4.5 4.8 5.0

Base modelw/ SRU (5 layer)w/ SRU (4 layer)

Train-valid perplexity

Valid

Train PPL

�1

Figure 4: Mean validation accuracy (y-axis) of dif-ferent translation models after each training epoch(x-axis).

ilar accuracy. Finally, adding SRU does not affectthe parallelization or speed of Transformer – the4-layer model exhibits 10% speed improvement,while the 5-layer model is only 5% slower com-pared to the base model. We present more resultsand discussion in Appendix B.3.

4.4 Character-level Language Modeling

Dataset We use Enwik8, a large dataset forcharacter-level language modeling. Followingstandard practice, we use the first 90M charactersfor training and the remaining 10M split evenly forvalidation and test.

Setup Similar to previous work, we use a batchsize of 128 and an unroll size of 100 for trun-cated backpropagation during training. We alsoexperiment with an unroll size of 256 and a batchsize of 64 such that each training instance haslonger context. We use a non-zero highway biasbr = �3 that is shown useful for training lan-guage model (Zilly et al., 2017). Previous meth-ods employ different optimizers and learning rateschedulers for training. For simplicity and consis-tency, we use the Adam optimizer and the same

learning rate scheduling (i.e., Noam scheduling)as the translation experiments. We train a maxi-mum of 100 epochs (about 700,000 steps).

We compare various recurrent models and usea parameter budget similar to previous methods.In addition, we experiment with the factorizationtrick (Kuchaiev and Ginsburg, 2017) to reduce thetotal number of parameters without decreasing theperformance. See details in Appendix B.

Results Table 4 presents the results of SRUand other recurrent models. The 8-layer SRUmodel achieves validation and test bits per char-acter (BPC) of 1.21, outperforming previous bestreported results of LSTM, QRNN and recurrenthighway networks (RHN). Increasing the layer ofSRU to 12 and using a longer context of 256 char-acters in training further improves the BPC to 1.19

4.5 Ablation Analysis

We perform ablation analyses on SRU by succes-sively disabling different components:

(1) Remove the point-wise multiplication termv � ct�1 in the forget and reset gates. Theresulting variant involves less recurrence andhas less representational capacity.

(2) Disable the scaling correction by setting theconstant ↵ = 1.

(3) Remove the skip connections.

We train model variants on the classification andquestion answering datasets. Table 5 and Figure 5confirm the impact of our design decisions – re-moving these components result in worse classifi-cation accuracies and exact match scores.

Page 9: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4478

Model Size # layers Unroll size Valid Test Time

MI-LSTM (Wu et al., 2016c) 17m 1 100 - 1.44 -HM-LSTM (Chung et al., 2016) 35m 3 100 - 1.32 -LSTM (Melis et al., 2017) 46m 4 50 1.28 1.30 -RHN (Zilly et al., 2017) 46m 10 50 - 1.27 -FS-LSTM (Mujika et al., 2017) 47m 4 100 - 1.25 -QRNN (Merity et al., 2018) 26m 4 200 - 1.33 -LSTM (Merity et al., 2018) 47m 3 200 - 1.23 -

SRU 37m 6 100 1.29 1.30 28minSRU 37m 10 100 1.26 1.27 29minSRU (with projection) 37m 6 100 1.25 1.26 29minSRU (with projection) 47m 8 100 1.21 1.21 39minSRU (with projection) 49m 12 256 1.19 1.19 41min

Table 4: Validation and test BPCs of different recurrent models on Enwik8 dataset. The last columnpresents the training time per epoch. For SRU with projection, we set the projection dimension to 512.

Model 4 layers 6 layers

SRU (full) 70.7 71.4� remove v � ct�1 70.6 71.1� remove ↵-scaling 70.3 71.0� remove highway 69.4 69.1

Table 5: Ablation analysis on SQuAD. Compo-nents are successively removed and the EM scoresare averaged over 4 runs.

Table 1

CR CR SUBJ SUBJ MR MR Trec Trec MPQA MPQA

Full (v2) 85.284 84.874 95.389 93.533 83.257 82.301 92.823 94.033 90.14659 90.05592667

- scaling , - c[t-1] 86.78 84.106 94.763 93.473 82.240 82.061 92.284 93.807 91.501 89.732

- highway 85.863 84.115 94.756 93.497 82.771 82.314 91.165 93.093 91.354 90.087

- c[t-1] 0.8490.8360.9530.9350.8310.8180.9220.93984.9 83.6 95.3 93.5 83.1 81.8 92.2 93.9

CR CR SUBJ SUBJ MR MR Trec Trec

Full (v2) 85.3 84.9 95.4 93.5 83.3 82.3 92.8 94.0

- c[t-1] 84.9 83.6 95.3 93.5 83.1 81.8 92.2 93.9

- scaling 86.8 84.1 94.8 93.5 82.2 82.1 92.3 93.8

- highway 85.9 84.1 94.8 93.5 82.8 82.3 91.2 93.1

CR SUBJ MR Trec

Full (v2) 85.3 95.4 83.3 92.8

- c[t-1] 84.9 95.3 83.1 92.2

- scaling 86.8 94.8 82.2 92.3

- highway 85.9 94.8 82.8 91.2

CR CR SUBJ SUBJ MR MR Trec Trec

CR CR SUBJ SUBJ MR MR Trec Trec

CR CR SUBJ SUBJ MR MR Trec Trec

CR SUBJ MR Trec

91.2

82.8

94.8

85.9

92.2

83.1

95.3

84.9

92.8

83.3

95.4

85.3

�1

Figure 5: Ablation analysis on the classificationdatasets. Average validation results are presented.We compare the full SRU implementation (leftblue), the variant without v � ct�1 multiplication(middle green) and the variant without highwayconnection (right yellow).

5 Discussion

This work presents Simple Recurrent Unit (SRU),a scalable recurrent architecture that operates asfast as feed-forward and convolutional units. Weconfirm the effectiveness of SRU on multiple nat-ural language tasks ranging from classification to

translation. We open source our implementation tofacilitate future NLP and deep learning research.

Trading capacity with layers SRU achieveshigh parallelization by simplifying the hidden-to-hidden dependency. This simplification is likely toreduce the representational power of a single layerand hence should be balanced to avoid perfor-mance loss. However, unlike previous work thatsuggests additional computation (e.g., n-gram fil-ters) within the layer (Balduzzi and Ghifary, 2016;Bradbury et al., 2017), we argue that increasingthe depth of the model suffices to retain modelingcapacity. Our empirical results on various tasksconfirm this hypothesis.

Acknowledgement

We thank Alexander Rush and Yoon Kim forhelp with machine translation experiments, andDanqi Chen for help with SQuAD experiments.We thank Adam Yala, Howard Chen, JeremyWohlwend, Lili Yu, Kyle Swanson and KevinYang for providing useful feedback on the paperand the SRU implementation. A special thanks toHugh Perkins for his support on the experimentalenvironment setup and Runqi Yang for answeringquestions about his code.

References

Fabio Anselmi, Lorenzo Rosasco, Cheston Tan, andTomaso A. Poggio. 2015. Deep convolutional net-works are hierarchical kernel machines. CoRR,abs/1508.01084.

Page 10: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4479

Jeremy Appleyard, Tomás Kociský, and Phil Blunsom.2016. Optimizing performance of recurrent neuralnetworks on gpus. CoRR, abs/1604.01946.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings of

the International Conference on Learning Represen-

tations.

David Balduzzi and Muhammad Ghifary. 2016.Strongly-typed recurrent neural networks. In Inter-

national Conference on Machine Learning.

James Bradbury, Stephen Merity, Caiming Xiong, andRichard Socher. 2017. Quasi-recurrent neural net-works. In Proceedings of the International Confer-

ence on Learning Representations.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the Annual

Meeting of the Association for Computational Lin-

guistics.

Yining Chen, Sorcha Gilroy, Kevin Knight, andJonathan May. 2018. Recurrent neural networks asweighted language recognizers. In Proceedings of

the Conference of the North American Chapter of

the Association for Computational Linguistics: Hu-

man Language Technologies.

Kyunghyun Cho, Bart van Merrienboer, ÃGaglarGülÃgehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. 2014. Learn-ing phrase representations using rnn encoder–decoder for statistical machine translation. In Pro-

ceedings of the Conference on Empirical Methods in

Natural Language Processing.

Junyoung Chung, Sungjin Ahn, and Yoshua Bengio.2016. Hierarchical multiscale recurrent neural net-works. CoRR, abs/1609.01704.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoïcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings of

the Conference on Empirical Methods in Natural

Language Processing.

Amit Daniely, Roy Frostig, and Yoram Singer. 2016.Toward deeper understanding of neural networks:The power of initialization and a dual view on ex-pressivity. In Advances In Neural Information Pro-

cessing Systems.

Greg Diamos, Shubho Sengupta, Bryan Catanzaro,Mike Chrzanowski, Adam Coates, Erich Elsen,Jesse Engel, Awni Hannun, and Sanjeev Satheesh.2016. Persistent rnns: Stashing recurrent weightson-chip. In International Conference on Machine

Learning.

Yarin Gal and Zoubin Ghahramani. 2016. A theoret-ically grounded application of dropout in recurrentneural networks. In Advances in Neural Information

Processing Systems.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann Dauphin. 2017. Convolutional se-quence to sequence learning. In International Con-

ference on Machine Learning.

Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difficulty of training deep feedforward neuralnetworks. In Proceedings of the international con-

ference on artificial intelligence and statistics.

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter No-ordhuis, Lukasz Wesolowski, Aapo Kyrola, AndrewTulloch, Yangqing Jia, and Kaiming He. 2017. Ac-curate, large minibatch SGD: Training imagenet in1 hour. CoRR, abs/1706.02677.

Klaus Greff, Rupesh Kumar Srivastava, JanKoutnx00EDk, Bas R. Steunebrink, andJx00FCrgen Schmidhuber. 2017. Lstm: Asearch space odyssey. IEEE Transactions on Neural

Networks and Learning Systems, 28.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Delving deep into rectifiers: Surpass-ing human-level performance on imagenet classifi-cation. In Proceedings of the IEEE international

conference on computer vision.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference on

computer vision and pattern recognition.

Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle-moyer. 2017. Deep semantic role labeling: Whatworks and what’s next. In Proceedings of the An-

nual Meeting of the Association for Computational

Linguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory. Neural computation, 9.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Proceedings of the tenth

ACM SIGKDD international conference on Knowl-

edge discovery and data mining.

Ozan Irsoy and Claire Cardie. 2014. Opinion miningwith deep recurrent neural networks. In Proceedings

of the Conference on Empirical Methods in Natural

Language Processing.

Sébastien Jean, Kyunghyun Cho, Roland Memisevic,and Yoshua Bengio. 2015. On using very large tar-get vocabulary for neural machine translation. InProceedings of the Annual Meeting of the Associa-

tion for Computational Linguistics and the Interna-

tional Joint Conference on Natural Language Pro-

cessing.

Page 11: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4480

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences. In Proceedings of the Annual

Meeting of the Association for Computational Lin-

guistics.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the Em-

pirical Methods in Natural Language Processing.

Yoon Kim, Yacine Jernite, David A Sontag, andAlexander M. Rush. 2016. Character-aware neurallanguage models. In Proceedings of the AAAI Con-

ference on Artificial Intelligence.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In Proceedings

of the International Conference on Learning Repre-

sentations.

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander Rush. 2017. Opennmt:Open-source toolkit for neural machine translation.In Proceedings of ACL 2017, System Demonstra-

tions.

Oleksii Kuchaiev and Boris Ginsburg. 2017. Fac-torization tricks for lstm networks. CoRR,abs/1703.10722.

Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton.2015. A simple way to initialize recurrent networksof rectified linear units. CoRR, abs/1504.00941.

Kenton Lee, Omer Levy, and Luke S. Zettlemoyer.2017. Recurrent additive networks. CoRR,abs/1705.07393.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2015.Molding cnns for text: non-linear, non-consecutiveconvolutions. In Proceedings of the Conference on

Empirical Methods in Natural Language Process-

ing. Association for Computational Linguistics.

Tao Lei, Wengong Jin, Regina Barzilay, and TommiJaakkola. 2017. Deriving neural architectures fromsequence and graph kernels. International Confer-

ence on Machine Learning.

Liangyou Li, Xiaofeng Wu, Santiago Cortes Vaillo, JunXie, Andy Way, and Qun Liu. 2014. The DCU-ICTCAS MT system at WMT 2014 on german-english translation task. In Proceedings of the Ninth

Workshop on Statistical Machine Translation.

Xin Li and Dan Roth. 2002. Learning question classi-fiers. In Proceedings of the international conference

on Computational linguistics-Volume 1. Associationfor Computational Linguistics.

Minh-Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective approaches to attention-based neural machine translation. In Empirical

Methods in Natural Language Processing. Associ-ation for Computational Linguistics.

Hongyuan Mei, Mohit Bansal, and R. Matthew Walter.2016. What to talk about and how? selective gener-ation using lstms with coarse-to-fine alignment. InProceedings of the Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies.

Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. Onthe state of the art of evaluation in neural languagemodels. CoRR, abs/1707.05589.

Stephen Merity, Nitish Shirish Keskar, and RichardSocher. 2018. An analysis of neural language mod-eling at multiple scales. CoRR, abs/1803.08240.

Yajie Miao, Jinyu Li, Yongqiang Wang, Shi-XiongZhang, and Yifan Gong. 2016. Simplifying longshort-term memory acoustic models for fast trainingand decoding. In IEEE International Conference on

Acoustics, Speech and Signal Processing.

Dipendra Misra, John Langford, and Yoav Artzi. 2017.Mapping instructions and visual observations to ac-tions with reinforcement learning. In Proceedings

of the Conference on Empirical Methods in Natural

Language Processing.

Asier Mujika, Florian Meier, and Angelika Steger.2017. Fast-slow recurrent neural networks. In Ad-

vances in Neural Information Processing Systems.

Bo Pang and Lillian Lee. 2004. A sentimental edu-cation: Sentiment analysis using subjectivity sum-marization based on minimum cuts. In Proceedings

of the annual meeting on Association for Computa-

tional Linguistics.

Bo Pang and Lillian Lee. 2005. Seeing stars: Ex-ploiting class relationships for sentiment categoriza-tion with respect to rating scales. In Proceedings of

the annual meeting on association for computational

linguistics.

Stephan Peitz, Joern Wuebker, Markus Freitag, andHermann Ney. 2014. The RWTH aachen german-english machine translation system for wmt 2014.In Proceedings of the Ninth Workshop on Statistical

Machine Translation.

Hao Peng, Roy Schwartz, Sam Thomson, and Noah A.Smith. 2018. Rational recurrences. In Empirical

Methods in Natural Language Processing.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016.Squad: 100,000+ questions for machine comprehen-sion of text. In Empirical Methods in Natural Lan-

guage Processing.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. Bidirectional at-tention flow for machine comprehension. CoRR,abs/1611.01603.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,Andy Davis, Quoc Le, Geoffrey Hinton, and JeffDean. 2017. Outrageously large neural networks:

Page 12: Simple Recurrent Units for Highly Parallelizable Recurrencethe design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which

4481

The sparsely-gated mixture-of-experts layer. arXiv

preprint arXiv:1701.06538.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Y. Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In Proceedings of the Conference on Em-

pirical Methods in Natural Language Processing.

Rupesh K Srivastava, Klaus Greff, and Jürgen Schmid-huber. 2015. Training very deep networks. In Ad-

vances in neural information processing systems.

Alane Suhr and Yoav Artzi. 2018. Situated mappingof sequential instructions to actions with single-stepreward observation. In Proceedings of the Annual

Meeting of the Association for Computational Lin-

guistics.

Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018.Learning to map context-dependent sentences to ex-ecutable formal queries. In Proceedings of the Con-

ference of the North American Chapter of the Asso-

ciation for Computational Linguistics: Human Lan-

guage Technologies.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-

cessing Systems.

Sida Wang and Christopher Manning. 2013. Fastdropout training. In International Conference on

Machine Learning.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Proceedings of the Annual Meeting of

the Association for Computational Linguistics.

Janyce Wiebe, Theresa Wilson, and Claire Cardie.2005. Annotating expressions of opinions and emo-tions in language. Language resources and evalua-

tion.

Huijia Wu, Jiajun Zhang, and Chengqing Zong. 2016a.An empirical exploration of skip connections for se-quential tagging. In Proceedings of the Interna-

tional Conference on Computational Linguisticss.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, ÅAukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, Ja-son Riesa, Alex Rudnick, Oriol Vinyals, Greg Cor-rado, Macduff Hughes, and Jeffrey Dean. 2016b.Google’s neural machine translation system: Bridg-ing the gap between human and machine translation.CoRR, abs/1609.08144.

Yuhuai Wu, Saizheng Zhang, Ying Zhang, YoshuaBengio, and Ruslan R Salakhutdinov. 2016c. Onmultiplicative integration with recurrent neural net-works. In Advances in Neural Information Process-

ing Systems.

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.2014. Recurrent neural network regularization.CoRR, abs/1409.2329.

Yingjie Zhang and Byron C. Wallace. 2017. A sensi-tivity analysis of (and practitioners’ guide to) convo-lutional neural networks for sentence classification.In Proceedings of the International Joint Conference

on Natural Language Processing.

Yuchen Zhang, Jason D. Lee, and Michael I. Jordan.2016. `1-regularized neural networks are improp-erly learnable in polynomial time. In International

Conference on Machine Learning.

Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015.Self-adaptive hierarchical sentence model. In Pro-

ceedings of the International Joint Conference on

Artificial Intelligence.

Julian Georg Zilly, Rupesh Kumar Srivastava, JanKoutník, and Jürgen Schmidhuber. 2017. Recurrenthighway networks. In International Conference on

Machine Learning.

Barret Zoph and Quoc V. Le. 2016. Neural archi-tecture search with reinforcement learning. CoRR,abs/1611.01578.