arXiv:1808.07383v1 [cs.LG] 22 Aug [email protected] Dongbok Lee Korea University Seoul,...

Dynamic Self-Attention: Computing Attention over Words Dynamicallyfor Sentence Embedding

Deunsol Yoon∗Korea University

Seoul, Republic of [email protected]

Dongbok Lee∗Korea University


SangKeun LeeKorea University


Abstract

In this paper, we propose Dynamic Self-Attention (DSA), a new self-attention mecha-nism for sentence embedding. We design DSAby modifying dynamic routing in capsule net-work (Sabour et al., 2017) for natural languageprocessing. DSA attends to informative wordswith a dynamic weight vector. We achieve newstate-of-the-art results among sentence encod-ing methods in Stanford Natural Language In-ference (SNLI) dataset with the least numberof parameters, while showing comparative re-sults in Stanford Sentiment Treebank (SST)dataset.

1 Introduction

In Natural Language Process (NLP), most neuralnetwork-based models contain a sentence encoderto map a sequence of words into a vector. Thevector is then used for various downstream tasks,e.g., sentiment analysis, natural language infer-ence, etc. The key part of a sentence encoder isa computation across a variable-length input se-quence for a fixed size vector. One of the commonapproaches is the max-pooling in CNN or RNN(Kim, 2014; Conneau et al., 2017).

Self-attention is another approach for a fixedsize vector. Self-attention derived from the at-tention mechanism, originally designed for neu-ral machine translation (Bahdanau et al.), is uti-lized in various tasks (Liu et al., 2016; Yang et al.,2016; Shen and Huang, 2016). Self-attention com-putes attention weights by the inner product be-tween words and the learnable weight vector. Theweight vector is important in that it detects infor-mative words, yet it is static during inference. Thesignificance of the role of the weight vector castsdoubt on whether its being static is an optimal sta-tus.

∗Equal Contribution.

In parallel, Sabour et al. (2017) recently pro-posed capsule network for image classification. Incapsule network, dynamic routing iteratively com-putes weights over inputs by the inner product be-tween inputs and a weighted sum of inputs. Vary-ing with the inputs, the weighted sum detects in-formative inputs; therefore it can be interpreted asa dynamic weight vector from the perspective ofself-attention. We expect the dynamic weight vec-tor to give rise to flexibility in self-attention sinceit can adapt to given sentences even after training.

Motivated by dynamic routing (Sabour et al.,2017), we propose a new self-attention mechanismfor sentence embedding, namely Dynamic Self-Attention (DSA). To this end, we modify dynamicrouting such that it functions as self-attention withthe dynamic weight vector. DSA, which is stackedon CNN with Dense Connection (Huang et al.,2017), achieves new state-of-the-art results amongthe sentence encoding methods in Stanford Nat-ural Language Inference (SNLI) dataset with theleast number of parameters, while obtaining com-parative results in Stanford Sentiment Treebank(SST) dataset. It also outperforms recent modelsin terms of time efficiency due to its simplicity andhighly parallelized computations.

Our technical contributions are as follows:

• We design and implement Dynamic Self-Attention (DSA), a new self-attention mecha-nism for sentence embedding.

• We devise the dynamic weight vector withwhich DSA computes attention weights.

• We achieve new state-of-the-art results in SNLIdataset, while showing comparative results inSST dataset.

2 Preliminary

In self-attention (Liu et al., 2016; Yang et al.,2016), attention weights are computed as follows:

arX

iv:1

808.

0738

3v1

[cs

.LG

] 2

2 A

ug 2

018

Figure 1: Overall architecture of our approach. z1, ..., zm are dynamic weight vectors. The final sentenceembedding, i.e. z, is generated by concatenating z1, ..., zm.

a = Softmax(vTTanh(WX)) (1)

where X ∈ Rdw×n is an input sequence, W ∈Rdv×dw is a projection matrix and v ∈ Rdv isthe learnable weight vector of self-attention. Theweight vector v plays an important role, since at-tention weights are computed by the inner productbetween v and the projection of the input sequenceX . The weight vector v is static with respect tothe input sequence X during inference. Replacingthe weight vector v with a weight matrix enablesmultiple attentions (Lin et al., 2017; Shen et al.,2018a).

3 Our Approach

Our architecture, shown in Figure 1, is built onCNN with Dense Connection (Huang et al., 2017).Dynamic Self-Attention (DSA), which is stackedon CNN with Dense Connection, computes atten-tion weights over words.

3.1 CNN with Dense Connection

The goal of this module is to encode each wordinto a meaningful representation space while cap-turing local information. We do not add any po-sitional encoding, as suggested by Gehring et al.(2017); deep convolution layers capture relativeposition information. We also enforce every out-put of layers to have the same number of columnsby using appropriate zero padding.

We denote a sequence of word embeddings asX0 ∈ Rd0×n, where X0 = [x0

1,x02, ...,x

0n].

hl(·) is a composite function of the lth layer,composed of 1D Convolution, dropout (Srivastavaet al., 2014), and leaky rectified linear unit (LeakyReLU). We feed a sequence of word embeddingsinto h1(·) with kernel size 1:

X1 = h1(X0) (2)

where X1 ∈ Rd1×n. We add Dense Connection inevery layer hl(·) with the same kernel size:

X l = hl(concat[X l−1, X l−2, ..., X1]) (3)

where X l ∈ Rdl×n, and l ∈ [2, L]. We concate-nate outputs of all hl(·), and denote it as a singlefunction:

Φk(X0) = concat[XL, XL−1, ..., X1] (4)

where kernel sizes of all hl(·) in Φk(·) are thesame number k, except for h1(·). We then feedoutputs of two different functions Φk1(·),Φk2(·),and a sequence of word embeddings X0 into acompression layer:

Xc = hc(concat[Φk1(X0),Φk2(X0), X0]) (5)

where hc(·) is the composite function with kernelsize 1. It compresses the first dimension of input(i.e., 2

∑Ll=1 dl + d0) into dc to represent a word

compactly. Finally, L2 norm of every column vec-tor xc

i in the Xc is normalized, which is found tohelp our model to converge fast and stably.

3.2 Dynamic Self-Attention (DSA)Dynamic Self-Attention (DSA) iteratively com-putes attention weights over words with the dy-namic weight vector, which varies with inputs.DSA enables multiple attentions in parallel bymultiplying different projection matrices to Xc,the output from CNN with Dense Connection. Forthe jth attention, DSA projects the compact repre-sentation of every word xc

i with LeakyReLU acti-vation:

xj|i = LeakyReLU(Wjxci + bj) (6)

where Wj ∈ Rdo×dc is a projection matrix, bj ∈Rdo is a bias term for the jth attention. Given thenumber of attentions m, i.e., j ∈ [1,m], attention

weights of words are computed by following Al-gorithm 1:

Algorithm 1 Dynamic Self-Attention

1: procedure ATTENTION(xj|i, r)2: for all ith word, jth attention : qij = 03: for r iterations do4: for all i, j : aij =

exp(qij)∑kexp(qkj)

5: for all j : sj =∑

i aijxj|i6: for all j : zj = Tanh(sj)7: for all i, j : qij = qij + xT

j|izj

8: return all zj

r is the number of iterations, and aij is the at-tention weight for the ith word in the jth attention.zj is the output for the jth attention of DSA at therth iteration, and also the dynamic weight vectorfor the jth attention of DSA before rth iteration.The final sentence embedding z is the concatena-tion of z1, ..., zm:

z = concat[z1, ..., zm] (7)

where z ∈ Rmdo is used for downstream tasks.We modify dynamic routing (Sabour et al.,

2017) to make it function as self-attention with thedynamic weight vector. We remove capsulizationlayer in capsule network which transforms scalarneurons to capsules, multi-dimensional neurons.A single word is then not decomposed into multi-ple capsules, but represented as a single vector xc

i

in Eq. 6. Squashing function is a nonlinear func-tion for capsules. We replace it with Tanh non-linear function for scalar neurons in Line 6 of Al-gorithm 1. We also force all the words in the jth

attention to share a projection matrix Wj in Eq. 6,as an input is a variable-length sequence. By con-trast, each capsule in capsule network has its ownprojection matrix Wij .

3.3 Dynamic Weight VectorsThe weight vector v of self-attention in Eq. 1 isstatic during inference. In DSA, however, the dy-namic weight vector zj in Line 6 of Algorithm 1,varies with an input sequence xj|1, ..., xj|n, evenafter training. In order to show how the dynamicweight vectors vary, we perform dimensionalityreduction on them, z1 at the (r − 1)th iterationof Algorithm 1, by Principal Component Analy-sis (PCA). We randomly select 1,000 sentencesfrom Stanford Natural Language Inference (SNLI)

Figure 2: Dynamic weight vectors visualization

dataset and plot each dynamic weight vector forthe sentences in 2D vector space. Figure 2 showsthat dynamic weight vectors are scattered in all di-rections. Thus, DSA adapts the dynamic weightvector with respect to each sentence.

4 Experiments

We evaluate our sentence embedding method withtwo different tasks: natural language inference andsentiment analysis. We implement single DSA,multiple DSA and self-attention in Eq. 1 as a base-line. Both DSA and self-attention are stacked onCNN with Dense Connection for fair comparison.

For our implementations, we initialize wordembeddings by 300D GloVe 840B pretrained vec-tors (Pennington et al., 2014), and fix them duringtraining. We use cross-entropy loss as an objectivefunction for both tasks. We set do = 600, m = 1for single DSA and do = 300, m = 8 for multipleDSA. In Appendix, we provide details for train-ing our implementations, hyperparameter settings,and visualization of attention maps of DSA.

4.1 Natural Language Inference Results

Natural language inference is a task of classifyingthe semantic relationship between two sentences,i.e., a premise and a hypothesis. We conduct ex-periments on Stanford Natural Language Infer-ence (SNLI) dataset, consisting of human-written570k pairs of English sentences labeled with oneof three classes: Entailment, Contradiction andNeutral. As the task considers the semantic rela-tionship, SNLI is used as a benchmark for evalu-ating the performance of a sentence encoder.

We follow a conventional approach, calledheuristic matching (Mou et al., 2016), to clas-sify the relationship of two sentences. Thesentences are encoded by our proposed model.Given encoded sentences sh, sp for hypothesisand premise respectively, an input of the classifier

Model Train (%) Test (%) Parameters (m) T(s)/epoch600D BiLSTM with self-attention (Liu et al., 2016) 84.5 84.2 2.8 -300D Directional self-attention network (Shen et al., 2018a) 91.1 85.6 2.4 587600D Gumbel TreeLSTM (Choi et al., 2018) 93.1 86.0 10.0 -600D Residual stacked encoders (Nie and Bansal, 2017) 91.0 86.0 29.0 -300D Reinforced self-attention network (Shen et al., 2018b) 92.6 86.3 3.1 6221200D Distance-based self-attention network (Im and Cho, 2017) 89.6 86.3 4.7 693600D CNN (Dense) with self-attention 88.7 84.6 2.4 121ours (600D Single DSA) 87.3 86.8 2.1 135ours (2400D Multiple DSA) 89.0 87.4 7.0 198

Table 1: SNLI Results. The values in T(s)/epoch come from original papers and are experimented on thesame graphic card to ours (single Nvidia GTX 1080Ti). Word embedding is not counted in parameters.

Model SST-2 SST-5BiLSTM (Cho et al., 2014) 87.5 49.5CNN-non-static (Kim, 2014) 87.2 48.0BiLSTM with self-attention 88.2 50.4CNN (Dense) with self-attention 88.3 50.6ours (Single DSA) 88.5 50.6

Table 2: Test accuracy with SST dataset.

is concat[sh, sp, |sh − sp|, sh � sp].The results from the official SNLI leader board1

are summarized in Table 1. Single DSA achievesnew state-of-the-art results with test accuracy(86.8%) and the number of parameters (2.1m). Be-sides, our learning time per epoch (135s) is signifi-cantly faster than recent models because of its sim-ple structure and highly parallelized computations.With tradeoffs in terms of parameters and learningtime per epoch, multiple DSA outperforms othermodels by a large margin (+1.1%).

In comparison to the baseline, single DSAshows better performance than self-attention(+2.2%). This confirms that the dynamic weightvector is more effective for sentence embedding.Note that our implementation of the baseline, self-attention stacked on CNN with Dense Connection,shows better performance (+0.4%) than the onestacked on BiLSTM (Liu et al., 2016).

4.2 Sentiment Analysis Results

Sentiment analysis is a task of classifying senti-ment in sentences. We use Stanford SentimentTreebank (SST) dataset, consisting of 10k En-glish sentences, to evaluate our model in single-sentence classification. We experiment SST-2 andSST-5 dataset labeled with binary sentiment labelsand five fine-grained labels, respectively.

The SST results are summarized in Table 2. Wecompare single DSA with four baseline models:

1https://nlp.stanford.edu/projects/snli/

BiLSTM (Cho et al., 2014), CNN (Kim, 2014) andself-attention with BiLSTM or CNN with denseconnection. Single DSA outperforms all the base-line models in SST-2 dataset, and achieves com-parative results in SST-5, which again verifiesthe effectiveness of the dynamic weight vector.In contrast to the distinguished results in SNLIdataset (+2.2%), in SST dataset, only marginaldifferences in the performance between DSA andthe previous self-attentive models are found. Weconclude that DSA exhibits a more significant im-provement for large and complex datasets.

5 Related Works

Our work differs from early self-attention for sen-tence embedding (Liu et al., 2016; Yang et al.,2016; Lin et al., 2017; Shen et al., 2018a) in thatthe dynamic weight vector is not static. Indepen-dently, there have recently been an approach tocapsule network-based NLP. Zhao et al. (2018) ap-plied whole capsule network to text classificationtask. However, we only utilize an algorithm, dy-namic routing from capsule network, and modifyit into self-attention with the dynamic weight vec-tor, without unnecessary concepts, e.g., capsule.

6 Conclusion

In this paper, we have proposed DynamicSelf-Attention (DSA), which computes attentionweights over words with the dynamic weight vec-tor. With the dynamic weight vector, the self at-tention mechanism can be furnished with flexi-bility. Our experiments show that DSA achievesnew state-of-the-art results in SNLI dataset, whileshowing comparative results in SST dataset.

https://nlp.stanford.edu/projects/snli/

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. Neural machine translation by jointly learningto align and translate. Computing Research Reposi-tory, arXiv:1409.0473. Version 7.

Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder-decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing, pages 1724–1734.

Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018.Learning to compose task-specific tree structures. InProceedings of the Thirty-Second AAAI Conferenceon Artificial Intelligence.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 670–680.

Jonas Gehring, Michael Auli, David Grangier, De-nis Yarats, and Yann N. Dauphin. 2017. Convolu-tional sequence to sequence learning. In Proceed-ings of the 34th International Conference on Ma-chine Learning, pages 1243–1252.

Gao Huang, Zhuang Liu, Laurens van der Maaten, andKilian Q. Weinberger. 2017. Densely connectedconvolutional networks. In 2017 IEEE Conferenceon Computer Vision and Pattern Recognition, pages2261–2269.

Jinbae Im and Sungzoon Cho. 2017. Distance-based self-attention network for natural languageinference. Computing Research Repository,arXiv:1712.02047. Version 1.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, pages 1746–1751.

Zhouhan Lin, Minwei Feng, Cıcero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. International Conference on LearningRepresentations.

Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang.2016. Learning natural language inference us-ing bidirectional LSTM model and inner-attention.Computing Research Repository, arXiv:1605.09090.Verison 1.

Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan,and Zhi Jin. 2016. Natural language inference bytree-based convolution and heuristic matching. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics.

Yixin Nie and Mohit Bansal. 2017. Shortcut-stackedsentence encoders for multi-domain inference. InProceedings of the Second Workshop on EvaluatingVector Space Representations for NLP, pages 41–45.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, pages 1532–1543.

Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton.2017. Dynamic routing between capsules. In Ad-vances in Neural Information Processing Systems30: Annual Conference on Neural Information Pro-cessing Systems, pages 3859–3869.

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang,Shirui Pan, and Chengqi Zhang. 2018a. Disan: Di-rectional self-attention network for rnn/cnn-free lan-guage understanding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, SenWang, and Chengqi Zhang. 2018b. Reinforced self-attention network: A hybrid of hard and soft atten-tion for sequence modeling. Computing ResearchRepository, arXiv:1801.10296. Version 1.

Yatian Shen and Xuanjing Huang. 2016. Attention-based convolutional neural network for semantic re-lation extraction. In 26th International Conferenceon Computational Linguistics, Proceedings of theConference, pages 2526–2536.

Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overfitting. Journal of MachineLearning Research, 15(1):1929–1958.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alexander J. Smola, and Eduard H. Hovy. 2016. Hi-erarchical attention networks for document classifi-cation. In The 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages1480–1489.

Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, SoufeiZhang, and Zhou Zhao. 2018. Investigatingcapsule networks with dynamic routing for textclassification. Computing Research Repository,arXiv:1804.00857. Version 1.

AppendixDetailed Experimental Settings

For training our implementations in both Stand-ford Natural Language Inference (SNLI) datasetand Stanford Sentiment Treebank (SST) dataset,we initialize word embeddings by 300D GloVe840B pretrained vectors, and fix them during train-ing. We use cross-entropy loss as an objectivefunction for both tasks. For both tasks, we setr = 2, k1 = 3, k2 = 5, L = 4, d1 = 150,dl = 75, where l ∈ [2, L], and dc = 300. Allmodels are implemented via PyTorch. Details ofthe hyperparameters for each task are introducedin each section. Note that we followed data pre-processing of SST as (Kim, 2014)2 and SNLI as(Conneau et al., 2017)3

A SNLI

We minimize cross-entropy loss with Adam op-timizer. We apply m = 1, do = 600 for singleDSA and m = 8, do = 300 for multiple DSA. Weuse two-hidden layer multilayer perceptron (MLP)with leaky rectified linear unit (Leaky ReLU) acti-vation as a classifier, where the number of hiddenlayer neurons are 300 (single) or 512 (multiple).Batch normalization and dropout are used for aninput for a hidden layer in the MLP. Dropout rateis set to 0.3 (single) or 0.4 (multiple) in the MLP.For the both models, we use 0.00001 L2 regular-ization. We use dropout with rate of 0.2 in everyhl(·), and hc(·). We also use dropout with rate of0.3 for word embeddings. We initialize parame-ters in every layer with He initialization and mul-tiply them by square root of the dropout rate of itslayer. We initialize out-of-vocabulary words withU(−0.005, 0.005). starting with default value ofAdam, we halve learning rate if training loss is notreduced for five times with a patience of 0.001.The size of mini-batch is set to 256.

B SST

We minimize cross-entropy loss with Adadelta op-timzier. We apply do = 600,m = 1. We use one-hidden layer MLP with Leaky ReLU activation asa classifier, where the number of hidden layer neu-rons is 300. Batch normalization and dropout withrate of 0.4 are used for an input for a hidden layer

2https://github.com/yoonkim/CNN_sentence3https://github.com/facebookresearch/InferSent

in MLP. For regularization, we use 0.00001 L2regularization. We use dropout with rate of 0.2in every hl(·) and hc(·). We also use dropout withrate of 0.4 for word embeddings. We initialize pa-rameters in every layer with He initialization andmultiply them by square root of the dropout rate ofeach layer. We initialize out-of-vocabulary wordswith U(−0.05, 0.05). Starting with default valueof Adadelta, we halve learning rate halved learn-ing rate if training loss is not reduced for two timeswith a patience of 0.001. The size of mini-batch isset to 128.

Visualization

Figure 4: Single DSA attends only informativewords in the given sentences.

https://github.com/yoonkim/CNN_sentence

https://github.com/facebookresearch/InferSent

Figure 3: We visualize 4 out of 8 attentions from multiple DSA, which are human interpretable. Eachattention in multiple DSA attends different aspects in the given sentences. 1th attention only attendswords related to a verb, 2th attends related to a place, 3th attends related to adjective, and 4th attendsrelated to an organism.

arXiv:1808.07383v1 [cs.LG] 22 Aug [email protected] Dongbok Lee Korea University Seoul,...

Documents

Transcript of arXiv:1808.07383v1 [cs.LG] 22 Aug [email protected] Dongbok Lee Korea University Seoul,...