arXiv:1710.00519v2 [cs.CL] 13 Nov 2018 · Plant cells have structures that animal cells lack. 0...

Attentive Convolution:Equipping CNNs with RNN-style Attention Mechanisms

Wenpeng YinUniversity of Pennsylvania

[email protected]

Hinrich SchützeCenter for Information and Language Processing

LMU Munich, Germany

Abstract

In NLP, convolutional neural networks(CNNs) have benefited less than recur-rent neural networks (RNNs) from attentionmechanisms. We hypothesize that this is be-cause the attention in CNNs has been mainlyimplemented as attentive pooling (i.e., itis applied to pooling) rather than as atten-tive convolution (i.e., it is integrated intoconvolution). Convolution is the differen-tiator of CNNs in that it can powerfullymodel the higher-level representation of aword by taking into account its local fixed-size context in the input text tx. In thiswork, we propose an attentive convolutionnetwork, ATTCONV. It extends the contextscope of the convolution operation, deriv-ing higher-level features for a word not onlyfrom local context, but also information ex-tracted from nonlocal context by the atten-tion mechanism commonly used in RNNs.This nonlocal context can come (i) fromparts of the input text tx that are distantor (ii) from extra (i.e., external) contextsty . Experiments on sentence modeling withzero-context (sentiment analysis), single-context (textual entailment) and multiple-context (claim verification) demonstrate theeffectiveness of ATTCONV in sentence rep-resentation learning with the incorporationof context. In particular, attentive con-volution outperforms attentive pooling andis a strong competitor to popular attentiveRNNs.1

1 Introduction

Natural language processing (NLP) has benefitedgreatly from the resurgence of deep neural net-works (DNNs), due to their high performance withless need of engineered features. A DNN typi-cally is composed of a stack of non-linear trans-

1https://github.com/yinwenpeng/Attentive_Convolution

formation layers, each generating a hidden rep-resentation for the input by projecting the outputof a preceding layer into a new space. To date,building a single and static representation to ex-press an input across diverse problems is far fromsatisfactory. Instead, it is preferable that the rep-resentation of the input vary in different appli-cation scenarios. In response, attention mecha-nisms (Graves, 2013; Graves et al., 2014) are pro-posed to dynamically focus on parts of the in-put that are expected to be more specific to theproblem. They are mostly implemented based onfine-grained alignments between two pieces of ob-jects, each emitting a dynamic soft-selection to thecomponents of the other, so that the selected ele-ments dominate in the output hidden representa-tion. Attention-based DNNs have demonstratedgood performance on many tasks.

Convolutional neural networks (CNNs, Le-Cun et al. (1998)) and recurrent neural networks(RNNs, Elman (1990)) are two important types ofDNNs. Most work on attention has been done forRNNs. Attention-based RNNs typically take threetypes of inputs to make a decision at the currentstep: (i) the current input state, (ii) a representa-tion of local context (computed unidirectionally orbidirectionally, Rocktäschel et al. (2016)) and (iii)the attention-weighted sum of hidden states cor-responding to nonlocal context (e.g., the hiddenstates of the encoder in neural machine translation(Bahdanau et al., 2015)). An important question,therefore, is whether CNNs can benefit from suchan attention mechanism as well, and how. This isour technical motivation.

Our second motivation is natural language un-derstanding (NLU). In generic sentence model-ing without extra context (Collobert et al., 2011;Kalchbrenner et al., 2014; Kim, 2014), CNNslearn sentence representations by composing wordrepresentations that are conditioned on a local con-text window. We believe that attentive convolu-

arX

iv:1

710.

0051

9v2

[cs

.CL

] 1

3 N

ov 2

018

https://github.com/yinwenpeng/Attentive_Convolution

https://github.com/yinwenpeng/Attentive_Convolution

premise, modeled as context ty

Plant cells have structures that animal cells lack. 0Animal cells do not have cell walls. 1The cell wall is not a freestanding structure. 0Plant cells possess a cell wall, animals never. 1

Table 1: Examples of four premises for the hypothesistx = “A cell wall is not present in animal cells.” inSCITAIL dataset. Right column (hypothesis’s label):“1” means true, “0” otherwise.

tion is needed for some NLU tasks that are essen-tially sentence modeling within contexts. Exam-ples: textual entailment (Dagan et al. (2013), Is ahypothesis true given a premise as the single con-text?) and claim verification (Thorne et al. (2018),Is a claim correct given extracted evidence snip-pets from a text corpus as the context?). Considerthe SCITAIL (Khot et al., 2018) textual entailmentexamples in Table 1; here the input text tx is thehypothesis and each premise is a context text ty.And consider the illustration of claim verificationin Figure 1; here the input text tx is the claim andty can consist of multiple pieces of context. Inboth cases, we would like the representation of tx

to be context specific.In this work, we propose attentive convolution

networks, ATTCONV, to model a sentence (i.e.,tx) either in intra-context (where ty = tx) or extra-context (where ty 6= tx and ty can have manypieces) scenarios. In the intra-context case (sen-timent analysis, for example), ATTCONV extendsthe local context window of standard CNNs tocover the entire input text tx. In the extra-contextcase, ATTCONV extends the local context win-dow to cover accompanying contexts ty.

For a convolution operation over a windowin tx such as (leftcontext, word, rightcontext), wefirst compare the representation of word with allhidden states in the context ty to get an atten-tive context representation attcontext, then convo-lution filters derive a higher-level representationfor word, denoted as wordnew, by integrating wordwith three pieces of context: leftcontext, rightcontextand attcontext. We interpret this attentive convo-lution in two perspectives. (i) For intra-context,a higher-level word representation wordnew islearned by considering the local (i.e., leftcontextand rightcontext) as well as nonlocal (i.e., attcontext)context. (ii) For extra-context, wordnew is gener-ated to represent word, together with its cross-textalignment attcontext, in the context leftcontext andrightcontext. In other words, the decision for theword is made based on the connected hidden states

maybe haoyofajrfngnovajrnvhar

yaojnbarlvjhnhjarnohg

nvhyhnv va j

maybe haoyofajrfngnovajrnvhar

yaojnbarlvjhnhjarnohg

nvhyhnv va j

whatno jmofjag

as ajgonahnbjunaeorg varguoergu

arg ag .arghoguerng

mao rhg aerare hn kvarenb bhebn

bnjb ye nerbhbjanrih

bjrbn arebahofjrf

Marilyn Monroe worked withWarner Brothers

Telemundo is aEnglish-languagetelevision network.

c1

c2

ci

cn

...

contexts claim classes

Figure 1: Verify claims in contexts

of cross-text aligned terms, with local context.We apply ATTCONV to three sentence mod-

eling tasks with variable-size context: a large-scale Yelp sentiment classification task (Lin et al.,2017) (intra-context, i.e., no additional con-text), SCITAIL textual entailment (Khot et al.,2018) (single extra-context) and claim verification(Thorne et al., 2018) (multiple extra-contexts).ATTCONV outperforms competitive DNNs withand without attention and gets state-of-the-art onthe three tasks.

Overall, we make the following contributions:• This is the first work that equips convolution

filters with the attention mechanism commonlyemployed in RNNs.• We distinguish and build flexible modules –

attention source, attention focus and attention ben-eficiary – to greatly advance the expressivity of at-tention mechanisms in CNNs.• ATTCONV provides a new way to broaden

the originally constrained scope of filters in con-ventional CNNs. Broader and richer contextcomes from either external context (i.e., ty) or thesentence itself (i.e., tx).• ATTCONV shows its flexibility and effec-

tiveness in sentence modeling with variable-sizecontext.

2 Related Work

In this section we discuss attention-related DNNsin NLP, the most relevant work for our paper.

2.1 RNNs with Attention

Graves (2013) and Graves et al. (2014) first in-troduced a differentiable attention mechanism thatallows RNNs to focus on different parts of theinput. This idea has been broadly explored inRNNs, shown in Figure 2, to deal with text gen-eration, such as neural machine translation (Bah-danau et al., 2015; Luong et al., 2015; Libovickýand Helcl, 2017; Kim et al., 2017), response gen-

weightedsum

attentivecontext

sentence ty sentence tx

hiddenstates

Figure 2: A simplified illustration of attention mecha-nism in RNNs.

convolution convolution

inter-hidden-statematch

column-wise compose

row-wise compose

sentence tx sentence ty

word embedding

layer

hidden states layer

X Y

X ⋅ softmax( ) Y ⋅ softmax( )

representation :tx

representation :ty

(4 × 6)

matchingscores

Figure 3: Attentive pooling, summarized fromABCNN (Yin et al., 2016) and APCNN (dos Santoset al., 2016).

eration in social media (Shang et al., 2015), doc-ument reconstruction (Li et al., 2015) and docu-ment summarization (Nallapati et al., 2016), ma-chine comprehension (Hermann et al., 2015; Ku-mar et al., 2016; Xiong et al., 2016; Wang andJiang, 2017; Xiong et al., 2017; Seo et al., 2017;Wang et al., 2017a) and sentence relation classi-fication, such as textual entailment (Rocktäschelet al., 2016; Wang and Jiang, 2016; Cheng et al.,2016; Wang et al., 2017b; Chen et al., 2017b) andanswer sentence selection (Miao et al., 2016).

We try to explore the RNN-style attentionmechanisms in CNNs, more specifically, in con-volution.

2.2 CNNs with Attention

In NLP, there is little work on attention-basedCNNs. Gehring et al. (2017) propose an attention-based convolutional seq-to-seq model for machinetranslation. Both the encoder and decoder are hi-erarchical convolution layers. At the nth layer of

the decoder, the output hidden state of a convolu-tion queries each of the encoder-side hidden states,then a weighted sum of all encoder hidden states isadded to the decoder hidden state, finally this up-dated hidden state is fed to the convolution at layern+1. Their attention implementation relies on theexistence of a multi-layer convolution structure –otherwise the weighted context from the encoderside could not play a role in the decoder. So essen-tially their attention is achieved after convolution.In contrast, we aim to modify the vanilla convolu-tion, so that CNNs with attentive convolution canbe applied more broadly.

We discuss two systems that are representativeof CNNs that implement the attention in pooling(i.e., the convolution is still not affected): (Yinet al., 2016; dos Santos et al., 2016), illustratedin Figure 3. Specifically, these two systems workon two input sentences, each with a set of hiddenstates generated by a convolution layer, then eachsentence will learn a weight for every hidden stateby comparing this hidden state with all hiddenstates in the other sentence, finally each input sen-tence obtains a representation by a weighted meanpooling over all its hidden states. The core compo-nent – weighted mean pooling – was referred to as“attentive pooling”, aiming to yield the sentencerepresentation.

In contrast to attentive convolution, attentivepooling does not connect directly the hidden statesof cross-text aligned phrases in a fine-grainedmanner to the final decision making; only thematching scores contribute to the final weightingin mean pooling. This important distinction be-tween attentive convolution and attentive poolingis further discussed in Section 3.3.

Inspired by the attention mechanisms in RNNs,we assume that it is the hidden states of alignedphrases rather than their matching scores that canbetter contribute to representation learning and de-cision making. Hence, our attentive convolutiondiffers from attentive pooling in that it uses at-tended hidden states from extra context (i.e., ty)or broader-range context within tx to participatein the convolution. In experiments, we will showits superiority.

3 ATTCONV Model

We use bold uppercase, e.g., H, for matrices; boldlowercase, e.g., h, for vectors; bold lowercasewith index, e.g., hi, for columns of H; and non-

ci

hi

hi+1

sentence tx context ty

attentivecontext

attentiveconvolution

Layern

Layern+1

hi-1

(a) Light attentive convolution layer

matching

attentivecontext

attentiveconvolution

fbene(Hx)

fmgran(Hx) fmgran(Hy)Layern

Layern+1

source focus

beneficiary

sentence tx context ty

(b) Advanced attentive convolution layer

Figure 4: ATTCONV models sentence tx with context ty

bold lowercase for scalars.To start, we assume that a piece of text t (t ∈{tx, ty}) is represented as a sequence of hiddenstates hi ∈ Rd (i = 1, 2, . . . , |t|), forming featuremap H ∈ Rd×|t|, where d is the dimensionalityof hidden states. Each hidden state hi has its leftcontext li and right context ri. In concrete CNNsystems, context li and ri can cover multiple adja-cent hidden states; we set li = hi−1 and ri = hi+1

for simplicity in the following description.We now describe light and advanced versions of

ATTCONV. Recall that ATTCONV aims to com-pute a representation for tx in a way that convolu-tion filters encode not only local context, but alsoattentive context over ty.

3.1 Light ATTCONV

Figure 4(a) shows the light version of ATTCONV.It differs in two key points – (i) and (ii) – bothfrom the basic convolution layer that models a sin-gle piece of text and from the Siamese CNN thatmodels two text pieces in parallel. (i) A match-ing function determines how relevant each hiddenstate in the context ty is to the current hidden statehxi in sentence tx. We then compute an average

of the hidden states in the context ty, weighted bythe matching scores, to get the attentive contextcxi for hx

i . (ii) The convolution for position i intx integrates hidden state hx

i with three sources ofcontext: left context hx

i−1, right context hxi+1 and

attentive context cxi .

Attentive Context. First, a function generates amatching score ei,j between a hidden state in tx

and a hidden state in ty by (i) dot product:ei,j = (hx

i )T · hy

j (1)or (ii) bilinear form:

ei,j = (hxi )

TWehyj (2)

(where We ∈ Rd×d) or (iii) additive projection:

ei,j = (ve)T · tanh(We · hx

i +Ue · hyj ) (3)

where We,Ue ∈ Rd×d and ve ∈ Rd.Given the matching scores, the attentive context

cxi for hidden state hxi is the weighted average of

all hidden states in ty:cxi =

∑j

softmax(ei)j · hyj (4)

We refer to the concatenation of attentive contexts[cx1 ; . . . ; c

xi ; . . . ; c

x|tx|] as the feature map Cx ∈

Rd×|tx| for tx.

Attentive Convolution. After attentive contexthas been computed, a position i in the sentencetx has a hidden state hx

i , the left context hxi−1, the

right context hxi+1 and the attentive context cxi . At-

tentive convolution then generates the higher-levelhidden state at position i:

hxi,new = tanh(W · [hx

i−1,hxi ,h

xi+1, c

xi ] + b) (5)

= tanh(W1 · [hxi−1,h

xi ,h

xi+1]+

W2 · cxi + b) (6)

where W ∈ Rd×4d is the concatenation of W1 ∈Rd×3d and W2 ∈ Rd×d, b ∈ Rd.

As Equation 6 shows, Equation 5 can beachieved by summing up the results of two sepa-rate and parallel convolution steps before the non-linearity. The first is still a standard convolution-without-attention over feature map Hx by filterwidth 3 over the window (hx

i−1, hxi , hx

i+1). Thesecond is a convolution on the feature map Cx,i.e., the attentive context, with filter width 1, i.e.,over each cxi ; then sum up the results element-wiseand add a bias term and the non-linearity. Thisdivide-then-compose strategy makes the attentiveconvolution easy to implement in practice with noneed to create a new feature map, as required inEquation 5, to integrate Hx and Cx.

role textpremise Three firefighters come out of subway station

hypothesis Three firefighters putting out a fire insideof a subway station

Table 2: Multi-granular alignments required in TE

It is worth mentioning that W1 ∈ Rd×3d cor-responds to the filter parameters of a vanilla CNNand the only added parameter here is W2 ∈ Rd×d,which only depends on the hidden size.

This light ATTCONV shows the basic princi-ples of using RNN-style attention mechanisms inconvolution. Our experiments show that this lightversion of ATTCONV– even though it incurs a lim-ited increase of parameters (i.e., W2) – worksmuch better than the vanilla Siamese CNN andsome of the pioneering attentive RNNs. The fol-lowing two considerations show that there is spaceto improve its expressivity.

(i) Higher-level or more abstract representa-tions are required in subsequent layers. We findthat directly forwarding the hidden states in tx orty to the matching process does not work well insome tasks. Pre-learning some more high-level orabstract representations helps in subsequent learn-ing phases.

(ii) Multi-granular alignments are preferred inthe interaction modeling between tx and ty. Ta-ble 2 shows another example of textual entail-ment. On the unigram level, “out” in the premisematches with “out” in the hypothesis perfectly,while “out” in the premise is contradictory to “in-side” in the hypothesis. But their context snippets– “come out” in the premise and “putting out afire” in the hypothesis – clearly indicate they arenot semantically equivalent. And the gold conclu-sion for this pair is “neutral”, i.e., the hypothesisis possibly true. Therefore, matching should beconducted across phrase granularities.

We now present advanced ATTCONV. It is moreexpressive and modular, based on the two forego-ing considerations (i) and (ii).

3.2 Advanced ATTCONV

Adel and Schütze (2017) distinguish between fo-cus and source of attention. The focus of atten-tion is the layer of the network that is reweightedby attention weights. The source of attention isthe information source that is used to computethe attention weights. Adel and Schütze (2017)showed that increasing the scope of the attentionsource is beneficial. It possesses some prelimi-nary principles of the query/key/value distinction

by Vaswani et al. (2017). Here we further extendthis principle to define beneficiary of attention –the feature map (labeled “beneficiary” in Figure4(b)) that is contextualized by the attentive con-text (labeled “attentive context” in Figure 4(b)).In light attentive convolutional layer (Figure 4(a)),the source of attention is hidden states in sentencetx, the focus of attention is hidden states of thecontext ty, the beneficiary of attention is again thehidden states of tx, i.e., it is identical to the sourceof attention.

We now try to distinguish these three conceptsfurther to promote the expressivity of an atten-tive convolutional layer. We call it “advancedATTCONV”, see Figure 4(b). It differs from thelight version in three ways: (i) attention source islearned by function fmgran(H

x), feature map Hx

of tx acting as input; (ii) attention focus is learnedby function fmgran(H

y), feature map Hy of con-text ty acting as input; (iii) attention beneficiaryis learned by function fbene(H

x), Hx acting as in-put. Both functions fmgran() and fbene() are basedon a gated convolutional function fgconv():

oi = tanh(Wh · ii + bh) (7)

gi = sigmoid(Wg · ii + bg) (8)

fgconv(ii) = gi · ui + (1− gi) · oi (9)

where ii is a composed representation, denoting agenerally defined input phrase [· · · ,ui, · · · ] of ar-bitrary length with ui as the central unigram-levelhidden state, the gate gi sets a trade-off betweenthe unigram-level input ui and the temporary out-put oi at the phrase-level. We elaborate these mod-ules in the remainder of this subsection.

Attention Source. First, we present a generalinstance of generating source of attention by func-tion fmgran(H), learning word representations inmulti-granular context. In our system, we considergranularities one and three, corresponding to uni-gram hidden state and tri-gram hidden state. Forthe uni-hidden state case, it is a gated convolutionlayer:

hxuni,i = fgconv(h

xi ) (10)

For tri-hidden state case:

hxtri,i = fgconv([h

xi−1,h

xi ,h

xi+1]) (11)

Finally, the overall hidden state at position i is theconcatenation of huni,i and htri,i:

hxmgran,i = [hx

uni,i,hxtri,i] (12)

i.e., fmgran(Hx) = Hx

mgran.

Such kind of comprehensive hidden state canencode the semantics of multigranular spans at aposition, such as “out” and “come out of”. Gatinghere implicitly enables cross-granular alignmentsin subsequent attention mechanism as it sets high-way connections (Srivastava et al., 2015) betweenthe input granularity and the output granularity.

Attention Focus. For simplicity, we use thesame architecture for the attention source (just in-troduced) and for the attention focus, ty; i.e., forthe attention focus: fmgran(H

y) = Hymgran. See

Figure 4(b). Thus, the focus of attention willparticipate in the matching process as well as bereweighted to form an attentive context vector. Weleave exploring different architectures for atten-tion source and focus for future work.

Another benefit of multi-granular hidden statesin attention focus is to keep structure informationin the context vector. In standard attention mech-anisms in RNNs, all hidden states are average-weighted as a context vector, the order informationis missing. By introducing hidden states of largergranularity into CNNs that keep the local order orstructures, we boost the attentive effect.

Attention Beneficiary. In our system, we sim-ply use fgconv() over uni-granularity to learn amore abstract representation over the current hid-den representations in Hx, so that

fbene(hxi ) = fgconv(h

xi ) (13)

Subsequently, the attentive context vector cxiis generated based on attention source featuremap fmgran(H

x) and attention focus feature mapfmgran(H

y), according to the description of thelight ATTCONV. Then attentive convolution isconducted over attention beneficiary feature mapfbene(H

x) and the attentive context vectors Cx toget a higher-layer feature map for the sentence tx.

3.3 Analysis

Compared to the standard attention mecha-nism in RNNs, ATTCONV has a similar match-ing function and a similar process of computingcontext vectors, but differs in three ways. (i) Thediscrimination of attention source, focus and ben-eficiary improves expressivity. (ii) In CNNs, thesurrounding hidden states for a concrete positionare available, so the attention matching is able toencode the left context as well as the right con-text. In RNNs however, it needs bidirectionalRNNs to yield both left and right context repre-sentations. (iii) As attentive convolution can be

implemented by summing up two separate convo-lution steps (Eqs. 5–6), this architecture providesboth attentive representations and representationscomputed without the use of attention. This strat-egy is helpful in practice to use richer representa-tions for some NLP problems. In contrast, such aclean modular separation of representations com-puted with and without attention is harder to real-ize in attention-based RNNs.

Prior attention mechanisms explored inCNNs mostly involve attentive pooling (Yin et al.,2016; dos Santos et al., 2016), i.e., the weightsof the post-convolution pooling layer are deter-mined by attention. These weights come from thematching process between hidden states of twotext pieces. However, a weight value is not in-formative enough to tell the relationships betweenaligned terms. Consider a textual entailment sen-tence pair for which we need to determine whether“inside −→ outside” holds. The matching degree(take cosine similarity as example) of these twowords is high, e.g., ≈ .7 in Word2Vec (Mikolovet al., 2013) and GloVe (Pennington et al., 2014).On the other hand, the matching score between“inside” and “in” is lower: .31 in Word2Vec, .46in GloVe. Apparently, the higher number .7 doesnot mean that “outside” is more likely than “in” tobe entailed by “inside”. Instead, joint representa-tions for aligned phrases [hinside, houtside], [hinside,hin] are more informative and enable finer-grainedreasoning than a mechanism that can only transmitinformation downstream by matching scores. Wemodify the conventional CNN filters so that “in-side” can make the entailment decision by lookingat the representation of the counterpart term (“out-side” or “in”) rather than a matching score.

A more damaging property of attentive poolingis the following. Even if matching scores couldconvey the phrase-level entailment degree to someextent, matching weights, in fact, are not lever-aged to make the entailment decision directly; in-stead, they are used to weight the sum of the out-put hidden states of a convolution as the globalsentence representation. In other words, fine-grained entailment degrees are likely to be lost inthe summation of many vectors. This illustrateswhy attentive context vectors participating in theconvolution operation are expected to be moreeffective than post-convolution attentive pooling(more explanations in Section 4.3, see paragraph“Visualization”).

Intra-context attention & extra-context at-tention. Figures 4(a)-4(b) depict the modelingof a sentence tx with its context ty. This is acommon application of attention mechanism in lit-erature; we call it extra-context attention. ButATTCONV can also be applied to model a singletext input, i.e., intra-context attention. Considera sentiment analysis example: “With the 2017NBA All-Star game in the books I think we canall agree that this was definitely one to remember.Not because of the three-point shootout, the dunkcontest, or the game itself but because of the lu-dicrous trade that occurred after the festivities”;this example contains informative points at differ-ent locations (“remember” and “ludicrous”); con-ventional CNNs’ ability to model nonlocal depen-dency is limited due to fixed-size filter widths. InATTCONV, we can set ty = tx. The attentive con-text vector then accumulates all related parts to-gether for a given position. In other words, ourintra-context attentive convolution is able to con-nect all related spans together to form a compre-hensive decision. This is a new way to broadenthe scope of conventional filter widths: a filter nowcovers not only the local window, but also thosespans that are related, but are beyond the scope ofthe window.

Comparison to Transformer.2 The “focus”in ATTCONV corresponds to “key” and “value”in Transformer; i.e., our versions of “key” and“value” are the same, coming from the con-text sentence. The “query” in Transformer cor-responds to the “source” and “beneficiary” ofATTCONV; i.e., our model has two perspectivesto utilize the context: one acts as a real query(i.e., “source”) to attend the context, the other (i.e.,“beneficiary”) takes the attentive context back toimprove the learned representation of itself. Ifwe reduce ATTCONV to unigram convolutionalfilters, it is pretty much a single Transformerlayer (if we neglect the positional encoding inTransformer and unify the “query-key-value” and“source-focus-beneficiary” mechanisms).

4 Experiments

We evaluate ATTCONV on sentence modelingin three scenarios: (i) Zero-context, i.e., intra-

2Our “source-focus-beneficiary” mechanism was inspiredby Adel and Schütze (2017). Vaswani et al. (2017) later pub-lished the Transformer model, which has a similar “query-key-value” mechanism.

systems acc

w/o

atte

ntio

n

Paragraph Vector 58.43Lin et al. Bi-LSTM 61.99Lin et al. CNN 62.05MultichannelCNN (Kim) 64.62

with

atte

ntio

n

CNN+internal attention 61.43ABCNN 61.36APCNN 61.98Attentive-LSTM 63.11Lin et al. RNN Self-Att. 64.21

AT

TC

ON

V light 66.75w/o convolution 61.34

advanced 67.36∗

Table 3: System comparison of sentiment analysis onYelp. Significant improvements over state of the art aremarked with ∗ (test of equal proportions, p < .05).

context; the same input sentence acts as tx as wellas ty; (ii) Single-context. Textual entailment –hypothesis modeling with a single premise as theextra-context; (iii) Multiple-context. Claim ver-ification – claim modeling with multiple extra-contexts.

4.1 Common setup and common baselinesAll experiments share a common setup. The in-put is represented using 300-dimensional publiclyavailable Word2Vec (Mikolov et al., 2013) em-beddings; OOVs are randomly initialized. Thearchitecture consists of the following four layersin sequence: embedding, attentive convolution,max-pooling and logistic regression. The context-aware representation of tx is forwarded to the lo-gistic regression layer. We use AdaGrad (Duchiet al., 2011) for training. Embeddings are fine-tuned during training. Hyperparameter values in-clude: learning rate 0.01, hidden size 300, batchsize 50, filter width 3.

All experiments are designed to explore com-parisons in three aspects: (i) within ATTCONV,“light” vs. “advanced”; (ii) “attentive convolu-tion” vs. “attentive pooling”/“attention only”; (iii)“attentive convolution” vs. “attentive RNN”.

To this end, we always report “light” and“advanced” ATTCONV performance and compareagainst five types of common baselines: (i) w/ocontext, (ii) w/o-attention, (iii) w/o-convolution:Similar with the Transformer’s principle (Vaswaniet al., 2017), we discard the convolution oper-ation in Equation 5 and forward the additionof the attentive context cxi and the hx

i into a

fully connected layer. To keep enough param-eters, we stack totally four layers so that “w/o-convolution” has the same size of parameters aslight-ATTCONV, (iv) with-attention: RNNs withattention and CNNs with attentive pooling and (v)prior state of the art, typeset in italics.

4.2 Sentence modeling with zero-context:sentiment analysis

We evaluate sentiment analysis on a Yelp bench-mark released by Lin et al. (2017): review-starpairs in sizes 500K (train), 2000 (dev) and 2000(test). Most text instances in this dataset arelong: 25%, 50%, 75% percentiles are 46, 81, 125words, respectively. The task is five-way classi-fication: one to five stars. The measure is accu-racy. We use this benchmark because the predom-inance of long texts lets us evaluate the system per-formance of encoding long-range context, and thesystem by Lin et al. (2017) is directly related toATTCONV in intra-context scenario.

Baselines. (i) w/o-attention. Three baselinesfrom Lin et al. (2017): Paragraph Vector (Le andMikolov, 2014) (unsupervised sentence represen-tation learning), BiLSTM and CNN. We also reim-plement MultichannelCNN (Kim, 2014), recog-nized as a simple, but surprisingly strong sen-tence modeler. (ii) with-attention. A vanilla“Attentive-LSTM” by Rocktäschel et al. (2016).“RNN Self-Attention” (Lin et al., 2017) is di-rectly comparable to ATTCONV: it also uses intra-context attention. “CNN+internal attention” (Adeland Schütze, 2017), an intra-context attention ideasimilar to, but less complicated than (Lin et al.,2017). ABCNN/APCNN – CNNs with attentivepooling.

Results and Analysis. Table 3 shows thatadvanced-ATTCONV surpasses its “light” coun-terpart, and gets significant improvement over thestate of the art.

In addition, ATTCONV surpasses attentivepooling (ABCNN&APCNN) with a big mar-gin (>5%) and outperforms the representativeattentive-LSTM (>4%).

Furthermore, it outperforms the two self-attentive models: CNN+internal attention (Adeland Schütze, 2017) and RNN Self-Attention (Linet al., 2017), which are specifically designedfor single-sentence modeling. Adel and Schütze(2017) generate an attention weight for each CNNhidden state by a linear transformation of the same

1 2 3 4 5 6 7 8 9 10indices of sorted text groups

0.58

0.60

0.62

0.64

0.66

0.68

0.70

acc

MultichannelCNNATTCONV0.6+diff of two curves

Figure 5: ATTCONV vs. MultichannelCNN forgroups of Yelp text with ascending text lengths.ATTCONV performs more robust across differentlengths of text.

#instances #entail #neutraltrain 23,596 8,602 14,994dev 1,304 657 647test 2,126 842 1,284total 27,026 10,101 16,925

Table 4: Statistics of SCITAIL dataset

hidden state, then compute weighted average overall hidden states as the text representation. Linet al. (2017) extend that idea by generating agroup of attention weight vectors, then RNN hid-den states are averaged by those diverse weightedvectors, allowing extracting different aspects ofthe text into multiple vector representations. Bothworks are essentially weighted mean pooling, sim-ilar to the attentive pooling in (Yin et al., 2016; dosSantos et al., 2016).

Next, we compare ATTCONV with Multichan-nelCNN, the strongest baseline system (“w/o at-tention”), for different length ranges to check ifATTCONV can really encode long-range contexteffectively. We sort the 2000 test instances bylength, then split them into 10 groups, each con-sisting of 200 instances. Figure 4.2 shows perfor-mance of ATTCONV vs. MultichannnelCNN.

We observe that ATTCONV consistently out-performs MultichannelCNN for all lengths. Fur-thermore, the improvement over Multichannel-CNN generally increases with length. This is ev-idence that ATTCONV is more effectively model-ing long text. This is likely due to ATTCONV’scapability to encode broader context in its filters.

systems acc

w/o

atte

ntio

n

Majority Class 60.4w/o Context 65.1Bi-LSTM 69.5NGram model 70.6Bi-CNN 74.4

with

atte

ntio

n

Enhanced LSTM 70.6Attentive-LSTM 71.5Decomp-Att 72.3DGEM 77.3APCNN 75.2ABCNN 75.8

ATTCONV-light 78.1w/o convolution 75.1

ATTCONV-advanced 79.2Table 5: ATTCONV vs. baselines on SCITAIL dataset

4.3 Sentence modeling with a single context:textual entailment

Dataset. SCITAIL (Khot et al., 2018) is a textualentailment benchmark designed specifically for areal-world task: multi-choice question answering.All hypotheses tx were obtained by rephrasing(question, correct answer) pairs into single sen-tences, and premises ty are relevant web sentencesretrieved by an information retrieval (IR) method.Then the task is to determine whether a hypothe-sis is true or not, given a premise as context. All(tx, ty) pairs are annotated via crowdsourcing. Ac-curacy is reported. Table 1 shows examples andTable 4 gives statistics. By this construction, asubstantial performance improvement on SCITAIL

is equivalent to a better QA performance (Khotet al., 2018). The hypothesis tx is the target sen-tence, and the premise ty acts as its context.

Baselines. Apart from the common baselines(see Section 4.1), we include systems coveredby Khot et al. (2018): (i) Ngram Overlap: Anoverlap baseline, considering lexical granularitysuch as unigrams, one-skip bigrams and one-skiptrigrams. (ii) Decomposable Attention Model(Decomp-Att) (Parikh et al., 2016): Explore atten-tion mechanisms to decompose the task into sub-tasks to solve in parallel. (iii) Enhanced LSTM(Chen et al., 2017b): Enhance LSTM by takinginto account syntax and semantics from parsinginformation. (iv) DGEM (Khot et al., 2018): Adecomposed graph entailment model, the currentstate-of-the-art.

Table 5 presents results on SCITAIL. (i)

Within ATTCONV, “advanced” beats “light” by1.1%; (ii) “w/o convolution” and attentive pool-ing (i.e., ABCNN/APCNN) get lower perfor-mances by 3%–4%; (iii) More complicated at-tention mechanisms equipped into LSTM (e.g.,“attentive-LSTM” and “enhanced-LSTM”) per-form even worse.

Error Analysis. To better understand theATTCONV in SCITAIL, we study some errorcases listed in Table 6.

Language conventions. Pair #1 uses sequentialcommas (i.e., in “the egg, larva, pupa and adult”)or a special symbol sequence (i.e., in “egg −>larva −> pupa −> adult”) to form a set or se-quence; pair #2 has “A (or B)” to express theequivalence of A and B. This challenge is expectedto be handled by deep neural networks with spe-cific training signals.

Knowledge beyond the text ty. In #3, “becausesmaller amounts of water evaporate in the coolmorning” cannot be inferred from the premise ty

directly. The main challenge in #4 is to distin-guish “weight” from “force”, which requires back-ground physical knowledge that is beyond the pre-sented text here and beyond the expressivity ofword embeddings.

Complex discourse relation. The premise in #5has an “or” structure. In #6, the inserted phrase“with about 16,000 species” makes the connectionbetween “nonvascular plants” and “the mosses,liverworts, and hornworts” hard to detect. Bothinstances require the model to decode the dis-course relation.

ATTCONV on SNLI. Table 7 shows the com-parison. We observe that: (i) classifying hypothe-ses without looking at premises, i.e., “w/o context”baseline, gets a big improvement over the “ma-jority baseline”. This verifies the strong bias inthe hypothesis construction of SNLI dataset (Gu-rurangan et al., 2018; Poliak et al., 2018). (ii)ATTCONV (advanced) surpasses all “w/o atten-tion” baselines and “with attention” CNN base-lines (i.e., attentive pooling), obtaining a perfor-mance (87.8%) that is close to the state of the art(88.7%).

We also report the parameter size in SNLIas most baseline systems did. Table 7 showsthat, in comparison to these baselines, ourATTCONV (light & advanced) has a more limitednumber of parameters, yet its performance is com-petitive.

# (Premise ty , Hypothesis tx) Pair G/P Challenge

1(ty) These insects have 4 life stages, the egg, larva, pupa and adult.

1/0 languageconventions(tx) The sequence egg −> larva −> pupa −> adult shows the life cycle

of some insects.

2 (ty) . . . the notochord forms the backbone (or vertebral column). 1/0 languageconventions(tx) Backbone is another name for the vertebral column.

3(ty) Water lawns early in the morning . . . prevent evaporation.

1/0 beyond text(tx) Watering plants and grass in the early morning is a way to conserve waterbecause smaller amounts of water evaporate in the cool morning.

4 (ty) . . . the SI unit . . . for force is the Newton (N) and is defined as (kg·m/s−2 ). 0/1 beyond text(tx) Newton (N) is the SI unit for weight.

5(ty) Heterotrophs get energy and carbon from living plants or animals(consumers) or from dead organic matter (decomposers). 0/1 discourse

relation(tx) Mushrooms get their energy from decomposing dead organisms.

6(ty) . . . are a diverse assemblage of three phyla of nonvascular plants, with

1/0 discourserelationabout 16,000 species, that includes the mosses, liverworts, and hornworts.

(tx) Moss is best classified as a nonvascular plant.

Table 6: Error cases of ATTCONV in SCITAIL. “. . . ”: truncated text. “G/P”: gold/predicted label.

Systems #para acc

w/o

atte

ntio

n

majority class 0 34.3w/o context (i.e., hypothesis only) 270K 68.7Bi-LSTM (Bowman et al., 2015) 220K 77.6Bi-CNN 270K 80.3Tree-CNN (Mou et al., 2016) 3.5M 82.1NES (Munkhdalai and Yu, 2017) 6.3M 84.8

with

atte

ntio

n

Attentive-LSTM (Rocktäschel) 250K 83.5Self-Attentive (Lin et al., 2017) 95M 84.4Match-LSTM (Wang & Jiang) 1.9M 86.1LSTMN (Cheng et al., 2016) 3.4M 86.3Decomp-Att (Parikh) 580K 86.8Enhanced LSTM (Chen et al., 2017b) 7.7M 88.6ABCNN (Yin et al., 2016) 834K 83.7APCNN (dos Santos et al., 2016) 360K 83.9

ATTCONV – light 360K 86.3w/o convolution 360K 84.9

ATTCONV – advanced 900K 87.8State-of-the-art (Peters et al., 2018) 8M 88.7

Table 7: Performance comparison on SNLI test. En-semble systems are not included.

Visualization. In Figure 6, we visualize the at-tention mechanisms explored in attentive convolu-tion (i.e., Figure 6(a)) and attentive pooling (i.e.,Figure 6(b)).

Figure 6(a) explores the visualization of twokinds of features learned by light ATTCONV inSNLI dataset (most are short sentences with richphrase-level reasoning): (i) ei,j in Equation 1 (af-ter softmax), which shows the attention distribu-tion over context ty by the hidden state hxi insentence tx; (ii) hx

i,new in Equation 5 for i =1, 2, · · · , |tx|; it shows the context-aware word

features in tx. By the two visualized features, wecan know which parts of the context ty are moreimportant for a word in sentence tx, and a max-pooling, over those context-driven word repre-sentations, selects and forwards dominant (word,leftcontext, rightcontext, attcontext) combinations tothe final decision maker.

Figure 6(a) shows the features3 of sentence tx =“A dog jumping for a Frisbee in the snow.” con-ditioned on the context ty = “An animal is out-side in the cold weather, playing with a plastictoy.”. Observations: (i) The right figure showsthat the attention mechanism successfully alignssome cross-sentence phrases that are informativeto the textual entailment problem, such as “dog”to “animal” (i.e., cxdog ≈ “animal”), “Frisbee” to“plastic toy” and “playing” (i.e., cxFrisbee ≈ “plas-tic toy”+“playing”); (ii) The left figure shows amax-pooling over the generated features of fil-ter_1 and filter_2 will focus on the context-awarephrases (A, dog, jumping, cxdog) and (a, Frisbee, in,cxFrisbee) respectively; the two phrases are crucialto the entailment reasoning for this (ty, tx) pair.

Figure 6(b) shows the phrase-level (i.e., eachconsecutive tri-gram) attentions after the convolu-tion operation. As Figure 3 shows, a subsequentpooling step will weight and sum up those phrase-level hidden states as an overall sentence represen-tation. So, even though some phrases such as “inthe snow” in tx and “in the cold” in ty show im-portance in this pair instance, the final sentencerepresentation still (i) lacks a fine-grained phrase-

3For simplicity, we show 2 out of 300 ATTCONV filters.

A dog for a in the .snow

jumping

Frisbee

cx

dog cx

Fris.

An in cold ,is the with toy .a

animal

outside

weather

playing

plastic

tx

ty

(a) Visualization for features generated by ATTCONV’s filters on sentence tx and ty . A max-pooling, over filter_1, locatesthe phrase (A, dog, jumping, cxdog), and locates the phrase (a, Frisbee, in, cxFrisbee) via filter_2. “cxdog” (resp. cxFris.) – theattentive context of “dog” (resp. “Frisbee”) in tx – mainly comes from “animal” (resp. “toy” and “playing”) in ty .

A dog for a in the .snow

jum

ping

Fris

bee

An in cold ,is the with toy .a

anim

al

outs

ide

wea

ther

play

ing

plas

tic

tx

ty

convolutionoutput

(filter width=3)

(b) Attention visualization for attentive pooling (ABCNN). Based on the words in tx and ty , first, a convolution layer withfilter width 3 outputs hidden states for each sentence, then each hidden state will obtain an attention weight for how wellthis hidden state matches towards all the hidden states in the other sentence, finally all hidden states in each sentence will beweighted and summed up as the sentence representation. This visualization shows that the spans “dog jumping for” and “inthe snow” in tx and the spans “animal is outside” and “in the cold” in ty are most indicative to the entailment reasoning.

Figure 6: Attention visualization for Attentive Convolution (top) and Attentive Pooling (bottom) between sentencetx = “A dog jumping for a Frisbee in the snow.” (left) and sentence ty = “An animal is outside in the cold weather,playing with a plastic toy.” (right).

to-phrase reasoning, and (ii) underestimates someindicative phrases such as “A dog” in tx and “Ananimal” in ty.

Briefly, attentive convolution first performsphrase-to-phrase, inter-sentence reasoning, thencomposes features; attentive pooling composesphrase features as sentence representations, thenperforms reasoning. Intuitively, attentive convolu-

tion better fits the way humans conduct entailmentreasoning, and our experiments validate its superi-ority – it is the hidden states of the aligned phrasesrather than their matching scores that support bet-ter representation learning and decision making.

The comparisons in both SCITAIL and SNLIshow that:

• CNNs with attentive convolution (i.e.,

#SUPPORTED #REFUTED #NEItrain 80,035 29,775 35,639dev 3,333 3,333 3,333test 3,333 3,333 3,333

Table 8: Statistics of claims in FEVER dataset

1 2 3 4 5 6 7 8 9 10 >10#context for each claim sentence

0

5

10

15

...

%

12.13

4.07

2.85 2.90

1.760.98 0.68 0.49 0.40

1.85

71.88

Figure 7: Distribution of #sentence in FEVER evidence

ATTCONV) outperform the CNNs with attentivepooling (i.e., ABCNN and APCNN);• Some competitors got over-tuned on SNLI

while demonstrating mediocre performance inSCITAIL – a real-world NLP task. Our systemATTCONV shows its robustness in both bench-mark datasets.

4.4 Sentence modeling with multiplecontexts: claim verification

Dataset. For this task, we use FEVER (Thorneet al., 2018); it infers the truthfulness of claims byextracted evidence. The claims in FEVER weremanually constructed from the introductory sec-tions of about 50K popular Wikipedia articles inthe June 2017 dump. Claims have 9.4 tokens onaverage. Table 8 lists the claim statistics.

In addition to claims, FEVER also provides aWikipedia corpus of approximately 5.4 million ar-ticles, from which gold evidences are gathered andprovided. Figure 7 shows the distributions of sen-tence sizes in FEVER’s ground truth evidence set(i.e., the context size in our experimental setup).We can see that roughly 28% of evidence instancescover more than one sentence and roughly 16%cover more than two sentences.

Each claim is labeled as SUPPORTED, RE-FUTED or NOTENOUGHINFO (NEI) given thegold evidence. The standard FEVER task alsoexplores the performance of evidence extraction,evaluated by F1 between extracted evidence andgold evidence. This work focuses on the claim en-

retrie. evi. goldsystem ALL SUB evi.

dev

MLP 41.86 19.04 65.13Bi-CNN 47.82 26.99 75.02APCNN 50.75 30.24 78.91ABCNN 51.39 32.44 77.13Attentive-LSTM 52.47 33.19 78.44Decomp-Att 52.09 32.57 80.82ATTCONV

light,context-wise 57.78 34.29 83.20w/o conv. 47.29 25.94 73.18

light,context-conc 59.31 37.75 84.74w/o conv. 48.02 26.67 73.44

advan.,context-wise 60.20 37.94 84.99advan.,context-conc 62.26 39.44 86.02

test (Thorne et al., 2018) 50.91 31.87 –

ATTCONV 61.03 38.77 84.61

Table 9: Performance on dev and test of FEVER. In“gold evi.” scenario, ALL SUBSET are the same.

tailment part, assuming the evidences are provided(extracted or gold). More specific, we treat a claimas tx, and its evidence sentences as context ty.

This task has two evaluations: (i) ALL – accu-racy of claim verification regardless of the valid-ness of evidence; (ii) SUBSET – verification accu-racy of a subset of claims, in which the gold evi-dence for SUPPORTED and REFUTED claims mustbe fully retrieved. We use the official evaluationtoolkit.4

Setups. (i) We adopt the same retrieved evi-dence set (i.e, contexts ty) as Thorne et al. (2018):top-5 most relevant sentences from top-5 retrievedwiki pages by a document retriever (Chen et al.,2017a). The quality of this evidence set against theground truth is: 44.22 (recall), 10.44 (precision),16.89 (F1) on dev, and 45.89 (recall), 10.79 (pre-cision), 17.47 (F1) on test. This setup challengesour system with potentially unrelated or even mis-leading context. (ii) We use the ground truth evi-dence as context. This lets us determine how farour ATTCONV can go for this claim verificationproblem once the accurate evidence is given.

Baselines. We first include the two systems ex-plored by Thorne et al. (2018): (i) MLP: A multi-layer perceptron baseline with a single hiddenlayer, based on tf-idf cosine similarity between the

4https://github.com/sheffieldnlp/fever-scorer

https://github.com/sheffieldnlp/fever-scorer

https://github.com/sheffieldnlp/fever-scorer

1 2 3 4 5 >5gold #context for each claim

81.5

82.0

82.5

83.0

83.5

84.0

84.5

85.0ac

c (%

)

Figure 8: Fine-grained ATTCONV performance givenvariable-size golden FEVER evidence as claim’s con-text

claim and the evidence (Riedel et al., 2017); (ii)Decomp-Att (Parikh et al., 2016): A decompos-able attention model that is tested in SCITAIL andSNLI before. Note that both baselines first reliedon an IR system to extract the top-5 relevant sen-tences from the retrieved top-5 wiki pages as ev-idence for claims, then concatenated all evidencesentences as a longer context for a claim.

We then consider two variants of ourATTCONV in dealing with modeling of tx

with variable-size context ty. (i) Context-wise:we first use all evidence sentences one by one ascontext ty to guide the representation learning ofthe claim tx, generating a group of context-awarerepresentation vectors for the claim, then wedo element-wise max-pooling over this vectorgroup as the final representation of the claim. (ii)Context-conc: concatenate all evidence sentencesas a single piece of context, then model theclaim based on this context. This is the samepreprocessing step as Thorne et al. (2018) did.

Results. Table 9 compares our ATTCONV indifferent setups against the baselines. First,ATTCONV surpasses the top competitor“Decomp-Att”, reported in (Thorne et al.,2018), with big margins in dev (ALL: 62.26vs. 52.09) and test (ALL: 61.03 vs. 50.91). Inaddition, “advanced-ATTCONV” consistentlyoutperforms its “light” counterpart. Moreover,ATTCONV surpasses attentive pooling (i.e.,ABCNN&APCNN) and “attentive-LSTM” by>10% in ALL, >6% in SUB and >8% in “goldevi.”.

Figure 8 further explores the fine-grained per-formance of ATTCONV for different sizes of goldevidence (i.e., different sizes of context ty). The

system shows comparable performances for sizes1 and 2. Even for context sizes bigger than 5, itonly drops by 5%.

Above experiments on claim verification clearlyshow the effectiveness of ATTCONV in sen-tence modeling with variable-size context. Thisshould be attributed to the attention mechanism inATTCONV, which enables a word or a phrase inthe claim tx to “see” and accumulate all relatedclues even if those clues are scattered across mul-tiple contexts ty.

Error Analysis. We do error analysis for “re-trieved evidence” scenario.

Error case #1 is due to the failure of fully re-trieving all evidence. For example, a successfulsupport of the claim “Weekly Idol has a host bornin the year 1978.” requires the information com-position from three evidence sentences, two fromthe wiki article “Weekly Idol”, one from “JeongHyeong-don”. However, only one of them is re-trieved in the top-5 candidates. Our system pre-dicts REFUTED. This error is more common in in-stances for which no evidence is retrieved.

Error case #2 is due to the insufficiency of rep-resentation learning. Consider the wrong claim“Corsica belongs to Italy” (i.e., in REFUTED

class). Even though good evidence is retrieved, thesystem is misled by noise evidence: “It is located. . . west of the Italian Peninsula, with the nearestland mass being the Italian island . . . ”.

Error case #3 is due to the lack of advanced datapreprocessing. For a human, it is very easy to “re-fute” the claim “Telemundo is a English-languagetelevision network” by the evidence “Telemundois an American Spanish-language terrestrial tele-vision . . . ” (from the “Telemundo” wikipage),by checking the keyphrases: “Spanish-language”vs. “English-language”. Unfortunately, both to-kens are unknown words in our system, as a result,they do not have informative embeddings. A morecareful data preprocessing is expected to help.

5 Summary

We presented ATTCONV, the first workthat enables CNNs to acquire the attentionmechanism commonly employed in RNNs.ATTCONV combines the strengths of CNNs withthe strengths of the RNN attention mechanism.On the one hand, it makes broad and rich contextavailable for prediction, either context fromexternal inputs (extra-context) or internal inputs

(intra-context). On the other hand, it can take fulladvantage of the strengths of convolution: it ismore order-sensitive than attention in RNNs andlocal-context information can be powerfully andefficiently modeled through convolution filters.Our experiments demonstrate the effectivenessand flexibility of ATTCONV while modelingsentences with variable-size context.

Acknowledgments

We gratefully acknowledge funding for thiswork by the European Research Council (ERC#740516). We would like to thank the anonymousreviewers for their helpful comments.

References

Heike Adel and Hinrich Schütze. 2017. Exploringdifferent dimensions of attention for uncertaintydetection. In Proceedings of EACL, pages 22–34.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. In Pro-ceedings of ICLR.

Samuel R. Bowman, Gabor Angeli, ChristopherPotts, and Christopher D. Manning. 2015. Alarge annotated corpus for learning natural lan-guage inference. In Proceedings of EMNLP,pages 632–642.

Danqi Chen, Adam Fisch, Jason Weston, and An-toine Bordes. 2017a. Reading Wikipedia to an-swer open-domain questions. In Proceedings ofACL, pages 1870–1879.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,Hui Jiang, and Diana Inkpen. 2017b. EnhancedLSTM for natural language inference. In Pro-ceedings of ACL, pages 1657–1668.

Jianpeng Cheng, Li Dong, and Mirella Lapata.2016. Long short-term memory-networks formachine reading. In Proceedings of EMNLP,pages 551–561.

Ronan Collobert, Jason Weston, Léon Bottou,Michael Karlen, Koray Kavukcuoglu, andPavel P. Kuksa. 2011. Natural language pro-cessing (almost) from scratch. JMLR, 12:2493–2537.

Ido Dagan, Dan Roth, Mark Sammons, andFabio Massimo Zanzotto. 2013. RecognizingTextual Entailment: Models and Applications.Synthesis Lectures on Human Language Tech-nologies. Morgan & Claypool.

John Duchi, Elad Hazan, and Yoram Singer.2011. Adaptive subgradient methods for onlinelearning and stochastic optimization. JMLR,12:2121–2159.

Jeffrey L. Elman. 1990. Finding structure in time.Cognitive Science, 14(2):179–211.

Jonas Gehring, Michael Auli, David Grangier, De-nis Yarats, and Yann N. Dauphin. 2017. Convo-lutional sequence to sequence learning. In Pro-ceedings of ICML, pages 1243–1252.

Alex Graves. 2013. Generating sequences with re-current neural networks. CoRR, abs/1308.0850.

Alex Graves, Greg Wayne, and Ivo Danihelka.2014. Neural turing machines. CoRR,abs/1410.5401.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R. Bowman, andNoah A. Smith. 2018. Annotation artifacts innatural language inference data. In Proceedingsof NAACL-HLT, pages 107–112.

Karl Moritz Hermann, Tomás Kociský, EdwardGrefenstette, Lasse Espeholt, Will Kay, MustafaSuleyman, and Phil Blunsom. 2015. Teachingmachines to read and comprehend. In Proceed-ings of NIPS, pages 1693–1701.

Nal Kalchbrenner, Edward Grefenstette, and PhilBlunsom. 2014. A convolutional neural net-work for modelling sentences. In Proceedingsof ACL, pages 655–665.

Tushar Khot, Ashish Sabharwal, and Peter Clark.2018. SciTaiL: A textual entailment datasetfrom science question answering. In Proceed-ings of AAAI, pages 5189–5197.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. In Proceedings ofEMNLP, pages 1746–1751.

Yoon Kim, Carl Denton, Luong Hoang, andAlexander M. Rush. 2017. Structured attentionnetworks. In Proceedings of ICLR.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, MohitIyyer, James Bradbury, Ishaan Gulrajani, Vic-tor Zhong, Romain Paulus, and Richard Socher.2016. Ask me anything: Dynamic memory net-works for natural language processing. In Pro-ceedings of ICML, pages 1378–1387.

Quoc Le and Tomas Mikolov. 2014. Distributedrepresentations of sentences and documents. InProceedings of ICML, pages 1188–1196.

Yann LeCun, Léon Bottou, Yoshua Bengio, andPatrick Haffner. 1998. Gradient-based learningapplied to document recognition. Proceedingsof the IEEE, 86(11):2278–2324.

Jiwei Li, Minh-Thang Luong, and Dan Jurafsky.2015. A hierarchical neural autoencoder forparagraphs and documents. In Proceedings ofACL, pages 1106–1115.

Jindrich Libovický and Jindrich Helcl. 2017. At-tention strategies for multi-source sequence-to-sequence learning. In Proceedings of ACL,pages 196–202.

Zhouhan Lin, Minwei Feng, Cícero Nogueira dosSantos, Mo Yu, Bing Xiang, Bowen Zhou,and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In Proceedingsof ICLR.

Minh-Thang Luong, Hieu Pham, and Christo-pher D. Manning. 2015. Effective approachesto attention-based neural machine translation.In Proceedings of EMNLP, pages 1412–1421.

Yishu Miao, Lei Yu, and Phil Blunsom. 2016.Neural variational inference for text processing.In Proceedings of ICML, pages 1727–1736.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gre-gory S. Corrado, and Jeffrey Dean. 2013. Dis-tributed representations of words and phrasesand their compositionality. In Proceedings ofNIPS, pages 3111–3119.

Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, RuiYan, and Zhi Jin. 2016. Natural language in-ference by tree-based convolution and heuristicmatching. In Proceedings of ACL, pages 130–136.

Tsendsuren Munkhdalai and Hong Yu. 2017.Neural semantic encoders. In Proceedings ofEACL, pages 397–407.

Ramesh Nallapati, Bowen Zhou, Cícero Nogueirados Santos, Çaglar Gülçehre, and Bing Xiang.2016. Abstractive text summarization usingsequence-to-sequence rnns and beyond. In Pro-ceedings of CoNLL, pages 280–290.

Ankur P. Parikh, Oscar Täckström, Dipanjan Das,and Jakob Uszkoreit. 2016. A decomposableattention model for natural language inference.In Proceedings of EMNLP, pages 2249–2255.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vec-tors for word representation. In Proceedings ofEMNLP, pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,Matt Gardner, Christopher Clark, Kenton Lee,and Luke Zettlemoyer. 2018. Deep contextu-alized word representations. In Proceedings ofNAACL-HLT, pages 2227–2237.

Adam Poliak, Jason Naradowsky, Aparajita Hal-dar, Rachel Rudinger, and Benjamin VanDurme. 2018. Hypothesis only baselines innatural language inference. In Proceedings of*SEM, pages 180–191.

Benjamin Riedel, Isabelle Augenstein, Georgios P.Spithourakis, and Sebastian Riedel. 2017. Asimple but tough-to-beat baseline for the fakenews challenge stance detection task. CoRR,abs/1707.03264.

Tim Rocktäschel, Edward Grefenstette,Karl Moritz Hermann, Tomáš Kocisky,and Phil Blunsom. 2016. Reasoning about en-tailment with neural attention. In Proceedingsof ICLR.

Cícero Nogueira dos Santos, Ming Tan, Bing Xi-ang, and Bowen Zhou. 2016. Attentive poolingnetworks. CoRR, abs/1602.03609.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2017. Bidirectionalattention flow for machine comprehension. InProceedings of ICLR.

Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.Neural responding machine for short-text con-versation. In Proceedings of ACL, pages 1577–1586.

Rupesh Kumar Srivastava, Klaus Greff, and Jür-gen Schmidhuber. 2015. Training very deepnetworks. In Proceedings of NIPS, pages 2377–2385.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: A large-scale dataset for fact ex-traction and verification. In Proceedings ofNAACL-HLT, pages 809–819.

Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Proceedings of NIPS,pages 6000–6010.

Shuohang Wang and Jing Jiang. 2016. Learningnatural language inference with LSTM. In Pro-ceedings of NAACL, pages 1442–1451.

Shuohang Wang and Jing Jiang. 2017. Machinecomprehension using match-LSTM and answerpointer. In Proceedings of ICLR.

Wenhui Wang, Nan Yang, Furu Wei, BaobaoChang, and Ming Zhou. 2017a. Gated self-matching networks for reading comprehensionand question answering. In Proceedings ofACL, pages 189–198.

Zhiguo Wang, Wael Hamza, and Radu Florian.2017b. Bilateral multi-perspective matching fornatural language sentences. In Proceedings ofIJCAI, pages 4144–4150.

Caiming Xiong, Stephen Merity, and RichardSocher. 2016. Dynamic memory networks forvisual and textual question answering. In Pro-ceedings of ICML, pages 2397–2406.

Caiming Xiong, Victor Zhong, and RichardSocher. 2017. Dynamic coattention networksfor question answering. In Proceedings ofICLR.

Wenpeng Yin, Hinrich Schütze, Bing Xiang, andBowen Zhou. 2016. ABCNN: Attention-basedconvolutional neural network for modeling sen-tence pairs. TACL, 4:259–272.

arXiv:1710.00519v2 [cs.CL] 13 Nov 2018 · Plant cells have structures that animal cells lack. 0...

Documents

Transcript of arXiv:1710.00519v2 [cs.CL] 13 Nov 2018 · Plant cells have structures that animal cells lack. 0...