Investigating BERT’s Knowledge of Language: Five Analysis … · 2020. 1. 23. · Paloma...

11
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2877–2887, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 2877 Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs Alex Warstadt, ,1,2 Yu Cao, ,3 Ioana Grosu, ,2 Wei Peng, ,3 Hagen Blix, ,1 Yining Nie, ,1,2 Anna Alsop, ,2 Shikha Bordia, ,3 Haokun Liu, ,3 Alicia Parrish, ,2,3 Sheng-Fu Wang, ,3 Jason Phang, ,1,3 Anhad Mohananey, ,1,3 Phu Mon Htut, ,3 Paloma Jeretiˇ c, ,1,2 and Samuel R. Bowman New York University Equal contribution with roles given below; order assigned randomly. Correspondence: [email protected] 1 Framing and organizing the paper 2 Creating diagnostic data 3 Constructing and running experiments Abstract Though state-of-the-art sentence representa- tion models can perform tasks requiring sig- nificant knowledge of grammar, it is an open question how best to evaluate their grammat- ical knowledge. We explore five experimen- tal methods inspired by prior work evaluat- ing pretrained sentence representation models. We use a single linguistic phenomenon, nega- tive polarity item (NPI) licensing in English, as a case study for our experiments. NPIs like any are grammatical only if they appear in a licensing environment like negation (Sue doesn’t have any cats vs. *Sue has any cats). This phenomenon is challenging because of the variety of NPI licensing environments that exist. We introduce an artificially generated dataset that manipulates key features of NPI licensing for the experiments. We find that BERT has significant knowledge of these fea- tures, but its success varies widely across dif- ferent experimental methods. We conclude that a variety of methods is necessary to reveal all relevant aspects of a model’s grammatical knowledge in a given domain. 1 Introduction Recent sentence representation models have at- tained state-of-the-art results on language under- standing tasks, but standard methodology for eval- uating their knowledge of grammar has been slower to emerge. Recent work evaluating gram- matical knowledge of sentence encoders like BERT (Devlin et al., 2018) has employed a vari- ety of methods. For example, Shi et al. (2016), Ettinger et al. (2016), and Tenney et al. (2019) use probing tasks to target a model’s knowledge of particular grammatical features. Marvin and Linzen (2018) and Wilcox et al. (2019) compare language models’ probabilities for pairs of mini- mally different sentences differing in grammatical acceptability. Linzen et al. (2016), Warstadt et al. (2018), and Kann et al. (2019) use Boolean accept- ability judgments inspired by methodologies in generative linguistics. However, we have not yet seen any substantial direct comparison between these methods, and it is not yet clear whether they tend to yield similar conclusions about what a given model knows. We aim to better understand the trade-offs in task choice by comparing different methods in- spired by previous work to evaluate sentence un- derstanding models in a single empirical domain. We choose as our case study negative polarity item (NPI ) licensing in English, an empirically rich phenomenon widely discussed in the theoret- ical linguistics literature (Kadmon and Landman, 1993; Giannakidou, 1998; Chierchia, 2013, a.o.). NPIs are words or expressions that can only appear in environments that are, in some sense, negative (Fauconnier, 1975; Ladusaw, 1979; Linebarger, 1980). For example, any is an NPI because it is acceptable in negative sentences (1) but not posi- tive sentences (2); negation thus serves as an NPI licensor. NPIs furthermore cannot be outside the syntactic scope of a licensor (3). Intuitively, a li- censor’s scope is the syntactic domain in which an NPI is licensed, and it varies from licensor to li- censor. A sentence with an NPI present is only acceptable in cases where (i) there is a licensor— as in (1) but not (2)—and (ii) the NPI is within the scope of that licensor—as in (1) but not (3). (1) Mary hasn’t eaten any cookies. (2) *Mary has eaten any cookies. (3) *Any cookies haven’t been eaten. We compare five experimental methods to test BERT’s knowledge of NPI licensing. We con- sider: (i) a Boolean acceptability classification task to test BERT’s knowledge of sentences in

Transcript of Investigating BERT’s Knowledge of Language: Five Analysis … · 2020. 1. 23. · Paloma...

  • Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 2877–2887,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

    2877

    Investigating BERT’s Knowledge of Language:

    Five Analysis Methods with NPIs

    Alex Warstadt,†,1,2

    Yu Cao,†,3

    Ioana Grosu,†,2

    Wei Peng,†,3

    Hagen Blix,†,1

    Yining Nie,†,1,2

    Anna Alsop,†,2

    Shikha Bordia,†,3

    Haokun Liu,†,3

    Alicia Parrish,†,2,3

    Sheng-Fu Wang,†,3

    Jason Phang,†,1,3

    Anhad Mohananey,†,1,3

    Phu Mon Htut,†,3

    Paloma Jeretič,†,1,2

    and Samuel R. Bowman

    New York University

    †Equal contribution with roles given below; order assigned randomly. Correspondence: [email protected] and organizing the paper 2Creating diagnostic data 3Constructing and running experiments

    Abstract

    Though state-of-the-art sentence representa-tion models can perform tasks requiring sig-nificant knowledge of grammar, it is an openquestion how best to evaluate their grammat-ical knowledge. We explore five experimen-tal methods inspired by prior work evaluat-ing pretrained sentence representation models.We use a single linguistic phenomenon, nega-tive polarity item (NPI) licensing in English,as a case study for our experiments. NPIslike any are grammatical only if they appearin a licensing environment like negation (Suedoesn’t have any cats vs. *Sue has any cats).This phenomenon is challenging because ofthe variety of NPI licensing environments thatexist. We introduce an artificially generateddataset that manipulates key features of NPIlicensing for the experiments. We find thatBERT has significant knowledge of these fea-tures, but its success varies widely across dif-ferent experimental methods. We concludethat a variety of methods is necessary to revealall relevant aspects of a model’s grammaticalknowledge in a given domain.

    1 Introduction

    Recent sentence representation models have at-tained state-of-the-art results on language under-standing tasks, but standard methodology for eval-uating their knowledge of grammar has beenslower to emerge. Recent work evaluating gram-matical knowledge of sentence encoders likeBERT (Devlin et al., 2018) has employed a vari-ety of methods. For example, Shi et al. (2016),Ettinger et al. (2016), and Tenney et al. (2019)use probing tasks to target a model’s knowledgeof particular grammatical features. Marvin andLinzen (2018) and Wilcox et al. (2019) comparelanguage models’ probabilities for pairs of mini-mally different sentences differing in grammatical

    acceptability. Linzen et al. (2016), Warstadt et al.(2018), and Kann et al. (2019) use Boolean accept-ability judgments inspired by methodologies ingenerative linguistics. However, we have not yetseen any substantial direct comparison betweenthese methods, and it is not yet clear whetherthey tend to yield similar conclusions about whata given model knows.

    We aim to better understand the trade-offs intask choice by comparing different methods in-spired by previous work to evaluate sentence un-derstanding models in a single empirical domain.We choose as our case study negative polarityitem (NPI) licensing in English, an empiricallyrich phenomenon widely discussed in the theoret-ical linguistics literature (Kadmon and Landman,1993; Giannakidou, 1998; Chierchia, 2013, a.o.).NPIs are words or expressions that can only appearin environments that are, in some sense, negative(Fauconnier, 1975; Ladusaw, 1979; Linebarger,1980). For example, any is an NPI because it isacceptable in negative sentences (1) but not posi-tive sentences (2); negation thus serves as an NPIlicensor. NPIs furthermore cannot be outside thesyntactic scope of a licensor (3). Intuitively, a li-censor’s scope is the syntactic domain in which anNPI is licensed, and it varies from licensor to li-censor. A sentence with an NPI present is onlyacceptable in cases where (i) there is a licensor—as in (1) but not (2)—and (ii) the NPI is within thescope of that licensor—as in (1) but not (3).

    (1) Mary hasn’t eaten any cookies.(2) *Mary has eaten any cookies.(3) *Any cookies haven’t been eaten.

    We compare five experimental methods to testBERT’s knowledge of NPI licensing. We con-sider: (i) a Boolean acceptability classificationtask to test BERT’s knowledge of sentences in

  • 2878

    isolation, (ii) an absolute minimal pair task eval-uating whether the absolute Boolean outputs ofacceptability classifiers distinguish between pairsof minimally different sentences that differ in ac-ceptability and each isolate a single key propertyof NPI licensing, (iii) a gradient minimal pairtask evaluating whether the gradient outputs of ac-ceptability classifiers distinguish between minimalpairs, (iv) a cloze test evaluating the grammaticalpreferences of BERT’s masked language model-ing head, and (v) a probing task directly evaluatingBERT’s representations for knowledge of specificgrammatical features relevant to NPI licensing.

    We find that BERT does have knowledge ofall the key features necessary to predict the ac-ceptability of NPI sentences in our experiments.However, our five methods give meaningfully dif-ferent results. While the gradient minimal pairexperiment and, to a lesser extent, the accept-ability classification and cloze tests indicate thatBERT has systematic knowledge of all NPI licens-ing environments and relevant grammatical fea-tures, the absolute minimal pair and probing ex-periments show that BERT’s knowledge is in factnot equal across these domains. We conclude thateach method depicts different relevant aspects of amodel’s grammatical knowledge; comparing bothgradient and absolute measures of performance ofmodels gives a more complete picture. We recom-mend that future studies use multiple complemen-tary methods to evaluate model performance.

    2 Related Work

    Evaluating Sentence Encoders The success ofsentence encoders and broader neural networkmethods in NLP has prompted significant interestin understanding the linguistic knowledge encap-sulated in these models.

    A portion of related work focuses on Booleanclassification tasks of English sentences to evalu-ate the grammatical knowledge encoded in thesemodels. The objective of this task is to predictwhether a single input sentence is acceptable ornot, abstracting away from gradience in accept-ability judgments. Linzen et al. (2016) train classi-fiers on this task using data with manipulated ver-bal inflection to investigate whether LSTMs canidentify subject-verb agreement violations, andtherefore a (potentially long distance) syntacticdependency. Warstadt et al. (2018) train modelson this task using the CoLA corpus of acceptabil-

    ity judgments as a method for evaluating domaingeneral grammatical knowledge, and Warstadt andBowman (2019) analyze how these domain gen-eral classifiers perform on specific linguistic phe-nomena. Kann et al. (2019) use this task to evalu-ate whether sentence encoders represent informa-tion about verbal argument structure classes.

    A related method employs minimal pairs, con-sisting of two sentences that differ minimally intheir content and differ in linguistic acceptability,to judge whether a model is sensitive to a singlegrammatical feature. Marvin and Linzen (2018)and Wilcox et al. (2019) apply this method to phe-nomena such as subject-verb agreement, NPI li-censing, and reflexive licensing.

    Another branch of work uses probing tasks inwhich the objective is to predict the value of aparticular linguistic feature given an input sen-tence. Probing tasks have been used to investi-gate whether sentence embeddings encode syntac-tic and surface features such as tense and voice(Shi et al., 2016), sentence length and word con-tent (Adi et al., 2016), or syntactic depth and mor-phological number (Conneau et al., 2018). Giu-lianelli et al. (2018) use diagnostic classifiers totrack the propagation of information in RNN-based language models. Ettinger et al. (2018) andDasgupta et al. (2018) use automatic data genera-tion to evaluate compositional reasoning. Tenneyet al. (2019) introduce sub-sentence level probingtasks derived from common NLP tasks.

    Negative Polarity Items In the theoretical liter-ature on NPIs, proposals have been made to unifythe properties of the diverse NPI licensing envi-ronments. For example, a popular view statesthat NPIs are licensed if and only if they occur indownward entailing (DE) environments (Faucon-nier, 1975; Ladusaw, 1979), i.e. an environmentthat licences inferences from sets to subsets.1

    For instance, (4) shows that the environmentunder the scope of negation is DE. The propertybeen to Paris is a subset of been to France, be-cause those who have been to Paris are a subset ofthose who have been to France. While the infer-ence from set to subset is normally invalid (4-b), itis valid if the property is embedded under negation(4-a). According to the DE theory of NPI licens-

    1Other prominent theories of NPI licensing are basedon notions of non-veridicality (Giannakidou, 1994, 1998;Zwarts, 1998), domain widening and emphasis (Kadmon andLandman, 1993; Krifka, 1995; Chierchia, 2013), a.o.

  • 2879

    ing, this explains the contrast between (1) and (2).(4) a. I haven’t been to France.

    ! I haven’t been to Paris. (DE)b. I have been to France.

    6! I have been to Paris. (not DE)This view does not capture licensing in some

    environments, for example questions. No theory isyet accepted as identifying the unifying propertiesof all NPI licensing environments.

    Within computational linguistics, NPIs are usedas a testing ground for neural network mod-els’ grammatical knowledge. Marvin and Linzen(2018) find that LSTM LMs do not systematicallyprefer sentences with licensed NPIs (1) over sen-tencew with unlicensed NPIs (2). Jumelet andHupkes (2018) shows LSTM LMs find a rela-tion between the licensing context and the nega-tive polarity item, and appears to be aware of thescope of this context. Wilcox et al. (2019) useNPIs and filler-gap dependencies, as instances ofnon-local grammatical dependencies, to probe theeffect of supervision with hierarchical structure.They find that structurally-supervised models out-perform state-of-the-art sequential LSTM models,showing the importance of structure in learningnon-local dependencies like NPI licensing.

    CoLA We use the Corpus of Linguistic Accept-ability (CoLA; Warstadt et al., 2018) in our exper-iments to train supervised acceptability classifiers.CoLA is a dataset of over 10k syntactically diverseexample sentences from the linguistics literaturewith Boolean acceptability labels. As is conven-tional in theoretical linguistics, sentences are takento be acceptable if native speakers judge them tobe possible sentences in their language. Such sen-tences are widely used in linguistics publicationsto illustrate phenomena of interest. The examplesin CoLA are gathered from diverse sources andrepresent a wide array of syntactic, semantic, andmorphological phenomena. As measured by theGLUE benchmark (Wang et al., 2018), acceptabil-ity classifiers trained on top of BERT and relatedmodels reach near-human performance on CoLA.

    3 Methods

    We experiment with five approaches to the evalua-tion of grammatical knowledge of sentence repre-sentation models like BERT using our generatedNPI acceptability judgment dataset (§4). Eachdata sample in the dataset contains a sentence, a

    Boolean label which indicates whether the sen-tence is grammatically acceptable or not, and threeBoolean meta-data variables (licensor, NPI, andscope; Table 2). We evaluate four model types:BERT-large, BERT with fine-tuning on one of twotasks, and a simple bag-of-words baseline usingGloVe word embeddings (Pennington et al., 2014).

    Boolean Acceptability We test the model’s abil-ity to judge the grammatical acceptability of thesentences in the NPI dataset. Following standardsin linguistics, sentences for this task are assumedto be either totally acceptable or totally unaccept-able. We fine-tune the sentence representationmodels to perform these Boolean judgments. ForBERT-based sentence representation models, weadd a classifier on top of the [CLS] embeddingof the last layer. For BoW, we use a max pool-ing layer followed by an MLP classifier. The per-formance of the models is measured as MatthewsCorrelation Coefficient (MCC; Matthews, 1975)2

    between the predicted label and the gold label.

    Absolute Minimal Pair We conduct a minimalpair experiment to analyze Boolean acceptabilityclassifiers on minimally different sentences. Twosentences within paradigm form a minimal pair ifthey differ in one feature (licensor, NPI, or scope),but have different acceptability. For example, thesentences in (1) and (2) differ in whether an NPIlicensor (negation) is present. We evaluate themodels trained on acceptability judgments withthe minimal pairs. In absolute minimal pair eval-uation, the models needs to correctly classify bothsentences in the pair to be counted as correct.

    Gradient Minimal Pair The gradient minimalpair evaluation is a more lenient version of ab-solute minimal pair evaluation: Here, we count apair as correctly classified as long as the Booleanclassifier’s output for the acceptable class is higherfor the acceptable sentence than for the unaccept-able sentence. In other words, the classifier needonly predict the acceptable sentence has the higherlikelihood of being acceptable, but need not cor-rectly predict that it is acceptable.

    Cloze Test In the cloze test, a standard sentence-completion task, we use the masked languagemodeling (MLM) component in BERT (Devlinet al., 2018) and evaluate whether it assigns a

    2MCC gives the correlation of two Boolean distributionsbetween -1 and 1. A score of 0 is given to any two unrelateddistributions, without requiring balanced class sizes.

  • 2880

    higher probability to the acceptable sentence in aminimal pair, following Linzen et al. (2016). AnMLM predicts the probability of a single maskedtoken based on the rest of the sentence. The min-imal pairs tested are a subset of those in the abso-lute and gradient minimal pair experiments, whereboth sentences must be equal in length and differin only one token. This differing token is replacedwith [MASK], and the minimal pair is taken to beclassified correctly if the MLM assigns a higherprobability to the token from the acceptable sen-tence. In contrast with the other minimal pairexperiments, this experiment is entirely unsuper-vised, using BERT’s native MLM functionality.

    Feature Probing We use probing classifiers asa more fine-grained approach to the identificationof grammatical variables. We freeze the sentenceencoders both with and without fine-tuning fromthe acceptability judgment experiments and trainlightweight classifiers on top of them to predictmeta-data labels corresponding to the key proper-ties a model must learn in order to judge the ac-ceptability of NPI sentences. These properties arewhether a licensor is present, whether an NPI ispresent, and whether the NPI (or a similar item) isin the syntactic scope of the licensor (or a similaritem). Crucially, each individual meta-data labelby itself does not decide acceptability (i.e., theseprobing experiments test a different but related setof knowledge from acceptability experiments).

    4 Data

    In order to probe BERT’s performance on sen-tences involving NPIs, we generate a set of sen-tences and acceptability labels for the experimentsin this paper. We use generated data so that wecan assess minimal pairs, and so that there are suf-ficient unacceptable sentences. We release all ourdata3 and our generation code and vocabulary.4

    Licensing Features We create a controlled setof 136,000 English sentences using an automatedsentence generation procedure, inspired in largepart by previous work by Ettinger et al. (2016,2018), Marvin and Linzen (2018), Dasgupta et al.(2018), and Kann et al. (2019). The set containsnine NPI licensing environments (Table 1), and

    3https://alexwarstadt.files.wordpress.com/2019/08/npi_lincensing_data.zip (Click-ing on the link will cause a file to download.)

    4https://github.com/alexwarstadt/data_generation

    two NPIs (any, ever). All but one licensor-NPIpair follows a 2⇥2⇥2 paradigm, which manipu-lates three boolean NPI licensing features: licen-sor presence, NPI presence, and the occurrence ofan NPI within a licensor’s scope. Each 2⇥2⇥2paradigm forms 5 minimal pairs. Table 2 showsan example paradigm.

    The Licensor feature indicates whether an NPIlicensor is present in the sentence. For many en-vironments, there are multiple lexical items thatserve as a licensor (e.g., the adverbs environ-ment contains rarely, hardly, seldom, barely, andscarcely as NPI licensors). When the licensor isnot present, we substitute it with a licensor re-placement that has a similar syntactic distributionbut does not license NPIs, again using multiple ap-propriate lexical items as replacements. For exam-ple, licensor replacements for quantifier licensorslike every include quantifiers like some, many, andmore than three that do not license NPIs.

    The NPI feature indicates whether an NPI is inthe sentence or if it is substituted by an NPI re-placement with similar structural distribution. Forexample, NPI replacements for ever include otheradverbs such as often, sometimes, and certainly.

    The Scope feature indicates whether theNPI/NPI replacement is within the scope of thelicensor/licensor replacement. As illustrated ear-lier in (3), a sentence containing an NPI is onlyacceptable when the NPI falls within the scope ofthe licensor. What constitutes the scope of a licen-sor is highly dependent on the type of licensor.

    The exception to the 2⇥2⇥2 paradigm is theSimple Questions licensing condition, with a re-duced 2⇥2 paradigm. It lacks a scope manipu-lation because the question takes scope over theentire clause, and in Simple Questions the clauseis the whole sentence. The paradigm for SimpleQuestions is given in Table 3 in the Appendix; itforms only 2 minimal pairs.

    Data Generation To generate sentences, wecreate sentence templates for each condition.Templates follow the general structure illustratedin example (5), in which the part-of-speech (auxil-iary verb, determiner, noun, verb), as well as theinstance number is specified. For example, N2is the second instance of a noun in the template.We use these labels here for illustrative purposes;in reality, the templates include more fine-grainedproperties, such as verb tense and noun number.

  • 2881

    Environment Abbrev. Example

    Adverbs ADV The guests who rarely love any actors had left libraries.Conditionals COND If the pedestrian passes any schools, the senator will talk to the adults.Determiner negation D-NEG Just as the waitress said, no customers thought that any dancers bought the dish.Sentential negation S-NEG These drivers have not thought that any customers have lied.Only ONLY From what the cashier heard, only the children have known any dancers.Quantifiers QNT Every actress who was talking about any high schools criticizes the children.Questions QUES The boys wonder whether the doctors went to any art galleries.Simple questions SMP-Q Has the guy worked with any teenagers?Superlatives SUP The teenagers approach the nicest actress that any customers had criticized.

    Table 1: Examples of each of the NPI licensing environments generated. The licensor is in bold, and the NPI (hereany) is in italics. All examples show cases where the NPI is present (NPI=1), the licensor is present (Licensor=1),and the NPI is in the scope of the licensor (Scope=1); all are acceptable to native speakers of English.

    Licensor NPI Scope Sentence

    1 1 1 Those boys wonder whether [the doctors ever went to an art gallery.]1 1 0 *Those boys ever wonder whether [the doctors went to an art gallery.]1 0 1 Those boys wonder whether [the doctors often went to an art gallery.]1 0 0 Those boys often wonder whether [the doctors went to an art gallery.]0 1 1 *Those boys say that [the doctors ever went to an art gallery.]0 1 0 *Those boys ever say that [the doctors went to an art gallery.]0 0 1 Those boys say that [the doctors often went to an art gallery.]0 0 0 Those boys often say that [the doctors went to an art gallery.]

    Table 2: Example 2⇥2⇥2 paradigm using the Questions environment. The licensor (whether) or licensor replace-ment (that) is in bold. The NPI (ever) or NPI replacement (often) is in italics. When licensor=1, the licensoris present rather than its replacement word. When NPI=1, the NPI is present rather than its replacement. Thescope of the licensor/licensor replacement is shown in square brackets (brackets, italicization, and boldface are notpresent in the actual data). When scope=1, the NPI/NPI replacement is within the scope of the licensor/licensorreplacement. Unacceptable sentences are marked with *. The five minimal pairs are connected by arrows thatpoint from the unacceptable to the acceptable sentence.

    Figure 1: Results from the acceptability judgment experiment in MCC. The columns indicate evaluation tests, andthe rows fine-tuning settings.

    (5) Aux1Has

    D1the

    N1guy

    V1seen

    anyany

    N2waitresses

    ??

    Given the specifications encoded in the sentencetemplates, words were sampled from a vocabularyof over 1000 lexical items annotated with 30 syn-tactic, morphological, and semantic features. Theannotated features allow us to encode selectionalrequirements of lexical items, e.g., what types ofnouns a verb can combine with. This avoids bla-tantly implausible sentences.

    For each environment, the training set contains

    10K sentences, and the dev and test sets con-tain 1K sentences each. Sentences from the sameparadigm are always in the same set.

    In addition to our data set, we also test BERT ona set of 104 handcrafted sentences from the NPIsub-experiment in Wilcox et al. (2019), who usea paradigm that partially overlaps with ours, buthas an additional condition where the NPI linearlyfollows its licensor while not being in the scope ofthe licensor. This is included as an additional testset for evaluating acceptability classifiers in (6).

  • 2882

    Data validation We use Amazon MechanicalTurk (MTurk) to validate a subset of our data toassure that the generated sentences contrast in ac-ceptability as intended. We randomly sample 500sentences in approximately equal amounts fromeach environment, NPI and paradigm. Each sen-tence is rated on a Likert scale of 1-6, with 1 be-ing “the sentence is not possible in English” and 6being “the sentence is possible in English” by 20unique participants in the US who self-identifiedas native English speakers. Participants are com-pensated $0.25 per HIT and are often able to com-plete a HIT of 5 sentences in just under 1 minute.

    Table 4 in the Appendix shows the participants’scores transformed into a Boolean judgment of 0(unacceptable, score 3) or 1 (acceptable, score� 4) and presented as the percentage of ‘accept-able’ ratings assigned to the sentences in each ofthe NPI-licensing environments. Across all NPI-licensing environments, 81.3% of the sentences la-belled acceptable are assigned an acceptable ratingby the MTurk raters, and 85.2% of sentences la-beled unacceptable are assigned an unacceptablerating. This gives an overall agreement rating of82.8%5. A Wilcoxon signed-rank test (Wilcoxon,1945) shows that within each environment andfor each NPI, the acceptable sentences are moreoften rated as acceptable by our MTurk valida-tors than the unacceptable sentences (all p-values< 0.001). This contrast holds considering boththe raw Likert-scale responses and the responsestransformed to a Boolean judgment.

    5 Experimental Settings

    We conduct our experiments with the jiant 0.9 6

    (Wang et al., 2019) multitask learning and transferlearning toolkit, the AllenNLP platform (Gardneret al., 2018), and the BERT implementation fromHuggingFace.7

    Models We study the following sentence under-standing models: (i) GloVe BoW: a bag-of-wordsbaseline obtained by max-pooling of 840B tokens300-dimensional GloVe word embeddings (Pen-

    5We observe that sentences with ‘any’ get over-accepted.Overall agreement for sentences with ‘ever’ is 87.3%, whileagreement for those with ‘any’ is 78.3%. We believe this isdue to a free-choice interpretation of ‘any’ occurring moreeasily than is often reported in the semantics literature.

    6https://github.com/nyu-mll/jiant/tree/blimp-and-npi/scripts/bert_npi

    7https://github.com/huggingface/pytorch-pretrained-BERT

    nington et al., 2014) and (ii) BERT (Devlin et al.,2018): we use the cased version of BERT-largemodel, which works the best for our tasks in pi-lot experiments. In addition, since recent work(Liu et al., 2019; Stickland and Murray, 2019) hasshown that intermediate training on related taskscan meaningfully impact BERT’s performance ondownstream tasks, we also explore two additionalBERT-based models—(iii) BERT!MNLI: BERTfine-tuned on the Multi-Genre Natural LanguageInference corpus (Williams et al., 2018), mo-tivated both by prior work on pretraining sen-tence encoders on MNLI (Conneau et al., 2017)as well as work showing significant improve-ments to BERT on downstream semantic tasks(Phang et al., 2018; Bowman et al., 2018) (iv)BERT!CCG: BERT fine-tuned on CombinatoryCategorial Grammar Bank corpus (Hockenmaierand Steedman, 2007), motivated by Wilcox et al.’s(2019) finding that structural supervision may im-prove a LSTM-based sentence encoders knowl-edge on non-local syntactic dependencies.

    Training-Evaluation Configurations We areinterested in whether sentence representationmodels learn NPI licensing as a unified property.Can the models generalize from trained environ-ments to previously unseen environments? To an-swer these questions, for each NPI environment,we extensively test the performance of the modelsin the following configurations: (i) CoLA: train-ing on CoLA, evaluating on the environment. (ii)1 NPI: training and evaluating on the same NPIenvironment. (iii) Avg Other NPI: training inde-pendently on every NPI environment except one,averaged over the evaluation results on that envi-ronment. (iv) All-but-1 NPI: training on all envi-ronments except for one environment, evaluatingon that environment. (v) All NPI: training on allenvironments, evaluating on the environment.

    6 Results

    Acceptability Judgments The results in Fig. 1show that BERT outperforms the BoW baseline onall test data with all fine-tuning settings. Withineach BERT variant, MCC reaches 1.0 on all testdata in the 1 NPI setting. In the All-but-1 NPItraining-evaluation configuration, the performanceon all NPI environments for BERT drops. Whilethe MCC value on environments like condition-als and sentential negation remains above 0.9, onthe simple question environment it drops to 0.58.

  • 2883

    Figure 2: Results from the minimal pair test. The top section shows the average accuracy for detecting thepresence of the NPI, the middle section shows average accuracy for detecting the presence of the licensor, and thebottom shows average accuracy of minimal pair contrasts that differ in whether the NPI is in scope of the licensor.Within each section, we show performance of GloVe BoW and BERT models under both absolute preferenceand gradient preference evaluation methods. The rows represent the training-evaluation configuration, while thecolumns represent different licensing environments.

    Compared with NPI data fine-tuning, CoLA fine-tuning results in BERT’s lower performance onmost of the NPI environments but better perfor-mance on data from Wilcox et al. (2019).

    In comparing the three BERT variants (see fullresults in Figure 5 in the Appendix), the AvgOther NPI shows that on 7 out of 9 NPI environ-ments, plain BERT outperforms BERT!MNLIand BERT!CCG. Even in the remaining two en-vironments, plain BERT yields about as good per-formance as BERT!MNLI and BERT!CCG, in-dicating that MNLI and CCG fine-tuning brings no

    obvious gain to acceptability judgments.

    Absolute and Gradient Minimal Pairs Theresults (Fig. 2) show that models’ performancehinges on how minimal pairs differ. When testedon minimal pairs differing by the presence of anNPI, BoW and plain BERT obtain (nearly) per-fect accuracy on both absolute and gradient mea-sures across all settings. For minimal pairs differ-ing by licensor and scope, BERT again achievesnear perfect performance on the gradient measure,while BoW does not. On the absolute measure,both BERT and BoW perform worse. Overall, it

  • 2884

    shows that absolute judgment is more challengingwhen targeting licensor, which involves a largerpool of lexical items and syntactic configurationsthan NPIs; and scope, which requires nontrivialsyntactic knowledge about NPI licensing.

    As in the acceptability experiment, we find thatintermediate fine-tuning on MNLI and CCG doesnot improve performance (see full results in Fig-ures 6-8 in Appendix).

    Figure 3: Results of BERT MLM performance in thecloze test. The top section shows the average accuracyfor detecting the presence of the NPI; the bottom sec-tion show the accuracy for detecting the presence ofthe licensor. The columns represent different licensingenvironments

    Cloze Test The results (Fig. 3) show that evenwithout supervision on NPI data, the BERT MLMcan distinguish between acceptable and unaccept-able sentences in the NPI domain. Performanceis highly dependent on the NPI-licensing envi-ronment and type of minimal pair. Accuracy fordetecting NPI presence falls between 0.76 and0.93 for all environments. Accuracy for detect-ing licensor presence is much more variable, withthe BERT MLM achieving especially high perfor-mance in conditional, sentential negation, and onlyenvironments; and low performance in quantifierand superlative environments.

    Feature Probing Results (Fig. 4) show thatplain BERT outperforms the BoW baseline in de-tecting whether NPI is in the scope of the licensor(henceforth ‘scope detection’). As expected, BoWis nearly perfect in detecting presence of NPI andlicensor, as these tasks do not require knowledgeof syntax or word order. Consistent with resultsfrom previous experiments, detecting the presenceof a licensor is slightly more challenging for mod-els fine-tuned with CoLA or NPI data. However,the overall lower performances in scope detectioncompared with detecting the presence of the licen-sor is not found in the minimal-pair experiments.

    CoLA fine-tuning improves the performancefor BERT, especially for detecting NPI presence.Fine-tuning on NPI data improves scope detection.Inspection of environment-specific results showsthat models struggle when the superlative, quan-tifiers, and adverb environments are the held-outtest sets in the All-but-1 NPI fine-tuning setting.

    Different from other experiments, BERTand BERT!MNLI have comparable perfor-mance across many settings and tasks, beatingBERT!CCG especially in scope detection (seefull results in Figure 9 in the Appendix).

    7 Discussion

    We find that BERT systematically represents allfeatures relevant to NPI licensing across most in-dividual licensing environments, according to cer-tain evaluation methods. However, these resultsvary widely across the different methods we com-pare. In particular, BERT performs nearly per-fectly on the gradient minimal pairs task acrossall of minimal pair configurations and nearly alllicensing environments. Based on this methodalone, we might conclude that BERT’s knowledgeof this domain is near perfect. However, the othermethods show a more nuanced picture.

    BERT’s knowledge of which expressions areNPIs and NPI licensors is generally stronger thanits knowledge of the licensors’ scope. This is es-pecially apparent from the probing results (Fig. 4).BERT without acceptability fine-tuning performsclose to ceiling on detecting the presence of alicensor, but is inconsistent at scope detection.Tellingly, the BoW baseline is also able to performat ceiling on telling whether a licensor is present.For BoW to succeed at this task, the GloVe em-beddings for NPI-licensors must share some com-mon property, most likely the fact that licensorsco-occur with NPIs. It is possible that BERT isable to succeed using a similar strategy. By con-trast, identifying whether an NPI is in the scopeof a licensor requires at the very least word orderinformation and not just co-occurrences.

    The contrast in BERT’s performance on the gra-dient and absolute tasks tells us that these evalu-ations reveal different aspects of BERT’s knowl-edge. The gradient task is strictly easier than theabsolute task. On the one hand, BERT’s high per-formance on the gradient task reveals the presenceof systematic knowledge in the NPI domain. Onthe other hand, due to ceiling effects, the gradi-

  • 2885

    Figure 4: Results of probing classification on NPI presence, licensor presence, and whether the NPI is in the scopeof the licensor (scope detection), shown in MCC. Letters on top of bars refer to NPI environments: A=ADV,B=COND, C=D-NEG, D=S-NEG, E=ONLY, F=QNT, G=QUES, H=SMP-Q, I=SUP.

    ent task fails to reveal actual differences betweenNPI-licensing environments that we clearly ob-serve based on absolute, cloze, and probing tasks.

    While BERT has systematic knowledge of ac-ceptability contrasts, this knowledge varies acrossdifferent licensing environments and is not cat-egorical. Generative models of syntax (Chom-sky, 1965, 1981, 1995) model human knowledgeof natural language as categorical: In that senseBERT fails at attaining human performance. How-ever, some have argued that acceptability is in-herently gradient (Lau et al., 2017), and resultsfrom the human validation study on our generateddataset show evidence of gradience in the accept-ability of sentences in our dataset.

    Supplementing BERT with additional pretrain-ing on CCG and MNLI does not improve per-formance, and even lowers performance in somecases. While results from Phang et al. (2018)lead us to hypothesize that intermediate pretrain-ing might help, this is not what we observe on ourdata. This result is in direct contrast with the re-sults from Wilcox et al. (2019), who find that syn-tactic pretraining does improve performance in theNPI domain. This difference in findings is likelydue to differences in models and training proce-dure, as their model is an RNN jointly trainedon language modeling and parsing over the muchsmaller Penn Treebank (Marcus et al., 1993).

    Future studies would benefit from employinga variety of different methodologies for assessingmodel performance withing a specified domain. Inparticular, a result showing generally good perfor-mance for a model should be regarded as possi-bly hiding actual differences in performance thata different task would reveal. Similarly, generally

    poor performance for a model does not necessar-ily mean that the model does not have systematicknowledge in a given domain; it may be that aneasier task would reveal systematicity.

    8 Conclusion

    We have shown that within a well-defined domainof English grammar, evaluation of sentence en-coders using different tasks will reveal differentaspects of the encoder’s knowledge in that do-main. By considering results from several eval-uation methods, we demonstrate that BERT hassystematic knowledge of NPI licensing. However,this knowledge is unequal across the different fea-tures relevant to this phenomenon, and does notreflect the Boolean effect that these features haveon acceptability.

    Acknowledgments

    This project was a joint effort by the participants inthe Spring 2019 NYU Linguistics seminar courseLinguistic Knowledge in Reusable Sentence En-coders. We are grateful to the department for mak-ing this seminar possible.

    This material is based upon work supported bythe National Science Foundation under Grant No.1850208. Any opinions, findings, and conclusionsor recommendations expressed in this material arethose of the author(s) and do not necessarily re-flect the views of the National Science Founda-tion. This project has also benefited from finan-cial support to SB by Samsung Research underthe project Improving Deep Learning using LatentStructure and from the donation of a Titan V GPUby NVIDIA Corporation.

  • 2886

    References

    Yossi Adi, Einat Kermany, Yonatan Belinkov, OferLavi, and Yoav Goldberg. 2016. Fine-grained anal-ysis of sentence embeddings using auxiliary predic-tion tasks. arXiv preprint arXiv:1608.04207.

    Samuel R. Bowman, Ellie Pavlick, Edouard Grave,Benjamin Van Durme, Alex Wang, Jan Hula, PatrickXia, Raghavendra Pappagari, R. Thomas McCoy,Roma Patel, Najoung Kim, Ian Tenney, YinghuiHuang, Katherin Yu, Shuning Jin, and Berlin Chen.2018. Looking for elmo’s friends: Sentence-levelpretraining beyond language modeling. CoRR,abs/1812.10860.

    Gennaro Chierchia. 2013. Logic in grammar: Polar-ity, free choice, and intervention. Oxford UniversityPress.

    Noam Chomsky. 1965. Aspects of the Theory of Syn-tax. MIT Press.

    Noam Chomsky. 1981. Lectures on Government andBinding. Number 9 in Studies in Generative Gram-mar. Foris Publications, Dordrecht, Holland ; Cin-naminson, [N.J.].

    Noam Chomsky. 1995. The Minimalist Program.Number 28 in Current Studies in Linguistics. TheMIT Press, Cambridge, Mass.

    Alexis Conneau, Douwe Kiela, Holger Schwenk,Loı̈c Barrault, and Antoine Bordes. 2017. Su-pervised learning of universal sentence representa-tions from natural language inference data. CoRR,abs/1705.02364.

    Alexis Conneau, German Kruszewski, GuillaumeLample, Loı̈c Barrault, and Marco Baroni. 2018.What you can cram into a single vector: Probingsentence embeddings for linguistic properties. arXivpreprint arXiv:1805.01070.

    Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller,Samuel J Gershman, and Noah D Goodman. 2018.Evaluating compositionality in sentence embed-dings. arXiv preprint arXiv:1802.04302.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. arXiv preprint 1810.04805.

    Allison Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for semantic evidence of composi-tion by means of simple classification tasks. In Pro-ceedings of the First Workshop on Evaluating VectorSpace Representations for NLP, pages 134–139. As-sociation for Computational Linguistics.

    Allyson Ettinger, Ahmed Elgohary, Colin Phillips, andPhilip Resnik. 2018. Assessing composition in sen-tence vector representations. In Proceedings ofthe 27th International Conference on ComputationalLinguistics, pages 1790–1801. Association for Com-putational Linguistics.

    Gilles Fauconnier. 1975. Polarity and the scale princi-ple. Proceedings of the Chicago Linguistics Society,11:188–199.

    Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.Allennlp: A deep semantic natural language pro-cessing platform. In Proceedings of Workshop forNLP Open Source Software (NLP-OSS), pages 1–6.

    Anastasia Giannakidou. 1994. The semantic licensingof NPIs and the Modern Greek subjunctive. Lan-guage and Cognition, 4:55–68.

    Anastasia Giannakidou. 1998. Polarity sensitivity as(non) veridical dependency, volume 23. John Ben-jamins Publishing.

    Mario Giulianelli, Jack Harding, Florian Mohnert,Dieuwke Hupkes, and Willem H. Zuidema. 2018.Under the hood: Using diagnostic classifiers to in-vestigate and improve how language models trackagreement information. CoRR, abs/1808.08079.

    Julia Hockenmaier and Mark Steedman. 2007. CCG-bank: A corpus of CCG derivations and dependencystructures extracted from the Penn Treebank. Com-putational Linguistics, 33(3):355–396.

    Jaap Jumelet and Dieuwke Hupkes. 2018. Do languagemodels understand anything? on the ability of lstmsto understand negative polarity items. arXiv preprintarXiv:1808.10627.

    Nirit Kadmon and Fred Landman. 1993. Any. Linguis-tics and Philosophy, 16(4):353–422.

    Katharina Kann, Alex Warstadt, and Adina Williams.2019. Verb argument structure alternations in wordand sentence embeddings. In Proceedings of the So-ciety for Computation in Linguistics (SCiL), pages287–297.

    Manfred Krifka. 1995. The semantics and pragmaticsof polarity items. Linguistic Analysis, 25(3-4):209–257.

    William Ladusaw. 1979. Negative polarity items as in-herent scope relations. Unpublished Ph.D. Disser-tation, University of Texas at Austin.

    Jey Han Lau, Alexander Clark, and Shalom Lappin.2017. Grammaticality, acceptability, and probabil-ity: A probabilistic view of linguistic knowledge.Cognitive Science, 41(5):1202–1241.

    Marcia Christine Linebarger. 1980. The grammar ofnegative polarity. Ph.D. thesis, Massachusetts Insti-tute of Technology.

    Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Associ-ation for Computational Linguistics, 4:521–535.

  • 2887

    Xiaodong Liu, Pengcheng He, Weizhu Chen, andJianfeng Gao. 2019. Multi-task deep neural net-works for natural language understanding. CoRR,abs/1901.11504.

    Mitchell Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of english: The penn treebank.

    Rebecca Marvin and Tal Linzen. 2018. Targeted syn-tactic evaluation of language models. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 1192–1202.

    Brian W Matthews. 1975. Comparison of the pre-dicted and observed secondary structure of t4 phagelysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451.

    Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

    Jason Phang, Thibault Févry, and Samuel R. Bowman.2018. Sentence encoders on STILTs: Supplemen-tary training on intermediate labeled-data tasks. InarXiv preprint.

    Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural mt learn source syntax? In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 1526–1534.

    Asa Cooper Stickland and Iain Murray. 2019. BERTand PALs: Projected attention layers for effi-cient adaptation in multi-task learning. CoRR,abs/1902.02671.

    Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,and Ellie Pavlick. 2019. What do you learn fromcontext? Probing for sentence structure in contextu-alized word representations. In International Con-ference on Learning Representations.

    Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 353–355.

    Alex Wang, Ian F. Tenney, Yada Pruksachatkun,Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappa-gari, Shuning Jin, R. Thomas McCoy, Roma Pa-tel, Yinghui Huang, Jason Phang, Edouard Grave,Najoung Kim, Phu Mon Htut, Thibault F’evry,Berlin Chen, Nikita Nangia, Haokun Liu, An-had Mohananey, Shikha Bordia, Ellie Pavlick, andSamuel R. Bowman. 2019. jiant 0.9: A softwaretoolkit for research on general-purpose text under-standing models. http://jiant.info/.

    Alex Warstadt and Samuel R Bowman. 2019. Gram-matical analysis of pretrained sentence encoderswith acceptability judgments. arXiv preprintarXiv:1901.03438.

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-man. 2018. Neural network acceptability judg-ments. arXiv preprint 1805.12471.

    Ethan Wilcox, Peng Qian, Richard Futrell, MiguelBallesteros, and Roger Levy. 2019. Structural su-pervision improves learning of non-local grammat-ical dependencies. In Proceedings of the 2019 An-nual Conference of the North American Chapter ofthe Association for Computational Linguistics.

    Frank Wilcoxon. 1945. Individual comparisons byranking methods. Biometrics Bulletin, 1(6):80–83.

    Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers), pages 1112–1122.

    Frans Zwarts. 1998. Three types of polarity. In Plural-ity and Quantification, pages 177–238. Springer.