Open Sesame: Getting Inside BERT’s Linguistic Knowledge · Open Sesame: Getting Inside BERT’s...

Open Sesame: Getting Inside BERT’s Linguistic Knowledge

Yongjie Lina,∗ and Yi Chern Tana,∗ and Robert FrankbaDepartment of Computer Science, Yale University

bDepartment of Linguistics, Yale University{yongjie.lin, yichern.tan, robert.frank}@yale.edu

Abstract

How and to what extent does BERT en-code syntactically-sensitive hierarchical infor-mation or positionally-sensitive linear infor-mation? Recent work has shown that contex-tual representations like BERT perform wellon tasks that require sensitivity to linguis-tic structure. We present here two studieswhich aim to provide a better understandingof the nature of BERT’s representations. Thefirst of these focuses on the identification ofstructurally-defined elements using diagnosticclassifiers, while the second explores BERT’srepresentation of subject-verb agreement andanaphor-antecedent dependencies through aquantitative assessment of self-attention vec-tors. In both cases, we find that BERT en-codes positional information about word to-kens well on its lower layers, but switches toa hierarchically-oriented encoding on higherlayers. We conclude then that BERT’s repre-sentations do indeed model linguistically rel-evant aspects of hierarchical structure, thoughthey do not appear to show the sharp sensitiv-ity to hierarchical structure that is found in hu-man processing of reflexive anaphora.1

1 Introduction

Word embeddings have become an important cor-nerstone in any NLP pipeline. Although suchembeddings traditionally involve context-free dis-tributed representations of words (Mikolov et al.,2013; Pennington et al., 2014), recent successeswith contextualized representations (Howard andRuder, 2018; Peters et al., 2018; Radford et al.,2019) have led to a paradigm shift. One promi-nent architecture is BERT (Devlin et al., 2018), aTransformer-based model that learns bidirectionalencoder representations for words, on the basis of

∗Equal contribution.1The code is available at https://github.com/

yongjie-lin/bert-opensesame.

a masked language model and sentence adjacencytraining objective. Simply using BERT’s represen-tations in place of traditional embeddings has re-sulted in state-of-the-art performance on a range ofdownstream tasks including summarization (Liu,2019), question answering and textual entailment(Devlin et al., 2018). It is still, however, unclearwhy BERT representations perform well.

A flurry of recent work (Linzen et al., 2016;Gulordava et al., 2018; Marvin and Linzen, 2018;Lakretz et al., 2019) has explored how recurrentneural language models perform in cases that re-quire sensitivity to hierarchical syntactic structure,and study how they do so, particularly in the do-main of agreement. In these studies, a pre-trainedlanguage model is asked to predict the next wordin a sentence (a verb in the target sentence) follow-ing a sequence that may include other interveningnouns with different grammatical features (e.g.,“the bear by the trees eats...”). The predicted verbshould agree with the subject noun (bear) and notthe attractors (trees), in spite of the latter’s recency.Such analyses have revealed that LSTMs exhibitstate tracking and explicit notions of word orderfor modeling long term dependencies, althoughthis effect is diluted when sequential and structuralinformation in a sentence conflict. Further workby Gulordava et al. (2018) and others (Linzenand Leonard, 2018; Giulianelli et al., 2018) ar-gues that RNNs acquire grammatical competencein agreement that is more abstract than word col-locations, although language model performancethat requires sensitivity to the phenomena such asreflexive anaphora, non-local agreement and neg-ative polarity remains low (Marvin and Linzen,2018). Meanwhile, studies evaluating which lin-guistic phenomena are encoded by contextualizedrepresentations (Goldberg, 2019; Wolf, 2019; Ten-ney et al., 2019) successfully demonstrate thatpurely self-attentive architectures like BERT can

arX

iv:1

906.

0169

8v1

[cs

.CL

] 4

Jun

201

9

https://github.com/yongjie-lin/bert-opensesame

https://github.com/yongjie-lin/bert-opensesame

capture hierarchy-sensitive, syntactic dependen-cies, and even support the extraction of depen-dency parses (Hewitt and Manning, 2019). How-ever, the way in which BERT does this has beenless studied. In this paper, we investigate how andwhere the representations produced by pre-trainedBERT models (Devlin et al., 2018) express the hi-erarchical organization of a sentence.

We proceed in two ways. The first involves theuse of diagnostic classifiers (Hupkes et al., 2018)to probe the presence of hierarchical and linearproperties in the representations of words. How-ever, unlike past work, we train these classifiersusing a “poverty of the stimulus” paradigm, wherethe training data admit both linear and hierarchi-cal solutions that can be distinguished by an en-riched generalization set. This method allows us toidentify what kinds of information are representedmost robustly and transparently in the BERT em-beddings. We find that as we use embeddings fromhigher layers, the prevalence of linear/sequentialinformation decreases, while the availability of onhierarchical information increases, suggesting thatwith each layer, BERT phases out positional infor-mation in favor of hierarchical features of increas-ing complexity.

In the second set of experiments, we explorea novel approach to the study of BERT’s self-attention vectors. Past explorations of attentionmechanisms, whether in the domain of vision(Olah et al., 2018; Carter et al., 2019) or NLP(Bahdanau et al., 2015; Karpathy et al., 2015;Young et al., 2018; Voita et al., 2018), have largelyinvolved a range of visualization techniques or thestudy of the general distribution of attention. Ourwork takes a quantitative approach to the studyof attention and its encoding of syntactic depen-dencies. Specifically, we consider the relation-ships between verbs and the subjects with whichthey agree, and reflexive anaphors and their an-tecedents. Building on past work in psycholin-guistics, we consider the influence of distractornoun phrases on the identification of these de-pendencies. We propose a simple attention-basedmetric called the confusion score that capturesBERT’s response to syntactic distortions in an in-put sentence. This score provides a novel quan-titative method of evaluating BERT’s syntacticknowledge as encoded in its attention vectors. Wefind that BERT does indeed leverage syntactic re-lationships between words to preferentially attend

to the “correct” noun phrase for the purposes ofagreement and anaphora, though syntactic struc-ture does not show the strong categorical effectswe sometimes find in natural language. This resultagain points to a representation of syntactically-relevant hierarchical information in BERT, thistime through attention weightings.

Our analysis thus provides evidence thatBERT’s self-attention layers compose increas-ingly abstract representations of linguistic struc-ture without explicit word order information,and that structural information is expressly fa-vored over linear information. This explains whyBERT can perform well on downstream NLPtasks, which typically require complex modelingof structural relationships.

2 Diagnostic Classification

For our first exploration of the kind of linguisticinformation captured in BERT’s embeddings, weapply diagnostic classifiers to 3 tasks: identify-ing whether a given word is the sentence’s mainauxiliary, the sentence’s subject noun, and thesentence’s nth-token. In each task, we assesshow well BERT’s embeddings encode informationabout a given linguistic property via the ability ofa simple diagnostic classifier to correctly recoverthe presence of that property from the embeddingsof a single word. The three tasks focus on dif-ferent sorts of information: identifying the mainauxiliary and the subject noun requires sensitivityto hierarchical or syntactic information, while thenth-token requires linear information.

For each token in a given sentence, its input rep-resentation to BERT is a sum of its token, segmentand positional embeddings (Devlin et al., 2018).We refer to these inputs as pre-embeddings. Notethat by construction, a) the pre-embeddings con-tain linear but not hierarchical information, and b)BERT cannot generate new linear information thatis not already in the input. Thus, any linear infor-mation in BERT’s embeddings ultimately stemsfrom the pre-embeddings, while any hierarchicalinformation must be constructed by BERT itself.

2.1 Poverty of the stimulus

To classify an embedding as a sentence’s mainauxiliary or subject noun, the network needs tohave represented structural information about aword’s role in the sentence. In many cases, suchstructural information can be approximated lin-

Main auxiliary task Subject noun task

Training, Developmentthe cat will sleep the bee can sting

the cat will eat the fish that can swim the bee can sting the boy

Generalizationthe cat that can meow will sleep (compound noun) the queen bee can sting

the cat that can meow will eat the fish that can swim (possessive) the queen’s bee can sting

Table 1: Representative sentences from the main auxiliary and subject noun tasks. For the latter, the generalizationset contains two types of sentences, compound nouns and possessives, which are evaluated on separately. In eachexample, the correct token is underlined, while the distractor (consistent with the incorrect linear rule) is italicized.

early: the main auxiliary or subject noun could beidentified as the first auxiliary or noun in a sen-tence. Though such a linear generalization may befalsified if given certain complex examples, it willsucceed over a large range of simple sentences.Chomsky (1980) argues that the relevant distin-guishing examples may be very rare for the caseof identifying the main auxiliary (a property thatis necessary in order to form questions), and hencethis is an instance of the “poverty of the stimu-lus” that motivates the hypothesis of innate biastoward hierarchical generalizations. However, itseems clear that distinguishing examples are plen-tiful for the subject noun case. The question weare interested in, then, is whether and how BERT’sembeddings, which result from training on a mas-sive dataset, encode hierarchical information.

Pursuing the idea of poverty of the stimulustraining (McCoy et al., 2018), we train diagnos-tic classifiers only on sentences in which the rel-evant property (main auxiliary or subject noun) isstateable in either hierarchical or sequential terms,i.e., the linearly first auxiliary or noun (cf. Section2.2). The classifiers are then tested on sentences ofgreater complexity in which the hierarchical andlinear generalizations can be distinguished. Sinceour classifier is a simple perceptron that can accessonly one embedding at a time, it cannot computecomplex contingencies among the representationsof multiple words, and cannot succeed unless suchinformation is already encoded in the individualembeddings. Thus, success on these tasks wouldindicate that BERT robustly represents the wordsof a sentence using a feature space where the iden-tification of hierarchical generalizations is easy.

2.2 Dataset

The main auxiliary and subject noun tasks use syn-thetic datasets generated from context-free gram-mars (cf. Appendix A.1) that were designed to iso-late the relevant syntactic property for a poverty

of the stimulus setup. Typical sentences are high-lighted in Table 1. In both tasks, the training,development and generalization sets contained40000, 10000, and 10000 examples respectively.

Main auxiliary In the training and developmentsets, the main auxiliary (will in Table 1) is al-ways the first auxiliary in the sentence. A classi-fier that learns the naive linear rule of identifyingthe first linearly occurring auxiliary instead of thecorrect hierarchical (syntactic) rule still performswell during training. However, in the generaliza-tion set, the subject of each sentence is modifiedby a relative clause that contains an interveningauxiliary (that can meow). Since the main auxil-iary is never the first auxiliary in this case, learningthe hierarchical rule becomes imperative.

Subject noun In the training and developmentsets, the subject noun (bee in Table 1) is always thefirst noun in the sentence. A classifier that learnsthe linear rule of identifying the first linearly oc-curring noun does well during training, but onlythe hierarchical rule gives the right answer at testtime. In the generalization set (both compoundnouns & possessives cases), the subject noun isthe head of the construction (bee) and not the de-pendent (queen). In the possessives case, we notethat subword tokenization always produces ’s asa standalone token, e.g. queen’s is tokenized into[queen] [’s]. Also, we allow sentences to chain anarbitrary number of possessives via nesting.

nth-token For this experiment, we use sentencesfrom the Penn Treebank WSJ corpus. Followingthe setup of Collins (2002) and filtering for sen-tences between 10 to 30 tokens BERT tokeniza-tion, we obtained training, development and gen-eralization sets of sentences of sizes 21142, 3017and 2999. We only consider 2 ≤ n ≤ 9. In par-ticular, we ignore n = 1 since the first token pro-duced by BERT is always trivially [CLS].

2.3 MethodsBERT models In our experiments, we considertwo of Google AI’s pre-trained BERT modelsbert-base-uncased (bbu) and bert-large-uncased(blu) from a PyTorch implementation.2 bbu has12 layers, 12 attention heads and embedding width768, while blu has 24 layers, 16 attention headsand embedding width 1024.

Training For each task, we train a simple per-ceptron with a sigmoid output to perform binaryclassification on individual token embeddings of asentence, based on whether the underlying tokenpossesses the property relevant to the task. This issimilar to the concept of diagnostic classificationby Hupkes et al. (2018); Giulianelli et al. (2018).

the cat will sleep(BERT) ↓ ↓ ↓ ↓

e1 e2 e3 e4(Classifier) ↓ ↓ ↓ ↓

y1 y2 y3 y4

Each input sentence is tokenized and processedby BERT, and the resulting embeddings {ei} areindividually passed to the classifier fθ to producea sequence of logits {yi}. Supervision is providedvia a one-hot vector of indicators {yi} for the spec-ified property. For example, in the main auxiliarytask, the above example would have y1 = y2 =y4 = 0 and y3 = 1, since the third word is themain auxiliary. The contribution of each exampleto the total cross-entropy loss is:

Lθ = −∑i

(yi log yi+(1− yi) log(1− yi)) (1)

Each classifier is trained for a single epoch us-ing the Adam optimizer (Kingma and Ba, 2014)with hyperparameters lr = 0.001, β1 = 0.9, β2 =0.999. We freeze BERT’s weights throughouttraining, which allows us to take good classifica-tion performance as evidence that the informationrelevant to the task is being encoded in BERT’sembeddings in a salient, easily retrievable manner.

Evaluation For each example at test time, aftercomputing the logits we obtain the index of theclassifier’s most confident guess within the sen-tence:

i∗ = argmaxi

yi (2)

The average yi∗ across the test set is reported asthe classification accuracy.

2https://github.com/huggingface/pytorch-pretrained-BERT

Figure 1: Layerwise accuracy of diagnostic classifierson the generalization set of the main auxiliary task.

Layerwise diagnosis One key aspect of our ex-periments is the training of layer-specific classi-fiers for all layers. This yields a layerwise diag-nosis of the information content in BERT’s em-beddings, providing a glimpse into how BERTinternally manipulates and composes linguisticinformation. We also train classifiers on thepre-embeddings, which can be considered as the“zero-th” layer of BERT and hence act as usefulbaselines for content present in the input.

2.4 Results

Main auxiliary Classifiers for both modelsachieved near-perfect accuracy across all layers onthe development set. In Figure 1, we observe thaton the generalization set, the classifiers for bothmodels can identify the main auxiliary with over85% accuracy past layer 5, and bbu in particularobtains near-perfect accuracy from layers 4 to 11.

As discussed in Section 2.2, the classifiers wereonly given training examples where the main aux-iliary was also the first auxiliary in the sentence.Although the linear rule “pick the first auxiliary”is compatible with the training data, the classifiernonetheless learns the more complex but correcthierarchical rule “pick the auxiliary of the mainclause”. By our argument from Section 2.1, thissuggests that BERT embeddings encode syntacticinformation relevant to whether a token is the mainauxiliary, as a feature salient enough to be recov-erable by our simple diagnostic classifier.

We found that almost all instances of classifi-cation errors involved the misidentification of thelinearly first auxiliary (within the relative clause)as the main auxiliary, e.g. can instead of will inTable 1. We believe that this stems from the signif-icance of part-of-speech information for languagemodeling. As a result, any word of a different POSwill not be chosen by the classifier.

https://github.com/huggingface/pytorch-pretrained-BERT

https://github.com/huggingface/pytorch-pretrained-BERT

Figure 2: Layerwise accuracy of diagnostic classifierson the compound noun generalization set of the subjectnoun task.

Figure 3: Layerwise accuracy of diagnostic classifierson the possessive generalization set of the subject nountask.

Subject noun As with the previous task, clas-sifiers for both models achieved near-perfect ac-curacy across all layers on the development set.On the compound noun generalization set (Fig-ure 2), while bbu achieved near-perfect accuracyin later layers, blu consistently performed poorly.bbu’s performance suggests that in the classifiersuccessfully learns a generalization that excludesthe first noun of a compound, as opposed to thenaive linear rule “pick the first noun”. As before,this suggests that BERT encodes syntactic infor-mation in its embeddings. However, blu’s perfor-mance is unexpected: it consistently predicts theobject noun when it makes errors. In contrast, onthe possessive generalization set (Figure 3), bothmodels perform poorly. We offer an explanationfor this distinctive performance in Section 2.5.

nth token Since this property is entirely deter-mined by the linear position of a word in a sen-tence, it directly measures the amount of posi-tional information encoded in the embeddings.Here we have two baselines characterizing bothextremes: the normal pre-embeddings (denotedpE) and a variant (pE – pos) where we exclude the

Figure 4: Layerwise accuracy of diagnostic classifierson the generalization set of the nth token task, for thebbu model only. Note that each line corresponds to aparticular layer’s embeddings as we vary 2 ≤ n ≤ 9.pE denotes pre-embeddings and pE – pos denotes pre-embeddings without the positional component.

positional component from its construction. SinceBERT cannot introduce any new positional infor-mation, we expect these two to represent upperand lower bounds on the amount of positional in-formation present in BERT’s embeddings.

In Figure 4, we see a dramatic difference in per-formance on pE (one of the topmost lines) com-pared to pE – pos (bottommost line). We notethat performances across all 12 layers fall betweenthese two extremes, confirming our intuitions fromearlier. Specifically, the classifiers for layers 1 – 3have near-perfect accuracy on identifying an ar-bitrary nth token (2 ≤ n ≤ 9). However, fromlayer 4 onwards, the accuracy drops sharply as nincreases. This suggests that the positional com-ponent of pre-embeddings is the primary source ofpositional information in BERT, and BERT (bbu)discards a significant amount of positional infor-mation between layers 3 and 4, possibly in favorof hierarchical information.

2.5 Further Analysis

Main auxiliary In Figure 1, we observe thatclassification accuracy increases sharply in thefirst 4 layers, then plateaus before slowly decreas-ing. This mirrors a similar layerwise trend ob-served by Hewitt and Manning (2019). We pos-tulate that the embeddings reach their “optimallevel of abstraction” with respect to their ability topredict the main auxiliary halfway through BERT(about layer 6 for bbu, 12 for blu). At layer 1,the embedding for each token is a highly localizedrepresentation that contains insufficient sententialcontext to determine whether it is the main aux-iliary of the sentence. As layer depth increases,

BERT composes increasingly abstract representa-tions via self-attention, which allows it to extractinformation from other tokens in the sentence. Atsome point, the representation becomes abstractenough to represent the hierarchical concept of a“main auxiliary”, causing an early increase in clas-sification accuracy. However, as depth increasesfurther, the representations become so abstract thatfiner linguistic features are increasingly difficult torecover, e.g., a token embedding at the sentence-vector level of abstraction may longer be capableof identifying itself as the main auxiliary, account-ing for the slowly deteriorating performance to-wards later layers.

Subject noun Given the similarity of the mainauxiliary and the subject noun classification tasks,we might expect them to exhibit similar trendsin performance. In Figure 2, we observe a sim-ilar early increase in diagnostic classification ac-curacy for the bbu embeddings. The lack of sig-nificant performance decay on higher layers pos-sibly reflects the salience of the subject noun fea-ture even at the sentence-vector level of abstrac-tion. Strangely, blu performed poorly, even worsethan chance (50%). We are unable to explain whythis happens and leave this for future research.

On the possessive generalization set, the poorperformance of both models seems to contra-dict the hypothesis that BERT has learned an ab-stract hierarchical generalization to classify sub-ject nouns. We conjecture that BERT’s issues inthe possessive case stem from the ambiguity ofthe ’s token, which can function either as a pos-sessive marker or as a contracted auxiliary verb(e.g.“She’s sleeping”). If BERT takes a possessiveoccurrence of ’s as the auxiliary verb, the immedi-ately preceding noun can be (incorrectly) analyzedas the subject. If so, this would suggest that BERTdoes not represent the syntactic structure of the en-tire sentence in a unified fashion, but instead useslocal cues to constituency. In Figure 3, the gradu-ally increasing but still poor performance towardslater layers in both models suggests that the em-beddings might be trending toward a more abstractrepresentation, but do not ultimately achieve it.

nth token For each layer k ≥ 3, Figure 4 showsan asymmetry where the classifier for layer k per-forms worse at identifying the nth token as n in-creases. We believe that this may be an artifact ofthe distributional properties of natural language:

the distribution of words that occur at the start of asentence tends to be concentrated on a small classof parts of speech that can occur near the begin-ning of constituents that can begin a sentence. Asn increases, the class of possible parts is no longera function of the beginning of the sentence, andas a result becomes more uniform. As a result, itis easier for a classifier to predict whether a givenword is the nth token when n is small, since it canmake use of easily accessible part-of-speech infor-mation in the embeddings to limit its options toonly the tokens likely to occur in a given position.

3 Diagnostic Attention

Our second exploration of BERT’s syntacticknowledge focuses on the encoding of grammat-ical relationships instead of the identification ofelements with specific structural properties. Weconsider two phenomena: reflexive anaphora andsubject-verb agreement. For each, we deter-mine the extent to which BERT attends to lin-guistically relevant elements via the self-attentionmechanism. This gives us further informationabout how hierarchy-sensitive syntactic informa-tion is encoded.

3.1 Quantifying intrusion effects viaattention

Subject-verb and antecedent-anaphor dependen-cies both involve a dependent element, which wecall the target (the verb or the anaphor) and the el-ement on which it depends, which we call the trig-ger (the subject or the antecedent that provides theinterpretation). A considerable body of work inpsycholinguistics has explored how humans pro-cess such dependencies in the presence of ele-ments that are not in relevant structural positionsbut which linearly intervene between the triggerand target. Dillon et al. (2013) aim to quantifythis intrusion effect in human reading for the twodependencies we explore here. Under the assump-tion that higher reading time and eye movementregressions indicate an intrusion effect, they con-clude that intruding noun-phrases have a substan-tial effect on the processing of subject-verb agree-ment, but not antecedent-anaphor relations.

We adapt this idea in measuring intrusion ef-fects in BERT. We propose a simple and novelmetric we term the “confusion score” for quantify-ing intrusion effects using attention. This quantita-

Subject-Verb Agreement

Condition RelativeClause

DNNumber Match Example Sentence Mean

Confusion Score

A1 7 X the cat near the dog does sleep 0.97A2 7 7 the cat near the dogs does sleep 0.93A3 X X the cat that can comfort the dog does sleep 0.85A4 X 7 the cat that can comfort the dogs does sleep 0.81

Reflexive Anaphora

Condition RelativeClause

DNoGender Match

DNrGender Match Example Sentence Mean

Confusion Score

R1 7 X NA the lord could comfort the wizard by himself 1.01R2 7 7 NA the lord could comfort the witch by himself 0.92R3 X NA X the lord that can hurt the prince could comfort himself 0.99R4 X NA 7 the lord that can hurt the princess could comfort himself 0.89R5 X X X the lord that can hurt the prince could comfort the wizard by himself 1.57R6 X X 7 the lord that can hurt the princess could comfort the wizard by himself 1.52R7 X 7 X the lord that can hurt the prince could comfort the witch by himself 1.49R8 X 7 7 the lord that can hurt the princess could comfort the witch by himself 1.39

Table 2: Representative sentences from the subject-verb agreement and reflexive anaphora datasets for each con-dition, and corresponding mean confusion scores. DNo: distractor noun as object. DNr: distractor noun in relativeclause.

tive metric allows us to measure the preferable at-tention of transformer-based self-attention on oneentity as opposed to another. Formally, supposeX = {xi}ni=1 are linguistic units of interest, i.e.candidate triggers for the dependency, and Y is thedependency target. For each layer l and attentionhead a, we sum the self-attention weights from theindices of xi (since each xi may consist of multi-ple words) on attention head a of layer l − 1 toY on layer l, and take the mean over A attentionheads:

attnl(xi, Y ) =1

A

A∑a=1

∑xij∈xi

attnla(xij , Y ) (3)

We finally define the confusion score on layer l asthe binary cross entropy of the normalized atten-tion distribution between {xi} given Y as follows:

confl(X,Y ) = − logattnl(x1, Y )∑ni=1 attnl(xi, Y )

(4)

Note that this equation assumes that each depen-dency has a unique trigger x1: verbs agree witha single subject, and anaphors take a single nounphrase as their antecedent.

Our study focuses on the examples of the formsshown in Table 2. For subject-verb agreement,there are two types of examples: with the distrac-tor within a PP (A1 and A2) and with the distrac-tor within a RC (A3 and A4). Past psycholinguis-tic work has shown that distractor noun phrases

within PPs give rise to greater processing diffi-culty than distractors within RCs (Bock and Cut-ting, 1992). For each type, we compare confusionin the case of distractors that share features withthe subject, the true trigger of agreement, (A1 andA3) with those that do not (A2 and A4). Our ex-pectation is that distractors that do not share fea-tures with the target of agreement will yield lessconfusion.

For reflexive anaphora, because of the possibil-ity of ambiguity, we also consider sentences thatinclude a noun phrase that is a structurally pos-sible antecedent. For example, condition R1 hasthe subject the lord as its antecedent, but the ob-ject noun phrase the wizard is also grammaticallypossible. In contrast, for R2, the mismatch in gen-der features prevents the object from serving asan antecedent, which should lead to lower confu-sion. Sentences R3 and R4 include a distractornoun phrase within a RC. Since this noun phrasedoes not c-command the anaphor, it is grammati-cally inaccessible and should therefore contributeless, if at all, to confusion. Sentence types R5through R8 include both the relative modifier andthe object noun phrase, and systematically vary theagreement properties of the two distractors.

We hypothesize that attention weights on eachlinguistic unit indicate the relative importance ofthat entity as a trigger of a linguistic dependency.As a result, the ideal attention distribution shouldput all of the probability mass on the antecedentnoun phrase for reflexive anaphora or on the sub-

ject noun phrase for agreement, and zero on thedistractor noun phrases. As a baseline, a uniformdistribution over two noun phrases, one the actualtarget and the other a distractor, would lead to aconfusion score of − log 1

2 = 1; with two dis-tractors, the uniform probability baseline would be− log 1

3 = 1.6.

3.2 DatasetWe construct synthetic datasets using context-free grammars (shown in Appendix A.1) for bothsubject-verb agreement and reflexive anaphoraand compute mean confusion scores across mul-tiple sentences. This allows us to control for se-mantic effects on the confusion score. All datasetsfor each condition contain 10000 examples.

In the subject-verb agreement datasets, we vary1) the type of subordinate clause (prepositionalphrase, PP; or relative clause, RC), and 2) thenumber on the distractor noun phrase. All condi-tions should be unambiguous, since only the headnoun phrase can agree with the auxiliary.

In the reflexive anaphora datasets, we vary 1)the presence of a RC, 2) the gender match be-tween the RC’s noun phrase and the reflexive 3)the presence of an object noun phrase, and 4) thegender match between the object noun and the re-flexive. All nouns are singular. Conditions R1,R5, R6 are ambiguous conditions, as they includean object noun phrase that matches the reflexivein gender. In other conditions, only the head nounphrase is the possible antecedent: the object mis-matches in features and the noun phrase within theRC is grammatically inaccessible.

3.3 MethodsWe use Equation 4 to compute the confusion scoreon each layer for the target in each sentence in ourdataset. As in Section 2.3, this yields a layerwisediagnosis of confusion in BERT’s self-attentionmechanism. We also compute the mean confusionscore across all layers.3 In our experiments, wecompute confusion scores using bbu only.

Note that in conditions R1, R5 and R6, thereare two possible antecedents of the reflexive. Wenonetheless use Equation 4 to calculate confusionscores relative to a single antecedent (the subject).

To compute the significance of the presence ofdifferent types of distractors and of feature mis-

3We built on Vig (2019)’s BERT attention visualization li-brary https://github.com/jessevig/bertviz toimplement the attention-based confusion score.

Figure 5: Layerwise confusion scores for each reflex-ive anaphora condition listed in Table 2. Conditions R1to R4 have one distractor noun phrase, but conditionsR5 to R8 have two distractor noun phrases.

Coefficient Estimate p-value

Subject-Verb Agreement

Intercept 1.33± 1.32e−3 < 2e−16Relative Clause −0.12± 1.03e−3 < 2e−16DNr Number Match 0.03± 1.03e−3 < 2e−16Layer −0.06± 1.50e−4 < 2e−16

Reflexive Anaphora

Intercept 0.63± 1.24e−3 < 2e−16DNo Gender Match 0.60± 9.09e−4 < 2e−16DNo Gender Mismatch 0.50± 9.09e−4 < 2e−16DNr Gender Match 0.57± 9.09e−4 < 2e−16DNr Gender Mismatch 0.49± 9.09e−4 < 2e−16Layer −0.03± 9.72e−5 < 2e−16

Table 3: Regression estimates and p-values for the co-efficient effects under reflexive anaphora and subject-verb agreement. All effects are statistically significant.

match of the distractors, we run a linear regressionto predict confusion score. For subject-verb agree-ment, the baseline value is the confusion at layer 1of a sentence with a PP and a mismatch in numberon the distractor noun (condition A2 in Table 2).For reflexive anaphora, the baseline is the confu-sion at layer 1 of a sentence with no RC and noobject noun (e.g. “the lord comforts himself”).

3.4 Results

Subject-verb agreement Since sentence typesA1 to A4 are all unambiguous, ideal confusionscores should be zero. However, Table 2 indi-cates that the mean confusion scores are insteadcloser to the uniform probability baseline con-fusion score of 1, suggesting that BERT’s self-attention mechanism is far from able to perfectlymodel syntactically-sensitive hierarchical infor-mation. Nonetheless, from Table 3, we see thatBERT’s attention mechanism is in fact sensitive

https://github.com/jessevig/bertviz

to subtleties of linguistic structure: a distractorwithin a PP causes more confusion than one withina relative clause (i.e., the presence of the relativehas a negative coefficient in the linear model), inagreement with past psycholinguistic work (Bockand Cutting, 1992). Moreover, the presence ofmatching distractors has a significant positive ef-fect on confusion scores. These findings there-fore suggest that BERT representations are sensi-tive to different types of syntactic embedding aswell as the values of number features in comput-ing subject-verb agreement dependencies.

Reflexive anaphora From Table 2, we see themajor effect of the number of distractor nounphrases: mean confusion scores for conditionswith one distractor (R1-R4) are lower than thosewith two distractors (R5-R8). If BERT were per-fectly exploiting grammatical structure, we shouldexpect the presence of a grammatically inacces-sible distractor noun within a relative clause notto add to confusion. Thus, we might expect R5and R6 to have mean confusion scores compara-ble to R1, as both include single grammaticallyviable distractor. However, they both have highermean confusion scores than R1 (the same is truefor R7/R8 vs. R2). Moreover, conditions R2 toR4 and R7 to R8 should have confusion scores ofzero, since the head noun phrase is the only gram-matically possible antecedent. This, however, isnot so. Taken together, we might conclude thatBERT attends unnecessarily to grammatically in-accessible or grammatically mismatched distrac-tor noun phrases, suggesting that it does not accu-rately model reflexive dependencies.

Nonetheless, if we look more closely at the ef-fects of the different factors through the linearmodel reported in Table 3, we once again find ev-idence for a sensitivity to both syntactic structureand grammatical features: the presence of gram-matically accessible distractors has a (slightly)larger effect on confusion than grammatically in-accessible distractors (i.e., DNo vs. DNr), particu-larly when the distractor matches in features withthe actual antecedent.

3.5 Further Analysis

Layerwise diagnosis Figure 5 and Table 3 showthat confusion is negatively correlated with layerdepth for reflexive anaphora. Confusion scoresfor subject-verb agreement exhibit a similar trend.This provides additional evidence for our con-

jecture that BERT composes increasingly abstractrepresentations containing hierarchical informa-tion, with an optimal level of abstraction. Notably,the observed sensitivity of BERT’s self-attentionvalues to grammatical distortions suggests thatBERT’s syntactic knowledge is in fact encoded inits attention matrices. Finally, it is worth notingthat confusion for both reflexives and subject-verbagreement showed an increase at layer 4. Strik-ingly, this was the level at which linear informa-tion was found, through diagnostic classifiers, tobe degraded. We leave for the future an under-standing of the connection between these.

4 Conclusion

In this paper, we investigated how and to what ex-tent BERT representations encode syntactically-sensitive hierarchical information, as opposed tolinear information. Through diagnostic classifi-cation, we find that positional information is en-coded in BERT from the pre-embedding level upthrough lower layers of the model. At higherlayers, information becomes less positional andmore hierarchical, and BERT encodes increas-ingly complex representations of sentence units.

We propose a simple and novel method of ob-serving, for a given syntactic phenomenon, theintrusion effects of distractors on BERT’s self-attention mechanism. Through such diagnosticattention, we find that BERT does encode as-pects of syntactic structure that are relevant forsubject-verb agreement and reflexive dependen-cies through attention weights, and that this infor-mation is represented more accurately on higherlayers. We also find evidence that BERT is re-sponsive to matching of grammatical features suchas gender and number. However, BERT’s atten-tion is only incompletely modulated by structuraland featural properties, and attention is sometimesspread across grammatically irrelevant elements.

We conclude that BERT composes increasinglyabstract hierarchical representations of linguisticstructure using its self-attention mechanism. Tofurther understand BERT’s syntactic knowledge,further work can be done to (1) investigate or vi-sualize layer-on-layer changes in BERT’s struc-tural and positional information, particularly be-tween layers 3 and 4 when positional informationis largely phased out, and (2) retrieve the increas-ingly hierarchical representations of BERT acrosslayers via the self-attention mechanism.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In International Con-ference for Learning Representations.

Kathryn Bock and J. Cooper Cutting. 1992. Reg-ulating mental energy: Performance units in lan-guage production. Journal of Memory and Lan-guage, 31(1):99–127.

Shan Carter, Zan Armstrong, Ludwig Schubert,Ian Johnson, and Chris Olah. 2019. Activa-tion atlas. https://distill.pub/2019/activation-atlas [Accessed: 19 Apr 2019].

Noam Chomsky. 1980. Rules and representations. Be-havioral and Brain Sciences, 3(1):1–15.

Michael Collins. 2002. Discriminative training meth-ods for hidden Markov models: Theory and experi-ments with perceptron algorithms. In Proceedingsof the 2002 Conference on Empirical Methods inNatural Language Processing, pages 1–8. Associ-ation for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. Computing Research Repository,abs/1810.04805.

Brian Dillon, Alan Mishler, Shayne Sloggett, and ColinPhillips. 2013. Contrasting intrusion profiles foragreement and anaphora: Experimental and model-ing evidence. Journal of Memory and Language,69(2):85–103.

Mario Giulianelli, Jack Harding, Florian Mohnert,Dieuwke Hupkes, and Willem Zuidema. 2018. Un-der the hood: Using diagnostic classifiers to in-vestigate and improve how language models trackagreement information. In Proceedings of the 2018EMNLP Workshop BlackboxNLP: Analyzing and In-terpreting Neural Networks for NLP, pages 240–248, Brussels, Belgium. Association for Computa-tional Linguistics.

Yoav Goldberg. 2019. Assessing BERT’s syntac-tic abilities. Computing Research Repository,abs/1901.05287.

Kristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 1195–1205, NewOrleans, Louisiana. Association for ComputationalLinguistics.

John Hewitt and Christopher Manning. 2019. A struc-tural probe for finding syntax in word representa-tions. In International Conference for LearningRepresentations.

Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 328–339, Melbourne, Aus-tralia. Association for Computational Linguistics.

Dieuwke Hupkes, Sara Veldhoen, and WillemZuidema. 2018. Visualisation and diagnostic clas-sifiers reveal how recurrent and recursive neural net-works process hierarchical structure. Journal of Ar-tificial Intelligence Research, 61:907926.

Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015.Visualizing and understanding recurrent networks.Computing Research Repository, abs/1506.02078.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. Computing Re-search Repository, abs/1412.6980.

Yair Lakretz, German Kruszewski, Theo Desbordes,Dieuwke Hupkes, Stanislas Dehaene, and MarcoBaroni. 2019. The emergence of number and syn-tax units in LSTM language models. Computing Re-search Repository, abs/1903.07435.

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of theAssociation for Computational Linguistics, 4:521–535.

Tal Linzen and Brian Leonard. 2018. Distinct patternsof syntactic agreement errors in recurrent networksand humans. In Proceedings of the 40th AnnualConference of the Cognitive Science Society, pages690–695.

Yang Liu. 2019. Fine-tune BERT for extractive sum-marization. In Computing Research Repository, vol-ume abs/1903.10318.

Rebecca Marvin and Tal Linzen. 2018. Targeted syn-tactic evaluation of language models. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 1192–1202,Brussels, Belgium. Association for ComputationalLinguistics.

R Thomas McCoy, Robert Frank, and Tal Linzen. 2018.Revisiting the poverty of the stimulus: hierarchicalgeneralization without a hierarchical bias in recur-rent neural networks. In Proceedings of the 40thAnnual Conference of the Cognitive Science Society,pages 2093–2098.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems, pages 3111–3119.

https://doi.org/10.23915/distill.00015


https://distill.pub/2019/activation-atlas

https://distill.pub/2019/activation-atlas

https://doi.org/10.3115/1118693.1118694

https://doi.org/10.3115/1118693.1118694

https://doi.org/10.3115/1118693.1118694

https://www.aclweb.org/anthology/W18-5426




https://doi.org/10.18653/v1/N18-1108

https://doi.org/10.18653/v1/N18-1108

https://nlp.stanford.edu/pubs/hewitt2019structural.pdf



https://www.aclweb.org/anthology/P18-1031


https://www.aclweb.org/anthology/D18-1151

https://www.aclweb.org/anthology/D18-1151

Chris Olah, Arvind Satyanarayan, Ian Johnson, ShanCarter, Ludwig Schubert, Katherine Ye, and Alexan-der Mordvintsev. 2018. The building blocks of in-terpretability. https://distill.pub/2018/building-blocks [Accessed: 19 Apr 2019].

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1532–1543.

Matthew E Peters, Mark Neumann, Mohit Iyyer,Matt Gardner, Christopher Clark, Kenton Lee, andLuke Zettlemoyer. 2018. Deep contextualized wordrepresentations. Computing Research Repository,abs/1802.05365.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Lan-guage models are unsupervised multitask learners.https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf [Accessed: 19 Apr2019].

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R Bowman, Dipan-jan Das, et al. 2019. What do you learn from con-text? probing for sentence structure in contextual-ized word representations. In International Confer-ence on Learning Representations.

Jesse Vig. 2019. Visualizing attention in transformer-based language models. Computing ResearchRepository, abs/1904.02679.

Elena Voita, Pavel Serdyukov, Rico Sennrich, and IvanTitov. 2018. Context-aware neural machine trans-lation learns anaphora resolution. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1264–1274, Melbourne, Australia. Associa-tion for Computational Linguistics.

Thomas Wolf. 2019. Some additional experimentsextending the tech report “Assessing BERT’s Syn-tactic Abilities” by Yoav Goldberg. https://huggingface.co/bert-syntax/extending-bert-syntax.pdf [Accessed:19 Apr 2019].

Tom Young, Devamanyu Hazarika, Soujanya Poria,and Erik Cambria. 2018. Recent trends in deeplearning based natural language processing. IEEEComputational Intelligence Magazine, 13(3):55–75.



https://distill.pub/2018/building-blocks

https://distill.pub/2018/building-blocks

https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf



https://openreview.net/forum?id=SJzSgnRcKX





https://huggingface.co/bert-syntax/extending-bert-syntax.pdf



Open Sesame: Getting Inside BERT’s Linguistic Knowledge · Open Sesame: Getting Inside BERT’s...

Documents

Transcript of Open Sesame: Getting Inside BERT’s Linguistic Knowledge · Open Sesame: Getting Inside BERT’s...