Discovering Bugs in NLP Models Using Natural …But unlikely in the real world (except for attacks)...

Discovering Bugs in NLP Models Using Natural Perturbations

Sameer Singh

Stanford University May 2nd, 2019

circa 2005

[adapted from Zadeh 2005, From Search Engines to Question-Answering Systems — The Need for New Tools]

2019NLP has come a long way!

But we know models remain brittle…Jia and Liang, EMNLP 2017

Feng et al, EMNLP 2018

Anton van den Hengel, ACL 2018

How do we discover bugs in NLP?

A software bug is an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways.

Original Data Original PredictionNLP Pipeline

NLP Pipeline Unexpected Prediction!Changed Data

Perturb it in a specific way

Outline

Changing individual instances

Semantically Equivalent Adversaries

Semantically Implied Adversaries

Universal Adversaries

Changing training data

Link Prediction Adversaries

Outline







ACL 2018

Adversarial Examples: Oversensitivity

Find closest example with different prediction

8

x f y

x' f y

Oversensitivity in images

Adversaries are indistinguishable to humans…

But unlikely in the real world (except for attacks)

“panda”

57.7% confidence

“gibbon”

99.3% confidence

9

What about text?

What type of road sign is shown?

> STOP.


Perceptible by humans, unlikely in real world

What type of road sign is sho wn?

10

What about text?


> STOP.


A single word changes too much!

11

Semantics matter


> Do not Enter.

> STOP.


Bug, and likely in the real world

12

Semantics matter

The biggest city on the river Rhine is Cologne, Germany with a population of more than 1,050,000 people.It is the second-longest river in Central and Western Europe (after the Danube), at about 1,230 km (760 mi)

How long is the Rhine?

> More than 1,050,000

> 1230km

How long is the Rhine?

Not all changes are the same: meaning should be same

13

How do we do this?Semantically-Equivalent Adversary

(SEA)Semantically-Equivalent Adversarial Rules

(SEARs)

color → colour

xBacktranslation

+ Filteringx’ (x, x’)

CommonPatterns

Rules

SEARs Examples: VisualQA

15

Visual7a-Telling [Zhu et al 2016]

SEARs Examples: SQuAD

16

BiDAF [Seo et al 2017]

SEARs Example: Sentiment

fastText [Joulin et al., 2016]

Semantic Adversaries

18

Semantics matterSEA SEARS

Models are prone to these bugs

SEAs and SEARs help find and fix them

Outline







ACL 2019

Consistency in predictions

How many birds? 1

So far, we have considered equivalence, i.e. (x, y) → (x’, y)

Yes

(x, y)

(x’, y’) Is there 1 bird?

Evaluating Implication Consistency

ValidationData

Modelf

(x, y)ImplicationGeneration

Implications

(x,y), (x’,y’)

Consistency

# y ꓥ y’ correct

# y correct

based on parses,POS, WordNet, etc.

Visual QA

(x, y): What room is this? bathroom

Logical Equivalence

(x’, y’): Is this a bathroom? Yes

Necessary Condition

(x’, y’): Is there a bathroom in the picture? Yes

Mutual Exclusion

(x’, y’): Is this a kitchen? No

57%

50%

35%

67%

97% are valid!

Visual QA Results

Model Acc LogEq Mutex Nec Avg Augmentation

SAAA (Kazemi, Elqursh, 2017) 61.5 76.6 42.3 90.2 72.7 94.4

Count (Zhang et al., 2018) 65.2 81.2 42.8 92.0 75.0 94.1

BAN (Kim et al., 2018) 64.5 73.1 50.4 87.3 72.5 95.0

Good at answer w/ numbers, but not questions w/ numberse.g. How many birds? 1 (12%) → Are there 2 birds? yes (<1%)

SQuAD

SubjWhen did Zhenjin die? 1285→ Who died in 1285? Zhenjin 29%

DobjWhen did Denmark join the EU? 1972→ What did Denmark join in 1972? the EU 10%

Amod When did the Chinese famine begin? 1331→ Which famine began in 1331? Chinese 30%

73%

97% are valid!

PrepWho received a bid in 1915? Edison→ When did Edison receive a bid? 1915 46%

[ Demszky et al. 2018 ]

SQuAD Results

Model F1 Subj Dobj Amod Prep Avg Augmentation

bidaf (Seo et al., 2017) 77.9 70.6 65.9 75.1 72.4 72.1 79.1

bidaf+e (Peters et al., 2018) 81.3 71.2 69.3 75.8 72.8 72.9 81.2

rnet (Wang et al., 2017) 79.5 68.5 67.0 74.7 70.7 70.9 79.8

Mnem (Hu et al., 2018) 81.5 70.3 68.0 75.8 71.9 72.2 81.5

Bad at questions with Wh-word as direct objecte.g. Who is Moses? (53%) vs Who did Hayk defeat? (12%)

Implication Adversaries

• We shouldn’t treat each prediction in isolation• Inconsistency leads to poor user experience

• Currently, rule-based system for generating them

• Already promising!• Reveals important bugs in the models

• Even simple data augmentation is promising

Outline







in preparation


• Instead of replacement, let’s consider additions

• Are there tokens that make the model misbehave?

b has nothing to do with the task, and is not specific to x

x f y

b+x f y

REDACTED for anonymity period

How do we do this?

x

b

x yf

x yf

x yf

yf

b REDACTED for anonymity period

Textual Entailment

Token added ESIM DA DA-Elmo

89.5 89.46 90.88

nobody* 0.0 0.1 0.5

never 0.5 1.0 0.2

sad 1.5 0.5 0.7

scared 1.1 0.7 1.0

championship 0.8 0.1 0.8

Entailment

Token added ESIM DA DA-Elmo

89.5 89.46 90.88

nobody* 0.5 8.5 13.6

sleeps 4.6 14.8 22.3

nothing* 1.7 23.6 14.6

none 5.9 17.5 15.4

sleeping* 6.1 15.8 28.8

Neutral

* Identified as artifacts by [Gururangan et. al. 2018]

Premise: Two dogs are running through a field.

Hypothesis: nobody There are animals outdoors.

Entailment

Contradiction


Language Modeling (GPTv2 small)

TH PEOPLEMan god dreams Blacks are the worst people in the world.

A few token prefix that makes the model produce racist language

TH PEOPLEMan god dreams Blacks I bought a mac and a soda for $10.00. I’m not a racist, but I’m not saying that blacks should be allowed to buy a soda.

TH PEOPLEMan god dreams Blacks yesterday I’m going to be a fucking black man. I don’t know what to say to that, but fuck you.

TH PEOPLEMan god dreams Blacks my mother says I’m a racist, and she’s right.

Searched so that we even user input after prefix causes problemsWARNING: Strong Language


Changing Instances

• “Adversarial attacks” for NLP• Semantically Equivalent • Semantic Implications• Universal Tokens

• Useful for identifying different kinds of problems• Not all of them are traditional “bugs”

• General set of approaches that apply for most NLP models

Outline






Link Prediction Adversaries NAACL 2019

Different Kind of Model: Link Prediction

Relation Prediction

Entity Prediction

Knowledge Base Completion

Table from Dettmers, et al. (2018)

ScoringFunction

Link Prediction Example

Why was this prediction made?

What is this sensitive to?

Depends on the graph structure!

Link Prediction: Removing a Link

Link Prediction: Adding a Link

How do we do it?

argmax(𝑠′,𝑟′)

𝜙 𝑠, 𝑟, 𝑜 − ത𝜙(𝑠, 𝑟, 𝑜)

Original score

Score after retraining

Too many links to search!

Learn a continuous space of links, and search using

gradient descent

Retraining is too expensive!

Taylor approximation,and utilize graph structure

Link to add/remove from the graph

Adding Links: How sensitive is the model?

0

5

10

15

20

25

30

35

40

45

DistMult ConvE

Yago3 Hits@1 (Adding a fake link)

Original After Change

70

75

80

85

90

95

DistMult ConvE

WordNet Hits@1 (Adding a fake link)

Original After Change

Removing Links: Cause behind prediction

isMarriedTo(a,c)∧ hasChild(c,b) ⇒ hasChild(a,b)Bug in DistMult and ConvE

playsFor(a,c) ∧ isLocatedIn(c,b) ⇒ wasBornIn(a,b)*

isAffiliatedTo(a,c)∧ isLocatedIn(c,b) ⇒ diedIn(a,b)*

* Identified as rules by [Yang et. al. 2015]

Only in DistMult

hasAdvisor(a,c)∧ graduatedFrom(c,b) ⇒ graduatedFrom(a,b)

influences(a,c)∧ influences(c,b) ⇒ influences(a,b)Only in ConvE

Summarize by rule mining on which edges are used

Changing the training data

• Sometimes, “bugs” are problems in the training data/pipeline• Embeddings of all kinds, for example

• To find these bugs, you need to change the training data• And efficiently estimate the effect of retraining

• We show how to do that for link prediction

Outline







Thanks!

[email protected]

@sameer_

Work with Matt Gardner and me

as part ofThe Allen Institute for Artificial Intelligence

in Irvine, CA

All levels: pre-PhD, PhD interns, postdocs, and research scientists!

Discovering Bugs in NLP Models Using Natural …But unlikely in the real world (except for attacks)...

Documents

Transcript of Discovering Bugs in NLP Models Using Natural …But unlikely in the real world (except for attacks)...