Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and...

34
1/29 Motivation Method Experiment Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers Ji Gao 1 , Jack Lanchantin 1 , Mary Lou Soffa 1 , Yanjun Qi 1 1 University of Virginia http://trustworthymachinelearning.org/ @ 1st Deep Learning and Security Workshop ; 2018

Transcript of Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and...

Page 1: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

1/29

Motivation Method Experiment

Black-box Generation of Adversarial TextSequences to Evade Deep Learning Classifiers

Ji Gao1, Jack Lanchantin1, Mary Lou Soffa1, Yanjun Qi1

1University of Virginiahttp://trustworthymachinelearning.org/

@ 1st Deep Learning and Security Workshop ; 2018

Page 2: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

2/29

Motivation Method Experiment

Outline

1 Motivation

White box vs. black box

2 Method

Word scorerWord transformer

3 Experiment

4 Conclusions

Page 3: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

3/29

Motivation Method Experiment

Example of black-box classification systems

Google Perspective API

Page 4: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

3/29

Motivation Method Experiment

Example of black-box classification systems

Google Perspective API

Page 5: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

4/29

Motivation Method Experiment

Target scenario

Page 6: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

5/29

Motivation Method Experiment

An example of DeepWordBug

Goal: Flip the prediction of a sentiment analyzer

Page 7: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

5/29

Motivation Method Experiment

An example of DeepWordBug

Goal: Flip the prediction of a sentiment analyzer

Page 8: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

5/29

Motivation Method Experiment

An example of DeepWordBug

Goal: Flip the prediction of a sentiment analyzer

Page 9: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

6/29

Motivation Method Experiment

AlgorithmOur Methods

Page 10: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

7/29

Motivation Method Experiment

Challenges of language tasksOur Method

Adversarial examples

Suppose a deep learning classifier F (·) : X→ Y original sample isx , an adversarial example x ′ in Untargeted attack follows:

x′ = x + ∆x, ||∆x||p < ε, x′ ∈ XF (x) 6= F (x′)

When X is symbolic:

How to perturb x?

No metric for measuring ∆x

Page 11: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

8/29

Motivation Method Experiment

Our settingOur Method

∆x = Edit distance(x, x′)

Page 12: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

9/29

Motivation Method Experiment

DeepWordBugOur Methods

1. Scoring - Find important words to change

2. Transformation - Generate some modification on words oftop importance.

∆x = Edit distance(x, x′)

=∑

i∈Selected words

Edit distance(xi , x′i )

Page 13: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

10/29

Motivation Method Experiment

Step 1: Scoring functionOur Methods

Goal: Select important words

The proposed scoring functions have the following properties:1 Correctly reflect the importance of words2 Black-box3 Efficient to calculate.

Page 14: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

11/29

Motivation Method Experiment

Temporal Head Score

Page 15: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

11/29

Motivation Method Experiment

Temporal Head Score

Page 16: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

11/29

Motivation Method Experiment

Temporal Head Score

Page 17: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

12/29

Motivation Method Experiment

Temporal Tail score

Page 18: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

13/29

Motivation Method Experiment

Combined score

Page 19: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

14/29

Motivation Method Experiment

Step 2: Ranking and transformation

Calculate the scoring function for all words in the input once.

Rank all the words according to the scores.

Page 20: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

15/29

Motivation Method Experiment

Step 3: Word TransformerOur Methods

Original Substitution Swapping Deletion Insertion

Team → Texm Taem Tem Tezam

Artist → Arxist Artsit Artst Articst

Computer → Computnr Comptuer Compter Comnputer

Aim I: Machine-learning based classifier views generated wordsas “unknown”.

Aim II: Control the edit distance of the modification

Page 21: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

16/29

Motivation Method Experiment

SummaryOur Methods

Page 22: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

17/29

Motivation Method Experiment

Dataset

#Training #Testing #Classes Task

AG’s News 120,000 7,600 4NewsCategorization

Amazon ReviewFull

3,000,000 650,000 5SentimentAnalysis

Amazon ReviewPolarity

3,600,000 400,000 2SentimentAnalysis

DBPedia 560,000 70,000 14OntologyClassification

Yahoo! Answers 1,400,000 60,000 10TopicClassification

Yelp Review Full 650,000 50,000 5SentimentAnalysis

Yelp Review Polarity 560,000 38,000 2SentimentAnalysis

Enron Spam Email 26,972 6,744 2Spam E-mailDetection

Page 23: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

18/29

Motivation Method Experiment

Methods in comparison

Random(Baseline): Random selection of words. Similar to(Papernot et al. 2013)

Gradient(Baseline): White-box method. Judging theimportance of the word using the magnitude of the gradient(Samanta, S., & Mehta, S. (2017).).

DeepWordBug(Our method): Use 3 Different scoringfunctions: Temporal Head, Temporal Tail and Combined.

Page 24: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

19/29

Motivation Method Experiment

Main result: Effectiveness of adversarial samples (average)

6.82%

16.36%

63.02%

44.40%

68.05%64.38%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Random Gradient Replace-1 TemporalHead

TemporalTail

Combined

Re

lati

ve

Pe

rfo

rma

nce

De

cre

ase

(%)

DeepWordBug

Page 25: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

20/29

Motivation Method Experiment

Question: Are the generated adversarial samplestransferable to other models?

Adversarial samples generated on one model can besuccessfully transferred between models, reducing the modelaccuracy from around 90% to 20-50%

Page 26: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

21/29

Motivation Method Experiment

Question: How does different transformer functions work?

Varying transformation function have small effect on theattack performance.

Page 27: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

22/29

Motivation Method Experiment

Question: How strong are the adversarial samplesgenerated?

The adversarial samples generated successfully make themachine learning model to believe a wrong answer with 0.9probability

Page 28: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

23/29

Motivation Method Experiment

Defense: by Adversarial training

88.5%85.9%87.3%87.6%87.5%87.4%87.4%87.6%87.5%86.8%87.0%

11.9%

30.2%

45.0%52.4%

57.1%58.8%58.8%59.9%60.5%61.6%62.7%

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

0 1 2 3 4 5 6 7 8 9 10

Accuracy Adversarial accuracy

ReTrain the model with adversarial samples.

Accuracy on raw inputs slightly decreases;

Accuracy on the adversarial samples rapidly increases fromaround 12% (before the training) to 62% (after the training)

Page 29: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

24/29

Motivation Method Experiment

Defense: by an autocorrector?

Original Attack Defend with Autocorrector

Swap 88.45% 14.77% 77.34%

Substitute 88.45% 12.28% 74.85%

Remove 88.45% 14.06% 62.43%

Insert 88.45% 12.28% 82.07%

Substitute-2 88.45% 11.90% 54.54%

Remove-2 88.45% 14.25% 33.67%

While spellchecker reduces the effectiveness of the adversarialsamples, stronger attacks such as removing 2 characters inevery selected word still can successfully reduce the modelaccuracy to 34%

Page 30: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

25/29

Motivation Method Experiment

Related Works

Related works:

Papernot et. al 2016Iteratively:

Pick words randomlyApply gradient based algorithm directly on the word embeddingProject to the nearest word

Samanta & Sameep 2017Iteratively:

Pick important words using gradientGenerate linguistic based modification on the words

Summary: White-box and costly

Page 31: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

26/29

Motivation Method Experiment

Conclusion

Black-box: DeepWordBug generates adversarial samples in apure black-box manner.

Performance: Reduce the performance of state-of-the-art deeplearning models by up to 80%

Transferability: The adversarial samples generated on onemodel can be successfully transferred to other models,reducing the target model accuracy from around 90% to20-50%.

Page 32: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

27/29

Motivation Method Experiment

Reference

Goodfellow, Ian, J., Jonathon Shlens, and Christian Szegedy. ”Explaining andharnessing adversarial examples.” arXiv preprint arXiv:1412.6572 (2014).

Papernot, Nicolas, et al. ”Crafting adversarial input sequences for recurrentneural networks.” Military Communications Conference, MILCOM 2016-2016IEEE. IEEE, 2016.

Samanta, Suranjana, and Sameep Mehta. ”Towards Crafting Text AdversarialSamples.” arXiv preprint arXiv:1707.02812 (2017).

Zhang, Xiang, Junbo Zhao, and Yann LeCun. ”Character-level convolutionalnetworks for text classification.” Advances in neural information processingsystems. 2015.

Rayner, Keith, Sarah J. White, and S. P. Liversedge. ”Raeding wrods withjubmled lettres: There is a cost.” (2006).

Page 33: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

28/29

Motivation Method Experiment

Why Word Transformer is Effective?

Do not guarantee the original word will be changed to“unknown”, but failure chance is very slight

Suppose the longest word in the dictionary is length l , thereare 27l possible letter sequences ≤ l

Let l = 8, and |D| = 20000. The chance that changed word is

not “unknown” is roughly 278

20000 ≈ 0.00000007

Page 34: Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). Papernot,

29/29

Motivation Method Experiment

Why current scoring functions?

For a single step, Replace-1 score gives the bestapproximation.

However, globally it’s not optimal.

Example:

Here, Temporal tail gives better result than Replace-1.