Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and...

1/29

Motivation Method Experiment

Black-box Generation of Adversarial TextSequences to Evade Deep Learning Classifiers

Ji Gao1, Jack Lanchantin1, Mary Lou Soffa1, Yanjun Qi1

1University of Virginiahttp://trustworthymachinelearning.org/

@ 1st Deep Learning and Security Workshop ; 2018

http://trustworthymachinelearning.org/

2/29


Outline

1 Motivation

White box vs. black box

2 Method

Word scorerWord transformer

3 Experiment

4 Conclusions

3/29


Example of black-box classification systems

Google Perspective API

4/29


Target scenario

5/29


An example of DeepWordBug

Goal: Flip the prediction of a sentiment analyzer

6/29


AlgorithmOur Methods

7/29


Challenges of language tasksOur Method

Adversarial examples

Suppose a deep learning classifier F (·) : X→ Y original sample isx , an adversarial example x ′ in Untargeted attack follows:

x′ = x + ∆x, ||∆x||p < ε, x′ ∈ XF (x) 6= F (x′)

When X is symbolic:

How to perturb x?

No metric for measuring ∆x

8/29


Our settingOur Method

∆x = Edit distance(x, x′)

9/29


DeepWordBugOur Methods

1. Scoring - Find important words to change

2. Transformation - Generate some modification on words oftop importance.

∆x = Edit distance(x, x′)

=∑

i∈Selected words

Edit distance(xi , x′i )

10/29


Step 1: Scoring functionOur Methods

Goal: Select important words

The proposed scoring functions have the following properties:1 Correctly reflect the importance of words2 Black-box3 Efficient to calculate.

11/29


Temporal Head Score

12/29


Temporal Tail score

13/29


Combined score

14/29


Step 2: Ranking and transformation

Calculate the scoring function for all words in the input once.

Rank all the words according to the scores.

15/29


Step 3: Word TransformerOur Methods

Original Substitution Swapping Deletion Insertion

Team → Texm Taem Tem Tezam

Artist → Arxist Artsit Artst Articst

Computer → Computnr Comptuer Compter Comnputer

Aim I: Machine-learning based classifier views generated wordsas “unknown”.

Aim II: Control the edit distance of the modification

16/29


SummaryOur Methods

17/29


Dataset

#Training #Testing #Classes Task

AG’s News 120,000 7,600 4NewsCategorization

Amazon ReviewFull

3,000,000 650,000 5SentimentAnalysis

Amazon ReviewPolarity

3,600,000 400,000 2SentimentAnalysis

DBPedia 560,000 70,000 14OntologyClassification

Yahoo! Answers 1,400,000 60,000 10TopicClassification

Yelp Review Full 650,000 50,000 5SentimentAnalysis

Yelp Review Polarity 560,000 38,000 2SentimentAnalysis

Enron Spam Email 26,972 6,744 2Spam E-mailDetection

18/29


Methods in comparison

Random(Baseline): Random selection of words. Similar to(Papernot et al. 2013)

Gradient(Baseline): White-box method. Judging theimportance of the word using the magnitude of the gradient(Samanta, S., & Mehta, S. (2017).).

DeepWordBug(Our method): Use 3 Different scoringfunctions: Temporal Head, Temporal Tail and Combined.

19/29


Main result: Effectiveness of adversarial samples (average)

6.82%

16.36%

63.02%

44.40%

68.05%64.38%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Random Gradient Replace-1 TemporalHead

TemporalTail

Combined

Re

lati

ve

Pe

rfo

rma

nce

De

cre

ase

(%)

DeepWordBug

20/29


Question: Are the generated adversarial samplestransferable to other models?

Adversarial samples generated on one model can besuccessfully transferred between models, reducing the modelaccuracy from around 90% to 20-50%

21/29


Question: How does different transformer functions work?

Varying transformation function have small effect on theattack performance.

22/29


Question: How strong are the adversarial samplesgenerated?

The adversarial samples generated successfully make themachine learning model to believe a wrong answer with 0.9probability

23/29


Defense: by Adversarial training

88.5%85.9%87.3%87.6%87.5%87.4%87.4%87.6%87.5%86.8%87.0%

11.9%

30.2%

45.0%52.4%

57.1%58.8%58.8%59.9%60.5%61.6%62.7%

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

0 1 2 3 4 5 6 7 8 9 10

Accuracy Adversarial accuracy

ReTrain the model with adversarial samples.

Accuracy on raw inputs slightly decreases;

Accuracy on the adversarial samples rapidly increases fromaround 12% (before the training) to 62% (after the training)

24/29


Defense: by an autocorrector?

Original Attack Defend with Autocorrector

Swap 88.45% 14.77% 77.34%

Substitute 88.45% 12.28% 74.85%

Remove 88.45% 14.06% 62.43%

Insert 88.45% 12.28% 82.07%

Substitute-2 88.45% 11.90% 54.54%

Remove-2 88.45% 14.25% 33.67%

While spellchecker reduces the effectiveness of the adversarialsamples, stronger attacks such as removing 2 characters inevery selected word still can successfully reduce the modelaccuracy to 34%

25/29


Related Works

Related works:

Papernot et. al 2016Iteratively:

Pick words randomlyApply gradient based algorithm directly on the word embeddingProject to the nearest word

Samanta & Sameep 2017Iteratively:

Pick important words using gradientGenerate linguistic based modification on the words

Summary: White-box and costly

26/29


Conclusion

Black-box: DeepWordBug generates adversarial samples in apure black-box manner.

Performance: Reduce the performance of state-of-the-art deeplearning models by up to 80%

Transferability: The adversarial samples generated on onemodel can be successfully transferred to other models,reducing the target model accuracy from around 90% to20-50%.

27/29


Reference

Goodfellow, Ian, J., Jonathon Shlens, and Christian Szegedy. ”Explaining andharnessing adversarial examples.” arXiv preprint arXiv:1412.6572 (2014).

Papernot, Nicolas, et al. ”Crafting adversarial input sequences for recurrentneural networks.” Military Communications Conference, MILCOM 2016-2016IEEE. IEEE, 2016.

Samanta, Suranjana, and Sameep Mehta. ”Towards Crafting Text AdversarialSamples.” arXiv preprint arXiv:1707.02812 (2017).

Zhang, Xiang, Junbo Zhao, and Yann LeCun. ”Character-level convolutionalnetworks for text classification.” Advances in neural information processingsystems. 2015.

Rayner, Keith, Sarah J. White, and S. P. Liversedge. ”Raeding wrods withjubmled lettres: There is a cost.” (2006).

28/29


Why Word Transformer is Effective?

Do not guarantee the original word will be changed to“unknown”, but failure chance is very slight

Suppose the longest word in the dictionary is length l , thereare 27l possible letter sequences ≤ l

Let l = 8, and |D| = 20000. The chance that changed word is

not “unknown” is roughly 278

20000 ≈ 0.00000007

29/29


Why current scoring functions?

For a single step, Replace-1 score gives the bestapproximation.

However, globally it’s not optimal.

Example:

Here, Temporal tail gives better result than Replace-1.

Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and...

Documents

Transcript of Black-box Generation of Adversarial Text Sequences to Evade … · 2020-06-26 · "Explaining and...