Detecting adversarials examples attacks to deep neural networks

Detecting adversarial example attacks to deep neural networks

CBMI, 19-21 June 2017, Florence, Italy 1

Fabio Carrara, Fabrizio Falchi, Roberto Caldelli, Giuseppe Amato, Roberta Fumarola, Rudy Becarelli

Outline

● Introduction○ Adversarial examples for deep neural network image classifiers○ Risk of attacks to vision systems

● Our defense strategy○ Detecting adversarial examples○ CBIR approach

● Evaluation

2

Deep Neural Networks as Image Classifiers

● DNNs as image classifiers for several vision tasks○ image annotation, face recognition, etc.

● More and more in sensible applications (safety- or security-related)○ content filtering (spam, porn, violence, terrorist propaganda images, etc.)○ malware detection○ self-driving cars

CO

NV 1

RELU

1

POO

L 1

CO

NV 2

RELU

2

POO

L 2

CO

NV 3

RELU

3

CO

NV 4

RELU

4

CO

NV 5

RELU

5

POO

L 5

FC 6

RELU

6

FC 7

RELU

7

FC 8

It’s a stop sign.I’m pretty sure.

Image Classifier(Deep Neural Network)

3

Adversarial images

● DNN image classifiers are vulnerable to adversarial images○ malicious images crafted adding a small but intentional (not random!)

perturbation○ adversarial images fool DNNs to predict a wrong class with high confidence○ imperceptible to human eye, like an optical illusion for the DNN○ efficient algorithms to find them

Image Classifier(DNN) It’s a

roundabout sign!

No doubt.

+ =

Original Image Adv. Perturbation(5x amplified for visualization)

Adversarial Image

Adversary

4

Risk of attacks to DNNs

5

● Attacks are possible:○ if you have the model [1,2]

[1] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199(2013).

[2] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).

● Bypass filters○ E.g: NSFW images

https://github.com/yahoo/open_nsfw


6

● Attacks are possible:○ if you have the model [1,2]○ if you have access to input and output only! [3]



[3] Papernot, Nicolas, et al. "Practical black-box attacks against deep learning systems using adversarial examples." arXiv preprint arXiv:1602.02697 (2016).

84.24% 88.94% 96.19%Error Rate:


7

● Attacks are possible:○ if you have the model [1,2]○ if you have access to input and output only! [3]○ in the physical world (printout adversarial images) [4]

● Safety-related issues○ E.g: self-driving car crash



[3] Papernot, Nicolas, et al. "Practical black-box attacks against deep learning systems using adversarial examples." arXiv preprint arXiv:1602.02697 (2016).

[4] Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. "Adversarial examples in the physical world." arXiv preprint arXiv:1607.02533 (2016).

How to defend from adversarial attacks?

● Make more robust classifiers○ E.g: include adversarial images in the training phase○ “Every law has a loophole” → Every model has its adversarial images it is vulnerable to

● Detect adversarial inputs○ Understand when the network is talking nonsense

8

Our Adversarial Detection Approach (I)

9

input image


DNN says:Stop Sign

10

Image Classifier

(DNN)


DNN says:Stop Sign

11

Image Classifier

(DNN)

LABELLEDIMAGES

search for similar images

most similar images retrieved

LABELLEDIMAGES

(TRAIN SET)


DNN says:Stop Sign

12

✓Image Classifier

(DNN)

LABELLEDIMAGES



LABELLEDIMAGES

(TRAIN SET)

Our Adversarial Detection Approach (II)

DNN says:Stop Sign

13

Image Classifier

(DNN)

LABELLEDIMAGES

(TRAIN SET)



✗

Deep Features as Similarity Measure

● Reuse the intermediate output of the network (deep features)○ intermediate representation of visual aspects of the image○ we can use euclidean distance among deep features to evaluate visual similarity

CO

NV 1

RELU

1

POO

L 1

CO

NV 2

RELU

2

POO

L 2

CO

NV 3

RELU

3

CO

NV 4

RELU

4

CO

NV 5

RELU

5

POO

L 5

FC 6

RELU

6

FC 7

RELU

7

FC 8

0.2, 1.5, 5.4, …, 1.0, 0.0, 8.3

1.6, 4.3, 0.1, …, 0.2, 7.0, 4.9

0.3, 1.9, 5.1, …, 0.0, 0.1, 6.3

14

kNN scoring

● Use k-Nearest Neighbors score to evaluate the goodness of the classification● Score is assigned looking at the classes of the k neighbors ● Neighbor distance is important● Threshold on score to accept or reject classification

15

= 1 if i-th neighbor label == predicted label= 0 otherwise

Evaluation (I)

● OverFeat classifier on ImageNet, pool5 layer (1024-D) as deep feature● set of images to be classified generated from ImageNet validation set

(a) 1000 authentic images correctly classified by the DCNN (1 per calss, randomly chosen)

(b) ~2000 adversarial images generated from (a) with L-BFGS and FGS generation algorithms

(c) 1000 authentic images incorrectly classified by the DCNN (errors, 1 per class, randomly chosen)

● model evaluated as a binary classifier (authentic / spurious) 16

kNN ScoringOverFeat DCNN pretrained on ImageNet

CO

NV 1

RELU

1

POO

L 1

CO

NV 2

RELU

2

POO

L 2

CO

NV 3

RELU

3

CO

NV 4

RELU

4

CO

NV 5

RELU

5

POO

L 5

FC 6

RELU

6

FC 7

RELU

7

FC 8

ImageNetTRAIN SET

predictedclass Accept

orReject

deep feature

threshold

Evaluation (II)

17

ー % of authentic images correctly retained

ー % of FGS adversarials correctly discarded

ー % of L-BFGS adversarials correctly discarded

ー % of wrong classifications of authentic images correctly discarded

Threshold

Evaluation (II)

● with low threshold you can:

[1] Filter out 50% of adversarial images (and 10% of errors) while retaining almost all the authentic images

[2] Filter out 80% of adversarial images (and 30% of errors) while retaining 90% the authentic images

● filter aggressiveness can be adjusted

[1] [2]

18

Threshold

● Examples of content that might be filtered

● Our approach successfully identifies adversarial images, assigning them low scores

19

Evaluation (IV) - Good Detections

http://deepfeatures.org/adversarials/

● The most difficult adversarial images to detect (the ones having highest kNN scores)

● Note the visual similarity and common aspects among the classes

20

Evaluation (IV) -Bad Detections


Conclusions

● We presented an approach to cope with adversarial images○ with a satisfactory level of accuracy○ without changing the model (no retrain)○ without using additional data

● Future Works○ test more network architectures○ test more generation algorithms for adversarial images○ compare with other defense methodologies

21

Thanks for your attention!

Questions ?


Fabio Carrara <[email protected]>

22

Detecting adversarials examples attacks to deep neural networks

Science

Transcript of Detecting adversarials examples attacks to deep neural networks