Detecting adversarial example attacks to deep neural networks
CBMI, 19-21 June 2017, Florence, Italy 1
Fabio Carrara, Fabrizio Falchi, Roberto Caldelli, Giuseppe Amato, Roberta Fumarola, Rudy Becarelli
Outline
● Introduction○ Adversarial examples for deep neural network image classifiers○ Risk of attacks to vision systems
● Our defense strategy○ Detecting adversarial examples○ CBIR approach
● Evaluation
2
Deep Neural Networks as Image Classifiers
● DNNs as image classifiers for several vision tasks○ image annotation, face recognition, etc.
● More and more in sensible applications (safety- or security-related)○ content filtering (spam, porn, violence, terrorist propaganda images, etc.)○ malware detection○ self-driving cars
CO
NV 1
RELU
1
POO
L 1
CO
NV 2
RELU
2
POO
L 2
CO
NV 3
RELU
3
CO
NV 4
RELU
4
CO
NV 5
RELU
5
POO
L 5
FC 6
RELU
6
FC 7
RELU
7
FC 8
It’s a stop sign.I’m pretty sure.
Image Classifier(Deep Neural Network)
3
Adversarial images
● DNN image classifiers are vulnerable to adversarial images○ malicious images crafted adding a small but intentional (not random!)
perturbation○ adversarial images fool DNNs to predict a wrong class with high confidence○ imperceptible to human eye, like an optical illusion for the DNN○ efficient algorithms to find them
Image Classifier(DNN) It’s a
roundabout sign!
No doubt.
+ =
Original Image Adv. Perturbation(5x amplified for visualization)
Adversarial Image
Adversary
4
Risk of attacks to DNNs
5
● Attacks are possible:○ if you have the model [1,2]
[1] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199(2013).
[2] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
● Bypass filters○ E.g: NSFW images
https://github.com/yahoo/open_nsfw
Risk of attacks to DNNs
6
● Attacks are possible:○ if you have the model [1,2]○ if you have access to input and output only! [3]
[1] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199(2013).
[2] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
[3] Papernot, Nicolas, et al. "Practical black-box attacks against deep learning systems using adversarial examples." arXiv preprint arXiv:1602.02697 (2016).
84.24% 88.94% 96.19%Error Rate:
Risk of attacks to DNNs
7
● Attacks are possible:○ if you have the model [1,2]○ if you have access to input and output only! [3]○ in the physical world (printout adversarial images) [4]
● Safety-related issues○ E.g: self-driving car crash
[1] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199(2013).
[2] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
[3] Papernot, Nicolas, et al. "Practical black-box attacks against deep learning systems using adversarial examples." arXiv preprint arXiv:1602.02697 (2016).
[4] Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. "Adversarial examples in the physical world." arXiv preprint arXiv:1607.02533 (2016).
How to defend from adversarial attacks?
● Make more robust classifiers○ E.g: include adversarial images in the training phase○ “Every law has a loophole” → Every model has its adversarial images it is vulnerable to
● Detect adversarial inputs○ Understand when the network is talking nonsense
8
Our Adversarial Detection Approach (I)
9
input image
Our Adversarial Detection Approach (I)
DNN says:Stop Sign
10
Image Classifier
(DNN)
Our Adversarial Detection Approach (I)
DNN says:Stop Sign
11
Image Classifier
(DNN)
LABELLEDIMAGES
search for similar images
most similar images retrieved
LABELLEDIMAGES
(TRAIN SET)
Our Adversarial Detection Approach (I)
DNN says:Stop Sign
12
✓Image Classifier
(DNN)
LABELLEDIMAGES
search for similar images
most similar images retrieved
LABELLEDIMAGES
(TRAIN SET)
Our Adversarial Detection Approach (II)
DNN says:Stop Sign
13
Image Classifier
(DNN)
LABELLEDIMAGES
(TRAIN SET)
search for similar images
most similar images retrieved
✗
Deep Features as Similarity Measure
● Reuse the intermediate output of the network (deep features)○ intermediate representation of visual aspects of the image○ we can use euclidean distance among deep features to evaluate visual similarity
CO
NV 1
RELU
1
POO
L 1
CO
NV 2
RELU
2
POO
L 2
CO
NV 3
RELU
3
CO
NV 4
RELU
4
CO
NV 5
RELU
5
POO
L 5
FC 6
RELU
6
FC 7
RELU
7
FC 8
0.2, 1.5, 5.4, …, 1.0, 0.0, 8.3
1.6, 4.3, 0.1, …, 0.2, 7.0, 4.9
0.3, 1.9, 5.1, …, 0.0, 0.1, 6.3
14
kNN scoring
● Use k-Nearest Neighbors score to evaluate the goodness of the classification● Score is assigned looking at the classes of the k neighbors ● Neighbor distance is important● Threshold on score to accept or reject classification
15
= 1 if i-th neighbor label == predicted label= 0 otherwise
Evaluation (I)
● OverFeat classifier on ImageNet, pool5 layer (1024-D) as deep feature● set of images to be classified generated from ImageNet validation set
(a) 1000 authentic images correctly classified by the DCNN (1 per calss, randomly chosen)
(b) ~2000 adversarial images generated from (a) with L-BFGS and FGS generation algorithms
(c) 1000 authentic images incorrectly classified by the DCNN (errors, 1 per class, randomly chosen)
● model evaluated as a binary classifier (authentic / spurious) 16
kNN ScoringOverFeat DCNN pretrained on ImageNet
CO
NV 1
RELU
1
POO
L 1
CO
NV 2
RELU
2
POO
L 2
CO
NV 3
RELU
3
CO
NV 4
RELU
4
CO
NV 5
RELU
5
POO
L 5
FC 6
RELU
6
FC 7
RELU
7
FC 8
ImageNetTRAIN SET
predictedclass Accept
orReject
deep feature
threshold
Evaluation (II)
17
ー % of authentic images correctly retained
ー % of FGS adversarials correctly discarded
ー % of L-BFGS adversarials correctly discarded
ー % of wrong classifications of authentic images correctly discarded
Threshold
Evaluation (II)
● with low threshold you can:
[1] Filter out 50% of adversarial images (and 10% of errors) while retaining almost all the authentic images
[2] Filter out 80% of adversarial images (and 30% of errors) while retaining 90% the authentic images
● filter aggressiveness can be adjusted
[1] [2]
18
Threshold
● Examples of content that might be filtered
● Our approach successfully identifies adversarial images, assigning them low scores
19
Evaluation (IV) - Good Detections
http://deepfeatures.org/adversarials/
● The most difficult adversarial images to detect (the ones having highest kNN scores)
● Note the visual similarity and common aspects among the classes
20
Evaluation (IV) -Bad Detections
http://deepfeatures.org/adversarials/
Conclusions
● We presented an approach to cope with adversarial images○ with a satisfactory level of accuracy○ without changing the model (no retrain)○ without using additional data
● Future Works○ test more network architectures○ test more generation algorithms for adversarial images○ compare with other defense methodologies
21
Thanks for your attention!
Questions ?
http://deepfeatures.org/adversarials/
Fabio Carrara <[email protected]>
22
Top Related