convolutional neural networkssergey/teaching/harbourcv19/... · 2019. 6. 5. · Sergey Nikolenko,...
Transcript of convolutional neural networkssergey/teaching/harbourcv19/... · 2019. 6. 5. · Sergey Nikolenko,...
convolutional neural networksMaster's Computer Vision by Neuromation
Sergey Nikolenko, Alex DavydowHarbour Space University, Barcelona, SpainMay 27, 2019
Random facts:
• on May 27, 1703, Peter the Great founded Saint Petersburg, soon to be capital of the Russianempire and still a wonderful city
• on May 27, 1931, Auguste Piccard and Paul Kipfer took off on a balloon from Augsburg andbecame the first human beings to enter the stratosphere, reaching a record altitude of 15,781m
• on May 27, 1933, Walt Disney released the cartoon Three Little Pigs, with its hit song Who'sAfraid of the Big Bad Wolf?
• on May 27, 1960, a military coup removed the Turkish President Celâl Bayar and the rest ofthe democratic government from office
• on May 27, 1977, Virgin released a Sex Pistols single God Save the Queen; the song wasimmediately banned on British radio but still reached #1 on the charts
Modern CNN architectures
ResNet
• Residual learning: let’s train the differences (residues) betweenone layer and the next.
• Then the gradients will be able to flow with no obstacle.• A function implemented by a residual unit looks like
y(𝑘) = 𝐹(x(𝑘)) + x(𝑘),
where x(𝑘) is the input vector of layer 𝑘, 𝐹(𝑥) is the functioncomputed by the layer, and y(𝑘) is the output of the residuallayer that will then become x(𝑘+1) for the next layer.
• Now the gradient can pass through and does not vanish when 𝐹becomes saturated:
𝜕y(𝑘)
𝜕x(𝑘) = 1 + 𝜕𝐹(x(𝑘))𝜕x(𝑘) .
3
ResNet
• This has allowed for very deep networks.• Another similar approach – highway networks by JürgenSchmidhuber.
• We again represent y(𝑘), output of layer 𝑘, as a linearcombination of x(𝑘) and 𝐹(x(𝑘)), but in a different way:
y(𝑘) = 𝐶(x(𝑘))x(𝑘) + 𝑇 (x(𝑘))𝐹(x(𝑘)),
where 𝐶 is the carry gate, and 𝑇 is the transform gate; usuallyit’s a convex combination, 𝐶 = 1 − 𝑇 .
• Practice shows that the residual connections should be as“straight” as possible.
3
ResNet: variations
4
Revolution of Depth (Kaiming He)
5
Revolution of Depth (Kaiming He)
6
Revolution of Depth (Kaiming He)
7
ResNeXt
• ResNeXt (Xie et al., 2016): let’s replace ResNet units with“split-transform-merge” units, similar to Inception.
• The input is divided into blocks w.r.t. channels, and every blockgets its own convolutions.
8
ResNeXt
• The idea is similar to group convolutions, used already inAlexNet for parallelization:
• They do yield a kind of a specialization in the results:
8
Inception v4 and Inception ResNet
• Another classic paper (Szegedy et al., 2016) introduced Inceptionv4 and Inception ResNet.
• Inception v4 – let’s standardize everything and simplify theunits. First, the “stem”:
9
Inception v4 and Inception ResNet
• Second, we now have three basic blocks A, B, and C:
9
Inception v4 and Inception ResNet
• And special reduction blocks to reduce the dimensions:
9
Inception v4 and Inception ResNet
• Inception ResNet adds residual connections to these blocks:
9
Inception v4 and Inception ResNet
• There is no pooling now but there are still reduction blocks:
9
Inception v4 and Inception ResNet
• As a result, the architecture has become even simpler; Inceptionv4 (top), Inception ResNet (bottom):
9
Inception v4 and Inception ResNet
• Inception ResNet v2:
9
Inception v4 and Inception ResNet
• And it works quite well:
9
SqueezeNet
• SqueezeNet (Iandola et al., 2017) – how to reduce the number ofparameters:
• replace 3 × 3 filters with 1 × 1;• reduce the number of inputs for 3 × 3 convolutions;• delay downsampling as late as possible to increase the size ofactivation maps.
10
SqueezeNet
• Fire module:• squeeze convolutional layer (1 × 1 only);• expand layer (1 × 1 and 3 × 3).
10
SqueezeNet
• General SqueezeNet architecture:
10
SqueezeNet
• We get 50x fewer parameters than AlexNet. But:
10
MobileNet
• MobileNet (Howard et al., 2017): networks for mobile devices.• Depthwise separable convolutions: let’s decompose aconvolution into a depthwise convolution (one filter for eachchannel) and a 1 × 1 convolution.
11
MobileNet
• Then the structure of a layer will be more complex (but withfewer weights), and the overall architecture is not so deep:
11
MobileNet
• We see that we can save a lot of parameters at the price of asmall decrease in quality:
11
Adversarial examples
Adversarial examples
• Interesting feature of neural networks: you can fool any networkwith a picture completely indistinguishable for the naked eye.
• But how? Any ideas?..
13
Adversarial examples
• Let’s do gradient descent not along the weights 𝜃 but along theinput x!
• We only need to control that the new example x̂ remains similarto the original x, e.g., ‖x̂ − x‖∞ ≤ 𝜖 (or some other condition).How?
• Moreover, we can try to make x̂ stable to transformations suchas rotation.
• How would we do that?
13
Adversarial examples
• Intriguing properties of neural networks (Szegedy et al., 2013). Avery intriguing paper indeed...
• For instance, we have analyzed the activations of neurons.• I.e., supposedly, if we analyze the last layer neurons, they willform a nice basis in the latent space where it is easy to find thesemantics.
• Right?..
13
Adversarial examples
• ...not really:
• I.e., regular CNNs don’t have any reasonable disentanglement,the latent space is good but the basis is as good as random.
13
Adversarial examples
• The same paper introduced adversarial attacks; for AlexNeteverything on the right is an ostrich:
13
Adversarial examples
• Further in (Goodfellow, Shlens, Szegedy, 2014); all highlightedpictures are airplanes:
13
Adversarial examples
• Conclusions (Goodfellow, Shlens, Szegedy, 2014):• for a linear classifier it’s clear what to do: for x̂ = x + z we want toshi t w⊤x̂ = w⊤x + w⊤z, i.e., we take z = sign(w) and applyconstraints on the norm of x̂;
• the same can be done in any network; by taking the gradient wedo a linear approximation in a neighborhood:
z = 𝜖sign(∇x𝐿(𝜃, x, 𝑦));
• i.e., this is not because our models are very nonlinear, it’s becausethey are too linear;
• the shi t direction is important, not any specific point; i.e., we caneven generalize adversarial shi t to different examples;
• and we can try to regularize against it by adding adversarial shi tto the objective function:
𝐿′(𝜃, x, 𝑦) = 𝛼𝐿(𝜃, x, 𝑦) + (1 − 𝛼)𝐿(𝜃, x + 𝜖sign(∇x𝐿(𝜃, x, 𝑦), 𝑦).
• But that’s not the end of the story either... 13
Adversarial examples
• Lots of different attacks:• Deep Fool attack (Bastani et al., 2016): shi t the example to thehyperplane that divides classes, z = 𝑓(x0)
‖w‖22
w for a linear classifierand z𝑖 = 𝑓(x𝑖)
‖∇𝑓(x𝑖)‖22
∇𝑓(x𝑖) for any function;• (Carlini, Wagner, 2016): find minimal changes based on 𝐿0, 𝐿2, and
𝐿∞-norms, still some of the best attacks;• one can also look not for a direction but for specific features;• (Papernot et al., 2016): find out which pixels are the mostimportant and shi t them;
• and a lot more, hundreds of papers already...
13
Adversarial examples
• There are different approaches to defense too:• (Bastani et al., 2016): formalized the notion of robustness toadversarial attacks, proposed methods for evaluating it;
• (Lyu et al., 2015; Roth et al., 2018): other variations on gradientregularization;
• (Shabam et al., 2015; Madry et al., 2017): let’s train on “adversarial”examples, choosing the worst example in a neighborhood;
• (Brendel, Bethge, 2017): the more we have nonzero (small)gradients, the worse for attacks, so we can use simple numericalinstability as a regularizer;
• DeepCloak defense (Gao et al., 2017): let’s remove features that arenot needed for classification;
• and a lot more, hundreds of papers already...
13
Adversarial examples
• (Kurakin, Goodfellow, Bengio, 2016): attacks in the real world!Moreover, black box attacks: we attack one model and test onanother.
• There is an app that changes a photo adversarially:
13
Adversarial examples
• Even better – you can print out an adversarial example, and itstill works!
• It’s still unclear how realistic all this is, but quite possibly animportant direction for AI security in the future.
13
Thank you!
Thank you for your attention!
14