Quantifying Translation-Invariance in Convolutional Neural...

Quantifying Translation-Invariance in Convolutional Neural Networks

Eric Kauderer-AbramsStanford University

450 Serra Mall, Stanford, CA [email protected]

Abstract

A fundamental problem in object recognition is the de-velopment of image representations that are invariant tocommon transformations such as translation, rotation, andsmall deformations. There are multiple hypotheses regard-ing the source of translation invariance in CNNs. Oneidea is that translation invariance is due to the increas-ing receptive field size of neurons in successive convolu-tion layers. Another possibility is that invariance is dueto the pooling operation. We develop a simple a tool, thetranslation-sensitivity map, which we use to visualize andquantify the translation-invariance of various architectures.We obtain the surprising result that architectural choicessuch as the number of pooling layers and the convolutionfilter size have only a secondary effect on the translation-invariance of a network. Our analysis identifies trainingdata augmentation as the most important factor in obtain-ing translation-invariant representations of images usingconvolutional neural networks.

1. IntroductionObject recognition is a problem of fundamental impor-

tance in visual perception. The ability to extrapolate fromraw pixels to the concept of a coherent object persistingthrough space and time is a crucial link connecting low-level sensory processing with higher-level reasoning. Afundamental problem in object recognition is the develop-ment of image representations that are invariant to commontransformations such as translation, rotation, and small de-formations. Intuitively, these are desirable properties foran object recognition algorithm to have: a picture of a catshould be recognizable regardless of the cat’s location andorientation within the image. It has been demonstrated thataffine transformations account for a significant portion ofintra-class variability in many datasets [1].

In recent years, several different approaches to the invari-ance problem have emerged. Loosely modeled on the lo-cal receptive fields and hierarchical structure of Visual Cor-

tex, Convolutional Neural Networks (CNNs) have achievedunprecedented accuracy in object recognition tasks [2] [3].There is widespread consensus in the literature that CNNsare capable of learning translation-invariant representations[5] [4] [7]. However, there is no definitive explanation forthe mechanism by which translation-invariance is obtained.

One idea is that translation-invariance is due to the grad-ual increase in the receptive field size of neurons in suc-cessive convolution layers [4]. An alternative hypothesis isthat the destructive pooling operations are responsible fortranslation-invariance [5]. Despite the widespread consen-sus, there has been little work done in exploring the na-ture of translation-invariance in CNNs. In this study weattempt to fill this gap, directly addressing the two ques-tions: To what extent are the representations produced byCNNs translation invariant? Which features of CNNs areresponsible for this translation invariance?

To address these questions we introduce a simple tool forvisualizing and quantifying translation invariance, calledtranslation-sensitivity maps. Using translation-sensitivitymaps, we quantify the degree of translation-invariance ofseveral CNN architectures trained on an expanded versionof the MNIST digit dataset [6]. We obtain the surprising re-sult that the most important factor in obtaining translation-invariant CNNs is training data augmentation and not theparticular architectural choices that are often discussed.

2. Related WorkThere have been several previous works addressing the

invariance problem in CNNs. A recent study quantifiedthe invariance of representations in CNNs to general affinetransformations such as rotations and reflections [8]. Therehave also been recent developments in altering the struc-ture of CNNs and adding new modules to CNNs to increasethe degree of invariance to common transformations [4]. Apoignant example is spatial transformer networks [5], whichinclude a new layer that learns an example-specific affinetransformation to be applied to each image. The intuitionbehind this approach, that the network can transform anyinput into a standard form with the relevant information cen-

4321

tered and oriented consistently, has been confirmed empiri-cally.

There have been several studies done on the theory be-hind the invariance problem. One such example is TheScattering Transform [9] [10], which computes represen-tations of images that are provably invariant to affine trans-formations and stable to small deformations. Furthermore,it does so without any learning, relying instead on a cas-cade of fixed wavelet transforms and nonlinearities. Withperformance matching CNNs on many datasets, The Scat-tering Transform provides a powerful theoretical foundationfor addressing the invariance problem. In addition there areseveral related works addressing the invariance problem forscenes rather than images [11] and in a more general unsu-pervised learning framework [12].

The methods that we describe in the next section can beapplied to any of the above approaches to quantify the de-gree of translation-invariance achieved by the algorithms.

3. Methods

3.1. Convolutional Neural Networks

This section describes the development of tools used toquantify the translation-invariance of CNNs, and the setupfor the experiments to be run using these tools. It is impor-tant to point out that we refer to a network as “translation-invariant” as shorthand for saying that the network outputvaries little when the input is translated. In practice, no al-gorithm is truly translation-invariant; instead, we measurethe sensitivity of the output to translation in the input. Weloosely call an architecture more translation-invariant thananother if its output is less sensitive to translations of the in-put. Now we introduce tools to make comparisons betweenarchitectures more precise.

Our approach entails constructing and training many dif-ferent CNN architectures. To provide some background: aCNN is a hierarchy of several different types of transfor-mations referred to as “layers” that take an image as inputand produce an output “score vector” giving the network’sprediction for the category that the image belongs to. Eachlayer implements a specific mathematical operation. Con-volution layers compute the convolution of a small, localfilter with the input. A unit in a convolution layer acts as afeature detector, becoming active when the template storedin the filter matches with a region of the input. Convolu-tion layers are followed by the application of a point-wisenonlinearity, which is commonly chosen to be the rectifiedlinear function (ReLu). Many architectures include Fully-Connected (FC) Layers at the end of the hierarchy, whichmultiply the input (reshaped to be a vector) with a matrixof weights corresponding to the connection “strength” be-tween each FC unit and each input vector element. Addi-tional common layers are Pooling Layers, which subsample

Figure 1. A translation-sensitivity map obtained from a networkwith two convolution and two pooling layers.

the input and batch normalization layers, which fix the meanand variance of the output of a given layer.

To train CNNs, the output of the last layer is fed into aloss function that provides a single scalar value correspond-ing to the desirability of the network’s prediction for a givenexample. The networks are then trained with backpropoga-tion: computing the gradient of the loss function with re-spect to all of the parameters in the network.

For this work we trained several variants of common,relatively small CNNs using the Adam [13] optimiza-tion algorithm and the cross-entropy loss function, L

i

=

�log( e

f

y

iPj

e

f

j

), with L2 regularization. The learning-rate

and regularization strength were selected via the traditionaltraining-validation set approach. All networks were imple-mented using the TensorFlow framework [14] on an Intel i7CPU.

Given a fully trained CNN, we select an image from thetest set to designate as the “base” image. Next, we gen-erate all possible translations of the base image that leavethe digit fully contained within the frame. For the MNISTdataset centered within a 40x40 pixel frame, the transla-tions went from -10 to 10 pixels in the x-direction and -10to 10 pixels in the y-direction, where negative and positiveare used to denote translation to the left and right in the x-direction, and up and down and in the y-direction. Next,we perform a forward pass on the base image and save theoutput, which we refer to as the base output vector. We thenperform a forward pass on the translated image and com-pute the Euclidean distance between the base output vec-tor and the translated output vector. Note that for a fully

4322

translation-invariant network, this distance would be zero.Finally, we divide this distance by the median inter-class

distance between all score vectors in a large sample of thedata in order to normalize the results so that they can becompared across different networks. This normalizationalso provides easily interpretable numbers: a distance ofone means that the network considers the translated imageto be so different from the base image that it might as wellbelong to another class. A distance of zero corresponds tothe case of perfect translation-invariance.

3.2. Translation-Sensitivity Maps

To produce a translation-sensitivity map, we computethe normalized score-space distances as described above be-tween the base image, I0,0, and each translated copy I

k

x

,k

y

,where I

k

x

,k

y

is the original image translated k

x

units inthe x-direction and k

y

units in the y-direction. We displaythe results in a single heat-map in which the color of cell[k

x

, k

y

] corresponds to d(I0,0, Ikx

,k

y

). The plots are cen-tered such that the untranslated image is represented in thecenter of the image, and the colors are set such that whitercolors correspond to lower distances. To produce a sin-gle translation-sensitivity map for each input class, we pro-duce separate translation-sensitivity maps using numerousexamples of each class from the test set and average theresults. The averaging makes sense because the translation-sensitivity maps for different images from a given class areextremely similar.

3.3. Radial Translation-Sensitivity Fuctions

While the translation-sensitivity maps provide clearly in-terpretable images quantifying and displaying the nature oftranslation-invariance for a given trained network, it is un-wieldy for quantitatively comparing results between differ-ent network architectures. To perform quantitative com-parisons, we compute a one-dimensional signature fromthe translation-sensitivity maps, called a radial translation-sensitivity function, which displays the average value of atranslation-sensitivity map at a given radial distance. In-tuitively, the radial translation-sensitivity function displaysthe magnitude of the change in the output of the networkas a function of the size of the translation (ignoring the di-rection). Because most translation-sensitivity maps wereanisotropic, the radial translation-sensitivity function pro-vides a rough metric, however it is useful for comparingresults across architectures that can be displayed on a singleplot.

Using the translation-sensitivity maps and ra-dial translation-sensitivity functions, we quantify thetranslation-invariance of the outputs produced by variousCNN architectures with different numbers of convolutionand pooling layers. For each network, we train two ver-sions of the network: One trained on the original centered

Figure 2. Untranslated and translated versions of an MNIST image

MNIST dataset, and the other trained on an augmentedversion of the MNIST dataset obtained by adding fourrandomly-translated copies of each training set image.

3.4. Translated MNIST Dataset

For all experiments we used the MNIST digit dataset.The original images were centered within a 40x40 blackframe in order to provide ample space for translating theimages. The un-augmented training dataset consists of the50,000 original centered MNIST images. To generate theaugmented dataset, we randomly selected 12,500 of theoriginal centered images and produced 4 randomly trans-lated versions of each of these images to produce to theaugmented dataset. The translations in the x and y direc-tions were each chosen randomly from a uniform distribu-tion ranging from one to ten pixels. It is important that theaugmented and original training sets are the same size inorder to separate the effects of augmentation from the ef-fects of simply having more training data. The dataset alsoincludes 10,000 validation set and 10,000 test set images.

4. Results4.1. Experiment One: Five Architectures With and

Without Training Data Augmentation

We begin with a series of experiments in which weuse translation-sensitivity maps and radial translation-sensitivity functions to quantify and compare thetranslation-invariance of several different CNN archi-tectures, each trained both with and without training dataaugmentation. The goal is to quantify the role played bytraining data augmentation, the number of convolutionlayers, the number of max-pooling layers, and the roleplayed by the ordering of the convolution and poolinglayers in obtaining translation-invariant CNNs.

For referring to each network, we use the following nam-ing convention: Each architecture is described by a string inwhich ‘c’ stands for a convolution layer, and ‘p’ stands fora max-pooling layer. The Ordering of the letters gives theorder of the layers in the network starting at the input. Thenetworks trained with data augmentation end with the string‘aug’. Unless otherwise stated, all convolution layers use a

4323

Figure 3. Translation-Sensitivity Maps for three different architectures. The images on the top row were obtained from networks trainedwith non-augmented data. The images on the bottom row were obtained from the same networks trained with augmented data. Theincreased brightness of the images on bottom row demonstrates the impact of training data augmentation on the translation-invariance of anetwork.

filter-size of five pixels, stride one, and padding to preservethe input size, while all max-pooling layers use a two-by-two kernel with stride two.

For this experiment, we train five networks in total: withone convolution and no pooling layers, one convolution andone pooling layer, two convolution layers and no poolinglayers, and two different networks with the two convolutionand two pooling layers in different orders. All of the abovenetworks have ReLu layers after each convolution layer andhave a fully connected layer with 100 hidden units. All net-works were trained to convergence, achieving test time ac-curacy of between 98% and 99%.

4.2. The Importance of Data Augmentation

The results, a radial translation-sensitivity plot displayedin Figure 4, are partially surprising. The most obvious take-away is that all networks trained on augmented data aresignificantly more translation-invariant than all networkstrained on non-augmented data. From Figures 3 and 4,

Figure 4. Radial translation-sensitivity function comparing thetranslation-invariance of five different architectures. The dashedlines are networks trained with augmented data; the solid lines arenetworks trained with non-augmented data.

4324

we see that a simple network with one convolution layerand no pooling layers that is trained on augmented datais significantly more translation-invariant than a networkwith two convolution and two pooling layers trained onnon-augmented data. These results suggest that trainingdata augmentation is the most important factor in obtainingtranslation-invariant networks.

4.3. Architectural Features Play A Secondary Role

Next, we observe that the architectural choices such asthe number of convolution and pooling layers plays a sec-ondary role in determining the translation-invariance of anetwork. For the networks trained on non-augmented data,there is no correlation between the number of convolutionand pooling layers and the translation-invariance of the net-work. For the networks trained on augmented data, thethree network configurations with two convolution and twopooling layers are more translation-invariant than the shal-lower networks. However, between the three deeper net-works there is no distinction, suggesting that it is the net-work depth, rather than the specific type of layer used thatcontributes to the translation-invariance of the network.

4.4. Experiment Two: The Effect of Filter-Size in aDeeper Network

In the next experiment, we examine the role of the filtersize of the convolution layers in determining the translation-invariance of the network. For this work, we train a deepernetwork consisting of conv-conv-pool-conv-conv-pool. Wechose to use a deeper network because with four convolu-tion layers, the effect of filter size should be more evidentthan in shallower networks. We trained four instantiationsof this architecture: with a filter size of three pixels andfilter size of five pixels, and each of these trained on bothaugmented and non-augmented data.

The results, displayed in Figure 5 are in line with theresults from the previous section. Again, the biggest dif-ference is between the networks trained on augmented dataand the networks trained on non-augmented data. Filter sizeplays a secondary role, in the manner often discussed in theliterature, as the network with a larger filter size is moretranslation invariant than the network with a smaller filtersize.

5. ConclusionIn this paper we introduced two simple tools, translation-

sensitivity maps and radial translation-sensitivity functions,for visualizing and quantifying the translation-invarianceof classification algorithms. We applied these tools to theproblem of measuring the translation-invariance of variousCNN architectures and isolating the network and trainingfeatures primarily responsible for the translation-invariance.The results provide a refinement of many ideas circulating

Figure 5. Radial translation-sensitivity function comparing thetranslation-invariance of the same architecture with filter sizes ofthree and five pixels. The dashed lines are networks trained withaugmented data; the solid lines are networks trained with non-augmented data. Once again, we see that training data augmen-tation has the largest effect on the translation-invariance of thenetwork.

around the literature that have not been rigorously testedbefore this work.

CNNs are not inherently translation-invariant by virtueof their architecture alone, however they have the capacityto learn translation-invariant representations if trained onappropriate data. The single most important factor in ob-taining translation-invariant networks is to train on data thatfeatures significant amounts of variation due to translation.Beyond training data augmentation, architectural featuressuch as the number of convolution, the number of poolinglayers, and the filter size play a secondary role. Deepernetworks with larger filter sizes have the capacity to learnrepresentations that are more translation-invariant but thiscapacity is only realized if the network is trained on aug-mented data.

To follow-up on this work we plan to use translation-sensitivity maps to quantify the translation-invarianceachieved by spatial transformer networks. Many promisingresults have been obtained using Spatial Transformer Net-works and it would be interesting to place these within thecontext of this paper.

On a higher-level, we are interested in using the invari-ance problem as a platform for constructing hybrid modelsbetween CNNs and algorithms like The Scattering Trans-form. Given that invariance to nuisance transformations isa desirable property for many object recognition algorithmsto have, it seems inefficient that CNNs must learn this fromscratch each time. While some have suggested that invari-ance is inherently built into CNNs through their architec-

4325

ture, our work shows that this is not the case. In contrast,The Scattering Transform, which features no learning in theimage representation stage, has translation-invariance builtin regardless of how it is trained. On the hand, the abilityof CNNs to learn arbitrary patterns from the training datais one of the factors contributing to their considerable suc-cess. However, it will be interesting to consider hybrid al-gorithms which incorporate the desired invariance structuredifferently.

References

[1] Sundaramoorthi, Ganesh, et al. ”On the set of imagesmodulo viewpoint and contrast changes.” Computer Vi-sion and Pattern Recognition, 2009. CVPR 2009. IEEEConference on. IEEE, 2009.

[2] LeCun, Yann, et al. ”Gradient-based learning applied todocument recognition.” Proceedings of the IEEE 86.11(1998): 2278-2324.

[3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hin-ton. ”Imagenet classification with deep convolutionalneural networks.” Advances in neural information pro-cessing systems. 2012.

[4] Gens, Robert, and Pedro M. Domingos. ”Deep symme-try networks.” Advances in neural information process-ing systems. 2014.

[5] Jaderberg, Max, Karen Simonyan, and Andrew Zisser-man. ”Spatial transformer networks.” Advances in Neu-ral Information Processing Systems. 2015.

[6] LeCun, Yann, Corinna Cortes, and Christopher JCBurges. ”The MNIST database of handwritten digits.”(1998).

[7] LeCun, Yann. ”Learning invariant feature hierarchies.”Computer vision?ECCV 2012. Workshops and demon-strations. Springer Berlin Heidelberg, 2012.

[8] Lenc, Karel, and Andrea Vedaldi. ”Understanding im-age representations by measuring their equivarianceand equivalence.” Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 2015.

[9] Bruna, Joan, and Stphane Mallat. ”Invariant scatteringconvolution networks.” Pattern Analysis and MachineIntelligence, IEEE Transactions on 35.8 (2013): 1872-1886.

[10] Mallat, Stphane. ”Group invariant scattering.” Com-munications on Pure and Applied Mathematics 65.10(2012): 1331-1398.

[11] Soatta, Stegano, and Alessandro Chiuso. ”Visual Rep-resentations: Defining Properties and Deep Approxi-mations.” ICLR (2016)

[12] Anselmi, Fabio, et al. ”Unsupervised learning of in-variant representations in hierarchical architectures.”arXiv preprint arXiv:1311.4158 (2013).

[13] Kingma, Diederik, and Jimmy Ba. ”Adam: Amethod for stochastic optimization.” arXiv preprintarXiv:1412.6980 (2014).

[14] Abadi, Mart?n, et al. ”TensorFlow: Large-scale ma-chine learning on heterogeneous systems, 2015.” Soft-ware available from tensorflow. org.

4326

Quantifying Translation-Invariance in Convolutional Neural...

Documents

Transcript of Quantifying Translation-Invariance in Convolutional Neural...