Object Detection using deep learning and synthetic data

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet

gnipökrroN 47 106 nedewS ,gnipökrroN 47 106-ES

LiU-ITN-TEK-A--18/030--SE

Object Detection using deeplearning and synthetic data

Love Lidberg

2018-06-15

LiU-ITN-TEK-A--18/030--SE

Object Detection using deeplearning and synthetic data

Examensarbete utfört i Datateknikvid Tekniska högskolan vid

Linköpings universitet

Love Lidberg

Handledare Pierangelo DellAcquaExaminator Jonas Unger

Norrköping 2018-06-15

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –under en längre tid från publiceringsdatum under förutsättning att inga extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat förickekommersiell forskning och för undervisning. Överföring av upphovsrättenvid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning avdokumentet kräver upphovsmannens medgivande. För att garantera äktheten,säkerheten och tillgängligheten finns det lösningar av teknisk och administrativart.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovanbeskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådanform eller i sådant sammanhang som är kränkande för upphovsmannens litteräraeller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press seförlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possiblereplacement - for a considerable time from the date of publication barringexceptional circumstances.

The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for your own use and touse it unchanged for any non-commercial research and educational purpose.Subsequent transfers of copyright cannot revoke this permission. All other usesof the document are conditional on the consent of the copyright owner. Thepublisher has taken technical and administrative measures to assure authenticity,security and accessibility.

According to intellectual property law the author has the right to bementioned when his/her work is accessed as described above and to be protectedagainst infringement.

For additional information about the Linköping University Electronic Pressand its procedures for publication and for assurance of document integrity,please refer to its WWW home page: http://www.ep.liu.se/

© Love Lidberg

i

Abstract

This thesis investigates how synthetic data can be utilized when training convolutional

neural networks to detect flags with threatening symbols. The synthetic data used in this

thesis consisted of rendered 3D flags with different textures and flags cut out from real

images. The synthetic data showed that it can achieve an accuracy above 80% compared

to 88% accuracy achieved by a data set containing only real images. The highest accuracy

scored was achieved by combining real and synthetic data showing that synthetic data can

be used as a complement to real data. Some attempts to improve the accuracy score was

made using generative adversarial networks without achieving any encouraging results.

Acknowledgments

I would like to thank my supervisor at FOI David Gustafsson, my supervisor at LinköpingUniversity Pierangelo Dell’Aqua and my examiner Jonas Unger for the help and input on mywork. I would also like to thank my fellow thesis writers at FOI for the great companionshipand all the daily table tennis breaks.

ii

Contents

Abstract i

Acknowledgments ii

Contents iii

List of Figures v

List of Tables vi

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3

2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Architecture of a CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.8 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Method 17

3.1 Frameworks, Software & Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Results of the Tests 24

4.1 Test: Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Test: Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Test: Adding Realism to Generated Images . . . . . . . . . . . . . . . . . . . . . 264.4 Test: Flags on Random Generated Background . . . . . . . . . . . . . . . . . . . 28

5 Discussion 30

5.1 Answer to Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

iii

6 Conclusion 33

Bibliography 34

iv

List of Figures

2.1 An overview of the structure of the perceptron. . . . . . . . . . . . . . . . . . . . . . 42.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 An example of a multi layered perceptron. . . . . . . . . . . . . . . . . . . . . . . . 62.4 Typical architecture for a convolutional neural network . . . . . . . . . . . . . . . . 72.5 Convolutional operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Convolutional operation with zero padding. . . . . . . . . . . . . . . . . . . . . . . 82.7 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.8 Unpooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.9 The Inception module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.10 An example of a residual block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.11 Simple overview of a GAN. The generator, G, generates images from a random

distribution, Z, and the discriminator tries to distinguish them from some realtarget domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Example of real images containing desired objects . . . . . . . . . . . . . . . . . . . 183.2 Example of images labelled as ’Nothing’. The images are taken from the COCO

data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Example of the different flags rendered in Blender. Three different textures was

used for each flag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Flags pasted in to real images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Images taken from game play of Grand Theft Auto V. . . . . . . . . . . . . . . . . . 20

4.1 Images before and after they’ve gone through the ’fake to real’ generator. . . . . . . 274.2 Images before and after they’ve gone through the ’real to fake’ generator. . . . . . . 284.3 Flags on backgrounds generated by a generative adversarial network. . . . . . . . 29

v

List of Tables

3.1 Data sets used for the classification test. . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Data sets used in Object Detection test. . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Accuracy for rendered flags on different backgrounds. . . . . . . . . . . . . . . . . . 244.2 Accuracy scores for the classification experiment. . . . . . . . . . . . . . . . . . . . 254.3 Recall and precision scores for the classification experiment . . . . . . . . . . . . . 254.4 Accuracy scores for the object detection experiment . . . . . . . . . . . . . . . . . . 264.5 Recall and precision scores for the object detection experiment . . . . . . . . . . . . 264.6 Accuracy for images generated by GAN. . . . . . . . . . . . . . . . . . . . . . . . . 29

vi

1 Introduction

1.1 Background

Millions of images are posted daily on social media around the world[5]. Though mostimages are harmless some images are discriminating, threatening or calls for violence andterrorism. In fact, terrorist in the past have been known to post to social media prior to theirattacks and terrorist groups use social media as a way to spread their propaganda and torecruit more people[31]. A person posting images containing terrorists and certain symbolsto illustrate their, could be an indicator that this person can pose a threat in the future. Thepolice wants to take advantage of this situation to be able to prevent future attacks. Scanningthrough millions of images by hand to find the ones that contain objects of interest, forexample a weapon or an ISIS flag, is time consuming and not sustainable.

To automate this task, one can employ machine learning and computer vision techniques.Machine learning is a field within computer science where computers have to learn to act ona task without being specifically programmed for that task[3]. Computer vision is anotherfield, highly connected to artificial intelligence, that seeks to automate the tasks of the humanvisual system. Machine learning has previously proved to be able to automatize many taskssuch as spam filtering, playing games or predicting stock prices[1][20][32]. A machine learn-ing area that has gained a lot of attention in the last decade is neural networks, often referredto as ’Deep Learning’. Deep learning has with the rise of computing power and the massiveamount of available data shown to be a powerful technique that can be applied to manyfields. Convolutional neural networks is a type of deep feed-forward neural network thathas revolutionized many tasks within computer vision. Typical computer vision tasks suchas image classification, object detection or semantic segmentation have been challenging for along time. Nowadays using deep learning is the standard approach to solve these problems.Instead of relying on hand tuned features as before, convolutional neural networks can betrained on data and can learn on its own which features that are relevant for the problem.

Convolutional neural networks need a lot of data for training, in order to generalize andperform well on new images. The problem is when there is no data to train the notwork. Insuch cases, augmenting the existing data has shown to be one way to boost the result[30].

1

1.2. Motivation

Transfer learning is another way[33]. 3D modelling software is nowadays a powerful toolthat is able to produce images indistinguishable from real images. Can this technique beused when there is a shortage of data? Images that are produced artificially and is not a trueexample are often referred to as ’synthetic data’. This thesis will investigate how syntheticdata can be used to find certain given symbols of hate in images by using deep learning.

1.2 Motivation

This thesis has been carried out at Swedish Defence Research Agency, FOI. This thesis is apart of a bigger project that focuses on evaluating the use of big data in crime and terrorismprevention. To have a software that can scan through thousands of images fast and automat-ically to find possible future perpetrators that have posted threatening or suspicious imageswould be a useful tool. This thesis will focus partly on delivering a model that can be a partof such software.

1.2.1 Swedish Defence Research Agency

FOI is an assignment-based authority working under the Swedish Ministry of Defence. Theirmain customer is the Armed Forces and the Swedish Defence Material Administration butthey also accept assignments from civil authorities. FOI is one of the leading research insti-tutes in Europe in the areas of defence and security. Their aim is to contribute to a safer andmore secure world by being a world leader in these areas.

1.3 Aim

The aim of this thesis is to train a deep convolutional neural network to detect symbols ofhate in images and investigate how and if it is possible to improve the results of the modelwith synthetic training data.

1.4 Research questions

1. With limited data available, what efforts can be taken to boost the results of a convolu-tional neural network?

2. Is it possible to train a convolutional neural network using only synthetic training datato detect desired objects?

3. Can generative adversarial networks, GAN, be used for the production of syntheticdata?

1.5 Delimitations

This thesis will only focus on identifying three different symbols in images. Object detectionand classification is mostly considered to be a solved problem, hence this thesis will focus onthe use and production of synthetic data, how synthetic data can be made and how well itperforms compared to real images.

2

2 Theory

This chapter presents the theory and the history behind the technology used in the thesis.

2.1 Machine Learning

As said in the previous chapter machine learning can be seen as a subfield of artificial in-telligence. It has the goal to make computers learn tasks with data without being explicitlyprogrammed. Machine Learning is often divided into three types, supervised learning,unsupervised learning and reinforcement learning. Though, this thesis will not cover rein-forcement learning. When the data that is being fed to the machine learning algorithm islabeled it is called supervised learning. A common use for supervised learning is to predictthe outcome of something uncertain or unknown based on previously data, for exampleto filter e-mails as spam or not spam automatically or to predict the price of a stock thefollowing month. In supervised learning the algorithm learns to estimate the probabilityp(y|x) where y is the output of the algorithm with the given variable x. If the data instead isunlabeled it is referred to as unsupervised learning. Unsupervised learning is often used tomake sense out of some unlabeled data, to cluster pieces of data together or detect anomalies.

When using a machine learning algorithm, one often desires to either classify something, forexample if there is dog in a image, or to produce a continuous output such as the price of ahouse. The two tasks are referred to as classification problems and regression problems.

A common problem in machine learning is overfitting or underfitting. Underfitting hap-pens when a model fail to perform well. It can happen because the model is to simple for theproblem. A model is overfitting when it is performs very well on the training data but failsto generalize on new examples.

When training a model you want to make sure it generalizes well on new examples. Acommon way to make sure the model is achieving this is to divide the data set into threeparts: a training set, a validation set and a test set. The training set is the set that the modelwill learn from and the majority of the data set will go in to this category. The validationset consists of examples from the data set that the model will test itself on to see how well it

3

2.2. Neural Networks

performs. The model never trains on these examples, they are only available to see how wellthe training is going. This set is mostly used to determine when a model is done training andfor tuning hyper parameters. Finally when a model is done with the training it is exposedto the test set; this set that determines how well the model generalizes. The test set involvesexamples never seen by the model except for during the test. When evaluating differentmodels one commonly uses the same test for all models to compare them and see which oneperforms the best.

2.2 Neural Networks

Neural networks is a machine learning algorithm modelled to mimic how the human brainworks. A neural network consists of neurons, or units, that takes inputs from a previous layer,sums them together and then pass it through an activation function to generate an output.The precursor to neural network, the perceptron, consisted of a single neuron that summedup the inputs after being multiplied with their corresponding weights and sent it through anactivation function to produce an output, see figure 2.1. The algorithm was introduced in the1950s by Frank Rosenblatt[25]. A perceptron can be described by the following equation:

Y = ϕ(n

ÿ

i=1

Wi ˚ Xi + b) (2.1)

where X denotes the input, W is the corresponding weight, ϕ is the activation function and bis the bias. The bias shifts the decision boundary away from the origin and does not dependon any input value.

Figure 2.1: An overview of the structure of the perceptron.

The activation function is a function that takes the sum of a neuron and decides if this neuronshould ’fire’, or be activated. The activation function used for the first perceptron was thestep function that mapped the weighted sum of the neuron to either 0 or 1. Today there area few other activation functions that are used in almost every network, see figure 2.2. Thesigmoid function that takes an input and produces an output anywhere between 0 and 1.This is commonly used in the last layer of a network for classification problems to produce aprobability of the input belonging to a specific class. The Tanh activation function is similar to

4


the sigmoid function but maps the input from -1 to 1, this produces a stronger gradient whenperforming back propagation. The two other most used activation functions are the rectifiedlinear unit, ReLU, and leaky ReLU[22][19]. Both functions leave positive inputs untouchedbut suppress negative inputs. The non-leaky ReLU(figure 2.2 (a)) maps all negative inputs tozero while the leaky ReLU function just suppresses them close to 0. By not being zero theystill have a gradient which can be helpful to avoid problems during training.

(a) Leaky ReLU (b) ReLU (c) Sigmoid (d) Tanh

Figure 2.2: Activation functions

The first neural networks consisted of more than one perceptrons stacked on top or next toeach other, see figure 2.3. These networks are called feed-forward networks or multi-layeredperceptrons, MLP. They consist of an input layer that connects to each neuron in the nextlayer. The layers between the input and output layers are called hidden layers where eachneuron is connected to all neurons in the previous layer and all neurons in the next layer. Afeed-forward network with only one hidden layer provided with a sufficient amount of neu-rons has been shown to be able to approximate any nonlinear function[13]. A feed-forwardnetwork can have a trivial number of hidden layers. A network with two or more hiddenlayers is usually referred to as a deep network which has coined the term often used whenreferring to neural networks today: Deep Learning.

5


Figure 2.3: An example of a multi layered perceptron.

When an input is passed through a network, or propagates forward, and produces an outputwe refer to it as a forward pass. For a newly constructed network it would probably producea arbitrary and useless output, because the weights in the network have not been optimizedfor the problem yet. To make the algorithm produce an output better fit for the data wehave to optimize the weights with respect to the training data. These is what is referred to astraining the network.

To train the network after every forward pass, we compare the output to the correct la-bel for the corresponding input and update the weights so they approximate the target’slabels better. To do this we have to have some way of measuring how far off the output of thenetwork is from the correct label. Therefore we define a loss function that tells us how good,or bad, the output was. The most common loss functions are the least square error, LSE, andleast absolute error, LAE. The key difference between the two is that LSE-loss penalizes largeerrors heavier than LAE-loss. The goal of the training is to minimize this error. Computingthe gradient of the loss function gives us which direction to move the weights of the lasthidden layer. By taking advantage of the chain rule of derivatives we can find each neuron’sgradient and propagate backwards through all hidden layers. Once all gradients are foundwe can update the corresponding weights by subtracting the current value of the weightwith the gradient multiplied by a predefined learning rate, a procedure known as gradientdescent. This is done for each example in the training set over and over until the loss of thenetwork converges or the training is stopped manually. The algorithm above that describeshow to calculate the gradients is known as backpropagation and it is the standard approachto optimize the weights of a network[10].

Neural networks are scalable and adding more hidden layers and units is easy. Addingmore units allows the network to describe more complex functions. But it also means thatneural networks has a tendency to easily overfit the training data. Regularization techniquescan be used to avoid overfitting. When dealing with neural networks the most commontechnique used to regularize is dropout[12]. Dropout simply means that some neurons aredropped and not used during some iteration during the training phase. This forces other

6

2.3. Convolutional Neural Networks

neurons to activate instead. This is a way to prevent relatively few neurons to control theoutput. Dropout is only used during the training. Ones the model has been trained allneurons must be available.

2.3 Convolutional Neural Networks

Computer vision is a subfield of artificial intelligence often highly connected to machinelearning and neural networks in particular. Computer vision deals with how computers cangain a high-level understanding of digital images and videos. It seeks to mimic the capa-bilities of the human visual system. Nowadays convolutional neural networks, CNNs, arethe standard approach to many of the tasks within computer vision. Since its breakthroughin the last decade convolutional neural networks have achieved groundbreaking results incomputer vision tasks such as image classification and object detection.

CNNs are a type of deep, feed-forward neural networks. The differences to ordinary neuralnetworks are the convolutional layer and how the inputs are being fed into the network.Suppose that we give an image as input to a neural network. In a regular neural networkevery pixel in an image can be seen as a feature and an input. This is computationally veryheavy since a relatively low quality image of size 600x600x3 will have more than one millioninputs.

CNNs handle the inputs in a different way and make use of the convolutional layers whichare far more effective for this type of input. Convolutional neural networks are today thestandard approach for computer vision tasks such as classification and object detection.

2.4 Architecture of a CNN

This section covers the commonly used layers in CNNs.

Figure 2.4: Typical architecture for a convolutional neural network

2.4.1 Convolutional Layer

Section 2.3 explained how standard neural networks are not optimal when working withimages. So instead of using each pixel as an input to each neuron we can use a convolutionallayer and slide, or convolve, over the image which is far more effective. A convolutionallayer can be seen as a stack of filters that take in a 3D volume and outputs another 3D volumeof feature maps. Each filter in the layer has a corresponding feature map in the output. Forexample, let a filter, also often referred to as kernel, be a 3x3 matrix. To produce a feature mapwe multiply this filter element wise at some starting position and sum up the product. The

7

2.4. Architecture of a CNN

output of each operation corresponds to one cell in the feature map. After each operation wemove the filter with a fixed stride and perform the same operation until we have slid overthe whole input and produced a complete feature map. An example of this procedure can beseen in figure 2.5. Since all inputs are convoluted by the same filters, which means all inputsshare the same weights, we are using far less parameters than a regular neural network. Thisallows us to add more filters and make deeper networks. The filters in the network learnsfeatures that before had to be hand-engineered through the training process.

Figure 2.5: Convolutional operation

In some cases when using convolutional layers is it important to keep the spatial dimensionof the input. If we convolve an input with dimensions 10x10 with a filter with size 3x3 theoutput would be a feature map of size 8x8.To keep the dimensions in the output the sameas in the input we can pad the sides of the input with zeros. This is a procedure known as’zero-padding’. This keeps the spatial dimensions after the convolutional layer. See figure2.6 for an example.

Figure 2.6: Convolutional operation with zero padding.

2.4.2 Pooling Layer

In between convolutional layers it is common to use a pooling layer. A pooling has the roleof sub-sampling the input to a lower dimension without losing important information. Bydecreasing the dimensions we lower the computations and parameters in the network, whilealso control overfitting. The most common pooling operation is max pooling where we froma spatial region take the largest number and discard the rest, see figure 2.7. There are other,nowadays less common, pooling procedures such as average pooling and sum pooling but

8

2.4. Architecture of a CNN

max pooling is by far the most used. The pooling layer is applied in the same way as aconvolutional layer. We have a fixed size of the pooling layer that we slide with a fixed strideacross the input. Most common is a 2x2 layer with a stride of 2. The output of such layer willresult in half the amount of parameters as the input.

Figure 2.7: Max pooling

For some convolutional networks, such as autoencoders or generative adversarial networks,other pooling procedures are used like unpooling which can be seen as a up-sampling andopposite of regular pooling. A unpooling layer takes in 3D volume and outputs a 3D volumewith a larger width and height. There are different methods for how the upsampling can bedone. In figure 2.8 an example of unpooling is shown with a stride of 2 and kernel size of2x2. In this case each cell from the smaller matrix is expanded to a 2x2 matrix with the valuefrom the original cell placed somewhere in the new matrix, where the position in the newmatrix depends on the method used. If the network has a corresponding max pooling layerinformation the positions of the maximum can be saved and used later in the unpooling layer.

Figure 2.8: Unpooling

2.4.3 Fully Connected Layers

The last layers of a CNN are typically fully connected. In fact, the last convolutional layer isflattened so that it can be seen as a long vector and can then be connected to a fully connectedlayer. The role of the fully connected layers is to make sense of the previous convolutionalstages and come up with a prediction. Normally one to three layers are stacked next to eachother with the last layer having as many neurons as classes it wants to predict. Each neuronis responsible for the prediction of its corresponding class. If the task on the other hand is todetect an object, the number of neurons in the output layer might have additional neuronsresponsible for the prediction of the size of the bounding box of the object.

2.4.4 Activation Functions

Activation functions were described section 2.2. In CNNs there are there a few activationfunctions that are more frequently used depending on where they are placed in the network.ReLU or leaky ReLU are the most used in the convolutional layers. Fully connected layersexcept for the last layer often use ReLU as the standard activation. The last layer whichcontrols the output of the network have different activations depending on the task. CNNs

9

2.5. Transfer Learning

used for classification use sigmoid functions for single class classification. For multiclassclassifications the activation is dependant on the label. If an image can belong to more thenone class sigmoid is also used since it maps all outputs to the interval [0,1]. If an image onlybelong to one class the softmax activation is used. It is similar to a sigmoid function as itmaps to the same interval but it makes the output of all neurons in the layer add up to 1.

2.4.5 Batch Normalization

Introduced by Ioffe and Szegedy in 2015 batch normalization become a popular opera-tion[14]. It is placed between layers and takes the output of its previous layer and normalizesit with regard to the current batch mean and standard variation before sending it to the nextlayer as input. Batch normalization is used since it allows the learning rate to be higher,therefore the network less sensitive to weight initialization and works as a regularizationoperation.

2.5 Transfer Learning

Training a large CNN from scratch can be costly in terms of time and computing powerand it requires a vast amount of data. For smaller projects that do not have a sufficientamount of data can transfer learning be a solution. To solve this problem we make use ofpre trained networks and retrain, or fine tune, them to fit your desired task. Training largenetworks on huge data sets take multiple days with multiple high end GPUs. Since theconvolutional layers in the pre trained network hopefully learn to recognize useful high andlow level features we can make use of them. To do this we freeze the weights for the convo-lutional layers and only run backpropagation on the last fully connected layers that controlthe output. Thus, we only need to make one small modification to the last layer to fit thepurpose of our task. For instance COCO has 91 different categories, which means 91 neuronsin the last layer. This needs to be changed to amount of categories in the current classification.

Since 2010 is the competition ImageNet Large Scale Visual Recognition Challenge, ILSVRC,held every year[26]. This is a competition where researchers compete to achieve the highestaccuracy on a given data set on different computer vision tasks. In 2012 Alex Krizhevsky etal.[16] won the image classification competition with its network, AlexNet, by achieving a15% top-5 error rate, beating the runner up by over 10%. This is seen as the break throughof deep convolutional networks. Since the release of AlexNet the competition has seen itsrecord broken multiple times.

In 2014 Google won the ILSVRC with the network GoogLeNet, a 22 layer deep network[29].Their network achieved a top-5 error rate of 6.67%. What was ground breaking was notonly the error rate but also the improved utilization of computing resources. While havinga deeper architecture and higher accuracy than AlexNet, it used twelve times fewer param-eters. The main contribution to this was the so called ’Inception’ module, see figure 2.9.The inception module introduced a new type of architecture for convolutional blocks. Eachmodule form a small network of its own with multiple convolutional layers with differentkernel sizes that concatenates in a final output.

10

2.5. Transfer Learning

Figure 2.9: The Inception module

Google has since the introduction of the inception network released multiple pre trainedversions and made them publicly available.

Microsoft Research entered ILSVRC in 2015 and won, not only the classification compe-tition but also semantic segmentation and object detection competitions, with their deepresidual network, the ResNet[11]. ResNet was by far the deepest network in the competitionwith over hundred layers. But the interesting thing with ResNet was that it introduced anew concept of residual blocks. Residual blocks are built from blocks of convolutional layers.Nowadays there are different implementations of residual blocks but the ones in the originalnetwork consisted of two convolutional layers where the input is added to the output afterbeen fed through the convolutional layers, see figure 2.10. This type of architecture forceseach block to learn something new from the previous layer. Residual blocks also helpswith the vanishing gradient problem, a common problem for deep networks, and allows fordeeper networks.

Figure 2.10: An example of a residual block

2.5.1 Classification & Object Detection

Humans can without problem distinguish objects from each other, no matter viewpoint ordistance to some degree. We can easily see objects even if they are partially truncated. Al-though many breakthroughs have happened in the last couple of years this is still a problemfor computers.

Image classification means to extract information in a given image and predict which class,

11

2.6. Generative Models

or classes, the image belongs to. It does not matter where in the image important pieces ofdata are, the whole image is classified as one. Object detection on the other hand has the taskto localize where an object is found in the image and give it a bounding box surrounding theobject. Nowadays both task are solved using convolutional neural networks but with somedifferent approaches. Image classification networks are commonly networks having multipleconvolutional layers followed by pooling layers, and lastly some fully connected layers tomake a prediction. These networks take a full image as input

Object detection could be done in a similar fashion but instead of using the whole im-age as input we can slide over the image and use a small patch of the image as input. Thistechnique is known as ’sliding window’. But sliding over the image and feeding each patchto the network is computationally costly.

Making object detection a faster operation has been an active research area in the lastyears. R-CNN gained a lot of attention with their approach that used multiple networks toidentify an object could be found in the image[8]. Since the release of R-CNN fast releaseshave been built[7][24]. Another approach that is commonly used today is the single shotmultibox detector, SSD[18]. In contrast to R-CNN, it uses a single network, which makesit much faster. Multiple bounding box predictions that get progressively smaller are madeby he last layers in the SSD network where the final prediction is the union of all thesepredictions.

2.6 Generative Models

A very active research area in the field of deep learning lies within generative models. Theaim of a generative model is to learn to generate data from any known distribution or trans-late data from one domain to another. The most common generative models are VariationalAutoencoders and Generative Adversarial Networks.

2.6.1 Generative Adversarial Networks

Generative adversarial networks, commonly referenced as GAN, were introduced in 2014 byIan Goodfellow[9]. A simple GAN consists of two networks, a generator and a discriminator,competing against each other. The generator tries to generate data that approximates somereal distribution while the discriminator determines what is generated data and what is datafrom the real distribution. A generator can be seen as an art forger that tries to replicate anartist and the discriminator as an art expert that the separates real paintings from the fakeones. Both networks learn from each others output and gets better and better until ultimatelythe discriminator can no longer distinguish real data from the fake.

12

2.6. Generative Models

Figure 2.11: Simple overview of a GAN. The generator, G, generates images from a randomdistribution, Z, and the discriminator tries to distinguish them from some real target domain.

2.6.2 GAN Architectures

The discriminator is often an ordinary CNN, with some architecture differences dependingon the task. Usually the discriminator outputs a single value between 0 and 1 describing it’sconfidence that the image comes from the real distribution. The generator in the first GANconsisted of a random distribution connected to fully connected layers. In 2015 Radford etal.[23] introduced the DCGAN which consisted of a random distribution followed by convo-lutional layers which yielded better looking images. Using convolutional layers is nowadaysthe standard approach for generating images.

GANs are not only used for generating images but also to translate images between dif-ferent domains. For these networks the generators are often replaced by an autoencodersince the output images are not completely generated but are generated by modifying pre-existing images. An autoencoder is a CNN consisting of an encoder network and a decodernetwork. The encoder tries to represent the image in some latent space and the decoder triesto replicate the original image, or with some desired modification, from the latent space. Itcan be used for image compression or to colorize black and white images.

2.6.3 Training a GAN

GANs are notoriously hard to train and several papers have been published on this topic.Up to now there is no standard way to go for all networks[2][27]. They are trained usingbackpropagation as all other neural networks. Since GANs are made up of multiple networksthey come with some problems. Tuning one network can be hard which makes tuning mul-tiple networks simultaneously even harder. It is important not to let one network becometo good relative the other other. If the generator is trained too heavily, it might discoversome weakness in the discriminator and produce the same output regardless of input. Onthe other hand if the discriminator gets too good at distinguishing between real and fakeit won’t produce any good gradients for the generator and the networks will probably notconverge. To find the right balance in alternating the training between the networks can bevery challenging.

The discriminator is often an ordinary CNN and therefore trained as one. Its objectiveis to minimize the probability of a fake sample being predicted as real and maximize theprobability of a real sample being predicted as real. For the generator the loss function variesa lot depending on its task. The network can have multiple different losses, sometimes com-

13

2.7. Data sets

bined. The typical loss for the generator in a GAN is often referred to as the adversarial lossand comes from the output of the discriminator, since it wants to maximize the probabilityof the generated data being labelled as real. For image translation tasks it is often desired tokeep the structure of the input image but have it modified in some way. To do this anotherloss function has to be added. This could be the absolute difference pixel wise between inputand output of the generator or a cyclic loss as in cycleGAN. CycleGAN was introduced in2017 by Jun-Yan Zhu et al[34]. It has gained a lot of attention in this field. The architecturebehind this network consist of two generators and two discriminators, one for each domain.The idea behind is that an image should ideally be able to be translated back and forthbetween two domains. So if an image from one domain is fed in through the first generatorand then the output from that generator is fed to the next generator it should ideally be thesame as the original image. The difference is what is called the cyclic loss.

2.7 Data sets

To become good at prediction a CNN needs a lot of images of the desired object, the morevariation the better, to learn from. To collect good annotated data can be both costly and timeconsuming. Luckily there are a many publicly available data sets.

Sometimes there exists no data or not enough data is available containing objects of in-terest to train a network and make it generalize well enough. There might not exist a publicdata set and manually searching after and finding relevant images is time consuming andsometimes not even possible if no public images exists. What can be done in this case? Wecan try new ways to either replicate data, generate new data or augment the existing data.

2.7.1 Data Augmentation

The most common operation when your data supply is limited is to augment the data youalready have in different ways. It is shown that cropping the image at random places (withoutlosing important information in the image), adjusting the colors or brightness in the imagecan improve the network training[30]. It prevents the network to overfit the training data.This is usually a good approach regardless of data set since it helps the network generalizebetter after it have been exposed to more variation.

2.7.2 Synthetic Data

Synthetic data refers to data that have not been obtained by direct measurement. It is insteaddata that have been created artificially.

2.7.2.1 Rendered 3D Models

Rendered 3D models from software as Blender or 3ds Max can today be indistinguishablefrom real photos for a human. We can take advantage of this when collecting data. Modelsof desired objects can be modelled and rendered, then either cut in to a ordinary image orlet the whole rendered image be a standalone image. These images can then be used assynthetic data. An advantage working with rendering software is the ability to automaticallyget information about positions of objects in the picture. This makes labelling bounding boxesand classifying images an easier and less time consuming task.

2.7.2.2 Cut & Paste Objects

Another way to expand a data set with synthetic data could be done by cutting out the desiredobjects to be detected out of an image containing the objects and pasting it on to other images.

14

2.8. Evaluation Metrics

With this technique we get the same advantage with image labelling as with rendered imagesonce all desired objects have been cut out.

2.8 Evaluation Metrics

To evaluate how well a model works, we need some kinds of metrics. Often in machinelearning when working with classification algorithms we use accuracy, recall and precision.When defining those metrics we use the following terms:

• True Positive (TP) - An image correctly predicted as belonging to a specific class.

• True Negative (TN) - An image correctly predicted as not belonging to the specific class.

• False Positive (FP) - An image wrongly predicted as belonging to the specific class.

• False Negative (FN) - An image wrongly predicted as not belonging the specific class.

Accuracy is defined as the number of correctly classified data points divided by the totalnumber of data points. It is defined as:

Accuracy =TP + TN

TP + TN + FP + FN(2.2)

Precision is a measure of correct predictions of a class divided by the total number of predic-tions for that class. A high precision value indicates a low number of false positives. Precisionis defined as:

Precision =TP

TP + FP(2.3)

Recall is the fraction of correct predictions for a class divided by the number of total instancesfor that class. A high recall value indicates a low number of false negatives. Recall is definedas:

Recall =TP

TP + FN(2.4)

High values for precision and recall individually is not equivalent to a good score. But com-bined they give a good indication of how well a model has scored.

2.9 Related Work

2.9.1 Synthetic Data

The cost of computational power and storage has decreased dramatically in the last decadeswhich has paved the way for CNNs to train on large amount of data. The problem however isthe availability of large well annotated data sets. Previously the researchers have themselvesbeen the ones putting data sets together and manually annotate the data for their task. Thisis a time consuming work and researchers have been looking for better ways to gather wellannotated data sets. Some have been using Amazon’s Mechanical Turk to crowd source tasksas labelling data. This however might not be a perfect solution since data sets often requiresome kind of expertise to label the data correctly. Researchers have also been looking intousing synthetic data as a substitute for real. Movshovitz-Attias et al. used 3D models of carsrendered in different settings that they pasted on to random images from the PASCAL dataset[21]. They used this method to train a model to estimate the viewpoint. They showed that

15

2.9. Related Work

combining rendered images together with a small amount of real data improved the accuracyof their model. Dwibedi et al. look in to the possibility of reusing objects from images[4]. Bycutting out desired kitchen objects from real images and pasting them into the backgroundwith a kitchen environment and combining them with a small amount of real data, theyshowed that they could beat models trained exclusively on real images.

2.9.2 Image Translation and Style Transfer

Gatys et al.[6] showed in 2015 in their paper that using features learned by convolutionallayers in a CNN, it is possible to change the style of an image. In their paper they useexamples where they have regular photos translated to look like paintings. With the gainedpopularity of GANs new techniques for image translation have appeared.

Isola et al.[15] and Liu et al.[17] showed promising result in image translation tasks. Isola etal. used paired images in different domains to for example translate black and white imagesto colored, while Liu et al. used, as CycleGAN, unpaired images and showed how imagescould be translated between domains.

In a paper from Apple by Shrivastava et al.[28] combined both synthetic data and imagetranslation. They used what they call a refiner network that took a rendered image as inputand tried to add realism to it. They had a discriminator with the mission to separate therefined images from real images. They showed that using this technique for generatingtraining data, they were able to achieve higher accuracy than with using exclusively realimages.

16

3 Method

This chapter will walk through how the tests of the diploma work were carried out and whichtechniques were used.

3.1 Frameworks, Software & Hardware

All tests were carried out on a computer with Ubuntu 16.04 and a Nvidia Geforce GTX 1080TIGPU. All scripts were written in Python and all neural networks were taken from the Tensor-flow library. Blender was used for 3D modelling and rendering of images and videos. Gimpwas used for image processing.

3.2 Data sets

The tests in this thesis focused on trying to detect certain symbols in images. The symbolsof interest were the swastika, the ISIS flag and the flag of the Nordic resistant movement.No public data set containing images with these symbols was publicly available. For theclassification tests was a category with images labelled as ’Nothing’ was used. These imageswas randomly picked from the COCO data set.

3.2.1 Real Images

A data set consisting of real images with four different classes was collected. Images con-taining the objects to be detected were found by searching for keywords on Google Images.For regular images containing no objects of interest were found in the COCO data set. Atotal of 412 real images with desired object were collected. Among these images 172 imagescontained the ISIS-flag, 97 images contained the symbol of the swastika and 143 images con-tained the flag of the Nordic Resistance Movement. 50 images per category were used as atest set and the rest was used for training. No validation set was constructed since removingmore images from the already quite small test set would make the results on the test set lessreliable. The same test set was used for all tests.

17

3.2. Data sets

(a) ISIS (b) Swastika(c) Nordic Resistant Move-ment

Figure 3.1: Example of real images containing desired objects

(a) (b) (c)

Figure 3.2: Example of images labelled as ’Nothing’. The images are taken from the COCOdata set.

3.2.2 3D Rendered Images

A 3D model of a flag was modelled in Blender.To make the flag look like it was blowing in thewind physics simulation was added. The model was rendered with three different texturesfor each category. Each rendering output a 100 images of a flag on a transparent background.These images where then pasted on to regular images from the COCO data set. All imagesof flags were randomly cropped before adding them to another image. A python script waswritten to be able to generate a desired amount of images easily.

18

3.2. Data sets

Figure 3.3: Example of the different flags rendered in Blender. Three different textures wasused for each flag.

3.2.3 Cut & Paste

10 images from each category from the training set of real images where the objects of interestwere cut out were saved as a separate images with transparent backgrounds using Gimp. Thesame procedure regarding augmentations was done as above with the images rendered froma 3D model.

Figure 3.4: Flags pasted in to real images.

19

3.3. Tests

3.2.4 Grand Theft Auto V

For one of the tests a large rendered data set was needed. Since no such data set was availableone had to me made. I decided to take images from Grand Theft Auto V since it has in thelatest years been praised for it’s realistic graphics and it takes place in a realistic environmentonly containing real life objects. A data set containing 14,966 still images from the campaignmode and video scenes was collected by grabbing three frames per second from one hour ofgame play.

(a) (b) (c)

Figure 3.5: Images taken from game play of Grand Theft Auto V.

3.2.5 Labelling

All images that were generated were automatically labelled and had a corresponding text filecontaining information about the bounding box for the object in the image. The real imagesthat were collected were labelled and annotated by hand.

3.3 Tests

This section will walk through the tests conducted in this thesis.

3.3.1 Test 1: Classification

Since I was working with a small data set I decided to use a pre trained network. Tensorflowis written by Google and they have released their pre trained Inception network with codewritten for transfer learning. Because of this availability and since Inception is known to beone of the state-of-the-art models it was chosen for this test.

First a test was carried out to see how well rendered images performed depending onbackground and augmentations. Four data sets were composed and compared: renderedobjects on black background, rendered objects on random background noise, rendered ob-jects with a random image from the COCO data set as background and the same as theprevious with some random augmentations. The Inception net was trained on the datasets for a thousand steps, one step being equivalent to one batch’s forward propagationthrough the network, which was enough steps for the network to get a high accuracy onthe training images. No modifications were done to the default configurations of the network.

Next test was more thorough using more data sets to evaluate how well synthetic dataworks as training data compared to real images. Multiple data sets were constructed of

20

3.3. Tests

different sizes using both rendered images and objects cut out from real images. The datasets that did not perform well in the previous test were discarded. A set of real images wasused as a benchmark. The complete list of data sets for this tests is shown in table 3.1. Thistest used the same configurations as the previous; no modifications was done to the defaultsettings.

Table 3.1: Data sets used for the classification test.

Data sets Synthetic Images / Category

Rendered Images 300, 600, 900, 1500Cut & Paste 300, 600, 900, 1500Rendered + Cut & Paste 300, 600Real Images + Rendered 150, 300, 600Real images + Cut & Paste 150, 300, 600Real + Rendered + Cut & Paste 300, 600Real Images -

3.3.2 Test 2: Object Detection

Tensorflow comes with an Object Detection API that is easy to adjust for ones own pur-poses. The API has multiple pre trained network but I decided to use Inception again forconsistency. As explained in the theory chapter the network for classification differs from thenetwork for detection. I decided to go with a single shot multibox detector network since itis faster than R-CNN but it allows to achieve similar results.

The data sets scoring low in the previous tests was discarded and not taken into account thistime since retraining the network for each data set was considered too time consuming. Thedata sets used for the test is listed in the table below.

Table 3.2: Data sets used in Object Detection test.

Data sets Synthetic Images / Category

Rendered Images 900, 1500Cut & Paste 900, 1500Rendered + Cut & Paste 300, 600Real Images + Rendered 300, 600Real images + Cut & Paste 300, 600Real + Rendered + Cut & Paste 300, 600Real Images -

3.3.3 Test: Adding realism to rendered Images

Comparing images from GTA V and real images, it is easy to spot which ones are real andwhich ones are rendered. To add realism to rendered images a generative adversarial net-work was set up. The architecture was set up as a CycleGAN with two generators and twodiscriminators. Each generator consisted of two down sampling blocks with two convolu-tional layers and one max pooling layer, three residual blocks and two up sampling blocksthat mirrored the down sampling. The down sampling layers started with 32 filters and dou-bled for each layer. The number of filters was kept the same throughout the residual blocks.The last layer was a convolutional layer with three filters to make the output have the correctimage dimensions. All convolutional layers used the ReLU function as activation except thelast layer that used the sigmoid function to map the output image between zero and one. Alllayers except the last layer used batch normalization before the activation function.

21

3.3. Tests

The discriminators consisted of five convolutional layers with filter size starting at 64 anddoubling in size for each layer. The last layer produced a one dimensional 16x16 featuremap with probabilities of the image being real. All activation functions were ReLU func-tions except for the last layer that used the sigmoid function. All convolutional layers inthe networks had a filter size of 3 and stride of 1. All inputs to convolutional layers werezero-padded to keep the dimensions after the convolutional operation.The loss function for the discriminators was defined as:

LD = Lreal + L f ake (3.1)

Lreal = 1 ´ D(y) (3.2)

L f ake = D(G(x)) (3.3)

where D is the discriminator, G is the generator, X denotes a fake image and Y a real image.The loss for the generators was defined as:

LG = Ladv + λLcyclic (3.4)

Lcyclic = |x ´ G1(G2(x))| (3.5)

Ladv = 1 ´ D(x) (3.6)

where the cyclic loss is the same as described in section 2.6.3.

The data set used for this test consisted of 14,966 images from Grand Theft Auto V and15000 random images from COCO. The images from GTA V were used for this test as a testto investigating the possibility of adding realism and in case of success model and rendernew 3D scenes containing the objects in the other tests.

3.3.4 Test: Flags on Random Generated Background

Inspired by the paper from Apple, see section 2.9, where they used a refiner network to trans-late rendered images to real images, this test tried to replicate their approach with some mod-ifications. In their test they had two data sets of pictures of eyes, one containing real imagesand another with rendered. Since I did not have any completely rendered images I tooka slightly different approach. Rendered images containing symbols on a transparent back-ground were pasted on to a randomly generated image. To know the location of the symbola binary mask from the rendered image was retrieved. The networks in this test were onegenerator, an autoencoder, and one discriminator. The loss for the generator was defined as:

LG = Ladv + λLreg (3.7)

Ladv = 1 ´ D(x) (3.8)

Lreg = |x ´ G(x)| ˚ B (3.9)

Where B is the reg-loss stands for the binary mask. This mask was multiplied element wise

22

3.4. Evaluation Metrics

with the absolute difference between the input image and the generated image. This forcedthe generator to keep the structure of the symbols while the background did not get affectedby this and was only linked to the adversarial loss. The discriminator in this test used thesame loss function as in the test above.

The generator had a similar architecture as the generators in the previous test. The onlydifference was that this network used six residual blocks in the middle. No max poolinglayer was used, instead was the stride for the convolutional layers set to 2 which works as adown sampling method. The discriminator had the same architecture as in the previous test.The learning rate was the same for both networks and was kept at 0.0001, the batch size wasset to 16 and lambda was set to 10.

3.4 Evaluation Metrics

I used accuracy to evaluate how well the tests involving classification and object detectionwent. For the test with object detection the highest scoring object found in an image deter-mined the class it belonged to. An image was labelled as ’Nothing’ if no object could befound in the image. Recall and precision were used to analyze the results further. When find-ing threatening objects in images it is important to find all images containing objects ratherthan wrongly classify an image containing an object as nothing, even though these imagesmight have to be filtered out manually afterwards which could be tedious. Therefore, in thisproject a high recall value for images containing objects is important.

23

4 Results of the Tests

4.1 Test: Classification

The results of the first test with rendered flags pasted on to different backgrounds is shownin table 4.1. As shown in table, pasting images onto random images from the COCO data setwith augmentations had the highest accuracy. Hence, images with augmentations was usedin later experiments.

Table 4.1: Accuracy for rendered flags on different backgrounds.

Images / CategoryData set 300 600 900

No background 26.50% - -Random background 26% 25.50% -COCO background 71% 70% 70%COCO background w/ aug. 69.50% 73% 70%

The accuracy scores of the second experiment are shown in table 4.2. The data sets, those thatachieved the highest accuracy, precision and recall scores are shown in table 4.3. Real imagesachieved the highest score for an individual data set but with both methods of synthetic dataadded was the accuracy boosted with a couple percent.

24

4.2. Test: Object Detection

Table 4.2: Accuracy scores for the classification experiment.

Data set Images / Category Accuracy %

3D 300 69.503D 600 703D 900 73Cut 300 57.50Cut 600 55Cut 900 643D + Cut 300 723D + Cut 600 69.50Real + 3D 150 88Real + 3D 300 90Real + 3D 600 90Real + Cut 150 86Real + Cut 300 91.50Real + Cut 600 90Real + 3D + Cut 300 84.50Real + 3D + Cut 600 86Real - 86

Table 4.3: Recall and precision scores for the classification experiment

Precision (%) Recall (%)

Data set ISIS Nazi NMR Nothing ISIS Nazi NMR Nothing

Rendered 78.72 71.15 92.59 52.16 74 74 50 92Cut & Paste 82.86 96.77 90 43.86 58 60 36 100Real + Rendered 90.74 89.13 89.8 88.24 98 82 88 90Real + Cut & Paste 87.04 100 95.83 85.96 94 82 92 98Rendered + Cut & Paste 70.8 83.72 88 60 76 72 44 96Real + Rendered + Cut 88.46 100 81.13 82.46 82 76 84 94Real Image 88.68 91.89 79.31 86.54 94 68 92 90

4.2 Test: Object Detection

The overall score for all data sets were higher when using the object detection API. All accu-racy scores can be seen in table 4.4. The data sets with the highest scores precision and recallare displayed in table 4.5.

25

4.3. Test: Adding Realism to Generated Images

Table 4.4: Accuracy scores for the object detection experiment

Data set Synthetic Images / Category Accuracy %

3D 900 793D 1500 73.87Cut 900 57.29Cut 1500 52.263D + Cut 300 80.93D + Cut 600 82.41Real + 3D 300 93.97Real + 3D 600 91.46Real + Cut 300 93.97Real + Cut 600 91.96Real + 3D + Cut 300 88.94Real + 3D + Cut 600 86Real - 89.95

Table 4.5: Recall and precision scores for the object detection experiment

Precision (%) Recall (%)

Data set ISIS Nazi NMR Nothing ISIS Nazi NMR Nothing

Rendered 100 100 91 56 70 63 84 98Cut & Paste 94.74 95.65 39.56 42.55 72 45 72 40Real + Rendered 98 95.65 97.87 85.45 100 89 92 94Real + Cut & Paste 90.91 97.87 97.96 94.08 100 93,88 96 89Rendered + Cut & Paste 91.67 97.62 94.29 62.16 88 83.67 66 92Real + Rendered + Cut 90.38 100 97.73 90.47 94 85.71 86 90Real Image 93.88 93.48 97.83 77.59 92 88 90 90

4.3 Test: Adding Realism to Generated Images

The result of the experiment can be seen in figure 4.1 and figure 4.2. The images generatedfrom fake to real showed no clear sign of added realism. The images translated from realto fake had visible signs that it had adopted to another domain. Those images got visiblysmoother but they often ended up with smudges of irrelevant colors. The experiment did notachieve a result promising enough for another experiment with new 3D scenes containingthe desired symbols.

26

4.3. Test: Adding Realism to Generated Images

(a) (b)

(c) (d)

Figure 4.1: Images before and after they’ve gone through the ’fake to real’ generator.

27

4.4. Test: Flags on Random Generated Background

(a) (b)

(c) (d)

Figure 4.2: Images before and after they’ve gone through the ’real to fake’ generator.

4.4 Test: Flags on Random Generated Background

The result for the images with a backgrounds generated by a generative adversarial networkcan be seen in table 4.6. Examples of how the images could look after they have gone throughthe generator is shown in figure 4.3. The images achieved worse accuracy than rendered flagsthat was pasted on to real images. This indicates that the generator failed to add realism tothe flags and generate a better background. The network did never converge which is anotherindicator of failure. The discriminator ended up predicting that both real and fake images asreal. This could be a sign that the generator got to powerful with respect to the discriminator.

28

4.4. Test: Flags on Random Generated Background

(a) (b) (c)

Figure 4.3: Flags on backgrounds generated by a generative adversarial network.

Table 4.6: Accuracy for images generated by GAN.

Data set Images/Category without Real Images with real images

Generated images 300 54.27% 88.44%Generated Images 600 61.81% 79.90%

29

5 Discussion

The results show that a pre trained network does not need a large amount of real images totrain on to learn generalize to new domains. It also shows that if there does not exist muchdata to begin with using synthetic data could be a good complement.

The results indicate that we can use both rendered flags and flags cut out from imagespasted on random background images as synthetic data. Neither of them scored a highaccuracy when the models was trained on them exclusively. But both of them help boost theaccuracy when combined with real images and functioned as a complement to real images.They both have some advantages toward each other. As for rendered images, once a model ofthe desired object has been modelled it is very easy to render it in different ways. To changethe texture, lightning, wind speed, rotation and control the material is easily done and givesendless amount of possible rendered images. In case of no access to modelling software orno modelling skills to cut out objects from already attained images might be a better option.The advantage of cut out objects is that they already are objects from the same domain thatwe wish to train our model on. Cutting out objects does not give as much control as therendering approach and the only augmentations that can be done is the same as for regularimages, for example rotation, scaling or adjusting brightness. It does require a lot of timeas well to cut out objects from images cleanly without either losing relevant information orgetting unwanted information from the background. The task is also hard to automate.

Due to time restrictions neither of the experiments were carried out with extreme depth.The data sets, for example, used for the experiments could have been expanded if time wasnot a factor. The amount of data for real images was not particularly large and a bigger dataset could have been obtained by searching for videos and cutting out relevant parts. Therestrictions in size made the test set for the experiments quite small. A small test set mightnot show how well a model truly generalizes. If the experiments were to be conducted againa larger test set is advised.

Time restriction plays a part as well in how reliable the results obtained in the experi-ments are. All experiments have a random factor to it, training images were pasted onrandom backgrounds with random augmentations and networks were initialized with ran-

30

5.1. Answer to Research Questions

dom weights. Because of this all experiments should be carried out multiple times to see thatthe results remained the same. In this thesis I did not have the time to validate the resultswith multiple tests for every experiment.

The small amount of data made me take the decision not to use a validation set sincethere were basically not enough data for it. To have a validation set would have made thetraining of the network much easier, not only to know when a model had gone throughenough iterations but also for the tuning of hyper parameters.

Trying to add realism to rendered image did not work especially well. The generator inthe CycleGAN inspired network concentrated on generating real images did not add anyvisual effects that would indicate that the rendered would be more realistic. The other gener-ator on the other hand did make clear visual changes in the images. Making them smootherand appear more like a painting. This could have been since it might be easier to translateimages from the real domain to the rendered. An experiment for future work could be to tryand translate real images to rendered and then evaluate a model trained on purely renderedimages.

The generative adversarial networks proved to be very difficult to train without constantsupervision. All hyper parameters had to be tuned by trial and error which wasted a lotof time. The generators and discriminators easily got to powerful if one was trained withmore iterations than the other and a good balance was hard to find. The experiments withgenerative adversarial did not turn out well and was an overall disappointment.

5.1 Answer to Research Questions

• With limited data available, what efforts can be taken to boost the results of a convolu-tional neural network?

This thesis has shown that using synthetic data such as rendered symbols or symbols cut outfrom real images pasted on to real images can be one way to go. Using synthetic imagesboosted the result with multiple percent in more than one experiment. Augmentations to thelimited existing data is another but it lacks flexibility because of the limitations of augmenta-tions that can be done to an image.

• Is it possible to train a convolutional neural network using only synthetic training datato detect desired objects?

In the experiments in this thesis did networks trained on fully synthetic data sets achieve anaccuracy above 80%. While it is still worse than a small data set containing real images it isstill a promising result. This thesis focused more on the possibility of using synthetic datarather than achieving great result with one method. If more effort was put in to renderingsymbols with more variety I believe even greater results could be achieved. If there existsno data at all for the wanted task training a network on only synthetic data is absolutely anoption as it beats random guessing by a large margin.

• Can generative adversarial networks, GAN, be used for the production of syntheticdata?

Generative adversarial models have in the last years shown promising results for image gen-eration. But in this thesis they did not achieve any remarkable results. Rendered symbolson random background than gone through the generator in hope of adding realistic featuresperformed worse than rendered symbols pasted on to random real images. The experiment

31

5.1. Answer to Research Questions

that used fully rendered 3D scenes in a result of making them appear more realistic did notachieve any visually appealing results. Generative adversarial networks is still a relativelynew research area and will probably in the future achieve great results but in this thesis theexperiments using generative adversarial networks were considered a failure.

32

6 Conclusion

This thesis investigated the possibility of using synthetic data as training data for a convolu-tional neural network. Synthetic data might not be good enough yet for a model to train onexclusively but it can be used as a tool when no other data exists. This theses showed thatthere are different approaches that are feasible, both rendering new objects and reuse objectsfrom existing data. Combined with real images can synthetic data help boost the result. Asfor the generative adversarial approach it did not yield any good results but generative ad-versarial networks are in a relative early stage in research and can possibly be a solution inthe future.

33

Bibliography

[1] Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Con-stantine D. Spyropoulos, and Panagiotis Stamatopoulos. “Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach”. In: (2000).

[2] Martin Arjovsky, Soumith Chintala, and Leon Bottou. “Wasserstein Generative Adver-sarial Networks”. In: (2017), pp. 214–223. URL: http://proceedings.mlr.press/v70/arjovsky17a.html.

[3] Samuel Arthur. “Some Studies in Machine Learning Using the Game of Checkers”. In:IBM Journal of Research and Development Volume 3 Issue 3 (1959).

[4] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. “Cut, Paste and Learn: Surpris-ingly Easy Synthesis for Instance Detection”. In: CoRR abs/1708.01642 (2017). arXiv:1708.01642. URL: http://arxiv.org/abs/1708.01642.

[5] Jim Edwards. “PLANET SELFIE: We’re Now Posting A Staggering 1.8 Billion PhotosEvery Day”. In: Business Insider (2014).

[6] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. “A Neural Algorithm of Artis-tic Style”. In: CoRR abs/1508.06576 (2015).

[7] Ross B. Girshick. “Fast R-CNN”. In: CoRR abs/1504.08083 (2015). arXiv: 1504.08083.URL: http://arxiv.org/abs/1504.08083.

[8] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hier-archies for accurate object detection and semantic segmentation”. In: CoRR (2013).

[9] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative Adversarial Nets”. In:(2014). Ed. by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Wein-berger, pp. 2672–2680.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learningfor Image Recognition”. In: CoRR abs/1512.03385 (2015). arXiv: 1512.03385. URL:http://arxiv.org/abs/1512.03385.

34

Bibliography

[12] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. “Improving neural networks by preventing co-adaptation of featuredetectors”. In: CoRR abs/1207.0580 (2012). arXiv: 1207.0580. URL: http://arxiv.org/abs/1207.0580.

[13] K. Hornik, M. Stinchcombe, and H. White. “Multilayer Feedforward Networks AreUniversal Approximators”. In: Neural Netw. 2.5 (July 1989), pp. 359–366. ISSN: 0893-6080. DOI: 10.1016/0893-6080(89)90020-8. URL: http://dx.doi.org/10.1016/0893-6080(89)90020-8.

[14] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep NetworkTraining by Reducing Internal Covariate Shift”. In: CoRR abs/1502.03167 (2015). arXiv:1502.03167. URL: http://arxiv.org/abs/1502.03167.

[15] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. “Image-to-Image Trans-lation with Conditional Adversarial Networks”. In: CoRR abs/1611.07004 (2016).

[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification withDeep Convolutional Neural Networks”. In: Advances in Neural Information ProcessingSystems 25. Ed. by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. CurranAssociates, Inc., 2012, pp. 1097–1105. URL: http://papers.nips.cc/paper/4824- imagenet- classification- withdeep- convolutional- neural-

networks.pdf.

[17] Ming-Yu Liu, Thomas Breuel an, and Jan Kautz. “Unsupervised Image-to-Image Trans-lation Networks”. In: CoRR abs/1703.00848 (2018).

[18] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. “SSD: Single Shot MultiBox Detector”. In: CoRRabs/1512.02325 (2015).

[19] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. “Rectifier Nonlinearities Im-prove Neural Network Acoustic Models”. In: (2014).

[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,Daan Wierstra, and Martin Riedmiller and. “Playing Atari with Deep ReinforcementLearning”. In: CoRR abs/1312.5602 (2013).

[21] Yair Movshovitz-Attias, Takeo Kanade, and Yaser Sheikh. “How useful is photo-realistic rendering for visual learning?” In: CoRR abs/1603.08152 (2016).

[22] Vinod Nair and Geoffrey E. Hinton. “Rectified Linear Units Improve Restricted Boltz-mann Machines”. In: ICML’10 (2010), pp. 807–814. URL: http://dl.acm.org/citation.cfm?id=3104322.3104425.

[23] Alec Radford, Luke Metz, and Soumith Chintala. “Unsupervised RepresentationLearning with Deep Convolutional Generative Adversarial Networks”. In: CoRRabs/1511.06434 (2015).

[24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. In: Curran Associates, Inc.,2015, pp. 91–99.

[25] Frank Rosenblatt. “The Perceptron - a perceiving and recognizing automaton”. In: Cor-nell Aeronautical Laboratory, INC (1959).

[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg,and Li Fei-Fei. “ImageNet Large Scale Visual Recognition Challenge”. In: InternationalJournal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252. DOI: 10.1007/s11263-015-0816-y.

35

Bibliography

[27] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, andXi Chen. “Improved Techniques for Training GANs”. In: CoRR abs/1606.03498 (2016).

[28] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Rus-sell Webb. “Learning from Simulated and Unsupervised Images through AdversarialTraining”. In: CoRR abs/1612.07828 (2016). arXiv: 1612.07828. URL: http://arxiv.org/abs/1612.07828.

[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “GoingDeeper with Convolutions”. In: CoRR abs/1409.4842 (2014). arXiv: 1409.4842. URL:http://arxiv.org/abs/1409.4842.

[30] Luke Taylor and Geoff Nitschke. “Improving Deep Learning using Generic Data Aug-mentation”. In: CoRR abs/1708.06020 (2017).

[31] “The Role of Social Media in the Evolution of Al Qaeda-Inspired Violent Extremism”.In: National Institute of Justice (2017).

[32] T.B. Trafalis and H. Ince. “Support Vector Machine for Regression and Applications toFinancial Forecasting.” In: 6 (2000), pp. 348–353.

[33] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. “How transferable are fea-tures in deep neural networks?” In: CoRR abs/1411.1792 (2014).

[34] Jun-Yan Zhu, Taesung Park, and Alexei A. Efros Phillip Isol and. “UnpairedImage-to-Image Translation using Cycle-Consistent Adversarial Networks”. In: CoRRabs/1703.10593 (2018).

36

Object Detection using deep learning and synthetic data

Documents

Transcript of Object Detection using deep learning and synthetic data