Improving Realism in Synthetic Barcode Images using...

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering, Linköping University, 2018

Improving Realism inSynthetic Barcode Imagesusing GenerativeAdversarial Networks

Petter Stenhagen

Master of Science Thesis in Electrical Engineering

Improving Realism in Synthetic Barcode Images using Generative AdversarialNetworks

Petter Stenhagen

LiTH-ISY-EX–18/5169–SE

Supervisor: Karl Holmquistisy, Linköping University

Erik RinagbySICK IVP

Examiner: Klas Nordbergisy, Linköpings universitet

Division of Computer VisionDepartment of Electrical Engineering

Linköping UniversitySE-581 83 Linköping, Sweden

Copyright © 2018 Petter Stenhagen

Abstract

This master thesis explores the possibility of using generative Adversarial Net-works (GANs) to refine labeled synthetic code images to resemble real code im-ages while preserving label information. The GAN used in this thesis consistsof a refiner and a discriminator. The discriminator tries to distinguish betweenreal images and refined synthetic images. The refiner tries to fool the discrimi-nator by producing refined synthetic images such that the discriminator classifythem as real. By updating these two networks iteratively, the idea is that theywill push each other to get better, resulting in refined synthetic images with realimage characteristics.

The aspiration, if the exploration of GANs turns out successful, is to be ableto use refined synthetic images as training data in Semantic Segmentation (SS)tasks and thereby eliminate the laborious task of gathering and labeling real data.Starting off from a foundational GAN-model, different network architectures, hy-perparameters and other design choices are explored to find the best performingGAN-model.

As is widely acknowledged in the relevant literature, GANs can be difficult totrain and the results in this thesis are varying and sometimes ambiguous. Basedon the results from this study, the best performing models do however performbetter in SS tasks than the unrefined synthetic set they are based on and bench-marked against, with regards to Intersection over Union.

iii

Abstract v

Acknowledgement

I would like to thank SICK IVP for providing me with all the resources I needed toconduct this master thesis. I especially want to thank my supervisor Erik Ringabywho was very accessible for questions and who helped me formulate the generaldirections of this thesis. I also want to thank Klas Nordberg and Karl Holmquistat the Department of Electrical Engineering at Linköping University for givingfeedback on my report and my project plan.

Contents

Notation ix

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 52.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Neuron Model . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Sample and Error Propagation . . . . . . . . . . . . . . . . . 82.1.4 Optimization in Neural Networks . . . . . . . . . . . . . . . 102.1.5 Softmax Cross Entropy Loss Function . . . . . . . . . . . . 12

2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . 132.2.1 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . 142.2.2 ResNet-Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . 162.3.1 Overview of Generative Models . . . . . . . . . . . . . . . . 172.3.2 The Original GAN . . . . . . . . . . . . . . . . . . . . . . . . 182.3.3 Using GANs for Data Augmentation . . . . . . . . . . . . . 192.3.4 Solutions to Common GAN Failure Modes . . . . . . . . . . 21

2.4 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Method 273.1 Data Collection and Modification . . . . . . . . . . . . . . . . . . . 273.2 Synthetic Image Generation . . . . . . . . . . . . . . . . . . . . . . 283.3 Setting up the GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Integration of R and D in Training . . . . . . . . . . . . . . 283.3.2 Sampling from trained R . . . . . . . . . . . . . . . . . . . . 303.3.3 Default GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 33

vi

Contents vii

3.4 Evaluation through Semantic Segmentation . . . . . . . . . . . . . 343.5 Training and Testing Pipeline . . . . . . . . . . . . . . . . . . . . . 343.6 Tests and GAN-models . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.1 Main Tests Based on GAN-loss . . . . . . . . . . . . . . . . . 363.6.2 Test of Subjective Quality . . . . . . . . . . . . . . . . . . . 383.6.3 Tests for Checking Reliability . . . . . . . . . . . . . . . . . 38

4 Results 414.1 Training the GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Loss Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 414.1.2 Training Time . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Quantitative Segmentation Results . . . . . . . . . . . . . . . . . . 444.2.1 Quantitative Results from GANs Based on USS0 . . . . . . 444.2.2 Subjective Comparison . . . . . . . . . . . . . . . . . . . . . 454.2.3 Additional Synthetic Sets . . . . . . . . . . . . . . . . . . . . 484.2.4 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Refined Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Discussion 535.1 Training the GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1.1 Loss Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 535.1.2 Training Time . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2 Quantitative Segmentation Results . . . . . . . . . . . . . . . . . . 555.2.1 Connection between Quantitative Results and Loss . . . . . 555.2.2 Subjective Comparison . . . . . . . . . . . . . . . . . . . . . 565.2.3 Additional Synthetic Training Sets . . . . . . . . . . . . . . 565.2.4 Model Characteristics . . . . . . . . . . . . . . . . . . . . . . 565.2.5 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3 Refined Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusions and Future Work 596.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.3 Social and Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . 61

A Results 65A.1 Training the GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.1.1 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65A.1.2 Training Time . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A.2 Refined Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Notation

Abbreviations

Abbreviation Description

Adam Adaptive Moment EstimationBGD Batch Gradient DescentCNN Convolutional Neural NetworkD Discriminator (GAN)G Generator (GAN)

GAN Generative Adversarial NetworkGPU Graphical Processing UnitIoU Intersection over Union

IoUN Intersection over Union NormalizedIoUR Intersection over Union Raw

LeakyReLU Leaky Rectified Linear UnitMBGD Mini-Batch Gradient Descent

PA Pixel AccuracyPAN Pixel Accuracy NormalizedPAR Pixel Accuracy RawR Refiner (GAN)

ReLU Rectified Linear UnitSGD Stochastic Gradient DescentSICK SICK IVP (Company name)

SS Semantic Segmentationtanh Hyperbolic TangentUSS Unrefined Synthetic Set

ix

1Introduction

1.1 Motivation

Great breakthroughs have been made in image classification tasks in recent yearswith machine learning techniques and in particular Convolutional Neural Net-works (CNNs). These tasks include object recognition, scene classification andmore. A notion that is commonly used is that of deep architectures, whichslightly arbitrarily, refers to a network with many successive neural layers. Spe-cific implementations of deep architectures such as VGG16, ResNet50 and Incep-tion V3 have become well-known in the machine learning world and pre-trainedversions of these are often used as a foundation when solving new tasks on newdata sets [32]. A description of how CNNs work is given in section 2.2. Onevery crucial requirement when using CNNs for solving classification tasks is asufficient amount of diverse training data that the model can learn from, in a gen-eralizable way. This data also has to be labeled. There must be a ground truthvalue stating which class each sample belongs to.

Having access to the required amount of data can be a problem in itself andeven if it is not, gathering this data and labeling it, is often expensive and timeconsuming. This is in broad terms the problem this work aims to address.

My thesis work will be conducted for SICK IVP (SICK), a developer of intelli-gence to cameras used in industrial applications. One type of images that are ofinterest to SICK and hence will be studied in this thesis is code images. Figure1.1 shows an example of how such a code image could look. The actual real codeimages used as training data in this thesis cannot however be shown in this re-port of integrity reasons, due to the personal information written on packagesand embedded in the codes.

1

2 1 Introduction

Figure 1.1: An example of a code image

SICK has customers within logistic departments, where automatic camera read-ings of codes is one of the offered solutions. What makes cameras more advanta-geous for code reading tasks compared to traditional laser code scanners is theirability to deal with irregular inputs where the number of codes, the code typesand the position as well as the orientation of the codes are unknown beforehand.CNNs are a potentially attractive model family for detecting and reading thesecodes. But as earlier mentioned, labeled data is then a requirement. To synthet-ically generate such data with corresponding labels would eliminate a lot of thetime spent on data gathering and labeling activities and would therefore be at-tractive for SICK.

To accomplish this, generative models and more specifically Generative Adver-sarial Networks (GANs) have been used in this thesis. The goal of a generativemodel is to generate samples from a desired distribution. But a parametric expres-sion of this distribution is very seldom accessible. As in SICK’s case, all that existare samples from the distribution, namely the real code images. As is describedin section 2.3.1, researchers have gone in different directions to find methods forhigh-performing generative models for high-dimensional spaces. But no methodhas yet turned out to be flawless.

GANs provide a novel way to generate samples from a desired distribution byindirectly measuring and minimizing the divergence between the desired distri-bution and that produced by the current generator. This is achieved by traininga discriminator to differentiate between real and generated samples and trainingthe generator to fool the discriminator into misclassifying the generated samplesas real. The original feature of GANs is the implicit representation of the density

1.2 Purpose 3

functions, that requires no explicit formulation or assumptions neither about thedistribution nor of its parameters. This makes GANs an interesting subject ofexploration for facilitating SICK’s code detection and code reading activities.

1.2 Purpose

The purpose of this master thesis work is to explore how Generative AdversarialNetworks can be used as code image refiners. More specifically, given simple butlabeled synthetic code images, the possibility of refining these has been explored,making them statistically more similar to real images, while preserving label in-formation and thereby enable their successful usage as training data in pixel-wiseSemantic Segmentation (SS) tasks.

1.3 Research Questions

The research questions this master thesis aims to give answers to are:

1. Which architectural choices, hyperparameter sets and other design param-eters with respect to the GAN, give the training sets that produce the bestresults with regards to SS Pixel Accuracy (PA), SS Intersection over Union(IoU) and convergence time?

2. How do the GAN-refined training sets perform at segmentation tasks withregards to PA and IoU compared to training sets comprised of unrefinedsynthetic images and real images respectively?

3. How do GAN-refined training sets compare to each other and to their un-refined counterparts for sets of synthetic input images with different de-grees of complexity? The phrasing "different degrees of complexity" refersto whether the images have full gray scale range or are binary and whetherthey contain text or not.

1.4 Limitations

The limitations in this thesis work are mainly in the form of methods and modelsused. Convolutional GAN-networks can have very different architectural struc-tures. The architectures studied in this thesis are variations of the ones proposedin the paper Learning from Simulated and Unsupervised Images through Adversar-ial Training by Shrivastava et al. [35]. This choice is solely a question of timeconstraints and due to the successful results that Shrivastava et al. showed.

2Theory

In this chapter the theoretical concepts, essential for understanding this thesisare presented. In section 2.1 a walk-through of neural networks is given. This isfollowed by a description of CNNs in section 2.2 before moving on to GANs, themain topic of this thesis, in section 2.3.

2.1 Neural Networks

Neural networks are unidirectional networks consisting of neurons organized inlayers. Starting with a simple neural network architecture as in Figure 2.1 there isan input layer, hidden layers and an output layer. As we can see in Figure 2.1 allnodes in the hidden layers as well as the output layer are connected to all nodesin the preceding layer. Layers that have this property are called fully connected.In fully connected layers the output can be calculated as a matrix multiplicationbetween the input and the weights [12]. This is opposed to convolutional layers,which are described in section 2.2.

The input layer consists of a number of nodes equal to the number of featuresof each data point. For classification networks, the output layer has one node perclass. In the hidden layers, both the number of layers and the number of neuronsin each layer is a design choice of the programmer and there is no exact number,directly derivable from the data properties nor from the task.

The data used in a classification network is that of an input sample and a cor-responding label. The input sample x, is a vector consisting of the featuresx1, x2, ..., xk and the label y is a vector that is 1 at the position corresponding tothe correct class and 0 in every other place. These will be called input-label pairs.

5

6 2 Theory

Figure 2.1: Fully-connected neural network

The hidden layers and the output layers consist of artificial neurons, or simplyneurons. These are the core computational building blocks of neural networks.

2.1.1 Neuron Model

The neuron model is visualized in Figure 2.2. The input sample x is vector mul-tiplied by the weight vector w (w1, w2, ..., wk) and added together with the biasweight w0. This result is then fed through an activation function f (x). Mathemat-ically, this takes the form:

y = f (w0 +k∑i=1

wixi). (2.1)

More compactly, it can be written:

y = f (w′x), (2.2)

where w =

w0w1...wk

and x =

1x1...xk

.It is the weights w0, w1, ..., wk that are updated during training to minimize someloss function, resulting in what is referred to as learning. The loss function isdescribed more closely in section 2.1.5 but is in essence a quantified expressionof how far we are from the correct solution.

2.1 Neural Networks 7

Figure 2.2: Neuron model

2.1.2 Activation Function

Activation functions are used in neural networks to enable a non-linear mappingfrom the input to the output space. As an example we can look at Figure 2.3. In(a) there exists a linear classifier that perfectly separates the two classes. This ishowever not true in (b). Therefore, a non-linear activation function is necessaryto map the input features to an output space where the classes can be separatedlinearly [12].Below are examples of activation functions used for neural networks. All activa-tions presented here are also shown in Figure 2.4

SigmoidThe sigmoid, or logistic function as it is also called, has the form:

f (x) =1

1 + e−x. (2.3)

The strong advantage of the sigmoid is that its range is in the interval (0, 1), whichmakes it appropriate for outputs with probabilistic interpretation [34].

Hyperbolic tangentThe hyperbolic tangent (tanh) also has a sigmoid shape but has a range of (−1, 1).This make it appropriate for classification between two classes [34]. tanh is cal-culated in the following way:

f (x) =ex − e−x

ex + e−x. (2.4)

Rectified Linear UnitThe Rectified linear unit (ReLU) has the form:

f (x) = max(0, x). (2.5)

8 2 Theory

(a) (b)

Figure 2.3: Comparison between linear separable data (a) and linearly in-separable data (b)

As touched upon earlier, an architecture consisting exclusively of linear units willnever be able to approximate a non-linear function. However, linear functionshave characteristics that make them very pleasant to use in optimization. This isthe reason to why the ReLU function, which is piece-wise linear, has become verypopular for deep neural networks. While preserving the desirable characteristicsof linear functions, networks with successive layers of neurons using ReLU-units,can theoretically approximate any non-linear function.[12]

Leaky Rectified Linear UnitWhen ReLU was introduced as an activation function for neural networks by Glo-rot et al. [10], it was argued that it was the sparsity of the output values (inputvalues < 0 being mapped to exactly 0) that made it so successful. However Xuet al.[40] have shown that variations of ReLU where the negative input valuesare mapped linearly, but not to zero, perform better than the original ReLU. Theeasiest modified version of ReLU is LeakyReLU:

f (x) =

x, x ≥ 0,αx, x < 0.

(2.6)

2.1.3 Sample and Error Propagation

Two concepts are vital to understand learning in neural networks, namely FeedForward, Back Propagation.

Feed ForwardThe feed forward aspect of a function refers to the property of an input beingfed through all layers of intermediate computation to the output without anyfeedback. There are no outputs from layers being fed back into previous layers[5]. All neural networks studied in this work have this property.


Figure 2.4: Overview of the presented activations. α = 0.2 for LeakyReLU

Back PropagationWithout any feedback in the feed forward process, neural networks must workin another way to adapt to training losses. This is done in the latter stage calledback propagation. After the feed forward stage, the partial derivative of the lossfunction with regards to any network weight can be calculated and evaluatedfor the sample and the present network weights with the use of the chain rulefrom multivariate calculus. We look at a loss function L(x, y,w), evaluated forx0, a vector of one specific sample with corresponding label y0 when the currentnetwork weights are w0. We then have the partial derivative for the networkweight wklm going between neuron k in layer m and neuron l in layer m + 1:

ηklm =∂

∂wklmL(x0, y0,w0) (2.7)

What give rise to the name back propagation are the means used to find up-dated values for the weights in successive layers. In Figure 2.5 a simple single-input single-output example is shown. x(0) is the input, x(k) = y is the output,w(1), ..., w(k) are the network weights for layers 1, ..., k and σ is an activation func-tion so that x(i) = σ (w(i), x(i−1)). Then if we know ∂L

∂x(k) , the derivatives withregards to previous layers’ outputs and weights can be calculated recursivelythrough equations (2.8) and (2.9) [25]:

∂L

∂w(k)=∂σ∂w

(w(k), x(k−1))∂L

∂x(k)(2.8)

∂L

∂x(k−1)=∂σ∂x

(w(k), x(k−1))∂L

∂x(k). (2.9)

10 2 Theory

Figure 2.5: Small example for showing back propagation

In this formulation ∂σ∂w (w(k), x(k)) is the derivative of σ with regards to w, evalu-

ated at the point of input x(k−1) and the current value of weightw(k). ∂σ∂x (w(k), x(k−1))is the corresponding derivative but with respect to x. In networks with more thanone weight per layer, these become Jacobian matrices instead.

The recursive update formulas in (2.8) and (2.9) make results from one recursionstep reusable in the next as we move from the output towards the input layer.This is where the name back propagation comes from. The simple example inFigure 2.5 is easily extended to the multiple-input multiple-output case.

2.1.4 Optimization in Neural Networks

All common optimization techniques for updating the network weights are someversions of gradient descent. The notations weight and parameter will be usedinterchangeably. Parameter is a more general term, relevant outside the area ofneural networks. It therefore becomes more natural to use in some situations.These continuously updated weights/parameters should be separated from hyper-parameters, which are often constant throughout training and are dictating thetraining process rather than trained themselves. The general formula for a gradi-ent descent update of the weight wj is given by [20]:

wj ← wj − α∂∂wj

L(w) (2.10)

where α is the learning rate.

Gradient descent is very intuitive. The parameters are changed to travel in theparameter landscape, in the direction where the slope of the function downwardsis maximal. However, to use the standard gradient descent formula in neural net-work training, it would be necessary to compute the loss function L(x, y,w) withregards to all training input-label pairs (x1, y1), (x2, y2), ..., (xn, yn) for every gradi-ent update. This method is called batch gradient descent (BGD) and has severaldrawbacks, mainly with regards to speed and required memory. To remedy theseshortcomings Mini-Batch Gradient Descent (MBGD) is introduced.

Mini-Batch Gradient DescentMBGD uses a batch of input-label pairs with size m where m < n = training setsize. A basic assumption is that the m samples on average have statistics thatclosely resemble those of the whole data set, resulting in a good approximationof the BGD-gradient, while massively reducing the time per weight update. The


choice of m is an act of balance. Too low m will approximate the full data setpoorly and result in noisy weight updates [2]. The extreme case is when m = 1.This is called Stochastic Gradient Descent (SGD). Too high m, on the other hand,will lead to unnecessarily long training times.

What is beneficial with MBGD is that, given intelligent vectorization of the in-put, intermediary outputs and weights, graphical processing units (GPUs) canprocess multiple samples in parallel. This will have a scaling effect that causesthe time for calculating a gradient from m samples in one mini-batch to be muchlower than for gradient updates from m samples computed serially [18].

Gradient Descent ExtensionsThe choice of optimization method is not only a matter of how many samples tobase each weight update on. There are other extensions to gradient descent thatspeed up convergence and reduce loss variance. One simple extension is throughmomentum terms such as in Nesterov’s accelerated gradient shown below:

Algorithm 1 Nesterov’s accelerated gradientg is the gradientm is the momentum of the gradientw is the parameters to be updatedindex t indicates iteration number t

1: gt ← ∂∂wt−1

L(wt−1 − ηµmt−1)2: mt ← µmt−1 + gt3: wt ← wt−1 − ηmt

The extension proposed in Nesterov’s acccelareted gradient can be seen as an in-tegration over the velocity, which has the benefit of increasing the step length inareas of homogeneous gradient behaviour and decreasing step length in areas ofgradient oscillation [39],[8].

A constant challenge with standard gradient descent in (2.10) is to get the learn-ing rate parameter α right. α influences the step length of each update and is thesame for all weights, and for standard gradient descent, constant throughout alltraining. α can be updated during training in so called learning rate schemes thatare structured ways of achieving faster and more stable learning. However, whendeciding how to change α in the learning rate scheme, new hyperparameters of-ten have to be introduced and the speed and stability of learning can be just assensitive to these new hyperparameters as the original α. Consequently, the prob-lem of deciding a good value for α is just transferred to other hyperparameters[24]. To make performance more robust to non-optimal hyperparameter configu-rations, and also to allow for learning rates that are variable across weights andtime, Adam is used [39].

12 2 Theory

AdamAdaptive Moment Estimation (Adam) was introduced by Kingma and Ba in 2014and has shown good relative performance in high-dimensional parameter land-scapes with characteristics such as sparse gradients and non-stationary or stochas-tic loss functions.

Since it is unknown beforehand what the input looks like, what is essentiallyminimized is the expected loss when classifying a random sample. The loss func-tion depends on random variables, namely which sample is chosen. When train-ing, implementations using SGD and MBGD therefore have stochastic loss func-tions, since the samples used for each update are randomly chosen. BGD on theother hand use all the available training samples each iteration, so no stochastic-ity therefore exists.[21]

The idea of Adam is to estimate first and second moments of the gradient withexponential weighting and using them to update the weights. These estimateswill however be biased to zero; the initial values of the moments. The estimatesare therefore bias corrected before being used in the parameter update. One stepof parameter update with Adam at time t is given below.

Algorithm 2 Adamg is the gradientm is the first moment of the gradientv is the second moment of the gradientw is the weight to be updatedβ1 and β2 are the exponential weights

1: gt ← ∂∂θt−1

L(θt−1)2: mt ← β1mt−1 + (1 − β1)gt3: vt ← β2vt−1 + (1 − β2)g2

t4: mt ←

mt1−βt1

5: vt ←vt

1−βt26: wt ← wt−1 − α

mt(√

vt+ε)

2.1.5 Softmax Cross Entropy Loss Function

Learning in a neutral network is essentially just an optimization problem. It isabout changing the weights leading in to the neurons to minimize some loss func-tion. The loss function determines how well the model is performing and mustbe adapted to the specific task. With model I mean the complete set of designchoices i.e. network architecture, choice of loss function, weight initializationand hyperparameters as a mean to solve some problem e.g. classification.The most used loss function for classification networks is the cross entropy loss.It is defined as [7]:

2.2 Convolutional Neural Networks 13

L = −1n

n∑i=1

m∑k=1

yik ln ak(xi), (2.11)

where n is the number of samples in the batch/mini-batch, m is the number ofclasses, ak(xi) is the predicted probability that sample xi is from class k and yiis a m × 1-label vector that is 1 at the position corresponding to the correct classand 0 in every other place.

Nielsen shows the beneficial property of the cross entropy loss in an exampleby studying its derivative when using the sigmoid activation function in a one-layer neural network [27]. In such a setting the output for output node j for asingle sample x is:

aj (x) = σ (wTj x + bj ), (2.12)

where σ is the sigmoid activation function of the layer, wj is the weight vectorconnected to output node j and bj is a bias. For a single sample x0 the partialderivative of 2.11 becomes:

∂Lx0

∂wj= x0(σ (wT

j x + b) − yj ). (2.13)

Looking at (2.13), the cross entropy derivative will be controlled by the errorε = σ (wT

j x + b)− yj , which is compelling. When the error increases, the derivativeand thereby also the step length increase. This is what makes the cross entropyan attractive loss function when performing gradient descent.

Softmax FunctionAs mentioned in section 2.1.2, the sigmoid and tanh are appropriate for classifi-cation tasks in two-class cases. However in classification tasks with a number ofclasses greater than two, an extension has to be used instead. This is the softmaxfunction. The softmax has the following form:

f (xl) =exl∑mk=1 e

xk, (2.14)

Since ex > 0 ∀x ∈ R the softmax ensure that 0 < f (x1), f (x2), ..., f (xm) < 1 andthat

∑mk=1 f (xk) = 1. This give the softmax function an interpretation as the prob-

abilities of sample x having the class labels 1, 2, ..., m respectively.

2.2 Convolutional Neural Networks

We now return to the property of layer connections that was discussed in 2.1.Unlike Figure 2.1, that was an example of a fully connected network, we willturn to another architectural structure, namely Convolutional Neural Networks(CNNs). CNNs are networks that have at least one convolutional layer, one ofwhich is shown in Figure 2.6. Instead of each connection between two neurons

14 2 Theory

Figure 2.6: schematic overview of computations in convolutional layers

in subsequent layers receiving its own weight, there are now a set of kernels thatare slided over the input, producing an output by element-wise multiplicationand summation. This operation is in the context of machine learning referredto as convolution. Strictly mathematically however, it is rather cross-correlationsince the kernel is applied without being flipped. This detail has no importancewith regards to the performance of the network and the network type will still bereferred to as convolutional.[12]

For the CNN structure to be useful, the input needs to have a grid-like topol-ogy. There must be dependencies between input features that has to do withtheir relative location. Examples are timeseries, images and videos. CNNs makeuse of three key features: sparse interactions, parameter sharing and equivari-ant representations. Goodfellow et al. 2016 give a detailed description of theseconcepts. In short, sparse interactions and parameter sharing are clever ways toconnect neurons in different layers that greatly reduce the number of parametersneeded in the network. This frees up memory, increases runtime speed and givesimproved statistical efficiency. Statistical efficiency means here a measurementof the variance between multiple evaluations of the model for a given numberof samples [9]. A statistical efficient neural network can be trained with a lownumber of samples and produces consistent results. The concept of equivariancemeans that translational changes in the input follow through to the output as ifthe translation would have been applied to the output directly.[12]

2.2.1 Batch Normalization

Generally speaking, the task of a neural network is to learn the statistical prop-erties of its inputs. In image classification tasks this means capturing the distin-guishing image statistics of the different classes. For the first layer in the net-work this is a straight forward optimization problem since its inputs statistics donot change during training. However, for all subsequent layers, the input willdepend on the parameters of the preceding layers, resulting in non-stationaryinputs across training. This becomes a problem, especially when saturated acti-vation functions are used deep down in the network. This has historically calledfor very small learning rates and scrupulous initialization, making sure the ini-

2.2 Convolutional Neural Networks 15

tial values of the weights are precisely within very narrow bounds, from whereconvergence is possible. But in 2015, Ioffe and Szegedy introduced batch normal-ization, which is a whitening transformation (removal of mean and dividing bystandard deviation) across the batch for layer inputs at arbitrary network depth.See (2.15) [18]

xk =xk − Ek[x]√

Vark[x]. (2.15)

Here Ek[x] and Vark[x] are the mean and variance of x over dimension k.

Originally, the batch normalization was only taken over the batch for each fea-ture [18]. However for images, that have a grid-like topology, it makes sense touse a global mean and a global variance calculated over the batch as well as overthe spatial dimensions of the image [39], although not over the channels. Thechannels can be seen as the depth of an input tensor to some layer. Into the firstlayer the channels are the different color channels (only 1 if grayscale is used, 3if RGB is used and so on). In subsequent layers, the input to layer k will have mchannels when layer k − 1 have m kernels. If a single feature value into a deeplayer is denoted xb,r,c,ch for batch: b, row: r, column: c and channel: ch. Then weget:

xb,r,c,ch =xb,r,c,ch − Eb,r,c[x]√

Varb,r,c[x](2.16)

We will come back to batch normalization and how it can be applied to GANs insection 2.3.

2.2.2 ResNet-Blocks

A specific architectural building block that has become popular in deep CNNsare ResNet-blocks. The theory behind the ResNet-block was presented by He etal. [14]. It consists of two convolutional layers with a ReLU activation in betweenthem. The input to the first layer x is then merged additively with the outputfrom the second layer F (x) and led through a ReLU activation, as is shown inFigure 2.7. This figure is a recreated version of Figure 2 in [14].

He et al. motivate this architecture by looking at the degradation problem thatis prevalent in deep CNNs. This term describes the phenomenon where trainingerror increases, when additional layers are added to an already converging archi-tecture. Theoretically, an otherwise identical network, except for one or more ad-ditional layers, should never perform worse than its shallower counterpart withregards to training error. This is because the additional layers in the deeper net-work could just learn an identity mapping and thereby replicate the result of theshallower network. The degradation problem shows that different networks arein practice differently equipped to learn the same mapping. To conclude, since

16 2 Theory

Figure 2.7: A schematic image of a ResNet-block

training error is theoretically decreasing when adding layers to a converging ar-chitecture and since the number of layers necessary to achieve satisfying trainingerror is difficult to predict beforehand, it is tempting to go for very deep architec-tures. But as He et al. show, excessive layers have adverse effects on training errorin practice. Consequently, what He et al. strive for is to minimize the increasein training error caused by excessive layers, making performance more robust tosub-optimal architectural choices. They do this by introducing the ResNet-block.

If we considerH(x) to be any non-linear mapping through the ResNet-block, thenF (x) = H(x)−x is the residual mapping. By letting the network learn the residualmapping F (x) instead of the initial mapping H(x) He et al. show that trainingerror decreases in deep architectures.

2.3 Generative Adversarial Networks

One natural breakdown of neural network models are in discriminative and gen-erative models, respectively. Discriminative models try to categorize data in todifferent classes. Since data can vary endlessly within classes, we can not expectnew data points to manifest input features exactly equal to some labeled sam-ple, from which we can transfer the label. Instead the task has to be approachedstatistically. The label of the new data point has to be inferred from the statis-tical distribution of the training data. While discriminative models only haveto capture the statistical distribution of training data insofar as to establish ef-ficient decision boundaries for classification, generative models aim to generatenew data according to those statistics.

2.3 Generative Adversarial Networks 17

2.3.1 Overview of Generative Models

Most generative models can be thought of as doing maximum likelihood estima-tion to some degree [11]:

θ∗ = argmaxθ

Ex∼pdatalog pmodel(x|θ) (2.17)

In words, we want to maximize the expected log probability of x as being a sam-ple from the distribution pmodel, which is defined by the parameters θ, where xis a random sample from the desired distribution pdata. The notation Ex∼pdata

de-notes an expectation, dependent on the stochastic variable x, which is a randomsample from the distribution pdata.

This is achieved by optimizing over the parameter set θ. Several types of mod-els exist that explicitly try to find this θ. The inherent difficulty here is com-putational intractability stemming from the complexity and high dimensionalityof the parameter space. One type of remedy is careful construction of density-models that ensure tractability. One example of this kind of model is PixelRNN.Here Van den Oord et al. [28] find a tractable formulation of the joint probabilityof pixels in an image by conditioning on the pixels already classified. The notionof tractability denotes the possibility of finding a solution that can be expressedanalytically. The problem with this kind of model is the sequential nature of thedata generation process, that make subtasks impossible to parallelize by design,resulting in long inference time. Inference time is in this case the time taken togenerate one sample from the generative model.

There are also methods that relieve the condition of tractability in the density-model, but instead use approximations to the gradient or the likelihood to en-sure tractability. Examples of these are Variational Autoencoders that place atractable lower bound on the likelihood that can be optimized [22].

L(x; θ) ≤ log pmodel(x; θ). (2.18)

Goodfellow means, however, that the gap between the lower bound L and logpmodelcauses the final learned model to differ significantly from pmodel , leading to sub-jectively less appealing samples [11].

Boltzmann Machines rely on Markov Chain Monte Carlo methods, such asGibbs Sampling, to get samples for training and inference that over time aretheoretically ensured to have converged to the joint distribution [15]. Gibbs Sam-pling is used when the joint distribution is unknown but the conditional proba-bilities of components are known [31]. The drawbacks of Boltzmann Machinesand their markovian underpinnings are according to Goodfellow:

• Markov chains methods are less effective in high-dimensional spaces suchas images.

• There is in practice no way of knowing if the sample generation has con-verged to the desirable distribution yet.

18 2 Theory

• Inference takes too long time because of the relatively high computationalcost for generating a single sample.

As a response to all the problematic features of explicit models, essentially stem-ming from the same root; the difficulty of finding tractable accurate high-dimen-sional probability density functions, implicit models have recently become pop-ular. These implicit models do not require an explicit formulation of a densityfunction and its parameters, but can still sample from it. One of these implicitmodel types are Generative Adversarial Networks (GANs).

2.3.2 The Original GAN

The original GAN-architecture is composed by two separate neural networks: aDiscriminator (D) and a Generator (G). The task of D is to classify a sample sin to the correct one of the two classes: Target (CT ) and Generator (CG). The taskof G is to generate a sample sG given some input i that belongs to CG but whichfools D in to classifying it as CT . A sample correctly belonging to the target classCT is denoted sT . A general GAN is shown in Figure 2.8.

To connect to section 2.3, the motivation here is to disregard any explicit formula-tion of the density functions and instead letting D judge the ability of G to learnthe desired distribution pT . Both the original GAN and many extensions haveapplied the model to synthetic image generation, where pT is the distributionover a set of real images and pG is the distribution of synthetic images, which Gtries to make as similar to pT as possible. In many of the studies, Real images aresimply photographs of everyday objects, such as animals, human faces, naturescenes etc.. The distinguishing features of real images compared to syntheticcounterparts obviously depends on the motive, but there are generalities such assharpness of object borders that are significative of real images and which arehard to capture realistically in synthetic generations. Also the number and rela-tive location of features in an image (e.g. two eyes symmetrically placed aboveone nose in a human face) is what give real images their realism. As an example,Radford et al. [30] show GAN-generated examples of both bedrooms, faces, and

Figure 2.8: Schematic overview of a GAN


handwritten digits with varying degrees of realism. However there is no intrinsicaspect of the model limiting the area of application to synthetic image generation.For instance, Pascual et. al. [29] applies GANs to denoising speech.

There are two ways to think about GANs. Either they can be viewed from agame theoretical perspective, as finding the Nash Equilibrium to the followingminimax-problem:

minG

maxD

V (D, G) = Ex∼pT (x) logD(x) + Ez∼pGi (z) log(1 − D(G(z))) (2.19)

where D(y) is the probability D assigns to the sample y of belonging to the targetclass CT and pGi is the distribution over the generator input. As noted by theoriginal GAN-authors, no method for finding this unique solution exist for prob-lems of high dimensionality (such as images), with non-convex loss functionsand where the parameters are continuous [13]. Finding a true Nash Equilibriumthrough gradient descent is therefore more of an intuitive motivation in theorythan a realistic end goal in practice. The other approach to GANs, which is notby any means neglected by Goodfellow et al., but more accentuated by others, isto think of GANs as Likelihood Ratio Estimators. Sønderby et al. [36] show thatif the loss functions are defined as:

L(D) = −Ex∼pT (x) logD(x) − Ez∼pGi (z) log(1 − D(G(z))), (2.20)

L(G) = −Ez∼pGi (z) logD(G(z))

1 − D(G(z)), (2.21)

then minimizing them iteratively is the same as minimizing the Kullback-LeiblerDivergence DKL[pG ||pT ]. Here the Kullback-Leibler Divergence is defined as:

DKL[p1||p2] = −∑i

p1(i) logp2(i)p1(i)

. (2.22)

The Kullback-Leibler Divergence measures how much the distribution p1, ap-proximating p2, diverges from p2 itself [23]. This is intuitively compelling sincewhat GANs are trying to achieve is to find a generator distribution pG, from whichit is possible to generate samples statistically identical to the target distributionpT .

2.3.3 Using GANs for Data Augmentation

Shrivastava et al. [35] use a version of GAN to refine labeled synthetic data. Thissection is solely about this paper. While the original GAN takes a vector z of uni-formly sampled noise as the input to G, Shrivastava et al. use a labeled syntheticimage. They therefore call their own network equivalent toG in the original GAN,a Refiner(R). Their aspiration is to train a GAN with unlabeled real data to addrealism to the synthetic images while preserving their label information. Figure2.9 is inspired by Figure 2 in [35] and shows the essentials of what is attempted.

20 2 Theory

Figure 2.9: Schematic overview of Shrivastava et al. architecture.

The loss functions for (D) and (R) are given by:

L(D) = −∑i

log(1 − D(R(xi))) −∑j

log(D(zj )) (2.23)

L(R) = −∑i

log(D(R(xi))) + λ||R(xi) − xi ||1, (2.24)

where zj is a real image and xi is an unrefined synthetic image. The functionR(xi) produces a refined image given the input xi . D(y) is just as in (2.19) theprobability assigned to the sample y of being a real image. Being a real image inShrivastava’s regard is equivalent to belonging to CT in the more general formu-lation in section 2.3.2.

In (2.23) the discriminator achieves its lowest loss when it assigns a high prob-ability of the real sample zj of being real since then the negative log-expression isclose to zero. Analogously, the probability of R(xi), which is actually synthetic, ofbeing real should be kept as low as possible as the negative log-expression thenis close to zero. In (2.24) the probability that R(xi) is classified as real should behigh as this means R is producing images deemed realistic by D. The L1-normbetween R(xi) and xi should be kept low as this means the image is changed aslittle as possible, thereby increasing chances of label preservation.


Architectural ChoicesD is a CNN that includes non-strided and strided convolutions as well as MaxPooling but no fully connected layers. Strides mean here the interval in numberof pixels between two consecutive applications of the convolution kernel. Afterthe kernel has been applied to the input area with centre at (k, l) the kernel isthen moved to (k + i, l + j). In non-strided convolutions i, j = 1. In strided con-volutions i, j ∈ Z and i, j > 1. In the work of Shrivastava et al. the strides areconstrained by i = j.

R’s architecture includes a number of ResNet-blocks, as were described in sec-tion 2.2.2. Shrivastava et al. vary the exact design choices, such as the number ofResNet-blocks, kernel sizes, etc., depending on data set and input image size.

Probability Map Output from DShrivastava et al. note that all local patches in a refined image should have statis-tics identical to a similar sized and located patch in a real image. To increasethe number of classifiable samples per image, effectively increasing the data sizeand to limit the capacity of D to become too strong, D outputs a probability mapwith dimensions w×h. This corresponds to a loss over wh local (to some degreeoverlapping) patches in the input image.

Historic Batches of Refined Images in D-trainingWhen training GANs, D normally only sees and learns from samples producedby the most recent version of G. This may lead to divergence in training or rein-troduction of old artefacts by G. Artefacts are unwanted features constructedby G that helps fooling imperfect intermediary version of D but that are notrealistic and therefore not productive in the long run. Such artefacts are un-avoidable when training, but reemergence of one and the same artefact (old arte-facts) should be avoided, as this can produce a loop-like behaviour of perpetuallyfalling into the same traps. To prevent these undesirables, Shrivastava et al. makehalf of the refined batch that D is trained on consist of images produced by ear-lier generations of R. This buffer of historic refined images is updated iterativelyas newer generations of R are created.

2.3.4 Solutions to Common GAN Failure Modes

GANs are widely acknowledged to be difficult to train. Here, some of the mostcommon failure modes and methods for preventing them are presented.

Noise to DAs discussed in section 2.3.2, training GANs can be seen as minimizing some di-vergence (for instance the Kullback-Leibler Diveregence) between two probabil-ity density functions. This means, as Huszár [17] points out, that the followinglogarithmic ratio must be finite:

logpG(x)pT (x)

. (2.25)

22 2 Theory

This means that pG and pT must overlap, which is far from certain due to thehigh-dimensionality of the space of learned parameters. Moreover, a lack of over-lap between the density functions means that there will be multiple local optimafor D since there are multiple, perhaps fairly disparate, classification decisionboundaries able to perfectly discriminate between real and synthetic. This is notgood since each time we complete a training iteration for D, we might end upin a new local optima, providing completely new gradient information to G andthereby aggravating continuous convergence.

To increase overlap Huszár suggest Inference Noise, which means adding gaus-sian noise to both refined and real input images to D. This will smooth the den-sity functions and thereby increase the likelihood of overlap early in trainingfrom which convergence can be achieved.[17]

Another method used to remedy this problem of non-overlaping density func-tions is occasionally flipping the label for a real sample in to D [3]. Huzsár meansthat this does not fix the problems as effectively as instance noise since there isno smart way for D to handle label noise and it does not change the optimizationlandscape characterized by multiple local optima [17]. It essentially just addsnew modes to pG as it is perceived by D. The notion of a mode is essentially hills,areas of high density, in a density function. This is exemplified in Figure 2.10below.

A similar trick proposed by Chintala et al. [3] is label smoothing, where insteadof setting the correct label to 1 and the false label to 0, the correct label is set toa random number between 0.7 and 1.2 and the wrong label to a random numberbetween 0 and 0.3. Szegedy et al. [37] propose something similar, but withoutthe randomness. They use the label distribution over the K classes in (2.26).

q(k) = (1 − ε)δk,y +εK, (2.26)

where 0 ≤ ε ≤ 1. Szegedy et al. motivate this by stating that a classifier should notbe too confident about its label assignment. Because this will reduce the abilityof the classifier to generalize and to adapt.

(a) One mode (b) Two modes

Figure 2.10: Example showing the difference between a unimodal and mul-timodal density function in the one-dimensional case


Feature MatchingFeature Matching is a method proposed by Salimans et al. to avoid over-trainingon specificD-instances. It is implemented as a changed loss function forG, whereinstead of trying to fool D at its output layer, G now attempts to generate samplesthat are expected to differ minimally from real samples at some intermediary Dlayer. G’s loss is now:

||Ex∼pT (x)f (x) − Ez∼pGi (z)f (G(z))||22, (2.27)

where f (y) is the output from some intermediary layer of D for the network inputy. As Salimans et al. point out, it makes sense to attempt to replicate these deepfeatures, because when training D these will incorporate the most discriminativeaspects between real and generated data.[33]

DCGANRadford et al. [30] propose their GAN-version: DCGAN, which include several ar-chitectural elements that the authors claim to stabilize training. These are shownbelow:

• Strided convolutions instead of pooling layers

• Batch Normalization in both G and D

• Removal of all fully-connected layers

• ReLU activation in generator except in output layer, which uses tanh

• LeakyReLU activation in all layers of D

Solutions to Mode CollapseMode Collapse is a well-known problem in GAN-training that describes the un-desirable scenario where pG has far fewer modes than pT . This causes differentinputs to be mapped to more or less the same point. We say that there is too littleentropy in pG. This is caused by D basically studying each sample in the mini-batch separately, resulting in a non-existent incentive for G to capture the fulldiversity of pT rather than just a few single points that fool the current version ofD. [33]

There are several proposed methods to in the literature to remedy mode collapse.One such method is Batch Normalization in D [30]. By statistically relating eachsample to all others in the mini-batches they belong to, D now implicitly takesinto consideration the distribution over the mini-batches rather than just the in-dividual samples they contain.

Historic batching, that was discussed in section 2.3.3, should in theory work toremind D of previously explored modes and thereby create a landscape for pGwith a more complete set of modes. Another trick is dropout for G, as is usedby Isola et al. [19]. Dropout(n) means deactivating a fraction n of the nodes ina network layer, by dropping the weights leading to and from the node. Whichweights are deactivated is random and changes every iteration.

24 2 Theory

2.4 Semantic Segmentation

Semantic segmentation (SS) is the task of classifying each pixel in an image toa specific class. The difference relative instance segmentation should be noted.In the former we are only interested in the class belonging of each pixel, whilein the latter case we also segment between different instances of each class. Thedifference is shown in Figure 2.11 where in semantic segmentation there is nolabel difference between pixels belonging to different instances of the same class.CNNs of some sort, are used in many implementations to solve SS tasks. Long etal. [26] accomplishes a method for SS in their paper Fully Convolutional Networksfor Semantic Segmentation by re-organizing the architecture of already successfulclassification nets such as AlexNet, VGG16 and GoogLeNet. By the use of fraction-ally strided convolutional layers, the original classification nets are reconstructedto take in an image of any size and output a map of the same size, correspondingto a pixel-to-pixel classification of the original image.

Several previous methods classify pixels by doing classification on image patcheswith the pixel of interest centered in the patch. Ciresan et al. [4] propose onesuch method where the patch size is predefined beforehand. This has the draw-back of not taking advantage of local appearance information in shallower layers,and global semantic information in deeper layers. Long et al. achieve this bycombining outputs from layers of different depth before producing the final clas-sification maps.

MetricsLong et al. use four different metrics to evaluate their results. nij is the numberof pixels actually belonging to class i, classified as j. ti =

∑j nij is the number of

pixels actually belonging to class i. There are m different classes.

• Pixel Accuracy:∑i nii /

∑i ti

• Mean Accuracy: (1/m)∑i nii /ti

(a) Semantic segmentation (b) Instance segmentation

Figure 2.11: Label difference between Semantic segmentation and Instancesegmentation

2.4 Semantic Segmentation 25

• Mean IoU: (1/m)∑i nii /(ti +

∑j nij − nii)

• frequency weighted IoU: (∑k tk)

−1 ∑i tinii /(ti +

∑j nij − nii)

3Method

In this chapter the method for conducting the project is presented. Firstly, it is de-scribed how the real data was gathered and preprocessed and how the syntheticdata was generated. Thereafter, we go into the architecture of the GAN-network.This is followed by a description of the method for evaluating and quantifyingthe results. Lastly, the full training and testing pipeline is described as well asthe concrete tests conducted.

Much of the content in this chapter rely on the theory presented in previouschapter. The method chosen and presented in this chapter, subsequently led tothe results which are presented in the next chapter.

3.1 Data Collection and Modification

SICK has an internal database with labeled grayscale images of codes. Fromthere, 3000 images was taken and have been used throughout this project. Thelabel here is in the form of pixel-wise class-belonging over the three classes Back-ground (BG), 1-D code (1-D) and 2-D code (2-D). These are also the classes usedfor this project. The images have a dimension of 2048×2048 pixels and are down-sampled to enable storing of larger batches in the GPU working memory and todecrease training time. However, to meet the preferred image size of 64×64 with-out loosing too much of the original information in interpolation, the images arecropped before down-sampling. For data augmentation purposes all images arecropped both with a 400 × 400 and with a 800 × 800 window.

The cropping is to be made to capture the codes, which can be located fairlyrandomly in the original image. The segmentation label is used to find the pointof mass for one code in the image which is then used as a preliminary center

27

28 3 Method

in the cropping window. The actual center is randomized uniformly around thepreliminary center so that codes can be located not only in the exact centre of thecropped image. This procedure is carried out for one 1-D code and one 2-D codein each original image. This is for data augmentation purposes and to ensure arelatively equal distribution of images including 1-D codes and images including2-D codes. If one of the code-types does not exist in the image, the center of massis randomized over the whole image. In some cases this can lead to completelyblack images when cropping, since no constraints are set on the location of thecropped image center.

With the two cropping window sizes and the two cropping image centers foreach original image there is in total 2 × 2 × 3000 = 12000 labeled training imagesaccessible. 6000 of these images are used for GAN training, 5000 are used for SStraining and 1000 are used in SS validation. Removal of images that are almostcompletely black after cropping is necessary for the GAN training set. Otherwisethe distribution of real images will include this very strong mode, which has anadverse effect on GAN training. About 800 images were removed. It is importantto note that the labels for the GAN training set are completely unutilized and thisset could just as well have been unlabeled without any implication on the results.

3.2 Synthetic Image Generation

The GAN-refiner in this project takes in synthetic images from an Unrefined Syn-thetic Set (USS) and tries to refine them to resemble real images. The USSs, areconstructed in Python. Since one part of the project is to compare the perfor-mance of the GAN given different USSs the generation of these sets varies slightly.But the fundamental concepts are the same. A bright sticker is attached to adarker background and on that sticker either a 1-D code, a 2-D code, or one ofeach are attached. The codes have random size. The sticker is then rotated witha random rotation angle. The different variations are shown in Table 3.1. The dif-ference between the USSs are in the form of whether background text is includedand whether the color scale is binary (only 0 or 255) or full range (0 to 255 withevery integer in between).

3.3 Setting up the GAN

The implementation of the GAN is made in Python using TensorFlow, and Kerasas a more user-friendly API for many sub parts and operations.

3.3.1 Integration of R and D in Training

In this project, a refiner R is trying to refine images that fool the discriminatorD into classifying them as real while preserving label information. A pixel la-beled as 1-D should reasonably be labeled as 1-D also after the refinement. The

3.3 Setting up the GAN 29

Name Description Example Image

USS0Background has random value in [0,100],Sticker background has random value in [200,255],Text is included

USS1 Same as USS0 above but without text

USS2 Same as USS1 but binary

Table 3.1: This table describes the different synthetic sets used

same is obviously true for pixels labeled as BG and as 2-D. R’s output distribu-tion pR is made to resemble the real image distribtuion pdata as close as possible.Preserving the label information amounts to keeping the regularization loss low.Regularization loss was described in section 2.3.3. How the regularization loss iscalculated is covered in section 3.3.4. D is trying to discriminate between refinedsynthetic images from R and real images.

Such an iteration of both one training step of R (R-step) and one training stepof D (D-step) is shown in figure 3.1. This will be called alternating training, todifferentiate it from pre-training. When performing an R-step, D’s weights arefrozen. The same is true for R’s weights when performing a D-step. Each timewhen there is a switch between taking a R-step and a D-step, the frozen modelis updated to its latest version. These steps are the core operations when train-ing the GAN. However, before reaching the alternating training stage, R and Dare first pre-trained separately as is suggested by Shrivastava et al. [35]. Thisis to ensure sound starting values of the weights to increase the chance of fastconvergence. While a pre-training D-step is identical to a D-step in alternatingtraining, a pre-training R-step is slightly different from its alternating trainingcounterpart. Since a well performing D is essential for R’s adversarial loss thereis not much use for R to try to reduce this loss before D has been trained. As itwould only fool D with unwanted artefacts. Hence, R is pre-trained only with

30 3 Method

Figure 3.1: Schematic overview of GAN iteration in Alternating Training

Figure 3.2: Schematic overview of R when it is being pre-trained

the regularization loss. This is shown in Figure 3.2. The whole training pipelineis given in Algorithm 3.

3.3.2 Sampling from trained R

When the training is finished it is possible to use R for sampling refined syntheticimages. The input is just as when training R, a labeled unrefined synthetic im-age and the output is a labeled refined synthetic image. Disregarding memoryconstraints, it is possible to create labeled refined image sets of any size, since allthat is needed is an equal number of unrefined input images. The generation ofthe synthetic unrefined images was described in section 3.2 and no limit on thegenerated set sizes exist there.


Algorithm 3 GAN TrainingSpecify:KRP : number of pre-train iterations for RKDP : number of pre-train iterations for DKRA: number of R-steps per Alternating Training iterationKDA: number of D-steps per Alternating Training iterationI : number of full Alternating Training iterations

1: # Pre-training R2: for k = 1 to KRP do3: Train R with only regularization loss4: end for5: # Pre-training D6: for k = 1 to KDP do7: Train D with R frozen8: end for9: # Training R and D in alternating fashion

10: for i = 1 to I do11: for k = 1 to KRA do12: Train R with D frozen13: end for14: for k = 1 to KDA do15: Train D with R frozen16: end for17: end for

32 3 Method

3.3.3 Default GAN

A large part of this project has been to investigate which GAN architecture andhyperparameter set work best. What is described here is therefore the defaultGAN model, the foundation, on which further GAN versions are built. Some dif-ferent architectural choices, hyperparameter sets and other design choices wereattempted and discarded before even reaching this default model. The defaultGAN should therefore not be seen as the absolute first configuration attempted,but rather a sound structure to build on.

The architectures of the Default model are shown in Figure 3.3 and are also de-scribed below.

Refiner ArchitectureR is built from ResNet-blocks as were described in section 2.2.2. The default is tohave nine ResNet-blocks with LeakyReLU(0.3) as activation function, instead ofReLU as in the original paper. This was due to performance reasons when testingand comparing the two activations. One convolutional layer is added before thefirst ResNet-block and one is added after the last. All convolutional layers have32 kernels each, a 3×3 kernel size and stride length 1. The activation from the lastconvolutional layer is tanh. Batch Normalization is applied to all convolutionallayers except to the input layer. This setup was found through experiments.

(a) Refiner architecture

(b) Discriminator architecture

Figure 3.3: Refiner and discriminator architectures. The dimensions aregiven as: (batch size x height x width x channels). K: number of Kernels,KS: Kernel size, S: Strides, A: Activation


Discriminator ArchitectureD consist of a straight pipeline of convolutional layers with LeakyReLU(0.3) ac-tivations. The last convolutional layer has activation function tanh. Some of theconvolutions have stride lengths greater than 1 to reduce the output dimensionof the original image. It is worth noting though that the output from the final con-volutional layer does not have dimension 1x1 as would be the case if the wholeinput image was classified in its totality. Instead, the output of the final convolu-tional layer have dimension 4×4, corresponding to the idea of classifying patchesin the input images as was discussed in section 2.3.3. Moreover, some fraction ofthe synthetic image batches that are fed into D is not from the last refiner version,as was also explained in section 2.3.3. The fraction of new images in the batchis bnew, and the fraction of historic images is bold , where 0 ≤ bnew, bold ≤ 1 andbnew + bold = 1.

Hyper Parameters and other Design ChoicesIn Table 3.2 the relevant hyper parameters and design choices of the default GANare shown:

Image size 64 × 64

IterationsI = 10000 (or stopped earlier if stuck in endless state),

KRP = 250, KDP = 500, KRA = 1, KDA = 1Batch size 64Historic batching bnew = 0.5, bold = 0.5Adam Optimizer α = 0.001, β1 = 0.9, β2 = 0.999Input Normalization Images are normalized linearly to interval [−1, 1]Table 3.2: A declaration of the design parameters used for the default GANmodel.

3.3.4 Loss Functions

A key component of GANs is the adversarial loss that was given by the minimaxformulation (2.19) in section 2.3.2. For D, this formulation leads directly to theloss used by Shrivastava et al. in (2.23). This is also the loss used for D in thisproject. The corresponding loss for R is slightly different from the one proposedin (2.24). The losses for the two networks are:

L(D) = −∑i

log(1 − D(R(xi))) −∑j

log(D(zj )) (3.1)

L(R) = −∑i

log(D(R(xi))) + λ||(R(xi))(ci) − xi(ci)||1 (3.2)

The regularization used in (3.2), is the sum of the absolute difference over thecodes rather than over the whole image. ci is therefore the subset of pixels in theinput image xi , which are labeled as either 1-D or 2-D codes. This is chosen since

34 3 Method

the purpose of the regularization primarily is to preserve the label informationin the codes. Other measures of regularization such as preserving the standarddeviation over the codes were attempted, but were abandoned quickly because oflack of loss convergence.

3.4 Evaluation through Semantic Segmentation

The performance of the GAN-models are evaluated using an implementation ofthe SS algorithm by Long et al. [26]. The metrics used are shown below. Inspira-tion is taken from the metrics proposed in section 2.4. All metrics are calculatedfor each class separately. It is of interest to know how well the algorithm per-forms on each of BG, 1-D and 2-D separately. nijk is the number of pixels thatcorrectly belong to class i, classified as class j in image k. tik =

∑j nijk is the

number of pixels of class i in image k.

• Pixel Accuracy Raw (PAR):∑k niik∑

j∑k nijk

, ∀i

• Pixel accuracy Normalized (PAN): 1K

∑k

niik∑j nijk

, ∀i• IoU Raw (IoUR):

∑k niik∑

k(tik+∑j nijk−niik ) , ∀i

• IoU Normalized (IoUN): 1K

∑k

niik(tik+

∑j nijk−niik )

, ∀iThe difference between the raw versions and the normalized versions is that thenormalized ones assign equal importance to every image regardless of the num-ber of pixels belonging to the class. The reason for including the normalized ver-sions is that the importance of accurately catching the presence and the locationsof codes does not depend on the size of the codes.

3.5 Training and Testing Pipeline

The full process from GAN-refinement to receiving quantified SS-scores involveseveral steps. Hardware-wise all training and testing are performed on an NvidiaGeForce GTX 1080 Graphical Processing Unit (GPU). The full pipeline of trainingand testing operations is shown in Figure 3.4 and is described below.

• Step 1: The GAN is trained using a set of real images and a set of labeledsynthetic images. Every twentieth iteration the weights of the current R-version and intermediary loss data are saved. The intermediary loss data isin the form of mean training loss per iteration for R and for D. We denoteone such intermediary version of R: Ri . Consequently, with R we meanthe refiner-model in general. With Ri we mean a particular instance of R,

3.5 Training and Testing Pipeline 35

corresponding to a specific iteration for some particular GAN-model fromwhich we can sample refined synthetic images. How the GAN is trainedwas given in detail in Algorithm 3 in section 3.3.1.

• Step 2: A whole refined synthetic image set with labels, is created from theGAN trained in step 1. This set serves as the representation of that GANin the SS evaluation. Because of volatility in the loss and in the subjectivequality of the refined images over training, the refined images are not gener-ated from the Ri corresponding to the last GAN-training iteration. Insteadthey are generated one third each from the three Ris corresponding to theiterations with lowest loss. This loss is the sum of the loss for R and for D.Sampling from a trained R was described in section 3.3.2.

• Step 3: The labeled refined image set created in step 2 is used to train theSS algorithm. The SS algorithm was described in section 2.4 and 3.4.

• Step 4: The trained SS algorithm from step 3 is evaluated using real labeledimages. This step is described in section 3.4.

Just for the sake of clarification, it is not the performance of the SS algorithm itselfthat we wish to explore, but rather the performance of the GAN-refined image setas a training set for that SS algorithm. The SS algorithm itself is the same for allGAN-models and is only a means for quantifying the GAN-performance.

Figure 3.4: The pipeline of how a GAN-model is trained and evaluated

36 3 Method

3.6 Tests and GAN-models

To reconnect to the research questions in section 1.3, the main objectives are:

1. To find the best performing GAN-model.

2. To explore how the best performing GANs from 1 perform on the differentsynthetic sets presented in section 3.2.

3.6.1 Main Tests Based on GAN-loss

To reach the solution to the first objective, the 4-step process in section 3.5 iscompleted multiple times with various modifications made to the default GANpresented in section 3.3.3. The synthetic image set used is USS0. The variousGAN-models explored are given in Table 3.3. Note that every parameter and de-sign choice that is not explicitly changed is the same as for the default GAN.

The raw versions from Table 3.3 that work well are also combined to hopefullycreate even better performing models. This means combining the deviations fromthe default GAN of several models into one single model. For example, a combina-tion of GAN3 and GAN4 would be a model identical to the deafult GAN except ituses both instance noise and noisy labels. Obviously, there was not time to try allpossible combinations and no objectively right answer for how to combine themcould be conceived of either. Therefore the selection process is heavily based onintuition and common sense. For instance, GAN3, GAN4 and GAN5 all involvemethods that make D’s job harder by artificially mixing up the two classes’ prob-ability distributions and thereby hopefully find more stable solutions. Includingall of these in one model might, however, not be preferable since creating toomuch confusion for D has an adverse affect on stability and quality of results. Onthe other hand, combinations of models where one has high PA but low IoU, andthe other one has high IoU but low PA can hopefully extract the high true positiverate of the former and the low false positive rate of the latter and merge them inone single model, which is superior in both respects.

For benchmarking the quantitative results of the GANs, the SS algorithm is alsotrained one time with unrefined synthetic images and one time with real images.This amounts to changing the refined synthetic set in Step 3 in figure 3.4 to anUSS and a real set respectively. When the USS used for GAN-training in step 1 isUSS0, the USS used for benchmarking in step 3 is also USS0. The same holds forUSS1 and USS2.

The second objective stated in the beginning of section 3.6 is addressed by test-ing the three best performing GANs from training on the USS0 synthetic set, onUSS1 and USS2 sets described in 3.2. In case there is ambiguity regarding whichGANs could be considered to be "best performing" precedence is given to themodel with highest IoUR on 2-D. As above, the results are benchmarked againstSS training on real data and SS training on the corresponding USS.

3.6 Tests and GAN-models 37

GAN-model DescriptionGAN0 default GANGAN1 DCGAN

GAN2Deeper architecture- 12 ResNet-blocks in refiner,6 additional Conv-layers in discriminator to ensuresame visual field for refiner and discriminator

GAN3Instance noise- original std = 0.1, annealed linearly over24000 epochs.

GAN4Noisy labels- uniform random label:positive interval = [0.85, 0.95], negative interval = [0.05, 0.15]

GAN5Flip real labels- with probability 0.1 real samples getslabeled as synthetic in the discriminator

GAN6Noise refiner- To the input and to every ResNet-block inputadditive gaussian noise with std = 0.02 is applied

GAN7Dropout refiner- To the first layer and to the first layer ineach ResNet-block dropout(0.5) is applied.This means randmoly freezing 50% of weights during training,

GAN8LeakyReLU output discriminator- The activation fromthe last Conv-layer is LeakyReLU

GAN9 More discriminator training- KDA = 3

GAN10More images from latest refiner version-bnew = 0.75, bold = 0.25

GAN11Changed Adam optimizer parameters-α = 0.0002, β1 = 0.5, β2 = 0.9

GAN12Feature matching-R’s adversarial loss is changed to 2.27

Table 3.3: Summary of the GAN models implemented and what differentiatethem from the default GAN

38 3 Method

Tests of Training TimeAs was stated in the research questions in section 1.3, the training times for thevarious models are to be measured and compared. This might seem redundant atfirst, since each model is only trained once. However, every small modificationto the scenario, e.g. changing image size or synthetic data generation, requires re-training of the GAN. Therefore, significant differences in training times betweendifferent models are still of interest.

Measuring the training time does not require a separate test, but instead, trainingtime is recorded during training of all the GANs when trained on dataset USS0.The training time measurement of highest interest is that of full time until con-vergence, as this tells us how long it takes to achieve a fully developed refiner. Itis also of interest to measure training time per alternating training iteration, asthis give us an indication of the additional time required for (or time saved when)working with deviations from the default GAN.

3.6.2 Test of Subjective Quality

In the research questions stated in section 1.3, there is strictly speaking no con-straints on generated images to be resembling real images for the human eye.However, the aspect of visual appeal is not completely insignificant. Since a lotof data processing is still carried out or controlled by humans, the erosion of con-fidence in synthetic data sets that show little or no resemblance to the data it issupposed to emulate might make such data unusable, despite desirable quantita-tive performance. Therefore as a side-task, a subjective selection of the versionsof R producing the most visually appealing results is also made in addition to theloss based selection method described in section 3.5.

3.6.3 Tests for Checking Reliability

Since there is significant randomness in the method used, i.e. initialization ofthe GAN-weights, training image order for the GAN and training image order forthe SS algorithm, two tests were carried out to check the reliability of the results.One test is for checking the spread of the quantitative results due to stochasticityover the whole 4-step process in Figure 3.4, and the other test is for checking thesame spread for the SS algorithm. The first test essentially measures reliability inGAN-training (step 1) and SS Training (step 3) and the second test only measuresreliability in step 3. Step 2 and 4 have no stochastic aspects.

Test 1: GAN Training and SS TrainingThis test is carried out by taking one and the same GAN-model (GAN0) and feedit through the 4-step process eight times and record the min, max, mean andstandard deviation of IoUR.

3.6 Tests and GAN-models 39

Test 2: SS TrainingThis test is carried out by taking one and the same GAN-refined image set (gen-erated from GAN0) from step 2 and feed it through step 3 and 4 eight times andrecord record the min, max, mean and standard deviation of IoUR.

4Results

In this chapter, all results that are relevant for answering the research questions,are presented. Firstly, results concerning the training of the GANs are presentedin section 4.1, before moving on to the quantitative results from the SS evaluationin section 4.2. In section 4.3 there is then a presentation of the actual refinedimages generated from various trained models. In the chapter 5 there will bemore of a discussion about the findings presented in this chapter and how theycan be explained by theory and the research method applied.

4.1 Training the GAN

This section covers the results and findings connected to the training of the GANs,i.e. loss progress and training time.

4.1.1 Loss Progress

One key to supervise the learning progress is to study the loss values duringtraining. The loss development over training varies quite a lot between the differ-ent GAN-models. In Figure 4.1, 4.2 and 4.3 there are some examples of modelstrained on dataset USS0. The rest can be found in Appendix A.1. The differ-ent GAN versions were described in section 3.6 and the models with names withprefix COMB are simply combinations of these "raw" models. What is shown inthese figures is the mean loss over the samples over K = 20 iterations. This iscalculated by (4.1) for D, using (3.1)

Lmean(D) =1KN

K∑k

Lk(D), (4.1)

41

42 4 Results

and by (4.2) for R, using (3.2)

Lmean(R) =1KN

K∑k

Lk(R), (4.2)

where N is the number of samples per batch and Lk is the loss over sample k.

Figure 4.1: Loss progress for GAN12 trained on dataset USS0

4.1 Training the GAN 43

Figure 4.2: Loss progress for comb3,5,7 trained on dataset USS0

Figure 4.3: Loss progress for comb3,7,9,10 trained on dataset USS0

44 4 Results

The loss for GAN12 is shown in Figure 4.1. GAN12 has a stable, over time declin-ing, D-loss, but an increasing R-loss. COMB3,5,7, exhibited in Figure 4.2, showsstability in terms of D- as well as R-loss, although there are some spikes. ForCOMB3,5,7 we see a strong correlation between the R-loss and the D-loss, wherethey follow each other both upwards and downwards. As opposed to COMB3,5,7,COMB3,7,9,10, shown in Figure 4.3, has a very unstable R, where we see large os-cillations for the R-loss for small perturbations of the D-loss. There is also anevidently non-optimal plateau for R between iteration 2200 and 3900. It is essen-tially just for a couple of hundred iterations around iteration 1000 where the lossof R looks reasonable.

4.1.2 Training Time

The intended training time tests were described in section 3.6.1. These were:

1. total time to convergence,

2. time per alternating training iteration.

But as was shown in section 4.1.1, the loss is not decreasing over time for allmodels and the lowest losses do not necessarily occur at the end of training. Thismakes test 1 above significantly harder to perform, since there is ambiguity re-garding whether the model converges at all, and if so, when it is fully converged.Therefore, the way test 1 is actually carried out in practice is to measure the totaltime for reaching the successive 20 iterations corresponding to the lowest meancombined loss. With combined loss, we mean the R-loss and the D-loss addedtogether. This is shown together with test 2 above, in Table A.1 in Appendix A.1.

We see that the training time per iteration is fairly similar for most models. Itis around 0.65 seconds. However GAN9, GAN12 and combinative models includ-ing these have noticeably higher training time per iteration. For GAN9 it is 1.15seconds and for GAN12 it is 0.96 seconds.

The total time to optimal iteration do not show the same commonly shared cen-ter. The lowest loss value show up after minute 22 and before minute 96 for allmodels.

4.2 Quantitative Segmentation Results

Here the quantitative results from the SS evaluation are presented in tables. Themost interesting features of these tables are extracted and described in text.

4.2.1 Quantitative Results from GANs Based on USS0

In Table 4.1, the metrics IoUR and IoUN from the SS evaluation are shown for allthe different models based on USS0. In Table 4.2, the corresponding results forthe metrics PAR and PAN are presented. All metrics were described in section

4.2 Quantitative Segmentation Results 45

3.4. There are two benchmarks: The real and the unrefined set USS0 as was de-scribed in section 3.6.1.

With regards to IoU, no GAN model performed as good as the benchmark Realon any of the three classes (BG, 1-D, 2-D). This is however not true with regardsto PA, where for example COMB3,7,10 have a PAN of 47.4%, whereas Real onlyhas 28.7% for 2-D. COMB3,7,10 shows the highest PA-value on 2-D of all models.But the phenomenon of relatively high PA scores compared to IoU is prevalent inother models as well. This indicates that the factor damaging IoU for the GAN-refined sets is in most cases too many false positive classifications of codes ratherthan too few true positives. This could also be seen when studying the pixel-wiseinference made by the segmentation algorithm.

Comparisons to training on USS0 can also be made. This means training the SSalgorithm on the unrefined image set USS0 and then evaluating the SS algorithmon real images with the metrics described in section 3.4 as usual. The result fromthis comparison is that, all GAN-refined sets perform better with regards to IoUon 1-D and BG. Looking at PA, the explanation why the IoU of USS0 is relativelylow is a strong over-classification of the 1-D code. For 2-D however, USS0 per-forms better than many of the GAN-models.

On the other hand, some GANs perform significantly better than USS0 also withregards to 2-D codes. The best model with regards to IoU on 2-D is COMB3,7,9,10that has 34.4% in IoUR and 23.5% in IoUN. USS0 has 26.2% in IoUR and 18.0%in IoUN for 2-D.

4.2.2 Subjective Comparison

In Tables 4.3 and 4.4, the results are shown from the SS evaluation on the sets ofsubjectively selected Ris. The motivation for why this test is useful was given insection 3.6.2. This evaluation was only carried out for those models, where therewas a clear difference between the images produced by the Ris corresponding tothe lowest loss and the images produced by those Ris producing the most visuallyappealing images.

We see in Figure 4.4 that the training set of subjectively chosen Ris performsslightly better on average for all classes, for both IoUR and IoUN. The differenceis however not great (<2 percentage points) at maximum. The variance was alsofairly high between models. For some models, for instance GAN3, the objectiveversion performed much better than the subjective one.

46 4 Results

IoUR (%) IoUN (%)Training data sets BG 1-D 2-D BG 1-D 2-DReal- Benchmark 96.9 76.5 46.9 96.7 56.2 24.9USS0- Benchmark 86.4 37.3 26.2 86.2 27.9 18.0GAN0 92.3 46.5 24.5 92.0 28.8 15.1GAN1 92.6 51.6 21.7 92.2 31.9 11.3GAN2 92.4 52.7 20.2 92.1 31.5 13.6GAN3 93.3 56.8 26.5 93.0 34.3 17.2GAN4 91.9 48.5 24.3 91.4 29.9 12.1GAN5 94.0 52.7 18.0 93.8 32.5 17.1GAN6 93.8 57.3 28.5 93.5 35.5 20.1GAN7 88.2 42.0 29.2 87.8 29.7 19.0GAN8 94.2 57.1 20.0 94.1 33.8 15.3GAN9 92.3 46.5 24.1 92.1 28.7 13.2GAN10 92.0 51.4 25.2 91.7 32.1 14.4GAN11 93.0 50.4 20.3 92.6 29.6 11.9GAN12 94.6 56.7 25.6 94.4 34.5 17.5COMB3,5,7 94.9 64.5 31.2 94.7 41.0 21.0COMB3,6,9 94.1 55.5 24.2 93.8 33.1 14.1COMB3,7,10 89.1 49.2 22.6 89.0 31.2 20.3COMB3,7,9,10 94.7 65.6 34.4 94.5 41.8 23.5COMB6,10 93.6 53.8 23.8 93.3 32.1 16.6

Table 4.1: Table of the IoU scores from the SS evaluation of all differentGAN-models trained on USS0. Included are also the two benchmark sets:Real and USS0. The best GAN-model result for each metric and class is bold.


PAR (%) PAN (%)Training data sets BG 1-D 2-D BG 1-D 2-DReal- Benchmark 99.0 85.5 50.2 98.9 78.3 28.7USS0- Benchmark 87.0 93.8 50.5 86.8 89.7 31.3GAN0 95.8 61.0 48.0 95.6 56.9 32.4GAN1 94.9 81.0 32.5 94.6 76.8 17.0GAN2 94.9 80.6 32.7 94.7 75.8 26.8GAN3 95.7 75.2 53.4 95.4 68.5 36.7GAN4 93.9 81.7 39.0 93.5 76.2 18.8GAN5 98.1 62.9 22.6 98.0 57.4 26.5GAN6 96.5 71.5 53.2 96.3 65.3 44.9GAN7 89.2 91.6 57.6 88.9 86.0 38.0GAN8 97.6 72.2 25.8 97.5 65.8 23.7GAN9 95.7 63.3 43.9 95.6 59.8 25.0GAN10 93.8 85.8 41.2 93.6 80.9 26.4GAN11 96.1 72.4 28.4 95.8 63.5 18.1GAN12 98.4 65.3 32.7 98.3 57.5 26.2COMB3,5,7 97.0 81.7 49.2 96.9 72.7 34.4COMB3,6,9 97.3 71.6 33.0 97.1 65.7 21.4COMB3,7,10 90.3 87.4 63.2 90.3 80.6 47.4COMB3,7,9,10 96.2 85.8 60.8 96.1 80.4 45.6COMB6,10 96.9 65.4 42.3 96.8 59.4 32.7

Table 4.2: Table of the PA scores from the SS evaluation of all different GAN-models trained on USS0. Included are also the two benchmark sets: Real andthe USS0. The best GAN-model result for each metric and class is bold.

IoUR (%) IoUN (%)Training data sets BG 1-D 2-D BG 1-D 2-DReal- Benchmark 96.9 76.5 46.9 96.7 56.2 24.9USS0- Benchmark 86.4 37.3 26.2 86.2 27.9 18.0GAN1 93.0 51.4 19.7 92.7 30.7 13.1GAN3 91.6 47.2 16.9 91.1 27.5 8.9GAN4 93.6 54.2 28.7 93.2 34.2 16.7GAN9 94.2 56.5 30.3 93.9 37.3 20.4GAN10 95.0 63.9 27.6 94.8 40.4 18.3

Table 4.3: IoU for some GAN-models, for which the Ris are subjectivelychosen. The best GAN-model result for each metric and class is bold.

48 4 Results

(a) IoU Raw (b) IoU Normalized

Figure 4.4: These plots show the difference in mean semantic segmentationscores (IoU) between the objectively and the subjectively best refiners forGAN models: GAN1, GAN3, GAN4, GAN9, GAN10 for the classes: 0: BG, 1:1-D, 2: 2-D

4.2.3 Additional Synthetic Sets

As mentioned in section 3.6, the best performing models from training with USS0set were also trained on USS1 and USS2. The motivation for why these tests areuseful was given in section 3.2. The results are shown in the tables below.

USS1In Tables 4.5 and 4.6 are the SS evaluation results for some GAN-models trainedon USS1. We see that loosing the text in the synthetic input images has a signifi-cant negative effect on the results. GAN6, COMB3,5,7 and COMB3,7,9,10 trained onUSS1 all perform worse in the SS evaluation than compared to being trained onUSS0 in terms of IoU. This is also true for the unrefined benchmark USS1. OnlyGAN6 has an IoUR higher than that of the unrefined benchmark USS1. Lookingat 4.6 it is obvious that the low IoU stems from an over-prediction of codes sincePA is essentially negatively correlated with IoU for both 1-D codes and 2-D codes.

USS2The results when using USS2 for training the GANs, seen in Table 4.7 and 4.8,are very similar to the results when using USS1 described above. There is a largerpositive performance gap between the best GAN-refined training set (GAN6) andthe unrefined benchmark USS2 in terms of IoU.

4.2.4 Reliability

The results from the reliability tests are presented below. The motivation for whyreliability tests are useful was given in section 3.6.3. Test 1 is presented in Table4.9 and Test 2 is presented in Table 4.10. We can see that the variation is muchhigher for Test 1. The mean values of the two tests are on the other hand quiteclose to each other. The reliability tests are discussed much more in depth insection 5.2.5

4.3 Refined Images 49

4.3 Refined Images

In Figure A.2 in Appendix A.2 there is a collection of refined images from the Riscorresponding to the lowest combined loss for the best performing GAN-models.In Figure A.3 there is a collection of the most visually appealing images for someof the models.

Almost all of the models produce some different transition from the sticker to thebackground compared to the sharpness of the original unrefined images. Manymodels produce some sort of color change across the sticker resembling the colorand shadow transitions in crumpled surfaces to some degree or another. The de-gree of realism of these varies. GAN0 for example, produce stripe-like patternsthat are quite unnatural. This can be seen in Figure A.2, on the row correspond-ing to GAN0. We see in Figure A.3 on the row corresponding to GAN0, that thesestripe-like patterns exist for the subjectively chosen images as well. But they arefar smoother here.

Another characteristic worth noting is the similarity between samples comingfrom the same Ri . We see that all images for each model is modified in more orless the exact same way. If one image has border artefacts, the most images fromthat Ri has border artefacts and so on. This can be seen in Figure A.2 on the rowcorresponding to GAN1, where all images have dark horizontal stripes, close tothe lower image border. Looking at GAN6 in Figure A.2, we see that they havevertical stripes, close to the right border.

50 4 Results

PAR (%) PAN (%)Training data sets BG 1-D 2-D BG 1-D 2-DReal- Benchmark 99.0 85.5 50.2 98.9 78.3 28.7USS0- Benchmark 87.0 93.8 50.5 86.8 89.7 31.3GAN1 96.4 69.4 30.7 96.2 62.6 22.7GAN3 94.6 75.0 25.9 94.2 67.6 15.3GAN4 96.5 73.3 41.2 96.3 65.9 26.3GAN9 97.0 73.8 43.0 96.9 70.1 34.9GAN10 97.6 80.1 36.3 97.5 73.9 28.5

Table 4.4: PA for some GAN-models, for which the Ris are subjectively cho-sen. The best GAN-model result for each metric and class is bold.

IoUR (%) IoUN (%)Training data sets BG 1-D 2-D BG 1-D 2-DReal-Benchmark 96.9 76.5 46.9 96.7 56.2 24.9USS1- Benchmark 79.0 39.0 12.7 78.4 26.8 14.1GAN6 88.6 42.8 15.5 88.2 25.9 11.4COMB3,5,7 75.9 33.0 12.3 75.2 24.2 12.8COMB3,7,9,10 78.4 38.0 11.8 77.8 24.8 11.7

Table 4.5: IoU for a couple of GAN-generated training sets where the unre-fined synthetic set used is USS1. The best GAN-model result for each metricand class is bold.

PAR (%) PAN (%)Training data sets BG 1-D 2-D BG 1-D 2-DReal- Benchmark 99.0 85.5 50.2 98.9 78.3 28.7USS1- Benchmark 79.5 85.5 80.5 79.0 78.7 73.1GAN6 91.2 68.8 48.8 90.7 64.4 31.2COMB3,5,7 76.3 89.7 73.1 75.7 84.3 63.9COMB3,7,9,10 79.0 85.0 75.7 78.5 77.4 57.7

Table 4.6: PA for a couple of GAN-generated training sets where the syn-thetic set used was USS1. The best GAN-model result for each metric andclass is bold.


IoUR (%) IoUN (%)Training data sets BG 1-D 2-D BG 1-D 2-DReal- Benchmark 96.9 76.5 46.9 96.7 56.2 24.9USS2- Benchmark 65.8 27.4 10.2 65.9 20.1 11.9GAN6 92.6 52.8 19.9 92.3 31.1 14.2COMB3,5,7 62.8 24.3 10.7 62.2 20.0 11.9COMB3,7,9,10 75.3 37.5 10.9 74.7 24.6 12.7

Table 4.7: IoU for a couple of GAN-generated training sets where the syn-thetic set used is USS2. The best GAN-model result for each metric and classis bold.

PAR (%) PAN (%)Training data sets BG 1-D 2-D BG 1-D 2-DReal- Benchmark 99.0 85.5 50.2 98.9 78.3 28.7USS2- Benchmark 65.9 90.3 87.0 65.1 83.8 82.1GAN6 96.0 60.8 49.7 95.8 52.9 34.4COMB3,5,7 62.9 93.4 84.8 62.3 87.8 78.4COMB3,7,9,10 75.6 82.1 89.4 75.0 71.2 84.0

Table 4.8: PA for a couple of GAN-generated training sets where the syn-thetic set used is USS2. The best GAN-model result for each metric and classis bold.

IoUR (%)Classes min max mean stdBG 87.8 94.7 92.7 2.01-D 36.4 62.8 53.1 7.12-D 14.3 29.7 23.6 5.1

Table 4.9: Reliability Test 1 shows variation in IoUR scores for model GAN0over GAN training and SS training on synthetic set USS0 over eight itera-tions.

IoUR (%)Classes min max mean stdBG 92.6 94.0 93.5 0.41-D 49.4 57.4 54.0 2.82-D 22.4 26.1 24.7 1.2

Table 4.10: Reliability Test 2 shows variation in IoUR scores for model GAN0over SS training on synthetic set USS0 over eight iterations.

5Discussion

In this chapter the results from chapter 4 are discussed in more detail. Whenpossible, connections are also drawn to chapters 2 and 3 to relate the resultsto relevant theory and the method chosen. The essence of this chapter is thenreiterated in chapter 6 and is used to answer the research questions.

5.1 Training the GAN

This section concern the loss progress and training time for the different models.

5.1.1 Loss Progress

The Discriminator (D) shares all of the characteristics with a normal 2-class clas-sifier network, except that the statistics of its inputs are non-stationary, sincethe refiner (R) and hence the refined synthetic set are updated over time. Forstandard classifiers the loss is normally steadily decreasing and then reaching aplateau. For D however, there are often spikes and sometimes lack of decrease.This can be seen in the loss graphs in Figure A.1 and it was mentioned in section4.1.1.

Spikes in D-lossThe non-stationary inputs to D surely has some explanatory value for why thereare so many large spikes for D’s loss for some of the models. When pre-trainingD on static inputs, no such large spikes were discernible. However, it is not cor-rect to say that these spikes are normal for a well-behaving GAN. Chintala et al.[3] for example, explicitly say that: "when things are working, D loss has low vari-ance and goes down over time vs having huge variance and spiking".

53

54 5 Discussion

One part of the explanation for the unwanted spikes is probably mode collapse.As was discussed in section 4.3, when looking at the refined images from onespecific Ri for some GAN-model, the images are always very similar and also of-ten a bit blurry. This is another indicator of mode collapse according to Huanget al [16]. As can be seen in Figure A.2, all refined images from one specific Riare changed in very much the same way, in terms of colouring, border artefacts,crumple characteristics, and so on. Just to give a reminder, an Ri is a specificinstance of a refiner (R) model during training, defined by the current values ofR’s network weights. If the output distribution of the refiner (pR) is concentratedaround one or a few dominant modes, the locations of these compared to D’s de-cision boundary become critical for D’s overall performance. The peaks of D’sloss would thereby stem from these dominant pR-modes moving back and forth,around the discriminative boundary. This is opposed to a more diverse densityfunction, where the impact of specific pR-modes would become less dramatic inD’s decision landscape.

D and R InteractionThe interaction between D’s and R’s losses is not straightforwardly interpreted.Because of the adversarial nature of the two networks’ loss functions, the intu-ition is that they should be negatively correlated. So that when D becomes verygood at its job, R’s loss increase because of the difficulties it has fooling D andvice versa. Looking at the loss graphs in Figure A.1, this is true in some in-stances but far from all. For GAN7 there are evident signs of positive correla-tion up to iteration 6000. Judging from the images generated, which show nodiscernible changes between iteration 2000 to 6000, this is probably some sort oflocal minima-pair. Both R andD find some region from which they have no incen-tive to deviate. Because of the non-stationary input of D and the non-stationaryloss function of R (which is depending on a continuously updated D) it could alsobe an orbital trajectory around some minimum rather than one consistent singlepoint. We see around iteration 7000, where this behaviour is disrupted, how thecorrelation between the R-loss and the D-loss turns from positive to negative andD reaches its absolute lowest loss when R’s loss is spiking.

One of the most well-behaved models in terms of loss structure is GAN12. This isshown in Figure A.1. Here D show a slowly decreasing loss without any spiking.Since GAN12 is the feature matching model, R’s loss is defined differently thanfor all other models, making comparisons difficult. But we can see that the R-lossgoes up. This is intuitively undesirable, since this means, in the case of featurematching, that the refined image statistics differ increasingly from the real imagestatistics as training progresses. But it is essential to remember that those statis-tical differences are defined by D, which gets better and better. The judge of R’sperformance gets more critical for each iteration. The performance and progressof R might therefore still be relatively good.


5.1.2 Training Time

The research question concerning training time lost some of its essence whenthere were so few instances of longer lasting convergence to any specific solutionbut rather just shorter spurs of convergence to local optima, which were soonabandoned. This problem was touched upon in section 4.1.2. The metric used:Time to optimal Iteration, defined as the time taken to reach the iteration of low-est combined loss is sensitive to individual loss values and stochasticity. Com-parisons between models based on single measurements are therefore not fullyreliable.

The metric Time/iteration on the other hand, is fully reliable. In Table A.1, inAppendix A.1.2, we see almost doubled values for the models including GAN9.This is not surprising since we then have two additional D-training steps peralternating training step.

5.2 Quantitative Segmentation Results

Solely judging from the quantitative results presented in Table 4.1 and describedin section 4.2.1, COMB3,7,9,10 should be considered the best performing modeloverall, with IoU-values exceeding all other models and the unrefined benchmarkwith several percentage points for both 1-D and 2-D. In Table 4.1, this can be seenby comparing COMB3,7,9,10 with the other models, in the 1-D and 2-D columnsfor IoUR as well as for IoUN. In general, the GAN-models proved to be success-ful compared to the unrefined benchmark on BG and 1-D, whereas the successcompared to the unrefined benchmark on 2-D varied between models.

5.2.1 Connection between Quantitative Results and Loss

It is hard to find any consistent patterns that connect the loss graph appearancein Figure A.1 with the quantitative performance of the GANs in Table 4.1. Thosemodels that show signs of longer lasting convergence for D, as is deemed prefer-able by Huszar [17] and Chintala et al. [3], do not consistently outperform theones where no longer lasting convergence was discernible. This was for exampletrue for GAN1 and GAN8. For these models, the D-loss clearly goes down andthe loss variance decreases, which is seen in Figure A.1. However, when studyingthe IoU-result for all classes for GAN1 and GAN8 in Table 4.1 and comparingthem to the other models, GAN1 and GAN8 do not perform significantly betterthan the majority of the rest.

The reason for this could be that the statistics determining resemblance in theGAN network is not the same determining the performance of a data set in se-mantic segmentation (SS) training. There could be image aspects essential forSS performance, but that are deemed negligible by D in the GAN network, andtherefore never get transferred to R and the images it generates. This would implyeither that the choice of SS as an evaluation method for determining the realism

56 5 Discussion

produced by GANs is non-optimal, or that searching for maximized realism inthe training data is a non-optimal way of maximizing SS scores. These are twosides of the same coin. The latter perspective is to some degree accounted forby introducing the regularization loss in addition to the adversarial loss for R, aswas covered in section 3.3.4.

5.2.2 Subjective Comparison

The subjective comparison presented in section 4.2.2, shows that picking out thetop Ris, by looking at the images, slightly outperformed the objective selectionmeasure of combined loss. This could be seen in Figure 4.4. This tells us that toimperfect Ds, the human eye is a strong competitor for recognizing differencesbetween real and synthetic data. It is highly likely that humans focus on differentaspects when deciding on the degree of realism in synthetic images than D do.Furthermore, it could be the case that these humanly chosen aspects are of higherimportance for the SS performance than the ones of high discriminative value forD.

5.2.3 Additional Synthetic Training Sets

When using the synthetic sets USS1 and USS2 the quantitative results go downacross the board as was mentioned in section 4.2.3. As is seen in Tables 4.5-4.8,especially text (or text-like features) seem to be vital in the training data for theSS algorithm, and it also seems hard to introduce through GAN training.

5.2.4 Model Characteristics

It is difficult to discern any strong patterns that distinguish good models frombad ones when looking at the IoU results in Table 4.1. It seems like instancenoise (GAN3) is generally good with regards to IoU. It is at least better than otherdiscriminator regularization methods. Both GAN6 and GAN7 also work wellwith regards to IoU. These are methods of regularization for R. As noted byArjovsky et al. [1], most generative models need noise components to get anoverlap between the desired distribution and the distribution of the generator.This was also discussed in section 2.3.4.

5.2.5 Reliability

As could be seen in section 4.2.4, the reliability in the results can be put intosome question. For Test 2, shown in Table 4.10, the variations were considerablysmaller than for Test 1, shown in Table 4.9. This tells us that most of the unrelia-bility stems from the GAN training rather than the SS training. So when talkingabout COMB3,7,9,10 as the best performing model this variability has to be keptin mind. Considering the classes BG and 1-D it seems still to be safe to say thatGAN-tranining improve IoU compared to SS-evaluating unrefined synthetic data.


The best performing model, COMB3,7,9,10, had IoUR on 1-D almost 4 standard de-viations higher than the unrefined set. For 2-D, the corresponding difference isless than 2 standard deviations. This is seen by generalizing the standard devia-tion of model GAN0 in Table 4.9 to COMB3,7,9,10 and calculating the differencein IoUR-performance between COMB3,7,9,10 and the unrefined synthetic set USS0in terms of the number of standard deviations. This line of reasoning obviouslyrisks being incorrect if the variability of COMB3,7,9,10 would be significantly dif-ferent from that of GAN0.

Perhaps more clever methodological choices could have been done to increasereliability. Especially, using the three intermediary Ris corresponding to the low-est combined loss, as was described in section 3.5, is not a perfect representationof that GAN-model as a whole. This is because of the unnatural dependency onspecific loss values, that the method creates. Sometimes, just a couple of percent-ages of shift in loss for some Ri give rise to very different outcomes, with regardsto which three Ris are chosen, and subsequently, which images are used to repre-sent the GAN-model in the SS evaluation.

This dependency on particular Ris also grew stronger by the seemingly low en-tropy in pR, where the refinement by one Ri of two different input images werevery often similar. It should be noted that this solution was not the original in-tention, which was rather to simply pick the last Ri when there had been evidentconvergence. But as has been noted several times there was often no longer last-ing convergence.

One intuitive remedy of the dependency problem discussed above, would be toincrease the number of Ris used, from the top three to the top ten or somethingalong that line. This could very well increase reliability (and perhaps also overallscores). Letting several Ris contribute to the final result have evident similaritiesto the successful model family "ensemble learning", where sets of individuallyweak algorithms are combined in some voting process, so as to together form astrong algorithm [6]. However, one should consider the impractical aspect of hav-ing to sample from multiple Ris.

Another solution would have been to go through the full process in Figure 3.4more than once and then average the quantitative results. Because of time con-straints, this would however not have been possible without limiting the numberof different GAN-models explored considerably.

5.3 Refined Images

We close this chapter with some general comments about the subjective qualityof the images produced. A subset of these are shown in Figure A.3. First of all,the GANs have a tendency to smooth the image a lot. Instead of keeping transi-tions sharp, the background and foreground are often melted together. This was

58 5 Discussion

touched upon in section 4.3 and shown in the figures of Appendix A.2. As Huanget al.[16] point out, this smoothness is commonly occurring in models sufferingfrom mode collapse, where multiple modes in pdata is averaged into a single modein pR. There were also often unrealistic border effects being introduced along oneor several sides of the image.

On the positive side, there are signs of light effects, creases and crumpled sur-faces, sometimes convincingly similar to real samples. COMB3,7,9,10 for example,show this, with color transitions on the sticker that resemble creases and shadow-ing.

6Conclusions and Future Work

This chapter concludes the whole report, primarily by answering the researchquestions stated in section 1.3. The project is also discussed from ethical andsocial perspectives and recommendations regarding future work are given.

6.1 Conclusions

First the research questions stated in section 1.3 are reiterated:

1. Which architectural choices, hyperparameter sets and other design param-eters with respect to the GAN, give the training sets that produce the bestresults with regards to SS Pixel Accuracy (PA), SS Intersection over Union(IoU) and convergence time?

2. How do the GAN-refined training sets perform at segmentation tasks withregards to PA and IoU compared to training sets comprised of unrefinedsynthetic images and real images respectively?

3. How do GAN-refined training sets compare to each other and to their un-refined counterparts for sets of synthetic input images with different de-grees of complexity? The phrasing "different degrees of complexity" refersto whether the images have full gray scale range or are binary and whetherthey contain text or not.

Below the research questions are answered one by one:

1. The best performing GAN model is COMB3,7,9,10 with regards to IoU. Al-though some precaution due to the variability in quantitative results shouldbe taken before making this a final conclusion. With regards to PA, theresults are a bit more ambiguous and no single model is best across all

59

60 6 Conclusions and Future Work

classes. But COMB3,7,9,10 is among the top performers here as well. Themodel COMB3,7,9,10 shares its architectural foundation and hyper parame-ters with the Default model (GAN0) declared in Figure 3.3 and Table 3.2.But it contains the following modifications:

• instance noise,

• dropout to R,

• the number of discriminator iterations per alternating training (KDA)is increased to 3,

• historic batching is reduced by taking 75% of the images into D fromthe most recent Ri instead of only 50%.

The metric of convergence time is not easily defined considering the lossvolatility. But all models reached their lowest loss after between 22 to 96minutes. Most models are similar in Time/Iteration, around 0.65 seconds,except for GAN9, GAN12 and the combinative models including these. Themodel GAN12 is about 50% slower and GAN9 is about 75% slower thanthe mean of the remaining models. These training times are acquired whenusing a Nvidia GeForce GTX 1080 GPU. Other hardware might very wellproduce different results.

2. No model succeeds in performing as good as the Real benchmark with re-gards to IoU on any of the classes. Some models perform better on PA thanthe real benchmark. This is however, as is seen when comparing to IoU,rather overprediction of codes than more accurate predictions.

Compared to the unrefined set USS0, GAN-refinement improves IoU forthe classes BG and 1-D for all models. Some models also outperform USS0on 2-D. The lack of full reliability should be considered here.

3. Changing synthetic set from USS0 to USS1 whose images have no text, hasa large negative impact on the quantitative results. But the decline in IoUis fairly similar in terms of IoU percentage points for all models, includ-ing the unrefined benchmark USS1. Therefore, the declining performancewhen the synthetic images lack text-like features is more indicative of the SSalgorithm than the GAN. There is no consistent difference in performancebetween using USS1 and using USS2.

6.2 Future Work

Essentially, GANs are all about statistical optimization in high-dimensional spaces.The problem of visualization in this space makes it hard to evaluate the qualityof the GAN and its generator’s output distribution. As Huang et al. states, "eval-uation is still predominantly qualitative" [16]. This has been a question markthroughout this thesis work. How close are the distributions? Is the overlap thatHuszar points out as absolutely crucial for training GAN:s really there [17]?

6.3 Social and Ethical Aspects 61

The Kullback-Leibler divergence was touched upon but several other divergencesare presented by Uehara et al [38]. These lead to slightly different formulationsof the loss functions and can possibly also lead to better optimization, despitehaving the same global optimum. Huang et al. present several evaluation met-rics for GANs and try to find ways to evaluate the evaluation metrics. More workstudying these different divergences for GANs, and evaluating them with soundquantitative metrics, would be interesting to show with high reliability how faraway the GAN-models are from producing truly realistic code images.

The label-preserving regularization is also something worth looking more closelyinto. There might be alternatives to the absolute difference over the codes thatwould produce more realistic results while preserving label information just assuccessfully.

For SICK, a natural next step would be to compare the GAN-generated datasets to other alternatives, such as end-to-end hand-crafted synthetic sets and aug-mented real data.

6.3 Social and Ethical Aspects

As with most machine learning applications there are both social benefits andrisks. As discussed in section 1.1, this project and similar ones have a great po-tential in overcoming the obstacle of gathering the required training data, that somany applications are struggling with. This can be to great benefit for society ifthe applications are constructed with society’s best in mind.

But if looking at generative models in general, there are also concrete social risks.In a envisioned scenario where synthetic image, speech and even video genera-tion are perfect, so that no discernible difference between synthetic and real exist,it is easy to see the potential danger. Effortless production of truly realistic butforged images, speech and video snippets can lead to a society where spread offalse information is widespread and difficult to stop. And considering the finan-cial and political gains that could come from such unethical activity, it seemshighly likely that these risks will be realized if the technology allows it and no-body stops it. It is therefore of highest importance that the user of these tools actresponsibly and that the legal framework adapts to these new risks.

Appendix

AResults

A.1 Training the GAN

This section include results from training the GAN:s

A.1.1 Losses

Below are graphs of the training loss progress for all the GAN-models presented.

(a) GAN0 loss (b) GAN1 loss (c) GAN2 loss

(d) GAN3 loss (e) GAN4 loss (f) GAN5 loss

65

66 A Results

(g) GAN6 loss (h) GAN7 loss (i) GAN8 loss

(j) GAN9 loss (k) GAN10 loss (l) GAN11 loss

(m) GAN12 loss (n) comb3,5,7 loss (o) comb3,6,9 loss

(p) comb3,7,9,10 loss (q) comb3,7,10 loss (r) comb6,10 loss

Figure A.1: Loss progress for the different GAN-models trained on USS0.

A.1 Training the GAN 67

A.1.2 Training Time

Training data sets Time/Iteration (s) Time to optimal Iteration (min)GAN0 0.71 26GAN1 0.57 70GAN2 0.72 95GAN3 0.65 58GAN4 0.64 68GAN5 0.62 96GAN6 0.65 27GAN7 0.68 73GAN8 0.62 52GAN9 1.15 96GAN10 0.63 95GAN11 0.62 40GAN12 0.96 38COMB3,5,7 0.67 48COMB3,6,9 1.23 69COMB3,7,9,10 1.26 22COMB3,7,10 0.68 80COMB6,10 0.65 71

Table A.1: Training times for all GAN models trained on USS0. Time tooptimal iteration measures the total time until the model reaches its lowestadditively combined loss for D and R.

68 A Results

A.2 Refined Images

Below are some examples of GAN-generated images from various models.

USS0

GAN0

GAN1

GAN6

COMB3,7,9,10

Figure A.2: Refined objectively chosen samples from various GAN-modelsbased on unrefined synthetic set USS0. For each model, all samples comefrom the same Ri .

A.2 Refined Images 69

USS0

GAN0 Subjective

GAN1 Subjective

GAN4 Subjective

GAN9 Subjective

Figure A.3: Refined subjectively chosen samples from various GAN-modelsbased on synthetic set USS0. For each model, all samples come from thesame Ri .

70 A Results

Bibliography

[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein gan”.In: arXiv preprint arXiv:1701.07875 (2017).

[2] Jason Brownlee. A Gentle Introduction to Mini-Batch Gradient Descentand How to Configure Batch Size. 2017.url: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/.

[3] Soumith Chintala et al. How to Train a GAN? Tips and tricks to make GANswork. 2016. url: https://github.com/soumith/ganhacks. (ac-cessed: 16.04.2018).

[4] Dan Ciresan et al. “Deep neural networks segment neuronal membranes inelectron microscopy images”. In: Advances in neural information process-ing systems. 2012, pp. 2843–2851.

[5] Deep Learning: Feedforward Neural Network.url: https://towardsdatascience.com/deep-learning-feedforward-neural-network-26a6705dbdc7.(accessed: 13.03.2018).

[6] Thomas G Dietterich. “Ensemble learning”. In: The handbook of brain the-ory and neural networks 2 (2002), pp. 110–125.

[7] Rob DiPietro. A Friendly Introduction to Cross-Entropy Loss. 2016. url:https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/.

[8] Timothy Dozat. “Incorporating nesterov momentum into adam”. In: (2016).

[9] Efficiency / Relative Efficiency and the Efficient Estimator. url: http://www.statisticshowto.com/efficient-estimator-efficiency/.(accessed: 19.07.2018).

[10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifierneural networks”. In: Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics. 2011, pp. 315–323.

[11] Ian Goodfellow. “NIPS 2016 tutorial: Generative adversarial networks”. In:arXiv preprint arXiv:1701.00160 (2016).

71

https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/



https://github.com/soumith/ganhacks

https://towardsdatascience.com/deep-learning-feedforward-neural-network-26a6705dbdc7

https://towardsdatascience.com/deep-learning-feedforward-neural-network-26a6705dbdc7

https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/

https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/

http://www.statisticshowto.com/efficient-estimator-efficiency/

http://www.statisticshowto.com/efficient-estimator-efficiency/

72 BIBLIOGRAPHY

[12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016.

[13] Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in neuralinformation processing systems. 2014, pp. 2672–2680.

[14] Kaiming He et al. “Deep residual learning for image recognition”. In: Pro-ceedings of the IEEE conference on computer vision and pattern recogni-tion. 2016, pp. 770–778.

[15] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. “A fast learning al-gorithm for deep belief nets”. In: Neural computation 18.7 (2006), pp. 1527–1554.

[16] Gao Huang et al. An empirical study on evaluation metrics of generativeadversarial networks. 2018. url: https://openreview.net/forum?id=Sy1f0e-R-.

[17] Ferenc Huszár. Instance Noise: A trick for stabilising GAN training. 2016.url: http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/. (accessed: 17.04.2018).

[18] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift”. In: arXiv preprintarXiv:1502.03167 (2015).

[19] Phillip Isola et al. “Image-to-image translation with conditional adversar-ial networks”. In: arXiv preprint (2017).

[20] Aerin Kim. Difference between Batch Gradient Descent and Stochastic Gra-dient Descent. 2017.url: https://towardsdatascience.com/difference-between-batch-gradient-descent-and-stochastic-gradient-descent-1187f1291aa1.

[21] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic opti-mization”. In: arXiv preprint arXiv:1412.6980 (2014).

[22] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”.In: arXiv preprint arXiv:1312.6114 (2013).

[23] Will Kurt. Kullback-Leibler Divergence Explained. 2017. url: https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained. (accessed: 12.04.2018).

[24] Learning Rate Schedules and Adaptive Learning Rate Methods for DeepLearning. url: https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1. (accessed: 21.03.2018).

[25] Yann A LeCun et al. “Efficient backprop”. In: Neural networks: Tricks ofthe trade. Springer, 2012, pp. 9–48.

[26] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutionalnetworks for semantic segmentation”. In: Proceedings of the IEEE confer-ence on computer vision and pattern recognition. 2015, pp. 3431–3440.

http://www.deeplearningbook.org

http://www.deeplearningbook.org

https://openreview.net/forum?id=Sy1f0e-R-

https://openreview.net/forum?id=Sy1f0e-R-

http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/

http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/

https://towardsdatascience.com/difference-between-batch-gradient-descent-and-stochastic-gradient-descent-1187f1291aa1



https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained



https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1



BIBLIOGRAPHY 73

[27] Micahel Nielsen. Neural Networks and Deep Learning. Determination Press,2015.

[28] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. “Pixelrecurrent neural networks”. In: arXiv preprint arXiv:1601.06759 (2016).

[29] Santiago Pascual, Antonio Bonafonte, and Joan Serra. “SEGAN: Speech en-hancement generative adversarial network”. In: arXiv preprint arXiv:1703.09452(2017).

[30] Alec Radford, Luke Metz, and Soumith Chintala. “Unsupervised represen-tation learning with deep convolutional generative adversarial networks”.In: arXiv preprint arXiv:1511.06434 (2015).

[31] Don van Ravenzwaaij, Pete Cassey, and Scott D Brown. “A simple introduc-tion to Markov Chain Monte–Carlo sampling”. In: Psychonomic bulletin &review (2016), pp. 1–12.

[32] Adrian Rosebrock. ImageNet: VGGNet, ResNet, Inception, and Xceptionwith Keras. url: https://www.pyimagesearch.com/2017/03/20/imagenet- vggnet- resnet- inception- xception- keras/.(accessed: 07.05.2018).

[33] Tim Salimans et al. “Improved techniques for training gans”. In: Advancesin Neural Information Processing Systems. 2016, pp. 2234–2242.

[34] Sagar Sharma. Activation Functions: Neural Networks. 2017. url: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6. (accessed: 28.03.2018).

[35] Ashish Shrivastava et al. “Learning from simulated and unsupervised im-ages through adversarial training”. In: The IEEE Conference on ComputerVision and Pattern Recognition (CVPR). Vol. 3. 4. 2017, p. 6.

[36] Casper Kaae Sønderby et al. “Amortised map inference for image super-resolution”. In: arXiv preprint arXiv:1610.04490 (2016).

[37] Christian Szegedy et al. “Rethinking the inception architecture for com-puter vision”. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. 2016, pp. 2818–2826.

[38] Masatoshi Uehara et al. “Generative adversarial nets from a density ratioestimation perspective”. In: arXiv preprint arXiv:1610.02920 (2016).

[39] Stanford University. Lecture notes in CS321n, Convolutional Neural Net-works for Visual Recognition. 2017.

[40] Bing Xu et al. “Empirical evaluation of rectified activations in convolu-tional network”. In: arXiv preprint arXiv:1505.00853 (2015).

https://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/

https://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/

https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6



Improving Realism in Synthetic Barcode Images using...

Documents

Transcript of Improving Realism in Synthetic Barcode Images using...