Evaluation of Pretraining Methods for Deep Reinforcement ...1191656/FULLTEXT01.pdf · In recent...

Examensarbete 30 hpFebruari 2018

Evaluation of Pretraining Methods for Deep Reinforcement Learning

Emil Larsson

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Evaluation of Pretraining Methods for DeepReinforcement Learning

Emil Larsson

In recent years, Machine Learning research has made notable progress using Deep Learning methods. Deep Learning leverages deep convolutional neural networks to extract features from data, and has been able to reinstate interest in Reinforcement Learning, a Machine Learning method for modeling behaviour. It is well suited for game type environments and Deep Reinforcement Learning broke headlines teaching a computer to play Atari games only from pixel inputs.The Swedish Defence Agency is conducting research to incorporate Deep Reinforcement Learning for creating virtual actors in military training simulations. While shown great potential, Deep Reinforcement Learning comes with the major downsides of being computationally heavy and needing extensive amounts of datato work well.To increase efficiency, this thesis evaluates methods for pretraining Deep Reinforcement Learning models using both novel and state-of-the-art pretraining methods. Evaluation is done in Atari games, using recorded data from a demonstrator playing the games. Results indicate that pretraining a network to learn features can have benefits when using Deep Reinforcement Learning. It alsoindicate that pretraining on demonstrator behaviour confounds learning. The thesis lastly discusses issues with the methods and data used, and highlights potentialways for improvement.

ISSN: 1654-7616, UPTEC E 18005Examinator: Tomas NybergÄmnesgranskare: Niklas WahlströmHandledare: Linus Luotsinen

Populärvetenskaplig SammanfattningDe som följt områden inom artificiell intelligens har på senare år sett en dramatiskutveckling. I samband med större tillgång till datorkraft och data har forskninginom framförallt maskininlärning gjort stora framsteg. Maskininlärning är en grenav artificiell intelligens som innebär att datorer lär sig utföra uppgifter genom data.Kraftfulla generella algoritmer har presenterats och gjort det möjligt för datorer attlösa uppgifter som tidigare varit svåra att utföra, t.ex. bild- och röstigenkänning elleratt lära sig att bemästra diverse spel.

Inom militär träning används idag virtuella träningsmiljöer innehållandes förpro-grammerade aktörer att interagera med. Syftet är realistisk träning i scenarion somkan vara svåra att genomföra i verkligheten eller skulle kräva för mycket resurser.Totalförsvarets Forskningsinstitut (FOI) bedriver forskning med målet att skapa in-telligenta, syntetiska aktörer med hjälp av maskininlärning. För nuvarande skrivsbeteendena för dessa aktörer manuellt, något som är både tidskrävande och lätt kangå fel. En annan stor nackdel är att det kan vara svårt att programmera in mänskligtbeteende, något som krävs för realistisk träning. Tidigare forskning har visat detmöjligt för datorer att lära sig beteenden inom olika spelscenarion, kända exempel ärbrädspelet Go samt klassiska arkadspel som t.ex. Pong. Målet är att använda dennatypen av maskininlärning, så kallad förstärkt inlärning, för att skapa syntetiska ak-törer som inte bara går snabbare att ta fram utan också kan bete sig mer som enmänsklig motpart.

Även om förstärkt inlärning har uppvisat framgång finns det fortfarande mångautmaningar. En av dem är den stora mängden data som behövs för att träna uppett bra beteende. Algoritmen behöver testa miljoner steg inom miljön som aktörenska fungera i, och även då finns ingen garanti att det alltid fungerar. I denna rapportundersöks ifall demonstrationer från spelare i olika spel kan främja algoritmerna attlära sig bra beteenden snabbare. I samband med det presenteras också hur olikatekniska aspekter av maskininlärning kan användas för att snabbare skapa bra aktörersamt utnyttja kunskap som presenteras av en demonstratör.

1

Acknowledgments

This thesis was conducted at the Swedish Defence Research Agency (FOI), as partof the project Synthetic Actors. Supervisors for the thesis were Dr Linus Luotsinenand Dr Farzad Kamrani and reviewer was Dr Niklas Wahlström at Uppsala University.

I want to express my gratitude for being given the opportunity to work at FOI,and to Linus and Farzad, thank you for your support and supervision.

To Babak, thank you for always showing up, ProPud in hand, to help me out ofthe holes I dug for myself.

To Niklas, thank you for tirelessly reading and giving advices. You guided me outsidemy comfort zone and elevated this thesis to something I now proudly display.

To my family and friends, thank you for all your kind words and support. Andto Erika, thank you for your endless encouragement and helping me to keep focus onwhat is important.

Emil Larsson, 2018

2

Contents

List of Acronyms 5

1 Introduction 61.1 A Brief History of Artificial Intelligence . . . . . . . . . . . . . . . . . 61.2 Military Training using Simulations . . . . . . . . . . . . . . . . . . . . 71.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Theory 92.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Feed Forward Neural Networks . . . . . . . . . . . . . . . . . . 102.1.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Softmax Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . 142.3.2 Bias–Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . 152.3.3 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Policy and Value Iteration . . . . . . . . . . . . . . . . . . . . . 172.4.2 Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5.1 Deep Supervised Learning . . . . . . . . . . . . . . . . . . . . . 182.5.2 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . 19

2.6 Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.6.1 Double Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . 202.6.2 Dueling Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . 21

2.7 Advantage Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Pretraining Methods for Deep Reinforcement Learning 243.1 Deep Q-Learning from Demonstrations . . . . . . . . . . . . . . . . . . 243.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Method 1: Preloading Experience Replay in DDQN . . . . . . 263.2.2 Method 2: Pretraining Network for DDQN . . . . . . . . . . . 263.2.3 Method 3: Pretraining Network for A2C . . . . . . . . . . . . . 273.2.4 Method 4: Deep Q-Learning from Demonstrations . . . . . . . 27

4 Evaluation 294.1 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.2 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.3 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.5 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.6 Demonstration Data . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Methods Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Method 1: Preloading Experience Replay in DDQN . . . . . . . . . . 32

3

4.3.1 Breakout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.2 Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.3 Discussion of Method 1 . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Method 2: Pretraining the Network for DDQN . . . . . . . . . . . . . 354.4.1 Breakout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4.2 Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4.3 Discussion of Method 2 . . . . . . . . . . . . . . . . . . . . . . 37

4.5 Method 3: Pretraining Network for A2C . . . . . . . . . . . . . . . . . 384.5.1 Breakout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.5.2 Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.5.3 Discussion of Method 3 . . . . . . . . . . . . . . . . . . . . . . 40

4.6 Method 4: Deep Q-learning from Demonstrations . . . . . . . . . . . . 404.6.1 Breakout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.6.2 Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.6.3 Discussion of Method 4 . . . . . . . . . . . . . . . . . . . . . . 44

4.7 Demonstrator Data Evaluation . . . . . . . . . . . . . . . . . . . . . . 454.8 Deep Q-Learning Deficiencies . . . . . . . . . . . . . . . . . . . . . . . 46

5 Conclusions 485.1 Pretraining for Deep Reinforcement Learning . . . . . . . . . . . . . . 48

5.1.1 Pretraining Network Layers through Supervised Learning . . . 485.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

References 50

4

List of AcronymsA2C Synchronous Advantage Actor-Critic. 22, 27, 29, 38–40, 48

A3C Asynchronous Advantage Actor-Critic. 22, 30

AI Artificial Intelligence. 6

ANN Artificial Neural Network. 9, 12, 14, 18

CNN Convolutional Neural Network. 9, 12–14, 18, 19, 22, 24, 26

DDQN Dueling Deep Q-Network. 21, 22, 24–26, 28, 29, 31–37, 40–48

DL Deep Learning. 6, 7, 9, 18, 19, 24

DQfD Deep Q-learning from Demonstrations. 24, 25, 27, 29–31, 40–48

DQN Deep Q-Network. 19–22, 25, 29, 30

FFNN Feed-Forward Neural Network. 10–12

MDP Markov Decision Process. 15, 16, 18

ML Machine Learning. 6, 7, 9–11, 14, 15, 18, 19, 26, 29

RL Reinforcement Learning. 6–9, 15–17, 19, 22, 24, 26, 27, 29, 31, 32, 40, 45, 47, 48

SL Supervised Learning. 6, 9, 13–15, 24, 27, 35, 37, 38, 40, 48

5

1 IntroductionThis chapter gives an introduction to the background for this thesis. It contextualisesthe thesis with a general background to machine learning (ML) and artificial intelli-gence (AI), outlines the purpose seen from the point of view of the Swedish DefenceResearch Agency and lastly presents the key questions this thesis will cover.

1.1 A Brief History of Artificial Intelligence

AI has been a human fascination for a long time. It has been a topic not onlyexamined in science, but is widely discussed and portrayed in both philosophy andfictional works of art.

In recent decades, AI has become almost synonymous with computer algorithmsand it has had some prominent success stories. In 1997, the supercomputer Deep

Blue [1] beat the world champion Garry Kasparov in chess [2], a game hailed asthe benchmark of intelligence throughout history. In 2010, IBM’s Watson won thepopular game show Jeopardy! [3] and in 2016, Google DeepMind’s AlphaGo beat LeeSedol in the game Go [4], a 19x19 tile board game which involves long term tacticsand territory takeover. AI has also found some ground outside of games in a datamanagement tasks such as logistics and finance. If these machines are "intelligent"has been, and still is, debated but none the less, AI has gotten a strong foothold incomputer science and society.

Until recent years, AI could more or less be seen as complex search trees. Machinessuch as Deep Blue performed extensive searches for "if this, do that" kind of problems.This meant that with improved search algorithms and hardware, AI improved too.This way of finding intelligent behaviour is bound to stagnate when problems becomeexponentially more complex, search trees are simply way too inefficient. The progres-sion of problem solving for AI systems also seemed backwards. Chess, a challenge fora human, was conquered by brute force using AI; image and speech recognition, easiertasks for a human, did not work at all. This is where modern ML enters the picture.While ML itself is not new, modern algorithms have revolutionised many fields andproblems. Its applications range from facial recognition in your phone [5] to helpingmedical doctors detect cancer [6].

What makes ML different is its absence of explicit programming. ML algorithmsare general, the machine "learns" by processing data and approximating functions topredict future outcomes. Instead of needing to explicitly implement every solutionby hand, the machine is left to find patterns and classify data for itself, guided bysome predetermined algorithm. The main historical drawback for ML has been itsrequirement for computational power and lack of training data. Nowadays, whencomputational power has skyrocketed and there is an excess of data available, MLresearch has made fast progress alongside the development of better algorithm. Thetwo main approaches to ML in this thesis are supervised learning (SL) and reinforce-ment learning (RL). In very brief terms, SL learns by getting feedback from knowndata while RL learns by exploring and trial-and-error. Both concepts are describedin detail in Chapter 2. The implementations of SL and RL in this thesis makes use ofa moden technique called deep learning (DL) (Section 2.5). DL makes use of vastlymore complex models to learn tasks that comes natural to humans - image recogni-tion, speech recognition - or as mentioned above, way more complex tasks such asthe mechanics of Go in DeepMind’s AlphaGo. AlphaGo is an example of what thisthesis will explore, where instead of detailing a massive search tree as in Deep Blue,AlphaGo first learned by applying SL, learning from old games played by the mastersof Go. After this, it started playing old iterations of itself using RL, surpassing humanlevel to eventually beat the world champion in 2016.

6

This is a new, and very effective, way of "programming" machines. ML, andspecifically DL, currently sits in the frontier of AI research, solving many complextasks that seemingly were out of scope for computer algorithms not long ago.

1.2 Military Training using Simulations

Setting up military procedures for tactical training are often complex and resourceheavy tasks which involves actors in specific environments and can be hard to manage.In order to increase efficiency and save time and resources, the Swedish Armed Forcesare increasingly using virtual training simulations. One simulation engine used usVirtual Battlespace 3. The engine offers realistic environments and contains a largevariety of different military units, called computer generated forces (CGF). CFG canrange from single entities to large groups and can be viewed and managed in both2D and 3D perspective, giving the means to tailor the simulations in complex anddetailed ways. Currently, units can be controlled manually through input but they canalso act autonomously, making decisions based on implemented behaviour models. Inorder to achieve rewarding training using the simulator, the autonomous actors mustbe realistic and act human-like. The behaviour models used today are set up byacquiring information from field experts and by programming the agents manuallyto cover all these scenarios and act accordingly. This process of designing behaviourmodels are time-consuming and prone to flaws, so to further increase efficiency, theSwedish Defence Research Agency (FOI) investigates potential ways of incorporatingML in the development of these models. In recent years, research evaluating deepreinforcement learning (see Section 2.5.2) is being conducted and this thesis will lookinto ways of integrating supervised learning (see Section 2.3) to increase efficiencywhen using deep reinforcement learning algorithms. This evaluation is interesting formainly two reasons:

1. To be able to make use of data available to FOI while simultaneously takingadvantage of unsupervised learning algorithms using Deep Learning.

2. To speed up the process of using Deep Learning models. Creating models usingdeep Reinforcement Learning alone is very slow and pretraining models couldpotentially speed up the process.

1.3 Purpose

The objective of this thesis is to evaluate methods of using existing demonstrativedata to leverage models for deep reinforcement learning. The evaluation will focuson simpler problems and be compared to implementations using only reinforcementlearning. The aim is to get a clear picture of potential advantages and/or disadvan-tages with a future implementation for creating behaviour models. Specifically, thisthesis will try to answer some key questions:

1. Does demonstration data help with completing a proposed problem or play/beata game?

2. How fast is the training? Does the behaviour from the supervised learningincrease time efficiency of the reinforcement learning process?

1.4 Delimitations

The scope of this thesis is not to deliver a working prototype to be used as a militaryasset in simulations. Instead, it aims to evaluate potential ways to merge supervisedand unsupervised ML algorithms and discuss their merits on a conceptual level. Forthis reason the evaluation will not be done in the high end simulator VBS3 but in the

7

simpler Atari Learning Environment (ALE) [7]. These environments are significantlyless complex which implies faster convergence and have worked as benchmarks forreinforcement learning over the last years. These are good upsides; however, they dobring with them low realism and applicability for real military situations. The resultsfrom this thesis will therefore only work as a foundation from which further researchand implementation can be extended, not as the actual solution.

8

2 TheoryOutlined in this Chapter is the ML theory and relevant concepts used in this thesis.It will start off by explaining artificial neural networks and its progeny convolutionalneural network (CNN). It then explains SL and classification before moving into RL.Lastly, the term DL is explained followed by deep RL-algorithms that are implementedin the experiments conducted in this thesis.

2.1 Artificial Neural Networks

An artificial neural network (ANN) is a network architecture inspired by biologicalneural networks in animal and human brains [8]. Similar to the interconnected neu-rons in the brain, ANN are built up by interconnected nodes. The strength of theconnection between two nodes is referred to as the connections weight. As a neuron inthe brain, which will fire (outputting an electrical impulse) if it reaches a certain inputthreshold, a node activates and outputs a signal according to its activation function.Figure 2.1 visually shows the structure of a node. All signals xk from connected nodesk are multiplied with the respective connection weight wk and summed together. Thesummed signal u is then run through the activation function f(u) which defines theoutput y, see (2.1).

y = f(u) , u =nX

k

xkwk (2.1)

x1w1

xnwn

... Σ f(u) yu

OutputsInputs Activationfunction

Summation

Figure 2.1: A node containing n inputs.

A network containing just several of these interconnected nodes will lead to avastly complex system and it is this complexity that makes it possible to tune a neuralnetwork to approximate a function. This is achieved by updating its parameters - theweights between nodes, and is often referred to as training a network.

2.1.1 Activation Functions

Activation functions are mathematical functions that defines the output of a nodewhen receiving inputs. There are numerous versions, three examples below are binarystep (2.2), Sigmoid (2.3) and Rectified Linear Unit (ReLU) (2.4).

9

f(u) =

(0 u < 0

1 u � 0(2.2) f(u) =

1

1 + e�u(2.3) f(u) =

(0 u < 0

u u � 0(2.4)

The reason for showing these three equations is their general occurrence withintechnology. The binary step is the standard function for electric devices, an earlysort of network containing physical logic gates instead of nodes. Sigmoid offers asmoother function between zero and one, and has been largely used up until recentyears in many different neural settings. Lastly there is ReLU , an activation functionused in most newer ML [9].

2.1.2 Feed Forward Neural Networks

While there are a plethora of different network architectures, this thesis focuses ona branch called feed-forward neural network (FFNN). A FFNN contains nodes thatare aligned in so called layers, see Figure 2.2. Seen in this figure is an input layer,a hidden layer and an output layer. Each layer, except the input, is fully connected

to its subsequent layer. This means that all outputs from one layer are inputs forevery node in the next layer. The term Feed-Forward alludes to the nature of theconnections, where a signal from the input layer is only traversing "forward" towardsthe output layer.

Inputs Outputs

Inputlayer

Hiddenlayer

Output layer

x1

x2

xn

y1

ym

.........

Figure 2.2: A feed-forward neural network containing an input layer, one hidden layerand an output layer.

The network generally contains one input layer and one output layer but theamount of hidden layers as well as nodes in all layers can be arbitrarily chosen. Thereare many different ways for choosing a good architecture for specific tasks; howeverFFNNs with just one hidden layer containing a finite number of nodes have beenshown to work as universal approximators [10]. This means that given any continuousfunction F (x) and ✏ > 0, the network can approximate the function f(x) so that

10

8x, |F (x)� f(x)| < ✏. (2.5)

Missing in Figure 2.2, are what is called the biases. The term bias refers to oneextra node, identical to a normal node, that is added in each layer and always receivesmax input. The biases serves the purpose of ensuring an offset in the approximatedfunction of the network.

2.1.3 Backpropagation

The way neural networks are trained in ML practice is by updating the weightsbetween nodes. Changing the weights will change the dynamics of the whole networkand the aim is to find the correct weights corresponding to the wanted dynamic.The most common way of training a FFNN is by using backpropagation, or backprop.Backpropagation refers to propagating the output error back through all the layersin the network and updating all weights according to their contribution to the error,using an optimising method. One such optimising method is gradient descent, whichuses the gradient of a loss function to update the weights [11]. Using the notations inFigure 2.1, let y be the wanted output vector, referred to as the targets, and y be theprediction output vector from the network. The optimisation method then minimisesthe loss function

L(w) =1

n

nX

i

E(yi, yi) (2.6)

which is the mean of the squared errors between the targets and network outputsgiven by

E(yi, yi) =1

2(yi � yi)

2. (2.7)

The error gradient with respect to a weight can then be calculated using the chainrule as follows

@E

@wi=

@E

@y

@y

@u

@u

@wi(2.8)

where

@E

@y= �(yt � y),

@y

@u= f

0(u),@u

@wi= xi (2.9)

which gives

@E

@wi= �(yt � y)f 0(u)xi. (2.10)

Using stochastic gradient descent, the weights are then updated according to

wi = wi + ⌘@E

@wi= wi � ⌘(yt � y)f 0(u)xi (2.11)

where ⌘ is the step size parameter of the algorithm, more commonly referred to as thelearning rate in the field of ML. Parameters such as the learning rate are tuned beforelearning begins, and are referred to as tuning paremeters. (2.11) comes with the prob-lem of tuning the learning rate and that it can get stuck in local minimum. Extensionsthat deals with these problems when running the algorithm includes RMSprop [12]and Adam [13]. Adam is an extension of RMSprop which keeps an adaptive learningrate per parameter using previous updates to find better and faster convergence ofthe network.

11

2.2 Convolutional Neural Network

Convolutional neural network (CNN) are a special type of a FFNN which makes useof spatial information in what is called a convolutional layer. Inspired by the visualcortex of animals, it replicates the overlapping nature of neurons in the visual field[14] and has been highly successful in fields such as image classification. Below, themain components of a CNN are outlined.

2.2.1 Convolutional Layer

Convolutional layers are similar to normal layers within an ANN in that they consistof weights, nodes and biases. The convolutional layers are however distributed inthree dimensions, height, width and depth. The depth of a layer is determined bythe number of feature maps it contains. One feature map is a set of nodes in twodimensions, width and height, which is connected to feature maps in the next layer.It is not fully connected, instead, a set of spatially confined nodes in one feature mapcorresponds to a set of spatially confined nodes in the next. The relationship betweeninput and output is defined by a filter, also known as a kernel. A filter is a matrix ofnumbers which is element-wise multiplied with the input and summed together. Thenumbers in the filter is what corresponds to the networks weights, and therefore thenetworks parameters. As shown in Figure 2.3, the filter is convolved over a featuremap, creating a two dimensional output corresponding to the connected feature map.The output is defined by the filters size and stride (step size of the convolution). Anexample of a filter with different strides can be seen in Figure 2.3.

Input Output Input Output Input Output

(a) Stride 1, resulting in an output feature map with dimensions 3x3.

Input Output Input Output Input Output

(b) Stride 2, resulting in an output feature map with dimensions 2x2.

Figure 2.3: First three convolution steps over a 5x5 feature map using a 3x3 filterwith different strides.

As explained earlier, one layer does not necessarily contain only one map of fea-tures. Figure 2.4 shows two layers containing four and eight feature maps respectively.The relationship of spatiality between each feature map is the same (that is, the filtersize and stride are identical), however the weights of the filters between each set ofmaps are different. This means that the weights between two layers in a CNN can beseen as a four dimensional array containing number of input feature maps, width andheight of the filter as well as number of output feature maps.

12

Layer 1 Layer 2

Figure 2.4: A layer with dimensions n x n x 4 convolving onto a layer with dimensionsm x m x 8.

2 1 4 87 2 1 52 3 4 04 3 3 1

3 53 2

(a) Mean pooling

2 1 4 87 2 1 52 3 4 04 3 3 1

7 84 4

(b) Max pooling

Figure 2.5: Mean and max pooling down-sampling from 4x4 to 2x2.

2.2.2 Pooling Layer

Pooling layers are used to down-sample the spatial size of an output. This decreasesthe amount of parameters in the network, consequently lightening the computationalpower needed to train it. Down-sampling data can also prevent over-fitting, see Sec-tion 2.3.2, by reducing the spatiality of features [15]. Pooling layers uses convolutionsto downscale layer sizes and can represent many various operations, Figure 2.5 showstwo different pooling operations: mean and max pooling.

2.2.3 Fully Connected Layer

As mentioned in Section 2.1.2, a fully connected layer refers to a layer of nodes whereall inputs are the sum of every output from the previous layer. The fully connectedlayers in a CNN removes the spatial information in the network, and as a consequenceare often the last component in the pipeline, see Figure 2.6. The reason for addingthem are to classify the relations between representations learned and outputted fromthe convolutional layers.

2.3 Supervised Learning

Supervised learning (SL) means, as the name suggests, that the learning model isoverseen and guided by a supervisor. The supervisor refers to a feedback function,evaluating the model performance while learning, overseeing that the model approx-imates a wanted function. The learning is done by giving the model a set of inputdata, letting it predict outputs and finally having the supervisor evaluate said out-puts. Commonly this is referred to as having labeled training data and the modelis evaluated by having the supervisor compare the models predictions and the cor-

13

Inputs Convolutional layers FC layers Outputs

Figure 2.6: An example of a CNN architecture.

rectly labeled data. Feedback from the supervisor is then used to update the modelaccording to some predetermined optimisation algorithm. There are different ways ofdoing this except for ANNs, support vector machines [16] and decision trees are twosuccessful methods, the latter used in IBM’s Deep Blue [1] mentioned in Chapter 1.However, this thesis only deals with neural networks and when talking about a model

it will, if not stated otherwise, refer to a function approximated by a CNN. Using SLto solve a problem generally contains the following steps:

1. Gather data with known outputs which corresponds to the problem that shouldbe solved. This data set should contain selected input features and outputrepresentations. Separate the data set into a training set and validation set.

2. Design a model containing a chosen method with corresponding learning algo-rithm that can approximate a function between the input features and outputrepresentation from step 1.

3. Train the model using the training data set until some predetermined objectiveis fulfilled.

4. Evaluate the model on the validation data set in order to test the accuracy androbustness.

When designing a model for SL (and ML in general) setting tuning paremetersare of great importance. As briefly mentioned in Section 2.1.3, tuning paremeters areoften set before learning begins, they work "outside" of the model. This is important,because even though a neural network in theory works as a universal approximator, inpractice this might mean an infinite number of learning steps. To increase efficiency,tuning paremeters narrows down the search space of the method in order to convergefaster.

2.3.1 Softmax Cross-Entropy

This thesis makes use of a specific SL technique called softmax cross-entropy. Softmaxrefers to the softmax activation function, probability distribution function defined as

f(ui) =eui

Pmj=1 e

uj(2.12)

where u is the vector of all input signals. The softmax function normalises outputs toform a distribution that sums up to 1, which can then be used to calculate the cross-

entropy loss. Cross-entropy loss shows the error between two distributions, definedas

14

L(w) = � 1

n

nX

i=1

f(ui) log yi (2.13)

where y is the wanted output vector. (2.13) can then be used to train the networkby updating the weights with backpropagation.

2.3.2 Bias–Variance Trade-Off

A key component of SL is to test the model on data that is separated from the trainingdata (referred to as validation data in the previous section). If the model has a highaccuracy on training data but does not perform well on validation data, the model isoverfitted. Overfitting is a statistical problem in which the model is taking noise intoconsideration when approximating a function. It means the model is overly complexand will result in bad predictions on validation data. The opposite of overfitting isunderfitting, meaning the model does not pick up enough details from the trainingdata. The model might miss important features in the data which also leads to badpredictions.

The trade-off between these two errors is called the bias-variance trade-off, wherebias corresponds to underfitting and variance to overfitting. The importance of thisdilemma comes from the objective of supervised learning; the model should learnto perform well on training data while simultaneously be general enough to performbeyond said training data. This is a key part of the hybrid approach in this thesis- pretraining a model to make use of already acquired knowledge when learning in anew environment in order to increase efficiency.

2.3.3 Regularisation

In ML, regularisation is a way to handle the bias-variance trade-off. Regularisationentail many concepts, but in one way or another penalises the model to prevent it fromoverfitting. A common regularisation term used in this thesis is the L2 regularisation

of weights in a network, which penalises large weights and bounds the model better[17]. L2 regularisation means adding a weight dependent term to the loss function,see (2.14).

LL2(w) = �1

n

nX

i=0

w2i (2.14)

The strength of the regularisation changes with �, a tuning parameter compromis-ing the model between precise estimation of the loss function or bounding all weightsto smaller values.

2.4 Reinforcement Learning

Reinforcement learning (RL) is an area of unsupervised learning in which a learnerexplores an environment by trying actions and observing the outcome. The termreinforcement originates from behavioural psychology; it refers to a consequence of alearners action in a certain state which increases or decreases the learners future po-tential of selecting that action, presented with a similar state [18]. In contrast to SL,there is no labeled training data, the learning is done by exploring the presented en-vironment. The main goal is therefore to find a suitable behaviour which correspondsto the best outcomes in a specific environment by trial-and-error.

The two main terms in RL are the agent and the environment. The agent refersto the learner and decision-maker while the environment refers to what the agentis interacting with [19]. The environment is formally described as a finite markov

15

decision process (MDP) (MDP) [20] containing a set of discrete states S, a set ofdiscrete actions A, a reward function R(st, st+1) and transition function betweenstates T (st, at) = st+1. At a time step t, the agent is in a state st 2 S. The agentthen performs an action at 2 A, transitions into state st+1 according to T (st, at, st+1)and receives a reward rt. This interacting process between agent and environment isvisualised in Figure 2.7.

Agent

Environment

Reward (rt)New State (st+1) Action (at)

Figure 2.7: Agent-environment interface in reinforcement learning.

The idea behind RL comes down to deriving a mapping of actions from statesthat will lead to a high reward. This mapping is called the agents policy, defined as⇡(st) = at, and the objective is to find the optimal policy which will maximize thecumulative reward.

In this thesis, following established RL methodology, the cumulative reward willbe seen as a sum of all rewards starting from a time step t and ending at T . Thisdoes pose a problem if T ! 1 since it could lead to an infinite reward. To boundthe reward, a discount rate � is presented and the reward function can be written as

Gt =1X

k=0

�krt+k+1, 0 � < 1. (2.15)

From this equation and the fact that the environment can be seen as an MDP,two value functions for policies can be derived, the state-value function, V⇡(s) andthe action-value function Q⇡(s, a) [19]. V⇡(s) is defined as

V⇡(st) = E⇡[Gt|st] (2.16)

and gives the expected reward from being in state s and following policy ⇡. Theaction-value function Q⇡(s, a) is similar and defined as

Q⇡(st, at) = E⇡[Gt|st, at] (2.17)

which gives the expected reward from being in state s and taking action a, after whichpolicy ⇡ is followed. This means that finding a good solution in RL, presuming a finiteMDP, is the same as finding the optimal state-value and action-value functions

V⇤(s) = max⇡

V⇡(s), (2.18)

Q⇤(s, a) = max⇡

Q⇡(s, a) (2.19)

for all s 2 S, a 2 A. The two main techniques used in this thesis are Actor-Critic andQ-learning. They build on two different concepts of solving RL problems, on-policy

and off-policy, which refers to the way an algorithm learns the policy with regards toexploring. Actor-critic is on-policy, which means following and updating the policyaccording to the action that was selected while Q-learning is off-policy, meaning itupdates unrelated to the action picked [21]. More precisely, it updates a posterioriwith regards to which action should have returned the highest cumulative reward.

16

2.4.1 Policy and Value Iteration

As previously mentioned, finding a solution to a RL problem means finding an optimalpolicy ⇡. There are two fundamental ways from which the techniques in this thesisare extended, policy iteration and value iteration.

Policy iteration starts with a random policy, evaluates that policy using its state-value function (2.16) [22] and updating it according to

⇡(s) = argmaxa

Q⇡(s, a). (2.20)

Policy iteration is bound to always improve unless the optimal policy is found.This is clear from looking at (2.20); if the policy is not optimal, Q⇡ will return ahigher value for a certain state-action pair and the policy will improve.

Value iteration means finding a good policy by estimating one of the optimal valuefunctions [22]. The iterative update equation for finding the optimal value functionis defined as

V (st) maxa

E⇡[rt + �V (st)|st, a], 0 � < 1. (2.21)

This method iteratively approximates the linear equations for an unknown system,using the Bellman Equation update rule [23], an equation used in many RL methods.

In this thesis, this method of iteration is used together with exploration of theenvironment, seeing the reward function is not known beforehand. One issue in thisthesis arises from this fact; the exploration vs exploitation issue, where an agentneeds to choose between trying a new action or following policy [24]. Exploringthe environment will add data points for approximation, while exploitation mightstreamline performance in computational power as well as reward.

2.4.2 Actor-Critic

Actor-critic is an on-policy method where the policy is stored independently from thevalue function. The method contains an actor, which represents the policy, and acritic, which evaluates the actions picked by the actor.

The actor-critic interface can be seen in Figure 2.8. The critic in this thesisrepresents the state-value function V (s), which evaluates the new state st+1 afteraction at is picked by the actor. The evaluation is the temporal difference [19] error

�t = rt + �V (st+1)� V (st) (2.22)

which is positive if action at lead to an improvement and negative if it did not. Theerror then strengthen or weaken the tendencies of the actors policy according to someupdate rule. This rule can be to simply increase probabilities of actions being pickedor used in loss functions for updating neural networks [25], explained more in Section2.7. The actor-critic method is always on-policy as the critic evaluates the direct stategiven by the actors action.

17

Actor

Critic

Environment

Error (�t)

Reward (rt)

New State (st+1) Action (at)

Figure 2.8: Interface of the Actor-Critic model.

2.4.3 Q-Learning

Q-learning is an off-policy method where the action-value function is optimized with-out consideration of the policy followed [26]. The Q-learning algorithm updates usingthe temporal difference error, but it does not take into account the actual policy fol-lowed, instead it updates as if the policy took the best possible action. The updaterule for a one step Q-learning can be written as

Q(st, at) Q(st, at) + ↵[rt + �maxa

Q(st+1, a)�Q(st, at)], (2.23)

where ↵ denotes the learning rate. From (2.23), the issue of exploration vs ex-ploitation can be traced. Q-learning only updates as if following the policy with thehighest reward. Unlike actor-critic, which updates using the actual action picked,there is no intrinsic way of exploration in Q-learning, the same states leads to thesame updates (provided a finite MDP). The common way around this is to simplyadd a parameter ✏ 2 [0, 1], known as the exploration rate. This parameters determinehow often the agent will pick a random action instead of following its policy, leadingto a higher exploration of the environment [24], known as ✏-greedy.

2.5 Deep Learning

DL (DL) is a a subfield of ML which in later years has lead to state-of-the-art resultsin varying fields, such as computer vision, language processing and speech recognition[9]. The term DL was coined in 1986 [27] and DL is therefore itself not a new theory;however the lack of computational power and stable algorithms has mostly stoppedit from being a viable method in science. Starting in the mid 2000s it slowly gainedsome ground and with new technologies and algorithms, such as parallel computationon GPUs (Graphical Processing Units) [28] or the DQN algorithm (Section 2.6), DLhas become a monumental force in ML.

DL means solving problems by learning broader representations instead of specificfeatures. Using deep neural network architectures, meaning networks containing manyhidden layers, have been on the frontier of DLs success. These deep networks are ableto extract and learn high-level representations of the input features leading to thenetwork being able to learn very complex functions [9].

2.5.1 Deep Supervised Learning

Deep supervised learning is simply problems where a deep neural network is usedto solve a problem where training data is available, see Section 2.3. This can bedone using ANNs, CNNs or other neural networks, one example being RecurrentNeural Networks (RNN), networks where nodes propagate forward and to themselves,creating a sort of "memory" [29]. RNNs are outside the scope of this thesis, but

18

are used for example in language processing. On the other hand, deep supervisedlearning using CNNs has lead to leaps forward in accuracy on problems such as imageclassification; in 2012 the first deep CNN architecture won an image classificationcompetition from ImageNet [30] with a 10% margin over the second best entry [31].

As previously mentioned, deep networks can extract representations of the inputs.This leads to an abstraction of the input which makes DL able to handle generalproblems that have previously been hard to do with ML, an example would be toclassify features of a face on many different people - something every photo applicationin your phone can do fairly well nowadays.

2.5.2 Deep Reinforcement Learning

In theory, both policy and value iteration will converge under the assumption thatevery state can be reached enough times [19]. In practice, things gets complicated, andone of the most important practical questions for getting RL to work is how to storethe approximated function. If a state space is small, storing a function approximationis easy. A classic example would be a small grid where an agent can receive positiveor negative reward depending on which square it ends up in; storing values for eachsquare of the grid in a table will not be a problem. But what if the amount of statesare near endless? Hardware limitations would lead to that we run out of memory.Deep Reinforcement Learning uses a deep neural network to approximate the functionbetween state and value. Making use of the learned features in the network, it usesRL algorithms to map these on a policy. While still computationally heavy, it makesit possible to deal with huge state spaces as well as approximate novel scenarios.Deep RL has resulted in superhuman behaviour in various tasks; in this thesis, twoalgorithms that have shown these results are explained and extended from. Theybuild on Q-learning (Section 2.4.3) and actor-critic (Section 2.4.2), and they bothimplement deep CNNs.

2.6 Deep Q-Learning

Deep Q-Learning refers to approximating the action-value function using a deep neuralnetwork. This thesis will use the term deep Q-learning interchangeably with deep Q-network (DQN) [32]. DQN is an algorithm of solving deep Q-learning presented byDeepMind [33] that had a big impact on RL. It used pixels from the screen directlyas input and learned to play 49 different Atari games without any interference orchanging of parameters. The algorithm managed to outperform human experts insome of the games, and it learned in a relatively short amount of time. It sets thebenchmark of Atari games in deep RL and showcased great potential for deep neuralnetworks in RL.

One large part of what made DQN successful was its way of breaking correlationbetween input data. Solving RL using a non-linear function approximator, such asa neural network, can be highly unstable [34] as the states are used sequentiallyto update the model. Slowly changing states means highly correlated data, whichcan lead to a bootstrap-effect when Q(st, a) ⇡ Q(st+1, a) (see (2.23)), making thefunction diverge. DQN breaks this correlation with two tricks. The first trick is usingtwo networks, referred to as the online network and the target network. They areseparated using two different connotations for their weights, ✓ and ✓

�, and respectivelyapproximated by the Q-functions Q(s, a; ✓) and Q(s, a; ✓�). The target network isthen used to calculate the error E(✓) of the online network as

E(✓) = rt + �maxa

Q(st+1, a; ✓�)�Q(st, at; ✓). (2.24)

For every K update step, the target networks weights are then copied from theonline network, leading to a slower and more stable approximation of the Q-function.

19

The other trick is an experience replay, a buffer where every variable at eachtime step (state, action, reward etc.) gets stored. Instead of updating the networkon sequential data as it is presented in time, the algorithm samples random sets ofvariables from the buffer and updates the network using those, breaking eventualcorrelations in time.

Adding these two tricks together, DQN samples variables from the experience re-play and calculates the error (2.24) in order to minimise the loss function (2.25). Thisloss is minimised using stochastic gradient descent (Section 2.1.3) and the pseudocodefor DQN can be seen in Algorithm 1.

LQ(✓) = E(✓)2 (2.25)

Algorithm 1 Deep Q-Learning with Experience Replay1: Initialise Experience Replay D with size N

2: Initialise online network and target network with random weights ✓ = ✓�

3: for episode 1 to M do4: Initialise s1

5: for t 1 to T do6: Select random action at with probability ✏

7: Otherwise select at argmaxa Q(st, a; ✓)8: Run action at and observe rt, st+1

9: Store transition {st, at, rt, st+1} in D10: Sample random batch of transitions {sj , aj , rj , sj+1} from D

11: yj (rj if sj+1 is terminalrj + �maxa Q(sj+1, a; ✓�) otherwise

12: Update ✓ ✓ � ↵@@✓ (yj �Q(sj , aj ; ✓))2

13: Every K step set ✓� ✓

14: end for15: end for

There have been many additions and tweaks to the DQN algorithm following therelease of DeepMinds first paper. Two extensions, Double Q Learning and Dueling Q

Network, are used in this thesis and explained below.

2.6.1 Double Q-Learning

An issue with the Q-learning algorithm is that it often overestimates the action values.This is due to the algorithm selecting for maximising estimated reward when updating,which means overestimations will be used more than underestimations. A solution tothis is double Q-learning [35], which can be implemented as an extension to DQN.The main idea is to extend the return in the DQN algorithm from

yt = rt + �maxa

Q(st+1, a; ✓�) (2.26)

toyt = rt + �Q(st+1, argmax

aQ(st+1, a; ✓); ✓

�). (2.27)

This separates the action evaluation from the action picking, the latter now beingdone in the online network. This way, actions are chosen from current estimates ofthe action-values using the online network. The actual action-values instead comesfrom the slower changing target network which decreases bootstrap effects whichcan lead to escalating estimations of action-values. This stabilises learning and canincrease end performance [35].

20

84x84x4 32x21x21 64x11x11 512x2 4

32x8x8, 4 64x4x4, 2

64x11x11

64x3x3, 1

Input Convolutional layers FC layers Outputs

Filters, stride

4 + 1Adv./value

Figure 2.9: The DDQN architecture.

2.6.2 Dueling Deep Q-Learning

As explained in Section 2.4, V⇡(s) gives the value of being in a certain state, whileQ⇡(s, a) gives the value of picking a certain action in said state. However, in manystates there might be little to no difference in picking a certain action, while in somestates the wrong action might lead to catastrophe. To combat this, Dueling Q-learning

approximates the action-value function Q⇡(s, a) using the state-value function V⇡(s)combined with a new term, the advantage function, A⇡(s, a) [36]. A⇡(s, a) gives arelative value of importance between actions in a certain state and is expressed as

A⇡(s, a) = Q⇡(s, a)� V⇡(s). (2.28)

While Q(s, a) approximates the value of an action, the advantage function can bebeneficial as it states if there is of high importance to pick an action in the immediatemoment.

An implementation of dueling Q-learning as an extension of DQN was presentedin [37], called dueling deep Q-network (DDQN), see Figure 2.9. In this architecture,the one fully connected layer in the DQN architecture is replaced with two fully con-nected layers in parallel, parameterised by ↵ and �. The layers respectively approx-imates A(st, at; ✓,↵) and V (st; ✓,�) which are used to approximate Q(st, at; ✓,↵,�).The expression (2.28) is however unidentifiable, meaning there is no way to sepa-rate A(st, at; ✓,↵) from V (st; ✓,�) given Q⇡(st, at; ✓,↵,�). The estimates are there-fore inseparable when training, and the practical implementation of the estimatorA⇡(st, at; ✓,↵) is forced to zero for the chosen action:

Q(st, at; ✓,↵,�) = V (st; ✓,�) +A0(st, at; ✓,↵) (2.29)

whereA

0(st, at; ✓,↵) = A(st, at; ✓,↵)�maxa0t

A(st, a0t; ✓,↵). (2.30)

This means that V (st; ✓,�) = Q⇡(st, at; ✓,↵,�) for the chosen action (as Q-learning picks the highest action value), making it possible to separately estimatethe advantage and value. Important to note is that advantage and value is integratedin the network, see Figure 2.9. Consequently, DDQN can therefore make use of allextensions to Q-learning explained earlier, such as experience replay and a target net-work, as well as the double Q-learning return. The network architecture is used forall Q-learning experiments and can be seen in Figure 2.9.

21

2.7 Advantage Actor-Critic

Advantage Actor-Critic is an RL architectural approach where gradient ascent on theexpected reward is used in order to approximate the policy. The policy is representedby a deep CNN, parameterised by ✓, with a softmax function output (Section 2.3.1).The gradient for an unbiased estimate of the reward with regards to the networkparameters can then be written as

dRt

d✓= r✓ log ⇡(at|st; ✓)Rt (2.31)

where Rt is an estimate of the expected reward from action at in state st, shown in(2.17) to be given by the action-value function Q⇡(st, at) [38, 39]. To reduce variancein the estimate, a learned state-function can be subtracted, which in this case wouldbe the state-value function V⇡(st) [38]. This subtraction has been explained in Section2.6.2, and refers to the advantage of taking an action in a certain state, written asA⇡(st, at) = Q⇡(st, at) � V⇡(st). The update gradient for the network is then givenby

dRt

d✓= r✓ log ⇡(at|st; ✓)A(at, st; ✓). (2.32)

This updating rule was used in a paper presented in 2016 [25] which introduced analgorithm called asynchronous advantage actor-critic (A3C). To increase exploration,a regularising entropy term H(⇡(st; ✓)) is added and the full gradient becomes

dRt

d✓= r✓ log ⇡(at|st; ✓)A(at, st; ✓)� �r✓H(⇡(st; ✓)) (2.33)

where � is a tuning paremeter which controls the regularisation strength [25].The paper used multi-threading to have several agents make asynchronous updates

to a deep network. Each agent explored its own instance of the environment andupdated a copy of the main network. The main network was then periodically updatedusing accumulated changes to the local networks. By having several actors working atthe same time, it breaks the correlation between data in similar fashion to experiencereplay in DQN, assuming agents likely explore different parts of the environment [25].

In order to increase stability, it also implements n-step return, a Monte Carloupdate that takes into account the added return up until a fixed or terminal step.This can be implemented in all temporal difference step equations (see (2.20)), andis not exclusive to actor-critic [19]. The n-step return is expressed as

Rt =n�1X

i=0

�irt+i + �

nV⇡(st+n), (2.34)

as long as step t is not terminal. The algorithm is computationally lightweightcompared to DQN; A3C reached state-of-the-art results running only on the centralcomputing unit (CPU). After the success of A3C, researchers found that synchronousupdates to the main network did not seem to make any difference if done correctly,and the synchronous algorithm simply got the name advantage actor-critic (A2C),short for A2C [40]. The network architecture is similar to the one DDQN (Figure2.9, except there is only one fully connected layer that outputs to both advantage andvalue. Pseudo-code for the A2C algorithm is outlined in Algorithm 2.

22

Algorithm 2 Advantage Actor-Critic1: Assume global weights ✓, ✓v and global timer T = 02: Assume thread-local weights ✓

0, ✓0v3: Initialise t 14: repeat5: Initialise d✓

0 0, d✓0v 06: Synchronise local weights ✓

0 = ✓ and ✓0v = ✓v

7: tstart t

8: Initialise st

9: repeat10: Select action at ⇡(at|st; ✓0)11: Run action at and observe rt, st+1

12: t t+ 113: T T + 114: until st is terminal or t� tstart = tmax

15: R (0 if st is terminalV (st, ✓0v) otherwise

16: for i t� 1 to tstart do17: R ri + �R

18: d✓ d✓ +r✓0 log ⇡(ai|si; ✓0)(R� V (si; ✓0v))19: d✓v d✓v + @(R� V (si; ✓0v))

2/@✓

0v

20: end for21: Synchronously update ✓ ✓ + d✓ and ✓v ✓v + d✓v

22: until T > Tmax

23

3 Pretraining Methods forDeep Reinforcement Learning

The focus of this thesis will be to merge SL and RL using DL techniques. Thismeans making use of data comprising a demonstrator performing a task that couldbe suitable for RL. The goal would be to leverage the knowledge inherent in thedemonstrator, while simultaneously making use of the capacities of deep RL. Sepa-rately, these two can be done; imitating the demonstrator would be done using SLto classify demonstrator actions from states; deep RL would be released without anyknowledge of the domain to try and find good behaviour. The main issue though, isto to find a bridge between these different types of learning. SL does not make use ofany value functions, and RL uses trial-and-error from scratch.

There are two ways of merging that will be tested in this thesis. The first way ofpretraining would entail a deep CNN learning features from the demonstrator datausing a classic supervised learning approach (classifying actions from input states).The parameters from the trained convolutional layers in the network would then becopied to a new network used for RL. Theoretically, the features learned would begeneral for the environment, so a RL network starting with pretrained parameterscould converge to a good policy faster. As this should not be different to having theRL network learn the features itself, using only RL, having the network learn featuresin a separate step could potentially be a benefit.

The other way of pretraining might be more intuitive: making use of SL to ap-proximate a value function and policy directly from demonstration data. This doescomes with some issues, one being that the demonstration data only shows a narrowpart of the state-space [41]. A good demonstrator would not take random actionsand the vast majority of the state-space goes unexplored. State-space expansion is ageneral problem in RL, an infinite exploration will lead to convergence but it is notpractically feasible. Algorithms that pretrain to approximate a value function musttherefore have some implementation to generalise learning not only for good states,but bad states as well.

3.1 Deep Q-Learning from Demonstrations

Deep Q-learning From Demonstrations (DQfD) is an algorithm presented by Deep-Mind which extends their DDQN algorithm to make use of demonstration data [41].The algorithm does this by adding an L2 regularisation loss LL2(✓) (Section 2.3.2),an n-step return loss Ln(✓), and a large margin classification loss LD(✓) [42]. Ln(✓)combines a version of n-step return (2.34) with the normal Q-loss (2.25). Start bydefining

Knt = rt+1 + �rt+2 + ...+ �

n�t�1rt+n + �

n�t maxa

Q(sn+1, a; ✓) (3.1)

which can be seen as an n-step version of (2.26). The loss function for the n-stepreturn is then defined as

Ln(✓) =n�1X

i=0

[Kn�1n�i�1 �Q(sn�i, an�i; ✓)]

2. (3.2)

Note that the n-step return and loss will stop if presented with a terminal state,meaning it includes at most n steps.

DQfD can be seen as a mix between SL and RL, where the network uses prere-corded transitions to map the demonstrators actions with states before exploring the

24

environment. The supervised loss used is LD(✓), defined as

LD(✓) = maxa

[Q(s, a; ✓) + l(aD, a)]�Q(s, aD; ✓) (3.3)

where

l(aD, a) =

(0 aD = a

ke aD 6= a. (3.4)

Here, aD refers to the action taken by the demonstrator, and (3.3) increases theloss of non-demonstrator actions by a margin ke, which is set as a tuning paremeter.The demonstrator data is added to the experience replay buffer before starting thealgorithm and is never replaced. The full loss for DQfD is defined as

L(✓) = LQ(✓) + �1Ln(✓) + �2LD(✓) + �3LL2(✓). (3.5)

All � parameters are tuning paremeters, but �2 is set to zero when calculatingloss on data acquired when exploring the environment.

The DQfD algorithm works in the same way as DQN, with the exception that itsamples and runs a set amount of pretraining steps on data from the replay memorywithout interacting with the environment. The added losses are meant to stabilisethis part of learning as demonstrator data will naturally cover a very narrow part ofthe states in the environment [41]. Pseudocode for DQfD can be seen in Algorithm3, where updating using gradient descent is done in the same way as normal DDQNwith the added losses explained in (3.5).

Algorithm 3 Deep Q-Learning from Demonstrations1: Initialise Experience Replay DD with size N and add demonstration data2: Initialise online network and target network with random weights ✓ = ✓

�

3: for t 1 to Tpretrain do4: Sample random batch of n demonstrator transitions from DD

5: Use batch to calculate loss L(✓)6: ✓ ✓ + ↵

@@✓L(✓)


8: end for9: for episode 1 to M do

10: Initialise s1

11: for t 1 to T do12: Select action at using an ✏-greedy policy13: Store transition {st, at, rt, st+1} in DD

14: Sample random batch of n demonstrator transitions from DD

15: Use batch to calculate loss L(✓)16: ✓ ✓ + ↵

@@✓L(✓)


18: end for19: end for

3.2 Methods

The pretraining methods in this thesis will include learning features only, and learningfeatures and policy simultaneously. Figure ref visualises the pipelines of the differentmethods and the theoretical benefits and downsides for each method are presentedseparately in the sections below.

25

3.2.1 Method 1: Preloading Experience Replay in DDQN

Method 1 is a simple way of trying to learn a better policy directly from demonstrationdata. Since the DDQN algorithm makes use of an experience replay to update thenetwork, a naive solution for "pretraining" would be to simply add demonstrationdata to the replay buffer. In the original DDQN, the agent takes 50000 random stepsto fill the buffer with some transitions before learning starts. Instead of these randomsteps, the experience replay buffer will be filled up with however many demonstrationsteps is at hand. This data will never be overwritten, which differs from the originalexperience replay which will overwrite old data when the buffer is full (this due topractical memory limitations). Formally, there will be no pretraining, the algorithmwill be run as normal but with good transitions to learn from immediately.

Method 1

Agent

Replay bufferDemonstration data

Reinforcement Learning

Environment

Random Network

Figure 3.1: A figure showing the pipeline of Method 1. The method is simply DDQNwith a preloaded replay memory.

3.2.2 Method 2: Pretraining Network for DDQN

For method 2 (and 3), the network is pretrained using supervised learning and thetrained parameters are then used to initialise the network for RL. This is commonlyused in ML classification where it is referred to as Transfer Learning. By using anetwork that have been proven to have good architectures and work well in classifyingtasks, there is a good chance it will be faster and provide better results to retrainthis network on new tasks. In [43], it was shown that layers in deep CNNs can betransferable to various degrees. Specifically, it was shown that the first layers weremostly general and did not necessarily contain specific information of the input. Thedeeper into the network however, layers seem to become more specific to the input,ending with the fully connected layers being matched concretely to the classifyingtask [43].

This method would make use of this information by training a network to predictthe demonstrators actions when presented with certain states. The network architec-ture would be identical, but pretraining is done by using softmax cross-entropy andnot value functions. The parameters of the first convolutional layer is then copied toinitialise the RL network, which the DDQN algorithm uses to approximate a policy.

26

Method 2

Replay buffer

Initialise Network TrainNetwork

Demonstration data

Agent


Environment

Supervised Learning

Figure 3.2: A figure showing the pipeline of Method 2. The network is pretrained bypredicting demonstrator actions using SL, and the agent network is initialised with1 of the convolutional layers from the pretrained network. The experience replay isinitialised with demonstrator data, as in Method 1.

3.2.3 Method 3: Pretraining Network for A2C

This method works the same as Method 2, but using the A2C algorithm in theRL phase. This could have some implications; while A2C uses a value function tolearn policy, it also uses a softmax activation function for selecting actions. Thiscould potentially mean that pretraining using softmax cross-entropy leads to similarparameters in the network as if using A2C.

Method 3

Initialise Network TrainNetwork

Demonstration data

Agent


Environment

Supervised Learning

Figure 3.3: A figure showing the pipeline of Method 3. The network is pretrained bypredicting demonstrator actions using SL, and the agent network is initialised with 1of the convolutional layers from the pretrained network.

3.2.4 Method 4: Deep Q-Learning from Demonstrations

This method makes use of DQfD to learn a policy directly from demonstration data.Data is loaded into the experience replay and never overwritten, and the algorithmruns for a set number of steps using only this data before starting to explore theenvironment. The benefit of this algorithm is that it is highly regulated, meaning itshould be able to generalise well from demonstrator data alone.

27

Method 4

Replay buffer

Initialise Network

Demonstration data

Agent


Environment

Replay buffer

Agent

Supervised Learning

Figure 3.4: A figure showing the pipeline of Method 4. The agent is trained usingdemonstration data in the replay buffer, but with no interaction with the environment.The agent is then exposed to the environment after a set number of steps, where ittrains like the normal DDQN algorithm.

28

4 EvaluationThis chapter explains the practical implementations of the methods presented inChapter 3. It then evaluates the final model performance and if pretraining bene-fits the learning time on proposed problems.

4.1 Implementations

4.1.1 Algorithms

The three algorithms used in the methods are:

• Dueling Deep Q-Network (DDQN) which has shown strong results on var-ious Atari games using deep RL. The algorithm builds on DQN and implementsa dueling network architecture and double Q value learning.

• Advantage Actor-Critic (A2C) which has lead to faster convergence andstate-of-the-art results needing less computational power.

• Deep Q-Learning from Demonstrations (DQfD), an extension of DDQNwhich makes use of additional supervised and unsupervised losses to stabilizelearning from demonstration data before exploring the environment.

In order to streamline performance, open source projects have been used and extendedfrom in this thesis:

• Gym is an open-source library from openAI containing a standardised set ofenvironments such as the classic Atari games [44].

• Baselines is another open-source library from openAI which contains stream-lined algorithms to a range of modern deep RL algorithms [45]. Algorithmsimplemented in this thesis are all extended from Baselines implementations.

• TensorFlow is a an open-source library for ML [46]. Parameters not explicitlyspecified in this thesis follow TensorFlows built in values. One example of thisis the Adam optimizer [13], where parameters not specified uses standard valuesfound at the official documentation at TensorFlows website.

4.1.2 Network

The deep network will be the same for all methods, seen in Figure 2.9. The outputlayer implements a softmax activation function when using A2C, instead of selectingfor the highest value as in Q-learning.

4.1.3 Games

The methods will be evaluated by learning two games, Pong and Breakout, both fromAtari (Figure 4.1). Due to limitations in time and hardware there was a need to scaledown the amount of games evaluated on. These two games were chosen for beingpreviously shown to work well with deep RL.

• Pong simulates table-tennis and lets the player control a paddle which can bemoved vertically on one side of the screen. On the opposite side another paddleis controlled by either another player or a preprogrammed opponent. Bothplayers will then try to hit a ball which goes back and forth between them.Letting the ball through is punished with minus one point, while the opponentis rewarded with one point. The first player that scores 21 points win.

29

http://www.tensorflow.org

(a) Breakout (b) Pong

Figure 4.1: The Atari games used in this thesis.

• Breakout lets the player control a paddle that moves horizontally on the bot-tom of the screen. On the top side are six rows of bricks, and the player triesto hit the bricks with a ball that bounces back and forth. Higher rows givemore points, but will also return the ball faster and shrink the paddle size. Thebricks can be cleared twice, meaning that if every brick is destroyed, anothernew set will spawn. After the second set is destroyed, the game will restart.The player has five lives and one life is lost every time the player lets the ballthrough. There is no negative reward for this and the reward range for Breakoutis r = [0, 7], r 2 Z.

4.1.4 Preprocessing

In order to lower the amount of data that needs to be processed, and therefore param-eters to be adjusted in the network, as well as make learning faster and more stable,DeepMind presented some preprocessing procedures which are all used in this thesis:downscaling, reward clipping, frame skipping and frame stacking.

1. Downscaling. Frames are downscaled from 210x160 RGB format to 84x84grayscale. This downscaling includes a cropping of the frame, the algorithm isgiven a reward through the output of Gym, not by looking at the screen.

2. Reward clipping. The reward is processed so that positive reward is 1, neg-ative reward is -1 and 0 is unchanged. This lowers the error magnitude for thealgorithms and generalises better across games. Pong only rewards the playerwith 1 point if scoring, or -1 if opponent scores.

3. Frame skipping. Changes between two frames in Atari games can be minus-cule; saving all frames wastes computational resources. Instead, the algorithmuses the same action 4 frames in a row. The first two frames are disregarded,and the last two are merged. This merge compares the two last frames, usingthe highest value of every pixel and returns one frame.

4. Frame stacking. On top of skipping frames, the algorithm also stacks fourframes as input. This is done by replacing the oldest processed frame with thenewest. The input is therefore always of size 84x84x4.

4.1.5 Hyperparameters

Table 1 outlines all hyperparameters used in the methods. There has been no searchfor optimal values, instead the standard values from [32] (DQN), [25] (A3C) and[41] (DQfD) has been used. The only difference is the max step; since both Pong

30

DQNVariable Hyperparameter ValueTmax Max steps (Pong/Breakout) 3 000 000/5 000 000D Experience replay size 500 000K Target network update freq. 10 000✏start Initial exploration rate 1.0✏end Final exploration rate 0.1✏anneal Exploration anneal steps 0.2 * Tmax

� Discount factor 0.99↵ Learning rate 0.00025

A2CVariable Hyperparameter Value

T Max steps 15 000 000K Number of agents 16tn Max Steps n-step return 5↵ Learning rate 0.0007 (annealed to 0)� Entropy Regularisation 0.01

DQfDVariable Hyperparameter ValueTpre Pretraining steps 150 000/450 000�L2 L2 loss regulation 0.0001�n n-step loss regulation 1.0�D Supervised loss regulation 1.0ke Expert margin 0.8tn Max Steps n-step return 10

Table 1: Hyperparameters used in the methods.

and Breakout tend to stagnate earlier and training time was limited, the maximumamount of training steps has been reduced. Pong has been reduced even more asthe game does not change, and the max score has a lower ceiling, leading to fasterconvergance. Another hyperparameter that is different are the amount of pretrainingsteps for DQfD which will be methoded on. Note that when DQfD is done pretraining,the same hyperparameters as DDQN are used for RL.

4.1.6 Demonstration Data

There will be two different data sets used for pretraining in this thesis. The first set islabeled expert data and is recorded from an agent trained to expert behaviour usingRL. The other data set, labeled human data, is recorded from a human playing thegame for around 50000 time steps. In all methods, the data will be used separately togive an indication on how different data changes model performance. In Breakout, theexpert data contains a higher spread of states than the human data as it progressesfurther into the game. In Pong however, seeing the game is highly static, the expertfound a perfect way of winning and therefore contains many of the same states. Thehuman data is a lot more varied as the human player made many different movesresulting in both good and bad outcomes.

31

4.2 Methods Evaluation

The methods used are outlined in Chapter 3 and two ways of evaluating results areconducted in the thesis. The first looks at reward acquired during training, wherethe reward is synonymous with the points from the game played. Note that rewardsare only clipped when using backpropagation through the network, when visualisingtraining the real reward is used. Visualising reward during training indicates if thealgorithm converges to a good policy, how fast it is converging and how stable itprogresses. The second way of evaluation is done when training is complete. Thetrained model then plays a set number of steps in the environment, the reward islogged and an average reward per game played is presented. This evaluation showshow well the end model works, but it can also show robustness; in games with bigreward spreads it is interesting to see the variation of reward throughout several gamesplayed. Lastly, evaluation of the models will always be compared to a vanilla model,referring to the corresponding RL algorithm with no pretraining or demonstrationdata. For optimal results, training would be conducted several times using differentrandom seeds. Due to time limitation and the heavy computational power demand ofthe algorithms, each method only ran once. This severely limits the conclusions thatcan be drawn from the results, as outliers could occur.

4.3 Method 1: Preloading Experience Replay in DDQN

In this method, the experience replay buffer is filled with both the expert data andthe human data in separate experiments. This is done before the agent is initialisedto start exploring the environment and learning begins.

4.3.1 Breakout

In the case of Breakout, neither training time nor end result had any indication of im-provement over the vanilla DDQN. Figure 4.3 indicates that adding the demonstratordata had some effects on end performance. This would probably be a consequence ofproviding the algorithm with more good data points earlier in the training phase. Asexploration starts with the agent acting completely random, most transitions earlyon in training will not result in good rewards and therefore improve the model veryslowly. Providing the algorithm with the pretrained transitions could therefore re-sult in better and faster learning. However, as the evaluation was done on only oneseed, there is a high chance that this comes down to a luck. Seeing the spread of re-ward is large, strong conclusions should not be drawn from this method alone (whichunfortunately goes for all methods).

4.3.2 Pong

Preloading the experience replay in Pong seemed to achieve a minor decrease in train-ing time. This effect could also be seen at the training time of Pong in method 2,Figure 4.8, which could signify that simply adding the demonstration data is benefi-cial. It would suggest that giving the model access to better transitions early boostsinitial training. Notably, the end result for preloading was worse, especially whenpreloading with expert data. As the expert data contained many of the same tran-sitions, this could have hindered the model performance as it trains on a smallerstate-space early on. The better end result from using human data compared to ex-pert data would also encourage this idea; the human data contains a more varied setof transitions, meaning a bigger state-space for training.

32

0 2,500,000 5,000,0000

50

100

150

200

Timesteps

Rew

ard

Method 1, Breakout

VanillaExpert DataHuman Data

Figure 4.2: Average reward acquired during 5000000 training steps for vanilla DDQNcompared to using a preloaded replay buffer containing expert and human demon-stration data.

Vanilla Expert data Human data

0

200

400

Rew

ard

Method 1, Breakout

Figure 4.3: Final average model performances after training is complete.

33

0 1,500,000 3,000,000

�20

�10

0

10

20

Timesteps

Rew

ard

Method 1, Pong


Figure 4.4: Average reward during 3000000 training steps for vanilla DDQN comparedto using a preloaded replay buffer containing expert and human demonstration data.


5

10

15

20

Rew

ard

Method 1, Pong


34

4.3.3 Discussion of Method 1

Preloading the replay memory had a small increase in end performance for Breakout,Figure 4.3, and learning speed for Pong, Figure 4.4. A reason why this method didnot have more of an impact could be the amount of demonstration data compared todata gathered throughout training. The experience replay size buffer had a maximumcapacity of 500000 transitions, and inefficient sampling of demonstration data meansit will not be used in many learning updates. As learning is detached from exploringusing deep Q-learning, the experience replay buffer is randomly sampled for transitionswhich the Q-learning algorithm uses to update the model. The demonstration datais randomly sampled without any indication that it might be more important thanrandom data from the exploration of the environment. Hypothetically, preloadingdata could make a bigger difference in this method (and the other DDQN methods)if there was some importance indication from the demonstration samples. This mighthave been achieved using what is called Prioritised Replay Memory [47], which sampleswith regards to the temporal difference error of a transition. This gives a measurementon how well the transition aligns with the already learned policy, a high error meaningit would be important (if we conclude that the demonstrator actions were to beconsidered good). This extension was unfortunately not implemented due to timelimitations, but would be a natural step to implement into future work.

4.4 Method 2: Pretraining the Network for DDQN

This method uses SL, training the network to learn features in the data by havingit predict the demonstrators actions. The parameters of the first convolutional layerare then copied to the network used for DDQN.

4.4.1 Breakout

As can be seen in Figure 4.6, initialising the network with the pretrained convolutionallayer had negligible effect on training time, neither training on expert nor humandata made any difference. Figure 4.7 shows that final performance of the modelswas slightly improved, with both expert and human data giving a slight averagereward increase over the vanilla learning algorithm. The results are similar to theresults of Method 1 and as demonstrator data was added to the experience replay inthis method as well, it could indicate that having access to demonstrator transitionsmakes for better learning.

35

0 2,500,000 5,000,0000

50

100

150

200

Timesteps

Rew

ard

Method 2, Breakout


Figure 4.6: Average reward during 5000000 training steps for vanilla DDQN and usingexpert and human data to pretrain a model and transfer 1 convolutional layer.


0

200

400

Rew

ard

Method 2, Breakout


4.4.2 Pong

In the case of Pong, the results indicates some improvement in learning speed, Figure4.8, but a slight decrease in final model performance, Figure 4.9. This echoes theresults of Method 1, but it also indicates that pretraining the network stabilises theend results somewhat.

36

0 1,500,000 3,000,000�20

�10

0

10

20

Timesteps

Rew

ard

Method 2, Pong


Figure 4.8: Average reward during 3000000 training steps for vanilla DDQN and usingexpert and human data to pretrain a model and transfer 1 convolutional layer.


5

10

15

20

Rew

ard

Method 2, Pong



Pretraining using SL and transferring the first convolutional layer in DDQN hadonly minor implications. The final model performance for Breakout gave a smallperformance boost and the learning speed for Pong was slightly increased. The resultswere similar to those of Method 1, which would encourage the claim that loading data

37

into the experience replay affects learning. This fact would however indicate that thepretraining of the network had less effect, but as with Method 1 it could also comedown to having a lucky (or unlucky) seed.

What should be a factor to consider for the learning speed is that all algorithmsstarts with an exploration rate of 100%. This naturally leads to the same rewardsuntil the exploration rate is lower and the model starts relying more on its policy. Asfeatures close to the end result might have already been learned in the pretrainingsteps, and preferable data has already been added to the replay buffer, starting witha lower exploration rate might be favorable. Having the agent try random actions formany tens of thousands of steps could hinder the algorithm from bootstrapping onthe pretraining and decrease performance. There could also be a point in transferringmore layers, hopefully starting with layer parameters closer to the end result of asuccessful model.

4.5 Method 3: Pretraining Network for A2C

Similar to Method 2, this method uses SL, training the network to learn features inthe data by having it predict the demonstrators actions. The parameters of the firstconvolutional layer are then copied to the network used for A2C.

4.5.1 Breakout

Pretraining the network made a big difference for A2C in Breakout. Figure 4.10 showsa sizable increase in learning speed and it seemed to have an impact on the final modelperformances as well (Figure 4.11). However, a caveat should be added; none of themodels seem to have completely converged, meaning additional learning steps mighthave brought the end performance of the models closer.

0 5,000,000 10,000,000 15,000,0000

100

200

300

400

Timesteps

Rew

ard

Method 3, Breakout


Figure 4.10: Average reward during 15000000 training steps for vanilla A2C and usingexpert and human data to pretrain a model and transfer 1 convolutional layer.

38

Vanilla Expert data Human data300

325

350

375

400

425

450

Rew

ard

Method 3, Breakout


4.5.2 Pong

Figure 4.12 show a great speed increase when A2C is learning Pong. Using thepretrained network, the algorithm seem to converge on a well working model in almosthalf the time. Figure 4.13 shows that the final performance of the models are not verydifferent, but as the max number of points that can be acquired is 21, this mostlyindicates that all models have settled on expert behaviour.

0 5,000,000 10,000,000 15,000,000�21

�10

0

10

21

Timesteps

Rew

ard

Method 3, Pong


Figure 4.12: Average reward during 15000000 training steps for vanilla A2C and usingexpert and human data to pretrain a model and transfer 1 convolutional layer.

39

Vanilla Expert data Human data15

16

17

18

19

20

21

Rew

ard

Method 3, Pong

Figure 4.13: Final average model performances after training is complete. (Highestpossible reward is 21.)


In both games, pretraining had a positive impact. In Pong, the learning curves for thepretrained network in Figure 4.12 are steeper and learning takes almost half the timeof using a randomly initialised network. The fact that both data sets gave almost thesame results what matters is the amount of transitions that can be pretrained on. Asmore data is often synonymous with better results in SL and looking at Figure 4.10,this claim is strengthened; the expert data reached further into the game giving thealgorithm more transitions on which the network could be pretrained.

In comparison to DDQN, exploration differences could benefit models that havebeen pretrained which could be a reason for A2C working better. As A2C uses severalagents exploring different parts of the environment as well as updating the networkfrom the actual policy followed, pretraining should increase efficiency immediatelywhen starting to use RL. This only entails pretraining to learn features, as this is theonly method conducted in this thesis. As for the end results of the models, it is unsureif it is a consequence of the pretraining or simply that it had more time to settle ata better policy. In theory, the longer the training, the smaller the fluctuations inrewards, as bad states will less often be visited when learned a good policy.

4.6 Method 4: Deep Q-learning from Demonstrations

Method 4 implements DQfD and is pretrained for 150000 and 450000 steps respec-tively in both games. As with Method 1 and 2, the demonstration data is added intothe experience replay buffer and kept there during all of the RL training.

4.6.1 Breakout

Using DQfD slowed down learning for both number of pretraining steps, as seen inFigures 4.14 and 4.16. The results show that more pretraining leads to worse learning,and in both cases, the algorithm never learns to approximate a model as good as thevanilla solution, see Figures 4.15 and 4.17.

40

0 2,500,000 5,000,0000

50

100

150

Timesteps

Rew

ard

Method 4, Breakout, 150000 training steps


Figure 4.14: Average reward during 5000000 training steps for vanilla DDQN andusing expert and human data in DQfD to pretrain the model for 150000 timesteps.


0

200

400

Rew

ard



41

0 2,500,000 5,000,0000

50

100

150

Timesteps

Rew

ard





0

100

200

300

400

Rew

ard



4.6.2 Pong

Pretraining Pong using expert data could benefit learning, as seen in Figure 4.18.Figure 4.20 however shows that pretraining for more steps resulted in worse learning,and as both Method 1 and 2 showed increased learning when adding expert data tothe replay buffer, it might be a consequence of the data and not learning. Using

42

human data was fatal to both learning and final performance, which can especially beseen in Figures 4.20 and 4.21 where the model pretrained using human data averages-21 points, meaning it generally lost all balls.

0 1,500,000 3,000,000

�20

�10

0

10

20

Timesteps

Rew

ard

Method 4, Pong, 150000 training steps




�20

0

20

Rew

ard



43

0 1,500,000 3,000,000

�20

�10

0

10

20

Timesteps

Rew

ard





�20

0

20

Rew

ard




DQfD almost came with no benefits over DDQN. All results were worse than DDQN,with the exceptions of a small decrease in learning time of Pong. This did not translateto better end result however, and looking at previous methods, could have simply beena consequence of the experience replay being loaded with the expert data. There is an

44

immense difference in regulation for this algorithm compared to the vanilla DDQN,something that seemed to have a big impact. Results seem to showcase that morepretraining leads to worse learning in the RL phase, something that can be inferredfrom the higher regulation. Regulation admittedly works as a counter measure tooverfitting, but a model pretrained for too long with a small amount of data canstill overfit. As a consequence of regulation, the RL algorithm will then have a hardtime optimising away from the learned approximation due to the regulation restrains.This can be seen with pretraining on the expert data in Breakout (Figure 4.17 andthe human data in Pong (Figure 4.21), the algorithm does not manage to find anynoticeable improvement throughout the training steps. As opposed to Method 2 and3, the pretraining in DQfD should not only train the model to approximate featuresbut also learn a policy. This could be the reason that expert data worked better inPong, but worse in Breakout. In Pong, learning just a couple of action sequences froma very narrow set of states is enough to win the game. In Breakout however, gettinga high score entails learning many more action sequences over a much wider set ofstates. The simple policy of Pong seems to therefore be learned as it does not needmany expert moves; the complex policy of Breakout is never found as pretraining ontoo few expert moves hinders exploration. The methods on human data would suggestthe same thing. Pretraining for Pong using human data learns a bad policy and thebest policy seems to never be found with RL, probably due to the high regulation.In Breakout however, one would assume the pretraining trains the model to learn adecent policy for many different early states which can then be polished using RL.

Another consequence of learning a policy in pretraining is that DQfD should the-oretically work immediately after pretraining. This was one of the main points ofpurpose in the original article [41]. To see if it had any bearing in this thesis, asmaller test run was done in Breakout with a lower starting exploration rate. Themodel used was the pretrained model for 150000 steps on human data, and the start-ing exploration rate was set to 10%, ending at 1%. The results for the first 2 milliontraining steps can be seen in Figure 4.22. The result showed that training efficiencyincreased at the start, but was overtaken by the vanilla RL algorithm later on.

The results are not aligned with the results presented in the original article pre-senting the algorithm [41], but as Prioritised Replay Memory was not used here,conclusions can not really be drawn from comparisons. As mentioned before, Pri-oritised Replay Memory could potentially benefit pretraining as demonstrator datashould be useful to sample more. It is also hard to compare results as demonstratordata might be very different. However, the article mentions that the demonstratordata was recorded from humans playing the games, which would imply that thereshould not be a big difference here.

4.7 Demonstrator Data Evaluation

The results have shown that different data can make a big difference and there aretwo main differences in the data that will be discussed here. The first one is labeldistribution, meaning the proportions of actions taken by the demonstrator. In thehuman data set, many of the actions taken were so called ’no-ops’, timesteps wherethe player did not take an action but simply waited for the game to progress. Thismakes sense, a human player might not feel the need to move the paddle when theball is far away, but will simply track it and try to strategise for when and where tomove the paddle. The common trait of a trained agent is different, it tends to strafeback and forward even if the ball is far away. This behaviour stems from the intrinsicproperty of value functions in RL; values of different actions are similar when it doesnot change the outcome. The agent has no concept of doing ’nothing’, a ’no-op’ issimply another action that does not entail any less computation. What this ultimatelymean is that there could be a clash between how a human and a trained agent plays

45

0 1,000,000 2,000,0000

50

100

Timesteps

Rew

ard

Method 4, Breakout, 10% starting exploration rate

VanillaHuman data

Figure 4.22: Average reward during first 2000000 training steps for vanilla DDQNand using DQfD with lower start exploration rate.

the game, something that could impact how useful human demonstration data is forfinding viable agent policies. This could be both a benefit and a drawback; therecould be a point in having an agent adapt human behaviour rather than finding thebest policy.

The second difference has been discussed and refers to the data spread, meaninghow many and which scenarios are in the demonstration data. The trained agentperformed significantly better in both games than the human did, the data from theagent therefore entailed many more scenarios further into the game. However, inthe earlier stages of the game, the human player tends to play differently every timewhile the agent selects exactly the same actions when presented with the same frame(this due to the exploration rate being set to zero when recording data). In thesegames, exploring further into the game might not mean very much as there is nosignificant game layout changes. Breakout has minor changes as blocks disappear(which seemed to help learning) while there is no change at all in Pong. However,playing differently might lead to exploring more states of the game, especially showingmore bad states which will likely result in worse rewards. This could interestingly behelpful for pretraining if trained correctly, the whole purpose of pretraining being tonarrow down the state-action space from which the agent learns a good policy. As canbe seen in from the results of DQfD, this would seem to be the case with Breakout.

4.8 Deep Q-Learning Deficiencies

Overarching DDQN and DQfD are two immediate points for discussion are replaybuffer sampling and exploration rate. As briefly mentioned before, sampling the databuffer more efficiently to make better use of demonstrator data could hypotheticallyimprove the algorithms. The previously mentioned Prioritised Replay Memory makesuse of the TD-error to mark "importance" of different samples. Samples that areless aligned with the already learned policy would become used more often as they

46

would impact the learning more. In the original paper for DQfD [41], even anotherprioritisation parameter is added to give increased sampling rate from the demonstra-tion data. This seems to give better results, which makes sense considering the vastnumber of transitions that are unnecessary for the algorithm to put much weight on.

The other point, exploration, refers to the initial exploration rate and the annealingend point. The exploration rate was set to 100% from the start and annealed to10%. As Figure 4.22 showed, decreasing this could increase efficiency; pretrainingshould theoretically narrow down the state-space enough that the agent can makeuse of a learned policy when starting to explore the environment. Random steps areimportant to sample necessary transitions, but as most transitions are not valuable,additional searching might not always be necessary. The counterpoint is that thedemonstrator might miss the better way of achieving high reward, something an RLalgorithm could find given enough trial and error. As shown in state-of-the-art RLimplementations such as AlphaZero [48], using RL alone can beat out algorithmsusing human demonstrations given enough time. This also seemed to be the casefrom looking at Figure 4.22. Regarding the annealed exploration rate, as can be seenin results from all methods from DDQN and DQfD, the learning curve stagnates andbecomes somewhat unstable. If exploration rate was annealed additionally, say to 1%(as is the case with some newer paper, including [41]), the algorithm could potentiallybecome more stable and get better results later into the learning.

47

5 Conclusions5.1 Pretraining for Deep Reinforcement Learning

The results of this thesis would suggest that using demonstration data for deep RL canstreamline the learning process. It would suggest that exploration of the environmentcan be reduced or narrowed down, which could increase time performance of the timeconsuming RL methods and better model performance in earlier stages of learning.Explicit results of improvement were however only shown in A2C while hypotheticalsolutions for better performance of DDQN and DQfD were discussed in Chapter 3.

Results suggests that while pretraining for deep RL might increase training ef-ficiency, it could also hinder an agent from finding the best policy in the presentedenvironment which was discussed to not necessarily be a bad thing. To add onto thispoint it could be argued that a worse policy might inspire better AI in simulationimplementations for the military.

Overarching the whole thesis is however time limitations. No attention was drawnto the actual time different algorithms consumes, but this is essentially the reasonwhy the results should not be completely trusted. As there was not enough timeto run several trials of every experiment, the results could be prone to outliers fromlucky (or unlucky) random seeds. Some cases would seem more unlikely than othersto be outliers, but the possibility can not be excluded. The thesis does however givesa basis for what has potential to work well and suggested points for future researchare outlined below.

5.1.1 Pretraining Network Layers through Supervised Learning

Pretraining the network using Supervised Learning when using A2C had the biggesteffect of all methods. Reasons for this has been outlined already, but a general pointof discussion is the pretraining itself. As mentioned in Section 3.2.2, the networkare pretrained using cross entropy loss with regards to a softmax activation function.The output of the DDQN were changed in pretraining to a softmax function butother than that, no changes were conducted to the network structure. Additionally, noextra implementations or methods were undertaken in order to improve the supervisedlearning. As the games are relatively simple, the supervised learning worked well nonthe less, but going forward with more complex games there could be potential benefitsto use more sophisticated methods in the supervised learning process.

Another point of interest in transferring layers are the number of layers transferred.Due to time restriction this thesis only looked into transferring the first convolutionallayer of the network. As pointed out in [43] it would seem that transferring morelayers can be beneficial depending on the task. Especially in cases like these, wherepretraining and RL works on the same problem. SL and RL both try to approximatethe same features, so translating more layers from a pretrained network with goodresults would most likely translate into better result. Also, overfitting when doing SLcould have major consequences when transferred, as RL might not work to find goodpolicies if working with bad feature approximations.

5.2 Future work

An important factor overarching this thesis is that time limited the experiments toonly one seed per model training. This leaves open the chance for lucky seeds in thealgorithms, meaning outliers are possible.

While this thesis works only as a exploration of deep RL methods that incorporatesdemonstration data, it shows that leveraging data have some effect on learning and

48

end results. A natural next step for the Swedish Defence Research Agency would be touse these methods on simpler environments imitating military simulations, similar to[49]. Experimenting in somewhat more complex environments could give better resultsindicating how an agent could benefit from demonstration data by looking closer atbehaviour. There are even other Atari games requiring more eloquent strategy andthinking which could serve as a better benchmark for drawing conclusions towardsmilitary operations.

Another task for future research would be to pretrain on data from tasks that isdifferent to the goal of the agent. This could save time when needing to train agentsin new environments as well as different tasks in the same environment. This wouldgo more under transfer learning, from which concepts used for experiment 2 in thisthesis were drawn.

49

References[1] F. Hsu, “IBM’s deep blue chess grandmaster chips,” IEEE Micro, vol. 19, no. 2,

pp. 70–81, 1999.

[2] G. Kasparov and M. Greengard, Deep Thinking: Where Machine Intelligence

Ends and Human Creativity Begins. PublicAffairs, 2017.

[3] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur,A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al., “Building Watson: Anoverview of the DeepQA project,” AI Magazine, vol. 31, no. 3, pp. 59–79, 2010.

[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche,J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Master-ing the game of Go with deep neural networks and tree search,” Nature, vol. 529,no. 7587, pp. 484–489, 2016.

[5] “About face ID advanced technology.” https://support.apple.com/en-us/HT208108/. Accesed: 2017-11-26.

[6] H. Wieslander, G. Forslid, E. Bengtsson, C. Wählby, J.-M. Hirsch,C. Runow Stark, and S. Kecheril Sadanandan, “Deep convolutional neural net-works for detecting cellular changes due to malignancy,” in Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, pp. 82–89, 2017.

[7] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learningenvironment: An evaluation platform for general agents.,” Journal of Artificial

Intelligence Research, vol. 47, pp. 253–279, 2013.

[8] S.-C. Wang, “Artificial neural network,” in Interdisciplinary computing in Java

programming, pp. 81–100, Springer, 2003.

[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553,pp. 436–444, 2015.

[10] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neu-

ral Networks, vol. 4, no. 2, pp. 251–257, 1991.

[11] “Stochastic gradient descent.” http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/. Accessed: 2017-10-13.

[12] T. Tieleman and G. E. Hinton, “Lecture 6.5: RMSprop,” Coursera: Neural Net-

works for Machine Learning, vol. 4, no. 2, pp. 26–31, 2012.

[13] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv

preprint arXiv:1412.6980, 2014.

[14] “Convolutional neural networks (LeNet).” http://deeplearning.net/tutorial/lenet.html. Accessed: 2017-10-03.

[15] “Stanford class CS231: Convolutional neural networks.” http://cs231n.github.io/convolutional-networks/. Accessed: 2017-10-18.

[16] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Supportvector machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4,pp. 18–28, 1998.

50

https://support.apple.com/en-us/HT208108/

https://support.apple.com/en-us/HT208108/

http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/

http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/

http://deeplearning.net/tutorial/lenet.html

http://deeplearning.net/tutorial/lenet.html

[17] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Net-

works, vol. 61, pp. 85–117, 2015.

[18] E. Vargas, “B.F. Skinner’s theory of behavior,” European Journal of Behavior

Analysis, vol. 18, no. 1, pp. 2–38, 2017.

[19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1.MIT press Cambridge, 1998.

[20] R. Bellman, “A Markovian decision process,” Journal of Mathematics and Me-

chanics, pp. 679–684, 1957.

[21] M. Hausknecht and P. Stone, “On-policy vs. off-policy updates for deep rein-forcement learning,” in Deep Reinforcement Learning: Frontiers and Challenges,

IJCAI 2016 Workshop, 2016.

[22] “Artificial intelligence - foundations of computational agents.” http://artint.info/. Accessed: 2017-10-11.

[23] R. Bellman, Dynamic programming. Courier Corporation, 2013.

[24] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: Asurvey,” Journal of Artificial Intelligence Research, vol. 4, pp. 237–285, 1996.

[25] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,”in International Conference on Machine Learning, pp. 1928–1937, 2016.

[26] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992.

[27] R. Dechter, Learning while searching in constraint-satisfaction problems. Univer-sity of California, Computer Science Department, Cognitive Systems Laboratory,1986.

[28] D. C. Cireşan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep,big, simple neural nets for handwritten digit recognition,” Neural Computation,vol. 22, no. 12, pp. 3207–3220, 2010.

[29] “Understanding LSTM.” http://colah.github.io/posts/2015-08-Understanding-LSTMs/. Accessed: 2017-10-17.

[30] “ImageNet.” http://www.image-net.org/. Accessed: 2017-10-13.

[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deepconvolutional neural networks,” in Advances in Neural Information Processing

Systems, pp. 1097–1105, 2012.

[32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-levelcontrol through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[33] “DeepMind - DQN.” https://deepmind.com/research/dqn/. Accessed: 2017-10-16.

[34] J. N. Tsitsiklis and B. Van Roy, “Analysis of temporal-difference learning withfunction approximation,” in Advances in Neural Information Processing Systems,pp. 1075–1081, 1997.

51

http://artint.info/

http://artint.info/

http://www.image-net.org/

https://deepmind.com/research/dqn/

[35] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with doubleQ-learning.,” in AAAI, pp. 2094–2100, 2016.

[36] L. C. Baird III, “Advantage updating,” tech. rep., Wright Lab Wright-PattersonAFB, 1993.

[37] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas,“Dueling network architectures for deep reinforcement learning,” arXiv preprint

arXiv:1511.06581, 2015.

[38] R. J. Williams, “Simple statistical gradient-following algorithms for connectionistreinforcement learning,” Machine Learning, vol. 8, no. 3-4, pp. 229–256, 1992.

[39] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradientmethods for reinforcement learning with function approximation,” in Advances

in Neural Information Processing Systems, pp. 1057–1063, 2000.

[40] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos,C. Blundell, D. Kumaran, and M. Botvinick, “Learning to reinforcement learn,”arXiv preprint arXiv:1611.05763, 2016.

[41] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot,A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, et al., “Learningfrom demonstrations for real world reinforcement learning,” arXiv preprint

arXiv:1704.03732, 2017.

[42] B. Piot, M. Geist, and O. Pietquin, “Boosted bellman residual minimization han-dling expert demonstrations,” in Joint European Conference on Machine Learn-

ing and Knowledge Discovery in Databases, pp. 549–564, Springer, 2014.

[43] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features indeep neural networks?,” in Advances in Neural Information Processing Systems,pp. 3320–3328, 2014.

[44] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, andW. Zaremba, “OpenAI Gym,” arXiv:1606.01540, 2016.

[45] “OpenAI baselines.” https://github.com/openai/baselines. Accessed: 2017-08-30.

[46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learningon heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.

[47] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,”arXiv preprint arXiv:1511.05952, 2015.

[48] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez,T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering the game of Gowithout human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.

[49] B. Toghiani-Rizi, F. Kamrani, L. J. Luotsinen, and L. Gisslén, “Evaluating deepreinforcement learning for computer generated forces in ground combat simu-lation,” in Submitted to IEEE International Conference on Systems, Man, and

Cybernetics, 2017.

52

https://github.com/openai/baselines

Evaluation of Pretraining Methods for Deep Reinforcement ...1191656/FULLTEXT01.pdf · In recent...

Documents

Transcript of Evaluation of Pretraining Methods for Deep Reinforcement ...1191656/FULLTEXT01.pdf · In recent...