State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using...

12
State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liang , Marlos C. Machado , Erik Talvitie , and Michael Bowling Franklin & Marshall College Lancaster, PA, USA {yliang, erik.talvitie}@fandm.edu University of Alberta Edmonton, AB, Canada {machado, mbowling}@ualberta.ca ABSTRACT The recently introduced Deep Q-Networks (DQN) algorithm has gained attention as one of the first successful combina- tions of deep neural networks and reinforcement learning. Its promise was demonstrated in the Arcade Learning Environ- ment (ALE), a challenging framework composed of dozens of Atari 2600 games used to evaluate general competency in AI. It achieved dramatically better results than earlier approaches, showing that its ability to learn good represen- tations is quite robust and general. This paper attempts to understand the principles that underlie DQN’s impressive performance and to better contextualize its success. We sys- tematically evaluate the importance of key representational biases encoded by DQN’s network by proposing simple linear representations that make use of these concepts. Incorporat- ing these characteristics, we obtain a computationally prac- tical feature set that achieves competitive performance to DQN in the ALE. Besides offering insight into the strengths and weaknesses of DQN, we provide a generic representation for the ALE, significantly reducing the burden of learning a representation for each game. Moreover, we also provide a simple, reproducible benchmark for the sake of comparison to future work in the ALE. Keywords Reinforcement Learning, Function Approximation, DQN, Representation Learning, Arcade Learning Environment 1. INTRODUCTION In the reinforcement learning (RL) problem an agent au- tonomously learns a behavior policy from experience in or- der to maximize a provided reward signal. Most successful RL approaches have relied upon the engineering of problem- specific state representations, diminishing the agent as fully autonomous and reducing its flexibility. The recent Deep Q-Network (DQN) algorithm [20] aims to tackle this prob- lem, presenting one of the first successful combinations of RL and deep convolutional neural-networks (CNN) [13, 14], which are proving to be a powerful approach to represen- tation learning in many areas. DQN is based upon the well-known Q-learning algorithm [31] and uses a CNN to Appears in: Proceedings of the 15th International Confer- ence on Autonomous Agents and Multiagent Systems (AA- MAS 2016), John Thangarajah, Karl Tuyls, Stacy Marsella, Catholijn Jonker (eds.), May 9–13, 2016, Singapore. Copyright c 2016, International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. simultaneously learn a problem-specific representation and estimate a value function. Games have always been an important testbed for AI, fre- quently being used to demonstrate major contributions to the field [5, 6, 8, 23, 28]. DQN follows this tradition, demon- strating its success by achieving human-level performance in the majority of games within the Arcade Learning Environ- ment (ALE) [2]. The ALE is a platform composed of dozens of qualitatively diverse Atari 2600 games. As pictured in Figure 1, the games in this suite include first-person per- spective shooting games (e.g. Battle Zone), platforming puzzle games (e.g. Montezuma’s Revenge), sports games (e.g. Ice Hockey), and many other genres. Because of this diversity, successful approaches in the ALE necessarily ex- hibit a degree of robustness and generality. Further, because it is based on problems designed by humans for humans, the ALE inherently encodes some of the biases that allow hu- mans to successfully navigate the world. This makes it a potential stepping-stone to other, more complex decision- making problems, especially those with visual input. DQN’s success in the ALE rightfully attracted a great deal of attention. For the first time an artificial agent achieved performance comparable to a human player in this challeng- ing domain, achieving by far the best results at the time and demonstrating generality across many diverse problems. It also demonstrated a successful large scale integration of deep neural networks and RL, surely an important step to- ward more flexible agents. That said, problematic aspects of DQN’s evaluation make it difficult to fully interpret the results. As will be discussed in more detail in Section 6, the DQN experiments exploit non-standard game-specific prior information and also report only one independent trial per game, making it difficult to reproduce these results or to make principled comparisons to other methods.Independent re-evaluations have already reported different results due to the variance in the single trial performance of DQN (e.g. [12]). Furthermore, the comparisons that Mnih et al. did present were to benchmark results using far less training data. Without an adequate baseline for what is achievable using simpler techniques, it is difficult to evaluate the cost- benefit ratios of more complex methods like DQN. In addition to these methodological concerns, the evalu- ation of a complicated method such as DQN often leaves open the question of which of its properties were most im- portant to its success. While it is tempting to assume that the neural network must be discovering insightful tailored representations for each game, there is also a considerable amount of domain knowledge embedded in the very struc- arXiv:1512.01563v2 [cs.LG] 21 Apr 2016

Transcript of State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using...

Page 1: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

State of the Art Control of Atari GamesUsing Shallow Reinforcement Learning

Yitao Liang†, Marlos C. Machado‡, Erik Talvitie†, and Michael Bowling‡†Franklin & Marshall College

Lancaster, PA, USA{yliang, erik.talvitie}@fandm.edu

‡University of AlbertaEdmonton, AB, Canada

{machado, mbowling}@ualberta.ca

ABSTRACTThe recently introduced Deep Q-Networks (DQN) algorithmhas gained attention as one of the first successful combina-tions of deep neural networks and reinforcement learning. Itspromise was demonstrated in the Arcade Learning Environ-ment (ALE), a challenging framework composed of dozensof Atari 2600 games used to evaluate general competencyin AI. It achieved dramatically better results than earlierapproaches, showing that its ability to learn good represen-tations is quite robust and general. This paper attempts tounderstand the principles that underlie DQN’s impressiveperformance and to better contextualize its success. We sys-tematically evaluate the importance of key representationalbiases encoded by DQN’s network by proposing simple linearrepresentations that make use of these concepts. Incorporat-ing these characteristics, we obtain a computationally prac-tical feature set that achieves competitive performance toDQN in the ALE. Besides offering insight into the strengthsand weaknesses of DQN, we provide a generic representationfor the ALE, significantly reducing the burden of learning arepresentation for each game. Moreover, we also provide asimple, reproducible benchmark for the sake of comparisonto future work in the ALE.

KeywordsReinforcement Learning, Function Approximation, DQN,Representation Learning, Arcade Learning Environment

1. INTRODUCTIONIn the reinforcement learning (RL) problem an agent au-

tonomously learns a behavior policy from experience in or-der to maximize a provided reward signal. Most successfulRL approaches have relied upon the engineering of problem-specific state representations, diminishing the agent as fullyautonomous and reducing its flexibility. The recent DeepQ-Network (DQN) algorithm [20] aims to tackle this prob-lem, presenting one of the first successful combinations ofRL and deep convolutional neural-networks (CNN) [13, 14],which are proving to be a powerful approach to represen-tation learning in many areas. DQN is based upon thewell-known Q-learning algorithm [31] and uses a CNN to

Appears in: Proceedings of the 15th International Confer-ence on Autonomous Agents and Multiagent Systems (AA-MAS 2016), John Thangarajah, Karl Tuyls, Stacy Marsella,Catholijn Jonker (eds.), May 9–13, 2016, Singapore.Copyright c© 2016, International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

simultaneously learn a problem-specific representation andestimate a value function.

Games have always been an important testbed for AI, fre-quently being used to demonstrate major contributions tothe field [5, 6, 8, 23, 28]. DQN follows this tradition, demon-strating its success by achieving human-level performance inthe majority of games within the Arcade Learning Environ-ment (ALE) [2]. The ALE is a platform composed of dozensof qualitatively diverse Atari 2600 games. As pictured inFigure 1, the games in this suite include first-person per-spective shooting games (e.g. Battle Zone), platformingpuzzle games (e.g. Montezuma’s Revenge), sports games(e.g. Ice Hockey), and many other genres. Because of thisdiversity, successful approaches in the ALE necessarily ex-hibit a degree of robustness and generality. Further, becauseit is based on problems designed by humans for humans, theALE inherently encodes some of the biases that allow hu-mans to successfully navigate the world. This makes it apotential stepping-stone to other, more complex decision-making problems, especially those with visual input.

DQN’s success in the ALE rightfully attracted a great dealof attention. For the first time an artificial agent achievedperformance comparable to a human player in this challeng-ing domain, achieving by far the best results at the timeand demonstrating generality across many diverse problems.It also demonstrated a successful large scale integration ofdeep neural networks and RL, surely an important step to-ward more flexible agents. That said, problematic aspectsof DQN’s evaluation make it difficult to fully interpret theresults. As will be discussed in more detail in Section 6, theDQN experiments exploit non-standard game-specific priorinformation and also report only one independent trial pergame, making it difficult to reproduce these results or tomake principled comparisons to other methods.Independentre-evaluations have already reported different results due tothe variance in the single trial performance of DQN (e.g.[12]). Furthermore, the comparisons that Mnih et al. didpresent were to benchmark results using far less trainingdata. Without an adequate baseline for what is achievableusing simpler techniques, it is difficult to evaluate the cost-benefit ratios of more complex methods like DQN.

In addition to these methodological concerns, the evalu-ation of a complicated method such as DQN often leavesopen the question of which of its properties were most im-portant to its success. While it is tempting to assume thatthe neural network must be discovering insightful tailoredrepresentations for each game, there is also a considerableamount of domain knowledge embedded in the very struc-

arX

iv:1

512.

0156

3v2

[cs

.LG

] 2

1 A

pr 2

016

Page 2: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

Figure 1: Examples from the ALE (left to right:Battle Zone, Montezuma’s Revenge, Ice Hockey).

ture of the network. We can identify three key structuralbiases in DQN’s representation. First, CNNs provide a formof spatial invariance not exploited in earlier work using lin-ear function approximation. DQN also made use of multipleframes of input allowing for the representation of short-rangenon-Markovian value functions. Finally, small-sized convo-lutions are quite well suited to detecting small patterns ofpixels that commonly represent objects in early video gamegraphics. These biases were not fully explored in earlierwork in the ALE, so a natural question is whether non-lineardeep network representations are key to strong performancein ALE or whether the general principles implicit in DQN’snetwork architecture might be captured more simply.

The primary goal of this paper is to systematically in-vestigate the sources of DQN’s success in the ALE. Obtain-ing a deeper understanding of the core principles influencingDQN’s performance and placing its impressive results in per-spective should aid practitioners aiming to apply or extendit (whether in the ALE or in other domains). It should alsoreveal representational issues that are key to success in theALE itself, potentially inspiring new directions of research.

We perform our investigation by elaborating upon a sim-ple linear representation that was one of DQN’s main com-parison points, progressively incorporating the representa-tional biases identified above and evaluating the impact ofeach one. This process ultimately yields a fixed, generic fea-ture representation able to obtain performance competitiveto DQN in the ALE, suggesting that the general form ofthe representations learned by DQN may in many cases bemore important than the specific features it learns. Further,this feature set offers a simple and computationally prac-tical alternative to DQN as a platform for future research,especially when the focus is not on representation learning.Finally, we are able provide an alternative benchmark thatis more methodologically sound, easing reproducibility andcomparison to future work.

2. BACKGROUNDIn this section we introduce the reinforcement learning

problem setting and describe existing results in the ALE.

2.1 Reinforcement LearningIn the reinforcement learning (RL) problem [26, 27] an

agent interacts with an unknown environment and attemptsto maximize a “reward” signal. The environment is com-monly formalized as a Markov decision process (MDP) Mdefined as a 5-tuple M = 〈S,A,R, P, γ〉. At time t theagent is in the state st ∈ S where it takes an action at ∈ Athat leads to the next state st+1 ∈ S according to the tran-sition probability kernel P , which encodes Pr(st+1|st, at).The agent also observes a reward rt+1 ∼ R(st, at, st+1).The agent’s goal is to learn the optimal policy, a conditionaldistribution π(a|s) that maximizes the state value function

V π(s).= Eπ

[∑∞k=0 γ

krt+k+1|st = s]

for all s ∈ S, whereγ ∈ [0, 1) is known as the discount factor.

As a critical step toward improving a given policy π, itis common for reinforcement learning algorithms to learn astate-action value function, denoted:

Qπ(s, a).= Eπ

[R(st, at, st+1) + γV π(st+1)

∣∣st=s, at=a].

However, in large problems it may be infeasible to learn avalue for each state-action pair. To tackle this issue agentsoften learn an approximate value function: Qπ(s, a; θ) ≈Qπ(s, a). A common approach uses linear function approx-imation (LFA) where Qπ(s, a; θ) = θ>φ(s, a), in which θdenotes the vector of weights and φ(s, a) denotes a staticfeature representation of the state s when taking action a.However, this can also be done through non-linear functionapproximation methods, including neural networks.

One of the most popular reinforcement learning algorithmsis Sarsa(λ) [22]. It consists of learning an approximateaction-value function while following a continually improv-ing policy π. As the states are visited, and rewards areobserved, the action-value function is updated and conse-quently the policy is improved since each update improvesthe estimative of the agent’s expected return from state s,taking action a, and then following policy π afterwards, i.e.Qπ(s, a). The Sarsa(λ) update equations, when using func-tion approximation, are:

δt = rt+1 + γQ(st+1, at+1; ~θt)−Q(st, at; ~θt)

~et = γλ~et−1 +∇Q(st, at; ~θt)

~θt+1 = ~θt + αδt~et,

where α denotes the step-size, the elements of ~et are knownas the eligibility traces, and δt is the temporal difference er-ror. Theoretical results suggest that, as an on-policy method,Sarsa(λ) may be more stable in the linear function approx-imation case than off-policy methods such as Q-learning,which are known to risk divergence [1, 10, 18]. These theo-retical insights are confirmed in practice; Sarsa(λ) seems tobe far less likely to diverge in the ALE than Q-learning andother off-policy methods [7].

2.2 Arcade Learning EnvironmentIn the Arcade Learning Environment agents have access

only to sensory information (160 pixels wide by 210 pixelshigh images) and aim to maximize the score of the gamebeing played using the 18 actions on a standard joystickwithout game-specific prior information. Note that in mostgames a single screen does not constitute Markovian state.That said, Atari 2600 games are deterministic, so the entirehistory of interaction fully determines the next screen. Itis most common for ALE agents to base their decisions ononly the most recent few screens.

2.2.1 Linear Value Function ApproximationEarly work in the ALE focused on developing new generic

feature representations to be used with linear RL methods.Along with introducing the ALE itself, Bellemare et al. [2]presented an RL benchmark using four different feature setsobtained from the game screen. Basic features tile the screenand check if each of the available colors in the Atari 2600are active in each tile. BASS features add pairwise combi-nations of Basic features. DISCO features attempt to detect

Page 3: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

and classify objects on the screen in order to infer their po-sitions and velocities. LSH simply applies Locally SensitiveHashing [9] to raw Atari 2600 screens. Bellemare, Veness,and Bowling proposed an extension to the Basic feature setwhich involved identifying which parts of the screen were un-der the agents’ direct control. This additional informationis known as contingency awareness [3].

These feature sets were, for some time, the standard rep-resentations for Atari 2600 games, being directly used inother work [7, 16, 17] or serving as the base for more elabo-rate feature sets [4]. The feature representations presentedin this paper follow the spirit of this early work. We use Ba-sic features (formally described below) as a starting point,and attempt to capture pairwise spatial relationships be-tween objects on the screen.

Basic Features: To obtain Basic features we first dividethe Atari 2600 screen into 16×14 tiles of size 10×15 pixels.For every tile (c, r) and color k, where c ∈ {1, . . . , 16}, r ∈{1, . . . , 14}, and k ∈ {1, . . . , 128}, we check whether color kis present within the tile (c, r), generating the binary featureφc,r,k. Intuitively, Basic features encode information like“there is a red pixel within this region”. There are 16×14×128 = 28,672 Basic features in total.

It can be computationally demanding to work directlywith 160 × 210 pixel screens with a 128 color palette. Tomake the screen more sparse, following previous work (e.g.[2, 3]) we subtract the background from the screen at eachframe before processing it. The background is precomputedusing 18,000 samples obtained from random trajectories.

2.2.2 Non-Linear Value Function ApproximationThe use of non-linear function approximation in the ALE

is more recent. Veness et al. [30], for example, propose theuse of compression techniques as a policy evaluation ap-proach in RL, evaluating their method in the ALE. Soonafter, the research landscape changed due to the success ofusing deep learning techniques to approximate the action-value functions, demonstrated by the DQN algorithm [19,20]. DQN achieved 75% of a human player’s score in themajority of games and this inspired much more work in theALE using deep learning (e.g. [11, 24, 25]).

DQN employs a deep convolutional neural network to rep-resent the state-action value function Q. A deep convolu-tional network uses weight sharing to make it practical tolearn a function of high-dimensional image input. Specif-ically, when processing the image a small network focusedon a small region of the screen called a filter is applied atmultiple positions on the screen. Its output at each positionforms a new (smaller) image, which can then be processedby another filter, and so on. Multiple layers of convolu-tion allow the network to detect patterns and relationshipsat progressively higher levels of abstraction. DQN’s net-work uses three layers of convolution with multiple filtersat each layer. The final result of those convolutions is thenprocessed via a more standard fully connected feed-forwardnetwork. The convolutional aspect of the network allowsthe network to detect relationships like “A bullet is near analien here,” where “bullet” and “alien” can be position in-variant concepts. The non-linear global layers allow it torepresent whole-screen concepts such as “A bullet is near analien somewhere.” This is the inspiration of our spatiallyinvariant features (Section 3). Further, Mnih et al. actually

provide the four most recent images as input to their net-work, allowing it to detect relationships through time (suchas movement). This motivates our short-order-Markov fea-tures (Section 4). Finally, note that Basic features are essen-tially a simple convolution where the filters simply detect thepresence of each color in their range. DQN’s more sophisti-cated filters motivate our improvement of object detectionfor base features (Section 5).

3. SPATIAL INVARIANCERecall that Basic features detect the presence of each color

in various regions of the screen. As discussed above, this canbe seen as analogous to a single convolution with a crude fil-ter. The BASS feature set, which was amongst the best per-forming representations before DQN’s introduction encodespairwise relationships between Basic features, still anchoredat specific positions, so it is somewhat analogous to a secondconvolutional layer. But in many games the absolute posi-tions of objects are not as important as their relative posi-tions to each other. To investigate the effect of ignoring ab-solute position we impose a non-linearity over BASS (anal-ogous to DQN’s fully connected layers), specifically takingthe max of BASS features over absolute position.

We call the resulting feature set Basic Pairwise RelativeOffsets in Space (B-PROS) because it captures pairwise rel-ative distances between objects in a single screen. Morespecifically, a B-PROS feature checks if there is a pair ofBasic features with colors k1, k2 ∈ {1, . . . , 128} separated byan offset (i, j), where −13 ≤ i ≤ 13 and −15 ≤ j ≤ 15.If so, φk1,k2,(i,j) is set to 1, meaning that a pixel of colork1 is contained within some block (c, r) and a pixel of colork2 is contained within the block (c + i, r + j). Intuitively,B-PROS features encode information like “there is a yel-low pixel three tiles below a red pixel”. The computationalcomplexity of generating B-PROS features is similar to thatof BASS, though ultimately fewer features are generated.Note that, as described, B-PROS contains redundant fea-tures (e.g. φ1,2,(4,0) and φ2,1,(−4,0)), but it is straightforwardto eliminate them. The complete feature set is composed ofthe Basic features and the B-PROS features. After redun-dancy is eliminated, the B-PROS feature set has 6,885,440features in total (28,672 + ((31× 27× 1282− 128)/2 + 128)).

Note that there have been other attempts to representpairwise spatial relationships between objects, for instanceDISCO [2] and contingency awareness features [3]. How-ever, these existing attempts are complicated to implement,demanding to compute, and less effective (as will be seen),most likely due to unreliable estimates of object positions.

3.1 Empirical EvaluationOur first set of experiments compares B-PROS to the

previous state of the art in linear representations for theAtari 2600 [2, 3] in order to evaluate the impact of the non-linearity applied to BASS. For the sake of comparison wefollow Bellemare et al.’s methodology [2]. Specifically, in 24independent trials the agent was trained for 5000 episodes.After the learning phase we froze the weights and evalu-ated the learned policy by recording its average performanceover 499 episodes. We report the average evaluation scoreacross the 24 trials. Following Bellemare et al., we defineda maximum length for each episode: 18,000 frames, i.e. fiveminutes of real-time play. Also, we used a frame-skippingtechnique, in which the agent selects actions and updates its

Page 4: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

Table 1: Comparison of linear representations. Bolddenotes the largest value between B-PROS and BestLinear [2]. See text for more details.

Game Best Linear CAF B-PROS (std. dev.)

Asterix 987.3 1332.0 4078.7 (1016.0)

Beam Rider 929.4 1742.7 1528.7 (504.4)

Breakout 5.2 6.1 13.5 (11.0)

Enduro 129.1 159.4 240.5 (24.2)

Freeway 16.4 19.97 31.0 (0.7)

Pong -19.2 -17.4 4.9 (6.6)

Q*Bert 613.5 960.3 1099.8 (740.0)

Seaquest 664.8 722.89 1048.5 (321.2)

Space Invaders 250.1 267.93 384.7 (87.2)

value function every x frames, repeating the selected actionfor the next x − 1 frames. This allows the agent to playapproximately x times faster. Following Bellemare et al. weuse x = 5 (DQN also uses a similar procedure).

We used Sarsa(λ) with replacing traces and an ε-greedypolicy. We performed a parameter sweep over nine games,which we call “training” games. The reported results use adecay rate γ = 0.99, an exploration rate ε = 0.01, a step-sizeα = 0.5 and eligibility decay rate λ = 0.9.

Table 1 compares our agent to existing linear agents inour set of training games (note that Breakout, Enduro,Pong, and Q*Bert were not training games for the ear-lier methods). “Best Linear” denotes the best performanceobtained among four different feature sets: Basic, BASS,DISCO and LSH [2]. For reference, in the “CAF” columnwe also include results obtained by contingency awarenessfeatures [3], as reported by Mnih et al. [20]. Note, that theseresults are not directly comparable because their agent wasgiven 10,000 episodes of training (rather than 5000).

B-PROS’ performance surpasses the original benchmarksby a large margin and, except in one game, even surpassesthe performance of CAF, despite being far simpler and train-ing with half as much data. Further, some of the improve-ments represent qualitative breakthroughs. For instance,in Pong, B-PROS allows the agent to win the game witha score of 20-15 on average, while previous methods rarelyscored more than a few points. In Enduro, the earlier meth-ods seem to be stymied by a sudden change in dynamics af-ter approximately 120 points, while B-PROS is consistentlyable to continue beyond that point.

When comparing these features sets for all 53 games B-PROS performs better on average than all of Basic, BASS,DISCO, and LSH in 77% (41/53) of the games. Even withless training data, B-PROS performs better on average thanCAF in 77% (38/49) of the games in which CAF results werereported1. The dramatic improvement yielded by B-PROSclearly indicates that focusing on relative spatial relation-ships rather than absolute positions is vital to success in theALE (and likely in other visual domains as well).

4. NON-MARKOVIAN FEATURESB-PROS is capable of encoding relative distances between

objects but it fails to encode movement, which can be a veryimportant aspect of Atari games. For instance, the agentmay need to know whether the ball is moving toward oraway from the paddle in Pong. Previous linear representa-tions similarly relied mainly on the most recent screen. In

1For reference, the full results with 53 games are availablein Table 3 in the Appendix.

contrast, Mnih et al. use the four most recent screens as in-put, allowing DQN to represent short-order-Markov featuresof the game screens. In this section we present an extensionto B-PROS that takes a similar approach, extracting infor-mation from the two most recent screens.

Basic Pairwise Relative Offsets in Time (B-PROT) fea-tures represent pairwise relative offsets between Basic fea-tures obtained from the screen five frames in the past andBasic features from the current screen (i.e. PROT fea-tures). More specifically, for every pair of colors k1, k2 ∈{1, . . . , 128} and every offset (i, j), where −13 ≤ i ≤ 13 and−15 ≤ j ≤ 15, a binary B-PROT feature φk1,k2,(i,j) is 1 if apixel of color k1 is contained within some block (c+ i, r+ j)on the screen five frames ago and a pixel of color k2 is con-tained within the block (c, r) in the current screen.

The B-PROST feature set contains Basic, B-PROS, andB-PROT features. Note that there are roughly twice asmany B-PROT features as B-PROS because there are no re-dundant offsets. As such, B-PROST has a total of 20,598,848sparse, binary features (6,885,440 + 31× 27× 1282).

4.1 Empirical EvaluationB-PROS outperformed all the other linear architectures

in the previous experiment. Subsequent extensions will beprimarily compared to DQN (see Section 6). As such, inthese experiments, we adopted an evaluation protocol simi-lar to Mnih et al.’s. Each agent was trained for 200,000,000frames (equivalent to 40,000,000 decisions) over 24 indepen-dent trials. The learned policy in each trial was evaluatedby recording its average performance in 499 episodes withno learning. We report the average evaluation score overthe 24 trials. In an effort to make our results comparable toDQN’s we also started each episode with a random numberof “no-op” actions and restricted the agent to the minimalset of actions that have a unique effect in each game.

The first two columns of Table 2 present results usingB-PROS and B-PROST in the training games. B-PROSToutperforms B-PROS in all but one of the training games.One particularly dramatic improvement occurs in Pong; B-PROS wins with a score of 20-9, on average, while B-PROSTrarely allows the opponent to score at all. Another resultworth noting is in Enduro. The randomized initial condi-tions seem to have significantly harmed the performance ofB-PROS in this game, but B-PROST seems to be robustto this effect. When evaluated over all 49 games evaluatedby Mnih et al. the average score using B-PROST is higherthan that using B-PROS in 82% of the games (40/49)2. Thisclearly indicates the critical importance of non-Markov fea-tures to success in the ALE.

Before making a final comparison with DQN, we will makeone more improvement to our representation.

5. OBJECT DETECTIONBecause Basic features encode the positions of individual

pixels on the screen, both B-PROS and B-PROST featuresstruggle to distinguish which pixels are part of the sameobject. DQN’s network is capable of learning far more subtlefilters. In order to measure the impact of improved low-levelobject detection, we consider a simple extension to Basicthat exploits the fact that Atari screens often contain severalcontiguous regions of pixels of the same color. We call such

2The full results are available in Table 4 in the Appendix.

Page 5: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

Table 2: Comparison of relative offset features. Boldindicates the best average of the three columns. The† indicates significant differences between B-PROSTand Blob-PROST.

GameB-PROS B-PROST Blob-PROST

Avg. (std. dev.) Avg. (std. dev.) Avg. (std. dev.)

Asterix 8194.3 (1802.9) 8928.8† (1714.5) 4400.8 (1201.3)

Beam Rider 1686.9 (255.2) 1808.9 (309.2) 1902.3 (423.9)

Breakout 7.1 (1.7) 15.0 (4.7) 46.7† (45.7)

Enduro 34.9 (71.5) 207.7 (23.1) 257.0† (79.8)

Freeway 23.8 (6.7) 29.1 (6.3) 31.5† (1.2)

Pong 10.9 (5.2) 18.9 (1.3) 20.1† (0.5)

Q*Bert 3647.8 (1273.3) 3608.7 (1129.0) 6946.8† (3036.4)

Seaquest 1366.1 (445.9) 1636.5 (519.5) 1664.2 (440.4)

Space Invaders 505.1 (130.4) 582.9 (139.0) 723.1† (101.5)

regions “blobs”. Rather than directly represent coarsenedpositions of pixels, we first process the screen to find a listof blobs. Blob features then represent coarsened positions ofblobs on the screen. Changing the “primitive” features fromBasic to Blob yields the Blob-PROST feature set.

Note that blobs are a simplification; in many Atari games,as would be true in more natural images, objects consist ofmultiple close but separate blobs. Grouping only strictlycontiguous pixels into each blob may generate redundantblobs that all represent a single object. As a simple meansto address this we add a tolerance to the contiguity condi-tion, i.e. we consider pixels that are in the same s× s pixelsquare to be contiguous. This approach has an inherenttrade-off. On the one hand, with sufficiently large s we maysuccessfully represent each object with few blobs. Also, byreducing the number of blobs, we substantially decrease thenumber of primitive features, making Blob-PROS and Blob-PROT features easier to compute. On the other hand, if sis too high then multiple distinct objects may be groupedtogether. In our experiments we set s to be 6 after an infor-mal search using the set of training games. It is very likelythat with a more systematic selection of s one can obtainbetter results than those reported here.

We define the position of a blob as the centroid of theblob’s smallest bounding box. To generate the features, wefirst generate Blob features that are analogous to Basic fea-tures: we divide the screen into tiles of 4×7 pixels and thenfor every color k and block (c, r), where c ∈ {1, . . . , 40}, r ∈{1, . . . , 30} and k ∈ {1, . . . , 128}, the Blob feature φc,r,k is1 if the block (c, r) contains the centroid for some blob ofcolor k. Note that we use a finer resolution than Basic. Thisis feasible only because of the extreme sparsity of blobs onthe screen. The number of blobs in one frame ranged from 6(Pong) to 412 (Battle Zone). This sparsity also makes thebackground subtraction step unnecessary; the backgroundtypically reduces to a handful of blobs itself.

The Blob-PROST feature set is then constructed fromthese primitive features in the same manner as B-PROST.There are 153,600 Blob features (40× 30× 128), 38,182,976Blob-PROS features ((79 × 59 × 1282 − 128)/2 + 128), and76,365,824 Blob-PROT features (79×59×1282). This yieldsa total of 114,702,400 Blob-PROST features. Though thenumber of possible features is very large, most would neverbe generated due to the sparsity of blobs.

5.1 Empirical EvaluationThe last column of Table 2 presents results using Blob-

PROST in the training games. In all but one of the train-

ing games, Blob-PROST outperformed both B-PROS andB-PROST on average3. In 6 of the 9, Blob-PROST’s av-erage performance was statistically significantly better thanB-PROST (using Welch’s t-test with p < 0.05). Asterix isa game in which Blob-PROST notably fails; perhaps blobdetection lumps together distinct objects in a harmful way.Q*Bert is a notable success; the scores of B-PROS andB-PROST indicate that they rarely complete the first level,while Blob-PROST’s score indicates that it consistently clearsthe first level. Out of the 49 games evaluated by Mnih et al.,Blob-PROST’s average performance was higher than that ofboth B-PROS and B-PROST in 59% (29/49) of the games.Its performance was statistically significantly higher thanthat of B-PROST in 47% (23/49) of the games, showing theclear contribution of even simple improvements in object de-tection. Though object detection itself is not a feature, itmake all features based upon it carry more meaningful infor-mation that helps our agents better “interpret” the screen.

6. COMPARISON WITH DQNThe next set of experiments compare Blob-PROST to

DQN, showing that the three enhancements investigatedabove largely explain the performance gap between DQNand earlier LFA approaches. In performing this compari-son, we necessarily confront problematic aspects of the orig-inal DQN evaluation (some of which we adopt for compari-son’s sake). In Section 7 we present a more methodologicallysound ALE benchmark for comparison to future work.

6.1 DQN Evaluation MethodologyIn each game, Mnih et al. trained a DQN agent once

for 200,000,000 frames. Rather than evaluating the learnedpolicy after the whole learning phase completed (as in ear-lier work), they evaluated their agent’s performance every4,000,000 frames and selected the best performing weights.They reported the average and standard deviation of thescores of this best learned policy in 30 evaluation episodes.

Mnih et al. also restricted their agent to the minimal set ofactions that have a unique effect in each game, game-specificprior information not exploited by earlier work in the ALE.Since in most games the minimal action set contains feweractions than the full action set, agents with access to theminimal set may benefit from a faster effective learning rate(this is discussed further in Section 7).

To avoid overfitting to the Atari’s determinism, DQN addeda random number of“no-op”actions to the beginning of eachtraining and evaluation episode: the number of “no-ops” wasuniformly randomly selected from {1, . . . , 30}, before a newepisode officially began.

Mnih et al. also modified when episodes ended. In manyAtari games the player has a number of “lives” and the gameends when they are exhausted. DQN employed the sameepisode termination criteria as Bellemare et al. [2], i.e. theend of the game or expiration of the 5 minute time limit, butduring training also terminated episodes when the agent losta life (another form of game-specific prior information). Wehave not specifically evaluated the impact of this mechanism,but one might speculate that in games that have “lives” (e.g.Breakout, Space Invaders) this modification could more

3The implementation of the three feature sets we introduced,as well as the code used in our experiments is available athttps://github.com/mcmachado/b-pro.

Page 6: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

DQNBlob-PROST

Figure 2: Learning curves for Blob-PROST on thegame Up ’n Down, smoothed over 500 episodes.

easily generate agents with an “avoid death” policy.The DQN experimental methodology possesses three main

problematic flaws. First, one cannot make statistical com-parisons from a single trial. While Mnih et al. report a stan-dard deviation, this only represents the amount of variationobserved when executing the one chosen policy, and not vari-ation observed over different independent executions of theinherently stochastic algorithm. DQN’s consistently highperformance across games suggests it does indeed regularlyperform well, but the reported performance numbers on anyparticular game may not be representative of their expectedperformance. Second, selecting the best performing param-eterization over the whole learning period is a significantdeviation from typical RL methodology. Such a choice maybe masking instability issues where learning performance isinitially strong but later goes awry. This effect can occureven using comparatively stable linear function approxima-tors. For example, Figure 2 shows 24 independent learningtrials from the game Up ’n Down using our linear Blob-PROST agent. Many learning trials in this game exhibit anagent steadily improving its performance before a suddenplummet from which it does not recover. The reported per-formance of DQN and Blob-PROST is also depicted, show-ing the impact of such instability: most trials perform muchbetter than the reported average at some point. We con-tend that general competency should include consistent andstable improvement over time, best measured by an agent’sperformance at the end of training. Third, using designer-provided game-specific prior knowledge in the form of theminimal action set and termination on “death” is counter tothe goal of evaluating general competency. Emerging follow-up work [21] has begun to address some of these issues byusing the final network weights for evaluation and by report-ing the average score over five independent trials.

6.2 Comparing Blob-PROST and DQNAs discussed in Section 4, for comparison’s sake, we adopted

a similar methodology. We did use the minimal action set,and did add random “no-ops” to the beginning of episodes,but we opted not to utilize the life counter.

6.2.1 Computational CostWe found that Blob-PROST is far more computationally

practical than DQN. We compared the algorithms’ resourceneeds on 3.2GHz Intel Core i7-4790S CPUs. We ran them

for 24 hours (to allow resource usage to stabilize), then mea-sured their runtime and memory use.

The computational cost of our implementation of Blob-PROST can vary from game to game. The runtime of ourimplementation ranged from 56 decisions per second (Alien)to 300 decisions per second (Star Gunner), that is 280-1500 frames per second (many times real time speed). Thememory utilization of our implementation ranged from 50MB(Pong) to 9GB (Battle Zone). Note that Battle Zonewas an outlier; the next most memory intensive game (StarGunner) used only 3.7GB of memory. Furthermore, thememory utilization of Blob-PROST can likely be effectivelycontrolled by simplifying the color palette or through theuse of feature hashing (e.g. [4]).

In contrast, we found DQN’s computational cost to bequite consistent across games, running at a speed of ap-proximately 5 decisions per second (i.e. 20 frames per sec-ond, 3 times slower than real time), and requiring approx-imately 9.8GB of memory. DQN already uses a reducedcolor palette and is not immediately amenable to featurehashing to control memory usage. On the other hand, it isamenable to GPU acceleration to improve runtime. Mnihet al. do not report DQN’s runtime but in recent follow-upwork the GPU accelerated speed has been reported to be ap-proximately 330 frames per second [29], still slower than theBlob-PROST agent in most games. Further note that ob-taining enough GPUs to support multiple independent trialsin all the games (necessary for statistical comparisons) is, initself, a prohibitive cost. Emerging work [21] shows that arelated approach can be accelerated via CPU parallelism.

6.2.2 ALE PerformanceBecause only one trial of DQN was reported in each game,

and because of the prohibitively high computational cost ofindependently re-evaluating DQN, a principled comparisonbetween our method and DQN is essentially impossible. In-stead, we used a number of ad hoc measures aimed at form-ing an intuitive understanding of the relative capabilities ofthe two algorithms, based on available evidence.

First, for each game we record how many trials of Blob-PROST obtained evaluation scores greater than DQN’s sin-gle trial. If we were to compare an algorithm to itself inthis way, we would expect roughly 50% of the trials to havegreater performance. In Table 5 (available in the Appendix),the column marked “% trials > DQN” reports the results.The average percentage of trials better than DQN, acrossall games, was 41%. That is, if you select a game uniformlyrandomly and run Blob-PROST, there is an estimated 41%chance that it will surpass DQN’s reported score.

Second, in each game we compared Blob-PROST’s middletrial (12th best) to the single reported DQN trial. If Blob-PROST’s evaluation score compares favorably to DQN inthis trial then this suggests that scores like DQN’s are typ-ical for Blob-PROST. Again, if this method were used tocompare an algorithm to itself, one would expect the mid-dle trial to be better in roughly half of the games. In Ta-ble 5 the column marked “Middle trial” reports the results.Blob-PROST’s middle trial evaluation score was better thanDQN’s in 43% (21/49) of the games. In 3 of the games inwhich it performed worse, Blob-PROST’s score was not stas-tically significantly different than DQN’s. So in 49% of thegames Blob-PROST’s middle trial was either not differentfrom or better than DQN’s single reported trial.

Page 7: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

Third, in each game we compared Blob-PROST’s besttrial to the single DQN trial. Since it is possible that someof DQN’s trials are high-performing outliers, this is intendedto give a sense of the level of performance Blob-PROSTis capable of reaching, regardless of whether it can do soconsistently. In 65% (32/49) of the games, Blob-PROST’sbest trial’s performance exceeded that of the DQN trial.Again, in three games in which Blob-PROST did worse, thedifference was not statistically significant. So in 71% of thegames Blob-PROST’s best trial was either not different fromor better than DQN’s single reported trial.

Finally, DQN’s main result was achieving at least 75% of ahuman’s performance in over half of the games (29/49). Thebest trial of Blob-PROST crossed this threshold in 34/49games. The middle trial did so in only 20/49 games.

The Blob-PROST representation was generated using sim-ple and natural enhancements to BASS (one of DQN’s maincomparison points) aimed at incorporating similar structuralbiases to those encoded by DQN’s designer-provided net-work architecture. All told, these results seem to indicatethat this fixed representation is of comparable quality toDQN’s learned representation across many games.

7. ALE BENCHMARKAs discussed above, besides better understanding the core

issues underlying DQN’s success in the ALE, it is importantto present a reproducible benchmark without the problem-atic flaws discussed in Section 6 (i.e. one that reports morethan one trial, evaluates the final set of weights, and eschewsgame-specific prior knowledge). Moreover, establishing thisbenchmark is important for the ALE’s continued use as anevaluation platform. The principled evaluation of compli-cated approaches depends upon the existence of a soundbaseline for what can be achieved with simple approaches.

To this end we also present the performance of Blob-PROST using the full action set. Our agents’ average per-formance after 24 trials using the full action set is reportedin the last column of Table 4 in the Appendix. We recom-mend that future comparisons to Blob-PROST use these re-sults. Surprisingly, we found that when using the full actionset Blob-PROST performed slightly better in comparison toDQN using the measures described above than when usingthe minimal action set. It is not entirely clear why this wouldbe; one possible explanation is that the presence of redun-dant actions may have a positive impact on exploration insome games.

8. CONCLUSIONS AND FUTURE WORKWhile it is difficult to draw firm conclusions, the results

indicate the Blob-PROST’s performance in the ALE is com-petitive to DQN’s, albeit likely slightly worse overall. Mostimportantly, these results indicate we may have been ableto capture some of the key features of DQN in a practical,fixed linear representation. We saw progressive and dra-matic improvements by respectively incorporating relativedistances between objects, non-Markov features, and moresophisticated object detection. This illuminates some im-portant representational issues that likely underly DQN’ssuccess. It also suggests that the general properties of therepresentations learned by DQN may be more important toits success in ALE than the specific features it learns in eachgame. This deeper understanding of the specific strengths

of DQN should aid practitioners in determining when itsbenefits are likely to outweigh its costs.

It is important to note that we do not intend to dimin-ish the importance of DQN as a seemingly stable combina-tion of deep neural networks and RL. This is a promisingand exciting research direction with potential applicationsin many domains. That said, taking into account the factthat DQN’s performance may be inflated to an unknowndegree as a result of evaluating the best performing set ofweights, the fact of DQN’s extreme computational cost, andthe methodological issues underlying the reported DQN re-sults, we conclude that Blob-PROST is a strong alternativecandidate to DQN both as an ALE performance benchmarkand as a platform on which to build future ALE agents whenrepresentation learning is not the topic being studied. Ourresults also suggest in general that fixed representations in-spired by the principles underlying convolutional networksmay yield competitive, lighter-weight alternatives.

As future work, it may be interesting to investigate theremaining discrepancies between our results and DQN’s. Insome games like Gravitar, Frostbite and Krull Blob-PROST’s performance was several times higher than DQN’s,while the opposite was true in games such as Star Gunner,Breakout and Atlantis. In general, DQN seems to ex-cel in shooting games (a large part of the games supportedby the ALE), maybe because it is able to easily encode ob-ject velocities and to predict objects’ future positions. DQNalso excels on games that require a more holistic view of thewhole screen (e.g. Breakout, Space Invaders), some-thing pairwise features struggle with. On the other hand,some of the games in which Blob-PROST surpasses DQNare games where it is fairly easy to die, maybe becauseDQN interrupts the learning process after the agent loses alife. Other games Blob-PROST succeeds in are games wherethe return is very sparse (e.g. Tennis, Montezuma’s Re-venge). These games may generate very small gradients forthe network, making it harder for an algorithm to learn boththe representation and a good policy. Alternatively DQN’sprocess of sampling from the experience memory may notbe effective to draw the “interesting” samples.

These conjectures suggest directions for future work. Adap-tive representation methods may potentially benefit fromthese insights by building in stronger bias toward the typesof features we have investigated here. This would allow themto quickly perform well, and then focus energy on learn-ing exceptions to the rule. In the realm of linear methods,these results suggest that there may still be simple, genericenhancements that yield dramatic improvements. More so-phisticated object detection may be beneficial, as might fea-tures that encode more holistic views of the screen.

AcknowledgementsThis research was supported by Alberta Innovates Technol-ogy Futures and the Alberta Innovates Centre for MachineLearning. Computing resources were provided by ComputeCanada through CalculQuebec.

Page 8: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

APPENDIXThe following tables report the complete results discussedin the paper. Table 3 reports the data used to compareB-PROS results to other linear feature representations’ re-ported results (see Section 3). Table 4 presents the results ofthe three feature sets introduced in this paper (B-PROS, B-PROST and Blob-PROST) as well as the results using Blob-PROST with the full action set, which we recommend to beused as benchmark in the future (see Sections 4, 5 and 7).Finally, Table 5 reports the data used to compare Blob-PROST to DQN’s reported results (see Section 6).

Page 9: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

Table 3: Control performance in Atari 2600 games when using linear function approximation. Bold indicatesthe best mean score in the first five columns, which were obtained using all the 18 actions available, after learn-ing for 5,000 episodes and evaluating for 500 episodes [2]. CAF denotes contingency awareness features [3].These results were reported after learning for 10,000 episodes and the average reported is the average scoreof the last 500 learning episodes. They are not directly comparable but note B-PROS outperforms CAF inthe majority of games, even using half of the samples.

Game Basic BASS DISCO LSH B-PROS (std. dev.) CAF

Asterix 862.3 859.8 754.6 987.3 4078.7 (1016.0) 1332.0Beam Rider 929.4 872.7 563.0 793.6 1528.7 (504.4) 1742.7Breakout 3.3 5.2 3.9 2.5 13.5 (11.0) 6.1

Enduro 111.8 129.1 0.0 95.8 240.5 (24.2) 159.4Freeway 11.3 16.4 12.8 15.4 31.0 (0.7) 20.0

Pong -19.2 -19.0 -19.6 -19.9 4.9 (6.6) -17.4Q*Bert 613.5 497.2 326.3 529.1 1099.8 (740.0) 960.3

Seaquest 579.0 664.8 421.9 508.5 1048.5 (321.2) 675.5Space Invaders 203.6 250.1 239.1 222.2 384.7 (87.2) 267.9

Alien 939.2 893.4 623.6 510.2 1563.8 (529.6) 103.2Amidar 64.9 103.4 67.9 45.1 174.7 (63.0) 183.6Assault 465.8 378.4 371.7 628.0 472.9 (103.7) 537.0

Asteroids 829.7 800.3 744.5 590.7 1281.6 (182.5) 89.0Atlantis 62687.0 25375.0 20857.3 17593.9 33328.9 (9787.2) 852.9

Bank Heist 98.8 71.1 51.4 64.6 214.5 (128.2) 67.4Battle Zone 15534.3 12750.8 0.0 14548.1 18433.5 (4473.2) 16.2

Berzerk 329.2 491.3 329.0 441.0 471.6 (100.4) -Bowling 28.5 43.9 35.2 26.1 36.9 (11.7) 36.4Boxing -2.8 15.5 12.4 10.5 8.3 (5.8) 9.8

Centipede 7725.5 8803.8 6210.6 6161.6 12258.6 (2064.0) 4647.0Chopper Command 1191.4 1581.5 1349.0 943.0 2180.1 (584.5) 16.9

Crazy Climber 6303.1 7455.6 4552.9 20453.7 24669.0 (8526.0) 149.8Demon Attack 520.5 318.5 208.8 355.8 664.2 (108.9) 0.0Double Dunk -15.8 -13.1 -23.2 -21.6 -6.7 (2.5) -16.0

Elevator Action 3025.2 2377.6 4.6 3220.6 6879.5 (3610.5) -Fishing Derby -92.6 -92.1 -89.5 -93.2 -87.8 (4.5) -85.1

Frostbite 161.0 161.1 176.6 216.9 768.8 (747.5) 180.9Gopher 545.8 1288.3 295.7 941.8 5028.8 (1159.3) 2368

Gravitar 185.3 251.1 197.4 105.9 613.8 (142.3) 429H.E.R.O. 6053.1 6458.8 2719.8 3835.8 8086.0 (2034.2) 7295

Ice Hockey -13.9 -14.8 -18.9 -15.1 0.6 (1.9) -3.2James Bond 197.3 202.8 17.3 77.1 336.6 (107.3) 354.1

Journey Escape -8441.0 -14730.7 -9392.2 -13898.9 -3953.9 (2968.2) -Kangaroo 962.4 1622.1 457.9 256.4 1337.6 (628.2) 8.8

Krull 2823.3 3371.5 2350.9 2798.1 4446.6 (999.3) 3341.0Kung-Fu Master 16416.2 19544.0 3207.0 8715.6 25255.8 (3848.9) 29151.0

Montezuma’s Revenge 10.7 0.1 0.0 0.1 0.1 (0.3) 259.0Ms. Pac-Man 1537.2 1691.8 999.6 1070.8 3494.1 (471.3) 1227.0

Name This Game 1818.9 2386.8 1951.0 2029.8 6857.4 (477.1) 2247.0Pooyan 800.3 1018.9 402.7 1225.3 1461.7 (192.3) -

Private Eye 81.9 100.7 -23.0 684.3 148.8 (151.1) 86.0River Raid 1708.9 1438.0 0.0 1904.3 4228.0 (697.8) 2650.0

Road Runner 67.7 65.2 21.4 42.0 17891.9 (3795.6) 89.1Robotank 12.8 10.1 9.3 10.8 18.5 (1.5) 12.4

Star Gunner 850.2 1069.5 1002.2 722.9 1040.5 (63.1) 9.4Tennis -0.2 -0.1 -0.1 -0.1 0.0 (0.0) 0.0

Time Pilot 1728.2 2299.5 0.0 2429.2 1813.8 (696.4) 24.9Tutankham 40.7 52.6 0.0 85.2 119.0 (31.7) 98.2Up ’n Down 3532.7 3351.0 2473.4 2475.1 7737.7 (1936.2) 2449.0

Venture 0.0 66.0 0.0 0.0 0.0 (0.0) 0.6Video Pinball 15046.8 12574.2 10779.5 9813.9 18086.3 (5195.4) 19761.0

Wizard of Wor 1768.8 1981.3 935.6 945.5 1900.6 (619.4) 36.9Zaxxon 1392.0 2069.1 69.8 3365.1 4465.1 (2856.4) 21.4

Times Best 2 7 0 3 41

Page 10: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

Table 4: Comparison between the three proposed feature sets. See Section 3.1 for details. The last columnpresents the average performance of Blob-PROST using the full action set for comparison to future work(see Section 7). The standard deviation reported represents the variability over the 24 independent learningtrials. The highest average among the first three columns is in bold face. We use † to denote statisticallydifferent averages between B-PROST and Blob-PROST, which were obtained with Welch’s t-tests (p<0.05).

GameB-PROS B-PROST Blob-PROST (min.) Blob-PROST (full)

Avg. (std. dev.) Avg. (std. dev.) Avg. (std. dev.) Avg. (std. dev.)

Asterix 8194.3 (1802.9) 8928.8† (1714.5) 4400.8 (1201.3) 3996.6 (743.9)Beam Rider 1686.9 (255.2) 1808.9 (309.2) 1902.3 (423.9) 2367.3 (815.0)

Breakout 7.1 (1.7) 15.0 (4.7) 46.7† (45.7) 52.9 (38.0)

Enduro 34.9 (71.5) 207.7 (23.1) 257.0† (79.8) 296.7 (7.8)

Freeway 23.8 (6.7) 29.1 (6.3) 31.5† (1.2) 32.3 (0.5)

Pong 10.9 (5.2) 18.9 (1.3) 20.1† (0.5) 20.2 (0.4)

Q*Bert 3647.8 (1273.3) 3608.7 (1129.0) 6946.8† (3036.4) 8072.4 (2210.5)Seaquest 1366.1 (445.9) 1636.5 (519.5) 1664.2 (440.4) 1664.2 (440.4)

Space Invaders 505.1 (130.4) 582.9 (139.0) 723.1† (101.5) 844.8 (144.9)

Alien 3169.5 (616.8) 4074.5 (533.7) 4154.8 (532.5) 4154.8 (532.5)

Amidar 337.9 (77.8) 308.2 (121.3) 486.7† (195.4) 408.4 (177.5)

Assault 1329.4 (215.3) 1485.9† (416.9) 1356.0 (299.6) 1107.9 (207.7)

Asteroids 1145.7 (139.9) 1306.7 (198.6) 1673.5† (211.1) 1759.5 (182.1)Atlantis 33465.9 (13879.0) 19195.9 (15223.3) 19347.2 (10478.5) 37428.5 (11599.7)

Bank Heist 227.7 (127.8) 332.5 (53.1) 463.4† (93.6) 463.4 (93.6)Battle Zone 22794.8 (4643.4) 25321.8 (5063.5) 26222.8 (4070.0) 26222.8 (4070.0)

Bowling 36.5 (10.6) 48.0 (20.0) 59.1† (18.6) 65.9 (14.2)

Boxing 27.2 (30.1) 26.1 (25.6) 89.4† (16.5) 89.4 (16.5)Carnival - - - - - - 4322.0 (3705.0)

Centipede 18871.4 (2624.9) 2579.9 (4848.7) 3903.3 (6838.8) 3903.3 (6838.8)

Chopper Command 3884.9 (872.5) 4072.3† (931.0) 3006.6 (782.0) 3006.6 (782.0)

Crazy Climber 24763.4 (8080.5) 34486.7 (13056.1) 59514.3† (13721.0) 73241.5 (10467.9)

Demon Attack 2085.4 (532.5) 2441.9† (571.9) 1745.8 (256.0) 1441.8 (184.8)

Double Dunk -5.5 (3.0) 5.9† (13.3) -6.4 (0.9) -6.4 (0.9)

Fishing Derby -86.5 (5.5) -82.8 (6.2) -58.8† (12.0) -58.8 (12.0)

Frostbite 1974.9 (1046.3) 2382.9 (1182.8) 3389.7† (743.6) 3389.7 (743.6)Gopher 3784.7 (670.7) 5272.4 (1077.0) 5076.6 (1235.1) 6823.4 (1022.5)

Gravitar 778.2 (191.8) 788.7 (350.7) 1231.8† (423.4) 1231.8 (423.4)H.E.R.O. 9572.1 (4160.2) 13552.5 (1867.4) 13690.3 (4291.2) 13690.3 (4291.2)

Ice Hockey 2.1 (1.7) 7.3 (2.3) 14.5† (3.4) 14.5 (3.4)

James Bond 378.0 (128.9) 373.5 (158.3) 636.3† (191.8) 636.3 (191.8)

Kangaroo 4946.0 (2694.3) 5898.4† (3166.3) 3800.3 (2211.0) 3800.3 (2211.0)

Krull 5247.1 (957.4) 5973.8 (904.6) 8333.9† (5599.5) 8333.9 (5599.5)Kung-Fu Master 34025.7 (5994.3) 35173.0 (6190.8) 34075.6 (7716.7) 33868.5 (6247.5)

Montezuma’s Revenge 300.3 (155.7) 345.8 (433.6) 778.1† (789.8) 778.1 (789.8)

Ms. Pac-Man 4402.9 (905.4) 5544.1† (727.9) 4625.6 (774.3) 4625.6 (774.3)

Name This Game 5697.7 (809.7) 6593.1† (750.9) 5883.7 (1542.1) 6580.1 (1773.0)Pooyan - - - - - - 2228.1 (274.5)

Private Eye 106.3 (42.9) 502.7† (1842.1) 33.0 (47.6) 33.0 (47.6)River Raid 9171.4 (973.6) 9928.8 (1,443.6) 10629.1 (2110.5) 10629.1 (2110.5)

Road Runner 25342.7 (4168.4) 32282.9† (5707.8) 24558.3 (12369.2) 24558.3 (12369.2)

Robotank 21.4 (1.3) 22.6 (2.4) 28.3† (2.7) 28.3 (2.7)Skiing - - - - - - -29842.6 (19.8)

Star Gunner 1355.3 (343.8) 1020.1 (304.8) 1227.7† (165.1) 1227.7 (165.1)Tennis -0.2 (0.8) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)

Time Pilot 2501.2 (943.4) 2154.6 (875.4) 3933.1† (713.9) 3972.0 (878.3)

Tutankham 110.3 (40.8) 90.0† (52.4) 61.2 (61.3) 81.4 (50.5)Up’n Down 6140.4 (5418.7) 7888.2 (5950.1) 12197.0 (14027.3) 19533.0 (18733.6)

Venture 97.3 (223.7) 182.1 (370.1) 244.5 (490.4) 244.5 (490.4)

Video Pinball 14239.4 (4314.0) 19358.3† (3006.2) 10006.3 (5860.8) 9783.9 (3043.9)Wizard of Wor 2500.8 (754.7) 2893.6 (842.6) 3174.8 (1227.9) 2733.6 (1259.0)

Zaxxon 10993.8 (2118.2) 11599.8† (2660.7) 8204.4 (2845.3) 8204.4 (2845.3)

Times best 4 14.5 30.5Times statistically best N/A 14 23

Page 11: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

Table 5: The first column presents the average performance of Blob-PROST using the minimal action set.The standard deviation reported represents the variability over the 24 independent learning trials. Therest of the columns present our comparison between Blob-PROST and DQN (see Section 6 for details). Inthese results the Blob-PROST agent uses the minimal action set, for the sake of comparison. The standarddeviation reported for the best trial as well as for the median trial is the standard deviation while evaluatingfor 499 episodes. We use bold face to denote values that are higher than DQN’s and † to denote Blob-PROSTaverages that are statistically significantly different than DQN’s average. To do the comparison for each gamewe used Welch’s t-test (p<0.025, which accounts for the usage of DQN results in two tests, for an overallsignificance level of 0.05).

Game Avg. (std. dev.) Best trial (std. dev.) Middle trial (std. dev.)% trials

DQN Human Random> DQN

Asterix 4400.8 (1201.3) 6921.5 (4501.2) 4472.9† (2514.8) 8.3 6012.0 (1744.0) 8503.0 210.0

Beam Rider 1902.3 (423.9) 2965.5† (1376.1) 1807.6† (657.0) 0.0 6846.0 (1619.0) 5775.0 363.9

Breakout 46.7 (45.7) 190.3† (179.7) 33.2† (21.3) 0.0 401.2 (26.9) 31.8 1.7

Enduro 257.0 (79.8) 299.1 (26.9) 279.2† (28.2) 0.0 301.8 (24.6) 309.6 0.0

Freeway 31.5 (1.2) 32.6† (1.0) 31.7† (1.0) 95.8 30.3 (0.7) 29.6 0.0

Pong 20.1 (0.5) 20.5† (0.8) 20.2† (1.4) 95.8 18.9 (1.3) 9.3 -20.7

Q*Bert 6946.8 (3036.4) 14449.4† (4111.7) 5881.4† (1069.0) 12.5 10596.0 (3294.0) 13455.0 163.9

Seaquest 1664.2 (440.4) 2278.0† (294.1) 1845.1† (469.8) 0.0 5286.0 (1310.0) 20182.0 68.4

Space Invaders 723.1 (101.5) 889.8† (374.5) 712.9† (250.3) 0.0 1976.0 (893.0) 1652.0 148.0

Alien 4154.8 (532.5) 4886.6† (1151.9) 4219.1† (635.4) 95.8 3069.0 (1093.0) 6875.0 227.8

Amidar 486.7 (195.4) 825.4 (225.3) 522.3 (153.1) 12.5 739.5 (3024.0) 1676.0 5.8

Assault 1356.0 (299.6) 1829.3† (702.9) 1380.8† (449.9) 0.0 3359.0 (775.0) 1496.0 224.4

Asteroids 1673.5 (211.1) 2229.9† (937.0) 1635.1 (617.9) 50.0 1629.0 (542.0) 13157.0 719.1

Atlantis 19347.2 (10478.5) 42937.7† (13763.1) 19983.2† (5830.4) 0.0 85641.0 (17600.0) 29028.0 12850.0

Bank Heist 463.4 (93.6) 793.6† (102.9) 455.1 (100.9) 62.5 429.7 (650.0) 734.4 14.2

Battle Zone 26222.8 (4070.0) 37850.0† (7445.0) 25836.0 (6616.3) 41.7 26300.0 (7725.0) 37800.0 2360.0

Bowling 59.1 (18.6) 91.1† (13.6) 62.0 (2.5) 91.7 42.4 (88.0) 154.8 23.1

Boxing 89.4 (16.5) 98.3† (3.3) 95.8† (5.5) 87.5 71.8 (8.4) 4.3 0.1

Centipede 3903.3 (6838.8) 21137.0† (6722.8) 426.7† (313.8) 20.8 8309.0 (5237.0) 11963.0 2091.0

Chopper Command 2998.0 (763.1) 4898.9† (2511.2) 2998.0† (1197.7) 0.0 6687.0 (2916.0) 9882.0 811.0

Crazy Climber 59514.3 (13721.0) 81016.0† (26529.2) 59848.9† (27960.7) 0.0 114103.0 (22797.0) 35411.0 10781.0

Demon Attack 1745.8 (256.0) 2166.0† (1327.5) 1774.2† (979.8) 0.0 9711.0 (2406.0) 3401.0 152.1

Double Dunk -6.4 (0.9) -4.1† (2.3) -6.2† (2.8) 100.0 -18.1 (2.6) -15.5 -18.6

Fishing Derby -58.8 (12.0) -28.7† (19.0) -57.4† (18.3) 0.0 -0.8 (19.0) 5.5 -91.7

Frostbite 3389.7 (743.6) 4534.0† (1496.8) 3557.5† (864.3) 100.0 328.3 (250.5) 4335.0 65.2

Gopher 5076.6 (1235.1) 7451.1† (3146.1) 5006.9† (3118.9) 0.0 8520.0 (32.8) 2321.0 257.6

Gravitar 1231.8 (423.4) 1709.7† (669.7) 1390.6† (752.9) 91.7 306.7 (223.9) 2672.0 173.0

H.E.R.O. 13690.3 (4291.2) 20273.1† (1155.1) 13642.1† (48.4) 12.5 19950.0 (158.0) 25673.0 1027.0

Ice Hockey 14.5 (3.4) 22.8† (6.5) 14.5† (5.5) 100.0 -1.6 (2.5) 0.9 -11.2

James Bond 636.3 (191.8) 1030.5† (947.1) 587.8 (99.5) 50.0 576.7 (175.5) 406.7 29.0

Kangaroo 3800.3 (2211.0) 9492.8† (2918.2) 3839.1† (1601.1) 8.3 6740.0 (2959.0) 3035.0 52.0

Krull 8333.9 (5599.5) 33263.4† (15403.3) 7463.9† (1719.7) 95.8 3805.0 (1033.0) 2395.0 1598.0

Kung-Fu Master 34075.6 (7716.7) 51007.6† (9131.9) 33232.1† (7904.5) 91.7 23270.0 (5955.0) 22736.0 258.5

Montezuma’s Revenge 778.1 (789.8) 2508.4† (27.8) 400.0† (0.0) 100.0 0.0 (0.0) 4367.0 0.0

Ms. Pac-Man 4625.6 (774.3) 5917.9† (1984.0) 4667.4† (2126.0) 100.0 2311.0 (525.0) 15693.0 307.3

Name This Game 5883.7 (1542.1) 7787.0† (744.7) 6387.1† (1046.7) 20.8 7257.0 (547.0) 4076.0 2292.0

Private Eye 33.0 (47.6) 100.3 (4.0) 0.0 (0.0) 0.0 1788.0 (5473.0) 69571.0 24.9

River Raid 10629.1 (2110.5) 14583.3† (3078.3) 9540.4† (1053.8) 91.7 8316.0 (1049.0) 13513.0 1339.0

Road Runner 24572.3 (12381.1) 41828.0† (9408.1) 28848.5† (6857.4) 75.0 18257.0 (4268.0) 7845.0 11.5

Robotank 28.3 (2.7) 34.4† (7.7) 28.5† (8.0) 0.0 51.6 (4.7) 11.9 2.2

Star Gunner 1227.7 (165.1) 1651.6† (435.4) 1180.0† (85.2) 0.0 57997.0 (3152.0) 10250.0 664.0

Tennis 0.0 (0.0) 0.0† (0.2) 0.0† (0.2) 100.0 -2.5 (1.9) -8.9 -23.8

Time Pilot 3933.1 (713.9) 5429.5 (1659.7) 4069.3† (647.7) 0.0 5947.0 (1600.0) 5925.0 3568.0

Tutankham 61.2 (61.3) 217.7† (33.9) 41.7† (4.2) 8.3 186.7 (41.9) 167.7 11.4

Up and Down 12197.0 (14027.3) 41257.8† (12061.8) 3716.7† (1050.8) 45.8 8456.0 (3162.0) 9082.0 533.4

Venture 244.5 (490.4) 1397.0† (294.4) 0.0† (0.0) 20.8 380.0 (238.6) 1188.0 0.0

Video Pinball 10006.3 (5860.8) 21313.0† (14916.4) 9480.7† (6563.1) 0.0 42684.0 (16287.0) 17298.0 16257.0

Wizard of Wor 3174.8 (1227.9) 5681.2† (4573.2) 3442.9 (2582.0) 50.0 3393.0 (2019.0) 4757.0 563.5

Zaxxon 8204.4 (2845.3) 11721.8† (4009.7) 9280.0† (2251.4) 87.5 4977.0 (1235.0) 9173.0 32.5

Times > DQN N/A 32/49 21/49Times statist. ≥ DQN N/A 35/49 24/49

Page 12: State of the Art Control of Atari Games Using …State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and

REFERENCES[1] L. C. Baird III. Residual Algorithms: Reinforcement

Learning with Function Approximation. InProceedings of the International Conference onMachine Learning (ICML), pages 30–37, 1995.

[2] M. G. Bellemare, Y. Naddaf, J. Veness, andM. Bowling. The Arcade Learning Environment: AnEvaluation Platform for General Agents. Journal ofArtificial Intelligence Research, 47:253–279, 06 2013.

[3] M. G. Bellemare, J. Veness, and M. Bowling.Investigating Contingency Awareness using Atari 2600Games. In Proceedings of the Twenty-Sixth Conferenceon Artificial Intelligence (AAAI), pages 864–871, 2012.

[4] M. G. Bellemare, J. Veness, and M. Bowling.Sketch-Based Linear Value Function Approximation.In Proceedings of the Advances in Neural InformationProcessing Systems (NIPS), pages 2222–2230, 2012.

[5] M. Bowling, N. Burch, M. Johanson, andO. Tammelin. Heads-up Limit Hold’em Poker isSolved. Science, 347(6218):145–149, January 2015.

[6] M. Campbell, A. J. H. Jr., and F.-H. Hsu. Deep Blue.

Artificial Intelligence, 134(1aAS2):57 – 83, 2002.

[7] A. Defazio and T. Graepel. A Comparison of LearningAlgorithms on the Arcade Learning Environment.CoRR, abs/1410.8620, 2014.

[8] D. A. Ferrucci, E. W. Brown, J. Chu-Carroll, J. Fan,D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock,E. Nyberg, J. M. Prager, N. Schlaefer, and C. A.Welty. Building Watson: An Overview of the DeepQAProject. AI Magazine, 31(3):59–79, 2010.

[9] A. Gionis, P. Indyk, and R. Motwani. SimilaritySearch in High Dimensions via Hashing. InProceedings of International Conference on Very LargeData Bases (VLDB), pages 518–529, 1999.

[10] G. J. Gordon. Reinforcement Learning with FunctionApproximation Converges to a Region. In Proceedingsof the Advances in Neural Information ProcessingSystems (NIPS), pages 1040–1046, 2000.

[11] X. Guo, S. P. Singh, H. Lee, R. L. Lewis, andX. Wang. Deep Learning for Real-Time Atari GamePlay Using Offline Monte-Carlo Tree Search Planning.In Proceedings of the Advances in Neural InformationProcessing Systems (NIPS), pages 3338–3346, 2014.

[12] M. J. Hausknecht and P. Stone. Deep RecurrentQ-Learning for Partially Observable MDPs. CoRR,abs/1507.06527, 2015.

[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton.ImageNet Classification with Deep ConvolutionalNeural Networks. In Proceedings of the Advances inNeural Information Processing Systems (NIPS), pages1106–1114, 2012.

[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-Based Learning Applied to DocumentRecognition. Proceedings of the IEEE,86(11):2278–2324, 1998.

[15] Y. Liang, M. C. Machado, E. Talvitie, and M. H.Bowling. State of the Art Control of Atari GamesUsing Shallow Reinforcement Learning. CoRR,abs/1512.01563, 2015.

[16] N. Lipovetzky, M. Ramirez, and H. Geffner. ClassicalPlanning with Simulators: Results on the Atari Video

Games. In Proceedings of International JointConference on Artificial Intelligence (IJCAI), 2015.

[17] M. C. Machado, S. Srinivasan, and M. Bowling.Domain-Independent Optimistic Initialization forReinforcement Learning. In AAAI Workshop onLearning for General Competency in Video Games,2015.

[18] F. S. Melo, S. P. Meyn, and M. I. Ribeiro. AnAnalysis of Reinforcement Learning with FunctionApproximation. In Proceedings of the InternationalConference on Machine Learning (ICML), pages664–671, 2008.

[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,I. Antonoglou, D. Wierstra, and M. Riedmiller.Playing Atari With Deep Reinforcement Learning. InNIPS Deep Learning Workshop, 2013.

[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie,A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis. Human-levelControl through Deep Reinforcement Learning.Nature, 518(7540):529–533, 02 2015.

[21] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P.Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu.Asynchronous Methods for Deep ReinforcementLearning. CoRR, abs/1602.01783, 2016.

[22] G. A. Rummery and M. Niranjan. On-line Q-Learningusing Connectionist Systems. CUED/F-INFENG/TR166, Cambridge University Engineering Dept.,September 1994.

[23] J. Schaeffer, N. Burch, Y. Bjornsson, A. Kishimoto,M. Muller, R. Lake, P. Lu, and S. Sutphen. Checkersis Solved. Science, 317(5844):1518–1522, 2007.

[24] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, andP. Moritz. Trust Region Policy Optimization. InProceedings of the International Conference onMachine Learning (ICML), pages 1889–1897, 2015.

[25] N. Sprague. Parameter Selection for the DeepQ-learning Algorithm ((Extended Abstract)). InProceedings of the Multidisciplinary Conference onReinforcement Learning and Decision Making(RLDM), 2015.

[26] R. S. Sutton and A. G. Barto. ReinforcementLearning: An Introduction. MIT Press, 1998.

[27] C. Szepesvari. Algorithms for Reinforcement Learning.Synthesis lectures on Artificial Intelligence andMachine Learning. Morgan & Claypool, 2010.

[28] G. Tesauro. Temporal Difference Learning andTD-Gammon. Communications of the ACM,38(3):58–68, Mar. 1995.

[29] H. van Hasselt, A. Guez, and D. Silver. DeepReinforcement Learning with Double Q-learning.CoRR, abs/1509.06461, 2015.

[30] J. Veness, M. G. Bellemare, M. Hutter, A. Chua, andG. Desjardins. Compress and Control. In Proceedingsof the Conference on Artificial Intelligence (AAAI),2015.

[31] C. J. C. H. Watkins and P. Dayan. Technical Note:Q-Learning. Machine Learning, 8(3-4), May 1992.