Yuanyi Zhong Alexander Schwing Jian Peng University of ...

DISENTANGLING CONTROLLABLE OBJECT THROUGH VIDEO PREDICTIONIMPROVES VISUAL REINFORCEMENT LEARNING

Yuanyi Zhong Alexander Schwing Jian Peng

University of Illinois at Urbana-Champaign, IL, USA

ABSTRACT

In many vision-based reinforcement learning (RL) problems,the agent controls a movable object in its visual field, e.g., theplayer’s avatar in video games and the robotic arm in visualgrasping and manipulation. Leveraging action-conditionedvideo prediction, we propose an end-to-end learning frame-work to disentangle the controllable object from the observa-tion signal. The disentangled representation is shown to beuseful for RL as additional observation channels to the agent.Experiments on a set of Atari games with the popular Dou-ble DQN algorithm demonstrate improved sample efficiencyand game performance (from 222.8% to 261.4% measured innormalized game scores, with prediction bonus reward).

Index Terms— reinforcement learning, video prediction,representation learning, sample efficiency

1. INTRODUCTION

Recent advances in deep reinforcement learning (RL) havesignificantly improved the state-of-the-art of several challeng-ing visual decision making and control tasks, such as robotgrasping [1, 2], and video game playing in Atari games [3],StarCraft II [4] and Dota 2 [5]. Modern deep RL solutionstypically adopt a deep neural network which directly mapsraw sensory input signal to either action probabilities (as inpolicy optimization) or values (as in valued-based RL).

It has been postulated that behind the success of deeplearning is its ability to learn good representations [6], andfurthermore, a good representation for signal processingshould disentangle the factors of variations [7]. We arguein this paper that the agent’s controllable object is an im-portant factor to be isolated from the observation space invision-based (visual) RL. Actually, the existence of such con-trollable objects is prevalent in both real and artificial worlds.In many real-world vision-based control problems, such asautonomous driving and robotic arm manipulation [1], thereis typically an object in the visual field that the agent cancontrol. While in computer games, the player often controlsan avatar on screen that follows the player’s commands, asillustrated in Fig. 1.

To achieve the goal of disentangling the controllable ob-jects, we propose a self-supervised learning approach based

+

+

+

up

left

down-right

observation 𝑜𝑡−1

observation 𝑜𝑡 controllable part

𝐼𝑡𝑐 = 𝑚t ⊙𝑜𝑡

uncontrollable part

𝐼𝑡𝑢 = (1 −𝑚t) ⊙ 𝑜𝑡

Fig. 1. An illustration of controllable object disentanglementin the Atari game Seaquest. The location of the submarine onthe future frame depends on the agent’s action, whereas theenvironmental objects are not affected by actions.

on action-conditional future frame prediction. By carefullydesigning the network architecture, the training objective andthe regularization terms, the model can be effectively trainedwithout any ground truth annotations, at the same time as theagent’s policy learning.

The disentanglement of controllable object is useful forreinforcement learning. We applied our approach on a suite ofAtari 2600 games with the Double DQN [8] algorithm, wherethe agent’s input was augmented with the controllable objectimages. We first verified that our model was able to success-fully discover the controllable object regions in those games.Second, the augmented agents achieved both better sampleefficiency and higher final test scores, compared to the vanillaDDQN agents which take in only original game frames. Fur-ther improvement was obtained by utilizing the prediction er-ror as a bonus reward.

Our contributions are summarized as follows:1. Formulating a video prediction model specifically for vi-

sual RL to disentangle the controllable objects in a self-supervised manner;

2. Introducing novel training techniques to learn the predic-tion model;

3. Incorporating the prediction model and the disentangledrepresentation with DDQN and demonstrating significantimprovement on several Atari games.

arX

iv:2

002.

0913

6v1

[cs

.LG

] 2

1 Fe

b 20

20

𝑜𝑡−4…

𝑜𝑡−1

𝑜𝑡

Conv Deconv 𝐼𝑡𝑢

𝑎𝑡−1Deconv 𝐼𝑡

𝑐

Conv Deconv 𝑚𝑡

embedding

Fig. 2. Prediction Network Structure. ot−4, . . . , ot are framesat time t − 4 to t, at−1 are the action taken after observingot−1. The controllable image part Ict , the uncontrollable partIut and mask mt are predicted by three neural net branches.

2. FRAME PREDICTION MODEL

The problem of disentangling controllable image region isformalized as judging whether a pixel (x, y) belongs to anobject controlled by the agent on a target frame ot at time t.Denoting mt(x, y) as the probability that pixel (x, y) belongsto the controllable region, then mt is essentially an attentionmask on image ot. The frame ot is then decomposed intotwo non-intersecting parts as in Eq. 1: the element-wise prod-uct mt � ot gives the controllable object image region, and(1−mt)� ot gives the rest uncontrollable regions.

ot(x, y) = mt(x, y)ot(x, y) + (1−mt(x, y))ot(x, y). (1)

The key idea behind our approach is to utilize action-conditioned future frame prediction, and use action informa-tion as the bottleneck supervision signal to learn the disen-tanglement. Specifically, we divide the prediction model intothree branches: (1) one to predict the action-related imagemt � ot given the action at and previous frames ot−4..ot−1;(2) one to predict the rest of the image (1−mt)�ot given onlythe previous frames; (3) a mask image predictionmt = m(ot)given frame ot to combine branches (1) and (2).

The detailed structure is shown in Fig. 2. The model oper-ates on gray-scale 84×84 images. The Conv block contains 3conv layers with 16,32,32 channels, kernel size 6 and stride 2,encoding the previous frames into a 10×10×32 feature map.The Deconv block is symmetric to the Conv. The embeddingfor action at−1 has size 10×10×8 and is merged with the im-age feature map via a conv layer (kernel size 5 stride 1). Themask prediction m(ot) has a final sigmoid activation, and ev-erything else has ReLU activation.

2.1. Loss Functions

We propose the following loss function L (Eq. 2) for training,

L =Lmasked + Lrecon + λ1L1 + λ2Lact_pred + λ3Lflow (2)

Lmasked = ‖mt � ot − Ict ‖2 + ‖(1−mt)� ot − Iut ‖2 (3)

Lrecon = ‖ot − Ict − Iut ‖2 (4)

where λ’s are the coefficients. The notation � stands forelement-wise product. ‖ · ‖ is the L2 norm.

The first two terms are the main components of our objec-tive function. Intuitively, since m is expected to take eithernear 0 or 1 values, Lmasked decomposes the target image intonon-intersecting parts mt � ot and (1 − mt) � ot as train-ing objectives for Ict and Iut respectively. The second termLrecon is the reconstruction error of the whole image. Thesetwo terms are consistent, since a perfect model would haveLrecon ≈ 0, mt � ot ≈ Ict and (1−mt)� ot ≈ Ict all hold.

Since the action information is only fed into the Ict branchand actions have stochasticity, only the Ict branch will learnto account for action-related image changes. And ideally,the Iut branch would learn to account for the environment orbackground changes that are unrelated to the agent’s action.The maskmt balances the two, is implicitly learnt, and forceseach pixel to be predicted by either Ict or Iut but not both.

However, the decomposition into ot = Ict+Iut is still quite

arbitrary. For instance, a high capacity Ict can potentiallylearn to predict the player’s action implicitly and everythingelse. Moreover, the action related pixel changes do not nec-essarily correspond to controllable objects, e.g. ghosts chasethe PacMan in game MsPacman. Hence we introduce severalnecessary regularization terms to refine the predictions.

The L1 term is defined as the L1 regularization on pixelvalues of mt, L1 = ‖mt‖1, to encourage a sparser mask.

Lact_pred is the inverse model regularization term Eq. 5,which tries to predict the action between consecutive framesinversely from the mask images of consecutive frames. Themasks are expected to be refined to focus more on action rel-evant information. p(at−1 = at−1|mt−1,mt) is parameter-ized by a simple 2 layer conv net, and at−1 is the predictedaction, at−1 is the ground truth action.

Lact_pred = −at−1 log p(at−1 = at−1|mt−1,mt) (5)

Lflow is the flow constraint term Eq. 6. This is essentiallysome smoothness constraint on adjacent masks. Note in inAtari games, each action has its specific semantic meaning,e.g. left, right, fire, etc. The intuitive idea is, if there is adesired movement in the direction that the action defines, theestimated mask should be continuous in that direction. Theax and ay are the desired increments in x and y directions.

Lflow = ‖mt(x+ ax, y + ay)−mt−1(x, y)‖2 (6)

3. APPLICATION TO DEEP Q-LEARNING

In principle, the prediction model can be readily used along-side any RL algorithms, such as DQN [3], A3C [9], TRPO[10], etc. Considering deep Q-learning methods already uti-lize an experience replay buffer, it is convenient to reuse thesame buffer to train the prediction model. During learning,a random minibatch of transitions is sampled every train stepto train the prediction model with the aforementioned losses.The Q network is then updated with the usual Bellman loss.

𝑠𝑡 = 𝑜𝑡−3, 𝑜𝑡−2… , 𝑜𝑡

Linear 512

𝑄 𝑠𝑡 , 𝑎1 , 𝑄 𝑠𝑡 , 𝑎2 , …

𝑚𝑡−3⊙𝑜𝑡−3, … ,𝑚𝑡 ⊙𝑜𝑡

Conv 8x8x32 s4

Conv 4x4x32 s2

Conv 3x3x64 s1

concatenate

Conv 8x8x32 s4

Conv 4x4x32 s2

Fig. 3. Q Network Structure. The original frames (on the left)and the controllable-object masked images (on the right) arefed into two network streams with shared parameters. A latefusion is done before predicting Q values. We mark the convparameters with kernel× kernel× channels s strides.

Incorporating the prediction model and using the dis-entangled image representation as input augmentation tothe Double DQN agent [8] yields the DDQN+Pred model.We further show that the prediction model can also gen-erate a useful exploration bonus signal, resulting in theDDQN+Pred+Bonus model.

3.1. Augmented Q Network Structure (DDQN+Pred)

In DDQN+Pred, we feed the disentangled representationmt−4+1:t�ot−4+1:t as additional input to the Q network. Theframe stacking depth, 4, is conventional [3]. The modified Qnetwork structure is shown in Fig. 3. The representations aremerged in a late-fusion manner, as we find it to work betterthan the early-fusion counterpart. We keep the model sizeto be comparable to that in the original DDQN by reducingchannels by half and sharing parameters of the two streams.

3.2. Prediction Error as Bonus (DDQN+Pred+Bonus)

Following the argument suggested in prior works [11, 12], theerror of a predictive model can be used as a novelty measureto incentivize exploration during the early stage of learning.At time t, the bonus-augmented reward rt writes as

rt = rt+β

tet(ot, at), et(ot, at) =

1

Npix‖mt�ot−Ict ‖2 (7)

where et is the prediction error signal defined in Eq. 7, Npixis the number of pixels per image, and β is a scaling constant.

We choose to include only the action-related part of theprediction error rather than the whole residue ‖ot−Ict −Iut ‖2,mainly because the former contains more information abouthow well the agent believes it is controlling its avatar. Irrele-vant background changes of the frame are suppressed in thisway, as those changes might not be beneficial for exploration.

Fig. 4. Prediction results for Seaquest, Breakout and MsPac-man (rows). The first column: original frames ot, the secondcolumn: Ict , the third column: Iut , the last column: mask mt.The game frames pre-processed into gray-scale.

4. EXPERIMENTAL RESULTS

4.1. Qualitative Results of Prediction

Fig. 4 qualitatively shows the action-related controllable partpredictions Ic, the action-unrelated uncontrollable part pre-dictions Iu, and the predicted masks m that separate out thecontrollable region, on three games: Seaquest, Breakout andMsPacman. Seaquest is a game in which the player controls asubmarine to attack enemies while rescuing divers. The pro-posed predictive model successfully identifies the submarinefrom the rest of image as the controllable object. Similarly,in Breakout, the predicted mask circles out the paddle on thebottom of the screen that the player can move to catch theball. And in MsPacman, the player controls PacMan to col-lect rewards on the path and avoid (or attack) enemy ghosts.The predicted mask identifies the PacMan successfully. In allgames, the predictions Ic and Iu make sense in that only therelevant image regions specified by the mask are generated.

4.2. Agent Performance on Atari Games

We chose 20 games in total to test the hypothesis that our dis-entangled representation helps visual RL. Some games havethe controllable-avatar property (Breakout, Pong, Seaquest,Freeway, Riverraid, StarGunner, etc.), while some othersdo not have such property (Enduro, TimePilot, Qbert, Ice-Hockey, ChopperCommand) where no apparent avatar existsor action causes all pixels to change.

We followed the conventional setting of doing gradientupdate every 4th frame, thus these models are trained for2.5M gradient steps in total. The hyper-parameters of thebaseline DDQN experiments were set to the same as the rec-ommended default values, except for the slightly prolongeddecaying of the exploration coefficient ε of the ε-greedy pol-

0

50

100

150

200

Amidar

500

1000

1500

2000Assault

20

0

20

40

60

80

Boxing

0

100

200

300

Breakout

500

1000

1500

ChopperCommand

0

200

400

600

800

Enduro

0

10

20

30

Freeway

0

500

1000

1500

2000

2500

Gopher

15

10

5

0IceHockey

0

1000

2000

3000

4000Kangaroo

500

1000

1500

2000

MsPacman

0

2000

4000

6000

NameThisGame

20

10

0

10

20Pong

0

1000

2000

3000

Qbert

2000

4000

6000

Riverraid

0.00 0.25 0.50 0.75 1.00Frame 1e7

0

10000

20000

30000

40000RoadRunner

0.00 0.25 0.50 0.75 1.00Frame 1e7

0

1000

2000

3000

4000

5000Seaquest

0.00 0.25 0.50 0.75 1.00Frame 1e7

200

400

600

800

SpaceInvaders

0.00 0.25 0.50 0.75 1.00Frame 1e7

0

10000

20000

30000StarGunner

0.00 0.25 0.50 0.75 1.00Frame 1e7

1000

2000

3000

4000

TimePilot

Fig. 5. Training curves on Atari games (10 million gameframes). Blue lines: DDQN+Pred+Bonus method, greenlines: DDQN+Pred, and red lines: DDQN baseline. The plotsare averaged over two runs with different random seeds.

icy. The rest of hyper-parameters of our proposed variantswere roughly tuned on three games (Breakout, Seaquest,SpaceInvaders), resulting in λ1 = 0.001, λ2 = 0.1, λ3 =0.01, β = 0.5. The prediction net was trained by the RM-SProp optimizer with learning rate 1e−3. Batch size was 32.The hyper-parameters were fixed across different games.

The training curves of all methods (DDQN baseline,DDQN+Pred and DDQN+Pred+Bonus) evaluated in Ar-cade Learning Environments [13] on the selected games areillustrated in Fig. 5, where the horizontal axis representstime t (number of frames) and the vertical axis representsthe episode rewards. The curves are smoothed with a slidingwindow of 100 episodes. In order to obtain fair quantitativecomparison, we compute the following normalized score [8],

scorenormalized =scoreagent − scorerandom

scorehuman − scorerandom.

As can be seen in Fig. 5 and Table 1, our DDQN+Predagent outperforms the DDQN baseline on multiple gameswith the controllable-avatar property including Breakout andRiverraid, and especially Seaquest, Pong and StarGunner,achieving a 252.3% normalized score with 29.5% absoluteimprovement (13% relative) on average over the baseline.As expected, on the games without the controllable-avatarproperty, the proposed DDQN variants achieve roughly thesame performance as the baseline, indicating that there is noharm utilizing our method. The learning is overall more dataefficient with the proposed DDQN variants in many games.For example, in Breakout, Pong and Freeway, the episodescores saturate earlier than the baseline. This could be ex-plained by: Since we already provide the masked image onthe controllable object, the agent can treat it as an existing

Table 1. Performance on Atari games in normalized scores.Method Norm score Relative gain

Random 0% -DDQN 222.8% 0%DDQN+Pred 252.3% +13.2%DDQN+Pred+Bonus 261.4% +34.5%

feature instead of learning this feature from scratch, thus canquickly discover and make use of the location of the object inthe early training stage.

On the other hand, the DDQN+Pred+Bonus modelwhich further utilizes the described exploration bonus yieldsan extra 9.1% improvement in normalized score over theDDQN+Pred model (in total 38.6% absolute gain overDDQN). Note that the bonus may not only improve per-formance in the games with a controllable-avatar, but alsoin a few other games without an explicit controllable avatar.This could be explained: As long as the predicted mask isnot completely empty, the prediction error Eq. 7 is not zeroand contains information about the “familiarity” of the frameto the prediction net, which may correlate with the visitationcount of the agent to that state.

5. RELATED WORK

Action conditioned frame prediction [14, 15], in robotic vi-sion [16, 17], and in real-world problems [18], usually with anencode-transform-decode framework. However, the control-lable object image regions are not explicitly isolated. Neitherdo they investigate the possibility of boosting RL with con-trollable objects. [19] studies the problem of learning contin-gent regions, whose definition resembles our mask. However,they use the traditional pixel-level transition model, while weadopt an effective deep learning frame prediction method.The idea of using prediction error as exploration bonus is seenin [12, 11, 20]. A recent work [21] studies contingency aware-ness to promote better exploration for A2C with count-basedexploration method, but does not investigate the use of con-tingency region directly as input or prediction error as bonus,which differs from ours. Finally, video prediction model canalso be used in model-based RL [22, 23, 24]. However, pre-dicting images accurately is hard which limits the power ofmodel-based RL, while our method only requires predictingthe mask image well.

6. CONCLUSIONS

We propose a novel learning framework to disentangle thecontrollable objects from the visual input signal of RL agents.We tested our approach on Atari games, and demonstratedimproved performance with the controllable object images asadditional input and further the prediction error as bonus.

7. REFERENCES

[1] Deirdre Quillen, Eric Jang, Ofir Nachum, Chelsea Finn,Julian Ibarz, and Sergey Levine, “Deep reinforcementlearning for vision-based robotic grasping: A simulatedcomparative evaluation of off-policy methods,” in ICRA.IEEE, 2018, pp. 6284–6291.

[2] Frederik Ebert, Chelsea Finn, Sudeep Dasari, An-nie Xie, Alex Lee, and Sergey Levine, “Visualforesight: Model-based deep reinforcement learningfor vision-based robotic control,” arXiv preprintarXiv:1812.00568, 2018.

[3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al., “Human-level control through deep re-inforcement learning,” Nature, vol. 518, no. 7540, 2015.

[4] “AlphaStar: Mastering the Real-Time Strategy GameStarCraft II,” https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.

[5] “https://openai.com/five/,” 2019.

[6] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton,“Deep learning,” Nature, vol. 521, no. 7553, 2015.

[7] Valentin Thomas, Jules Pondard, Emmanuel Bengio,Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs,Joelle Pineau, Doina Precup, and Yoshua Bengio, “In-dependently controllable features,” RLDM, 2017.

[8] Hado Van Hasselt, Arthur Guez, and David Silver,“Deep reinforcement learning with double q-learning,”in AAAI Conference on Artificial Intelligence, 2016.

[9] Volodymyr Mnih, Adria Puigdomenech Badia, MehdiMirza, Alex Graves, Timothy Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu, “Asynchronousmethods for deep reinforcement learning,” in Interna-tional Conference on Machine Learning, 2016.

[10] John Schulman, Sergey Levine, Pieter Abbeel, MichaelJordan, and Philipp Moritz, “Trust region policy op-timization,” in International Conference on MachineLearning, 2015, pp. 1889–1897.

[11] Bradly C Stadie, Sergey Levine, and Pieter Abbeel, “In-centivizing exploration in reinforcement learning withdeep predictive models,” NIPS 2015 Workshop on DeepReinforcement Learning, 2015.

[12] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, andTrevor Darrell, “Curiosity-driven exploration by self-supervised prediction,” in International Conference onMachine Learning, 2017, pp. 2778–2787.

[13] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowl-ing, “The arcade learning environment: An evaluationplatform for general agents,” Journal of Artificial Intel-ligence Research, vol. 47, pp. 253–279, jun 2013.

[14] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L.Lewis, and Satinder P. Singh, “Action-conditional videoprediction using deep networks in atari games,” in Neu-ral Information Processing Systems, 2015.

[15] Felix Leibfried, Nate Kushman, and Katja Hofmann, “Adeep learning approach for joint video frame and rewardprediction in atari games,” ICML 2017 Workshop onPrincipled Approaches to Deep Learning, 2016.

[16] Chelsea Finn, Ian J. Goodfellow, and Sergey Levine,“Unsupervised learning for physical interaction throughvideo prediction,” in Neural Information ProcessingSystems, 2016, pp. 64–72.

[17] Chelsea Finn and Sergey Levine, “Deep visual foresightfor planning robot motion,” in ICRA. 2017, IEEE.

[18] Carl Vondrick, Hamed Pirsiavash, and Antonio Tor-ralba, “Generating videos with scene dynamics,” inNeural Information Processing Systems, 2016.

[19] Marc G. Bellemare, Joel Veness, and Michael Bowling,“Investigating contingency awareness using atari 2600games,” in AAAI, 2012.

[20] Marc G. Bellemare, Sriram Srinivasan, Georg Ostro-vski, Tom Schaul, David Saxton, and Rémi Munos,“Unifying count-based exploration and intrinsic motiva-tion,” in Neural Information Processing Systems, 2016.

[21] Jongwook Choi, Yijie Guo, Marcin Moczulski, JunhyukOh, Neal Wu, Mohammad Norouzi, and Honglak Lee,“Contingency-aware exploration in reinforcement learn-ing,” ICLR, 2019.

[22] Sébastien Racanière, Théophane Weber, David Re-ichert, Lars Buesing, Arthur Guez, Danilo JimenezRezende, Adria Puigdomènech Badia, Oriol Vinyals,Nicolas Heess, Yujia Li, et al., “Imagination-augmentedagents for deep reinforcement learning,” in Neural In-formation Processing Systems, 2017.

[23] Danijar Hafner, Timothy Lillicrap, Ian Fischer, RubenVillegas, David Ha, Honglak Lee, and James Davidson,“Learning latent dynamics for planning from pixels,” inInternational Conference on Machine Learning, 2019.

[24] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos,Blazej Osinski, Roy H Campbell, Konrad Czechowski,Dumitru Erhan, Chelsea Finn, Piotr Kozakowski,Sergey Levine, et al., “Model-based reinforcementlearning for atari,” ICLR, 2020.

Yuanyi Zhong Alexander Schwing Jian Peng University of ...

Documents

Transcript of Yuanyi Zhong Alexander Schwing Jian Peng University of ...