ELEC 576: Deep Reinforcement Learning€¦ · ELEC 576: Deep Reinforcement Learning Ankit B. Patel,...

ELEC 576: Deep Reinforcement Learning

Ankit B. Patel, CJ BarberanBaylor College of Medicine (Neuroscience Dept.)

Rice University (ECE Dept.) 11-08-2017

Value-Based Deep Reinforcement Learning

Value-Based Deep RL• The value function is

represented by Q-network with weights w

[David Silver ICML 2016]

Q-Learning• Optimal Q-values obey Bellman Equation

• Minimise MSE loss via SGD

• Diverges due to

• Correlations between samples

• Non-stationary targets

Interesting Question• What can go wrong in reinforcement learning that

doesn’t go wrong in standard supervised learning?

Experience Replay• Build dataset from agent’s own

experience

• Sample experiences from the dataset and apply update

Benefits of Experience Replay

• Greater data efficiency

• Breaks the correlations

• Smoothing out learning and avoiding oscillations or divergence in parameters

DQN w/ Experience Replay

[Deepmind 2014]

DQN: Atari 2600

DQN: Atari 2600• End-to-end learning of Q(s,a) from the frames

• Input state is stack of pixels from last 4 frames

• Output is Q(s,a) for the joystick/button positions

• Varies with different games (~3-18)

• Reward is change in score for that step

DQN: Atari 2600• Network architecture and hyper parameters are

fixed across all the games

DQN: Atari Video

[Jon Juett Youtube]

DQN: Atari Video

[PhysX Vision Youtube]

Improvements on the Basic DQN Algorithm

Double DQN• Double DQN

• Current Q-network w is used to select actions

• Older Q-network is w- is used to evaluate actions

• Prioritized reply

• Store experience in priority queue by the DQN error

Dueling Network• Duelling Network

• Action-independent value function V(s,v)

• Action-dependent advantage function A(s,a,w)

• Q(s,a) = V(s,v) + A(s,a,w)

Prioritized Experience Replay

[Schaul et al 2016]

Prioritized Experience Replay

[Schaul et al 2016]

Improvements after DQN

[Marc G. Bellemare Youtube]

General Reinforcement Learning Architecture

Policy-based Deep Reinforcement Learning

Deep Policy Network• The policy is represented by a network with weights u

• Define objective function as total discounted reward

• Optimise objective via SGD

• Adjust policy parameters u to achieve higher reward

Policy Gradients• Gradient of a stochastic policy

• Gradient of a deterministic policy

Actor Critic Algorithm• Estimate value function Q(s, a, w)

• Update policy parameters u via SGD

Asynchronous Advantage Actor-Critic: Labyrinth

[DeepMind Youtube]

A3C: Labyrinth• End-to-end learning of softmax policy from raw

pixels

• Observations are pixels of the current frame

• State is an LSTM

• Outputs both value V(s) & softmax over actions

• Task is to collect apples (+1) and escape (+10)

Model-based Deep Reinforcement Learning

Learning Models of the Environment

• Generative model of Atari 2600

• Issues

• Errors in transition model compound over the trajectory

• Planning trajectory differ from executed trajectories

• Long, unusual trajectory rewards are totally wrong

Learning Models of the Environment

[Junhyuk Oh Youtube]

Newer Implementations

AlphaGo

[Deepmind Nature]

Policy and Value Networks

[Deepmind Nature]

Monte Carlo Tree Search

[Deepmind Nature]

AlphaGo Results

[Deepmind Nature]

Which Move to Make

[Deepmind Nature]

Five Matches Against Fan Hui

[Deepmind Nature]

Neural Architecture Search with Reinforcement Learning• Uses an RNN to generate model descriptions of the

• Trains the RNN with RL to maximize expected accuracy of generated architectures on validation set

Neural Architecture Search

[Zoph & Le Arxiv 2016]

RNN Controller

Training with REINFORCE

Parallel Training

Increase Architecture Complexity

Constructing the Network

Results

Reinforcement Learning with Unsupervised Auxiliary Tasks• Train an agent that maximises other pseudo-reward

functions simultaneously by RL

• Those tasks have to develop in absence of extrinsic rewards

• Outperforms previous state-of-the-art on atari

• Averaging 880% expert human performance

UNREAL Agent

[Jaderberg et. al Arxiv 2016]

Auxiliary Control Tasks

Auxiliary Tasks

UNREAL Algorithm

A3C Loss is minimised on policyValue function is optimised from replayed data

Auxiliary control loss is optimised off-policy from replied dataReward loss is optimised from rebalanced replay data

Atari Results

Reinforcement Learning with Unsupervised Auxiliary Tasks Video

[DeepMind Youtube]

ELEC 576: Deep Reinforcement Learning€¦ · ELEC 576: Deep Reinforcement Learning Ankit B. Patel,...

Documents

Transcript of ELEC 576: Deep Reinforcement Learning€¦ · ELEC 576: Deep Reinforcement Learning Ankit B. Patel,...

576 Oak, Winnetka, Illinois

Athenian Sport p551-576

ELEC 576: Recurrent Neural Network Language Models & Deep ... · ELEC 576: Recurrent Neural Network Language Models & Deep Reinforcement Learning Ankit B. Patel, CJ Barberan Baylor

COM(96) 576

Manipal - 576 104, Karnataka, India. STUDENT …...STUDENT HANDBOOK BATCH OF 2016 Manipal - 576 104, Karnataka, India. Kasturba Medical College, Manipal Manipal - 576 104, Karnataka,

&RQWD[ 576 - Butkus

Public Service Elec & Gas Co - Elec Tariff March 2016

ELEC 576: Neural Networks & Backpropagation Lecture 3 · 2017. 9. 13. · ELEC 576: Neural Networks & Backpropagation Lecture 3 Ankit B. Patel Baylor College of Medicine (Neuroscience

ELEC 576: Neural Networks & Backpropagation Lecture 3€¦ · ELEC 576: Neural Networks & Backpropagation Lecture 3 Ankit B. Patel Baylor College of Medicine (Neuroscience Dept.)

Snippetz Issue 576

Bleach Capítulo 576 - M4allscans

Bendigo Weekly Issue 576

City of Toronto - STAFF REPORT ACTION REQUIRED 576 ......The King-Spadina Secondary Plan emphasizes the reinforcement of the characteristics and Staff report for action – Final Report

Naruto 576 [manga-worldjap.com]

Operator’s manual 570 576 XP 576 XPG - hsqGlobalm.husqvarna.com/ddoc/HUSO/HUSO2016_NAen/HUSO2016... · 570 576 XP 576 XPG Operator’s manual ... 11 Throttle trigger lockout ...

Antigone Lines 552-576

576 Kbyte-1994-06

api 576.pdf

ARCHITECTURE 576 AC ARCHITECTURE, CINEMA, ENVIRONMENT …arch.illinois.edu/sites/default/files/ARCH 576 AC Anthony.pdf · Arch 576 AC Course Description 1 School of Architecture University

ELEC 576: Deep Reinforcement Learning cont. and the Future ... · Deep Reinforcement Learning cont. and the Future of Deep Learning Ankit B. Patel, CJ Barberan, Tan Nguyen Baylor