CS230: Lecture 9 Deep Reinforcement Learningcs230.stanford.edu/files/lecture-notes-9.pdf · Kian...

Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri

CS230: Lecture 9 Deep Reinforcement Learning

Kian Katanforoosh, Ramtin Keramati

I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

Today’s outline

I. Motivation

Human Level Control through Deep Reinforcement Learning

AlphaGo

[Mnih et al. (2015): Human Level Control through Deep Reinforcement Learning][Silver et al. (2017): Mastering the game of Go without human knowledge]

I. Motivation

Why RL?• Delayed labels• Making sequences of decisions

What is RL?• Automatically learn to make good sequences of decision

Examples of RL applications

Robotics AdvertisementGames

I. Motivation II. Recycling is good: an introduction to RLIII. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

Today’s outline

II. Recycling is good: an introduction to RL

Problem statement

Goal: maximize the return (rewards)

Agent’s Possible actions:Define reward “r” in every state

+2 0 0 +1 +10

Number of states: 5

Types of states:initial normal terminal

State 1 State 2 (initial) State 3 State 4 State 5

How to define the long-term return?

Discounted return R = γ trtt=0∑ = r0 + γ r1 + γ

2r2 + ...

Best strategy to follow if γ = 1

Additional rule: garbage collector coming in 3min, it takes 1min to move between states

What do we want to learn?#actions

#statesQ =

Q11Q21Q31Q41Q51

Q12Q22Q32Q42Q52

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟

Q-table

how good is it to take action 1 in state 2

Assuming

γ = 0.9

+2 0 0 +1 +10

Problem statement

Define reward “r” in every state

2r2 + ...

#statesQ =

Q11Q21Q31Q41Q51

Q12Q22Q32Q42Q52

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟

+10 (= 1 + 10 x 0.9)

Assuming γ = 0.9

+2 0 0 +1 +10

State 1 State 2 (initial) State 3 State 4 State 5 how good is it to take action 1 in state 2

Problem statement

2r2 + ...

Q-table

#statesQ =

Q11Q21Q31Q41Q51

Q12Q22Q32Q42Q52

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟

+ 9 (= 0 + 0.9 x 10)Assuming γ = 0.9

+2 0 0 +1 +10

+ 10 (= 1 + 10 x 0.9)

Problem statement

Q-table

2r2 + ...

#statesQ =

Q11Q21Q31Q41Q51

Q12Q22Q32Q42Q52

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟

Assuming γ = 0.9

+2 0 0 +1 +10

+ 9 (= 0 + 0.9 x 10)

+ 10 (= 1 + 10 x 0.9)

Problem statement

Q-table

2r2 + ...

#statesQ =

Q11Q21Q31Q41Q51

Q12Q22Q32Q42Q52

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟

Assuming γ = 0.9

+2 0 0 +1 +10

+ 9 (= 0 + 0.9 x 10)

+ 10 (= 1 + 10 x 0.9)

+ 8.1 (= 0 + 0.9 x 9)

Problem statement

Q-table

2r2 + ...

#statesQ =

Q11Q21Q31Q41Q51

Q12Q22Q32Q42Q52

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟

Assuming γ = 0.9

+2 0 0 +1 +10

+ 9 (= 0 + 0.9 x 10)

+ 10 (= 1 + 10 x 0.9)

+ 8.1 (= 0 + 0.9 x 9)

Problem statement

Q-table

2r2 + ...

Best strategy to follow if γ = 0.9

#statesQ =

028.190

0910100

⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

Bellman equation (optimality equation)

Q*(s,a) = r + γ maxa '(Q*(s ',a '))

how good is it to take action 1 in state 2

When state and actions space are too big, this method has huge memory cost Policy π (s) = argmax

a(Q*(s,a))

Function telling us our best strategy

Q-tableSTART

+2 0 0 +1 +10

Problem statement

What we’ve learned so far:

- Vocabulary: environment, agent, state, action, reward, total return, discount factor.

- Q-table: matrix of entries representing “how good is it to take action a in state s”

- Policy: function telling us what’s the best strategy to adopt

- Bellman equation satisfied by the optimal Q-table

I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-NetworksIV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

Today’s outline

III. Deep Q-Networks

Main idea: find a Q-function to replace the Q-table

Neural NetworkProblem statement

#actions

#statesQ =

028.190

0910100

⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

Q-table

⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

a1[3] Q(s,→)

Q(s,←)

Then compute loss, backpropagate.

How to compute the loss?

⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

a1[3] Q(s,→)

Q(s,←) Loss function

y = r→ + γ maxa '(Q(s→

next ,a '))y = r← + γ maxa '(Q(s←

next ,a '))

Target value

Immediate reward for taking action in state s Immediate Reward for

taking action in state sDiscounted maximum future reward

when you are in state s←next

Discounted maximum future reward when you are in state s→

Hold fixed for backprop Hold fixed for backprop

L = ( y −Q(s,→))2

Q(s,←) >Q(s,→)Case: Q(s,←) <Q(s,→)Case:

⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

a1[3] Q(s,→)

Q(s,←) Loss function (regression)

y = r→ + γ maxa '(Q(s→

next ,a '))y = r← + γ maxa '(Q(s←

next ,a '))

Target value

L = ( y −Q(s,→))2

Q(s,←) >Q(s,→) Q(s,←) <Q(s,→)Case: Case:

Backpropagation Compute and update W using stochastic gradient descent∂L∂W

Recap’

DQN Implementation:

- Initialize your Q-network parameters

- Loop over episodes:

- Start from initial state s

- Loop over time-steps:

- Forward propagate s in the Q-network

- Execute action a (that has the maximum Q(s,a) output of Q-network)

- Observe rewards r and next state s’

- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

- Update parameters with gradient descent

I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari)V. Tips to train Deep Q-Network VI. Advanced topics

Today’s outline

IV. Deep Q-Networks application: Breakout (Atari)

Goal: play breakout, i.e. destroy all the bricks.

input of Q-network Output of Q-network

Q(s,←)Q(s,→)Q(s,−)

⎜⎜⎜

⎟⎟⎟

Q-values

Would that work?

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Goal: play breakout, i.e. destroy all the bricks.

input of Q-networkDemo

Output of Q-network

Q(s,←)Q(s,→)Q(s,−)

⎜⎜⎜

⎟⎟⎟

Q-values

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Preprocessing

What is done in preprocessing?

- Convert to grayscale - Reduce dimensions (h,w) - History (4 frames)

input of Q-network

φ(s) =

Deep Q-network architecture?

CONV ReLU CONV ReLU FC (RELU) FC (LINEAR)φ(s)Q(s,←)Q(s,→)Q(s,−)

⎜⎜⎜

⎟⎟⎟

Recap’ (+ preprocessing + terminal state)

DQN Implementation:

- Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

Some training challenges:- Keep track of terminal step - Experience replay - Epsilon greedy action choice

(Exploration / Exploitation tradeoff)φ(s)

φ(s ')

φ(s ')- Use s’ to create

Recap’ (+ preprocessing + terminal state)

DQN Implementation:

- Create a boolean to detect terminal states: terminal = False

- Use s’ to create

- Check if s’ is a terminal state. Compute targets y by forward propagating state s’ in the Q-network, then compute loss.

(Exploration / Exploitation tradeoff)φ(s)

φ(s ') φ(s ')

if terminal = False : y = r + γ maxa '(Q(s ',a '))

if terminal = True : y = r (break)

⎧⎨⎪

⎩⎪

Replay memory (D)

IV - DQN training challenges

Experience replay

Current method is to start from initial state s and follow:

1 experience (leads to one iteration of gradient descent)

Experience Replay

Training: E1 E2 E3 Training: E1 sample(E1, E2) sample(E1, E2, E3)sample(E1, E2, E3, E4) …

E2E3…

Can be used with mini batch gradient descent

φ(s)→ a→ r→φ(s ')φ(s ')→ a '→ r '→φ(s '')

φ(s '')→ a ''→ r ''→φ(s ''')...

Advantages of experience replay?

Recap’ (+ experience replay)DQN Implementation:

- Initialize your Q-network parameters - Initialize replay memory D

- Start from initial state

- Forward propagate in the Q-network

- Execute action a (that has the maximum Q( ,a) output of Q-network)

- Add experience to replay memory (D)

- Sample random mini-batch of transitions from D

- Check if s’ is a terminal state. Compute targets y by forward propagating state in the Q-network, then compute loss.

φ(s ')

(φ(s),a,r,φ(s '))

The transition resulting from this is added to D, and will not always be

used in this iteration’s update!

Update using

sampled transitions

(Exploration / Exploitation tradeoff)

Exploration vs. Exploitation

R = +0

R = +1

R = +1000

Q(S1,a1) = 0.5Q(S1,a2 ) = 0.4Q(S1,a3) = 0.3

Just after initializing the Q-network, we get:

Initial state

Terminal state

a3Terminal state

R = +0S2

R = +1

R = +1000

Q(S1,a1) = 0.5Q(S1,a2 ) = 0.4Q(S1,a3) = 0.3

0Initial state

Terminal state

R = +0S2

R = +1

R = +1000

Q(S1,a1) = 0.5Q(S1,a2 ) = 0.4Q(S1,a3) = 0.3

Initial state

Terminal state

R = +0S2

R = +1

R = +1000

Q(S1,a1) = 0.5Q(S1,a2 ) = 0.4Q(S1,a3) = 0.3

Initial state

Will never be visited, because Q(S1,a3) < Q(S1,a2)

Terminal state

Recap’ (+ epsilon greedy action)DQN Implementation:

- With probability epsilon, take random action a.- Otherwise:

- Forward propagate in the Q-network - Execute action a (that has the maximum Q( ,a) output of Q-network).

φ(s ')

(φ(s),a,r,φ(s '))

Overall recap’DQN Implementation:

- With probability epsilon, take random action a.- Otherwise:

- Forward propagate in the Q-network - Execute action a (that has the maximum Q( ,a) output of Q-network).

φ(s ')

(φ(s),a,r,φ(s '))

- Preprocessing- Detect terminal state- Experience replay- Epsilon greedy action

Results

[https://www.youtube.com/watch?v=TmPfTpjtdgg]

Other Atari games

Pong SeaQuest Space Invaders

[https://www.youtube.com/watch?v=p88R2_3yWPA][https://www.youtube.com/watch?v=NirMkC5uvWU]

[https://www.youtube.com/watch?v=W2CAghUiofY&t=2s]

I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Networks IV. Application of Deep Q-Network: Breakout (Atari) V. Tips to train Deep Q-Network VI. Advanced topics

Today’s outline

VI - Advanced topics

Alpha Go

[DeepMind Blog][Silver et al. (2017): Mastering the game of Go without human knowledge]

VI - Advanced topics Competitive self-play

VI - Advanced topics

Meta learning

[Finn et al. (2017): Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks]

VI - Advanced topics Imitation learning

[Ho et al. (2016): Generative Adversarial Imitation Learning]

VI - Advanced topics Auxiliary task

Announcements

For Thursday 03/15, 9am:

C5M3• Quiz: Sequence models & attention mechanism• Programming Assignment: Neural Machine Translation with Attention• Programming Assignment: Trigger word detection

Final project and posters!

Check out the project example code (cs230-stanford.github.io)

CS230: Lecture 9 Deep Reinforcement Learningcs230.stanford.edu/files/lecture-notes-9.pdf · Kian...

Documents

Transcript of CS230: Lecture 9 Deep Reinforcement Learningcs230.stanford.edu/files/lecture-notes-9.pdf · Kian...

CS230: Lecture 9 Deep Reinforcement Learningcs230.stanford.edu/spring2019/cs230_lecture9.pdf · IV. Deep Q-Learning application: Breakout (Atari) Goal: play breakout, i.e. destroy

Kenyatta Case - International Criminal Court · Kenyatta Case Statement of the Prosecutor of the International Criminal Court, Fatou Bensouda, following an application seeking an

Simplifying Grocery Checkout with Deep Learningcs230.stanford.edu/projects_fall_2019/reports/26257432.pdf · Simplifying Grocery Checkout with Deep Learning Jing Ning Department of

OF THE SPECIAL RAPPORTEUR ON PRISON CONDITIONS AND ... · Secretary of State for Interior, Honourable busman Badgie 6 Attorney-General, Mrs. Fatou B. Bensouda. 7 Justice Yahya 8 Inspector

YOUR EXCELLENCY FATOU BENSOUDA, PROSECUTOR OF THE ... · 921, of 7.2.2020 Published at the Federal Register of 10.2.2020 Opens an extraordinary credit, in favor of the Ministry of

CS230: Lecture 5cs230.stanford.edu/fall2017/files/CS230_Handout6.pdf · CS230: Lecture 5 Advanced topics in Object Detection Kian Katanforoosh, Andrew Ng Stanford University. Kian

CS230 Deep Learningcs230.stanford.edu/projects_winter_2019/reports/15871106.pdf5.0.2 Models Using the retrained Inception ResNet classifier, we ran several experiments to test the

CS230 Deep Learningcs230.stanford.edu/projects_spring_2018/reports/8289231.pdf · high gesture classification accuracy can be achieved using a convolutional neural network trained

Kenyan Cases before the International Criminal Court (ICC ... · Issue: On 19 April 2015, Prosecutor Fatou Bensouda requested that Trial Chamber V(a) admit as evidence previously

CS230: Lecture 9 Deep Reinforcement Learningcs230.stanford.edu/files/Lecture9.pdf · Kian Katanforoosh, Andrew Ng, Younes Bensouda Mourri I. Motivation Human Level Control through

Bankruptcy Prediction by Deep Learningcs230.stanford.edu/projects_winter_2020/reports/32569269.pdf · Deep learning techniques, as the Convolutional Neural Networks (CNN) have only

CS230 Deep Learningcs230.stanford.edu/projects_winter_2019/reports/15812470.pdf · upon them by pursuing deep learning techniques. Using techniques like LSTMs, RNNs, and highway networks,

CS230 Deep Learningcs230.stanford.edu/projects_spring_2019/reports/18681615.pdfStanford University 1050 Arastradero Rd., Stanford, CA kkaganov [ at ] stanford.edu Abstract In order

Deep Learningcs230.stanford.edu/files_winter_2018/projects/6912923.pdf · reinforcement learning is used to address resource allocation and management. B. Voice Voice connection IS

CS230 Deep Learningcs230.stanford.edu/files_winter_2018/projects/6940460.pdf · also begun exploring deep unsupervised learning methods in the healthcare setting. One example includes

YOUR EXCELLENCY FATOU BENSOUDA, PROSECUTOR OF THE ... · Supreme Court;10 b) On March 24 (twenty-four), 2020, the Brazilian President announced, in an ... 1st Federal Court of Duque

Chalumeau & Grimaldi Bensouda

CS230 Deep Learningcs230.stanford.edu/projects_winter_2019/reports/15811878.pdf · striker (offensive agent) and goalie (defensive agent), we explore how agents can ... formation

Motivation Discussion Results - Deep Learningcs230.stanford.edu/projects_winter_2020/posters/32026369.pdf · SKIN CANCER SELF-DIAGNOSIS USING MOBILE DEVICE DERMATOSCOPIC ATTACHMENTS

CS230 Deep Learningcs230.stanford.edu/projects_winter_2019/reports/15813120.pdf · animated images and applied to images earlier in the creative process. Style images from animated