Reinforcement Learning: CNNs and Deep Q Learning
Transcript of Reinforcement Learning: CNNs and Deep Q Learning
![Page 1: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/1.jpg)
ReinforcementLearning:CNNsandDeepQLearning
EmmaBrunskillStanfordUniversity
Spring2017
WithmanyslidesforDQNfromDavidSilverandRuslanSalakhutdinovandsomevisionslidesfromGianniDiCaro
![Page 2: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/2.jpg)
Today
• Convolu[onalNeuralNets(CNNs)• DeepQlearning
![Page 3: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/3.jpg)
WhyDoWeCareAboutCNNs?
• CNNsextensivelyusedincomputervision• Ifwewanttogofrompixelstodecisions,likelyusefultoleverageinsightsforvisualinput
![Page 4: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/4.jpg)
Howmanyweightparametersforasinglenodewhichisalinearcombina[onofinput?
![Page 5: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/5.jpg)
![Page 6: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/6.jpg)
• Tradi[onalNNsreceiveinputassinglevector&transformitthroughaseriesof(fullyconnected)hiddenlayers
• Foranimage(32w,32h,3c),theinputlayerhas32×32×3=3072neurons,
• Singlefully-connectedneuroninthefirsthiddenlayerwouldhave3072weights…
• Twomainissues:
• space-'mecomplexity
• lackofstructure,localityofinfo
![Page 7: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/7.jpg)
ImagesHaveStructure
• Havelocalstructureandcorrela[on• Havedis[nc[vefeaturesinspace&frequencydomains
![Page 8: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/8.jpg)
ImageFeatures
• Wantuniqueness• Wantinvariance• Geometricinvariance:transla[on,rota[on,scale• Photometricinvariance:brightness,exposure,...• Leadstounambiguousmatchesinotherimagesorwrttoknownen[[esofinterest
• Lookfor“interestpoints”:imageregionsthatareunusual
• Comingupwiththeseisnontrivial
![Page 9: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/9.jpg)
Convolu[onalNN
• Considerlocalstructureandcommonextrac[onoffeatures• Notfullyconnected• Localityofprocessing• Weightsharingforparameterreduc[on• Learntheparametersofmul[pleconvolu[onalfilterbanks• Compresstoextractsalientfeatures&favorgeneraliza[on
Image:hjp://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-07-at-7.26.20-AM.png
![Page 10: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/10.jpg)
LocalityofInforma[on:Recep[veFields
Hiddenneuron
![Page 11: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/11.jpg)
(Filter)Stride
• Slidethe5x5maskoveralltheinputpixels• Stridelength=1• Canuseotherstridelengths
• Assumeinputis28x28,howmanyneuronsin1sthiddenlayer?
1sthiddenlayerInput
![Page 12: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/12.jpg)
StrideandZeroPadding
Input
Output
Stride:howfar(spa[ally)moveoverfilterZeropadding:howmany0stoaddtoeithersideofinputlayer
Filter
![Page 13: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/13.jpg)
StrideandZeroPadding
Input
Output
Stride:howfar(spa[ally)moveoverfilterZeropadding:howmany0stoaddtoeithersideofinputlayer
Filter
![Page 14: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/14.jpg)
WhatistheStrideandtheValuesintheSecondExample?
![Page 15: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/15.jpg)
Strideis2
![Page 16: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/16.jpg)
SharedWeights• Whatisthepreciserela[onshipbetweentheneuronsintherecep[vefieldandthatinthehiddenlayer?
• Whatistheac#va#onvalueofthehiddenlayerneuron?
• Sumoveriisonlyoverthe(inthiscase,25)neuronsintherecep[vefieldofthehiddenlayerneuron
• Thesameweightswandbiasbareusedforeachofthe(here,24×24)hiddenneurons
![Page 17: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/17.jpg)
FeatureMap• Alltheneuronsinthefirsthiddenlayerdetectexactlythesamefeature,justatdifferentloca'onsintheinputimage.
• “Feature”:thekindofinputpajern(e.g.,alocaledge)thatdeterminetheneuronto”fire”or,moreingeneral,produceacertainresponselevel
• Whydoesthismakessense?• Supposetheweightsandbiasare(learned)suchthatthehiddenneuroncanpickout,aver[caledgeinapar[cularlocalrecep[vefield.
• Thatabilityisalsolikelytobeusefulatotherplacesintheimage.
• Andsoitisusefultoapplythesamefeaturedetectoreverywhereintheimage.
• Inspiredbyvisualsystem
![Page 18: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/18.jpg)
• Themapfromtheinputlayertothehiddenlayeristhereforeafeaturemap:allnodesdetectthesamefeatureindifferentparts
• Themapisdefinedbythesharedweightsandbias
• Thesharedmapistheresultoftheapplica[onofconvolu[onalfilter(definedbyweightsandbias),alsoknownasconvolu[onwithlearnedkernels
FeatureMap
Sameweights!
![Page 19: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/19.jpg)
Convolu[onalImageFilters
![Page 20: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/20.jpg)
• Atthei-thhiddenlayernfilterscanbeac[veinparallel• Abankofconvolu'onalfilters,eachlearningadifferentfeature(differentweightsandbias)
• 3featuremaps,eachdefinedbyasetof5×5sharedweightsandonebias
• Theresultisthatthenetworkcandetect3differentkindsoffeatures,witheachfeaturebeingdetectableacrosstheen[reimage
WhyOnly1Filter?
28x28input324x24neurons
![Page 21: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/21.jpg)
![Page 22: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/22.jpg)
VolumesandDepths
Eachnodeherehasthesameinputbutdifferentweightvectors(e.g.compu[ngdifferentfeaturesofsameinput)
![Page 23: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/23.jpg)
![Page 24: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/24.jpg)
![Page 25: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/25.jpg)
![Page 26: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/26.jpg)
PoolingLayers• Poolinglayersareusuallyusedimmediatelyaoer
convolu[onallayers.• Poolinglayerssimplify/subsample/compressthe
informa[onintheoutputfromconvolu[onallayer• Apoolinglayertakeseachfeaturemapoutputfromthe
convolu[onallayerandpreparesacondensedfeaturemap
![Page 27: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/27.jpg)
MaxPooling
• Max-pooling:apoolingunitsimplyoutputsthemaxac[va[onintheinputregion
![Page 28: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/28.jpg)
MaxPoolingBenefits
• Max-poolingasawayforthenetworktoaskwhetheragivenfeatureisfoundanywhereinaregionoftheimage.Itthenthrowsawaytheexactposi[onalinforma[on.
• Onceafeaturehasbeenfound,itsexactloca[onisn'tasimportantasitsroughloca[onrela[vetootherfeatures.
• Abigbenefitisthattherearemanyfewerpooledfeatures,andsothishelpsreducethenumberofparametersneededinlaterlayers.
![Page 29: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/29.jpg)
FinalLayerTypicallyFullyConnected
![Page 30: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/30.jpg)
Overview
• CNNs• DeepQlearning
![Page 31: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/31.jpg)
Generaliza[on
• Usingfunc[onapproxima[ontohelpscaleuptomakingdecisionsinreallylargedomains
![Page 32: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/32.jpg)
Deep Reinforcement Learning ‣ Use deep neural networks to represent
- Value function
- Policy
- Model
‣ Optimize loss function by stochastic gradient descent (SGD)
![Page 33: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/33.jpg)
Deep Q-Networks (DQNs) ‣ Represent value function by Q-network with weights w
![Page 34: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/34.jpg)
Q-Learning ‣ Optimal Q-values should obey Bellman equation
‣ Treat right-hand as a target
‣ Minimize MSE loss by stochastic gradient descent
‣ Remember VFA lecture: Minimize mean-squared error between the true action-value function qπ(S,A) and the approximate Q function:
![Page 35: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/35.jpg)
Q-Learning ‣ Minimize MSE loss by stochastic gradient descent
‣ Converges to Q� using table lookup representation
‣ But diverges using neural networks due to:
- Correlations between samples
- Non-stationary targets
![Page 36: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/36.jpg)
DQNs: Experience Replay ‣ To remove correlations, build data-set from agent’s own experience
‣ To deal with non-stationarity, target parameters w− are held fixed
‣ Sample experiences from data-set and apply update
![Page 37: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/37.jpg)
Remember: Experience Replay ‣ Given experience consisting of ⟨state, value⟩, or <state, action/value>
pairs
‣ Repeat
- Sample state, value from experience
- Apply stochastic gradient descent update
![Page 38: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/38.jpg)
DQNs: Experience Replay ‣ DQN uses experience replay and fixed Q-targets
‣ Use stochastic gradient descent
‣ Store transition (st,at,rt+1,st+1) in replay memory D
‣ Sample random mini-batch of transitions (s,a,r,s′) from D
‣ Compute Q-learning targets w.r.t. old, fixed parameters w−
‣ Optimize MSE between Q-network and Q-learning targets
Q-learning target Q-network
![Page 39: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/39.jpg)
DQNs in Atari Deep Reinforcement Learning in Atari
state
reward
action
at
rt
st
![Page 40: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/40.jpg)
DQNs in Atari ‣ End-to-end learning of values Q(s,a) from pixels s
‣ Input state s is stack of raw pixels from last 4 frames
‣ Output is Q(s,a) for 18 joystick/button positions
‣ Reward is change in score for that step
‣ Network architecture and hyperparameters fixed across all games
Mnih et.al., Nature, 2014
![Page 41: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/41.jpg)
DQNs in Atari ‣ End-to-end learning of values Q(s,a) from pixels s
‣ Input state s is stack of raw pixels from last 4 frames
‣ Output is Q(s,a) for 18 joystick/button positions
‣ Reward is change in score for that step
‣ Network architecture and hyperparameters fixed across all games
Mnih et.al., Nature, 2014
DQN source code: sites.google.com/a/deepmind.com/dqn/
![Page 42: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/42.jpg)
Demo
Mnih et.al., Nature, 2014
![Page 43: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/43.jpg)
DQN Results in Atari DQN Results in Atari
![Page 44: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/44.jpg)
WhichAspectsofDQNWereImportantforSuccess?
Game Linear Deepnetowrk
DQNwithfixedQ
DQNwithreplay
DQNwithreplayandfixedQ
Breakout 3 3 10 241 317
Enduro 62 29 141 831 1006
RiverRaid 2345 1453 2868 4102 7447
Sequest 656 275 1003 823 2894
SpaceInvaders
301 302 373 826 1089
• Replayishugelyimportant
![Page 45: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/45.jpg)
Double Q-Learning ‣ Train 2 action-value functions, Q1 and Q2
‣ Do Q-learning on both, but
- never on the same time steps (Q1 and Q2 are independent)
- pick Q1 or Q2 at random to be updated on each step
‣ Action selections are !-greedy with respect to the sum of Q1 and Q2 -greedy with respect to the sum of Q1 and Q2
‣ If updating Q1, use Q2 for the value of the next state:
144 CHAPTER 6. TEMPORAL-DIFFERENCE LEARNING
Initialize Q1(s, a) and Q2(s, a), 8s 2 S, a 2 A(s), arbitrarilyInitialize Q1(terminal-state, ·) = Q2(terminal-state, ·) = 0Repeat (for each episode):
Initialize SRepeat (for each step of episode):
Choose A from S using policy derived from Q1 and Q2 (e.g., "-greedy in Q1 + Q2)Take action A, observe R, S0
With 0.5 probabilility:
Q1(S, A) Q1(S, A) + ↵⇣R + �Q2
�S0, argmaxa Q1(S0, a)
��Q1(S, A)
⌘
else:
Q2(S, A) Q2(S, A) + ↵⇣R + �Q1
�S0, argmaxa Q2(S0, a)
��Q2(S, A)
⌘
S S0;until S is terminal
Figure 6.15: Double Q-learning.
The idea of doubled learning extends naturally to algorithms for full MDPs. Forexample, the doubled learning algorithm analogous to Q-learning, called Double Q-learning, divides the time steps in two, perhaps by flipping a coin on each step. Ifthe coin comes up heads, the update is
Q1(St, At) Q1(St, At)+↵⇣Rt+1 +Q2
�St+1, argmax
aQ1(St+1, a)
��Q1(St, At)
⌘.
(6.8)
If the coin comes up tails, then the same update is done with Q1 and Q2 switched,so that Q2 is updated. The two approximate value functions are treated completelysymmetrically. The behavior policy can use both action value estimates. For ex-ample, an "-greedy policy for Double Q-learning could be based on the average (orsum) of the two action-value estimates. A complete algorithm for Double Q-learningis given in Figure 6.15. This is the algorithm used to produce the results in Fig-ure 6.14. In this example, doubled learning seems to eliminate the harm caused bymaximization bias. Of course there are also doubled versions of Sarsa and ExpectedSarsa.
6.8 Games, Afterstates, and Other Special Cases
In this book we try to present a uniform approach to a wide class of tasks, but ofcourse there are always exceptional tasks that are better treated in a specialized way.For example, our general approach involves learning an action-value function, but inChapter 1 we presented a TD method for learning to play tic-tac-toe that learnedsomething much more like a state-value function. If we look closely at that example, itbecomes apparent that the function learned there is neither an action-value functionnor a state-value function in the usual sense. A conventional state-value functionevaluates states in which the agent has the option of selecting an action, but the
144 CHAPTER 6. TEMPORAL-DIFFERENCE LEARNING
Initialize Q1(s, a) and Q2(s, a), 8s 2 S, a 2 A(s), arbitrarilyInitialize Q1(terminal-state, ·) = Q2(terminal-state, ·) = 0Repeat (for each episode):
Initialize SRepeat (for each step of episode):
Choose A from S using policy derived from Q1 and Q2 (e.g., "-greedy in Q1 + Q2)Take action A, observe R, S0
With 0.5 probabilility:
Q1(S, A) Q1(S, A) + ↵⇣R + �Q2
�S0, argmaxa Q1(S0, a)
��Q1(S, A)
⌘
else:
Q2(S, A) Q2(S, A) + ↵⇣R + �Q1
�S0, argmaxa Q2(S0, a)
��Q2(S, A)
⌘
S S0;until S is terminal
Figure 6.15: Double Q-learning.
The idea of doubled learning extends naturally to algorithms for full MDPs. Forexample, the doubled learning algorithm analogous to Q-learning, called Double Q-learning, divides the time steps in two, perhaps by flipping a coin on each step. Ifthe coin comes up heads, the update is
Q1(St, At) Q1(St, At)+↵⇣Rt+1 +Q2
�St+1, argmax
aQ1(St+1, a)
��Q1(St, At)
⌘.
(6.8)
If the coin comes up tails, then the same update is done with Q1 and Q2 switched,so that Q2 is updated. The two approximate value functions are treated completelysymmetrically. The behavior policy can use both action value estimates. For ex-ample, an "-greedy policy for Double Q-learning could be based on the average (orsum) of the two action-value estimates. A complete algorithm for Double Q-learningis given in Figure 6.15. This is the algorithm used to produce the results in Fig-ure 6.14. In this example, doubled learning seems to eliminate the harm caused bymaximization bias. Of course there are also doubled versions of Sarsa and ExpectedSarsa.
6.8 Games, Afterstates, and Other Special Cases
In this book we try to present a uniform approach to a wide class of tasks, but ofcourse there are always exceptional tasks that are better treated in a specialized way.For example, our general approach involves learning an action-value function, but inChapter 1 we presented a TD method for learning to play tic-tac-toe that learnedsomething much more like a state-value function. If we look closely at that example, itbecomes apparent that the function learned there is neither an action-value functionnor a state-value function in the usual sense. A conventional state-value functionevaluates states in which the agent has the option of selecting an action, but the
![Page 46: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/46.jpg)
Double Q-Learning 144 CHAPTER 6. TEMPORAL-DIFFERENCE LEARNING
Initialize Q1(s, a) and Q2(s, a), 8s 2 S, a 2 A(s), arbitrarilyInitialize Q1(terminal-state, ·) = Q2(terminal-state, ·) = 0Repeat (for each episode):
Initialize SRepeat (for each step of episode):
Choose A from S using policy derived from Q1 and Q2 (e.g., "-greedy in Q1 + Q2)Take action A, observe R, S0
With 0.5 probabilility:
Q1(S, A) Q1(S, A) + ↵⇣R + �Q2
�S0, argmaxa Q1(S0, a)
��Q1(S, A)
⌘
else:
Q2(S, A) Q2(S, A) + ↵⇣R + �Q1
�S0, argmaxa Q2(S0, a)
��Q2(S, A)
⌘
S S0;until S is terminal
Figure 6.15: Double Q-learning.
The idea of doubled learning extends naturally to algorithms for full MDPs. Forexample, the doubled learning algorithm analogous to Q-learning, called Double Q-learning, divides the time steps in two, perhaps by flipping a coin on each step. Ifthe coin comes up heads, the update is
Q1(St, At) Q1(St, At)+↵⇣Rt+1 +Q2
�St+1, argmax
aQ1(St+1, a)
��Q1(St, At)
⌘.
(6.8)
If the coin comes up tails, then the same update is done with Q1 and Q2 switched,so that Q2 is updated. The two approximate value functions are treated completelysymmetrically. The behavior policy can use both action value estimates. For ex-ample, an "-greedy policy for Double Q-learning could be based on the average (orsum) of the two action-value estimates. A complete algorithm for Double Q-learningis given in Figure 6.15. This is the algorithm used to produce the results in Fig-ure 6.14. In this example, doubled learning seems to eliminate the harm caused bymaximization bias. Of course there are also doubled versions of Sarsa and ExpectedSarsa.
6.8 Games, Afterstates, and Other Special Cases
In this book we try to present a uniform approach to a wide class of tasks, but ofcourse there are always exceptional tasks that are better treated in a specialized way.For example, our general approach involves learning an action-value function, but inChapter 1 we presented a TD method for learning to play tic-tac-toe that learnedsomething much more like a state-value function. If we look closely at that example, itbecomes apparent that the function learned there is neither an action-value functionnor a state-value function in the usual sense. A conventional state-value functionevaluates states in which the agent has the option of selecting an action, but the
Hado van Hasselt 2010
![Page 47: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/47.jpg)
‣ Older Q-network w− is used to evaluate actions
Double DQN ‣ Current Q-network w is used to select actions
van Hasselt, Guez, Silver, 2015
Action selection: w
Action evaluation: w−
![Page 48: Reinforcement Learning: CNNs and Deep Q Learning](https://reader030.fdocuments.in/reader030/viewer/2022012718/61b0d3be8232484bb3675c70/html5/thumbnails/48.jpg)
Double DQN
van Hasselt, Guez, Silver, 2015