Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is...
Transcript of Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is...
![Page 1: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/1.jpg)
Term Project:Reinforcement learning applied to
Othello
G E O R G E T U C K E R
Othello
G E O R G E T U C K E R
![Page 2: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/2.jpg)
Othello
![Page 3: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/3.jpg)
What is reinforcement learning?
After a sequence of actions get a rewardq gPositive or negative
Temporal credit assignment problemDetermine credit for the reward
Temporal Difference MethodsTemporal Difference MethodsTD-lambda
Q-learning (TD(0))
![Page 4: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/4.jpg)
Contrast to Conventional Strategies
Most methods use an evaluation function
Use minimax/alpha-beta search
Hand-designed feature detectorsgEvaluation function is a weighted sum
So why TD learning?Does not need hand coded features
li iGeneralization
![Page 5: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/5.jpg)
Temporal Difference Learning
![Page 6: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/6.jpg)
Temporal Difference Learning
![Page 7: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/7.jpg)
Key Observation
If we let
at time step t. Then at time step t+1,p p ,
![Page 8: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/8.jpg)
Temporal Difference Learning
If we let λ = 0, then, we get, , g
Widrow-Hoff rule
Makes Yt closer to Yt+1t t+1
![Page 9: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/9.jpg)
Disadvantage
Requires lots of trainingq g
Self-playShort-term pathologies
Randomization
![Page 10: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/10.jpg)
Setup
Board: 64 element vector4+1 = black
0 = empty
hi-1 = white
Corresponds to human representation
NetworkNetwork64 inputs
30 hidden nodes – Sigmoid activation
Single outputGoal: predict final game score
![Page 11: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/11.jpg)
Setup
TD lambda learninggLambda = 0.3
Learning rate = 0.005
R d i d Reward is endgame score
Move selectionEvaluate every legal 1 ply moveEvaluate every legal 1 ply move
Choose randomly with exponential weight
![Page 12: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/12.jpg)
Player Handling
Two Neural Networks
Board inversionOn white’s move, invert board and score
Faster and superior learning
![Page 13: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/13.jpg)
Training Data
Recall:Random play
Fixed opponent
D b lDatabase play
Self-play
I focused on:Database playDatabase play
Self-play
![Page 14: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/14.jpg)
Opponent
Java Othellowww.luthman.nu/Othello/Othello.html
Variable levels corresponding to ply depth
Used as benchmark
Trained against
![Page 15: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/15.jpg)
Database Training
Logistello databaseg120,000 games
Fastless than 30 minutes to train on the full set
Wins 10% games against a 1 ply opponent
0.4
0.5
0.6
0.1
0.2
0.3
0
0 20000 40000 60000 80000 100000 120000
![Page 16: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/16.jpg)
Self-play
Extremely slow improvementy pEven after nearly 2,000,000 iterations almost no improvement
O l i % f i t l tOnly wins 1% of games against 1 ply opponent
0.5
0.6
0.3
0.4
0.5
0
0.1
0.2
0
0 100000 200000 300000 400000 500000
![Page 17: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/17.jpg)
Two Ply Opponent
Opponent looks ahead one ply and chooses the best pp p ymove
Much slower by a factor of 6 or more
0.5
0.6
0.7
0.3
0.4
0.5
0
0.1
0.2
0 20000 40000 60000 80000 100000 120000
![Page 18: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/18.jpg)
Website
For source code and reference materialwww.cs.hmc.edu/~gtucker/othello.html
![Page 19: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal](https://reader033.fdocuments.in/reader033/viewer/2022060306/5f098b347e708231d4275547/html5/thumbnails/19.jpg)
Conclusions
Board inversion should definitely be usedInitially, at least self-play is poorDatabase play significantly improves networkAsymmetric self-play is far superior to standard self-playPlaying a fixed opponent may be bestPlaying a fixed opponent may be best
Future WorkFuture WorkAdd in additional feature detectorsInvestigate more advanced depth play