Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go...
Transcript of Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go...
![Page 1: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/1.jpg)
Slides credited from Dr. Hung-Yi Lee
![Page 2: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/2.jpg)
RL Agent Taxonomy
2
Model-Free
Model
Value Policy
Learning a Critic
Actor-Critic
Learning an Actor
![Page 3: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/3.jpg)
Model-BasedAgent ’s Representation of the Environment
3
![Page 4: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/4.jpg)
Model
4
observationot
actionat
rewardrt
A model predicts what the environment will do next◦P predicts the next state
◦R predicts the next immediate reward
![Page 5: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/5.jpg)
Model-Based Deep RLGoal: learn a transition model of the environment and planbased on the transition model
5
Objective is to maximize the measured goodness of model
Model-based deep RL is challenging, and so far has failed in Atari
![Page 6: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/6.jpg)
Issues for Model-Based Deep RLCompounding errors◦ Errors in the transition model compound over the trajectory
◦ A long trajectory may result in totally wrong rewards
Deep networks of value/policy can “plan” implicitly◦ Each layer of network performs arbitrary computational step
◦ n-layer network can “lookahead” n steps
6
![Page 7: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/7.jpg)
Model-Based Deep RL in GoMonte-Carlo tree search (MCTS)◦ MCTS simulates future trajectories
◦ Builds large lookahead search tree with millions of positions
◦ State-of-the-art Go programs use MCTS
Convolutional Networks◦ 12-layer CNN trained to predict expert moves
◦ Raw CNN (looking at 1 position, no search at all) equals performance of MoGo with 105 position search tree
7
1st strong Go program
https://deepmind.com/alphago/Silver, et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, 2016.
![Page 8: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/8.jpg)
Problems within RL
8
![Page 9: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/9.jpg)
Learning and PlanningIn sequential decision making◦ Reinforcement learning• The environment is initially unknown
• The agent interacts with the environment
• The agent improves its policy
◦ Planning• A model of the environment is known
• The agent performs computations with its model (w/o any external interaction)
• The agent improves its policy (a.k.a. deliberation, reasoning, introspection, pondering, thought, search)
9
![Page 10: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/10.jpg)
Atari Example: Reinforcement Learning
10
Rules of the game are unknown
Learn directly from interactive game-play
Pick actions on joystick, see pixels and scores
![Page 11: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/11.jpg)
Atari Example: Planning
11
Rules of the game are known
Query emulator based on the perfect model inside agent’s brain◦ If I take action a from state s:
• what would the next state be?
• what would the score be?
Plan ahead to find optimal policy e.g. tree search
![Page 12: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/12.jpg)
Exploration and ExploitationReinforcement learning is like trial-and-error learning
The agent should discover a good policy from the experience without losing too much reward along the way
Exploration finds more information about the environment
Exploitation exploits known information to maximize reward
12
When to try?
It is usually important to explore as well as exploit
![Page 13: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/13.jpg)
RL for Unsupervised Model:Modularizing Unsupervised Sense Embeddings (MUSE)
13
![Page 14: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/14.jpg)
Word2Vec Polysemy IssueWords are polysemy◦ An apple a day, keeps the doctor away.
◦ Smartphone companies including apple, …
If words are polysemy, are their embeddings polysemy?◦ No
◦ What’s the problem?
14
tree
trees
rock
rocks
![Page 15: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/15.jpg)
Smartphone companies including blackberry, and sony will be invited.
Modular FrameworkTwo key mechanisms◦ Sense selection given a text context
◦ Sense representation to embed statistical characteristics of sense identity
15
apple
apple-1 apple-2
sense selection
sense embedding
sense representationsense selection
sense identity
reinforcement learning
![Page 16: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/16.jpg)
Sense Selection ModuleInput: a text context ഥ𝐶𝑡 = 𝐶𝑡−𝑚, … , 𝐶𝑡 = 𝑤𝑖 , … , 𝐶𝑡+𝑚
Output: the fitness for each sense 𝑧𝑖1, … , 𝑧𝑖3
Model architecture: Continuous Bag-of-Words (CBOW) for efficiency
Sense selection◦ Policy-based
◦ Value-based (Q-value)
16
Sense Selection Module
…𝐶𝑡 = 𝑤𝑖𝐶𝑡−1
𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡)
Sense selection for target word 𝐶𝑡
matrix 𝑄𝑖
matrix 𝑃
… 𝐶𝑡+1including apple blackberrycompanies and
![Page 17: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/17.jpg)
Sense Representation ModuleInput: sense collocation 𝑧𝑖𝑘 , 𝑧𝑗𝑙
Output: collocation likelihood estimation
Model architecture: skip-gram architecture
Sense representation learning
17
𝑧𝑖1
Sense Representation Module
…𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1)
matrix 𝑈
matrix 𝑉
![Page 18: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/18.jpg)
A Summary of MUSE
18
Corpus: { Smartphone companies including apple blackberry, and sony will be invited.}
sense selection ←
rew
ard sign
al ←
sense selectio
n →
sample collocation1
2
2
3
Sense selection for collocated word 𝐶𝑡′
Sense Selection Module
…𝐶𝑡′ = 𝑤𝑗𝐶𝑡′−1
𝑞(𝑧𝑗1|𝐶𝑡′) 𝑞(𝑧𝑗2|𝐶𝑡′) 𝑞(𝑧𝑗3|𝐶𝑡′)
matrix 𝑄𝑗
matrix 𝑃
… 𝐶𝑡′+1apple andincluding sonyblackberry
𝑧𝑖1
Sense Representation Module
…𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1)
negative sampling
matrix 𝑉
matrix 𝑈
Sense Selection Module
…𝐶𝑡 = 𝑤𝑖𝐶𝑡−1
𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡)
Sense selection for target word 𝐶𝑡
matrix 𝑄𝑖
matrix 𝑃
… 𝐶𝑡+1including apple blackberrycompanies and
The first purely sense-level embedding learning with efficient sense selection.
![Page 19: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/19.jpg)
Context … braves finish the season in tie with the los angeles dodgers …
… his later years proudly wore tie with the chinesecharacters for …
k-NN scoreless otl shootout 6-6 hingis 3-3 7-7 0-0
pants trousers shirt juventus blazer socks anfield
Figure
Qualitative Analysis
19
![Page 20: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/20.jpg)
Qualitative Analysis
20
Context … of the mulberry or the blackberry and minossent him to …
… of the large number of blackberry users in the us federal …
k-NN cranberries maple vaccinium apricot apple
smartphones sap microsoft ipv6 smartphone
Figure
![Page 21: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/21.jpg)
Demonstration
21
![Page 22: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/22.jpg)
OpenAI UniverseSoftware platform for measuring and training an AI's general intelligence via the OpenAI gym environment
22
![Page 23: Slides credited from Dr. Hung-Yi Leemiulab/s107-adl/doc/190423...Model-Based Deep RL in Go Monte-Carlo tree search (MCTS) MCTS simulates future trajectories Builds large lookahead](https://reader035.fdocuments.in/reader035/viewer/2022070712/5ecdc98c374d300f3d6acfe2/html5/thumbnails/23.jpg)
Concluding RemarksRL is a general purpose framework for decision making under interactions between agent and environment
An RL agent may include one or more of these components
RL problems can be solved by end-to-end deep learning
Reinforcement Learning + Deep Learning = AI
23
Value Policy
Learning a Critic Actor-Critic
Learning an Actor
◦ Value function: how good is each state and/or action
◦ Policy: agent’s behavior function
◦ Model: agent’s representation of the environment