David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching...
Transcript of David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching...
Simulation-Based Search
David Silver
Part I: Background
The Real World
Real-world problems are MESSY
Huge state spaces
Huge branching factors
Long-term consequences of actions
No good heuristics
Traditional search algorithms fail (e.g. A*, alpha-beta)
Example: The Game of Go
The ancient oriental game of Go is ~4000 years old
Usually played on 19x19 board
Simple rules, complex strategy
Black and white place down stones alternately
Capturing
If stones are completely surrounded they are captured
Winner
The game is finished when both players pass
The intersections surrounded by each player are known as territory
The player with more territory wins the game
Shape Knowledge in Go
Go players utilise a large vocabulary of shapes:
One-point jump
Ponnuki
Hane
A Grand Challenge for AI
Huge state space: 10170 states
Big branching factor: 361 actions
Long-term consequences: hundreds of moves
No good heuristics: amateur level after 40 years
Traditional search has failed in Computer Go
Progress In 19x19 Computer Go
1 dan
1 kyu
2 dan
3 dan
4 dan
2 kyu
5 dan
6 dan
7 dan
3 kyu
4 kyu
5 kyu
6 kyu
7 kyu
8 kyu
9 kyu
10 kyu
Be
gin
ne
rM
aste
r
Traditional Search
2002 2004 2006 2008 20102000 2001 2003 2005 2007 20091999199819971996
11 kyu
12 kyu
13 kyu
14 kyu
15 kyu
Many Faces of Go
Go++
Handtalk
Progress In 19x19 Computer Go
1 dan
1 kyu
2 dan
3 dan
4 dan
2 kyu
5 dan
6 dan
7 dan
3 kyu
4 kyu
5 kyu
6 kyu
7 kyu
8 kyu
9 kyu
10 kyu
Be
gin
ne
rM
aste
r
Monte-Carlo Search
Traditional Search
Zen
MoGo
MoGo
CrazyStone
Indigo
2002 2004 2006 2008 20102000 2001 2003 2005 2007 20091999199819971996
11 kyu
12 kyu
13 kyu
14 kyu
15 kyu
Indigo
Success of Monte-Carlo Search
Human master level in Computer Go
Super human world-champion level in Backgammon, Scrabble
Computer world champion in general game playing, Hex, Amazons, Lines of Action, Hearts, Skat, ...
World computer record for many puzzles: Morpion Solitaire, 16x16 Sudoku, SameGame, ...
Part II:Monte-Carlo Tree Search
Position Evaluation
Game outcome z
Black wins z=1
White wins z=0
Value of position s
Vπ(s) = Eπ[z|s]
V*(s) = maxπ Vπ(s)
<= Monte-Carlo simulation
<= Tree search
Monte-Carlo Simulation
Simulate n random games from current position s
Evaluate position by mean outcome of simulations
V (s) =1n
n�
i=1
zi
Monte-Carlo Simulation
Current position s
Simulation
1 1 0 0 Outcomes
V(s) = 2/4 = 0.5
Simple Monte-Carlo Search
Run Monte-Carlo simulation from the current position s, for each action a
Select action maximising Monte-Carlo value
Monte-Carlo Tree Search
Monte-Carlo simulation from the current position s
Builds a search tree containing Monte-Carlo values
Simulation policy has two phases:
Tree policy (e.g. greedy)
Default policy (e.g. uniform random)
Monte-Carlo Tree Search
Monte-Carlo Tree Search
Monte-Carlo Tree Search
Monte-Carlo Tree Search
Monte-Carlo Tree Search
Optimism in the Face of Uncertainty
Want to exploit the knowledge we have accumulated
Search the best nodes of the tree most deeply
Want to explore to accumulate more knowledge
Search the most uncertain nodes of the tree
Solution: use upper confidence bound on value
Q⊕(s, a) = Q(s, a) + c
�log n(s)n(s, a)
UCT (Upper Confidence Trees)
Exploitation
Q⊕(s, a) = Q(s, a) + c
�log n(s)n(s, a)
Exploration
Q⊕(s, a) = Q(s, a) + c
�log n(s)n(s, a)
Weaknesses of MCTS
No generalisation between similar positions
No knowledge beyond the search tree
Monte-Carlo evaluation is high variance
Highly dependent on random rollout policy
Part IIIHeuristic MC-RAVE
MoGo
Generalise value of same move in similar positions
Use a heuristic function to initialise leaf nodes
(Gelly and Silver, 2007)
Rapid Action Value Estimate (RAVE)
Assume that the value of move is the same
Regardless of when move is played
Rapid Action Value Estimate (RAVE)
MC value of C3 = 0/1RAVE value of C3 = 3/5
Rapid Action Value Estimate (RAVE)
MC value of C3 = 1/1RAVE value of C3 = 2/3
3/5
2/3
MC-RAVE
Monte-Carlo value is unbiased but has high variance
RAVE value is biased but has low variance
Combine both values so as to minimise MSE
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
1 10 100 1000 10000
k
UCTMC-RAVE
Win
nin
gra
teagain
stG
nuG
o3.7
.10
MC-RAVE in MoGo
Heuristic MCTS
V(s)n(s)
s
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0 50 100 150 200
Win
nin
g r
ate
ag
ain
st G
nu
Go
3.7
.10
de
fau
lt leve
l
Local shape featuresHandcrafted heuristicGrandfather heuristicEven game heuristicUCT–RAVE
n̂h
Heuristic MCTS in MoGo
MoGo (2007)
• MoGo = heuristic MCTS + MC-RAVE + handcrafted default policy
• 99% winning rate against best traditional programs
• Highest rating on 9x9 and 19x19 Computer Go Server
• Gold medal at Computer Go Olympiad
• First victory against 9-dan professional player (9x9)
MoGo (2009)
• MoGo += massive parallelisation, expert Go knowledge, better heuristics, ...
• First victory against 9-dan professional player (19x19) (7 stones handicap)
• Is the end nigh for humankind?
MoGo: 13x13 Scalability
Part IVTemporal-Difference Search
MC vs. TD Learning
In reinforcement learning:
TD learning reduces variance but increases bias
TD learning is usually more efficient than MC
TD(λ) can be much more efficient than MC
Function Approximation
Also in reinforcement learning:
Can use MC or TD with function approximation
Approximate value over large state spaces
Generalise between similar states
e.g. linear function approximation (tile coding, coarse coding, etc.)
Temporal-Difference Search
Simulation-based search
Using TD instead of MC
Using function approximation
Temporal-Difference Search
Consider subgame starting from current state s
Apply temporal-difference learning to subgame:
Simulate games of self-play from s
Update feature weights using TD
Value function is specialised online to current subgame
(Silver et al. 2008)
Linear TD learning
Features: φ(s)
Value function: V(s) = θ·φ(s)
Play many games from start to end
Update feature weights θ after every move
TD-error: δ = V(s’) - V(s)
Weight update: Δθ = αδφ(s)
Linear TD search
Features: φ(s)
Value function: V(s) = θ·φ(s)
Simulate many games from current position
Update feature weights θ after every simulated move
TD-error: δ = V(s’) - V(s)
Weight update: Δθ = αδφ(s)
Local Shape Features
Binary features matching a local configuration of stones
All possible locations and configurations from 1x1 to 3x3
~1 million features for 9x9 Go
TD Learning
Current State
Learning
Q(s, a) = φ(s, a)T θ
Feature weights
TD Search
Search
Current State
Q̄(s, a) = φ(s, a)T θ + φ̄(s, a)T θ̄
Feature weights
Empty Triangle
Temporal difference learning:
Local shape feature acquires a negative weight
Guzumi
Temporal difference search:
Local shape feature acquires a positive weight
Blood Vomiting Game
Dyna-2: Two Memories
A memory is a vector of feature weights
Long-term memory updated by TD learning
General domain knowledge
Short-term memory updated by TD search
Specific knowledge about current situation
Positions are evaluated by combining both memories
Results for Dyna-2
TD Learning + TD SearchTD SearchTD LearningUCT
Dynamic Evaluation
Traditional search algorithms (e.g. alpha-beta) use a static evaluation function
Dyna-2 provides a dynamic evaluation function
Re-learn the evaluation function after every move
Dyna-2 + Alpha-Beta: Results
RLGO
RLGO = TD Learning + TD Search (+ Alpha-Beta)
Outperforms all handcrafted, traditional search and traditional machine learning programs
Outperforms MCTS in 9x9 Go and scales better with board size
A New Paradigm for Big AI
Given a generative model of the world’s dynamics:
Consider subproblem starting from now
Simulate experience from now with the model
Apply reinforcement learning to simulations
Questions?