Mastering the game of Go with deep neural networks and tree search (article overview)
-
Upload
ilya-kuzovkin -
Category
Science
-
view
266 -
download
0
Transcript of Mastering the game of Go with deep neural networks and tree search (article overview)
![Page 1: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/1.jpg)
Article overview by Ilya Kuzovkin
David Silver et al. from Google DeepMind
Reinforcement Learning Seminar University of Tartu, 2016
Mastering the game of Go with deep neural networks and tree search
![Page 2: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/2.jpg)
THE GAME OF GO
![Page 3: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/3.jpg)
BOARD
![Page 4: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/4.jpg)
BOARD
STONES
![Page 5: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/5.jpg)
BOARD
STONES
GROUPS
![Page 6: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/6.jpg)
BOARD
STONES
LIBERTIES
GROUPS
![Page 7: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/7.jpg)
BOARD
STONES
LIBERTIES
CAPTURE
GROUPS
![Page 8: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/8.jpg)
BOARD
STONES
LIBERTIES
CAPTURE KO
GROUPS
![Page 9: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/9.jpg)
BOARD
STONES
LIBERTIES
CAPTURE KO
GROUPS
EXAMPLES
![Page 10: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/10.jpg)
BOARD
STONES
LIBERTIES
CAPTURE KO
GROUPS
EXAMPLES
![Page 11: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/11.jpg)
BOARD
STONES
LIBERTIES
CAPTURE KO
GROUPS
EXAMPLES
![Page 12: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/12.jpg)
BOARD
STONES
LIBERTIES
CAPTURE
TWO EYES
KO
GROUPS
EXAMPLES
![Page 13: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/13.jpg)
BOARD
STONES
LIBERTIES
CAPTURE
FINAL COUNT
KO
GROUPS
EXAMPLES
TWO EYES
![Page 14: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/14.jpg)
TRAINING
![Page 15: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/15.jpg)
Supervised policy network
p�(a|s)
Reinforcement policy network
p⇢(a|s)Rollout policy
network p⇡(a|s)
Value!network v✓(s)
Tree policy network p⌧ (a|s)
TRAINING THE BUILDING BLOCKSSUPERVISED
CLASSIFICATION REINFORCEMENT SUPERVISED REGRESSION
![Page 16: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/16.jpg)
Supervised policy network
p�(a|s)
![Page 17: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/17.jpg)
Supervised policy network
p�(a|s)
19 x 19 x 48 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
1 convolutional layer 1x1 ReLU
Softmax
![Page 18: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/18.jpg)
Supervised policy network
p�(a|s)
19 x 19 x 48 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Softmax
1 convolutional layer 1x1 ReLU
• 29.4M positions from games between 6 to 9 dan players
![Page 19: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/19.jpg)
Supervised policy network
p�(a|s)
19 x 19 x 48 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Softmax • stochastic gradient ascent !• learning rate = 0.003,
halved every 80M steps !• batch size m = 16 !• 3 weeks on 50 GPUs to make
340M steps
↵1 convolutional layer 1x1 ReLU
• 29.4M positions from games between 6 to 9 dan players
![Page 20: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/20.jpg)
Supervised policy network
p�(a|s)
19 x 19 x 48 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Softmax • stochastic gradient ascent !• learning rate = 0.003,
halved every 80M steps !• batch size m = 16 !• 3 weeks on 50 GPUs to make
340M steps
↵1 convolutional layer 1x1 ReLU
• 29.4M positions from games between 6 to 9 dan players
• Augmented: 8 reflections/rotations • Test set (1M) accuracy: 57.0% • 3 ms to select an action
![Page 21: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/21.jpg)
19 X 19 X 48 INPUT
![Page 22: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/22.jpg)
19 X 19 X 48 INPUT
![Page 23: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/23.jpg)
19 X 19 X 48 INPUT
![Page 24: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/24.jpg)
19 X 19 X 48 INPUT
![Page 25: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/25.jpg)
Rollout policy p⇡(a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax
p�(a|s)
![Page 26: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/26.jpg)
Rollout policy p⇡(a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax
p�(a|s)
![Page 27: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/27.jpg)
Rollout policy p⇡(a|s) Tree policy p⌧ (a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax
p�(a|s)
![Page 28: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/28.jpg)
Rollout policy p⇡(a|s)
• “similar to the rollout policy but with more features”
Tree policy p⌧ (a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax
p�(a|s)
![Page 29: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/29.jpg)
Reinforcement policy network
p⇢(a|s)
Same architectureWeights are initialized with⇢ �
![Page 30: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/30.jpg)
Reinforcement policy network
p⇢(a|s)
Same architectureWeights are initialized with⇢ �
• Self-play: current network vs. randomized pool of previous versions
![Page 31: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/31.jpg)
Reinforcement policy network
p⇢(a|s)
Same architectureWeights are initialized with⇢ �
• Self-play: current network vs. randomized pool of previous versions
• Play a game until the end, get the reward zt = ±r(sT ) = ±1
![Page 32: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/32.jpg)
Reinforcement policy network
p⇢(a|s)
Same architectureWeights are initialized with⇢ �
• Self-play: current network vs. randomized pool of previous versions
• Play a game until the end, get the reward • Set and play the same game again, this time
updating the network parameters at each time step t
zt = ±r(sT ) = ±1zit = zt
![Page 33: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/33.jpg)
Reinforcement policy network
p⇢(a|s)
Same architectureWeights are initialized with⇢ �
• Self-play: current network vs. randomized pool of previous versions
• Play a game until the end, get the reward • Set and play the same game again, this time
updating the network parameters at each time step t • = …
‣ 0 “on the first pass through the training pipeline” ‣ “on the second pass”
zt = ±r(sT ) = ±1zit = zt
v(sit)
v✓(sit)
![Page 34: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/34.jpg)
Reinforcement policy network
p⇢(a|s)
Same architectureWeights are initialized with⇢ �
• Self-play: current network vs. randomized pool of previous versions
• Play a game until the end, get the reward • Set and play the same game again, this time
updating the network parameters at each time step t • = …
‣ 0 “on the first pass through the training pipeline” ‣ “on the second pass”
• batch size n = 128 games • 10,000 batches • One day on 50 GPUs
zt = ±r(sT ) = ±1zit = zt
v(sit)
v✓(sit)
![Page 35: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/35.jpg)
Reinforcement policy network
p⇢(a|s)
Same architectureWeights are initialized with⇢ �
• Self-play: current network vs. randomized pool of previous versions
• 80% wins against Supervised Network • 85% wins against Pachi (no search yet!) • 3 ms to select an action
• Play a game until the end, get the reward • Set and play the same game again, this time
updating the network parameters at each time step t • = …
‣ 0 “on the first pass through the training pipeline” ‣ “on the second pass”
• batch size n = 128 games • 10,000 batches • One day on 50 GPUs
zt = ±r(sT ) = ±1zit = zt
v(sit)
v✓(sit)
![Page 36: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/36.jpg)
Value!network v✓(s)
![Page 37: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/37.jpg)
19 x 19 x 49 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Fully connected layer 256 ReLU units
Value!network v✓(s)
Fully connected layer 1 tanh unit
1 convolutional layer 1x1 ReLU
![Page 38: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/38.jpg)
19 x 19 x 49 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Fully connected layer 256 ReLU units
Value!network v✓(s)
Fully connected layer 1 tanh unit
• Evaluate the value of the position s under policy p: !• Double approximation
1 convolutional layer 1x1 ReLU
![Page 39: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/39.jpg)
19 x 19 x 49 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Fully connected layer 256 ReLU units
Value!network v✓(s)
Fully connected layer 1 tanh unit
• Evaluate the value of the position s under policy p: !• Double approximation
1 convolutional layer 1x1 ReLU
• Stochastic gradient descent to minimize MSE
![Page 40: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/40.jpg)
19 x 19 x 49 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Fully connected layer 256 ReLU units
Value!network v✓(s)
Fully connected layer 1 tanh unit
• Evaluate the value of the position s under policy p: !• Double approximation
1 convolutional layer 1x1 ReLU
• Stochastic gradient descent to minimize MSE
• Train on 30M state-outcome pairs , each from a unique game generated by self-play:
(s, z)
![Page 41: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/41.jpg)
19 x 19 x 49 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Fully connected layer 256 ReLU units
Value!network v✓(s)
Fully connected layer 1 tanh unit
• Evaluate the value of the position s under policy p: !• Double approximation
1 convolutional layer 1x1 ReLU
• Stochastic gradient descent to minimize MSE
• Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from
SL policy ‣ make random move u ‣ sample t=u+1…T from RL
policy and get game outcome z
‣ add pair to the training set
(s, z)
(su, zu)
![Page 42: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/42.jpg)
19 x 19 x 49 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Fully connected layer 256 ReLU units
Value!network v✓(s)
Fully connected layer 1 tanh unit
• Evaluate the value of the position s under policy p: !• Double approximation
1 convolutional layer 1x1 ReLU
• Stochastic gradient descent to minimize MSE
• Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from
SL policy ‣ make random move u ‣ sample t=u+1…T from RL
policy and get game outcome z
‣ add pair to the training set
• One week on 50 GPUs to train on 50M batches of size m=32
(s, z)
(su, zu)
![Page 43: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/43.jpg)
19 x 19 x 49 input
1 convolutional layer 5x5 with k=192 filters, ReLU
11 convolutional layers 3x3 with k=192 filters, ReLU
Fully connected layer 256 ReLU units
Value!network v✓(s)
Fully connected layer 1 tanh unit
• Evaluate the value of the position s under policy p: !• Double approximation • MSE on the test set: 0.234 • Close to MC estimation from RL policy; 15,000 faster
1 convolutional layer 1x1 ReLU
• Stochastic gradient descent to minimize MSE
• Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from
SL policy ‣ make random move u ‣ sample t=u+1…T from RL
policy and get game outcome z
‣ add pair to the training set
• One week on 50 GPUs to train on 50M batches of size m=32
(s, z)
(su, zu)
![Page 44: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/44.jpg)
![Page 45: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/45.jpg)
PLAYING
![Page 46: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/46.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
![Page 47: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/47.jpg)
APV-MCTS
Each node s has edges (s, a) for all legal actions and stores statistics:
Prior Number of evaluations
Number of rollouts
MC value estimate
Rollout value estimate
Combined mean action
value
ASYNCHRONOUS POLICY AND VALUE MCTS
![Page 48: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/48.jpg)
APV-MCTS
Each node s has edges (s, a) for all legal actions and stores statistics:
Prior Number of evaluations
Number of rollouts
MC value estimate
Rollout value estimate
Combined mean action
value
ASYNCHRONOUS POLICY AND VALUE MCTS
Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.
![Page 49: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/49.jpg)
APV-MCTS
Each node s has edges (s, a) for all legal actions and stores statistics:
Prior Number of evaluations
Number of rollouts
MC value estimate
Rollout value estimate
Combined mean action
value
ASYNCHRONOUS POLICY AND VALUE MCTS
Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.
Position is added to evaluation queue.sL
![Page 50: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/50.jpg)
APV-MCTS
Each node s has edges (s, a) for all legal actions and stores statistics:
Prior Number of evaluations
Number of rollouts
MC value estimate
Rollout value estimate
Combined mean action
value
ASYNCHRONOUS POLICY AND VALUE MCTS
Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.
Position is added to evaluation queue.sL
Bunch of nodes selected for evaluation…
![Page 51: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/51.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
![Page 52: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/52.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Node s is evaluated using the value network to obtain v✓(s)
![Page 53: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/53.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Node s is evaluated using the value network to obtain
and using rollout simulation with policy till the end of each simulated game to get the final game score.
p⇡(a|s)
v✓(s)
![Page 54: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/54.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Node s is evaluated using the value network to obtain
and using rollout simulation with policy till the end of each simulated game to get the final game score.
p⇡(a|s)
v✓(s)
Each leaf is evaluated, we are ready to propagate updates
![Page 55: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/55.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
![Page 56: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/56.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Statistics along the paths of each simulation are updated during the backward pass though t < L
![Page 57: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/57.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Statistics along the paths of each simulation are updated during the backward pass though t < L
visits counts are updated as well
![Page 58: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/58.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Statistics along the paths of each simulation are updated during the backward pass though t < L
visits counts are updated as well
Finally overall evaluation of each visited state-action edge is updated
![Page 59: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/59.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Statistics along the paths of each simulation are updated during the backward pass though t < L
visits counts are updated as well
Finally overall evaluation of each visited state-action edge is updated
Current tree is updated
![Page 60: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/60.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
![Page 61: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/61.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’
nthr
![Page 62: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/62.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’
nthr
It is initialized using the tree policy to
and updated with SL policy:
p⌧ (a|s0)⌧
![Page 63: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/63.jpg)
APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS
Tree is expanded, fully updated and ready for the next move!
Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’
nthr
It is initialized using the tree policy to
and updated with SL policy:
p⌧ (a|s0)⌧
![Page 64: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/64.jpg)
![Page 65: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/65.jpg)
https://www.youtube.com/watch?v=oRvlyEpOQ-8
WINNING
![Page 66: Mastering the game of Go with deep neural networks and tree search (article overview)](https://reader031.fdocuments.in/reader031/viewer/2022022203/58708eba1a28ab412b8b50c3/html5/thumbnails/66.jpg)
https://www.youtube.com/watch?v=oRvlyEpOQ-8