Accelerating Parallel Monte Carlo Tree Search using CUDA...MCTS - Coulom (2006) UCB - Kocsis and...
Transcript of Accelerating Parallel Monte Carlo Tree Search using CUDA...MCTS - Coulom (2006) UCB - Kocsis and...
Accelerating Parallel Monte Carlo Tree Search using CUDAKamil Rocki and Reiji Suda
Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo
This work was partially supported by Core Research of Evolutional Scienceand Technology (CREST) project "ULP-HPC: Ultra Low-Power, High-Performance Computing via Modeling and Optimization of Next GenerationHPC Technologies" of Japan Science and Technology Agency (JST) andGrant-in-Aid for Scienti�c Research of MEXT Japan.
Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in arti�cial intelligence (AI) problems, typically move planning in combinatorialgames. It combines the generality of random simulation with the precision oftree search. It can theoretically be applied to any domain that can be described in terms of state, action pairs and simulation used to forecast outcomes such as decision support, control, delayed reward problems or complex optimization.
The motivation behind this work is caused by the emerging GPU-based systems and their high computational potential combined with relatively low power usage compared to CPUs. As a problem to be solved we chose to develop an AI GPU-basedagent in the game of Reversi (Othello) which provides a su�ciently complex problemfor tree searching with non-uniform structure and an average branching factor of over 8.
We present an e�cient parallel GPU MCTS implementation based on the introduced ’block-parallelism’ scheme which combines GPU SIMD thread groups and performsindependent searches without any need of intra-GPU or inter-GPU communication. The obtained results show that using my GPU MCTS implementation on the TSUBAME 2.0system one GPU can be compared to 100-200 CPU threads depending on factors such asthe search time and other MCTS parameters in terms of obtained results. We propose andanalyze simultaneous CPU/GPU execution which improves the overall result.
Introduction
•The basic MCTS algorithm is simple
•1. Selection
•2. Expansion
•3. Simulation
•4. Backpropagation
standard UCB formula
mean value of node (i.e. success/loss ratio)
C - exploitation/exploration ratio factor, tunable
MCTS - Coulom (2006)UCB - Kocsis and Szepervari (2006)
Parallel MCTS Schemes - Chaslot et al. (2008)
Easy
Efficient
Complex, not efficientOur approach - Parallel MCTS on GPU
= Block parallelism (c)
Weakness:CPU sequential tree management part (proportional to the number
n simulations
a. Leaf parallelism
n trees
b. Root parallelism
c. Block parallelism
n = blocks(trees) x threads (simulations at once)
Advantage:Works well with SIMD hardware, improves the overall result on 2 levels of parallelization
3/6
1/33/5
2/3 1/3
0
2 partsTree building
Stored in the CPU memory
Simulating1. Temporary - not remembered2. Done by CPU or GPU3. The results are used to affect the tree’s expansion strategyFinal result:
0 or 1
•MCTS has many applications already
•New ones are appearing
•The architecture is likely to follow the trend in the future
•Programming GPUs may become easier, rather not harder
CPU
GPU
TSUBAME 2.0•CPUs - Intel(R) Xeon(R) CPU X5670
@ 2.93GHz ~ 1400 Nodes of 12 cores
•GPU - NVIDIA Tesla C2050 - 14 (MP) x 32 (Cores/MP) = 448 (Cores) @ 1.15 GHz, ~ 1400 Nodes of 3 GPUs each (around 515GFlops max capability per GPU)
•If not specified otherwise, the MCTS search time = 500 ms, and GPU block size = 128
Root parallel MCTS - many starting pointsGreater chance of reaching the global solution
Sequential/leaf parallel MCTSSeen as an optimization problem
STATE SPACE STATE SPACE
Sequential MCTS Root parallel MCTS
Leaf parallel MCTS Block (root-leaf ) parallel MCTS
STATE SPACE STATE SPACE
Starting point - with root parallelism, more chance of �nding the global solution
Local solution (extremum)
Search scope - with leaf parallelism, the search is broader/more accurate (more samples)
Problem statementParallel tree search is one of the basic problems in computer science. It is used to solve many kinds of problems. E�ective parallelization is hard, especially for more than hundreds of threads. SIMD hardware (i.e. GPU) is fast, but hard to utilize. How to utilize GPUs/CUDA?
Mapping MCTS trees to blocks
Multiprocessor Multiprocessor
Multiprocessor Multiprocessor
Multiprocessor Multiprocessor
Multiprocessor Multiprocessor
GPU Hardware
Multiprocessor
Multiprocessor Multiprocessor
Multiprocessor Multiprocessor
Multiprocessor Multiprocessor
GPU Program
Block 0 Multiprocessor
Multiprocessor Multiprocessor
Multiprocessor Multiprocessor
Multiprocessor Multiprocessor
Block 2
Block 4
Block 6
Block 1
Block 3
Block 5
Block 7
SIMD warp SIMD warp
SIMD warpSIMD warp
32 threads ( xed, for current hardware)
Thread 0 Thread 1
Thread 4 Thread 5
Thread 2 Thread 3
Thread 6 Thread 7
Thread 8 Thread 9
Thread 12 Thread 13
Thread 10 Thread 11
Thread 14 Thread 15
Thread 16 Thread 17
Thread 20 Thread 21
Thread 18 Thread 19
Thread 22 Thread 23
Number of blocks con gurable
Number of threads con gurable
Number of MPs xed
Root parallelism
Leaf parallelism
Block parallelism
}}
Scalability - MPI Parallel Scheme
Root processid = 0 n-1 processes
N processes init
simulate
broadcast data
collect data (reduce)Outputdata
Inputdata
simulate simulate
Other machinei.e. core i7, Fedora
Other machinei.e. Phenom, Ubuntu
Network
Send the current state of the game to all processes
Think
Choose the best move and send it to the opponent
Receive the opponent’s move
Accumulate results
All simulations are independent
Process number 0 controls the game
Results and �ndings
Simultaneous CPU/GPU simulating
2 3 4 5
106
107
Sim
ulat
ions
/sec
ond
1 2 4 8 16 3226.5
27
27.5
28
28.5
29
29.5
No of GPUs (112 block x 64 Threads)
Ave
rage
Poi
nt D
iffer
ence
No communication bottleneck
Improvement gets worse
229,376 threads
~20mln sim/s
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 143360.2
0.250.3
0.350.4
0.450.5
0.550.6
0.650.7
0.750.8
0.850.9
0.951
GPU Threads
Win
ratio
Leaf parallelism (block size = 64)
Block parallelism (block size = 32)
Block parallelism (block size = 128)
GPU vs 1 cpu thread448 trees
112 trees
112 trees
Average for 2000 games
10 20 30 40 50 600
2
4
6
8
10
12
14
Game step
Poi
nts
10 20 30 40 50 600
10
20
30
40
Game step
Dep
th
GPU + CPU
GPU
1 GPU vs 128 CPUs
Average point difference
(score)
Average tree depth
500ms search time
10 20 30 40 50 60
0
10
20
30
40
50
Game step
Aveg
are
scor
e
e
ee
e256 GPUs
(3,670,016 threads)
and 2048 CPU
threadsvs
sequential MCTS
•Findings:
•Weak scaling of the algorithm - problem’s complexity affects the scalability
•Exploitation/exploration ratio - higher exploitation needed for more trees
•No communication bottleneck
•Much more efficient than the CPU version
Exploration/exploitation in parallel MCTS
Trees
Trees
1 2 3 4 5 SUM
1 2 3 4 5 SUM
High exploitation
High exploration
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 143360
0.51
1.52
2.53
3.54
4.55
5.56
6.57
7.58
8.59 x 105
GPU Threads
Sim
ulat
ions
/sec
ond
Leaf parallelism (block size = 64)Block parallelism (block size = 32)Block parallelism (block size = 128)
GPU vs 1 cpu thread448 trees
112 trees
1 CPU - around 10.000 sim/sGPU is much faster!
•More trees = higher score
•More simulations = higher score
•More trees = fewer simulations
•Block size needs to be adjusted
•1 GPU ~ 64-128 CPUs (AI power)
•While GPU runs a kernel CPU can work too
• Increases the tree depth, improves the overall result
GPU kernel
execution time
kernel execution call
gpu ready event
cpu control
CPU can work here!
processed by GPU
expanded by CPU in the meantime
Hybrid CPU/GPU search