Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer,...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer,...
Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on
GPUsLifan Xu, Michela Taufer,
Stuart Collins, Dionisios G. Vlachos
Global Computing LabUniversity of Delaware
Multiscale Modeling: Challenges
Lifan Xu, GCLab@UD
CFDMesoscopic Theory/CGMC
KMC, DSMCLB, DPD,
MD, extended trajectory calc.
10-2
LengthScale (m)
Time Scale (s)
Quantum Mechanics TST
Lab Scale
10-10
10-9
10-7
10-6
10-12 10-9 10-3 103
Downscaling or Top-down info traffic
Upscaling or Bottom-up info traffic
Our Goal: increase time and length scale for the CGMC by using GPUs
2
By courtesy of Dionisios G. Vlachos
3
Outline
• Background CGMC GPU
• CGMC implementation on single GPU Optimize memory use Optimize algorithm
• CGMC implementation on multiple GPUs Combine OpenMP and CUDA
• Accuracy• Conclusions • Future work
Lifan Xu, GCLab@UD
Coarse Grained Kinetic Monte Carlo (CGMC)
• Study phenomena such as catalysis, crystal growth, and surface diffusion• Group neighboring microscopic sites (molecules) together into
“coarse-grained” cells
Lifan Xu, GCLab@UD 48 X 8 microscopic sites 2 X 2 cells
Mapping
5
Terminology
• Coverage A 2D matrix contains number of different species of molecules for each
cell
• Events Reaction Diffusion
• Probability A list of probabilities of all possible events on every cell Probability is calculated based on neighboring cell’s coverage Probability has to be updated if the coverage of its neighbor changed Events are selected based on probabilities
Lifan Xu, GCLab@UD
9
GPU Overview (I)
Lifan Xu, GCLab@UD
• NVIDIA Tesla C1060: 30 Streaming Multiprocessor (1-N) 8 Scalar Processors/SM (1-M) 30, 8-way SIMD cores = 240 PEs
• Massively parallel multithreaded Up to 30720 active threads handled
by thread execution manager
• Processing power 933 GigaFLOPS(single precision) 78 GigaFLOPS(double precision)
From: CUDA Programming Guide, NVIDIA
10
GPU Overview (II)
Lifan Xu, GCLab@UD
• Memory types: Read/write per thread
• Registers• Local memory
Read/write per block• Shared memory
Read/write per grid• Global memory
• Read-only per grid• Constant memory• Texture memory
• Communication among devices and with CPU Through PCI Express bus
From CUDA Programming Guide, NVIDIA
11
Multi-thread Approach
Lifan Xu, GCLab@UD
Read probability from global memory
…………
One τ leap
Select events
Update coverage inglobal memory
Update probability inglobal memory
Read coverage from global memory
GPU threads, where each thread equals to one cell
Thread synchronization
12
Memory Optimization
Lifan Xu, GCLab@UD
Read probability from global memory
…………
One τ leap
Select and execute events
Update coverage inshared memory
Update probability inglobal memory
Read coverage from shared memory
GPU threads, where each thread equals to one cell
Thread synchronization
Write coverage toglobal memory
13
Performance
Lifan Xu, GCLab@UD
Platforms
GPU –Tesla C1060 Intel(R) Xeon(R) CPU X5450(CPU)
Number of cores 240 Number of cores 4
Global memory 4GB Memory 4GB
Clock rate 1.44 GHz Clock rate 3 GHz
14
Performance
Lifan Xu, GCLab@UD
Sign
ifica
nt in
crea
se in
perf
orm
ance
Time scale: CGMC simulations can be 100X faster than a CPU implementation
15
Rethinking the CGMC Algorithm for GPUs
Lifan Xu, GCLab@UD
• Redesign of cell structures: Macro-cell = set of cells Number of cells per macro-cell is flexible
• Redesign of event execution: One thread per macro-cell (coarse-grained parallelism) or one block per macro-cell (fine-
grained parallelism) Events replicated across macro-cells
3-layer macro-cell2-layer macro-cellsCells in a 4x4 system
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Synchronization Frequency
16
Leap 2Leap 1 Leap 3 Leap 4
Synchronization
Synchronization
Lifan Xu, GCLab@UD
First layer Second layer Third layer
Event Replication
Macro-cell 1(block 0)
Macro-cell 2(block 1)
Cell 1(thread 0)
Cell 2(thread 1)
… Cell 2(thread 5)
Cell 1(thread 6)
Cell 3(thread 7)
…
Leap 1E1(A--) E1(A++) E1(A++) E1(A--)
E1(A--) E1(A++) E1(A++) E1(A--)
Leap 2 E1(A--) E1(A++)
17
E1: A[cell 1] -> A[cell 2]
A is a molecule specie
Macro-cell 1 Macro-cell 2
Lifan Xu, GCLab@UD
18
Random Number Generator
• Generate the random numbers on the CPU and transfer them to the GPU
Simple and easy Uniform distribution
• Have a dedicated kernel generate numbers and store them in global memory
Many global memory access
• Have each thread generate random numbers as needed
Wallace Gaussian Generator Shared memory requirement
Lifan Xu, GCLab@UD
19
Random Number Generator
1. Generate as many random numbers as possible events on CPU
2. Store random numbers in texture memory as read only3. Start one leap simulation
1. Each event fetches its own random number in texture memory2. Select events3. Execute and update
4. End of one leap simulation5. Go to 1
Lifan Xu, GCLab@UD
20
Coarse-Grained vs. Fine-Grained
• 2-layer coarse-grained Each thread is in charge of one macro-cell Each thread simulates five cells in every first leap, one central cell in
every second leap
• 2-layer fine-grained Each block is in charge of one macro-cell Each block has five threads Each thread simulates one cell in every first leap Five thread simulate the central cell together in every second leap
Lifan Xu, GCLab@UD
21
Performance
Lifan Xu, GCLab@UD
64 128 256 512 1024 2048 4096 8192 16384 327680
5000
10000
15000
20000
25000
30000
35000
1-layer shared memory 2-layer coarse-grained 2-layer fine-grained 3-layer fine-grainedNumber of Cells
Even
ts/m
s
Small molecular systems
large molecular systems
22
Multi-GPU Implementation
• Limit in number of cells for a single GPU implementation that can be processed in parallel
Use multiple GPUs
• Synchronization between multiple GPUs is costly Take advantage of 2-layer fine-grained parallelism
• Combine OpenMP with CUDA for multi-GPU programming: Use portable pinned memory for communication between CPU threads Use mapped pinned memory for data copy between CPU and GPU(zero
copy) Use write-combined memory for fast data access
Lifan Xu, GCLab@UD
23
Pinned Memory
• Portable pinned memory(PPM) Available for all host threads Can be freed by any host thread
• Mapped pinned memory(MPM) Allocate memory on host side Memory can be mapped to device’s address space Two addresses: one in host memory and one in device memory No explicit memory copy between CPU and GPU
• Write-combined pinned memory(WCM) Transfers across the PCI Express bus, Buffer is not snooped High transfer performance but no data coherency guarantee
Lifan Xu, GCLab@UD
Multi-GPU Implementation
25
Lifan Xu, GCLab@UD
CGMCParameters:
RNcoverage
probability
Select Events
Execute Events
Update coverage
Update probability
Select Events
Execute Events
Update coverage
Update probability
Leap 1 Leap 2
Select Events
Execute Events
Update coverage
Update probability
Select Events
Execute Events
Update coverage
Update probability
Leap 1 Leap 2
GPU 1
CPU
Pinned Memory
RNGRandom
Numbers
GPU 2
25
For 2-layer implementation
26
Performance with Different Number of GPUs
2048 4096 8192 16384 32768 65536 819200
5000100001500020000250003000035000400004500050000
1-layer 1-GPU 2-layer 2-GPU 2-layer 4-GPU
Lifan Xu, GCLab@UD
Number of Cells
Even
ts/m
s
27
Performance with Different Layers on 4 GPUs
2048 4096 8192 16384 32768 65536 819200
5000100001500020000250003000035000400004500050000
1-layer OpenMP+PPM+RNG 2-layer OpenMP+PPM+RNG 3-layer OpenMP+PPM+RNG
Lifan Xu, GCLab@UD
Number of Cells
Even
ts/m
s
28
Accuracy
Lifan Xu, GCLab@UD
• Simulations of three different species of molecules, A, B, and C
A change to B and vice-versa 2 B molecules change to C and vice-versa A, B, and C can diffuse
• Number of molecules in system reaching the equilibrium and in equilibrium
1 26 51 76 1011261511762012262512762100021200214002160021800220002220022400
1 24 47 70 93 1161391621852082312542772100021200214002160021800220002220022400
1 25 49 73 97 12114516919321724126528921000
21200
21400
21600
21800
22000
22200
22400
GPU
CPU
-32
CPU
-64
29
Conclusion
• We present a parallel implementation of the tau-leap method for CGMC simulations on single and multi-GPUs• We identify the most efficient parallelism granularity for the
simulations given a molecular system with a certain size 42X speedup for small systems using 2-layer fine-grained thread
parallelism 100X speedup for average systems using 1-layer parallelism 120X speedup for large systems using 2-layer fine-grained thread
parallelism across multiple GPUs
• We increase both time scale (up to 120 times) and length scale (up to 100,000)
Lifan Xu, GCLab@UD
30
Future Work
• Study molecular systems with heterogeneous distributions• Combine MPI, OpenMP, and CUDA to use multiple GPUS across
different machines• Use MPI to implement CGMC parallelization on CPU cluster
Compare the performance of GPU cluster and CPU cluster
• Link phenomena and models across scales i.e., from QM to MD and to CGMC
Lifan Xu, GCLab@UD
31
Acknowledgements
Sponsors:
Collaborators:Dionisios G. Vlachos and Stuart Collins (UDel)
More questions: [email protected]
GCL Members:
Trilce Estrada Boyu Zhang
Abel Licon Narayan Ganesan
Lifan Xu Philip Saponaro
Maria Ruiz Michela Taufer
GCL members in Spring 2010
Lifan Xu, GCLab@UD