Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer,...

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on

GPUsLifan Xu, Michela Taufer,

Stuart Collins, Dionisios G. Vlachos

Global Computing LabUniversity of Delaware

Multiscale Modeling: Challenges

Lifan Xu, GCLab@UD

CFDMesoscopic Theory/CGMC

KMC, DSMCLB, DPD,

MD, extended trajectory calc.

10-2

LengthScale (m)

Time Scale (s)

Quantum Mechanics TST

Lab Scale

10-10

10-9

10-7

10-6

10-12 10-9 10-3 103

Downscaling or Top-down info traffic

Upscaling or Bottom-up info traffic

Our Goal: increase time and length scale for the CGMC by using GPUs

2

By courtesy of Dionisios G. Vlachos

3

Outline

• Background CGMC GPU

• CGMC implementation on single GPU Optimize memory use Optimize algorithm

• CGMC implementation on multiple GPUs Combine OpenMP and CUDA

• Accuracy• Conclusions • Future work

Lifan Xu, GCLab@UD

Coarse Grained Kinetic Monte Carlo (CGMC)

• Study phenomena such as catalysis, crystal growth, and surface diffusion• Group neighboring microscopic sites (molecules) together into

“coarse-grained” cells

Lifan Xu, GCLab@UD 48 X 8 microscopic sites 2 X 2 cells

Mapping

5

Terminology

• Coverage A 2D matrix contains number of different species of molecules for each

cell

• Events Reaction Diffusion

• Probability A list of probabilities of all possible events on every cell Probability is calculated based on neighboring cell’s coverage Probability has to be updated if the coverage of its neighbor changed Events are selected based on probabilities

Lifan Xu, GCLab@UD

6

Reaction

Lifan Xu, GCLab@UD

Reaction

7

Diffusion

Lifan Xu, GCLab@UD

Reaction

8

CGMC Algorithm

Lifan Xu, GCLab@UD

9

GPU Overview (I)

Lifan Xu, GCLab@UD

• NVIDIA Tesla C1060: 30 Streaming Multiprocessor (1-N) 8 Scalar Processors/SM (1-M) 30, 8-way SIMD cores = 240 PEs

• Massively parallel multithreaded Up to 30720 active threads handled

by thread execution manager

• Processing power 933 GigaFLOPS(single precision) 78 GigaFLOPS(double precision)

From: CUDA Programming Guide, NVIDIA

10

GPU Overview (II)

Lifan Xu, GCLab@UD

• Memory types: Read/write per thread

• Registers• Local memory

Read/write per block• Shared memory

Read/write per grid• Global memory

• Read-only per grid• Constant memory• Texture memory

• Communication among devices and with CPU Through PCI Express bus

From CUDA Programming Guide, NVIDIA

11

Multi-thread Approach

Lifan Xu, GCLab@UD

Read probability from global memory

…………

One τ leap

Select events

Update coverage inglobal memory

Update probability inglobal memory

Read coverage from global memory

GPU threads, where each thread equals to one cell

Thread synchronization

12

Memory Optimization

Lifan Xu, GCLab@UD

Read probability from global memory

…………

One τ leap

Select and execute events

Update coverage inshared memory

Update probability inglobal memory

Read coverage from shared memory

GPU threads, where each thread equals to one cell

Thread synchronization

Write coverage toglobal memory

13

Performance

Lifan Xu, GCLab@UD

Platforms

GPU –Tesla C1060 Intel(R) Xeon(R) CPU X5450(CPU)

Number of cores 240 Number of cores 4

Global memory 4GB Memory 4GB

Clock rate 1.44 GHz Clock rate 3 GHz

14

Performance

Lifan Xu, GCLab@UD

Sign

ifica

nt in

crea

se in

perf

orm

ance

Time scale: CGMC simulations can be 100X faster than a CPU implementation

15

Rethinking the CGMC Algorithm for GPUs

Lifan Xu, GCLab@UD

• Redesign of cell structures: Macro-cell = set of cells Number of cells per macro-cell is flexible

• Redesign of event execution: One thread per macro-cell (coarse-grained parallelism) or one block per macro-cell (fine-

grained parallelism) Events replicated across macro-cells

3-layer macro-cell2-layer macro-cellsCells in a 4x4 system

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Synchronization Frequency

16

Leap 2Leap 1 Leap 3 Leap 4

Synchronization

Synchronization

Lifan Xu, GCLab@UD

First layer Second layer Third layer

Event Replication

Macro-cell 1(block 0)

Macro-cell 2(block 1)

Cell 1(thread 0)

Cell 2(thread 1)

… Cell 2(thread 5)

Cell 1(thread 6)

Cell 3(thread 7)

…

Leap 1E1(A--) E1(A++) E1(A++) E1(A--)

E1(A--) E1(A++) E1(A++) E1(A--)

Leap 2 E1(A--) E1(A++)

17

E1: A[cell 1] -> A[cell 2]

A is a molecule specie

Macro-cell 1 Macro-cell 2

Lifan Xu, GCLab@UD

18

Random Number Generator

• Generate the random numbers on the CPU and transfer them to the GPU

Simple and easy Uniform distribution

• Have a dedicated kernel generate numbers and store them in global memory

Many global memory access

• Have each thread generate random numbers as needed

Wallace Gaussian Generator Shared memory requirement

Lifan Xu, GCLab@UD

19

Random Number Generator

1. Generate as many random numbers as possible events on CPU

2. Store random numbers in texture memory as read only3. Start one leap simulation

1. Each event fetches its own random number in texture memory2. Select events3. Execute and update

4. End of one leap simulation5. Go to 1

Lifan Xu, GCLab@UD

20

Coarse-Grained vs. Fine-Grained

• 2-layer coarse-grained Each thread is in charge of one macro-cell Each thread simulates five cells in every first leap, one central cell in

every second leap

• 2-layer fine-grained Each block is in charge of one macro-cell Each block has five threads Each thread simulates one cell in every first leap Five thread simulate the central cell together in every second leap

Lifan Xu, GCLab@UD

21

Performance

Lifan Xu, GCLab@UD

64 128 256 512 1024 2048 4096 8192 16384 327680

5000

10000

15000

20000

25000

30000

35000

1-layer shared memory 2-layer coarse-grained 2-layer fine-grained 3-layer fine-grainedNumber of Cells

Even

ts/m

s

Small molecular systems

large molecular systems

22

Multi-GPU Implementation

• Limit in number of cells for a single GPU implementation that can be processed in parallel

Use multiple GPUs

• Synchronization between multiple GPUs is costly Take advantage of 2-layer fine-grained parallelism

• Combine OpenMP with CUDA for multi-GPU programming: Use portable pinned memory for communication between CPU threads Use mapped pinned memory for data copy between CPU and GPU(zero

copy) Use write-combined memory for fast data access

Lifan Xu, GCLab@UD

23

Pinned Memory

• Portable pinned memory(PPM) Available for all host threads Can be freed by any host thread

• Mapped pinned memory(MPM) Allocate memory on host side Memory can be mapped to device’s address space Two addresses: one in host memory and one in device memory No explicit memory copy between CPU and GPU

• Write-combined pinned memory(WCM) Transfers across the PCI Express bus, Buffer is not snooped High transfer performance but no data coherency guarantee

Lifan Xu, GCLab@UD

24

Combine OpenMP and CUDA

Lifan Xu, GCLab@UD

Multi-GPU Implementation

25

Lifan Xu, GCLab@UD

CGMCParameters:

RNcoverage

probability

Select Events

Execute Events

Update coverage

Update probability

Select Events

Execute Events

Update coverage

Update probability

Leap 1 Leap 2

Select Events

Execute Events

Update coverage

Update probability

Select Events

Execute Events

Update coverage

Update probability

Leap 1 Leap 2

GPU 1

CPU

Pinned Memory

RNGRandom

Numbers

GPU 2

25

For 2-layer implementation

26

Performance with Different Number of GPUs

2048 4096 8192 16384 32768 65536 819200

5000100001500020000250003000035000400004500050000

1-layer 1-GPU 2-layer 2-GPU 2-layer 4-GPU

Lifan Xu, GCLab@UD

Number of Cells

Even

ts/m

s

27

Performance with Different Layers on 4 GPUs

2048 4096 8192 16384 32768 65536 819200

5000100001500020000250003000035000400004500050000

1-layer OpenMP+PPM+RNG 2-layer OpenMP+PPM+RNG 3-layer OpenMP+PPM+RNG

Lifan Xu, GCLab@UD

Number of Cells

Even

ts/m

s

28

Accuracy

Lifan Xu, GCLab@UD

• Simulations of three different species of molecules, A, B, and C

A change to B and vice-versa 2 B molecules change to C and vice-versa A, B, and C can diffuse

• Number of molecules in system reaching the equilibrium and in equilibrium

1 26 51 76 1011261511762012262512762100021200214002160021800220002220022400

1 24 47 70 93 1161391621852082312542772100021200214002160021800220002220022400

1 25 49 73 97 12114516919321724126528921000

21200

21400

21600

21800

22000

22200

22400

GPU

CPU

-32

CPU

-64

29

Conclusion

• We present a parallel implementation of the tau-leap method for CGMC simulations on single and multi-GPUs• We identify the most efficient parallelism granularity for the

simulations given a molecular system with a certain size 42X speedup for small systems using 2-layer fine-grained thread

parallelism 100X speedup for average systems using 1-layer parallelism 120X speedup for large systems using 2-layer fine-grained thread

parallelism across multiple GPUs

• We increase both time scale (up to 120 times) and length scale (up to 100,000)

Lifan Xu, GCLab@UD

30

Future Work

• Study molecular systems with heterogeneous distributions• Combine MPI, OpenMP, and CUDA to use multiple GPUS across

different machines• Use MPI to implement CGMC parallelization on CPU cluster

Compare the performance of GPU cluster and CPU cluster

• Link phenomena and models across scales i.e., from QM to MD and to CGMC

Lifan Xu, GCLab@UD

31

Acknowledgements

Sponsors:

Collaborators:Dionisios G. Vlachos and Stuart Collins (UDel)

More questions: [email protected]

GCL Members:

Trilce Estrada Boyu Zhang

Abel Licon Narayan Ganesan

Lifan Xu Philip Saponaro

Maria Ruiz Michela Taufer

GCL members in Spring 2010

Lifan Xu, GCLab@UD

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer,...

Documents

Transcript of Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer,...