J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.
-
Upload
lillian-winland -
Category
Documents
-
view
220 -
download
0
Transcript of J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.
![Page 1: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/1.jpg)
JACOBI ITERATIVE TECHNIQUE ON MULTI GPU PLATFORM
By Ishtiaq HossainVenkata Krishna Nimmagadda
![Page 2: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/2.jpg)
APPLICATION OF JACOBI ITERATION
Cardiac Tissue is considered as a grid of cells.
Each GPU thread takes care of voltage calculation at one cell. This calculation requires Voltage values of neighboring cells
Two different models are shown in the bottom right corner
Vcell0 in current time step is calculated by using values of surrounding cells from previous time step to avoid synchronization issues
Vcell0k = f(Vcell1
k-1 +Vcell2
k-1 +Vcell3k-1….
+VcellNk-1)
where N can be 6 or 18
![Page 3: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/3.jpg)
APPLICATION OF JACOBI ITERATION
Initial values are provided to start computation
In s single time step ODE and PDE parts are sequentially evaluated and added
By solving the finite difference equations, voltage values of every cell in a time step is calculated by a thread
Figure 1 shows a healthy cell’s voltage curve with time.
Figure 1
![Page 4: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/4.jpg)
THE TIME STEP
Solve for ODE part and add it to the current cell’s voltage to obtain voltage Vtemp1 for each cell
Vt
e
m
p
2
is generated in every iteration
Vtemp2 is generated in every iteration for all the cells in the grid
Calculation of Vtemp2
requires Vtemp2 values of previous time step
Once the iterations are completed, final Vtemp2
is added with Vtemp1 to generate Voltage values for that time step
![Page 5: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/5.jpg)
CORRECTNESS OF OUR IMPLEMENTATION
![Page 6: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/6.jpg)
MEMORY COALESCING
typedef struct __align__(N)
{
int a[N];
int b[N]
-
-
} NODE;
.
.
.
.
NODE nodes[N*N];
N*N blocks and N threads are launched so that all the N threads access values in consecutive places
single cell
260
270
280
290
300
310
Unaligned
with __align
Design of data Structure
Tim
e in m
illi se
cs
![Page 7: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/7.jpg)
SERIAL VS SINGLE GPU
3X3X
3
10X1
0X10
20X2
0X20
30X3
0X30
X30
0
400
800
Serial is not helpful
Series 1
Hey serial, what take you so long?
16X1
6X16
32X3
2X32
64X6
4X64
0
10
20
30
40
501 GPU
Tim
e in s
ecs
Tim
e in s
ecs
128X128X128 gives us 309 secs
Enormous Speed Up
![Page 8: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/8.jpg)
STEP 1 LESSONS LEARNT
Choose Data structure which maximizes the memory coalescing
The mechanics of serial code and parallel code are very different
Develop algorithms that address the areas where serial code takes long time
![Page 9: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/9.jpg)
MULTI GPU APPROACH
Multiple Host
threads Creation
Establishing Multiple
Host – GPU Contexts
Choosing Time step according to current
phase
Solve Cell Model ODE
Solve Communication model
PDE
Visualize Data
Using OpenMP for launching host threads.
Data partitioning and kernel invocation for GPU computation.
ODE is solved using Forward Eular Method
PDE is solved using Jacobi Iteration
![Page 10: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/10.jpg)
INTER GPU DATA PARTITIONING
•Let both the cubes are of dimensions s X s X s •Interface Region of left one is 2s2
•Interface Region of right one is 3s2
•After division, data is copied into the device memory (global) of each GPU.
•Input data: 2D array of structures. Structures contain arrays.•Data resides in host memory.
Interface Region
![Page 11: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/11.jpg)
SOLVING PDES USING MULTIPLE GPUS
During each Jacobi Iteration threads use Global memory to share data among them.
Threads in the Interface Region need data from other GPUs.
Inter GPUs sharing is done through Host memory.
A separate kernel is launched that handles the interface region computation and copies result back to device memory. So GPUs are synchronized.
Once PDE calculation is completed for one timestamp, all values are written back to the Host Memory.
![Page 12: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/12.jpg)
SOLVING PDES USING MULTIPLE GPUS
Time
Host to device copy
GPU Computation
Device to host copy
Interface Region Computation
![Page 13: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/13.jpg)
THE CIRCUS OF INTER GPU SYNC
Ghost Cell computing! Pad with dummy cells at the inter GPU interfaces
to reduce communication Lets make other cores of CPU work
4 out of 8 cores in CPU are having contexts Use the free 4 cores to do interface computation
Simple is the best Launch new kernels with different dimensions to
handle cells at interface.
![Page 14: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/14.jpg)
VARIOUS STAGES
Inter GPU Sync
Solve PDE
Solve ODE
Memory Copy
Solving ODE and PDE takes most of the time.
Interestingly solving PDE using Jacobi iteration is eating most of the time.
![Page 15: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/15.jpg)
SCALABILITY
AB
C D
050
100150200250300350400450500
1 GPU
2 GPU
3 GPU
1 GPU2 GPU3 GPU
A = 32X32X32 cells executed by each GPU B= 32X32X32 cells executed by each GPU C= 32X32X32 cells executed by each GPU D= 32X32X32 cells executed by each GPU
![Page 16: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/16.jpg)
STEP 2 LESSONS LEARNT
The Jacobi iterative technique looks pretty good in scalability
Interface Selection is very important
Making a Multi GPU program generic is a lot of effort from programmer side
![Page 17: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/17.jpg)
LETS WATCH A VIDEO
![Page 18: J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.](https://reader036.fdocuments.in/reader036/viewer/2022062417/5517e291550346cb568b4598/html5/thumbnails/18.jpg)
Q & A