Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’...
Transcript of Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’...
![Page 1: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/1.jpg)
Hybrid Parallel Programming Part 2
Bhupender Thakur HPC @ LSU
6/4/14 1 LONI Programming Workshop 2014
![Page 2: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/2.jpg)
• Ring exchange • Jacobi • Advanced • Overlapping
6/4/14 2
Overview
LONI Programming Workshop 2014
![Page 3: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/3.jpg)
Rank 0 [u0]
Rank 1 [u1]
Rank 2 [u2]
Rank 3 [u3]
6/4/14 3
Exchange on a ring
HPC User Environment Spring 2014
mod(rank+1+nproc,nproc) mod(rank-‐1+nproc,nproc)
LeT neighbor Right neighbor
![Page 4: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/4.jpg)
Rank 0 [u0]
Rank 1 [u1]
Rank 2 [u2]
Rank 3 [u3]
6/4/14 4
Exchange on a ring
ucopy0
ucopy1
ucopy2
ucopy3
1. Setup iniVal vector across mpi
processes
2. Start copy of vector u to its neighbor in variable ucopy
3. In the meanVme do useful work on u
4. At the end of computaVon and copy, replace u with newer value from ucopy
5. Repeat steps 1-‐3 unVl all vector components have been cycled
![Page 5: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/5.jpg)
Rank 0 [u0]
Rank 1 [u1]
Rank 2 [u2]
Rank 3 [u3]
6/4/14 5
Exchange on a ring ucopy0
ucopy1
ucopy2
ucopy3
Compute on u at 1
Copy to next
Only two ways to improve performance:
• Reduce mpi processes so as to reduce communicaVon and communicaVon overhead
• Overlap computaVon and communicaVon
![Page 6: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/6.jpg)
6/4/14 6
Where is it useful
Vector space is HUGE and distributed vectors are the only way to go Matrix is analyVcal, so can be computed on the fly. It either does not need storage or cant be stored at all !
Analy/cal matrix or computed on the fly. Cant be stored !
U0
U1
U2
U3
![Page 7: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/7.jpg)
6/4/14 7
Code Walkthrough
Vector space is HUGE and distributed vectors are the only way to go Matrix is analyVcal, so can be computed on the fly. It either does not need storage or cant be stored at all !
![Page 8: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/8.jpg)
6/4/14 8
Exercise
• Replace Blocking communicaVon in the code ring with non blocking calls
• Create a local vector c of same dimensions. Define a funcVon that replicates a banded matrix A(i,j) = 1 , i=j = random(x), 1 < abs(i-‐j) < 100
![Page 9: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/9.jpg)
Cannon’s Algorithm Distributed matrix-‐mulVplicaVon
S Matrix is distributed across a 2D grid of processors
S Matrix is too large to be present on one node
S Requires communica/on to get sub-‐matrices from other ranks
A 11 P0
A 12 P3
A 13 P6
A 21 P1
A 22 P4
A 23 P7
A 31 P2
A 32 P5
A 33 P8
![Page 10: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/10.jpg)
Cannon’s Algorithm Distributed matrix-‐mulVplicaVon
S How Big? Consider: Size =[1million X 1million] Memory= (10^6*10^6*(4))/(1024^3)
S How many Supermike nodes? 3725/32=116 Nodes
A 11 P0
A 12 P3
A 13 P6
A 21 P1
A 22 P4
A 23 P7
A 31 P2
A 32 P5
A 33 P8
![Page 11: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/11.jpg)
Cannon’s Algorithm Distributed matrix-‐mulVplicaVon
S Input matrices are distributed, so should the output matrix be.
B 11 P0
B 12 P3
B 13 P6
B 21 P1
B 22 P4
B 23 P7
B 31 P2
B 32 P5
B 33 P8
A 11 P0
A 12 P3
A 13 P6
A 21 P1
A 22 P4
A 23 P7
A 31 P2
A 32 P5
A 33 P8
![Page 12: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/12.jpg)
Cannon’s Algorithm IniVal arrangement
S Idea is to bring required pieces of matrix to each processor Start with C11
B 11 P0
B 12 P3
B 13 P6
B 21 P1
B 22 P4
B 23 P7
B 31 P2
B 32 P5
B 33 P8
A 11 P0
A 12 P3
A 13 P6
A 21 P1
A 22 P4
A 23 P7
A 31 P2
A 32 P5
A 33 P8
C 11 P0
![Page 13: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/13.jpg)
Cannon’s Algorithm IniVal arrangement
B 11 P0
B 12 P3
B 13 P6
B 21 P1
B 22 P4
B 11 P7
B 31 P2
B 11 P5
B 33 P8
A 11 P0
A 12 P3
A 13 P6
A 11 P1
A 22 P4
A 11 P7
A 11 P2
A 11 P5
A 33 P8
C11 = A11 * B11 + A12 * B21 + C13 * B31
First iteraVon partly computes C11: A11 * B11
No problem with Ist Row and !st column
![Page 14: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/14.jpg)
Cannon’s Algorithm IniVal arrangement
B 11 P0
B 22 P3
B 13 P6
B 11 P1
B 22 P4
B 11 P7
B 11 P2
B 11 P5
B 33 P8
A 11 P0
A 12 P3
A 13 P6
A 11 P1
A 22 P4
A 11 P7
A 11 P2
A 11 P5
A 33 P8
C12 = A11 * B12 + A12 * B22 + C13 * B32
B22, Needs to be here !
If A12 stays here
![Page 15: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/15.jpg)
Cannon’s Algorithm IniVal arrangement
B 11 P0
B 22 P3
B 33 P6
B 11 P1
B 22 P4
B 11 P7
B 11 P2
B 11 P5
B 33 P8
A 11 P0
A 12 P3
A 13 P6
A 11 P1
A 22 P4
A 11 P7
A 11 P2
A 11 P5
A 33 P8
C12 = A11 * B12 + A12 * B22 + A13 * B32
C13 = A11 * B13 + A12 * B23 + A13 * B33
Wind 2nd column up
once
Wind 2nd column up
twice
![Page 16: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/16.jpg)
Cannon’s Algorithm IniVal arrangement
B 11 P0
B 22 P3
B 33 P6
B 21 P1
B 22 P4
B 11 P7
B 31 P2
B 11 P5
B 33 P8
A 11 P0
A 12 P3
A 13 P6
A 22 P1
A 22 P4
A 11 P7
A 33 P2
A 11 P5
A 33 P8
C21 = A21 * B11 + A22 * B21 + A23 * B31
C31 = A31 * B11 + A32 * B21 + A33 * B31
![Page 17: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/17.jpg)
Cannon’s Algorithm IniVal arrangement
B 11 P0
B 22 P3
B 33 P6
B 21 P1
B 22 P4
B 13 P7
B 31 P2
B 12 P5
B 33 P8
A 11 P0
A 12 P3
A 13 P6
A 22 P1
A 22 P4
A 21 P7
A 33 P2
A 31 P5
A 33 P8
C32 = A31 * B12 + A32 * B22 + A33 * B32
C23 = A21 * B13 + A22 * B23 + A23 * B33
![Page 18: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/18.jpg)
Cannon’s Algorithm IniVal arrangement
A 11 P0
A 12 P3
A 13 P6
A 22 P1
A 23 P4
A 21 P7
A 33 P2
A 31 P5
A 32 P8
A 21 P7
A 32 P8
A 31 P5
Rows have been shi\ed to the le\
![Page 19: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/19.jpg)
Cannon’s Algorithm IniVal arrangement
B 11 P0
B 22 P3
B 33 P6
B 21 P1
B 32 P4
B 13 P7
B 31 P2
B 12 P5
B 23 P8
B 12 P5
B 13 P7
Columns have been shi\ed upwards
![Page 20: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/20.jpg)
Cannon’s Algorithm IniVal arrangement
A 11 P0
A 12 P3
A 13 P6
A 22 P1
A 23 P4
A 21 P7
A 33 P2
A 31 P5
A 32 P8
B 11 P0
B 22 P3
B 33 P6
B 21 P1
B 32 P4
B 13 P7
B 31 P2
B 12 P5
B 23 P5
Matrices a\er ini/al skewing
![Page 21: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/21.jpg)
Cannon’s Algorithm Algorithm: IteraVve compute and transfer
A 11 P0
A 12 P3
A 13 P6
A 22 P1
A 23 P4
A 21 P7
A 33 P2
A 31 P5
A 32 P8
B 11 P0
B 22 P3
B 33 P6
B 21 P1
B 32 P4
B 13 P7
B 31 P2
B 12 P5
B 23 P5
Sub-‐matrices among rows are copied from the right neighbor to the le\
Sub-‐matrices among columns are copied from the bo]om neighbor to the top
The iteraVon in every step mulVplies local submatrix of A with local submatrix of B ParVal results are added to sub-‐matrix of C. Full matrices are never needed.
![Page 22: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/22.jpg)
Sub-‐matrices of A are copied from the right neighbor to the le\, and sub-‐matrices for B are copied from bo]om to the top neighbor
Par/al results from each itera/on are added to local matrix C
A 11 B 11
A 12 B 22
A 13 B 33
A 21 B 21
A 22 B 32
A 23 B 32
A 31 B 31
A 32 B 12
A 33 B 31
Cannon’s Algorithm Algorithm: IteraVve compute and transfer
![Page 23: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/23.jpg)
Algorithm begs for overlap of computa/on and communica/on !
A 11 B 11
A 12 B 22
A 13 B 33
A 21 B 21
A 22 B 32
A 23 B 32
A 31 B 31
A 32 B 12
A 33 B 31
Cannon’s Algorithm Algorithm: IteraVve compute and transfer
![Page 24: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/24.jpg)
6/4/14 24
Jacobi solver
1. Master calculates boundaries Isends and Irecvs them on other processes Send down, Send Up, Recv down Recv up
2. All threads chime in and start working on inner points. Check convergence
3. Wait for communicaVon to finish Copy new to old
4. Proceed with another iteraVon if needed Proc k-‐1
Proc k
Proc k+1
![Page 25: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/25.jpg)
6/4/14 25
Jacobi solver /* Use master thread to calculate and communicate boundies */ #pragma omp master { /* Loop over top and booom boundry */ for (k = 1; k <= NC; k++){ /*Calculate average of neighbors as new value (Point Jacobi method) */ t[*new][1][k] = 0.25 * (t[old][2][k] + t[old][0][k] + t[old][1][k+1] + t[old][1][k-‐1]); t[*new][nrl][k] = 0.25 * (t[old][nrl+1][k] + t[old][nrl-‐1][k] + t[old][nrl][k+1] + t[old][nrl][k-‐1]); /* Calculate local maximum change from last step */ /* Puts thread's max in d */ d = MAX(fabs(t[*new][1][k] -‐ t[old][1][k]), d); d = MAX(fabs(t[*new][nrl][k] -‐ t[old][nrl][k]), d); }
![Page 26: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/26.jpg)
6/4/14 26
Jacobi solver if (nPEs!=1){ /* Exchange boundries with neighbor tasks */ if (myPE < nPEs-‐1 ) /* Sending Down; Only npes-‐1 do this */ MPI_Isend(&(t[*new][nrl][1]), NC, MPI_FLOAT, myPE+1, DOWN, MPI_COMM_WORLD, &request[0]); if (myPE != 0) /* Sending Up; Only npes-‐1 do this */ MPI_Isend(&t[*new][1][1], NC, MPI_FLOAT, myPE-‐1, UP, MPI_COMM_WORLD, &request[1]); if (myPE != 0) /* Receive from UP */ MPI_Irecv(&t[*new][0][1], NC, MPI_FLOAT, MPI_ANY_SOURCE, DOWN, MPI_COMM_WORLD, &request[2]); if (myPE != nPEs-‐1) /* Receive from DOWN */ MPI_Irecv(&t[*new][nrl+1][1], NC, MPI_FLOAT, MPI_ANY_SOURCE, UP, MPI_COMM_WORLD, &request[3]); } }
ComputaVon goes here MPI_Wait
![Page 27: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/27.jpg)
6/4/14 27
Jacobi solver Code walkthrough /* Everyone calculates values and finds local max change */ #pragma omp for schedule(runVme) nowait for (i = 2; i <= nrl-‐1; i++) for (j = 1; j <= NC; j++){ t[*new][i][j] = 0.25 * (t[old][i+1][j] + t[old][i-‐1][j] + t[old][i][j+1] + t[old][i][j-‐1]); d = MAX(fabs(t[*new][i][j] -‐ t[old][i][j]), d); } /*Local max change become taks-‐global max change */ #pragma omp criVcal dt = MAX(d, dt); /* Finds max of the d's */ }
![Page 28: Hybrid’Parallel’Programming’ Part2’ · Hybrid’Parallel’Programming’ Part2’ BhupenderThakur) HPC)@LSU) 6/4/14 LONIProgramming’Workshop’2014’ 1](https://reader034.fdocuments.in/reader034/viewer/2022042118/5e967eb43f2ded1295670055/html5/thumbnails/28.jpg)
CPU0
Thread 0
Thread 1
Thread 2
Thread 3
GPU0
6/4/14 28
Advanced Hybrid Programming
CPU1
Thread 0
Thread 1
Thread 2
Thread 3
GPU1