Parallel Programming in C with MPI and OpenMP
description
Transcript of Parallel Programming in C with MPI and OpenMP
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Programmingin C with MPI and OpenMP
Michael J. QuinnMichael J. Quinn
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 18
Combining MPI and OpenMPCombining MPI and OpenMP
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Outline
Advantages of using both MPI and Advantages of using both MPI and OpenMPOpenMP
Case Study: Conjugate gradient methodCase Study: Conjugate gradient method Case Study: Jacobi methodCase Study: Jacobi method
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
C+MPI vs. C+MPI+OpenMP
C + MPI C + MPI + OpenMP
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Why C + MPI + OpenMPCan Execute Faster Lower communication overheadLower communication overhead More portions of program may be practical More portions of program may be practical
to parallelizeto parallelize May allow more overlap of May allow more overlap of
communications with computations communications with computations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Case Study: Conjugate Gradient
Conjugate gradient method solves Conjugate gradient method solves AxAx = = bb In our program we assume In our program we assume AA is dense is dense MethodologyMethodology
Start with MPI programStart with MPI program Profile functions to determine where Profile functions to determine where
most execution time spentmost execution time spent Tackle most time-intensive function firstTackle most time-intensive function first
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Result of Profiling MPI Program
FunctionFunction 1 CPU1 CPU 8 CPUs8 CPUs
matrix_vector_productmatrix_vector_product 99.55%99.55% 97.49%97.49%
dot_productdot_product 0.19%0.19% 1.06%1.06%
cgcg 0.25%0.25% 1.44%1.44%
Clearly our focus needs to be on functionmatrix_vector_product
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Code for matrix_vector_productvoid matrix_vector_product (int id, int p, int n, double **a, double *b, double *c){ int i, j; double tmp; /* Accumulates sum */ for (i=0; i<BLOCK_SIZE(id,p,n); i++) { tmp = 0.0; for (j = 0; j < n; j++) tmp += a[i][j] * b[j]; piece[i] = tmp; } new_replicate_block_vector (id, p, piece, n, (void *) c, MPI_DOUBLE);}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Adding OpenMP directives
Want to minimize fork/join overhead by Want to minimize fork/join overhead by making parallel the outermost possible loopmaking parallel the outermost possible loop
Outer loop may be executed in parallel if Outer loop may be executed in parallel if each thread has a private copy of tmp and jeach thread has a private copy of tmp and j
#pragma omp parallel for private(j,tmp)#pragma omp parallel for private(j,tmp)
for (i=0; i<BLOCK_SIZE(id,p,n); i++) {for (i=0; i<BLOCK_SIZE(id,p,n); i++) {
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
User Control of Threads
Want to give user opportunity to specify Want to give user opportunity to specify number of active threads per processnumber of active threads per process
Add a call to omp_set_num_threads to Add a call to omp_set_num_threads to function mainfunction main
Argument comes from command lineArgument comes from command line
omp_set_num_threads (atoi(argv[3]));omp_set_num_threads (atoi(argv[3]));
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
What Happened?
We transformed a C+MPI program to a We transformed a C+MPI program to a C+MPI+OpenMP program by adding only C+MPI+OpenMP program by adding only two lines to our program!two lines to our program!
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Benchmarking
Target system: a commodity cluster with four dual-Target system: a commodity cluster with four dual-processor nodesprocessor nodes
C+MPI program executes on 1, 2, ..., 8 CPUsC+MPI program executes on 1, 2, ..., 8 CPUs On 1, 2, 3, 4 CPUs, each process on different node, On 1, 2, 3, 4 CPUs, each process on different node,
maximizing memory bandwidth per CPUmaximizing memory bandwidth per CPU C+MPI+OpenMP program executes on 1, 2, 3, 4 C+MPI+OpenMP program executes on 1, 2, 3, 4
processesprocesses Each process has two threadsEach process has two threads C+MPI+OpenMP program executes on 2, 4, 6, 8 C+MPI+OpenMP program executes on 2, 4, 6, 8
threadsthreads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Results of Benchmarking
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Analysis of Results
C+MPI+OpenMP program slower on 2, 4 C+MPI+OpenMP program slower on 2, 4 CPUs because C+MPI+OpenMP threads CPUs because C+MPI+OpenMP threads are sharing memory bandwidth, while are sharing memory bandwidth, while C+MPI processes are notC+MPI processes are not
C+MPI+OpenMP programs faster on 6, 8 C+MPI+OpenMP programs faster on 6, 8 CPUs because they have lower CPUs because they have lower communication costcommunication cost
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Case Study: Jacobi Method
Begin with C+MPI program that uses Begin with C+MPI program that uses Jacobi method to solve steady state heat Jacobi method to solve steady state heat distribution problem of Chapter 13distribution problem of Chapter 13
Program based on rowwise block striped Program based on rowwise block striped decomposition of two-dimensional matrix decomposition of two-dimensional matrix containing finite difference meshcontaining finite difference mesh
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Methodology
Profile execution of C+MPI programProfile execution of C+MPI program Focus on adding OpenMP directives to Focus on adding OpenMP directives to
most compute-intensive functionmost compute-intensive function
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Result of Profiling
FunctionFunction 1 CPU1 CPU 8 CPUs8 CPUs
initialize_meshinitialize_mesh 0.01%0.01% 0.03%0.03%
find_steady_statefind_steady_state 98.48%98.48% 93.49%93.49%
print_solutionprint_solution 1.51%1.51% 6.48%6.48%
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function find_steady_state (1/2) its = 0; for (;;) { if (id > 0) MPI_Send (u[1], N, MPI_DOUBLE, id-1, 0, MPI_COMM_WORLD); if (id < p-1) { MPI_Send (u[my_rows-2], N, MPI_DOUBLE, id+1, 0, MPI_COMM_WORLD); MPI_Recv (u[my_rows-1], N, MPI_DOUBLE, id+1, 0, MPI_COMM_WORLD, &status); } if (id > 0) MPI_Recv (u[0], N, MPI_DOUBLE, id-1, 0, MPI_COMM_WORLD, &status);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function find_steady_state (2/2) diff = 0.0; for (i = 1; i < my_rows-1; i++) for (j = 1; j < N-1; j++) { w[i][j] = (u[i-1][j] + u[i+1][j] + u[i][j-1] + u[i][j+1])/4.0; if (fabs(w[i][j] - u[i][j]) > diff) diff = fabs(w[i][j] - u[i][j]); } for (i = 1; i < my_rows-1; i++) for (j = 1; j < N-1; j++) u[i][j] = w[i][j]; MPI_Allreduce (&diff, &global_diff, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD); if (global_diff <= EPSILON) break; its++;
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Making Function Parallel (1/2)
Except for two initializations and a return Except for two initializations and a return statement, function is a big statement, function is a big forfor loop loop
Cannot execute Cannot execute forfor loop in parallel loop in parallel Not in canonical formNot in canonical form Contains a Contains a break break statementstatement Contains calls to MPI functionsContains calls to MPI functions Data dependences between iterationsData dependences between iterations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Making Function Parallel (2/2)
Focus on first Focus on first forfor loop indexed by loop indexed by ii How to handle multiple threads How to handle multiple threads
testing/updating testing/updating diffdiff?? Putting Putting if if statement in a critical section statement in a critical section
would increase overhead and lower speedupwould increase overhead and lower speedup Instead, create private variable Instead, create private variable tdifftdiff Thread testsThread tests tdiff tdiff against against diffdiff before before
call to call to MPI_AllreduceMPI_Allreduce
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Modified Function diff = 0.0;#pragma omp parallel private (i, j, tdiff){ tdiff = 0.0;#pragma omp for for (i = 1; i < my_rows-1; i++) ...#pragma omp for nowait for (i = 1; i < my_rows-1; i++)#pragma omp critical if (tdiff > diff) diff = tdiff;} MPI_Allreduce (&diff, &global_diff, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Making Function Parallel (3/3)
Focus on second Focus on second forfor loop indexed by loop indexed by ii Copies elements of Copies elements of ww to corresponding to corresponding
elements of elements of uu: no problem with executing : no problem with executing in parallelin parallel
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Benchmarking
Target system: a commodity cluster with four dual-Target system: a commodity cluster with four dual-processor nodesprocessor nodes
C+MPI program executes on 1, 2, ..., 8 CPUsC+MPI program executes on 1, 2, ..., 8 CPUs On 1, 2, 3, 4 CPUs, each process on different node, On 1, 2, 3, 4 CPUs, each process on different node,
maximizing memory bandwidth per CPUmaximizing memory bandwidth per CPU C+MPI+OpenMP program executes on 1, 2, 3, 4 C+MPI+OpenMP program executes on 1, 2, 3, 4
processesprocesses Each process has two threadsEach process has two threads C+MPI+OpenMP program executes on 2, 4, 6, 8 C+MPI+OpenMP program executes on 2, 4, 6, 8
threadsthreads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Benchmarking Results
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Analysis of Results
Hybrid C+MPI+OpenMP program uniformly Hybrid C+MPI+OpenMP program uniformly faster than C+MPI programfaster than C+MPI program
Computation/communication ratio of hybrid Computation/communication ratio of hybrid program is superiorprogram is superior
Number of mesh points per element Number of mesh points per element communicated is twice as high per node for the communicated is twice as high per node for the hybrid programhybrid program
Lower communication overhead leads to 19% Lower communication overhead leads to 19% better speedup on 8 CPUsbetter speedup on 8 CPUs
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary
Many contemporary parallel computers consists of Many contemporary parallel computers consists of a collection of multiprocessorsa collection of multiprocessors
On these systems, performance of On these systems, performance of C+MPI+OpenMP programs can exceed C+MPI+OpenMP programs can exceed performance of C+MPI programsperformance of C+MPI programs
OpenMP enables us to take advantage of shared OpenMP enables us to take advantage of shared memory to reduce communication overheadmemory to reduce communication overhead
Often, conversion requires addition of relatively Often, conversion requires addition of relatively few pragmasfew pragmas