1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität...
-
Upload
monica-farmer -
Category
Documents
-
view
218 -
download
0
Transcript of 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität...
![Page 1: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/1.jpg)
1
Typical performance bottlenecks and how they can be found
Bert Wesarg
ZIH, Technische Universität Dresden
![Page 2: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/2.jpg)
2SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Case I:– Finding load imbalances in OpenMP codes
• Case II:– Finding communication and computation imbalances in MPI
codes
Outline
![Page 3: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/3.jpg)
3SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Matrix has significant more zero elements => sparse matrix
• Only non-zero elements of are saved efficiently in memory
• Algorithm:
Case I: Sparse Matrix Vector Multiplication
foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]
![Page 4: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/4.jpg)
4SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Naïve OpenMP Algorithm:
• Distributes the rows of A evenly across the threads in the parallel region
• The distribution of the non-zero elements may influence the load balance in the parallel application
Case I: Sparse Matrix Vector Multiplication
#pragma omp parallel forforeach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]
![Page 5: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/5.jpg)
5SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Measuring the static OpenMP application
Case I: Load imbalances in OpenMP codes
% cd ~/Bottlenecks/smxv% make PREP=scorepscorep gcc -fopenmp -DLITTLE_ENDIAN \ -DFUNCTION_INC='"y_Ax-omp.inc.c"' -DFUNCTION=y_Ax_omp \ -o smxv-omp smxv.c -lmscorep gcc -fopenmp -DLITTLE_ENDIAN \ -DFUNCTION_INC='"y_Ax-omp-dynamic.inc.c"‘ \ -DFUNCTION=y_Ax_omp_dynamic -o smxv-omp-dynamic smxv.c -lm% OMP_NUM_THREADS=8 scan –t ./smxv-omp yax_large.bin
![Page 6: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/6.jpg)
6SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Two metrics which indicate load imbalances:– Time spent in OpenMP barriers– Computational imbalance
• Open prepared measurement on the LiveDVD with Cube
Case I: Load imbalances in OpenMP codes: Profile
% cube ~/Bottlenecks/smxv/scorep_smxv-omp_large/trace.cubex
[CUBE GUI showing trace analysis report]
![Page 7: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/7.jpg)
7SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
Case I: Time spent in OpenMP barriers
These threads spent up to 20% of there running time
in the barrier
![Page 8: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/8.jpg)
8SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
Case I: Computational imbalance
Master thread does 66% of the work
![Page 9: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/9.jpg)
9SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Improved OpenMP Algorithm
• Distributes the rows of A dynamically across the threads in the parallel region
Case I: Sparse Matrix Vector Multiplication
#pragma omp parallel for schedule(dynamic,1000)foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]
![Page 10: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/10.jpg)
10SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Two metrics which indicate load imbalances:– Time spent in OpenMP barriers– Computational imbalance
• Open prepared measurement on the LiveDVD with Cube:
Case I: Profile Analysis
% cube ~/Bottlenecks/smxv/scorep_smxv-omp-dynamic_large/trace.cubex
[CUBE GUI showing trace analysis report]
![Page 11: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/11.jpg)
11SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
Case I: Time spent in OpenMP barriers
All threads spent similar time in the barrier
![Page 12: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/12.jpg)
12SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
Case I: Computational imbalance
Threads do nearly equal work
![Page 13: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/13.jpg)
13SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Open prepared measurement on the LiveDVD with Vampir:
Case I: Trace Comparison
% vampir ~/Bottlenecks/smxv/scorep_smxv-omp_large/traces.otf2
[Vampir GUI showing trace]
![Page 14: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/14.jpg)
14SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
Case I: Time spent in OpenMP barriers
Improved runtime
Less time in OpenMP barrier
![Page 15: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/15.jpg)
15SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
Case I: Computational imbalance
Great imbalance for time spent in computational
code
Great imbalance for time spent in computational
code
![Page 16: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/16.jpg)
16SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Case I:– Finding load imbalances in OpenMP codes
• Case II:– Finding communication and computation imbalances in MPI
codes
Outline
![Page 17: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/17.jpg)
17SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Calculating the heat conduction at each timestep• Discretized formula for space and time
Case II: Heat Conduction Simulation
![Page 18: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/18.jpg)
18SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Application uses MPI for boundary exchange and OpenMP for computation
• Simulation grid is distributed across MPI ranks
Case II: Heat Conduction Simulation
![Page 19: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/19.jpg)
19SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Ranks need to exchange boundaries before next iteration step
Case II: Heat Conduction Simulation
![Page 20: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/20.jpg)
20SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• MPI Algorithm
• Building and measuring the heat conduction application:
• Open prepared measurement on the LiveDVD with Cube
Case II: Profile Analysis
foreach step in [1:nsteps] exchangeBoundaries computeHeatConduction
% cd ~/Bottlenecks/heat% make PREP=‘scorep --user’ [... make output ...]% scan –t mpirun –np 16 ./heat-MPI 3072 32
% cube ~/Bottlenecks/heat/scorep_heat-MPI_16/trace.cubex
[CUBE GUI showing trace analysis report]
![Page 21: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/21.jpg)
21SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Step 1: Compute heat in the area which is communicated to your neighbors
Case II: Hide MPI communication with computation
![Page 22: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/22.jpg)
22SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Step 2: Start communicating boundaries with your neighbors
Case II: Hide MPI communication with computation
![Page 23: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/23.jpg)
23SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Step 3: Compute heat in the interior area
Case II: Hide MPI communication with computation
![Page 24: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/24.jpg)
24SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Improved MPI Algorithm
• Measuring the improved heat conduction application:
• Open prepared measurement on the LiveDVD with Cube
Case II: Profile Analysis
foreach step in [1:nsteps] computeHeatConductionInBoundaries startBoundaryExchange computeHeatConductionInInterior waitForCompletionOfBoundaryExchange
% scan –t mpirun –np 16 ./heat-MPI-overlap 3072 32
% cube ~/Bottlenecks/heat/scorep_heat-MPI-overlap_16/trace.cubex
[CUBE GUI showing trace analysis report]
![Page 25: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/25.jpg)
25SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Open prepared measurement on the LiveDVD with Vampir:
Case II: Trace Comparison
% vampir ~/Bottlenecks/heat/scorep_heat-MPI_16/traces.otf2
[Vampir GUI showing trace]
![Page 26: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.](https://reader030.fdocuments.in/reader030/viewer/2022033106/56649f445503460f94c64c31/html5/thumbnails/26.jpg)
26SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering
• Thanks to Dirk Schmidl, RWTH Aachen, for providing the sparse matrix vector multiplication code
Acknowledgments