Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos...
-
Upload
hester-francis -
Category
Documents
-
view
212 -
download
0
Transcript of Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos...
![Page 1: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/1.jpg)
www.openfabrics.org
Resource Utilization in Large Scale InfiniBand Jobs
Galen M. Shipman
Los Alamos National LabsLAUR-07-2873
![Page 2: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/2.jpg)
2www.openfabrics.org
The Problem
InfiniBand specifies that receive resources are consumed in order regardless of size
Small messages may therefore consume much larger receive buffers
At very large scale, many applications are dominated by small message transfers
Message sizes vary substantially from job to job and even rank to rank
![Page 3: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/3.jpg)
3www.openfabrics.org
Receive Buffer Efficiency
![Page 4: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/4.jpg)
4www.openfabrics.org
Implication for SRQ
Flood of small messages may exhaust SRQ resources
Probability of RNR NAK increases Stalls the pipeline
Performance degrades Wasted resource utilization Application may not complete within allotted time
slot (12 + Hours for some jobs)
![Page 5: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/5.jpg)
5www.openfabrics.org
Why not just tune the buffer size?
There is no “one size fits all” solution! Message size patterns differ based on:
Number of processes in the parallel job Input deck Identity / function in the parallel job
Need to balance optimization between: Performance Memory footprint
Tuning for each application run is not acceptable
![Page 6: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/6.jpg)
6www.openfabrics.org
What Do Users Want?
Optimal performance is important But predictability at “acceptable” performance is more
important
HPC users want a default/“good enough” solution Parameter tweaking is fine for papers Not for our end users
Parameter explosion OMPI OpenFabrics-related driver parameters: 48 OMPI other parameters: …many…
![Page 7: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/7.jpg)
7www.openfabrics.org
What Do Others Do?
Portals Contiguous memory region for unexpected messages
(Receiver managed offset semantic) Myrinet GM
Variable size receive buffers can be allocated Sender specifies which size receive buffer to consume
(SIZE & PRIORITY fields) Quadrics Elan
TPORTS manages pools of buffers of various sizes On receipt of an unexpected message a buffer is chosen
from the relevant pool
![Page 8: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/8.jpg)
8www.openfabrics.org
Bucket-SRQ
Inspired from standard bucket allocation methods
Multiple “buckets” of receive descriptors are created in multiple SRQs Each associated a different size buffer
A small pool of per-peer resources is also allocated
![Page 9: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/9.jpg)
9www.openfabrics.org
Bucket-SRQ
![Page 10: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/10.jpg)
10www.openfabrics.org
Performance Implications
Good overall performance Decreased/no RNR NAKS from draining SRQ
• Never trigger “SRQ limit reached” event
Latency penalty for SRQ ~1 usec
Large number of QPs may not be efficient Still investigating impact of high QP count on
performance
![Page 11: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/11.jpg)
11www.openfabrics.org
Results
Evaluation applications SAGE (DOE/LANL application) Sweep3D (DOE/LANL application) NAS Parallel Benchmarks (benchmark)
Instrumented Open MPI Measured receive buffer efficiency:
Size of receive buffer / size of data received
![Page 12: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/12.jpg)
12www.openfabrics.org
SAGE: Hydrodynamics
SAGE – SAIC’s Adaptive Grid Eulerian hydrocode
Hydrodynamics code with Adaptive Mesh Refinement (AMR)
Applied to: water shock, energy coupling, hydro instability problems, etc.
Routinely run on 1,000’s of processors.
Scaling characteristic: Weak
Data Decomposition (Default): 1-D (of a 3-D AMR spatial grid)
"Predictive Performance and Scalability Modeling of a Large-Scale Application", D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, in Proc. SC, Denver, 2001 Courtesy: PAL Team - LANL
![Page 13: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/13.jpg)
13www.openfabrics.org
SAGE
Adaptive Mesh Refinement (AMR) hydro-code
3 repeated phases
Gather data (including processor boundary data) Compute Scatter data (send back results)
3-D spatial grid, partitioned in 1-D
Parallel characteristics Message sizes vary, typically 10 - 100’s Kbytes Distance between neighbors increases with scale
Courtesy: PAL Team - LANL
![Page 14: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/14.jpg)
14www.openfabrics.org
SAGE: Receive Buffer Usage
256 Processes
![Page 15: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/15.jpg)
15www.openfabrics.org
SAGE: Receive Buffer Usage
4096 Processes
![Page 16: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/16.jpg)
16www.openfabrics.org
SAGE: Receive buffer efficiency
![Page 17: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/17.jpg)
17www.openfabrics.org
SAGE: Performance
![Page 18: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/18.jpg)
18www.openfabrics.org
Sweep3D
3-D spatial grid, partitioned in 2-D
Pipelined wavefront processing Dependency in ‘sweep’ direction
Parallel Characteristics: logical neighbors in X and Y Small message sizes: 100’s bytes (typical) Number of processors determines pipe-line length (PX + PY)
2-D example:
Courtesy: PAL Team - LANL
![Page 19: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/19.jpg)
19www.openfabrics.org
Sweep3D: Wavefront Algorithm
Characterized by a dependency in cell processing
1 2 3 4 51-D
2-D
3-D
Direction of wavefront can change start from any corner-point
previouslyprocessed
wavefrontedge
Courtesy: PAL Team - LANL
![Page 20: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/20.jpg)
20www.openfabrics.org
Sweep3D Receive Buffer Usage
256 Processes
![Page 21: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/21.jpg)
21www.openfabrics.org
Sweep3D: Receive Buffer Efficiency
![Page 22: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/22.jpg)
22www.openfabrics.org
Sweep3d: Performance
![Page 23: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/23.jpg)
23www.openfabrics.org
NPB Receive Buffer Usage
Class D 256 Processes
![Page 24: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/24.jpg)
24www.openfabrics.org
NPB Receive Buffer Efficiency
Class D 256 Processes
IS Benchmark Not Available for Class D
![Page 25: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/25.jpg)
25www.openfabrics.org
NPB Performance Results
NPB Class D 256 Processes
![Page 26: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/26.jpg)
26www.openfabrics.org
Conclusions
Bucket SRQ provides Good performance at scale “One size fits most” solution
• Eliminates need to custom-tune each run
Minimizes receive buffer memory footprint• No more than 25 MB was allocated for any run
Avoids RNR NAKs in communication patterns we examined
![Page 27: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f355503460f94c53b6c/html5/thumbnails/27.jpg)
27www.openfabrics.org
Future Work
Take advantage of ConnectX SRC feature to reduce the number of active QPs
Further examine our protocol at 4K+ processor count on SNL’s ThunderBird cluster