Post on 12-May-2015
description
Advancing Application Process Affinity Experimentation:Open MPI's LAMA-Based Affinity Interface
Jeff Squyres
September 18, 2013
Joshua Hursey
Locality Matters
• Multiple talks here at EuroMPI’13 about network locality
• Goals: Minimize data transfer distance Reduce network congestion and contention
• …this also matters inside the server, too!
Intel Xeon E5-2690 (“Sandy Bridge”)2 sockets, 8 cores, 64GB per socket
1GNICs
10GNICs
10GNICs
L1 and L2
Shared L3
Hyperthreading enabled
The intent of this work is to provide a mechanism that allows users to explore the process-placement space
within the scope of their own applications.
A User’s Playground
LAMA
• Locality-Aware Mapping Algorithm (LAMA) Supports a wide range of regular mapping
patterns.
• Adapts at runtime to available hardware Supports homogeneous and heterogeneous
systems.
• Extensible to any depth of server topology Naturally supports potentially deeper
topologies of future server architectures.
LAMA Inspiration
• Drawn from much prior work
• Most notably, heavily inspired by BlueGene/P and /Q mapping systems LAMA’s mapping specification is similar
Launching MPI Applications
• Three steps in MPI process placement1. Mapping
2. Ordering
3. Binding
• Let's discuss how these work in Open MPI
1. Mapping
• Create a layout of processes-to-resources
Server Server Server Server
Server Server Server Server
Server Server Server Server
Server Server Server Server
MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI
MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI
MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI
MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI
MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI
MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI
MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI
MPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPIMPI
Mapping
• MPI's runtime must create a map, pairing processes-to-processors (and memory).
• Basic technique: Gather hwloc topologies from allocated nodes. Mapping agent then makes a plan for which
resources are assigned to processes
Mapping Agent
• Act of planning mappings: Specify which process will be launched on
each server Identify if any hardware resource will be
oversubscribed
• Processes are mapped to the resolution of a single processing unit (PU) Smallest unit of allocation: hardware thread In HPC, usually the same as a processor core
Oversubscription
• Common / usual definition: When a single PU is assigned more than one
process
• Complicating the definition: Some application may need more than one
PU per process (multithreaded applications)
• How can the user express what their application means by “oversubscription”?
2. Ordering: By “Slot”
Assigning MCW ranks to mapped processes
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
32
36
40
44
48 49 50 51 64 65 66 67 80
2. Ordering: By Node
Assigning MCW ranks to mapped processes
0 16 32 48
64 80 96 112
128 144 160 176
192 208 224 240
1 17 33 49
65 81 97 113
129 145 161 177
193 209 225 241
2
66
130 146
194 210
4 20 36 52 5 23 37 53 6
Ordering
• Each process must be assigned a unique rank in MPI_COMM_WORLD
• Two common types of ordering: natural
• The order in which processes are mapped determines their rank in MCW
sequential• The processes are sequentially numbered starting
at the first processing unit, and continuing until the last processing unit
3. Binding
• Launch processes and enforce the layout
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34
40 41 42
Binding
• Process-launching agent working with the OS to limit where each process can run:1. No restrictions
2. Limited set of restrictions
3. Specific resource restrictions
• “Binding width” The number of PUs to which a process is
bound
Command Line Interface (CLI)
• 4 levels of abstraction for the user Level 1: None Level 2: Simple, common patterns Level 3: LAMA process layout regular patterns Level 4: Irregular patterns
CLI: Level 1 (none)
• No mapping or binding options specified May or may not specify the number of
processes to launch (-np) If not specified, default to the number of cores
available in the allocation One process is mapped to each core in the
system in a "by-core" style Processes are not bound
• …for backwards compatibility reasons
CLI: Level 2 (common)
• Simple, common patterns for mapping and binding Specify mapping pattern with
• --map-by X (e.g., --map-by socket)
Specify binding option with:• --bind-to Y (e.g., --bind-to core)
All of these options are translated to Level 3 options for processing by LAMA
(full list of X / Y values shown later)
CLI: Level 3 (regular patterns)
• LAMA process layout regular patterns Power users wanting something unique for
their application Four MCA run-time parameters
• rmaps_lama_map: Mapping process layout• rmaps_lama_bind: Binding width• rmaps_lama_order: Ordering of MCW ranks• rmaps_lama_mppr: Maximum allowable number of
processes per resource (oversubscription)
rmaps_lama_map (map)
• Takes as an argument the "process layout" A series of nine tokens
• allowing 9! (362,880) mapping permutation options.
Preferred iteration order for LAMA• innermost iteration specified first• outermost iteration specified last
Example system
2 servers (nodes), 4 sockets, 2 cores, 2 PUs
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_bind (bind)
• “Binding width" and layer
• Example: bind=3c (3 cores)
bind = 3c
rmaps_lama_bind (bind)
• “Binding width" and layer
• Example: bind=2s (2 sockets)
bind = 2s
bind = 2s
rmaps_lama_bind (bind)
• “Binding width" and layer
• Example: bind=12 (all PUs in an L2)
bind = 12
rmaps_lama_bind (bind)
• “Binding width" and layer
• Example: bind=1N (all PUs in NUMA locality)
bind = 1N
rmaps_lama_order (order)
• Select which ranks are assigned to processes in MCW
• There are other possible orderings, but no one has asked for them yet…
Natural order formap-by-node (default)
Sequential order forany mapping
rmaps_lama_mppr (mppr)
• mppr (mip-per) sets the Maximum number of allowable Processes Per Resource User-specified definition of oversubscription
• Comma-delimited list of <#:resource> 1:c At most one process per core 1:c,2:s At most one process per core,
and at most two processes per socket
MPPR
1:c At most one process per core
MPPR
1:c,2:s At most one process per core and two processes per socket
CLI: Level 4 (rankfile)
• Complete specification of processor-to-resource mapping description Bypasses LAMA
• Not described in the paper
Level 2 to Level 3 Chart
Remember the prior example?
• -np 24 -mppr 2:c -map scbnh
Same example, different mapping
• -np 24 -mppr 2:c -map nbsch
• Displays prettyprint representation of the binding actually used for each process. Visual feedback = quite helpful when exploring
mpirun -np 4 --mca rmaps lama --mca rmaps_lama_bind 1c --mca rmaps_lama_map nbsch --mca rmaps_lama_mppr 1:c --report-bindings hello_world
MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
Report Bindings
Future Work
• Available in Open MPI v1.7.2 (and later)
• Open questions to users: Are more flexible ordering options useful? What common mapping patterns are useful? What additional features would you like to
see?
Thank You