Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars...
-
Upload
priscilla-moody -
Category
Documents
-
view
219 -
download
0
Transcript of Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars...
Parallel Communications and NUMA Control on the Teragrid’s New Sun
Constellation System
Lars Koesterkewith
Kent Milfeld and Karl W. Schulz
AUS Presentation09/06/08
Bigger Systems Higher Complexity
Ranger is BIG!Ranger’s Architecture is
Multi-Level Paralleland
Asymmetric
3936 nodes62976 cores2 large Switches
Understand the Implications ofthe Multi-Level Parallel Architecture
Optimize Operational Methods and Applications
Maximize the Yield fromRanger and other Big TeraGrid
Machines yet to come!
Get the Most out of the New Generation of Supercomputers!
Outline • Introduction
• General description of the Experiment• Layout of the Node• Layout of the Interconnect (NEM and Magnum switches)
• Experiments : Ping-Pong and Barrier cost• On Node Experiments• On NEM Experiment• Switch Experiment
• Conclusion• Implications for System Management• Implications for Users
NEM: NetworkExpress Module
Parameter Selection for Experiments
Ranger Nodes have 4 quad-core Sockets : 16 cores per Node
Natural Setups Pure MPI : 16 tasks per NodeHybrid : 4 tasks per Node 1 task per Node
Tests are selected accordingly:1, 4 and 16 tasks
16 MPI Tasks 4 MPI Tasks4Threads/Task
1 MPI Tasks16 Threads/Task
MPI Task on CoreMaster Thread of MPI Task
Slave Thread of MPI TaskMaster Thread of MPI Task
In Large-scale calculations with 16 tasks per Node, communication could/should be bundled
Measure with one Task per Node
Experiment 1 : Ping-Pong with MPI
• MPI processes reside on :– same Node– same Chassis (connected by one NEM)– different Chassis (connected by Magnum switch)
• Messages are sent forth and back (Ping-Pong)– Communication Distance is varied (Node, NEM, Magnum)– Communication Volume is varied
• Message Size : 32 Bytes --- 16 MB• Number of processes sending/receiving simultaneously
• Effective Bandwidth per Communication Channel– Timing taken from multiple runs on a dedicated system
Node : 16 CoresChassis : 12 NodesTotal : 328 Chassis, 3936 Nodes
Experiment 2 : MPI Barrier Cost• MPI processes reside on :
– same Node– same Chassis (connected by a NEM)– different Chassis (connected by Magnum switch)
• Synchronize on Barriers– Communication Distance is varied (Node, NEM,
Magnum)– Communication Volume is varied
• Number of processes executing Barrier
• Barrier Cost measured in Clock Periods (CP)– Timing taken from multiple runs on a dedicated
system
Node : 16 CoresChassis : 12 NodesTotal : 328 Chassis, 3936 Nodes
Node Architecture• 4 quad-core CPUs (Sockets) per node• Memory local to Sockets• 3-way HyperTransport• ‘’Missing’’ connection 0---3
CPU CPU
CPUCPU
CPU CPU
CPUCPU
CPU CPU
CPUCPU
CPU CPU
CPUCPU
0
32
1
PCIExpressBridge
Asymmetry- Local vs. Remote Memory
0---3 requires one additional “hop”- PCI Connection
Note: Accessing local memory on both Sockets 0 and 3 is slower with extra HT hop (Cache Coherence)
Network Architecture• Each Chassis (12 Blades) is connected to a
Network Express Module (NEM)• Each NEM is connected to a Line Card in the
Magnum Switch• The Switch connects the Line Cards
through a Backplane
HCA NEM Line Card NEM HCA
7 hops
5 hops
3 hops
1 hop
Number of Hops / Latency
1 Hop 1.57 sec : Blades in the same Chassis3 Hops 2.04 sec : NEMs connected to the same Line Card5/7 Hops 2.45/2.85 sec : Connection through the Magnum switch
On-Node : Ping-Pong• Socket 0 ping-pongs with Sockets 1, 2 and 3• 1, 2, 4 simultaneous communications (quad-core)• Bandwidth scales with number of communications
Missing Connection : Communication between 0 and 3 is slower
Maximum Bandwidth : 1100 MB/s700 MB/s300 MB/s
On-Node : Barrier Cost (2 Cores)• One Barrier : 0---1, 0---2, 0---3• Cost : 1600 - 1900 CPs
Asymmetry: Communication between 0 and 3 is slower
On-Node : Barrier Cost (Multiple Cores, 2 Sockets)
• Barriers per Socket : 1, 2, 4• Cost :1700, 3200, 6800 CPs
• Barriers per Socket : 1, 2, 4• Cost :1700, 3200, 6800 CPs
On-NEM: Ping-Pong• 2-12 Nodes in the same Chassis• 1 MPI Process per Node (1-6 communication
pairs)Perfect Scaling for up to 6 simultaneous communicationsMaximum Bandwidth : 6 x 900 MB/s
On-NEM: Barrier Scaling
• Barriers per Node : 1, 4, 16• Cost : start at 5000/15000 CPs and increase up to
20000/27000/32000 CPs
NEM-to-NEM: Ping-Pong• Maximum Distance : 7 hops• 1 MPI Process per Node (1-12 communication
pairs)Maximum Performance : 2 x 900 up to 12 x 450 MB/s
Switch : Barrier Scaling
• Communication between 1-12 Nodes on 2 Chassis• Barriers per Node : 1, 4, 16• Two Runs: System was not clean during this test• Results similar to On-NEM test
• Communication pattern reveals Asymmetry on the Node level– No Direct HT Connection between Cores 0 and 3
• Max. Bandwidth : On-NEM: 6 x 900 MB/s NEM-to-NEM : 2 x 900 MB/s --- 12 x 450 MB/s
16-way Nodes: NUMA*,Multi-Level Interconnect:
low-latency, high-bandwidth
Further Investigation necessary to achieve theoretical 12 x 900 MB/s
Ranger
Conclusions
• Aggregate Communication and I/O on Node (SMP) level– Reduce total number of Communications– Reduce Traffic through Magnum switches– On 16-way Node : 15 compute tasks and a single Communication
task?– Use of MPI with OpenMP?
• Apply Load-Balancing– Asymmetry on Node Level– Multi-Level Interconnect (Node, NEM, Magnum switches)
• Use full Chassis (12 Nodes, 192 Cores)– Use extremely low-latency Connections through NEM (< 1.6
μsecs)
Take Advantage of the Architecture at all Levels
Applications should be cognizant of various SMP/Network levels
Applications should be cognizant of various SMP/Network levels
More topology aware scheduling is under investigation
More topology aware scheduling is under investigation