Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars...

Parallel Communications and NUMA Control on the Teragrid’s New Sun

Constellation System

Lars Koesterkewith

Kent Milfeld and Karl W. Schulz

AUS Presentation09/06/08

Bigger Systems Higher Complexity

Ranger is BIG!Ranger’s Architecture is

Multi-Level Paralleland

Asymmetric

3936 nodes62976 cores2 large Switches

Understand the Implications ofthe Multi-Level Parallel Architecture

Optimize Operational Methods and Applications

Maximize the Yield fromRanger and other Big TeraGrid

Machines yet to come!

Get the Most out of the New Generation of Supercomputers!

Outline • Introduction

• General description of the Experiment• Layout of the Node• Layout of the Interconnect (NEM and Magnum switches)

• Experiments : Ping-Pong and Barrier cost• On Node Experiments• On NEM Experiment• Switch Experiment

• Conclusion• Implications for System Management• Implications for Users

NEM: NetworkExpress Module

Parameter Selection for Experiments

Ranger Nodes have 4 quad-core Sockets : 16 cores per Node

Natural Setups Pure MPI : 16 tasks per NodeHybrid : 4 tasks per Node 1 task per Node

Tests are selected accordingly:1, 4 and 16 tasks

16 MPI Tasks 4 MPI Tasks4Threads/Task

1 MPI Tasks16 Threads/Task

MPI Task on CoreMaster Thread of MPI Task

Slave Thread of MPI TaskMaster Thread of MPI Task

In Large-scale calculations with 16 tasks per Node, communication could/should be bundled

Measure with one Task per Node

Experiment 1 : Ping-Pong with MPI

• MPI processes reside on :– same Node– same Chassis (connected by one NEM)– different Chassis (connected by Magnum switch)

• Messages are sent forth and back (Ping-Pong)– Communication Distance is varied (Node, NEM, Magnum)– Communication Volume is varied

• Message Size : 32 Bytes --- 16 MB• Number of processes sending/receiving simultaneously

• Effective Bandwidth per Communication Channel– Timing taken from multiple runs on a dedicated system

Node : 16 CoresChassis : 12 NodesTotal : 328 Chassis, 3936 Nodes

Experiment 2 : MPI Barrier Cost• MPI processes reside on :

– same Node– same Chassis (connected by a NEM)– different Chassis (connected by Magnum switch)

• Synchronize on Barriers– Communication Distance is varied (Node, NEM,

Magnum)– Communication Volume is varied

• Number of processes executing Barrier

• Barrier Cost measured in Clock Periods (CP)– Timing taken from multiple runs on a dedicated

system

Node : 16 CoresChassis : 12 NodesTotal : 328 Chassis, 3936 Nodes

Node Architecture• 4 quad-core CPUs (Sockets) per node• Memory local to Sockets• 3-way HyperTransport• ‘’Missing’’ connection 0---3

CPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

0

32

1

PCIExpressBridge

Asymmetry- Local vs. Remote Memory

0---3 requires one additional “hop”- PCI Connection

Note: Accessing local memory on both Sockets 0 and 3 is slower with extra HT hop (Cache Coherence)

Network Architecture• Each Chassis (12 Blades) is connected to a

Network Express Module (NEM)• Each NEM is connected to a Line Card in the

Magnum Switch• The Switch connects the Line Cards

through a Backplane

HCA NEM Line Card NEM HCA

7 hops

5 hops

3 hops

1 hop

Number of Hops / Latency

1 Hop 1.57 sec : Blades in the same Chassis3 Hops 2.04 sec : NEMs connected to the same Line Card5/7 Hops 2.45/2.85 sec : Connection through the Magnum switch

On-Node : Ping-Pong• Socket 0 ping-pongs with Sockets 1, 2 and 3• 1, 2, 4 simultaneous communications (quad-core)• Bandwidth scales with number of communications

Missing Connection : Communication between 0 and 3 is slower

Maximum Bandwidth : 1100 MB/s700 MB/s300 MB/s

On-Node : Barrier Cost (2 Cores)• One Barrier : 0---1, 0---2, 0---3• Cost : 1600 - 1900 CPs

Asymmetry: Communication between 0 and 3 is slower

On-Node : Barrier Cost (Multiple Cores, 2 Sockets)

• Barriers per Socket : 1, 2, 4• Cost :1700, 3200, 6800 CPs

• Barriers per Socket : 1, 2, 4• Cost :1700, 3200, 6800 CPs

On-NEM: Ping-Pong• 2-12 Nodes in the same Chassis• 1 MPI Process per Node (1-6 communication

pairs)Perfect Scaling for up to 6 simultaneous communicationsMaximum Bandwidth : 6 x 900 MB/s

On-NEM: Barrier Scaling

• Barriers per Node : 1, 4, 16• Cost : start at 5000/15000 CPs and increase up to

20000/27000/32000 CPs

NEM-to-NEM: Ping-Pong• Maximum Distance : 7 hops• 1 MPI Process per Node (1-12 communication

pairs)Maximum Performance : 2 x 900 up to 12 x 450 MB/s

Switch : Barrier Scaling

• Communication between 1-12 Nodes on 2 Chassis• Barriers per Node : 1, 4, 16• Two Runs: System was not clean during this test• Results similar to On-NEM test

• Communication pattern reveals Asymmetry on the Node level– No Direct HT Connection between Cores 0 and 3

• Max. Bandwidth : On-NEM: 6 x 900 MB/s NEM-to-NEM : 2 x 900 MB/s --- 12 x 450 MB/s

16-way Nodes: NUMA*,Multi-Level Interconnect:

low-latency, high-bandwidth

Further Investigation necessary to achieve theoretical 12 x 900 MB/s

Ranger

Conclusions

• Aggregate Communication and I/O on Node (SMP) level– Reduce total number of Communications– Reduce Traffic through Magnum switches– On 16-way Node : 15 compute tasks and a single Communication

task?– Use of MPI with OpenMP?

• Apply Load-Balancing– Asymmetry on Node Level– Multi-Level Interconnect (Node, NEM, Magnum switches)

• Use full Chassis (12 Nodes, 192 Cores)– Use extremely low-latency Connections through NEM (< 1.6

μsecs)

Take Advantage of the Architecture at all Levels

Applications should be cognizant of various SMP/Network levels

Applications should be cognizant of various SMP/Network levels

More topology aware scheduling is under investigation

More topology aware scheduling is under investigation

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars...

Documents

Transcript of Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars...