Altix 4700. ccNUMA Architecture Distributed Memory - Shared address space.

Altix 4700

ccNUMA Architecture

• Distributed Memory - Shared address space

Altix HLRB II – Phase 2

• 19 partitions with 9728 cores• Each with 256 Itanium dual-core processors, i.e., 512 cores

– Clock rate 1.6 GHz– 4 Flops per cycle per core– 12,8 GFlop/s (6,4 GFlop/s per core)

• 13 high-bandwidth partitions– Blades with 1 processor (2 cores) and 4 GB memory– Frontside bus 533 MHz (8.5 GB/sec)

• 6 high-density partitions– Blades with 2 processors (4 cores) and 4 GB memory.– Same memory bandwidth.

• Peak Performance: 62,3 TFlops (6.4 GFlops/core)• Memory: 39 TB

Memory Hierarchy

• L1D• 16 KB, 1 cycle latency, 25,6 GB/s bandwidth • cache line size 64 bytes

• L2D• 256 KB, 6 cycles, 51 GB/s• cache line size 128 bytes

• L3• 9 MB, 14 cycles, 51 GB/s• cache line size 128 bytes

Interconnect

• NUMAlink 4• 2 links per blade• Each link 2*3,2 GB/s bandwidth• MPI latency 1-5µs

Disks

• Direct attached disks (temporary large files)• 600 TB• 40 GB/s bandwidth

• Network attached disks (Home Directories)• 60 TB• 800 MB/s bandwidth

Environment

• Footprint: 24 m x 12 m• Weight: 103 metric tons• Electrical power: ~1 MW

NUMAlink Building Block

NUMALink 4RouterLevel 1

BLADEBLADEBLADEBLADEIO BLADE



BLADEBLADEBLADEBLADEIO BLADE


SANSwitch

10 GE

PCI/FC8 cores (high bandwidth)

16 cores (high-density)

Blades and Rack

Interconnection in a Partition

Interconnection of Partitions

• Gray squares• 1 partition with 512 cores• L: Login B:Batch

• Lines• 2 NUMALink4 planes with 16 cables• each cable: 2 * 3,2 GB/s

Interactive Partition

• Login cores• 32 for compile & test

• Interactive batch jobs• 476 cores• managed by PBS

– daytime interactive usage– small-scale and nighttime

batch processing– single partition only

• High-density blades• 4 cores per memory

12 Login

4 OS16 Login

12 Batch

4 Login16

16 16 16 16

16 16 16 16

18 Batch Partitions

• Batch jobs• 510 (508) cores• managed by PBS• large-scale parallel jobs• single or multi-partition jobs

• 5 partitions with high-density blades

• 13 partitions with high-bandwidth blades

6 (12)

4 OS8 (16) 8 (16) 8 (16)

8 (16)8 (16)8 (16)8 (16)

8 (16)8 (16)8 (16)8 (16)

Bandwidth

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

0

500

1000

1500

2000

2500

3000Bandwidth (MB/s) Intra-Node

Intranode

Internode

Coherence Implementatioin

• SHUB2 supports up to 8192 SHUBs (32768 cores)• Coherence domain up to 1024 SHUBs(4096 cores)

• SGI term: "Sharing mode"• Directory with one bit per SHUB• Multiple shared copies are supported.

• Accesses of other coherence domains• SGI term: "Exclusive sharing mode"• Always translated in exclusive access• Only single copy is supported• Directory stores the address of SHUB(13 bits)

SHMEM Latency Model for Altix

• SHMEM get latency is sum of:• 80 nsec for function call• 260 nsec for memory latency• 340 nsec for first hop• 60 nsec per hop• 20 nsec per meter of NUMAlink cable

• Example• 64 P system: max hops is 4, max total cable length is 4. • Total SHMEM get latency is:

1000 nsec = 80 + 260 + 340 + 60x4 + 20x4

Coherency

Domain 1

Parallel Programming Models

Linux Image 2

Coherency

Domain 2

Intra-Host (512 cores) Intra-CoherencyDomain (4096 cores)

and across entire machineOpenMP

Pthreads

MPI

SHMEMTM

Global segments

MPI

SHMEM

Global Segments

Altix® System

Linux Image 1

Barrier Synchronization

• Frequent in OpenMP, SHMEM, MPI single sided ops (MPI_Win_fence)

• Tree-based implementation using multiple fetch-op variables to minimize contention on SHUB.

• Using uncached load to reduce NUMAlink traffic.

CPUHUB

ROUTER

CPU

Fetch-op

variable

Programming Models

• OpenMP on an Linux image• MPI• SHMEM• Shared segments (System V und Global Shared

Memory)

SHMEM

• Can be used for MPI programs where all processes execute same code.

• Enables access within and across partitions.• Static data and symmetric heap data (shmalloc or shpalloc)

• info: man intro_shmem

Example

#include <mpp/shmem.h>

main()

{

long source[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };

static long target[10]; MPI_Init(…)

if (myrank == 0) { /* put 10 elements into target on PE 1 */ shmem_long_put(target, source, 10, 1); } shmem_barrier_all(); /* sync sender and receiver */

if (myrank == 1) printf("target[0] on PE %d is %d\n", myrank,target[0]);}

Global Shared Memory Programming

• Allocation of a shared memory segment via collective GSM_alloc.

• Similar to memory mapped files or System V shared segments. But these are limited to a single OS instance.

• GSM segment can be distributed across partitions.– GSM_ROUNDROBIN: Pages are distributed in roundrobin

across processes– GSM_SINGLERANK: Places all pages near to a single process– GSM_CUSTOM_ROUNDROBIN: Each process specifies how

many pages should be placed in its memory.

• Data structures can be placed in this memory segment and accessed from all processes with normal load and store instructions.

Example

#include <mpi_gsm.h>

placement = GSM_ROUNDROBIN;flags = 0; size = ARRAY_LEN * sizeof(int); int *shared_buf;rc = GSM_Alloc(size, placement, flags, MPI_COMM_WORLD,&shared_buf);// Have one rank initialize the shared memory regionif (rank == 0) { for(i=0; i < ARRAY_LEN; i++) { shared_buf[i] = i; }}

MPI_Barrier(MPI_COMM_WORLD);

// Have every rank verify they can read from the shared memoryfor (i=0; i < ARRAY_LEN; i++) { if (shared_buf[i] != i) { printf("ERROR!! element %d = %d\n", i, shared_buf[i]); printf("Rank %d - FAILED shared memory test.\n", rank); exit(1); }}

Summary

• Altix 4700 is a ccNUMA system• >60 TFlop/s• MPI messages sent with two-copy or single-copy

protocol• Hierarchical coherence implementation

• Intranode• Coherence domain• Across coherence domains

• Programming models• OpenMP• MPI• SHMEM• GSM

The Compute Cube of LRZ

Klima

Archiv/Backup

Hö(sä

Zugangsbrücke

Höchstleistungsrechner(säulenfrei)

Rückkühlwerke

Archiv/Backup

Server/Netz

Klima

Elektro

Zugangsbrücke

Altix 4700. ccNUMA Architecture Distributed Memory - Shared address space.

Documents

Transcript of Altix 4700. ccNUMA Architecture Distributed Memory - Shared address space.