Optimizing Collective Communication for Multicore

BERKELEY PAR LAB

Optimizing Collective Communication for Multicore

By Rajesh Nishtala

BERKELEY PAR LAB

What Are Collectives

An operation called by all threads together to perform globally coordinated communication May involve a modest amount of computation, e.g. to

combine values as they are communicated Can be extended to teams (or communicators) in

which they operate on a predefined subset of the threads

Focus on collectives in Single Program Multiple Data (SPMD) programming models

2Multicore Collective Tuning

BERKELEY PAR LAB

Some Collectives Barrier ((MPI_Barrier())

A thread cannot exit a call to a barrier until all other threads have called the barrier

Broadcast (MPI_Bcast()) A root thread sends a copy of an array to all the other threads

Reduce-To-All (MPI_Allreduce()) Each thread contributes an operand to an arithmetic operation across all

the threads The result is then broadcast to all the threads

Exchange (MPI_Alltoall()) For all i, j < N , thread i copies the jth piece of an input array to the ith slot

of an output array located on thread i.


BERKELEY PAR LAB

Why Are They Important?

Basic communication building blocks Found in many parallel

programming languages and libraries

Abstraction If an application is written

with collectives, passes the responsibility of tuning to the runtime

Opteron/Infi-band/256

Class C Class D

Exchange in NAS FT

~28% ~23%

Reductions in NAS CG

~42% ~28%

Percentage of runtime spent in collectives


BERKELEY PAR LAB

Experimental Setup

Platforms Sun Niagra2

• 1 socket of 8 multi-threaded cores• Each core supports 8 hardware thread contexts for 64 total

threads

Intel Clovertown• 2 “traditional” quad-core sockets

BlueGene/P• 1 quad-core socket

MPI for Inter-process communication shared memory MPICH2 1.0.7


BERKELEY PAR LAB

Threads v. Processes (Niagra2)

Barrier Performance Perform a barrier across all

64 threads Threads arranged into

processes in different ways– One extreme has one

thread per process while other has 1 process with 64 threads

MPI_Barrier() called between processes

Flat barrier amongst threads

2 orders of magnitude difference in performance!


BERKELEY PAR LAB

Threads v. Processes (Niagra2) cont.

Other collectives see similar issues with scaling using processes MPI Collectives called between processes while shared memory is

leverage within a process


BERKELEY PAR LAB

Intel Clovertown and BlueGene/P

Less threads per node Differences are not as drastic but they are non-trivial

Intel Clovertown

BlueGene/P


BERKELEY PAR LAB

Optimizing Barrier w/ Trees

Leveraging shared memory is a critical optimization Flat trees are don’t scale Use to aid parallelism

Requires two passes of a tree First (UP) pass indicates that all

threads have arrived.• Signal parent when all your children

have arrived• Once root gets signal from all

children then all threads have reported in

Second (DOWN) pass indicates that all threads have arrived

• Wait for parent to send me a clear signal

• Propagate clear signal down to my children

00

88 22

331212 1010

44

66

11

1111

99

77

55

1414 1313

1515


BERKELEY PAR LAB

Example Tree Topologies

00

88 2233

1212 1010

44

66

11

1111 99 77 55

1414 13131515Radix 4 k-nomial tree

(quadnomial)

00

88 2233

1212 1010

4466 11

1111 99

77 55

1414 13131515Radix 8 k-nomial tree

(octnomial)

00

88 22

331212 1010

44

66

11

1111

99

77

55

1414 1313

1515Radix 2 k-nomial tree (binomial)


BERKELEY PAR LAB

Barrier Performance Results

Time many back-to-back barriers

Flat tree is just one level with all threads reporting to thread 0 Leverages shared memory but

non-scalable Architecture Independent Tree

(radix=2) Pick a generic “good” radix that

is suitable for many platforms Mismatched to architecture

Architecture Dependent Tree Search overall radices to pick

the tree that best matches the architecture

GOOD


BERKELEY PAR LAB

Broadcast Performance Results

Time a latency sensitive Broadcast (8 Bytes) Time Broadcast followed by

Barrier and subtract time for Barrier

Yields an approximation for how long it takes for the last thread to get the data

GOOD


BERKELEY PAR LAB

Reduce To All Performance Results

4kBytes (512 Doubles) Reduce-To-All In addition to data movement

we also want to parallelize the computation

In Flat approach, computation gets serialized at the root Tree based approaches allow

us to parallelize the computation amongst all the floating point units

8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way

GOOD


BERKELEY PAR LAB

Optimization Summary

Relying on flat trees is not enough for most collectives Architecture dependent tuning is a further and important

optimization

GOOD


BERKELEY PAR LAB

Extending the Results to a Cluster Use one-rack of BlueGene/P (1024 nodes or 4096 cores) Reduce-To-All by having one thread representative thread make call to

inter-node all reduce Reduce the number of messages in the network

Vary the number of threads per process but use all cores Relying purely on shared memory doesn’t always yield the best

performance Reduces number of active cores working on computation drops Can optimize so that computation is partitioned across cores

• Not suitable for direct call to MPI_Allreduce()


BERKELEY PAR LAB

Potential Synchronization Problem

1. Broadcast variable x from root2. Have proc 1 set a new value

for x on proc 4

broadcast x=1 from proc 0if(myid==1) {

put x=5 to proc 4} else {

/* do nothing*/}

pid: 0x: 1

pid: 1x: Ø

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: Ø

pid: 1x: 1

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: 1

pid: 4x: 5

pid: 0x: Ø

pid: 1x: Ø

pid: 4x: Ø

pid: 0x: 1

pid: 1x: Ø

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: Ø

pid: 1x: 1

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: 1

pid: 4x: 5

pid: 0x: 1

pid: 1x: Ø

pid: 4x: Ø

pid: 0x: 1

pid: 1x: Ø

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: Ø

pid: 1x: 1

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: 1

pid: 4x: 5

pid: 0x: 1

pid: 1x: 1

pid: 4x: Ø

pid: 0x: 1

pid: 1x: Ø

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: Ø

pid: 1x: 1

pid: 2x: 1

pid: 3x: Ø

pid: 4x: 1

pid: 4x: 5

pid: 0x: 1

pid: 1x: 1

pid: 4x: 5

pid: 0x: 1

pid: 1x: Ø

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: Ø

pid: 1x: 1

pid: 2x: 1

pid: 3x: 1

pid: 4x: 1

pid: 4x: 5

pid: 0x: 1

pid: 1x: 1

pid: 4x: 5

pid: 0x: 1

pid: 1x: Ø

pid: 2x: Ø

pid: 3x: Ø

pid: 4x: Ø

pid: 1x: 1

pid: 2x: 1

pid: 3x: 1

pid: 4x: 1

pid: 4x: 5

pid: 0x: 1

pid: 1x: 1

pid: 4x: 1

Put of x=5 by proc 1 has been lostProc 1 observes globally incomplete collective

Proc 1 thinks collective is done

Multicore Collective Tuning 16

BERKELEY PAR LAB

Strict v. Loose Synchronization

A fix to the problem Add barrier before/after the

collective Enforces global ordering of

the operations Is there a problem?

We want to decouple synchronization from data movement

Specify the synchronization requirements• Potential to aggregate

synchronization• Done by the user or a smart

compiler How can we realize these

gains in applications?


BERKELEY PAR LAB

Conclusions

Processes Threads is a crucial optimization for single-node collective communication

Can use tree-based collectives to realize better performance, even for collectives on one node

Picking the correct tree that best matches architecture yields the best performance

Multicore adds to the (auto)tuning space for collective communication

Shared memory semantics allow us to create new loosely synchronized collectives


BERKELEY PAR LAB

Questions?


BERKELEY PAR LAB

Backup Slides


BERKELEY PAR LAB

Threads and Processes

Threads A sequence of instructions and an execution stack Communication between threads through common and shared address

space• No OS/Network involvement needed• Reasoning about inter-thread communication can be tricky

Processes A set of threads and and an associated memory space All threads within process share address space Communication between processes must be managed through the OS

• Inter-process communication is explicit but may be slow• More expensive to switch between processes


BERKELEY PAR LAB

Experimental Platforms

Niagra2 Clovertown

BG/P


BERKELEY PAR LAB

Specs

Niagra2 Clovertown BlueGene/P

# Sockets 1 2 1

# Cores/Socket 8 4 4

Threads Per Core 8 1 1

Total Thread Count 64 8 4

Instruction Set Sparc x86/64 PowerPC

Core Frequency 1.4 GHz 2.6 GHz 0.85 GHz

Peak DP Floating Point Performance / Core 1.4 GFlop/s 10.4 GFlop/s 3.4 GFlop/s

DRAM Read Bandwidth / Socket 42.7 GB/s 21.3 GB/s 13.6 GB/s

DRAM Write Bandwidth / Socket 21.3 GB/s 10.7 GB/s 13.6 GB/s

L1 Cache Size 8kB 32 kB 32kB

L2 Cache Size 4MB 16MB 8MB

(shared) (4MB/2cores) (4MB/2cores)

OS Version Solaris 5.10 Linux 2.6.18 BG/P Compute Kernel

C Compiler Sun C (5.9) Intel ICC (10.1) IBM BlueGene XLC

MPI Implementation MPICH2 1.0.7 MPICH2 1.0.7 MPICH2 port for

ch3:shm device ch3:shm device BG/P


BERKELEY PAR LAB

Details of Signaling

For optimum performance have many readers and one writer Each thread sets a flag (a single word) that others will read

Every reader will get a copy of the cache-line and spin on that copy When writer comes in and changes value of variable, cache-coherency

system will handle broadcasting/updating the changes Avoid atomic primitives

On way up the tree, child sets a flag indicating that subtree has arrived Parent spins on that flag for each child

On way down, each child spins on parent’s flag When it’s set, it indicates that the parent wants to broadcast the clear

signal down Flags must be on different cache lines to avoid false sharing Need to switch back-and-forth between two sets of flags


Optimizing Collective Communication for Multicore

Documents

Transcript of Optimizing Collective Communication for Multicore