Optimizing Collective Communication for Multicore
description
Transcript of Optimizing Collective Communication for Multicore
BERKELEY PAR LAB
Optimizing Collective Communication for Multicore
By Rajesh Nishtala
BERKELEY PAR LAB
What Are Collectives
An operation called by all threads together to perform globally coordinated communication May involve a modest amount of computation, e.g. to
combine values as they are communicated Can be extended to teams (or communicators) in
which they operate on a predefined subset of the threads
Focus on collectives in Single Program Multiple Data (SPMD) programming models
2Multicore Collective Tuning
BERKELEY PAR LAB
Some Collectives Barrier ((MPI_Barrier())
A thread cannot exit a call to a barrier until all other threads have called the barrier
Broadcast (MPI_Bcast()) A root thread sends a copy of an array to all the other threads
Reduce-To-All (MPI_Allreduce()) Each thread contributes an operand to an arithmetic operation across all
the threads The result is then broadcast to all the threads
Exchange (MPI_Alltoall()) For all i, j < N , thread i copies the jth piece of an input array to the ith slot
of an output array located on thread i.
3Multicore Collective Tuning
BERKELEY PAR LAB
Why Are They Important?
Basic communication building blocks Found in many parallel
programming languages and libraries
Abstraction If an application is written
with collectives, passes the responsibility of tuning to the runtime
Opteron/Infi-band/256
Class C Class D
Exchange in NAS FT
~28% ~23%
Reductions in NAS CG
~42% ~28%
Percentage of runtime spent in collectives
4Multicore Collective Tuning
BERKELEY PAR LAB
Experimental Setup
Platforms Sun Niagra2
• 1 socket of 8 multi-threaded cores• Each core supports 8 hardware thread contexts for 64 total
threads
Intel Clovertown• 2 “traditional” quad-core sockets
BlueGene/P• 1 quad-core socket
MPI for Inter-process communication shared memory MPICH2 1.0.7
5Multicore Collective Tuning
BERKELEY PAR LAB
Threads v. Processes (Niagra2)
Barrier Performance Perform a barrier across all
64 threads Threads arranged into
processes in different ways– One extreme has one
thread per process while other has 1 process with 64 threads
MPI_Barrier() called between processes
Flat barrier amongst threads
2 orders of magnitude difference in performance!
6Multicore Collective Tuning
BERKELEY PAR LAB
Threads v. Processes (Niagra2) cont.
Other collectives see similar issues with scaling using processes MPI Collectives called between processes while shared memory is
leverage within a process
7Multicore Collective Tuning
BERKELEY PAR LAB
Intel Clovertown and BlueGene/P
Less threads per node Differences are not as drastic but they are non-trivial
Intel Clovertown
BlueGene/P
8Multicore Collective Tuning
BERKELEY PAR LAB
Optimizing Barrier w/ Trees
Leveraging shared memory is a critical optimization Flat trees are don’t scale Use to aid parallelism
Requires two passes of a tree First (UP) pass indicates that all
threads have arrived.• Signal parent when all your children
have arrived• Once root gets signal from all
children then all threads have reported in
Second (DOWN) pass indicates that all threads have arrived
• Wait for parent to send me a clear signal
• Propagate clear signal down to my children
00
88 22
331212 1010
44
66
11
1111
99
77
55
1414 1313
1515
9Multicore Collective Tuning
BERKELEY PAR LAB
Example Tree Topologies
00
88 2233
1212 1010
44
66
11
1111 99 77 55
1414 13131515Radix 4 k-nomial tree
(quadnomial)
00
88 2233
1212 1010
4466 11
1111 99
77 55
1414 13131515Radix 8 k-nomial tree
(octnomial)
00
88 22
331212 1010
44
66
11
1111
99
77
55
1414 1313
1515Radix 2 k-nomial tree (binomial)
10Multicore Collective Tuning
BERKELEY PAR LAB
Barrier Performance Results
Time many back-to-back barriers
Flat tree is just one level with all threads reporting to thread 0 Leverages shared memory but
non-scalable Architecture Independent Tree
(radix=2) Pick a generic “good” radix that
is suitable for many platforms Mismatched to architecture
Architecture Dependent Tree Search overall radices to pick
the tree that best matches the architecture
GOOD
11Multicore Collective Tuning
BERKELEY PAR LAB
Broadcast Performance Results
Time a latency sensitive Broadcast (8 Bytes) Time Broadcast followed by
Barrier and subtract time for Barrier
Yields an approximation for how long it takes for the last thread to get the data
GOOD
12Multicore Collective Tuning
BERKELEY PAR LAB
Reduce To All Performance Results
4kBytes (512 Doubles) Reduce-To-All In addition to data movement
we also want to parallelize the computation
In Flat approach, computation gets serialized at the root Tree based approaches allow
us to parallelize the computation amongst all the floating point units
8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way
GOOD
13Multicore Collective Tuning
BERKELEY PAR LAB
Optimization Summary
Relying on flat trees is not enough for most collectives Architecture dependent tuning is a further and important
optimization
GOOD
14Multicore Collective Tuning
BERKELEY PAR LAB
Extending the Results to a Cluster Use one-rack of BlueGene/P (1024 nodes or 4096 cores) Reduce-To-All by having one thread representative thread make call to
inter-node all reduce Reduce the number of messages in the network
Vary the number of threads per process but use all cores Relying purely on shared memory doesn’t always yield the best
performance Reduces number of active cores working on computation drops Can optimize so that computation is partitioned across cores
• Not suitable for direct call to MPI_Allreduce()
15Multicore Collective Tuning
BERKELEY PAR LAB
Potential Synchronization Problem
1. Broadcast variable x from root2. Have proc 1 set a new value
for x on proc 4
broadcast x=1 from proc 0if(myid==1) {
put x=5 to proc 4} else {
/* do nothing*/}
pid: 0x: 1
pid: 1x: Ø
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: Ø
pid: 1x: 1
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: 1
pid: 4x: 5
pid: 0x: Ø
pid: 1x: Ø
pid: 4x: Ø
pid: 0x: 1
pid: 1x: Ø
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: Ø
pid: 1x: 1
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: 1
pid: 4x: 5
pid: 0x: 1
pid: 1x: Ø
pid: 4x: Ø
pid: 0x: 1
pid: 1x: Ø
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: Ø
pid: 1x: 1
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: 1
pid: 4x: 5
pid: 0x: 1
pid: 1x: 1
pid: 4x: Ø
pid: 0x: 1
pid: 1x: Ø
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: Ø
pid: 1x: 1
pid: 2x: 1
pid: 3x: Ø
pid: 4x: 1
pid: 4x: 5
pid: 0x: 1
pid: 1x: 1
pid: 4x: 5
pid: 0x: 1
pid: 1x: Ø
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: Ø
pid: 1x: 1
pid: 2x: 1
pid: 3x: 1
pid: 4x: 1
pid: 4x: 5
pid: 0x: 1
pid: 1x: 1
pid: 4x: 5
pid: 0x: 1
pid: 1x: Ø
pid: 2x: Ø
pid: 3x: Ø
pid: 4x: Ø
pid: 1x: 1
pid: 2x: 1
pid: 3x: 1
pid: 4x: 1
pid: 4x: 5
pid: 0x: 1
pid: 1x: 1
pid: 4x: 1
Put of x=5 by proc 1 has been lostProc 1 observes globally incomplete collective
Proc 1 thinks collective is done
Multicore Collective Tuning 16
BERKELEY PAR LAB
Strict v. Loose Synchronization
A fix to the problem Add barrier before/after the
collective Enforces global ordering of
the operations Is there a problem?
We want to decouple synchronization from data movement
Specify the synchronization requirements• Potential to aggregate
synchronization• Done by the user or a smart
compiler How can we realize these
gains in applications?
17Multicore Collective Tuning
BERKELEY PAR LAB
Conclusions
Processes Threads is a crucial optimization for single-node collective communication
Can use tree-based collectives to realize better performance, even for collectives on one node
Picking the correct tree that best matches architecture yields the best performance
Multicore adds to the (auto)tuning space for collective communication
Shared memory semantics allow us to create new loosely synchronized collectives
18Multicore Collective Tuning
BERKELEY PAR LAB
Questions?
19Multicore Collective Tuning
BERKELEY PAR LAB
Backup Slides
20Multicore Collective Tuning
BERKELEY PAR LAB
Threads and Processes
Threads A sequence of instructions and an execution stack Communication between threads through common and shared address
space• No OS/Network involvement needed• Reasoning about inter-thread communication can be tricky
Processes A set of threads and and an associated memory space All threads within process share address space Communication between processes must be managed through the OS
• Inter-process communication is explicit but may be slow• More expensive to switch between processes
21Multicore Collective Tuning
BERKELEY PAR LAB
Experimental Platforms
Niagra2 Clovertown
BG/P
22Multicore Collective Tuning
BERKELEY PAR LAB
Specs
Niagra2 Clovertown BlueGene/P
# Sockets 1 2 1
# Cores/Socket 8 4 4
Threads Per Core 8 1 1
Total Thread Count 64 8 4
Instruction Set Sparc x86/64 PowerPC
Core Frequency 1.4 GHz 2.6 GHz 0.85 GHz
Peak DP Floating Point Performance / Core 1.4 GFlop/s 10.4 GFlop/s 3.4 GFlop/s
DRAM Read Bandwidth / Socket 42.7 GB/s 21.3 GB/s 13.6 GB/s
DRAM Write Bandwidth / Socket 21.3 GB/s 10.7 GB/s 13.6 GB/s
L1 Cache Size 8kB 32 kB 32kB
L2 Cache Size 4MB 16MB 8MB
(shared) (4MB/2cores) (4MB/2cores)
OS Version Solaris 5.10 Linux 2.6.18 BG/P Compute Kernel
C Compiler Sun C (5.9) Intel ICC (10.1) IBM BlueGene XLC
MPI Implementation MPICH2 1.0.7 MPICH2 1.0.7 MPICH2 port for
ch3:shm device ch3:shm device BG/P
23Multicore Collective Tuning
BERKELEY PAR LAB
Details of Signaling
For optimum performance have many readers and one writer Each thread sets a flag (a single word) that others will read
Every reader will get a copy of the cache-line and spin on that copy When writer comes in and changes value of variable, cache-coherency
system will handle broadcasting/updating the changes Avoid atomic primitives
On way up the tree, child sets a flag indicating that subtree has arrived Parent spins on that flag for each child
On way down, each child spins on parent’s flag When it’s set, it indicates that the parent wants to broadcast the clear
signal down Flags must be on different cache lines to avoid false sharing Need to switch back-and-forth between two sets of flags
24Multicore Collective Tuning