Optimization of Collective Communication in Intra-Cell MPI
description
Transcript of Optimization of Collective Communication in Intra-Cell MPI
Optimization of Collective Optimization of Collective Communication in Intra-Cell MPICommunication in Intra-Cell MPI
Ashok Srinivasan
Florida State University
Ashok Srinivasan
Florida State University
Goals
1. Efficient implementation of collectives for intra-Cell MPI
2. Evaluate the impact of different algorithms on the performance
Goals
1. Efficient implementation of collectives for intra-Cell MPI
2. Evaluate the impact of different algorithms on the performance
Collaborators: A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1,
S. Kapoor2
1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin
Acknowledgment: IBM, for providing access to a Cell blade under the VLP program
Collaborators: A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1,
S. Kapoor2
1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin
Acknowledgment: IBM, for providing access to a Cell blade under the VLP program
OutlineOutline
Cell Architecture
Intra-Cell MPI Design Choices
Barrier
Broadcast
Reduce
Conclusions and Future Work
A PowerPC core, with 8 co-processors (SPE) with 256 K local
store each
Shared 512 MB - 2 GB main memory - SPEs can DMA
Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops
in double precision for SPEs
204.8 GB/s EIB bandwidth, 25.6 GB/s for memory
Two Cell processors can be combined to form a Cell blade with
global shared memory
Cell ArchitectureCell Architecture
DMA put timesDMA put times
Memory to Memory Copy using:
• SPE local store
• memcpy by PPE
Memory to Memory Copy using:
• SPE local store
• memcpy by PPE
Intra-Cell MPI Design ChoicesIntra-Cell MPI Design Choices
Cell features In order execution, but DMAs can be out of order Over 100 simultaneous DMAs can be in flight
Constraints Unconventional, heterogeneous architecture SPEs have limited functionality, and can act directly only on local stores SPEs access main memory through DMA Use of PPE should be limited to get good performance
MPI design choices Application data in: (i) local store or (ii) main memory MPI data in: (i) local store or (ii) main memory PPE involvement: (i) active or (ii) only during initialization and finalization Collective calls can: (i) synchronize or (ii) not synchronize
Barrier (1)Barrier (1)
OTA List: “Root” receives notification from all others, and then acknowledges through a DMA list
OTA: Like OTA List, but root notifies others through individual non-blocking DMAs
SIG: Like OTA, but others notify root through a signal register in OR mode
Degree-k TREE In each step, a node has k-1 children In the first phase, children notify parents In the second phase, parents acknowledge
children
Barrier (2)Barrier (2)
PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension
DIS: In step i, SPU j sends to SPU j + 2i and receives from j – 2i
Comparison of MPI_Barrier on different hardware
P Cell (PE) s
Xeon/Myrinet s
NEC SX-8 s SGI Altix BX2 s
8 0.4 10 13 3
16 1.0 14 5 5
Alternatives Atomic increments in main memory
– several microseconds PPE coordinates using mailbox –
tens of microseconds
Broadcast (1)Broadcast (1)
OTA on 4 SPUs OTA: Each SPE copies data to its location
Different shifts are used to avoid hotspots in memory Different shifts on larger number of SPUs yield results
that are close to each other
AG on 16 SPUs AG: Each SPE is responsible for a
different portion of data Different minimum sizes are tried
Broadcast (2)Broadcast (2)
TREEMM on 12 SPUs TREEMM: Tree structured Send/Recv type
implementation Data for degrees 2 and 4 are close Degree 3 is best, or close to it, for all SPU counts
TREE on 16 SPUs TREE: Pipelined tree structured
communication based on local stores Results are similar to this figure for
other SPU counts
Broadcast (3)Broadcast (3)
Broadcast on 16 SPEs (2 processors) TREE: Pipelined tree structured communication based on LS TREEMM: Tree structured Send/Recv type implementation AG: Each SPE is responsible for a different portion of data OTA: Each SPE copies data to its location G: Root copies all data
Broadcast with good choice of
algorithms for each data size and SPE count
Maximum main memory bandwidth is also shown
Broadcast (4)Broadcast (4)
Each node of the SX-8 has 8 vector processors capable of 16 Gflop/s, with 64 GB/s bandwidth to memory from each processor The total bandwidth to memory for a node is 512 GB/s Nodes are connected through a crossbar switch capable of 16 GB/s in each direction
The Altix is a CC-NUMA system with a global shared memory Each node contains eight Itanium 2 processors Nodes are connected using NUMALINK4 -- bandwidth between processors on a node is 3.2 GB/s, and between
nodes 1.6 GB/s
Data Size
Cell (PE) s Infiniband s NEC SX-8 s SGI Altix BX2 s
P = 8 P = 16 P = 8 P = 16 P = 8 P = 16 P = 8 P = 16
128 B 1.7 3.1 18 10
1 KB 2.0 3.7 25 20
32 KB 12.8 33.7 220
1 MB 414 653 100 215 2600 3100
Comparison of MPI_Bcast on different hardware
ReduceReduce
Reduce of MPI_INT with MPI_SUM on 16 SPUs Similar trends were observed for other SPU counts
Data Size
Cell (PE) s IBM SP s
NEC SX-8 s SGI Altix BX2 s
P = 8 P = 16
P = 16 P = 8 P = 16
P = 8 P = 16
128 B 3.06 5.69 40
1 KB 4.41 8.8 60
1 MB 689 1129 13000 230
350
10000 12000
Each node of the IBM SP was a 16-processor SMP
Comparison of MPI_Bcast on different hardware
Conclusions and Future WorkConclusions and Future Work
Conclusions The Cell processor has good potential for MPI implementations
PPE should have a limited role High bandwidth and low latency even with application data in main
memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck
Current and future work Implemented
Collective communication operations optimized for contiguous data Future work
Optimize collectives for derived data types with non-contiguous data