A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.
-
Upload
dorcas-underwood -
Category
Documents
-
view
223 -
download
0
Transcript of A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000
Three Programming Models
CC-SAS– Linear address space for shared memory
MP– Communicate with other processes explicitly
via message passing interface
SHMEM– Via get and put primitives
Platforms:
Tightly-coupled multiprocessors– SGI Origin2000: a cache-coherent distributed
shared memory machine
Less tightly-coupled clusters– A cluster of workstations connected by ethernet
Purpose
Compare the three programming models on Origin2000, a modern 64-processor hardware cache-coherent machine– We focus on scientific applications that access
data regularly or predictably.
Questions to be answered
Can parallel algorithms be structured in the same way for good performance in all three models?
If there are substantial differences in performance under three models, where are the key bottlenecks?
Do we need to change the data structures or algorithms substantially to solve those bottlenecks?
Applications and Algorithms
FFT– All-to-all communication(regular)
Ocean– Nearest-neighbor communication
Radix– All-to-all communication(irregular)
LU– One-to-many communication
Performance Result
question:
Why MP is much worse than CC-SAS and SHMEM?
Analysis:
Execution time = BUSY + LMEM + RMEM + SYNC
where
BUSY: CPU computation time
LMEM: CPU stall time for local cache miss
RMEM: CPU stall time for sending/receiving remote data
SYNC: CPU time spend at synchronization events
Where does the time go in MP?
Improving MP performance
Remove extra data copy– Allocate all data involved in communication in
shared address space
Reduce SYNC time– Use lock-free queue management instead in
communication
Speedups under Improved MP
Why does CC-SAS perform best?
Why does CC-SAS perform best?
Extra packing/unpacking operation in MP and SHMEM
Extra packet queue management in MP …
Speedups for Ocean
Speedups for Radix
Speedups for LU
Conclusions
Good algorithm structures are portable among programming models.
MP is much worse than CC-SAS and SHMEM under hardware-coherent machine. However, we can achieve similar performance if extra data copy and queue synchronization are well solved.
Something about programmability
Future work
How about those applications that indeed have irregular, unpredictable and naturally fine-grained data access and communication patterns?
How about software-based coherent machines (i.e. clusters)?