A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

19
A comparison of CC- SAS, MP and SHMEM on SGI Origin2000

Transcript of A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Page 1: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

A comparison of CC-SAS, MP and SHMEM on SGI Origin2000

Page 2: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Three Programming Models

CC-SAS– Linear address space for shared memory

MP– Communicate with other processes explicitly

via message passing interface

SHMEM– Via get and put primitives

Page 3: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Platforms:

Tightly-coupled multiprocessors– SGI Origin2000: a cache-coherent distributed

shared memory machine

Less tightly-coupled clusters– A cluster of workstations connected by ethernet

Page 4: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Purpose

Compare the three programming models on Origin2000, a modern 64-processor hardware cache-coherent machine– We focus on scientific applications that access

data regularly or predictably.

Page 5: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Questions to be answered

Can parallel algorithms be structured in the same way for good performance in all three models?

If there are substantial differences in performance under three models, where are the key bottlenecks?

Do we need to change the data structures or algorithms substantially to solve those bottlenecks?

Page 6: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Applications and Algorithms

FFT– All-to-all communication(regular)

Ocean– Nearest-neighbor communication

Radix– All-to-all communication(irregular)

LU– One-to-many communication

Page 7: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Performance Result

Page 8: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

question:

Why MP is much worse than CC-SAS and SHMEM?

Page 9: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Analysis:

Execution time = BUSY + LMEM + RMEM + SYNC

where

BUSY: CPU computation time

LMEM: CPU stall time for local cache miss

RMEM: CPU stall time for sending/receiving remote data

SYNC: CPU time spend at synchronization events

Page 10: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Where does the time go in MP?

Page 11: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Improving MP performance

Remove extra data copy– Allocate all data involved in communication in

shared address space

Reduce SYNC time– Use lock-free queue management instead in

communication

Page 12: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Speedups under Improved MP

Page 13: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Why does CC-SAS perform best?

Page 14: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Why does CC-SAS perform best?

Extra packing/unpacking operation in MP and SHMEM

Extra packet queue management in MP …

Page 15: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Speedups for Ocean

Page 16: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Speedups for Radix

Page 17: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Speedups for LU

Page 18: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Conclusions

Good algorithm structures are portable among programming models.

MP is much worse than CC-SAS and SHMEM under hardware-coherent machine. However, we can achieve similar performance if extra data copy and queue synchronization are well solved.

Something about programmability

Page 19: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Future work

How about those applications that indeed have irregular, unpredictable and naturally fine-grained data access and communication patterns?

How about software-based coherent machines (i.e. clusters)?