SHMEM Programming Model

27
SHMEM Programming Model Hung-Hsun Su UPC Group, HCS lab 1/23/2004

description

SHMEM Programming Model. Hung-Hsun Su UPC Group, HCS lab 1/23/2004. Outline. Background Nuts and Bolts GPSHMEM Performance Conclusion Reference. Background What is SHMEM?. SHard MEMory library Based on SPMD model Available for C / Fortran - PowerPoint PPT Presentation

Transcript of SHMEM Programming Model

Page 1: SHMEM Programming Model

SHMEM Programming Model

Hung-Hsun Su

UPC Group, HCS lab

1/23/2004

Page 2: SHMEM Programming Model

Outline

1. Background

2. Nuts and Bolts

3. GPSHMEM

4. Performance

5. Conclusion

6. Reference

Page 3: SHMEM Programming Model

BackgroundWhat is SHMEM? SHard MEMory library

Based on SPMD model

Available for C / Fortran

Hybrid Message Passing / Shared Memory Programming Model Message Passing Like

Explicit communication, replication and synchronization Specification of remote data location (processor id) is required

Shard Memory like Provides logically shared memory system view Communication require processor on one side only

Allows any processor element (PE) to access memory in a remote PE without involving the microprocessor on the remote PE (put / get)

Non-blocking data transfer

Page 4: SHMEM Programming Model

BackgroundWhat is SHMEM? Must know the address of a

variable on the remote processor for transfer same on all PEs

Remotely accessible data objects (Symmetric Vars.) Global variables Local static variables Variables in common blocks Fortran variables modified by a !

DIR$ SYMMETRIC directive C variables modified by a

#pragma symmetric directive

Int xfloat y

RemotelyAccessible Memory

Private Memory

Int xfloat y

Private Memory

SAMEADDRESS

P.E. X P.E. Y

RemotelyAccessible Memory

Page 5: SHMEM Programming Model

BackgroundWhy program in SHMEM? Easier to program in than MPI / PVM

Low latency, high bandwidth data transfer Puts Gets

Provide efficient collective communication Gather / Scatter All-to-all Broadcast Reductions

Provide mechanisms to implement mutual exclusion Atomic swap Locking

Provide synchronization mechanisms Barrier Fence, Quiet

Page 6: SHMEM Programming Model

BackgroundSupported Platforms SHMEM

Cray T3D, T3E, PVP SGI Irix, Origin Compaq SC IBM SP Quadrics Linux Cluster SCI (?)

GPSHMEM (Version 1.0) IBM SP SGI Origin Cray J90, T3E Unix/Linux Windows NT Myrinet (?)

Page 7: SHMEM Programming Model

Nuts & BoltsInitialization Include header shmem.h / shmem.fh to access the library shmem_init() – Initializes SHMEM my_pe() – Get the PE ID of local processor num_pes() – Get the total number of PE in the system

#include <stdio.h>#include <stdlib.h>#include "shmem.h“int main(int argc, char **argv){

int my_pe, num_pe;

shmem_init();my_pe = my_pe();num_pe = num_pes();printf("Hello World from process %d of %d\n", my_pe, num_pes);exit(0);

}

Page 8: SHMEM Programming Model

Nuts & BoltsData Transfer Put

Specific Variable void shmem_TYPE_p(TYPE *addr, TYPE value, int pe)

TYPE = double, float, int, long, short Contiguous Object

void shmem_put(void *target, const void *source, size_t len, int pe)

void shmem_TYPE_put(TYPE *target, const TYPE*source, size_t len, int pe) TYPE = double, float, int, long, longdouble, longlong, short

void shmem_putSS(void *target, const void *source, size_t len, int pe) Storage Size (SS) = 32, 64 (default), 128, mem (any size)

Page 9: SHMEM Programming Model

Nuts & BoltsData Transfer Get

Specific Variable void shmem_TYPE_g(TYPE *addr, TYPE value, int pe)

TYPE = double, float, int, long, short Contiguous Object

void shmem_get(void *target, const void *source, size_t len, int pe)

void shmem_TYPE_get(TYPE *target, const TYPE*source, size_t len, int pe) TYPE = double, float, int, long, longdouble, longlong, short

void shmem_getSS(void *target, const void *source, size_t len, int pe) Storage Size (SS) = 32, 64 (default), 128, mem (any size)

Page 10: SHMEM Programming Model

Nuts & BoltsCollective Communication Broadcast

void shmem_broadcast(void *target, void *source, int nlong, int PE_root, int PE_start, int PE_group, int PE_size, long *pSync)

One-to-all communication

Collection void shmem_collect(void *target, void *source, int nlong, int

PE_start, int PE_group, int PE_size, long *pSync) void shmem_fcollect(void *target, void *source, int nlong, int

PE_start, int PE_group, int PE_size, long *pSync) Concatenates data items from the source array into the target

array over the defined set of PEs. The resultant target array consists of the contribution from the 1st PE, followed by 1st PE + 2nd PE, etc.

pSync - symmetric work array. Every element of this array must be initialized with the value _SHMEM_SYNC_VALUE before any of the PEs in the active set enter the routine. Use to prevent overlapping collective communication

Page 11: SHMEM Programming Model

Nuts & BoltsSynchronization Barrier

void shmem_barrier_all(void) Suspend all operations until all PE calls this function

void shmem_barrier(int PE_start, int PE_group, int PE_size, long *pSync) Barrier operation on subset of PEs

Wait Suspend until a remote PE writes a value NOT equal to the one

specified void shmem_wait(long *var, long value) void shmem_TYPE_wait(TYPE *var, TYPE value)

TYPE = int, long, longlong, short

Conditional Wait Same as wait except the comparison can now be >=, >, =, !=, <, <= void shmem_wait_until(long *var, int cond, long value)

Page 12: SHMEM Programming Model

Nuts & BoltsSynchronization

Fence All put operations issued to a particular PE prior to call are

guaranteed to be delivered before any subsequent remote write operation to the same PE which follows the call

Ensures ordering of remote write (put) operations

Quiet Waits for completion of all outstanding remote writes initiated

from the calling PE

Page 13: SHMEM Programming Model

Nuts & BoltsAtomic Operations

Atomic Swap Unconditional

long shmem_swap(long *target, long value, int pe) Conditional

int shmem_int_cswap(int *target, int cond, int value, int pe)

Arithmetic add, increment

int shmem_int_fadd(int *target, int value, int pe)

Page 14: SHMEM Programming Model

Nuts & BoltsCollective Reduction Collective logical operations

and, or, xor void shmem_int_and_to_all(int *target, int *source, int nreduce, int

PE_start, int PE_group, int PE_size, int *pWrk, long *pSync)

Collective comparison operations max, min void shmem_double_max_to_all(double *target, double *source, int

nreduce, int PE_start, int PE_group, int PE_size, double *pWrk, long *pSync)

Collective arithmetic operations product, sum void shmem_double_prod_to_all(double *target, double *source, int

nreduce, int PE_start, int PE_group, int PE_size, double *pWrk, long *pSync)

Page 15: SHMEM Programming Model

Nuts & BoltsOther

Address Manipulation shmem_ptr - Returns a pointer to a data object on a remote PE

Cache Control shmem_clear_cache_inv - Disables automatic cache coherency

mode shmem_set_cache_inv - Enables automatic cache coherency

mode shmem_set_cache_line_inv - Enables automatic line cache

coherency mode shmem_udcflush - Makes the entire user data cache coherent shmem_udcflush_line - Makes coherent a cache line

Page 16: SHMEM Programming Model

Nuts & BoltsExample (Array copy)

1. #include <stdio.h> 2. #include <mpp/shmem.h> 3. #include <intrinsics.h> 4. 6. int me, npes, i; 7. int source[8], dest[8]; 8. main() 9. { 10. /* Get PE information */ 11. me = _my_pe(); 12. npes = _num_pes(); 13.

14. /* Initialize and send on PE 1 */ 15. if(me == 1) { 16. for(i=0; i<8; i++) 17. source[i] = i+1; 18. shmem_put64(dest, source, 8*sizeof(dest[0])/8, 0); 19. } 20. 21. /* Make sure the transfer is complete */ 22. shmem_barrier_all(); 23. 24. /* Print from the receiving PE */ 25. if(me == 0) { 26. _shmem_udcflush(); 27. printf(" DEST ON PE 0:"); 28. for(i=0; i<8; i++) 29. printf(" %d%c", dest[i], (i<7) ? ',' : '\n');30. }}

Page 17: SHMEM Programming Model

GPSHMEM AMES Lab / Pacific Northwest National Lab collaborative project

Communication library like SHMEM library, but tries to achieve full portability

Mostly the T3D components with some “extensions” of functionality

Research Quality at this point

ARMCI = A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems

Page 18: SHMEM Programming Model

Performance – Latency (Origin 2000)

Page 19: SHMEM Programming Model

Performance – Latency (T3E 600)

Page 20: SHMEM Programming Model

Performance – Bandwidth

Taken from http://infm.cineca.it/documenti/incontro_infm/comunicazio/sld015.htm

Page 21: SHMEM Programming Model

Performance – Bandwidth

Page 22: SHMEM Programming Model

Performance - Broadcast

Page 23: SHMEM Programming Model

Performance – All to all

Page 24: SHMEM Programming Model

Performance – Ocean

On SGI Origin 2000

Page 25: SHMEM Programming Model

Performance – Radix

On SGI Origin 2000

Page 26: SHMEM Programming Model

Conclusion

Hybrid MP/Shard Memory programming model

Compare to MP Pro.

Easier to use Lower latency, higher bandwidth communication More scalable (within limit) Remote CPU not interrupted during transfer

Con. Limited platform support (as of now)

Page 27: SHMEM Programming Model

Reference1. Ricky A. Kendall et. al., GPSHMEM and other Parallel Programming Models Powerpoint

presentation

2. Hongzhang Shan and Jaswinder Pal Singh, A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000 http://citeseer.nj.nec.com/rd/48418321%2C296348%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/14068/http:zSzzSzwww.cs.princeton.eduzSz%7EshzzSzpaperszSzics99.pdf/a-comparison-of-mpi.pdf

3. Quadrics SHMEM Programming Manual http://www.psc.edu/~oneal/compaq/ShmemMan.pdf

4. Karl Feind, Shared Memory Access (SHMEM) Routines

5. Glenn Leucke et. al., The Performance and Scalability of SHMEM and MPI-2 One-Sided Routines on a SCI Origin 2000 and a Cray T3E-600 http://dsg.port.ac.uk/Journals/PEMCS/papers/paper19.pdf

6. Patrick H. Worley, CCSM Component Performance Benchmarking and Status of the CRAY X1 at ORNL http://www.csm.ornl.gov/~worley/talks/index.html