Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J....
-
Upload
shonda-carroll -
Category
Documents
-
view
215 -
download
1
Transcript of Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J....
![Page 1: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/1.jpg)
Unified Parallel C (UPC)Costin Iancu
The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell,
P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome, K. Yelick
http://upc.lbl.gov
Slides edited by K. Yelick, T. El-Ghazawi, P. Husbands, C.Iancu
![Page 2: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/2.jpg)
Context
• Most parallel programs are written using either:– Message passing with a SPMD model (MPI)
• Usually for scientific applications with C++/Fortran• Scales easily
– Shared memory with threads in OpenMP, Threads+C/C++/F or Java• Usually for non-scientific applications• Easier to program, but less scalable performance
• Partitioned Global Address Space (PGAS) Languages take the best of both
– SPMD parallelism like MPI (performance)– Local/global distinction, i.e., layout matters (performance)– Global address space like threads (programmability)
• 3 Current languages: UPC (C), CAF (Fortran), and Titanium (Java)
• 3 New languages: Chapel, Fortress, X10
![Page 3: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/3.jpg)
Partitioned Global Address Space
• Shared memory is logically partitioned by processors• Remote memory may stay remote: no automatic caching implied• One-sided communication: reads/writes of shared variables• Both individual and bulk memory copies • Some models have a separate private memory area• Distributed array generality and how they are constructed
Shared
Glo
bal
ad
dre
ss s
pac
e
X[0]
Privateptr: ptr: ptr:
X[1] X[P]
Thread0 Thread1 Threadn
![Page 4: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/4.jpg)
Partitioned Global Address Space Languages
• Explicitly-parallel programming model with SPMD parallelism– Fixed at program start-up, typically 1 thread per processor
• Global address space model of memory– Allows programmer to directly represent distributed data structures
• Address space is logically partitioned– Local vs. remote memory (two-level hierarchy)
• Programmer control over performance critical decisions– Data layout and communication
• Performance transparency and tunability are goals– Initial implementation can use fine-grained shared memory
![Page 5: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/5.jpg)
Current Implementations
• A successful language/library must run everywhere• UPC
– Commercial compilers: Cray, SGI, HP, IBM– Open source compilers: LBNL/UCB (source-to-source), Intrepid (gcc)
• CAF– Commercial compilers: Cray– Open source compilers: Rice (source-to-source)
• Titanium – Open source compilers: UCB (source-to-source)
• Common tools– Open64 open source research compiler infrastructure– ARMCI, GASNet for distributed memory implementations– Pthreads, System V shared memory
![Page 6: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/6.jpg)
Talk Overview
• UPC Language Design– Data Distribution (layout, memory management)– Work Distribution (data parallelism)– Communication (implicit,explicit, collective operations)– Synchronization (memory model, locks)
• Programming in UPC– Performance (one-sided communication)– Application examples: FFT, PC– Productivity (compiler support)– Performance tuning and modeling
![Page 7: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/7.jpg)
UPC Overview and Design
• Unified Parallel C (UPC) is:– An explicit parallel extension of ANSI C with common and
familiar syntax and semantics for parallel C and simple extensions to ANSI C
– A partitioned global address space language (PGAS)– Based on ideas in Split-C, AC, and PCP
• Similar to the C language philosophy– Programmers are clever and careful, and may need to get
close to hardware• to get performance, but• can get in trouble
• SPMD execution model: (THREADS, MYTHREAD), static vs. dynamic threads
![Page 8: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/8.jpg)
Data Distribution
![Page 9: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/9.jpg)
Data Distribution
Shared
Glo
bal
A
dd
ress
Sp
ace X[0]
Privateptr: ptr: ptr:
X[1] X[P]
Thread0 Thread1 Threadn
ours:
mine: mine: mine:
• Distinction between memory spaces through extensions of the type system (shared qualifier)
shared int ours;shared int X[THREADS];shared int *ptr;int mine;
• Data in shared address space:– Static: scalars (T0), distributed arrays– Dynamic: dynamic memory management (upc_alloc, upc_global_alloc, upc_all_alloc)
![Page 10: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/10.jpg)
Data Layout
• Data layout controlled through extensions of the type system (layout specifiers):– [0] or [] (indefinite layout, all on 1 thread):
shared [] int *p;
– Empty (cyclic layout) : shared int array[THREADS*M];
– [*] (blocked layout): shared [*] int array[THREADS*M];
– [b] or [b1][b2]…[bn] = [b1*b2*…bn] (block cyclic) shared [B] int array[THREADS*M];
• Element array[i] has affinity with thread
(i / B) % THREADS• Layout determines pointer arithmetic rules• Introspection (upc_threadof, upc_phaseof,
upc_blocksize)
![Page 11: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/11.jpg)
UPC Pointers Implementation • In UPC pointers to shared objects have three fields:
– thread number – local address of block– phase (specifies position in the block)
• Example: Cray T3E implementation
• Pointer arithmetic can be expensive in UPC
Phase Thread Virtual Address03738484963
Virtual Address Thread Phase
![Page 12: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/12.jpg)
UPC Pointers
Local Shared
Private PP (p1) PS (p3)
Shared SP (p2) SS (p4)
Where does the pointer point?
Where does the pointer reside?
int *p1; /* private pointer to local memory */shared int *p2; /* private pointer to shared space */int *shared p3; /* shared pointer to local memory */shared int *shared p4; /* shared pointer to shared space */
![Page 13: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/13.jpg)
UPC Pointers
int *p1; /* private pointer to local memory */shared int *p2; /* private pointer to shared space */int *shared p3; /* shared pointer to local memory */shared int *shared p4; /* shared pointer to shared space */
Shared
Glo
bal
ad
dre
ss s
pac
e
Private
p1:
Thread0 Thread1 Threadn
p2:
p1:
p2:
p1:
p2:
p3:
p4:
p3:
p4:
p3:
p4:
Pointers to shared often require more storage and are more costly to dereference; they may refer to local or remote memory.
![Page 14: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/14.jpg)
Common Uses for UPC Pointer Types
int *p1; • These pointers are fast (just like C pointers)• Use to access local data in part of code performing local work• Often cast a pointer-to-shared to one of these to get faster access to shared
data that is local
shared int *p2; • Use to refer to remote data• Larger and slower due to test-for-local + possible communication
int *shared p3; • Not recommended
shared int *shared p4; • Use to build shared linked structures, e.g., a linked list
• typedef is the UPC programmer’s best friend
![Page 15: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/15.jpg)
UPC Pointers Usage Rules
• Pointer arithmetic supports blocked and non-blocked array distributions
• Casting of shared to private pointers is allowed but not vice versa !
• When casting a pointer-to-shared to a pointer-to-local, the thread number of the pointer to shared may be lost
• Casting of shared to local is well defined only if the object pointed to by the pointer to shared has affinity with the thread performing the cast
![Page 16: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/16.jpg)
Work Distribution
![Page 17: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/17.jpg)
• Owner computes rule: loop over all, work on those owned by you• UPC adds a special type of loop
upc_forall(init; test; step; affinity) statement;
• Programmer indicates the iterations are independent– Undefined if there are dependencies across threads
• Affinity expression indicates which iterations to run on each thread. It may have one of two types:– Integer: affinity%THREADS == MYTHREAD– Pointer: upc_threadof(affinity) == MYTHREAD
• Syntactic sugar for: for(i=MYTHREAD; i<N; i+=THREADS) …
for(i=0; i<N; i++) if (MYTHREAD == i%THREADS) …
Work Distribution upc_forall()
![Page 18: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/18.jpg)
Inter-Processor Communication
![Page 19: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/19.jpg)
Data Communication
• Implicit (assignments)shared int *p;…*p = 7;
• Explicit (bulk synchronous) point-to-point (upc_memget, upc_memput, upc_memcpy, upc_memset)
• Collective operations http://www.gwu.edu/~upc/docs/
– Data movement: broadcast, scatter, gather, …– Computational: reduce, prefix, …– Interface has synchronization modes: (??)
• Avoid over-synchronizing (barrier before/after is simplest semantics, but may be unnecessary)
• Data being collected may be read/written by any thread simultaneously
![Page 20: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/20.jpg)
Data Communication
• The UPC Language Specification V 1.2 does not contain non-blocking communication primitives
• Extensions for non-blocking communication available in the BUPC implementation
• UPC V1.2 does not have higher level communication primitives for point-to-point communication.
• See BUPC extensions for – scatter, gather– VIS
• Should non-blocking communication be a first class language citizen?
![Page 21: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/21.jpg)
Synchronization
![Page 22: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/22.jpg)
Synchronization
• Point-to-point synchronization: locks– opaque type : upc_lock_t*– dynamically managed: upc_all_lock_alloc, upc_global_lock_alloc
• Global synchronization:– Barriers (unaligned) : upc_barrier– Split-phase barriers: upc_notify; this thread is ready for barrier do computation unrelated to barrier upc_wait; wait for others to be ready
![Page 23: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/23.jpg)
Memory Consistency in UPC
• The consistency model defines the order in which one thread may see another threads accesses to memory
– If you write a program with un-synchronized accesses, what happens?
– Does this work?data = … while (!flag) { };flag = 1; … = data; // use the data
• UPC has two types of accesses: – Strict: will always appear in order– Relaxed: May appear out of order to other threads
• Consistency is associated either with a program scope (file, statement)
{ #pragma upc strict flag = 1; }or with a type shared strict int flag;
![Page 24: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/24.jpg)
Sample UPC Code
![Page 25: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/25.jpg)
Matrix Multiplication in UPC
• Given two integer matrices A(NxP) and B(PxM), we want to compute C =A x B.
• Entries cij in C are computed by the formula:
bac lj
p
lilij×=∑
=1
![Page 26: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/26.jpg)
Serial C code
#define N 4#define P 4#define M 4int a[N][P] = {1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16},
c[N][M];int b[P][M] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1};
void main (void) { int i, j , l; for (i = 0 ; i<N ; i++) {
for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l = 0 ; l<P ; l++)
c[i][j] += a[i][l]*b[l][j];}
}}
![Page 27: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/27.jpg)
Domain Decomposition
• A (N P) is decomposed row-wise into blocks of size (N P) / THREADS as shown below:
• B(P M) is decomposed column wise into M/ THREADS blocks as shown below:
Thread 0
Thread 1
Thread THREADS-1
0 .. (N*P / THREADS) -1
(N*P / THREADS)..(2*N*P / THREADS)-1
((THREADS-1)N*P) / THREADS .. (THREADS*N*P / THREADS)-1
Columns 0: (M/THREADS)-1
Columns ((THREAD-1) M)/THREADS:(M-1)
Thread 0Thread THREADS-1
•Note: N and M are assumed to be multiples of THREADS
• Exploits locality in matrix multiplication
N
P M
P
![Page 28: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/28.jpg)
UPC Matrix Multiplication Code#include <upc_relaxed.h>#define N 4#define P 4#define M 4shared [N*P/THREADS] int a[N][P] = {1,..,16}, c[N][M];// data distribution: a and c are blocked shared matricesshared [M/THREADS] int b[P][M] = {0,1,0,1,… ,0,1};
void main (void) {int i, j , l; // private variablesupc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
//work distribution for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l= 0 ; l<P ; l++)
//implicit communication c[i][j] += a[i][l]*b[l][j]; }
}}
![Page 29: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/29.jpg)
UPC Matrix Multiplication With Block Copy
#include <upc_relaxed.h>shared [N*P /THREADS] int a[N][P], c[N][M];// a and c are blocked shared matricesshared[M/THREADS] int b[P][M];int b_local[P][M];
void main (void) {int i, j , l; // private variables
//explicit bulk communicationupc_memget(b_local, b, P*M*sizeof(int));
//work distribution (c aligned with a??)upc_forall(i = 0 ; i<N ; i++; &c[i][0]) { for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l= 0 ; l<P ; l++)
c[i][j] += a[i][l]*b_local[l][j]; }}
}
![Page 30: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/30.jpg)
Programming in UPC
Don’t ask yourself what can my compiler do for me, ask yourself what can I do for my compiler!
![Page 31: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/31.jpg)
Principles of Performance Software
To minimize the cost of communication• Use the best available communication mechanism on
a given machine• Hide communication by overlapping (programmer or compiler or runtime)
• Avoid synchronization using data-driven execution (programmer or runtime)
• Tune communication using performance models when they work (??); search when they don’t
(programmer or compiler/runtime)
![Page 32: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/32.jpg)
Best Available Communication Mechanism
• Performance is determined by overhead, latency and bandwidth
• Data transfer (one-sided communication) is often faster than (two sided) message passing
• Semantics limit performance– In-order message delivery– Message and tag matching– Need to acquire information from remote host processor– Synchronization (message receipt) tied to data transfer
![Page 33: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/33.jpg)
One-Sided vs Two-Sided: Theory
• A two-sided messages needs to be matched with a receive to identify memory address to put data– Offloaded to Network Interface in networks like Quadrics– Need to download match tables to interface (from host)
• A one-sided put/get message can be handled directly by a network interface with RDMA support– Avoid interrupting the CPU or storing data from CPU (preposts)
address
message id
data payload
data payload
one-sided put message
two-sided message
network
interface
memory
host
CPU
![Page 34: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/34.jpg)
GASNet: Portability and High-Performance(d
ow
n i
s g
oo
d)
GASNet better for overhead and latency across machinesUPC Group; GASNet design by Dan Bonachea
8-byte Roundtrip Latency
14.6
6.6
24.2
22.1
9.6
18.5
11.8
6.6
4.5
9.5 9.58.3
17.8
13.5
0
5
10
15
20
25
Elan3 Elan4 Myrinet G5/IB Opteron/IB SP/Fed XT3
Roundtrip Latency (usec)
MPI ping-pong
GASNet put+sync
![Page 35: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/35.jpg)
(up
is
go
od
)
GASNet excels at mid-range sizes: important for overlap
GASNet: Portability and High-Performance
Joint work with UPC Group; GASNet design by Dan Bonachea
Flood Bandwidth for 4KB messages
526
547
420
190
702
152
252
765
750
714231
763223
679
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3 Elan4 Myrinet G5/IB Opteron/IB SP/Fed XT3
Percent HW peak
MPI
GASNet
Flood Bandwidth for 2MB messages
1504
630
244
857 225
610
11061490799255
858 228795
1109
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3 Elan4 Myrinet G5/IB Opteron/IB SP/Fed XT3
Percent HW peak (BW in MB)
MPI GASNet(up
is
go
od
)
GASNet at least as high (comparable) for large messages
![Page 36: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/36.jpg)
One-Sided vs. Two-Sided: Practice
0
100
200
300
400
500
600
700
800
900
16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K128K256K
Bandwidth (MB/s)
0
0.5
1
1.5
2
2.5
Speedup (GASNet / MPI)
GASNet put (nonblock)"
MPI Flood
GASNet/MPI
• InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5• Half power point (N ½ ) differs by one order of magnitude• This is not a criticism of the implementation!
Yelick,Hargrove, Bonachea
(up
is
go
od
)
NERSC Jacquard machine with Opteron processors
![Page 37: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/37.jpg)
Overlap
![Page 38: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/38.jpg)
Hide Communication by Overlapping
• A programming model that decouples data transfer and synchronization (init, sync)
• BUPC has several extensions: (programmer)– explicit handle based– region based– implicit handle based
• Examples:– 3D FFT (programmer)– split-phase optimizations (compiler)– automatic overlap (runtime)
![Page 39: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/39.jpg)
Performing a 3D FFT• NX x NY x NZ elements spread across P processors• Will Use 1-Dimensional Layout in Z dimension
– Each processor gets NZ / P “planes” of NX x NY elements per plane
1D Partition
NX
NY
Example: P = 4
NZ
p0p1
p2p3
NZ/P
Bell, Nishtala, Bonachea, Yelick
![Page 40: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/40.jpg)
Performing a 3D FFT (part 2)
• Perform an FFT in all three dimensions• With 1D layout, 2 out of the 3 dimensions are local
while the last Z dimension is distributed
Step 1: FFTs on the columns(all elements local)
Step 2: FFTs on the rows (all elements local)
Step 3: FFTs in the Z-dimension(requires communication)
Bell, Nishtala, Bonachea, Yelick
![Page 41: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/41.jpg)
Performing the 3D FFT (part 3)
• Can perform Steps 1 and 2 since all the data is available without communication
• Perform a Global Transpose of the cube– Allows step 3 to continue
Transpose
Bell, Nishtala, Bonachea, Yelick
![Page 42: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/42.jpg)
Communication Strategies for 3D FFTchunk = all rows with same destination
pencil = 1 row
• Three approaches:
–Chunk: • Wait for 2nd dim FFTs to finish• Minimize # messages
–Slab: • Wait for chunk of rows destined for
1 proc to finish• Overlap with computation
–Pencil: • Send each row as it completes• Maximize overlap and• Match natural layout
slab = all rows in a single plane with same destination
Bell, Nishtala, Bonachea, Yelick
![Page 43: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/43.jpg)
NAS FT Variants Performance Summary
• Slab is always best for MPI; small message cost too high• Pencil is always best for UPC; more overlap
0
200
400
600
800
1000
Myrinet 64
InfiniBand 256Elan3 256
Elan3 512Elan4 256
Elan4 512
MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions
Best NAS Fortran/MPIBest MPIBest UPC
.5 Tflops
Myrinet Infiniband Elan3 Elan3 Elan4 Elan4
#procs 64 256 256 512 256 512
MF
lop
s p
er
Th
rea
d
Chunk (NAS FT with FFTW)Best MPI (always slabs)Best UPC (always pencils)
![Page 44: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/44.jpg)
Bisection Bandwidth Limits
• Full bisection bandwidth is (too) expensive• During an all-to-all communication phase
–Effective (per-thread) bandwidth is fractional share–Significantly lower than link bandwidth–Use smaller messages mixed with computation to avoid swamping the network
Bell, Nishtala, Bonachea, Yelick
![Page 45: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/45.jpg)
Compiler Optimizations
• Naïve scheme (blocking call for each load/store) not good enough
• PRE on shared expressions– Reduce the amount of unnecessary communication– Apply also to UPC shared pointer arithmetic
• Split-phase communication
– Hide communication latency through overlapping • Message coalescing
– Reduce number of messages to save startup overhead and achieve better bandwidth
Chen, Iancu, Yelick
![Page 46: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/46.jpg)
Benchmarks
• Gups– Random access (read/modify/write) to distributed array
• Mcop– Parallel dynamic programming algorithm
• Sobel– Image filter
• Psearch – Dynamic load balancing/work stealing
• Barnes Hut– Shared memory style code from SPLASH2
• NAS FT/IS– Bulk communication
![Page 47: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/47.jpg)
Performance Improvements%
im
pro
vem
ent
ove
r u
no
pti
miz
ed
Chen, Iancu, Yelick
![Page 48: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/48.jpg)
Data Driven Execution
![Page 49: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/49.jpg)
Data-Driven Execution
• Many algorithms require synchronization with remote processor – Mechanisms: (BUPC extensions)
• Signaling store: Raise a semaphore upon transfer
• Remote enqueue: Put a task in a remote queue
• Remote execution: Floating functions (X10 activities)
• Many algorithms have irregular data dependencies (LU)– Mechanisms (BUPC extensions)
• Cooperative multithreading
![Page 50: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/50.jpg)
Matrix FactorizationBlocks 2Dblock-cyclicdistributed
Panel factorizationsinvolve communicationfor pivoting Matrix-
matrixmultiplicationused here.Can be coalesced
Completed part of U
Co
mp
lete
d p
art o
f L
A(i,j) A(i,k)
A(j,i) A(j,k)
Trailing matrix
to be updated
Panel being factored
Husbands,Yelick
![Page 51: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/51.jpg)
Three Strategies for LU Factorization
Husbands,Yelick
• Organize in bulk-synchronous phases (ScaLAPACK)• Factor a block column, then perform updates
• Relatively easy to understand/debug, but extra synchronization
• Overlapping phases (HPL): • Work associated with on block column factorization can be overlapped
• Parameter to determine how many (need temp space accordingly)
• Event-driven multithreaded (UPC Linpack): • Each thread runs an event handler loop
• Tasks: factorization (w/ pivoting), update trailing, update upper
• Tasks my suspend (voluntarily) to wait for data, synchronization, etc.
• Data moved with remote gets (synchronization built-in)
• Must “gang” together for factorizations
• Scheduling priorities are key to performance and deadlock avoidance
![Page 52: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/52.jpg)
UPC-HP Linpack Performance
X1 UPC vs. MPI/HPL
0
200
400
600
800
1000
1200
1400
60 X1/64 X1/128
GFlop/s
MPI/HPL
UPC
Opteron clusterUPC vs. MPI/HPL
0
50
100
150
200
Opt/64
GFlop/s MPI/HPL
UPC
Altix UPC. Vs.
MPI/HPL
0
20
40
60
80
100
120
140
160
Alt/32
GFlop/sMPI/HPL
UPC
•Comparable to HPL (numbers from HPCC database)•Faster than ScaLAPACK due to less synchronization•Large scaling of UPC code on Itanium/Quadrics (Thunder)
• 2.2 TFlops on 512p and 4.4 TFlops on 1024pHusbands, Yelick
UPC vs. ScaLAPACK
0
20
40
60
80
2x4 proc grid 4x4 proc grid
GFlops
ScaLAPACK
UPC
![Page 53: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/53.jpg)
Performance Tuning
Iancu, Strohmaier
![Page 54: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/54.jpg)
Efficient Use of One-Sided
• Implementations: need to be efficient and have scalable performance
• Application level use of NB benefits from new design techniques: finer grained decompositions and overlap
• Overlap exercises the system in “un-expected” ways
• Prototyping of implementations for large scale systems is a hard problem: non-linear behavior of networks, communication scheduling is NP-hard
• Need methodology for fast prototyping:– understand interaction network/CPU at large scale
![Page 55: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/55.jpg)
Performance Tuning
• Performance is determined by overhead, latency and bandwidth, computational characteristics and communication topology
• It’s all relative: Performance characteristics are determined by system load
• Basic principles– Minimize communication overhead– Avoid congestion:
• control injection rate (end-point)• avoid hotspots (end-point, network routes)
• Have to use models.• What kind of answers can a model answer?
![Page 56: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/56.jpg)
Example: Vector-Add
shared double *rdata;double *ldata, *buf;
upc_memget(buf, rdata, N);for(i=0; i<N; i++) ldata[i]+= buf[i];
for(i=0; i<N/B; i++) h[i]=upc_memget_nb(buf+i*B,rdata+i*B,B);
for(i=0; i<N/B; i++) { sync(h[i]); for(j=0;j<B; j++)
ldata[i*B+j]+=buf[i*B+j]; }
GET_nb(B0):GET_nb(Bb)GET_nb(Bb+1):GET_nb(B2b)sync(B0)compute(B0):sync(Bb)compute(Bb)GET_nb(B2b+1):GET_nb(B3b):sync(BN)compute(BN)
Which implementation is faster?
What is B,b?
b
b
b
![Page 57: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/57.jpg)
Prototyping
• Usual approach: use time accurate performance model (applications, automatically tuned collectives)– Models (LogP..) don’t capture important behavior (parallelism,
congestion, resource constraints, non-linear behavior)– Exhaustive search of the optimization space– Validated only at low concurrency (~tens of procs), might break at high
concurrency, might break for torus networks
• Our approach:– Use performance model for ideal implementation– Understand hardware resource constraints and the variation of
performance parameters (understand trends not absolute values)
– Derive implementation constraints to satisfy both optimal implementation and hardware constraints
– Force implementation parameters to converge towards optimal
![Page 58: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/58.jpg)
Performance
• Network: bandwidth and overhead
• Application: communication pattern and schedule (congestion), computation
Overhead is determined bymessage size, communication
schedule, hardware flow of control
Bandwidth is determinedby message size, communicationschedule, fairness of allocation
Iancu, Strohmaier
![Page 59: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/59.jpg)
• Understand network behavior in the presence of non-blocking communication (Infiniband, Elan)
• Develop performance model for scenarios widely encountered in applications (p2p, scatter, gather, all-to-all) and a variety of aggressive optimization techniques (strip mining, pipelining, communication schedule skewing)
• Use both micro-benchmarks and application kernels to validate approach
Validation
Iancu, Strohmaier
![Page 60: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/60.jpg)
Findings
• Can choose optimal values for implementation parameters
• Time accurate model for an implementation: hard to develop, inaccurate at
high concurrency • Methodology does not require exhaustive search of the optimization space
(only p2p and qualitative behavior of gather)
• In practice one can produce templatized implementations for an algorithm and use our approach to determine optimal values: code generation (UPC), automatic tuning of collective operations, application development
• Need to further understand the mathematical and statistical properties
![Page 61: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/61.jpg)
End
![Page 62: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/62.jpg)
Avoid synchronization: Data-driven Execution
• Many algorithms require synchronization with remote processor
• What is the right mechanism in a PGAS model for doing this?
• Is it still one-sided?
Part 3: Event-Driven Execution Models
![Page 63: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/63.jpg)
Mechanisms for Event-Driven Execution
• “Put” operation does a send side notification– Needed for memory consistency model ordering
• Need to signal remote side on completion• Strategies:
– Have remote side do a “get” (works in some algorithms)– Put + strict flag write: do a put, wait for completion, then
do another (strict) put– Pipelined put + put-flag: works only on ordered networks– Signaling put: add new “store” operation that embeds
signal (2nd remote address) into single message
![Page 64: Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala,](https://reader036.fdocuments.in/reader036/viewer/2022081519/56649ea05503460f94ba3a2c/html5/thumbnails/64.jpg)
Mechanisms for Event-Driven Execution
Signalling Store
0
10
20
30
40
50
60
1 10 100 1000 10000 100000
payload size (Bytes)
time (usec)
memput signalupc memput + strict flagpipelined memput (unsafe)MPI n-byte ping/pongMPI send/rec
Preliminary results on Opteron + Infiniband