Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J....

Unified Parallel C (UPC)Costin Iancu

The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell,

P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome, K. Yelick

http://upc.lbl.gov

Slides edited by K. Yelick, T. El-Ghazawi, P. Husbands, C.Iancu

Context

• Most parallel programs are written using either:– Message passing with a SPMD model (MPI)

• Usually for scientific applications with C++/Fortran• Scales easily

– Shared memory with threads in OpenMP, Threads+C/C++/F or Java• Usually for non-scientific applications• Easier to program, but less scalable performance

• Partitioned Global Address Space (PGAS) Languages take the best of both

– SPMD parallelism like MPI (performance)– Local/global distinction, i.e., layout matters (performance)– Global address space like threads (programmability)

• 3 Current languages: UPC (C), CAF (Fortran), and Titanium (Java)

• 3 New languages: Chapel, Fortress, X10

Partitioned Global Address Space

• Shared memory is logically partitioned by processors• Remote memory may stay remote: no automatic caching implied• One-sided communication: reads/writes of shared variables• Both individual and bulk memory copies • Some models have a separate private memory area• Distributed array generality and how they are constructed

Shared

Glo

bal

ad

dre

ss s

pac

e

X[0]

Privateptr: ptr: ptr:

X[1] X[P]

Thread0 Thread1 Threadn

Partitioned Global Address Space Languages

• Explicitly-parallel programming model with SPMD parallelism– Fixed at program start-up, typically 1 thread per processor

• Global address space model of memory– Allows programmer to directly represent distributed data structures

• Address space is logically partitioned– Local vs. remote memory (two-level hierarchy)

• Programmer control over performance critical decisions– Data layout and communication

• Performance transparency and tunability are goals– Initial implementation can use fine-grained shared memory

Current Implementations

• A successful language/library must run everywhere• UPC

– Commercial compilers: Cray, SGI, HP, IBM– Open source compilers: LBNL/UCB (source-to-source), Intrepid (gcc)

• CAF– Commercial compilers: Cray– Open source compilers: Rice (source-to-source)

• Titanium – Open source compilers: UCB (source-to-source)

• Common tools– Open64 open source research compiler infrastructure– ARMCI, GASNet for distributed memory implementations– Pthreads, System V shared memory

Talk Overview

• UPC Language Design– Data Distribution (layout, memory management)– Work Distribution (data parallelism)– Communication (implicit,explicit, collective operations)– Synchronization (memory model, locks)

• Programming in UPC– Performance (one-sided communication)– Application examples: FFT, PC– Productivity (compiler support)– Performance tuning and modeling

UPC Overview and Design

• Unified Parallel C (UPC) is:– An explicit parallel extension of ANSI C with common and

familiar syntax and semantics for parallel C and simple extensions to ANSI C

– A partitioned global address space language (PGAS)– Based on ideas in Split-C, AC, and PCP

• Similar to the C language philosophy– Programmers are clever and careful, and may need to get

close to hardware• to get performance, but• can get in trouble

• SPMD execution model: (THREADS, MYTHREAD), static vs. dynamic threads

Data Distribution

Data Distribution

Shared

Glo

bal

A

dd

ress

Sp

ace X[0]

Privateptr: ptr: ptr:

X[1] X[P]


ours:

mine: mine: mine:

• Distinction between memory spaces through extensions of the type system (shared qualifier)

shared int ours;shared int X[THREADS];shared int *ptr;int mine;

• Data in shared address space:– Static: scalars (T0), distributed arrays– Dynamic: dynamic memory management (upc_alloc, upc_global_alloc, upc_all_alloc)

Data Layout

• Data layout controlled through extensions of the type system (layout specifiers):– [0] or [] (indefinite layout, all on 1 thread):

shared [] int *p;

– Empty (cyclic layout) : shared int array[THREADS*M];

– [*] (blocked layout): shared [*] int array[THREADS*M];

– [b] or [b1][b2]…[bn] = [b1*b2*…bn] (block cyclic) shared [B] int array[THREADS*M];

• Element array[i] has affinity with thread

(i / B) % THREADS• Layout determines pointer arithmetic rules• Introspection (upc_threadof, upc_phaseof,

upc_blocksize)

UPC Pointers Implementation • In UPC pointers to shared objects have three fields:

– thread number – local address of block– phase (specifies position in the block)

• Example: Cray T3E implementation

• Pointer arithmetic can be expensive in UPC

Phase Thread Virtual Address03738484963

Virtual Address Thread Phase

UPC Pointers

Local Shared

Private PP (p1) PS (p3)

Shared SP (p2) SS (p4)

Where does the pointer point?

Where does the pointer reside?

int *p1; /* private pointer to local memory */shared int *p2; /* private pointer to shared space */int *shared p3; /* shared pointer to local memory */shared int *shared p4; /* shared pointer to shared space */

UPC Pointers

int *p1; /* private pointer to local memory */shared int *p2; /* private pointer to shared space */int *shared p3; /* shared pointer to local memory */shared int *shared p4; /* shared pointer to shared space */

Shared

Glo

bal

ad

dre

ss s

pac

e

Private

p1:


p2:

p1:

p2:

p1:

p2:

p3:

p4:

p3:

p4:

p3:

p4:

Pointers to shared often require more storage and are more costly to dereference; they may refer to local or remote memory.

Common Uses for UPC Pointer Types

int *p1; • These pointers are fast (just like C pointers)• Use to access local data in part of code performing local work• Often cast a pointer-to-shared to one of these to get faster access to shared

data that is local

shared int *p2; • Use to refer to remote data• Larger and slower due to test-for-local + possible communication

int *shared p3; • Not recommended

shared int *shared p4; • Use to build shared linked structures, e.g., a linked list

• typedef is the UPC programmer’s best friend

UPC Pointers Usage Rules

• Pointer arithmetic supports blocked and non-blocked array distributions

• Casting of shared to private pointers is allowed but not vice versa !

• When casting a pointer-to-shared to a pointer-to-local, the thread number of the pointer to shared may be lost

• Casting of shared to local is well defined only if the object pointed to by the pointer to shared has affinity with the thread performing the cast

Work Distribution

• Owner computes rule: loop over all, work on those owned by you• UPC adds a special type of loop

upc_forall(init; test; step; affinity) statement;

• Programmer indicates the iterations are independent– Undefined if there are dependencies across threads

• Affinity expression indicates which iterations to run on each thread. It may have one of two types:– Integer: affinity%THREADS == MYTHREAD– Pointer: upc_threadof(affinity) == MYTHREAD

• Syntactic sugar for: for(i=MYTHREAD; i<N; i+=THREADS) …

for(i=0; i<N; i++) if (MYTHREAD == i%THREADS) …

Work Distribution upc_forall()

Inter-Processor Communication

Data Communication

• Implicit (assignments)shared int *p;…*p = 7;

• Explicit (bulk synchronous) point-to-point (upc_memget, upc_memput, upc_memcpy, upc_memset)

• Collective operations http://www.gwu.edu/~upc/docs/

– Data movement: broadcast, scatter, gather, …– Computational: reduce, prefix, …– Interface has synchronization modes: (??)

• Avoid over-synchronizing (barrier before/after is simplest semantics, but may be unnecessary)

• Data being collected may be read/written by any thread simultaneously

http://www.gwu.edu/~upc/docs/

Data Communication

• The UPC Language Specification V 1.2 does not contain non-blocking communication primitives

• Extensions for non-blocking communication available in the BUPC implementation

• UPC V1.2 does not have higher level communication primitives for point-to-point communication.

• See BUPC extensions for – scatter, gather– VIS

• Should non-blocking communication be a first class language citizen?

Synchronization

Synchronization

• Point-to-point synchronization: locks– opaque type : upc_lock_t*– dynamically managed: upc_all_lock_alloc, upc_global_lock_alloc

• Global synchronization:– Barriers (unaligned) : upc_barrier– Split-phase barriers: upc_notify; this thread is ready for barrier do computation unrelated to barrier upc_wait; wait for others to be ready

Memory Consistency in UPC

• The consistency model defines the order in which one thread may see another threads accesses to memory

– If you write a program with un-synchronized accesses, what happens?

– Does this work?data = … while (!flag) { };flag = 1; … = data; // use the data

• UPC has two types of accesses: – Strict: will always appear in order– Relaxed: May appear out of order to other threads

• Consistency is associated either with a program scope (file, statement)

{ #pragma upc strict flag = 1; }or with a type shared strict int flag;

Sample UPC Code

Matrix Multiplication in UPC

• Given two integer matrices A(NxP) and B(PxM), we want to compute C =A x B.

• Entries cij in C are computed by the formula:

bac lj

p

lilij×=∑

=1

Serial C code

#define N 4#define P 4#define M 4int a[N][P] = {1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16},

c[N][M];int b[P][M] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1};

void main (void) { int i, j , l; for (i = 0 ; i<N ; i++) {

for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l = 0 ; l<P ; l++)

c[i][j] += a[i][l]*b[l][j];}

}}

Domain Decomposition

• A (N P) is decomposed row-wise into blocks of size (N P) / THREADS as shown below:

• B(P M) is decomposed column wise into M/ THREADS blocks as shown below:

Thread 0

Thread 1

Thread THREADS-1

0 .. (N*P / THREADS) -1

(N*P / THREADS)..(2*N*P / THREADS)-1

((THREADS-1)N*P) / THREADS .. (THREADS*N*P / THREADS)-1

Columns 0: (M/THREADS)-1

Columns ((THREAD-1) M)/THREADS:(M-1)

Thread 0Thread THREADS-1

•Note: N and M are assumed to be multiples of THREADS

• Exploits locality in matrix multiplication

N

P M

P

UPC Matrix Multiplication Code#include <upc_relaxed.h>#define N 4#define P 4#define M 4shared [N*P/THREADS] int a[N][P] = {1,..,16}, c[N][M];// data distribution: a and c are blocked shared matricesshared [M/THREADS] int b[P][M] = {0,1,0,1,… ,0,1};

void main (void) {int i, j , l; // private variablesupc_forall(i = 0 ; i<N ; i++; &c[i][0]) {

//work distribution for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l= 0 ; l<P ; l++)

//implicit communication c[i][j] += a[i][l]*b[l][j]; }

}}

UPC Matrix Multiplication With Block Copy

#include <upc_relaxed.h>shared [N*P /THREADS] int a[N][P], c[N][M];// a and c are blocked shared matricesshared[M/THREADS] int b[P][M];int b_local[P][M];

void main (void) {int i, j , l; // private variables

//explicit bulk communicationupc_memget(b_local, b, P*M*sizeof(int));

//work distribution (c aligned with a??)upc_forall(i = 0 ; i<N ; i++; &c[i][0]) { for (j=0 ; j<M ;j++) { c[i][j] = 0; for (l= 0 ; l<P ; l++)

c[i][j] += a[i][l]*b_local[l][j]; }}

}

Programming in UPC

Don’t ask yourself what can my compiler do for me, ask yourself what can I do for my compiler!

Principles of Performance Software

To minimize the cost of communication• Use the best available communication mechanism on

a given machine• Hide communication by overlapping (programmer or compiler or runtime)

• Avoid synchronization using data-driven execution (programmer or runtime)

• Tune communication using performance models when they work (??); search when they don’t

(programmer or compiler/runtime)

Best Available Communication Mechanism

• Performance is determined by overhead, latency and bandwidth

• Data transfer (one-sided communication) is often faster than (two sided) message passing

• Semantics limit performance– In-order message delivery– Message and tag matching– Need to acquire information from remote host processor– Synchronization (message receipt) tied to data transfer

One-Sided vs Two-Sided: Theory

• A two-sided messages needs to be matched with a receive to identify memory address to put data– Offloaded to Network Interface in networks like Quadrics– Need to download match tables to interface (from host)

• A one-sided put/get message can be handled directly by a network interface with RDMA support– Avoid interrupting the CPU or storing data from CPU (preposts)

address

message id

data payload

data payload

one-sided put message

two-sided message

network

interface

memory

host

CPU

GASNet: Portability and High-Performance(d

ow

n i

s g

oo

d)

GASNet better for overhead and latency across machinesUPC Group; GASNet design by Dan Bonachea

8-byte Roundtrip Latency

14.6

6.6

24.2

22.1

9.6

18.5

11.8

6.6

4.5

9.5 9.58.3

17.8

13.5

0

5

10

15

20

25

Elan3 Elan4 Myrinet G5/IB Opteron/IB SP/Fed XT3

Roundtrip Latency (usec)

MPI ping-pong

GASNet put+sync

(up

is

go

od

)

GASNet excels at mid-range sizes: important for overlap

GASNet: Portability and High-Performance

Joint work with UPC Group; GASNet design by Dan Bonachea

Flood Bandwidth for 4KB messages

526

547

420

190

702

152

252

765

750

714231

763223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Percent HW peak

MPI

GASNet

Flood Bandwidth for 2MB messages

1504

630

244

857 225

610

11061490799255

858 228795

1109

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Percent HW peak (BW in MB)

MPI GASNet(up

is

go

od

)

GASNet at least as high (comparable) for large messages

One-Sided vs. Two-Sided: Practice

0

100

200

300

400

500

600

700

800

900

16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K128K256K

Bandwidth (MB/s)

0

0.5

1

1.5

2

2.5

Speedup (GASNet / MPI)

GASNet put (nonblock)"

MPI Flood

GASNet/MPI

• InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5• Half power point (N ½ ) differs by one order of magnitude• This is not a criticism of the implementation!

Yelick,Hargrove, Bonachea

(up

is

go

od

)

NERSC Jacquard machine with Opteron processors

Overlap

Hide Communication by Overlapping

• A programming model that decouples data transfer and synchronization (init, sync)

• BUPC has several extensions: (programmer)– explicit handle based– region based– implicit handle based

• Examples:– 3D FFT (programmer)– split-phase optimizations (compiler)– automatic overlap (runtime)

Performing a 3D FFT• NX x NY x NZ elements spread across P processors• Will Use 1-Dimensional Layout in Z dimension

– Each processor gets NZ / P “planes” of NX x NY elements per plane

1D Partition

NX

NY

Example: P = 4

NZ

p0p1

p2p3

NZ/P

Bell, Nishtala, Bonachea, Yelick

Performing a 3D FFT (part 2)

• Perform an FFT in all three dimensions• With 1D layout, 2 out of the 3 dimensions are local

while the last Z dimension is distributed

Step 1: FFTs on the columns(all elements local)

Step 2: FFTs on the rows (all elements local)

Step 3: FFTs in the Z-dimension(requires communication)


Performing the 3D FFT (part 3)

• Can perform Steps 1 and 2 since all the data is available without communication

• Perform a Global Transpose of the cube– Allows step 3 to continue

Transpose


Communication Strategies for 3D FFTchunk = all rows with same destination

pencil = 1 row

• Three approaches:

–Chunk: • Wait for 2nd dim FFTs to finish• Minimize # messages

–Slab: • Wait for chunk of rows destined for

1 proc to finish• Overlap with computation

–Pencil: • Send each row as it completes• Maximize overlap and• Match natural layout

slab = all rows in a single plane with same destination


NAS FT Variants Performance Summary

• Slab is always best for MPI; small message cost too high• Pencil is always best for UPC; more overlap

0

200

400

600

800

1000

Myrinet 64

InfiniBand 256Elan3 256

Elan3 512Elan4 256

Elan4 512

MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions

Best NAS Fortran/MPIBest MPIBest UPC

.5 Tflops

Myrinet Infiniband Elan3 Elan3 Elan4 Elan4

#procs 64 256 256 512 256 512

MF

lop

s p

er

Th

rea

d

Chunk (NAS FT with FFTW)Best MPI (always slabs)Best UPC (always pencils)

Bisection Bandwidth Limits

• Full bisection bandwidth is (too) expensive• During an all-to-all communication phase

–Effective (per-thread) bandwidth is fractional share–Significantly lower than link bandwidth–Use smaller messages mixed with computation to avoid swamping the network


Compiler Optimizations

• Naïve scheme (blocking call for each load/store) not good enough

• PRE on shared expressions– Reduce the amount of unnecessary communication– Apply also to UPC shared pointer arithmetic

• Split-phase communication

– Hide communication latency through overlapping • Message coalescing

– Reduce number of messages to save startup overhead and achieve better bandwidth

Chen, Iancu, Yelick

Benchmarks

• Gups– Random access (read/modify/write) to distributed array

• Mcop– Parallel dynamic programming algorithm

• Sobel– Image filter

• Psearch – Dynamic load balancing/work stealing

• Barnes Hut– Shared memory style code from SPLASH2

• NAS FT/IS– Bulk communication

Performance Improvements%

im

pro

vem

ent

ove

r u

no

pti

miz

ed

Chen, Iancu, Yelick

Data Driven Execution

Data-Driven Execution

• Many algorithms require synchronization with remote processor – Mechanisms: (BUPC extensions)

• Signaling store: Raise a semaphore upon transfer

• Remote enqueue: Put a task in a remote queue

• Remote execution: Floating functions (X10 activities)

• Many algorithms have irregular data dependencies (LU)– Mechanisms (BUPC extensions)

• Cooperative multithreading

Matrix FactorizationBlocks 2Dblock-cyclicdistributed

Panel factorizationsinvolve communicationfor pivoting Matrix-

matrixmultiplicationused here.Can be coalesced

Completed part of U

Co

mp

lete

d p

art o

f L

A(i,j) A(i,k)

A(j,i) A(j,k)

Trailing matrix

to be updated

Panel being factored

Husbands,Yelick

Three Strategies for LU Factorization

Husbands,Yelick

• Organize in bulk-synchronous phases (ScaLAPACK)• Factor a block column, then perform updates

• Relatively easy to understand/debug, but extra synchronization

• Overlapping phases (HPL): • Work associated with on block column factorization can be overlapped

• Parameter to determine how many (need temp space accordingly)

• Event-driven multithreaded (UPC Linpack): • Each thread runs an event handler loop

• Tasks: factorization (w/ pivoting), update trailing, update upper

• Tasks my suspend (voluntarily) to wait for data, synchronization, etc.

• Data moved with remote gets (synchronization built-in)

• Must “gang” together for factorizations

• Scheduling priorities are key to performance and deadlock avoidance

UPC-HP Linpack Performance

X1 UPC vs. MPI/HPL

0

200

400

600

800

1000

1200

1400

60 X1/64 X1/128

GFlop/s

MPI/HPL

UPC

Opteron clusterUPC vs. MPI/HPL

0

50

100

150

200

Opt/64

GFlop/s MPI/HPL

UPC

Altix UPC. Vs.

MPI/HPL

0

20

40

60

80

100

120

140

160

Alt/32

GFlop/sMPI/HPL

UPC

•Comparable to HPL (numbers from HPCC database)•Faster than ScaLAPACK due to less synchronization•Large scaling of UPC code on Itanium/Quadrics (Thunder)

• 2.2 TFlops on 512p and 4.4 TFlops on 1024pHusbands, Yelick

UPC vs. ScaLAPACK

0

20

40

60

80

2x4 proc grid 4x4 proc grid

GFlops

ScaLAPACK

UPC

Performance Tuning

Iancu, Strohmaier

Efficient Use of One-Sided

• Implementations: need to be efficient and have scalable performance

• Application level use of NB benefits from new design techniques: finer grained decompositions and overlap

• Overlap exercises the system in “un-expected” ways

• Prototyping of implementations for large scale systems is a hard problem: non-linear behavior of networks, communication scheduling is NP-hard

• Need methodology for fast prototyping:– understand interaction network/CPU at large scale

Performance Tuning

• Performance is determined by overhead, latency and bandwidth, computational characteristics and communication topology

• It’s all relative: Performance characteristics are determined by system load

• Basic principles– Minimize communication overhead– Avoid congestion:

• control injection rate (end-point)• avoid hotspots (end-point, network routes)

• Have to use models.• What kind of answers can a model answer?

Example: Vector-Add

shared double *rdata;double *ldata, *buf;

upc_memget(buf, rdata, N);for(i=0; i<N; i++) ldata[i]+= buf[i];

for(i=0; i<N/B; i++) h[i]=upc_memget_nb(buf+i*B,rdata+i*B,B);

for(i=0; i<N/B; i++) { sync(h[i]); for(j=0;j<B; j++)

ldata[i*B+j]+=buf[i*B+j]; }

GET_nb(B0):GET_nb(Bb)GET_nb(Bb+1):GET_nb(B2b)sync(B0)compute(B0):sync(Bb)compute(Bb)GET_nb(B2b+1):GET_nb(B3b):sync(BN)compute(BN)

Which implementation is faster?

What is B,b?

b

b

b

Prototyping

• Usual approach: use time accurate performance model (applications, automatically tuned collectives)– Models (LogP..) don’t capture important behavior (parallelism,

congestion, resource constraints, non-linear behavior)– Exhaustive search of the optimization space– Validated only at low concurrency (~tens of procs), might break at high

concurrency, might break for torus networks

• Our approach:– Use performance model for ideal implementation– Understand hardware resource constraints and the variation of

performance parameters (understand trends not absolute values)

– Derive implementation constraints to satisfy both optimal implementation and hardware constraints

– Force implementation parameters to converge towards optimal

Performance

• Network: bandwidth and overhead

• Application: communication pattern and schedule (congestion), computation

Overhead is determined bymessage size, communication

schedule, hardware flow of control

Bandwidth is determinedby message size, communicationschedule, fairness of allocation

Iancu, Strohmaier

• Understand network behavior in the presence of non-blocking communication (Infiniband, Elan)

• Develop performance model for scenarios widely encountered in applications (p2p, scatter, gather, all-to-all) and a variety of aggressive optimization techniques (strip mining, pipelining, communication schedule skewing)

• Use both micro-benchmarks and application kernels to validate approach

Validation

Iancu, Strohmaier

Findings

• Can choose optimal values for implementation parameters

• Time accurate model for an implementation: hard to develop, inaccurate at

high concurrency • Methodology does not require exhaustive search of the optimization space

(only p2p and qualitative behavior of gather)

• In practice one can produce templatized implementations for an algorithm and use our approach to determine optimal values: code generation (UPC), automatic tuning of collective operations, application development

• Need to further understand the mathematical and statistical properties

Avoid synchronization: Data-driven Execution

• Many algorithms require synchronization with remote processor

• What is the right mechanism in a PGAS model for doing this?

• Is it still one-sided?

Part 3: Event-Driven Execution Models

Mechanisms for Event-Driven Execution

• “Put” operation does a send side notification– Needed for memory consistency model ordering

• Need to signal remote side on completion• Strategies:

– Have remote side do a “get” (works in some algorithms)– Put + strict flag write: do a put, wait for completion, then

do another (strict) put– Pipelined put + put-flag: works only on ordered networks– Signaling put: add new “store” operation that embeds

signal (2nd remote address) into single message

Mechanisms for Event-Driven Execution

Signalling Store

0

10

20

30

40

50

60

1 10 100 1000 10000 100000

payload size (Bytes)

time (usec)

memput signalupc memput + strict flagpipelined memput (unsafe)MPI n-byte ping/pongMPI send/rec

Preliminary results on Opteron + Infiniband

Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J....

Documents

Transcript of Unified Parallel C (UPC) Costin Iancu The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J....