Brief Overview of Parallel Computing

Brief Overview of Parallel Programming

M. D. Jones, Ph.D.

Center for Computational ResearchUniversity at Buffalo

State University of New York

Spring 2014

M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 1 / 61

Background

A Little Motivation

Why do we parallel process in the first place?Performance - either solve a bigger problem, or to reach solutionfaster (or both)

Hardware is becoming (actually it has been for a while now)intrinsically parallelParallel programming is not easy. How easy is sequentialprogramming? Now add another layer (a sizable one, at that) forparallelism ...


Background

Big Picture

Basic idea of parallel programming is a simple one:

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

problemproblem decompositiondecomposition

in which we decompose a large computational problem into availableprocessing elements (CPUs). How you achieve that decomposition iswhere all the fun lies ...


Background

Decomposition

1D

2D 3D

Uniform volume example - decomposition in 1D, 2D, and 3D. Note thatload balancing in this case is simpler if the work load per cell is thesame ...


Background

Decomposition (continued)

Nonuniform example - decomposition in 3D for a 16-way paralleladaptive finite element calculation (in this case using a terrainelevation). Load balancing in this case is much more complex.


Demand for HPC More Speed!

Why Use Parallel Computation?

Note that, in the modern era, inevitably HPC = Parallel (or concurrent)Computation. The driving forces behind this are pretty simple - thedesire is:

Solve my problem faster, i.e. I want the answer now (and whodoesn’t want that?)I want to solve a bigger problem than I (or anyone else, for thatmatter) have ever before been able to tackle, and do so in areasonable (generally reasonable = within a graduate student’stime to graduate!)



A Concrete Example

Well, more of a discrete example, actually. Let’s consider thegravitational N-body problem.

ExampleUsing classical gravitation, we have a very simple (but long-ranged)force/potential. For each of N bodies, the resulting force is computedfrom the other N − 1 bodies, thereby requiring N2 force calculationsper step. If a galaxy consists of approximately 1012 such bodies, andeven the best algorithm for computing requires N log2 N calculations,that means ' 1012 ln(1012)/ ln(2) calculations. If each calculationtakes ' 1µ sec, that is ∼ 40× 106 seconds per step. That is about 1.3CPU-years per step. Ouch!



A “Size” Example

On the other side, suppose that we need to diagonalize (or invert) adense matrix being used in, say, an eigenproblem derived from someengineering or multiphysics problem.

ExampleIn this problem, we want to increase the resolution to capture theessential underlying behavior of the physical process being modeled.So we determine that we need a matrix on the order of, say, 400000elements. Simply to store this matrix, in 64-bit representation, requires' 1.28× 1012 Bytes of memory, or ∼ 1200 GBytes. We could fit thisonto a cluster with say, 103 nodes, each having 4 GBytes of memory,by distributing the matrix across the individual memories of eachcluster node.



So ...

So the lessons to take from the preceding examples should be clear:It is the nature of the research (and commercial) enterprise toalways be trying to accomplish larger tasks with ever increasingspeed, or efficiency. For that matter it is human nature.These examples, while specific, apply quite generally - do you seethe connection between the first example and general molecularmodeling? The second example and finite element (or finitedifference) approaches to differential equations? How about websearching?


Demand for HPC Inherent Limitations

Scaling Concepts

By “scaling” we typically mean the relative performance of a parallelvs. serial implementation:

Definition (Scaling): Speedup Factor, S(p) is given by

S(p) =Sequential execution time (using optimal implementation)

Parallel execution time using p processors

so, in the ideal case, S(p) = p.



Parallel Efficiency

Using τS as the (best) sequential execution time, we note that

S(p) =τS

τp(p),

≤ p,

for a lower bound, and the parallel efficiency is given by

E(p) = S(p)p

,

=τS

p × τp.



Inherent Limitations in Parallel Speedup

Limitations on the maximum speedup:Fraction of the code, f , can not be made to execute in parallelParallel overhead (communication, duplication costs)Using this serial fraction, f , we can note that

τp ≥ f τS + (1− f ) ∗ τS/p,



Amdahl’s Law

This simplification for τp leads directly to Amdahl’s Law,Definition (Amdahl’s Law):

S(p) ≤ τS

f τS + (1− f ) ∗ τS/p,

≤ p1 + f (p − 1)

.

G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” AFIPS Conference

Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.



Implications of Amdahl’s Law

The implications of Amdahl’s law are pretty straightforward:The limit as p →∞,

limp→∞

S(p) ≤ 1f.

If the serial fraction is 5%, the maximum parallel speedup is only20.



Implications of Amdahl’s Law (cont’d)

1 2 4 8 16 32 64 128 256Number of Processors, p

1

2

4

8

16

32

64

128

256

MA

X(S

(p))

f=0.001f=0.01f=0.02f=0.05f=0.1

Amdahl's Law Maximum Parallel Speedup



A Practical Example

Let τp = τcomm + τcomp, where τcomm is the time spent in communicationbetween parallel processes (unavoidable overhead) and τcomp is thetime spent in (parallelizable) computation.

τ1 ≤ pτcomp,

and

S(p) =τ1

τp,

≤ pτcomp

τcomm + τcomp,

≤ p(1 + τcomm/τcomp

)−1

The point here is that it is critical to minimize communication time totime spent doing computation (recurring theme).



Defeating Amdahl’s Law

There are ways to work around the serious implications of Amdahl’slaw:

We assumed that the problem size was fixed, which is (very) oftennot the case.Now consider a case where the problem size is allowed to varyAssume that now the problem size is fixed such that τp is heldconstant



Gustafson’s Law

Now let τp be held constant as p is increased,

Ss(p) = τS/τp,

= (f τS + (1− f )τS)/τp,

= (f τS + (1− f )τS)/(f τS + (1− f )τS/p),= p/(1− (1− p)f ),' p + p(1− p)f . . . .

Another way of looking at this is that the serial fraction becomesnegligible as the problem size is scaled. Actually that is a pretty gooddefinition of a scalable code ...J. L. Gustafson, “Reevaluating Amdahl’s Law,” Comm. ACM 31(5), 532-533 (1988).



Scalability

Definition(Scalable): An algorithm is scalable if there is a minimal nonzeroefficiency as p →∞ and the problem size is allowed to take on anyvalue

I like this (equivalent) one better:

Definition(Scalable): For a scalable code the sequential fraction becomesnegligible as the problem size (and the number of processors) grows


Overview of Parallel APIs Basic Terminology

The Parallel Zoo

There are many parallel programming models, but roughly they breakdown into the following categories:

Threaded models: e.g., OpenMP or Posix threads as theapplication programming interface (API)Message passing: e.g., MPI = Message Passing InterfaceHybrid methods: combine elements of other models



Thread Models

a.out

data

thread 1

thread 2

thread N

time

Typical thread model (e.g., OpenMP)



Thread Models

a.out

data

thread 1

thread 2

thread N

time

Shared address space (all threads can access data of programa.out).Generally limited to single machine except in very specializedcases.GPGPU (general purpose GPU computing) uses thread model(with a large number of threads on the host machine.



Message Passing Models

a.out

data

a.out

data

task 0

a.out

data

task 1

task N−1

PID

PID

time

coll

ecti

ve

com

mu

nic

ati

on

(b

roa

dca

st/r

edu

ce/.

..)

msg

sen

d/r

ecv

msg

sen

d/r

ecv

PID

Typical message passing model (e.g., MPI).M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 26 / 61


Message Passing Models

a.out

data

a.out

data

task 0

a.out

data

task 1

task N−1

PID

PID

time

coll

ecti

ve

com

mu

nic

ati

on

(b

roa

dca

st/r

edu

ce/.

..)

msg

sen

d/r

ecv

msg

sen

d/r

ecv

PID

Separate address space (tasks can not access data of othertasks without sending messages or participating in collectivecommunications).General purpose model - messages can be sent over network,shared memory, etc.Managing communication is generally up to the programmer.



Parallelism at Three Levels

Task - macroscopic view in which an algorithm is organizedaround separate concurrent tasks

Data - structure data in (separately) update-able “chunks”Instruction - ILP through the “flow” of data in a predictable fashion



Approaching the Problem

The first step in deciding where to add parallelism to an algorithm orapplication is usually to analyze from the point of tasks and data:

Tasks How do I reduce this problem into a set of tasks that canbe executed concurrently?

Data How do I take the key data and represent it in such a waythat large chunks can be operated on independently (andthus concurrently)?



Dependencies

Bear in mind that you always want to minimize the dependenciesbetween your concurrent tasks and data:

On the task level, design your tasks to be as independent aspossible - this can also be temporal, namely take into accountany necessity for ordering the tasks and their executionSharing of data will require synchronization and thus introducedata dependencies, if not race conditions or contention



An Hour of Planning ...

Taking the time to carefully analyze an existing application, or plan anew one, can really pay off later:

Plan for parallel programming by keeping data and task parallelconstructs/opportunities in mind

Analyze an existing or new program to discover/confirm thelocations of the time-consuming portions

Optimize by implementing a strategy using data or task parallelconstructs


Overview of Parallel APIs Auto-Parallelization

Auto-Parallelization

Often called “implicit” parallelism, some compilers (usally commercialones) have the ability to:

automatically parallelize simple (outer) loops (thread-levelparallelism, or TLP)automatically vectorize (usually innermost loops) using ILP

Note that thread level parallelism is only usable on an SMParchitecture


Overview of Parallel APIs Auto-Vectorization

Auto-Vectorization

Like auto-parallelization, a feature of high-performance (oftencommercial) compilers on hardware that supports at least some levelof vectorization:

compilers finds low-level operations that can operatesimultaneously on multiple data elements using a singleinstructionIntel/AMD processors with SSE/SSE2/SSE3 usually limited to 21 -23 vector length (hardware limitation that true “vector” processorsdo not share)User can exert some control through compiler directives (usuallyvendor specific) and data organization focused on vectoroperations


Overview of Parallel APIs Auto-Vectorization

Downside of Automatic Methods

The automatic parallelization schemes have a number of shortfalls:Very limited in scope - parallelizes only a very small subset ofcodeWithout directives from the programmer, compiler is very limited inidentifying parallelizable regions of codePerformance is rather easily degraded rather than sped upWrong results are easy to (automatically) produce


Overview of Parallel APIs Shared Memory APIs

Shared Memory Parallelism

Here we consider “explicit” modes of parallel programming on sharedmemory architectures:

HPF, High Performance FortranSHMEM, the old Cray shared memory libraryPthreads, UNIX ANSI standard (available on Windows as well)OpenMP, common specification for compiler directives and API


Overview of Parallel APIs Distributed Memory APIs

Distributed Memory Parallel Programming

In this case, mainly message passing, with a few other (compilerdirectives again) possibilities:

HPF, High Performance FortranCluster OpenMP, a new product from Intel to push OpenMP intoa distributed memory regimePVM, Parallel Virtual MachineMPI, Message Passing Interface, has become the dominant APIfor general purpose parallel programmingUPC, Unified Parallel C, extensions to ISO 99 C for parallelprogramming (new)CAF, Co-Array Fortran, parallel extensions to Fortran 95/2003(new, officially part of Fortran 2008)


Overview of Parallel APIs APIs/Language Extensions & Availability

OpenMP

Synopsis: designed for multi-platform shared memory parallelprogramming in C/C++/FORTRAN primarily through theuse of compiler directives and environmental variables(library routines also available).

Target: Shared memory systems.Current Status: Specification 1.0 released in 1997, 2.0 in 2002, 2.5 in

2005, 3.0 in 2008, 3.1 in 2011. Widely implemented bycommercial compiler vendors, also now in mostopen-source compilers (4.0 specification released in2013-07, not yet available in production compilers).

More Info: The OpenMP home page:http://www.openmp.org



OpenMP Availability

Platform Version Invocation (example)Linux x86_64 3.1 ifort -openmp -openmp_report2 ...

3.0 pgf90 -mp ...GNU (>=4.2) 2.5 gfortran -fopenmp ...GNU (>=4.4) 3.0 gfortran -fopenmp ...GNU (>=4.7) 3.1 gfortran -fopenmp ...



MPI

Message Passing InterfaceSynopsis: Message passing library specification proposed as a

standard by committee of vendors, implementors, andusers.

Target: Distributed and shared memory systems.Current Status: Most popular (and most portable) of the message

passing APIs.More Info: ANL’s main MPI site:

www-unix.mcs.anl.gov/mpi



MPI Availability at CCR

Platform Version(+MPI-2)Linux IA64 1.2+(C++,MPI-I/O)Linux x86_64 1.2+(C++,MPI-I/O),2.x(various)

I am starting to favor the commercial Intel MPI for its ease of use,expecially in terms of supporting multiple networks/protocols.



Unified Parallel C (UPC)

Unified Parallel C (UPC) is a project based at Lawrence BerkeleyLaboratory to extend C (ISO 99) and provide large-scale parallelcomputing on both shared and distributed memory hardware:

Explicit parallel execution (SPMD)Shared address space (shared/private data like OpenMP), butexploits data localityPrimitives for memory managementBSD license (free download)http://upc.lbl.gov/, community website athttp://upc.gwu.edu/



Simple example of some UPC syntax:1 shared i n t a l l _ h i t s [THREADS ] ;2 . . .3 . . .4 for ( i =0; i < m y _ t r i a l s ; i ++) my_hits += h i t ( ) ;5 a l l _ h i t s [MYTHREAD] = my_hits ;6 upc_bar r ie r ;7 i f (MYTHREAD == 0) {8 t o t a l _ h i t s = 0 ;9 for ( i =0; i < THREADS; i ++) {

10 t o t a l _ h i t s += a l l _ h i t s [ i ] ;11 }12 p i = 4 .0∗ ( ( double ) t o t a l _ h i t s ) / ( ( double ) t r i a l s ) ;13 p r i n t f ( " PI est imated to %10.7 f from %d t r i a l s on %d threads . \ n " ,14 pi , t r i a l s , THREADS) ;15 }



Co-Array Fortran

Co-Array Fortran (CAF) is an extension to Fortran (Fortran 95/2003) toprovide data decomposition for parallel programs (somewhat akin toUPC and HPF):

Original specification by Numrich and Reid, ACM Fortran Forum,17, no. 2, pp 1-31 (1998).ISO Fortran committee included co-arrays in next revision toFortran standard (c.f. Numrich and Reid, ACM Fortran Forum, 24,no 2, pp 4-17 (2005), Fortran 2008.Does not require shared memory (more applicable than OpenMP)Early compiler work at Rice:http://www.hipersoft.rice.edu/caf/index.html



Simple examples of CAF usage:1 REAL,DIMENSION (N) [∗ ] : : X ,Y2 X = Y[PE] ! get from Y[PE]3 Y[PE] = X ! put i n t o Y [PE]4 Y [ : ] = X ! broadcast X5 Y[ LIST ] = X ! broadcast X over subset o f PE’ s i n ar ray LIST6 Z ( : ) = Y [ : ] ! c o l l e c t a l l Y7 S = MINVAL(Y [ : ] ) ! min ( reduce ) a l l Y8 B( 1 :M) [ 1 :N] = S ! S sca lar , promoted to ar ray o f shape ( 1 :M, 1 :N)

UPC and CAF are examples of partitioned global address space(PGAS) language, for which a large push is being made byAHPCRC/DARPA as part of the Petascale computing initiative(also one based on java called Titanium)


Libraries

Leveraging Existing Parallel Libraries

One “easy” way to utilize parallel programming resources is throughthe use of existing parallel libraries. Most common examples (noteorientation around standard mathematical routines):

BLAS - Basic Linear Algebra SubroutinesLAPACK - Linear Algebra PACKage (uses BLAS)

ScaLAPACK - distributed memory (MPI) solvers for common LAPACKfunctions

PetSC - Portable, Extensible Toolkit for Scientific ComputationFFTW - Fastest Fourier Transforms in the West


Libraries

BLAS

The Basic Linear Algebra Subroutines (BLAS) form a standard set oflibrary functions for vector and matrix operations:

Level 1 Vector-Vector (e.g. xdot, where x=s,d,c,z)Level 2 Matrix-Vector (e.g. xaxpy, where x=s,d,c,z)Level 3 Matrix-Matrix (e.g. xgemm, where x=s,d,c,z)

These routines are generally provided by vendors hand-tuned at thelevel of assembly code for optimum performance on a particularprocessor. Shared memory versions (multithreaded) and distributedmemory versions available.

www.netlib.org/blas


Libraries

Vendor BLAS

Vendor implementations of the BLAS:

AMD ACMLApple Velocity EngineCompaq CXMLCray libsciHP MLIBIBM ESSLIntel MKLNEC PDLIB/SXSGI SCSLSUN Sun Performance Library


Libraries

Performance Example

101

102

103

104

105

106

Vector Length [8B]

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000M

Flop

/s

cMKL 8.1.1Ref (-lblas)

DDOT PerformanceU2 Compute Node

L1 cache

L2 cache

Main memory

Performance advantage of Intel MKL ddot versus reference (system)version. Level 3 BLAS routines have even more significant gains.


Libraries

LAPACK

The Linear Algebra PACKage (LAPACK) is a library that lies on top ofthe BLAS (for optimium performance and parallelism) and provides:

solvers for systems of simultaneous linear equationsleast-squares solutionseigenvalue problemssingular value problemsOn CCR systems, the Intel MKL is generally preferred (includesoptimized BLAS and LAPACK)

www.netlib.org/lapack


Libraries

ScaLAPACK

The Scalable LAPACK library, ScaLAPACK, is designed for use on thelargest of problems (typically implemented on distributed memorysystems):

subset of LAPACK routines redesigned for distributed memoryMIMD parallel computersexplicit message passing for interprocessor communicationassumes matrices are laid out in a two-dimensional block cyclicdecompositionOn CCR systems, the Intel Cluster MKL includesScaLAPACK libraries (includes optimized BLAS, LAPACK, andScaLAPACK)

www.netlib.org/scalapack


Parallel Program Design Considerations Programming Costs

Programming Costs

Consider:Complexity: parallel codes can be orders of magnitude more complex

(especially those using message passing) - you have toplan for and deal with multiple instruction/data streams

Portability: parallel codes frequently have long lifetimes (proportionalto the amount of effort invested in them) - all of the serialapplication porting issues apply, plus the choice ofparallel API (MPI, OpenMP, and POSIX threads arecurrently good portable choices, but implementations ofthem can differ from platform to platform)

Resources: overhead for parallel computation can be significant forsmaller calculations

Scalability: limitations in hardware (CPU-memory speed andcontention, for one example) and the parallel algorithmswill limit speedups. All codes will eventually reach a stateof decreasing returns at some point


Parallel Program Design Considerations Examples

Speedup Limitation Example (MD/NAMD)

JAC (Joint Amber-CHARMM) Benchmark: DHFR Protein, 7182 residues, 23558 atoms, 7023

TIP3 waters, PME (9Åcutoff)


Parallel Program Design Considerations Examples

1 2 4 8 16 32 64 128 256Number of Processors (ppn=2)

0.01

0.1

1

Ben

chm

ark

MD

s/s

tep

ch_p4ch_gm

Joint Amber-CHARMM BenchmarkNAMD, v2.6b1, u2

1

1.89

3.54

6.467.01

2.8

1

1.9

3.67

7.3

14.2

27.2

49.6

79.988

1

2

4

8

16

32

64

128

256

Para

llel S

peed

up


Parallel Program Design Considerations Communication Costs

Communications

All parallel programs will need some communication between tasks,but the amount varies considerably:

minimal communication, also known as “embarrassingly” parallel- think Monte Carlo calculationsrobust communication, in which processes must exchange orupdate information frequently - time evolution codes, domaindecomposed finite difference/element applications



Communication Considerations

Communication needs between parallel processes affect parallelprograms in several important ways:

Cost: there is always overhead when communicating:latency - the cost in overhead for a zero length message(microseconds)bandwidth - the amount of data that can be sent per unit time(MBytes/second); many applications utilize many small messagesand are latency-bound

Scope: point-to-point communication is always faster thancollective communication (that involves a large subset or allprocesses)Efficiency: are you using the best available resource (network) forcommunicating?



Latency Example - MPI/CCR

100

101

102

103

104

105

106

Message Length [Bytes]

1

10

100

1000

10000M

essa

ge

Tim

e [µ

s]

ch_mx

ch_gm

ch_p4

DAPL-QDR-IB

MPI PingPong Benchmark Performance

U2 Cluster Interconnects

Latency on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and QDRInfiniband.



Bandwidth Example - MPI/CCR

100

101

102

103

104

105

106

Message Length [Bytes]

0.01

0.1

1

10

100

1000

Ban

dw

idth

[M

By

te/s

]

ch_mx

ch_gm

ch_p4

DAPL-QDR-IB

MPI PingPong Benchmark Performance

U2 Cluster Interconnects

Bandwidth on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, andQDR Infiniband.



Visibility: message-passing codes require the programmer toexplicitly manage all communications, but many data-parallelcodes do not, masking the underlying data transfersSynchronicity: Synchronous communications are called blockingas they require an explicit “handshake” between parallelprocesses - asynchronous communications (non-blocking offerthe ability to overlap computation with communication, but placean extra burden on the programmer. Examples include:

Barriers - force collective synchronization; often implied, and alwaysexpensiveLocks/Semaphores - typically protect memory locations fromsimultaneous/conflicting updates to prevent race conditions



Collective Example - MPI/CCR

10 100 1000Number of MPI Processes

1

10

100

1000

10000

1e+05

t allt

oal

l(4K

B)

[µse

c]

ppn=12

ppn=6

ppn=4

ppn=2

ppn=1

MPI_Alltoall BenchmarkIntel MPI, 4KB Buffer, 12-core nodes with Qlogic QDR IB, Xeon E5645

Cost of MPI_Alltoall for 4KB buffer on CCR/QDR IB.M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 61 / 61

Brief Overview of Parallel Computing

Documents

Transcript of Brief Overview of Parallel Computing