CPE 779 Parallel Computing - Spring 20121 CPE 779 Parallel Computing Lecture.
Brief Overview of Parallel Computing
-
Upload
douglasandrewbrummelliii -
Category
Documents
-
view
223 -
download
3
description
Transcript of Brief Overview of Parallel Computing
Brief Overview of Parallel Programming
M. D. Jones, Ph.D.
Center for Computational ResearchUniversity at Buffalo
State University of New York
Spring 2014
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 1 / 61
Background
A Little Motivation
Why do we parallel process in the first place?Performance - either solve a bigger problem, or to reach solutionfaster (or both)
Hardware is becoming (actually it has been for a while now)intrinsically parallelParallel programming is not easy. How easy is sequentialprogramming? Now add another layer (a sizable one, at that) forparallelism ...
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 3 / 61
Background
Big Picture
Basic idea of parallel programming is a simple one:
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
problemproblem decompositiondecomposition
in which we decompose a large computational problem into availableprocessing elements (CPUs). How you achieve that decomposition iswhere all the fun lies ...
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 4 / 61
Background
Decomposition
1D
2D 3D
Uniform volume example - decomposition in 1D, 2D, and 3D. Note thatload balancing in this case is simpler if the work load per cell is thesame ...
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 5 / 61
Background
Decomposition (continued)
Nonuniform example - decomposition in 3D for a 16-way paralleladaptive finite element calculation (in this case using a terrainelevation). Load balancing in this case is much more complex.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 6 / 61
Demand for HPC More Speed!
Why Use Parallel Computation?
Note that, in the modern era, inevitably HPC = Parallel (or concurrent)Computation. The driving forces behind this are pretty simple - thedesire is:
Solve my problem faster, i.e. I want the answer now (and whodoesn’t want that?)I want to solve a bigger problem than I (or anyone else, for thatmatter) have ever before been able to tackle, and do so in areasonable (generally reasonable = within a graduate student’stime to graduate!)
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 8 / 61
Demand for HPC More Speed!
A Concrete Example
Well, more of a discrete example, actually. Let’s consider thegravitational N-body problem.
ExampleUsing classical gravitation, we have a very simple (but long-ranged)force/potential. For each of N bodies, the resulting force is computedfrom the other N − 1 bodies, thereby requiring N2 force calculationsper step. If a galaxy consists of approximately 1012 such bodies, andeven the best algorithm for computing requires N log2 N calculations,that means ' 1012 ln(1012)/ ln(2) calculations. If each calculationtakes ' 1µ sec, that is ∼ 40× 106 seconds per step. That is about 1.3CPU-years per step. Ouch!
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 9 / 61
Demand for HPC More Speed!
A “Size” Example
On the other side, suppose that we need to diagonalize (or invert) adense matrix being used in, say, an eigenproblem derived from someengineering or multiphysics problem.
ExampleIn this problem, we want to increase the resolution to capture theessential underlying behavior of the physical process being modeled.So we determine that we need a matrix on the order of, say, 400000elements. Simply to store this matrix, in 64-bit representation, requires' 1.28× 1012 Bytes of memory, or ∼ 1200 GBytes. We could fit thisonto a cluster with say, 103 nodes, each having 4 GBytes of memory,by distributing the matrix across the individual memories of eachcluster node.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 10 / 61
Demand for HPC More Speed!
So ...
So the lessons to take from the preceding examples should be clear:It is the nature of the research (and commercial) enterprise toalways be trying to accomplish larger tasks with ever increasingspeed, or efficiency. For that matter it is human nature.These examples, while specific, apply quite generally - do you seethe connection between the first example and general molecularmodeling? The second example and finite element (or finitedifference) approaches to differential equations? How about websearching?
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 11 / 61
Demand for HPC Inherent Limitations
Scaling Concepts
By “scaling” we typically mean the relative performance of a parallelvs. serial implementation:
Definition (Scaling): Speedup Factor, S(p) is given by
S(p) =Sequential execution time (using optimal implementation)
Parallel execution time using p processors
so, in the ideal case, S(p) = p.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 12 / 61
Demand for HPC Inherent Limitations
Parallel Efficiency
Using τS as the (best) sequential execution time, we note that
S(p) =τS
τp(p),
≤ p,
for a lower bound, and the parallel efficiency is given by
E(p) = S(p)p
,
=τS
p × τp.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 13 / 61
Demand for HPC Inherent Limitations
Inherent Limitations in Parallel Speedup
Limitations on the maximum speedup:Fraction of the code, f , can not be made to execute in parallelParallel overhead (communication, duplication costs)Using this serial fraction, f , we can note that
τp ≥ f τS + (1− f ) ∗ τS/p,
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 14 / 61
Demand for HPC Inherent Limitations
Amdahl’s Law
This simplification for τp leads directly to Amdahl’s Law,Definition (Amdahl’s Law):
S(p) ≤ τS
f τS + (1− f ) ∗ τS/p,
≤ p1 + f (p − 1)
.
G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” AFIPS Conference
Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 15 / 61
Demand for HPC Inherent Limitations
Implications of Amdahl’s Law
The implications of Amdahl’s law are pretty straightforward:The limit as p →∞,
limp→∞
S(p) ≤ 1f.
If the serial fraction is 5%, the maximum parallel speedup is only20.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 16 / 61
Demand for HPC Inherent Limitations
Implications of Amdahl’s Law (cont’d)
1 2 4 8 16 32 64 128 256Number of Processors, p
1
2
4
8
16
32
64
128
256
MA
X(S
(p))
f=0.001f=0.01f=0.02f=0.05f=0.1
Amdahl's Law Maximum Parallel Speedup
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 17 / 61
Demand for HPC Inherent Limitations
A Practical Example
Let τp = τcomm + τcomp, where τcomm is the time spent in communicationbetween parallel processes (unavoidable overhead) and τcomp is thetime spent in (parallelizable) computation.
τ1 ≤ pτcomp,
and
S(p) =τ1
τp,
≤ pτcomp
τcomm + τcomp,
≤ p(1 + τcomm/τcomp
)−1
The point here is that it is critical to minimize communication time totime spent doing computation (recurring theme).
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 18 / 61
Demand for HPC Inherent Limitations
Defeating Amdahl’s Law
There are ways to work around the serious implications of Amdahl’slaw:
We assumed that the problem size was fixed, which is (very) oftennot the case.Now consider a case where the problem size is allowed to varyAssume that now the problem size is fixed such that τp is heldconstant
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 19 / 61
Demand for HPC Inherent Limitations
Gustafson’s Law
Now let τp be held constant as p is increased,
Ss(p) = τS/τp,
= (f τS + (1− f )τS)/τp,
= (f τS + (1− f )τS)/(f τS + (1− f )τS/p),= p/(1− (1− p)f ),' p + p(1− p)f . . . .
Another way of looking at this is that the serial fraction becomesnegligible as the problem size is scaled. Actually that is a pretty gooddefinition of a scalable code ...J. L. Gustafson, “Reevaluating Amdahl’s Law,” Comm. ACM 31(5), 532-533 (1988).
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 20 / 61
Demand for HPC Inherent Limitations
Scalability
Definition(Scalable): An algorithm is scalable if there is a minimal nonzeroefficiency as p →∞ and the problem size is allowed to take on anyvalue
I like this (equivalent) one better:
Definition(Scalable): For a scalable code the sequential fraction becomesnegligible as the problem size (and the number of processors) grows
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 21 / 61
Overview of Parallel APIs Basic Terminology
The Parallel Zoo
There are many parallel programming models, but roughly they breakdown into the following categories:
Threaded models: e.g., OpenMP or Posix threads as theapplication programming interface (API)Message passing: e.g., MPI = Message Passing InterfaceHybrid methods: combine elements of other models
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 23 / 61
Overview of Parallel APIs Basic Terminology
Thread Models
a.out
data
thread 1
thread 2
thread N
time
Typical thread model (e.g., OpenMP)
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 24 / 61
Overview of Parallel APIs Basic Terminology
Thread Models
a.out
data
thread 1
thread 2
thread N
time
Shared address space (all threads can access data of programa.out).Generally limited to single machine except in very specializedcases.GPGPU (general purpose GPU computing) uses thread model(with a large number of threads on the host machine.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 25 / 61
Overview of Parallel APIs Basic Terminology
Message Passing Models
a.out
data
a.out
data
task 0
a.out
data
task 1
task N−1
PID
PID
time
coll
ecti
ve
com
mu
nic
ati
on
(b
roa
dca
st/r
edu
ce/.
..)
msg
sen
d/r
ecv
msg
sen
d/r
ecv
PID
Typical message passing model (e.g., MPI).M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 26 / 61
Overview of Parallel APIs Basic Terminology
Message Passing Models
a.out
data
a.out
data
task 0
a.out
data
task 1
task N−1
PID
PID
time
coll
ecti
ve
com
mu
nic
ati
on
(b
roa
dca
st/r
edu
ce/.
..)
msg
sen
d/r
ecv
msg
sen
d/r
ecv
PID
Separate address space (tasks can not access data of othertasks without sending messages or participating in collectivecommunications).General purpose model - messages can be sent over network,shared memory, etc.Managing communication is generally up to the programmer.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 27 / 61
Overview of Parallel APIs Basic Terminology
Parallelism at Three Levels
Task - macroscopic view in which an algorithm is organizedaround separate concurrent tasks
Data - structure data in (separately) update-able “chunks”Instruction - ILP through the “flow” of data in a predictable fashion
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 28 / 61
Overview of Parallel APIs Basic Terminology
Approaching the Problem
The first step in deciding where to add parallelism to an algorithm orapplication is usually to analyze from the point of tasks and data:
Tasks How do I reduce this problem into a set of tasks that canbe executed concurrently?
Data How do I take the key data and represent it in such a waythat large chunks can be operated on independently (andthus concurrently)?
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 29 / 61
Overview of Parallel APIs Basic Terminology
Dependencies
Bear in mind that you always want to minimize the dependenciesbetween your concurrent tasks and data:
On the task level, design your tasks to be as independent aspossible - this can also be temporal, namely take into accountany necessity for ordering the tasks and their executionSharing of data will require synchronization and thus introducedata dependencies, if not race conditions or contention
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 30 / 61
Overview of Parallel APIs Basic Terminology
An Hour of Planning ...
Taking the time to carefully analyze an existing application, or plan anew one, can really pay off later:
Plan for parallel programming by keeping data and task parallelconstructs/opportunities in mind
Analyze an existing or new program to discover/confirm thelocations of the time-consuming portions
Optimize by implementing a strategy using data or task parallelconstructs
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 31 / 61
Overview of Parallel APIs Auto-Parallelization
Auto-Parallelization
Often called “implicit” parallelism, some compilers (usally commercialones) have the ability to:
automatically parallelize simple (outer) loops (thread-levelparallelism, or TLP)automatically vectorize (usually innermost loops) using ILP
Note that thread level parallelism is only usable on an SMParchitecture
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 32 / 61
Overview of Parallel APIs Auto-Vectorization
Auto-Vectorization
Like auto-parallelization, a feature of high-performance (oftencommercial) compilers on hardware that supports at least some levelof vectorization:
compilers finds low-level operations that can operatesimultaneously on multiple data elements using a singleinstructionIntel/AMD processors with SSE/SSE2/SSE3 usually limited to 21 -23 vector length (hardware limitation that true “vector” processorsdo not share)User can exert some control through compiler directives (usuallyvendor specific) and data organization focused on vectoroperations
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 33 / 61
Overview of Parallel APIs Auto-Vectorization
Downside of Automatic Methods
The automatic parallelization schemes have a number of shortfalls:Very limited in scope - parallelizes only a very small subset ofcodeWithout directives from the programmer, compiler is very limited inidentifying parallelizable regions of codePerformance is rather easily degraded rather than sped upWrong results are easy to (automatically) produce
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 34 / 61
Overview of Parallel APIs Shared Memory APIs
Shared Memory Parallelism
Here we consider “explicit” modes of parallel programming on sharedmemory architectures:
HPF, High Performance FortranSHMEM, the old Cray shared memory libraryPthreads, UNIX ANSI standard (available on Windows as well)OpenMP, common specification for compiler directives and API
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 35 / 61
Overview of Parallel APIs Distributed Memory APIs
Distributed Memory Parallel Programming
In this case, mainly message passing, with a few other (compilerdirectives again) possibilities:
HPF, High Performance FortranCluster OpenMP, a new product from Intel to push OpenMP intoa distributed memory regimePVM, Parallel Virtual MachineMPI, Message Passing Interface, has become the dominant APIfor general purpose parallel programmingUPC, Unified Parallel C, extensions to ISO 99 C for parallelprogramming (new)CAF, Co-Array Fortran, parallel extensions to Fortran 95/2003(new, officially part of Fortran 2008)
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 36 / 61
Overview of Parallel APIs APIs/Language Extensions & Availability
OpenMP
Synopsis: designed for multi-platform shared memory parallelprogramming in C/C++/FORTRAN primarily through theuse of compiler directives and environmental variables(library routines also available).
Target: Shared memory systems.Current Status: Specification 1.0 released in 1997, 2.0 in 2002, 2.5 in
2005, 3.0 in 2008, 3.1 in 2011. Widely implemented bycommercial compiler vendors, also now in mostopen-source compilers (4.0 specification released in2013-07, not yet available in production compilers).
More Info: The OpenMP home page:http://www.openmp.org
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 37 / 61
Overview of Parallel APIs APIs/Language Extensions & Availability
OpenMP Availability
Platform Version Invocation (example)Linux x86_64 3.1 ifort -openmp -openmp_report2 ...
3.0 pgf90 -mp ...GNU (>=4.2) 2.5 gfortran -fopenmp ...GNU (>=4.4) 3.0 gfortran -fopenmp ...GNU (>=4.7) 3.1 gfortran -fopenmp ...
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 38 / 61
Overview of Parallel APIs APIs/Language Extensions & Availability
MPI
Message Passing InterfaceSynopsis: Message passing library specification proposed as a
standard by committee of vendors, implementors, andusers.
Target: Distributed and shared memory systems.Current Status: Most popular (and most portable) of the message
passing APIs.More Info: ANL’s main MPI site:
www-unix.mcs.anl.gov/mpi
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 39 / 61
Overview of Parallel APIs APIs/Language Extensions & Availability
MPI Availability at CCR
Platform Version(+MPI-2)Linux IA64 1.2+(C++,MPI-I/O)Linux x86_64 1.2+(C++,MPI-I/O),2.x(various)
I am starting to favor the commercial Intel MPI for its ease of use,expecially in terms of supporting multiple networks/protocols.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 40 / 61
Overview of Parallel APIs APIs/Language Extensions & Availability
Unified Parallel C (UPC)
Unified Parallel C (UPC) is a project based at Lawrence BerkeleyLaboratory to extend C (ISO 99) and provide large-scale parallelcomputing on both shared and distributed memory hardware:
Explicit parallel execution (SPMD)Shared address space (shared/private data like OpenMP), butexploits data localityPrimitives for memory managementBSD license (free download)http://upc.lbl.gov/, community website athttp://upc.gwu.edu/
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 41 / 61
Overview of Parallel APIs APIs/Language Extensions & Availability
Simple example of some UPC syntax:1 shared i n t a l l _ h i t s [THREADS ] ;2 . . .3 . . .4 for ( i =0; i < m y _ t r i a l s ; i ++) my_hits += h i t ( ) ;5 a l l _ h i t s [MYTHREAD] = my_hits ;6 upc_bar r ie r ;7 i f (MYTHREAD == 0) {8 t o t a l _ h i t s = 0 ;9 for ( i =0; i < THREADS; i ++) {
10 t o t a l _ h i t s += a l l _ h i t s [ i ] ;11 }12 p i = 4 .0∗ ( ( double ) t o t a l _ h i t s ) / ( ( double ) t r i a l s ) ;13 p r i n t f ( " PI est imated to %10.7 f from %d t r i a l s on %d threads . \ n " ,14 pi , t r i a l s , THREADS) ;15 }
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 42 / 61
Overview of Parallel APIs APIs/Language Extensions & Availability
Co-Array Fortran
Co-Array Fortran (CAF) is an extension to Fortran (Fortran 95/2003) toprovide data decomposition for parallel programs (somewhat akin toUPC and HPF):
Original specification by Numrich and Reid, ACM Fortran Forum,17, no. 2, pp 1-31 (1998).ISO Fortran committee included co-arrays in next revision toFortran standard (c.f. Numrich and Reid, ACM Fortran Forum, 24,no 2, pp 4-17 (2005), Fortran 2008.Does not require shared memory (more applicable than OpenMP)Early compiler work at Rice:http://www.hipersoft.rice.edu/caf/index.html
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 43 / 61
Overview of Parallel APIs APIs/Language Extensions & Availability
Simple examples of CAF usage:1 REAL,DIMENSION (N) [∗ ] : : X ,Y2 X = Y[PE] ! get from Y[PE]3 Y[PE] = X ! put i n t o Y [PE]4 Y [ : ] = X ! broadcast X5 Y[ LIST ] = X ! broadcast X over subset o f PE’ s i n ar ray LIST6 Z ( : ) = Y [ : ] ! c o l l e c t a l l Y7 S = MINVAL(Y [ : ] ) ! min ( reduce ) a l l Y8 B( 1 :M) [ 1 :N] = S ! S sca lar , promoted to ar ray o f shape ( 1 :M, 1 :N)
UPC and CAF are examples of partitioned global address space(PGAS) language, for which a large push is being made byAHPCRC/DARPA as part of the Petascale computing initiative(also one based on java called Titanium)
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 44 / 61
Libraries
Leveraging Existing Parallel Libraries
One “easy” way to utilize parallel programming resources is throughthe use of existing parallel libraries. Most common examples (noteorientation around standard mathematical routines):
BLAS - Basic Linear Algebra SubroutinesLAPACK - Linear Algebra PACKage (uses BLAS)
ScaLAPACK - distributed memory (MPI) solvers for common LAPACKfunctions
PetSC - Portable, Extensible Toolkit for Scientific ComputationFFTW - Fastest Fourier Transforms in the West
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 46 / 61
Libraries
BLAS
The Basic Linear Algebra Subroutines (BLAS) form a standard set oflibrary functions for vector and matrix operations:
Level 1 Vector-Vector (e.g. xdot, where x=s,d,c,z)Level 2 Matrix-Vector (e.g. xaxpy, where x=s,d,c,z)Level 3 Matrix-Matrix (e.g. xgemm, where x=s,d,c,z)
These routines are generally provided by vendors hand-tuned at thelevel of assembly code for optimum performance on a particularprocessor. Shared memory versions (multithreaded) and distributedmemory versions available.
www.netlib.org/blas
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 47 / 61
Libraries
Vendor BLAS
Vendor implementations of the BLAS:
AMD ACMLApple Velocity EngineCompaq CXMLCray libsciHP MLIBIBM ESSLIntel MKLNEC PDLIB/SXSGI SCSLSUN Sun Performance Library
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 48 / 61
Libraries
Performance Example
101
102
103
104
105
106
Vector Length [8B]
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000M
Flop
/s
cMKL 8.1.1Ref (-lblas)
DDOT PerformanceU2 Compute Node
L1 cache
L2 cache
Main memory
Performance advantage of Intel MKL ddot versus reference (system)version. Level 3 BLAS routines have even more significant gains.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 49 / 61
Libraries
LAPACK
The Linear Algebra PACKage (LAPACK) is a library that lies on top ofthe BLAS (for optimium performance and parallelism) and provides:
solvers for systems of simultaneous linear equationsleast-squares solutionseigenvalue problemssingular value problemsOn CCR systems, the Intel MKL is generally preferred (includesoptimized BLAS and LAPACK)
www.netlib.org/lapack
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 50 / 61
Libraries
ScaLAPACK
The Scalable LAPACK library, ScaLAPACK, is designed for use on thelargest of problems (typically implemented on distributed memorysystems):
subset of LAPACK routines redesigned for distributed memoryMIMD parallel computersexplicit message passing for interprocessor communicationassumes matrices are laid out in a two-dimensional block cyclicdecompositionOn CCR systems, the Intel Cluster MKL includesScaLAPACK libraries (includes optimized BLAS, LAPACK, andScaLAPACK)
www.netlib.org/scalapack
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 51 / 61
Parallel Program Design Considerations Programming Costs
Programming Costs
Consider:Complexity: parallel codes can be orders of magnitude more complex
(especially those using message passing) - you have toplan for and deal with multiple instruction/data streams
Portability: parallel codes frequently have long lifetimes (proportionalto the amount of effort invested in them) - all of the serialapplication porting issues apply, plus the choice ofparallel API (MPI, OpenMP, and POSIX threads arecurrently good portable choices, but implementations ofthem can differ from platform to platform)
Resources: overhead for parallel computation can be significant forsmaller calculations
Scalability: limitations in hardware (CPU-memory speed andcontention, for one example) and the parallel algorithmswill limit speedups. All codes will eventually reach a stateof decreasing returns at some point
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 53 / 61
Parallel Program Design Considerations Examples
Speedup Limitation Example (MD/NAMD)
JAC (Joint Amber-CHARMM) Benchmark: DHFR Protein, 7182 residues, 23558 atoms, 7023
TIP3 waters, PME (9Åcutoff)
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 54 / 61
Parallel Program Design Considerations Examples
1 2 4 8 16 32 64 128 256Number of Processors (ppn=2)
0.01
0.1
1
Ben
chm
ark
MD
s/s
tep
ch_p4ch_gm
Joint Amber-CHARMM BenchmarkNAMD, v2.6b1, u2
1
1.89
3.54
6.467.01
2.8
1
1.9
3.67
7.3
14.2
27.2
49.6
79.988
1
2
4
8
16
32
64
128
256
Para
llel S
peed
up
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 55 / 61
Parallel Program Design Considerations Communication Costs
Communications
All parallel programs will need some communication between tasks,but the amount varies considerably:
minimal communication, also known as “embarrassingly” parallel- think Monte Carlo calculationsrobust communication, in which processes must exchange orupdate information frequently - time evolution codes, domaindecomposed finite difference/element applications
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 56 / 61
Parallel Program Design Considerations Communication Costs
Communication Considerations
Communication needs between parallel processes affect parallelprograms in several important ways:
Cost: there is always overhead when communicating:latency - the cost in overhead for a zero length message(microseconds)bandwidth - the amount of data that can be sent per unit time(MBytes/second); many applications utilize many small messagesand are latency-bound
Scope: point-to-point communication is always faster thancollective communication (that involves a large subset or allprocesses)Efficiency: are you using the best available resource (network) forcommunicating?
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 57 / 61
Parallel Program Design Considerations Communication Costs
Latency Example - MPI/CCR
100
101
102
103
104
105
106
Message Length [Bytes]
1
10
100
1000
10000M
essa
ge
Tim
e [µ
s]
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB
MPI PingPong Benchmark Performance
U2 Cluster Interconnects
Latency on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and QDRInfiniband.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 58 / 61
Parallel Program Design Considerations Communication Costs
Bandwidth Example - MPI/CCR
100
101
102
103
104
105
106
Message Length [Bytes]
0.01
0.1
1
10
100
1000
Ban
dw
idth
[M
By
te/s
]
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB
MPI PingPong Benchmark Performance
U2 Cluster Interconnects
Bandwidth on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, andQDR Infiniband.
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 59 / 61
Parallel Program Design Considerations Communication Costs
Visibility: message-passing codes require the programmer toexplicitly manage all communications, but many data-parallelcodes do not, masking the underlying data transfersSynchronicity: Synchronous communications are called blockingas they require an explicit “handshake” between parallelprocesses - asynchronous communications (non-blocking offerthe ability to overlap computation with communication, but placean extra burden on the programmer. Examples include:
Barriers - force collective synchronization; often implied, and alwaysexpensiveLocks/Semaphores - typically protect memory locations fromsimultaneous/conflicting updates to prevent race conditions
M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 60 / 61
Parallel Program Design Considerations Communication Costs
Collective Example - MPI/CCR
10 100 1000Number of MPI Processes
1
10
100
1000
10000
1e+05
t allt
oal
l(4K
B)
[µse
c]
ppn=12
ppn=6
ppn=4
ppn=2
ppn=1
MPI_Alltoall BenchmarkIntel MPI, 4KB Buffer, 12-core nodes with Qlogic QDR IB, Xeon E5645
Cost of MPI_Alltoall for 4KB buffer on CCR/QDR IB.M. D. Jones, Ph.D. (CCR/UB) Brief Overview of Parallel Programming Spring 2014 61 / 61