Parallel Programming and MPI
-
Upload
flashdomain -
Category
Documents
-
view
360 -
download
2
Transcript of Parallel Programming and MPI
An Advanced Simulation & Computing (ASC) Academic Strategic Alliances Program (ASAP) Center
at The University of Chicago
The Center for Astrophysical Thermonuclear Flashes
FLASH TutorialMay 13, 2004
Parallel Computing and MPI
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
What is Parallel Computing ?And why is it useful
Parallel Computing is more than one cpu working together on one problem
It is useful when Large problem, could take very long Data size too big to fit in the memory of one processor
When to parallelize Problem could be subdivided into relatively independent tasks
How much to parallelize While the speedup in computation relative to single processor
is of the order of number of processors
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Parallel paradigms
SIMD – Single instruction multiple data Processors work in lock-step
MIMD – Multiple instruction multiple data Processors do their own thing with occasional synchronization
Shared Memory One way communications
Distributed Memory Message passing
Loosely Coupled When the process on each cpu is fairly self contained and
relatively independent of processes on other cpu’s Tightly Coupled
When cpu’s need to communicate with each other frequently
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
How to Parallelize
Divide a problem into a set of mostly independent tasks
Partitioning a problem Tasks get their own data
Localize a task They operate on their own data for the most part
Try to make it self contained Occasionally
Data may be needed from other tasks Inter-process communication
Synchronization may be required between tasks Global operation
Map tasks to different processors One processor may get more than one task Task distribution should be well balanced
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
New Code Components
Initialization Query parallel state
Identify process Identify number of processes
Exchange data between processes Local, Global
Synchronization Barriers, Blocking Communication, Locks
Finalization
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
MPI
Message Passing Interface, standard for distributed memory model of parallelism
MPI-2 will support one-way communication, commonly associated with shared memory operations
Works with communicators; a collection of processors MPI_COMM_WORLD default
Has support for lowest level communication operations and composite operations
Has blocking and non-blocking operations
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Communicators
COMM1
COMM2
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Low level Operations in MPI
MPI_Init MPI_Comm_size
Find number of processors MPI_Comm_rank
Find my processor number MPI_Send/Recv
Communicate with other processors one at a time MPI_Bcast
Global data transmission MPI_Barrier
Synchronization MPI_Finalize
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Advanced Constructs in MPI
Composite Operations Gather/Scatter Allreduce Alltoall
Cartesian grid operations Shift
Communicators Creating subgroups of processors to operate on
User-defined Datatypes I/O
Parallel file operations
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Communication Patterns
10
32
Collective
0 1 2 3
Shift
10
2
All to All
10
32
Point to Point
10
32
One to All Broadcast
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Communication Overheads
Latency vs. Bandwidth Blocking vs. Non-Blocking
Overlap Buffering and copy
Scale of communication Nearest neighbor Short range Long range
Volume of data Resource contention for links
Efficiency Hardware, software, communication method
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Parallelism in FLASH
Short range communications Nearest neighbor
Long range communications Regridding
Other global operations All-reduce operations on physical quantities Specific to solvers
multi-pole method FFT based solvers
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Domain Decomposition
P0 P1
P2 P3
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Border Cells / Ghost Points
When splitting up solnData, need data from other processors.
Need a layer of cells from each processor
Need to update each time step
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Border/Ghost Cells
Short Range communication
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Two MPI Methods for doing it
MPI_Cart_create Create topology
MPE_Decomp1d Domain decomp on topology
MPI_Cart_shift Who’s on the left/right?
MPI_SendRecv Ghost cells left
MPI_SendRecv Ghost cells right
MPI_Comm_rank MPI_Comm_size Manually decompose grid
over processors Calculate left/right MPI_Send/MPI_Recv
Carefully to avoid deadlocks
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Adaptive Grid Issues
Discretization not uniform Simple left-right guard cell fills inadequate Adjacent grid points may not be mapped to the
nearest neighbors in processors topology Redistribution of work necessary
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Regridding
Change in number of cells/blocks Some processors get more work than others Load imbalance Redistribute data to even out work on all processors Long range communications Large quantities of data moved
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Regridding
The ASC/Alliances Center for Astrophysical Thermonuclear FlashesThe University of Chicago
Other parallel operations in FLASH
Global max/sum etc (Allreduce) Physical quantities In solvers Performance monitoring
Alltoall FFT based solver on UG
User defined datatypes and file operations Parallel I/O