Parallel Programming with MPI and OpenMP

Parallel Programmingwith MPI and OpenMP

Michael J. QuinnMichael J. Quinn

Chapter 6

Floyd’s AlgorithmFloyd’s Algorithm

Chapter Objectives

Creating 2-D arraysCreating 2-D arrays Thinking about “grain size”Thinking about “grain size” Introducing point-to-point communicationsIntroducing point-to-point communications Reading and printing 2-D matricesReading and printing 2-D matrices Analyzing performance when computations Analyzing performance when computations

and communications overlapand communications overlap

Outline

All-pairs shortest path problemAll-pairs shortest path problem Dynamic 2-D arraysDynamic 2-D arrays Parallel algorithm designParallel algorithm design Point-to-point communicationPoint-to-point communication Block row matrix I/OBlock row matrix I/O Analysis and benchmarkingAnalysis and benchmarking

All-pairs Shortest Path Problem

0 6 3 6

4 0 7 10

12 6 0 3

7 3 10 0

9 5 12 2

A B C D

Resulting Adjacency Matrix Containing Distances

Floyd’s Algorithm

for k 0 to n-1for i 0 to n-1

for j 0 to n-1a[i,j] min (a[i,j], a[i,k] + a[k,j])

endforendfor

endfor

Why It Works

Shortest path from i to k through 0, 1, …, k-1

Shortest path from k to j through 0, 1, …, k-1

Shortest path from i to j through 0, 1, …, k-1

Computedin previousiterations

Dynamic 1-D Array Creation

Run-time Stack

Dynamic 2-D Array Creation

Run-time StackBstorage B

Designing Parallel Algorithm

PartitioningPartitioning CommunicationCommunication Agglomeration and MappingAgglomeration and Mapping

Partitioning

Domain or functional decomposition?Domain or functional decomposition? Look at pseudocodeLook at pseudocode Same assignment statement executed Same assignment statement executed nn33

timestimes No functional parallelismNo functional parallelism Domain decomposition: divide matrix Domain decomposition: divide matrix AA

into its into its nn22 elements elements

Communication

Primitive tasksUpdatinga[3,4] whenk = 1

Iteration k:every taskin row kbroadcastsits value w/intask column

Iteration k:every taskin column kbroadcastsits value w/intask row

Agglomeration and Mapping

Number of tasks: staticNumber of tasks: static Communication among tasks: structuredCommunication among tasks: structured Computation time per task: constantComputation time per task: constant Strategy:Strategy:

Agglomerate tasks to minimize Agglomerate tasks to minimize communicationcommunication

Create one task per MPI processCreate one task per MPI process

Two Data Decompositions

Rowwise block striped Columnwise block striped

Comparing Decompositions

Columnwise block stripedColumnwise block striped Broadcast within columns eliminatedBroadcast within columns eliminated

Rowwise block stripedRowwise block striped Broadcast within rows eliminatedBroadcast within rows eliminated Reading matrix from file simplerReading matrix from file simpler

Choose rowwise block striped Choose rowwise block striped decompositiondecomposition

File Input

Pop Quiz

Why don’t we input the entire file at onceand then scatter its contents among theprocesses, allowing concurrent messagepassing?

Point-to-point Communication

Involves a pair of processesInvolves a pair of processes One process sends a messageOne process sends a message Other process receives the messageOther process receives the message

Send/Receive Not Collective

Function MPI_Send

int MPI_Send (

void *message,

int count,

MPI_Datatype datatype,

int dest,

int tag,

MPI_Comm comm

Function MPI_Recvint MPI_Recv (

void *message,

int count,

MPI_Datatype datatype,

int source,

int tag,

MPI_Comm comm,

MPI_Status *status

Coding Send/Receive

…if (ID == j) { … Receive from I …}…if (ID == i) { … Send to j …}…

Receive is before Send.Why does this work?

Inside MPI_Send and MPI_Recv

Sending Process Receiving Process

ProgramMemory

SystemBuffer

ProgramMemory

MPI_Send MPI_Recv

Return from MPI_Send

Function blocks until message buffer freeFunction blocks until message buffer free Message buffer is free whenMessage buffer is free when

Message copied to system buffer, orMessage copied to system buffer, or Message transmittedMessage transmitted

Typical scenarioTypical scenario Message copied to system bufferMessage copied to system buffer Transmission overlaps computationTransmission overlaps computation

Return from MPI_Recv

Function blocks until message in bufferFunction blocks until message in buffer If message never arrives, function never If message never arrives, function never

returnsreturns

Deadlock

Deadlock: process waiting for a condition Deadlock: process waiting for a condition that will never become truethat will never become true

Easy to write send/receive code that Easy to write send/receive code that deadlocksdeadlocks Two processes: both receive before sendTwo processes: both receive before send Send tag doesn’t match receive tagSend tag doesn’t match receive tag Process sends message to wrong Process sends message to wrong

destination processdestination process

Computational Complexity

Innermost loop has complexity Innermost loop has complexity ((nn)) Middle loop executed at most Middle loop executed at most n/pn/p times times Outer loop executed Outer loop executed nn times times Overall complexity Overall complexity ((nn33/p/p))

Communication Complexity

No communication in inner loopNo communication in inner loop No communication in middle loopNo communication in middle loop Broadcast in outer loop Broadcast in outer loop — complexity is — complexity is

((nn log log pp)) Overall complexity Overall complexity ((nn22 log log pp))

Execution Time Expression (1)

)/4(log/ npnnpnn

Iterations of outer loopIterations of middle loop

Cell update time

Iterations of outer loop

Messages per broadcastMessage-passing time

Iterations of inner loop

Computation/communication Overlap

Execution Time Expression (2)

Iterations of outer loopIterations of middle loop

Cell update time

Iterations of outer loop

Messages per broadcastMessage-passing time

Iterations of inner loop

/4loglog/ nppnnpnn Message transmission

Predicted vs. Actual PerformanceExecution Time (sec)Execution Time (sec)

ProcessesProcesses PredictedPredicted ActualActual

11 25.5425.54 25.5425.54

22 13.0213.02 13.8913.89

33 9.019.01 9.609.60

44 6.896.89 7.297.29

55 5.865.86 5.995.99

66 5.015.01 5.165.16

77 4.404.40 4.504.50

88 3.943.94 3.983.98

Summary

Two matrix decompositionsTwo matrix decompositions Rowwise block stripedRowwise block striped Columnwise block stripedColumnwise block striped

Blocking send/receive functionsBlocking send/receive functions MPI_SendMPI_Send MPI_RecvMPI_Recv

Overlapping communications with computationsOverlapping communications with computations

Parallel Programming with MPI and OpenMP

Documents

Transcript of Parallel Programming with MPI and OpenMP

1 Parallel Programming With OpenMP. 2 Contents Overview of Parallel Programming & OpenMP Difference between OpenMP & MPI OpenMP Programming Model.

Hybrid MPI and OpenMP Parallel Programming— 2 — Hybrid MPI and OpenMP Parallel Programming Tutorial M09 at SC’08, Austin, Texas, USA, Nov. 17, 2008 Slide 3 / 151 Rabenseifner,

Parallel Programming in C/C++ - OpenMP versus MPI MPI ...gibson/Teaching/CSC5021/L4-Parallel... · Note: OpenMP and MPI are oftencombined-in a hybridsystem-to provide parallelismatdifferentlevelsof

Mixed Mode MPI / OpenMP Programming · 2016-11-18 · Mixed Mode MPI / OpenMP Programming L.A. Smith Edinburgh Parallel Computing Centre, Edinburgh, EH9 3JZ 1 Introduction Shared

Multigrid Method using OpenMP/MPI Hybrid Parallel ... · • Flat MPI vs. Hybrid (OpenMP+MPI) • Expectations for Hybrid Parallel Programming Model – Number of MPI processes (and

Parallel Programming with MPI and OpenMP

Hybrid MPI and OpenMP Parallel Programming · SP-MZ MPI+OpenMP BT-MZ (MPI) BT-MZ MPI+OpenMP • Scalability in Mflops • MPI/OpenMP outperforms pure MPI • Use of numactl essential

3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel ...nkl.cc.u-tokyo.ac.jp/16w/04-pFEM/pFEM3D-OMP.pdf3D Parallel FEM(IV) (OpenMP+ MPI) Hybrid Parallel Programming Model Kengo Nakajima

Parallel Programming in C with MPI and OpenMP

Chapter 3 Parallel Algorithm Design Parallel Programming · MPI and OpenMP programming model MPI: purely parallel, need parallel algorithm from the start-up OpenMP: Fork-join model,

Hybrid MPI and OpenMP Parallel Programming · 2008. 7. 21. · Shared Memory Directives – OpenMP, III. • OpenMP – standardized shared memory parallelism – thread-based –

Hybrid MPI+OpenMP Parallel MD

Hybrid MPI and OpenMP Parallel Programmingmc.stanford.edu/cgi-bin/images/7/78/Hybrid_MPI_openMP.pdf · Hybrid MPI and OpenMP Parallel Programming MPI + OpenMP and other ... Processors

MPI and OpenMP Programming for Parallel Matrix Multiplication

Research Article An MPI-OpenMP Hybrid Parallel -LU Direct ...downloads.hindawi.com/journals/ijap/2015/615743.pdf · this paper, we proposed an MPI-OpenMP hybrid parallel strategy

Parallel Programming in C with MPI and OpenMP

Hybrid MPI+OpenMP Parallel MDcacs.usc.edu/education/cs653/HMD.pdf · Hybrid MPI+OpenMP Parallel MD Aiichiro Nakano Collaboratory for Advanced Computing & Simulations Department of

Multigrid Method using OpenMP/MPI Hybrid Parallel ... · Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10 Kengo Nakajima Information Technology

Hybrid MPI/OpenMP Parallel Programming on Clusters of ......Slide 3 Hybrid MPI/OpenMP High Performance Computing (HPC) systems Æhierarchical hardware design! • Efficient programming

129895334 Parallel Programming in C With MPI and OpenMP