Introduction to the Message Passing Interface (MPI)
Transcript of Introduction to the Message Passing Interface (MPI)
IntroductiontotheMessagePassingInterface(MPI)
JanFostier
May3rd 2017
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
Moore’sLaw
“Transistorcountdoubleseverytwoyears”
IllustrationfromWikipedia
Moore’sLaw
2000 2011
IllustrationfromWikipedia
Evolutionoftop500supercomputersovertime
Imagetakenfrom
www.to
p500
.org
=17.6PFlops/s(#1ranked)
=76TFlops/s(#500ranked)
Exponential increaseofsupercomputerpeakperformanceovertime
Applicationarea– performanceshare
Imagetakenfrom
www.to
p500
.org
“research”
“unknown”
Operatingsystemfamily
“Linux”
“Unix”
“Mixed”
Motivationforparallelcomputing
• Wanttorunthesameprogramfaster§ Dependsontheapplicationwhatisconsideredanacceptable
runtimeo SETI@Home,Folding@Home,GIMPS:yearsmaybeacceptableo ForR&D applications:daysorevenweeksareacceptable
– CFD,CEM,Bioinformatics,Cheminformatics,andmanymoreo Predictionoftomorrow’sweathershouldtakelessthanadayof
computationtime.o Someapplicationsrequirereal-timebehavior
– Computergames,algorithmictrading
• Wanttorunbiggerdatasets• Wanttoreducefinancialcost and/orpowerconsumption
Distributed-memoryarchitecture
Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)
Machine
CPU
Memory
Machine
CPU
Memory
Machine
CPU
Memory
…
Machine
CPU
Memory
NI NI NI NI
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
MessagePassingInterface(MPI)
• MPI=libraryspecification,notanimplementation• Mostimportantimplementations
§ OpenMPI (http://www.open-mpi.org,MPI-3standard)§ IntelMPI(proprietary,MPI-3standard)
• Specifiesroutinesfor(amongothers)§ Point-to-point communication(between2processes)§ Collective communication(>2processes)§ Topology setup§ ParallelI/O
• BindingsforC/C++andFortran
MPIreferenceworks
• MPIstandards:http://www.mpi-forum.org/docs/
• MPI:TheCompleteReference (M.Snir,S.Otto,SHuss-Lederman,D.Walker,J.Dongarra)Availablefromhttp://switzernet.com/people/emin-gabrielyan/060708-thesis-ref/papers/Snir96.pdf
• UsingMPI:PortableParallelProgrammingwiththeMessagePassingInterface,2nd ed.(W.Gropp,E.Lusk,A.Skjellum).
MPIstandard
• Startedin1992(WorkshoponStandardsforMessage-PassinginaDistributedMemoryEnvironment)withsupportfromvendors,librarywritersandacademia.
• MPIversion1.0(May1994)• Finalpre-draftin1993(Supercomputing‘93conference)• FinalversionJune1994
• MPIversion2.0 (July1997)• Supportforone-sidedcommunication• Supportforprocessmanagement• SupportforparallelI/O
• MPIversion3.0(September2012)• Supportfornon-blockingcollectivecommunication• Fortran2008bindings• Newone-sidedcommunicationroutines
HelloworldexampleinMPI#include <mpi.h>#include <iostream>#include <cstdlib>
using namespace std;
int main(int argc, char* argv[]) {int rank, size;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );MPI_Comm_size( MPI_COMM_WORLD, &size );
cout << “Hello World from process” << rank << “/” << size << endl;
MPI_Finalize();return EXIT_SUCCESS;
}
john@doe ~]$ mpirun –np 4 ./helloWorldHello World from process 2/4Hello World from process 3/4Hello World from process 0/4Hello World from process 1/4
Outputorder israndom
BasicMPIroutines
• int MPI_Init(int *argc, char ***argv)§ Initialization:allprocessesmustcallthispriortoanyotherMPIroutine.§ Stripsof(possible)argumentsprovidedby“mpirun”.
• int MPI_Finalize(void)§ Cleanup:allprocessesmustcallthisroutineattheendoftheprogram.§ Allpendingcommunicationshouldhavefinishedbeforecallingthis.
• int MPI_Comm_size(MPI_Comm comm, int *size);§ Returnsthesizeofthe“Communicator”associatedwith“comm”§ Communicator=userdefinedsubsetofprocesses§ MPI_COMM_WORLD=communicatorthatinvolvesallprocesses
• int MPI_Comm_rank(MPI_Comm comm, int *rank);§ ReturntherankoftheprocessintheCommunicator§ Range:[0...size– 1]
MessagePassingInterfaceMechanisms
Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)
Machine Machine Machine
…
Machine
HelloWorldprocess
HelloWorldprocess
HelloWorldprocess
HelloWorldprocess
Pprocesses(orinstances)ofthe“HelloWorld”program
mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram
Stackvariables:rank=0size=P
Stackvariables:rank=1size=P
Stackvariables:rank=2size=P
Stackvariables:rank=P-1size=P
“HelloWorld”
“HelloWorld”
“HelloWorld”
“HelloWorld”
Terminology
• Computerprogram =passivecollectionofinstructions.
• Process =instanceofacomputerprogramthatisbeingexecuted.
• Multitasking =runningmultipleprocessesonaCPU.
• Thread =smalleststreamofinstructionsthatcanbemanagedbyanOSscheduler(=light-weightprocess).
• Distributed-memory system=multi-processorsystemswhereeachprocessorhasdirectaccess(fast)toitsownprivatememoryandreliesoninter-processorcommunicationtoaccessanotherprocessor’smemory(typicallyslower).
Multithreadingversusmultiprocessing
Multi-threadingSingle processSharedmemoryaddressspaceProtectdata againstsimultaneouswritingLimitedtoasinglemachineE.g.Pthreads,CILK,OpenMP,etc.
MessagepassingMultiple processesSeparate memoryaddressspacesExplicitlycommunicate everything
Multiple machinespossibleE.g.MPI,UnifiedParallelC,PVM
SeqI SeqI SeqI SeqI
A B C D
SeqII SeqII SeqII SeqII
SeqI
SeqII
A B C D
Threadspawning
Threadjoining
MessagePassingInterface(MPI)• MPImechanisms (dependsonimplementation)
• CompilinganMPIprogramfromsource• mpicc -O3main.cpp-omain
gcc -O3main.cpp-omain-L<IncludeDir>-l<mpiLibs>
• Alsompic++(ormpicxx),mpif77,mpif90,etc.
• RunningMPIapplications(manually)• mpirun -np <numberofprograminstances><yourprogram>• Listofworkernodesspecifiedinsomeconfig file.
• UsingMPIontheUgentHPCcluster• Loadappropriatemodulefirst,e.g.
• moduleloadintel/2017a
• CompilinganMPIprogramfromsource• mpigcc (usestheGNU“gcc”Ccompiler)• mpiicc (usestheIntel“icc”Ccompiler)• mpigxx (usestheGNU“g++”C++compiler)• mpiicpc (usestheIntel“icpc”C++compiler)
• Submitjobusingajobscript (seefurther)
C
C++
mpicc / mpic++defaults to gcc
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
BasicMPIpoint-to-pointcommunication...int rank, size, count;char b[40];MPI_Status status;
... // init MPI and rank and size variables
if (rank != 0) {char * str = “Hello World”;MPI_Send(str, 12, MPI_CHAR, 0, 123, MPI_COMM_WORLD);
} else {for (int i = 1; i < size; i++) {
MPI_Recv(b, 40, MPI_CHAR, i, MPI_ANY_TAG, MPI_COMM_WORLD, &status);MPI_Get_count(&status, MPI_CHAR, &count);printf(“I received %s from process %d with size %d and tag %d\n”,
b, status.MPI_SOURCE, count, status.MPI_TAG);}
}...
john@doe ~]$ mpirun –np 4 ./ptpcommI received Hello World from process 1 with size 12 and tag 123I received Hello World from process 2 with size 12 and tag 123I received Hello World from process 3 with size 12 and tag 123
branchingonrank
MessagePassingInterfaceMechanisms
Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)
Machine Machine Machine
…
Machine
HelloWorldprocess
HelloWorldprocess
HelloWorldprocess
HelloWorldprocess
Pprocesses(orinstances)ofthe“HelloWorld”program
mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram
rank=0Recv 3x
rank=1Send
rank=2Send
rank=P-1Send
“HelloWorld”“HelloWorld”“HelloWorld”
“HelloWorld”“HelloWorld”“HelloWorld”
Blockingsendandreceive
int MPI_Send(void *buf, int count, MPI_Datatype datatype,int dest, int tag, MPI_Comm comm)
• buf:pointertothemessagetosend• count:numberofitemstosend• datatype:datatypeofeachitem
– numberofbytessent:count*sizeof(datatype)• dest:rankofdestinationprocess• tag:valuetoidentifythemessage[0...atleast(32767)]• comm:communicatorspecification(e.g.MPI_COMM_WORLD)
int MPI_Recv(void *buf, int count, MPI_Datatype datatype,int source, int tag, MPI_Comm comm,MPI_Status *status)
• buf:pointertothebuffertostorereceiveddata• count:upperbound(!)ofthenumberofitemstoreceive• datatype:datatypeofeachitem• source:rankofsourceprocess(orMPI_ANY_SOURCE)• tag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }
SendingandreceivingTwo-sided communication:
• BoththesenderandreceiverareinvolvedindatatransferAsopposedtoone-sidedcommunication
• PostedsendmustmatchreceiveWhendoMPI_SendandMPI_recv match ?
• 1.Rankofreceiver process• 2.Rankofsending process• 3.Tag
- customvaluetodistinguishmessagesfromsamesender• 4.Communicator
RationaleforCommunicators• Usedtocreatesubsetsofprocesses• Transparentuseoftags
- modulescanbewritteninisolation- communicationwithinmodulethroughownCommunicator- communicationbetweenmodulesthroughsharedCommunicator
MPIDatatypes
MPI_Datatype Cdatatype
MPI_CHAR signedchar
MPI_SHORT signedshortint
MPI_INT signedint
MPI_LONG signedlongin
MPI_UNSIGNED_CHAR unsignedchar
MPI_UNSIGNED_SHORT unsignedshortint
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsignedlongint
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE longdouble
MPI_BYTE noconversion,bitpatterntransferredasis
MPI_PACKED groupedmessages
Queryingforinformation
• MPI_Status
• StoresinformationabouttheMPI_Recv operationtypedef struct MPI_Status {
int MPI_SOURCE;
int MPI_TAG;
int MPI_ERROR;
}• Doesnotcontainthesizeofthereceivedmessage
• int MPI_Get_count (MPI_Status *status, MPI_Datatype
datatype, int *count)
• returnsthenumberofdataitemsreceivedinthecountvariable• notdirectlyaccessiblefromstatusvariable
Blockingsendandreceive
timea) sendercomesfirst,
idlingatsender(nobufferingofmessage)
send
data
reqtosend
oktosendrecv
b) sending/receivingataboutthesametime,idlingminimized
data
reqtosendoktosend recvsend
c) receivercomesfirst,idlingatreceiver
data
recv
sendreqtosendoktosend
SlidereproducedfromslidesbyJohnMellor- Crummey
Deadlocks
int a[10], b[10], myRank;MPI_Status s1, s2;MPI_Comm_rank( MPI_COMM_WORLD, &myRank);
if ( myRank == 0 ) {MPI_Send( a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD );MPI_Send( b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD );
}else if ( myRank == 1 ) {
MPI_Recv( b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, &s1 );MPI_Recv( a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, &s2 );
}
Slide credit: John Mellor - Crummey
IfMPI_Sendisblocking(handshakeprotocol),thisprogramwilldeadlock• If'eager'protocolisused,itmayrun• Dependsonmessagesize
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
Networkcostmodeling
• Variouschoicesofinterconnectionnetwork§ GigabitEthernet:cheap,butfartooslowforHPCapplications§ Infiniband /Myrinet:highspeedinterconnect
• Simpleperformancemodel forpoint-to-pointcommunicationTcomm =a +b*n
§ a =latency§ B=1/b =saturation(asymptotic)bandwidth(bytes/s)§ n =numberofbytestotransmit§ EffectivebandwidthBeff:
Beff =n
a+b∗n = na+nB
ThecaseofGengar (UGentcluster)Bandwidthandlatency
core3&L1cache
6MByte ofL2cache
core4&L1cache
core2&L1cache
core1&L1cache
16GByteofRAM
Non-blockingInfiniband(switchedfabric)
6MbyteofL2cache
core3&L1cache
6MbyteofL2cache
core4&L1cache
core2&L1cache
core1&L1cache
6MbyteofL2cache
Bandwidth Latency
2ns
50ns~100Gbit/s
~20Gbit/s ~1μstothe193othermachines
CPU1 CPU2
Measureeffectivebandwidth:ringtest
Idea:sendasinglemessage of size N inacircle
• IncreasethemessageNsizeexponentially• 1byte,2bytes,4bytes,...1024bytes,2048bytes,4096bytes
• Benchmarktheresults(measurewallclocktimeT),…• Bandwidth=N*P/T
0
1 2
3
P-1 ...
Hands-on:ringtest inMPI
void sendRing( char *buffer, int length ) {/* send message in a ring here */
}
int main( int argc, char * argv[] ){
...char *buffer = (char*) calloc ( 1048576, sizeof(char) );int msgLen = 8;for (int i = 0; i < 18; i++, msgLen *= 2) {
double startTime = MPI_Wtime();sendRing( buffer, msgLen );double stopTime = MPI_Wtime();double elapsedSec = stopTime - startTime;if (rank == 0)
printf( "Bandwidth for size %d is : %f\", ... );}...
}
Jobscript example
#!/bin/sh##PBS -o output.file#PBS -e error.file#PBS -l nodes=2:ppn=all#PBS -l walltime=00:02:00#PBS -m n
cd $VSC_SCRATCH/<yourdirectory>module load intel/2017amodule load scriptsmympirun ./<program name> <program arguments>
qsub mpijob.shqstat /qdel /etc remainsthesame
mpijob.shjobscriptexample:
Hands-on:ringtest inMPI(solution)
void sendRing( char *buffer, int msgLen ){
int myRank, numProc;MPI_Comm_rank( MPI_COMM_WORLD, &myRank );MPI_Comm_size( MPI_COMM_WORLD, &numProc );MPI_Status status;
int prevR = (myRank - 1 + numProc) % numProc;int nextR = (myRank + 1 ) % numProc;
if (myRank == 0) { // send first, then receiveMPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,
&status );} else { // receive first, then send
MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,&status );
MPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);}
}
BasicMPIroutines
TimingroutinesinMPI
• double MPI_Wtime( void )• returnsthetimeinsecondsrelativeto“sometime”inthepast• “sometime”inthepastisfixedduringprocess
• double MPI_Wtick( void )• ReturnstheresolutionofMPI_Wtime()inseconds• e.g.10-3 =millisecondresolution
BandwidthonGengar(Ugentcluster)
EffectiveBWincreases forlarger messages
Messagesize(byte)
1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 0
200
400
600
800
1000
1200
GigabitEthernet
Infiniband
EffectiveBa
ndwidthB
eff(M
byte/s)
BenchmarkresultsComparisonofCPUload
Infiniband
GigabitEthernet
ExchangingmessagesinMPIint MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype
sendtype, int dest, int sendtag, void *recvbuf,int recvcount, MPI_Datatype recvtype, int source,int recvtag, MPI_Comm comm, MPI_Status *status )• sendbuf:pointertothemessagetosend• sendcount:numberofelementstotransmit• sendtype:datatypeoftheitemstosend• dest:rankofdestinationprocess• sendtag:identifierforthemessage• recvbuf:pointertothebuffertostorethemessage(disjoint withsendbuf)• recvcount:upperbound(!)tothenumberofelementstoreceive• recvtype:datatypeoftheitemstoreceive• source:rankofthesourceprocess(orMPI_ANY_SOURCE)• recvtag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }• sendbuf:pointertothebuffertosend
int MPI_Sendrecv_replace( ... )• Bufferisreplacebyreceiveddata
BasicMPIroutinesSendrecv example
const int len = 10000;int a[len], b[len];if ( myRank == 0 ) {
MPI_Send( a, len, MPI_INT, 1, 0, MPI_COMM_WORLD );MPI_Recv( b, len, MPI_INT, 2, 2, MPI_COMM_WORLD, &status);
} else if ( myRank == 1 ) {MPI_Sendrecv( a, len, MPI_INT, 2, 1, b, len, MPI_INT, 0,
0, MPI_COMM_WORLD, &status );} else if ( myRank == 2) {
MPI_Sendrecv( a, len, MPI_INT, 0, 2, b, len, MPI_INT, 1,1, MPI_COMM_WORLD, &status );
}
• CompatibilitybetweenSendrecv and'normal'sendandrecv• Sendrecv canhelptopreventdeadlocks
2
0 1
safe to exchange !
0
12
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
Non-blockingcommunication
Idea:• Dosomethingusefulwhilewaitingforcommunicationstofinish• Trytooverlapcommunicationsandcomputations
How?• Replaceblockingcommunicationbynon-blockingvariants
MPI_Send(...) MPI_Isend(..., MPI_Request *request)MPI_Recv(...) MPI_Irecv(..., status, MPI_Request *request)
• I=intermediate functions• MPI_Isend and MPI_Irecv routinesreturnimmediately• Needpolling routinestoverifyprogress
• request handleisusedtoidentifycommunications• status fieldmovedtopollingroutines(seefurther)
Non-blockingcommunications
Asynchronousprogress• =abilitytoprogresscommunicationswhileperformingcalculations• Dependsonhardware
• GigabitEthernet=verylimited• Infiniband=muchmorepossibilities
• DependsonMPIimplementation• MultithreadedimplementationsofMPI(e.g.OpenMPI)• Daemonforasynchronousprogress(e.g.LAMMPI)
• Dependsonprotocol• Eagerprotocol• Handshakeprotocol
• Stillthesubjectofongoingresearch
Non-blockingsendingandreceiving
timea) networkinterfacesupports
overlappingcomputationsandcommunications
b) networkinterfacehasnosuchsupport
Isend
data
reqtosend
oktosendIrecv
SlidereproducedfromslidesbyJohnMellor- Crummey
Isend
data
reqtosend
oktosendIrecv
Non-blockingcommunicationsPolling/waitingroutines
int MPI_Wait( MPI_Request *request, MPI_Status *status )
request: handle to identify communicationstatus: status information (cfr. 'normal' MPI_Recv)
int MPI_Test( MPI_Request *request, int *flag, MPI_Status *status )
Returns immediately. Sets flag = true if communication has completed
int MPI_Waitany( int count, MPI_Request *array_of_requests,int *index, MPI_Status *status )
Waits for exactly one communication to completeIf more than one communication has completed, it picks a random oneindex returns the index of completed communication
int MPI_Testany( int count, MPI_Request *array_of_requests,int *index, int *flag, MPI_Status *status )
Returns immediately. Sets flag = true if at least one communication completedIf more than one communication has completed, it picks a random oneindex returns the index of completed communicationIf flag = false, index returns MPI_UNDEFINED
Example:client-servercodeif ( rank != 0 ) { // client code
while ( true ) { // generate requests and send to the servergenerate_request( data, &size );MPI_Send( data, size, MPI_CHAR, 0, tag, MPI_COMM_WORLD );
}} else { // server code (rank == 0)
MPI_Request *reqList = new MPI_Request[nProc];for ( int i = 0; i < nProc - 1; i++ )
MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,MPI_COMM_WORLD, &reqList[i] );
while ( true ) { // main consumer loopMPI_Status status;int reqIndex, recvSize;
MPI_Waitany( nProc–1, reqList, &reqIndex, &status );MPI_Get_count ( &status, MPI_CHAR, &recvSize );do_service( buffer[reqIndex].data, recvSize );MPI_Irecv( buffer[reqIndex].data, MAX_LEN, MPI_CHAR,
status.MPI_SOURCE, tag, MPI_COMM_WORLD, &reqList[reqIndex] );
}}
Non-blockingcommunicationsPolling / waiting routines (cont'd)
int MPI_Waitall( int count, MPI_Request *array_of_requests,MPI_Status *array_of_statuses )
Waits for all communications to completeint MPI_Testall ( int count, MPI_Request *array_of_requests,
int *flag, MPI_Status *array_of_statuses )Returns immediately. Sets flag = true if all communications have completed
int MPI_Waitsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )
Waits for at least one communications to completeoutcount contains the number of communications that have completedCompleted requests are set to MPI_REQUEST_NULL
int MPI_Testsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )
Same as Waitsome, but returns immediately. flag field no longer needed, returns outcount = 0 if no completed communications
Example:improvedclient-servercodeif ( rank != 0 ) { // same client code
...} else { // server code (rank == 0)
MPI_Request *reqList = new MPI_Request[nProc-1];MPI_Status *status = new MPI_Status[nProc-1];int *reqIndex = new MPI_Request[nProc];
for ( int i = 0; i < nProc - 1; i++ )MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,
MPI_COMM_WORLD, &reqList[i] );while ( true ) { // main consumer loop
int numMsg;MPI_Waitsome( nProc–1, reqList, &numMsg, reqIndex, status );for ( int i = 0; i < numMsg; i++ ) {
MPI_Get_count ( &status[i], MPI_CHAR, &recvSize );do_service( buffer[reqIndex[i]].data, recvSize);MPI_Irecv( buffer[reqIndex[i]].data, MAX_SIZE, MPI_CHAR,
status[i].MPI_SOURCE, tag, MPI_COMM_WORLD,&reqList[reqIndex[i]] );
}}
}
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
Barriersynchronization
BarrierSynchronization
time
Waitingtime(processdependent)
CalltoMPI_Barrier
processrank Computations
Idletime
MPI_Barrier(MPI_Comm comm)Thisfunctiondoesnotreturnuntilallprocessesincommhavecalledit.
BroadcastMPI_Bcast(void *buffer, int count, MPI_Datatype datatype,
int root, MPI_Comm comm)MPI_Bcastbroadcastscount elementsoftypedatatype storedinbuffer attheroot processtoallotherprocessesincomm wherethisdatais storedinbuffer.
data
processrank
bufferatnon-root process(contentswillbeoverwritten)
countelements
data
data
data
data
data
bufferatroot process(containsusefuldata)
MPI_Bcast
processrank
bufferatallprocessesnowcontainsamedata
countelements
p1
p0
p2
p4
p3
Broadcastexample...int rank, size;
... // init MPI and rank and size variables
int root = 0;char buffer[12];
if (rank == root) sprintf(buffer, “Hello world”);
MPI_Bcast(buffer, 12, MPI_CHAR, root, MPI_COMM_WORLD);
printf(“Process %d has %s stored in the buffer.\n”, buffer, rank);
...
john@doe ~]$ mpirun –np 4 ./broadcastProcess 1 has Hello World stored in the buffer.Process 0 has Hello World stored in the buffer.Process 3 has Hello World stored in the buffer.Process 2 has Hello World stored in the buffer.
fillthebufferattherootprocessonly
allprocessesmustcallMPI_Bcast
Broadcastalgorithm
data(nbytes)
p1
p0
p2
p4
p3
p5
p6
p7
processrank
• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.
• Binarytreealgorithm takesonly(a +bn)élog2Pù time.
ScatterMPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendType,
void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)
MPI_Scatter partitionsasendbuf attheroot processintoPequalpartsofsizesendcount andsendseachprocessincomm (includingroot)aportioninrankorder.
d0
processrank
sendbuffer(onlymattersatroot process)
sendcountelements
receivebuffer
MPI_Scatter
processrank
recvcountelements
p1
p0
p2
p4
p3
d1 d2 d3 d4
d2
d1 d0
d3
d4
d1 d2 d3 d4
d0
root=p1
Scatterexampleint root = 0;char recvBuf[7];
if (rank == root) { char sendBuf[25];sprintf(sendBuf, “This is the source data.”);
MPI_Scatter(sendBuf, 6, MPI_CHAR, recvBuf, 6, MPI_CHAR,root, MPI_COMM_WORLD);
} else {MPI_Scatter(NULL, 0, MPI_CHAR, recvBuf, 6, MPI_CHAR,
root, MPI_COMM_WORLD);}
recvBuf[6] = ’\0’;printf(“Process %d has %s in receive buffer\n”, rank, recvBuf);
...john@doe ~]$ mpirun –np 4 ./scatterProcess 1 has s the stored in the buffer.Process 0 has This i stored in the buffer.Process 3 has data. stored in the buffer.Process 2 has source stored in the buffer.
fillthesendbufferattherootprocessonly
firstthreeparametersareignoredonnon-rootprocesses
Scatteralgorithm
p1
p0
p2
p4
p3
p5
p6
p7
processrank
• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.
• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)
oneblock=nbytes
Scatter(vectorvariant)MPI_Scatterv(void *sendbuf, int *sendcnts, int *displs,
MPI_Datatype sendType, void *recvbuf, int recvcnt,MPI_Datatype recvType, int root, MPI_Comm comm)
Partitionsofsendbuf don’tneedtobeofequalsizeandarespecifiedperreceivingprocess:thefirstindexbythedispls array,theirsizebythesendcnts array.
processrank
sendbuffer(onlymattersatroot process)
sendcnts[i]elements
receivebuffer
MPI_Scatterv
processrank
recvcountelements
p1
p0
p2
p4
p3
root=p1
displs[i]
GatherMPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendType,
void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)
MPI_Gathergathersequalpartitionsofsizerecvcount fromeachofthePprocessesincomm (includingroot)andstorestheminrecvbuf attheroot processinrankorder.
processrank
sendcountelements
d2
d1
d3
d4
d0
d0
processrank
receivebuffer(onlymattersatroot process)
recvcountelements
sendbuffer
p1
p0
p2
p4
p3
d1 d2 d3 d4
root=p1
d2
d1
d3
d4
d0
MPI_Gather
GatheristheoppositeofScatter
Avectorvariant,MPI_Gatherv,exists,asimilargeneralizationasMPI_Scatterv
Gatherexampleint root = 0;int sendBuf = rank;
if (rank == root) { int *recvBuf = new int[size];
MPI_Gather(&sendBuf, 1, MPI_INT, recvBuf, 1, MPI_INT, root, MPI_COMM_WORLD);
cout << “Receive buffer at root process: ” << endl;for (size_t i = 0; i < size; i++)
cout << recvBuf[i] << “ ”;cout << endl;
delete [] recvBuf;} else {
MPI_Gather(&sendBuf, 1, MPI_INT, NULL, 1, MPI_INT,root, MPI_COMM_WORLD);
}
john@doe ~]$ mpirun –np 4 ./gatherReceive buffer at root process:0 1 2 3
receivebufferexistsattherootprocessonly
receiveparametersareignoredonnon-rootprocesses
Gatheralgorithm
p1
p0
p2
p4
p3
p5
p6
p7
processrank
• Linearalgorithm,subsequentsendingofnbytesfromP-1processestoroottakes(a +bn)(P- 1)time.
• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)
oneblock=nbytes
AllGatherMPI_Allgather(void *sendbuf, int sendcnt, MPI_Datatype sendType,
void *recvbuf, int recvcnt, MPI_Datatype recvType,MPI_Comm comm)
MPI_Allgather isageneralizationofMPI_Gather,inthatsensethatthedataisgatheredbyallprocessed,insteadofjusttherootprocess.
processrank
sendcountelements
processrank recvcountelementssendbuffer
p1
p0
p2
p4
p3
receivebuffer
MPI_Allgather
receivebuffer
Allgather algorithm
p1
p0
p2
p4
p3
p5
p6
p7
processrank
• Pcallstogather takesP[a élog2Pù +bn(P-1)]time(usingthebestgatheralgorithm)• Gatherfollowedbybroadcasttakes2a élog2Pù +bn(Pélog2Pù +P-1) time.• “Butterfly”algorithm takesonlya log2P +bn(P-1)time(incasePisapoweroftwo)
nbytes 2nbytes4nbytes
oneblock=nbytes
AlltoallcommunicationMPI_Alltoall(void *sendbuf, int sendcnt, MPI_Datatype sendType,
void *recvbuf, int recvcnt, MPI_Datatype recvType,int root, MPI_Comm comm)
UsingMPI_Alltoall,everyprocesssendsadistinctmessagetoeveryotherprocess.
processrank sendcountelements
a0
processrank
p1
p0
p2
p4
p3
MPI_Alltoall
a1 a2 a3 a4
b0 b1 b2 b3 b4
e0 e1 e2 e3 e4
d0 d1 d2 d3 d4
c0 c1 c2 c3 c4
sendbuffer
recvcountelements
a0 b0 c0 d0 e0
a1 b1 c1 d1 e1
a4 b4 c4 d4 e4
a3 b3 c3 d3 e3
a2 b2 c2 d2 e2
receivebuffer
Avectorvariant,MPI_Alltoallv,exists,allowingfordifferentsizesforeachprocess
ReduceMPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype
dataType, MPI_Op op, int root, MPI_Comm comm)Thereduceoperationaggregates(“reduces”)scattereddataattherootprocess
processrank
sendbuffer
countelements
MPI_Reduce
processrank
p1
p0
p2
p4
p3root=p0
d2
d1
d3
d4
d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4
receivebuffer
◊ =operation,likesum,product,maximum,etc.
countelements
Reduceoperations
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical AND
MPI_BAND bitwise AND
MPI_LOR logicalOR
MPI_BOR bitwiseOR
MPI_LXOR logicalexclusiveOR
MPI_BXOR bitwise exclusiveOR
MPI_MAXLOC maximumanditslocation
MPI_MINLOC maximumanditslocation
Availablereduceoperations(associativeandcommutative)Userdefinedoperationsarealsopossible
AllreduceoperationMPI_Allreduce(void *sendbuf, void *recvbuf, int count,
MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Similartothereduceoperation,buttheresultisavailableoneveryprocess.
processrank
sendbuffer
countelements
MPI_Allreduce
processrank
p1
p0
p2
p4
p3
d2
d1
d3
d4
d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4
receivebuffer
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
ScanoperationMPI_Scan(void *sendbuf, void *recvbuf, int count,
MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Ascanperformsapartialreductionofdata,everyprocesshasadistinctresult
processrank
sendbuffer
countelements
MPI_Scan
processrank
p1
p0
p2
p4
p3
d2
d1
d3
d4
d0 d0
receivebuffer
d0◊ d1 ◊ d2
d0◊ d1
d0◊ d1 ◊ d2 ◊ d3
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
Hands-onMatrix-vectormultiplication
Master :CoordinatestheworkofothersSlave :doesabitofwork
Task:computeA.bA:doubleprecision(mxn)matrixb:doubleprecision(nx1)columnmatrix
Master algorithm Slave algorithm1. Broadcast b to each slave 1. Broadcast b (in fact receive b)2. Send 1 row of A to each slave 2. do {3. while ( not all m results received ) { Receive message m
Receive result from any slave s if( m != termination )if ( not all m rows sent ) compute result
Send new row to slave s send result to master else } while( m != termination )
Send termination message to s 3. slave terminates}
4. continue
Matrix-vectormultiplicationint main( int argc, char** argv ) {
int rows = 100, cols = 100; // dimensions of adouble **a;double *b, *c;int master = 0; // rank of masterint myid; // rank of this processint numprocs; // number of processes// allocate memory for a, b and ca = (double**)malloc(rows * sizeof(double*));for( int i = 0; i < rows; i++ )
a[i]=(double*)malloc(cols * sizeof(double));b = (double*)malloc(cols * sizeof(double));c = (double*)malloc(rows * sizeof(double));MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &myid );MPI_Comm_size( MPI_COMM_WORLD, &numprocs );if( myid == master )
// execute master codeelse
// execute slave codeMPI_Finalize();
}
Matrixvectormultiplication// initialize a and bfor(int j=0;j<cols;j++) {b[j]=1.0; for(int i=0;i<rows;i++) a[i][j]=i;}// broadcast b to each slaveMPI_Bcast( b, cols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD );// send row of a to each slave, tag = row numberint numsent = 0;for( int i = 0; (i < numprocs-1) && (i < rows); i++ ) {
MPI_Send(a[i], cols, MPI_DOUBLE_PRECISION, i+1,i,MPI_COMM_WORLD);numsent++;
}for( int i = 0; i < rows; i++ ) {
MPI_Status status; double ans; int sender;MPI_Recv( &ans, 1, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status );c[status.MPI_TAG] = ans;sender = status.MPI_SOURCE;if ( numsent < rows ) { // send more work if any
MPI_Send( a[numsent], cols, MPI_DOUBLE_PRECISION,sender, numsent, MPI_COMM_WORLD );
numsent++;} else // send termination message
MPI_Send( MPI_BOTTOM, 0, MPI_DOUBLE_PRECISION, sender,rows, MPI_COMM_WORLD );
}
Matrix-vectormultiplication// broadcast b to each slave (receive here)MPI_Bcast( b,cols,MPI_DOUBLE_PRECISION,master,MPI_COMM_WORLD );// send row of a to each slave, tag = row numberif( myid <= rows ) {
double* buffer=(double*)malloc(cols*sizeof(double));while (true) {
MPI_Status status;MPI_Recv( buffer, cols, MPI_DOUBLE_PRECISION, master,
MPI_ANY_TAG, MPI_COMM_WORLD, &status );if( status.MPI_TAG != rows ) { // not a termination message
double ans = 0.0;for(int i=0; i < cols; i++)
ans += buffer[i]*b[i];MPI_Send( &ans, 1, MPI_DOUBLE_PRECISION, master,
status.MPI_TAG, MPI_COMM_WORLD );} else
break;}
} // more processes than rows => no work for some nodes
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
Networkcostmodeling
• Simpleperformancemodel formultiple communications– Bisectionbandwidth:sumofthebandwidthsofthe(averagenumber
of)linkstocuttopartitionthenetworkintwohalves.– Diameter:maximumnumberofhopstoconnectanytwodevices
Network(firsthalf)
Network(secondhalf)
bisection
Bustopology
Bus =communicationchannelthatisshared byallconnecteddevices• Nomorethantwodevices cancommunicateatanygiventime• Hardwarecontrolswhichdeviceshaveaccess• Highriskofcontention whenmultipledevicestrytoaccessthebus
simultaneously.Busisa“blocking”interconnect.
bisection
bus
Busnetworkproperties• Bisectionbandwidth=point-to-pointbandwidth(independentof#devices)• Diameter=1(singlehop)
Crossbarswitch
i4i3i2i1
o4
o3
o1
inputdevices
outputdevices
switch
Crossbarswitchcanconnectanycombination ofinput/outputpairsatfullbandwidth=fullynon-blocking
Staticroutingincrossbarswitch
i4i3i2i1
o4
o3
o2
o1
inputdevices
outputdevices stateA
Switchesatcrossingofinput/outputshouldbeinstateB.OtherswitchesshouldbeinstateA.Pathselectionisfixed=staticrouting
Input Output
i1 o2i2 o4i3 o1i4 o3
stateB
Input– outputpairings
Crossbarswitchfordistributedmemorysystems
• CrossbarimplementationofaswitchwithP=4machines.• Fullynon-blocking• Expensive:P2 switchesandO(P2)wires
M4
M3
M1
M4
M3
M1
4portnon-blocking
switch
Usuallyimplementedinaswitch
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
ninputs noutputs
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
ninputs noutputs
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
ninputs noutputs
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
• Adaptiveroutingmaybenecessary:cannotfindaconnectionforE!
AB
AB
CD
DC
E
E
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
• Adaptiveroutingmaybenecessary:rerouteanexistingpath
AB
AB
CD
DC
E
E
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
• Adaptiveroutingmaybenecessary:Ecannowbeconnected
AB
AB
CD
DC
E
E
Switchexamples
non-blockingswitchesInfinibandswitch(24ports) GigabitEthernetswitch(24ports)
Meshnetworks
bisection
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
Basicperformanceterminology
• Runtime (“Howlongdoesittaketorunmyprogram”)§ Inpractice,onlywallclocktime matters§ DependsonthenumberofparallelprocessesP§ TP =runtimeusingPprocesses
• Speedup (“Howmuchfasterdoesmyprogramruninparallel”)§ SP =T1 /TP§ Intheidealcase,SP =P§ Superlinearspeedup areusuallyduetocacheeffects
• Parallelefficiency (“Howwellistheparallelinfrastructureused”)§ hP =SP /P (0£ hP £ 1)§ Intheidealcase,hP =1=100%§ Dependsontheapplicationwhatisconsideredacceptableefficiency
Strongscaling:Amdahl’slaw• StrongScaling=increasingthenumberofparallelprocessesfora
fixed-size problem• Simplemodel:partitionsequentialruntimeT1 inaparallelizable
fraction(1-s) andinherentlysequentialfraction(s).§ T1 =sT1 +(1-s)T1 (0£ s £ 1)
§ Therefore, TP =sT1 +𝟏"𝐬 𝐓𝟏𝐏
sequential parallelizableT1=
sequentialT2= parallel
sequentialT3= parallel
sequentialT4= parallel
…sequentialT¥ =
sT1 (1-s)T1
Strongscaling:Amdahl’slaw
• Consequently,SP =&'&(
=&'
)&'*+,- .+/
= ')*+,-/
(Amdahl’slaw)
• Speedupisbounded S¥ =')
§ e.g.s=1%,S¥ = 100§ e.g.s=5%,S¥ = 20
• SourcesofsequentialfractionsT1§ Processstartup overhead§ Inherentlysequential portionsofthecode§ Dependenciesbetweensubtasks§ Communication overhead§ Functioncalling overhead§ Loadmisbalance
Amdahl’slaw
0
5
10
15
20
25
30
0 5 10 15 20 25 30
Parallelspe
edup
NumberofparallelprocessesP
s=5%
s=2%
s=1%
s=0%
ForsmallP,closetolinearspeedup
linearspeedup
Amdahl’slaw
0
20
40
60
80
100
120
140
1 4 16 64 256 1024 4096 16384 65536
Parallelspe
edup
NumberofparallelprocessesP
s=5%
s=2%
s=1%
s=0%
S¥ =')
Samegraphasonpreviousslide,butlogarithmicx-as
Amdahl’slaw
0
0.2
0.4
0.6
0.8
1
1 4 16 64 256 1024 4096 16384 65536
Parallelefficiency
NumberofparallelprocessesP
s=5%
s=2%
s=1%
s=0%
Samegraphasonpreviousslide,butexpressedasparallelefficiency
Weakscaling:Gustafson’slaw
• Weakscaling:increasingboththenumberofparallelprocessesPandtheproblemsizeN.
• Simplemodel:assumethatthesequentialpartisconstant,andthattheparallelizablepartisproportionaltoN.§ T1(N)=T1(1)[s+(1-s)N] (0£ s £ 1)
sequential parallelizableT1(1)=
sT1(1) (1-s)T1(1)
sequential parallelizableT1(2)=
sequential parallelizableT1(3)=
sequential parallelizableT1(4)=
parallelizable
parallelizable
parallelizable
parallelizable
parallelizableparallelizable
Gustafson’slaw
Therefore,TP(N)=T1(1)[s+'") 1(
]andhence TP(P)=T1(1)SpeedupSP(P)=s+(1-s)P=P- s(P-1)(Gustafson’slaw)
• Solveincreasinglylargerproblemsusingaproportionallyhighernumberofprocesses(N=P).
sequential parallelizableT1(1)=
sT1(1) (1-s)T1(1)
sequential parallelizableT2(2)=
sequential parallelizableT3(3)=
sequential parallelizableT4(4)=
Gustafson’slaw
0
50
100
150
200
250
0 50 100 150 200 250
Parallelspe
edup
Number ofparallelprocesses P,problem size N
s=5%
s=2%
s=1%
s=0%
Gustafson’slaw
0.94
0.95
0.96
0.97
0.98
0.99
1
1 4 16 64 256 1024 4096 16384 65536
Parallelefficiency
Number ofparallelprocesses P,problem size N
s=5%
s=2%
s=1%
s=0%
Samegraphasonpreviousslide,butexpressedasparallelefficiencyandlogscale
h¥ =1-s
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
CaseStudy1:Parallelmatrix-matrixproduct
Casestudy:parallelmatrixmultiplication
• Matrix-matrixmultiplication:C=a*A*B+b*C(BLASxgemm)§ AssumeC=mxn;A=mxk;B=kxnmatrix.§ a, b =scalars;assumea = b =1 inwhatfollows,i.e.C=C+A*B
• Initially,matrixelementsaredistributed amongPprocesses§ AssumesameschemeforeachmatrixA,BandC
• EachprocesscomputesvaluesforCthatarelocaltothatprocess§ RequireddatafromAandBthatisnotlocalneedstobe
communicated§ Performancemodelingassumingthea, b, gmodel
o a =latencyo b =perelementtransfertimeo g =timeforsinglefloatingpointcomputation
Casestud
yreprod
uced
from
J.Dem
mel
Casestudy:parallelmatrix-matrixproduct
• Twodimensionalpartitioning§ Partitionmatricesin2Dinanrxcmesh(P=r*c)§ X(I,J)referstoblock(I,J)ofmatrixX(X={A,B,C})§ Processpn isalsodenotedbypi,j (n=i*c+j)andholdsX(I,J)
p0,0 p0,1 p0,2 … p0,c-1
p1,0 …
pi,j
pr-1,0 pr-1,c-1
Casestud
yreprod
uced
from
J.Dem
mel
k
n
B(I,J)e.g.matrixB
rprocesses
cprocesses
Casestudy:parallelmatrix-matrixproduct
• Secondapproach:twodimensionalpartitioning(SUMMA)§ Processpi,j needstocomputeC(I,J):
C I, J = C I, J +8A I, i ∗ B(i, J)?"'
@AB
∀I = 0… r∀J = 0… c
C(I,J) A(I,J)
B(I,J)
+=*
C A
B
datalocalinpi,jdatathatneedstobecommunicatedtopi,j
datanotneededbypi,j
dothisinparallel
Casestud
yreprod
uced
from
J.Dem
mel
kelements
indexi referstoasinglecolumnofAorsinglerowofB
A(I,i)B(i,J)
Casestudy:parallelmatrix-matrixproduct
• Secondapproach:twodimensionalpartitioning§ SUMMAalgorithm(allIandJinparallel)
§ Costforinnerloop(executedktimes):o log2c(a +b(m/r))+log2r(a +b(n/c))+2mng/P
§ TotalTP =2kmng/P+ka(log2c+log2r)+kb((m/r)log2c)+(n/c)log2r)
§ Forn=m=kandr=c=sqrt(P)wefind:ParallelefficiencyhP = 1/(1+(a/g)(Plog2P)/(2n2)+(b/g)√PlogP/n)Isoefficiency whenngrowsas√P(constantmemorypernode!)
Casestud
yreprod
uced
from
J.Dem
mel
for i = 0 to k–1broadcast A(I, i) within process rowbroadcast B(i, J) within process columnC(I,J) += A(I, i) * B(i, J)
endfor
Casestudy:parallelmatrix-matrixproduct
• Evenmoreefficient:use“blocking”algorithm
• SUMMAalgorithmisimplementedinPBLAS=ParallelBLAS• Algorithmcanbeextendedtoblock-cycliclayout(seefurther)
for i = 0 to k–1 step bend = min(i+b-1, k-1)broadcast A(I, i:end) within process rowbroadcast B(i:end, J) within process columnC(I,J) += A(I, i:end) * B(i:end, J)
endfor
PerformthisproductusingLevel-3BLAS
Casestudy:parallelGaussianElimination
Imagetakefrom
J.Dem
mel
CaseStudy2:ParallelSorting
Casestudy:parallelsortingalgorithm
• Sequential sortingofnkeys(1st Bachelor)§ Bubblesort:O(n2)§ Mergesort:O(nlogn),eveninworst-case§ Quicksort:O(nlogn) expected,O(n2)worst-case,fastinpractice
• Parallel sortingofnkeys,usingPprocesses§ Initially,eachprocessholdsn/pkeys(unsorted)§ Eventually,eachprocessholdsn/pkeys(sorted)
o Keysperprocessaresortedo Ifq<r,eachkeyassignedtoprocessqislessthanorequalto
everykeyassignedtoprocessr(sortkeysinrankorder)
Casestudy:parallelsortingalgorithm
void Bubble_sort(int *a, int n) {for (int listLen = n; listLen >= 2; listLen--)
for (int i = 0; i < listLen-1; i++)if (a[i] > a[i+1]) {
temp = a[i];a[i] = a[i+1]a[i+1] = temp;
}
}
Bubblesort algorithm:O(n2)
Example: 52481® 2451 8® 241 58® 21 458® 1 2458
“Compare-swap”operation
AlgorithmreproducedfromP.Pacheco
afteriteration1afteriteration2afteriteration3afteriteration4
Casestudy:parallelsortingalgorithm
• Bubblesort§ Resultofcurrentstep(a[i]>a[i+1])dependsonpreviousstep
o Valueofa[i]isdeterminedbypreviousstepo Algorithmis“inherentlyserial”o Notmuchpointintryingtoparallelizethisalgorithm
• Odd-eventranspositionsort§ Decouplealgorithmintwophases:evenandodd
o Evenphase:compare-swaponfollowingelements:(a[0],a[1]),(a[2],a[3]),(a[4],a[5]),…
o Oddphase:compare-swapoperationsonfollowingelements:(a[1],a[2]),(a[3],a[4]),(a[5],a[6]),…
Casestudy:parallelsortingalgorithm
void Even_odd_sort(int *a, int n) {for (int phase = 0; phase < n; phase++)
if (phase % 2 == 0) { // even phasefor (int i = 0; i < n-1; i += 2)
if (a[i] > a[i+1]) {temp = a[i];a[i] = a[i+1]a[i+1] = temp;
}} else { // odd phase
for (int i = 1; i < n-1; i += 2)if (a[i] > a[i+1]) {
temp = a[i];a[i] = a[i+1]a[i+1] = temp;
}}
}
Even-oddtranspositionsortalgorithm:O(n2)
“Compare-swap”operation
“Compare-swap”operation
AlgorithmreproducedfromP.Pacheco
Casestudy:parallelsortingalgorithm
Example: 52481® 25481® 24518® 24158® 21458® 12458
evenphaseoddphaseevenphaseoddphaseevenphase
Even-oddtranspositionsortalgorithm:O(n2)
Parallelism withineachevenoroddphaseisnowobvious:Compare-swapbetween(a[i],a[i+1])independentfrom(a[i+2],a[i+3])
Casestudy:parallelsortingalgorithm
Parallelalgorithm• First,assumeP==n(oneelementperprocess)
a[i-1] a[i] a[i+1]
ImagereproducedfromP.Pacheco
a[i-1] a[i] a[i+1]
phasej
phasej+1
Communicatevaluewithneighbor• Rightprocess(highestrank)keepslargestvalue• Leftprocess(lowestrank)keepssmallestvalue
a[i-2]
a[i-2]
executeinparallelduringphasej+1
executeinparallelduringphasej
Casestudy:parallelsortingalgorithm
Parallelalgorithm• Now,assumen/P>>1 (asistypicallythecase)• Example:P=4;n=16
AlgorithmreproducedfromP.Pacheco
initialvalues 15,11,9,16 3,14,8,7 4,6,12,10 5,2,13,1
localsorting 9,11,15,16 3,7,8,14 4,6,10,12 1,2,5,13
phase0(even) 3,7,8,9 11,14,15,16 1,2,4,5 6,10,12,13
phase1(odd) 3,7,8,9 1,2,4,5 11,14,15,16 6,10,12,13
phase2(even) 1,2,3,4 5,7,8,9 6,10,11,12 13,14,15,16
phase3(odd) 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16
process0 process1 process2 process3
Casestudy:parallelsortingalgorithm
sort local keysfor (int phase = 0; phase < P; phase++) {
neighbor = computeNeighbor(phase, myRank);if (I’m not idle) { // first and/or last process may be idle
send all my keys to neighborreceive all keys from neighborif (myRank < neighbor)
keep smaller keyselse
keep larger keys}
}
Paralleleven-oddtranspositionsortpseudocode
AlgorithmreproducedfromP.Pacheco
Theorem: Parallelodd-eventranspositionsortalgorithmwillsorttheinputlistafterP(=numberofprocesses)phases.
Casestudy:parallelsortingalgorithm
int computeNeighbor(int phase, int myRank) {int neighbor;if (phase % 2 == 0) {
if (myRank % 2 == 0)neighbor = myRank + 1;
elseneighbor = myRank - 1;
} else {if (myRank % 2 == 0)
neighbor = myRank – 1;else
neighbor = myRank + 1;}if (neighbor == -1 || neighbor == P-1)
neighbor = MPI_PROC_NULL;return neigbor;
}
ImplementationofcomputeNeighbor (MPI)
AlgorithmreproducedfromP.Pacheco
WhenusedasdestinationorsourcerankinMPI_SendorMPI_Recv,nocommunicationtakesplace
Casestudy:parallelsortingalgorithm
if (myRank % 2 == 0) {MPI_Send(...)MPI_Recv(...)
} else {MPI_Recv(...)MPI_Send(...)
}
ImplementationofdataexchangeinMPI• Becarefulofdeadlocks• Inbothevenandoddphases,communicationalwaystakesplace
betweenaprocesswitheven,andaprocesswithoddrank
AlgorithmreproducedfromP.Pacheco
Exchangeordertopreventdeadlocks!
MPI_Sendrecv(...)
OR
Casestudy:parallelsortingalgorithm
Parallelodd-eventranspositionsortalgorithmanalysis• Initialsorting:O(n/Plog(n/P))time
• Useanefficientsequentialsortingalgorithm,e.g.quicksortormergesort
• Perphase:2(a +n/Pb)+gn/P• TotalruntimeTP(n)=O(n/Plog(n/P)+2(aP +nb)+gn
=1/PO(nlogn)+O(n)• LinearspeedupwhenPissmallandnislarge• However,badasymptoticbehaviour
• WhennandPincreaseproportionally,runtimeperprocessisO(n)
• Whatwereallywant:O(logn)• Difficult!(butpossible!)
AlgorithmreproducedfromP.Pacheco
Casestudy:parallelsortingalgorithm
• Sortingnetworks (=graphicaldepictionofsortingalgorithms)§ Numberofhorizontal“wires”(=elementstosort)§ Connectedbyvertical“comparators”(=compareandswap)
elem
entsto
sort(n
)
time
Example(4elementstosort)
x
y
min(x,y)
max(x,y)
Casestudy:parallelsortingalgorithm
Bubblesort algorithm(sequential) Samebubblesortalgorithm(parallel)
Sequentialruntime=numberofcomparators
=“size ofthesortingnetwork”=n*(n– 1)/2
Parallelruntime(assumeP==nprocesses)
=“depth ofsortingnetwork”=2n– 3
Casestudy:parallelsortingalgorithmDefinition:depthofasortingnetwork (=parallelruntime)
• Zeroattheinputsoreachwire• Foracomparatorwithinputswithdepthd1 andd2,thedepthof
itsoutputsis1+max(d1,d2)• Depthofthesortingnetwork=maximumdepthofeachoutput
0
0
0
0
0
0
1
1
2 3 4 5 6 7 8 9
3 5 7 9
2 4 6 8
3 5 7
4 6
5
3 5 7
4 6
5
Casestudy:parallelsortingalgorithm
• Sortingnetworkofodd-eventranspositionsort
• Parallelruntime=“depth ofsortingnetwork”=n• …orP(weassumen==P)
Casestudy:parallelsortingalgorithm
• Canwedobetter?• SequentialsortingalgorithmsareO(nlogn)• Ideally,Pandncanscaleproportionally:P=O(n)• ThatmeansthatwewanttosortnnumbersinO(logn)time
• Thisispossible(!),however,bigconstantpre-factor• WewilldescribeanalgorithmthatcansortnnumbersinO(log2 n)
paralleltime(usingP=O(n)processes)• Thisalgorithmhasbestperformanceinpractice• …unlessnbecomeshuge(n>22000)• Nobodywantstosortthatmanynumbers
Casestudy:parallelsortingalgorithm
• Theorem:Ifasortingnetworkwithninputssortsall2n binarystringsoflengthncorrectly,thenitsortsallsequencescorrectly(proof:seereferences).
1
1
1
0
0
0
0
1
1
1
0
0
WewilldesignanalgorithmthatcansortbinarysequencesinO(log22n) time
Example
Casestudy:parallelsortingalgorithm
• Step1: createasortingnetworkthatsortsbitonic sequences• Definition:Abitonic sequenceisasequencewhichisfirst
increasingandthendecreasing,orcanbecircularlyshiftedtobecomeso.§ (1,2,3,3.14,5,4,3,2,1)isbitonic§ (4,5,4,3,2,1,1,2,3)isbitonic§ (1,2,1,2)isnotbitonic
• Overzerosandones,abitonic sequenceisoftheform§ 0i1j0k or1i0j1k(withe.g.0i=0000…0=i consecutivezeros)§ i,jorkcanbezero
Casestudy:parallelsortingalgorithm
• Now,let’screateasortingnetworkthatsortsabitonic sequence• Ahalf-cleaner networkconnectslinei withlinei +n/2
• Iftheinputisabinarybitonic sequencethenfortheoutput§ Elementsinthetophalfaresmaller thanthecorresponding
elementsinthebottomhalf,i.e.halvesarerelativelysorted.§ Oneofthehalvesoftheoutputconsistsofonlyzerosorones(i.e.
is“clean”),theotherhalfisbitonic.
Half-cleanerexample(n=8)
Casestudy:parallelsortingalgorithm
• Exampleofahalf-cleanernetwork
0
1
0
1
1
0
0
0
0
0
01
0
0
1
1
lowerhalfisbitonic
upperhalfis“clean”
input=bitonicsequence
upperhalfissmallerthanlowerhalf
Casestudy:parallelsortingalgorithm
• Therefore,abitonic sorter[n] (i.e.networkthatsortsabitonicsequenceoflengthn)isobtainedas
Half-cleaner[n]
Bitonic-sorter[n/2]
Bitonic-sorter[n/2]
Abitonic sortersortsabitonic sequenceoflengthn=2k using• size=nk/2=n/2log2ncomparators(=sequentialtime)• depth=k=log2n(=paralleltime)
Casestudy:parallelsortingalgorithm• Step2:Buildanetworkmerger[n] thatmergestwosorted
sequencesoflengthn/2sothattheoutputissorted§ Flipsecondsequenceandconcatenatefirstandflippedsecond§ Concatenatedsequenceisbitonic,sortusingstep1
0
1
1
0
1
0
1
1
0
0
11
1
0
1
1
tobitonic-sorter[n/2]
sortedsequence1
sortedsequence2
tobitonic-sorter[n/2]
Casestudy:parallelsortingalgorithm
• Step3:Buildasorter[n] networkthatsortsarbitrarysequences§ Dothisrecursivelyfrommerger[n]buildingblocks§ Depth:D(1)=0andD(n)=D(n/2)+log2n=O(log22n)
merger[n](fromStep2)
sorter[n/2]
sorter[n/2]
Casestudy:parallelsortingalgorithm
Example:sorter[16]
merger[16]
merger[8]
merger[4]merger[2]
flip-cleaner[16]
half-cleaner[8]
Casestudy:parallelsortingalgorithm
• Furtherreading ofbitonic networks:§ http://valis.cs.uiuc.edu/~sariel/teach/2004/b/
webpage/lec/14_sortnet_notes.pdf• Incasenisnotapoweroftwo:
§ http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/oddn.htm