CS 179: GPU Programming
description
Transcript of CS 179: GPU Programming
![Page 1: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/1.jpg)
CS 179: GPU ProgrammingLecture 14: Inter-process Communication
![Page 2: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/2.jpg)
The Problem What if we want to use GPUs across a distributed system?
GPU cluster, CSIRO
![Page 3: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/3.jpg)
Distributed System A collection of computers
Each computer has its own local memory! Communication over a network
Communication suddenly becomes harder! (and slower!)
GPUs can’t be trivially used between computers
![Page 4: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/4.jpg)
We need some way to communicate between processes…
![Page 5: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/5.jpg)
Message Passing Interface (MPI) A standard for message-passing Multiple implementations exist Standard functions that allow easy communication of data
between processes
Works on non-networked systems too! (Equivalent to memory copying on local system)
![Page 6: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/6.jpg)
MPI Functions There are seven basic functions:
MPI_Init initialize MPI environment MPI_Finalize terminate MPI environment
MPI_Comm_size how many processes we have running MPI_Comm_rank the ID of our process
MPI_Isend send data (nonblocking) MPI_Irecv receive data (nonblocking)
MPI_Wait wait for request to complete
![Page 7: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/7.jpg)
MPI Functions Some additional functions:
MPI_Barrier wait for all processes to reach a certain point
MPI_Bcast send data to all other processes MPI_Reduce receive data from all processes and
reduce to a value
MPI_Send send data (blocking) MPI_Recv receive data (blocking)
![Page 8: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/8.jpg)
The workings of MPI Big idea:
We have some process ID We know how many processes there are
A send request/receive request pair are necessary to communicate data!
![Page 9: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/9.jpg)
MPI Functions – Init int MPI_Init( int *argc, char ***argv )
Takes in pointers to command-line parameters (more on this later)
![Page 10: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/10.jpg)
MPI Functions – Size, Rank int MPI_Comm_size( MPI_Comm comm, int *size ) int MPI_Comm_rank( MPI_Comm comm, int *rank )
Take in a “communicator” For us, this will be MPI_COMM_WORLD
– essentially, our communication group contains all processes Take in a pointer to where we will store our size (or rank)
![Page 11: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/11.jpg)
MPI Functions – Send int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
Non-blocking send Program can proceed without waiting for return! (Can use MPI_Send for “blocking” send)
![Page 12: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/12.jpg)
MPI Functions – Send int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
Takes in… A pointer to what we send Size of data Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) Destination process ID
…
![Page 13: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/13.jpg)
MPI Functions – Send int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
Takes in… (continued) A “tag” – nonnegative integers 0-32767+
Uniquely identifies our operation – used to identify messages when many are sent simultaneously
MPI_ANY_TAG allows all possible incoming messages
A “communicator” (MPI_COMM_WORLD) A “request” object
Holds information about our operation
![Page 14: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/14.jpg)
MPI Functions - Recv int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
Non-blocking receive (similar idea) (Can use MPI_Irecv for “blocking” receive)
![Page 15: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/15.jpg)
MPI Functions - Recv int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
Takes in… A buffer (pointer to where received data is placed) Size of incoming data Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) Sender’s process ID
…
![Page 16: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/16.jpg)
MPI Functions - Recv int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
Takes in… A “tag”
Should be the same tag as the tag used to send message! (assuming MPI_ANY_TAG not in use)
A “communicator” (MPI_COMM_WORLD) A “request” object
![Page 17: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/17.jpg)
Blocking vs. Non-blocking MPI_Isend and MPI_Irecv are non-blocking communications
Calling these functions returns immediately Operation may not be finished!
Should use MPI_Wait to make sure operations are completed Special “request” objects for tracking status
MPI_Send and MPI_Recv are blocking communications Functions don’t return until operation is complete Can cause deadlock!
(we won’t focus on these)
![Page 18: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/18.jpg)
MPI Functions - Wait int MPI_Wait(MPI_Request *request, MPI_Status *status)
Takes in… A “request” object corresponding to a previous operation
Indicates what we’re waiting on A “status” object
Basically, information about incoming data
![Page 19: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/19.jpg)
MPI Functions - Reduce int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
Takes in… A “send buffer” (data obtained from every process) A “receive buffer” (where our final result will be placed) Number of elements in send buffer
Can reduce element-wise array -> array Type of data (MPI label, as before)
…
![Page 20: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/20.jpg)
MPI Functions - Reduce int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
Takes in… (continued) Reducing operation (special MPI labels, e.g. MPI_SUM, MPI_MIN) ID of process that obtains result MPI communication object (as before)
![Page 21: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/21.jpg)
MPI Example
int main(int argc, char **argv) { int rank, numprocs; MPI_Status status; MPI_Request request; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); int tag=1234; int source=0; int destination=1; int count=1; int send_buffer; int recv_buffer; if(rank == source){ send_buffer=5678; MPI_Isend(&send_buffer,count,MPI_INT,destination,tag, MPI_COMM_WORLD,&request); } if(rank == destination){ MPI_Irecv(&recv_buffer,count,MPI_INT,source,tag, MPI_COMM_WORLD,&request); } MPI_Wait(&request,&status); if(rank == source){ printf("processor %d sent %d\n",rank,recv_buffer); } if(rank == destination){ printf("processor %d got %d\n",rank,recv_buffer); } MPI_Finalize(); return 0;}
Two processes
Sends a number from process 0 to process 1
Note: Both processes are running this code!
![Page 22: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/22.jpg)
Solving Problems Goal: To use GPUs across multiple computers!
Two levels of division: Divide work between processes (using MPI) Divide work among a GPU (using CUDA, as usual)
Important note: Single-program, multiple-data (SPMD) for our purposes
i.e. running multiple copies of the same program!
![Page 23: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/23.jpg)
Multiple Computers, Multiple GPUs Can combine MPI, CUDA streams/device selection,
and single-GPU CUDA!
![Page 24: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/24.jpg)
MPI/CUDA Example Suppose we have our “polynomial problem” again
Suppose we have n processes for n computers Each with one GPU
(Suppose array of data is obtained through process 0)
Every process will run the following:
![Page 25: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/25.jpg)
MPI/CUDA ExampleMPI_Init
declare rank, numberOfProcessesset rank with MPI_Comm_rankset numberOfProcesses with MPI_Comm_size
if process rank is 0:array <- (Obtain our raw data)for all processes idx = 0 through numberOfProcesses - 1:Send fragment idx of the array to process ID idx using MPI_IsendThis fragment will have the size(size of array) / numberOfProcesses)
declare local_dataRead into local_data for our process using MPI_Irecv
MPI_Wait for our process to finish receive request
local_sum <- ... Process local_data with CUDA and obtain a sum
declare global_sumset global_sum on process 0 with MPI_reduce(use global_sum somehow)
MPI_Finalize
![Page 26: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/26.jpg)
MPI/CUDA – Wave Equation Recall our calculation:
𝑦 𝑥 ,𝑡+1=2 𝑦 𝑥, 𝑡− 𝑦𝑥 ,𝑡 −1+(𝑐∆ 𝑡∆ 𝑥 )2
(𝑦¿¿ 𝑥+1 ,𝑡−2 𝑦𝑥 ,𝑡+𝑦𝑥−1 ,𝑡 )¿
![Page 27: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/27.jpg)
MPI/CUDA – Wave Equation Big idea: Divide our data array between n processes!
![Page 28: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/28.jpg)
MPI/CUDA – Wave Equation In reality: We have three regions of data at a time
(old, current, new)
![Page 29: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/29.jpg)
MPI/CUDA – Wave Equation Calculation for timestep t+1 uses the following data:
𝑦 𝑥 ,𝑡+1=2 𝑦 𝑥, 𝑡− 𝑦𝑥 ,𝑡 −1+(𝑐∆ 𝑡∆ 𝑥 )2
(𝑦¿¿ 𝑥+1 ,𝑡−2 𝑦𝑥 ,𝑡+𝑦𝑥−1 ,𝑡 )¿
tt-1
t+1
![Page 30: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/30.jpg)
MPI/CUDA – Wave Equation Problem if we’re at the boundary of a process!
𝑦 𝑥 ,𝑡+1=2 𝑦 𝑥, 𝑡− 𝑦𝑥 ,𝑡 −1+(𝑐∆ 𝑡∆ 𝑥 )2
(𝑦¿¿ 𝑥+1 ,𝑡−2 𝑦𝑥 ,𝑡+𝑦𝑥−1 ,𝑡 )¿
x
Where do we get ? (It’s outside our process!)
tt-1
t+1
![Page 31: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/31.jpg)
Wave Equation – Simple Solution After every time-step, each process gives its leftmost and
rightmost piece of “current” data to neighbor processes!
Proc0 Proc1 Proc2 Proc3 Proc4
![Page 32: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/32.jpg)
Wave Equation – Simple Solution
Pieces of data to communicate:
Proc0 Proc1 Proc2 Proc3 Proc4
![Page 33: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/33.jpg)
Wave Equation – Simple Solution Can do this with MPI_Irecv, MPI_Isend, MPI_Wait:
Suppose process has rank r: If we’re not the rightmost process:
Send data to process r+1 Receive data from process r+1
If we’re not the leftmost process: Send data to process r-1 Receive data from process r-1
Wait on requests
![Page 34: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/34.jpg)
Wave Equation – Simple Solution Boundary conditions:
Use MPI_Comm_rank and MPI_Comm_size Rank 0 process will set leftmost condition Rank (size-1) process will set rightmost condition
![Page 35: CS 179: GPU Programming](https://reader036.fdocuments.in/reader036/viewer/2022062310/56816102550346895dd042ce/html5/thumbnails/35.jpg)
Simple Solution – Problems Communication can be expensive!
Expensive to communicate every timestep to send 1 value!
Better solution: Send some m values every m timesteps!