Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming...
Transcript of Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming...
![Page 1: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/1.jpg)
Lecture 8: Distributed memory
David Bindel
22 Sep 2011
![Page 2: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/2.jpg)
Logistics
I Interesting CS colloquium tomorrow at 4:15 in Upson B15:“Trumping the Multicore Memory Hierarchy with Hi-Spade”
I Proj 1 due tomorrow at 11:59! I will be gone after 2-3tomorrow, so get in your questions now.(Alternate suggestion: Monday at 11:59?)
I Suggest a two-fold strategy: work on fast kernel (say16-by-16), and then work on a good blocked code thatemploys that kernel. Be careful not to spend too much timeon optimizations that the compiler does better.
I When you submit your Makefile, please make sure thatyou copy any changes that you made to the flags inMakefile.in into the Makefile proper.
I Some good discussions on Piazza – keep it going!
Next HW: a particle dynamics simulation.
![Page 3: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/3.jpg)
Plan for this week
I Last week: shared memory programmingI Shared memory HW issues (cache coherence)I Threaded programming concepts (pthreads and OpenMP)I A simple example (Monte Carlo)
I This week: distributed memory programmingI Distributed memory HW issues (topologies, cost models)I Message-passing programming concepts (and MPI)I A simple example (“sharks and fish”)
![Page 4: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/4.jpg)
Basic questions
How much does a message cost?I Latency: time to get between processorsI Bandwidth: data transferred per unit timeI How does contention affect communication?
This is a combined hardware-software question!
We want to understand just enough for reasonable modeling.
![Page 5: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/5.jpg)
Thinking about interconnects
Several features characterize an interconnect:I Topology: who do the wires connect?I Routing: how do we get from A to B?I Switching: circuits, store-and-forward?I Flow control: how do we manage limited resources?
![Page 6: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/6.jpg)
Thinking about interconnects
I Links are like streetsI Switches are like intersectionsI Hops are like blocks traveledI Routing algorithm is like a travel planI Stop lights are like flow controlI Short packets are like cars, long ones like buses?
At some point the analogy breaks down...
![Page 7: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/7.jpg)
Bus topology
Mem
$ $ $ $
P0 P1 P2 P3
I One set of wires (the bus)I Only one processor allowed at any given time
I Contention for the bus is an issueI Example: basic Ethernet, some SMPs
![Page 8: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/8.jpg)
Crossbar
P3
P0 P1 P2 P3
P0
P1
P2
I Dedicated path from every input to every outputI Takes O(p2) switches and wires!
I Example: recent AMD/Intel multicore chips(older: front-side bus)
![Page 9: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/9.jpg)
Bus vs. crossbar
I Crossbar: more hardwareI Bus: more contention (less capacity?)I Generally seek happy medium
I Less contention than busI Less hardware than crossbarI May give up one-hop routing
![Page 10: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/10.jpg)
Network properties
Think about latency and bandwidth via two quantities:I Diameter: max distance between nodesI Bisection bandwidth: smallest bandwidth cut to bisect
I Particularly important for all-to-all communication
![Page 11: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/11.jpg)
Linear topology
I p − 1 linksI Diameter p − 1I Bisection bandwidth 1
![Page 12: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/12.jpg)
Ring topology
I p linksI Diameter p/2I Bisection bandwidth 2
![Page 13: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/13.jpg)
Mesh
I May be more than two dimensionsI Route along each dimension in turn
![Page 14: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/14.jpg)
Torus
Torus : Mesh :: Ring : Linear
![Page 15: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/15.jpg)
Hypercube
I Label processors with binary numbersI Connect p1 to p2 if labels differ in one bit
![Page 16: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/16.jpg)
Fat tree
I Processors at leavesI Increase link bandwidth near root
![Page 17: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/17.jpg)
Others...
I Butterfly networkI Omega networkI Cayley graph
![Page 18: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/18.jpg)
Current picture
I Old: latencies = hopsI New: roughly constant latency (?)
I Wormhole routing (or cut-through) flattens latencies vsstore-forward at hardware level
I Software stack dominates HW latency!I Latencies not same between networks (in box vs across)I May also have store-forward at library level
I Old: mapping algorithms to topologiesI New: avoid topology-specific optimization
I Want code that runs on next year’s machine, too!I Bundle topology awareness in vendor MPI libraries?I Sometimes specify a software topology
![Page 19: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/19.jpg)
α-β model
Crudest model: tcomm = α+ βMI tcomm = communication timeI α = latencyI β = inverse bandwidthI M = message size
Works pretty well for basic guidance!
Typically α� β � tflop. More money on network, lower α.
![Page 20: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/20.jpg)
LogP model
Like α-β, but includes CPU time on send/recv:I Latency: the usualI Overhead: CPU time to send/recvI Gap: min time between send/recvI P: number of processors
Assumes small messages (gap ∼ bw for fixed message size).
![Page 21: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/21.jpg)
Communication costs
Some basic goals:I Prefer larger to smaller messages (avoid latency)I Avoid communication when possible
I Great speedup for Monte Carlo and other embarrassinglyparallel codes!
I Overlap communication with computationI Models tell you how much computation is needed to mask
communication costs.
![Page 22: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/22.jpg)
Message passing programming
Basic operations:I Pairwise messaging: send/receiveI Collective messaging: broadcast, scatter/gatherI Collective computation: sum, max, other parallel prefix opsI Barriers (no need for locks!)I Environmental inquiries (who am I? do I have mail?)
(Much of what follows is adapted from Bill Gropp’s material.)
![Page 23: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/23.jpg)
MPI
I Message Passing InterfaceI An interface spec — many implementationsI Bindings to C, C++, Fortran
![Page 24: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/24.jpg)
Hello world
#include <mpi.h>#include <stdio.h>
int main(int argc, char** argv) {int rank, size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &size);printf("Hello from %d of %d\n", rank, size);MPI_Finalize();return 0;
}
![Page 25: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/25.jpg)
Communicators
I Processes form groupsI Messages sent in contexts
I Separate communication for librariesI Group + context = communicatorI Identify process by rank in groupI Default is MPI_COMM_WORLD
![Page 26: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/26.jpg)
Sending and receiving
Need to specify:I What’s the data?
I Different machines use different encodings (e.g.endian-ness)
I =⇒ “bag o’ bytes” model is inadequateI How do we identify processes?I How does receiver identify messages?I What does it mean to “complete” a send/recv?
![Page 27: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/27.jpg)
MPI datatypes
Message is (address, count, datatype). Allow:I Basic types (MPI_INT, MPI_DOUBLE)I Contiguous arraysI Strided arraysI Indexed arraysI Arbitrary structures
Complex data types may hurt performance?
![Page 28: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/28.jpg)
MPI tags
Use an integer tag to label messagesI Help distinguish different message typesI Can screen messages with wrong tagI MPI_ANY_TAG is a wildcard
![Page 29: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/29.jpg)
MPI Send/Recv
Basic blocking point-to-point communication:
intMPI_Send(void *buf, int count,
MPI_Datatype datatype,int dest, int tag, MPI_Comm comm);
intMPI_Recv(void *buf, int count,
MPI_Datatype datatype,int source, int tag, MPI_Comm comm,MPI_Status *status);
![Page 30: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/30.jpg)
MPI send/recv semantics
I Send returns when data gets to systemI ... might not yet arrive at destination!
I Recv ignores messages that don’t match source and tagI MPI_ANY_SOURCE and MPI_ANY_TAG are wildcards
I Recv status contains more info (tag, source, size)
![Page 31: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/31.jpg)
Ping-pong pseudocode
Process 0:
for i = 1:ntrialssend b bytes to 1recv b bytes from 1
end
Process 1:
for i = 1:ntrialsrecv b bytes from 0send b bytes to 0
end
![Page 32: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/32.jpg)
Ping-pong MPI
void ping(char* buf, int n, int ntrials, int p){
for (int i = 0; i < ntrials; ++i) {MPI_Send(buf, n, MPI_CHAR, p, 0,
MPI_COMM_WORLD);MPI_Recv(buf, n, MPI_CHAR, p, 0,
MPI_COMM_WORLD, NULL);}
}
(Pong is similar)
![Page 33: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/33.jpg)
Ping-pong MPI
for (int sz = 1; sz <= MAX_SZ; sz += 1000) {if (rank == 0) {
clock_t t1, t2;t1 = clock();ping(buf, sz, NTRIALS, 1);t2 = clock();printf("%d %g\n", sz,
(double) (t2-t1)/CLOCKS_PER_SEC);} else if (rank == 1) {
pong(buf, sz, NTRIALS, 0);}
}
![Page 34: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/34.jpg)
Running the code
On my laptop (OpenMPI)
mpicc -std=c99 pingpong.c -o pingpong.xmpirun -np 2 ./pingpong.x
Details vary, but this is pretty normal.
![Page 35: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/35.jpg)
Approximate α-β parameters (2-core laptop)
1.00e-06
2.00e-06
3.00e-06
4.00e-06
5.00e-06
6.00e-06
7.00e-06
8.00e-06
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Tim
e/m
sg
b
MeasuredModel
α ≈ 1.46× 10−6, β ≈ 3.89× 10−10
![Page 36: Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,](https://reader033.fdocuments.in/reader033/viewer/2022050217/5f6300dd3d7f95769e7689f9/html5/thumbnails/36.jpg)
Where we are now
Can write a lot of MPI code with 6 operations we’ve seen:I MPI_Init
I MPI_Finalize
I MPI_Comm_size
I MPI_Comm_rank
I MPI_Send
I MPI_Recv
... but there are sometimes better ways.
Next time: non-blocking and collective operations!