Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r =...

10
Computer Science 320 Broadcasting

Transcript of Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r =...

Page 1: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Computer Science 320

Broadcasting

Page 2: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Floyd’s Algorithm on SMP

for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 drc = min(drc, dri + dic)

Page 3: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Floyd’s Algorithm on Cluster

• Root node reads distance matrix from input file and scatters row slices to other nodes

• Other nodes compute distances and update their slices

• The slices are gathered back to the root node for output

Page 4: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Parallel I/O File Pattern

• Eliminate the gather of data by having each node write its slice to a separate file

• Eliminate the scatter of data by having each node read its slice from the input file

Page 5: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Execution Timeline

Page 6: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Sharing Data in Computation

• On each pass through the outer loop, the ith row must be available to all of the processes (they all execute the same line of code in the inner loop)

• They can do this in SMP because they share the entire matrix

• They can’t do this in a cluster setup, because they don’t share

for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 drc = min(drc, dri + dic)

Page 7: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Share Row via a Broadcast Message

• The process that owns a row broadcasts it before the parallel loop is run, on each pass through the outer loop

• Process that owns the row acts as the root for the broadcast, setting up the source buffer

• The other processes set up a destination buffer

• Broadcast also enforces synchronization; they all wait for the broadcast

for i = 0 to n – 1 broadcast row i of d parallel for r = 0 to n – 1 for c = 0 to n – 1 drc = min(drc, dri + dic)

Page 8: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +
Page 9: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

// Allocate storage for row broadcast from another process.row_i = new double [n];row_i_buf = DoubleBuf.buffer (row_i);

int i_root = 0;for (int i = 0; i < n; ++ i){ double[] d_i = d[i]; // Determine which process owns row i. if (! ranges[i_root].contains(i)) ++ i_root; // Broadcast row i from owner process to all processes. if (rank == i_root) world.broadcast(i_root, DoubleBuf.buffer (d_i)); else{ world.broadcast(i_root, row_i_buf); d_i = row_i; } // Inner loops over rows in my slice and over all columns. for (int r = mylb; r <= myub; ++ r){ double[] d_r = d[r]; for (int c = 0; c < n; ++ c) d_r[c] = Math.min (d_r[c], d_r[i] + d_i[c]); }}

Page 10: Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Problem: Too Many Messages

• The amount of time spent in communication is too high when compared to the time spent in computation