Chapter 8

63
Chapter 8 Chapter 8 Matrix-Vector Multiplication

description

Chapter 8. Matrix-Vector Multiplication. Sequential Algorithm. Matrix-Vector Multiplication: Input : a[0..m-1, 0..n-1] – matrix with dimension m×n b[0..n-1] – vector with dimension n×1 Output: c[0..m-1] – vector with dimension m×1 for i ← 0 to m – 1 c[i] ← 0 - PowerPoint PPT Presentation

Transcript of Chapter 8

Page 1: Chapter 8

Chapter 8Chapter 8

Matrix-Vector Multiplication

Page 2: Chapter 8
Page 3: Chapter 8

Sequential AlgorithmSequential Algorithm

Matrix-Vector Multiplication:

Input : a[0..m-1, 0..n-1] – matrix with dimension m×n b[0..n-1] – vector with dimension n×1Output: c[0..m-1] – vector with dimension m×1

for i ← 0 to m – 1 c[i] ← 0 for j ← 0 to n-1 c[i] ← c[i] + a[i,j] × b[j]

Page 4: Chapter 8

MPI_ScatterMPI_Scatter Cut an array at the specified CPU id equal and then send a part to

the other CPU ids which is at same communicator.

int Sdata[], Rdata[], Send_cnt, Recv_cnt, src, err;

MPI_Comm COMM: MPI_Datatype Stype, Rtype;

err = MPI_Scatter(Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype, src, COMM);

Sdata Send array. Send_cnt An amount of data which is sent to every CPU id. Stype Send data type. Rdata Receive data. If Recv_cnt > 1, Rdata is an array. Recv_cnt An amount of data which is received from send CPU id. Rtype Receive data type. COMM Communicator. src CPU id with source of data.

Page 5: Chapter 8

Rdata

[1,2,3,4,5,6,7,8]

[1,2]

MPI_ScatterMPI_Scatter

int Sdata[8] = {1,2,3,4,5,6,7,8}, Rdata[2];int Send_cnt = 2, Recv_cnt = 2, src = 0;MPI_Scatter( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , src, MPI_COMM_WORLD);

CPU0 CPU1 CPU2 CPU3

MPI_Scatter [1,2]

Sdata

[5,6] [3,4] [7,8]

Rdata Rdata Rdata [3,4] [5,6] [7,8]

Page 6: Chapter 8

MPI_ScatterMPI_Scatter

In general, Send_cnt and Recv_cnt must be the same, and Stype and Rtype is also the same. If not, there may be some problem.

Suppose there are N CPU at this communicator, the size of Sdata must be at least Send_cnt*N.

Page 7: Chapter 8

MPI_ScattervMPI_Scatterv

A scatter operation in which different processes may end up with different numbers of elements.

Page 8: Chapter 8

Function MPI_ScattervFunction MPI_Scatterv

MPI_Scatterv (void *send_buffer, int* send_cnt, int* send_disp, MPI_Datatype send_type, void *rec_buffer, int recv_cnt, MPI_Datatype recv_type, int root, MPI_Comm communicator)

Page 9: Chapter 8

MPI_GatherMPI_Gather

Collective data from CPU id which is at same communicator and put result into specified CPU id.

int Sdata[], Rdata[], Send_cnt, Recv_cnt, dest, err;

MPI_Comm COMM; MPI_Datatype Stype, Rtype;

err = MPI_Gather( Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype,

dest, COMM);

Sdata Send data. If Send_cnt > 1, Sdata is an array.Send_cnt An amount of data which is sent from every CPU ids.Stype Send data type.Rdata Receive array.Recv_cnt An amount of data which is received from send CPU ids.Rtype Receive data type.COMM Communicator.dest CPU id which collective data from other CPU ids.

Page 10: Chapter 8

MPI_GatherMPI_Gather

int Send_cnt = 2, Recv_cnt = 2, dest = 0;MPI_Gather ( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , dest, MPI_COMM_WORLD);

CPU0 CPU1 CPU2 CPU3

MPI_Gather

Sdata SdataSdata Sdata [1,2] [3,4] [5,6] [7,8]

[1,2,3,4,5,6,7,8]Rdata

Page 11: Chapter 8

MPI_GatherMPI_Gather

In general, Send_cnt and Recv_cnt must be the same, and Stype and Rtype is also the same. If not, there may be some problem.

Suppose there are N CPU at this communicator, the size of Rdata must be at least Send_cnt*N.

Page 12: Chapter 8

MPI_GathervMPI_Gatherv

A gather operation in which the number of elements collected from different processes may vary.

Page 13: Chapter 8

Function MPI_Gatherv Function MPI_Gatherv

MPI_Gatherv (void* send_buffer, int send_cnt, MPI_Datatype send_type, void *recv_buffer, int* recv_cnt, int* recv_disp, MPI_Datatype recv_type, int root, MPI_Comm communicator)

Page 14: Chapter 8

MPI_AllgatherMPI_Allgather

Be liked MPI_Gather, but MPI_Allgather let collection result sent to all CPU ids which is at same communicator.

int Sdata[], Rdata[], Send_cnt, Recvcnt, err;

MPI_Comm Comm; MPI_Datatype Stype, Rtype;

err = MPI_Allgather( Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype,

Comm);

Sdata Send data. If Send_cnt > 1, Sdata is an array.Send_cnt An amount of data which is send from every CPU ids.Stype Send data type.Rdata Receive array.Recv_cnt An amount of data which is receive from send CPU ids.Rtype Receive data type.COMM Communicator.

Page 15: Chapter 8

Rdata Rdata Rdata Rdata

MPI_AllgatherMPI_Allgather

int Send_cnt = 2, Recv_cnt = 8;MPI_Allgather ( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , MPI_COMM_WORLD);

CPU0 CPU1 CPU2 CPU3

MPI_Allgather

Sdata Sdata Sdata Sdata [1,2] [3,4] [5,6] [7,8]

[1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8]

Page 16: Chapter 8

MPI_Allgatherv

An all-gather function in which different processes may contribute different members of elements.

Page 17: Chapter 8

int MPI_Allgatherv (void* send_buffer, int int MPI_Allgatherv (void* send_buffer, int send_cnt, MPI_Datatype send_type, void* send_cnt, MPI_Datatype send_type, void* receive_buffer, int* receive_cnt, int* receive_buffer, int* receive_cnt, int* receive_disp, MPI_Datatype receive_type, receive_disp, MPI_Datatype receive_type, MPI_COMM communicator)MPI_COMM communicator)

Function MPI_Allgatherv

Page 18: Chapter 8

MPI_AlltoallMPI_Alltoall

An all-to-all exchange of data elements among processes

Page 19: Chapter 8

Function MPI_AlltoallFunction MPI_Alltoall

int MPI_Alltoall (void *send_buffer, int *send_count, int *send_displacement, MPI_Datatype send_type, void *recv_buffer, int *recv_count, int *recv_displacement, MPI_Datatype recv_type, MPI_Comm communicator)

Page 20: Chapter 8

Data Decomposition OptionsData Decomposition Options

Rowwise block-striped decompositionColumnwise block-striped decompositionCheckboard block decomposition

Page 21: Chapter 8
Page 22: Chapter 8
Page 23: Chapter 8
Page 24: Chapter 8
Page 25: Chapter 8
Page 26: Chapter 8
Page 27: Chapter 8
Page 28: Chapter 8
Page 29: Chapter 8
Page 30: Chapter 8
Page 31: Chapter 8
Page 32: Chapter 8
Page 33: Chapter 8
Page 34: Chapter 8
Page 35: Chapter 8
Page 36: Chapter 8
Page 37: Chapter 8
Page 38: Chapter 8
Page 39: Chapter 8
Page 40: Chapter 8
Page 41: Chapter 8
Page 42: Chapter 8
Page 43: Chapter 8
Page 44: Chapter 8

Creating a CommunicatorCreating a Communicator

Page 45: Chapter 8

There are four collective There are four collective communication operationscommunication operations1. The processes in the first column of the

virtual process grid participate in the communication that gathers vector b when p is not square.

2. The processes in the first row of the virtual process grid participate in the communication that scatters vector b when p is not square.

3. Each first-row process broadcasts its block of b to other processes in the same column of the process grid.

4. Each row of processes in the grid performs an independent sum-reduction, yielding vector c in the first column of processes.

Page 46: Chapter 8

int MPI_Dims_createint MPI_Dims_create

int MPI_Dims_create (int nodes, int dims, int *size)

nodes: an input parameter, the number of processes in the grid.

dims: an input parameter, the number of dimensions in the desired grid.

size: an input/output parameter, the size of each grid dimension.

Page 47: Chapter 8

int MPI_Cart_createint MPI_Cart_createint MPI_Cart_create (MPI_Comm old_comm, int dims, int *size, int *periodic, int reorder, MPI_Comm *cart_comm)

old_comm: the old communicator. All processes in the old communicator must collectively call the function.

dims: the number of grid dimensions.

*size: an array of size dims. Element size[j] is the number of processes in dimension j.

*periodic: an array of size dims. Element periodic[j] should be 1 if dimension j is periodic (communications wrap around the edges of the grid) and 0 otherwise.

reorder: a flag indicating if process ranks can be reordered. If reorder is 0, the rank of each process in the new communicator is the same as its rank in old_comm.

Page 48: Chapter 8

Reading a Checkerboard Reading a Checkerboard Matrix Matrix

Page 49: Chapter 8
Page 50: Chapter 8

int MPI_Cart_rankint MPI_Cart_rank

int MPI_Cart_rank (MPI_Comm comm, int *coords, int *rank)

comm: an input parameter whose value is the Cartesian communicator in which the communication is occurring.

coords : an input parameter: an integer array containing the coordinates of a process in the virtual grid.

rank : the rank of the process in comm with the specified coordinates.

Page 51: Chapter 8

int MPI_Cart_coordsint MPI_Cart_coordsint MPI_Cart_coords (MPI_Comm comm, int rank, int dims, int *coords)

comm: the Cartesian coordinator being examined.

rank: the rank of the process whose coordinates we seek.

dims: the number of dimensions in the process grid.

The function returns through the last parameter the coordinates of the specified process in the grid.

Page 52: Chapter 8

int MPI_Cart_splitint MPI_Cart_split

int MPI_Cart_split (MIP_Comm old_comm, int partition, int new_rank, MPI_Comm *new_comm)

old_comm: the existing communicator to which these processes belong.partition: the partition number.new_rank: rank order of process within new communicator.

The function returns through new_comm a pointer to the new communicator to which this process belongs.

Page 53: Chapter 8

BenchmarkingBenchmarking

Page 54: Chapter 8
Page 55: Chapter 8
Page 56: Chapter 8

High Performance High Performance ComputingComputing

Page 57: Chapter 8

Program ParallelismProgram Parallelism

Algorithm LevelProgram LevelInstruction Level

Page 58: Chapter 8

Cache MemoryCache MemoryTo improve the average memory

access time, modern computer systems use a high speed cache memory.◦Temporal locality: A memory is

fetched again in the near future.◦Spatial locality: The cache keep

nearby words.

Page 59: Chapter 8

A Matrix MultiplicationA Matrix MultiplicationSimple matrix multiplication C=A

× B.for i=1 to n do for j=1 to n do for k=1 to n do C[i,j]=C[i,j]+A[i,k] * B[k,j]

Page 60: Chapter 8

Improving Spatial LocalityImproving Spatial LocalityReordering the loops to get the

ikj form will satisfy spatial locality:for i=1 to n do

for k=1 to n do for j=1 to n do C[i,j]=C[i,j]+A[i,k] * B[k,j]

Page 61: Chapter 8

Improving Temporal Improving Temporal Locality (1/2)Locality (1/2)We divide the matrices into

rectangular sub-matrices, as shown below.

We have chosen a sub-matrix size s=n ÷3. The first sub-matrix C11 can computed by sub matrix multiplication:

333231

232221

131211

333231

232221

131211

333231

232221

131211

BBB

BBB

BBB

AAA

AAA

AAA

CCC

CCC

CCC

31132112111111 BABABAC

Page 62: Chapter 8

Improving Temporal Improving Temporal Locality (2/2)Locality (2/2)The program for the reformulated

algorithm:for it=1 to n by s do for kt=1 to n by s do for jt=1 to n by s do for i=it to min(it+s-1,n) do for k=kt to min(kt+s-1,n) do for j=jt to min(jt+s-1,n) do C[i,j]=C[i,j]+A[i,k]+B[k,j]

Page 63: Chapter 8

Storage orderStorage orderRow-major and column-major

storage order for a 3 × 4 array.