Vector: Data Layout
description
Transcript of Vector: Data Layout
1
Vector: Data LayoutVector: x[n]P processorsAssume n = r * p
A[m:n] for(i=m;i<=n) A[i]…Let A[m : s : n] denotes for(i=m;i<=n;i=i+s) A[i] …
Block distribution: id = 0, 1, …, p-1 x[r*id : r*(id+1)-1] id-th processor
Cyclic distribution: x[id : p : n-1] id-th processor
Block cyclic distribution: x = [x1, x2, …, xN]^T where xj is a subvector of length n/N Assume N = s*p Do a cyclic distribution of xj, j=1,2…,N
r
n
block
p
n
cyclic
n/N
P*n/N
n
Block cyclic
2
Matrix: Data LayoutRow: block, cyclic, block cyclicColumn: block, cyclic, block cyclic
Matrix: 9 combinationsIf only one block in one direction 1D decompositionOtherwise 2D decomposition
e.g. p processors with a (Nx, Ny) virtual topology, p=Nx*Ny
Matrix A[n][n], n = rx * Nx = ry * Ny
A[rx*I : rx*(I+1)-1, J:Ny:n-1], I=0,…,rx-1, J=0,…,ry-1, is a 2D decomposition, block distribution in x direction, cyclic distribution in y direction
1D block cyclic
2D block cyclic
3
Matrix-Vector Multiply: Row-wiseAX=YA – NxN matrix, row-wise block distributionX,Y – vectors, dimension N=
A X Y
A11 A12 A13
A21 A22 A23
A31 A32 A33
X1
X2
X3
Y1
Y2
Y3Y1 = A11*X1 + A12*X2 + A13*X3Y2 = A21*X1 + A22*X2 + A23*X3Y3 = A31*X1 + A32*X2 + A33*X3
=
A11 A12 A13
A21 A22 A23
A31 A32 A33
X2
X3
X1
Y1
Y2
Y3
Y1 = A11*X1 + A12*X2 + A13*X3Y2 = A21*X1 + A22*X2 + A23*X3Y3 = A31*X1 + A32*X2 + A33*X3
=
A11 A12 A13
A21 A22 A23
A31 A32 A33
X3
X1
X2
Y1
Y2
Y3
Y1 = A11*X1 + A12*X2 + A13*X3Y2 = A21*X1 + A22*X2 + A23*X3Y3 = A31*X1 + A32*X2 + A33*X3
cpu 0
cpu 1
cpu 2
cpu 0
cpu 1
cpu 2
cpu 0
cpu 1
cpu 2
4
Matrix-Vector Multiply: Column-wiseAX=YA – NxN matrix, column-wise block distributionX,Y – vectors, dimension N
Y1 = A11*X1 + A12*X2 + A13*X3Y2 = A21*X1 + A22*X2 + A23*X3Y3 = A31*X1 + A32*X2 + A33*X3
Y2 = A21*X1 + A22*X2 + A23*X3Y3 = A31*X1 + A32*X2 + A33*X3Y1 = A11*X1 + A12*X2 + A13*X3
Y3 = A31*X1 + A32*X2 + A33*X3Y1 = A11*X1 + A12*X2 + A13*X3Y2 = A21*X1 + A22*X2 + A23*X3
=
A X Y
A11 A12 A13
A21 A22 A23
A31 A32 A33
X1
X2
X3
Y1
Y2
Y3
cpu 0
cpu 1
cpu 2
=
A11 A12 A13
A21 A22 A23
A31 A32 A33
X1
X2
X3
Y2
Y3
Y1
cpu 0
cpu 1
cpu 2
=
A11 A12 A13
A21 A22 A23
A31 A32 A33
X1
X2
X3
Y3
Y1
Y2
cpu 0
cpu 1
cpu 2
5
Matrix-Vector Multiply: Row-wise
All-gather
6
Matrix-Vector Multiply: Column-wise
=
A X Y
A11 A12 A13
A21 A22 A23
A31 A32 A33
X1
X2
X3
Y1
Y2
Y3
cpu 0
cpu 1
cpu 2
AX=YA – NxN matrix, column-wise block distributionX,Y – vectors, dimension N
Y1’ = A11*X1Y2’ = A21*X1Y3’ = A31*X1
Y1’ = A12*X2Y2’ = A22*X2Y3’ = A32*X2
Y1’ = A13*X3Y2’ = A23*X3Y3’ = A33*X3
Then reduce-scatter across processors
First local computations
7
Matrix-Vector Multiply: 2D Decomposition
P_{0} P_{1} … P_{K-1}
P_{K} P_{K+1} … P_{2K-1}
…
… …
A x P=K^2 number of cpusAs a 2D KxK meshA – K x K block matrix, each block (N/K)x(N/K)X – K x 1 blocks, each block (N/K)x1
Each block of A is distributed to a cpuX is distributed to the K cpus in last column
Result A*X be distributed on the K cpus of the last column
8
Matrix-Vector Multiply: 2D Decomposition
9
Homework Write an MPI program and implement the matrix vector
multiplication algorithm with 2D decomposition Assume:
Y=A*X, A – NxN matrix, X – vector of length N Number of processors P=K^2, arranged as a K x K mesh in a row-major
fashion, i.e. cpus 0, …, K-1 in first row, K, …, 2K-1 in 2nd row, etc N can be divided by K. Initially, each cpu has the data for its own submatrix of A; Input vector X
is distributed on processors of the rightmost column, i.e. cpus K-1, 2K-1, …, P-1
In the end, the result vector Y should be distributed on processors at the rightmost column.
A[i][j] = 2*i+j, X[i] = i; Make sure your result is correct using a small value of N Turn in:
Source code + binary Wall-time and speedup vs. cpu for 1, 4, 16 processors for N = 1024.
10
Load Balancing: (Block) Cyclica
a a
a a a
b b b b
b b b b b
b b b b b b
c c c c c c c
c c c c c c c c
c c c c c c c c c
a
a b b b
c c c c c c c
a a
b b b b b
c c c c c c c c
a a a
b b b b b b
c c c c c c c c c
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
y1
y2
y3
y4
y5
y6
y7
y8
y9
y1
y4
y7
y2
y5
y8
y3
y6
y9
=
=
11
Cyclic Distribution
Matrix-vector multiply, row-wise cyclic distribution of A and yBlock distribution of x
Initial data: id – cpu id p – number of cpus ids of left/right neighbors n – matrix dimension, n=r*p Aloc = A(id:p:n-1,:) yloc = y(id:p:n-1) xloc = x(id*r:(id+1)*r-1)
r = n/pfor t=0:p-1 send(xloc,left) s = (id+t)%p // xloc = x(s*r:(s+1)*r-1) for i=0:r-1 for j=0:min(id+i*p-s*r,r) yloc(j) += Aloc(i,j+s*r)*xloc(j) end end recv(xloc,right)end
12
Matrix MultiplicationRow: A(i,:) = [Ai1, Ai2, …, Ain]Column: A(:,j) = [A1j, A2j, …, Anj]^TSubmatrix: A(i1:i2,j1:j2) = [ A(i,j) ], i1<=i<=i2, j1<=j<=j2
for i=1:m for j=1:n for k=1:p C(i,j) = C(i,j)+A(i,k)*B(k,j) end endend
A – m x pB – p x nC – m x n
C = AB + C
for i=1:m for j=1:n C(i,j) = C(i,j)+A(i,:)B(:,j) endend
Dot product formulationA(i,:) dot B(:,j)A accessed by rowB accessed by columnNon-optimal memory access!
(ijk) variant of matrix multiplication
13
Matrix Multiplyijk loop can be arranged in other orders
(ikj) variantfor i=1:m for k=1:p for j=1:n C(i,j) = C(i,j) + A(i,k)B(k,j) end endend
for i=1:m for k=1:p C(i,:) = C(i,:) + A(i,k)B(k,:) endend
axpy formulationB by rowC by row
(jki) variantfor j=1:n for k=1:p for i=1:m C(i,j) = C(i,j)+A(i,k)B(k,j) end endend
for j=1:n for k=1:p C(:,j) = C(:,j)+A(:,k)B(k,j) endend
axpy formulationA by columnC by column
14
Other VariantsLoop order Inner loop Middle loop Inner loop data
access
ijk Dot Axpy A by row, B by column
jik Dot Axpy A by row, B by column
ikj axpy Axpy B by row, C by row
jki axpy Axpy A by column, C by column
kij axpy Row outer product
B by row, C by row
kji axpy Column outer product
A by column, C by column
15
Block Matrices
qrq
r
AA
AAA
1
111
nnnn
mmmm
r
q
...
...
21
21
Block matrix multiplyA, B, C – NxN block matrices each block: s x s
(mnp) variantfor m=1:N for n = 1:N for p = 1:N i=(m-1)s+1 : ms j = (n-1)s+1 : ns k = (p-1)s+1 : ps C(i,j) = C(i,j) + A(i,k)B(k,j) end endend
CBACN
1
also other variants
Cache blocking
16
Block Matrices
NNNN
N
NN CCBB
BBAACC
1
1
111
11
CBABABAC NN 2211
NNNNN
N
N C
C
B
B
AA
AA
C
C
11
1
1111
NNBABABAC 2211
17
Matrix Multiply: Column-wise
A1 A2 A3
B11 B12 B13
B21 B22 B23
B31 B32 B33
C1 C2 C3=
A B C
C1 = A1*B11 + A2*B21 + A3*B31 cpu 0
C2 = A1*B12 + A2*B22 + A3*B32 cpu 1
C3 = A1*B13 + A2*B23 + A3*B33 cpu 2
A, B, C – NxN matricesP – number of processors
A1, A2, A3 – Nx(N/P) matricesC1, C2, C3 - …Bij – (N/P)x(N/P) matrices
Column-wise decomposition
18
Matrix Multiply: Row-wise
B1
B2
B3
A11 A12 A13
A21 A22 A23
A31 A32 A33
C1
C2
C3
=
C1 = A11*B1 + A12*B2 + A13*B3 cpu 0
C2 = A21*B1 + A22*B2 + A23*B3 cpu 1
C3 = A31*B1 + A32*B2 + A33*B3 cpu 2
A, B, C – NxN matricesP – number of processors
B1, B2, B3 – (N/P)xN matricesC1, C2, C3 - …Aij – (N/P)x(N/P) matrices
19
Matrix Multiply: 2D DecompositionHypercube-Ring Cpus: P = K^2
Matrices A, B, C: dimension N x N, K x K blocksEach block: (N/K) x (N/K)
Determine coordinate (irow,icol) of current cpu.Set B’=B_localFor j=0:K-1 root_col = (irow+j)%K broadcast A’=A_local from root cpu (irow,root_col) to other cpus in the row C_local += A_local*B_local shift B’ upward one stepend
A01 A01 A01 A01
A12 A12 A12 A12
A23 A23 A23 A23
A30 A30 A30 A30
Step 1
Step 2
Step 2
broadcast
shift
Broadcast A diagonalsShift BC fixed
20
Matrix Multiply
Total ~K*log(K) communication steps, or sqrt(P)log(sqrt(P)) stepsIn contrast, 1D decomposition, P
communication stepsCan use max N^2 processors for problem
size NxN matrices1D decomposition, max N processors
21
Matrix Multiply: Ring-Hypercube
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B00
B11
B22
B33
Shift A columns Broadcast B diag
A00B00
A01B11
A02B22
A03B33
A10B00
A11B11
A12B22
A13B33
A20B00
A21B11
A22B22
A23B33
A30B00
A31B11
A32B22
A33B33
A01B10
A02B21
A03B32
A00B03
A11B10
A12B21
A13B32
A10B03
A21B10
A22B21
A23B32
A20B03
A31B10
A32B21
A33B32
A30B03
Determine coordinate (irow,icol) of current cpu.Set A’=A_localFor j=0:K-1 root_row = (icol+j)%K broadcast B’=B_local from root cpu (root_row,icol) to other cpus in the column C_local += A_local*B_local Shift A’ leftward one stepend
Number of cpus: P=K^2A, B, C: K x K block matrices each block: (N/K) x (N/K)
Step 1 Step 2
C fixed
22
A00B00
A01B01
A02B02
A03B03
A10B10
A11B11
A12B12
A13B13
A20B20
A21B21
A22B22
A23B23
A30B30
A31B31
A32B32
A33B33
A00B00
A01B11
A02B22
A03B33
A10B00
A11B11
A12B22
A13B33
A20B00
A21B11
A22B22
A23B33
A30B00
A31B11
A32B22
A33B33
A01 A02 A03 A00
A11 A12 A13 A10
A21 A22 A23 A20
A31 A32 A33 A30
A01B10
A02B21
A03B32
A00B03
A11B10
A12B21
A13B32
A10B03
A21B10
A22B21
A23B32
A20B03
A31B10
A32B21
A33B32
A30B03
A02 A03 A00 A01
A12 A13 A10 A11
A22 A23 A20 A21
A32 A33 A30 A31
A02B20
A03B31
A00B02
A01B13
A12B20
A13B31
A10B02
A11B13
A22B20
A23B31
A20B02
A21B13
A32B20
A33B31
A30B02
A31B13
initial Broadcast B compute Shift A broadcast
compute Shift A Broadcast B
Matrix Multiply: Ring-Hypercube
23
Matrix Multiply: Systolic (Torus)
A00B00
A01B11
A02B22
A11B10
A12B21
A10B02
A22B20
A20B01
A21B12
A01B10
A02B21
A00B02
A12B20
A10B01
A11B12
A20B00
A21B11
A22B22
A
B
A02B20
A00B01
A01B12
A10B00
A11B11
A12B22
A21B10
A22B21
A20B02
Step 1 Step 2 Step 3
A00B00
A01B01
A02B02
A10B10
A11B11
A12B12
A20B20
A21B21
A22B22
initial
Shift rows of A leftwardShift columns of B upward
C fixed
Number of cpus: P=K^2A, B, C: K x K block matrices each block: (N/K) x (N/K)
24
Matrix Multiply: SystolicP = K^2 number of processors, as a K x K 2D torusA, B, C: KxK block matrices, each block (N/K)x(N/K)Each cpu computes 1 block: A_loc, B_loc, C_locCoordinate in torus of current cpu: (irow, icol)Ids of left, right, top, bottom neighboring processors
// first get appropriate initial distributionfor j=0:irow-1 send(A_loc,left); recv(A_loc,right)endfor j=0:icol-1 send(B_loc,top); recv(B_loc,bottom)end// start computationfor j=0:K-1 send(A_loc,left) send(B_loc,top) C_loc = C_loc + A_loc*B_loc recv(A_loc,right) recv(B_loc,bottom)end
Max N^2 processors~ sqrt(P) communication steps
25
Matrix Multiply on P=K^3 CPUs
Assume:
A, B, C: dimension N x N
P = K^3 number of processorsOrganized into K x K x K 3D mesh
A (NxN) can be considered as a q x q block matrix, each block (N/q)x(N/q)
Let q = K^(1/3), i.e. consider A as a K^(1/3) x K^(1/3) block matrix, each block being (N/K^(1/3)) x (N/K^(1/3))
Similar for B and C
26
Matrix Multiply on K^3 CPUs
3/1
1
K
ttsrtrs BAC r,s = 1, 2, …, K^(1/3)
Total K^(1/3)*K^(1/3)*K^(1/3) = K block matrix multiplications
Idea: Perform these K matrix multiplications on the K different planes (or levels) of the 3D mesh of processors.
Processor (i,j,k) (i,j=1,…,K) belongs to plane k.Will perform multiplication A_{rt}*B_{ts}, where k = (r-1)*K^(2/3)+(s-1)*K^(1/3)+t
Within a plane, (N/K^(1/3)) x (N/K^(1/3)) matrix multiply on K x K processors. Use the systolic multiplication algorithm.Within a plane k: A_{rt}, B_{ts} and C_{rs} decomposed into K x K block matrices, each block (N/K^(4/3)) x (N/K^(4/3)).
27
Matrix Multiply
Initial data distribution
Initially, processor (i,j,1) has (i,j) sub-blocks of all A_{rt} and B_{ts} blocks, for all r,s,t=1,…,K^(1/3), i,j=1,…,K
N/K^(1/3) dimension
K^(1/3) blocks
K blocks
A, B, C
x
On KxK processors
A_{rt} B_{ts}
A_{rt} destined to levels k=(r-1)*K^(2/3)+(s-1)K^(1/3)+t, for all s=1,…,K^(1/3)
B_{ts} destined to levels k=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t, for all r=1,…,K^(1/3)
(i,j)
28
Matrix Multiply// set up input dataOn processor (i,j,1), read in the (i,j)-th block of matrices A_{r,t} and B_{t,s}, 1<= r,s,t <= K^(1/3); pass data onto processor (i,j,2);On processor (i,j,m), make own copy of A_{rt} if m=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t for some s=1,...,K^(1/3); make own copy of B_{ts} if m=(r-1)*K^(2/3)+(s-1)*K^(1/3)+t for some r=1,...,K^(1/3); pass data onward to (i,j,m+1);
// ComputationOn each processor (i,j,m), Compute A_{rt}*B_{ts} on the K x K processors using the systolic matrix multiplication algorithm; Some initial data setup may be needed before multiplication;
// SummationDetermine (r0,s0) of matrix the current processor (i,j,k) works on: r0 = k/K^(2/3)+1; s0 = (k-(r0-1)*K^(2/3))/K^(1/3);Do reduction (sum) over processors (i,j,m), m=(r0-1)*K^(2/3)+(s0-1)*K^(1/3)+t, of all 1<=t<=K^(1/3);
29
Matrix Multiply
Communication steps: ~K, or P^(1/3)
Maximum CPUs: N/K^(4/3) = 1 K=N^(3/4), or P=N^(9/4)
30
Matrix MultiplyIf number of processors: P = KQ^2, arranged
into KxQxQ meshK planesEach plane QxQ processors
Handle similarlyDecompose A, B, C into K^(1/3)xK^(1/3) blocksDifferent block matrix multiplications in different
planes, K multiplications totalEach block multiplication handled in a plane on QxQ
processors; use any favorable algorithm, e.g. systolic
31
Processor Array in Higher Dimension
Processors P=K^4, arranged into KxKxKxK mesh
Similar strategy:Divide A,B,C into K^(1/3)xK^(1/3) block matricesDifferent multiplications (total K) computed on
different levels of 1st dimensionEach block matrix multiplication done on the KxKxK
mesh at one level; repeat the above strategy.For even higher dimensions, P=K^n, n>4,
handle similarly.
32
Matrix Multiply: DNS Algorithm
Assume:
A, B, C: dimension N x N
P = K^3 number of processorsOrganized into K x K x K 3D mesh
K
ttsrtrs BAC
1
A, B, C are K x K block matrices, each block (N/K) x (N/K)Total K*K*K block matrix multiplications
Idea: each block matrix multiplication is assigned to one processor Processor (i,j,k) computes C_{ij}=A_{ik}*B_{kj} Need a reduction (sum) over processors (i,j,k), k=0,…,K-1
33
Matrix Multiply: DNS Algorithm
Initial data distribution:A_{ij} and B_{ij} at processor (i,j,0)
Need to trasnsfer A_{ik} (i,k=0,…,K-1) to processor (i,j,k) for all j=0,1,…,K-1
Two steps:- Send A_{ik} from processor (i,k,0) to (i,k,k);- Broadcast A_{ik} from processor (i,k,k) to processors (i,j,k);
34
Matrix Multiply
Send A_{ik} from (i,k,0) to (i,k,k)
To broadcast A_{ik} from (i,k,k) to (i,j,k)
35
Matrix Multiply
Final data distribution for A
A can also be considered to come in through the (i,k) plane; with a broadcast along the j-direction.
36
B Distribution
B distribution:Initially B_{kj} in processor (k,j,0);Need to transfer to processors (i,j,k) for all i=0,1,…,K-1
Two steps:-First send B_{kj} from (k,j,0) to (k,j,k)-Broadcast B_{kj} from (k,j,k) to (i,j,k) for all i=0,…,K-1, i.e. along i-direction
37
B Distribution
0,0
0,1
0,2
0,3
1,0
1,1
1,2
1,3
2,0
2,1
2,2
2,3
3,0
3,1
3,2
3,3
i
jk
Send B_{kj} from (k,j,0) to (k,j,k)
To broadcast from (k,j,k) to along i direction
38
Matrix Multiply
Final B distribution:
B can also be considered to come through (j,k) plane; then broadcast along i-direction
39
Matrix MultiplyA_{ik} and B_{kj} on cpu (i,j,k)Compute C_{ij} locallyReduce (sum) C_{ij} along k-directionFinal result: C_{ij} on cpu (i,j,0)
40
Matrix MultiplyA matrix comes through (i,k) plane, broadcast along j-directionB matrix comes through (j,k) plane, broadcast along i-directionC matrix result goes to (i,j) plane
Broadcast: 2log(K) stepsReduction: log(K) stepsTotal: 3log(K) = log(P) steps
Can use a maximum of P=N^3 processors
In contrast:
Systolic: P^(1/2) communication stepsCan use a maximum of P=N^2 processors
Slide #24: P^(1/3) communication stepsCan use a maximum of P=N^(9/4) processors