Post on 16-Oct-2020
Parallel Linear Algebra
Our goals: Fast and efficient parallel algorithms forthe matrix-vector product,the matrix-matrix product,solving systems of linear equations,applying finite difference systems,and computing the fast Fourier Transform.
The matrix-vector product is the basis of most of our algorithms.
Parallel Linear Algebra 1 / 35
Decomposing a matrix
How to distribute an m × n matrix A to p processes?
Rowwise decomposition:each process is responsible for m/p contiguous rows.Columnwise decomposition:each process is responsible for n/p contiguous columns.Checkerboard decomposition:Assume that k divides m and that l divides n.
I Assume moreover that k · l = p.I Imagine that the processes form a k × l mesh.I Process (i , j) obtains the submatrix of A consisting of the i th row
interval of length m/k and the j th column interval of length n/l .
Parallel Linear Algebra 2 / 35
The Matrix-Vector Product
Our goal: Compute y = A · x for a m × n matrix A and a vector x with ncomponents.
Assumptions:I We do assume that matrix A has been distributed to the various
processes.I Process 1 knows the vector x and has to determine the vector y .
The conventional sequential algorithm determines y by setting
yi =n∑
j=1
A[i , j] · xj .
I To compute yi we perform n multiplications and n − 1 additions.I Overall m · n multiplications and m · (n − 1) additions suffice.
Parallel Linear Algebra The Matrix-Vector Product 3 / 35
The Rowwise Decomposition
Replicate x : broadcast x to all processes in time O(n · log2 p).Each process determines its m
p vector-vector products in timeO(m·n
p ).
Process 1 performs a Gather operation in time O(m): p − 1messages of length m/p are involved.Performance analysis:
I Communication time is proportional to n · log2 p + m and overalltime Θ(m · n/p + n · log2 p + m) is sufficient.
I Efficiency is Θ(m · n/(m · n + p · (n · log2 p + m))).I Constant efficiency follows, if
m · n = Ω(p · (n · log2 p + m)) = Ω(p · log2 p · n + m · p)I Hence we get constant efficiency for
m = Ω(p · log2 p) and n = Ω(p).
Parallel Linear Algebra The Matrix-Vector Product 4 / 35
The Columnwise Decomposition
Apply MPI_Scatter to distribute the blocks of x to “their”processes. Since this involves p − 1 messages of length n/p, timeO(n) is sufficient.Each process i computes the matrix-vector product y i = Ai · x i forits block Ai of columns.Time O(m · n/p) is sufficient.Process 1 applies a Reduce operation to sum up y1, y2, . . . , yp intime O(m · log2 p).Performance analysis:
I Run time is bounded by O(m · n/p + n + m · log2 p).I Here we have constant efficiency, if computing time dominates
communication time. Require
m = Ω(p) and n = Ω(p · log2 p).
Parallel Linear Algebra The Matrix-Vector Product 5 / 35
Checkerboard Decomposition
Process 1 applies a Scatter operation addressed to the lprocesses of row 1 of the process mesh. Time O(l · n
l ) = O(n).Then each process of row 1 broadcasts its block of x to the kprocesses in its column: time O(n
l · log2 k) suffices. All processescompute their matrix-vector products in time O(m · n/p).The processes in column 1 of the process mesh apply a Reduceoperation for their row to sum up the l vectors of length m
k : timeO(m/k · log2 l) is sufficient.Process 1 gathers the k − 1 vectors of length m
k in time O(m).Performance analysis:
I The total computation time is bounded byO(m · n/p + n + n
l · log2 k + mk · log2 l + m).
I The total communication time is bounded by O(n + m), providedlog2 k ≤ l and log2 l ≤ k .
I We obtain constant efficiency, if m = Ω(p) and n = Ω(p).
Parallel Linear Algebra The Matrix-Vector Product 6 / 35
Summary
The checkerboard decomposition has the best performance, if m ≈ n.Why?
All three decompositions have the same computation time.Assuming m = n,
I the communication time of the rowwise decomposition is dominatedby boadcasting the vector x : time O(n log2 p),
I whereas the final Reduce dominates for the columnwisedecomposition: time O(m log2 p).
I The checkerboard decomposition cuts down on the messagelength!
Parallel Linear Algebra The Matrix-Vector Product 7 / 35
Matrix-Matrix Product
Our goal is to compute the n × n product matrix C = A · B for n × nmatrices A and B.
To compute C[i , j] =∑n
k=1 A[i , k ] · B[k , j] sequentially,n multiplications and n − 1 additions are required.Since C has n2 entries, we obtain running time Θ(n3).We discuss four approaches:
I the first algorithm uses the rowwise decomposition.I The algorithm of Fox and its improvement, the algorithm of Cannon,
use the checkerboard decomposition.I The DNS algorithm assumes a variant of the checkerboard
decomposition.
Parallel Linear Algebra The Matrix-Matrix Product 8 / 35
The Rowwise Decomposition
Process i receives the submatrices Ai of A and Bi of B,corresponding to the i th row interval of length n
p .
Further subdivide Ai ,Bi into the np square submatrices Ai,j ,Bi,j .
Define C i,j analogously and observe that C i,j =∑p
k=1 Ai,k · Bk ,j
holds. The computation:I In phase 1 process i computes all products Ai,i · Bi,j for j = 1, . . . ,p
in time O(p · np ·
np ·
np ) = O( n3
p2 ), then sends Bi to process i + 1 andreceives Bi−1 from process i − 1 in time O(n2/p).
I In phase 2 process i computes all products Ai,i−1 · Bi−1,j , sendsBi−1 to process i + 1 and receives Bi−2 from i − 1 . . ..
Performance analysis:I All in all p phases. Hence computing time is bounded by O(n3/p)
and communication time is bounded by O(n2).I The compute/communicate ratio n3
p /n2 = n
p is small!
Parallel Linear Algebra The Matrix-Matrix Product 9 / 35
The Algorithm of Fox
We again determine the product matrix according toC i,j =
∑pk=1 Ai,k · Bk ,j , but now
I processes are arranged in a√
p ×√p mesh of processes.I Process i knows the n/
√p × n/
√p submatrices Ai,j and Bi,j .
We have√
p phases.In phase k we want process (i , j) to compute Ai,i+k−1 · Bi+k−1,j :
I process (i , i + k − 1) broadcasts Ai,i+k−1 to all processes in row i ,I process (i , j) computes Ai,i+k−1 · Bi+k−1,j ,I receives Bi+k,j from (i + 1, j) and sends Bi+k−1,j to (i − 1, j).
Performance Analysis:I Per phase: computing time O(( n√
p )3) and communication time
O( n2
p · log p).I We have
√p phases: computation time O( n3
p ), communication time
O( n2√
p · log p). The compute/communicate ratio n√p log2 p increases.
Parallel Linear Algebra The Matrix-Matrix Product 10 / 35
The Algorithm of Cannon
The setup is as for the algorithm of Fox.In particular, process (i , j) has to determine C i,j =
∑pk=1 Ai,k · Bk ,j .
At the very beginning, redistribute matrices, such that process(i , j) holds Ai,i+j and Bi+j,j .We again have
√p phases. In phase k we want process (i , j) to
compute Ai,i+j+k−1 · Bi+j+k−1,j :I process (i , j) computes Ai,i+j+k−1 · Bi+j+k−1,j ,I sends Ai,i+j+k−1 to (i , j − 1) and Bi+j+k−1,j to (i − 1, j) andI receives Ai,i+j+k from (i , j + 1) and Bi+j+k,j from (i + 1, j).
Performance Analysis:I Per phase: computation time O(( n√
p )3), communication timeO(( n√
p )2).I Overall, computation time O( n3
p ), communication time O( n2√
p ) andthe compute/communicate ratio n√
p increases again.
Parallel Linear Algebra The Matrix-Matrix Product 11 / 35
How did we save Communication?
- Rowwise decomposition: in each of the p phases row blocks areexchanged.All in all O(p · n2/p) communication.
- The algorithm of Fox: a broadcast in each of the√
p withcommunication time O(n2/p · log p).All in all communication time O(n2/
√p · log p): merging
point-to-point messages into broadcasts is profitable!- The algorithm of Cannon: after initially rearranging submatrices,
the broadcasts in the algorithm of Fox are replaced by point topoint messages.All in all communication time O(
√p · n2/p).
Parallel Linear Algebra The Matrix-Matrix Product 12 / 35
The DNS Algorithm
p = n3 processes are arranged in an n × n × n mesh of processes.Process (i , j ,1) stores A[i , j],B[i , j] and has to determine C[i , j].
We move A[i , k ] to process (i , ∗, k): (i , k ,1) sends A[i , k ] to(i , k , k), which broadcasts A[i , k ] to all processes (i , ∗, k).Next we move B[k , j] to process (∗, j , k): (k , j ,1) sends B[k , j] to(k , j , k), which broadcasts B[k , j] to all processes (∗, j , k).Process (i , j , k) computes the product A[i , k ] · B[k , j].Process (i , j ,1) computes
∑nk=1 A[i , k ] · B[k , j] with MPI_Reduce.
Performance analysis:I The replication step takes time O(log2 n), since the broadcast
dominates. The multiplication step runs in constant time and theReduce operation runs in logarithmic time.
I Time O(log2 n) suffices. Its efficiency Θ(1/ log2 n) is too small.I We scale down.
Parallel Linear Algebra The Matrix-Matrix Product 13 / 35
Scaling down the number of processors
We work with p processes. Let q = p1/3 and imagine that the pprocesses are arranged in a q × q × q mesh.Input distribution: process (i , j ,1) receives the n
q ×nq submatrices
Ai,j and Bi,j : the matrices Ai,j and Bi,j play the role of the entriesA[i , j] and B[i , j].Mimic the algorithm for n3 processes.Performance analysis:
I The total computing time is O( n3
q3 ) = O( n3
p ), since nq ×
nq matrices
have to be multiplied.I During replication and summing, n
q ×nq matrices are involved and
hence the communication time is bounded by O( n2
q2 · log p).I The compute/communicate ratio is n
q·log2 p .
Best performance so far. p should be sufficiently large.
Parallel Linear Algebra The Matrix-Matrix Product 14 / 35
Summary
The checkerboard decomposition is again better than the rowwisedecomposition.Cannon’s algorithm replaces a broadcast by a point-to-pointmessage and is therefore faster than the algorithm of Fox.The DNS algorithm partitions the matrices A and B among q2 ofthe q3 processes.
I Thus each “input process” gets a relatively large chunk.I However there are only two (instead of
√p) communication steps:
namely when replicating and summing.I Observe that DNS is better than Cannon only if p is sufficiently
large.
Parallel Linear Algebra The Matrix-Matrix Product 15 / 35
Solving Linear Systems
We are given a matrix A and a right-hand side b and would like tosolve the linear system A · x = b.
We begin with the easy case of lower triangular matrices A anddescribe back substitution.Then we discuss efficient parallelizations of Gaussian eliminationand continue with iterative methods: Jacobi relaxation, theGauss-Seidel algorithm, the conjugate gradient approach and theNewton method.Finally we consider parallelization of the finite difference method.
Solving Linear Systems 16 / 35
Backsubstitution
We have to solve the system
A[i ,1] · x1 + · · ·+ A[i , i] · xi = bi
for i = 1, . . . ,n.
A sequential solution:I first determine x1 from the first equation A[i ,1] · x1 = b1.I If we already know x1, . . . , xi−1, then determine xi from the i th
equation.I Since an evaluation of the i th equation requires time O(i), the
sequential solution runs in time O(n2).We consider two input distributions:
I The off-diagonal decomposition of matrix A:process 1 knows the main diagonal and process i (i ≥ 2) knows thei − 1st offdiagonal A[i ,1],A[i + 1,2], . . . ,A[n,n − i + 1].
I And the rowwise decomposition.
Solving Linear Systems Backsubstitution 17 / 35
The Off-Diagonal Decomposition I
We use the linear array as communication pattern.
Process 1 successively determines x1, . . . , xn.Once computed, xi is forwarded thru the linear array.
How to solve the i th equation A[i ,1] · x1 + · · ·+ A[i , i] · xi = bi?
I Process i computes A[i ,1] · x1 immediately after receiving x1 fromprocess i − 1. Then i sends A[i ,1] · x1 to process i − 1 and x1 toprocess i + 1.
I If process i − 1 receives x2 from process i − 2, it computes theproduct A[i ,2] · x2, sends the sum A[i ,1] · x1 + A[i ,2] · x2 to processi − 2 and forwards x2 to process i .
I We communicate according to the principle of“just in time production”.
Solving Linear Systems Backsubstitution 18 / 35
The Off-Diagonal Decomposition II
x1
x2
x 3
A[2,1]
A[3,1]
A[4,1]x1
A[3,2]
A[4,2]x
A[4,3]x
x4
x 1
1x
2x
2
3
time
processors
Solving Linear Systems Backsubstitution 19 / 35
The Off-Diagonal Decomposition III
Backsubstitution with p processes.
Assign the off-diagonals (A[j ,1], . . . ,A[n,n − j + 1]) forj ∈ (i − 1) · n/p + 1, . . . , i · n/p to process i .The computing time: we have p phases with compute timeO((n/p)2) per phase.All in all compute time is bounded by O(n2
p ).
Communication O(n/p) per phase and hence all in all O(n)communication.
The running time is bounded by O(n2
p + n).We achieve constant efficiency, whenever n = Ω(p).
Solving Linear Systems Backsubstitution 20 / 35
The Rowwise Decomposition
This time process i determines xi .Once xi is determined, process i broadcasts xi to processesi + 1, . . . ,n.
For p processes:
Each process is responsible for np variables. Time O(n) per
variable is sufficient.⇒ The compute time is bounded by O(n2
p ).There is one broadcast per unknown and communication time isbounded by O(n · log2 p).We achieve constant efficiency, whenever n = Ω(p · log2 p).
Solving Linear Systems Backsubstitution 21 / 35
Gaussian Elimination with Partial Pivoting
Include right hand side b as last column of matrix A.
If we have already eliminated nonzeroes below the diagonals incolumns 1, . . . , i − 1, then
use largest entry A[i , j] for j = i , . . . ,n as pivot,
swap rows i and j and set rowk = rowk − A[k ,i]A[i,i] · rowi for k > i .
Performance analysis for the sequential algorithm:
When dealing with row i :I Determine the largest entry A[j , i] in column i in time O(n).I The elimination step for row i requires O(n − i + 1) arithmetic
operations.I All in all O(n + (n − i + 1)2) = O(n2) operations suffice.
The total number of arithmetic operations is bounded by O(n3).
Solving Linear Systems Gaussian Elimination 22 / 35
A parallelization of Gaussian Elimination I
We work with p processes and the rowwise decomposition:each process receives an “interval” of n/p rows.We maintain the sequential structure of pivoting,but parallelize each pivoting step instead.Assume that we have reached row i .
I To utilize the rowwise decomposition we look for the largest entry inrow i (and not in column i).
I We have to eliminate all non-zeroes in row i :the process holding row i has to
F determine the largest entry A[i, k ] in row i ,F compute the vector mi of multiples for the elimination stepF and send mi to the remaining processes.
Solving Linear Systems Gaussian Elimination 23 / 35
A parallelization of Gaussian Elimination II
Avoid broadcasting (mi , k). When dealing with row i − 1:I After computing mi−1, the process j holding row i interrupts its
elimination work for row i − 1,I immediately recomputes row i and determines (mi , k) instead,I sends (mi , k) to process j + 1 andI then resumes its elimination work for row i − 1.
We cover communication by computation:I the expensive broadcast of (mi , k) is replaced by sending (mi , k)
thru the linear array of processes.I Whenever a process receives (mi , k), it immediately forwards
(mi , k) to its neighbor process.Performance analysis:
I No delay when eliminating row i , if the compute time Θ( np · n) for
pivoting dominates the maximal communication delay p · n.
The overall compute time is bounded by O(n · np · n) = O(n3
p ).There is no delay due to communication, provided n = Ω(p2).
Solving Linear Systems Gaussian Elimination 24 / 35
Iterative Methods
In an iterative method an approximate solution of a linear systemA · x = b is successively improved.One starts with an initial “guess” x(0) andreplaces x(t) by a presumably better solution x(t + 1).
Assume that the computation of x(t + 1) is based on the
matrix-vector product.
We obtain a fast parallel algorithm and can exploit sparse linearsystems.We describe:
I the Jacobi relaxation and its variants,I the Newton method to approximately compute the inverse A−1.
Solving Linear Systems Iterative Methods 25 / 35
Jacobi Relaxation
Assume that A · x∗ = b. If A has a nonzero diagonal, then
x∗i =1
A[i , i]·
bi −∑j 6=i
A[i , j] · x∗j
.
The Jacobi iteration: if xi(t) is an approximate solution, set
xi(t + 1) =1
A[i , i]·
bi −∑j 6=i
A[i , j] · xj(t)
.
Each Jacobi iteration corresponds to a matrix-vector product.Hence one iteration runs in time O(n2
p ) and we obtain a fastapproximation, whenever few iterations suffice.When does the Jacobi iteration converge against the uniquesolution?
Solving Linear Systems Iterative Methods 26 / 35
Jacobi Relaxation: Convergence
Let D be the diagonal matrix with D[i , i] = A[i , i].Set M = D−1 · (D − A).
I Another view of the Jacobi iteration:if A is invertible and if x∗ is the unique solution of A · x = b, then
x(t + 1)− x∗ = M · (x(t)− x∗).
I Consequently x(t)− x∗ = M t · (x(0)− x∗) follows for all t .I If limt→∞M t = 0, then x(t) converges against x∗.
The Jacobi Relaxation converges for row diagonally dominantmatrices A: i.e., if
|A[i , i]| >∑j 6=i
|A[i , j]|
holds for all i .
Solving Linear Systems Iterative Methods 27 / 35
Two Extensions
In many practical applications the Jacobi overrelaxation convergesfaster.
I for a suitable coefficient γ:
xi (t + 1) = (1− γ) · xi (t) +γ
A[i , i]·
bi −∑j 6=i
A[i , j] · xj (t)
.
I The Jacobi relaxation is a special case: set γ = 1.The Gauss-Seidel algorithm incorporates already recomputedvalues of xj (i.e., it replaces xj(t) by xj(t + 1)).
I An example is
xi (t + 1) =1
A[i , i]·
bi −∑j<i
A[i , j] · xj (t + 1)−∑j>i
A[i , j] · xj (t)
.
I The Gauss-Seidel method does not look parallelizable!?
Solving Linear Systems Iterative Methods 28 / 35
Inversion by Newton Iteration
Assume that A is invertible and that Xt is an approximate inverse of A.The Newton iteration is
Xt+1 = 2 · Xt − Xt · A · Xt .
What is the intuition behind this approach?I Consider the residual matrix Rt = I − A · Xt which measures the
“distance” between A−1 and Xt .Rt+1 = I − A · Xt+1
= I − A · (2 · Xt − Xt · A · Xt )= (I − A · Xt )
2 = R2t .
I Rt converges rapidly towards the 0-matrix, whenever X0 is a goodapproximation of A−1.
Since the Newton iteration is based on matrix-matrix products,each iteration is easily parallelized.
Solving Linear Systems Iterative Methods 29 / 35
The Finite Difference Method
Finite differences can be used to approximate derivatives of a functionf , since
f ′(x) = limh→0f (x+h)−f (x)
h = limh→0f (x)−f (x−h)
hf ′′(x) = limh→0
f ′(x+h)−f ′(x)h = limh→0
f (x+h)−2f (x)+f (x−h)h2 .
The finite difference method is used to solve differential equations:
I Derivatives are approximated by finite differencesI and differential equations are modeled by linear systems of
equations.
The usually sparse systems are mostly solved with iterativemethods.
Solving Linear Systems The Finite Difference Method 30 / 35
An Example: The Poisson Equation
Find a function u : [0,1]2 → R whichsatisfies the Poisson equation uxx + uyy = Hand which has prescribed values on the boundary of the unitsquare [0,1]2.
If u is sufficiently smooth and if h is sufficiently small, then
uxx (x , y) ≈ u(x + h, y)− 2u(x , y) + u(x − h, y)
h2 .
Approximate uy ,y analogously and we get H(x , y) ≈u(x + h, y) + u(x − h, y) + u(x , y + h) + u(x , y − h)− 4u(x , y)
h2 .
For N sufficiently large, set
h =1N, ui,j = u(
iN,
jN
) and Hi,j = H(iN,
jN
).
Solving Linear Systems The Finite Difference Method 31 / 35
The Linear System I
Choose (x , y) as one of the grid points ( iN ,
jN ) | 0 < i , j < N and
we get the linear system
−4ui,j + ui+1,j + ui−1,j + ui,j+1 + ui,j−1 =Hi,j
N2
The system is huge: (N − 1)2 equations in (N − 1)2 unknowns.(The values of u at the boundary are prescribed.)The matrix of the system has (N − 1)4 entries, but is sparse, sinceany equation has at most five nonzero coefficients.To utilize sparsity, we apply iterative methods.
Solving Linear Systems The Finite Difference Method 32 / 35
The Linear System II
We process the system beginning with the lower boundary andworking upwards:
ui+1,j = 4ui,j − (ui−1,j + ui,j+1 + ui,j−1) +Hi,j
N2 .
We apply the Jacobi Relaxation and get
ui+1,j(t + 1) = 4ui,j(t)− (ui−1,j(t) + ui,j+1(t) + ui,j−1(t)) +Hi,j
N2 .
How to obtain an efficient parallelization of one iteration?I We use the checkerboard decomposition of the N − 1× N − 1 grid.
p processes are arranged in a√
p ×√p mesh.I The process responsible for determining ui+1,j (t + 1) has to know
ui−1,j (t),ui,j+1(t),ui,j−1(t) and ui,j (t).I Any missing value belongs to the “near-boundary” of a neighbor.
Solving Linear Systems The Finite Difference Method 33 / 35
The Linear System: Performance Analysis
The processes communicate their O( N√p ) near-boundary values.
Afterwards they perform an iteration without furthercommunication: computation time is bounded by O(N2
p ), since theupdate time for any grid point is constant and each process has toupdate O(N2
p ) grid points.
We have constant efficiency, whenever N = Ω(√
p).One can show that the matrix of the linear system is symmetricand positive definite: the Jacobi Relaxation converges.
Solving Linear Systems The Finite Difference Method 34 / 35
Gauss-Seidel Revisited
So far we have used the Jacobi Relaxation
ui+1,j(t + 1) = 4ui,j(t)− (ui−1,j(t) + ui,j+1(t) + ui,j−1(t)) +Hi,j
N2 .
Can we use the generally better Gauss-Seidel method instead?
I Use the new update
ui,j (t + 1) =ui+1,j (t) + ui−1,j (t) + ui,j+1(t) + ui,j−1(t)
4−
Hi,j
4N2 .
I Label grid point (i , j) white iff i + j is even and black otherwise:white grid points are updated with black grid points only.
I Execute one iteration of the Gauss-Seidel algorithm byF first updating black grid points conventionally with the Jacobi
relaxation.F Then apply the Gauss-Seidel algorithm to update white grid points by
using the already updated black grid points.
Solving Linear Systems The Finite Difference Method 35 / 35