Stability of block LU factorization - University of Manchester
Block LU Factorization Lecture 24
description
Transcript of Block LU Factorization Lecture 24
Block LU FactorizationLecture 24
MA471 Fall 2003
Example Case
1) Suppose we are faced with the solution of a linear system Ax=b
2) Further suppose:1) A is large (dim(A)>10,000) 2) A is dense3) A is full4) We have a sequence of different b vectors.
Problems• Suppose we are able to compute the
matrix –– It costs N2 doubles to store the matrix– E.g. for N=100,000 we require 76.3 gigabytes
of storage for the matrix alone.– 32 bit processors are limited to 4 gigabytes of
memory– Most desktops (even 64 bit) do not have 76.3
gigabytes
– What to do?
Divide and Conquer
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
One approach is to assume we have a square number of processors.We then divide the matrix into blocks – storing one block per processor.
Back to the Linear System
• We are now faced with LU factorization of a distributed matrix.
• This calls for a modified LU routine which acts on blocks of the matrix.
• We will demonstrate this algorithm for one level.
• i.e. we need to construct matrices L,U such that A=LU and we only store single blocks of A,L,U on any processor.
Constructing the Block LU Factorization
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 1 0
L20 0 1
*
U00 U01 U02
0 ?11 ?12
0 ?21 ?22
First we LU factorize A00 and look for the above block factorization. However, we need to figure out what each of the entries are:
A00 = L00*U00 (compute by L00, U00 by LU factorization)
A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02
A10 = L10*U00 => L10 = A10/U00A20 = L20*U00 => L20 = A20/U00
A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01..
contA00 = L00*U00 (compute by L00, U00 by LU factorization)
A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02
A10 = L10*U00 => L10 = A10/U00A20 = L20*U00 => L20 = A20/U00
A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01A12 = L10*U02 + ?12 => ?12 = A12 – L10*U02A21 = L20*U01 + ?21 => ?21 = A21 – L20*U01A22 = L20*U02 + ?22 => ?22 = A22 – L20*U02
In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m
Summary First Stage
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 1 0
L20 0 1
*
U00 U01 U02
0 ?11 ?12
0 ?21 ?22
First step: LU factorize uppermost block diagonal
Second step: a) compute U0n = L00\A0n n>0 b) compute Ln0 = An0/U00 n>0
Third step: compute ?nm = Anm – Ln0*U0m, (n,m>0)
Now Factorize Lower SE Block
?11 ?12
?21 ?22=
L11 0
L21 1*
U11 U12
0 ??22
We repeat the previous algorithm this time on the two by two SE block.
End Result
A00 A01 A02
A10 A11 A12
A20 A21 A22
=
L00 0 0
L10 L11 0
L20 L21 L22
*
U00 U01 U02
0 U11 U12
0 0 U22
Matlab Version
Parallel AlgorithmP0 P1 P2
P3 P4 P5
P6 P7 P8
P0: A00 = L00*U00 (compute by L00, U00 by LU factorization)
P1: U01 = L00\A01P2: U02 = L00\A02
P3: L10 = A10/U00P6: L20 = A20/U00
P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02
In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m
Parallel Communication
L00U00 U01 U02
L10 A11 A12
L20 A21 A22
P0: L00,U00 =lu(A)
P1: U01 = L00\A01P2: U02 = L00\A02
P3: L10 = A10/U00P6: L20 = A20/U00
P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02
In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m
Communication Summary
P0: L00,U00 =lu(A)
P1: U01 = L00\A01P2: U02 = L00\A02
P3: L10 = A10/U00P6: L20 = A20/U00
P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02
P0: sends L00 to P1,P2 sends U00 to P3,P6
P1: sends U01 to P4,P7P2: sends U02 to P5,P8
P3: sends L10 to P4,P5P4: sends L20 to P7,P8
P0 P1 P2
P3 P4 P5
P6 P7 P8
L00U00 U01 U02
L10 A11 A12
L20 A21 A22
Upshot
Notes:1) I added an MPI_Barrier purely to separate the LU factorization and the backsolve.2) In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait
compared to compute time.3) The compute part of this code can be optimized much more – making the parallel
efficiency even worse.
a b
(a) P0: sends L00 to P1,P2 sends U00 to P3,P6
(b) P1: sends U01 to P4,P7(c) P2: sends U02 to P5,P8
(d) P3: sends L10 to P4,P5(e) P4: sends L20 to P7,P8
cde
(f) P4: sends L11 to P5 sends U11 to P7
(g) P1: sends U12 to P8
(h) P3: sends L21 to P8
f
1st stage: 1st stage:
g
h
Block Back Solve
• After factorization we are left with the task of using the distributed L and U to compute the backsolve:
U00L00 U01 U02
L10 U11L11 U12
L20 L21 U22L22
Block distribution of L and U
P0 P1 P2
P3 P4 P5
P6 P7 P8
Recall
• Given an LU factorization of A namely, L,U such that A=LU
• Then we can solve Ax=b by• y=L\b• x=U\y
Distributed Back Solve
L00 0 0
L10 L11 0
L20 L21 L22
=
y0
y1
y2
b0
b1
b2
P0: solve L00*y0 = b0 send: y0 to P3,P6P3: send: L10*y0 to P4P4: solve L11*y1 = b1-L10*y0 send: y1 to P7P6: send: L20*y0 to P8\P7: send: L21*y1 to P8P8: solve L22*y2 = b2-L20*y0-L21*y1Results: y0 on P0, y1 on P4, y2 on P8
P0 P1 P2
P3 P4 P5
P6 P7 P8
Matlab Code
Back Solve
After the factorization we computed a solution to Ax=b
This consists of two distributed block triangular systems to solve
Barrier Between Back Solves
This time I inserted an MPI_Barrier call between the backsolves. This highlights the serial nature of the backsolves..
Example Codehttp://www.math.unm.edu/~timwar/MA471F03/blocklu.m
http://www.math.unm.edu/~timwar/MA471F03/parlufact2.c