Sparse factorizationusing low rank submatrices
Cleve AshcraftLSTC
2010 MUMPS User Group MeetingApril 15-16, 2010
Toulouse, FRANCE
ftp.lstc.com:outgoing/cleve/MUMPS10 Ashcraft.pdf
1
LSTCLivermore Software Technology Corporation
Livermore, California 94550
• Founded by John Hallquist of LLNL, 1980’s
• Public domain versions of DYNA and NIKE codes
• LS-DYNA : implicit/explicit,nonlinear finite element analysis code
• Multiphysics capabilities
– Fluid/structure interaction
– Thermal analysis
– Acoustics
– Electromagnetics
2
Multifrontal Algorithm
• very large, sparse LDLT and LDU factorizations
• tree structure organizes factor storage,solve and factor operations
• medium-to-large sparse linear systemslocated at each leaf node of the tree
• medium-sized dense linear systemslocated at each interior node of the tree
• dense matrix-matrix operationsat each interior node
• sparse matrix-matrix adds between nodes
3
Multifrontal tree
0
1
2
3
4 5
6
7
8
9
10
11
12
13
1415
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47 48
49
50
51
52
53
54
55
56
57
58 59
60
61
62
63
64
65
66
67
68
69
70
71
4
Multifrontal tree – polar representation
0
1
2
3
45
6
7
8
9
10
11
1213
14
15
1617
18 19 20 21
22
23
24
25
26 27
28
2930
31 32
33
34
35
36
37
38
39
40
41
42
43
44454647
48
49
5051
52
53
54
55
56
57
58
59
60
6162
6364
65
66
67
68 69
70
71
5
Multifrontal Algorithm
• 40M equations, present frontier at LSTC
• serial, SMP, MPP, hybrid, GPU
• large problems, > 1M - 10M dof,require out-of-core storage of factor entries,even on distributed memory systems
• IO cost largely hidden during the factorization
• IO cost dominant during the solves
• eigensolver =⇒ several right hand sides
• many applications, e.g., Newton’s method,have a single right hand side
6
Low Rank Approximations
• hierarchical matrices,Hackbusch, Bebendorf, Leborne, others
• semi-separable matrices, Gu, others
• submatrices are numerically rank deficient
• method of choice for Boundary Elements (BEM)
• now applied to Finite Elements (FEM)
• A ≈ XY TAm
n
= m X
r
Y Tn
r
• storage = r(m + n) vs mn
• reduction in ops =r
min(m,n)7
Multifrontal Algorithm + Low Rank Approximations
• At each leaf node in the multifrontal tree —use standard multifrontal
• At each interior node in the multifrontal tree —low rank matrix-matrix multiplies and sums
• Between nodes —low rank matrix sums
• Dramatic reduction in storage
• Dramatic reduction in operations
• Excellent approximation propertiesfor finite element operators
• Our experience is with potential equations,elasticity with solids and shells
8
Outline
• graph, tree, matrix perspectives
• experiments – 2-D potential equation
• low rank computations
• blocking strategies
• summary
9
One subgraph
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
One subtree
ΩI ΩJ
@@
@@
@@@I
S
6
∂S
One submatrix
AΩI ,ΩIAΩI ,S
AΩI ,∂S
AΩJ ,ΩJAΩJ ,S AΩJ ,∂S
AS,ΩIAS,ΩJ
AS,S AS,∂S
A∂S,ΩIA∂S,ΩJ
A∂S,S A∂S,∂S
10
Portion of original matrix
0 200 400 600 800 1000 1200
0
200
400
600
800
1000
1200
nz = 9052
A ordered by domains, separator and external boundary
11
Portion of factor matrix
0 200 400 600 800 1000 1200
0
200
400
600
800
1000
1200
nz = 48899
L ordered by domains, separator and external boundary
12
Schur complement matrix
[AS,S AS,∂S
A∂S,S A∂S,∂S
]=
0 20 40 60 80 100 120 140 160
0
20
40
60
80
100
120
140
160
nz = 20659
Schur complement, separator and external boundary
13
Outline
• graph, tree, matrix perspectives
• experiments – 2-D potential equation
• low rank computations
• blocking strategies
• summary
14
Computational experiments
• Compute LS,S, L∂S,S and A∂S,∂S
• Find domain decompositions of S and ∂S
• Form block matrices, e.g., LS,S =∑
K≥J
LK,J
• Find singular value decompositions of each LK,J
• Collect all singular values
σ =∑
K≥J
min(|K|,|J |)∑
i=1
σ(K,J)i
• Split matrix LS,S = MS,S + NS,S using singularvalues σ.
• We want ‖NS,S‖F ≤ 10−14 ‖LS,S‖F
15
255 × 255 diagonal block factor LS,S
43% dense, relative accuracy 10−14
LS,S =
0 50 100 150 200 250
0
50
100
150
200
250
16
754 × 255 lower block factor L∂S,S
16% dense, relative accuracy 10−14
L∂S,S =
0 100 200
0
100
200
300
400
500
600
700
17
754 × 754 update matrix A∂S,∂S
21% dense, relative accuracy 10−14
A∂S,∂S =
0 100 200 300 400 500 600 700
0
100
200
300
400
500
600
700
18
255 × 255 factor matrix LS,S
storage vs accuracy
0 0.1 0.2 0.3 0.4 0.5 0.6−16
−14
−12
−10
−8
−6
−4
−2
0
fraction of dense storage
log 10
(1 −
||M
|| F /
||A|| F
)
approximating 255 x 255 separator factor matrix L3,3
2 x 2 blocks3 x 3 blocks4 x 4 blocks5 x 5 blocks6 x 6 blocks7 x 7 blocks8 x 8 blocks
19
754 × 255 factor matrix L∂S,S
storage vs accuracy
0 0.05 0.1 0.15 0.2−16
−14
−12
−10
−8
−6
−4
−2
0
fraction of dense storage
log 10
(1 −
||M
|| F /
||A|| F
)
approximating 754 x 255 interface−separator factor matrix L4,3
3 x 1 blocks6 x 2 blocks9 x 3 blocks12 x 4 blocks
20
754 × 754 update matrix A∂S,∂S
storage vs accuracy
0 0.1 0.2 0.3 0.4 0.5−16
−14
−12
−10
−8
−6
−4
−2
0
fraction of dense storage
log 10
(1 −
||M
|| F /
||A|| F
)
approximating 754 x 754 schur complement matrix Ahat44
2 x 2 blocks3 x 3 blocks4 x 4 blocks5 x 5 blocks6 x 6 blocks7 x 7 blocks8 x 8 blocks
21
Outline
• graph, tree, matrix perspectives
• experiments – 2-D potential equation
• low rank computations
• blocking strategies
• summary
22
How to compute low rank submatrices ?
• SVD – singular value decomposition
A = UΣUT
where U and V are orthogonal and Σ is diagonal
• Gold standard, expensive, O(n3) ops
• QR factorization
AP = QR
where Q is orthogonal and R is upper triangular,and P is a permutation matrix
• Silver standard, moderate cost, O(rn2) ops
23
row norms of R vs singular values of A
754 × 754 update matrix A∂S,∂S
94 × 94 submatrix sizenear, mid and far interactions
0 2 4 6 8 10 12 14 16 18 20−20
−18
−16
−14
−12
−10
−8
−6
−4
−2
0row norms of R versus singular values of 754 x 754 A using 8 segments
near ||R
i,:||
2
near σi
mid ||Ri,:
||2
mid σi
far ||Ri,:
||2
far σi
24
SVD vs QR with column pivoting — Conclusions :
• Column pivoting QR factorization does well.
• Row norms of R track singular values σ
• The numerical rank of R is greater than needed,but not that much greater
• For more accuracy/less storagetwo sided orthogonal factorizations
– AP = ULV , U and V orthogonal, L triangular
– PAQ = UBV ,U and V orthogonal, B bidiagonal
track the singular values very closely.
25
Type of approximation of submatrices
• submatrix LI,J of L∂S,S, ‖LI,J‖F = 2.92 × 10−2
factorization numerical rank total entriesnone — 4900QR 20 2681
ULV 18 2680SVD 18 2538
• submatrix LI,J of L∂S,S, ‖LI,J‖F = 2.04 × 10−3
factorization numerical rank total entriesnone — 4900QR 14 1896
ULV 10 1447SVD 10 1410
26
Operations with low rank submatrices
A = UAV TA , B = UBV T
B , C = UCV TC
See Bebendorf, “Hierarchical Matrices”, Chapter 1
• Multiplication A = B C, rank(A) ≤ min(rank(B), rank(C))
A = UAV TA =
(UBV T
B
) (UCV T
C
)= BC
= UB
(V T
B UC
)V T
C
= UB
((V T
B UC
)V T
C
)
=(UB
(V T
B UC
))V T
C
• Addition A = B + C, rank(A) ≤ rank(B) + rank(C)
A = UAV TA =
(UBV T
B
)+
(UCV T
C
)= B + C
=[UB UC
] [VB VC
]T
27
Multiplication A = B C
A = UAV TA , B = UBV T
B , C = UCV TC
A = B C =
=(
(m × h) (h × l))(
(l × k) (k × n))
= =
= (m × min(h, k)) (min(h, k) × n)
28
Near-near matrix product
0 5 10 15 20 25 30 35
−16
−14
−12
−10
−8
−6
−4
−2
0
near−near interaction, L2,1
L2,1T
σ(L21)
σ(L21*L21T)
29
mid-near matrix product
0 5 10 15 20 25 30 35
−16
−14
−12
−10
−8
−6
−4
−2
0
mid−near interaction, L2,1
L1,1T
σ(L21)σ(L11)
σ(L21*L11T)
30
far-near matrix product
0 5 10 15 20 25 30 35
−16
−14
−12
−10
−8
−6
−4
−2
0
far−near interaction, L5,1
L6,1T
σ(L51)σ(L61)
σ(L51*L61T)
31
mid-mid matrix product
2 4 6 8 10 12 14 16
−16
−14
−12
−10
−8
−6
−4
−2
0
mid−mid interaction, L1,1
L1,1T
σ(L11)
σ(L11*L11T)
32
far-mid matrix product
2 4 6 8 10 12 14 16
−16
−14
−12
−10
−8
−6
−4
−2
0
mid−far interaction, L6,1
L1,1T
σ(L61)
σ(L61*L61T)
33
far-far matrix product
1 2 3 4 5 6 7 8 9 10 11
−16
−14
−12
−10
−8
−6
−4
−2
0
far−far interaction, L6,1
L6,1T
σ(L61)
σ(L61*L61T)
34
Addition A = B + C
rank(A) ≤ rank(B) + rank(C)
A = UAV TA =
(UBV T
B
)+
(UCV T
C
)= B + C
=[UB UC
] [VB VC
]T
= (Q1R1) (Q2R2)T
= Q1
(R1R
T2
)QT
2
= Q1 (Q3R3) QT2
= (Q1Q3)(R3Q
T2
)
= (Q1Q3)(Q2R
T3
)T
• R1RT2 usually has low numerical rank
• examples follow35
update matrix, diagonal blockA3,3 = L3,1L
T3,1 + L3,2L
T3,2 + L3,3L
T3,3
0 5 10 15 20 25
−16
−14
−12
−10
−8
−6
−4
−2
0
log 10
(σ)
update matrix −− self interaction
σ(near)σ(sum)σ(mid)σ(far)
36
update matrix, mid-distance off-diagonal blockA5,3 = L5,1L
T3,1 + L5,2L
T3,2 + L5,3L
T3,3
0 2 4 6 8 10 12−20
−18
−16
−14
−12
−10
−8
−6
−4
−2
log 10
(σ)
update matrix −− mid−range interaction
σ(sum)σ(near)σ(mid)σ(far)
37
update matrix, far-distance off-diagonal blockA7,3 = L7,1L
T3,1 + L7,2L
T3,2 + L7,3L
T3,3
0 2 4 6 8 10 12−20
−18
−16
−14
−12
−10
−8
−6
−4
−2
log 10
(σ)
update matrix −− far−range interaction
σ(sum)σ(near)σ(mid)σ(far)
38
Outline
• graph, tree, matrix perspectives
• experiments – 2-D potential equation
• low rank computations
• blocking strategies
• summary
39
Blocking Strategies
• Active data structures[LJ,J
L∂J,J A∂J,∂J
]
• partition J and ∂J independently
• for 2-d problems, J and ∂J are 1-d manifolds
• for 3-d problems, J and ∂J are 2-d manifolds
• we need mesh partitioning of the separatorand the boundary of a region J
• each index set of a partition is a segment
40
Segment partition
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1 segment wirebasket domain decomposition
41
•
0 50 100 150 200 250
0
50
100
150
200
250
LJ,J =∑
σ,τ
Lσ,τ σ × τ ⊆ J × J
•
0 100 200
0
100
200
300
400
500
600
700
L∂J,J =∑
σ,τ
Lσ,τ σ × τ ⊆ ∂J × J
•
0 100 200 300 400 500 600 700
0
100
200
300
400
500
600
700
A∂J,∂J =∑
σ,τ
Aσ,τ σ×τ ⊆ ∂J×∂J
42
Strategy I
• The partition of ∂J (local to node J)conforms to the partitions of ancestors K,K ∩ ∂J 6= ∅
• Advantages :
– Update assembly is simplified
A(p(J))σ2,τ2 = A
(p(J))σ2,τ2 + A
(J)σ1,τ1
where σ1 ⊆ σ2, τ1 ⊆ τ2
– One destination for each A(J)σ1,τ1
• Disadvantages :
– Partition of ∂J can be fragmented,more segments, smaller size,less efficient storage
43
Strategy II
• The partition of ∂J (local to node J)need not conform to the partitions of ancestors
• Advantages :
– Partition of ∂J can be optimized bettersince ∂J is small and localized
• Disadvantages :
– Update assembly is more complex
A(p(J))σ1∩σ2,τ1∩τ2
= A(p(J))σ1∩σ2,τ1∩τ2
+ A(J)σ1∩σ2,τ1∩τ2
where σ1 ∩ σ2 6= ∅, τ1 ∩ τ2 6= ∅
– Several destinations for a submatrix A(J)σ1,τ1
44
Outline
• graph, tree, matrix perspectives
• experiments – 2-D potential equation
• low rank computations
• blocking strategies
• summary
45
Multifrontal tree Multifrontal Factorization
0
1
2
3
45
6
7
8
9
10
11
1213
14
15
1617
18 19 20 21
22
23
24
25
26 27
28
2930
31 32
33
34
35
36
37
38
39
40
41
42
43
44454647
48
49
5051
52
53
54
55
56
57
58
59
60
6162
6364
65
66
67
68 69
70
71
• each leaf node is ad-dimensional sparseFEM matrix
• each interior node isa (d − 1)-dimensionaldense BEM matrix
• use low rank storageand computation ateach interior node
46
Call to action
• progress to date in low rank factorizationsdriven by iterative methods
• work needed from direct methods community
• start from industrial strength multifrontal code
– serial, SMP, MPP, hybrid, GPU
– pivoting for stability, out-of-core,singular systems, null spaces
47
Call to action
• Many challenges
– partition of space, partition of interface
– added programming complexityof low rank matrices
– challenges to pivot for stability
– challenges/opportunitiesto implement in parallel
• Payoff will be huge
– reduction in storage footprint
– reduction in computational work
– take direct methods to next levelof problem size
48
Top Related