Download - Sparse factorization using low rank submatrices Cleve Ashcraft …mumps.enseeiht.fr/doc/ud_2010/Ashcraft_talk.pdf · 2017. 9. 30. · Multifrontal Algorithm •very large, sparse

Sparse factorizationusing low rank submatrices

Cleve AshcraftLSTC

[email protected]

2010 MUMPS User Group MeetingApril 15-16, 2010

Toulouse, FRANCE

ftp.lstc.com:outgoing/cleve/MUMPS10 Ashcraft.pdf

1

LSTCLivermore Software Technology Corporation

Livermore, California 94550

• Founded by John Hallquist of LLNL, 1980’s

• Public domain versions of DYNA and NIKE codes

• LS-DYNA : implicit/explicit,nonlinear finite element analysis code

• Multiphysics capabilities

– Fluid/structure interaction

– Thermal analysis

– Acoustics

– Electromagnetics

2

Multifrontal Algorithm

• very large, sparse LDLT and LDU factorizations

• tree structure organizes factor storage,solve and factor operations

• medium-to-large sparse linear systemslocated at each leaf node of the tree

• medium-sized dense linear systemslocated at each interior node of the tree

• dense matrix-matrix operationsat each interior node

• sparse matrix-matrix adds between nodes

3

Multifrontal tree

0

1

2

3

4 5

6

7

8

9

10

11

12

13

1415

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

49

50

51

52

53

54

55

56

57

58 59

60

61

62

63

64

65

66

67

68

69

70

71

4

Multifrontal tree – polar representation

0

1

2

3

45

6

7

8

9

10

11

1213

14

15

1617

18 19 20 21

22

23

24

25

26 27

28

2930

31 32

33

34

35

36

37

38

39

40

41

42

43

44454647

48

49

5051

52

53

54

55

56

57

58

59

60

6162

6364

65

66

67

68 69

70

71

5

Multifrontal Algorithm

• 40M equations, present frontier at LSTC

• serial, SMP, MPP, hybrid, GPU

• large problems, > 1M - 10M dof,require out-of-core storage of factor entries,even on distributed memory systems

• IO cost largely hidden during the factorization

• IO cost dominant during the solves

• eigensolver =⇒ several right hand sides

• many applications, e.g., Newton’s method,have a single right hand side

6

Low Rank Approximations

• hierarchical matrices,Hackbusch, Bebendorf, Leborne, others

• semi-separable matrices, Gu, others

• submatrices are numerically rank deficient

• method of choice for Boundary Elements (BEM)

• now applied to Finite Elements (FEM)

• A ≈ XY TAm

n

= m X

r

Y Tn

r

• storage = r(m + n) vs mn

• reduction in ops =r

min(m,n)7

Multifrontal Algorithm + Low Rank Approximations

• At each leaf node in the multifrontal tree —use standard multifrontal

• At each interior node in the multifrontal tree —low rank matrix-matrix multiplies and sums

• Between nodes —low rank matrix sums

• Dramatic reduction in storage

• Dramatic reduction in operations

• Excellent approximation propertiesfor finite element operators

• Our experience is with potential equations,elasticity with solids and shells

8

Outline

• graph, tree, matrix perspectives

• experiments – 2-D potential equation

• low rank computations

• blocking strategies

• summary

9

One subgraph

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

One subtree

ΩI ΩJ

@@

@@

@@@I

S

6

∂S

One submatrix

AΩI ,ΩIAΩI ,S

AΩI ,∂S

AΩJ ,ΩJAΩJ ,S AΩJ ,∂S

AS,ΩIAS,ΩJ

AS,S AS,∂S

A∂S,ΩIA∂S,ΩJ

A∂S,S A∂S,∂S

10

Portion of original matrix

0 200 400 600 800 1000 1200

0

200

400

600

800

1000

1200

nz = 9052

A ordered by domains, separator and external boundary

11

Portion of factor matrix

0 200 400 600 800 1000 1200

0

200

400

600

800

1000

1200

nz = 48899

L ordered by domains, separator and external boundary

12

Schur complement matrix

[AS,S AS,∂S

A∂S,S A∂S,∂S

]=

0 20 40 60 80 100 120 140 160

0

20

40

60

80

100

120

140

160

nz = 20659

Schur complement, separator and external boundary

13

Outline





• summary

14

Computational experiments

• Compute LS,S, L∂S,S and A∂S,∂S

• Find domain decompositions of S and ∂S

• Form block matrices, e.g., LS,S =∑

K≥J

LK,J

• Find singular value decompositions of each LK,J

• Collect all singular values

σ =∑

K≥J

min(|K|,|J |)∑

i=1

σ(K,J)i

• Split matrix LS,S = MS,S + NS,S using singularvalues σ.

• We want ‖NS,S‖F ≤ 10−14 ‖LS,S‖F

15

255 × 255 diagonal block factor LS,S

43% dense, relative accuracy 10−14

LS,S =

0 50 100 150 200 250

0

50

100

150

200

250

16

754 × 255 lower block factor L∂S,S


L∂S,S =

0 100 200

0

100

200

300

400

500

600

700

17

754 × 754 update matrix A∂S,∂S


A∂S,∂S =

0 100 200 300 400 500 600 700

0

100

200

300

400

500

600

700

18

255 × 255 factor matrix LS,S

storage vs accuracy

0 0.1 0.2 0.3 0.4 0.5 0.6−16

−14

−12

−10

−8

−6

−4

−2

0

fraction of dense storage

log 10

(1 −

||M

|| F /

||A|| F

)

approximating 255 x 255 separator factor matrix L3,3

2 x 2 blocks3 x 3 blocks4 x 4 blocks5 x 5 blocks6 x 6 blocks7 x 7 blocks8 x 8 blocks

19

754 × 255 factor matrix L∂S,S

storage vs accuracy

0 0.05 0.1 0.15 0.2−16

−14

−12

−10

−8

−6

−4

−2

0


log 10

(1 −

||M

|| F /

||A|| F

)

approximating 754 x 255 interface−separator factor matrix L4,3

3 x 1 blocks6 x 2 blocks9 x 3 blocks12 x 4 blocks

20


storage vs accuracy

0 0.1 0.2 0.3 0.4 0.5−16

−14

−12

−10

−8

−6

−4

−2

0


log 10

(1 −

||M

|| F /

||A|| F

)

approximating 754 x 754 schur complement matrix Ahat44

2 x 2 blocks3 x 3 blocks4 x 4 blocks5 x 5 blocks6 x 6 blocks7 x 7 blocks8 x 8 blocks

21

Outline





• summary

22

How to compute low rank submatrices ?

• SVD – singular value decomposition

A = UΣUT

where U and V are orthogonal and Σ is diagonal

• Gold standard, expensive, O(n3) ops

• QR factorization

AP = QR

where Q is orthogonal and R is upper triangular,and P is a permutation matrix

• Silver standard, moderate cost, O(rn2) ops

23

row norms of R vs singular values of A


94 × 94 submatrix sizenear, mid and far interactions

0 2 4 6 8 10 12 14 16 18 20−20

−18

−16

−14

−12

−10

−8

−6

−4

−2

0row norms of R versus singular values of 754 x 754 A using 8 segments

near ||R

i,:||

2

near σi

mid ||Ri,:

||2

mid σi

far ||Ri,:

||2

far σi

24

SVD vs QR with column pivoting — Conclusions :

• Column pivoting QR factorization does well.

• Row norms of R track singular values σ

• The numerical rank of R is greater than needed,but not that much greater

• For more accuracy/less storagetwo sided orthogonal factorizations

– AP = ULV , U and V orthogonal, L triangular

– PAQ = UBV ,U and V orthogonal, B bidiagonal

track the singular values very closely.

25

Type of approximation of submatrices

• submatrix LI,J of L∂S,S, ‖LI,J‖F = 2.92 × 10−2

factorization numerical rank total entriesnone — 4900QR 20 2681

ULV 18 2680SVD 18 2538

• submatrix LI,J of L∂S,S, ‖LI,J‖F = 2.04 × 10−3

factorization numerical rank total entriesnone — 4900QR 14 1896

ULV 10 1447SVD 10 1410

26

Operations with low rank submatrices

A = UAV TA , B = UBV T

B , C = UCV TC

See Bebendorf, “Hierarchical Matrices”, Chapter 1

• Multiplication A = B C, rank(A) ≤ min(rank(B), rank(C))

A = UAV TA =

(UBV T

B

) (UCV T

C

)= BC

= UB

(V T

B UC

)V T

C

= UB

((V T

B UC

)V T

C

)

=(UB

(V T

B UC

))V T

C

• Addition A = B + C, rank(A) ≤ rank(B) + rank(C)

A = UAV TA =

(UBV T

B

)+

(UCV T

C

)= B + C

=[UB UC

] [VB VC

]T

27

Multiplication A = B C

A = UAV TA , B = UBV T

B , C = UCV TC

A = B C =

=(

(m × h) (h × l))(

(l × k) (k × n))

= =

= (m × min(h, k)) (min(h, k) × n)

28

Near-near matrix product

0 5 10 15 20 25 30 35

−16

−14

−12

−10

−8

−6

−4

−2

0

near−near interaction, L2,1

L2,1T

σ(L21)

σ(L21*L21T)

29

mid-near matrix product

0 5 10 15 20 25 30 35

−16

−14

−12

−10

−8

−6

−4

−2

0

mid−near interaction, L2,1

L1,1T

σ(L21)σ(L11)

σ(L21*L11T)

30

far-near matrix product

0 5 10 15 20 25 30 35

−16

−14

−12

−10

−8

−6

−4

−2

0

far−near interaction, L5,1

L6,1T

σ(L51)σ(L61)

σ(L51*L61T)

31

mid-mid matrix product

2 4 6 8 10 12 14 16

−16

−14

−12

−10

−8

−6

−4

−2

0

mid−mid interaction, L1,1

L1,1T

σ(L11)

σ(L11*L11T)

32

far-mid matrix product

2 4 6 8 10 12 14 16

−16

−14

−12

−10

−8

−6

−4

−2

0

mid−far interaction, L6,1

L1,1T

σ(L61)

σ(L61*L61T)

33

far-far matrix product

1 2 3 4 5 6 7 8 9 10 11

−16

−14

−12

−10

−8

−6

−4

−2

0

far−far interaction, L6,1

L6,1T

σ(L61)

σ(L61*L61T)

34

Addition A = B + C

rank(A) ≤ rank(B) + rank(C)

A = UAV TA =

(UBV T

B

)+

(UCV T

C

)= B + C

=[UB UC

] [VB VC

]T

= (Q1R1) (Q2R2)T

= Q1

(R1R

T2

)QT

2

= Q1 (Q3R3) QT2

= (Q1Q3)(R3Q

T2

)

= (Q1Q3)(Q2R

T3

)T

• R1RT2 usually has low numerical rank

• examples follow35

update matrix, diagonal blockA3,3 = L3,1L

T3,1 + L3,2L

T3,2 + L3,3L

T3,3

0 5 10 15 20 25

−16

−14

−12

−10

−8

−6

−4

−2

0

log 10

(σ)

update matrix −− self interaction

σ(near)σ(sum)σ(mid)σ(far)

36

update matrix, mid-distance off-diagonal blockA5,3 = L5,1L

T3,1 + L5,2L

T3,2 + L5,3L

T3,3

0 2 4 6 8 10 12−20

−18

−16

−14

−12

−10

−8

−6

−4

−2

log 10

(σ)

update matrix −− mid−range interaction

σ(sum)σ(near)σ(mid)σ(far)

37

update matrix, far-distance off-diagonal blockA7,3 = L7,1L

T3,1 + L7,2L

T3,2 + L7,3L

T3,3

0 2 4 6 8 10 12−20

−18

−16

−14

−12

−10

−8

−6

−4

−2

log 10

(σ)

update matrix −− far−range interaction

σ(sum)σ(near)σ(mid)σ(far)

38

Outline





• summary

39

Blocking Strategies

• Active data structures[LJ,J

L∂J,J A∂J,∂J

]

• partition J and ∂J independently

• for 2-d problems, J and ∂J are 1-d manifolds

• for 3-d problems, J and ∂J are 2-d manifolds

• we need mesh partitioning of the separatorand the boundary of a region J

• each index set of a partition is a segment

40

Segment partition

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1 segment wirebasket domain decomposition

41

•

0 50 100 150 200 250

0

50

100

150

200

250

LJ,J =∑

σ,τ

Lσ,τ σ × τ ⊆ J × J

•

0 100 200

0

100

200

300

400

500

600

700

L∂J,J =∑

σ,τ

Lσ,τ σ × τ ⊆ ∂J × J

•

0 100 200 300 400 500 600 700

0

100

200

300

400

500

600

700

A∂J,∂J =∑

σ,τ

Aσ,τ σ×τ ⊆ ∂J×∂J

42

Strategy I

• The partition of ∂J (local to node J)conforms to the partitions of ancestors K,K ∩ ∂J 6= ∅

• Advantages :

– Update assembly is simplified

A(p(J))σ2,τ2 = A

(p(J))σ2,τ2 + A

(J)σ1,τ1

where σ1 ⊆ σ2, τ1 ⊆ τ2

– One destination for each A(J)σ1,τ1

• Disadvantages :

– Partition of ∂J can be fragmented,more segments, smaller size,less efficient storage

43

Strategy II

• The partition of ∂J (local to node J)need not conform to the partitions of ancestors

• Advantages :

– Partition of ∂J can be optimized bettersince ∂J is small and localized

• Disadvantages :

– Update assembly is more complex

A(p(J))σ1∩σ2,τ1∩τ2

= A(p(J))σ1∩σ2,τ1∩τ2

+ A(J)σ1∩σ2,τ1∩τ2

where σ1 ∩ σ2 6= ∅, τ1 ∩ τ2 6= ∅

– Several destinations for a submatrix A(J)σ1,τ1

44

Outline





• summary

45

Multifrontal tree Multifrontal Factorization

0

1

2

3

45

6

7

8

9

10

11

1213

14

15

1617

18 19 20 21

22

23

24

25

26 27

28

2930

31 32

33

34

35

36

37

38

39

40

41

42

43

44454647

48

49

5051

52

53

54

55

56

57

58

59

60

6162

6364

65

66

67

68 69

70

71

• each leaf node is ad-dimensional sparseFEM matrix

• each interior node isa (d − 1)-dimensionaldense BEM matrix

• use low rank storageand computation ateach interior node

46

Call to action

• progress to date in low rank factorizationsdriven by iterative methods

• work needed from direct methods community

• start from industrial strength multifrontal code

– serial, SMP, MPP, hybrid, GPU

– pivoting for stability, out-of-core,singular systems, null spaces

47

Call to action

• Many challenges

– partition of space, partition of interface

– added programming complexityof low rank matrices

– challenges to pivot for stability

– challenges/opportunitiesto implement in parallel

• Payoff will be huge

– reduction in storage footprint

– reduction in computational work

– take direct methods to next levelof problem size

48