Memory-Aware Scheduling for Sparse Direct Methodseagullo/thesis/siam_cse_2009_sparse-mem… ·...

Memory-Aware Scheduling for Sparse DirectMethods

Emmanuel AGULLO, ICL - University of TennesseeJean-Yves L’EXCELLENT, LIP - INRIA

Abdou GUERMOUCHE, LaBRI, Universite de Bordeaux

MS31 Parallel Sparse Matrix Computations and EnablingAlgorithms

SIAM CSE 2009, Miami, FL, March 2-6, 2009

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 1

Context

Solving sparse linear systems

Ax = b⇒ Direct methods: A = LU

Typical matrix: BRGM matrixF 3.7× 106 variablesF 156× 106 non zeros in AF 4.5× 109 non zeros in LUF 26.5× 1012 flops

Hardware paradigmF Many-core architecture.F Large global amount of

memory.F Limited memory per core.

Software challenge→ Need for algorithms whose

memory usage scales withthe number of processors.

F Case study: MUMPS

Context

F Case study: MUMPS

Context

F Case study: MUMPS

Context

F Case study: MUMPS

Context

Outline

1. MUMPS

2. Limits to memory scalability

3. A new memory-aware algorithm

4. Preliminary results

5. Conclusion

Outline

1. MUMPS

5. Conclusion

MUMPS: a MUltifrontal Massively Parallel sparse directSolver

Solution of large sparse linear systems with:F Symmetric positive definite matrices;F General symmetric matrices;F General unsymmetric matrices.

ImplementationF Distributed Multifrontal Solver (F90, MPI based);F Dynamic Distributed Scheduling;F Use of BLAS, BLACS, ScaLAPACK.

InterfacesF Fortran, C, Matlab, Scilab, Visual Studio.

The multifrontal method (Duff, Reid’83)

1 2 3 4 5

Non−zero Fill−in

L+U−I=

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

Memory behaviour (serial postorder traversal)

Memory efficiency of MUMPS

Definition: Memory Efficiency on p processors (or cores)

e(p) = Sseqp×Smax(p) , Sseq: serial storage, Smax: parallel storage

Results: Memory Efficiency of MUMPS (with factors on disk)

Number p of processors 16 32 64 128AUDI KW 1 0.16 0.12 0.13 0.10

CONESHL MOD 0.28 0.28 0.22 0.19CONV3D64 0.42 0.40 0.41 0.37QIMONDA07 0.30 0.18 0.11 -

ULTRASOUND80 0.32 0.31 0.30 0.26

Limits to memory scalability

Outline

1. MUMPS

5. Conclusion

Parallel multifrontal scheme

F Type 1 : Nodes processed on a single processorF Type 2 : Nodes processed with a parallel 1D blocked factorizationF Type 3 : Parallel 2D cyclic factorization (root node)

: STATIC

2D static decomposition

SUBTREES

Parallel multifrontal scheme

F Type 1 : Nodes processed on a single processorF Type 2 : Nodes processed with a parallel 1D blocked factorizationF Type 3 : Parallel 2D cyclic factorization (root node)

P2P2P3P0

: STATIC

1D pipelined factorization

: DYNAMIC

P3 and P0 chosen by P2 at runtime

SUBTREES

P2P2P3P0

: STATIC

: DYNAMIC

SUBTREES

F Many simultaneous active tasks;F Large master tasks;F Large subtrees;F Proportional mapping.

P2P2P3P0

: STATIC

: DYNAMIC

SUBTREES

P2P2P3P0

: STATIC

: DYNAMIC

SUBTREES

Proportional mapping VS postorder traversal (1/2)Elimination tree :

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees.

Advantages and drawbacks, Fine-grain + coarse-grain parallelism;/ Bad memory efficiency.

Proportional mapping VS postorder traversal (1/2)Proportional mapping:

256 256

128 128 128 128

256 256

128 128 128 128

Proportional mapping VS postorder traversal (2/2)Elimination tree :

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

512 512512

A new memory-aware algorithm

Outline

1. MUMPS

5. Conclusion

Memory-aware mapping algorithmElimination tree :

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees if

memory allows for it.

Memory-aware mapping algorithmMemory-aware mapping:

256 256

Advantages

, Robust: guaranteed (if memory M0 <Sseq

p )., Efficient: available memory provides coarse-grain parallelism.

Preliminary results

Outline

1. MUMPS

5. Conclusion

Preliminary results

Preliminary resultsF Excellent memory scalability:

I memory efficiency closed to 1.F Competitive (time) efficiency

I closed to proportional mapping (if enough memory);I memory provides coarse-grain parallelism:

0 5 10 15 20 25 30

Distance to root node (depth)

Proportional mappingM0=1/32*sequential peakM0=2/32*sequential peakM0=5/32*sequential peakM0=8/32*sequential peak

Conclusion

Outline

1. MUMPS

5. Conclusion

Conclusion

Prototype of a memory-aware algorithmF Maximizes the amount of coarse-grain parallelism with respect

to the amount of memory available per processor/core.F New static mapping implemented, with constraints on dynamic

schedulers; experimented within the OOC version of MUMPS.F Very good memory scalability obtained.

On-going workF Further tuning and validation.F Generalization to the in-core case.F Reinject dynamic information to schedulers.

Memory-Aware Scheduling for Sparse Direct Methodseagullo/thesis/siam_cse_2009_sparse-mem… ·...

Documents

Transcript of Memory-Aware Scheduling for Sparse Direct Methodseagullo/thesis/siam_cse_2009_sparse-mem… ·...

Sparse Matrix-Vector Multiplication FPGA …kodiak.ee.ucla.edu/cite/pdf/Sparse Matrix-Vector Multiplication...Sparse Matrix-Vector Multiplication FPGA Implementation Teng Wang 3 1

Sparse Matrix-Vector Multiplication using FPGA - …kodiak.ee.ucla.edu/cite/pdf/Sparse Matrix-Vector Multiplication...Sparse Matrix-Vector Multiplication using FPGA NAN WANG 3 1 Introduction

Sparse Matrices - Boston University Sparse Matrices ! Have small number of non-zero entries Dense Matrix Sparse Matrix

GPU Acceleration of Sparse Matrix Factorization in CHOLMOD · GPU Acceleration of Sparse Matrix Factorization in CHOLMOD Author: Steve Rennich Subject: Sparse direct solvers, and

GPU Acceleration of Sparse Matrix Factorization in …on-demand.gputechconf.com/gtc/...sparse-matrix-factorization-cholm… · GPU ACCELERATION OF SPARSE MATRIX FACTORIZATION IN CHOLMOD

Sparse Matrix-Matrix Multiplication on Multilevel Memory … · 2018. 4. 4. · Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

SPARSE MATRIX NOTES

Sparse matrix solving

Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

Sparse matrix formats

Sparse Matrix-Matrix Multiplication: Applications, Algorithms, …gmballa/talks/LA15.pdf · Sparse Matrix-Matrix Multiplication: Applications, Algorithms, and Implementations Organizers:

Fast sparse assembly - arXiv matrix, some basic manipulations and operations, and fundamental matrix factorizations in sparse format. ... Fast sparse assembly ...

Automatic Performance Tuning of Sparse Matrix Kernelsleebcc/documents/demmel2003...Tuning Sparse Matrix Kernels • Sparse tuning issues – Typical uniprocessor performance < 10%

Sparse Matrix ADT

Sparse Matrix Data Structures for High Performance Computingechow/ipcc/hpc-course/sparsemat.pdf · Sparse matrix data structures I Only nonzero elements are stored in sparse matrix

Sparse-Matrix Belief Propagation - UAIauai.org/uai2018/proceedings/papers/225.pdf · 2018-07-06 · sparse-matrix belief propagation. In particular, we use sparse-matrix products

Autotuning sparse matrix kernels

Cache-oblivious sparse matrix vector multiplicationalbert-jan.yzelman.net/slides/csc11.pdf · Cache-oblivious sparse matrix{vector multiplication > Sparse matrix reordering Separated

Midpoint-Based Parallel Sparse Matrix-Matrix Multiplication Algorithm

Matrix Block Structure In Sparse Matrix-Vector Multiplication...Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication (SpMV) — y = A x — Iterative methods