ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...

Post on 29-Mar-2015

213 views 1 download

Tags:

Transcript of ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...

ScicomP 10, Aug 9-13, 2004

Parallel Out-of-Core LU and QR Factorization

Brian GunterCenter for Space Research

The University of Texas at Austin, Austin, TXgunter@csr.utexas.edu

Enrique Quintana-OrtíDepto. de Ingenieria y Ciencia de Computadores

Universidad Jaume I, Castellón, Spainquintana@icc.uji.es

Robert van de GeijnDepartment of Computer Sciences

The University of Texas at Austin, Austin, TXrvdg@cs.utexas.edu

Thierry JoffrainDepartment of Computer Sciences

The University of Texas at Austin, Austin, TXjoffrain@cs.utexas.edu

ScicomP 10, Aug 9-13, 2004

Motivation

Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.

m

n

In-core

ScicomP 10, Aug 9-13, 2004

Motivation

Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.

m

n

In-core

ScicomP 10, Aug 9-13, 2004

Motivation

Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.

m

n

In-core

ScicomP 10, Aug 9-13, 2004

Motivation

m >> n

n

While this is effective for many applications, it is inherently unscalable

As m >> n, fewer columns can fit into memory

ScicomP 10, Aug 9-13, 2004

A=QR

Q = I + YTYT

Out-of-Core QR Factorization

Compact WY Representation Q is an orthogonal matrix R is upper triangular Y is an m×r collection of Householder vectors, normalized to

be unit lower triangular (trapezoidal) T is r×r upper triangular

Given the m×n matrix, A, we wish to apply the factorization

ScicomP 10, Aug 9-13, 2004

Step 1:

Begin with an unfactored matrix which resides on disk.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

ScicomP 10, Aug 9-13, 2004

Step 2:

Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

t

t

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

ScicomP 10, Aug 9-13, 2004

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

YiTi

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

ScicomP 10, Aug 9-13, 2004

Step 7:

Repeat Steps 5 and 6 to any remaining rows of tiles.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

YiTi

ScicomP 10, Aug 9-13, 2004

Step 7:

Repeat Steps 5 and 6 to any remaining rows of tiles.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

ScicomP 10, Aug 9-13, 2004

Step 8:

Repeat Steps 1-7 on lower quadrant.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Yi

Ti

ScicomP 10, Aug 9-13, 2004

Step 8:

Repeat Steps 1-7 on lower quadrant.

Continue until entire matrix has been factored.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

ScicomP 10, Aug 9-13, 2004

PA=LU

Out-of-Core LU Factorization

P is an permutation matrix U is n×n upper triangular L is lower trapezoidal Implementation analogous to out-of-core QR factorization

Given the m×n matrix, A, we wish to apply the factorization

ScicomP 10, Aug 9-13, 2004

Step 1:

Factor first tile, saving permutation matrix.

= Stored on disk = In memory

LU FactorizationOut-of-Core Implementation

Pi

Li

Ui

ScicomP 10, Aug 9-13, 2004

Step 2:

Update remaining tiles in row using panels of L and the saved permutation matrices.

= Stored on disk = In memory

LU FactorizationOut-of-Core Implementation

Pi

Li

Ui

ScicomP 10, Aug 9-13, 2004

Step 3:

Factor next tile in first column using LU update algorithm.

= Stored on disk = In memory

LU FactorizationOut-of-Core Implementation

Pi Li

Ui

ScicomP 10, Aug 9-13, 2004

Step 4:

Update remaining tiles in row using panels of L and stored permutation matrices.

= Stored on disk = In memory

LU FactorizationOut-of-Core Implementation

Li

Ui

Pi

ScicomP 10, Aug 9-13, 2004

Development Environment

Parallel Linear Algebra Package (PLAPACK) Optimized parallel routines (FORTRAN and C interfaces) ‘View-based’ infrastructure Uses standard MPI and BLAS libraries

Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) Out-of-core extension to PLAPACK Handles the complexity of the I/O operations (i.e., hidden to user) Uses standard read/write functions for portability

ScicomP 10, Aug 9-13, 2004

Performance of Parallel OOC QR

IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of 3.723 Gflops

ScicomP 10, Aug 9-13, 2004

Performance for Sequential OOC LU

ScicomP 10, Aug 9-13, 2004

Earth Science Application

Gravity Recovery And Climate Experiment (GRACE)

A collaborative effort between The University of Texas Center for Space Research (CSR) The Jet Propulsion Laboratory (JPL) GeoForschungsZentrum (GFZ) Deutschen Zentrum für Luft- und Raumfahrt (DLR) National Aeronautics and Space Administration (NASA)

ScicomP 10, Aug 9-13, 2004

Earth Science Application

Goal was to compute a rigorous 360x360 gravity model No approximation techniques Translates to roughly 100 km2 resolution Involves the least squares estimation of ~130,000 parameters

Requires the combination of hundreds of millions of observations surface gravity data (land) – ½ TB altimetry-based mean sea surface data (ocean) GRACE data (satellite)

Using new parallel OOC QR algorithm A 360x360 field was generated, complete with full covariance Largest rigorous gravity field model ever created Used a single IBM P690 node OOC QR required only 32 GB

• To do in-core would require 165 GB of memory Required ~6 days of wall clock time to compute (2326 CPU hours)

• A single processor machine with sufficient memory would require 3.2 months

ScicomP 10, Aug 9-13, 2004

Conclusion

Tile-based out-of-core algorithms provide scalability Size of the tile is based on the memory of the machine (i.e. fixed) and

is independent of the problem size

Algorithms achieve excellent performance

The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations

This helps to offset the I/O cost associated with moving the tiles to and from disk

Use of the PLAPACK & POOCLAPACK greatly simplified the implementation Reduces complexity of code Makes code portable

Has already proven valuable to Earth science applications

ScicomP 10, Aug 9-13, 2004

Conclusion

Broad spectrum of applications Large scale problems Small clusters Embedded systems Other small memory machines

Tile-based OOC approach can be extended to other dense linear algebra operations Cholesky, matrix inverse, BLAS-3, etc.

Goal is to provide a full suite of OOC utilities

ScicomP 10, Aug 9-13, 2004

For More Information

Visit the PLAPACK website: www.cs.utexas.edu/users/plapack

Visit the GRACE website: www.csr.utexas.edu/grace