ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...

ScicomP 10, Aug 9-13, 2004

Parallel Out-of-Core LU and QR Factorization

Brian GunterCenter for Space Research

The University of Texas at Austin, Austin, TXgunter@csr.utexas.edu

Enrique Quintana-OrtíDepto. de Ingenieria y Ciencia de Computadores

Universidad Jaume I, Castellón, Spainquintana@icc.uji.es

Robert van de GeijnDepartment of Computer Sciences

The University of Texas at Austin, Austin, TXrvdg@cs.utexas.edu

Thierry JoffrainDepartment of Computer Sciences

The University of Texas at Austin, Austin, TXjoffrain@cs.utexas.edu

ScicomP 10, Aug 9-13, 2004

Motivation

Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.

In-core

ScicomP 10, Aug 9-13, 2004

Motivation

In-core

ScicomP 10, Aug 9-13, 2004

Motivation

In-core

ScicomP 10, Aug 9-13, 2004

Motivation

m >> n

While this is effective for many applications, it is inherently unscalable

As m >> n, fewer columns can fit into memory

ScicomP 10, Aug 9-13, 2004

Q = I + YTYT

Out-of-Core QR Factorization

Compact WY Representation Q is an orthogonal matrix R is upper triangular Y is an m×r collection of Householder vectors, normalized to

be unit lower triangular (trapezoidal) T is r×r upper triangular

Given the m×n matrix, A, we wish to apply the factorization

ScicomP 10, Aug 9-13, 2004

Step 1:

Begin with an unfactored matrix which resides on disk.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

ScicomP 10, Aug 9-13, 2004

Step 2:

Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file.

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

ScicomP 10, Aug 9-13, 2004

Step 3:

ScicomP 10, Aug 9-13, 2004

Step 3:

ScicomP 10, Aug 9-13, 2004

Step 3:

ScicomP 10, Aug 9-13, 2004

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

ScicomP 10, Aug 9-13, 2004

Step 4:

ScicomP 10, Aug 9-13, 2004

Step 4:

ScicomP 10, Aug 9-13, 2004

Step 4:

ScicomP 10, Aug 9-13, 2004

Step 4:

ScicomP 10, Aug 9-13, 2004

Step 4:

ScicomP 10, Aug 9-13, 2004

Step 4:

ScicomP 10, Aug 9-13, 2004

Step 4:

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

ScicomP 10, Aug 9-13, 2004

Step 5:

ScicomP 10, Aug 9-13, 2004

Step 5:

ScicomP 10, Aug 9-13, 2004

Step 5:

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

ScicomP 10, Aug 9-13, 2004

Step 6:

ScicomP 10, Aug 9-13, 2004

Step 6:

ScicomP 10, Aug 9-13, 2004

Step 6:

ScicomP 10, Aug 9-13, 2004

Step 6:

ScicomP 10, Aug 9-13, 2004

Step 7:

Repeat Steps 5 and 6 to any remaining rows of tiles.

ScicomP 10, Aug 9-13, 2004

Step 7:

Repeat Steps 5 and 6 to any remaining rows of tiles.

ScicomP 10, Aug 9-13, 2004

Step 8:

Repeat Steps 1-7 on lower quadrant.

ScicomP 10, Aug 9-13, 2004

Step 8:

Repeat Steps 1-7 on lower quadrant.

Continue until entire matrix has been factored.

ScicomP 10, Aug 9-13, 2004

Out-of-Core LU Factorization

P is an permutation matrix U is n×n upper triangular L is lower trapezoidal Implementation analogous to out-of-core QR factorization

Given the m×n matrix, A, we wish to apply the factorization

ScicomP 10, Aug 9-13, 2004

Step 1:

Factor first tile, saving permutation matrix.

LU FactorizationOut-of-Core Implementation

ScicomP 10, Aug 9-13, 2004

Step 2:

Update remaining tiles in row using panels of L and the saved permutation matrices.

ScicomP 10, Aug 9-13, 2004

Step 3:

Factor next tile in first column using LU update algorithm.

ScicomP 10, Aug 9-13, 2004

Step 4:

Update remaining tiles in row using panels of L and stored permutation matrices.

ScicomP 10, Aug 9-13, 2004

Development Environment

Parallel Linear Algebra Package (PLAPACK) Optimized parallel routines (FORTRAN and C interfaces) ‘View-based’ infrastructure Uses standard MPI and BLAS libraries

Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) Out-of-core extension to PLAPACK Handles the complexity of the I/O operations (i.e., hidden to user) Uses standard read/write functions for portability

ScicomP 10, Aug 9-13, 2004

Performance of Parallel OOC QR

IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of 3.723 Gflops

ScicomP 10, Aug 9-13, 2004

Performance for Sequential OOC LU

ScicomP 10, Aug 9-13, 2004

Earth Science Application

Gravity Recovery And Climate Experiment (GRACE)

A collaborative effort between The University of Texas Center for Space Research (CSR) The Jet Propulsion Laboratory (JPL) GeoForschungsZentrum (GFZ) Deutschen Zentrum für Luft- und Raumfahrt (DLR) National Aeronautics and Space Administration (NASA)

ScicomP 10, Aug 9-13, 2004

Earth Science Application

Goal was to compute a rigorous 360x360 gravity model No approximation techniques Translates to roughly 100 km2 resolution Involves the least squares estimation of ~130,000 parameters

Requires the combination of hundreds of millions of observations surface gravity data (land) – ½ TB altimetry-based mean sea surface data (ocean) GRACE data (satellite)

Using new parallel OOC QR algorithm A 360x360 field was generated, complete with full covariance Largest rigorous gravity field model ever created Used a single IBM P690 node OOC QR required only 32 GB

• To do in-core would require 165 GB of memory Required ~6 days of wall clock time to compute (2326 CPU hours)

• A single processor machine with sufficient memory would require 3.2 months

ScicomP 10, Aug 9-13, 2004

Conclusion

Tile-based out-of-core algorithms provide scalability Size of the tile is based on the memory of the machine (i.e. fixed) and

is independent of the problem size

Algorithms achieve excellent performance

The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations

This helps to offset the I/O cost associated with moving the tiles to and from disk

Use of the PLAPACK & POOCLAPACK greatly simplified the implementation Reduces complexity of code Makes code portable

Has already proven valuable to Earth science applications

ScicomP 10, Aug 9-13, 2004

Conclusion

Broad spectrum of applications Large scale problems Small clusters Embedded systems Other small memory machines

Tile-based OOC approach can be extended to other dense linear algebra operations Cholesky, matrix inverse, BLAS-3, etc.

Goal is to provide a full suite of OOC utilities

ScicomP 10, Aug 9-13, 2004

For More Information

Visit the PLAPACK website: www.cs.utexas.edu/users/plapack

Visit the GRACE website: www.csr.utexas.edu/grace

ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...

Documents

Transcript of ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...

Bronislaw Kasper MALINOWSKI Gunter Senft

ESL - Gunter ISD

On Gunter Grass

High Performance Global File Systems - ScicomP

Diane Gunter Resume_2015

Gunter Pauli

Cambridge University Press Gunter Zoller Frontmatter …assets.cambridge.org/97805215/91607/frontmatter/9780521591607... · Gunter Zoller Frontmatter More information ... Cambridge

Gunter Rules in Navigation

Carl Newton Gunter, Jr., Funeral Program

The surprise gunter gerngross

Gunter Stein Bode Lecture

IBM High Performance Computing â€“ Products and - ScicomP

Intel® Xeon Phiâ„¢ Product Family - An Overview - ScicomP

Maxwell/Gunter - Military Media Inc

High Performance Global File Systems - ScicomP...Garching 2007-07-18 ScicomP 13 High Performance Global File Systems Easy Data Management in Supercomputer Grids Andreas Schott (schott@rzg.mpg.de)

Performance Tuning with the IBM XL Compilers SciComp Tutorial

Networking with Java Carl Gunter ( gunter@cis ) Arvind Easwaran ( arvinde@saul ) Michael May ( mjmay@saul )

gunter the Castle Gates

Magazine Spring2010 - Austin College · 2 | Austin College Magazine Spring 2010 Anne Gunter ’10 has been awarded a Grant-in-Aid of Research from Sigma Xi, the international honor

Power7 Performance Overview - ScicomP