The Future of {LAPACK} and...

The Future of LAPACK andScaLAPACK

Jason Riedy, Yozo Hida, James Demmel

EECS DepartmentUniversity of California, Berkeley

November 18, 2005

Outline

Survey responses: What users want

Improving LAPACK and ScaLAPACKImproved NumericsImproved PerformanceImproved FunctionalityImproved Engineering and Community

Two Example ImprovementsNumerics: Iterative Refinement for Ax = bPerformance: The MRRR Algorithm for Ax = λx

2 / 16

Survey: What users want

I Survey available fromhttp://www.netlib.org/lapack-dev/.

I 212 responses, over 100 different, non-anonymous groups

I Problem sizes:100 1K 10K 100K 1M (other)8% 26% 24% 12% 6% (24%)

I >80% interested in small-medium SMPs

I >40% interested in large distributed-memory systems

I Vendor libs seen as faster, buggier

I over 20% want > double precision, 70% out-of-core

I Requests: High-level interfaces, low-level interfaces,parallel redistribution∗ and tuning

3 / 16

ParticipantsI UC Berkeley

I Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye

Li, Osni Marques, Christof Vomel, David Bindel, Yozo Hida,

Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads, . . .I U Tennessee, Knoxville

I Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,

Stan Tomov, . . .I Other Academic Institutions

I UT Austin, UC Davis, U Kansas, U Maryland, North Carolina

SU, San Jose SU, UC Santa Barbara, TU Berlin, FU Hagen,

U Madrid, U Manchester, U Umea, U Wuppertal, U ZagrebI Research Institutions

I CERFACS, LBL, UEC (Japan)I Industrial Partners

I Cray, HP, Intel, MathWorks, NAG, SGI

4 / 16

Improved Numerics

Improved accuracy with standard asymptotic speed:Some are faster!

I Iterative refinement for linear systems, least squaresDemmel / Hida / Kahan / Li / Mukherjee / Riedy / Sarkisyan

I Pivoting and scaling for symmetric systemsI Definite and indefinite

I Jacobi SVD (and faster) — Drmac / Veselic

I Condition numbers and estimatorsHigham / Cheng / Tisseur

I Useful approximate error estimates

5 / 16

Improved Performance

Improved performance with at least standard accuracy

I MRRR algorithm for eigenvalues, SVDParlett / Dhillon / Vomel / Marques / Willems / Katagiri

I Fast Hessenberg QR & QZByers / Mathias / Braman, Kagstrom / Kressner

I Fast reductions and BLAS2.5van de Geijn, Bischof / Lang, Howell / Fulton

I Recursive data layoutsGustavson / Kagstrom / Elmroth / Jonsson

I generalized SVD — Bai, Wang

I Polynomial roots from semi-separable formGu / Chandrasekaran / Zhu / Xia / Bindel / Garmire / Demmel

I Automated tuning, optimizations in ScaLAPACK, . . .

6 / 16

Improved Functionality

Algorithms

I Updating / downdating factorizations — Stewart, Langou

I More generalized SVDs: products, CSD — Bai, Wang

I More generalized Sylvester, Lyupanov solversKagstrom, Jonsson, Granat

I Quadratic eigenproblems — Mehrmann

I Matrix functions — Higham

Implementations

I Add “missing” features to ScaLAPACK

I Generate LAPACK, ScaLAPACK for higher precisions

7 / 16

Improved Engineering and Community

Use new features without a rewrite

I Use modern Fortran 95, maybe 2003I DO . . . END DO, recursion, allocation (in wrappers)

I Provide higher-level wrappers for common languagesI F95, C, C++

I Automatic generation of precisions, bindingsI Full automation (FLAME, etc.) not quite ready for all

functions

I Tests for algorithms, implementations, installations

Open development

Need a community for long-term evolution.http://www.netlib.org/lapack-dev/

Lots of work to do, research and development.

8 / 16

Two Example Improvements

Recent, locally developed improvements

Improved Numerics

Iterative refinement for linear systems Ax = b:

I Extra precision ⇒ small error, dependable estimate

I Both normwise and componentwise

I (See LAWN 165 for full details.)

Improved Performance

MRRR algorithm for eigenvalue, SVD problems

I Optimal complexity: O(n) per value/vector

I (See LAWNs 162, 163, 166, 167. . . for more details.)

9 / 16

Numerics: Iterative RefinementImprove solution to Ax = b

Repeat: r = b − Ax , dx = A−1r , x = x + dx

Until: good enough

Not-too-ill-conditioned ⇒ error O(√

κnorm

log10E

Normwise error vs. condition number κnorm. (2000000 cases)

167212 1156099

463800 22544

190085 260

0 5 10 15100

log10 κnorm

log10E

Normwise error vs. condition number κnorm. (2000000 cases)

821097 1136987

0 5 10 15100

10 / 16

Until: good enough

Dependable normwise relative error estimate

log10 Enorm

log10B

Normwise Error vs. Bound (2000000 cases)

133497 485930

1323311

-8 -6 -4 -2 0100

log10B

Normwise Error vs. Bound (2000000 cases)

1361 18

1957741 78

-8 -6 -4 -2 0100

11 / 16

Until: good enough

Also small componentwise errors and dependable estimates

log10 κcomp

log10E

Componentwise error vs. condition number κcomp. (2000000 cases)

545427 1380569

0 5 10 15100

PSfrag replacemen

log10 Ecomp

log10B

Componentwise Error vs. Bound (2000000 cases)

25307 53

1912961 6706

-8 -6 -4 -2 0100

12 / 16

Relying on Condition Numbers

Need condition numbers for dependable estimates.

Picking the right condition number and estimating it well.

log10 κnorm (single precision)

log10κ

(double

κnorm: single vs. double precision (2000000 cases)

0.0% 28.9%

21.8% 0.0%

0 2 4 6 8 10 12 14100

13 / 16

Performance: The MRRR Algorithm

Multiple Relatively Robust Representations

I 1999 Householder Award honorable mention for Dhillon

I Optimal complexity with small error!I O(nk) flops for k eigenvalues/vectors of n × n

tridiagonal matrixI Small residuals: ‖Txi − λixi‖ = O(nε)I Orthogonal eigenvectors: ‖xT

i xj‖ = O(nε)

I Similar algorithm for SVD.

I Eigenvectors computed independently ⇒ naturallyparallelizable

I (LAPACK r3 had bugs, missing cases)

14 / 16

Performance: The MRRR Algorithm

“fast DC”: Wilkinson, deflate like crazy

15 / 16

Summary

I LAPACK and ScaLAPACK are open for improvement!

I Planned improvements inI numerics,I performance,I functionality, andI engineering.

I Forming a community for long-term development.

16 / 16

The Future of {LAPACK} and...

Documents

Transcript of The Future of {LAPACK} and...

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 28 Sept 2005.

Computational plasma physics - WebHome · or Numpy and those programming in a low level language like Fortran, C or C++ can use e cient libraries, like LAPACK, ScaLAPACK, PETSc, Trilinos,

GRAPE accelerators - jun.artcompsci.orgjun.artcompsci.org/talks/barcelona20100603.pdf · GRAPE accelerators Jun Makino ... ∗ BLAS, LAPACK – Particle-Particle interaction ... programmable

Extra-precise Iterative Re nement for Overdetermined Least ...jriedy.users.sonic.net/resume/material/lsq_itrefx.pdf · 2 relative to kbk 2 indicates convergence in the range. Therefore,

Lapack / ScaLAPACK / Multi-core and more

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

LAPACK for GPUs - Innovative Computing Laboratory

Introduction to ScaLAPACK. 2 ScaLAPACK ScaLAPACK is for solving dense linear systems and computing eigenvalues for dense matrices.

Lapack / ScaLAPACK / Multi-core and more Julie Langou @ UCSB - April 16 th 2007.

Prospectus for the Next LAPACK and ScaLAPACK Libraries ...

Iterative refinement for linear systems and LAPACK

Symmetric Eigensolvers in Sca/LAPACK Osni Marques (oamarques@lbl.gov)

Developments with LAPACK and ScaLAPACK on Today's and ...

LAPACK Working Note 81 Quick Installation Guide for ...Quick Installation Guide for LAPACK on Unix Systems1 Authors and Jack Dongarra Department of Computer Science University of Tennessee

C++ API for BLAS and LAPACK - icl.utk.edu2 C++ API for BLAS and LAPACK Mark Gates ICL1 Piotr Luszczek ICL Ahmad Abdelfattah ICL Jakub Kurzak ICL Jack Dongarra ICL Konstantin Arturov

Extra-precise Iterative Re nement for Overdetermined Least ...jriedy.users.sonic.net/resume/material/lawn188.pdf · Extra-precise Iterative Re nement for Overdetermined Least Squares

THE DESIGN AND IMPLEMENTATION OF THE SCALAPACK LU, …parallel/parallelrechner/scalapack/lawns/lawn80.pdf · THE DESIGN AND IMPLEMENT A TION OF THE SCALAP A CK LU, QR, AND CHOLESKY

ScaLAPACK Users' Guide

C++ API for BLAS and LAPACK - EECS Library · 2020. 12. 31. · in Fortran, with C interfaces provided by CBLAS and LAPACKE, respectively. BLAS and LAPACK will serve as building blocks

Massive Social Network Analysis: Mining Twitter for Social ...jriedy.users.sonic.net/CV/material/ICPP10-GraphCT.pdf · Abstract—Social networks produce an enormous quan-tity of