The Future of {LAPACK} and...

Post on 23-May-2020

1 views 0 download

Transcript of The Future of {LAPACK} and...

The Future of LAPACK andScaLAPACK

Jason Riedy, Yozo Hida, James Demmel

EECS DepartmentUniversity of California, Berkeley

November 18, 2005

Outline

Survey responses: What users want

Improving LAPACK and ScaLAPACKImproved NumericsImproved PerformanceImproved FunctionalityImproved Engineering and Community

Two Example ImprovementsNumerics: Iterative Refinement for Ax = bPerformance: The MRRR Algorithm for Ax = λx

2 / 16

Survey: What users want

I Survey available fromhttp://www.netlib.org/lapack-dev/.

I 212 responses, over 100 different, non-anonymous groups

I Problem sizes:100 1K 10K 100K 1M (other)8% 26% 24% 12% 6% (24%)

I >80% interested in small-medium SMPs

I >40% interested in large distributed-memory systems

I Vendor libs seen as faster, buggier

I over 20% want > double precision, 70% out-of-core

I Requests: High-level interfaces, low-level interfaces,parallel redistribution∗ and tuning

3 / 16

ParticipantsI UC Berkeley

I Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye

Li, Osni Marques, Christof Vomel, David Bindel, Yozo Hida,

Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads, . . .I U Tennessee, Knoxville

I Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,

Stan Tomov, . . .I Other Academic Institutions

I UT Austin, UC Davis, U Kansas, U Maryland, North Carolina

SU, San Jose SU, UC Santa Barbara, TU Berlin, FU Hagen,

U Madrid, U Manchester, U Umea, U Wuppertal, U ZagrebI Research Institutions

I CERFACS, LBL, UEC (Japan)I Industrial Partners

I Cray, HP, Intel, MathWorks, NAG, SGI

You?

4 / 16

Improved Numerics

Improved accuracy with standard asymptotic speed:Some are faster!

I Iterative refinement for linear systems, least squaresDemmel / Hida / Kahan / Li / Mukherjee / Riedy / Sarkisyan

I Pivoting and scaling for symmetric systemsI Definite and indefinite

I Jacobi SVD (and faster) — Drmac / Veselic

I Condition numbers and estimatorsHigham / Cheng / Tisseur

I Useful approximate error estimates

5 / 16

Improved Performance

Improved performance with at least standard accuracy

I MRRR algorithm for eigenvalues, SVDParlett / Dhillon / Vomel / Marques / Willems / Katagiri

I Fast Hessenberg QR & QZByers / Mathias / Braman, Kagstrom / Kressner

I Fast reductions and BLAS2.5van de Geijn, Bischof / Lang, Howell / Fulton

I Recursive data layoutsGustavson / Kagstrom / Elmroth / Jonsson

I generalized SVD — Bai, Wang

I Polynomial roots from semi-separable formGu / Chandrasekaran / Zhu / Xia / Bindel / Garmire / Demmel

I Automated tuning, optimizations in ScaLAPACK, . . .

6 / 16

Improved Functionality

Algorithms

I Updating / downdating factorizations — Stewart, Langou

I More generalized SVDs: products, CSD — Bai, Wang

I More generalized Sylvester, Lyupanov solversKagstrom, Jonsson, Granat

I Quadratic eigenproblems — Mehrmann

I Matrix functions — Higham

Implementations

I Add “missing” features to ScaLAPACK

I Generate LAPACK, ScaLAPACK for higher precisions

7 / 16

Improved Engineering and Community

Use new features without a rewrite

I Use modern Fortran 95, maybe 2003I DO . . . END DO, recursion, allocation (in wrappers)

I Provide higher-level wrappers for common languagesI F95, C, C++

I Automatic generation of precisions, bindingsI Full automation (FLAME, etc.) not quite ready for all

functions

I Tests for algorithms, implementations, installations

Open development

Need a community for long-term evolution.http://www.netlib.org/lapack-dev/

Lots of work to do, research and development.

8 / 16

Two Example Improvements

Recent, locally developed improvements

Improved Numerics

Iterative refinement for linear systems Ax = b:

I Extra precision ⇒ small error, dependable estimate

I Both normwise and componentwise

I (See LAWN 165 for full details.)

Improved Performance

MRRR algorithm for eigenvalue, SVD problems

I Optimal complexity: O(n) per value/vector

I (See LAWNs 162, 163, 166, 167. . . for more details.)

9 / 16

Numerics: Iterative RefinementImprove solution to Ax = b

Repeat: r = b − Ax , dx = A−1r , x = x + dx

Until: good enough

Not-too-ill-conditioned ⇒ error O(√

n ε)

log10

κnorm

log10E

norm

Normwise error vs. condition number κnorm. (2000000 cases)

167212 1156099

463800 22544

190085 260

0 5 10 15100

101

102

103

104

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

log10 κnorm

log10E

norm

Normwise error vs. condition number κnorm. (2000000 cases)

40377

1539

821097 1136987

0 5 10 15100

101

102

103

104

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

10 / 16

Numerics: Iterative RefinementImprove solution to Ax = b

Repeat: r = b − Ax , dx = A−1r , x = x + dx

Until: good enough

Dependable normwise relative error estimate

log10 Enorm

log10B

norm

Normwise Error vs. Bound (2000000 cases)

133497 485930

1323311

56848

414

-8 -6 -4 -2 0100

101

102

103

104

105

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

log10

Enorm

log10B

norm

Normwise Error vs. Bound (2000000 cases)

100

40359

343

1361 18

1957741 78

-8 -6 -4 -2 0100

101

102

103

104

105

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

11 / 16

Numerics: Iterative RefinementImprove solution to Ax = b

Repeat: r = b − Ax , dx = A−1r , x = x + dx

Until: good enough

Also small componentwise errors and dependable estimates

log10 κcomp

log10E

com

p

Componentwise error vs. condition number κcomp. (2000000 cases)

41755

32249

545427 1380569

0 5 10 15100

101

102

103

104

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

PSfrag replacemen

log10 Ecomp

log10B

com

p

Componentwise Error vs. Bound (2000000 cases)

1 236

41702

13034

25307 53

1912961 6706

-8 -6 -4 -2 0100

101

102

103

104

105

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

12 / 16

Relying on Condition Numbers

Need condition numbers for dependable estimates.

Picking the right condition number and estimating it well.

log10 κnorm (single precision)

log10κ

norm

(double

pre

cisi

on)

κnorm: single vs. double precision (2000000 cases)

0.0% 28.9%

30.1%

19.2%

21.8% 0.0%

0 2 4 6 8 10 12 14100

101

102

103

104

0

2

4

6

8

10

12

14

13 / 16

Performance: The MRRR Algorithm

Multiple Relatively Robust Representations

I 1999 Householder Award honorable mention for Dhillon

I Optimal complexity with small error!I O(nk) flops for k eigenvalues/vectors of n × n

tridiagonal matrixI Small residuals: ‖Txi − λixi‖ = O(nε)I Orthogonal eigenvectors: ‖xT

i xj‖ = O(nε)

I Similar algorithm for SVD.

I Eigenvectors computed independently ⇒ naturallyparallelizable

I (LAPACK r3 had bugs, missing cases)

14 / 16

Performance: The MRRR Algorithm

“fast DC”: Wilkinson, deflate like crazy

15 / 16

Summary

I LAPACK and ScaLAPACK are open for improvement!

I Planned improvements inI numerics,I performance,I functionality, andI engineering.

I Forming a community for long-term development.

16 / 16