Future of LAPACK and ScaLAPACK

Post on 11-May-2015

1.133 views 1 download

Tags:

description

Progress and directions for LAPACK and ScaLAPACK as of 2005. See LAPACK at netlib and

Transcript of Future of LAPACK and ScaLAPACK

The Future of LAPACK andScaLAPACK

Jason Riedy, Yozo Hida, James Demmel

EECS DepartmentUniversity of California, Berkeley

November 18, 2005

Outline

Survey responses: What users want

Improving LAPACK and ScaLAPACKImproved NumericsImproved PerformanceImproved FunctionalityImproved Engineering and Community

Two Example ImprovementsNumerics: Iterative Refinement for Ax = bPerformance: The MRRR Algorithm for Ax = λx

2 / 16

Survey: What users want

I Survey available fromhttp://www.netlib.org/lapack-dev/.

I 212 responses, over 100 different, non-anonymous groups

I Problem sizes:100 1K 10K 100K 1M (other)8% 26% 24% 12% 6% (24%)

I >80% interested in small-medium SMPs

I >40% interested in large distributed-memory systems

I Vendor libs seen as faster, buggier

I over 20% want > double precision, 70% out-of-core

I Requests: High-level interfaces, low-level interfaces,parallel redistribution∗ and tuning

3 / 16

ParticipantsI UC Berkeley

I Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye

Li, Osni Marques, Christof Vomel, David Bindel, Yozo Hida,

Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads, . . .I U Tennessee, Knoxville

I Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,

Stan Tomov, . . .I Other Academic Institutions

I UT Austin, UC Davis, U Kansas, U Maryland, North Carolina

SU, San Jose SU, UC Santa Barbara, TU Berlin, FU Hagen,

U Madrid, U Manchester, U Umea, U Wuppertal, U ZagrebI Research Institutions

I CERFACS, LBL, UEC (Japan)I Industrial Partners

I Cray, HP, Intel, MathWorks, NAG, SGI

You?

4 / 16

Improved Numerics

Improved accuracy with standard asymptotic speed:Some are faster!

I Iterative refinement for linear systems, least squaresDemmel / Hida / Kahan / Li / Mukherjee / Riedy / Sarkisyan

I Pivoting and scaling for symmetric systemsI Definite and indefinite

I Jacobi SVD (and faster) — Drmac / Veselic

I Condition numbers and estimatorsHigham / Cheng / Tisseur

I Useful approximate error estimates

5 / 16

Improved Performance

Improved performance with at least standard accuracy

I MRRR algorithm for eigenvalues, SVDParlett / Dhillon / Vomel / Marques / Willems / Katagiri

I Fast Hessenberg QR & QZByers / Mathias / Braman, Kagstrom / Kressner

I Fast reductions and BLAS2.5van de Geijn, Bischof / Lang, Howell / Fulton

I Recursive data layoutsGustavson / Kagstrom / Elmroth / Jonsson

I generalized SVD — Bai, Wang

I Polynomial roots from semi-separable formGu / Chandrasekaran / Zhu / Xia / Bindel / Garmire / Demmel

I Automated tuning, optimizations in ScaLAPACK, . . .

6 / 16

Improved Functionality

Algorithms

I Updating / downdating factorizations — Stewart, Langou

I More generalized SVDs: products, CSD — Bai, Wang

I More generalized Sylvester, Lyupanov solversKagstrom, Jonsson, Granat

I Quadratic eigenproblems — Mehrmann

I Matrix functions — Higham

Implementations

I Add “missing” features to ScaLAPACK

I Generate LAPACK, ScaLAPACK for higher precisions

7 / 16

Improved Engineering and Community

Use new features without a rewrite

I Use modern Fortran 95, maybe 2003I DO . . . END DO, recursion, allocation (in wrappers)

I Provide higher-level wrappers for common languagesI F95, C, C++

I Automatic generation of precisions, bindingsI Full automation (FLAME, etc.) not quite ready for all

functions

I Tests for algorithms, implementations, installations

Open development

Need a community for long-term evolution.http://www.netlib.org/lapack-dev/

Lots of work to do, research and development.

8 / 16

Two Example Improvements

Recent, locally developed improvements

Improved Numerics

Iterative refinement for linear systems Ax = b:

I Extra precision ⇒ small error, dependable estimate

I Both normwise and componentwise

I (See LAWN 165 for full details.)

Improved Performance

MRRR algorithm for eigenvalue, SVD problems

I Optimal complexity: O(n) per value/vector

I (See LAWNs 162, 163, 166, 167. . . for more details.)

9 / 16

Numerics: Iterative RefinementImprove solution to Ax = b

Repeat: r = b − Ax , dx = A−1r , x = x + dx

Until: good enough

Not-too-ill-conditioned ⇒ error O(√

n ε)

log10

κnorm

log10E

norm

Normwise error vs. condition number κnorm. (2000000 cases)

167212 1156099

463800 22544

190085 260

0 5 10 15100

101

102

103

104

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

log10 κnorm

log10E

norm

Normwise error vs. condition number κnorm. (2000000 cases)

40377

1539

821097 1136987

0 5 10 15100

101

102

103

104

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

10 / 16

Numerics: Iterative RefinementImprove solution to Ax = b

Repeat: r = b − Ax , dx = A−1r , x = x + dx

Until: good enough

Dependable normwise relative error estimate

log10 Enorm

log10B

norm

Normwise Error vs. Bound (2000000 cases)

133497 485930

1323311

56848

414

-8 -6 -4 -2 0100

101

102

103

104

105

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

log10

Enorm

log10B

norm

Normwise Error vs. Bound (2000000 cases)

100

40359

343

1361 18

1957741 78

-8 -6 -4 -2 0100

101

102

103

104

105

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

11 / 16

Numerics: Iterative RefinementImprove solution to Ax = b

Repeat: r = b − Ax , dx = A−1r , x = x + dx

Until: good enough

Also small componentwise errors and dependable estimates

log10 κcomp

log10E

com

p

Componentwise error vs. condition number κcomp. (2000000 cases)

41755

32249

545427 1380569

0 5 10 15100

101

102

103

104

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

PSfrag replacemen

log10 Ecomp

log10B

com

p

Componentwise Error vs. Bound (2000000 cases)

1 236

41702

13034

25307 53

1912961 6706

-8 -6 -4 -2 0100

101

102

103

104

105

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

12 / 16

Relying on Condition Numbers

Need condition numbers for dependable estimates.

Picking the right condition number and estimating it well.

log10 κnorm (single precision)

log10κ

norm

(double

pre

cisi

on)

κnorm: single vs. double precision (2000000 cases)

0.0% 28.9%

30.1%

19.2%

21.8% 0.0%

0 2 4 6 8 10 12 14100

101

102

103

104

0

2

4

6

8

10

12

14

13 / 16

Performance: The MRRR Algorithm

Multiple Relatively Robust Representations

I 1999 Householder Award honorable mention for Dhillon

I Optimal complexity with small error!I O(nk) flops for k eigenvalues/vectors of n × n

tridiagonal matrixI Small residuals: ‖Txi − λixi‖ = O(nε)I Orthogonal eigenvectors: ‖xT

i xj‖ = O(nε)

I Similar algorithm for SVD.

I Eigenvectors computed independently ⇒ naturallyparallelizable

I (LAPACK r3 had bugs, missing cases)

14 / 16

Performance: The MRRR Algorithm

“fast DC”: Wilkinson, deflate like crazy

15 / 16

Summary

I LAPACK and ScaLAPACK are open for improvement!

I Planned improvements inI numerics,I performance,I functionality, andI engineering.

I Forming a community for long-term development.

16 / 16