The Future of LAPACK and ScaLAPACK netlib/lapack-dev

59
The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev Jim Demmel UC Berkeley 28 Sept 2005

description

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev. Jim Demmel UC Berkeley 28 Sept 2005. Outline. Motivation Participants Goals Better numerics (faster and more accurate algorithms) Expand contents (more functions, more parallel implementations) Automate performance tuning - PowerPoint PPT Presentation

Transcript of The Future of LAPACK and ScaLAPACK netlib/lapack-dev

Page 1: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK

www.netlib.org/lapack-dev

Jim DemmelUC Berkeley28 Sept 2005

Page 2: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Outline• Motivation• Participants • Goals

1. Better numerics (faster and more accurate algorithms)2. Expand contents (more functions, more parallel

implementations)3. Automate performance tuning4. Improve ease of use5. Better maintenance and support6. Increase community involvement (this means you!)

• Questions for the audience• Selected Highlights• Concluding poem

Page 3: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Motivation• LAPACK and ScaLAPACK are widely used

– Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, …

– >52M web hits @ Netlib (incl. CLAPACK, LAPACK95)• Many ways to improve them, based on

– Own algorithmic research– Enthusiastic participation of research community– On-going user/vendor survey (url below)– Opportunities and demands of new architectures, programming

languages• New releases planned (NSF support)• Your feedback desired

– www.netlib.org/lapack-dev

Page 4: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Participants• UC Berkeley:

– Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads…

• U Tennessee, Knoxville– Jack Dongarra, Victor Eijkhout, Julien Langou, Julie Langou, Piotr

Luszczek, Stan Tomov• Other Academic Institutions

– UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara

– TU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb

• Research Institutions– CERFACS, LBL

• Industrial Partners – Cray, HP, Intel, MathWorks, NAG, SGI

Page 5: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 1 – Better Numerics• Fastest algorithm providing “standard” backward stability

– MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems

– Up to 10x faster HQR: Byers / Mathias / Braman– Extensions to QZ: Kågström / Kressner– Faster Hessenberg, tridiagonal, bidiagonal reductions:

van de Geijn, Bischof / Lang , Howell / Fulton– Recursive blocked layouts for packed formats:

Gustavson / Kågström / Elmroth / Jonsson/

Page 6: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 1 – Better Numerics• Fastest algorithm providing “standard” backward stability

– MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems

– Up to 10x faster HQR: Byers / Mathias / Braman– Extensions to QZ: Kågström / Kressner– Faster Hessenberg, tridiagonal, bidiagonal reductions:

van de Geijn, Bischof / Lang , Howell / Fulton– Recursive blocked layouts for packed formats: Gustavson /

Kågström / Elmroth / Jonsson/• New: Most accurate algorithm providing “standard” speed

– Iterative refinement for Ax=b, least squares• Assume availability of Extra Precise BLAS (Li/Hida/…)• www.netlib.org/blas/blast-forum/

– Jacobi SVD: Drmaĉ/Veselić

Page 7: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 1 – Better Numerics• Fastest algorithm providing “standard” backward stability

– MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems

– Up to 10x faster HQR: Byers / Mathias / Braman– Extensions to QZ: Kågström / Kressner– Faster Hessenberg, tridiagonal, bidiagonal reductions: van de

Geijn, Bischof / Lang , Howell / Fulton– Recursive blocked layouts for packed formats, Gustavson /

Kågström / Elmroth / Jonsson/• New: Most accurate algorithm providing “standard” speed

– Iterative refinement for Ax=b, least squares• Assume availability of Extra Precise BLAS (Li/Hida/…)• www.netlib.org/blas/blast-forum/

– Jacobi SVD: Drmaĉ/Veselić• Condition estimates for (almost) everything (ongoing)

Page 8: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 1 – Better Numerics• Fastest algorithm providing “standard” backward stability

– MRRR algorithm for symmetric eigenproblem / SVD: Parlett / Dhillon / Voemel / Marques / Willems

– Up to 10x faster HQR: Byers / Mathias / Braman– Extensions to QZ: Kågström / Kressner– Faster Hessenberg, tridiagonal, bidiagonal reductions: van de Geijn,

Bischof / Lang , Howell / Fulton– Recursive blocked layouts for packed formats, Gustavson / Kågström /

Elmroth / Jonsson/• New: Most accurate algorithm providing “standard” speed

– Iterative refinement for Ax=b, least squares• Assume availability of Extra Precise BLAS (Li/Hida/…)• www.netlib.org/blas/blast-forum/

– Jacobi SVD: Drmaĉ/Veselić• Condition estimates for (almost) everything (ongoing)• What is not fast or accurate enough?

Page 9: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

What goes into Sca/LAPACK?For all linear algebra problems

For all matrix structures

For all data types

For all programming interfaces

Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc)

For all architectures and networks

Need to automate and prioritize!

Page 10: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 2 – Expanded Content• Make content of ScaLAPACK mirror LAPACK as much

as possible– Full Automation would be nice, not yet robust, general enough

• Telescoping languages, Bernoulli, Rose, FLAME, …

• New functions (examples)– Updating / downdating of factorizations: Stewart, Langou– More generalized SVDs: Bai , Wang– More generalized Sylvester/Lyapunov eqns:

Kågström, Jonsson, Granat– Structured eigenproblems

• O(n2) version of roots(p) – Gu, Chandrasekaran, Zhu et al• Selected matrix polynomials: Mehrmann

• How should we prioritize missing functions?

Page 11: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 3 – Automate Performance Tuning

• Not just BLAS• 1300 calls to ILAENV() to get block sizes, etc.

– Never been systematically tuned• Extend automatic tuning techniques of

ATLAS, etc. to these other parameters– Automation important as architectures evolve

• Convert ScaLAPACK data layouts on the fly• How important is peak performance?

Page 12: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 4: Improved Ease of Use

• Which do you prefer?

CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO)

A \ B

CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO)

Page 13: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 4: Improved Ease of Use, Software Engineering (1)

• Development versus Research– Development: practical approach to produce useful code– Research: If we could start over and do it right, how would we?

• Life after F77?– Fortran95, C, C++, Java, Matlab, Python, …

• Easy interfaces vs access to details– No universal agreement across systems on “easiest interface”– Leave decision to higher level packages– Keep expert driver / simple driver / computational routines

• Conclusion: Subset of F95 for core + wrappers for drivers– What subset?

• Recursion, for new data structures • Modules, to produce multiple precision versions• Environmental enquiries, to replace xLAMCH

– Wrappers for Fortran95, Java, Matlab, Python, … even for CLAPACK• Automatic memory allocation of workspace

Page 14: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 4: Improved Ease of Use, Software Engineering (2)

• Why not full F95 for core? – Would make interfacing to other languages and

packages harder– Some users want control over memory allocation

• Why not C for core?– High cost/benefit ratio for full rewrite– Performance

• Automation would be nice– Use Babel/SIDL to produce native looking interfaces

when possible

Page 15: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Precisions beyond double (1)

• Range of designs possible– Just run in quad (or some other fixed precision)– Support codes like

#bits = 32Repeat #bits = 2* #bits Solve(A, b, x, error_bound, #bits)Until error_bound < tol

Page 16: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Precisions beyond double (2)• Easiest approach – fixed precision

– Use F95 modules to produce any precision on request– Could use QD, ARPREC, GMP, …– Keep current memory allocation (twiddle GMP…)

• Next easier approach – maximum precision– Build maximum allowable precision on request– Pass in precision parameter, up to this amount

• More flexible (and difficult) approach– Choose any precision at run time– Dynamically allocate all variables

• Most aggressive approach– New algorithms that minimize work to get desired prec.

• What do users want? – Compatibility with symbolic manipulation systems?

Page 17: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 4: Improved Ease of Use, Software Engineering (3)

• Research Issues – May or may not impact development

• How to map multiple software layers to emerging architectural layers?

• Are emerging HPCS languages better?• How much can we automate? Do we keep having

to write Gaussian elimination over and over again?• Statistical modeling to limit performance tuning

costs, improve use of shared clusters

Page 18: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 5:Better Maintenance and Support

• Website for user feedback and requests• New developer and discussion forums• URL: www.netlib.org/lapack-dev

– Includes NSF proposal• Version control and bug tracking system• Automatic Build and Test environment• Wide variety of supported platforms• Cooperation with vendors• Anything else desired?

Page 19: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Goal 6: Involve the Community• To help identify priorities

– More interesting tasks than we are funded to do– See www.netlib.org/lapack-dev for list

• To help identify promising algorithms– What have we missed?

• To help do the work– Bug reports, provide fixes– Again, more tasks than we are funded to do– Already happening: thank you!– We retain final decisions on content

• Anything else?

Page 20: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Some Highlights• Putting more of LAPACK into ScaLAPACK• ScaLAPACK performance on 1D vs 2D grids• MRRR “Holy Grail” algorithm for symmetric EVD• Iterative Refinement for Ax=b• O(n2) polynomial root finder• Generalized SVD

Page 21: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Missing Drivers in Sca/LAPACKLAPACK ScaLAPACK

Linear Equations

LUCholeskyLDLT

xGESVxPOSVxSYSV

PxGESVPxPOSVmissing

Least Squares (LS)

QRQR+pivotSVD/QRSVD/D&CSVD/MRRRQR + iterative refine.

xGELSxGELSYxGELSSxGELSDmissingmissing

PxGELSmissing drivermissing drivermissing (intent)missingmissing

Generalized LS LS + equality constr.Generalized LMAbove + Iterative ref.

xGGLSExGGGLMmissing

missingmissingmissing

Page 22: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

More missing driversLAPACK ScaLAPACK

Symmetric EVD QR / Bisection+InvitD&CMRRR

xSYEV / XxSYEVDxSYEVR

PxSYEV / Xmissing (intent)missing

Nonsymmetric EVD Schur formVectors too

xGEES / XxGEEV /X

missing drivermissing driver

SVD QRD&CMRRRJacobi

xGESVDxGESDDmissingmissing

PxGESVDmissing (intent)missingmissing

Generalized Symmetric EVD

QR / Bisection+InvitD&CMRRR

xSYGV / XxSYGVDmissing

PxSYGV / Xmissing (intent)missing

Generalized Nonsymmetric EVD

Schur formVectors too

xGGES / XxGGEV / X

missingmissing

Generalized SVD KogbetliantzMRRR

xGGSVDmissing

missing (intent)missing

Page 23: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Missing matrix types in ScaLAPACK

• Symmetric, Hermitian, triangular– Band, Packed

• Positive Definite– Packed

• Orthogonal, Unitary– Packed

Page 24: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0

10

20

30

40

50

60

70

80

90

100

seconds

10002000300040005000600070008000900010000

1x602x30

3x204x15

5x126x10

problem size

grid shape

Execution time of PDGESV for various grid shape

90-10080-9070-8060-7050-6040-5030-4020-3010-200-10

Times obtained on:

60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect

2GB Memory

Speedups for using 2D processor grid range from 2x to 8x

Page 25: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0.01

0.1

1

10

100

seconds

1000 4000 7000 1000problem size

Optimal grid (6x10) for PDGESVComparison between Computation and Redistribution of Data from Linear Grid

Calculation Time

RedistributionTime

Times obtained on:60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect2GB Memory

Cost of redistributing matrix to optimal layout is small

Page 26: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

MRRR Algorithm for eig(tridiagonal) and svd(bidiagonal)

• “Multiple Relatively Robust Representation”• 1999 Householder Award honorable mention for Dhillon• O(nk) flops to find k eigenvalues/vectors of nxn tridiagonal matrix (similar for SVD)

– Minimum possible!• Naturally parallelizable• Accurate

– Small residuals || Txi – i xi || = O(n )– Orthogonal eigenvectors || xi

Txj || = O(n )• Hence nickname: “Holy Grail”• 2 versions

– LAPACK 3.0: large error on “hard” cases– Next release: fixed!

• How should we tradeoff speed and accuracy?

Page 27: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)

Page 28: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)

Page 29: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)

Page 30: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Timing of Eigensolvers(only matrices where time > .1 sec)

Page 31: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Accuracy Results (old vs new Grail)maxi ||Tqi – i qi || / ( n ) || QQT – I || / (n )

Page 32: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Accuracy Results (Grail vs QR vs DC)maxi ||Tqi – i qi || / ( n ) || QQT – I || / (n )

Page 33: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

More Accurate: Solve Ax=bConventional Gaussian Elimination With extra precise

iterative refinement

n1/2

Page 34: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

What’s new?• Need extra precision (beyond double)

– Part of new BLAS standard– Cost = O(n2) extra per right-hand-side, vs O(n3) to factor

• Get tiny componentwise bounds too– Error in xi small compared to |xi|, not just maxj |xj|

• “Guarantees” based on condition number estimates– No bad bounds in 6.2M tests– Different condition number for componentwise bounds– Traditional iterative refinement can “fail”– Only get “matrix close to singular” message when answer wrong?

• Extends to least squares

• Demmel, Kahan, Hida, Riedy, X. Li, Sarkisyan, …• LAPACK Working Note # 165

Page 35: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Can condition estimators lie?

• Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization– Demmel/Diament/Malajovich (FCM2001)

• But what if matrix multiply costs O(n2)?– Cohn/Umans/Kleinberg (FOCS 2003/5)

Page 36: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

New algorithm for roots(p)

• To find roots of polynomial p– Roots(p) does eig(C(p))– Costs O(n3), stable, reliable

• O(n2) Alternatives– Newton, Jenkins-Traub, Laguerre, …– Stable? Reliable?

• New: Exploit “semiseparable” structure of C(p)– Low rank of any submatrix of upper triangle of C(p) preserved under QR

iteration– Complexity drops from O(n3) to O(n2)

• Related work: Gemignani, Bini, Pan, et al• Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel, David

Garmire, Jim Demmel

-p1 -p2 … -pd

1 0 … 0 0 1 … 0 … … … … 0 … 1 0

C(p)=

Page 37: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Properties of new roots(p)

• First (?) algorithm that– Is O(n2) – Is backward stable, in the matrix sense– Is backward stable, in the sense that the computed

roots are the exact roots of a slightly perturbed input polynomial

• Depends on balancing = scaling roots by a constant • Still need to automate choice of

• Byers, Mathias, Braman• Tisseur, Higham, Mackey …

Page 38: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

New GSVD Algorithm: Timing Comparisons 1.0GHz Itanium–2 (2GB RAM);

Intel's Math Kernel Library 7.2.1 (incl. BLAS, LAPACK)

Bai et al, UC Davis PSVD, CSD on the way

Page 39: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Conclusions

• Lots to do in Dense Linear Algebra– New numerical algorithms– Continuing architectural challenges

• Parallelism, performance tuning

• Grant support, but success depends on contributions from community

Page 40: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Extra Slides

Page 41: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Downa Dating Should some equations be forgot when overdetermīned?Should some equations be forgot using hyperbolic sines?  With hyperbolic sines, my dear with hyperbolic sines, We’ll hope to get some boundedness with hyperbolic sines.

With apologies to Robert Burns & Nick Higham

Page 42: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

New GSVD Algorithm (XGGQSV)• UT A Z =∑a( 0 R ), VT B Z =∑b( 0 R ); A: M x N, B: P xN• Modified Van Loan's method

(1) Pre-processing: reveal rank of ( AT ; BT )T

(2) Split QR: reduce two upper triangular into one (3) CSD: Cosine-Sine Decomposition (4) Post-processing: assemble resulted matrices

• Workspace = (max(M, N, P)) + 5L2 where L = rank(B)• Outperforms current XGGSVD in LAPACK

Page 43: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Profile of SGGQSV 1.0GHz Itanium–2 (2GB RAM);

Intel's Math Kernel Library 7.2.1 (incl. BLAS, LAPACK)

Page 44: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Related New Routines• CSD (XORCSD)

– UTQ1Z =∑a, VTQ2Z =∑b

– Based on Von Loan's method (1985)– Dominated by the cost of SVD– Workspace =(Max(P+M, N))

• PSVD (XGGPSV)– UTABV = ∑– Based on work by Golub, Solna, Van Dooren (2000) – Workspace =(Max(M, P, N))

Page 45: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Benchmark Details

• AMD 1.2 GHz Athlon, 2GB mem, Redhat + Intel compiler

• Compute all eigenvalues and eigenvectors of a symmetric tridiagonal matrix T

• Codes compared:– qr: QR iteration from LAPACK: dsteqr– dc: Cuppen’s Divide&Conquer from LAPACK: dstedc– gr: New implementation of MRRR algorithm (“Grail”)– ogr: MRRR from LAPACK 3.0: dstegr (“old Grail”)

Page 46: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec

Page 47: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)

Page 48: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

More Accurate: Solve Ax=b

• Old idea: Use Newton’s method on f(x) = Ax-b– On a linear system? – Roundoff in Ax-b makes it interesting (“nonlinear”)– Iterative refinement

• Snyder, Wilkinson, Moler, Skeel, …

Repeat r = Ax-b … compute with extra precision Solve Ad = r … using LU factorization of A Update x = x – dUntil “accurate enough” or no progress

Page 49: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0

5

10

15

20

25

30

35

40

seconds

10002000300040005000600070008000900010000

1x60

2x303x20

4x15

5x126x10

problem size

grid shape

Execution time of PDPOSV for various grid shapes

35-4030-3525-3020-2515-2010-155-100-5

Times obtained on:60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect2GB Memory

Speedups from 1.33x to 11.5x

Page 50: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0.010.1

110

100

seconds

1000 4000 7000 1000problem size

Optimal grid (6x10) for PDPOSVComparison between Computation and Redistribution of Data from Linear Grid

Calculation Time

RedistributionTime

Times obtained on:

60 processors, Dual Opteron 1.4GHz, 64-bit machine

2GB Memory, Myrinet Interconnect

Page 51: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0

50

100

150

200

250

300

seconds

10002000300040005000600070008000900010000

1x602x30

3x204x15

5x126x10

problem size

grid shape

Execution time of PDGESV for various grid shapes

250-300200-250150-200100-15050-1000-50

Times obtained on:60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect2GB Memory

Speedups from 1.8x to 7.5x

Page 52: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0.1

1

10

100

seconds

1000 4000 7000 1000problem size

Optimal grid (4x15) for PDGESVComparison between Computation and Redistribution of Data from Linear Grid

Calculation Time

RedistributionTime

Times obtained on:60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect2GB Memory

Page 53: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0

10

20

30

40

50

60

70

80

seconds

10002000300040005000600070008000900010000

1x60

2x30

3x204x15

5x12

6x10

problem size

grid shape

Execution time of PDPOSV for various grid shapes

70-8060-7050-6040-5030-4020-3010-200-10

Times obtained on:60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect2GB Memory

Speedups from 2.2x to 6x

Page 54: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0.1

1

10

100

seconds

1000 4000 7000 1000problem size

Optimal grid (6x10) for PDPOSV Comparison between Computation and Redistribution of Data from Linear Grid

Calculation Time

RedistributionTime

Times obtained on:

60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect

2GB Memory

Page 55: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0

50

100

150

200

250

300

350

400

450

seconds

10002000300040005000600070008000900010000

1x602x30

3x204x15

5x126x10

problem size

grid shape

Execution time of PDGEHRD for various grid shape

400-450350-400300-350250-300200-250150-200100-15050-1000-50

Times obtained on:

60 processors, IBM SP RS/6000, 16 shared memory (16 Gbytes) processors per node, 4 nodes connected with high-bandwidth low-latency switching network

Block size: 64

Speedups from 2.2x to 2.8x

Page 56: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

0.010.1

110

1001000

seconds10

00

4000

7000

1000

problem size

Optimal grid (6x10) for PDGEHRDComparison between Computation and Redistribution of Data from Linear Grid

Calculation Time

RedistributionTime

Times obtained on:

60 processors, IBM SP RS/6000, 16 shared memory (16 Gbytes) processors per node, 4 nodes connected with high-bandwidth low-latency switching network

Block size: 64

Page 57: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK

www.cs.berkeley.edu/~demmel/Sca-LAPACK-Proposal.pdf

Jim DemmelUC Berkeley

Householder XVI

Page 58: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

OO-like interfaces (VE)• Aim: native looking interfaces in modern

languages (C++, F90, Java, Python…)• Overloading to have identical interfaces for all

data types (precision/storage)• Overloading to omit certain parameters (stride,

transpose,…)• Expose data only when necessary (automatic

allocation of temporaries, pivoting permutation,….)• Mechanism: Use Babel/SIDL to interface to

existing F77 code base

Page 59: The Future of  LAPACK and ScaLAPACK netlib/lapack-dev

Example of HLL interface (VE)// C++ example Lapack::PDTridiagonalMatrix A = Lapack::PDTridiagonalMatrix::_create(); x = (double*) malloc(SIZE*sizeof(double)); y = (double*) malloc(SIZE*sizeof(double)); A.MatVec(x,y); // or A.MatVec(MatTranspose,x,y); A.Factor(); A.Solve(y);

! F90 example! Create matrix from user arrayallocate(data_array(20,25)) ! Fill in the array….call CreateWithArray(A,data_array)! Retrieve the array of a (factored) matrixcall GetArray(A,mat)mat_array => mat%d_dataprint *,'pivots',mat_array(0,0),mat_array(1,1)

Lots of details omitted here!