Symmetric Eigensolvers in Sca/LAPACK Osni Marques ([email protected])
The Future of LAPACK and ScaLAPACK netlib/lapack-dev
description
Transcript of The Future of LAPACK and ScaLAPACK netlib/lapack-dev
![Page 1: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/1.jpg)
The Future of LAPACK and ScaLAPACK
www.netlib.org/lapack-dev
Jim Demmel
UC Berkeley
23 Feb 2007
![Page 2: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/2.jpg)
Outline
• Motivation for new Sca/LAPACK
• Challenges (or research opportunities…)
• Goals of new Sca/LAPACK
• Highlights of progress– With some excursions …
![Page 3: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/3.jpg)
Motivation• LAPACK and ScaLAPACK are widely used
– Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, …
– >68M web hits @ Netlib (incl. CLAPACK, LAPACK95)• 35K hits/day
• Many ways to improve them, based on– Own algorithmic research– Enthusiastic participation of research community– User/vendor survey– Opportunities and demands of new architectures,
programming languages• New releases planned (NSF support)
![Page 4: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/4.jpg)
Participants• UC Berkeley:
– Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, undergrads…
• U Tennessee, Knoxville– Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,
Stan Tomov, Alfredo Buttari, Jakub Kurzak• Other Academic Institutions
– UT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara
– TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb
• Research Institutions– CERFACS, LBL
• Industrial Partners – Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, SGI
![Page 5: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/5.jpg)
Challenges: Increasing Parallelism In the Top500:
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
64k-128k32k-64k16k-32k8k-16k4k-8k2049-40961025-2048513-1024257-512129-25665-12833-6417-329-165-83-421
In your Laptop (Intel just announced an 80-core, 1 Teraflop chip)
![Page 6: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/6.jpg)
Challenges
• Increasing parallelism• Exponentially growing gaps between
– Floating point time << 1/Memory BW << Memory Latency• Improving 59%/year vs 23%/year vs 5.5%/year
– Floating point time << 1/Network BW << Network Latency• Improving 59%/year vs 26%/year vs 15%/year
• Heterogeneity (performance and semantics)• Asynchrony• Unreliability
![Page 7: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/7.jpg)
What do users want?
• High performance, ease of use, …• Survey results at www.netlib.org/lapack-dev
– Small but interesting sample– What matrix sizes do you care about?
• 1000s: 34%• 10,000s: 26%• 100,000s or 1Ms: 26%
– How many processors, on distributed memory?• >10: 34%, >100: 31%, >1000: 19%
– Do you use more than double precision? • Sometimes or frequently: 16%
– Would Automatic Memory Allocation help?• Very useful: 72%, Not useful: 14%
![Page 8: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/8.jpg)
Goals of next Sca/LAPACK1. Better algorithms
– Faster, more accurate
2. Expand contents – More functions, more parallel implementations
3. Automate performance tuning
4. Improve ease of use
5. Better software engineering
6. Increased community involvement
![Page 9: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/9.jpg)
Goal 2 – Expanded Content
• Make content of ScaLAPACK mirror LAPACK as much as possible (possible class projects)
![Page 10: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/10.jpg)
Missing Routines in Sca/LAPACKLAPACK ScaLAPACK
Linear Equations
LU
LU + iterative refine
Cholesky
LDLT
xGESV
xGESVX
xPOSV
xSYSV
PxGESV
missing
PxPOSV
missing
Least Squares (LS)
QR
QR+pivot
SVD/QR
SVD/D&C
SVD/MRRR
QR + iterative refine.
xGELS
xGELSY
xGELSS
xGELSD
missing
missing
PxGELS
missing
missing
missing (intent?)
missing
missing
Generalized LS LS + equality constr.
Generalized LM
Above + Iterative ref.
xGGLSE
xGGGLM
missing
missing
missing
missing
![Page 11: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/11.jpg)
More missing routinesLAPACK ScaLAPACK
Symmetric EVD QR / Bisection+Invit
D&C
MRRR
xSYEV / X
xSYEVD
xSYEVR
PxSYEV / X
PxSYEVD
missing
Nonsymmetric EVD Schur form
Vectors too
xGEES / X
xGEEV /X
missing (driver)
missing
SVD QR
D&C
MRRR
Jacobi
xGESVD
xGESDD
missing
missing
PxGESVD
missing (intent?)
missing
missing
Generalized Symmetric EVD
QR / Bisection+Invit
D&C
MRRR
xSYGV / X
xSYGVD
missing
PxSYGV / X
missing (intent?)
missing
Generalized Nonsymmetric EVD
Schur form
Vectors too
xGGES / X
xGGEV / X
missing
missing
Generalized SVD Kogbetliantz
MRRR
xGGSVD
missing
missing (intent)
missing
![Page 12: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/12.jpg)
Goal 1: Better Algorithms
• Faster– But provide “usual” accuracy, stability– … Or accurate for an important subclass
• More accurate– But provide “usual” speed– … Or at any cost
![Page 13: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/13.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR:– Byers / Mathias / Braman
• Faster Hessenberg, tridiagonal, bidiagonal reductions: – van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
• Extensions to QZ: – Kågström / Kressner
• Recursive blocked layouts for packed formats: – Gustavson / Kågström / Elmroth / Jonsson/
![Page 14: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/14.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems– Faster and more accurate than previous algorithms– SIAM SIAG/LA Prize in 2006– New sequential, first parallel versions out in 2006
![Page 15: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/15.jpg)
Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)
![Page 16: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/16.jpg)
Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)
![Page 17: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/17.jpg)
Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)
![Page 18: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/18.jpg)
Flop Counts of Eigensolvers(2.2 GHz Opteron + ACML)
![Page 19: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/19.jpg)
Flop Count Ratios of Eigensolvers(2.2 GHz Opteron + ACML)
![Page 20: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/20.jpg)
Run Time Ratios of Eigensolvers(2.2 GHz Opteron + ACML)
![Page 21: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/21.jpg)
MFlop Rates of Eigensolvers(2.2 GHz Opteron + ACML)
![Page 22: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/22.jpg)
Exploiting GPUs• Numerous emerging co-processors
– Cell, SSE, Grape, GPU, “physics coprocessor,” …
• When can we exploit them?– LIttle help if memory is bottleneck– Various attempts to use GPUs for dense linear algebra
• Bisection on GPUs for symmetric tridiagonal eigenproblem– Evaluate Count(x) = #(evals < x) for many x– Very little memory traffic– Speedups up to 100x (Volkov)
• 43 Gflops on ATI Radeon X1900 vs running on 2.8 GHz Pentium 4• Overall eigenvalue solver 6.8x faster
![Page 23: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/23.jpg)
Parallel Runtimes of Eigensolvers(2.4 GHz Xeon Cluster + Ethernet)
![Page 24: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/24.jpg)
Goal 1b – More Accurate Algorithms• Iterative refinement for Ax=b, least squares
– “Promise” the right answer for O(n2) additional cost
• Jacobi-based SVD– Faster than QR, can be arbitrarily more accurate
• Arbitrary precision versions of everything– Using your favorite multiple precision package
![Page 25: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/25.jpg)
Goal 1b – More Accurate Algorithms• Iterative refinement for Ax=b, least squares
– “Promise” the right answer for O(n2) additional cost– Iterative refinement with extra-precise residuals
• Newton’s method applied to solving f(x) = A*x-b = 0
– Extra-precise BLAS needed (LAWN#165)
![Page 26: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/26.jpg)
More Accurate: Solve Ax=bConventional Gaussian Elimination
With extra preciseiterative refinement
n1/2
![Page 27: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/27.jpg)
Iterative Refinement: for speed
• What if double precision much slower than single?– Cell processor in Playstation 3
• 256 GFlops single, 25 GFlops double
– Pentium SSE2: single twice as fast as double
• Given Ax=b in double precision– Factor in single, do refinement in double
– If (A) < 1/single, runs at speed of single
• 1.9x speedup on Intel-based laptop• Applies to many algorithms, if difference large
![Page 28: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/28.jpg)
Goal 2 – Expanded Content• Make content of ScaLAPACK mirror LAPACK as much as
possible• New functions (highlights)
– Updating / downdating of factorizations: • Stewart, Langou
– More generalized SVDs: • Bai , Wang
– More generalized Sylvester/Lyapunov eqns:• Kågström, Jonsson, Granat
– Structured eigenproblems • O(n2) version of roots(p)
– Gu, Chandrasekaran, Bindel et al• Selected matrix polynomials:
– Mehrmann
![Page 29: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/29.jpg)
New algorithm for roots(p)
• To find roots of polynomial p– Roots(p) does eig(C(p))– Costs O(n3), stable, reliable
• O(n2) Alternatives– Newton, Jenkins-Traub, Laguerre, …– Stable? Reliable?
• New: Exploit “semiseparable” structure of C(p)– Low rank of any submatrix of upper triangle of C(p) preserved
under QR iteration– Complexity drops from O(n3) to O(n2), stable in practice
• Related work: Gemignani, Bini, Pan, et al• Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David Bindel,
David Garmire, Jim Demmel
-p1 -p2 … -pd
1 0 … 0 0 1 … 0 … … … … 0 … 1 0
C(p)=
![Page 30: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/30.jpg)
Goal 3 – Automate Performance Tuning
• Widely used in performance tuning of Kernels– ATLAS (PhiPAC) – BLAS - www.netlib.org/atlas– FFTW – Fast Fourier Transform – www.fftw.org– Spiral – signal processing - www.spiral.net– OSKI – Sparse BLAS – bebop.cs.berkeley.edu/oski
![Page 31: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/31.jpg)
Optimizing blocksizes for mat-mul
Finding a Needle in a Haystack – So Automate
![Page 32: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/32.jpg)
Goal 3 – Automate Performance Tuning
• Widely used in performance tuning of Kernels• 1300 calls to ILAENV() to get block sizes, etc.
– Never been systematically tuned
• Extend automatic tuning techniques of ATLAS, etc. to these other parameters– Automation important as architectures evolve
• Convert ScaLAPACK data layouts on the fly– Important for ease-of-use too
![Page 33: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/33.jpg)
ScaLAPACK Data Layouts
1D Block
1D Block Cyclic
1D Cyclic
2D Block Cyclic
![Page 34: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/34.jpg)
0
10
20
30
40
50
60
70
80
90
100
seconds
10002000300040005000600070008000900010000
1x60
2x30
3x20
4x15
5x12
6x10
problem size
grid shape
Execution time of PDGESV for various grid shape
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
10-20
0-10
Times obtained on:
60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect
2GB Memory
Speedups for using 2D processor grid range from 2x to 8xCost of redistributing from 1D to best 2D layout 1% - 10%
![Page 35: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/35.jpg)
Conclusions
• Lots to do in Dense Linear Algebra– New numerical algorithms– Continuing architectural challenges
• Parallelism, performance tuning
– Ease of use, software engineering
• Grant support, but success depends on contributions from community
• www.netlib.org/lapack-dev• www.cs.berkeley.edu/~demmel
![Page 36: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/36.jpg)
Extra Slides
![Page 37: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/37.jpg)
Fast Matrix Multiplication (1) (Cohn, Kleinberg, Szegedy, Umans)
• Can think of fast convolution of polynomials p, q as– Map p (q) into group algebra i pi zi C[G] of cyclic group G = { zi } – Multiply elements of C[G] (use divide&conquer = FFT)– Extract coefficients
• For matrix multiply, need non-abelian group satisfying triple product property– There are subsets X, Y, Z of G where xyz = 1 with
x X, y Y, z Z x = y = z = 1 – Map matrix A into group algebra via xy Axy x-1y,
B into y’z By’z y’-1z.– Since x-1y y’-1z = x-1z iff y = y’ we get y Axy Byz = (AB)xz
• Search for fast algorithms reduced to search for groups with certain properties– Fastest algorithm so far is O(n2.38), same as Coppersmith/Winograd
![Page 38: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/38.jpg)
Fast Matrix Multiplication (2)(Cohn, Kleinberg, Szegedy, Umans)
1. Embed A, B in group algebra (exact)
2. Perform FFT (roundoff)
3. Reorganize results into new matrices (exact)
4. Multiple new matrices recursively (roundoff)
5. Reorganize results into new matrices (exact)
6. Perform IFFT (roundoff)
7. Extract C = AB from group algebra (exact)
![Page 39: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/39.jpg)
Fast Matrix Multiplication (3)(Demmel, Dumitriu, Holtz, Kleinberg)
• Thm 1: Any algorithm of this class for C = AB is “numerically stable”
– || Ccomp - C || <= c • nd • • || A || • || B || + O(– c and d are “modest” constants– Like Strassen
• Let be the exponent of matrix multiplication, i.e. no algorithm is faster than O(n).
• Thm 2: For all >0 there exists an algorithm with complexity O(n+) that is numerically stable in the sense of Thm 1.
![Page 40: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/40.jpg)
Commodity Processor TrendsAnnual increase
Typical valuein 2006
Predicted valuein 2010
Typical valuein 2020
Single-chipfloating-point performance
59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s
Memory bus bandwidth
23% 1 GWord/s= 0.25 word/flop
3.5 GWord/s= 0.11 word/flop
27 GWord/s= 0.008 word/flop
Memory latency (5.5%) 70 ns= 280 FP ops= 70 loads
50 ns= 1600 FP ops= 170 loads
28 ns= 94,000 FP ops= 780 loads
Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.
Will our algorithms run at a high fraction of peak?
![Page 41: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/41.jpg)
Challenges
• For all large scale computing, not just linear algebra!
• Example … your laptop• Exponentially growing gaps between
– Floating point time << 1/Memory BW << Memory Latency
– Floating point time << 1/Network BW << Network Latency
![Page 42: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/42.jpg)
Parallel Processor TrendsAnnual increase
Typical valuein 2004
Predicted valuein 2010
Typical valuein 2020
# Processors 20 % 4,000 12,000 3300 GFLOP/s
NetworkBandwidth
26% 65 MWord/s= 0.03 word/flop
260 MWord/s= 0.008 word/flop
27 GWord/s= 0.008 word/flop
Networklatency
(15%) 5 s= 20K FP ops
2 s= 64K FP ops
28 ns= 94,000 FP ops= 780 loads
Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.
Will our algorithms scale up to more processors?
![Page 43: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/43.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems– Faster and more accurate than previous algorithms– New sequential, first parallel versions out in 2006
![Page 44: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/44.jpg)
Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)
![Page 45: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/45.jpg)
Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)
![Page 46: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/46.jpg)
Timing of Eigensolvers(1.2 GHz Athlon, only matrices where time > .1 sec)
![Page 47: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/47.jpg)
Timing of Eigensolvers(only matrices where time > .1 sec)
![Page 48: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/48.jpg)
Accuracy Results (old vs new MRRR)maxi ||Tqi – i qi || / ( n ) || QQT – I || / (n )
![Page 49: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/49.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR: – Byers / Mathias / Braman
• Extensions to QZ: – Kågström / Kressner
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
– van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
– Full nonsymmetric eigenproblem: n=1500: 3.43x faster
• HQR: 5x faster, Reduction: 14% faster
– Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster
– Sequential versions out in 2006
![Page 50: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/50.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR: – Byers / Mathias / Braman
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
– van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
• Extensions to QZ: – Kågström / Kressner– LAPACK Working Note (LAWN) #173– On 26 real test matrices, speedups up to 11.9x, 4.4x average
![Page 51: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/51.jpg)
Goal 4: Improved Ease of Use
• Which do you prefer?
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO)
A \ B
CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO)
![Page 52: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/52.jpg)
Goal 4: Improved Ease of Use
• Easy interfaces vs access to details– Some users want access to all details, because
• Peak performance matters• Control over memory allocation
– Other users want “simpler” interface• Automatic allocation of workspace• No universal agreement across systems on “easiest interface”• Leave decision to higher level packages
• Keep expert driver / simple driver / computational routines• Add wrappers for other languages
– Fortran95, Java, Matlab, Python, even C– Automatic allocation of workspace
• Add wrappers to convert to “best” parallel layout
![Page 53: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/53.jpg)
Goal 5: Better SW Engineering:What could go into Sca/LAPACK?For all linear algebra problems
For all matrix structures
For all data types
For all programming interfaces
Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc)
For all architectures and networks
Need to prioritize, automate!
![Page 54: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/54.jpg)
Goal 5: Better SW Engineering• How to map multiple SW layers to emerging HW layers?• How much better are asynchronous algorithms?• Are emerging PGAS languages better?• Statistical modeling to limit performance tuning costs, improve use
of shared clusters
• Only some things understood well enough for automation now– Telescoping languages, Bernoulli, Rose, FLAME, …
• Research Plan: explore above design space• Development Plan to deliver code (some aspects)
– Maintain core in F95 subset– Friendly wrappers for other programming environments– Use variety of source control, maintenance, development tools
![Page 55: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/55.jpg)
Goal 6: Involve the Community
• To help identify priorities– More interesting tasks than we are funded to do– See www.netlib.org/lapack-dev for list
• To help identify promising algorithms– What have we missed?
• To help do the work– Bug reports, provide fixes– Again, more tasks than we are funded to do– Already happening: thank you!
![Page 56: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/56.jpg)
Accuracy of Eigensolvers
maxi ||Tqi – i qi || / ( n ) || QQT – I || / (n )
![Page 57: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/57.jpg)
Goal 2 – Expanded Content
• Make content of ScaLAPACK mirror LAPACK as much as possible
• New functions (highlights)– Updating / downdating of factorizations:
• Stewart, Langou– More generalized SVDs:
• Bai , Wang
![Page 58: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/58.jpg)
New GSVD Algorithm
Bai et al, UC Davis PSVD, CSD on the way
Given m x n A and p x n B, factor A = U ∑a X and B = V ∑b X
![Page 59: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/59.jpg)
Motivation• LAPACK and ScaLAPACK are widely used
– Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, …
– >63M web hits @ Netlib (incl. CLAPACK, LAPACK95)• 35K hits/day
![Page 60: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/60.jpg)
Impact (with NERSC, LBNL)
Cosmic Microwave Background Analysis,
BOOMERanG collaboration, MADCAP code (Apr. 27, 2000).
ScaLAPACK
![Page 61: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/61.jpg)
Challenges
• For all large scale computing, not just linear algebra!
• Example …
![Page 62: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/62.jpg)
Challenges
• For all large scale computing, not just linear algebra!
• Example … your laptop
![Page 63: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/63.jpg)
CPU Trends
2004 2005 2006 2007 2008 2009 2010
Cores Per Processor ChipHardware Threads Per Chip
0
50
100
150
200
250
300
Year
• Relative processing power will continue to double every 18 months
• 256 logical processors per chip in late 2010
![Page 64: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/64.jpg)
Challenges
• For all large scale computing, not just linear algebra!
• Example … your laptop• Exponentially growing gaps between
– Floating point time << 1/Memory BW << Memory Latency• Improving 59%/year vs 23%/year vs 5.5%/year
![Page 65: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/65.jpg)
Accuracy of Eigensolvers: Old vs New MRRR
maxi ||Tqi – i qi || / ( n ) || QQT – I || / (n )
![Page 66: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/66.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems– Faster and more accurate than previous algorithms– New sequential, first parallel versions out in 2006– Numerical evidence shows DC faster if it “deflates” often,
which is hard to predict in advance. So having both algorithms is important.
– SVD still an open problem
![Page 67: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/67.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR: – Byers / Mathias / Braman– SIAM SIAG/LA Prize in 2003– Sequential version out in 2006– More on performance later
![Page 68: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/68.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR: – Byers / Mathias / Braman
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
– van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
– Full nonsymmetric eigenproblem: n=1500: 3.43x faster
• HQR: 5x faster, Reduction: 14% faster
![Page 69: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/69.jpg)
ARCH: Intel Pentium 4 ( 3.4 GHz )F77 : GNU Fortran (GCC) 3.4.4BLAS: libgoto_prescott32p-r1.00.so (one thread)
Dense nonsymmetric eigenvalue problem
No vectors All vectors
Source: Julien Langou
![Page 70: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/70.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR: – Byers / Mathias / Braman
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
– van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
– Full nonsymmetric eigenproblem: n=1500: 3.43x faster
• HQR: 5x faster, Reduction: 14% faster
– Bidiagonal Reduction (LAWN#174): n=2000: 1.32x faster
– Sequential versions out in 2006
![Page 71: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/71.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR: – Byers / Mathias / Braman
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
– van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
• Extensions to QZ: – Kågström / Kressner– LAPACK Working Note (LAWN) #173– On 26 real test matrices, speedups up to 11.9x, 4.4x average
![Page 72: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/72.jpg)
Comparison of ScaLAPACK QR and new parallel multishift QZ
Execution times in secs for 4096 x 4096 random problems Ax = sx and Ax = sBx, using processor grids including 1-16 processors.
Note: work(QZ) > 2 * work(QR) butTime(// QZ) << Time (//QR)!! Times include cost for computing eigenvalues and transformation matrices.
0
1000
2000
3000
4000
5000
6000
1 x 1 2 x 1 2 x 2 4 x 2 4 x 4
// QR
// QZ
Adlerborn-Kågström-Kressner, SIAM PP’2006
![Page 73: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/73.jpg)
Goal 1a – Faster Algorithms (Highlights)
• MRRR algorithm for symmetric eigenproblem / SVD: – Parlett / Dhillon / Voemel / Marques / Willems
• Up to 10x faster HQR: – Byers / Mathias / Braman
• Faster Hessenberg, tridiagonal, bidiagonal reductions:
– van de Geijn/Quintana, Howell / Fulton, Bischof / Lang
• Extensions to QZ: – Kågström / Kressner
• Recursive blocked layouts for packed formats:– Gustavson / Kågström / Elmroth / Jonsson/– SIAM Review Article 2004
![Page 74: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/74.jpg)
Recursive Layouts and Algorithms
Still merges multiple elimination steps into a few BLAS 3 operations
Best speedups for packed storage of symmetric matrices
![Page 75: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/75.jpg)
Goal 1b – More Accurate Algorithms• Iterative refinement for Ax=b, least squares
– “Promise” the right answer for O(n2) additional cost– Iterative refinement with extra-precise residuals– Extra-precise BLAS needed (LAWN#165)– “Guarantees” based on condition number estimates
• Condition estimate < 1/(n1/2 reliable answer and tiny error bounds
• No bad bounds in 6.2M tests• Can condition estimators lie?
![Page 76: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/76.jpg)
Can condition estimators lie?
• Yes, but rarely, unless they cost as much as matrix multiply = cost of LU factorization– Demmel/Diament/Malajovich (FCM2001)
• But what if matrix multiply costs O(n2)?– More later
![Page 77: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/77.jpg)
Goal 1b – More Accurate Algorithms• Iterative refinement for Ax=b, least squares
– “Promise” the right answer for O(n2) additional cost– Iterative refinement with extra-precise residuals– Extra-precise BLAS needed (LAWN#165)– “Guarantees” based on condition number estimates– Get tiny componentwise bounds too
• Each xi accurate
• Slightly different condition number– Extends to Least Squares– Release in 2006
![Page 78: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/78.jpg)
Goal 1b – More Accurate Algorithms• Iterative refinement for Ax=b, least squares
– Promise the right answer for O(n2) additional cost
• Jacobi-based SVD– Faster than QR, can be arbitrarily more accurate– LAWNS # 169, 170– Can be arbitrarily more accurate on tiny singular values– Yet faster than QR iteration!
![Page 79: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/79.jpg)
Goal 1b – More Accurate Algorithms• Iterative refinement for Ax=b, least squares
– Promise the right answer for O(n2) additional cost
• Jacobi-based SVD– Faster than QR, can be arbitrarily more accurate
• Arbitrary precision versions of everything– Using your favorite multiple precision package– Quad, Quad-double, ARPREC, MPFR, …– Using Fortran 95 modules
![Page 80: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/80.jpg)
Goal 2 – Expanded Content• Make content of ScaLAPACK mirror LAPACK as much as
possible• New functions (highlights)
– Updating / downdating of factorizations: • Stewart, Langou
– More generalized SVDs: • Bai , Wang
– More generalized Sylvester/Lyapunov eqns:• Kågström, Jonsson, Granat
– Structured eigenproblems • O(n2) version of roots(p)
– Gu, Chandrasekaran, Bindel et al• Selected matrix polynomials:
– Mehrmann• How should we prioritize missing functions?
![Page 81: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/81.jpg)
The Difficulty of Tuning SpMV:Sparse Matrix Vector Multiply
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
![Page 82: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/82.jpg)
The Difficulty of Tuning SpMV
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
// Compressed sparse row (CSR)
for each row i:
t = 0
for k=row[i] to row[i+1]-1:
t += A[k] * x[J[k]]
y[i] = t
• Exploit 8x8 dense blocks
![Page 83: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/83.jpg)
Speedups on Itanium 2: The Need for Search
Reference Mflop/s (7.6%)
Mflop/s (31.1%)
![Page 84: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/84.jpg)
Speedups on Itanium 2: The Need for Search
Reference
Best: 4x2
Mflop/s (7.6%)
Mflop/s (31.1%)
![Page 85: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/85.jpg)
SpMV Performance—raefsky3
![Page 86: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/86.jpg)
SpMV Performance—raefsky3
![Page 87: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/87.jpg)
More Surprises tuning SpMV
• More complex example
• Example: 3x3 blocking– Logical grid of 3x3 cells
![Page 88: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/88.jpg)
Extra Work Can Improve Efficiency
• More complex example
• Example: 3x3 blocking– Logical grid of 3x3
cells– Pad with zeros– “Fill ratio” = 1.5
• On Pentium III: 1.5x speedup! (2/3 time)
![Page 89: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/89.jpg)
Tuning for Workloads
• BiCG - equal mix of A*x and AT*y– 3x1: Ax, ATy = 1053, 343 Mflop/s – 3x3: Ax, ATy = 806, 826 Mflop/s
• Higher-level operation - (Ax, ATy) kernel– 3x1: 757 Mflop/s– 3x3: 1400 Mflop/s
![Page 90: The Future of LAPACK and ScaLAPACK netlib/lapack-dev](https://reader036.fdocuments.in/reader036/viewer/2022081418/568149ab550346895db6e99b/html5/thumbnails/90.jpg)
Optimizations Available in OSKI:Optimized Sparse Kernel Interface
• Optimizations for SpMV– Register blocking (RB): up to 4x over CSR– Variable block splitting: 2.1x over CSR, 1.8x over RB– Diagonals: 2x over CSR– Reordering to create dense structure + splitting: 2x over CSR– Symmetry: 2.8x over CSR, 2.6x over RB– Cache blocking: 3x over CSR– Multiple vectors (SpMM): 7x over CSR– And combinations…
• Sparse triangular solve– Hybrid sparse/dense data structure: 1.8x over CSR
• Higher-level kernels– AAT*x, ATA*x: 4x over CSR, 1.8x over RB– A*x: 2x over CSR, 1.5x over RB
• Available stand alone or integrated into PETSc