HPL is a portable implementation of the High Performance...
Transcript of HPL is a portable implementation of the High Performance...
SPONSORED BY
FIND OUT MORE AT http://icl.eecs.utk.edu/hpl/
A COLLABORATION OF
HPL is a portable implementation of the High Performance LINPACK Benchmark for distributed memory computers.
• Algorithm: recursive panel factorization, multiple lookahead depths, bandwidth reducing swapping
• Easy to install, only needs MPI+BLAS or VSIPL
• Highly scalable and efficient from the smallest cluster to the largest supercomputers in the world
CPU/Nodes N ½ N max R max (Gflop/s)
1/1
4/1
16/4
64/16
256/64
150
800
2300
5700
14000
6625
12350
26500
53000
106000
1
4
17
67.5
263.6
Compaq 64 Node AlphaServer SC (4 EV67 667 MHz CPUs per node); constant memory load/CPU = 335 MiB
LINPACK software is releasedSolves systems of linear equations in FORTRAN 66
LINPACK 100 releasedMeasures system performance in Mflop/s and solves 100x100 linear systems
LINPACK 1000 releasedAny language allowed and the linear system of size 1000 can be used
LINPACKDv2 releasedExtends random number generator from 16384 to 65536
LINPACK Table 3 (Highly Parallel Computing)Any size linear system is allowed
TOP500 first releasedWith CM-5 running the LINPACK benchmark at nearly 60 Gflop/s
9th TOP500 is releasedWith the 1st system breaking the 1 Tflop/s barrier: ASCI Red from Sandia National Laboratory
HPLv1 is releasedBy Antoine Petitet, Jack Dongarra, Clint Whaley, and Andy Cleary
31st TOP500 is releasedWith the 1st system breaking the 1 Pflop/s barrier: Roadrunner from Los Alamos National Laboratory
HPLv2 is releasedThe new version features a 64-bit random number generator that prevents the benchmark failures from generating nearly singular matrices which was the problem with the old generator.
Peta flop/s are spreadingThe upgrades of the machines hosted at ORNL results in shattering the 2 Pflop/s theoretical peak barrier for Cray's XT5 Jaguar. Also, the first academic system reaches 1 Pflop/s theoretical peak: University of Tennessee's Kraken.
HISTORY OF THE BENCHMARK1974
1977
1986
1989
1991
1993
1996
2000
2008
2008
2009
PARALLEL EFFICIENCY OF HPL
0
10
20
30
40
50
60
70
80
90
1 2 4 8 16 32 64 128 256 384 416 512
PAR
ALL
EL E
FFIC
IEN
CY
(%
)
NUMBER OF CORES
INTEL XEON
INTEL CORE 2
AMD OPTERON
HPL was used to obtain a number of results in the current TOP500 list including the #1 entry.
SPONSORED BY
FIND OUT MORE AT http://icl.eecs.utk.edu/hpl/
A COLLABORATION OF
The original LINPACK Benchmark is, in some sense, an accident. It was originally designed to assist users of the LINPACK package by providing information on execution times required to solve a system of linear equations. The first “LINPACK Benchmark” report appeared as an appendix in the LINPACK Users' Guide in 1979. Over the years additional performance data was added, more as a hobby than anything else, and today the collection includes over 1300 different computer systems. In addition to the number of computers increasing, the scope of the benchmark has also expanded. The benchmark report describes the performance for solving a general dense matrix problem Ax=b at three levels of problem size and optimization opportunity: 100 by 100 problem (inner loop optimization), 1000 by 1000 problem (three loop optimization – the whole program), and a scalable parallel problem – the HPL. New statistics were required to reflect the diversification of supercomputers, the enormous performance difference between low-end and high-end models, the increasing availability of massively parallel processing (MPP) systems, and the strong increase in computing power of the high-end models of workstation suppliers (SMP). To provide this new statistical foundation, the TOP500 list was created in 1993 to assemble and maintain a list of the 500 most powerful computer systems. The list is compiled twice a year with the help of high-performance computer experts, computational scientists, manufacturers, and the Internet community in general. In the list, computers are ranked by their performance on the HPL benchmark. With the unprecedented increase in performance comes the price of increased time the benchmark takes to complete. As the chart below shows, the time to completion is increasing with the number of cores. And in June 2009, the #2 entry reported a run that took over 18 hours.
To deal with this problem a new version of HPL will offer a partial run option. This will allow to run the benchmark only a fraction of the time it takes to run the full version of the benchmark without losing much of the reported performance. Initial runs on a system with over 30000 cores indicate only a 6% drop in performance with a 50% reduction in time. This becomes a viable alternative to, say, checkpointing to solve the problem of the the rapidly increasing number of cores and the decrease of Mean Time Between Failures (MTBF) for large supercomputer installations.
0
1
2
3
4
5
6
7
8
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
TIME TAKEN BY #1 ENTRY IN TOP500
TOP500 ISSUE
TOTA
L TI
ME
(HO
UR
S)
PR
OC
ESS
OR
S*
*0
.4
Processors**0.4
Total Time (Hours)