Center for Information Services and High Performance ... · 1995 Intel Paragon XP/S MP 6768 281.1...
Transcript of Center for Information Services and High Performance ... · 1995 Intel Paragon XP/S MP 6768 281.1...
Nöthnitzer Straße 46
Raum 1026
Tel. +49 351 - 463 - 35048
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Leistungsanalyse
von Rechnersystemen
9. November 2011
Nöthnitzer Straße 46
Raum 1026
Tel. +49 351 - 463 - 35048
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Summary of Previous Lecture
Holger Brunst, Matthias Müller: Leistungsanalyse
Summary of Previous Lecture
Different workloads:
– Test workload
– Real workload
– Synthetic workload
Historical examples for test workloads:
– Addition instruction
– Instruction mixes
– Kernels
– Synthetic programs
– Application benchmarks
Holger Brunst, Matthias Müller: Leistungsanalyse
Excursion on Speedup and Efficiency Metrics
Comparison of sequential and parallel algorithms
Speedup:
– n is the number of processors
– T1 is the execution time of the sequential algorithm
– Tn is the execution time of the parallel algorithm with n processors
Efficiency:
– Its value estimates how well-utilized p processors solve a given problem
– Usually between zero and one. Exception: Super linear speedup (later)
Sn =T1
Tn
E p =Sp
p
Holger Brunst, Matthias Müller: Leistungsanalyse
Amdahl s Law
Find the maximum expected improvement to an overall system when only part of the system is improved
Serial execution time = s+p
Parallel execution time = s+p/n
– Normalizing with respect to serial time (s+p) = 1 results in:
• Sn = 1/(s+p/n)
– Drops off rapidly as serial fraction increases
– Maximum speedup possible = 1/s, independent of n the number of processors!
Bad news: If an application has only 1% serial work (s = 0.01) then you will never see a speedup greater than 100. So, why do we build system with more than 100 processors?
What is wrong with this argument?
Sn =s + p
s +p
n
Holger Brunst, Matthias Müller: Leistungsanalyse
Popilar and historic benchmarks
Popular benchmarks:
– Eratosthenes sieve algorithm
– Ackermann s Function
– Whetstone
– LINPACK
– Dhrystone
– Lawrence Livermore Loops
– TPC-C
– SPEC
Holger Brunst, Matthias Müller: Leistungsanalyse
Workload description
Level of Detail of the workload description - Examples:
– Most frequent request (e.g. Addition)
– Frequency of request type (instruction mix)
– Time-stamped sequence of requests
– Average resource demand (e.g. 20 I/O requests per second)
– Distribution of resource demands (not only the average, but also probability distribution)
Holger Brunst, Matthias Müller: Leistungsanalyse
Characterization of Benchmarks
There are many metrics, each one has its purpose
– Raw machine performance: Tflops
– Microbenchmarks: Stream
– Algorithmic benchmarks: Linpack
– Compact Apps/Kernels: NAS benchmarks
– Application Suites: SPEC
– User-specific applications: Custom benchmarks
Computer Hardware
Applications
Holger Brunst, Matthias Müller: Leistungsanalyse
Comparison of different benchmark classes
coverage relevance Identify problems
Time evolution
Micro 0 0 ++ +
Algorithmic - 0 + ++
Kernels 0 0 + +
SPEC + + + +
Apps - ++ 0 0
Holger Brunst, Matthias Müller: Leistungsanalyse
SPEC Benchmarks: CPU 2006
Application Benchmarks
Different metrics:
– Integer, floatingpoint
– Standard and rate
– Base, peak
Run rules
Nöthnitzer Straße 46
Raum 1026
Tel. +49 351 - 463 - 35048
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Example for a Microbenchmark:
Stream
Holger Brunst, Matthias Müller: Leistungsanalyse
Stream Benchmark
Author: John McCalpin ( Mr Bandwidth )
John McCalpin Memory Bandwidth and Machine Balance in High Performance Computers , IEEE TCCA Newsletter, December 1995
http://www.cs.virginia.edu/stream/
STREAM: measure memory bandwidth with the operations:
– Copy: a(i) = b(i)
– Scale: a(i)=s*b(i)
– Add: a(i)=b(i)+c(i)
– Triad: a(i)=b(i)+s*c(i)
STREAM2: measures memory hierarchy bandwidth with the operations:
– Fill: a(i)=0
– Copy: a(i)=b(i)
– Daxpy: a(i) = a(i) +q*b(i)
– Sum: sum += a(i)
Holger Brunst, Matthias Müller: Leistungsanalyse
Stream 2 properties
Holger Brunst, Matthias Müller: Leistungsanalyse
Stream Results: TOP 10 in 2011
STREAM Memory Bandwidth --- John D. McCalpin, [email protected] Revised to Thu Jul 21 13:02:04 CDT 2011 All results are in MB/s --- 1 MB=10^6 B, *not* 2^20 B -------------------------------------------------------------------------------------------- Sub. Date Machine ID ncpus COPY SCALE ADD TRIAD -------------------------------------------------------------------------------------------- 2011.04.05 SGI_Altix_UV_1000 2048 5321074.0 5346667.0 5823380.0 5859367.0 data 2006.07.10 SGI_Altix_4700 1024 3661963.0 3677482.0 4385585.0 4350166.0 data 2011.06.06 ScaleMP_Xeon_X6560_64B 768 1493963.0 2112630.0 2252598.0 2259709.0 data 2004.12.22 SGI_Altix_3700_Bx2 512 906388.0 870211.0 1055179.0 1119913.0 data 2003.11.13 SGI_Altix_3000 512 854062.0 854338.0 1008594.0 1007828.0 data 2003.10.02 NEC_SX-7 32 876174.7 865144.1 869179.2 872259.1 data 2008.04.07 IBM_Power_595 64 679207.2 624707.8 777334.8 805804.6 data 1999.12.07 NEC_SX-5-16A 16 607492.0 590390.0 607412.0 583069.0 data 2009.08.10 ScaleMP_XeonX5570_vSMP_16B 128 437571.0 431726.0 442722.0 445869.0 data 1997.06.10 NEC_SX-4 32 434784.0 432886.0 437358.0 436954.0 data --------------------------------------------------------------------------------------------
Holger Brunst, Matthias Müller: Leistungsanalyse
Stream 2 Results
Nöthnitzer Straße 46
Raum 1026
Tel. +49 351 - 463 - 35048
Holger Brunst ([email protected])
Matthias S. Mueller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Linpack and TOP500
Slides courtesy Jack Dongarra
The Linpack Benchmark is a measure of a computer s floating-point rate of execution.
It is determined by running a computer program that solves a dense system of linear equations.
Over the years the characteristics of the benchmark has changed a bit.
In fact, there are three benchmarks included in the Linpack Benchmark report.
LINPACK Benchmark Dense linear system solve with LU factorization using partial pivoting Operation count is: 2/3 n3 + O(n2)
Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100.
When the Linpack Fortran n = 100 benchmark is run it produces the following kind of results:
Please send the results of this run to:
Jack J. Dongarra
Computer Science Department
University of Tennessee Knoxville, Tennessee 37996-1300
Fax: 865-974-8296
Internet: [email protected]
norm. resid resid machep x(1) x(n)
1.67005097E+00 7.41628980E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00
times are reported for matrices of order 100
dgefa dgesl total mflops unit ratio
times for array with leading dimension of 201
1.540E-03 6.888E-05 1.609E-03 4.268E+02 4.686E-03 2.873E-02
1.509E-03 7.084E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02 1.509E-03 7.003E-05 1.579E-03 4.348E+02 4.600E-03 2.820E-02
1.502E-03 6.593E-05 1.568E-03 4.380E+02 4.567E-03 2.800E-02
times for array with leading dimension of 200
1.431E-03 6.716E-05 1.498E-03 4.584E+02 4.363E-03 2.675E-02 1.424E-03 6.694E-05 1.491E-03 4.605E+02 4.343E-03 2.663E-02
1.431E-03 6.699E-05 1.498E-03 4.583E+02 4.364E-03 2.676E-02
1.432E-03 6.439E-05 1.497E-03 4.588E+02 4.360E-03 2.673E-02
Time
Factor
Time
Solve
Total
Time Mflop/s
rate
In the beginning there was the Linpack 100 Benchmark (1977)
n=100 (80KB); size that would fit in all the machines
Fortran; 64 bit floating point arithmetic
No hand optimization (only compiler options)
Year Computer Number of
Processors Cycle time Mflop/s
2006 Intel Pentium Woodcrest (3 GHz) 1 3 GHz 3018
2005 NEC SX-8/1 (1 proc) 1 2 GHz 2177
2004 Intel Pentium Nocona (1 proc 3.6 GHz) 1 3.6 GHz 1803
2003 HP Integrity Server rx2600 (1 proc 1.5GHz) 1 1.5 GHz 1635
2002 Intel Pentium 4 (3.06 GHz) 1 2.06 GHz 1414
2001 Fujitsu VPP5000/1 1 3.33 nsec 1156
2000 Fujitsu VPP5000/1 1 3.33 nsec 1156
1999 CRAY T916 4 2.2 nsec 1129
1995 CRAY T916 1 2.2 nsec 522
1994 CRAY C90 16 4.2 nsec 479
1993 CRAY C90 16 4.2 nsec 479
1992 CRAY C90 16 4.2 nsec 479
1991 CRAY C90 16 4.2 nsec 403
1990 CRAY Y-MP 8 6.0 nsec 275
1989 CRAY Y-MP 8 6.0 nsec 275
1988 CRAY Y-MP 1 6.0 nsec 74
1987 ETA 10-E 1 10.5 nsec 52
1986 NEC SX-2 1 6.0 nsec 46
1985 NEC SX-2 1 6.0 nsec 46
1984 CRAY X-MP 1 9.5 nsec 21
1983 CRAY 1 1 12.5 nsec 12
1979 CRAY 1 1 12.5 nsec 3.4
In the beginning there was the Linpack 100 Benchmark (1977)
n=100 (80KB); size that would fit in all the machines
Fortran; 64 bit floating point arithmetic
No hand optimization (only compiler options)
Linpack 1000 (1986)
n=1000 (8MB); wanted to see higher performance levels
Any language; 64 bit floating point arithmetic
Hand optimization OK
Linpack TPP (1991) (Top500; 1993)
Any size (n as large as you can; n=106; 8TB; ~6 hours);
Any language; 64 bit floating point arithmetic
Hand optimization OK Strassen s method not allowed (confuses the op count and rate)
Reference implementation available
In all cases results are verified by looking at:
Operations count for factorization ; solve
|| ||(1)
|| || || ||
Ax bO
A x n=
3 22 1
3 2
n n2
2n
LINPACK NxN benchmark Solves system of linear equations by some method Allows the vendors to choose size of problem for benchmark Measures execution time for each size problem
LINPACK NxN report Nmax – the size of the chosen problem run on a machine Rmax – the performance in Gflop/s for the chosen size problem run on the machine N1/2 – the size where half the Rmax execution rate is achieved Rpeak – the theoretical peak performance Gflop/s for the machine
LINPACK NxN is used to rank TOP500 fastest computers in the world
Size
Rate
Nmax
Rmax
N1/2
Size
Rate
TPP performance
(Entries for this table began in 1991.)
Year Computer # of
Procs Measured
Gflop/s Size of
Problem Size of
1/2 Perf Theoretical
Peak Gflop/s
2005 - 2006 IBM Blue Gene/L 131072 280600 1769471 367001
2002– 2004 Earth Simulator Computer,
NEC 5104 35610 1041216 265408 40832
2001 ASCI White-Pacific, IBM SP
Power 3 7424 7226 518096 179000 11136
2000 ASCI White-Pacific, IBM SP
Power 3 7424 4938 430000 11136
1999 ASCI Red Intel Pentium II Xeon
core 9632 2379 362880 75400 3207
1998 ASCI Blue-Pacific SST, IBM SP
604E 5808 2144 431344 3868
1997 Intel ASCI Option Red (200
MHz Pentium Pro) 9152 1338 235000 63000 1830
1996 Hitachi CP-PACS 2048 368.2 103680 30720 614
1995 Intel Paragon XP/S MP 6768 281.1 128600 25700 338
1994 Intel Paragon XP/S MP 6768 281.1 128600 25700 338
1993 Fujitsu NWT 140 124.5 31920 11950 236
1992 NEC SX-3/44 4 20.0 6144 832 22
1991 Fujitsu VP2600/10 1 4.0 1000 200 5
Performance of Supercomputers at ZIH
0,0001
0,001
0,01
0,1
1
10
100
1000
10000 T
FL
OP
S
Jahr
Cray T3E 28 GFlops
Platz 237 VP200-EX 472 MFlops
Platz 500
SGI Origin 2000 16,5 GFlops
Platz 236
SGI Origin 3800 85,4 GFlops
Platz 351
Rang 1
Rang 10 Rang 500
PC-Farm 10,88 TFlops
Platz 79
SGI Altix 11,9 TFlops
Platz 49
HRSK-II Stufe 1
HRSK-II Stufe 2
TOP10 (June 2011)
Rank Site Computer/Year Vendor Cores Rmax Rpeak Power
1 RIKEN Advanced Institute for Computational Science (AICS) Japan
K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect / 2011 Fujitsu
548352 8162.00 8773.63 9898.56
2 National Supercomputing Center in Tianjin China
Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA GPU, FT-1000 8C / 2010 NUDT
186368 2566.00 4701.00 4040.00
3 DOE/SC/Oak Ridge National Laboratory United States
Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 2009 Cray Inc.
224162 1759.00 2331.00 6950.60
4 National Supercomputing Centre in Shenzhen (NSCS) China
Nebulae - Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 GPU / 2010 Dawning
120640 1271.00 2984.30 2580.00
5 GSIC Center, Tokyo Institute of Technology Japan
TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/Windows / 2010 NEC/HP
73278 1192.00 2287.63 1398.61
6 DOE/NNSA/LANL/SNL United States
Cielo - Cray XE6 8-core 2.4 GHz / 2011 Cray Inc.
142272 1110.00 1365.81 3980.00
7 NASA/Ames Research Center/NAS United States
Pleiades - SGI Altix ICE 8200EX/8400EX, Xeon HT QC 3.0/Xeon 5570/5670 2.93 Ghz, Infiniband / 2011 SGI
111104 1088.00 1315.33 4102.00
8 DOE/SC/LBNL/NERSC United States
Hopper - Cray XE6 12-core 2.1 GHz / 2010 Cray Inc.
153408 1054.00 1288.63 2910.00
9 Commissariat a l'Energie Atomique (CEA) France
Tera-100 - Bull bullx super-node S6010/S6030 / 2010 Bull SA
138368 1050.00 1254.55 4590.00
10 DOE/NNSA/LANL United States
Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire Infiniband / 2009 IBM
122400 1042.00 1375.78 2345.50
Trends: Architectures
?
Trends: Processor Family
Trends: Interconnect Family
?
Trends: Operating System Family
?
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
HPCC Benchmark
Slides courtesy Jack Dongara
From Linpack Benchmark and Top500: “no single number can reflect overall performance”
Clearly need something more than Linpack
HPC Challenge Benchmark
Test suite stresses not only the processors, but the memory system and the interconnect.
The real utility of the HPCC benchmarks are that architectures can be described with a wider range of metrics than just Flop/s from Linpack.
Linpack Benchmark
Good
One number
Simple to define & easy to rank
Allows problem size to change with machine and over time
Bad
Emphasizes only peak CPU speed and number of CPUs
Does not stress local bandwidth
Does not stress the network
Does not test gather/scatter
Ignores Amdahl s Law (Only does weak scaling)
…
Ugly
Benchmarketeering hype
Consists of basically 7 benchmarks; Think of it as a framework or harness for adding benchmarks of interest.
1. HPL (LINPACK) MPI Global (Ax = b)
2. STREAM Local; single CPU *STREAM Embarrassingly parallel
3. PTRANS (A A + BT) MPI Global
4. RandomAccess Local; single CPU *RandomAccess Embarrassingly parallel RandomAccess MPI Global
5. BW and Latency – MPI
6. FFT - Global, single CPU, and EP
7. Matrix Multiply – single CPU and EP
35
HPCC was developed by HPCS to assist in testing new HEC systems Each benchmark focuses on a different part of the memory hierarchy HPCS performance targets attempt to
Flatten the memory hierarchy Improve real application performance Make programming easier
HPC Challenge
Performance Targets
HPL: linear system solve Ax = b
STREAM: vector operations A = B + s * C
FFT: 1D Fast Fourier Transform Z = fft(X)
RandomAccess: integer update T[i] = XOR( T[i], rand)
Cache(s)
Local Memory
Registers
Remote Memory
Disk
Tape
Instructions
Memory Hierarchy
Operands
Lines Blocks
Messages
Pages
Local - only a single processor is performing computations.
Embarrassingly Parallel - each processor in the entire system is performing computations but they do no communicate with each other explicitly.
Global - all processors in the system are performing computations and they explicitly communicate with each other.
Computational
resources
CPU computational
speed
Memory bandwidth
Node
Interconnect bandwidth
Computational
resources
CPU computational
speed
Memory bandwidth
Node
Interconnect bandwidth
HPL
Matrix Multiply
STREAM Random & Natural Ring
Bandwidth & Latency
Memory Access Patterns
Memory Access Patterns
Size
Rate
TPP performance
TPP Linpack Benchmark Used for the Top500 ratings Solve Ax=b, dense problem, matrix is random
Uses LU decomposition with partial pivoting Based on the ScaLAPACK routines but optimized The algorithm is scalable in the sense that the parallel efficiency is maintained constant with respect to the per processor memory usage In double precision (64-bit) arithmetic Run on all processors Problem size set by user
These settings used for the other tests
Requires An implementation of the MPI An implementation of the Basic Linear Algebra Subprograms (BLAS)
Reports total TFlop/s achieved for set of processors Takes the most time
Considering stopping the process after say 25% Still check to see if correct
The STREAM Benchmark is a standard benchmark for the measurement of computer memory bandwidth Measures bandwidth sustainable from standard operations -- not the theoretical "peak bandwidth" provided by most vendors Four operations
COPY, SCALE ADD, TRIAD
Measures: Machine Balance - relative cost of memory accesses vs arithmetic Vector lengths chosen to fill local memory
Tested on a single processor Tested on all processors in the set in an embarrassingly parallel fashion
Reports total GB/s achieved per processor
------------------------------------------------------------------
name kernel bytes/iter FLOPS/iter
------------------------------------------------------------------
COPY: a(i) = b(i) 16 0
SCALE: a(i) = q*b(i) 16 1
SUM: a(i) = b(i) + c(i) 24 1
TRIAD: a(i) = b(i) + q*c(i) 24 2
------------------------------------------------------------------
Implements parallel matrix transpose
A = A + BT
The matrices A and B are distributed across the processors
Two-dimensional block-cyclic storage Same storage as for HPL
Exercises the communications pattern where pairs of processors communicate with each other simultaneously.
Large (out-of-cache) data transfers across the network
Stresses the global bisection bandwidth
Reports total GB/s achieved for set of processors
Integer Read-modify-write to random address No spatial or temporal locality Measures memory latency or the ability to hide memory latency
Architecture stresses Latency to cache and main memory Architectures which can generate enough outstanding memory operations to tolerate the latency, change this into a main memory bandwidth constrained benchmark
Three forms Tested on a single processor Tested on all processors in the set in an embarrassingly parallel fashion
Tested with an MPI version across the set of processors
Each processor caches updates then all processors perform MPI all-to-all communication to perform updates across processors
Reports Gup/s (Giga updates per second) per processor
Ping-Pong test between pairs of processors
Send a message from proci to prock then return message from prock to proci
proci MPI_Send() - prock MPI_Recv() proci MPI_Recv() - prock MPI_Send() Other processors doing MPI_Waitall()
time += MPI_Wtime() time /= 2 The test is performed between as many possible distinct pairs of processors.
There is an upper bound on the time for the test Tries to find the weakest link amongst all pairs
Minimum bandwidth Maximum latency Not necessarily the same link will be the worst for bandwidth and latency
Message 8B used for latency test; take max time Message 2MB used for bandwidth test; take min GB/s
Two types of rings: Naturally ordered
(use MPI_COMM_WORLD): 0,1,2, ... P-1.
Randomly ordered (30 rings tested)
eg.: 7, 2, 5, 0, 3, 1, 4, 6
Each node posts two sends (to its left and right neighbor) and two receives (from its left and right neighbor).
Two types of communication routines are used: combined send/receive and non-blocking send/receive.
MPI_Sendrecv( TO: right_neighbor,FROM: left_neighbor) MPI_Irecv( left_neighbor )MPI_Irecv( right_neighbor ) and MPI_Isend( right_neighbor )MPI_Isend( left_neighbor )
The smaller (better) time for each is taken (which one is smaller depends on the MPI implementation).
Message 8B used for latency test; Message 2MB used for bandwidth test;
Using FFTE software Daisuke Takahashi code from University of Tsukuba 64 bit complex 1-D FFT
Uses 64 bit addressing Global transpose with MPI_Alltoall() Three transposes (data is never scrambled)
Single program to download and run Simple input file similar to HPL input
Base Run and Optimization Run Base run must be made
User supplies MPI and the BLAS Optimized run allowed to replace certain routines
User specifies what was done
Results upload via website html table and Excel spreadsheet generated with performance results
Intentionally we are not providing a single figure of merit (no over all ranking)
Goal: no more than 2 X the time to execute HPL.
49
1.Download
2.Install
3.Run
4.Upload results
5.Confirm via @email@
6.Tune
7.Run
8.Upload results
9.Confirm via @email@
Only some routines can be replaced Data layout needs to be preserved Multiple languages can be used
Provide detailed installation and
execution environment
Results are immediately available on the web site: Interactive HTML XML MS Excel Kiviat charts (radar plots)
Optional
Prequesites: C compiler
BLAS MPI