1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra...
-
Upload
paula-french -
Category
Documents
-
view
216 -
download
0
Transcript of 1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra...
1
““How Can We Address the How Can We Address the Needs and Solve the Needs and Solve the
Problems in HPCProblems in HPCBenchmarking?”Benchmarking?”
Jack DongarraInnovative Computing LaboratoryUniversity of Tennessee
http://www.cs.utk.edu/~dongarra/http://www.cs.utk.edu/~dongarra/
Workshop on the PerformanceWorkshop on the PerformanceCharacterization of AlgorithmsCharacterization of Algorithms
2
LINPACK BenchmarkLINPACK Benchmark Accidental benchmarking
Designed to help users extrapolate execution time for Linpack software
First benchmark report from 1979
My iPAQ running the benchmark in Java comesin here today.
3
Accidental BenchmarkingAccidental Benchmarking Portable, runs on any system Easy to understand Content changed over time
n=100, 300, 1000, as large as possible (Top500)Allows for restructuring algorithm
Performance data with the same arithmetic precisionBenchmark checks to see if “correct solution” achieved
Not intended to measure entire machine performance.
In the benchmark report, “One further note: The following performance data should not be taken too seriously.”
4
LINPACK BenchmarkLINPACK Benchmark Historical data For n=100 same software
for the last 22 years Unbiased reporting Freely available
sw/results worldwide Should be able to achieve
high performance on this problem, if not…
Compiler test at n=100, heavily hand optimized at TPP (Modified ScaLAPACK implementation)
Scalable benchmark, size and parallel
Pressure on vendors to optimize my software and provide a set of kernels that benefit others
Run rules very important
Today, n =.5x106 at 7.2 TFlop/s requires 3.3 hours On a Petaflops machine,
at n=5x106 will require 1 day.
5
BenchmarkBenchmark Machine signatures Algorithm
characteristics Make improvements in
applications Users looking for
performance portability Many of the things we do
are specific to one system’s parameters
Need a way understand and rapidly develop software which has a chance at high performance
6
Self-Adapting Numerical Self-Adapting Numerical Software (SANS)Software (SANS)
Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.
Simple operations like Matrix-Vector ops require many man-hours / platform• Software lags far behind hardware introduction• Only done if financial incentive is there
Compilers not up to optimization challenge Hardware, compilers, and software have a large design space
w/many parameters Blocking sizes, loop nesting permutations, loop unrolling depths,
software pipelining strategies, register allocations, and instruction schedules.
Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.
Need for quick/dynamic deployment of optimized routines. ATLAS - Automatic Tuned Linear Algebra Software
7
Software Generation Software Generation Strategy - BLASStrategy - BLAS
Takes ~ 20 minutes to run.
“New” model of high performance programming where critical code is machine generated using parameter optimization.
Designed for RISC arch Super Scalar Need reasonable C
compiler Today ATLAS in use by
Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE, …
Parameter study of the hw Generate multiple versions of
code, w/difference values of key performance parameters
Run and measure the performance for various versions
Pick best and generate library
Level 1 cache multiply optimizes for: TLB access L1 cache reuse FP unit usage Memory fetch Register reuse Loop overhead minimization
ATLAS ATLAS (DGEMM n = 500)(DGEMM n = 500)
ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
1800.0
2000.0
Architectures
MFL
OP
/S
Vendor BLASATLAS BLASF77 BLAS
9
Related Tuning Projects Related Tuning Projects PHiPAC
Portable High Performance ANSI C http://www.icsi.berkeley.edu/~bilmes/phipac initial automatic GEMM generation project
FFTW Fastest Fourier Transform in the West http://www.fftw.org
UHFFT tuning parallel FFT algorithms http://rodin.cs.uh.edu/~mirkovic/fft/parfft.htm
SPIRAL Signal Processing Algorithms Implementation Research for Adaptable
Libraries maps DSP algorithms to architectures http://www.ece.cmu.edu/~spiral/
Sparsity Sparse-matrix-vector and Sparse-matrix-matrix multiplication
http://www.cs.berkeley.edu/~ejim/publication/ tunes code to sparsity structure of matrix more later in this tutorial
University of Tennessee
10
IBM Power 3 375 MHz693686552
IntelPentium II400MHz
Linux andIBM 1.1.8
JDK
CompaqAlpha
21264 500 MHz Java 2
SDK 1.2.2w ith FastVM 1.2.2
IBM Pow er3 375 MHz
0100200300400500600700800
Mflop/S
FortranCJava
Experiments with C, Fortran, and Experiments with C, Fortran, and Java for ATLAS Java for ATLAS (DGEMM kernel)(DGEMM kernel)
11
Machine-Assisted Application Machine-Assisted Application Development and AdaptationDevelopment and Adaptation
Communication librariesOptimize for the specifics of
one’s configuration. Algorithm layout and
implementationLook at the different ways to
express implementation
12
Work in Progress:Work in Progress:ATLAS-like Approach Applied to Broadcast ATLAS-like Approach Applied to Broadcast (PII 8 Way Cluster with 100 Mb/s switched network)(PII 8 Way Cluster with 100 Mb/s switched network)
Message Size Optimal algorithm Buffer Size (bytes) (bytes)
8 binomial 8 16 binomial 16 32 binary 32 64 binomial 64 128 binomial 128 256 binomial 256 512 binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K ring 4K 512K ring 4K 1M binary 4K
Root
Sequential Binary Binomial Ring
13
Conjugate Gradient Variants by Conjugate Gradient Variants by Dynamic Selection at Run TimeDynamic Selection at Run Time Variants combine
inner products to reduce communication bottleneck at the expense of more scalar ops.
Same number of iterations, no advantage on a sequential processor
With a large number of processor and a high-latency network may be advantages.
Improvements can range from 15% to 50% depending on size.
14
Conjugate Gradient Variants by Conjugate Gradient Variants by Dynamic Selection at Run TimeDynamic Selection at Run Time Variants combine
inner products to reduce communication bottleneck at the expense of more scalar ops.
Same number of iterations, no advantage on a sequential processor
With a large number of processor and a high-latency network may be advantages.
Improvements can range from 15% to 50% depending on size.
15
Reformulating/Rearranging/Reuse Reformulating/Rearranging/Reuse Example is the reduction to narrow
band from for the SVD
Fetch each entry of A once Restructure and combined operations Results in a speedup of > 30%
A A uy wv
y A u
w A v
newT T
newT
new new
16
Tools for Tools for Performance EvaluationPerformance Evaluation
Timing and performance evaluation has been an artResolution of the clock Issues about cache effectsDifferent systemsCan be cumbersome and inefficient with
traditional tools Situation about to change
Today’s processors have internal counters
17
Performance CountersPerformance Counters Almost all high performance processors
include hardware performance counters. Some are easy to access, others not
available to users. On most platforms the APIs, if they exist,
are not appropriate for the end user or well documented.
Existing performance counter APIs Compaq Alpha EV 6 & 6/7 SGI MIPS R10000 IBM Power Series CRAY T3E Sun Solaris Pentium Linux and Windows
IA-64 HP-PA RISC Hitachi Fujitsu NEC
18
Directions Directions Need tools that allow us to
examine performance and identify problems.Should be simple to usePerhaps in an automatic fashion
Machine assisted optimization of key componentsThink of it as a higher level compilerDone via experimentation