1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra...

1

““How Can We Address the How Can We Address the Needs and Solve the Needs and Solve the

Problems in HPCProblems in HPCBenchmarking?”Benchmarking?”

Jack DongarraInnovative Computing LaboratoryUniversity of Tennessee

http://www.cs.utk.edu/~dongarra/http://www.cs.utk.edu/~dongarra/

Workshop on the PerformanceWorkshop on the PerformanceCharacterization of AlgorithmsCharacterization of Algorithms

2

LINPACK BenchmarkLINPACK Benchmark Accidental benchmarking

Designed to help users extrapolate execution time for Linpack software

First benchmark report from 1979

My iPAQ running the benchmark in Java comesin here today.

3

Accidental BenchmarkingAccidental Benchmarking Portable, runs on any system Easy to understand Content changed over time

n=100, 300, 1000, as large as possible (Top500)Allows for restructuring algorithm

Performance data with the same arithmetic precisionBenchmark checks to see if “correct solution” achieved

Not intended to measure entire machine performance.

In the benchmark report, “One further note: The following performance data should not be taken too seriously.”

4

LINPACK BenchmarkLINPACK Benchmark Historical data For n=100 same software

for the last 22 years Unbiased reporting Freely available

sw/results worldwide Should be able to achieve

high performance on this problem, if not…

Compiler test at n=100, heavily hand optimized at TPP (Modified ScaLAPACK implementation)

Scalable benchmark, size and parallel

Pressure on vendors to optimize my software and provide a set of kernels that benefit others

Run rules very important

Today, n =.5x106 at 7.2 TFlop/s requires 3.3 hours On a Petaflops machine,

at n=5x106 will require 1 day.

5

BenchmarkBenchmark Machine signatures Algorithm

characteristics Make improvements in

applications Users looking for

performance portability Many of the things we do

are specific to one system’s parameters

Need a way understand and rapidly develop software which has a chance at high performance

6

Self-Adapting Numerical Self-Adapting Numerical Software (SANS)Software (SANS)

Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.

Simple operations like Matrix-Vector ops require many man-hours / platform• Software lags far behind hardware introduction• Only done if financial incentive is there

Compilers not up to optimization challenge Hardware, compilers, and software have a large design space

w/many parameters Blocking sizes, loop nesting permutations, loop unrolling depths,

software pipelining strategies, register allocations, and instruction schedules.

Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.

Need for quick/dynamic deployment of optimized routines. ATLAS - Automatic Tuned Linear Algebra Software

7

Software Generation Software Generation Strategy - BLASStrategy - BLAS

Takes ~ 20 minutes to run.

“New” model of high performance programming where critical code is machine generated using parameter optimization.

Designed for RISC arch Super Scalar Need reasonable C

compiler Today ATLAS in use by

Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE, …

Parameter study of the hw Generate multiple versions of

code, w/difference values of key performance parameters

Run and measure the performance for various versions

Pick best and generate library

Level 1 cache multiply optimizes for: TLB access L1 cache reuse FP unit usage Memory fetch Register reuse Loop overhead minimization

ATLAS ATLAS (DGEMM n = 500)(DGEMM n = 500)

ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

1800.0

2000.0

Architectures

MFL

OP

/S

Vendor BLASATLAS BLASF77 BLAS

9

Related Tuning Projects Related Tuning Projects PHiPAC

Portable High Performance ANSI C http://www.icsi.berkeley.edu/~bilmes/phipac initial automatic GEMM generation project

FFTW Fastest Fourier Transform in the West http://www.fftw.org

UHFFT tuning parallel FFT algorithms http://rodin.cs.uh.edu/~mirkovic/fft/parfft.htm

SPIRAL Signal Processing Algorithms Implementation Research for Adaptable

Libraries maps DSP algorithms to architectures http://www.ece.cmu.edu/~spiral/

Sparsity Sparse-matrix-vector and Sparse-matrix-matrix multiplication

http://www.cs.berkeley.edu/~ejim/publication/ tunes code to sparsity structure of matrix more later in this tutorial

University of Tennessee

10

IBM Power 3 375 MHz693686552

IntelPentium II400MHz

Linux andIBM 1.1.8

JDK

CompaqAlpha

21264 500 MHz Java 2

SDK 1.2.2w ith FastVM 1.2.2

IBM Pow er3 375 MHz

0100200300400500600700800

Mflop/S

FortranCJava

Experiments with C, Fortran, and Experiments with C, Fortran, and Java for ATLAS Java for ATLAS (DGEMM kernel)(DGEMM kernel)

11

Machine-Assisted Application Machine-Assisted Application Development and AdaptationDevelopment and Adaptation

Communication librariesOptimize for the specifics of

one’s configuration. Algorithm layout and

implementationLook at the different ways to

express implementation

12

Work in Progress:Work in Progress:ATLAS-like Approach Applied to Broadcast ATLAS-like Approach Applied to Broadcast (PII 8 Way Cluster with 100 Mb/s switched network)(PII 8 Way Cluster with 100 Mb/s switched network)

Message Size Optimal algorithm Buffer Size (bytes) (bytes)

8 binomial 8 16 binomial 16 32 binary 32 64 binomial 64 128 binomial 128 256 binomial 256 512 binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K ring 4K 512K ring 4K 1M binary 4K

Root

Sequential Binary Binomial Ring

13

Conjugate Gradient Variants by Conjugate Gradient Variants by Dynamic Selection at Run TimeDynamic Selection at Run Time Variants combine

inner products to reduce communication bottleneck at the expense of more scalar ops.

Same number of iterations, no advantage on a sequential processor

With a large number of processor and a high-latency network may be advantages.

Improvements can range from 15% to 50% depending on size.

14

Conjugate Gradient Variants by Conjugate Gradient Variants by Dynamic Selection at Run TimeDynamic Selection at Run Time Variants combine

inner products to reduce communication bottleneck at the expense of more scalar ops.

Same number of iterations, no advantage on a sequential processor

With a large number of processor and a high-latency network may be advantages.

Improvements can range from 15% to 50% depending on size.

15

Reformulating/Rearranging/Reuse Reformulating/Rearranging/Reuse Example is the reduction to narrow

band from for the SVD

Fetch each entry of A once Restructure and combined operations Results in a speedup of > 30%

A A uy wv

y A u

w A v

newT T

newT

new new

16

Tools for Tools for Performance EvaluationPerformance Evaluation

Timing and performance evaluation has been an artResolution of the clock Issues about cache effectsDifferent systemsCan be cumbersome and inefficient with

traditional tools Situation about to change

Today’s processors have internal counters

17

Performance CountersPerformance Counters Almost all high performance processors

include hardware performance counters. Some are easy to access, others not

available to users. On most platforms the APIs, if they exist,

are not appropriate for the end user or well documented.

Existing performance counter APIs Compaq Alpha EV 6 & 6/7 SGI MIPS R10000 IBM Power Series CRAY T3E Sun Solaris Pentium Linux and Windows

IA-64 HP-PA RISC Hitachi Fujitsu NEC

18

Directions Directions Need tools that allow us to

examine performance and identify problems.Should be simple to usePerhaps in an automatic fashion

Machine assisted optimization of key componentsThink of it as a higher level compilerDone via experimentation

1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra...

Documents

Transcript of 1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra...