1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy...

1Titanium Review: Ti Parallel Benchmarks Kaushik Datta

Titanium NAS Parallel Benchmarks

Kathy Yelick

http://titanium.cs.berkeley.edu

U.C. Berkeley

September 9, 2004

2Titanium Review, Sep. 9, 2004 Kaushik Datta

Benchmarks

• Current Titanium NAS benchmarks:• MG (Multigrid)• FT (Fast Fourier Transform)• CG (Conjugate Gradient- Armando)• IS (Integer Sort- Omair)• EP (Embarrassingly Parallel- Meling)

• Today’s focus is on MG and FT


Platforms

• Seaborg:• NERSC IBM SP RS/6000• 16-way SMP nodes• 375 MHz Power3 procs, 1.5 GFlops max• 64 KB L1 D-Cache, 8 MB L2 Cache

• Also briefly mention Compaq Alphaserver, AMD Opteron, and Intel Itanium II processors


MG Benchmark

• Class A problem is 4 iterations on a 2563 grid• All the computations are nearest neighbor

across 3D grids• For coarse grids, all computations are done on one

processor to minimize fine-grain communication

• Communication pattern for updating ghost cells is very regular

• Tests both the computation and communication aspects of the platform


Major MG Components

• Computation (applies 27-point 3D stencil):• ApplySmoother• EvaluateResidual• Coarsen• Prolongate

• Communication (very regular):• UpdateBorder


Possible Serial Optimizations

• In order to improve the performance of naive MG code, we first tried to make the serial code faster

• The optimizations that seemed most promising were:• Cache Blocking• Common Subexpression Elimination


Possible Serial Optimization #1- Cache Blocking

• Cache blocking attempts to take a portion of the grid (that will fit into a given level of cache) and perform all necessary computations on it before proceeding to the next cache block

• In our case, we used cubic 3D cache blocks, and varied the side length of the cube



• Cache blocking seems to help slightly on the Itanium II, but not on the Power3

Power3 Running Time

00.5

11.5

22.5

33.5

4

Side Length of Cubic Cache Block

Tim

e (s

ec.)

applySmoother

evaluateResidual

Itanium II Running Time

00.5

11.5

22.5

3


Tim

e (s

ec.)

applySmoother

evaluateResidual



• The Alphaserver and Opteron processors do not benefit from cache blocking

Alphaserver Running Time

0

0.51

1.52

2.5


Tim

e (s

ec.)

applySmoother

evaluateResidual

Opteron Running Time

0

0.51

1.52

2.5


Tim

e (s

ec.)

applySmoother

evaluateResidual


Possible Serial Optimization #2- Common Subexpression Elimination

• CSE is a technique to reduce the flop count by memoizing results

• However, it may not always reduce the overall running time, since each pencil through the grid needs to be traversed twice• The first traversal is done to memoize certain results• The second traversal then uses these results to compute

the final answer


Possible Serial Optimization #2- Common Subexpression Elimination

• CSE does a good job at lowering the running time, partially because it reduces the Flop count

• Note: The Fortran MG benchmark uses CSE

Power3 Running Time

00.5

11.5

22.5

MG Component

Tim

e (s

ec.)

Regular

CSE

Power3 FP Operations

050

100150200250300350400

MG Component

Mil

lio

ns

of

FP

Op

erat

ion

s

Regular

CSE


Chosen Serial Optimizations

• Based on these results, we kept the CSE optimization, but omitted any type of cache blocking

• This mimics the Fortran code


Parallel Optimizations

• Have each processor block communicate with only 6 nearest neighbors instead of 27 to update its border (Dan)

• Eliminate “static” timers• This gets rid of a level of indirection• Dan reduced false sharing in static variables by grouping

each processor’s static variables together

• Force bulk arraycopy by using contiguous array buffers (manual packing/unpacking)

• Use the “local” keyword to let each processor know that all its computations are local


Seaborg SMP Performance of MG Class A Problem

• Titanium does about as well as Fortran up to the 16 processor case

• Our serial tuning seems to be successful in this case

Time on Single SMP of Seaborg

0

5

10

15

20

25

serial 1x1 1x2 1x4 1x8 1x16

Processor Configuration

Tim

e (s

ec.) Fortran

Ti 32-bit gls

Ti 32-bit mcs

Speedup on Single SMP of Seaborg

0

2

4

6

8

10

12

14



Sp

eed

up Fortran

Ti 32-bit gls

Ti 32-bit mcs


FT Benchmark

• Class A problem is 6 iterations of 2562 x 128 problem

• Each 3D FFT is performed as 3 separate 1D FFTs and 2 transposes

• 1D FFTs:• All are local• Currently are library calls (using FFTW)

• Transposes:• One local transpose and one all-to-all transpose• All-to-all transpose tests machine bisection bandwidth


Major FT Components

• Computation:• 1D FFTs (part of 3D FFT)• Evolve• Checksum

• Communication:• Transposes (part of 3D FFT)

• Both:• Setup


Serial FT Optimizations

• Removed an unnecessary transpose in the 3D FFT

• Memoized the time evolution array to reduce the Flop count


Seaborg SMP Performance of FT Class A Problem

• Titanium does slightly better than Fortran, but we are calling FFTW library

• We will compare each component of the benchmark separately

Total Time (using FFTW for Ti)

0

10

20

30

40

50

60



Tim

e (s

ec.)

Fortran

Ti 32-bit gls

Total Speedup (using FFTW for Ti)

0

2

4

6

8

10

12



Sp

eed

up

Fortran

Ti 32-bit gls


Seaborg SMP Performance of Setup

• Setup creates distributed arrays and memoizes an array used in later computations

• This method is only called once, but still needs tuning

Setup Time

0

2

4

6

8

10



Tim

e (s

ec.)

Fortran

Ti 32-bit gls

Setup Speedup

0

10

20

30

40

50



Sp

eed

up

Fortran

Ti 32-bit gls


Seaborg SMP Performance of 1D FFTs

• The Titanium code calls the FFTW library in this case

• We are in the process of converting the FFT into pure Titanium code

1D FFT Time (using FFTW for Ti)

0

10

20

30

40

50



Tim

e (s

ec.)

Fortran

Ti 32-bit gls

1D FFT Speedup (using FFTW for Ti)

0

5

10

15

20



Tim

e (s

ec.)

Fortran

Ti 32-bit gls


Seaborg SMP Performance of Transpose

• The Titanium and Fortran perform similarly using shared memory

• Note: The Fortran code does cache blocking for the local transpose (possible Ti optimization)

Transpose Time

00.5

11.5

22.5

33.5

4

1x2 1x4 1x8 1x16


Tim

e (s

ec.) Fortran Global

Fortran Local

Ti 32-bit gls Global

Ti 32-bit gls Local


Seaborg SMP Performance of Evolve

• Evolve consists of purely local FP computations• The Titanium code performs slightly worse than

the Fortran code, but scales better

Evolve Time

0

1

2

3

4

5

6

7



Tim

e (s

ec.)

Fortran

Ti 32-bit gls

Evolve Speedup

02468

10121416



Sp

eed

up

Fortran

Ti 32-bit gls


Conclusion

• On Seaborg, Titanium serial and SMP performance is slightly worse than or comparable to Fortran with MPI in most cases


Future Work

• Examine and tune the multinode performance of the MG and FT benchmarks

• Convert the FT benchmark into pure Titanium (instead of calling FFTW)

• Start profiling and tuning the serial versions of the CG, IS, and EP benchmarks

• Check the performance of the benchmarks across several different platforms

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy...

Documents

Transcript of 1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy...