S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende tau-team@cs.uoregon.edu.

S3D: Performance Impact of Hybrid XT3/XT4Sameer Shende

tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 2

Acknowledgements

Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]

The performance data presented here is available at:

http://www.cs.uoregon.edu/research/tau/s3d

TAU Parallel Performance System

http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation

Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system

Computer system architectures and operating systems Different programming languages and compilers

Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid

The Story So Far... Scalability study of S3D using TAU

MPI_Wait I/O (WRITE_SAVEFILE) Loop: ComputeSpeciesDiffFlux (630-656) [Rice, SDSC] Loop: ReactionRateBounds (374-386) [exp]

3D Scatter plots pointed to a single “slow” node before Identifying individual nodes by mapping ranks to nodes

within TAU Cray utilities: nodeinfo, xtshowmesh, xtshowcabs Ran a 6400 core simulation to identify XT3/XT4

partition performance issues (removed -feature=xt3)

Total Runtime Breakdown by Events - Time

MPI_Wait

WRITE_SAVEFILE

Relative Efficiency

MPI Scaling

Relative Efficiency & Speedup for One Event

ParaProf’s Source Browser (8 core profile)

Case Study

Harness testcase Platform: Jaguar Combined Cray XT3/XT4 at ORNL

6400p Goal:

To evaluate the performance impact of combined XT3/XT4 nodes on S3D executions

Performance evaluation of MPI_Wait Study mapping of MPI ranks to nodes

TAU: ParaProf Profile

Overall Mean Profile: Exclusive Wallclock Time

Overall Inclusive Time

Mean Mflops observed over all ranks

Inclusive Total Instructions Executed

Total Instructions Executed (Exclusive)

Comparing Exclusive PAPI Counters, MFlops

3D Scatter Plots

Plot four routines along X, Y, Z, and Color axes Each routine has a range (max, min) Each process (rank) has a unique position along the three

axes and a unique color Allows us to examine the distribution of nodes (clusters)

Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters!

3D Triangle Mesh Display

Plot MPI rank, routine name, and exclusive time along X, Y and Z axes

Color can be shown by a fourth metric Scalable view Suitable for very large number of processors

MPI_Wait: 3D View

3D View: Zooming In... Jagged Edges!

3D View: Uh Oh!

Zoom, Change Color to L1 Data Cache Misses

• Loop in ComputeSpeciesDiffFlux (630-656) has high L1 DCMs (red)• Takes longer to execute on this “slice” of processors. So do other routines. Slower memory?

Changing Color to MFLOPS

• Loop in ComputeSpeciesDiffFlux (630-656) lower Mflops (dark blue)

Getting Back to MPI_Wait()

• Why does MPI_Wait take less time on these cores?

• What does the profile of MPI_Wait look like?

MPI_Wait - Sorted by Exclusive Time

• MPI_Wait takes 435.84 seconds on rank 3101• It takes 59.6 s on rank 3233 and 29.2 s on rank 3200• It takes 15.49 seconds on rank 0!• How is rank 3101 different from rank 0?

Comparing Ranks 3101 and 0 (extremes)

Comparing Inclusive Times - Same for S3D

Comparing PAPI Floating Point Instructions

• PAPI_FP_INS are the same - as expected

Comparing Performance - MFLOPS

• For the memory intensive loop in ComputeSpeciesDiffFlux, rank 0 gets 65% Mflops of rank 3101 (114 vs 174 Mflops)!

Comparing MFLOPS: Rank 3101 vs Rank 0

• Rank 0 appears to be “slower” than rank 3101• Are there other nodes that are similarly slow with less wait times? • How does the MPI_Wait profile look like over all nodes?

MPI_Wait Profile

What is this rank?

MPI_Wait Profile Shifts at rank 114!

• Ranks 0 through 113 take less time in MPI_Wait than 114...

Another Shift in MPI_Wait()

• This shift is observed in ranks 3200 through 3313• Again 114 processors... (like ranks 0 through 113) • Hmm...• How do other routines perform on these ranks? • What are the physical node ids?

MPI_Wait

• While MPI_Wait takes less time on these cpus, other routines take longer • Points to a load imbalance!

Identifying Physical Processors using Metadata

MetaData for Ranks 3200 and 0

• Rank 3200 and 0 both lie on the same physical node nid03406!

Mapping Ranks from TAU to Physical Processors

• Ranks 0..113 lie on processors 3406..3551• Ranks 3200..3313 are also on 3406..3551

Results from Cray’s nodeinfo Utility

• Processors 3406..3551 (physical ids) are located on the XT3 partition• XT3 partition has slow DDR-400 memory (5986 MB/s)• XT3 has a slower SS1 (1109 MB/s) interconnect• XT4 partition has faster DDR2-667 memory modules (7147 MB/s) and faster Seastar2 (SS2) (2022 MB/s) interconnect

Location of Physical Nodes in the Cabinets

• Using Cray utilities xtshowcabs, and xtshowmesh utilities • All nodes marked with a Job “c” came from our S3D job

xtshowcabs

• Nodes marked with a “c” are from our S3D run• What does the mesh look like?

xtshowmesh (1 of 2)

• Nodes marked with a “c” are from our S3D run

xtshowmesh (2 of 2)

• Nodes marked with a “c” are from our S3D run

Conclusions Using a combination of XT3/XT4 nodes slowed down parts of

S3D The application spends a considerable amount of time

spinning/polling in MPI_Wait The load imbalance is probably caused by non-uniform nodes Conducted a performance characterization of S3D This data will help derive communication models that explain the

performance data observed [John Mellor-Crummey, Rice] Techniques to improve cache memory utilization in the loops

identified by TAU will help overall performance [SDSC, Rice] I/O characterization of S3D will help identify I/O scaling issues

S3D - Building with TAU Change name of compiler in build/make.XT3

ftn=> tau_f90.sh cc => tau_cc.sh

Set compile time environment variables setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/

Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi Disabled tracking message communication statistics in TAU MPI_Comm_compare() is not called inside TAU’s MPI wrapper Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation

setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ Selective instrumentation file eliminates instrumentation in lightweight routines Pre-process Fortran source code using cpp before compiling

Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:

export TAU_THROTTLE=1 export COUNTER1 GET_TIME_OF_DAY export COUNTER2 PAPI_FP_INS export COUNTER3 PAPI_L1_DCM export COUNTER4 PAPI_TOT_INS export COUNTER5 PAPI_L2_DCM

Selective Instrumentation in TAU

% cat select.tauBEGIN_EXCLUDE_LIST

MCADIF

GETRATES

TRANSPORT_M::MCAVIS_NEW

MCEDIF

MCACON

CKYTCP

THERMCHEM_M::MIXCP

THERMCHEM_M::MIXENTH

THERMCHEM_M::GIBBSENRG_ALL_DIMT

CKRHOY

MCEVAL4

THERMCHEM_M::HIS

THERMCHEM_M::CPS

THERMCHEM_M::ENTROPY

END_EXCLUDE_LIST

BEGIN_INSTRUMENT_SECTION

loops routine="#"

END_INSTRUMENT_SECTION

Getting Access to TAU on Jaguar set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) Choose Stub Makefiles (TAU_MAKEFILE env. var.) from

/spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* Makefile.tau-mpi-pdt-pgi (flat profile) Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)

Binaries of S3D can be found in: ~sameer/scratch/S3D-BINARIES

withtau» papi, multiplecounters, mpi, pdt, pgi options

without_tau

Concluding Discussion Performance tools must be used effectively More intelligent performance systems for productive use

Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process

Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use

Develop next-generation tools and deliver to community Open source with support by ParaTools, Inc. http://www.cs.uoregon.edu/research/tau

Support Acknowledgements

Department of Energy (DOE)

Office of Science LLNL, LANL, ORNL, ASC PERI

S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende tau-team@cs.uoregon.edu.

Documents

Transcript of S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende tau-team@cs.uoregon.edu.

PSC’s CRAY-XT3 Preparation and Installation Timeline.

S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu.

MEET THE FIRST-EVER 2019 XT4 - Auto-Brochures.com XT4... · 2018. 10. 9. · The XT4 has ample cargo room behind the rear seats. When you need even more cargo space, fold down the

Research Findings - S3D Exhibition

What is « HPC - Supélec · Top500 site: HPC roadmap ... (Symetric MultiProcessor): Overview of Recent ... Cray-X1 – 52.4 Tflops Vector MPP Cray-XT3 Cray-XT4

Accelerating S3D: A GPGPU Case Study

Business Case of S3D movies in India

Eiki LC-XT3

Technical Reference Manual - XT3 12

S3D: Performance Impact of Hybrid XT3/XT4

XT3 Configuration Manual - Video Europevideoeurope.co.uk/wp-content/uploads/2014/07/EVS-XT3-Manual.pdf · Configuration Manual Version 11.00 ... manualbye-mailtodoc@evs.tv. RegionalContacts

Optimization for the Cray XT4 ™ MPP Supercomputer

Early Evaluation of the Cray XT3 at ORNL

An Evaluation of the ORNL Cray XT3

Ordering information for XT1, XT2, XT3, XT4 UL/CSA ......(1)Consult ABB for availability (2)Shaft for use with handles beginning with KXT. (3)Shafts for use with handles beginning

Trasformazione interruttore base in parte mobile per ... · Trasformazione interruttore base in parte mobile per interruttore rimovibile XT1-XT2-XT3-XT4. Transformation of a basic

00327911 XT3 XT4 Blind Rive MULTI cover · 2019-03-15 · Ciocan hidro-pneumatic pentru nituire Пневмо-гидравлический инструмент для потайных

Profiling S3D on Cray XT3 using TAU Sameer Shende tau-team@cs.uoregon.edu.

HDOH output s3d - hawaiihie.org

Enabling Moab’s Adaptive Computing for Cray XT3/XT4€¦ · Enabling Moab’s Adaptive Computing for Cray XT3/XT4 Michael Jackson, President Cluster Resources michael@clusterresources.com