Post on 19-Dec-2015
TAU Performance SystemS3D Scalability Study 2
Acknowledgements
Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]
The performance data presented here is available at:
http://www.cs.uoregon.edu/research/tau/s3d
TAU Performance SystemS3D Scalability Study 3
TAU Parallel Performance System
http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation
Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system
Computer system architectures and operating systems Different programming languages and compilers
Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid
TAU Performance SystemS3D Scalability Study 4
The Story So Far... Scalability study of S3D using TAU
MPI_Wait I/O (WRITE_SAVEFILE) Loop: ComputeSpeciesDiffFlux (630-656) [Rice, SDSC] Loop: ReactionRateBounds (374-386) [exp]
3D Scatter plots pointed to a single “slow” node before Identifying individual nodes by mapping ranks to nodes
within TAU Cray utilities: nodeinfo, xtshowmesh, xtshowcabs Ran a 6400 core simulation to identify XT3/XT4
partition performance issues (removed -feature=xt3)
TAU Performance SystemS3D Scalability Study 5
Total Runtime Breakdown by Events - Time
MPI_Wait
WRITE_SAVEFILE
TAU Performance SystemS3D Scalability Study 10
Case Study
Harness testcase Platform: Jaguar Combined Cray XT3/XT4 at ORNL
6400p Goal:
To evaluate the performance impact of combined XT3/XT4 nodes on S3D executions
Performance evaluation of MPI_Wait Study mapping of MPI ranks to nodes
TAU Performance SystemS3D Scalability Study 18
3D Scatter Plots
Plot four routines along X, Y, Z, and Color axes Each routine has a range (max, min) Each process (rank) has a unique position along the three
axes and a unique color Allows us to examine the distribution of nodes (clusters)
TAU Performance SystemS3D Scalability Study 20
3D Triangle Mesh Display
Plot MPI rank, routine name, and exclusive time along X, Y and Z axes
Color can be shown by a fourth metric Scalable view Suitable for very large number of processors
TAU Performance SystemS3D Scalability Study 24
Zoom, Change Color to L1 Data Cache Misses
• Loop in ComputeSpeciesDiffFlux (630-656) has high L1 DCMs (red)• Takes longer to execute on this “slice” of processors. So do other routines. Slower memory?
TAU Performance SystemS3D Scalability Study 25
Changing Color to MFLOPS
• Loop in ComputeSpeciesDiffFlux (630-656) lower Mflops (dark blue)
TAU Performance SystemS3D Scalability Study 26
Getting Back to MPI_Wait()
• Why does MPI_Wait take less time on these cores?
• What does the profile of MPI_Wait look like?
TAU Performance SystemS3D Scalability Study 27
MPI_Wait - Sorted by Exclusive Time
• MPI_Wait takes 435.84 seconds on rank 3101• It takes 59.6 s on rank 3233 and 29.2 s on rank 3200• It takes 15.49 seconds on rank 0!• How is rank 3101 different from rank 0?
TAU Performance SystemS3D Scalability Study 30
Comparing PAPI Floating Point Instructions
• PAPI_FP_INS are the same - as expected
TAU Performance SystemS3D Scalability Study 31
Comparing Performance - MFLOPS
• For the memory intensive loop in ComputeSpeciesDiffFlux, rank 0 gets 65% Mflops of rank 3101 (114 vs 174 Mflops)!
TAU Performance SystemS3D Scalability Study 32
Comparing MFLOPS: Rank 3101 vs Rank 0
• Rank 0 appears to be “slower” than rank 3101• Are there other nodes that are similarly slow with less wait times? • How does the MPI_Wait profile look like over all nodes?
TAU Performance SystemS3D Scalability Study 34
MPI_Wait Profile Shifts at rank 114!
• Ranks 0 through 113 take less time in MPI_Wait than 114...
TAU Performance SystemS3D Scalability Study 35
Another Shift in MPI_Wait()
• This shift is observed in ranks 3200 through 3313• Again 114 processors... (like ranks 0 through 113) • Hmm...• How do other routines perform on these ranks? • What are the physical node ids?
TAU Performance SystemS3D Scalability Study 36
MPI_Wait
• While MPI_Wait takes less time on these cpus, other routines take longer • Points to a load imbalance!
TAU Performance SystemS3D Scalability Study 38
MetaData for Ranks 3200 and 0
• Rank 3200 and 0 both lie on the same physical node nid03406!
TAU Performance SystemS3D Scalability Study 39
Mapping Ranks from TAU to Physical Processors
• Ranks 0..113 lie on processors 3406..3551• Ranks 3200..3313 are also on 3406..3551
TAU Performance SystemS3D Scalability Study 40
Results from Cray’s nodeinfo Utility
• Processors 3406..3551 (physical ids) are located on the XT3 partition• XT3 partition has slow DDR-400 memory (5986 MB/s)• XT3 has a slower SS1 (1109 MB/s) interconnect• XT4 partition has faster DDR2-667 memory modules (7147 MB/s) and faster Seastar2 (SS2) (2022 MB/s) interconnect
TAU Performance SystemS3D Scalability Study 41
Location of Physical Nodes in the Cabinets
• Using Cray utilities xtshowcabs, and xtshowmesh utilities • All nodes marked with a Job “c” came from our S3D job
TAU Performance SystemS3D Scalability Study 42
xtshowcabs
• Nodes marked with a “c” are from our S3D run• What does the mesh look like?
TAU Performance SystemS3D Scalability Study 43
xtshowmesh (1 of 2)
• Nodes marked with a “c” are from our S3D run
TAU Performance SystemS3D Scalability Study 44
xtshowmesh (2 of 2)
• Nodes marked with a “c” are from our S3D run
TAU Performance SystemS3D Scalability Study 45
Conclusions Using a combination of XT3/XT4 nodes slowed down parts of
S3D The application spends a considerable amount of time
spinning/polling in MPI_Wait The load imbalance is probably caused by non-uniform nodes Conducted a performance characterization of S3D This data will help derive communication models that explain the
performance data observed [John Mellor-Crummey, Rice] Techniques to improve cache memory utilization in the loops
identified by TAU will help overall performance [SDSC, Rice] I/O characterization of S3D will help identify I/O scaling issues
TAU Performance SystemS3D Scalability Study 46
S3D - Building with TAU Change name of compiler in build/make.XT3
ftn=> tau_f90.sh cc => tau_cc.sh
Set compile time environment variables setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/
Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi Disabled tracking message communication statistics in TAU MPI_Comm_compare() is not called inside TAU’s MPI wrapper Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation
setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ Selective instrumentation file eliminates instrumentation in lightweight routines Pre-process Fortran source code using cpp before compiling
Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:
export TAU_THROTTLE=1 export COUNTER1 GET_TIME_OF_DAY export COUNTER2 PAPI_FP_INS export COUNTER3 PAPI_L1_DCM export COUNTER4 PAPI_TOT_INS export COUNTER5 PAPI_L2_DCM
TAU Performance SystemS3D Scalability Study 47
Selective Instrumentation in TAU
% cat select.tauBEGIN_EXCLUDE_LIST
MCADIF
GETRATES
TRANSPORT_M::MCAVIS_NEW
MCEDIF
MCACON
CKYTCP
THERMCHEM_M::MIXCP
THERMCHEM_M::MIXENTH
THERMCHEM_M::GIBBSENRG_ALL_DIMT
CKRHOY
MCEVAL4
THERMCHEM_M::HIS
THERMCHEM_M::CPS
THERMCHEM_M::ENTROPY
END_EXCLUDE_LIST
BEGIN_INSTRUMENT_SECTION
loops routine="#"
END_INSTRUMENT_SECTION
TAU Performance SystemS3D Scalability Study 48
Getting Access to TAU on Jaguar set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) Choose Stub Makefiles (TAU_MAKEFILE env. var.) from
/spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* Makefile.tau-mpi-pdt-pgi (flat profile) Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)
Binaries of S3D can be found in: ~sameer/scratch/S3D-BINARIES
withtau» papi, multiplecounters, mpi, pdt, pgi options
without_tau
TAU Performance SystemS3D Scalability Study 49
Concluding Discussion Performance tools must be used effectively More intelligent performance systems for productive use
Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process
Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use
Develop next-generation tools and deliver to community Open source with support by ParaTools, Inc. http://www.cs.uoregon.edu/research/tau