HPC Application Profiling and Analysis
-
Upload
rishi-pathak -
Category
Documents
-
view
241 -
download
1
Transcript of HPC Application Profiling and Analysis
![Page 1: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/1.jpg)
Application Profiling & Analysis
Rishi Pathak
National PARAM Supercomputing Facility, C-DAC
![Page 2: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/2.jpg)
What is application profiling
• Profiling
– Recording of summary information during execution
• inclusive, exclusive time, # calls, hardware counter statistics, …
– Reflects performance behavior of program entities
• functions, loops, basic blocks
• user-defined “semantic” entities
– Very good for low-cost performance assessment
– Helps to expose performance bottlenecks and hotspots
– Implemented through either
• sampling: periodic OS interrupts or hardware counter traps
• measurement: direct insertion of measurement code
![Page 3: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/3.jpg)
Profiling Tools 3
Sampling vs. Instrumentation
Sampling Instrumentation
Overhead Typically about 1% High, may be 500% !
System-wide
profiling
Yes, profiles all app, drivers, OS functions Just application and instrumented DLLs
Detect unexpected
events
Yes , can detect other programs using OS resources
No
Setup None Automatic ins. of data collection stubs required
Data collected Counters, processor an OS state Call graph , call times,
critical path
Data granularity Assembly level instr., with src line Functions, sometimes
statements
Detects
algorithmic issues
No, Limited to processes , threads Yes – can see algorithm,
call path is expensive
![Page 4: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/4.jpg)
Inclusive v/s Exclusive Profiling
• Inclusive time for main– 100 secs
• Exclusive time for main– 100-20-50-20=10 secs
int main( )
{ /* takes 100 secs */
f1(); /* takes 20 secs */
/* other work */
f2(); /* takes 50 secs */
f1(); /* takes 20 secs */
/* other work */
}
/* similar for other metrics, such
as hardware performance counters,
etc. */
![Page 5: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/5.jpg)
What are Application Traces
• Tracing– Recording of information about significant points (events) during program
execution
• entering/exiting code region (function, loop, block, …)
• thread/process interactions (e.g., send/receive message)
– Save information in event record
• timestamp
• CPU identifier, thread identifier
• Event type and event-specific information
– Event trace is a time-sequenced stream of event records
– Can be used to reconstruct dynamic program behavior
– Typically requires code instrumentation
![Page 6: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/6.jpg)
Profiling v/s Tracing
• Profiling– Summary statistics of performance metrics
• Number of times a routine was invoked
• Exclusive, inclusive time/hpm counts spent executing it
• Number of instrumented child routines invoked, etc.
• Structure of invocations (call-trees/call-graphs)
• Memory, message communication sizes
• Tracing– When and where events took place along a global timeline
• Time-stamped log of events
• Message communication events (sends/receives) are tracked
• Shows when and from/to where messages were sent
• Large volume of performance data generated usually leads to more perturbation in the program
![Page 7: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/7.jpg)
The Big Picture
InstrumentationSampling
Profiling
Analysis Optimization
![Page 8: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/8.jpg)
Instrumentation & Sampling
![Page 9: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/9.jpg)
Measurements - Instrumentation
Instrumentation - Adding measurement probes
to the code to observe its execution
– Can be done on several levels
– Different techniques for different levels
– Different overheads and levels of accuracy with each technique
– No instrumentation: run in a simulator. E.g., Valgrind
![Page 10: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/10.jpg)
Measurements - Instrumentation
• Source code instrumentation
– User added time measurement, etc. (e.g., printf(), gettimeofday())
– Many tools expose mechanisms for source code instrumentation in addition to automatic instrumentation facilities they offer
– Instrument program phases:
• initialization/main iteration loop/data post processing
Measurements - Instrumentation
![Page 11: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/11.jpg)
Measurements - Instrumentation
• Preprocessor Instrumentation
– Example: Instrumenting OpenMP constructs with
Opari
– Preprocessor operation
– Example: Instrumenta
/* ORIGINAL CODE in parallel region */
Instrumentation
added by Opari
Orignial
source code
Modified (instrumented)
source codePre-
processor
This is used for OpenMPanalysis in tools such as KoJak/Scalasca/ompP
![Page 12: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/12.jpg)
Measurements - Instrumentation
• Compiler Instrumentation
– Many compilers can instrument functions automatically
– GNU compiler flag: -finstrument-functions
– Automatically calls functions on function entry/exit that a tool can capture
– Not standardized across compilers, often undocumented flags, sometimes not available at all
![Page 13: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/13.jpg)
Measurements - Instrumentation
– GNU compiler example:
void __cyg_profile_func_enter(void *this, void *callsite)
{
/* called on function entry */
}
void __cyg_profile_func_exit(void *this, void *callsite)
{
/* called just before returning from function */
}
![Page 14: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/14.jpg)
Measurements - Instrumentation
• Library Instrumentation:
MPI library interposition
– All functions are available under two names: MPI_xxx and PMPI_xxx, MPI_xxx symbols are weak, can be over-written by interposition library
– Measurement code in the interposition library measures begin, end, transmitted data, etc… and calls corresponding PMPI routine.
– Not all MPI functions need to be instrumented
![Page 15: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/15.jpg)
Measurements - Instrumentation
• Binary Instrumentation– Static binary instrumentation
• LD_PRELOAD(Linux)
• DLL injection(MS)
– Debuggers– Breakpoints (Software & Hardware)
• Dynamic binary instrumentation
• Injection of instrumentation code into a running process.
• Tools: PIN(Intel), Valgrind
![Page 16: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/16.jpg)
Measurements - Sampling
Using event triggers
– Reoccurring - program counter is sampled many
times
– Histogram of program contexts(CCT)
– Sufficiently large number of samples required
– Uniformity in event triggers wrt execution time
![Page 17: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/17.jpg)
Measurements - Sampling
Event trigger types
• Synchronous
– Initiated by direct program action
– E.g. memory allocation, I/O, and inter-process communication(including MPI communication)
• Asynchronous
– Not initiated by direct program action
– OS timer interrupt or
– Hardware performance counter events
– E.g. CPU, floating point instructions, clock cycles etc.
![Page 18: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/18.jpg)
Profiling Tools 18
Profiling Tools
• Gprof
• Intel VTune
• Scalasca
• HPC Tool Kit
![Page 19: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/19.jpg)
19
HPC Tool Kit• Name: HPCToolkit
• Developer: Rice University
• Website:
– http://hpctoolkit.org
![Page 20: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/20.jpg)
20
HPCToolkit Overview• Consists of
– hpcviewer
• Sorts by any collected metric, from any processes displayed
• Displays samples at various levels in call hierarchy through “flattening”
• Allows user to focus in on interesting sections of the program through “zooming”
– Hpcprof and hpcprof-mpi
• Correlating dynamic profiles with static source code structure
– hpcrun
• Application profiling using statistical sampling
• Hpcrun-flat – for collection of `flat' profile
– Hpcstruct
• Recovers static program structure such as procedures and loop nests
– Hpctraceviewer
– Hpclink
• For statically-linked executables (e.g. for Cray XT or BG/P)
![Page 21: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/21.jpg)
21
Available Metrics in HPCToolkit• Metrics, obtained by sampling/profiling
– PAPI Hardware counters
– OS program counters
• Wallclock time (WALLCLK)
– However, can’t get PAPI metrics and Wallclock time in a single run
• Derived metrics
– Combination of existing metrics created by specifying a mathematical formula in an XML
configuration file.
• Source Code Correlation
– Metrics reflect exclusive time spent in function based on counter overflow events
– Metrics correlated at the source line level and the loop level
– Metrics are related back to source code loops (even if code has been significantly altered
by optimization) (“bloop”)
![Page 22: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/22.jpg)
HPCToolKit workflow
![Page 23: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/23.jpg)
hpcviewer Views
• Calling context view– top-down view shows dynamic calling contexts in which costs were
incurred
• Caller’s view– bottom-up view apportions costs incurred in a routine to the
routine’s dynamic calling contexts
• Flat view– aggregates all costs incurred by a routine in any context and shows
the details of where they were incurred within the routine
![Page 24: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/24.jpg)
hpcviewer User Interface
![Page 25: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/25.jpg)
hpctraceviewer Views
• Trace view (left, top)– Time on the horizontal axis
– Process (or thread) rank on the vertical axis
• Depth view (left, bottom) & Summary view– Call-path/time view for the process rank selected
• Call view (right, top)– Current call path depth that defines the hierarchical slice
shown in the Trace View
– Actual call path for the point selected by the Trace View's crosshair
![Page 26: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/26.jpg)
Hpctraceviewer user interface - DEMO
![Page 27: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/27.jpg)
Call Path Profiling: Costs in Context
Event-based sampling method for performance measurement
• When a profile event occurs, e.g. a timer expires– determine context in which cost is incurred
• unwind call stack to determine set of active procedure frames
– attribute cost of sample to PC in calling context
• Benefits– monitor unmodified fully optimized code
– language independent – C/C++, Fortran, assembly code, …
– accurate
– low overhead (1K samples per second has ~ 3-5% overhead)
![Page 28: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/28.jpg)
Demo for :
• Hpcrun (list events & proifiling)
• Hpcstruct
• hpcprof
![Page 29: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/29.jpg)
PAPI
• Performance Application Programming
Interface– The purpose of the PAPI project is to design, standardize
and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.
• Parallel Tools Consortium project started in 1998
• Developed by University of Tennessee, Knoxville
• http://icl.cs.utk.edu/papi/
![Page 30: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/30.jpg)
PAPI - Support
• Unix/Linux
– Perfctr kernel patch for kernel < 2.6.30
– Perf package for kernel >= 2.6.30
![Page 31: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/31.jpg)
PAPI - Implementation
3rd Party and GUI Tools
PAPI Low Level
Machine
Specific
Layer
Portable
Layer
PAPI Machine Dependent Substrate
PAPI High Level
Hardware Performance Counters
Operating System
Kernel Extension
![Page 32: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/32.jpg)
PAPI - Hardware Events
• Preset Events(Platform neutral)– Standard set of over 100 events for application performance tuning
– No standardization of the exact definition
– Mapped to either single or linear combinations of native events on each platform
– Use papi_avail utility to see what preset events are available on a given platform
– PAPI_TOT_INS
• Native Events(Platform dependent) – Any event countable by the CPU
– Same interface as for preset events
– Use papi_native_avail utility to see all available native events
– L3_MISSES
• Use papi_event_chooser utility to select a compatible set of events
![Page 33: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/33.jpg)
PAPI events demo
papi_avail
papi_native_avail
papi_event_chooser
Availability for QPI h/w performance
counters using /sbin/lspci
![Page 34: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/34.jpg)
PAPI-Derived Metrics
Metric Formula
Instructions
Graduated instructions per cycle PAPI_TOT_INS/PAPI_TOT_CYC
Issued instructions per cycle PAPI_TOT_IIS/PAPI_TOT_CYC
Graduated floating point instructions per cycle PAPI_FP_INS/PAPI_TOT_CYC
Percentage floating point instructions PAPI_FP_INS/PAPI_TOT_INS
Ratio of graduated instructions to issued instructions PAPI_TOT_INS/PAPI_TOT_IIS
Percentage of cycles with no instruction issue 100.0 * (PAPI_STL_ICY/PAPI_TOT_CYC)
Data references per instruction PAPI_L1_DCA/PAPI_TOT_INS
Ratio of floating point instructions to L1 data cache accesses PAPI_FP_INS/PAPI_L1_DCA
Ratio of floating point instructions to L2 cache accesses (data) PAPI_FP_INS/PAPI_L2_DCA
Issued instructions per L1 instruction cache miss PAPI_TOT_IIS/PAPI_L1_ICM
Graduated instructions per L1 instruction cache miss PAPI_TOT_INS/PAPI_L1_ICM
L1 instruction cache miss ratio PAPI_L2_ICR/PAPI_L1_ICR
![Page 35: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/35.jpg)
PAPI-Derived Metrics
Cache & Memory Hierarchy
Graduated loads & stores per cycle PAPI_LST_INS/PAPI_TOT_CYC
Graduated loads & stores per floating point instruction PAPI_LST_INS/PAPI_FP_INS
L1 cache line reuse (data) ((PAPI_LST_INS - PAPI_L1_DCM) / PAPI_L1_DCM)
L1 cache data hit rate 1.0 - (PAPI_L1_DCM/PAPI_LST_INS)
L1 data cache read miss ratio PAPI_L1_DCM/PAPI_L1_DCA
L2 cache line reuse (data) ((PAPI_L1_DCM - PAPI_L2_DCM) / PAPI_L2_DCM)
L2 cache data hit rate 1.0 - (PAPI_L2_DCM/PAPI_L1_DCM)
L2 cache miss ratio PAPI_L2_TCM/PAPI_L2_TCA
L3 cache line reuse (data) ((PAPI_L2_DCM - PAPI_L3_DCM) / PAPI_L3_DCM)
L3 cache data hit rate 1.0 - (PAPI_L3_DCM/PAPI_L2_DCM)
L3 data cache miss ratio PAPI_L3_DCM/PAPI_L3_DCA
L3 cache data read ratio PAPI_L3_DCR/PAPI_L3_DCA
L3 cache instruction miss ratio PAPI_L3_ICM/PAPI_L3_ICR
Bandwidth used (Lx cache)((PAPI_Lx_TCM * Lx_linesize) / PAPI_TOT_CYC)
* Clock(MHz)
Metric Formula
![Page 36: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/36.jpg)
PAPI-Derived MetricsMetric Formula
http://perfsuite.ncsa.illinois.edu/psprocess/metrics.shtml
Branching
Ratio of mispredicted to correctly predicted branches PAPI_BR_MSP/PAPI_BR_PRC
Processor Stalls
Percentage of cycles waiting for memory access 100.0 * (PAPI_MEM_SCY/PAPI_TOT_CYC)
Percentage of cycles stalled on any resource 100.0 * (PAPI_RES_STL/PAPI_TOT_CYC)
Aggregate Performance
MFLOPS (CPU cycles) (PAPI_FP_INS/PAPI_TOT_CYC) * Clock(MHz)
MFLOPS (effective) PAPI_FP_INS/Wallclock time
MIPS (CPU cycles) (PAPI_TOT_INS/PAPI_TOT_CYC) * Clock(MHz)
MIPS (effective) PAPI_TOT_INS/Wallclock time
Processor utilization (PAPI_TOT_CYC*Clock) / Wallclock time
![Page 37: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/37.jpg)
Component PAPI (PAPI-C)
• Goals:– Support for simultaneous access to on- and off-processor
counters
– Isolation of hardware dependent code in a separable ‘substrate’ module
– Extension of platform independent code to support multiple simultaneous substrates
– API calls to support access to any of several substrates
• Released in PAPI 4.0
![Page 38: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/38.jpg)
Extension to PAPI to
Support Multiple Substrates
PAPI Low Level
Machine
Specific
Layer
Portable
Layer
PAPI High Level
PAPI Machine Dependent Substrate
Hardware Performance Counters
Operating System
Kernel Extension
Hardware Independent Layer
PAPI Machine Dependent Substrate
Off-Processor Hardware Counters
Operating System
Kernel Extension
![Page 39: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/39.jpg)
High-level tools that use PAPI
• TAU (U Oregon)
• HPCToolkit (Rice Univ)
• KOJAK (UTK, FZ Juelich)
• PerfSuite (NCSA)
• SCALASCA
• Open|Speedshop (SGI)
• Intel Vtune
![Page 40: HPC Application Profiling and Analysis](https://reader033.fdocuments.in/reader033/viewer/2022051400/559c04811a28ab98188b45be/html5/thumbnails/40.jpg)
• Hpcviewer demo (trace, flops and clock cycles)
– Nek5000(CFD solver using spectral element
method)
– Xhpl(Linpack)