Performance Programing Module I: Measuring Program Performance 9
Performance Programming: Theory, Practice and Case Studies
Module I: Measuring Program Performance
Performance Programing Module I: Measuring Program Performance 10
Outline Outline
� Measuring methodology and guidelines
� Measurement tools � Timing Tools � Profiling Tools � Process monitoring and tracing tools � System monitoring tools
� Hardware counter measurements � Monitoring tools � Code instrumentation
� Parallel performance measurements � Guidelines and recommendations � Tools for parallel monitoring
� Summary
Performance Programing Module I: Measuring Program Performance 11
Measurement Methodology Measurement Methodology
� Quantifying performance is the first step in the application tuning process
� Important to set reasonable expectations for op-timization
� Measurements should be made repeatedly to identify parts of the program that need to be op-timized
� Proper choice of measurement characteristics suitable for a particular application
� Comparison of measurements to theoretical peak values
Performance Programing Module I: Measuring Program Performance 12
What to Measure What to Measure
� Timing measurements � Wall clock time for a single job (turnaround time)� Wall clock time for multiple jobs (throughput measurements)� Wall clock time for parallel runs (scalability measurements)
� Execution and computation rates � MFLOPS (million floating point operations per second)� MIPS (million instructions per second) � IPC (instructions per cycle)
� Resource utilization � Memory usage� I/O utilization� Network usage
Performance Programing Module I: Measuring Program Performance 13
Benchmarking GuidelinesBenchmarking Guidelines
� Benchmark runs should adequately represent the use of the application
� Preferably only one parameter changing at a time
� Overhead of measurement should be considered
� Runs from tmpfs or from a locally mounted ufs
� System activities should be monitored
� The systems should not have any other computa-tional jobs running during benchmarking
� System parameters and settings should be docu-mented together with the results of the runs.
Performance Programing Module I: Measuring Program Performance 14
Measurement ToolsMeasurement Tools
� Functionality � Timing tools� Profiling tools� Monitoring tools
� Usage requirements � Tools that can operate on optimized binaries� Tools that require recompilation� Tools that require source code instrumentation
� Parallel / serial measurement tools � Tools measuring serial performance� Tools measuring parallel performance
Performance Programing Module I: Measuring Program Performance 15
Timing Entire Program Timing Entire Program
� Measuring the elapsed (wall- clock) time that passes during the program execution
� Example: Solaris time , timex , and ptime
Performance Programing Module I: Measuring Program Performance 16
Timing Program PortionsTiming Program Portions
� Fortran 77: etime , dtime (both not thread safe)
� C, C++, Fortran 90/95: gethrtime� High resolution timer (nanoseconds)� Can be called via a C wrapper from Fortran 77� Can be used for multithreaded applications
� Platform-specific tools and methods� Solaris microstate accounting � Fine-grain timing measurements by accessing UltraSPARC
TICK register directly
.inline readtick,1rd %tick, %o1stx %o1, [%o0].end
Performance Programing Module I: Measuring Program Performance 17
Measurement Overhead Measurement Overhead
� Computing overhead of gethrtime() call
#include<sys/time.h> time_t start, end; int i, iters = 100000; for (i = 0; i < iters; i++) { start = gethrtime(); end = gethrtime(); (void)printf("%lld \n", (end - start));}
Call overhead (ns)0
2500
5000
7500
10000
12500
15000
17500
20000
22500
Distribution
180-185 185-190 190-195 195-200200-205 205-210 210-215 215-220
220-225 225-230 (ns)
Performance Programing Module I: Measuring Program Performance 18
Program Profiling with Program Profiling with gprofgprof
� Application profiling � Special form of timing measurements that shows which func-
tions account for large parts of application runtimes� Should be used on multiple and representative test cases
� gprof - standard UNIX profiling utility � Can be used for profiling executalbes and shared libraries� Based on Program Counter (PC) sampling at periodic intervals� Requires recompilation with -pg (Linux, Solaris, Tru64) or -G
(HP-UX) � After the run the data is collected in gmon.out file � Profiling results displayed with gprof command
Performance Programing Module I: Measuring Program Performance 19
gprofgprof Output Output
� Output includes� Absolute time spent in a function � Percentage of total run time spent in a function � Number of calls to the function � Average time per call
� Functions can be sorted by � time they consume together with their descendants (commul-
ative or inclusive time)� time spent executing the function itself (self or exclusive time)
% cumulative self self totaltime seconds seconds calls ms/call ms/call name66.4 65.70 65.70 186116 0.35 0.35 dmmch_ [4]15.2 80.72 15.02 20448 0.73 0.73 dmake_ [8]10.9 91.51 10.79 16924 0.64 0.64 dgemm_ [9]...
Performance Programing Module I: Measuring Program Performance 20
Profiling Using Coverage AnalysisProfiling Using Coverage Analysis
� Coverage analysis tools annotate source code with the number of times each line was executed� Basic block profiling � Results can be accumulated for multiple runs � Information about hot loops in the code and branches taken � Code coverage for quality assurance
DO 350 L = LL, LL+ LSEC- 1150483840 -> F11 = F11 + T1( L- LL+ 1, I- II+ 1 )* $ T2( L- LL+ 1, J- JJ+ 1 )
� Available on UNIX platforms� Linux/GNU: gcov� Solaris: tcov � IRIX: cvcov , cvxcov � Tru64: pixie � AIX: tprof
Performance Programing Module I: Measuring Program Performance 21
Advanced Profiling ToolsAdvanced Profiling Tools
� Measurement parameters and features� Measurements based on hardware counters � Profiling by
� functions � basic blocks � lines of high level code� assembly instructions
� Source code annotation � Capabilities to work with parallel programs
� synchronization overhead,� load balancing monitoring
� Available tools Tool Vendor Platforms
VTune Intel NTAnalyzer Sun Solaris SpeedShop SGI IRIXDCPI Tru64, NT DEC Compaq HP
Performance Programing Module I: Measuring Program Performance 22
Example: Sun Performance Analyzer (1 of 3)Example: Sun Performance Analyzer (1 of 3)
� Profiling by function and module (no recompilation)
Performance Programing Module I: Measuring Program Performance 23
Example: Sun Performance Analyzer (2 of 3)Example: Sun Performance Analyzer (2 of 3)
� Annotated source (recompilation with -g ) and disassembly (no recompilation)
Performance Programing Module I: Measuring Program Performance 24
Example: Sun Performance Analyzer (3 of 3)Example: Sun Performance Analyzer (3 of 3)
� Hardware counter overflow profiling
Performance Programing Module I: Measuring Program Performance 25
Process Monitoring ToolsProcess Monitoring Tools
� Tracing tools � Linux: strace (ltrace for dynamic library calls) � Solaris: truss (sotruss for dynamic library calls) � IRIX: par � Tru64: atom -tool ptrace
� procfs -based tools� pmap: prints the address space of the program� pldd : lists the dynamic shared objects linked into the process
(including ones explicitly attached using dlopen )� pstack : prints a stack trace for each LWP in the process� pflags : prints the /proc tracing flags� ptree : process trees containing specified pids or users� pwait : wait for specified processes to terminate� pcred : prints the credentials (effective, real, saved UIDs and
GIDs)
Performance Programing Module I: Measuring Program Performance 26
Example: profiling system callsExample: profiling system calls� truss on Solaris � Reports the number of system calls for a process and associated
time
Performance Programing Module I: Measuring Program Performance 27
System Monitoring Tools System Monitoring Tools
� Tools for various UNIX platforms � vmstat , vm_stat , memvis - virtual memory and CPU sta-
tistics � mpstat , mpvis - parallel memory/CPU statistics � netstat , nfsstat , nfsvis - network status and statistics � iostat , dkvis - I/O statistics � sar - system activity report � top , prstat - list of most active processes � systat - system activity stats � lockstat - kernel lock statistics � dkstat - file status information
Performance Programing Module I: Measuring Program Performance 28
vmstatvmstat - Virtual Memory Statistics - Virtual Memory Statistics� Available on HP-UX, Tru64, Solaris, Linux, FreeBSD, etc. � Example on Alpha/Tru64
Memory Usage
Paging Activity
CPU Usage
Idle System
Performance Programing Module I: Measuring Program Performance 29
Hardware Counter MeasurementsHardware Counter Measurements
� Hardware performance counters allow for the runtime low-overhead measurements of various hardware events� Cache references� Cache misses� Pipeline stalls� Branch misprediction statistics� D-TLB (Data Translation Lookaside Buffer) misses� I-TLB (Instruction Translation Lookaside Buffer)� Bus statistics including DMA and cache coherency transac-
tions on a multiprocessor systems � Others
� Only several events can be monitored at the same time
Performance Programing Module I: Measuring Program Performance 30
Code Instrumentation Code Instrumentation
� APIs can be used directly in the code� High-resolution timing of performance-critical parts of the pro-
gram � Access to HW performance counters
� Example (Solaris) if ( cpc_take_sample(&before) == -1) exit(-1); for (k = 0; k < N-1; k++) sum = sum + a[k]*b[k];if ( cpc_take_sample(&after) == -1) exit(-1);
� Counters specified by setting PERFEVENTS environment variable
example% setenv PERFEVENTS pic0=Load_use,pic1=Load_use_RAW
� Works on UltraSPARC CPUs
Performance Programing Module I: Measuring Program Performance 31
Parallel Measurement MethodologyParallel Measurement Methodology
� Same guidelines as in the serial case� Parallel benchmarks should be representative of typical uses of
applications� Benchmarking must be performed to ensure repeatable and
consistent results � Probe effects and tool overheads should be minimized
� Specifics of parallel benchmarking� Parallelism vs. Concurrency � Dedicated mode of benchmarking � Number of processors � Choice of timer and time criterion � Processor-set configuration � Processor allocation in clusters
Performance Programing Module I: Measuring Program Performance 32
Timing a Parallel Threaded ProgramTiming a Parallel Threaded Program
� timex can be used for parallel timing
� Note that the real time decreases, but the user time repre-senting combined CPU usage stays constant
Performance Programing Module I: Measuring Program Performance 33
Specific Parallel Timers Specific Parallel Timers
� Timing MPI programs � time or timex timers can be used in combination with MPI
submitting commands (mprun , mpirun , etc.) � For timing portions of an MPI program, one can use the
MPI_Wtime function available in Fortran, C and C++ bind-ings (typically highly accurate).
� Threaded applications can use gethrvtime (S-olaris, Tru64 with Solaris Compatibility Library) � Shows the user time on a per-thread basis � Can be used in combination with gethrtime , which returns
the elapsed real (wallclock) time on a per-thread basis
Performance Programing Module I: Measuring Program Performance 34
Parallel System Monitoring Parallel System Monitoring
� mpstat - mutliprocessor monitoring
Sample measurements
Crosscalls
InterruptsContextswitches
CPU usageSystem calls
Threadmigrations
Mutex infoCPUID
First snapshot: average since boot
Performance Programing Module I: Measuring Program Performance 35
Kernel Lock Statistics Kernel Lock Statistics
� Tools that report kernel lock statistics � lockstat - Solaris, IRIX, AIX, Linux � lockinfo - Tru64
� Allows one to specify what events to monitor� spin on adaptive mutex� block on read access to rwlock due to waiting writers
� On some platforms generates gprof -like output # lockstat -IWk example_tnf 24...Profiling interrupt: 151649 events in 130.282 seconds (1164 events/sec)Count indv cuml rcnt nsec Hottest CPU+PIL Caller--------------------------------------------------------------------85698 57% 57% 1.00 188 cpu[12] mutex_vector_enter14247 9% 66% 1.00 160 cpu[9]+10 disp_getwork12792 8% 74% 1.00 746 cpu[14] mutex_tryenter10359 7% 81% 1.00 280 cpu[5] (usermode) 1951 1% 82% 1.00 59 cpu[1] splx 1648 1% 84% 1.00 365 cpu[5]+10 _resume_from_idle 1510 1% 85% 1.00 490 cpu[9]+10 disp 1259 1% 85% 1.00 255 cpu[15]+10 setfrontdq
Performance Programing Module I: Measuring Program Performance 36
Binding a Program To a Set of Processors Binding a Program To a Set of Processors
� Process monitoring can be difficult on multiproc-essor systems due to process migration
� Single-threaded programs� One can bind to a processor
� For multithreaded programs � One can use processor sets
� Commands to set up and use processor sets � psrset (HP-UX, Solaris) � pset (IRIX) � pset_create , pset_assign_cpu ,
pset_assign_pid , etc. (Tru64)
Performance Programing Module I: Measuring Program Performance 37
SummarySummary
� Monitoring performance is essential to optimiza-tion� If you cannot measure it you cannot improve it
� Important to select benchmarks carefully and identify parameters to measure
� Select tools suitable for the task � System-wide or process-specific? � Parallel or serial? � Require recompilation or instrumentation? � Need source-level information? � Need hardware counter information?
Top Related