Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock...
Transcript of Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock...
![Page 1: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/1.jpg)
Programming for Performance
Prof. Dr. Michael Gerndt
Lehrstuhl für Rechnertechnik und
Rechnerorganisation/Parallelrechnerarchitektur
![Page 2: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/2.jpg)
2
Speedup Limited by Overheads
Sequential Work
Max (Work + Synch Time + Comm Cost + Extra Work)Speedup <
![Page 3: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/3.jpg)
3
Load Balance
• Limit on speedup:
• Work includes data access and other costs
• Not just equal work, but must be busy at same time
• Four parts to load balance
1. Identify enough concurrency
2. Decide how to manage it
3. Determine the granularity at which to exploit it
4. Reduce serialization
ProcessoranyonWorkMax
WorkSequential)(Speedup p
![Page 4: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/4.jpg)
4
Reducing Synch Time
• Reduce wait time due to load imbalance
• Reduce synchronization overhead
![Page 5: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/5.jpg)
5
Reducing Synchronization Overhead
• Event synchronization
• Reduce use of conservative synchronization
– e.g. point-to-point instead of barriers, or granularity of pt-to-pt
• But fine-grained synch more difficult to program, more synch
ops.
• Mutual exclusion
• Separate locks for separate data
– e.g. locking records in a database: lock per process, record, or
field
– lock per task in task queue, not per queue
– finer grain => less contention/serialization, more space, less
reuse
• Smaller, less frequent critical sections
– don’t do reading/testing in critical section, only modification
– e.g. searching for task to dequeue in task queue, building tree
![Page 6: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/6.jpg)
6
Implications of Load Balance/Synchronization
• Extends speedup limit expression to:
• Generally, responsibility of software
• Architecture can support task stealing and
synchronization efficiently
• Fine-grained communication, low-overhead access to queues
– efficient support allows smaller tasks, better load balance
• Accessing shared data in the presence of task stealing
– need to access data of stolen tasks
– Hardware shared address space advantageous
)Synch time Work (Max
WorkSequential)(Speedup
p
![Page 7: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/7.jpg)
7
Reducing Inherent Communication
• Communication is expensive!
• Measure: communication to computation ratio
• Focus here on inherent communication
• Determined by assignment of tasks to processes
• Actual communication can be greater
• Assign tasks that access same data to same process
• Solving communication and load balance NP-hard in
general case
• But simple heuristic solutions work well in practice
• Applications have structure!
Sequential Work
Max (Work + Synch Time + Comm Cost)Speedup <
![Page 8: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/8.jpg)
8
Reducing Extra Work
• Common sources of extra work:
• Computing a good partition
• Using redundant computation to avoid communication
• Task, data and process management overhead
– applications, languages, runtime systems, OS
• Imposing structure on communication
– coalescing messages, allowing effective naming
• Architectural implications:
• Reduce need by making communication and orchestration
efficient
Sequential Work
Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup <
![Page 9: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/9.jpg)
9
A Lot Depends on Sizes
• Application parameters and no. of procs affect
inherent properties
• Load balance, communication, extra work, temporal and
spatial locality
• Memory hierarchy
• Interactions with organization parameters of extended
memory hierarchy affect artifactual communication and
performance
• Effects often dramatic, sometimes small: application-
dependent
![Page 10: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/10.jpg)
10
A Lot Depends on Sizes
1 4 7 10 13 16 19 22 25 28 310
5
10
15
20
25
30
Number of processors Number of processors
Sp
ee
du
p
Sp
ee
du
p
N = 130
N = 258
N = 514
N = 1,026
1 4 7 10 13 16 19 22 25 28 310
5
10
15
20
25
30 Origin—16 K
Origin—64 K
Origin—512 K
Challenge—16 K
Challenge—512 K
Ocean Barnes-Hut
![Page 11: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/11.jpg)
11
Measuring Performance
• Absolute performance
• Performance = Work / Time
• Most important to end user
• Performance improvement due to parallelism
• Speedup(p) = Performance(p) / Performance(1)
• Both should be measured
• Work is determined by input configuration of the problem
• If work is fixed,can measure performance as 1/Time
– Or retain explicit work measure (e.g. transactions/sec, bonds/sec)
– Still w.r.t particular configuration, and still what’s measured is
time
• Speedup(p) = or
Time(1)
Time(p)
Operations Per Second (p)
Operations Per Second (1)
![Page 12: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/12.jpg)
12
Scaling: Why Worry?
• Fixed problem size is of limited usefulness
• Too small a problem:
• May be appropriate for small machine
• Parallelism overheads begin to dominate benefits for larger
machines
– Load imbalance
– Communication to computation ratio
• May even achieve slowdowns
• Doesn’t reflect real usage, and inappropriate for large
machines
– Can exaggerate benefits of architectural improvements,
especially when measured as percentage improvement in
performance
• Too large a problem
• Difficult to measure improvement (next)
![Page 13: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/13.jpg)
13
Too Large a Problem
• Suppose problem realistically large for big machine
• May not “fit” in small machine
• Can’t run
• Thrashing to disk
• Working set doesn’t fit in cache
• Fits at some p, leading to superlinear speedup
• Finally, users want to scale problems as machines
grow
• Can help avoid these problems
![Page 14: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/14.jpg)
14
Demonstrating Scaling Problems
• Small Ocean and big equation solver problems on SGI
Origin2000
Number of processors Number of processors
Sp
ee
du
p
Sp
ee
du
p
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310
5
10
15
20
25
30 Ideal
Ocean: 258 x 258
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310
5
10
15
20
25
30
35
40
45
50
Grid solver: 12 K x 12 K
Ideal
![Page 15: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/15.jpg)
15
Questions in Scaling
• Under what constraints to scale the application?
• What are the appropriate metrics for performance
improvement?
– work is not fixed any more, so time not enough
• How should the application be scaled?
• Definitions:
• Scaling a machine: Can scale power in many ways
– Assume adding identical nodes, each bringing memory
• Problem size: Vector of input parameters, e.g. N = (n, q, Dt)
– Determines work done
– Distinct from data set size and memory usage
– Start by assuming it’s only one parameter n, for simplicity
![Page 16: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/16.jpg)
16
Scaling Models
• Problem constrained (PC)
• Memory constrained (MC)
• Time constrained (TC)
![Page 17: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/17.jpg)
17
Problem Constrained Scaling
• User wants to solve same problem, only faster
• Video compression
• Computer graphics
• VLSI routing
• But limited when evaluating larger machines
)(
)1()(Speedup PC
pTime
Timep
![Page 18: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/18.jpg)
18
Time Constrained Scaling
• Execution time is kept fixed as system scales
• Example: User has fixed time to use machine or wait for result
• Performance = Work/Time as usual, and time is fixed,
so
• How to measure work(p)?
• Execution time on a single processor? (thrashing problems)
• The work metric should be easy to measure, ideally analytical.
• Should scale linearly with sequential complexity
– Or ideal speedup will not be linear in p (e.g. no. of rows, no of
points, no. of operations in matrix program)
• If we cannot find an intuitive application measure, as often
true, measure execution time with ideal memory system on a
uniprocessor.
)1(
)()(
Work
pWorkpSpeedupTC
![Page 19: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/19.jpg)
19
Memory Constrained Scaling (1)
• Scale so memory usage per processor stays fixed
• Speedup can not be defined as Time(1) / Time(p) for
scaled up problem since time(1) is hard to measure
and inappropriate
• Insert performance=work/time in speedup formula
gives
TimeinIncrease
WorkinIncrease
Time
pTime
Work
pWork
Time
Work
pTime
pWorkpSpeedupMC
)1(
)(/
)1(
)(
)1(
)1(/
)(
)()(
![Page 20: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/20.jpg)
20
Memory Constrained Scaling (2)
• MC scaling can lead to large increases in execution
time
• If work grows faster than linearly in memory usage
• e.g. matrix factorization with complexity n³
– 10,000-by 10,000 matrix takes 800MB and 1 hour on
uniprocessor
– With 1,000 processors, can run 320K-by-320K matrix, but ideal
parallel time grows to 32 hours!
– With 10,000 processors, 100 hours ...
![Page 21: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/21.jpg)
21
Scaling Down Problem Parameters
• Some parameters don’t affect parallel performance
much, but do affect runtime, and can be scaled down
• Common example is no. of time-steps in many scientific
applications
– need a few to allow settling down, but don’t need more
– may need to omit cold-start when recording time and statistics
• First look for such parameters
• But many application parameters affect key
characteristics
• Scaling them down requires scaling down no. of processors
too
• Otherwise can obtain highly unrepresentative behavior
![Page 22: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/22.jpg)
22
Difficulties in Scaling N, p Representatively
• Want to preserve many aspects of full-scale scenario
• Distribution of time in different phases
• Key behavioral characteristics
• Scaling relationships among application parameters
• Contention and communication patterns
• Can’t really hope for full representativeness, but can
• Cover range of realistic operating points
• Avoid unrealistic scenarios
• Gain insights and estimates of performance
![Page 23: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/23.jpg)
23
Performance Analysis Process
Measurement
Analysis
Ranking
Refinement
Coding
Performance Analysis
Production
Program Tuning
![Page 24: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/24.jpg)
24
Performance Prediction and Benchmarking
• Performance analysis determines the performance on
a given machine.
• Performance prediction allows to evaluate programs
for a hypthetical machine. It is based on:
• runtime data of an actual execution
• machine model of the target machine
• analytical techniques
• simulation techniques
• Benchmarking determines the performance of a
computer system on the basis of a set of typical
applications.
![Page 25: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/25.jpg)
25
Overhead Analysis
• How to decide whether a code performs well:
• Comparison of measured MFLOPS with peak performance
• Comparison with a sequential version
• Estimate distance to ideal
time via overhead classes
– tmem
– tcomm
– tsync
– tred
– ...
11 #processors
speedup
2
2
tmem
tcomm
tred
p
s
t
t)p(speedup
![Page 26: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/26.jpg)
26
The Basics
• Successful tuning is a combination of
• right algorithms and libraries
• compiler flags and directives
• thinking!
• Measurement is better than guessing:
• to determine performance problems
• to validate tuning decisions and optimization
• Measurement should be repeated after each
significant code modification and optimizations
![Page 27: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/27.jpg)
27
The Basics
• Do I have a performance problem at all?
• Compare MFlops/MOps to typical rate
• Speedup measurements
• What are the hot code region?
• Flat profiling
• Is there a bottleneck in those regions?
• Single node: Hardware counter profiling
• Parallel: Synchronization and communication analysis profiling
• Does the bottleneck vary over time or processor space?
• Profiling individual processes and/or threads
• Tracing
• Does the code behave similar for different configurations?
• Analyze runs with different processor counts
• Analyze different input configurations
![Page 28: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/28.jpg)
28
Performance Analysis
Instrumentation Analysis
Execution
refinement
Current Hypotheses
Requirements Performance Data
Detected Bottlenecks
Instr: DatISPEC
Info: HypDat
Prove: HypDat{T,F}
Refine: HypPHyp
![Page 29: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/29.jpg)
29
Performance Measurement Techniques
• Event model of the execution
• Events occur at a processor at a specific point in time
• Events belong to event types
– clock cycles
– cache misses
– remote references
– start of a send operation
– ...
• Profiling: Recording accumulated performance data for
events
• Sampling: Statistical approach
• Instrumentation: Precise measurement
• Tracing: Recording performance data of individual
events
![Page 30: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/30.jpg)
30
Statistical Sampling
Program Main...
end Main
Function Asterix (...)...
end Asterix
Function Obelix (...)...
end Obelix...
CPU
program counter
cycle counter
cache miss counter
flop counter
Main
Asterix
Obelix +
Function Table
interrupt every10 ms
add and reset counter
![Page 31: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/31.jpg)
31
...Function Obelix (...)
call monitor(“Obelix“, “enter“)...
call monitor(“Obelix“,“exit“)end Obelix
...
CPU
monitor(routine, location)if (“enter“) then
else
end if
Function Table
Instrumentation and Monitoring
cache miss counter
Main
Asterix
Obelix + - 1020013001490
![Page 32: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/32.jpg)
32
Instrumentation Techniques
• Source code instrumentation
• done by the compiler, source-to-source tool, or manually
– portability
– link back to source code easy
– re-compile necessary when instrumentation is changed
– difficult to instrument mixed-code applications
– cannot instrument system or 3rd party libraries or executables
• Binary instrumentation
• „patching“ the executable to insert hooks (like a debugger)
– inverse pros/cons
• Offline
• Online
![Page 33: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/33.jpg)
33
Instrumentation Tools
• Standard compilers
• Add callbacks for profiling functions
• Typically an function level
• Be careful of overhead for frequently called functions
• gcc, for example, adds calls if –finstrument-functions
is given.
• OPARI
• Jülich Supercomputing Center
• OpenMP for C and FORTRAN
• Source-level instrumentation see exercise
• PMPI interface
• Library interposition
• Link own library before real library, e.g. frequently used for
own malloc function.
![Page 34: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/34.jpg)
34
Instrumentation Tools
• TAU Generic Instrumenter
• Parsers for C++, FORTRAN, UPC,…
• Creation of PTD (Program Database Toolkit)
• Approach
– Specify which string to insert before and after certain regions
– Use provided variables to access file and line information
• Limited program region types
• tau.oregon.edu
• OMPT
• Proposal for profiling API
• Based on callbacks
![Page 35: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/35.jpg)
35
Source Code Transformation Tools
• Rose
• rosecompiler.org, LLNL
• LLVM
• Language independent code optimizer
and code generator
• http://www.llvm.org/, Univ. Illinois
• Clang C frontend for LLVM, http://clang.llvm.org/
• C/C++ and Objective C/C++
• Open64
• www.open64.net
• Compiler infrastructure based originally on the SGI compiler.
• Interprocecural and loop optimizations
![Page 36: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/36.jpg)
36
Binary Instrumentation Tools
• Dyninst
• Dynamic instrumentation on binary level
• Context pf Paradyn project
• Univ. Wisconsin-Madison, Maryland
• Bart Miller, Jeff Hollingsworth
• Intel Pin
• Intel for x86
• Online instrumentation of binaries
• Valgrind
• Dynamic instrumentation
• Based on emulation of x86 machine instructions
![Page 37: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/37.jpg)
37
Tr P n-1
Trace P1
Tracing
...Function Obelix (...)
call monitor(“Obelix“, “enter“)...
call monitor(“Obelix“,“exit“)end Obelix
...
MPI LibraryFunction MPI_send (...)
call monitor(“MPI_send“, “enter“)...
call PMPI_send(...)
call monitor(“MPI_send“,“exit“)end Obelix
...
Process 0
Process 1
Process n-1
Trace P0
10.4 P0 Obelix enter
10.6 P0 MPI_Send enter
10.8 P0 MPI_Send exit
![Page 38: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/38.jpg)
38
Tr P n-1
Trace P1
Merging
Trace P0
Merge Process
P0 - Pn-1
10.4 P0 Obelix enter
10.5 P1 Obelix enter
10.6 P0 MPI_Send enter
10.7 P1 MPI_Recv enter
10.8 P0 MPI_Send exit
11.0 P1 MPI_Recv exit
![Page 39: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/39.jpg)
39
Visualization of Dynamic Behaviour
P0 - Pn-1
10.4 P0 Obelix enter
10.5 P1 Obelix enter
10.6 P0 MPI_Send enter
10.7 P1 MPI_Recv enter
10.8 P0 MPI_Send exit
11.0 P1 MPI_Recv exit
P0
P1
10.4 10.5 10.6 10.7 10.8 10.9 11.0
Timeline Visualization
Obelix
Obelix MPI_Recv
MPI_Send Obelix
Obeli
![Page 40: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/40.jpg)
40
Profiling vs Tracing
• Profiling
• recording summary information (time, #calls,#misses...)
• about program entities (functions, objects, basic blocks)
• very good for quick, low cost overview
• points out potential bottlenecks
• implemented through sampling or instrumentation
• moderate amount of performance data
• Tracing
• recording information about events
• trace record typically consists of timestamp, processid, ...
• output is a trace file with trace records sorted by time
• can be used to reconstruct the dynamic behavior
• creates huge amounts of data
• needs selective instrumentation
![Page 41: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/41.jpg)
41
Program Monitors
• Each PA tools has its own monitor
• Score-P
• In the last years, Score-P was developed by tools groups of
Scalasca, Vampir and Periscope.
• Provides support for
– MPI, OpenMP, CUDA
– Profiling and tracing
– Callpath profiles
– Online Access Interface
• Cube 4 profiling data format
• OTF2 (Open Trace Format)
![Page 42: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/42.jpg)
42
Performance Analysis
Instrumentation Analysis
Execution
refinement
Current Hypotheses
Requirements Performance Data
Detected Bottlenecks
Instr: DatISPEC
Info: HypDat
Prove: HypDat{T,F}
Refine: HypPHyp
![Page 43: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/43.jpg)
43
Common Performance Problems with MPI
• Single node performance
• Excessive number of 2nd-level cache misses
• Low number of issued instructions
• IO
• High data volume
• Sequential IO due to IO subsystem or sequentialization in the
program
• Excessive communication
• Frequent communication
• High data volume
![Page 44: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/44.jpg)
44
Common Performance Problems with MPI
• Frequent synchronization
• Reduction operations
• Barrier operations
• Load balancing
• Wrong data decomposition
• Dynamically changing load
![Page 45: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/45.jpg)
45
Common Performance Problems with SM
• Single node performance
• ...
• IO
• ...
• Excessive communication
• Large number of remote memory accesses
• False sharing
• False data mapping
• Frequent synchronization
• Implicit synchronization of parallel constructs
• Barriers, locks, ...
• Load balancing
• Uneven scheduling of parallel loops
• Uneven work in parallel sections
![Page 46: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/46.jpg)
46
Analysis Techniques
• Offline vs Online Analysis
• Offline: first generate data then analyze
• Online: generate and analyze data while application is running
• Online requires automationlimited to standard bottlenecks
• Offline suffers more from size of measurement information
• Three techniques to support user in analysis
• Source-level presentation of performance data
• Graphical visualization
• Ranking of high-level performance properties
![Page 47: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/47.jpg)
47
Statistical Profiling based Tools
• Gprof – GNU profiling tool
• Time profiling
• Inclusive and exclusive time
• Flat profile
• Call graph profile
• Based on instrumentation of
function entry and exit
• Records were the call is
coming from.
![Page 48: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/48.jpg)
48
Statistical Profiling based Tools
• Allinea MAP
• Annotations to the application source code.
• Based on time series of profiles
• For parallel applications it indicates outlying processes.
![Page 49: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/49.jpg)
49
Profiling Tools based on Instrumentation
• TAU (Tuning and Analysis
Utilities)
• Measurements are based on
instrumentation
• Visualization via paraprof
– Graphical display for aggregated and
per node, context, or thread
– Topology views of performance data
• Scalasca
• Cube performance visualizer
• Profiles based on Score-P
• Call-path profiling
![Page 50: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/50.jpg)
50
Trace-based Analysis Tools
• Vampir
• Graphical views presenting
events and summary data
• Flexible scrolling and
zooming features
• OTF2 trace format
generated by Score-P
• Commercial license
• www.vampire.eu
![Page 51: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/51.jpg)
51
Trace-based Analysis Tools
• Paraver
• Barcelona Supercomputing
Center
• MPI, OMP, pthreads, OmpSs,
CUDA
• http://www.bsc.es/computer-sciences/performance-tools/paraver
• Clustering of program phases, i.e. segments between MPI calls
• Recently tracking of clusters in time series of profiles based on
object tracking
![Page 52: Programming for Performancegerndt/home/Teaching/PPE/6... · Programming for Performance ... lock per process, record, or field –lock per task in task queue, not per queue –finer](https://reader030.fdocuments.in/reader030/viewer/2022041011/5ebc7a6fbded1628ab012202/html5/thumbnails/52.jpg)
52
Automatic Analysis Tools
• Paradyn
• University of Wisconsin Madison
• Periscope
• TU München
• Automatic detection of formalized performance properties
• Profile data
• Distributed online tool
• Scalasca
• Search for performance patterns in traces
• Post-mortem on parallel resources of the application
• Visualization of patterns in CUBE