Understanding applications with Paraver
Transcript of Understanding applications with Paraver
![Page 1: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/1.jpg)
www.bsc.es
Understanding applications with Paraver
Judit Gimenez [email protected]
NERSC - Berkeley, August 2016
![Page 2: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/2.jpg)
Humans are visual creatures
Films or books? PROCESS – Two hours vs. days (months)
Memorizing a deck of playing cards STORE – Each card translated to an image (person, action, location)
Our brain loves pattern recognition IDENTIFY – What do you see on the pictures?
![Page 3: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/3.jpg)
Our Tools
Since 1991 Based on traces Open Source
– http://www.bsc.es/paraver
Core tools: – Paraver (paramedir) – offline trace analysis – Dimemas – message passing simulator – Extrae – instrumentation
Focus – Detail, variability, flexibility – Behavioral structure vs. syntactic structure – Intelligence: Performance Analytics
![Page 4: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/4.jpg)
www.bsc.es
Paraver
![Page 5: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/5.jpg)
Paraver – Performance data browser
Timelines
Raw data
2/3D tables (Statistics)
Goal = Flexibility No semantics
Programmable
Comparative analyses Multiple traces Synchronize scales
+ trace manipulation Trace visualization/analysis
![Page 6: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/6.jpg)
Timelines
Each window displays one view – Piecewise constant function of time
Types of functions – Categorical
• State, user function, outlined routine
– Logical • In specific user function, In MPI call, In long MPI call
– Numerical • IPC, L2 miss ratio, Duration of MPI call, duration of computation
burst
[ ] <⊂∈ nNSi ,n0,
[ )1,,)( +∈= iii ttiSts
RSi ∈
{ }1 ,0 ∈iS
![Page 7: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/7.jpg)
Tables: Profiles, histograms, correlations
From timelines to tables MPI calls profile
Useful Duration
Histogram Useful Duration
MPI calls
![Page 8: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/8.jpg)
Analyzing variability through histograms and timelines
Useful Duration
Instructions
IPC
L2 miss ratio
![Page 9: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/9.jpg)
By the way: six months later ….
Useful Duration
Instructions
IPC
L2 miss ratio
Analyzing variability through histograms and timelines
![Page 10: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/10.jpg)
Variability … is everywhere CESM: 16 processes, 2 simulated days
Histogram useful computation duration shows high variability
How is it distributed? Dynamic imbalance
– In space and time – Day and night. – Season ? ☺
![Page 11: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/11.jpg)
Data handling/summarization capability – Filtering
• Subset of records in original trace • By duration, type, value,… • Filtered trace IS a paraver trace and can
be analysed with the same cfgs (as long as needed data kept)
– Cutting • All records in a given time interval • Only some processes
– Software counters • Summarized values computed from those
in the original trace emitted as new even types
• #MPI calls, total hardware count,…
570 s 2.2 GB
MPI, HWC
WRF-NMM Peninsula 4km 128 procs
570 s 5 MB
4.6 s 36.5 MB
Trace manipulation
![Page 12: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/12.jpg)
www.bsc.es
Dimemas
![Page 13: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/13.jpg)
Dimemas: Coarse grain, Trace driven simulation
Simulation: Highly non linear model – MPI protocols, resource contention…
Parametric sweeps – On abstract architectures – On application computational regions
What if analysis – Ideal machine (instantaneous network) – Estimating impact of ports to MPI+OpenMP/CUDA/… – Should I use asynchronous communications? – Are all parts equally sensitive to network?
MPI sanity check – Modeling nominal
Paraver – Dimemas tandem – Analysis and prediction – What-if from selected time window
Detailed feedback on simulation (trace)
CPU
Local Memory
B
CPU
CPU
L
CPU
CPU
CPU Local
Memory
L
CPU
CPU
CPU Local
Memory
L
Impact of BW (L=8; B=0)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 4 16 64 256 1024
Effic
ienc
y
NMM 512ARW 512NMM 256ARW 256NMM 128ARW 128
![Page 14: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/14.jpg)
Network sensitivity
MPIRE 32 tasks, no network contention
L=5µs–BW=1GB/s L=1000µs–BW=1GB/s
L=5µs–BW=100MB/s
Allwindowssamescale
![Page 15: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/15.jpg)
Would I will benefit from asynchronous communications?
SPECFEM3D
Courtesy Dimitri Komatitsch
Real
Ideal
Predic2onMN
Predic2on5MB/s
Predic2on1MB/s
Predic2on10MB/s
Predic2on100MB/s
![Page 16: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/16.jpg)
Ideal machine
The impossible machine: BW = ∞, L = 0 Actually describes/characterizes Intrinsic application behavior
– Load balance problems? – Dependence problems?
waitall
sendrec
alltoall
Real run
Ideal network
Allgather +
sendrecv allreduce GADGET @ Nehalem cluster
256 processes
Impact on practical machines?
![Page 17: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/17.jpg)
Impact of architectural parameters
Ideal speeding up ALL the computation bursts by the CPUratio factor – The more processes the less speedup (higher impact of bandwidth
limitations) !! 64128
256
512
1024
2048
4096
8192
16384
141664
0
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPUratio
Speedup
64128
256
512
1024
2048
4096
8192
16384
1
8
640
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPUratio
Speedup
64128
256
512
1024
2048
4096
8192
16384
1
8
640
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPUratio
Speedup
64 procs
128 procs
256 procs
GADGET
![Page 18: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/18.jpg)
Profile
0
5
10
15
20
25
30
35
40
1 2 3 4 5 6 7 8 9 10 11 12 13code region
% o
f com
puta
tion
time
Hybrid parallelization
Hybrid/accelerator parallelization – Speed-up SELECTED regions
by the CPUratio factor
% C
ompu
tatio
n Ti
me
64128
256
512
1024
2048
4096
8192
16384
141664
0
5
10
15
20
Bandwdith (MB/s)
CPUratio
Speedup
64128
256
512
1024
2048
4096
8192
16384
1
8
64
0
5
10
15
20
Bandwdith (MB/s)
CPUratio
Speedup
64128
256
512
1024
2048
4096
8192
16384
1
8
64
0
5
10
15
20
Bandwdith (MB/s)
CPUratio
Speedup
93.67% 97.49% 99.11%
Code regions
128 procs.
(Previous slide: speedups up to 100x) 18
![Page 19: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/19.jpg)
www.bsc.es
Models and Extrapolation
![Page 20: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/20.jpg)
Parallel efficiency model
Parallel efficiency = LB eff * Comm eff
Computa2on Communica2on
MPI_Recv
MPI_Send Do not blame MPI
LB Comm
![Page 21: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/21.jpg)
Parallel efficiency refinement: LB * µLB * Transfer
Serializations / dependences (µLB) Dimemas ideal network " Transfer (efficiency) = 1
Computa2on Communica2on
𝑳𝑩=1 MPI_SendMPI_Recv
MPI_Send MPI_Recv
MPI_Send
MPI_Send
MPI_Recv
MPI_Recv
Do not blame MPI
LB µLB Transfer
![Page 22: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/22.jpg)
Why scaling?
0
5
10
15
20
0 50 100 150 200 250 300 350 400
speed up
Good scalability !! Should we be happy?
CG-POP mpi2s1D - 180x120
TrfSerLB **=η
0.5
0.6
0.7
0.8
0.9
1
1.1
0 100 200 300 400
Parallel eff
LB
uLB
transfer
0.4
0.6
0.8
1
1.2
1.4
1.6
0 100 200 300 400
Efficiency
Parallel eff
instr. eff
IPC eff
IPCinstr ηηηη **=
![Page 23: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/23.jpg)
AVBP (strong scale) Input: runs 512 – 4096
PffmetricAmdahl
metricmetricfit *)1(
0
−+=
Strong scale " small computations betweeen MPI calls become more and more important
![Page 24: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/24.jpg)
TURBORVB(strong scale) Input: runs 512 – 4096
3/10
*)1( PffmetricAmdahl
metricmetricfit −+=
Network contention can be reduced balancing / limiting the random selection within a node
1.00
1.20
1.40
1.60
1.80
2.00
100 1000 10000 100000 1000000 Avg.
iter
atio
n tim
e (s
)
# MPI ranks
meassured
![Page 25: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/25.jpg)
www.bsc.es
Clustering
![Page 26: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/26.jpg)
Using Clustering to identify structure
IPC
Completed Instructions
Automatic Detection of Parallel Applications Computation Phases (IPDPS 2009)
![Page 27: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/27.jpg)
19% gain
What should I improve?
13% gain
… we increase the IPC of Cluster1?
What if ….
… we balance Clusters 1 & 2?
PEPC
![Page 28: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/28.jpg)
Tracking scability through clustering
64 128 192
256 384 512
64 128 192
256 384 512
OpenMX (strong scale from 64 to 512 tasks)
![Page 29: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/29.jpg)
www.bsc.es
Folding
![Page 30: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/30.jpg)
Folding: Detailed metrics evolution
• Performance of a sequential region = 2000 MIPS
Is it good enough?
Is it easy to improve?
![Page 31: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/31.jpg)
Folding: Instantaneous CPI stack
MRGENESIS
• Trivial fix.(loop interchange) • Easy to locate?
• Next step?
• Availability of CPI stack models
for production processors? • Provided by
manufacturers?
![Page 32: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/32.jpg)
Instantaneous metrics with minimum overhead – Combine instrumentation and sampling
• Instrumentation delimits regions (routines, loops, …) • Sampling exposes progression within a region
– Captures performance counters and call-stack references
Folding
Iteration #1 Iteration #2 Iteration #3
Synth Iteration
Initialization Finalization
![Page 33: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/33.jpg)
“Blind” optimization
From folded samples of a few levels to timeline structure of “relevant” routines
Recommendation without access to source code
Harald Servat et al. “Bio-inspired call-stack reconstruction for performance analysis” Submitted, 2015
![Page 34: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/34.jpg)
17.20 M instructions ~ 1000 MIPS
24.92 M instructions ~ 1100 MIPS
32.53 M instructions ~ 1200 MIPS
MPI call
MPI call
Unbalanced MPI application – Same code – Different duration – Different performance
CG-POP multicore MN3 study
![Page 35: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/35.jpg)
Address space references
Based on PEBS events • Loads: address, cost in cycles, level providing the data • Stores: only address
35
Modified Stream for (i = 0; i < NITERS; i++) { Extrae_event (BEGIN); for (j = 0; j < N; j++) c[j] = a[j]; for (j = 0; j < N/8; j++) b[j] = s*c[random()%j]; for (j = 0; j < N; j++) c[j] = a[j]+b[j]; for (j = 0; j < N; j++) a[j] = b[j]+s*c[j]; Extrae_event (END); }
![Page 36: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/36.jpg)
Lulesh memory access
Architecture impact Stalls distribution
![Page 37: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/37.jpg)
www.bsc.es
Methodology
![Page 38: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/38.jpg)
Performance analysis tools objective
Help validate hypotheses
Help generate hypotheses
Qualitatively
Quantitatively
![Page 39: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/39.jpg)
First steps
Parallel efficiency – percentage of time invested on computation – Identify sources for “inefficiency”:
• load balance • Communication /synchronization
Serial efficiency – how far from peak performance? – IPC, correlate with other counters
Scalability – code replication? – Total #instructions
Behavioral structure? Variability?
Paraver Tutorial:
Introduction to Paraver and Dimemas methodology
![Page 40: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/40.jpg)
BSC Tools web site
www.bsc.es/paraver • downloads
– Sources / Binaries – Linux / windows / MAC
• documentation – Training guides – Tutorial slides
Getting started – Start wxparaver – Help " tutorials and follow instructions – Follow training guides
• Paraver introduction (MPI): Navigation and basic understanding of Paraver operation
![Page 41: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/41.jpg)
Thanks!
Use your brain, use visual tools :) Look at your codes!
www.bsc.es/paraver
![Page 42: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/42.jpg)
www.bsc.es
Demo
![Page 43: Understanding applications with Paraver](https://reader030.fdocuments.in/reader030/viewer/2022020621/61e78c2765866215c37a8986/html5/thumbnails/43.jpg)
Some examples of efficiencies
Code Parallel efficiency Communication efficiency
Load Balance efficiency
Gromacs@mt 66.77 75.68 88.22
BigDFT@altamira 59.64 78.97 75.52
CG-POP@mt 80.98 98.92 81.86
ntchem_mini@pi 92.56 94.94 97.49
nicam@pi 87.10 75.97 89.22
cp2k@jureca 75.34 81.07 92.93
lulesh@mn3 90.55 99.22 91.26
lulesh@leftraru 69.15 99.12 69.76
lulesh@uv2 (mpt) 70.55 96.56 73.06
lulesh@uv2 (impi) 85.65 95.09 90.07
lulesh@mt 83.68 95.48 87.64
lulesh@cori 90.92 98.59 92.20