® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH...
-
Upload
brandon-stevens -
Category
Documents
-
view
222 -
download
0
Transcript of ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH...
![Page 1: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/1.jpg)
RR
®®
1
Integrated Integrated MPI/OpenMP MPI/OpenMP Performance Performance
AnalysisAnalysis
KAI Software LabKAI Software LabIntel Corporation & Intel Corporation & Pallas, GmbHPallas, GmbH Bob Kuhn, Bob Kuhn, [email protected]@intel.com Hans-Christian Hoppe, Hans-Christian Hoppe, [email protected]@pallas.com
![Page 2: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/2.jpg)
RR
®®
2
OutlineOutline
Why integrated MPI/OpenMP Why integrated MPI/OpenMP programming?programming?
A performance tool for MPI/OpenMP A performance tool for MPI/OpenMP programming (Phase 1)programming (Phase 1)
Integrated performance analysis Integrated performance analysis capability for ASCI Apps (Phase 2)capability for ASCI Apps (Phase 2)
![Page 3: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/3.jpg)
RR
®®
3
Why Integrate Why Integrate MPI and OpenMP?MPI and OpenMP?Hardware trendsHardware trendsSimple example – How it is done now?Simple example – How it is done now?An FEA ExampleAn FEA ExampleASCI ExamplesASCI Examples
![Page 4: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/4.jpg)
RR
®®
4
Parallel Hardware Keeps Parallel Hardware Keeps ComingComing Example recently LLNL Example recently LLNL
ASCI clustersASCI clusters Parallel Capacity Parallel Capacity
Resource (PCR) clusterResource (PCR) cluster– Three clusters totaling Three clusters totaling
472 Pentium 4s; the 472 Pentium 4s; the largest with 252largest with 252
– Theoretical peak 857 Theoretical peak 857 gigaFLOP/s, gigaFLOP/s,
– Linux Linux
– NetworX via SGI FederalNetworX via SGI Federal
HPCWireHPCWire 8/31/01 8/31/01
Parallel global file Parallel global file system clustersystem cluster
– Total 48 Pentium 4 Total 48 Pentium 4 processors processors
– 1,024 clients/servers1,024 clients/servers
– Deliver I/O rates of over Deliver I/O rates of over 32 GB/s32 GB/s
– Fail-over and global lock Fail-over and global lock managermanager
– Linux open sourceLinux open source
– NetworX via SGI FederalNetworX via SGI Federal
HPCWireHPCWire 8/31/01 8/31/01
![Page 5: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/5.jpg)
RR
®®
5
Parallelism Performance Parallelism Performance AnalysisAnalysis
Effort
Cod
e P
erfo
rman
ce
OpenMP
MPIOpenMP Performance tools
MPI/OpenMP Performance
toolsDebuggers, IDEs
![Page 6: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/6.jpg)
RR
®®
6
Cost Effective Parallelism Cost Effective Parallelism Long Term Long Term Wealth of parallelism experience Wealth of parallelism experience
single person codes to large teamsingle person codes to large team TTEETTOONN CCRREETTIINN LLAASSNNEEXX
PPuurrppoossee rraaddiiaattiioonn ttrraannssppoorrtt nnoonn--LLTTEE pphhyyssiiccss IICCFF ssiimmuullaattiioonnss
AAggee ((yyeeaarrss)) ~~55--1100 ~~1100 ~~2255
SSiizzee ((lliinneess)) 2200 KK 110000 KK llaarrggee
DDeevveellooppeerrss 11--22 11 mmaannyy
CCoommpplleexxiittyy llooww mmooddeerraattee hhiigghh
PPaarraalllleell mmooddeell 11 lleevveell SSMMPP 11 lleevveell DDMMPP
vvaarriieedd lleevveell SSMMPP//DDMMPP
ssiinnggllee lleevveell SSMMPP
CCoommpplliiccaattiioonnss mmeemmoorryy mmaannaaggeemmeenntt
bbuuiilldd pprroocceessss
![Page 7: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/7.jpg)
RR
®®
7
ASCI ASCI Ultrascale Tools ProjectUltrascale Tools ProjectPathforward projectPathforward project
– RTS – Parallel System PerformanceRTS – Parallel System Performance
Ten Goals in three areas – Ten Goals in three areas – – ScalabilityScalability – Work with 10,000+ Processors– Work with 10,000+ Processors
– IntegrationIntegration – How about Hardware Monitors, – How about Hardware Monitors, Object Oriented, and Runtime Environment?Object Oriented, and Runtime Environment?
– Ease of UseEase of Use – Dynamic Instrumentation and Be – Dynamic Instrumentation and Be Prescriptive, not just Data Management Prescriptive, not just Data Management
![Page 8: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/8.jpg)
RR
®®
8
Architecture for Ultrascale Architecture for Ultrascale PerformancePerformance
1)1) GuideGuide – Source – Source InstrumentationInstrumentation
2)2) VampirtraceVampirtrace – – MPI/OpenMP MPI/OpenMP InstrumentationInstrumentation
3)3) VampirVampir – – MPI AnalysisMPI Analysis
4)4) GuideViewGuideView – – OpenMP Analysis OpenMP Analysis
Guide
Vampir
GuideView
Application Source
Executable
Guidetrace Library
VampirtraceLibrary
TraceFile
Object Files
![Page 9: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/9.jpg)
RR
®®
9
Phase One Goal – Phase One Goal – Integrated MPI/OpenMPIntegrated MPI/OpenMPPhase One Goals –Phase One Goals –
– Integrated MPI OpenMP TracingIntegrated MPI OpenMP Tracing– Mode most compatible with ASCI SystemsMode most compatible with ASCI Systems
–Whole Program ProfilingWhole Program Profiling– Integrate program profile with parallelismIntegrate program profile with parallelism
– Increased Scalability of Performance Increased Scalability of Performance AnalysisAnalysis
– 1000 processors1000 processors
![Page 10: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/10.jpg)
RR
®®
10
Vampir – Integrated Vampir – Integrated MPI/OpenMPMPI/OpenMP SWEEP3D run
on 4 MPI tasks with 4 OpenMP Threads each
Timeline shows OpenMP regions with glyph
Threaded activity during OpenMP region
![Page 11: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/11.jpg)
RR
®®
11
GuideView – Integrated GuideView – Integrated MPI/OpenMP & ProfileMPI/OpenMP & Profile
SWEEP3D run on 4 MPI tasks each with 4 OpenMP threads
All OpenMP regions for process summarized to one bar
Highlight (Red arrow) shows speedup curve for that set of threads
Thread view shows balance between MPI tasks and threads
![Page 12: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/12.jpg)
RR
®®
12
GuideView – Integrated GuideView – Integrated MPI/OpenMP & ProfileMPI/OpenMP & Profile
Sorting and filtering bring large amounts of information to manageable level
Profile allows comparison of MPI, OpenMP and Application activity inclusive and exclusive
![Page 13: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/13.jpg)
RR
®®
13
Guide –Guide –Compiler WorkhorseCompiler WorkhorseCompilation of Compilation of
OpenMP OpenMP Automatic Automatic
subroutine entry subroutine entry and exit and exit instrumentation –instrumentation –– FortranFortran
– C/C++C/C++
New compiler options –New compiler options –WGtraceWGtrace -- link with -- link with the Vampirtracethe Vampirtrace
WGprofWGprof -- -- subroutine subroutine entry/exit profiling entry/exit profiling
– – WGprof_leafprune WGprof_leafprune minimum size of minimum size of procedures to procedures to retain in profile retain in profile
![Page 14: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/14.jpg)
RR
®®
14
Vampirtrace –Vampirtrace –Profiling Profiling Support for pruning of short routines Support for pruning of short routines
ROUTINE X ENTRY
ROUTINE Y ENTRY
ROUTINE Y EXIT
> Δt < Δt
This tree will be pruned. ROUTINE X will be marked as having calltree info summarized.
All events that have not been pruned could now be written to the tracefile.
˚ ˚ ˚
ROUTINE Z ENTRYROUTINE Z may still be < Δt so cannot yet be written.
![Page 15: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/15.jpg)
RR
®®
15
Scalability on Phase OneScalability on Phase One
Timeline scaling to 256 Tasks/NodesTimeline scaling to 256 Tasks/NodesGathering of tasks in node into groupGathering of tasks in node into group
–Filtering by nodes Filtering by nodes
–Expand each nodeExpand each node
–Message statistics by nodesMessage statistics by nodes
![Page 16: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/16.jpg)
RR
®®
16
Phase Two – Integrating Phase Two – Integrating Capabilities for ASCI AppsCapabilities for ASCI AppsPhase Two Goals –Phase Two Goals –
– Deployment to other platformsDeployment to other platforms – – – Compaq, CPlant, SGICompaq, CPlant, SGI
– Thread-SafetyThread-Safety– ScalabilityScalability – –
– Grouping Grouping – Statistical Analysis Statistical Analysis – Integrated GuideViewIntegrated GuideView
– Hardware performance monitorsHardware performance monitors– Dynamic control of instrumentationDynamic control of instrumentation– Environmental awarenessEnvironmental awareness
![Page 17: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/17.jpg)
RR
®®
17
Thread SafetyThread Safety Collect data from Collect data from
each thread –each thread –– Thread-safeThread-safe
Vampirtrace libraryVampirtrace library
– Per threadPer thread profiling profiling datadata
– Previous release, Previous release, only master thread only master thread logged datalogged data
Improves accuracy Improves accuracy of dataof data
Value to users –Value to users –– Enhances integration Enhances integration
between MPI and between MPI and OpenMPOpenMP
– Enhances visibility into Enhances visibility into functional balance functional balance between threadsbetween threads
![Page 18: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/18.jpg)
RR
®®
18
Scalability: GroupingScalability: Grouping Up to end of FY00Up to end of FY00
– Fixed hierarchy levels Fixed hierarchy levels (system, nodes, CPUs)(system, nodes, CPUs)
– Fixed grouping of Fixed grouping of processesprocesses
– Eg, Impossible to reflect Eg, Impossible to reflect communicatorscommunicators
Need more levelsNeed more levels– Threads are a fourth Threads are a fourth
groupgroup– Systems with deeper Systems with deeper
hierarchies (30T)hierarchies (30T)– Reduce number of on-Reduce number of on-
screen entities for screen entities for scalabilityscalability
Whole system
Node nNode 1
CPU 1 CPU c
T_1 T_p
t_1 t_c
Quadboard
![Page 19: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/19.jpg)
RR
®®
19
Default GroupingDefault Grouping
Default Default GroupingGrouping– By NodesBy Nodes
– By ProcessesBy Processes
– By Master By Master ThreadsThreads
– All ThreadsAll Threads
Can be changed Can be changed in configuration in configuration filefile
All Cluster
All Processes
All Masters
Node n
Process n
T_1 T_pT_0 …
Node 1
Process 0
T_1 T_pT_0 …
All Threads
All Cluster
All Processes
All Masters
Node n
Process n
T_1 T_pT_0 …T_1 T_pT_0 …
Node 1
Process 0
T_1 T_pT_0 …T_1 T_pT_0 …
All Threads
![Page 20: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/20.jpg)
RR
®®
20
Scalability: Scalability: GroupingGrouping
Filter processes Filter processes dialogdialog– Select groups Select groups
combo-boxcombo-box
Display of groupsDisplay of groups– By aggregationBy aggregation
– By representativeBy representative
Grouping applies toGrouping applies to– ““Timeline bars”Timeline bars”
– Counter streamsCounter streams
![Page 21: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/21.jpg)
RR
®®
21
Scalability by GroupingScalability by Grouping
Parallelism display showing all threads
Parallelism display showing only master threads alternating between MPI and OpenMP parallelism
![Page 22: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/22.jpg)
RR
®®
22
Statistical Information Statistical Information GatheringGathering Collects basic Collects basic
statistics statistics at runtimeat runtime Saves statistics in Saves statistics in
an ASCII-filean ASCII-file View statisticsView statistics
– your favorite your favorite spreadsheet ...spreadsheet ...
Reduced overhead Reduced overhead compared to compared to tracingtracing
Parallel Executable
Tracefile(big)
Statsfile(small)
Perl filter
Excel, ...
![Page 23: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/23.jpg)
RR
®®
23
Statistical Information Statistical Information GatheringGathering Can work independent of tracingCan work independent of tracing Significantly lower overhead (memory, Significantly lower overhead (memory,
runtime) runtime) Restriction: for the whole application run ...Restriction: for the whole application run ...
What Organization Data Subroutines Per process Min/max/total
time # of calls
Messages Per sender/receiver
Min/max/total bytes
# of messages
Parallel region
Per process
![Page 24: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/24.jpg)
RR
®®
24
Statistical Information Statistical Information GatheringGathering<act>:<sym>:<proc>:<calls>:<minexcl>:<maxexcl>:<totalexcl>:<minincl>:<maxincl>:<totalincl> INFO ACTSTATS Application:PK_2112_YBDRYS:0:16:3.539324e-04:5.249977e-04:7.470846e-03:3.539324e-04:5.249977e-04:7.470846e-03 INFO ACTSTATS Application:PK_2112_YBDRYS:1:16:3.600121e-04:5.509853e-04:7.577062e-03:3.600121e-04:5.509853e-04:7.577062e-03 INFO ACTSTATS Application:PK_2112_YBDRYS:2:16:3.390312e-04:5.350113e-04:7.542133e-03:3.390312e-04:5.350113e-04:7.542133e-03 INFO ACTSTATS Application:PK_2112_YBDRYS:3:16:3.429651e-04:5.450249e-04:7.494092e-03:3.429651e-04:5.450249e-04:7.494092e-03
01
23
45
67
PK_814_CALCHYDZRDPARAM
PK_562_CALCHYDY
0,00E+005,00E+041,00E+051,50E+05
2,00E+05
2,50E+05
3,00E+05
3,50E+05
PK_814_CALCHYDZ
RDPARAM
PK_562_CALCHYDY
![Page 25: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/25.jpg)
RR
®®
25
GuideView Integrated GuideView Integrated Inside VampirInside Vampir
Creating an Creating an extension API extension API in Vampirin Vampir
– insert menu insert menu itemsitems
– include new include new displaysdisplays
– have access to have access to trace data & trace data & statisticsstatistics
Trace data(in memory)
Vampir menus
Vampir GUI engine
New GuideView
invoke
access
control
Motif graphics library
display
![Page 26: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/26.jpg)
RR
®®
26
New GuideView New GuideView Whole Program ViewWhole Program View Goals –Goals –
– Improve Improve MPI/OpenMP MPI/OpenMP integrationintegration
– Improve Improve scalabilityscalability
– Integrate look Integrate look and feeland feel
Works like old Works like old GuideView!GuideView!
Load time – Load time – Fast!Fast!
![Page 27: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/27.jpg)
RR
®®
27
New GuideView New GuideView Region ViewRegion ViewLooks like old Looks like old
Region view Region view turned on the turned on the side!side!
Scalability test Scalability test – 16 MPI tasks16 MPI tasks
– 16 OpenMP 16 OpenMP threadsthreads
– 300 Parallel 300 Parallel regionsregions
![Page 28: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/28.jpg)
RR
®®
28
Hardware Performance Hardware Performance MonitorsMonitors
1)1) User can call HPM API User can call HPM API in the source codein the source code
2)2) User can define User can define events in Config file events in Config file for Guide for Guide instrumentationinstrumentation
3)3) HPM counter events HPM counter events are also logged from are also logged from Guidetrace and Guidetrace and Vampirtrace library Vampirtrace library
4)4) Underlying HPM Underlying HPM library is PAPIlibrary is PAPI
Guide
Vampir
GuideView
Application Source
Executable
Guidetrace
Vampirtrace
TraceFile
Object Files
Config File
PAPI
![Page 29: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/29.jpg)
RR
®®
29
int main(int argc, char **argv) { int set_id; int inner,outer,other; set_id = VT_create_event_set(“MySet”); VT_add_event(set_id,PAPI_L1_DCM); VT_add_event(EventSet,PAPI_L2_DCM); VT_symdef(outer, “OUTER”, “USERSTATES”); VT_symdef(inner, “INNER”, “USERSTATES”);
VT_symdef(other, “OTHER”, “USERSTATES”);
VT_change_hpm(set_id);
VT_begin(outer);
foo();
VT_begin(inner);
bar();
VT_end(inner);
foo();
VT_end(outer);}
Create a new event set to measure L1 & L2 data cache
misses.
PAPI – PAPI – Hardware Performance Hardware Performance MonitorsMonitors Standardizes Standardizes
names across names across platformsplatforms
Users define Users define counter setscounter sets
User could User could instrument instrument by-hand --by-hand --
But better, But better, Counters are Counters are instrumented instrumented at OpenMP at OpenMP and subrsand subrs
Activate the event set
Collect the events over two user-defined
intervals
Can’t support unsup-ported
counters
![Page 30: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/30.jpg)
RR
®®
30
Hardware Performance ExampleHardware Performance ExampleMPI tasks on timeline
Floating pt instructions correlated but in different window
Or, per MPI task activity correlated in same window
![Page 31: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/31.jpg)
RR
®®
31
Hardware Performance Hardware Performance Can Be RichCan Be Rich
4 x 4 SWEEP3D run showing L1 Data Cache Miss
Cycles Stalled Waiting for Memory Accesses
![Page 32: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/32.jpg)
RR
®®
32
Hardware Performance Hardware Performance in GuideViewin GuideView
You can see the HPM data on all GuideView windows
L1 data cache misses and stalls in Cycle due to memory stalls in per MPI task profile view
![Page 33: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/33.jpg)
RR
®®
33
Derived Hardware Derived Hardware CountersCounters
Vampir and GuideView displays present derived counters
In this menu you can arithmetically combine measured counters into derived counters
![Page 34: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/34.jpg)
RR
®®
34
Environmental Environmental CountersCounters
ParameterParameter MeaningMeaning
utime utime user time used user time used
stime stime system time used system time used
maxrss maxrss
max resident set size max resident set size
ixrss ixrss
shared memory size shared memory size
idrss idrss
unshared data sizeunshared data size
minflt minflt
page reclaims page reclaims
majflt majflt
page faults page faults
nswap nswap
swaps swaps
inblock inblock
block input operations block input operations
oublock oublock
block output operations block output operations
minflt minflt
page reclaims page reclaims
majflt majflt
page faults page faults
Select Select rusagerusage information like HPMs information like HPMs
Data appears Data appears in Vampir and in Vampir and GuideView like GuideView like HPM dataHPM data
Time-varying Time-varying OS counters –OS counters –•Config variable Config variable sets sampling sets sampling frequencyfrequency•Difficult to Difficult to attribute to attribute to source code source code preciselyprecisely
![Page 35: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/35.jpg)
RR
®®
35
Environmental AwarenessEnvironmental Awareness
ParameterParameter MeaningMeaning
MP_EUIDEVICEMP_EUIDEVICE adapter set to be used for message adapter set to be used for message passing passing
MP_EUILIBMP_EUILIB communication subsystem library communication subsystem library implementation implementation
MP_INFOLEVELMP_INFOLEVEL level of message reporting level of message reporting
MP_BUFFER_MEMMP_BUFFER_MEM size of unexpected message buffers size of unexpected message buffers
MP_CSS_INTERRUPTMP_CSS_INTERRUPT generate interrupts for arriving generate interrupts for arriving packetspackets
MP_EAGER_LIMITMP_EAGER_LIMIT threshold for switching to threshold for switching to rendezvous protocol rendezvous protocol
MP_USE_FLOW_CONTMP_USE_FLOW_CONTROLROL
enforce flow control for outgoing enforce flow control for outgoing messages messages
Type 1: Collects IBM MPI information Type 1: Collects IBM MPI information – Treated as static (one time) event in tracefileTreated as static (one time) event in tracefile
– Over 50 parametersOver 50 parameters
![Page 36: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/36.jpg)
RR
®®
36
Dynamic Control of InstrumentationDynamic Control of Instrumentation1)1) In source, User puts In source, User puts
VT_confsync() callsVT_confsync() calls
2)2) At runtime, At runtime, TotalView TotalView is attachedis attached and and breakpoint is inserted breakpoint is inserted
3)3) From process #0, From process #0, user adjusts several user adjusts several instrumentation instrumentation settingssettings
4)4) VTconfigchanged VTconfigchanged flag is set, breakpoint flag is set, breakpoint is exited,is exited,
Guide
Vampir
GuideView
Application Source
Executable
TotalView
VampirtraceLibrary
TraceFile
Object Files
Tracefile reflects change after Tracefile reflects change after nextnext VT_confsync() VT_confsync()
![Page 37: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/37.jpg)
RR
®®
37
Dynamic Control of InstrumentationDynamic Control of InstrumentationKeywordKeyword DescriptionDescription Default ValueDefault Value
LOGFILE-NAMELOGFILE-NAME Tracefile nameTracefile name <argv[0]>.bvt<argv[0]>.bvt
LOGFILE-PREFIXLOGFILE-PREFIX Tracefile path prefixTracefile path prefix Null stringNull string
ACTIVITYACTIVITY Trace activities (User defined)Trace activities (User defined) * ON* ON
SYMBOLSYMBOL Trace symbols (Often subroutines)Trace symbols (Often subroutines) * ON* ON
COUNTERCOUNTER Trace countersTrace counters * ON* ON
OPENMPOPENMP Trace OpenMP regionsTrace OpenMP regions * ON* ON
PCTRACEPCTRACE Record return addressRecord return address OFFOFF
SUM-MPITESTSSUM-MPITESTS Collapse MPI probe and test routinesCollapse MPI probe and test routines ONON
CLUSTERCLUSTER Trace cluster nodesTrace cluster nodes All enabledAll enabled
PROCESSPROCESS Trace processesTrace processes All enabledAll enabled
ENVIRONMENTENVIRONMENT Record environment informationRecord environment information ONON
MEM-MAXBLOCKSMEM-MAXBLOCKS Maximum number of memory blocksMaximum number of memory blocks UnlimitedUnlimited
MEM-OVERWRITEMEM-OVERWRITE Overwrite in–core buffersOverwrite in–core buffers OFFOFF
PRUNE-LIMITPRUNE-LIMIT Execution time thresholdExecution time threshold No pruningNo pruning
![Page 38: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/38.jpg)
RR
®®
38
Structured Trace FilesStructured Trace FilesFrames Manage ScalabilityFrames Manage Scalability
A Section of the Timeline
A Set of Processors
Instances of a subroutine
OpenMP Regions
Messages or Collectives
![Page 39: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/39.jpg)
RR
®®
39
Structured Trace FilesStructured Trace Files Consist of Frames Consist of FramesFrames are defined Frames are defined
In the source code –In the source code –– int VT framedef( int VT framedef( char char name, name, unsigned int unsigned int type mask, int * type mask, int * frame handle )frame handle )
– int VT int VT framestart( int framestart( int frame handle )frame handle )
– int VT int VT framestop( int framestop( int frame handle )frame handle )
Type_mask defines the Type_mask defines the types of data collected –types of data collected –– VT FUNCTIONVT FUNCTION– VT REGIONVT REGION– VT PAR REGIONVT PAR REGION– VT OPENMPVT OPENMP– VT COUNTERVT COUNTER– VT MESSAGEVT MESSAGE– VT COLL OPVT COLL OP– VT COMMUNICATIONVT COMMUNICATION– VT ALLVT ALL
Analyze time frames will Analyze time frames will be availablebe available
![Page 40: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/40.jpg)
RR
®®
40
Structured Trace FilesStructured Trace FilesRapid Access By FramesRapid Access By Frames
Index File
FrameFrame
FrameFrame
1) Structured Tracefile
3) Selecting Thumbnails
Displays Frames in Vampir
2) Vampir Thumbnail Displays
Represent Frames
![Page 41: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/41.jpg)
RR
®®
41
Object Oriented Object Oriented Performance AnalysisPerformance AnalysisHow to avoid SOOX – Instrument with API How to avoid SOOX – Instrument with API
(Scalability Object Oriented eXplosion)(Scalability Object Oriented eXplosion)– C++ templates, classes make it much easier C++ templates, classes make it much easier
– Can be used with or Can be used with or without without sourcesource
VT ActivityVT Activity//InformerMappingsInformerMappings
MPI_SendMPI_Send MPI_RecvMPI_Recv MPI_FinalizeMPI_Finalize Func AFunc A Func InitFunc Init Func X Func Y Func ZFunc X Func Y Func Z
ImYImX ImZ
I_A I_B I_C I_DInformersInformers
EventsEvents
ImQ
Use Use TAU TAU
modelmodel
![Page 42: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/42.jpg)
RR
®®
42
Example of OO InformersExample of OO Informersclass Matrix {class Matrix {
public:public:
InformerMapping im;InformerMapping im;
Matrix(int rows, int columns) {Matrix(int rows, int columns) {
if (rows * columns > 500)if (rows * columns > 500)
im.Rename(“LargeMatrix”);im.Rename(“LargeMatrix”);
else im.Rename(“Matrix”); }else im.Rename(“Matrix”); }
void invert () {void invert () {
Informer(im, “invert”, 12, 15, “Example.C”);Informer(im, “invert”, 12, 15, “Example.C”);
#pragma omp parallel #pragma omp parallel
{ .... }{ .... }
MPI_send(...);MPI_send(...);
}}
void compl () {void compl () {
Informer(im, “Informer(im, “typeid(…)typeid(…)” );” );
........
}}
};};
void main(int argc, char **argv) {void main(int argc, char **argv) {
Matrix Matrix A(10,10),B(512,512),C(1000,1000); A(10,10),B(512,512),C(1000,1000);
// line 1// line 1
B.im.Rename(“MediumMatrix”); B.im.Rename(“MediumMatrix”); // line 2// line 2
A.invert(); A.invert(); // line 3// line 3
B.compl(); B.compl(); // line 4// line 4
C.invert(); C.invert(); // line 5// line 5
}}
Create three Matrix instances: A (mapped to “Matrix” bin), B (mapped to “LargeMatrix” bin),
and C (mapped to “LargeMatrix” bin)
Remap B to “MediumMatrix” bin
A.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in Matrix bin
B.compl is traced. Entry and exit events are collected and associated with (“Matrix:void
compl(void)”) in MediumMatrix bin
C.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in
LargeMatrix bin
![Page 43: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/43.jpg)
RR
®®
43
Vampir OO Timeline Vampir OO Timeline Shows Informer BinsShows Informer Bins
InformerMappings: display each bin as a Vampir activity. MPI is put into a separate activity with same prefix
Rename as ‘Mangled name’ InformerMapping:Informer:NormalEventName
![Page 44: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/44.jpg)
RR
®®
44
Vampir OO Profile Shows Vampir OO Profile Shows Informer BinsInformer Bins
Time in Classes: Queens
MPI Time in Class: Queens
![Page 45: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/45.jpg)
RR
®®
45
OO GuideView Shows OO GuideView Shows Regions in BinsRegions in Bins
Time and counter data per thread by Bin
![Page 46: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com.](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649e725503460f94b70b37/html5/thumbnails/46.jpg)
RR
®®
46
Parallel Performance Parallel Performance EngineeringEngineering
ASCI Ultrascale Performance ToolsASCI Ultrascale Performance Tools– ScalabilityScalability
– IntegrationIntegration
– Ease of UseEase of Use
Read about what was presentedRead about what was presented– ftp://ftp.kai.com/private/ftp://ftp.kai.com/private/
Lab_notes_2001.doc.gzLab_notes_2001.doc.gz
– Contact: [email protected]: [email protected]
Thank you for your attention!Thank you for your attention!