Programming with CUDA · Programming with CUDA ... CUDA C programming guide – CUDA Programming 4 …
PerfMon redux : analyzing a CUDA application with the ...€¦ · PerfMon redux : analyzing a CUDA...
Transcript of PerfMon redux : analyzing a CUDA application with the ...€¦ · PerfMon redux : analyzing a CUDA...
PerfMon redux: analyzing a CUDA application with the Windows
S6287
Richard Wilton
Department of Physics and Astronomy
Johns Hopkins University
PerfMon redux: analyzing a CUDA application with the Windows
Performance Monitor
S6287: Analyzing a CUDA
application with PerfMon What to monitor and why
� What is there to monitor?
� Speed (duration)
� Resource utilization
� Interactions between resources� Interactions between resources
� Why bother?
� Prove that things are operating as expected
� Make things run faster
� Find performance bottlenecks
� Identify resource contention
S6287: Analyzing a CUDA
application with PerfMon Setup for performance monitoring
� Tools you need
� Microsoft Windows
� NVidia GPU and CUDA toolkit (NVML)
� Microsoft Visual Studio (PerfLib v2)� Microsoft Visual Studio (PerfLib v2)
� Monitoring setup
� Target machine with target hardware
� Application “release” build
� Choose your performance counters
Choosing performance countersS6287: Analyzing a CUDA
application with PerfMon
Counters in the GPU group:
• Clock speed (MHz): memory
• Clock speed (MHz): SM
• Fan speed (% maximum)
• Global memory allocated (bytes)
• Global memory allocated (percent)• Global memory allocated (percent)
• Global memory free (bytes)
• Global memory read/write activity (%)
• GPU compute activity (%)
• GPU temperature (°C)
• GPU total power draw (watts)
• PCIe receive throughput (KB/s)
• PCIe transmit throughput (KB/s)
Choosing performance countersS6287: Analyzing a CUDA
application with PerfMon
Monitoring everything at once
is probably not a good idea.
Application pipeline (circa 2013)S6287: Analyzing a CUDA
application with PerfMon
� CPU compute activity
� GPU (CUDA) compute activity
GPU activityS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1, 2
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
GPU activityS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Sampling � JaggednessS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0
� GPU compute activity %
� Global memory read/write activity %
Sampled at 1-second intervalsSampled at 1-second intervals
Samples are “snapshots” (not averaged)
Concurrency among multiple GPUsS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1, 2
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Concurrency among multiple GPUsS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Concurrency among multiple GPUsS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 1
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Concurrency among multiple GPUsS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 2
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Starving for CPU cyclesS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1, 2
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Starving for CPU cyclesS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Starving for CPU cyclesS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1, 2
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Starving for CPU cyclesS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0
� GPU compute activity %
� Global memory read/write activity %
Host-related counters
� CPU activity %� CPU activity %
� Host memory allocation
Consuming a resourceS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 2
� GPU compute activity %
� Global memory allocated (bytes)
Host-related counters
� CPU activity %
(image TBD)
� CPU activity %
GPU mysteryS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1
� GPU compute activity %
� Global memory read/write activity %
� GPU temperature (°C)
� GPU total power draw (watts)� GPU total power draw (watts)
Host-related counters
� CPU activity %
� Host memory allocation
GPU mysteryS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1
� GPU compute activity %
� Global memory read/write activity %
� GPU temperature (°C)
� GPU total power draw (watts)� GPU total power draw (watts)
Host-related counters
� CPU activity %
� Host memory allocation
GPU mysteryS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1
� GPU compute activity %
� Global memory read/write activity %
� GPU temperature (°C)
� GPU total power draw (watts)� GPU total power draw (watts)
Host-related counters
� CPU activity %
� Host memory allocation
GPU mysteryS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1
� GPU compute activity %
� Global memory read/write activity %
� GPU temperature (°C)
� GPU total power draw (watts)� GPU total power draw (watts)
Host-related counters
� CPU activity %
� Host memory allocation
GPU mysteryS6287: Analyzing a CUDA
application with PerfMon
Device-related counters – device 0, 1
� GPU compute activity %
� Global memory read/write activity %
� GPU temperature (°C)
� GPU total power draw (watts)� GPU total power draw (watts)
Host-related counters
� CPU activity %
� Host memory allocation
S6287: Analyzing a CUDA
application with PerfMon PerfMon and CUDA
� What is there to monitor?
� Speed (duration)
� Resource utilization
� Interactions between resources� Interactions between resources
� Why bother?
� Prove that things are operating as expected
� Make things run faster
� Find performance bottlenecks
� Identify resource contention
S6287: Analyzing a CUDA application with PerfMon
Questions / Comments