PerfMon redux : analyzing a CUDA application with the ...€¦ · PerfMon redux : analyzing a CUDA...

Post on 07-Oct-2020

1 views 0 download

Transcript of PerfMon redux : analyzing a CUDA application with the ...€¦ · PerfMon redux : analyzing a CUDA...

PerfMon redux: analyzing a CUDA application with the Windows

S6287

Richard Wilton

Department of Physics and Astronomy

Johns Hopkins University

PerfMon redux: analyzing a CUDA application with the Windows

Performance Monitor

S6287: Analyzing a CUDA

application with PerfMon What to monitor and why

� What is there to monitor?

� Speed (duration)

� Resource utilization

� Interactions between resources� Interactions between resources

� Why bother?

� Prove that things are operating as expected

� Make things run faster

� Find performance bottlenecks

� Identify resource contention

S6287: Analyzing a CUDA

application with PerfMon Setup for performance monitoring

� Tools you need

� Microsoft Windows

� NVidia GPU and CUDA toolkit (NVML)

� Microsoft Visual Studio (PerfLib v2)� Microsoft Visual Studio (PerfLib v2)

� Monitoring setup

� Target machine with target hardware

� Application “release” build

� Choose your performance counters

Choosing performance countersS6287: Analyzing a CUDA

application with PerfMon

Counters in the GPU group:

• Clock speed (MHz): memory

• Clock speed (MHz): SM

• Fan speed (% maximum)

• Global memory allocated (bytes)

• Global memory allocated (percent)• Global memory allocated (percent)

• Global memory free (bytes)

• Global memory read/write activity (%)

• GPU compute activity (%)

• GPU temperature (°C)

• GPU total power draw (watts)

• PCIe receive throughput (KB/s)

• PCIe transmit throughput (KB/s)

Choosing performance countersS6287: Analyzing a CUDA

application with PerfMon

Monitoring everything at once

is probably not a good idea.

Application pipeline (circa 2013)S6287: Analyzing a CUDA

application with PerfMon

� CPU compute activity

� GPU (CUDA) compute activity

GPU activityS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1, 2

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

GPU activityS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Sampling � JaggednessS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0

� GPU compute activity %

� Global memory read/write activity %

Sampled at 1-second intervalsSampled at 1-second intervals

Samples are “snapshots” (not averaged)

Concurrency among multiple GPUsS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1, 2

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Concurrency among multiple GPUsS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Concurrency among multiple GPUsS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 1

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Concurrency among multiple GPUsS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 2

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Starving for CPU cyclesS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1, 2

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Starving for CPU cyclesS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Starving for CPU cyclesS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1, 2

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Starving for CPU cyclesS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0

� GPU compute activity %

� Global memory read/write activity %

Host-related counters

� CPU activity %� CPU activity %

� Host memory allocation

Consuming a resourceS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 2

� GPU compute activity %

� Global memory allocated (bytes)

Host-related counters

� CPU activity %

(image TBD)

� CPU activity %

GPU mysteryS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1

� GPU compute activity %

� Global memory read/write activity %

� GPU temperature (°C)

� GPU total power draw (watts)� GPU total power draw (watts)

Host-related counters

� CPU activity %

� Host memory allocation

GPU mysteryS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1

� GPU compute activity %

� Global memory read/write activity %

� GPU temperature (°C)

� GPU total power draw (watts)� GPU total power draw (watts)

Host-related counters

� CPU activity %

� Host memory allocation

GPU mysteryS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1

� GPU compute activity %

� Global memory read/write activity %

� GPU temperature (°C)

� GPU total power draw (watts)� GPU total power draw (watts)

Host-related counters

� CPU activity %

� Host memory allocation

GPU mysteryS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1

� GPU compute activity %

� Global memory read/write activity %

� GPU temperature (°C)

� GPU total power draw (watts)� GPU total power draw (watts)

Host-related counters

� CPU activity %

� Host memory allocation

GPU mysteryS6287: Analyzing a CUDA

application with PerfMon

Device-related counters – device 0, 1

� GPU compute activity %

� Global memory read/write activity %

� GPU temperature (°C)

� GPU total power draw (watts)� GPU total power draw (watts)

Host-related counters

� CPU activity %

� Host memory allocation

S6287: Analyzing a CUDA

application with PerfMon PerfMon and CUDA

� What is there to monitor?

� Speed (duration)

� Resource utilization

� Interactions between resources� Interactions between resources

� Why bother?

� Prove that things are operating as expected

� Make things run faster

� Find performance bottlenecks

� Identify resource contention

S6287: Analyzing a CUDA application with PerfMon

Questions / Comments