Energy Consumption of CUDA Kernels with Varying Thread Topology

29
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke Zuse Institute Berlin September 12, 2012

Transcript of Energy Consumption of CUDA Kernels with Varying Thread Topology

Page 1: Energy Consumption of CUDA Kernels with Varying Thread Topology

Energy Consumption of CUDA Kernelswith Varying Thread Topology

Sebastian Dreßler & Thomas SteinkeZuse Institute Berlin

September 12, 2012

Page 2: Energy Consumption of CUDA Kernels with Varying Thread Topology

Outline

1 GPGPUs @ HPC

2 Energy Awareness of GPGPUs

3 Applied Methods

4 Results & Interpretation

5 Conclusion & Outlook

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 2 / 25

Page 3: Energy Consumption of CUDA Kernels with Varying Thread Topology

GPGPUs @ HPC

� GPGPUs are more and more utilized in modern HPC systems

� Top500 of June ’12:

� 57 systems using accelerators in total

� NVIDIA GPUs:

� #5 (Tianhe-1A),

� #10 (Nebulae), and

� #14 (TSUBAME 2.0)

Energy awareness of acceleratorsbecomes a key element in HPC systems.

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 3 / 25

Page 4: Energy Consumption of CUDA Kernels with Varying Thread Topology

GPGPUs @ HPC

� GPGPUs are more and more utilized in modern HPC systems

� Top500 of June ’12:

� 57 systems using accelerators in total

� NVIDIA GPUs:

� #5 (Tianhe-1A),

� #10 (Nebulae), and

� #14 (TSUBAME 2.0)

Energy awareness of acceleratorsbecomes a key element in HPC systems.

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 3 / 25

Page 5: Energy Consumption of CUDA Kernels with Varying Thread Topology

Fermi GPU Architecture

Dispatch Unit

Warp Scheduler

Instruction Cache

Dispatch Unit

Warp Scheduler

Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

SFU

SFU

SFU

SFU

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

UniformCache

Core

Register File (32,768 x 32-bit)

CUDA Core

Operand Collector

Dispatch Port

Result Queue

FP Unit INT Unit

Fermi Streaming Multiprocessor (SM)

Image copyright by NVIDIA, Source: NVIDIA Fermi Architecture Whitepaper

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 4 / 25

Page 6: Energy Consumption of CUDA Kernels with Varying Thread Topology

Energy Awareness of GPGPUs

Page 7: Energy Consumption of CUDA Kernels with Varying Thread Topology

Measurements I

� E measurement GPGPU vs. CPU (Rofouei et al. [6])

� Scan application measured

� Higher E consumption, much lower run-time Ñ improved E footprint

� E measurement on heterogeneous systems (McIntosh-Smith et al. [5])

� NVIDIA beats AMD on E efficiency

� GPGPU: in general better E footprint than multi-core CPUs only

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 6 / 25

Page 8: Energy Consumption of CUDA Kernels with Varying Thread Topology

Measurements II

� Relation of utilized SMs vs. P̄ (Collange et al. [1])

� Steep linear rise until half SMs utilized

� more than half SMs Ñ more flat linear rise

� E measurement for block / thread comb. (Huang et al. [4])

� Single application measured

� Lower runtime Ñ lower energy consumption

Applied methods: external hardware + software logging

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 7 / 25

Page 9: Energy Consumption of CUDA Kernels with Varying Thread Topology

Measurements II

� Relation of utilized SMs vs. P̄ (Collange et al. [1])

� Steep linear rise until half SMs utilized

� more than half SMs Ñ more flat linear rise

� E measurement for block / thread comb. (Huang et al. [4])

� Single application measured

� Lower runtime Ñ lower energy consumption

Applied methods: external hardware + software logging

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 7 / 25

Page 10: Energy Consumption of CUDA Kernels with Varying Thread Topology

Predictive Models

� Prediction for # of utilized SMs for best EC (Hong and Kim [3])

� Goal: decreased power consumption

� Result: avg. E saving of «26%

� Fine grained model for P̄ at instruction level (Haifeng and Qingkui [2])

� Decomposition of PTX instructions into groups

� Arithmetics, Memory Transfer, Control, . . .

� Reference P̄ measurement + instruction group count Ñ P̄ prediction

� Error: 5%

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 8 / 25

Page 11: Energy Consumption of CUDA Kernels with Varying Thread Topology

Goal of Our Work

� With respect to measurements

� Switch from Hardware (Power meter, Oscilloscope, . . . ) to Software

� Migration to software controlled remote measurement

� Supported by upcoming integration of power sensors

� Fermi architecture HW power sensor available on Tesla M2090

� Provide fine-grained measurement for instructions and applications

� Concerning predictive models

� Clearly distinct energy consumption and power consumption

� Provide additional informations to improve models

� Demonstration of the thread scheduler’s impact

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 9 / 25

Page 12: Energy Consumption of CUDA Kernels with Varying Thread Topology

Applied Methods

Page 13: Energy Consumption of CUDA Kernels with Varying Thread Topology

Software Measurement Framework

� NVIDIA Management Library (NVML)

� Measures instantaneous overall power consumption P of a GPU card

� Framework: threaded library using NVML

� On Tesla M2090: every 20 ms a new sample

� Measurement accuracy

� Assumption: sensor values are correct

� Captured multiple runtime profiles with constant card utilization

� Statistical analysis on runtime profiles

� P̄ “ 165.3W ˘ 0.73W

� Relative uncertainty: 0.44 %

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 11 / 25

Page 14: Energy Consumption of CUDA Kernels with Varying Thread Topology

Instruction Level Kernels

� Purpose: Measure energy consumption at instruction level

Listing 1: Instruction level kernel example

1 float r1, r2, r3;23 for (int i = 0; i < RUNS; i++) {4 r3 = r1 + r2;5 r2 = r3 + r1;6 r1 = r2 + r3;7 [...]8 r3 = r1 + r2;9 r2 = r3 + r1;10 r1 = r2 + r3;11 }

� Runtime >20 ms (l. 3)

� Obfuscate that “code doesnothing” (ll. 4-10)

� Ensured above points withPTX code analysis

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 12 / 25

Page 15: Energy Consumption of CUDA Kernels with Varying Thread Topology

Application Level Kernels

� Purpose: Measure energy consumption at application level

Listing 2: Weak scaling: vector norm

1 void norm(2 double *v,3 double *norm) {45 int i;6 int idx = [...];7 double a = 0.0;89 for (i=0; i<SVEC; i++) {10 a += pow( // Ineff. code11 v[idx+i*SVEC], 2.0f12 );13 }1415 norm[idx] = sqrt(a);16 }

Listing 3: Strong scaling: vector calculation

1 void vecpowadd(2 double ma, double mb,3 double *vec_a ,4 double *vec_b) {56 int i;7 int idx = [...];8 int L = [...];9

10 for (i=0; i<SVEC/L; i++) {11 vec_a[idx + i] += pow(12 vec_a[idx + L],13 vec_b[idx + L]14 );15 }16 }

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 13 / 25

Page 16: Energy Consumption of CUDA Kernels with Varying Thread Topology

Results & Interpretation

Page 17: Energy Consumption of CUDA Kernels with Varying Thread Topology

Single Instruction SP Floating Point

0

16

32

48

64

0

256

512

768

1024 0

8

16

24

Ene

rgy

uJ

BlocksThreads

Ene

rgy

uJ

0 2 4 6 8 10 12 14 16 18 20 22

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 15 / 25

Page 18: Energy Consumption of CUDA Kernels with Varying Thread Topology

Vector Norm Application Kernel (Weak Scaling)

0

16

32

48

64

0

256

512

768

1024 0

0.5

1

1.5

2

2.5

Ene

rgy

J

BlocksThreads

Ene

rgy

J

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 16 / 25

Page 19: Energy Consumption of CUDA Kernels with Varying Thread Topology

Vector Calc. Application Kernel (Strong Scaling)

0

5

10

15

20

25

30

35

40

45

0 128 256 384 512 640 768 896 1024

Ene

rgy

Threads

16 Blocks32 Blocks48 Blocks64 Blocks

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 17 / 25

Page 20: Energy Consumption of CUDA Kernels with Varying Thread Topology

P̄ vs. E

� Differentiation between P̄ and E is necessary

� Taking only P̄ leads to wrong conclusions

0 16 32 48 64

768

776

784

792

800

400

500

600

700

800

Ave

rage

pow

er c

onsu

mpt

ion

Blocks

Threads

Ave

rage

pow

er c

onsu

mpt

ion

450 500 550 600 650 700 750 800

Reason for breakdowns?

130

140

150

160

170

180

190

0 34 68 102 136 170 204 238 272 306 340

Pow

er c

onsu

mpt

ion

in W

Samples

Power ConsumptionAverage Power Consumption

Range for avg. power calculation

Different utilization due to scheduling

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 18 / 25

Page 21: Energy Consumption of CUDA Kernels with Varying Thread Topology

P̄ vs. E

� Differentiation between P̄ and E is necessary

� Taking only P̄ leads to wrong conclusions

0 16 32 48 64

768

776

784

792

800

400

500

600

700

800

Ave

rage

pow

er c

onsu

mpt

ion

Blocks

Threads

Ave

rage

pow

er c

onsu

mpt

ion

450 500 550 600 650 700 750 800

Reason for breakdowns?

130

140

150

160

170

180

190

0 34 68 102 136 170 204 238 272 306 340

Pow

er c

onsu

mpt

ion

in W

Samples

Power ConsumptionAverage Power Consumption

Range for avg. power calculation

Different utilization due to scheduling

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 18 / 25

Page 22: Energy Consumption of CUDA Kernels with Varying Thread Topology

Scheduler Capabilities (Likely)

Notable result: energybreakdowns with

increasing utilization

5

6

7

8

9

10

11

12

13

14

640 672 704 736 768 800 832 864 896 928 960 992 1024

256 288 320 352 384 416 448 480 512 544 576 608 640

Ene

rgy

Threads for b = 32

Threads for b = 64

64 Blocks32 Blocks

� Most likely reason: two distinct scheduling cases

1. Scheduler can provide low delay for outstanding requests (low E)

2. The opposite case (high E)

� Sometimes scheduler is capable to switch to optimal scheduling

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 19 / 25

Page 23: Energy Consumption of CUDA Kernels with Varying Thread Topology

Scheduler Capabilities (Likely)

Notable result: energybreakdowns with

increasing utilization

5

6

7

8

9

10

11

12

13

14

640 672 704 736 768 800 832 864 896 928 960 992 1024

256 288 320 352 384 416 448 480 512 544 576 608 640

Ene

rgy

Threads for b = 32

Threads for b = 64

64 Blocks32 Blocks

� Most likely reason: two distinct scheduling cases

1. Scheduler can provide low delay for outstanding requests (low E)

2. The opposite case (high E)

� Sometimes scheduler is capable to switch to optimal scheduling

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 19 / 25

Page 24: Energy Consumption of CUDA Kernels with Varying Thread Topology

Relative Energy

0

2

4

6

8

10

12

14

16

0 128 256 384 512 640 768 896 1024

Ene

rgy

Threads

16 Blocks32 Blocks48 Blocks64 Blocks

� Linear dependence between energy consumption and threads per block

� Optimal scheduling: linear increase of energy

� Suboptimal scheduling: energy jumps

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 20 / 25

Page 25: Energy Consumption of CUDA Kernels with Varying Thread Topology

Conclusion & Outlook

� Software library for P̄ measurement and E calculation implemented

� Based on NVML, high accuracy and remote measurement

� LGPL licensed, available at GitHub

� CUDA Power and Energy Measurement Framework

� https://github.com/sdressler/CUDA-PEMF

� Showed impact of thread scheduler, very likely two categories

� Optimal scheduling Ñ energy efficient

� Suboptimal scheduling Ñ energy inefficient

� Open question: assumption to be evaluated by NVIDIA

� Outlook

� Investigate scheduler impact further

� Based on results: improve predictive models further

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 21 / 25

Page 26: Energy Consumption of CUDA Kernels with Varying Thread Topology

Thank you for your attention.

Page 27: Energy Consumption of CUDA Kernels with Varying Thread Topology

S. Collange, D. Defour, and A. Tisserand.

Power Consumption of GPUs from a Software Perspective.

In Proceedings of the 9th International Conference on ComputationalScience: Part I, ICCS ’09, pages 914–923, Berlin, Heidelberg, 2009.Springer-Verlag.

W. Haifeng and C. Qingkui.

An Energy Consumption Model for GPU Computing at InstructionLevel.

IJACT, 4(2):192 – 200, Feb. 2012.

S. Hong and H. Kim.

An integrated GPU power and performance model.

SIGARCH Comput. Archit. News, 38(3):280–289, June 2010.

Page 28: Energy Consumption of CUDA Kernels with Varying Thread Topology

S. Huang, S. Xiao, and W. Feng.

On the energy efficiency of graphics processing units for scientificcomputing.

In Proceedings of the 2009 IEEE International Symposium onParallel&Distributed Processing, IPDPS ’09, pages 1–8, Washington,DC, USA, 2009. IEEE Computer Society.

S. McIntosh-Smith, T. Wilson, A. A. Ibarra, J. Crisp, and R. B.Sessions.

Benchmarking energy efficiency, power costs and carbon emissions onheterogeneous systems.

The Computer Journal, 55(2):192–205, 2012.

Page 29: Energy Consumption of CUDA Kernels with Varying Thread Topology

M. Rofouei, T. Stathopoulos, S. Ryffel, W. Kaiser, andM. Sarrafzadeh.

Energy-aware high performance computing with graphic processingunits.

In Proceedings of the 2008 conference on Power aware computing andsystems, HotPower’08, pages 11–11, Berkeley, CA, USA, 2008.USENIX Association.