The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for...

12
Managed by UT-Battelle for the Department of Energy The impact of process placement on job cost and performance Using Titan’s Turbo Switch

Transcript of The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for...

Page 1: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

Managed by UT-Battelle for the Department of Energy

The impact of process placement on job cost

and performance

Using Titan’s Turbo Switch

Page 2: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

2 Managed by UT-Battelle for the Department of Energy

Outline

•  Who is the target audience?

•  An overview of Titan’s CPUs

•  What is aprun’s default placement?

•  Avoiding floating point contention

•  Best practices

•  What about Eos?

Page 3: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

3 Managed by UT-Battelle for the Department of Energy

Who is the target audience?

•  Any user whose application primarily uses the CPUs’ floating point units (FPUs)

•  Any user whose application uses 2-8 MPI ranks per node –  e.g. needs more memory per MPI rank

•  This does not apply to codes using the GPUs –  Unless they also heavily use the CPU FPUs

Page 4: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

4 Managed by UT-Battelle for the Department of Energy

An overview of Titan’s CPUs

•  AMD Opteron 6274 (Interlagos)

•  2 NUMA domains

•  8 Bulldozer modules –  4 per NUMA domain

•  Each Bulldozer has: –  2 Integer Units –  1 Floating Point Unit

(FPU) https://www.olcf.ornl.gov/kb_articles/xk7-titan-node-description/

Page 5: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

5 Managed by UT-Battelle for the Department of Energy

What is aprun’s default placement

•  The integer units are numbered 0-15

•  By default, aprun will: –  place 16 processes per node –  place the processes sequentially starting at core 0

•  Two processes will be assigned to each Bulldozer and each pair will have to share a FPU –  Even if you only request 8 processes per node,

they will be placed on cores 0-7 and compute for the first four FPUs

Page 6: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

6 Managed by UT-Battelle for the Department of Energy

Avoiding floating point contention

•  Request twice as many nodes from qsub –  Will consume 2x the node-hours

•  Place 8 processes per node (-N 8 or -S 4)

•  Use only 1 process per FPU (-j 1) –  Don’t use -d 2 (has different behavior on Eos)

Page 7: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

7 Managed by UT-Battelle for the Department of Energy

Avoiding floating point contention

•  Requesting 2x nodes is not enough

•  You must restrict the processes to one per FPU

•  Using -j 1 is easier than specifying the cores to use (-cc)

•  Set -r 1 as well to have OS processes run on core 15

1.00

1.69 1.69

1.01

0%

25%

50%

75%

100%

125%

150%

175%

200%

Default Even-OS Core Odd-OS Core Half-OS Core

S3D speedup when using twice the nodes

Page 8: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

8 Managed by UT-Battelle for the Department of Energy

Avoiding floating point contention

•  FPU-intensive apps see ~2.0x speedup –  e.g. HPL

•  Real apps do I/O or have non-floating-point sections

•  Typically, 1.4-1.7x speedup

•  Run your app to determine what, if any, speedup you get

Page 9: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

9 Managed by UT-Battelle for the Department of Energy

Best practices

•  If I see a speedup, should I always try to avoid FPU contention?

•  You are trading 2x the node-hours for 1.4-1.7x speedup

•  If you are debugging, conserve your node-hours

•  If you want to do your work faster, avoid FPU contention –  Bonus: larger jobs get priority in the queue

Page 10: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

10 Managed by UT-Battelle for the Department of Energy

What about Eos?

•  2 NUMA domains

•  16 Cores (8/NUMA)

•  Hyper-Threading –  32 integer cores –  but only 16 FPUs

•  Yes, it works here to, but…

•  Intel Xeon 5E-2670

https://www.olcf.ornl.gov/kb_articles/xc30-cpu-description/

Page 11: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

11 Managed by UT-Battelle for the Department of Energy

What about Eos?

•  By default, aprun uses -j 1 to place one process per core (i.e. no Hyper-Threading) –  Opposite policy from Titan

•  You can conserve node-hours –  Request ½ the nodes form qsub –  Pass -j 2 to aprun –  e.g. debugging

Page 12: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

12 Managed by UT-Battelle for the Department of Energy

Thank you

•  Questions?