Trends in Efficient Parallel Computing and Performance

Trends in systems and how to get efficient performance

Martin Hilgeman

HPC Consultant

[email protected]

2 of 48

The landscape is changing

“We are no longer in the general purpose era… the argument of tuning software for hardware is

moot. Now, to get the best bang for the buck, you have to tune both.”

https://www.nextplatform.com/2017/03/08/arm-amd-x86-server-chips-get-mainstream-lift-microsoft/amp/

- Kushagra Vaid, general manager of server

engineering, Microsoft Cloud Solutions

3 of 48 Accelerating Understanding Pisa, September 2017

System trend over the years (1)

C

C C

C

≈≈≈

~1970 - 2000

≈≈≈

2005

Multi-core: TOCK


System trend over the years (2)

2007

IntegratedMemorycontroller:

TOCK

IntegratedPCIecontroller:

TOCK

2012

≈≈≈≈≈≈


Future

IntegratedNetworkFabric Adapter:

TOCK

SoC designs:

TOCK

≈≈≈ ≈≈≈ ≈≈≈ ≈≈≈


Moore’s Law vs. Amdahl’s Law

• The clock speed plateau

• The power ceiling

• IPC limit

• Industry is applying Moore’s Law by

adding more cores

• Meanwhile Amdahl’s Law says that you

cannot use them all efficiently

Chuck Moore, "DATA PROCESSING IN EXASCALE-CLASS COMPUTER SYSTEMS", The

Salishan Conference on High Speed Computing, 2011

0

10

20

30

40

50

60

70

80

90

100

1 2 4 12 16 24 32

Wall c

lock t

ime

Number of cores

80% parallel application

parallel

serial

1.00

1.67

2.50

3.75 4.004.29 4.44


Moore’s Law vs Amdahl's Law - “too Many Cooks in the Kitchen”

Meanwhile Amdahl’s Law says that

you cannot use them all efficientlyIndustry is applying Moore’s Law by

adding more cores

8 of 48

Improving performance - what levels do we have?

• Challenge: Sustain performance trajectory without massive increases in cost, power, real

estate, and unreliability

• Solutions: No single answer, must intelligently turn “Architectural Knobs”

𝐹𝑟𝑒𝑞 ×𝑐𝑜𝑟𝑒𝑠

𝑠𝑜𝑐𝑘𝑒𝑡× #𝑠𝑜𝑐𝑘𝑒𝑡𝑠 ×

𝑖𝑛𝑠𝑡 𝑜𝑟 𝑜𝑝𝑠

𝑐𝑜𝑟𝑒 × 𝑐𝑙𝑜𝑐𝑘× 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦

Hardware performance What you really get

1 2 3 4 5

Software performance


Turning the knobs 1 - 4

Frequency is unlikely to change much Thermal/Power/Leakage

challenges

Moore’s Law still holds: 130 -> 14 nm. LOTS of transistors

Number of sockets per system is the easiest knob.

Challenging for power/density/cooling/networking

IPC still grows

FMA, AVX, accelerator implementations for algorithms

Challenging for the user/developer

1

2

3

4

10 of 48

Turning knob #5

Hardware tuning knobs are limited, but there’s far more possible in the

software layer

Hardware

Operating

system

Middleware

Application

BIOS

P-states

Memory profile

I/O cache tuning

Process affinity

Memory allocation

MPI (parallel) tuning

Use of performance

libs (Math, I/O, IPP)

Compiler hints

Source changes

Adding parallelism

easy

hard


New capabilities according to Intel

SSSE3 SSE4 AVX AVX AVX2 AVX2

2007 2009 2012 2013 2014 2015 2017

AVX-512


The state of ISV software

Segment Applications Vectorization support

CFD Fluent, LS-DYNA, STAR

CCM+

Limited SSE2 support

CSM CFX, RADIOSS, Abaqus Limited SSE2 support

Weather WRF, UM, NEMO, CAM Yes

Oil and Gas Seismic processing Not applicable

Reservoir Simulation Yes

Chemistry Gaussian, GAMESS, Molpro Not applicable

Molecular dynamics NAMD, GROMACS,

Amber,…

PME kernels support SSE2

Biology BLAST, Smith-Waterman Not applicable

Molecular mechanics CPMD, VASP, CP2k,

CASTEP

Yes

Bottom line: ISV support for new instructions is poor. Less of an issue

for in-house developed codes, but programming is hard


Add to this the Memory Bandwidth and System Balance

Obtained from: http://sc16.supercomputing.org/2016/10/07/sc16-invited-talk-spotlight-dr-john-d-mccalpin-presents-memory-bandwidth-system-balance-hpc-systems/


What does Intel do about these trends?

Problem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake

QPI

bandwidth

No problem Even better Two snoop

modes

Three snoop

modes

Four (!) snoop

modes

• UPI

• COD snoop

modes

Memory

bandwidth

No problem Extra memory

channel

Larger cache Extra load/store

units

Larger cache • Extra

load/store

units

• +50%

memory

channels

Core

frequency

No problem • More cores

• AVX

• Better

Turbo

• Even more

cores

• Above TDP

Turbo

• Still more

cores

• AVX2

• Per-core

Turbo

• Again even

more cores

• optimized

FMA

• Per-core

Turbo

based on

instruction

type

• More cores

• Larger

OOO

engine

• AVX-512

• 3 different

core

frequency

modes

The roofline model

16 of 48

Predicting performance – the roofline model

Predict system performance as function of peak performance, maximum bandwidth and arithmetic

intensity

Obtained from: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/


Roofline variables and terms

“Work” “Memory Traffic”

18 of 48

The last variable: Arithmetic intensity


Roofline model of a modern system

GF

LO

P/s

Arithmetic intensity (FLOP/Byte)

Memory

Bound

Compute

Bound

FP peak

Williams et al.: “Roofline: An Insightful Visual Performance Model“, CACM 2009

Better use of

caches

Micro

optimi

zation


The Broadwell baseline

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

GF

LO

P/s

FLOP/byte

E5-2699 v4 22C 2.1 GHz

FP peak (~ 1.4 TF)


1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

GF

LO

P/s

FLOP/byte

E5-2699 v4 22C 2.1 GHz

Xeon 8180 28C 2.5 GHz

Intel Skylake-SPFP peak (~ 3.3 TF)


1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

GF

LO

P/s

FLOP/byte

E5-2699 v4 22C 2.1 GHz

Xeon 8180 28C 2.5 GHz

AMD Epyc 7601 32C 2.2 GHz

What about AMD Epyc?

FP peak (~ 1 TF)


1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

GF

LO

P/s

FLOP/byte

What about accelerators?

Williams et al.: “Roofline: An Insightful Visual Performance Model“, CACM 2009

FP peak (~ 3.3 TF)

0.5

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

8192

0.01 0.1 1 10 100

Gbyt

es/s

FLOP/byte

Xeon Phi 7250 MCDRAM

Xeon Phi 7250 DDR

NVIDIA P100

FP peak (~ 3.3 TF)

FP peak (~ 4.7 TF)

24 of 48

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

Gbyt

es/s

FLOP/byte

AVX-512

AVX-512

Ideal roofline on a sunny day

25 of 48

Might need to repair some rooftiles…

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

Gbyt

es/s

FLOP/byte

AVX-512

AVX2

Half the register width

26 of 48

It’s worse than we thought…

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

Gbyt

es/s

FLOP/byte

AVX-512

AVX2

No FMA

No FMA

27 of 48

Everything is failing on us!

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

Gbyt

es/s

FLOP/byte

AVX-512

AVX2

No FMA

No SIMD

No SIMD

28 of 48

The reality after hurricane Irma

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

0.01 0.1 1 10 100

Gbyt

es/s

FLOP/byte

AVX-512

AVX2

No FMA

No SIMD

No ILP

No instruction level parallelism


1

2

4

8

16

32

64

128

256

512

1024

0.0625 0.125 0.25 0.5 1 2 4 8 16 32

GF

LO

P/s

Arithmetic intensity (FLOP/Byte)

No SIMD

No FMA

A few Broadwell resultsFP peak (~ 1TF)

STREAM

Triad

HPL

GROMACS

OpenFOAM

CP2kHPCG

Sparse

Solvers

(CFD)

A case study on power

31 of 48

Key Aspects of Acceleration

We have lots of transistors… Moore’s law is holding; this isn’t

necessarily the problem

We don’t really need lower power per transistor, we need lower power

per operation

How to do this?

32 of 48

Performance and Efficiency with Intel® AVX-512

669

11782034

3259

760 768 791 767

3.12.8

2.5

2.1

0

0.5

1

1.5

2

2.5

3

3.5

0

500

1000

1500

2000

2500

3000

3500

SSE4.2 AVX AVX2 AVX512

Co

re F

req

ue

ncy

GF

LO

Ps

, S

yste

m P

ow

er

LINPACK Performance

GFLOPs Power (W) Frequency (GHz)

1.00

1.74

2.92

4.83

1.00

2.00

4.00

8.00

SSE4.2 AVX AVX2 AVX512No

rmali

zed

to

SS

E4.2

G

FL

OP

s/W

att

GFLOPs / Watt

1.001.95

3.77

7.19

0.00

2.00

4.00

6.00

8.00

SSE4.2 AVX AVX2 AVX512No

rmali

zed

to

SS

E4.2

G

FL

OP

s/G

Hz

GFLOPs / GHz

Intel® AVX-512 delivers significant performance and efficiency gains

3

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Slide taken from Intel

33 of 48

Powerful instructions to save power

For LINPACK, powerful instructions can bring significant performance gains.

What about real applications?

NAS parallel benchmarks, which are “mini applications” containing kernels for

major HPC workload types

34 of 48

NPB kernels

KERNEL Description Workload

CG Conjugate gradient Memory latency bound

MG Multigrid Memory intensive

FT Fourier transform Compute and

transpose

BT Block tridiagonal solver

SP Scalar pentadiagonal solver

LU Lower-upper Gauss-Seidel solver

35 of 48

How to benchmark?

There are three ways that the kernels can be run

• Serial – N copies on N cores

• OpenMP – shared memory parallelism

• MPI – parallelism across multiple systems (nodes)

There are five classes of problem sizes. From A (tiny) to E (very large, to

1000s of cores)


Conjugate Gradient

0

500

1000

1500

2000

2500

3000

3500

4000

no-vec MHz SSE4.2 MHZ AVX MHz

AVX2 MHz AVX-512 MHz

0

50

100

150

200

250

no-vec Watt SSE4.2 Watt AVX Watt

AVX2 Watt AVX-512 Watt


Multigrid

0

500

1000

1500

2000

2500

3000

3500

4000

no-vec MHz SSE4.2 MHz AVX MHz


0

50

100

150

200

250




Block tridiagnonal solver

0

500

1000

1500

2000

2500

3000

3500

4000



0

50

100

150

200

250

no-vec Watt SSE4.2 Watt

AVX Watt AVX2 Watt

AVX-512 Watt


Scalar pentadiagnonal solver

0

500

1000

1500

2000

2500

3000

3500

4000



0

50

100

150

200

250




Lower-upper Gauss-Seidel solver

0

500

1000

1500

2000

2500

3000

3500

4000



0

50

100

150

200

250




Fourier Transformation

0

500

1000

1500

2000

2500

3000

3500

4000



0

50

100

150

200

250

no-vec Watt

SSE4.2 Watt

AVX Watt

AVX2 Watt

AVX-512 Watt

Trends in Efficient Parallel Computing and Performance

Documents

Transcript of Trends in Efficient Parallel Computing and Performance