PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A....

39
PyFR and GiMMiK on Intel KNL F.D. Witherden 1 , J.S. Park 2 , A. Heinecke 3 , P. Kelly 2 , P.E. Vincent 2 and A. Jameson 1 1 Department of Aeronautics & Astronautics, Stanford University 2 Departments of Aeronautics and Computing, Imperial College London 3 Intel Corporation

Transcript of PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A....

Page 1: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

PyFR and GiMMiK on Intel KNL

F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1

1Department of Aeronautics & Astronautics, Stanford University 2Departments of Aeronautics and Computing, Imperial College London

3Intel Corporation

Page 2: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Motivation: Turbulent Flows

• Interested in simulating unsteady, turbulent, flows.

Page 3: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

The PyFR Framework

• Uses high-order flux reconstruction (FR) to solve the compressible Navier–Stokes equations on mixed unstructured grids with explicit time stepping.

Page 4: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

The PyFR Framework

• Performance portable across a range of platforms.

• Finalist for the 2016 Gordon Bell Prize.

Page 5: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

The PyFR Framework

• Existing support for KNC based around offloading via pyMIC.

Page 6: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• Python outer layer.

PyFR

Python Outer Layer (Hardware Independent)

• Setup

• Distributed memory parallelism

• Outer ‘for’ loop and calls to hardware specific kernels

Page 7: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• Need to generate hardware specific kernels.

PyFR

Python Outer Layer (Hardware Independent)

• Setup

• Distributed memory parallelism

• Outer ‘for’ loop and calls to hardware specific kernels

Page 8: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• In FR two types of kernel are required.

PyFR

Python Outer Layer (Hardware Independent)

Matrix Multiply Kernels

Point-Wise Nonlinear Kernels

• Data interpolation/extrapolation etc.

• Flux functions, Riemann solvers etc.

• Setup

• Distributed memory parallelism

• Outer ‘for’ loop and calls to hardware specific kernels

Page 9: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• Matrix multiplications are quite simple.

PyFR

Python Outer Layer (Hardware Independent)

Matrix Multiply Kernels

Point-Wise Nonlinear Kernels

• Data interpolation/extrapolation etc.

• Flux functions, Riemann solvers etc.

Call GEMM

• Setup

• Distributed memory parallelism

• Outer ‘for’ loop and calls to hardware specific kernels

Page 10: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• For the point-wise nonlinear kernels we use a DSL.

PyFR

Python Outer Layer (Hardware Independent)

Pass templates through Mako

derived templating engine

Matrix Multiply Kernels

Point-Wise Nonlinear Kernels

• Data interpolation/extrapolation etc.

• Flux functions, Riemann solvers etc.

Call GEMM

• Setup

• Distributed memory parallelism

• Outer ‘for’ loop and calls to hardware specific kernels

Page 11: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• Kernels are generated and compiled at start-up.

PyFR

Python Outer Layer (Hardware Independent)

Pass templates through Mako

derived templating engine

CUDA Hardware specific kernels

OpenCL Hardware specific kernels

Matrix Multiply Kernels

Point-Wise Nonlinear Kernels

• Data interpolation/extrapolation etc.

• Flux functions, Riemann solvers etc.

Call GEMM

pyMIC Hardware specific kernels

• Setup

• Distributed memory parallelism

• Outer ‘for’ loop and calls to hardware specific kernels

C/OpenMP Hardware specific kernels

Page 12: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• Which may then be called by the outer layer.

PyFR

Python Outer Layer (Hardware Independent)

Pass templates through Mako

derived templating engine

CUDA Hardware specific kernels

OpenCL Hardware specific kernels

Matrix Multiply Kernels

Point-Wise Nonlinear Kernels

• Data interpolation/extrapolation etc.

• Flux functions, Riemann solvers etc.

Call GEMM

pyMIC Hardware specific kernels

• Setup

• Distributed memory parallelism

• Outer ‘for’ loop and calls to hardware specific kernels

C/OpenMP Hardware specific kernels

Page 13: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Matrix Multiplications in PyFR• Multiplications are of the block-by-panel variety:

• where N ~ 105 with N ≫ (M, K) and A is constant.

C A B

N K

M

Page 14: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

GEMM in PyFR• On x86 S/DGEMM has three kernels providers.

Dense A (Small)

Dense A (Large)

Sparse A (Small)

Sparse A (Large)

MKL ∅ ★ ∅ ∅

GiMMiK ★ ∅ ★ ★

Libxsmm (new) ★★ ▲ ★★ ▲

Page 15: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Initial Results• Flow over a cylinder at Re = 3,900 and Ma = 0.2.

• Quadratically curved hexahedral mesh with NE = 118,820.

Page 16: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Initial Results

Page 17: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Initial Results• PyFR 1.4.0: K40c (cuBLAS) vs KNL 7250F (MKL).

p = 1

p = 2

p = 3

p = 4

Time per DOF per RK stage / ns

0 2 4 6 8 10 12

Page 18: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Initial Results

• Profiling indicates point-wise kernels are the bottleneck.

• Surprising!

• Must therefore rethink our data layout and push for further vectorisation.

Page 19: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Data Layout 101

• Three main layouts:

• AoS

• SoA (used by PyFR 1.4.0)

• AoSoA(k)

Page 20: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

AoS

struct { float rho; float rhou; float E; } data[NELES];

Page 21: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• Cache and TLB friendly.

• Difficult to vectorise.

AoS

Memory

Page 22: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

SoA

struct { float rho[NELES]; float rhou[NELES]; float E[NELES]; } data;

Page 23: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

• Trivial to vectorise.

• Can put pressure on TLB and/or hardware pre-fetchers.

SoA

Memory

Page 24: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

AoSoA(k = 2)

struct { float rho[k]; float rhou[k]; float E[k]; } data[NELES / k];

Page 25: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

AoSoA(k = 2)

• Can be vectorised efficiently for suitable k.

• Cache and TLB friendly.

Memory

Page 26: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

AoSoA(k = 2)

• The Goldilocks solution

• …albeit at the cost of messy indexing

• …also requires coaxing for compilers to vectorise.

Page 27: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

AoSoA(k = 2)

• Here PyFR’s DSL pays dividends.

• Iteration and indexing are hidden from kernels.

• Can therefore change data structures with ease.

Page 28: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

AoSoA(k) Results• PyFR 1.4.0 KNL (MKL) vs PyFR 1.5.0 KNL (MKL).

p = 1

p = 2

p = 3

p = 4

Time per DOF per RK stage / ns

0 2 4 6 8 10 12

Page 29: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

AoSoA(k) Results• Comparing with PyFR 1.4.0 K40c (cuBLAS)

p = 1

p = 2

p = 3

p = 4

Time per DOF per RK stage / ns

0 2 4 6 8 10 12

Page 30: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

AoSoA(k) Results

• Large performance boost.

• KNL now outperforms a K40c in the dense regime.

• Other backends improve, too, but only slightly.

Page 31: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Sparsity Exploitation

• Hexahedral operator matrices are sparse.

• Can therefore use GiMMiK/libxsmm.

Page 32: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Sparsity Exploitation• All with PyFR 1.5.0

• K40c (GiMMiK)

• KNL (GiMMiK)

• KNL (libxsmm)

p = 1

p = 2

p = 3

p = 4

Time per DOF per RK stage / ns

0 1 2 3 4 5

Page 33: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Summary

• Performance is promising.

• Lots of room for improvement in GiMMiK and libxsmm.

Page 34: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Acknowledgements

• Air Force Office of Scientific Research for support under grant FA9550-14-1-0186.

Page 35: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Backup Slides

Page 36: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Start Up Time

• Normally start up time is a non-issue for PyFR.

• A few minutes for a simulation which will run for hours to days.

Page 37: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Start Up Time

• Time is split roughly equally between

(i) running serial Python code;

(ii) using ICC to compile run-time generated kernels.

Page 38: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Start Up Time

• Difficult to improve

• …Python is virtually impossible to JIT

• …and due to the GIL does not benefit from multi-threading.

Page 39: PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1 1Department of Aeronautics

Start Up Time

• But did add a kernel cache.

• Huge reduction in time for p = 4 cylinder case with GiMMiK.

Tim

e / s

0

1000

2000

3000

4000

No cache Cache