PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A....

PyFR and GiMMiK on Intel KNL

F.D. Witherden1, J.S. Park2, A. Heinecke3, P. Kelly2, P.E. Vincent2 and A. Jameson1

1Department of Aeronautics & Astronautics, Stanford University 2Departments of Aeronautics and Computing, Imperial College London

3Intel Corporation

Motivation: Turbulent Flows

• Interested in simulating unsteady, turbulent, flows.

The PyFR Framework

• Uses high-order flux reconstruction (FR) to solve the compressible Navier–Stokes equations on mixed unstructured grids with explicit time stepping.

The PyFR Framework

• Performance portable across a range of platforms.

• Finalist for the 2016 Gordon Bell Prize.

The PyFR Framework

• Existing support for KNC based around offloading via pyMIC.

• Python outer layer.

Python Outer Layer (Hardware Independent)

• Setup

• Distributed memory parallelism

• Outer ‘for’ loop and calls to hardware specific kernels

• Need to generate hardware specific kernels.

• Setup

• In FR two types of kernel are required.

Matrix Multiply Kernels

Point-Wise Nonlinear Kernels

• Data interpolation/extrapolation etc.

• Flux functions, Riemann solvers etc.

• Setup

• Matrix multiplications are quite simple.

Call GEMM

• Setup

• For the point-wise nonlinear kernels we use a DSL.

Pass templates through Mako

derived templating engine

Call GEMM

• Setup

• Kernels are generated and compiled at start-up.

CUDA Hardware specific kernels

OpenCL Hardware specific kernels

Call GEMM

pyMIC Hardware specific kernels

• Setup

C/OpenMP Hardware specific kernels

• Which may then be called by the outer layer.

CUDA Hardware specific kernels

OpenCL Hardware specific kernels

Call GEMM

pyMIC Hardware specific kernels

• Setup

C/OpenMP Hardware specific kernels

Matrix Multiplications in PyFR• Multiplications are of the block-by-panel variety:

• where N ~ 105 with N ≫ (M, K) and A is constant.

GEMM in PyFR• On x86 S/DGEMM has three kernels providers.

Dense A (Small)

Dense A (Large)

Sparse A (Small)

Sparse A (Large)

MKL ∅ ★ ∅ ∅

GiMMiK ★ ∅ ★ ★

Libxsmm (new) ★★ ▲ ★★ ▲

Initial Results• Flow over a cylinder at Re = 3,900 and Ma = 0.2.

• Quadratically curved hexahedral mesh with NE = 118,820.

Initial Results

Initial Results• PyFR 1.4.0: K40c (cuBLAS) vs KNL 7250F (MKL).

Time per DOF per RK stage / ns

0 2 4 6 8 10 12

Initial Results

• Profiling indicates point-wise kernels are the bottleneck.

• Surprising!

• Must therefore rethink our data layout and push for further vectorisation.

Data Layout 101

• Three main layouts:

• AoS

• SoA (used by PyFR 1.4.0)

• AoSoA(k)

struct { float rho; float rhou; float E; } data[NELES];

• Cache and TLB friendly.

• Difficult to vectorise.

Memory

struct { float rho[NELES]; float rhou[NELES]; float E[NELES]; } data;

• Trivial to vectorise.

• Can put pressure on TLB and/or hardware pre-fetchers.

Memory

AoSoA(k = 2)

struct { float rho[k]; float rhou[k]; float E[k]; } data[NELES / k];

AoSoA(k = 2)

• Can be vectorised efficiently for suitable k.

• Cache and TLB friendly.

Memory

AoSoA(k = 2)

• The Goldilocks solution

• …albeit at the cost of messy indexing

• …also requires coaxing for compilers to vectorise.

AoSoA(k = 2)

• Here PyFR’s DSL pays dividends.

• Iteration and indexing are hidden from kernels.

• Can therefore change data structures with ease.

AoSoA(k) Results• PyFR 1.4.0 KNL (MKL) vs PyFR 1.5.0 KNL (MKL).

0 2 4 6 8 10 12

AoSoA(k) Results• Comparing with PyFR 1.4.0 K40c (cuBLAS)

0 2 4 6 8 10 12

AoSoA(k) Results

• Large performance boost.

• KNL now outperforms a K40c in the dense regime.

• Other backends improve, too, but only slightly.

Sparsity Exploitation

• Hexahedral operator matrices are sparse.

• Can therefore use GiMMiK/libxsmm.

Sparsity Exploitation• All with PyFR 1.5.0

• K40c (GiMMiK)

• KNL (GiMMiK)

• KNL (libxsmm)

0 1 2 3 4 5

Summary

• Performance is promising.

• Lots of room for improvement in GiMMiK and libxsmm.

Acknowledgements

• Air Force Office of Scientific Research for support under grant FA9550-14-1-0186.

Backup Slides

Start Up Time

• Normally start up time is a non-issue for PyFR.

• A few minutes for a simulation which will run for hours to days.

Start Up Time

• Time is split roughly equally between

(i) running serial Python code;

(ii) using ICC to compile run-time generated kernels.

Start Up Time

• Difficult to improve

• …Python is virtually impossible to JIT

• …and due to the GIL does not benefit from multi-threading.

Start Up Time

• But did add a kernel cache.

• Huge reduction in time for p = 4 cylinder case with GiMMiK.

No cache Cache

PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A....

Documents

Transcript of PyFR and GiMMiK on Intel KNL - WitherdenPyFR and GiMMiK on Intel KNL F.D. Witherden1, J.S. Park2, A....

From HPC cluster to “inhouse” · 2018. 3. 30. · Intel Family Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Haswell KNL Skylake Instruction Set AVX AVX AVX2 AVX2 AVX512 AVX2

Support for Intel Knights Landing (KNL) · Copyright 2016 SchedMD LLC KNL SNC4 NUMA Mode Core Core Tile Core Core Tile Core Core Tile Core Core Tile Core Core Tile Core Core Tile

Knights Landing (KNL): 2nd Generation Intel® Xeon Phi Processorcacs.usc.edu/education/cs653/Sodani-KnightsLanding-Hot... · 2019. 3. 14. · Knights Landing: Next Intel®XeonPhi™Processor

Cray MPI for KNL - Argonne Leadership Computing Facility · Cray MPI for KNL Heidi Poxon Sr. Principal Engineer Cray Programming Environment 2

Introduction to HPC2N€¦ · Kebnekaise 1 602 nodes / 19288 cores (of which 2448 are KNL) 432 Intel Xeon E5-2690v4, 2x14 cores, 128 GB/node 52 Intel Xeon Gold 6132, 2x14 cores, 192

Knights Landing (KNL): 2nd Generation Intel® Xeon … · Knights Landing (KNL): 2nd Generation Intel® Xeon Phi™ Processor ... Built on top to existing libnuma API “FASTMEM”

LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Knights Landing (KNL): 2nd Generation Intel® Xeon …...Multithreading. Core resources shared or dynamically repartitioned between active threads ROB, Rename Buffers, RS: Dynamically

TheKokkosC++ Performance Portability EcoSystem...Intel Haswell / Intel KNL ANL Aurora21 Intel Xeon CPUs + Intel XeGPUs SNL Astra ARM Architecture SNL NALU Wind Turbine CFD SNL LAMMPS

Performance of Met Office Weather and Climate Codes on ... · Swan - Intel Xeon (Broadwell), 2 22-core @ 2.2GHz Skylake Swan - Intel Xeon (Skylake), 2 28-code @ 2.1GHz KNL XCK –Intel

Center for Accelerated Application Readiness€¦ · 18/03/2015 · •Intel® Xeon Phi™ Knights Landing (KNL) •16GB HBM, 64-128 GB DDR4 •Cray Aries Interconnect •28 PB Lustre

Achieving high-performance with a sparse direct solver on ... · Achieving high-performance with a sparse direct solver on Intel KNL Emmanuel Agullo, Alfredo Buttari, Mikko Byckling,

Introduction to KNL and Parallel Computing · Introduction to KNL and Parallel Computing Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC)

Honing and prooﬁng Astrophysical codes on the road to ... · of the available hardware. The Intel R Xeon PhiTM of second generation (code-named Knights Landing, henceforth KNL)

KNL Report_IIM Indore

Knights Landing Architecture at Cineca (KNL) · Intel® Xeon Phi™ Coprocessor x100 Product Family “Knights orner” 2016: Second Generation Intel® Xeon Phi™ “Knights Landing”

Instute for Advanced Computaonal Science · – 1 IBM Power8 node • Two Intel KNL development systems • Sea-wulf – $1.4M NSF MRI + $300 NYSTAR + $300 SBU internal including

Scheduling Parallel Tasks under Multiple Resources: List ......a second step separate from the data generation. The data ... ing algorithms on the Intel Xeon Phi Knights Landing (KNL)

Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Project Status Report - NAS Home · (KNL) cluster in two memory configurations, compared to performance on Intel Xeon Haswell and Broadwell nodes for the NAS rotor (91 million grid