Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM...

34
montblanc-project.eu | @MontBlanc_EU This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement n° 671697 Mont-Blanc (Arm for HPC) Etienne Walter, Bull with contributions from the whole team Going Arm workshop, Frankfort

Transcript of Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM...

Page 1: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

montblanc-project.eu | @MontBlanc_EU

This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement n° 671697

Mont-Blanc (Arm for HPC)

Etienne Walter, Bull with contributions from the whole team

Going Arm workshop, Frankfort

Page 2: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Mont-Blanc – Origin & context

Franfort, June 28, 2018 Mont-Blanc - Going Arm 2

Page 3: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

EPI

Mont-Blanc 2020

Design SoC

Develop IPs

The Mont-Blanc pitch

3 2012 2013 2014 2016 2015 2017 2018

Mont-Blanc

Proof of concept : HPC

computing based

on mobile embedded

technology

Mont-Blanc 2

Extend the concept

and explore new

possibilities Mont-Blanc 3

Prepare industrial solution

Test market acceptance

Vision: leverage the fast growing market of mobile

technology for scientific computation

2019 2020

Franfort, June 28, 2018 Mont-Blanc - Going Arm

Page 4: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

PRACE prototypes

• Tibidabo

• Carma

• Pedraforca

Mini-clusters

• Arndale

• Odroid XU

• Odroid XU-3

• NVIDIA Jetson

Mont-Blanc prototype

• Operational since May 2015 @ BSC

• 1080 compute cards

• Dual Cortex-A15

• GPU Mali-T604

• 4 GB LPDDR3

• Up to 64 GB local storage

• USB-to-Eth network

• Fine grained power monitoring system

ARM 64-bit mini-clusters

• APM X-GENE2

• Cavium ThunderX

• NVIDIA TX1

The Mont-Blanc prototype ecosystem

Franfort, June 28, 2018 Mont-Blanc - Going Arm 4

2012 2013 2014

Prototypes are critical to accelerate software development System software stack + applications

2015 2016 2017

Page 5: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Mont-Blanc prototype

Exynos 5 compute card 2 x Cortex-A15 @ 1.7GHz

1 x Mali T604 GPU

6.8 + 25.5 GFLOPS

15 Watts

2.1 GFLOPS/W

Carrier blade 15 x Compute cards

Blade chassis 7U 9 x Carrier blade

Rack 8 BullX chassis - 72 blades

1080 Compute cards

2160 CPUs -1080 GPUs

4.3 TB DRAM - 17.2 TB Flash

35 TFLOPS

24 kWatt

Franfort, June 28, 2018 Mont-Blanc - Going Arm 5

Page 6: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Mont-Blanc & Mont-Blanc 2 key results

Showed HPC computing was possible

with ARM based hardware

growing interest in using ARM for HPC

Showed that application scalability is

possible even with very low end

component

Need denser compute - more cores

Need a better interconnect ( Native

Ethernet)

Optimizing energy efficiency at the

system-level is hard

Franfort, June 28, 2018 Mont-Blanc - Going Arm 6

Page 7: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Mont-Blanc 3 key objectives

Design a compute node based on ARM architecture for a pre-exascale system

Well balanced : Memory, Interconnect, IO

Use of simulation to evaluate the options on applications

Energy efficient

Evaluate new high-end ARM core and accelerator, and assess different options for compute efficiency

Heterogeneous cores, vectorization, high performance core

Assessment with existing solutions & applications

Develop the software ecosystem needed for market acceptance of ARM solutions

Mont-Blanc - Going Arm 7

Prototypes

Scientific applications

HPC software ecosystem

Franfort, June 28, 2018

Key ideas: Well-balanced architecture, (Energy & Compute) Efficiency, Throughput computing, Co-design, SoC/SoP design

Page 8: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 8

Sw Environment

Applications

Hw. platform

Page 9: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 9

Sw Environment

Applications

Hw. platform

Page 10: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Applications

Franfort, June 28, 2018 Mont-Blanc - Going Arm 10

Page 11: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Contribution to co-design effort

Global need to change programmer’s mindset

Optimize throughput

& avoid (as far as possible) to be latency limited

• work on « taskification »

• work on asynchronism

Key role of runtimes

dynamic resource reallocation / maximize overall efficiency and load balancing

need for proper interaction between programming models (MPI & OpenMP)

Needed : mechanism to set locality & data management

cf. programming model (pushing this in OpenMP)

Franfort, June 28, 2018 Mont-Blanc - Going Arm 11

Page 12: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

From App analysis to large scale extrapolations (OpenIFS - ECMWF )

Franfort, June 28, 2018 Mont-Blanc - Going Arm 12

8

16

32

64

96

No noise L=50us

BW=100MB/s

Simulation

Page 13: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 13

Sw Environment

Applications

Hw. platform

Page 14: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

RedHat

GCC LLVM Mercurium Compilers

SLURM OpenLDAP NTP Ganglia Cluster management

OpenMP Nanos++ HSA MPI Runtime libraries

Drivers Ubuntu OpenSUSE CentOS Fedora Operating System (Linux kernel-based)

Power management Lustre NFS DVFS Hardware support

(Heterogeneous) Hardware

Extrae Perf MAQAO Allinea

ATLAS Arm Performance Libraries BLIS Scientific libraries

Kernels Mini-applications Full applications Applications (source files – C/C++/Fortran/Python etc.)

HPC Arm Software Stack

14

Porting MAQAO & CERE

Optimised Arm Performance Libraries

Porting PGI-Flang

OpenMP+OmpSs heterogeneous support

improvements

OmpSs accelerator support improvements

LLVM+GNU OpenMP optimisations

Porting TinyMPI

OmpSs and MPI communication/computation overlap

Topology-aware MPI

Linux kernel heterogeneous support improvements

memcpy-optimisations techniques via RDMA

Heterogeneous device specifications

Power management improvements

+ Contribution to OpenHPC (from 1.2)

Franfort, June 28, 2018 Mont-Blanc - Going Arm

PAPI Developer tools

- Mont-Blanc 3 contributions

Most elements to be available on Dibona platform

Page 15: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

– Now on ARM (since 1.2 version)

OpenHPC is a community effort to provide a common, verified set of open source packages for HPC deployments

ARM and Atos Bull:

• Members of OpenHPC

• Both are on the OpenHPC Technical Steering Committee and drive ARM build support

Status: 1.3.2 release out now

• All packages built on ARMv8 for CentOS and SUSE

• ARM-based machines are being used for building and also in the OpenHPC build infrastructure

Franfort, June 28, 2018 Mont-Blanc - Going Arm 15

Functional Areas Components include

Base OS RHEL/CentOS 7.1, SLES 12

Administrative Tools Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, prun

Provisioning Warewulf

Resource Mgmt. SLURM, Munge. Altair PBS Pro*

I/O Services Lustre client (community version)

Numerical/Scientific Libraries

Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, SuperLU, Mumps

I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran interfaces), Adios

Compiler Families GNU (gcc, g++, gfortran)

MPI Families OpenMPI, MVAPICH2

Development Tools Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy

Performance Tools PAPI, Intel IMB, mpiP, pdtoolkit TAU

Page 16: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 16

Sw Environment

Applications

Hw. platform

Page 17: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 17

Dibona: our new test platform

Page 18: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 18

IB EDR switches

Internal Ethernet management network

Redundant management server

including storage Power supply units

Hydraulics for Direct Liquid Cooling (ultra-energy efficient with

hot water cooling)

48 computes nodes: 16 blades 96 CPUs

3000 cores 12000 threads

Dibona (Mont-Blanc 3 test platform)

Optimized in density for 2-socket systems with up to 8 memory channels per socket in 1U

Page 19: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 19

Cavium ThunderX2 choice

Interconnect Mezzanine

1U form factor

Direct liquid cooling

3 dual-socket compute

nodes per blade with:

• 2 Cavium ThunderX2

processors (up to 32

Armv8 cores per CPU, 4

threads per core, up to

2.5GHz)

• 16 DDR4 DIMM slots

• 1 Interconnect mezzanine

board (EDR)

Cavium® ThunderX2™

Page 20: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

CVN thermal design & mechanical enclosure

3D detailed PCB calculation has been performed

Franfort, June 28, 2018 Mont-Blanc - Going Arm 20

new coldplate

new CPU heatspreader

new DDR heatspreader

Page 21: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Software stack and tools status (@ Dibona)

Franfort, June 28,

2018 Mont-Blanc - Going Arm

21

GCC ARM/LLVM compilers

Compilers

Cluster management

Runtime libraries

Drivers Ubuntu (optional)

Operating System (Linux kernel-based)

Power management

Hardware support

Dibona ThunderX2 compute node

Developer tools

Scientific libraries

Paqages dependencies

RHEL 7.5

DVFS NFS

BSC SW stack OpenMPI

SLURM

Papi Perf Allinea

ARM Per. Libs.

RHEL ATLAS/BLAS packages

RHEL 7.5 linux kernel-4.14.0-22

Open MPI 2.0.x

SLURM 17.x

Mellanox OFED 4.3.x

PAPI 5.4.x

GCC 5.3.0, 6.2.x, 7.1.0, 7.2.x, 7.3.x

HPCToolkit 5.4.3

the Allinea Studio tool : the Arm HPC compiler 18.2. for RH 7.2+

the Allinea Forge debugging and profiling tool

the Allinea Performance Reports reporting tool

Patched perf tool (providind 900+

uncore events for L3 partitions and

DMCs)

Patched PAPI (26 available events)

Page 22: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Power management features (@ Dibona)

Energy and Power monitoring

HDEEM data collection

Energy cumulated through time

(Blade + VRs + NICs) in Joules

High frequency FPGA energy

calculation (100Hz for VRs, 1000Hz

for Blade)

SLURM Power plugin

Grafana visualisation

DVFS, Power capping and

Turbo support

Franfort, June 28, 2018 Mont-Blanc - Going Arm 22

Page 23: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Preparing the future SoCs

Franfort, June 28, 2018 Mont-Blanc - Going Arm 23

Simulation Architecture

Balanced Architecture

Computing Efficiency

Sw Environment

Applications

Hw. platform

Page 24: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Simulation Tools

Mont-Blanc - Going Arm 24

Name on-line/offline

scope granularity speed(up)

gem5 (classic memory)

on-line core/SoC u-arch/insts ~100-200 KIPS

Elastic trace (gem5)

off-line interconnect/memory system

u-arch/insts ~ 7x (gem5)

SimMATE (gem5)

off-line memory system u-arch/insts ~6x – 800x (gem5)

Garnet (gem5)

on-line interconnect ~0.2x (gem5)

TaskSim off-line compute node/task scheduler

arch/tasks ~ 10x native (burst) ~20x gem5 (memory)

Dimemas off-line cluster/off-chip network

bursts/ messages

“very fast”

SST/macro on-line cluster/off-chip network

bursts/ messages

~ 0.3-3x native

u-arch/insts arch/insts bursts/tasks inter-proc messages

SimMATE

TaskSim

Dimemas

Spe

ed

Granularity

gem5 (on-line)

SST/macro

Elastic trace

dist-gem5

Scope

core/SoC multi-core compute-node cluster

MUSA

Simulation tools map Integration in a global framework

Franfort, June 28, 2018

Page 25: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Design Space Exploration (DSE)

New Experiments

Franfort, June 28, 2018 Mont-Blanc - Going Arm 25

New Features

FP Vector Width Simulation

Chip and Main Mem. Power measurements

Applications

BT, SP, Hydro, LULESH,

Specfem3D

Cores Issue width / ROB

Frequency Cache size L3 / L2

Memory Vector width

1 2 / 40 (lo-end) 1.5 64MB/1MB 4ch DDR4 128

32 4 / 180 (thunX) 2.0 32MB/1MB 8ch DDR4 256

64 6/224 (skylak) 2.5 32MB/256K 512

8/300 (hi-end) 3.0

SIMULATOR PARAMETERS/COMPONENTS

Simulate every possible combination of these

865 detailed arch simulations per app

Page 26: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

MUSA DSE Results: (BSC)

Franfort, June 28, 2018 Mont-Blanc - Going Arm 26

32 core 64 core 1 core

Po

we

r c

on

sum

pti

on

Performance Speed-up

Results of this extended space exploration

Wide study across architectural components

• Interaction between components

• Power / Performance tradeoffs

Identifying trends and optimal cases. Understanding bottlenecks.

Every point is 1 architectural simulation

Page 27: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

MUSA DSE Results: Exploration (HYDRO)

Franfort, June 28, 2018 Mont-Blanc - Going Arm 27

• Hydro simulation: 1 Rank during 1 iteration. • 400 Architectural configurations per Core • Legend coded by Vector Width used in the simulation

Po

we

r c

on

sum

pti

on

Performance Speed-up

Best for Power Consumption

Best for Performance

Sweet spot!

Example: Identifying Optimal Cases

Page 28: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 28

Simulation Architecture

Balanced Architecture

Computing Efficiency

Sw Environment

Applications

Hw. platform

Page 29: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28, 2018 Mont-Blanc - Going Arm 29

Page 30: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

The mission:

Trigger the development of the

next generation of industrial

processor for Big Data and High

Performance Computing

Franfort, June 28, 2018 Mont-Blanc - Going Arm 30

Page 31: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Keeping in mind economic sustainability

Mont-Blanc 2020 includes

the analysis of the

requirements of other

markets

The project’s strategy based

on modular packaging

would make it possible to

create a family of SoCs

targeting markets beyond

conventional HPC, such as

“embedded HPC” for

autonomous driving

Franfort, June 28, 2018 Mont-Blanc - Going Arm 31

Page 32: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

3 Key Challenges

Franfort, June 28, 2018

To achieve the desired performance with the targeted power consumption, we’ll need to:

understand the trade-offs between vector length, NoC bandwidth and memory bandwidth to maximize processing unit efficiency;

come up with an innovative on-die interconnect that can deliver enough bandwidth with minimum energy consumption;

opt for a high-bandwidth / low power memory solution with enough capacity and bandwidth for Exascale applications.

Mont-Blanc - Going Arm 32

Page 33: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

Franfort, June 28,

2018 Mont-Blanc - Going Arm 33

The arm ecosystem is maturing :

Real High End solutions are here

We are now at the point where users can play seriously

access and provide further feedbacks

Openness (induced by the arm business model) is a key value

We can expect to have more and more technical solutions,

while guarantying a good sustainability to users

Takeaway

Page 34: Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM architecture for a pre-exascale system Well balanced : Memory, Interconnect, IO Use of simulation

montblanc-project.eu | @MontBlanc_EU

This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement n° 671697

Thank you for your attention

[email protected]

Credits : the Mont-Blanc team

A final event is being prepared in Cambridge,

on Sept. 19th (synchronized with the end of Arm research summit) Stay tuned (on our web, Linkedin & twitter information channels) for further information.