Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM...
Transcript of Mont-Blanc (Arm for HPC) · Mont-Blanc 3 key objectives Design a compute node based on ARM...
montblanc-project.eu | @MontBlanc_EU
This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement n° 671697
Mont-Blanc (Arm for HPC)
Etienne Walter, Bull with contributions from the whole team
Going Arm workshop, Frankfort
Mont-Blanc – Origin & context
Franfort, June 28, 2018 Mont-Blanc - Going Arm 2
EPI
Mont-Blanc 2020
Design SoC
Develop IPs
The Mont-Blanc pitch
3 2012 2013 2014 2016 2015 2017 2018
Mont-Blanc
Proof of concept : HPC
computing based
on mobile embedded
technology
Mont-Blanc 2
Extend the concept
and explore new
possibilities Mont-Blanc 3
Prepare industrial solution
Test market acceptance
Vision: leverage the fast growing market of mobile
technology for scientific computation
2019 2020
Franfort, June 28, 2018 Mont-Blanc - Going Arm
PRACE prototypes
• Tibidabo
• Carma
• Pedraforca
Mini-clusters
• Arndale
• Odroid XU
• Odroid XU-3
• NVIDIA Jetson
Mont-Blanc prototype
• Operational since May 2015 @ BSC
• 1080 compute cards
• Dual Cortex-A15
• GPU Mali-T604
• 4 GB LPDDR3
• Up to 64 GB local storage
• USB-to-Eth network
• Fine grained power monitoring system
ARM 64-bit mini-clusters
• APM X-GENE2
• Cavium ThunderX
• NVIDIA TX1
The Mont-Blanc prototype ecosystem
Franfort, June 28, 2018 Mont-Blanc - Going Arm 4
2012 2013 2014
Prototypes are critical to accelerate software development System software stack + applications
2015 2016 2017
Mont-Blanc prototype
Exynos 5 compute card 2 x Cortex-A15 @ 1.7GHz
1 x Mali T604 GPU
6.8 + 25.5 GFLOPS
15 Watts
2.1 GFLOPS/W
Carrier blade 15 x Compute cards
Blade chassis 7U 9 x Carrier blade
Rack 8 BullX chassis - 72 blades
1080 Compute cards
2160 CPUs -1080 GPUs
4.3 TB DRAM - 17.2 TB Flash
35 TFLOPS
24 kWatt
Franfort, June 28, 2018 Mont-Blanc - Going Arm 5
Mont-Blanc & Mont-Blanc 2 key results
Showed HPC computing was possible
with ARM based hardware
growing interest in using ARM for HPC
Showed that application scalability is
possible even with very low end
component
Need denser compute - more cores
Need a better interconnect ( Native
Ethernet)
Optimizing energy efficiency at the
system-level is hard
Franfort, June 28, 2018 Mont-Blanc - Going Arm 6
Mont-Blanc 3 key objectives
Design a compute node based on ARM architecture for a pre-exascale system
Well balanced : Memory, Interconnect, IO
Use of simulation to evaluate the options on applications
Energy efficient
Evaluate new high-end ARM core and accelerator, and assess different options for compute efficiency
Heterogeneous cores, vectorization, high performance core
Assessment with existing solutions & applications
Develop the software ecosystem needed for market acceptance of ARM solutions
Mont-Blanc - Going Arm 7
Prototypes
Scientific applications
HPC software ecosystem
Franfort, June 28, 2018
Key ideas: Well-balanced architecture, (Energy & Compute) Efficiency, Throughput computing, Co-design, SoC/SoP design
Franfort, June 28, 2018 Mont-Blanc - Going Arm 8
Sw Environment
Applications
Hw. platform
Franfort, June 28, 2018 Mont-Blanc - Going Arm 9
Sw Environment
Applications
Hw. platform
Applications
Franfort, June 28, 2018 Mont-Blanc - Going Arm 10
Contribution to co-design effort
Global need to change programmer’s mindset
Optimize throughput
& avoid (as far as possible) to be latency limited
• work on « taskification »
• work on asynchronism
Key role of runtimes
dynamic resource reallocation / maximize overall efficiency and load balancing
need for proper interaction between programming models (MPI & OpenMP)
Needed : mechanism to set locality & data management
cf. programming model (pushing this in OpenMP)
Franfort, June 28, 2018 Mont-Blanc - Going Arm 11
From App analysis to large scale extrapolations (OpenIFS - ECMWF )
Franfort, June 28, 2018 Mont-Blanc - Going Arm 12
8
16
32
64
96
No noise L=50us
BW=100MB/s
Simulation
Franfort, June 28, 2018 Mont-Blanc - Going Arm 13
Sw Environment
Applications
Hw. platform
RedHat
GCC LLVM Mercurium Compilers
SLURM OpenLDAP NTP Ganglia Cluster management
OpenMP Nanos++ HSA MPI Runtime libraries
Drivers Ubuntu OpenSUSE CentOS Fedora Operating System (Linux kernel-based)
Power management Lustre NFS DVFS Hardware support
(Heterogeneous) Hardware
Extrae Perf MAQAO Allinea
ATLAS Arm Performance Libraries BLIS Scientific libraries
Kernels Mini-applications Full applications Applications (source files – C/C++/Fortran/Python etc.)
HPC Arm Software Stack
14
Porting MAQAO & CERE
Optimised Arm Performance Libraries
Porting PGI-Flang
OpenMP+OmpSs heterogeneous support
improvements
OmpSs accelerator support improvements
LLVM+GNU OpenMP optimisations
Porting TinyMPI
OmpSs and MPI communication/computation overlap
Topology-aware MPI
Linux kernel heterogeneous support improvements
memcpy-optimisations techniques via RDMA
Heterogeneous device specifications
Power management improvements
+ Contribution to OpenHPC (from 1.2)
Franfort, June 28, 2018 Mont-Blanc - Going Arm
PAPI Developer tools
- Mont-Blanc 3 contributions
Most elements to be available on Dibona platform
– Now on ARM (since 1.2 version)
OpenHPC is a community effort to provide a common, verified set of open source packages for HPC deployments
ARM and Atos Bull:
• Members of OpenHPC
• Both are on the OpenHPC Technical Steering Committee and drive ARM build support
Status: 1.3.2 release out now
• All packages built on ARMv8 for CentOS and SUSE
• ARM-based machines are being used for building and also in the OpenHPC build infrastructure
Franfort, June 28, 2018 Mont-Blanc - Going Arm 15
Functional Areas Components include
Base OS RHEL/CentOS 7.1, SLES 12
Administrative Tools Conman, Ganglia, Lmod, LosF, ORCM, Nagios, pdsh, prun
Provisioning Warewulf
Resource Mgmt. SLURM, Munge. Altair PBS Pro*
I/O Services Lustre client (community version)
Numerical/Scientific Libraries
Boost, GSL, FFTW, Metis, PETSc, Trilinos, Hypre, SuperLU, Mumps
I/O Libraries HDF5 (pHDF5), NetCDF (including C++ and Fortran interfaces), Adios
Compiler Families GNU (gcc, g++, gfortran)
MPI Families OpenMPI, MVAPICH2
Development Tools Autotools (autoconf, automake, libtool), Valgrind,R, SciPy/NumPy
Performance Tools PAPI, Intel IMB, mpiP, pdtoolkit TAU
Franfort, June 28, 2018 Mont-Blanc - Going Arm 16
Sw Environment
Applications
Hw. platform
Franfort, June 28, 2018 Mont-Blanc - Going Arm 17
Dibona: our new test platform
Franfort, June 28, 2018 Mont-Blanc - Going Arm 18
IB EDR switches
Internal Ethernet management network
Redundant management server
including storage Power supply units
Hydraulics for Direct Liquid Cooling (ultra-energy efficient with
hot water cooling)
48 computes nodes: 16 blades 96 CPUs
3000 cores 12000 threads
Dibona (Mont-Blanc 3 test platform)
Optimized in density for 2-socket systems with up to 8 memory channels per socket in 1U
Franfort, June 28, 2018 Mont-Blanc - Going Arm 19
Cavium ThunderX2 choice
Interconnect Mezzanine
1U form factor
Direct liquid cooling
3 dual-socket compute
nodes per blade with:
• 2 Cavium ThunderX2
processors (up to 32
Armv8 cores per CPU, 4
threads per core, up to
2.5GHz)
• 16 DDR4 DIMM slots
• 1 Interconnect mezzanine
board (EDR)
Cavium® ThunderX2™
CVN thermal design & mechanical enclosure
3D detailed PCB calculation has been performed
Franfort, June 28, 2018 Mont-Blanc - Going Arm 20
new coldplate
new CPU heatspreader
new DDR heatspreader
Software stack and tools status (@ Dibona)
Franfort, June 28,
2018 Mont-Blanc - Going Arm
21
GCC ARM/LLVM compilers
Compilers
Cluster management
Runtime libraries
Drivers Ubuntu (optional)
Operating System (Linux kernel-based)
Power management
Hardware support
Dibona ThunderX2 compute node
Developer tools
Scientific libraries
Paqages dependencies
RHEL 7.5
DVFS NFS
BSC SW stack OpenMPI
SLURM
Papi Perf Allinea
ARM Per. Libs.
RHEL ATLAS/BLAS packages
RHEL 7.5 linux kernel-4.14.0-22
Open MPI 2.0.x
SLURM 17.x
Mellanox OFED 4.3.x
PAPI 5.4.x
GCC 5.3.0, 6.2.x, 7.1.0, 7.2.x, 7.3.x
HPCToolkit 5.4.3
the Allinea Studio tool : the Arm HPC compiler 18.2. for RH 7.2+
the Allinea Forge debugging and profiling tool
the Allinea Performance Reports reporting tool
Patched perf tool (providind 900+
uncore events for L3 partitions and
DMCs)
Patched PAPI (26 available events)
Power management features (@ Dibona)
Energy and Power monitoring
HDEEM data collection
Energy cumulated through time
(Blade + VRs + NICs) in Joules
High frequency FPGA energy
calculation (100Hz for VRs, 1000Hz
for Blade)
SLURM Power plugin
Grafana visualisation
DVFS, Power capping and
Turbo support
Franfort, June 28, 2018 Mont-Blanc - Going Arm 22
Preparing the future SoCs
Franfort, June 28, 2018 Mont-Blanc - Going Arm 23
Simulation Architecture
Balanced Architecture
Computing Efficiency
Sw Environment
Applications
Hw. platform
Simulation Tools
Mont-Blanc - Going Arm 24
Name on-line/offline
scope granularity speed(up)
gem5 (classic memory)
on-line core/SoC u-arch/insts ~100-200 KIPS
Elastic trace (gem5)
off-line interconnect/memory system
u-arch/insts ~ 7x (gem5)
SimMATE (gem5)
off-line memory system u-arch/insts ~6x – 800x (gem5)
Garnet (gem5)
on-line interconnect ~0.2x (gem5)
TaskSim off-line compute node/task scheduler
arch/tasks ~ 10x native (burst) ~20x gem5 (memory)
Dimemas off-line cluster/off-chip network
bursts/ messages
“very fast”
SST/macro on-line cluster/off-chip network
bursts/ messages
~ 0.3-3x native
u-arch/insts arch/insts bursts/tasks inter-proc messages
SimMATE
TaskSim
Dimemas
Spe
ed
Granularity
gem5 (on-line)
SST/macro
Elastic trace
dist-gem5
Scope
core/SoC multi-core compute-node cluster
MUSA
Simulation tools map Integration in a global framework
Franfort, June 28, 2018
Design Space Exploration (DSE)
New Experiments
Franfort, June 28, 2018 Mont-Blanc - Going Arm 25
New Features
FP Vector Width Simulation
Chip and Main Mem. Power measurements
Applications
BT, SP, Hydro, LULESH,
Specfem3D
Cores Issue width / ROB
Frequency Cache size L3 / L2
Memory Vector width
1 2 / 40 (lo-end) 1.5 64MB/1MB 4ch DDR4 128
32 4 / 180 (thunX) 2.0 32MB/1MB 8ch DDR4 256
64 6/224 (skylak) 2.5 32MB/256K 512
8/300 (hi-end) 3.0
SIMULATOR PARAMETERS/COMPONENTS
Simulate every possible combination of these
865 detailed arch simulations per app
MUSA DSE Results: (BSC)
Franfort, June 28, 2018 Mont-Blanc - Going Arm 26
32 core 64 core 1 core
Po
we
r c
on
sum
pti
on
Performance Speed-up
Results of this extended space exploration
Wide study across architectural components
• Interaction between components
• Power / Performance tradeoffs
Identifying trends and optimal cases. Understanding bottlenecks.
Every point is 1 architectural simulation
MUSA DSE Results: Exploration (HYDRO)
Franfort, June 28, 2018 Mont-Blanc - Going Arm 27
• Hydro simulation: 1 Rank during 1 iteration. • 400 Architectural configurations per Core • Legend coded by Vector Width used in the simulation
Po
we
r c
on
sum
pti
on
Performance Speed-up
Best for Power Consumption
Best for Performance
Sweet spot!
Example: Identifying Optimal Cases
Franfort, June 28, 2018 Mont-Blanc - Going Arm 28
Simulation Architecture
Balanced Architecture
Computing Efficiency
Sw Environment
Applications
Hw. platform
Franfort, June 28, 2018 Mont-Blanc - Going Arm 29
The mission:
Trigger the development of the
next generation of industrial
processor for Big Data and High
Performance Computing
Franfort, June 28, 2018 Mont-Blanc - Going Arm 30
Keeping in mind economic sustainability
Mont-Blanc 2020 includes
the analysis of the
requirements of other
markets
The project’s strategy based
on modular packaging
would make it possible to
create a family of SoCs
targeting markets beyond
conventional HPC, such as
“embedded HPC” for
autonomous driving
Franfort, June 28, 2018 Mont-Blanc - Going Arm 31
3 Key Challenges
Franfort, June 28, 2018
To achieve the desired performance with the targeted power consumption, we’ll need to:
understand the trade-offs between vector length, NoC bandwidth and memory bandwidth to maximize processing unit efficiency;
come up with an innovative on-die interconnect that can deliver enough bandwidth with minimum energy consumption;
opt for a high-bandwidth / low power memory solution with enough capacity and bandwidth for Exascale applications.
Mont-Blanc - Going Arm 32
Franfort, June 28,
2018 Mont-Blanc - Going Arm 33
The arm ecosystem is maturing :
Real High End solutions are here
We are now at the point where users can play seriously
access and provide further feedbacks
Openness (induced by the arm business model) is a key value
We can expect to have more and more technical solutions,
while guarantying a good sustainability to users
Takeaway
montblanc-project.eu | @MontBlanc_EU
This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement n° 671697
Thank you for your attention
Credits : the Mont-Blanc team
A final event is being prepared in Cambridge,
on Sept. 19th (synchronized with the end of Arm research summit) Stay tuned (on our web, Linkedin & twitter information channels) for further information.