Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The...

35
Technology emerging from the DEEP & DEEP-ER projects Estela Suarez Jülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n° 287530 and n° 610476

Transcript of Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The...

Page 1: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Technology emerging from the DEEP & DEEP-ER projects

Estela Suarez

Jülich Supercomputing Centre

03.06.2016

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n° 287530 and n° 610476

Page 2: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP

• Cluster-Booster archit.

• Software stack

• Programming environ.

• Energy efficiency

• Applications:

– Co-design

– Evaluation/demonstration

– Code modernisation

DEEP-ER

• Extend memory hierarchy

• High-performance I/O

• Scalable resiliency

• Applications:

– Co-design

– Evaluation/demonstration

– Code modernisation

2

Topics

Page 3: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

CLUSTER-BOOSTER ARCHITECTURE

3

Page 4: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

“Standard” heterogeneity

CN

CN

CN

InfiniBand

CN

CN

CN

GPU

GPU

GPU

GPU

GPU

GPU

Flat topology

Simple management of

resources

Static assignment of

accelerators to CPUs

Accelerators cannot act

autonomously

4

Page 5: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

CN

CN

CN

InfiniBand

Cluster

Cluster-Booster architecture

5

BI BN

BI BN

BI BN

BN

BN

BN

BN

BN

BN

EXTOLL

Booster

Flexible assignment of resources (CPUs, accelerators)

Direct communication between accelerators

“Offload” of large and complex parts of applications

Page 6: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP Architecture

6

Intel® Xeon ®

Intel ® Xeon PhiTM

Page 7: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP System

• Installed at JSC

• 1,5 racks

• 500 TFlop/s

peak perf.

• 3.5 GFlop/s/W

• Water cooled

7

Cluster

(128 Xeon)Booster

(384 Xeon Phi

KNC)

Page 8: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Node Card Interface Card

8

Booster main components

Page 9: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Booster Nodes (KNC) from 1 to 384

Bo

ost

er

No

de

s (K

NC

) fr

om

1 t

o 3

84

Booster measurementsMPI Linktest: ping-pong

15

Latency: 870 us

BW: 1.3 MB/s

PCIe BW-

issue in 3

nodes

Page 10: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Booster Nodes (KNC) from 1 to 384

Bo

ost

er

No

de

s (K

NC

) fr

om

1 t

o 3

84

Booster measurementsMPI Linktest: ping-pong

16

Stddev < 10%

Page 11: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Main EXTOLL characteristics– Direct network: no switches required

– Integrates network interface controller

– Supports 6+1 links

– Capable of tunneling PCIe (allows remote-booting KNC from the network)

Current (A3) version of EXTOLL ASIC– 270 million transistors

– Link bandwidth: 100 G

– MPI latency: 850 ns

– MPI bandwidth: 8.5 GB/s

– Message rate: 70 million mgs/sec

– PCIe Gen3 x16

Network EXTOLL Tourmalet

Tourmalet PCI Express Board

Tourmalet Chip and Wafer

17

Page 12: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

GreenICE system

Alternative Booster implementation

• Interconnect EXTOLL ASIC “Tourmalet”

• 32 KNC-node system

• Implement 4×4×2 topology, with Z dimension open

2-phase immersion cooling

• NOVEC liquid from 3M

• Evaporates at about 50 degrees

• Condensates again in a water cooling pipe

• Allows very high-density integration

19

GreenICE Booster

Page 13: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

MPI performanceLatency

20

Page 14: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

MPI performanceBandwidth

21

Factor of 3× achieved by EXTOLL TOURMALET (5Gbit/s version A2)

Page 15: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP Architecture

23

Xeon Xeon Phi

Page 16: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Xeon Phi

DEEP-ER Architecture Innovation

Simplified Interconnect

On-Node NVM

Self-Booting Nodes

Network Attached Memory

24

Xeon

Page 17: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP-ER Aurora Blade

prototype

Eurotech’s Aurora technologyDirect water cooled, high density

25

7U

19 inch

Rootcard

-18x EXTOLL

-18x NVMe

Chassis:

-18x KNL

-94GB Mem

-1x backplane

4U

3U

EXTOLL Tourmalet

Aurora Blade DEEP-ER Booster(in construction)

Aurora Blade Chassis

NVMeIntel Xeon Phi (KNL)

Page 18: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Network Attached Memory

NAM architecture

• Xilinx Virtex 7 FPGA

• Hybrid Memory Cube (HMC)– Bandwidth HMC↔FPGA: 40+ GByte/s

– HMC Conrtroler: Open source development

• Attached to TOURMALET NIC

libNAM (libc based) for ease of use

Use cases:

• global (shared) storage

• compute node for an X-OR C/R app

• “active memory”, etc.

26

EXTOLL

Tourmalet NIC

12 EXTOLL lanes

HD

I6

HD

I6

HD

I6H

DI6

HD

I6H

DI6

UHEI Aspin v2 Board

HD

I6

XilinxVirtex 7

FPGA

HMCDRAM

16 HMC lanes

12-laneEXTOLL Cable

Xilinx Virtex 7 FPGA

EXTOLL

Endpoint

HMC

Controller

16 HMC lanes

12 EXTOLL lanes

RMA

Engine

NAM

Functions

NAM Board

Page 19: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

SOFTWARE

27

Page 20: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

28

Programming environment

Cluster Booster

BoosterInterface

Infin

iband

Exto

ll

Cluster BoosterProtocol

MPI_Comm_spawn

ParaStation MPI

OmpSs on top of MPI provides pragmas to ease the offload process

Page 21: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

0

100

200

300

400

500

600

700

800

900

1000

Message size

Th

rou

gh

pu

t[M

B/s

]

CBP

Message-based

RMA limit

Cluster-Booster Protocol

29

Page 22: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Application running on DEEP

3030

Source code

Compiler

Application

binaries

DEEP

Runtime

#pragma omp task in(…) out (…) onto (com, size*rank+1)

Page 23: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP Offload (with OmpSs)Performance & Scalability evaluation

31

Rank 0

master

Rank 0-15

slave0

Rank 0

Worker Rank 1

Worker Rank 2

Worker Rank 3

wk63

Rank 16-31

slave1

Rank 0

Worker Rank 1

Worker Rank 2

Worker Rank 3

wk127

Rank 240-255

slave15

Rank 0

Worker Rank 1

Worker Rank 2

Worker Rank 3

wk1023

x16

x16

x16

Published in: “Collective Offload for Heterogeneous Clusters”, HiPC 2015

FWI (full wave inversion) code

Page 24: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Scalable I/O

• Improve I/O scalability on all usage-levels

• Used also for checkpointing

32

Page 25: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Filesystem

• Two instances:

– Global FS on HDD server

– Cache FS on NVM at node

• API for cache domain handling

– Synchronous version

– Asynchronous version

33

Page 26: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Resiliency

• Develop a hierarchical, distributed checkpoint/restart

mechanism leveraging DEEP-ER architecture

34

Page 27: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

APPLICATIONS

36

Page 28: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Application-driven approach

• DEEP+DEEP-ER applications:

– Brain simulation (EPFL)

– Space weather simulation (KULeuven)

– Climate simulation (Cyprus Institute)

– Computational fluid engineering (CERFACS)

– High temperature superconductivity (CINECA)

– Seismic imaging (CGG)

– Human exposure to electromagnetic fields (INRIA)

– Geoscience (LRZ Munich)

– Radio astronomy (Astron)

– Oil exploration (BSC)

– Lattice QCD (University of Regensburg)

• Goals:

– Co-design and evaluation of architecture and its programmability

– Analysis of the I/O and resiliency requirements of HPC codes

37

Page 29: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Cluster-Booster Advantages

• More flexible than a standard architecture

→ This enables different use models:

1. Dynamic ratio of processors/coprocessors

2. Use Booster as pool of accelerators (globally shared)

3. Discrete use of the Booster

4. Discrete use + I/O offload

5. Specialized symmetric mode

• Enables a more efficient use of system resources

– Only resources actually needed are blocked by applications

– Dynamic allocation further increases system utilization

38

Page 30: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

39

Code optimisations

0

2

4

6

8

10

12

020406080

100120140

160

Sp

ee

du

p

Gfl

op

s/s

Impact of different optimizations of

wave propagator on Xeon Phi

Gflops/s

speedup

BSC: Enhancing Oil Exploration (FWI, wave propagator)

1 XeonPhi (60 cores), 180 OpenMP threads

Page 31: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

Using SIONlib

LRZ: Rapid crustal deformation & earthquake source equation (Seisol)

1 process per node, 16 threads per process

writing 20 checkpoint files (4GB/checkpoint)

40

0.1

1.0

10.0

100.0

16 32 64 128 256 512 1,024

Ba

nd

wid

th [

GB

/s]

# of nodes

Accumulated bandwidth on SuperMUC

SIONlib POSIX MPI I/O HDF5

Page 32: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

41

Using NVMe

Inria: Assessment of Human exposure to EM fields

24 MPI processes, 1 thread per process

0

10

20

30

40

50

60

P1 P2 P3 P4

Wri

tin

g t

ime

[s]

I/O performance of MAXW-DGTD

sdv-work NVMe

Increasing

model precision

P1<P2<P3<P4

Page 33: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP/-ER

Emerging Technologies

• Cluster-Booster Architecture:

– Alternative approach to heterogeneity

– High flexibility enabling various use modes

• Hardware components:

– Booster (new kind of cluster of accelerators)

– GreenICE Booster (2-phase immersion cooling)

– EXTOLL network tested at scale

– Warm-water cooling

– Memory hierarchy based on NVM

– Network Attached Memory

42

Page 34: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP/-ER

Emerging Technologies

43

• Software

– Cluster-Booster Protocol: low-level communication protocol between different high-speed networks

– Programming environment for future heterogeneous systems

– ParaStation Global MPI supporting EXTOLL and CBP

– OmpSs extensions for DEEP Offload

– Resiliency extensions for OmpSs (task recovery) and ParaStation

– BeeGFS extension for local caches (on NVM)

– SIONlib extensions for buddy-checkpointing, integration with SCR and use of BeeGFS functionality

– E10 scalability optimisations for MPI-I/O

– Extrae/Paraver support for DEEP Offload

– Applications modernisation and optimisation

Page 35: Technology emerging from the DEEP & DEEP-ER projectsJülich Supercomputing Centre 03.06.2016 The research leading to these results has received funding from the European Community's

DEEP and DEEP-ER

44www.deep-project.eu www.deep-er.eu

EU-Exascale projects

20 partners

Total budget: 28,3 M€

EU-funding: 14,5 M€

Nov 2011 – Mar 2017

Visit us @

ISC’16, Frankfurt

(Germany)

20.-22.06.2016

-Booth #1340

-BoF #11

-Workshop