Japan’ Supercomputing Systems on Today and Tomorro. HPCSaudi… · Cloud System BS2000 （44TF,...

Center for Computational Sciences, Univ. of Tsukuba

Japan’ Supercomputing Systemson Today and Tomorrow

Taisuke Boku

Deputy Director, Center for Computational Sciences

University of Tsukuba

www.hpcs.cs.tsukuba.ac.jp/~taisuke/

[email protected]

2017/03/14HPC Saudi 2017 @KAUST, Thuwal1


Where I’m from: CCS, Univ. of Tsukuba


Center for Computational Sciences

Established in 1992

12 years as Center for Computational Physics

Reorganized as Center for Computational Sciences in 2004

Daily collaborative researches with two kinds of researchers (about 30 in total)

Computational Scientistswho have NEEDS (applications)

Computer Scientistswho have SEEDS (system & solution)


Oakforest-PACS (OFP) SystemJapan’s fastest supercomputer



Towards Exascale Computing


1

10

100

1000 Post K Computer

T2K

PF

2008 2010 2012 2014 2016 2018 2020

U. of TsukubaU. of TokyoKyoto U.

Riken AICS

Future

Exascale

Tokyo Tech.

TSUBAME2.0

Tier-1 and tier-2 supercomputers form HPCI and move forward to Exascale computing like two wheels

JCAHPC（U. Tsukuba and

U. Tokyo)

OFP

Fiscal Year 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

Hokkaido

Tohoku

Tsukuba

Tokyo

Tokyo Tech.

Nagoya

Kyoto

Osaka

Kyushu

Deployment plan of 9 supercomputing centers (as on Feb. 2017)

HITACHI SR16000/M1（172TF, 22TB）Cloud System BS2000 （44TF, 14TB）Data Science Cloud / Storage HA8000 / WOS7000

（10TF, 1.96PB）

50+ PF (FAC) 3.5MW200+ PF

(FAC)

6.5MWFujitsu FX10 (1PFlops, 150TiB, 408 TB/s),

NEC

SX-9他(60TF)

~30PF, ~30PB/s Mem BW (CFL-D/CFL-M)

~3MWSX-ACE(707TF,160TB, 655TB/s)

LX406e(31TF), Storage(4PB), 3D Vis, 2MW

100+ PF (FAC/UCC+CFL

-M)up to 4MW

50+ PF (FAC/UCC + CFL-M)FX10(90TF)

Fujitsu FX100 (2.9PF, 81

TiB)CX400(470

TF)Fujitsu CX400 (774TF, 71TiB)

SGI UV2000 (24TF, 20TiB) 2MW in total up to 4MW

Power consumption indicates maximum of power supply

(includes cooling facility)

Cray: XE6 + GB8K +

XC30 (983TF)Camphor2 5PF (FAC/TPF + UCC)

1.5 MWCray XC30 (584TF)

50-100+ PF

(FAC/TPF + UCC) 1.8-2.4 MW

3.2 PF (UCC + CFL/M) 0.96MW 30 PF （UCC +

CFL-M） 2MW0.3 PF (Cloud) 0.36MW

15-20 PF (UCC/TPF)HA8000 (712TF, 242 TB)

SR16000 (8.2TF, 6TB) 100-150 PF

(FAC/TPF + UCC/TPF)FX10

(90.8TFLOPS)

3MWFX10 (272.4TF, 36 TB)

CX400 (966.2 TF, 183TB)

2.0MW 2.6MW

HA-PACS (1166 TF)

100+ PF 4.5MW

(UCC + TPF)

TSUBAME 3.0 (12 PF, 4~6PB/s) 2.0MWTSUBAME 4.0 (100+ PF,

>10PB/s, ~2.0MW)

TSUBAME 2.5 (5.7 PF,

110+ TB, 1160 TB/s),

1.4MW

TSUBAME 2.5 (3～4 PF, extended)

25.6 PB/s, 50-

100Pflop/s,1.5-

2.0MW

3.2PB/s, 5-10Pflop/s, 1.0-1.5MW (CFL-M)NEC SX-ACE NEC Express5800

(423TF) (22.4TF) 0.7-1PF (UCC)

COMA (PACS-IX) (1001 TF)

Reedbush 1.8〜1.9 PF 0.7 MW

OFP 25 PF (UCC + TPF) 4.2 MW

PACS-X 10PF (TPF) 2MW

52017/03/14HPC Saudi 2017 @KAUST, Thuwal


OFP in JCAHPC

Under operation by U. Tsukuba and U. Tokyo

Very tight relationship and collaboration with two universities

For primary supercomputer resources, uniform specification to single shared system

Each university is financially responsible to introduce the machine and its operation First attempt in Japan Unified procurement toward single system with

largest scale in Japan


Advanced HPC Facility

(Supercomputer)Kashiwa Campus,The University of Tokyo

Administrative CouncilChairman:

Professor aHiroshi Nakamura

Public Relations and Planning

Division

Operation Technology

Division

Advanced Computational Science

Support Technology Division

Budget

Operation Operation

Information Technology Center

The University of Tokyo

Budget

Center for Computational SciencesUniversity of Tsukuba

Center for Computational Sciences, Univ. of Tsukuba2017/03/14HPC Saudi 2017 @KAUST, Thuwal7

Oakforest-PACS (OFP): Japan’s Fastest Supercomputer

• 25 PFLOPS peak• 8208 KNL CPUs• FBB Fat-Tree by OmniPath• HPL 13.55 PFLOPS

#1 in Japan#6 in World

• HPCG #3 in World• Green500 #6 in World

• Full operation started Dec. 2016• Official Program will start on

April 2017


Photo of computation node & chassis


Computation node (Fujitsu next generation PRIMERGY)with single chip Intel Xeon Phi (Knights Landing, 3+TFLOPS)and Intel Omni-Path Architecture card (100Gbps)

Chassis with 8 nodes, 2U size

Water cooling fan & pipe


Specification of OFP system

2017/03/149

Total peak performanceLinpack performance

25 PFLOPS13.55 PFLOPS (with 8,178 nodes, 556,104 cores)

Total number of compute nodes 8,208

Compute node

Product Fujitsu PRIMERGY CX1640 M1

Processor Intel® Xeon Phi™ 7250 （Code name: Knights Landing）,68 cores, 3 TFLOPS

Memory High BW 16 GB, > 400 GB/sec (MCDRAM, effective rate)

Low BW 96 GB, 115.2 GB/sec (DDR4-2400 x 6ch, peak rate)

Inter-connect

Product Intel® Omni-Path Architecture

Link speed 100 Gbps

Topology Fat-tree with (completely) full-bisection bandwidth (102.6TB/s)

Login node Product Fujitsu PRIMERGY RX2530 M2 server

# of servers 20

Processor Intel Xeon E5-2690v4 (2.6 GHz 14 core x 2 socket)

Memory 256 GB, 153 GB/sec (DDR4-2400 x 4ch x 2 socket)

HPC Saudi 2017 @KAUST, Thuwal


Specification of OFP system (I/O)


Parallel File System Type Lustre File System

Total Capacity 26.2 PB

Meta data Product DataDirect Networks MDS server + SFA7700X

# of MDS 4 servers x 3 set

MDT 7.7 TB (SAS SSD) x 3 set

Object storage

Product DataDirect Networks SFA14KE

# of OSS (Nodes) 10 (20)

Aggregate BW 500 GB/sec

Fast File Cache System

Type Burst Buffer, Infinite Memory Engine (by DDN)

Total capacity 940 TB (NVMe SSD, including parity data by erasure coding)

Product DataDirect Networks IME14K

# of servers (Nodes) 25 (50)

Aggregate BW 1,560 GB/sec


Important Role of OFP in Japan

OFP should become fastest system in Japan. OFP: 25 PF, K Computer: 11.2 PF After K computer, OFP is expected to

be a bridge system toward post-K computer.

Post-K computer will be a many-core system as well as OFP.

Towards post-K computer, OFP plays important role: For application development with

large-scale problem For deployment of system software

including McKernel and XcalableMPfor many-core system


Machine location: Kashiwa Campus of U. Tokyo


Hongo Campus of U. Tokyo

U. Tsukuba

KashiwaCampusof U. Tokyo

Center for Computational Sciences, Univ. of Tsukuba13 2017/03/14HPC Saudi 2017 @KAUST, Thuwal

[Si] [SiO2]

Better

• up 128 nodes of strong scaling with Si• >12x faster than K Computer (with 128 nodes) with Si• not enough parallelism on SiO2 and MPI comm. bottlenecks

ARTED: Ab-initio Real Time Electron Dynamics simulation


OFP resource sharing program (nation-wide)

JCAHPC (20%)

HPCI – HPC Infrastructure program in Japan to share all supercomputers (free!)

Big challenge special use (full system size)

U. Tsukuba (23.5%)

Interdisciplinary Academic Program (free!)

Large scale general use

U. Tokyo (56.5%)

General use

Industrial trial use

Educational use

Young & Female special use



AI-related projects in near future in Japan


Tokyo Institute of Technology

Intel Omni-Path optical fiber networkFull-bisection bandwidth Fat-tree

432 Terabits/sec bi-directional

Twice the average of whole internet communication

540 compute nodes SGI ICE® XAIntel Xeon CPU * 2 + NVIDIA Pascal GPU (SXM2) * 4

256GB memory, 2TB NVMe SSD, quad-rail Omni-Path HFI

47.2 AI-Petaflops, 12.1 Petaflops (DP)

TSUBAME 3.0 ~ from August 2017

DDN Storage systemLustre FS 15.9PB, 150GB/s

Home 45TB

UNIVA Grid Engine: Docker support for virtualization

Warm water cooling

(Estimated PUE=1.033)

(slides provided by Tokyo Tech.)HPC Saudi 2017 @KAUST, Thuwal16

TSUBAME3.0 Features HPC BigData DL,AI

High performance FP16

High memory bandwidth

Full bisection network,Quad-rail OPA HFI

Local NVMe SSD,Large-capacity Lustre FS

Supercomputer not only for HPC but also for BigData and DL, AI.

TSUBAME 3.0 target applications

(slides provided by Tokyo Tech.)2017/03/14HPC Saudi 2017 @KAUST, Thuwal17

METI AIST-AIRC ABCI as the worlds first large-scale OPEN AI Infrastructure

ABCI: AI Bridging Cloud Infrastructure

Top-Level SC compute & data capability (130~200 AI-Petaflops)

Open Public & Dedicated infrastructure for Al & Big Data Algorithms, Software and Applications

Platform to accelerate joint academic-industry R&D for AI in Japan

Univ. Tokyo Kashiwa Campus

• 130~200 AI-Petaflops

• < 3MW Power• < 1.1 Avg. PUE

• Operational 2017Q3~Q4

(slides provided by AIST AIRC)2017/03/14HPC Saudi 2017 @KAUST, Thuwal

18

ABCI Prototype: AIST AI Cloud (AAIC) March 2017

400x NVIDIA Tesla P100s and Infiniband EDR accelerate various AI workloads including ML (Machine Learning)

and DL (Deep Learning).

Advanced data analytics leveraged by 4PiB shared Big Data Storage and Apache Spark w/ its ecosystem.

AI Computation System Large Capacity Storage System

Computation Nodes (w/GPU) x50• Intel Xeon E5 v4 x2• NVIDIA Tesla P100 (NVLink) x8• 256GiB Memory, 480GB SSD

Computation Nodes (w/o GPU) x68• Intel Xeon E5 v4 x2• 256GiB Memory, 480GB SSD

Mgmt & Service Nodes x16

Interactive Nodes x2

400 Pascal GPUs30TB Memory

56TB SSD DDN SFA14K• File server (w/10GbEx2, IB

EDRx4) x4• 8TB 7.2Krpm NL-SAS HDD

x730• GRIDScaler (GPFS)

>4PiB effectiveRW100GB/s

Computation Network

Mellanox CS7520 Director Switch• EDR (100Gbps) x216

Bi-direction 200GbpsFull bi-section bandwidth

Service and Management Network

IB EDR (100Gbps) IB EDR (100Gbps)

GbE or 10GbE GbE or 10GbE

Firewall

• FortiGate 3815D x2• FortiAnalyzer 1000E x2

UTM Firewall40-100Gbps class

10GbE

SINET5Internet Connection

10-100GbE

(slides provided by AIST AIRC)2017/03/14HPC Saudi 2017 @KAUST, Thuwal19


Flagship 2020 Project(Post-K)


Flagship 2020 project

Developing the next Japanese flagship computer, so-called “post K”

Disaster prevention

and global climateEnergy issues Industrial competitiveness Basic science

Society with

health and

longevity

Developing a wide range of application codes, to run on the “post K”, to solve major social and science issues

Vendor partner

The Japanese government selected 9 social

& scientific priority issues and their R&D

organizations.

(slides provided by RIKEN AICS)HPC Saudi 2017 @KAUST, Thuwal21 2017/03/14

Overview of post K Computer

Hardware

Manycore architecture

6D mesh/torus Interconnect

3-level hierarchical storage system Silicon Disk

Magnetic Disk

Storage for archive

Target performance:

100 times (maximum) of K by the capacity computing

50 times (maximum) of K by the capability computing

Power consumption of 30 - 40MW (cf. K computer: 12.7 MW)

LoginServers

MaintenanceServers

I/O Network

……

…

………………………

HierarchicalStorage System

PortalServers

System Software Multi-Kernel: Linux with Light-weight Kernel File I/O middleware for 3-level hierarchical storage

system and application Application-oriented file I/O middleware MPI+OpenMP programming environment Highly productive programing language and libraries

(slides provided by RIKEN AICS)HPC Saudi 2017 @KAUST, Thuwal22 2017/03/14

Nine Priority Application Areas

重点課題① 生体分子システムの機能制御による革新的創薬基盤の構築

①Innovative Drug Discovery

RIKEN Quant. Biology

Center

重点課題② 個別化・予防医療を支援する統合計算生命科学

②Personalized and

Preventive Medicine

Inst. Medical Science, U.

Tokyo

重点課題③ 地震・津波による複合災害の

統合的予測システムの構築③Hazard and Disaster

induced by Earthquake and

Tsunami

Earthquake Res. Inst., U.

Tokyo重点課題④ 観測ビッグデータを活用した

気象と地球環境の予測の高度化④Environmental Predictions

with Observational Big Data

Center for Earth Info., JAMSTEC

重点課題⑥ 革新的クリーンエネルギーシステムの実用化

⑥Innovative Clean Energy

Systems

Grad. Sch. Engineering, U. Tokyo

重点課題⑦ 次世代の産業を支える

新機能デバイス・高性能材料の創成⑦New Functional Devices

and High-Performance

Inst. For Solid State Phys., U. Tokyo

重点課題⑧ 近未来型ものづくりを先導する革新的設計・製造プロセスの開発

⑧ Innovative Design and

Production Processes for the

Manufacturing Industry in the

Near Future

Inst. of Industrial Science, U. Tokyo

重点課題⑤ エネルギーの高効率な創出、変換・貯蔵、利用の新規基盤技術の開発

⑤High-Efficiency Energy

Creation, Conversion/Storage

and Use

Inst. Molecular Science, NINS

重点課題⑨ 宇宙の基本法則と進化の解明

⑨Fundamental Laws and

Evolution of the Universe

CCS, U. Tsukuba

(slides provided by RIKEN AICS)2017/03/14HPC Saudi 2017 @KAUST, Thuwal23

Co-design in the Post K development

Target Application

Program Brief description

① GENESIS MD for proteins

② Genomon Genome processing (Genome alignment)

③ GAMERAEarthquake simulator (FEM in unstructured &

structured grid)

④NICAM+LET

KWeather prediction system using Big data (structured

grid stencil & ensemble Kalman filter)

⑤ NTChem molecular electronic (structure calculation)

⑥ FFB Large Eddy Simulation (unstructured grid)

⑦ RSDFT an ab-initio program (density functional theory)

⑧ AdventureComputational Mechanics System for Large Scale

Analysis and Design (unstructured grid)

⑨ CCS-QCD Lattice QCD simulation (structured grid Monte Carlo)

Nine social & scientific

priority issues and their

R&D organizations have

been selected from the

following point of view:

• High priority issues from a

social and national viewpoint

• Promising creation of world-

Leading achievement

• Promising strategic use of

post K computer

2017/03/14 HPC Saudi 2017 @KAUST, Thuwal 24

Goal: every application has 100x performance of K-Computer

⇒ is worth ExaScale


PACS-X Projectin CCS, U. Tsukuba



Schematic of Accelerator in Switch


CPUGPU

PCIe switch

FPGA

fine-grainpartialoffloading

high-speedcommunication(post-PEACH)

coarse-grainoffloading

interconnect

misc. workand computation

Multi-heteronot invading appropriateparts with each other

LET (Locally Essential Tree)

mask[id] arraymask[id] == 0 skip

mask[id] == 1 add to LET

distance judgment

partial regional data inreceiver side and each cellin sender side

27 2017/03/14HPC Saudi 2017 @KAUST, Thuwal


Preliminary Evaluation

0

10

20

30

40

50

60

70

80

CPU実行オフロード機構

実行時間

( μ s

ec

)

28

module itself 2.2x speeds up

0

50

100

150

200

250

300

CPU実行オフロード機構

実行時間

( μ s

ec

)

Exec. time for making LETfrom LET making to GPU data transfer

7.2x speed up

2017/03/14 HPC Saudi 2017 @KAUST, Thuwal

Exe

c. t

ime (

μs)

Exe

c. t

ime (

μs)

by CPU by FPGA offloading by CPU by FPGA offloading


Challenge on external communication

PEACH2/PEACH3 I/O bottleneck depending on PCIe everywhere, to connect CPU and GPU, and also for

external link

PCIe is a bottleneck on today’s advanced interconnect

High performance interconnection between FPGA Optical interconnect interface is ready

up to 100Gb speed

provided as IP for users

FPGA-FPGA communication without intra-node communication bottleneck on-the-fly computation & communication



Challenge on Programming

OpenCL is not perfect but best today

Much smaller number of lines than Verilog

Easy to understand even for application users

Very long compilation time to cause serious TAT for development

Not perfect to use all functions of FPGA

We need “glue” to support end-user programming

Similar to the relation between “C and assembler”-> “OpenCL and Verilog”

Making Verilog-written low level code as “library” for OpenCL

Potentially possible (by Altera document), but hard

Challenge: Partial Reconfiguration

Open source for applications & examples

Combination of OpenCL app. + Verilog modules

On commodity platform ( PEACH2: special hardware)



PACS-X (ten) Project

PACS (Parallel Advanced system for Computational Sciences) a series of co-design base parallel system development both on system and application at

U. Tsukuba (1978~)

recent systems focus on accelerators

PACS-VIII: HA-PACS (GPU cluster, Fermi+Kepler, PEACH2, 1.1PFLOPS)

PACS-IX: COMA (MIC cluster, KNC, 1PFLOPS)

Next generation of TCA implementation PEACH2 with PCIe is old and with several limitation

new generation of GPU and FPGA with high speed interconnection

more tightly co-designing with applications

system deployment starts from 2018-2019


PPX: Pre-PACS-X (also used for CREST)


PPX: latest multi-hetero platform (x6 nodes)


FPGA:fine-grainpartial offloading

+high-speed interconnectAltera Arria10(Bitware A10PL4)

GPU: coarse-grainoffloadingNVIDIA P100 x 2

Xeon Broadwell

HCA:Mellanox IB/EDR

100G IB/EDR

40Gb Ether x 2-> upgrade to 100G x 2

Xeon Broadwell

QPI

1.6TB NVMe


PPX mini-cluster system


CPU: BDW x 2GPU: P100 x 2FPGA: Arria10 x 1

InfiniBand/EDR Switch

InfiniBand/EDR (100Gbps)

x6 nodes

....

100G Ethernet Switch

Ethernet (40Gbps x 2)

Ethernet will be upgraded to 100Gb x2 soon login node

computenode

GbE switch


Summary

Japan’s supercomputer development projects are under-going with two streams Tier-1 National Projects (NFL: K, Post-K)

Tier-2 University Supercomputer Center Projects (UCC, TPF, NFL-small)-> Oakforest-PACS is the current leadership machine exceeding ex-Tier1 K Computer

Two streams of systems in Tier-2 systems

General purpose high performance CPU = many core now (KNL base)

Accelerated computers = GPU (currently) + FPGA (near future in Tsukuba)

Flagship2020 (Post-K)

based on feasibility study on hardware and codesign on software

9 important application fields + 9 basic core applications for codesigning

operation starts on 2020-2021



(Backup slides)



ARTED: Ab-initio Real Time Electron Dynamics simulation


[Si] [SiO2]

• KNL socket = 3TFLOPS peak• up to 25% of theoretical peak is achieved for 3D stencil• performance/node is > 3x of FX100

Better

Japan’ Supercomputing Systems on Today and Tomorro. HPCSaudi… · Cloud System BS2000 （44TF,...

Documents

Transcript of Japan’ Supercomputing Systems on Today and Tomorro. HPCSaudi… · Cloud System BS2000 （44TF,...