Japan’ Supercomputing Systems on Today and Tomorro. HPCSaudi… · Cloud System BS2000 (44TF,...
Transcript of Japan’ Supercomputing Systems on Today and Tomorro. HPCSaudi… · Cloud System BS2000 (44TF,...
Center for Computational Sciences, Univ. of Tsukuba
Japan’ Supercomputing Systemson Today and Tomorrow
Taisuke Boku
Deputy Director, Center for Computational Sciences
University of Tsukuba
www.hpcs.cs.tsukuba.ac.jp/~taisuke/
2017/03/14HPC Saudi 2017 @KAUST, Thuwal1
Center for Computational Sciences, Univ. of Tsukuba
Where I’m from: CCS, Univ. of Tsukuba
2017/03/14HPC Saudi 2017 @KAUST, Thuwal2
Center for Computational Sciences
Established in 1992
12 years as Center for Computational Physics
Reorganized as Center for Computational Sciences in 2004
Daily collaborative researches with two kinds of researchers (about 30 in total)
Computational Scientistswho have NEEDS (applications)
Computer Scientistswho have SEEDS (system & solution)
Center for Computational Sciences, Univ. of Tsukuba
Oakforest-PACS (OFP) SystemJapan’s fastest supercomputer
2017/03/14HPC Saudi 2017 @KAUST, Thuwal3
Center for Computational Sciences, Univ. of Tsukuba
Towards Exascale Computing
2017/03/14HPC Saudi 2017 @KAUST, Thuwal4
1
10
100
1000 Post K Computer
T2K
PF
2008 2010 2012 2014 2016 2018 2020
U. of TsukubaU. of TokyoKyoto U.
Riken AICS
Future
Exascale
Tokyo Tech.
TSUBAME2.0
Tier-1 and tier-2 supercomputers form HPCI and move forward to Exascale computing like two wheels
JCAHPC(U. Tsukuba and
U. Tokyo)
OFP
Fiscal Year 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Hokkaido
Tohoku
Tsukuba
Tokyo
Tokyo Tech.
Nagoya
Kyoto
Osaka
Kyushu
Deployment plan of 9 supercomputing centers (as on Feb. 2017)
HITACHI SR16000/M1(172TF, 22TB)Cloud System BS2000 (44TF, 14TB)Data Science Cloud / Storage HA8000 / WOS7000
(10TF, 1.96PB)
50+ PF (FAC) 3.5MW200+ PF
(FAC)
6.5MWFujitsu FX10 (1PFlops, 150TiB, 408 TB/s),
NEC
SX-9他(60TF)
~30PF, ~30PB/s Mem BW (CFL-D/CFL-M)
~3MWSX-ACE(707TF,160TB, 655TB/s)
LX406e(31TF), Storage(4PB), 3D Vis, 2MW
100+ PF (FAC/UCC+CFL
-M)up to 4MW
50+ PF (FAC/UCC + CFL-M)FX10(90TF)
Fujitsu FX100 (2.9PF, 81
TiB)CX400(470
TF)Fujitsu CX400 (774TF, 71TiB)
SGI UV2000 (24TF, 20TiB) 2MW in total up to 4MW
Power consumption indicates maximum of power supply
(includes cooling facility)
Cray: XE6 + GB8K +
XC30 (983TF)Camphor2 5PF (FAC/TPF + UCC)
1.5 MWCray XC30 (584TF)
50-100+ PF
(FAC/TPF + UCC) 1.8-2.4 MW
3.2 PF (UCC + CFL/M) 0.96MW 30 PF (UCC +
CFL-M) 2MW0.3 PF (Cloud) 0.36MW
15-20 PF (UCC/TPF)HA8000 (712TF, 242 TB)
SR16000 (8.2TF, 6TB) 100-150 PF
(FAC/TPF + UCC/TPF)FX10
(90.8TFLOPS)
3MWFX10 (272.4TF, 36 TB)
CX400 (966.2 TF, 183TB)
2.0MW 2.6MW
HA-PACS (1166 TF)
100+ PF 4.5MW
(UCC + TPF)
TSUBAME 3.0 (12 PF, 4~6PB/s) 2.0MWTSUBAME 4.0 (100+ PF,
>10PB/s, ~2.0MW)
TSUBAME 2.5 (5.7 PF,
110+ TB, 1160 TB/s),
1.4MW
TSUBAME 2.5 (3~4 PF, extended)
25.6 PB/s, 50-
100Pflop/s,1.5-
2.0MW
3.2PB/s, 5-10Pflop/s, 1.0-1.5MW (CFL-M)NEC SX-ACE NEC Express5800
(423TF) (22.4TF) 0.7-1PF (UCC)
COMA (PACS-IX) (1001 TF)
Reedbush 1.8〜1.9 PF 0.7 MW
OFP 25 PF (UCC + TPF) 4.2 MW
PACS-X 10PF (TPF) 2MW
52017/03/14HPC Saudi 2017 @KAUST, Thuwal
Center for Computational Sciences, Univ. of Tsukuba
OFP in JCAHPC
Under operation by U. Tsukuba and U. Tokyo
Very tight relationship and collaboration with two universities
For primary supercomputer resources, uniform specification to single shared system
Each university is financially responsible to introduce the machine and its operation First attempt in Japan Unified procurement toward single system with
largest scale in Japan
2017/03/14HPC Saudi 2017 @KAUST, Thuwal6
Advanced HPC Facility
(Supercomputer)Kashiwa Campus,The University of Tokyo
Administrative CouncilChairman:
Professor aHiroshi Nakamura
Public Relations and Planning
Division
Operation Technology
Division
Advanced Computational Science
Support Technology Division
Budget
Operation Operation
Information Technology Center
The University of Tokyo
Budget
Center for Computational SciencesUniversity of Tsukuba
Center for Computational Sciences, Univ. of Tsukuba2017/03/14HPC Saudi 2017 @KAUST, Thuwal7
Oakforest-PACS (OFP): Japan’s Fastest Supercomputer
• 25 PFLOPS peak• 8208 KNL CPUs• FBB Fat-Tree by OmniPath• HPL 13.55 PFLOPS
#1 in Japan#6 in World
• HPCG #3 in World• Green500 #6 in World
• Full operation started Dec. 2016• Official Program will start on
April 2017
Center for Computational Sciences, Univ. of Tsukuba
Photo of computation node & chassis
2017/03/14HPC Saudi 2017 @KAUST, Thuwal8
Computation node (Fujitsu next generation PRIMERGY)with single chip Intel Xeon Phi (Knights Landing, 3+TFLOPS)and Intel Omni-Path Architecture card (100Gbps)
Chassis with 8 nodes, 2U size
Water cooling fan & pipe
Center for Computational Sciences, Univ. of Tsukuba
Specification of OFP system
2017/03/149
Total peak performanceLinpack performance
25 PFLOPS13.55 PFLOPS (with 8,178 nodes, 556,104 cores)
Total number of compute nodes 8,208
Compute node
Product Fujitsu PRIMERGY CX1640 M1
Processor Intel® Xeon Phi™ 7250 (Code name: Knights Landing),68 cores, 3 TFLOPS
Memory High BW 16 GB, > 400 GB/sec (MCDRAM, effective rate)
Low BW 96 GB, 115.2 GB/sec (DDR4-2400 x 6ch, peak rate)
Inter-connect
Product Intel® Omni-Path Architecture
Link speed 100 Gbps
Topology Fat-tree with (completely) full-bisection bandwidth (102.6TB/s)
Login node Product Fujitsu PRIMERGY RX2530 M2 server
# of servers 20
Processor Intel Xeon E5-2690v4 (2.6 GHz 14 core x 2 socket)
Memory 256 GB, 153 GB/sec (DDR4-2400 x 4ch x 2 socket)
HPC Saudi 2017 @KAUST, Thuwal
Center for Computational Sciences, Univ. of Tsukuba
Specification of OFP system (I/O)
2017/03/14HPC Saudi 2017 @KAUST, Thuwal10
Parallel File System Type Lustre File System
Total Capacity 26.2 PB
Meta data Product DataDirect Networks MDS server + SFA7700X
# of MDS 4 servers x 3 set
MDT 7.7 TB (SAS SSD) x 3 set
Object storage
Product DataDirect Networks SFA14KE
# of OSS (Nodes) 10 (20)
Aggregate BW 500 GB/sec
Fast File Cache System
Type Burst Buffer, Infinite Memory Engine (by DDN)
Total capacity 940 TB (NVMe SSD, including parity data by erasure coding)
Product DataDirect Networks IME14K
# of servers (Nodes) 25 (50)
Aggregate BW 1,560 GB/sec
Center for Computational Sciences, Univ. of Tsukuba
Important Role of OFP in Japan
OFP should become fastest system in Japan. OFP: 25 PF, K Computer: 11.2 PF After K computer, OFP is expected to
be a bridge system toward post-K computer.
Post-K computer will be a many-core system as well as OFP.
Towards post-K computer, OFP plays important role: For application development with
large-scale problem For deployment of system software
including McKernel and XcalableMPfor many-core system
2017/03/14HPC Saudi 2017 @KAUST, Thuwal11
Machine location: Kashiwa Campus of U. Tokyo
2017/03/14HPC Saudi 2017 @KAUST, Thuwal12
Hongo Campus of U. Tokyo
U. Tsukuba
KashiwaCampusof U. Tokyo
Center for Computational Sciences, Univ. of Tsukuba13 2017/03/14HPC Saudi 2017 @KAUST, Thuwal
[Si] [SiO2]
Better
• up 128 nodes of strong scaling with Si• >12x faster than K Computer (with 128 nodes) with Si• not enough parallelism on SiO2 and MPI comm. bottlenecks
ARTED: Ab-initio Real Time Electron Dynamics simulation
Center for Computational Sciences, Univ. of Tsukuba
OFP resource sharing program (nation-wide)
JCAHPC (20%)
HPCI – HPC Infrastructure program in Japan to share all supercomputers (free!)
Big challenge special use (full system size)
U. Tsukuba (23.5%)
Interdisciplinary Academic Program (free!)
Large scale general use
U. Tokyo (56.5%)
General use
Industrial trial use
Educational use
Young & Female special use
2017/03/14HPC Saudi 2017 @KAUST, Thuwal14
Center for Computational Sciences, Univ. of Tsukuba
AI-related projects in near future in Japan
2017/03/14HPC Saudi 2017 @KAUST, Thuwal15
Tokyo Institute of Technology
Intel Omni-Path optical fiber networkFull-bisection bandwidth Fat-tree
432 Terabits/sec bi-directional
Twice the average of whole internet communication
540 compute nodes SGI ICE® XAIntel Xeon CPU * 2 + NVIDIA Pascal GPU (SXM2) * 4
256GB memory, 2TB NVMe SSD, quad-rail Omni-Path HFI
47.2 AI-Petaflops, 12.1 Petaflops (DP)
TSUBAME 3.0 ~ from August 2017
DDN Storage systemLustre FS 15.9PB, 150GB/s
Home 45TB
UNIVA Grid Engine: Docker support for virtualization
Warm water cooling
(Estimated PUE=1.033)
(slides provided by Tokyo Tech.)HPC Saudi 2017 @KAUST, Thuwal16
TSUBAME3.0 Features HPC BigData DL,AI
High performance FP16
High memory bandwidth
Full bisection network,Quad-rail OPA HFI
Local NVMe SSD,Large-capacity Lustre FS
Supercomputer not only for HPC but also for BigData and DL, AI.
TSUBAME 3.0 target applications
(slides provided by Tokyo Tech.)2017/03/14HPC Saudi 2017 @KAUST, Thuwal17
METI AIST-AIRC ABCI as the worlds first large-scale OPEN AI Infrastructure
ABCI: AI Bridging Cloud Infrastructure
Top-Level SC compute & data capability (130~200 AI-Petaflops)
Open Public & Dedicated infrastructure for Al & Big Data Algorithms, Software and Applications
Platform to accelerate joint academic-industry R&D for AI in Japan
Univ. Tokyo Kashiwa Campus
• 130~200 AI-Petaflops
• < 3MW Power• < 1.1 Avg. PUE
• Operational 2017Q3~Q4
(slides provided by AIST AIRC)2017/03/14HPC Saudi 2017 @KAUST, Thuwal
18
ABCI Prototype: AIST AI Cloud (AAIC) March 2017
400x NVIDIA Tesla P100s and Infiniband EDR accelerate various AI workloads including ML (Machine Learning)
and DL (Deep Learning).
Advanced data analytics leveraged by 4PiB shared Big Data Storage and Apache Spark w/ its ecosystem.
AI Computation System Large Capacity Storage System
Computation Nodes (w/GPU) x50• Intel Xeon E5 v4 x2• NVIDIA Tesla P100 (NVLink) x8• 256GiB Memory, 480GB SSD
Computation Nodes (w/o GPU) x68• Intel Xeon E5 v4 x2• 256GiB Memory, 480GB SSD
Mgmt & Service Nodes x16
Interactive Nodes x2
400 Pascal GPUs30TB Memory
56TB SSD DDN SFA14K• File server (w/10GbEx2, IB
EDRx4) x4• 8TB 7.2Krpm NL-SAS HDD
x730• GRIDScaler (GPFS)
>4PiB effectiveRW100GB/s
Computation Network
Mellanox CS7520 Director Switch• EDR (100Gbps) x216
Bi-direction 200GbpsFull bi-section bandwidth
Service and Management Network
IB EDR (100Gbps) IB EDR (100Gbps)
GbE or 10GbE GbE or 10GbE
Firewall
• FortiGate 3815D x2• FortiAnalyzer 1000E x2
UTM Firewall40-100Gbps class
10GbE
SINET5Internet Connection
10-100GbE
(slides provided by AIST AIRC)2017/03/14HPC Saudi 2017 @KAUST, Thuwal19
Center for Computational Sciences, Univ. of Tsukuba
Flagship 2020 Project(Post-K)
2017/03/14HPC Saudi 2017 @KAUST, Thuwal20
Flagship 2020 project
Developing the next Japanese flagship computer, so-called “post K”
Disaster prevention
and global climateEnergy issues Industrial competitiveness Basic science
Society with
health and
longevity
Developing a wide range of application codes, to run on the “post K”, to solve major social and science issues
Vendor partner
The Japanese government selected 9 social
& scientific priority issues and their R&D
organizations.
(slides provided by RIKEN AICS)HPC Saudi 2017 @KAUST, Thuwal21 2017/03/14
Overview of post K Computer
Hardware
Manycore architecture
6D mesh/torus Interconnect
3-level hierarchical storage system Silicon Disk
Magnetic Disk
Storage for archive
Target performance:
100 times (maximum) of K by the capacity computing
50 times (maximum) of K by the capability computing
Power consumption of 30 - 40MW (cf. K computer: 12.7 MW)
LoginServers
MaintenanceServers
I/O Network
……
…
………………………
HierarchicalStorage System
PortalServers
System Software Multi-Kernel: Linux with Light-weight Kernel File I/O middleware for 3-level hierarchical storage
system and application Application-oriented file I/O middleware MPI+OpenMP programming environment Highly productive programing language and libraries
(slides provided by RIKEN AICS)HPC Saudi 2017 @KAUST, Thuwal22 2017/03/14
Nine Priority Application Areas
重点課題① 生体分子システムの機能制御による 革新的創薬基盤の構築
①Innovative Drug Discovery
RIKEN Quant. Biology
Center
重点課題② 個別化・予防医療を支援する 統合計算生命科学
②Personalized and
Preventive Medicine
Inst. Medical Science, U.
Tokyo
重点課題③ 地震・津波による複合災害の
統合的予測システムの構築③Hazard and Disaster
induced by Earthquake and
Tsunami
Earthquake Res. Inst., U.
Tokyo重点課題④ 観測ビッグデータを活用した
気象と地球環境の予測の高度化④Environmental Predictions
with Observational Big Data
Center for Earth Info., JAMSTEC
重点課題⑥ 革新的クリーンエネルギー システムの実用化
⑥Innovative Clean Energy
Systems
Grad. Sch. Engineering, U. Tokyo
重点課題⑦ 次世代の産業を支える
新機能デバイス・高性能材料の創成⑦New Functional Devices
and High-Performance
Inst. For Solid State Phys., U. Tokyo
重点課題⑧ 近未来型ものづくりを先導する 革新的設計・製造プロセスの開発
⑧ Innovative Design and
Production Processes for the
Manufacturing Industry in the
Near Future
Inst. of Industrial Science, U. Tokyo
重点課題⑤ エネルギーの高効率な創出、変換・貯蔵、利用の新規基盤技術の開発
⑤High-Efficiency Energy
Creation, Conversion/Storage
and Use
Inst. Molecular Science, NINS
重点課題⑨ 宇宙の基本法則と進化の解明
⑨Fundamental Laws and
Evolution of the Universe
CCS, U. Tsukuba
(slides provided by RIKEN AICS)2017/03/14HPC Saudi 2017 @KAUST, Thuwal23
Co-design in the Post K development
Target Application
Program Brief description
① GENESIS MD for proteins
② Genomon Genome processing (Genome alignment)
③ GAMERAEarthquake simulator (FEM in unstructured &
structured grid)
④NICAM+LET
KWeather prediction system using Big data (structured
grid stencil & ensemble Kalman filter)
⑤ NTChem molecular electronic (structure calculation)
⑥ FFB Large Eddy Simulation (unstructured grid)
⑦ RSDFT an ab-initio program (density functional theory)
⑧ AdventureComputational Mechanics System for Large Scale
Analysis and Design (unstructured grid)
⑨ CCS-QCD Lattice QCD simulation (structured grid Monte Carlo)
Nine social & scientific
priority issues and their
R&D organizations have
been selected from the
following point of view:
• High priority issues from a
social and national viewpoint
• Promising creation of world-
Leading achievement
• Promising strategic use of
post K computer
2017/03/14 HPC Saudi 2017 @KAUST, Thuwal 24
Goal: every application has 100x performance of K-Computer
⇒ is worth ExaScale
Center for Computational Sciences, Univ. of Tsukuba
PACS-X Projectin CCS, U. Tsukuba
2017/03/14HPC Saudi 2017 @KAUST, Thuwal25
Center for Computational Sciences, Univ. of Tsukuba
Schematic of Accelerator in Switch
2017/03/14HPC Saudi 2017 @KAUST, Thuwal26
CPUGPU
PCIe switch
FPGA
fine-grainpartialoffloading
high-speedcommunication(post-PEACH)
coarse-grainoffloading
interconnect
misc. workand computation
Multi-heteronot invading appropriateparts with each other
LET (Locally Essential Tree)
mask[id] arraymask[id] == 0 skip
mask[id] == 1 add to LET
distance judgment
partial regional data inreceiver side and each cellin sender side
27 2017/03/14HPC Saudi 2017 @KAUST, Thuwal
Center for Computational Sciences, Univ. of Tsukuba
Preliminary Evaluation
0
10
20
30
40
50
60
70
80
CPU実行 オフロード機構
実行時間
( μ s
ec
)
28
module itself 2.2x speeds up
0
50
100
150
200
250
300
CPU実行 オフロード機構
実行時間
( μ s
ec
)
Exec. time for making LETfrom LET making to GPU data transfer
7.2x speed up
2017/03/14 HPC Saudi 2017 @KAUST, Thuwal
Exe
c. t
ime (
μs)
Exe
c. t
ime (
μs)
by CPU by FPGA offloading by CPU by FPGA offloading
Center for Computational Sciences, Univ. of Tsukuba
Challenge on external communication
PEACH2/PEACH3 I/O bottleneck depending on PCIe everywhere, to connect CPU and GPU, and also for
external link
PCIe is a bottleneck on today’s advanced interconnect
High performance interconnection between FPGA Optical interconnect interface is ready
up to 100Gb speed
provided as IP for users
FPGA-FPGA communication without intra-node communication bottleneck on-the-fly computation & communication
2017/03/14HPC Saudi 2017 @KAUST, Thuwal29
Center for Computational Sciences, Univ. of Tsukuba
Challenge on Programming
OpenCL is not perfect but best today
Much smaller number of lines than Verilog
Easy to understand even for application users
Very long compilation time to cause serious TAT for development
Not perfect to use all functions of FPGA
We need “glue” to support end-user programming
Similar to the relation between “C and assembler”-> “OpenCL and Verilog”
Making Verilog-written low level code as “library” for OpenCL
Potentially possible (by Altera document), but hard
Challenge: Partial Reconfiguration
Open source for applications & examples
Combination of OpenCL app. + Verilog modules
On commodity platform ( PEACH2: special hardware)
2017/03/14HPC Saudi 2017 @KAUST, Thuwal30
Center for Computational Sciences, Univ. of Tsukuba
PACS-X (ten) Project
PACS (Parallel Advanced system for Computational Sciences) a series of co-design base parallel system development both on system and application at
U. Tsukuba (1978~)
recent systems focus on accelerators
PACS-VIII: HA-PACS (GPU cluster, Fermi+Kepler, PEACH2, 1.1PFLOPS)
PACS-IX: COMA (MIC cluster, KNC, 1PFLOPS)
Next generation of TCA implementation PEACH2 with PCIe is old and with several limitation
new generation of GPU and FPGA with high speed interconnection
more tightly co-designing with applications
system deployment starts from 2018-2019
2017/03/14HPC Saudi 2017 @KAUST, Thuwal31
PPX: Pre-PACS-X (also used for CREST)
Center for Computational Sciences, Univ. of Tsukuba
PPX: latest multi-hetero platform (x6 nodes)
2017/03/14HPC Saudi 2017 @KAUST, Thuwal32
FPGA:fine-grainpartial offloading
+high-speed interconnectAltera Arria10(Bitware A10PL4)
GPU: coarse-grainoffloadingNVIDIA P100 x 2
Xeon Broadwell
HCA:Mellanox IB/EDR
100G IB/EDR
40Gb Ether x 2-> upgrade to 100G x 2
Xeon Broadwell
QPI
1.6TB NVMe
Center for Computational Sciences, Univ. of Tsukuba
PPX mini-cluster system
2017/03/14HPC Saudi 2017 @KAUST, Thuwal33
CPU: BDW x 2GPU: P100 x 2FPGA: Arria10 x 1
InfiniBand/EDR Switch
InfiniBand/EDR (100Gbps)
x6 nodes
....
100G Ethernet Switch
Ethernet (40Gbps x 2)
Ethernet will be upgraded to 100Gb x2 soon login node
computenode
GbE switch
Center for Computational Sciences, Univ. of Tsukuba
Summary
Japan’s supercomputer development projects are under-going with two streams Tier-1 National Projects (NFL: K, Post-K)
Tier-2 University Supercomputer Center Projects (UCC, TPF, NFL-small)-> Oakforest-PACS is the current leadership machine exceeding ex-Tier1 K Computer
Two streams of systems in Tier-2 systems
General purpose high performance CPU = many core now (KNL base)
Accelerated computers = GPU (currently) + FPGA (near future in Tsukuba)
Flagship2020 (Post-K)
based on feasibility study on hardware and codesign on software
9 important application fields + 9 basic core applications for codesigning
operation starts on 2020-2021
2017/03/14HPC Saudi 2017 @KAUST, Thuwal34
Center for Computational Sciences, Univ. of Tsukuba
(Backup slides)
2017/03/14HPC Saudi 2017 @KAUST, Thuwal35
Center for Computational Sciences, Univ. of Tsukuba
ARTED: Ab-initio Real Time Electron Dynamics simulation
2017/03/14HPC Saudi 2017 @KAUST, Thuwal36
[Si] [SiO2]
• KNL socket = 3TFLOPS peak• up to 25% of theoretical peak is achieved for 3D stencil• performance/node is > 3x of FX100
Better