Towards Exascale Systems: Activity in...

47
Center for Computational Sciences, Univ. of Tsukuba Towards Exascale Systems: Activity in Japan Taisuke Boku Deputy Director Center for Computational Sciences University of Tsukuba 2012/10/04 ORAP Forum 2012, Paris 1

Transcript of Towards Exascale Systems: Activity in...

Center for Computational Sciences, Univ. of Tsukuba

Towards Exascale Systems: Activity in Japan

Taisuke Boku Deputy Director

Center for Computational Sciences University of Tsukuba

2012/10/04 ORAP Forum 2012, Paris

1

Center for Computational Sciences, Univ. of Tsukuba

Outline

n  Status of K computer n  HPCI (High Performance Computing Infrastructure) n  SDHPC (Strategic Direction/Development for HPC) n  FS (Feasibility Study) & CREST n  Activity in U. Tsukuba n  Towards Exascale

2012/10/04 ORAP Forum 2012, Paris

2

Center for Computational Sciences, Univ. of Tsukuba

K computer status

We are #2 in the world now :-)

2012/10/04 ORAP Forum 2012, Paris

3

Full operation: 2012/06 Public utilization: 2012/09

K computer (#1 at TOP500 on 2011/06, 2011/11)

" Final spec. of K computer (as on Nov. 2011) " Computation nodes (CPUs): 88,128 " Total core#: 705,024 " Peak perforance: 11.28 PFLOPS Linpack: 10.51 PFLOPS " Memory capacity: 1.41 PB (16GB/node)

2012/10/04 ORAP Forum 2012, Paris

4

Kei (京) represents the numerical unit of 10 Peta (1016) in the Japanese language, representing the system’s performance goal of 10 Petaflops. The Chinese character 京 can also be used to mean “ a large gateway” so it could also be associated with the concept of a new gateway to computational science.

K computer’s official name

一、  十、  百、  千、 万、 億、 兆、 京、 垓、 杼、 穰、 溝、 澗、 正、 載、 極、 100    101   102  103 104 108 1012 1016 1020 1024 1028 1032 1036 1040 1044 1048

恒河沙、阿僧祗、那由他、不可思議、 無量大数 1052 1056 1060 1064 1068

2012/10/04 ORAP Forum 2012, Paris

5

K computer: compute nodes and network " Logical 3-dimensional torus network " Peak bandwidth: 5GB/s x 2 for each

direction of logical 3-dimensional torus network

" bi-section bandwidth: > 30TB/s

SPARC64TM VIIIfx

Courtesy of RIKEN/FUJITSU Ltd.

ノード

CPU: 128GFLOPS(8 Core)

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

CoreSIMD(4FMA)

16GFlops

L2$: 5MB

64GB/s

CoreSIMD(4FMA)16GFLOPS

MEM: 16GB

x

y

z

5GB/s (peak) x 2

5GB/s

(peak)

x 2

5GB/s

(peak

) x 2

5GB/s(peak) x 2

5GB/s (peak) x 2

Compute node

Logical 3-dimensional torus network for programming

CPU

ICC

" Computation nodes (CPUs): 88,128 " Total core#: 705,024 " Peak perforance: 11.28 PFLOPS

Linpack: 10.51 PFLOPS " Memory capacity: 1.41 PB (16GB/

node)

6

7

CPU Features (Fujitsu SPARC64TM VIIIfx) n  8 cores n  2 SIMD operation circuit

n  2 Multiply & add floating-point operations (SP or DP) are executed in one SIMD instruction

n  256 FP registers (double precision) n  Shared 5MB L2 Cache (10way)

n  Hardware barrier n  Prefetch instruction n  Software controllable cache

- Sectored cache n  Performance

n  16GFLOPS/core, 128GFLOPS/CPU n  Commercial version (PRIMEHPC FX-10)

n  16 core/CPU (with reduced frequency)

Reference: SPARC64TM VIIIfx Extensions http://img.jp.fujitsu.com/downloads/jp/jhpc/sparc64viiifx-extensions.pdf

45nm CMOS process, 2GHz 22.7mm x 22.6mm 760 M transisters 58W(at 30℃ by water cooling)

16GF/core(2*4*2G)

2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba

Operation of K computer n  Operation site: AICS (Advanced Institute for Computational

Science), RIKEN (@ Kobe) n  Job scheduling

n  Since K computer’s interconnection network is 3-D torus, users can select one of: (1) rectangular shape aware node allocation or (2) just allocate nodes with required number without caring shape

n  Currently, no priority control for the node size, but it could be modified based on user-request

n  Program categories n  70%: SPIRE (Strategic, 5 of special projects) n  25%: public use (15% for general, 5% for young researchers, 5% for

industry) under HPCI n  5%: AICS research on computer science

2012/10/04 ORAP Forum 2012, Paris

8

Center for Computational Sciences, Univ. of Tsukuba 9

SPIRE (Strategic Programs for Innovative Research) n  MEXT has identified five application areas that are expected to create

breakthroughs using the K computer n  Field 1: Life science/Drug manufacture (PI inst. RIKEN) n  Field 2: New material/energy creation (PI inst. U. Tokyo) n  Field 3: Global change prediction for disaster prevention/mitigation (PI inst.

JAMSTEC) n  Field 4: Mono-zukuri (Manufacturing technology) (PI inst. U. Tokyo) n  Field 5: The origin of matters and the universe (PI inst. U. Tsukuba)

n  MEXT funds 5 core organizations that lead research activities in these 5 strategic fields

(US$ 25M for each, for 5 years)

n  Public Users -> HPCI

Center for Computational Sciences, Univ. of Tsukuba

Prizes on K computer n  #1 in TOP500 on 2011/06, 2011/11 n  HPCC (HPC Challenge) at SC11, all 4 categories are occupied by K

computer n  HPL n  Bandwidth n  Random Access n  FFT

n  ACM Gordon Bell Prize, Peak Performance Award n  Joint team of AICS RIKEN, U. Tokyo, U. Tsukuba and Fujitsu n  Achieved 3.4 PFLOPS sustained performance on first principle material

simulation on semiconductor with 100,000 atoms n  Silicon nano-wire simulation based on Real Space Density Function Theory

10

Center for Computational Sciences, Univ. of Tsukuba

Large-scale first-principles calculations in nano science

11

0.1 nm

CMOS gate length = 5 nm

FET with CNT

Cytochrome c Oxidase

1 nm 10 nm sub µm

100 atoms

10,000 atoms

100,000 atoms Theory Experiment

Ø  Materials = Nuclei + Electrons (Quantum Objects) Ø  Density Functional Theory (DFT) = Efficient framework based on first-principles of quantum theory, but limited to N =100〜1000-atom systems before

Large-scale DFT calculations and experiments meet together in Nano World !

Challenge: 10,000 ~ 100,000-atom calculations overcoming N3 scaling to reveal nano-scale world!

Center for Computational Sciences, Univ. of Tsukuba

2011 ACM Gordon Bell Prize for Peak Performance Award “First-principle calculation of electronic states of a silicon nanowire with 100,000 atoms on the K computer”

Center for Computational Sciences, Univ. of Tsukuba

HPCI (High Performance Computing Infrastructure)

2012/10/04 ORAP Forum 2012, Paris

13

Center for Computational Sciences, Univ. of Tsukuba

What is HPCI ? n  Nation-wide High Performance Computing Infrastructure in Japan n  Gathering main supercomputer resources in Japan including K computer and

all universities’ supercomputer centers n  Grid technology to enable “SSO (single sign-on)” utilization of any

supercomputers and file systems nation-widely n  Large scale distributed file system in two sites (West and East) as data pool to

be accessed from any supercomputer resources under HPCI n  Science-driven project base proposals to utilize HPCI resources are submitted,

then free CPU budget is decided n  Operation just started on 2012/09/28 (just last week)

n  Project proposals: 2012/05 n  Project selection: 2012/06~08

2012/10/04 ORAP Forum 2012, Paris

14

Formation of HPCI n  Background:

n  After re-evaluation of the project at “government party change” in 2011, the NGS project was restarted as “Creation of the Innovative High Performance Computing Infra-structure (HPCI)”.

n  Building HPCI: High-Performance Computing Infrastructure n  To establish a hierarchical organization of supercomputers linked with the K computer

and other supercomputers at universities and institutes n  To set up a large-scale storage system for the K computer and other supercomputers

n  Organizing HPCI Consortium n  To play a role as the main body to design and operate HPCI.

n  To organize computational science communities from several application fields and institutional/university supercomputer centers.

n  Including Kobe Center

consortium

super computer

super computer

super computer

super computer

HPCI K computer

Institutional/University computer centers

Computational Science communities

Advanced Institute for Computational Science (AICS), RIKEN

(slide is courtesy by M. Sato, U. Tsukuba) 2012/10/04 ORAP Forum 2012, Paris

15

HPCI Consortium Members User Communities (13)

n  RIKEN n  Computational Materials Science Initiative n  Japan Agency for Marine-Earth Science and Technology n  Institute of Industrial Science at University of Tokyo n  Joint Institute for Computational Fundamental Science n  Industrial Committee for Super Computing Promotion n  Foundation for Computational Science n  BioGrid Center Kansai n  Japan Aerospace Exploration Agency n  Center for Computational Science & e-Systems, Japan Atomic Energy Agency n  National Institute for Fusion Science n  Solar-Terrestrial Environment Laboratory, Nagoya University n  Kobe University

Resource Providers (25) n  Information Initiative Center, Hokkaido University n  Cyberscience Center, Tohoku University n  Information Technology Center, University of Tokyo n  Information Technology Center, Nagoya University n  Academic Center for Computing and Media Studies, Kyoto University. n  Cybermedia Center, Osaka University n  Research Institute for Information Technology, Kyushu University

n  Center for Computational Sciences, University of Tsukuba n  Global Scientific Information and Computing Center, Tokyo

Institute of Technology n  Institute for Materials Research, Tohoku University n  Institute for Solid State Physics, University of Tokyo n  Yukawa Institute for Theoretical Physics, Kyoto University n  Research Center for Nuclear Physics, Osaka University n  Computing Research Center, KEK(High Energy Accelerator

Research Organization) n  National Astronomical Observatory of Japan n  Research Center for Computational Science, Institute for

Molecular Sciencce n  The Institute of Statistical Mathematics n  JAXA’s Engineering Digital Innovation Center n  The Earth Simulator Center n  Information Technology Research Institute, AIST n  Center for Computational Science & e-Systems, Japan Atomic

Energy Agency n  Advanced Center for Computing and Communication, RIKEN n  Advanced Institute for Computational Science n  National Institute of Informatics n  Research Organization for Information Science & Technology

(slide is courtesy by M. Sato, U. Tsukuba) 2012/10/04 ORAP Forum 2012, Paris

16

Center for Computational Sciences, Univ. of Tsukuba

HPCI: AICS and Supercomputer Centers in Japanese Universities

Kyushu Univ.: PC Cluster (55Tflops, 18.8TB) SR16000 L2 (25.3Tflops, 5.5TB) PC Cluster (18.4Tflops, 3TB)

Hokkaido Univ.:SR11000/K1(5.4Tflops, 5TB) PC Cluster (0.5Tflops, 0.64TB)

Nagoya Univ.:FX1(30.72Tflops, 24TB) HX600(25.6Tflops, 10TB) M9000(3.84Tflops, 3TB)

Osaka Univ.: SX-9 (16Tflops, 10TB) SX-8R (5.3Tflops, 3.3TB) PCCluster (23.3Tflops, 2.9TB)

Kyoto Univ. T2K Open Supercomputer (61.2 Tflops, 13 TB)

Tohoku Univ.: NEC SX-9(29.4Tflops, 18TB) NEC Express5800 (1.74Tflops, 3TB)

Univ. of Tsukuba: T2K Open Supercomputer95.4Tflops, 20TB

Univ. of Tokyo: T2K Open Supercomputer(140 Tflops, 31.25TB)

AICS, RIKEN: K computer (10 Pfflops, 4PB) Available in 2012

A 1 Pflops machine without accelerator will be installed by the end of 2011

Tokyo Institute of Technology: Tsubame 2 (2.4 Pflops, 100TB)

17

Center for Computational Sciences, Univ. of Tsukuba

Storage System in first phase for HPCI

Hokkaido University

Tohoku University

University of Tokyo

University of Tsukuba

Tokyo Institute of Technology

Nagoya University

Kyushu University

Osaka University Kyoto University

AICS, RIKEN

•  12 PB storage (52 OSS) •  Cluster for data analysis (87

node)

•  10 PB storage (30 OSS) •  Cluster for data analysis (88

node)

HPCI WEST HUB HPCI EAST HUB

Gfarm2 is used as the global shared file system

18 2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba

SDHPC (Strategic Direction/Development of High Performance Computing Systems)

2012/10/04 ORAP Forum 2012, Paris

19

Center for Computational Sciences, Univ. of Tsukuba

Post-peta~Exa scale computing formation in Japan

20

MEXT HPCI Consortium established on 2012/04 Users and supercomputer centers (resource providers)

Workshop of SDHPC (Strategic Development of HPC) Organized by Univ. of Tsukuba Univ. of Tokyo Tokyo Institute of Tech. Kyoto Univ. RIKEN supercomputer center and AICS AIST JST Tohoku Univ.

WG for applications

WG for systems

JST (Japan Science and Technology Agency)

Basic Research Programs CREST: Development of System Software Technologies for post-Peta

Scale High Performance Computing 2010 -- 2018

Feasibility Study of Advanced High Performance Computing

2012 -- 2013

White paper for strategic direction/development of HPC in JAPAN

The consortium will play an important role of the future HPC R&D

Council for Science and Technology Policy (CSTP)

Council on HPCI Plan and Promotion

2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba

Plans and funding scheme

21

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Basic Research Programs CREST: Development of System Software Technologies for post-Peta Scale High Performance Computing

Feasibility Study of Advanced High Performance Computing

US$ 23M

US$ 17M

US$ 15M

US$10M

White paper for strategic direction/development of HPC in Japan

Deployments

2011Q4 :HA-PACS, 0.8PF, Univ. Tsukuba -> be upgraded to 1.3PF (2013Q3)

2012Q1 : <1PF, Univ. Tokyo 2012Q2: 0.7PF, Kyoto-U

2012: BG/Q, 1.2PF, KEK

Three teams will run

2015: 100PF?, ??

R&D of Advanced HPC

Whether or not the national R&D project starts depends on results of those feasibility studies

(slide is courtesy by Y. Ishikawa, U. Tokyo)

2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba

What is SDHPC ?

n  White paper for Strategic Direction/Development of HPC in JAPAN was written by young Japanese researchers with advisers (seniors)

n  The white paper was provided to Council for HPCI Plan and Promotion on 2012/03

n  Contents n  Expected scientific and technological values in exa-scale computing n  Challenges towards exa-scale computing n  Current research efforts in Japan and other countries n  Approaches n  Competitive vs Collaboration n  Plan of R&D and organization

n  The white paper is referred for call for proposals in “Feasibility Study of Advanced High Performance Computing”

22 2012/10/04 ORAP Forum 2012, Paris

Strategic Development of Exascale Systems n  Exascale systems

n  Cannot be built upon traditional technological advances. n  Needs special efforts in architecture / system software for

developing effective (useful) Exascale systems n  Strategy

n  HW/SW/Application co-design n  Close cooperation with the application WG n  Architecture design suited for target application

requirements n  Exploring best-matching between available technologies

and application requirements

23 2012/10/04 ORAP Forum 2012, Paris

System Requirement for Target Sciences n  System performance

n  FLOPS: 800G – 2500PFLOPS n  Memory capacity: 10TB – 500PB n  Memory bandwidth: 0.001 – 1.0 B/F n  Example applications

n  Small capacity requirement n  MD, Climate, Space physics, …

n  Small BW requirement n  Quantum chemistry, …

n  High capacity/BW requirement n  Incompressibility fluid dynamics, …

n  Interconnection Network n  Not enough analysis has been carried out n  Some applications need <1us latency and large bisection BW

n  Storage n  There is not so big demand

24

1.0E-4

1.0E-3

1.0E-2

1.0E-1

1.0E+0

1.0E+1

1.0E-3 1.0E-2 1.0E-1 1.0E+0 1.0E+1 1.0E+2 1.0E+3

Req

uire

men

t of

B/F

Requirement of Memory Capacity (PB)

Low BW Middle capacity

High BW small capacity

High BW middle capacity

High BW High capacity

(slide is courtesy by M. Kondo, U. Elec. Comm.) 2012/10/04 ORAP Forum 2012, Paris

Candidate of ExaScale Architecture n  Four types of architectures are considered

n  General Purpose (GP) n  Ordinary CPU-based MPPs n  e.g.) K-Computer, GPU, Blue Gene,

x86-based PC-clusters

n  Capacity-Bandwidth oriented (CB) n  With expensive memory-I/F rather than

computing capability n  e.g.) Vector machines

n  Reduced Memory (RM) n  With embedded (main) memory n  e.g.) SoC, MD-GRAPE4, Anton

n  Compute Oriented (CO) n  Many processing units n  e.g.) ClearSpeed, GRAPE-DR

25

Memory bandwidth

Memory capacity

FLOPS

CB oriented

Compute oriented

Reduced Memory

General purpose

(slide is courtesy by M. Kondo, U. Elec. Comm.) 2012/10/04 ORAP Forum 2012, Paris

Performance Projection n  Performance projection for an HPC system in 2018

n  Achieved through continuous technology development n  Constraints: 20 – 30MW electricity & 2000sqm space

26

Injection

P-to-P

Bisection

Min Latency

Max Latency

High-radix (Dragonfly)

32 GB/s 32 GB/s 2.0 PB/s 200 ns 1000 ns

Low-radix (4D Torus)

128 GB/s 16 GB/s 0.13 PB/s 100 ns 5000 ns

Total Capacity Total Bandwidth

1 EB 10TB/s

100 times larger than main memory

For saving all data in memory to disks within 1000-sec.

Network Storage

Total CPU Performance (PetaFLOPS)

Total Memory Bandwidth

(PetaByte/s)

Total Memory Capacity

(PetaByte)

Byte / Flop

General Purpose 200~400 20~40 20~40 0.1

Capacity-BW Oriented 50~100 50~100 50~100 1.0 Reduced Memory 500~1000 250~500 0.1~0.2 0.5 Compute Oriented 1000~2000 5~10 5~10 0.005

Node Performance

(slide is courtesy by M. Kondo, U. Elec. Comm.) 2012/10/04 ORAP Forum 2012, Paris

Gap Between Requirement and Technology Trends n  Mapping four architectures onto science requirement n  Projected performance vs. science requirement

n  Big gap between projected and required performance

27

CB

GP

CO

RM

Gap between requirements and technology trends

1.0E-4

1.0E-3

1.0E-2

1.0E-1

1.0E+0

1.0E+1

1.0E-3 1.0E-2 1.0E-1 1.0E+0 1.0E+1 1.0E+2 1.0E+3

Req

uire

men

t of

B/F

Requirement of Memory Capacity (PB)

Mapping of Architectures

Needs national research project for science-driven HPC systems

0

900

1800

2700

Req

uire

men

t (P

FLO

PS

) CO RM GP CB

Projected vs. Required Perf.

(slide is courtesy by M. Kondo, U. Elec. Comm.) 2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba

Challenges Toward Exascale System Development

n  Challenges in all architectures n  Power efficiency, Power management, Dependability

n  Challenges in each architecture n  General Purpose (GP)

n  Multi-level memory hierarchy n  Management of heterogeneity

n  Capacity-Bandwidth oriented (CB) n  Memory system power reduction (3D-ICs, smart memory)

n  Reduced Memory (RM) n  On-chip network n  Small memory algorithm n  Huge-scale system management

n  Compute Oriented (CO) n  Flexibility to wide variety of sciences

28

GP CB

RM

CO

Power reduction

Memory Hierarchy

3D-integration

Huge-scale system

Flexibility

Small memory

Dependability

Heterogeneity

Power management

2012/10/04 ORAP Forum 2012, Paris

Research Issues n  Key R&D issues in each system component

29

Core

Cache

Throughput Cores Processor

Memory

Memory

Nod

e

Nod

e N

ode

Nod

e

・・・

Network

Application

- Data-sharing & network between latency&throughput cores - Data-sharing among throughput cores -  High-perf. &Low-power NoC

- Suitability of High/Low-Radix NW - Optimization for Collective Comm. - QoS management

- Checkpointing support - Migration support - HW monitoring for fault prediction

- Provide power knobs in each system compornent - Fine grain power-performance monitoring - System level power management

New devices

- Memory Hierarchy design - Refine on-chip memory arch. - On-chip main memory

- 3D memory integration - Wide I/O,HMC - NVRAM - Smart Memories

-  Intelligent NI -  Collective comm. support - fine-grain barriers

- Arch. development with co-design - Dynamic HW-adaptation

(slide is courtesy by M. Kondo, U. Elec. Comm.) 2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba

FS (Feasibility Study for Exascale Systems)

2012/10/04 ORAP Forum 2012, Paris

30

Center for Computational Sciences, Univ. of Tsukuba

Feasibility Study of Advanced High Performance Computing

n  MEXT (Ministry of Education, Culture, Sports, Science, and Technology) has just proposed two-year project for feasibility study of advanced HPC which started on 2012/07 (end at 2014/03)

n  Objectives n  The high performance computing technology is an important

national infrastructure n  Keeping development of top-level HPC technologies is one of

Japanese international competitiveness and contributes national security and safety

n  This two-year project is to study feasibilities of such development that Japanese community should focus on

n  Vendor(s) must join to the research team for “feasibility” on power and cost

n  Budget n  Approximately US$ 10M / year

31 2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba

Projects in FS n  Four projects (3 system-side + 1 application-side) were selected

based on proposals, on 2012/07

2012/10/04 ORAP Forum 2012, Paris

32

Focus & Challenge

general purpose (base) system

accelerated computing

vector processing

core applications

Leading PI Y. Ishikawa (Univ. of Tokyo)

M. Sato (Univ. of Tsukuba)

H. Kobayashi (Tohoku Univ.)

H. Tomita (AICS, RIKEN)

Member Institutes & Vendors

U. Tokyo, Kyushu U. Fujitsu, Hitachi, NEC

U. Tsukuba, TITECH, Aizu U., AICS, Hitachi

Tohoku U., JAMSTEC, NEC

AICS, TITECH

Center for Computational Sciences, Univ. of Tsukuba

Basic concept of U. Tsukuba’s FS n  Fully concentrated for FLOPS (CO+RM)

n  A number of SIMD cores (500~1K/chip) with very low dynamism n  Very limited memory capacity/core -> on-chip SRAM n  Reduced I/O to allow limited communication patterns

n  High density accelerated chip n  Simple and small CPU core with some number of registers and on-chip SRAM n  Tightly coupled with on-chip interconnection network n  Interconnection network

n  2-D torus on-chip (up to 1K cores), 2-D torus on-board (# of chips not decided) and tree network among boards

n  special feature for broadcast/reduction tree

n  Co-design issues n  Deciding memory capacity and # of cores on chip (trading-off) n  Instruction set and performance n  Network bandwidth and # of ports n  Based on target applications and algorithms

2012/10/04 ORAP Forum 2012, Paris

33

Core CPU architecture

34

Int. ALU

T-Reg.

General-Reg.

Local Memory

broadcast memory broadcast memory

PEID

BBID

NEWS-Reg neigh-boring core

neighboring core

2012/10/04 ORAP Forum 2012, Paris

multiplexer

multiplexer

Center for Computational Sciences, Univ. of Tsukuba

Local Mem

core

2ndpt

How to specify the address ? •  Source oriented: low throughput •  Destination oriented: difficulty

on programming •  How to control the path

conflict ?

Inter-core communication

35 2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba 36

Cont- roller

Hos

t CP

U

Data Mem.

Inst. Mem.

Comm. buffer

Comm. buffer

Com

m. b

uffe

r

Reduction Network

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Com

m. b

uffe

r

On-

boar

d In

terc

onne

ctio

n

Brd.

M

em.

Brd.

M

em.

Brd.

M

em.

Brd.

M

em.

FSACC chip

Chip Architecture

Center for Computational Sciences, Univ. of Tsukuba

System configuration image n  Performance target: 1EFLOPS

n  core: 1GHz, 2FLOP/clock = 2GFLOPS n  chip: 32x32 cores = 1K cores = 2TFLOPS n  node: 8x8 chips = 64K cores = 128TFLOPS/node n  system: 8000 nodes = 1EFLOPS

n  Power consumption limit: 20MW n  20MW/8000 = 2.5kW/node n  2.5kW/64=40W/chip

n  This balance is possible if we limit the SRAM memory capacity/core to only 512 words or same range

n  It is quite strange compared with today’s general purpose system, but this is a sort of style of “how to achieve EFLOPS”

n  Need to be very carefully designed and configured for application demand

2012/10/04 ORAP Forum 2012, Paris

37

Center for Computational Sciences, Univ. of Tsukuba

For the case of 2-D stencil computation (n x n grid / core, p x p core/chip) comp. on core = O(n2), comm. among core = O(n), comm. among chip = O(np) (ex: n=32 -> 1K core/chip p=8 -> 64 chip/board = 64Kcores/board) To reduce comm. overhead •  larger “n” (memory capacity is limiting factor)

•  tradeoff between memory apacity/core and # of cores •  comm. during time development to upload data to host CPU = O(n2 x p2) •  large interval for data uploading (m>>p2) is required

•  make available the overhead of comp. and comm. •  if we have 2-D torus between chips, p of buffers with size n is required •  using this buffer for general comm. for other usage

•  high throughput for inter-chip comm. To improve the comp. performance •  SIMD inst. even in core -> comm. ratio is worse •  increase # of cores -> comm. ratio is worse

Trade-off between # of cores and I/O

38 2012/10/04 ORAP Forum 2012, Paris

Center for Computational Sciences, Univ. of Tsukuba

JST-CREST (Core Research for Evolutional Science and Technology by Japan Science and Technology Agency)

2012/10/04 ORAP Forum 2012, Paris

39

Center for Computational Sciences, Univ. of Tsukuba

Post-petascale system software in CREST n  Area director: A. Yonezawa (AICS, RIKEN)

“Development on System Software Technologies for Post-Peta Scale High Performance Computing”

n  3 stages starting in 2010, 2011 and 2012 n  4~5 projects / stage n  5 years / project n  US$ 3~5M / project (for 5 years)

n  Purpose n  besides of hardware development, system software, language, library and

application should be developed by different fund n  leadership and education of (relatively) young researchers to support next

generation of Japan’s HPC

2012/10/04 ORAP Forum 2012, Paris

40

n  Stage-1 (2010-2015) n  T. Sakurai (U. Tsukuba): Development of an Eigen-Supercomputing Engine using a Post-Petascale

Hierarchical Model n  O. Tatebe (U. Tsukuba): System Software for Post Petascale Data Intensive Science n  K. Nakajima (U. Tokyo): ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-

Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT) n  A. Hori (AICS): Parallel System Software for Multi-core and Many-core n  N. Maruyama (AICS): Highly Productive, High Performance Application Frameworks for Post Petascale

Computing

n  Stage-2 (2011-2016) n  R. Shioya (U. Tokyo): Development of a Numerical Library based on Hierarchical Domain Decomposition for

Post Petascale Simulation n  H. Takizawa (Tohoku U.): An evolutionary approach to construction of a software development environment

for massively-parallel heterogeneous systems n  S. Chiba (TITECH): Software development for post petascale super computing --- Modularity for Super

Computing n  T. Nanri (Kyushu U.): Development of Scalable Communication Library with Technologies for Memory Saving

and Runtime Optimization n  K. Fujisawa (Chuo U.): Advanced Computing and Optimization Infrastructure for Extremely Large-Scale

Graphs on Post Peta-Scale Supercomputers

n  Stage-3 (2012-2017) n  T. Endo (TITECH): Software Technology for Deep Memory Hierarchy for Post-Peta Scale Era n  M. Kondo (U. Elec. Comm.): Development of Power Management Framework for Post-Peta Scale System n  I. Noda (AIST): Management and Execution Framework for Ultra Large Scale Society Simulation n  T. Boku (U. Tsukuba): Development of Unified Environment of Accelerators and Communication System for

Post-Peta Scale Era

41

Center for Computational Sciences, Univ. of Tsukuba

AC (Accelerator-Communication) CREST n  Research formation

n  Leading PI: T. Boku (U. of Tsukuba) n  Joint institutes: U. Tsukuba, Keio U., Osaka U. and AICS, RIKEN n  2012/10-2018/03, US$ 4.3M

n  AC (Accelerator-Communication) unification n  Based on TCA (Tightly Coupled Accelerator) concept, develop the technology for

direct communication among accelerators based on PCI-e technology n  System software for all layers to support TCA n  Application development based on the same concept n  New algorithm development for AC-unified system

n  Research topics n  PEACH2-oriented system software (firmware, drivers, API, library) n  PGAS language to support TCA model (XcalableMP/TCA) n  Applications (Astrophysics, Lattice QCD, Life, Climate, ...) n  Advanced PEACH (PEACH3) based on PCI-e gen3

2012/10/04 ORAP Forum 2012, Paris

42

Center for Computational Sciences, Univ. of Tsukuba

HA-PACS/TCA (Tightly Coupled Accelerator) n  Enhanced version of PEACH ⇒ PEACH2 n  x4 lanes -> x8 lanes n  hardwired on main data path and

PCIe interface fabric

PEACH2 CPU

PCIe

CPU PCIe

Node

PEACH2

PCIe

PCIe

PCIe

GPU

GPU

PCIe

PCIe

Node

PCIe

PCIe

PCIe

GPU

GPU CPU

CPU

IB HCA

IB HCA

IB Switch

n  True GPU-direct n  current GPU clusters require 3-

hop communication (3-5 times memory copy)

n  For strong scaling, inter-GPU direct communication protocol is needed for lower latency and higher throughput

MEM MEM

MEM MEM

43

Center for Computational Sciences, Univ. of Tsukuba

Implementation of PEACH2: FPGA n  FPGA based implementation

n  today’s advanced FPGA allows to use PCIe hub with multiple ports n  currently gen2 x 8 lanes x 4 ports are available ⇒ soon gen3 will be available (?)

n  easy modification and enhancement n  fits to standard (full-size) PCIe board n  internal multi-core general purpose CPU with programmability is available ⇒ easily split hardwired/firmware partitioning on certain level on control layer

n  Controlling PEACH2 for GPU communication protocol n  collaboration with NVIDIA for information sharing and discussion n  based on CUDA4.0 device to device direct memory copy protocol

44

Center for Computational Sciences, Univ. of Tsukuba

PEACH2 prototype board for TCA

45

FPGA (Altera Stratix IV GX530)

PCIe external link connector x2 (one more on daughter board)

PCIe edge connector (to host server)

daughter board connector

power regulators for FPGA

Center for Computational Sciences, Univ. of Tsukuba

PEACH2 configuration and performance (tentative)

GPU or GPU

PEACH2 PEACH2

PCIe 160ns

Internal routing 80ns

PCIe 196ns additional routing

and PCIe with 240ns

GPU or CPU

GPU-PEACH2-PEACH2-GPU communication latency ~ 700nsec

Test bed cluster HA-PACS at U. of Tsukuba (800TFLOPS, NVIDIA Fermi M2090 x 1072, Intel SB-EP x 536)

46

C C

C C

C C

C CQPI

PCIe

GPU GPU GPU IB

HCA PEACH2

GPU

Center for Computational Sciences, Univ. of Tsukuba

Summary n  Japan’s activity towards exa-scale computing is now strategically performed

under governmental program (MEXT) n  K computer is now in full operation with main contribution for SPIRE and

partial contribution for public projects in HPCI n  HPCI provides SSO facility access to all supercomputers in Japan for K

computer and universities’ supercomputers with large scale shared file system n  Feasibility Study research started to provide the hints for decision making

based on the concept of co-designing towards exa-scale n  JST-CREST supports the development of system software for next generation n  Center for Computational Sciences, University of Tsukuba takes important role

on all of these activities n  HPCI resource provider n  Shared file system in HPCI is based on Gfarm2 developed and maintained by U. Tsukuba n  Feasibility Study: leading PI institute of Accelerator based System n  JST-CREST: leading 3 projects as PI institute

2012/10/04 ORAP Forum 2012, Paris

47