AI Bridging Cloud Infrastructure (ABCI) and its...

AI Bridging Cloud Infrastructure (ABCI) and its Communication Performance

Shinichiro TakizawaThe National Institute of Advanced Industrial Science and Technology

(AIST), Japan

Outline

• Introduction of AIST and ABCI• ABCI Detail

– Architecture, software stack, network topology• MPI performance on ABCI• A latest application on ABCI

Introduction of AIST• A research institute as a part of the Ministry of Economy, Trade and Industry

(METI) of Japan• Our mission

– Integrate scientific and engineering knowledge to address industry and society needs

– Bridge the gap between innovative technological seeds and commercialization

We built ABCI for popularize AI technologies in Japan

4

n P C 8DTD B L SRD MC C R P BDB AH HRW

n U S H OI GSJ 5 JOIG J HM P RPSBRSPDP 0 1HF 3 R 0 F PHRGL RU PD MC

0 HB RH Mn U S :SST G OTS G LTVR R BBD DP RD I HMR

B CDLHB HMCS RPW 3 P 07

07 1PHCFHMF 2 SC 7M P RPSBRSPD

GP VLTVRGSI 07< B 7 ,

)- 7< B 7 ,6LL I O VLTVRGSI 0 GW TL S ( /

/ .. 7< B . OS C() 87< B F ) OS 8A66

. . C7< B OS 9 48T V DWGM 0 1 ( ) F

2 VGM D60 1 6W ORG J

234:0 CN FTV J W 7OVW <GVM #BIGU S 2: :SLVGW V I V

Univ. Tokyo / Kashiwa II Campus

OPERATED SINCE AUGUST 1ST, 2018

Some of 1 Year Achievements• 100+ projects, 1000+ users

– Academia, research institutes and companies• Large Japanese companies start to use ABCI as their R&D platform• ABCI supports NGC containers• SONY and Fujitsu lab achieved good

performance of ImageNet-1k classification on ABCI

• Two research papers that use ABCIwere accepted by SC19

https://blogs.nvidia.com/blog/2019/06/17/abci-adopts-ngc/

https://blogs.nvidia.com/blog/2019/06/17/abci-adopts-ngc/

6

0.74

0.745

0.75

0.755

0.76

0.765

0

200

400

600

800

1000

1200

1400

1600

MSRA(2015)

Facebook(2017)

GoogleBrain

(2017)

PreferredNetworks

(2017)

Tencent(2018)

Sony +ABCI

(2018)

Google(2018)

Google(2018)

Sony +ABCI

(2019)

FujitsuLab +ABCI

(2019)

NVIDIA(2019)

Google(2019)

FujitsuLab +ABCI

(2019)

ImageNet / ResNet-50 (Relative speedup & Accuracy)

Relative speedup Accuracy

Rela

tive

spee

dup

Accu

racy

World’s Highest Speed in ImageNet-1k Training

MLPerf Training v0.6

SONY’s workhttps://arxiv.org/abs/1811.05233Fujitsu lab’s workhttps://arxiv.org/abs/1903.12650

https://arxiv.org/abs/1811.05233

https://arxiv.org/abs/1903.12650

7

Gateway and Firewall

Interactive Nodes x 4

Interconnect (InfiniBand EDR)

Service Network (10GbE)

Management and Gateway Nodes x 15

High-Performance Computing System550 PFlops(FP16), 37.2 PFlops(FP64)

476 TiB Memory, 1.74 PB NVMe SSD

Computing Nodes (w/ GPU) x 1088

Multi-platform Nodes (w/o GPU) x 10

• Intel Xeon Gold 6132 (2.6GHz/14cores) x 2• 768GiB Memory, 3.8TB NVMe SSD, 1.5TB Intel Optane x2

22 PB GPFS (Group Shared Directory, etc.)

DDN SFA14K (w/ SS8462 Enclosure x 10) x3• 12TB 7.2Krpm NL-SAS HDD x 2400• 3.84TB SAS SSD x 216

• Mellanox CS7500 x 2• Mellanox SB7890 x 229

• Nexsus 3232C x2• FortiGate 1500D x2• FortiAnalyzer 400E x1

100GbpsSINET5

GPU NVIDIA Tesla V100 SXM2 x 4

CPU Intel Xeon Gold 6148 (2.4GHz/20cores) x 2

Memory 384GiB

Local Storage Intel SSD DC P4600 (NVMe) 1.6TB x 1

Interconnect InfiniBand EDR x 2

ABCI Hardware Overview

Large-scale Storage System

1 PB Lustre (Home Directory)

DDN SFA14KX (w/ SS9012 Enclosure x 10) x1• 7.68TB SAS SSD x 185 for data• 960GB SAS SSD x 13 for metadata

17 PB Object Storage (Preparing)

HPE Apollo 4510 Gen10 x 24• 12TB SATA HDD x 1440• 3.2TB SSD x 24

8

ABCI Software Stack

Stack

Operating System RHEL / CentOS 7.6

Job Scheduler Univa Grid Engine 8.6.3

Container Engine Docker 17.12.0 (Users can use only supported container images)Singularity 2.6.1 (Users can use any container images)

MPIIntel MPI 2018.2.199MVAPICH2 2.3rc2, 2.3 / MVAPICH2-GDR 2.3a, 2.3rc1, 2.3, 2.3.1OpenMPI 1.10.7, 2.1.3, 2.1.5, 2.1.6, 3.0.3, 3.1.0, 3.1.2, 3.1.3

Development tools

Intel Parallel Studio XE Cluster Edition 2017.8, 2018.2, 2018.3, 2019.3PGI Professional Edition 17.10, 18.5, 18.10, 19.3NVIDIA CUDA SDK 8.0, 9.0, 9.1, 9.2, 10.0cuDNN 5.1, 6.0, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5NCCL 1.3.5, 2.1, 2.2, 2.3, 2.4Intel MKL 2017.8, 2018.2, 2018.3, 2019.3GCC, Python, Ruby, R, OpenJDK, Go, Perl

Deep Learning Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc.(Frameworks provided by NVIDIA GPU Cloud can also be deployed)

Big Data Processing Hadoop, Spark

9

234: 4TRU TJ

Xeon Gold 6148

Xeon Gold 6148

10.4GT/s x3DDR4-266632GB x 6

DDR4-266632GB x 6

128GB/s 128GB/s

IB HCA (100Gbps)IB HCA (100Gbps)

NVMe

UPI x3

x48 switch

Skylake Skylake

x64 switch

Tesla V100 SXM2 Tesla V100 SXM2

Tesla V100 SXM2 Tesla V100 SXM2

PCIe gen3 x16 PCIe gen3 x16

PCIe gen3 x16 PCIe gen3 x16

NVLink2 x2

FUJITSU PRIMERGY Server (2 servers in 2U)

CPU Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) x2

GPU NVIDIA Tesla V100 (SXM2) x4

Memory 384GiB DDR4 2666MHz RDIMM

Local Storage 1.6TB NVMe SSD (Intel SSD DC P4600 u.2) x1

Interconnect InfiniBand EDR x2

10

234: TJ AGIP :S VITSS I

InfiniBand EDR x1InfiniBand EDR x6InfiniBand EDR x4

Rack #1

LEAF#1SB7890

LEAF#2SB7890

LEAF#3SB7890

LEAF#4SB7890

CX400#1

CX2

570#

1C

X257

0#2

CX400#2

CX2

570#

3C

X257

0#4

CX400#3

CX2

570#

5C

X257

0#6

CX400#17

CX2

570#

33C

X257

0#34

FBB#1SB7890

FBB#2SB7890

FBB#3SB7890

Full bisection BWIB-EDR x 72

Rack #2

LEAF#1SB7890

LEAF#2SB7890

LEAF#3SB7890

LEAF#4SB7890

CX400#1

CX2

570#

1C

X257

0#2

CX400#2

CX2

570#

3C

X257

0#4

CX400#3

CX2

570#

5C

X257

0#6

CX400#17

CX2

570#

33C

X257

0#34

FBB#1SB7890

FBB#2SB7890

FBB#3SB7890

SPINE#1CS7500

SPINE#2CS7500

Full bisection BWIB-EDR x 72

1/3 Oversubscription BWIB-EDR x 24

n 5 SW #UGIPGM J VGIP0 ) STJ W ), C W G E• GD PDRHB D DP PL MBD DP P B

:4 4: , :4 4:B F D :> ) & : C / &&:4 -P B

• : UDP B M SL RH M DP P B . , ))

n :S VITSS I• 4 R RPDD R FW• 7MRP P B . S AH DBRH M 1• 7MRDP P B ) TDP SA BPH RH M ( && -&&• HRG SR C RHTD P SRHMF• HRG SR D M 60 :

Grounds for the Interconnect Design

• 1/3 over-subscription network– A cost effective solution

• Many DL training apps do not use large number of nodes • High bandwidth network is required by only a small set of nodes

• Without adoptive routing– InfiniBand is shared by compute and IO communication, and the delivery

vendor suggested not to use adaptive routing on such a configuration

• Without Mellanox SHARP– We use EDR and switches which do not support large size message in

SHARP

Distribution of Number of Used GPUs/Nodes in DL Jobs

• Workload is collected from pre-ABCI system, AAIC– A system dedicated for AI research

• Single GPU Jobs are dominant and degree of parallelism is low

• ABCI is designed for a capacity computing system for AI

#Nodes: 32-38#GPUs/Node: 8#GPUs: 256-304

Single GPU, Single Nodeand Multi Node jobs

# Used Nodes in Multi Node Jobs

Workload is published under: https://github.com/aistairc/aaic-workload

https://github.com/aistairc/aaic-workload

Performance Impact on MPIunder 1/3 over-subscription without adoptive routing

• Measured P2P host-mem to host-mem transfer performance using OpenMPI 2.1.6

• Intra-rack: 17 - 17 nodes• Inter-rack: 34 - 34 nodes• Increase # of node pairs

Node

Node

Group0 (Rack0) Group1 (Rack1)

Node

…

Node

Node

Node

…

:S VG#AGIP ( VLTVRGSI14

0

5

10

15

20

25

30

0 5 10 15 20

Per-

Node

Thr

ough

put (

GB/s

)

# Intra-Group Connection

Ideal Measured

050

100150200250300350400450

0 5 10 15 20

Tota

l Th

roug

hput

(GB/

s)

# Intra-Group Connection

Ideal Measured

Per-Node Average Throughput Aggregated Throughput

More than 80% of the theoretical performance is achieved.

:S V#AGIP ( VLTVRGSI15

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35 40

Per-

Node

Thr

ough

put (

GB/s

)

# Inter-Rack Connection

Ideal Measured

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30 35 40

Tota

l Rac

k Thr

ough

put (

GB/s

)

# Inter-Rack Connection

Ideal Measured

Per-Node Average Throughput Aggregated Throughput

Large performance degradation at 6 and 20 connections, and far from Ideal performance

Collective Comm. Performance on ABCI

• Measured performance of two collective comm. patterns– Allreduce: essential in distributed training of DL– Reduce: the most heavy comm. in our application

• Used OSU Micro-Benchmarks 5.6.1– NCCL Bench 2.0.0 is used for NCCL– Default parameters

MPI VersionMVAPICH2 MVAPICH2 2.3.1

IntelMPI IntelMPI 2018 Update 2

OpenMPI OpenMPI 3.1.3

MPI, NCCL VersionMVAPICH2 MVAPICH2 2.3.1

MVAPICH2-GDR MVAPICH2-GDR 2.3.1, gdrcopy 1.2

OpenMPI OpenMPI 3.1.3

NCCL NCCL 2.4.7

Compiler GCC 4.8.5

CUDA 10.0.130

Host to host Transfer GPU to GPU Transfer

9(9 2 A J I17

Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)

• MVAPICH2 is comparable to IntelMPI• Performance do not suffer from 1/3 over-subscription inter-rack bandwidth

9(9 A J I18


MVAPICH2 is better than others in small message, but IntelMPI outperforms MVAPICH in large message

5(5 2 A J I19


MVAPICH2-GDR and NCCL outperforms others in large messages

5(5 A J I20


MVAPICH2-GDR achieves the best performance

Application Performance: iFDK

• A high-speed and high-resolution CT image reconstruction framework running on GPU clusters– Create 3D images from multiple 2D x-ray images– Demands for high-resolution CT image:

non-invasive inspection, reverse engineering, etc.

• What we done– GPU kernel whose compute cost is 1/6 of the standard

algorithm– Efficient FDK computation by overlapping CPU comp.,

GPU comp. and comm.– Distributed framework for high-resolution

image reconstructionPeng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka. iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction

CBCT geometry and trajectory

Micro-focus X-ray source

Will be presented at SC19

Distributed Framework for High-resolution Image Reconstruction

22

22

Filtering Filtering Filtering Filtering

Back-projectionBack-projectionBack-projectionBack-projection

Filtering Filtering Filtering Filtering

Back-projectionBack-projectionBack-projectionBack-projection

LoadLoadLoadLoadLoadLoadLoadLoad

StoreStoreStoreStore

On CPUs On GPUs

AllGather Reduce

On CPUs

2DProjections

3DVolum

e

Input Output

Example Case

23

0 32 64 96

1 9 17 25

31 63 95 127

!"

!#

!$#

%" %# %& %$

⋯⋯⋯⋯⋯⋯

Input : 2D ProjectionsOutput : 3D volume

vol 0vol 1

vol 62vol 63

16 MB/img x4096

8 GB/vol x64Allgather

Redu

ce

Input: 4k count of 2k image, Output: 4k^3, 32 Nodes

Performance Result

• Large performance gap between MPI implementations.– Mainly by Reduction– Used IntelMPI in the paper

• Next step– Tune parameters for collectives– Use CUDA-aware MPI collectives0

100

200

300

400500

IntelMPI MVAPICH2 OpenMPI

Tim

e (S

econ

d)

Allgather Reduce IO Compute

Input: 4k count of 2k image, Output: 4k^3, 32 Nodes

Summary

• Introduce ABCI, an open AI platform– Architecture, software stack and latest achievements– Detail network architecture

• MPI communication performance on ABCI– Effect of 1/3 inter-rack over-subscription bandwidth on P2P and

collectives– An application performance example

• Future work– Evaluate the latest MVAPICH2 and MVAPICH2-GDR– Provide detail comm. performance results on ABCI to users

https://abci.ai/

https://abci.ai/

AI Bridging Cloud Infrastructure (ABCI) and its...

Documents

Transcript of AI Bridging Cloud Infrastructure (ABCI) and its...