AI Bridging Cloud Infrastructure (ABCI) and its...
Transcript of AI Bridging Cloud Infrastructure (ABCI) and its...
AI Bridging Cloud Infrastructure (ABCI) and its Communication Performance
Shinichiro TakizawaThe National Institute of Advanced Industrial Science and Technology
(AIST), Japan
Outline
• Introduction of AIST and ABCI• ABCI Detail
– Architecture, software stack, network topology• MPI performance on ABCI• A latest application on ABCI
Introduction of AIST• A research institute as a part of the Ministry of Economy, Trade and Industry
(METI) of Japan• Our mission
– Integrate scientific and engineering knowledge to address industry and society needs
– Bridge the gap between innovative technological seeds and commercialization
We built ABCI for popularize AI technologies in Japan
4
n P C 8DTD B L SRD MC C R P BDB AH HRW
n U S H OI GSJ 5 JOIG J HM P RPSBRSPDP 0 1HF 3 R 0 F PHRGL RU PD MC
0 HB RH Mn U S :SST G OTS G LTVR R BBD DP RD I HMR
B CDLHB HMCS RPW 3 P 07
07 1PHCFHMF 2 SC 7M P RPSBRSPD
GP VLTVRGSI 07< B 7 ,
)- 7< B 7 ,6LL I O VLTVRGSI 0 GW TL S ( /
/ .. 7< B . OS C() 87< B F ) OS 8A66
. . C7< B OS 9 48T V DWGM 0 1 ( ) F
2 VGM D60 1 6W ORG J
234:0 CN FTV J W 7OVW <GVM #BIGU S 2: :SLVGW V I V
Univ. Tokyo / Kashiwa II Campus
OPERATED SINCE AUGUST 1ST, 2018
Some of 1 Year Achievements• 100+ projects, 1000+ users
– Academia, research institutes and companies• Large Japanese companies start to use ABCI as their R&D platform• ABCI supports NGC containers• SONY and Fujitsu lab achieved good
performance of ImageNet-1k classification on ABCI
• Two research papers that use ABCIwere accepted by SC19
https://blogs.nvidia.com/blog/2019/06/17/abci-adopts-ngc/
6
0.74
0.745
0.75
0.755
0.76
0.765
0
200
400
600
800
1000
1200
1400
1600
MSRA(2015)
Facebook(2017)
GoogleBrain
(2017)
PreferredNetworks
(2017)
Tencent(2018)
Sony +ABCI
(2018)
Google(2018)
Google(2018)
Sony +ABCI
(2019)
FujitsuLab +ABCI
(2019)
NVIDIA(2019)
Google(2019)
FujitsuLab +ABCI
(2019)
ImageNet / ResNet-50 (Relative speedup & Accuracy)
Relative speedup Accuracy
Rela
tive
spee
dup
Accu
racy
World’s Highest Speed in ImageNet-1k Training
MLPerf Training v0.6
SONY’s workhttps://arxiv.org/abs/1811.05233Fujitsu lab’s workhttps://arxiv.org/abs/1903.12650
7
Gateway and Firewall
Interactive Nodes x 4
Interconnect (InfiniBand EDR)
Service Network (10GbE)
Management and Gateway Nodes x 15
High-Performance Computing System550 PFlops(FP16), 37.2 PFlops(FP64)
476 TiB Memory, 1.74 PB NVMe SSD
Computing Nodes (w/ GPU) x 1088
Multi-platform Nodes (w/o GPU) x 10
• Intel Xeon Gold 6132 (2.6GHz/14cores) x 2• 768GiB Memory, 3.8TB NVMe SSD, 1.5TB Intel Optane x2
22 PB GPFS (Group Shared Directory, etc.)
DDN SFA14K (w/ SS8462 Enclosure x 10) x3• 12TB 7.2Krpm NL-SAS HDD x 2400• 3.84TB SAS SSD x 216
• Mellanox CS7500 x 2• Mellanox SB7890 x 229
• Nexsus 3232C x2• FortiGate 1500D x2• FortiAnalyzer 400E x1
100GbpsSINET5
GPU NVIDIA Tesla V100 SXM2 x 4
CPU Intel Xeon Gold 6148 (2.4GHz/20cores) x 2
Memory 384GiB
Local Storage Intel SSD DC P4600 (NVMe) 1.6TB x 1
Interconnect InfiniBand EDR x 2
ABCI Hardware Overview
Large-scale Storage System
1 PB Lustre (Home Directory)
DDN SFA14KX (w/ SS9012 Enclosure x 10) x1• 7.68TB SAS SSD x 185 for data• 960GB SAS SSD x 13 for metadata
17 PB Object Storage (Preparing)
HPE Apollo 4510 Gen10 x 24• 12TB SATA HDD x 1440• 3.2TB SSD x 24
8
ABCI Software Stack
Stack
Operating System RHEL / CentOS 7.6
Job Scheduler Univa Grid Engine 8.6.3
Container Engine Docker 17.12.0 (Users can use only supported container images)Singularity 2.6.1 (Users can use any container images)
MPIIntel MPI 2018.2.199MVAPICH2 2.3rc2, 2.3 / MVAPICH2-GDR 2.3a, 2.3rc1, 2.3, 2.3.1OpenMPI 1.10.7, 2.1.3, 2.1.5, 2.1.6, 3.0.3, 3.1.0, 3.1.2, 3.1.3
Development tools
Intel Parallel Studio XE Cluster Edition 2017.8, 2018.2, 2018.3, 2019.3PGI Professional Edition 17.10, 18.5, 18.10, 19.3NVIDIA CUDA SDK 8.0, 9.0, 9.1, 9.2, 10.0cuDNN 5.1, 6.0, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5NCCL 1.3.5, 2.1, 2.2, 2.3, 2.4Intel MKL 2017.8, 2018.2, 2018.3, 2019.3GCC, Python, Ruby, R, OpenJDK, Go, Perl
Deep Learning Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc.(Frameworks provided by NVIDIA GPU Cloud can also be deployed)
Big Data Processing Hadoop, Spark
9
234: 4TRU TJ
Xeon Gold 6148
Xeon Gold 6148
10.4GT/s x3DDR4-266632GB x 6
DDR4-266632GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch
Skylake Skylake
x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2
FUJITSU PRIMERGY Server (2 servers in 2U)
CPU Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) x2
GPU NVIDIA Tesla V100 (SXM2) x4
Memory 384GiB DDR4 2666MHz RDIMM
Local Storage 1.6TB NVMe SSD (Intel SSD DC P4600 u.2) x1
Interconnect InfiniBand EDR x2
10
234: TJ AGIP :S VITSS I
InfiniBand EDR x1InfiniBand EDR x6InfiniBand EDR x4
Rack #1
LEAF#1SB7890
LEAF#2SB7890
LEAF#3SB7890
LEAF#4SB7890
CX400#1
CX2
570#
1C
X257
0#2
CX400#2
CX2
570#
3C
X257
0#4
CX400#3
CX2
570#
5C
X257
0#6
CX400#17
CX2
570#
33C
X257
0#34
FBB#1SB7890
FBB#2SB7890
FBB#3SB7890
Full bisection BWIB-EDR x 72
Rack #2
LEAF#1SB7890
LEAF#2SB7890
LEAF#3SB7890
LEAF#4SB7890
CX400#1
CX2
570#
1C
X257
0#2
CX400#2
CX2
570#
3C
X257
0#4
CX400#3
CX2
570#
5C
X257
0#6
CX400#17
CX2
570#
33C
X257
0#34
FBB#1SB7890
FBB#2SB7890
FBB#3SB7890
SPINE#1CS7500
SPINE#2CS7500
Full bisection BWIB-EDR x 72
1/3 Oversubscription BWIB-EDR x 24
n 5 SW #UGIPGM J VGIP0 ) STJ W ), C W G E• GD PDRHB D DP PL MBD DP P B
:4 4: , :4 4:B F D :> ) & : C / &&:4 -P B
• : UDP B M SL RH M DP P B . , ))
n :S VITSS I• 4 R RPDD R FW• 7MRP P B . S AH DBRH M 1• 7MRDP P B ) TDP SA BPH RH M ( && -&&• HRG SR C RHTD P SRHMF• HRG SR D M 60 :
Grounds for the Interconnect Design
• 1/3 over-subscription network– A cost effective solution
• Many DL training apps do not use large number of nodes • High bandwidth network is required by only a small set of nodes
• Without adoptive routing– InfiniBand is shared by compute and IO communication, and the delivery
vendor suggested not to use adaptive routing on such a configuration
• Without Mellanox SHARP– We use EDR and switches which do not support large size message in
SHARP
Distribution of Number of Used GPUs/Nodes in DL Jobs
• Workload is collected from pre-ABCI system, AAIC– A system dedicated for AI research
• Single GPU Jobs are dominant and degree of parallelism is low
• ABCI is designed for a capacity computing system for AI
#Nodes: 32-38#GPUs/Node: 8#GPUs: 256-304
Single GPU, Single Nodeand Multi Node jobs
# Used Nodes in Multi Node Jobs
Workload is published under: https://github.com/aistairc/aaic-workload
Performance Impact on MPIunder 1/3 over-subscription without adoptive routing
• Measured P2P host-mem to host-mem transfer performance using OpenMPI 2.1.6
• Intra-rack: 17 - 17 nodes• Inter-rack: 34 - 34 nodes• Increase # of node pairs
Node
Node
Group0 (Rack0) Group1 (Rack1)
Node
…
Node
Node
Node
…
:S VG#AGIP ( VLTVRGSI14
0
5
10
15
20
25
30
0 5 10 15 20
Per-
Node
Thr
ough
put (
GB/s
)
# Intra-Group Connection
Ideal Measured
050
100150200250300350400450
0 5 10 15 20
Tota
l Th
roug
hput
(GB/
s)
# Intra-Group Connection
Ideal Measured
Per-Node Average Throughput Aggregated Throughput
More than 80% of the theoretical performance is achieved.
:S V#AGIP ( VLTVRGSI15
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
Per-
Node
Thr
ough
put (
GB/s
)
# Inter-Rack Connection
Ideal Measured
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35 40
Tota
l Rac
k Thr
ough
put (
GB/s
)
# Inter-Rack Connection
Ideal Measured
Per-Node Average Throughput Aggregated Throughput
Large performance degradation at 6 and 20 connections, and far from Ideal performance
Collective Comm. Performance on ABCI
• Measured performance of two collective comm. patterns– Allreduce: essential in distributed training of DL– Reduce: the most heavy comm. in our application
• Used OSU Micro-Benchmarks 5.6.1– NCCL Bench 2.0.0 is used for NCCL– Default parameters
MPI VersionMVAPICH2 MVAPICH2 2.3.1
IntelMPI IntelMPI 2018 Update 2
OpenMPI OpenMPI 3.1.3
MPI, NCCL VersionMVAPICH2 MVAPICH2 2.3.1
MVAPICH2-GDR MVAPICH2-GDR 2.3.1, gdrcopy 1.2
OpenMPI OpenMPI 3.1.3
NCCL NCCL 2.4.7
Compiler GCC 4.8.5
CUDA 10.0.130
Host to host Transfer GPU to GPU Transfer
9(9 2 A J I17
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
• MVAPICH2 is comparable to IntelMPI• Performance do not suffer from 1/3 over-subscription inter-rack bandwidth
9(9 A J I18
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
MVAPICH2 is better than others in small message, but IntelMPI outperforms MVAPICH in large message
5(5 2 A J I19
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
MVAPICH2-GDR and NCCL outperforms others in large messages
5(5 A J I20
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
MVAPICH2-GDR achieves the best performance
Application Performance: iFDK
• A high-speed and high-resolution CT image reconstruction framework running on GPU clusters– Create 3D images from multiple 2D x-ray images– Demands for high-resolution CT image:
non-invasive inspection, reverse engineering, etc.
• What we done– GPU kernel whose compute cost is 1/6 of the standard
algorithm– Efficient FDK computation by overlapping CPU comp.,
GPU comp. and comm.– Distributed framework for high-resolution
image reconstructionPeng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka. iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction
CBCT geometry and trajectory
Micro-focus X-ray source
Will be presented at SC19
Distributed Framework for High-resolution Image Reconstruction
22
22
Filtering Filtering Filtering Filtering
Back-projectionBack-projectionBack-projectionBack-projection
Filtering Filtering Filtering Filtering
Back-projectionBack-projectionBack-projectionBack-projection
LoadLoadLoadLoadLoadLoadLoadLoad
StoreStoreStoreStore
On CPUs On GPUs
AllGather Reduce
On CPUs
2DProjections
3DVolum
e
Input Output
Example Case
23
0 32 64 96
1 9 17 25
31 63 95 127
!"
!#
!$#
%" %# %& %$
⋯⋯⋯⋯⋯⋯
Input : 2D ProjectionsOutput : 3D volume
vol 0vol 1
vol 62vol 63
16 MB/img x4096
8 GB/vol x64Allgather
Redu
ce
Input: 4k count of 2k image, Output: 4k^3, 32 Nodes
Performance Result
• Large performance gap between MPI implementations.– Mainly by Reduction– Used IntelMPI in the paper
• Next step– Tune parameters for collectives– Use CUDA-aware MPI collectives0
100
200
300
400500
IntelMPI MVAPICH2 OpenMPI
Tim
e (S
econ
d)
Allgather Reduce IO Compute
Input: 4k count of 2k image, Output: 4k^3, 32 Nodes
Summary
• Introduce ABCI, an open AI platform– Architecture, software stack and latest achievements– Detail network architecture
• MPI communication performance on ABCI– Effect of 1/3 inter-rack over-subscription bandwidth on P2P and
collectives– An application performance example
• Future work– Evaluate the latest MVAPICH2 and MVAPICH2-GDR– Provide detail comm. performance results on ABCI to users
https://abci.ai/