Post on 08-Feb-2022
Robustness of Interconnection
Networks
3rd JLESC Summer School
Atsushi HoriRIKEN AICS
116年6月28日火曜日
3rd JLESC SS@Lyon 2016
Self IntroductionAtsushi Hori - System Software Researcher
The oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)
involved in the Flagship2020 project to develop the post-K computer
216年6月28日火曜日
3rd JLESC SS@Lyon 2016
Atsushi Hori - System Software ResearcherThe oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)
involved in the Flagship2020 project to develop the post-K computer
Self Introduction
316年6月28日火曜日
3rd JLESC SS@Lyon 2016
Atsushi Hori - System Software ResearcherThe oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)
involved in the Flagship2020 project to develop the post-K computer
DISCLAIMERThis contents of this talk are based
on my personal experiences andindependent from the Flagship2020
project
Self Introduction
316年6月28日火曜日
3rd JLESC SS@Lyon 2016
Atsushi Hori - System Software ResearcherThe oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)
involved in the Flagship2020 project to develop the post-K computer
The colored slides are supplements
Self Introduction
316年6月28日火曜日
3rd JLESC SS@Lyon 2016
Self IntroductionAtsushi Hori - System Software Researcher
The oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)
involved in the Flagship2020 project to develop the post-K computer
4
The venue of the next JLESC, in Dec., Kobe
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
HPC Network• Low latency and high bandwidth• Higher performance than silicon disks
• High Bi-section bandwidth• Low congestion possibility (hopefully)
• Very Reliable• No error, No loss
• Dense (in a computer room)• Internet covers the whole earth
• Packet Switching• No circuit switching (old telephone network)
516年6月28日火曜日
3rd JLESC SS@Lyon 2016
Network Basics
Outline
6
Routing
Topology
Implementation
Fault Resilience+ my personal opinion
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Glossary• A network consists of• Nodes where packets are sent and received
may include a switch (see below)
• Switches (Routers) • Links connecting nodes and switches
• Data transfer• Packet a unit of transfer• Message consists of multiple packets
716年6月28日火曜日
3rd JLESC SS@Lyon 2016
Topology
816年6月28日火曜日
3rd JLESC SS@Lyon 2016
Network Topologies (1)
9
Torus
FatTreeSwitch Switch
Switch Switch
Mesh
NodeLink
“SkinnyTree”Switch
Switch Switch
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Network Topology in Top500• Topologies in Top500 http://www.top500.org• Torus/Mesh BG/Q, the K (Tofu)• FatTree Infiniband, Aries, Cray Gemini, Tiahne • SkinnyTree Ethernet• Misc. IBM Power 775
10
❘
❘❘❘❘
❘❘❘❘❘
❘❘
❘❘❘❘❘❘❘❘
❘
❘❘❘❘❘❘❘❘
❘
❘❘❘❘❘
❘
❘❘❘❘❘❘❘
❘
❘❘❘
❘
❘❘❘❘❘❘❘❘❘❘❘
❘
❘
❘
❘
❘
❘❘
❘❘
❘❘❘
❘❘
❘
❘
❘❘❘❘
❘
❘❘
❘❘
❘❘❘❘
❘❘
❘❘❘❘❘
❘
❘
❘❘❘❘❘❘❘❘❘❘❘
❘❘❘
❘❘❘❘
❘
❘❘
❘❘❘❘❘
❘❘❘❘
❘
❘
❘
❘❘
❘❘
❘
❘
❘❘❘
❘
❘
❘❘
❘
❘
❘❘
❘❘❘
❘❘
❘❘
❘❘
❘❘❘
❘❘❘
❘❘
❘❘
❘
❘❘❘❘❘
❘❘
❘❘❘
❘❘❘❘
❘❘
❘❘
❘
❘❘
❘
❘
❘
❘
❘❘
❘❘❘❘❘
❘
❘❘❘❘
❘❘
❘
❘❘❘❘
❘❘
❘
❘❘
❘
❘
❘
❘❘❘❘❘
❘❘
❘❘❘❘
❘
❘❘❘
❘
❘
❘❘❘
❘
❘❘
❘
❘❘
❘
❘❘❘❘
❘❘
❘
❘
❘❘
❘
❘
❘❘
❘
❘
❘❘❘❘
❘❘❘
❘❘❘
❘❘❘❘
❘
❘
❘❘❘❘
❘
❘❘
❘
❘❘
❘
❘
❘❘❘
❘
❘
❘❘
❘
❘❘❘❘❘❘
❘❘
❘
❘
❘❘❘❘❘
❘
❘❘❘
❘❘
❘
❘
❘❘❘
❘❘
❘❘
❘
❘❘
❘
❘
❘
❘❘
❘
❘
❘❘❘
❘
❘❘
❘❘
❘
❘
❘
❘❘❘❘❘
❘❘❘❘❘❘❘❘❘❘
❘
❘
❘
❘
❘❘
❘
❘❘
❘
❘❘❘
❘❘❘
❘
❘
❘❘
❘
❘
❘
❘
❘❘❘
❘❘
❘❘❘❘
❘❘
❘
❘❘❘
❘
❘
❘❘❘
❘
❘❘
❘
❘
❘
❘
❘
❘❘
❘
❘
❘❘
❘❘❘
❘
❘
❘❘
❘❘❘❘❘❘❘
❘
❘❘❘❘
❘
❘❘
❘
❘
❘
❘
❘❘
❘
❘❘
❘❘
❘❘❘❘
❘❘
❘
❘
❘
❘
❘❘❘❘
❘❘
❘❘
❘❘
❘
❘❘
❘❘
❘❘❘❘
❘
❘❘
❘
❘
❘❘
❘
❘
❘❘❘
❘❘❘❘
0 50 100 150 200 250 300 350 400 450 500
Topo
logy
Rank in Top500 as of Nov. 2015
FatTree
Torus/Mesh
SkinnyTree
Misc.
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Network Topologies (2)
11
Hypercube Dragonfly
CM-2, nCUBE in 90s Cray XC series
and many others (ring, star, butterfly, to name a few)
Nod
es
LinkNode
Sw.
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Routing
1216年6月28日火曜日
3rd JLESC SS@Lyon 2016
Routing• Find a path from a sender node to a receiver
node• Ex) X-Y (Dimension Order) Routing in 2D
Mesh
13
Nj
Ni
Node
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Deadlock• A routing algorithm on a network topology
must be deadlock free• Cyclic path can cause deadlock• Deadlock can be avoided by having bypass• Virtual channels
14
Sw.
1 channel 2 (virtual) channels
Sw.
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Deadlock• A routing algorithm on a network topology
must be deadlock free• Cyclic path can cause deadlock• Deadlock can be avoided by having bypass• Virtual channels
1516年6月28日火曜日
3rd JLESC SS@Lyon 2016
• Hot spot• Packet congestion happens
• 2D Mesh Hot spot at the center• 2D Torus No hot spots
16
Hot Spot
Node
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Partitioning• Multiple jobs can run on a big machine• Node space is partitioned
• Partitioning may change topology of a job• Jobs may have interference
17
Job A Job B
Job C Job D
Job C
Job A
Job B
Job
D
Job B, C and D can interfere with the others
2D torus turns into 2D mesh
Job C
Node Node
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Dynamic (Adaptive) Routing• Static Routing
• Once a path is fixed, packets go along with the path
• Dynamic (adaptive) Routing• Paths can be changed dynamically according to
the state of the network• Issues
• Algorithm: how, who, when ?• Deadlock free• Route changing latency & H/W resource• Stability (see next slide)
• Packet order is not preserved (see next of next)
1816年6月28日火曜日
3rd JLESC SS@Lyon 2016
Oscillation in Adaptive RoutingTwo roads to the same destination
1. One is very crowded2. The radio says the
other is empty3. Everybody rushes
into the other road4. (repeat 1-3)
1916年6月28日火曜日
3rd JLESC SS@Lyon 2016
Packet Order• Adaptive routing cannot preserve packet
ordering• This can be problematic when receiving large
messages consisting of multiple packets
20
P0 P1 P2 P3 P4 P5 P6 P7 …
Sending Order = Receiving Order
P0 P1 P2 P3 P4 P5 P6 P7
Recvbuf 0 Recvbuf 1
P0 P5 P3 P2 P4 P7 P9 P6 …
Sending Order ≠ Receiving Order
P0 P2 P3 P4 P5 P6 P7
Recvbuf 0 Recvbuf 1
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Metrics• Topology
• The higher radix, the smaller network diameter• Network Diameter• High-Radix or Low-Radix
• Performance• Whole
• Bisection Bandwidth• P2P
• Bandwidth and Latency• Hop count
• Collective Operations (Barrier, and so on)• Latency
2116年6月28日火曜日
3rd JLESC SS@Lyon 2016
Implementation
2216年6月28日火曜日
3rd JLESC SS@Lyon 2016
Installation of the K Computer
2316年6月28日火曜日
3rd JLESC SS@Lyon 2016
Direct/Indirect Network• Direct or Indirect Network• Direct network• Every node has a switch inside
• Indirect network• Node has no switch
24
Machine/Network Direct/Indirect
the K (Tofu) Direct
BG/Q Direct
Infiniband Indirect
Ethernet Indirect
Note: In many books, direct or indirect network is categorized as an aspect of topology
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Level off Cable Lengths• Naive implementation results in uneven cable
lengths
2516年6月28日火曜日
3rd JLESC SS@Lyon 2016
Level off Cable Lengths• To level off cable lengths, alternate nodes are
connected
2616年6月28日火曜日
3rd JLESC SS@Lyon 2016
Co-Design• Network Cost = ∑ C + ∑ S + ∑ L
C: Network interface (card) of a nodeS: SwitchL: Cable
• Co-design• Communication patterns of applications• Find protocols to maximize performance of
possible applications, and• to minimize network cost• to minimize power consumption
2716年6月28日火曜日
3rd JLESC SS@Lyon 2016
Fault Resilience
2816年6月28日火曜日
3rd JLESC SS@Lyon 2016
Fault Resilience• System and/or jobs can survive from a
network component failure• Possible failure points• Link• Switch• Node
2916年6月28日火曜日
3rd JLESC SS@Lyon 2016
Link or Switch Failure• Static routing• Somebody changes routing info. to bypass
failed part(s)
• Dynamic routing• If a failure can be detected, the failed
part(s) can be automatically bypassed• Needless to say it must be deadlock free
3016年6月28日火曜日
3rd JLESC SS@Lyon 2016
Node Failure• If application has• Dynamic load balancing• Job stops using the failed node,
and rebalance load• Static load balancing• Ex) Stencil Computation• hard to rebalance load => spare node
31
2D Jacobi iterationV’(i,j) = A * ( V(i-1,j ) + V(i+1,j) +
V(i,j-1) + V(i,j+1) )
2D array V(N,M)
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Spare Node Substitution• Assuming switches and links are all healthy• A naive spare node substitution may result in a
large number of packet collisions• Max. latency depends on #collisions
• Is there a way to avoid this situation ?
32
Spare
No S
F
232
3
2
2
Migration
4 Possible Collisions
5 Po
ssib
le C
ollis
ions
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Spare Node• Pros• Easy to program• Balanced load
• Cons• Lower node
utilization• Additional
packet collisions
33
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 22 23 21
24 25 26 27 28 29
30 31 32 33 34 35
0D S
lidin
g
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 22 23
24 25 26 21 28 29
30 31 32 27 34 35
33
1D S
lidin
g2D
Slid
ing
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
Spare Nodes
Spar
e N
odes
Node 21 fails
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Sliding Methods - Basic Idea•Sliding Methods•0D - Naive method•1D - Up to 3 (worst case)
•2D - Up to 2•and so on
•Hybrid Sliding• Try the highest
degree method first• If failed, try lower
degree method
34
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 22 23 21
24 25 26 27 28 29
30 31 32 33 34 35
0D S
lidin
g
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 22 23
24 25 26 21 28 29
30 31 32 27 34 35
33
1D S
lidin
g2D
Slid
ing
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
Spare Nodes
Spar
e N
odes
Node 21 fails
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Repair
3516年6月28日火曜日
3rd JLESC SS@Lyon 2016
The K and FX100
36
Copyright 2015 FUJITSU LIMITED
�SPARC64TM XIfx�HPC-ACE2�L1キャッシュ、Wayを2倍�スーパースカラーの強化•アウトオブオーダ資源の増加•分岐予測の強化
�256 bit wide SIMD•単精度倍幅モード•8バイト整数命令
�アシスタントコア•IO・OS・通信のデーモンを処理
キーテクノロジー (5)
core core
core core
core core
core core
core core
core core
core core
core core
Assistantcore
Assistantcore
core core
core core
core core
core core
core core
core core
core core
core core
Tofu2 interfaceTofu2 controller
HMC interface HMC
inte
rface
L2 cache
L2 cache
PCI interface
MAC MAC MA
C MA
C
PCI controller
7
�Rack�216ノード / キャビネット�CPU、メモリ、光モジュールを直接水冷(水冷率90%)
�Chassis�19インチランクマウント型シャーシ�12ノード / 2U�本体装置間 Tofu2は光接続
�CPU Memory Board�CPU x 3�3 x 8 Micron's HMCs
Copyright 2015 FUJITSU LIMITED
キーテクノロジー (4)
6 4 system boards(384 cores)
3 CPUs (Nodes)
32+2 Cores + Tofu
CPU8 Cores
ICC(Tofu Network)
The K Computer
2011
FX1002015
1 system board4 CPUs and 4 ICCs
(32 cores)
18 chassis (6,912 cores)
24 chassis (768 cores)
http://accc.riken.jp/wp-content/uploads/2015/06/chiba.pdf
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Fujitsu FX100• A chassis contains 3 nodes, switches and links.• Tofu unit consists of 12 nodes is also a scheduling
unit• Tofu 6D coordinate: “XYZabc” (a=2, b=3, c=2)• “XYZ” coord. represents the location of a Tofu unit• “abc” coord. represents the location inside of a Tofu unit
• A chassis contains 12 nodes• 3 chassis compose 3 Tofu units
37
White paper FUJITSU Supercomputer PRIMEHPC FX100 – Evolution to the Next Generation
Page 7 of 8 http://www.fujitsu.com/global/products/computing/servers/supercomputer/primehpc-fx100/
The Tofu interconnect 2 (Tofu2) is an interconnect integrated into the SPARC64™ XIfx processor. Tofu2 enhances the bandwidth and functions of the Tofu interconnect (Tofu1) of the previous PRIMEHPC FX10 systems.
6D mesh/torus network Tofu2 interconnects nodes to construct a system with a 6D mesh/torus network, like with Tofu1. The sizes of the three axes X, Y, and Z vary depending on the system configuration. The sizes of the other three axes, A, B, and C, are fixed at 2, 3, and 2, respectively. Each node has 10 ports. The network topology from the user’s view is a virtual 1D/2D/3D torus. An arbitrary number of dimensions and size of each axis are specified by a user. The virtual torus space is mapped to the 6D mesh/torus network and reflected in the rank numbers. This virtual torus scheme improves the system fault tolerance and availability by enabling the region containing a failed node to be utilized as a torus.
High-speed 25 Gbps serial transmission Each link of Tofu2 consists of four lanes of signals with a data transfer speed of 25.78125 Gbps and provides peak throughput of 12.5 GB/s. The link bandwidth is 2.5 times higher than that of Tofu1, which uses 8 lanes of 6.25 Gbps signals and provides 5 GB/s of throughput.
Tofu2 connects 12 nodes in a PRIMEHPC FX100 main unit by electrical links, and inter-main unit links use optical transceiver modules because of a large transmission loss at 25 Gbps. Optical transceivers are even placed near the CPU to minimize the transmission loss. In contrast, Tofu1 does not use optical transceivers.
Optical-link dominant network Twelve nodes in a main unit are connected in the configuration of (X, Y, Z, A, B, C) = (1, 1, 3, 2, 1, 2). The number of intra-main unit links is 20 (Figure 14). Therefore, 40 out of 120 ports are used for intra-main unit links, and the other 80 are used for inter-main unit links.
For conventional HPC interconnects using 10 Gbps generation transmission technology, the ratio of optical links in the total network was up to one-third (A, B, and C in Figure 15). These interconnects partially used optical transmission and only used it to extend the wire length. In contrast, the ratio of optical links in Tofu2 is far beyond that of electrical links. Tofu2 is recognized as a next-generation HPC interconnect that mainly uses optical transmission.
Tofu Interconnect 2
Figure 12 6D mesh/torus topology model
Figure 13 Close placement of optical modules and CPU
Figure 14 Connection topology in main unit
for synchronization. The host bus connects aSparc64 chip5 to the Tofu interconnect andPCI Express devices.
Figure 2a shows the ICC chip structureand interconnections. Four Sparc64 chips
on a board are interconnected with the A-and C-axes links. Three boards in a Tofuunit are interconnected with the B-axislinks. Tofu units are interconnected withthe X-, Y- and Z-axes links that form a 3Dtorus. The Z-axis links connect 17 Tofuunits in two racks: 16 for computingnodes, and one for I/O nodes. The X-axisand Y-axis are expandable according to thenumber of columns and rows of racks.
Figure 2b shows a topological model ofthe ABC 3D mesh/torus forming a Tofuunit. In the event of a single-board failurethat decreases the B-axis’s length, the 3Dtorus graph’s embeddability is unaffected be-cause the ABC 3D topology remains cubic.
One of the big challenges in building theK computer was system reliability. Forinstance, a mean time between failures(MTBF) of five years per node, assumingcommodity processing nodes, would bringabout two failures every hour in the80,000-node system. We needed aboutone-hundredth of that failure rate. To min-imize the failure rate, we integrated allactive components of the Tofu interconnectinto a single ICC chip, protected major datapaths using error-correction code, and
[3B2-9] mmi2012010021.3d 18/1/012 18:44 Page 22
Host bus interface
Tofu networkinterface (TNI) and
Tofu barrierinterface (TBI)
Tofu network router (TNR)
PCI E
xpre
ssro
ot c
ompl
ex
Figure 1. A micrograph of the ICC chip.
The chip integrates all active components
of the Tofu interconnect: a Tofu network
router (TNR), four Tofu network interfaces
(TNIs), a Tofu barrier interface (TBI), a host
bus interface, and two PCI Express root
complexes.
Tofu unit
Quad SPARC64 board
SPARC64chip
InterConnect Controller (ICC) chip
TNR
C axis
B axis
Z axis
Y axis
X axis
A axis
PCIExpress
root
Hostbus
C axis
B axis
A axis(a) (b)
xis
A axis
TBI
TNI
Figure 2. The ICC chip structure and interconnections along six axes (a). A topological model of the A-, B-, and C-axes (b).
....................................................................
22 IEEE MICRO
...............................................................................................................................................................................................HOT INTERCONNECTS
A Tofu unit
White paper FUJITSU Supercomputer PRIMEHPC FX100 – Evolution to the Next Generation
https://www.fujitsu.com/global/Images/primehpc-fx100-hard-en.pdf
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
The K and FX100
38
Copyright 2015 FUJITSU LIMITED
�SPARC64TM XIfx�HPC-ACE2�L1キャッシュ、Wayを2倍�スーパースカラーの強化•アウトオブオーダ資源の増加•分岐予測の強化
�256 bit wide SIMD•単精度倍幅モード•8バイト整数命令
�アシスタントコア•IO・OS・通信のデーモンを処理
キーテクノロジー (5)
core core
core core
core core
core core
core core
core core
core core
core core
Assistantcore
Assistantcore
core core
core core
core core
core core
core core
core core
core core
core core
Tofu2 interfaceTofu2 controller
HMC interface HMC
inte
rface
L2 cache
L2 cache
PCI interface
MAC MAC MA
C MA
C
PCI controller
7
�Rack�216ノード / キャビネット�CPU、メモリ、光モジュールを直接水冷(水冷率90%)
�Chassis�19インチランクマウント型シャーシ�12ノード / 2U�本体装置間 Tofu2は光接続
�CPU Memory Board�CPU x 3�3 x 8 Micron's HMCs
Copyright 2015 FUJITSU LIMITED
キーテクノロジー (4)
6
3 CPUs (Nodes)
The K Computer
2011
FX1002015
18 chassis (6,912 cores)
24 chassis (768 cores)
4 system boards(384 cores)
1 system board4 CPUs and 4 ICCs
(32 cores)
The Tofu circuit is on the same CPU die, however, the Tofu circuit can keep running while the CPU cores are shutdown and power off.
32+2 Cores + Tofu
http://accc.riken.jp/wp-content/uploads/2015/06/chiba.pdf
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Various Units in FX100• Various units in Fujitsu FX100
• Unit of network• Tofu (12 nodes)
• Unit of scheduling• Tofu (not to interfere with other jobs)
• Physical Unit• Chassis - A chassis spans 3 Tofu units
• Replacement of a chassis• Affects 3 Tofu units (36 nodes, 1152 cores)
• Before replacement, the jobs running on the 36 nodes must be aborted
• While in the replacement, the affected 36 nodes can not accept jobs
• Tofu is direct network, replacement can affect entire network because XYZ connections for I/O are lost
• This 36-Nf nodes are called apparent failure (in this talk)
3916年6月28日火曜日
3rd JLESC SS@Lyon 2016
Repair Schedule• Entire system must go on as much as possible• Replacement may cause more apparent failures as packaging
density increases• Replacement cannot take place as soon as failure happens
• Remedy for apparent failures is getting harder• The more frequent system service, the higher running cost• So, repair is scheduled once in a day, 2-3 times in a week,
once in a week, and so on
40
The K’s case: Every morning, SEs replace the failed nodes1. Shutdown the chassis2. Unplug the chassis3. Replace failed mother board4. Plug the chassis5. Reboot (K’s nodes are disk-less)
Apparent failure
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
≈
Repair Interval• The longer repair interval, the larger number
of failed parts• K: One node failure in 1~2 days
41
# Fa
iled
Com
pone
nts
OperationRepair
Time
ApparentFailure
# Fa
iled
Com
pone
nts
Operation RepairTime
ApparentFailure
Average
Average
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Towards Exa-flops• Higher failure rate
• larger number of components• end of Moore’s law is close
• Longer time between repair• to reduce running cost• denser packaging results in more apparent
failures• larger impact on running jobs
➡ Always having one or more number of failed components
4216年6月28日火曜日
3rd JLESC SS@Lyon 2016
Network Resilience Towards Exa-scale and Beyond
My Personal Opinion
4316年6月28日火曜日
3rd JLESC SS@Lyon 2016
Failure will be Daily Life• Assumption of current HPC
failure happens unexpectedly and unusually• System design is based on particular rules
and algorithms• Random failure breaks those rules and
algorithms• Node MTBF is less than a day already
• If failure is daily happening, why don’t we design HPC systems having failures in mind ?
4416年6月28日火曜日
3rd JLESC SS@Lyon 2016
Failure Conscious Design• Failure• happens randomly• # Combinations are factorial !
• impossible to handle failures case by case• impossible to predict performance
degradation due to the failures
4516年6月28日火曜日
3rd JLESC SS@Lyon 2016
Stencil and Cartesian Topology • The node failure problem in stencil computations
is revisited• Communication pattern of stencil computation
fits with Cartesian topology very very well• When spare node
substitutions take place, then the fitness is gone and performance degrades
46
0
20
40
60
80
100
120
0 20 40 60 80 100 120 140 160 180 200
# C
ollis
ions
# Node Failures
Best Average Worst
5P-Stencil Communication Performance Degradation over the Number of Failed Nodes [7]
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Topology and Protocol• Protocols of collective
operations are optimized according to topology
• If conditions of H/W support are NOT met, then general protocol takes place
• Failure break those conditions
47
0
2
4
6
8
10
12
0 50 100 150 200 250 300
Slow
dow
n
# Node Failures
K-Barrier
K-Allreduce
BGQ-Barrier
BGQ-Allreduce
BGQ-Barrier*
BGQ-Allreduce*
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Regular topology turns into random topology
as the number of failed links increases
48
(Full) Dragonfly 22/28 16/28
Nodes
Sw.
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Regular topology turns into random topology
as the number of failed links increases
49
(Full) Dragonfly 22/28 16/28
Nodes
Sw.
QualitativeChange
QuantitativeChange
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Randomness may be an answer
5016年6月28日火曜日
3rd JLESC SS@Lyon 2016
Randomness may be an answer• Can we rely on the rules and algorithms which can be
broken by failures ?• Failures on regularity
Qualitative change: Hard to imagine
5016年6月28日火曜日
3rd JLESC SS@Lyon 2016
Randomness may be an answer• Can we rely on the rules and algorithms which can be
broken by failures ?• Failures on regularity
Qualitative change: Hard to imagine• What if we give up having such rules ?
• Failures on randomness Quantitative change: Easier to imagine
5016年6月28日火曜日
3rd JLESC SS@Lyon 2016
Randomness may be an answer• Can we rely on the rules and algorithms which can be
broken by failures ?• Failures on regularity
Qualitative change: Hard to imagine• What if we give up having such rules ?
• Failures on randomness Quantitative change: Easier to imagine
• Let’s start designing random systems from the beginning, forget about failures in regular systems
50
➡Random Topology➡Random Network
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Random Topology (1)
51
• Goal: to make a low-latency topology for HPC networks – low diameter and low average path hops
• Random topology is best [Koibuchi et al, ISCA2012]
100 times improvement (a) Non-random Shortcuts
(b) Random Shortcuts
1,024-node network
Avg
. sho
rtest
pat
h le
ngth
[hop
s]
Good Point of Random Topology
3 Switch degree ≈ Number of shortcuts
Michihiro Koibuchi, http://research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Random Topology (2)
52
Two Approaches to Quasi-randomness
• Method A makes a non-random topology random • Method B makes a random topology layout-friendly
11
Low High (not random) (fully random)
Randomness
Method A Method B start start
Quasi-random topologies
Michihiro Koibuchi, http://research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Random Routing in Hypercube
53
7 Sid C-K Chau
Random Routing in Hypercube • For deterministic bit-fix routing, the worst case requires
at least 2𝑛/2/2 steps (exponential in n) • But for random bit-fix routing, it requires O(n) steps
with high probability (i.e., using more than O(n) steps has a vanishing probability converging to 0, as nl0)
• Random bit-fix routing has two stages: 1. Pick a random node r(i) in the hypercube independently, and
use bit-fixing routing from i to r(i) 2. Use bit-fixing routing from r(i) to d(i)
• Obviously, longer paths are needed for random bit-fix routing. Then why is this better?
• The intuition is that random routing can average out the worst case configuration from deterministic routing
• The probability that a randomly generated configuration is the worst case is very low, and is vanishing for large n
• This intuition is behind many randomized algorithms
1100
0100
0110
1110
0010
1010
1000
0001
1101
0101
1111
0011
1011
1001
0111
0000
i
j
d(i)
d(j)
Random bit-fixing routing
r(i)
r(j)
0000 l 0000 l 0000 0001 l 0001 l 0100 0010 l 1000 l 1000 0011 l 0101 l 1100 0100 l 0001 l 0001 0101 l 1110 l 0101 0111 l 1101 l 1101 1110 l 0000 l 1011 1111 l 1110 l 1111
i l r(i) l d(i)
A two-stage configuration
Sid C-K Chau, https://www.cl.cam.ac.uk/teaching/1011/CompSysMod/RandBits_Lec2V2.pdf
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Dynamic routing vs. Random routing
• A switch has several routing candidates for a packet to go through• Static routing• choose fixed one always
• Dynamic routing• choose one according to network status
• Random routing• choose one in a random way• not have to be uniformly random
5416年6月28日火曜日
3rd JLESC SS@Lyon 2016
Randomness in a Network• Combination of regularity and randomness
• Random Topology• Regular part + Random part
• Ex) Ring + Random shortcuts• Random (Oblivious) Routing
(≠ Brownian motion)• Random routing + Regular routing
• Node/switch on the way is randomly chosen
• Failure may happen on the regular part ?• The factorial nature can be relaxed• Ex) Redundant links of the regular part of topology.
55
x
19
Conclusions • Use of random shortcuts at HPC interconnects
• Ring + random shortcuts is best • Advantage of high-radix networks
• Little variability of sampling and performance • Random shortcut topology imposes no constraints
on the number of switches, and links
Random Shortcut Topology (Ring + random shortcuts)
Up to 18% lower latency
Hypercube (Non-random topology)
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
My Last Word
“An eye for an eye, a tooth for a tooth”
Randomness for randomness
Randomness MAY save the future supercomputers (not yet proven)
Thank you56
16年6月28日火曜日
3rd JLESC SS@Lyon 2016
Reference1) High-radix router: Microarchitecture of a High-Radix Router, John Kim, William J. Dally, et. al.,
ISCA’05.2) Tofu network: THE TOFU INTERCONNECT, Yuichiro Ajima, et. al., HOT INTERCONNECTS, 2012. 3) Dragonfly network: Technology-Driven, Highly-Scalable Dragonfly Topology, John Kim, William J.
Dally, et. al., ISCA '08.4) Routing algorithms: A Survey and Evaluation of Topology-Agnostic Deterministic Routing
Algorithms, J. Flich et al., in IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 3, pp. 405-425, March 2012.
5) Shortest path finding algorithm (Dijkstra Algorithm): A note on two problems in connexion with graphs, Dijkstra, E.W., In Numerische Mathematik, 1959.
6) Adaptive routing in Infiniband: Fail-in-place Network Design: Interaction Between Topology, Routing Algorithm and Failures, J. Domke, T. Hoefler, and S. Matsuoka, SC ’14, 2014.
7) Spare node substitution: Sliding Substitution of Failed Nodes, Atsushi Hori, et. al., In Proceedings of the 22nd European MPI Users' Group Meeting, ACM, 2015.
8) Random algorithms including the random routing: Randomized Algorithms, Rajeev Motwani and Prabhakar Raghavan, Cambridge University Press, 1995.
9) Random network: A Case for Random Shortcut Topologies for HPC Interconnects, Michihiro Koibuchi, et. al., ISCA’12.
10)Another view on HPC network robustness: Robustness Attributes of Interconnection Networks for Parallel Processing, Behrooz Parhami, Keynote lecture, 1st Int'l Suprcomputing Conf. (ISUM-2010), 2010 March 4. (https://www.ece.ucsb.edu/~parhami/pres_folder/parh10-isum-robustness-int-nets.ppt)
5716年6月28日火曜日