Robustness of Interconnection Networks

Robustness of Interconnection

Networks

3rd JLESC Summer School

Atsushi HoriRIKEN AICS

116年6月28日火曜日

3rd JLESC SS@Lyon 2016

Self IntroductionAtsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)

involved in the Flagship2020 project to develop the post-K computer

Atsushi Hori - System Software ResearcherThe oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)

Self Introduction

DISCLAIMERThis contents of this talk are based

on my personal experiences andindependent from the Flagship2020

project

Self Introduction

The colored slides are supplements

Self Introduction

Self IntroductionAtsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917Advanced Institute for Computational Science (AICS), since 2010Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016)

The venue of the next JLESC, in Dec., Kobe

HPC Network• Low latency and high bandwidth• Higher performance than silicon disks

• High Bi-section bandwidth• Low congestion possibility (hopefully)

• Very Reliable• No error, No loss

• Dense (in a computer room)• Internet covers the whole earth

• Packet Switching• No circuit switching (old telephone network)

Network Basics

Outline

Routing

Topology

Implementation

Fault Resilience+ my personal opinion

Glossary• A network consists of• Nodes where packets are sent and received

may include a switch (see below)

• Switches (Routers) • Links connecting nodes and switches

• Data transfer• Packet a unit of transfer• Message consists of multiple packets

Topology

Network Topologies (1)

FatTreeSwitch Switch

Switch Switch

NodeLink

“SkinnyTree”Switch

Switch Switch

Network Topology in Top500• Topologies in Top500 http://www.top500.org• Torus/Mesh BG/Q, the K (Tofu)• FatTree Infiniband, Aries, Cray Gemini, Tiahne • SkinnyTree Ethernet• Misc. IBM Power 775

❘❘❘❘

❘❘❘❘❘

❘❘

❘❘❘❘❘❘❘❘

❘❘❘❘❘

❘❘❘❘❘❘❘

❘❘❘

❘❘❘❘❘❘❘❘❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘❘❘

❘❘❘❘❘❘❘❘❘❘❘

❘❘❘

❘❘❘❘

❘❘

❘❘❘❘❘

❘❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘❘

❘❘

❘❘❘

❘❘❘❘

❘❘

❘❘❘❘❘

❘❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘❘❘

❘❘

❘❘❘❘

❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘❘

❘❘❘

❘❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘❘❘

❘❘

❘❘❘❘❘

❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘❘

❘❘❘❘❘❘❘❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘

❘❘

❘❘❘❘❘❘❘

❘❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘❘

❘❘

❘❘❘

❘❘❘❘

0 50 100 150 200 250 300 350 400 450 500

Rank in Top500 as of Nov. 2015

FatTree

Torus/Mesh

SkinnyTree

Network Topologies (2)

Hypercube Dragonfly

CM-2, nCUBE in 90s Cray XC series

and many others (ring, star, butterfly, to name a few)

LinkNode

Routing

Routing• Find a path from a sender node to a receiver

node• Ex) X-Y (Dimension Order) Routing in 2D

Deadlock• A routing algorithm on a network topology

must be deadlock free• Cyclic path can cause deadlock• Deadlock can be avoided by having bypass• Virtual channels

1 channel 2 (virtual) channels

Deadlock• A routing algorithm on a network topology

must be deadlock free• Cyclic path can cause deadlock• Deadlock can be avoided by having bypass• Virtual channels

• Hot spot• Packet congestion happens

• 2D Mesh Hot spot at the center• 2D Torus No hot spots

Hot Spot

Partitioning• Multiple jobs can run on a big machine• Node space is partitioned

• Partitioning may change topology of a job• Jobs may have interference

Job A Job B

Job C Job D

Job B, C and D can interfere with the others

2D torus turns into 2D mesh

Node Node

Dynamic (Adaptive) Routing• Static Routing

• Once a path is fixed, packets go along with the path

• Dynamic (adaptive) Routing• Paths can be changed dynamically according to

the state of the network• Issues

• Algorithm: how, who, when ?• Deadlock free• Route changing latency & H/W resource• Stability (see next slide)

• Packet order is not preserved (see next of next)

Oscillation in Adaptive RoutingTwo roads to the same destination

1. One is very crowded2. The radio says the

other is empty3. Everybody rushes

into the other road4. (repeat 1-3)

Packet Order• Adaptive routing cannot preserve packet

ordering• This can be problematic when receiving large

messages consisting of multiple packets

P0 P1 P2 P3 P4 P5 P6 P7 …

Sending Order = Receiving Order

P0 P1 P2 P3 P4 P5 P6 P7

Recvbuf 0 Recvbuf 1

P0 P5 P3 P2 P4 P7 P9 P6 …

Sending Order ≠ Receiving Order

P0 P2 P3 P4 P5 P6 P7

Recvbuf 0 Recvbuf 1

Metrics• Topology

• The higher radix, the smaller network diameter• Network Diameter• High-Radix or Low-Radix

• Performance• Whole

• Bisection Bandwidth• P2P

• Bandwidth and Latency• Hop count

• Collective Operations (Barrier, and so on)• Latency

Implementation

Installation of the K Computer

Direct/Indirect Network• Direct or Indirect Network• Direct network• Every node has a switch inside

• Indirect network• Node has no switch

Machine/Network Direct/Indirect

the K (Tofu) Direct

BG/Q Direct

Infiniband Indirect

Ethernet Indirect

Note: In many books, direct or indirect network is categorized as an aspect of topology

Level off Cable Lengths• Naive implementation results in uneven cable

lengths

Level off Cable Lengths• To level off cable lengths, alternate nodes are

connected

Co-Design• Network Cost = ∑ C + ∑ S + ∑ L

C: Network interface (card) of a nodeS: SwitchL: Cable

• Co-design• Communication patterns of applications• Find protocols to maximize performance of

possible applications, and• to minimize network cost• to minimize power consumption

Fault Resilience

Fault Resilience• System and/or jobs can survive from a

network component failure• Possible failure points• Link• Switch• Node

Link or Switch Failure• Static routing• Somebody changes routing info. to bypass

failed part(s)

• Dynamic routing• If a failure can be detected, the failed

part(s) can be automatically bypassed• Needless to say it must be deadlock free

Node Failure• If application has• Dynamic load balancing• Job stops using the failed node,

and rebalance load• Static load balancing• Ex) Stencil Computation• hard to rebalance load => spare node

2D Jacobi iterationV’(i,j) = A * ( V(i-1,j ) + V(i+1,j) +

V(i,j-1) + V(i,j+1) )

2D array V(N,M)

Spare Node Substitution• Assuming switches and links are all healthy• A naive spare node substitution may result in a

large number of packet collisions• Max. latency depends on #collisions

• Is there a way to avoid this situation ?

Migration

4 Possible Collisions

Spare Node• Pros• Easy to program• Balanced load

• Cons• Lower node

utilization• Additional

packet collisions

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 22 23 21

24 25 26 27 28 29

30 31 32 33 34 35

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 22 23

24 25 26 21 28 29

30 31 32 27 34 35

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

Spare Nodes

Node 21 fails

Sliding Methods - Basic Idea•Sliding Methods•0D - Naive method•1D - Up to 3 (worst case)

•2D - Up to 2•and so on

•Hybrid Sliding• Try the highest

degree method first• If failed, try lower

degree method

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 22 23 21

24 25 26 27 28 29

30 31 32 33 34 35

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 22 23

24 25 26 21 28 29

30 31 32 27 34 35

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

Spare Nodes

Node 21 fails

Repair

The K and FX100

�SPARC64TM XIfx�HPC-ACE2�L1キャッシュ、Wayを2倍�スーパースカラーの強化•アウトオブオーダ資源の増加•分岐予測の強化

�256 bit wide SIMD•単精度倍幅モード•8バイト整数命令

�アシスタントコア•IO・OS・通信のデーモンを処理

キーテクノロジー (5)

core core

Assistantcore

core core

Tofu2 interfaceTofu2 controller

HMC interface HMC

L2 cache

PCI interface

MAC MAC MA

PCI controller

�Rack�216ノード / キャビネット�CPU、メモリ、光モジュールを直接水冷(水冷率90%)

�Chassis�19インチランクマウント型シャーシ�12ノード / 2U�本体装置間 Tofu2は光接続

�CPU Memory Board�CPU x 3�3 x 8 Micron's HMCs

6 4 system boards(384 cores)

3 CPUs (Nodes)

32+2 Cores + Tofu

CPU8 Cores

ICC(Tofu Network)

The K Computer

FX1002015

1 system board4 CPUs and 4 ICCs

(32 cores)

18 chassis (6,912 cores)

24 chassis (768 cores)

http://accc.riken.jp/wp-content/uploads/2015/06/chiba.pdf

Fujitsu FX100• A chassis contains 3 nodes, switches and links.• Tofu unit consists of 12 nodes is also a scheduling

unit• Tofu 6D coordinate: “XYZabc” (a=2, b=3, c=2)• “XYZ” coord. represents the location of a Tofu unit• “abc” coord. represents the location inside of a Tofu unit

• A chassis contains 12 nodes• 3 chassis compose 3 Tofu units

White paper FUJITSU Supercomputer PRIMEHPC FX100 – Evolution to the Next Generation

of 8 http://www.fujitsu.com/global/products/computing/servers/supercomputer/primehpc-fx100/

The Tofu interconnect 2 (Tofu2) is an interconnect integrated into the SPARC64™ XIfx processor. Tofu2 enhances the bandwidth and functions of the Tofu interconnect (Tofu1) of the previous PRIMEHPC FX10 systems.

6D mesh/torus network Tofu2 interconnects nodes to construct a system with a 6D mesh/torus network, like with Tofu1. The sizes of the three axes X, Y, and Z vary depending on the system configuration. The sizes of the other three axes, A, B, and C, are fixed at 2, 3, and 2, respectively. Each node has 10 ports. The network topology from the user’s view is a virtual 1D/2D/3D torus. An arbitrary number of dimensions and size of each axis are specified by a user. The virtual torus space is mapped to the 6D mesh/torus network and reflected in the rank numbers. This virtual torus scheme improves the system fault tolerance and availability by enabling the region containing a failed node to be utilized as a torus.

High-speed 25 Gbps serial transmission Each link of Tofu2 consists of four lanes of signals with a data transfer speed of 25.78125 Gbps and provides peak throughput of 12.5 GB/s. The link bandwidth is 2.5 times higher than that of Tofu1, which uses 8 lanes of 6.25 Gbps signals and provides 5 GB/s of throughput.

Tofu2 connects 12 nodes in a PRIMEHPC FX100 main unit by electrical links, and inter-main unit links use optical transceiver modules because of a large transmission loss at 25 Gbps. Optical transceivers are even placed near the CPU to minimize the transmission loss. In contrast, Tofu1 does not use optical transceivers.

Optical-link dominant network Twelve nodes in a main unit are connected in the configuration of (X, Y, Z, A, B, C) = (1, 1, 3, 2, 1, 2). The number of intra-main unit links is 20 (Figure 14). Therefore, 40 out of 120 ports are used for intra-main unit links, and the other 80 are used for inter-main unit links.

For conventional HPC interconnects using 10 Gbps generation transmission technology, the ratio of optical links in the total network was up to one-third (A, B, and C in Figure 15). These interconnects partially used optical transmission and only used it to extend the wire length. In contrast, the ratio of optical links in Tofu2 is far beyond that of electrical links. Tofu2 is recognized as a next-generation HPC interconnect that mainly uses optical transmission.

Tofu Interconnect 2

Figure 12 6D mesh/torus topology model

Figure 13 Close placement of optical modules and CPU

Figure 14 Connection topology in main unit

for synchronization. The host bus connects aSparc64 chip5 to the Tofu interconnect andPCI Express devices.

Figure 2a shows the ICC chip structureand interconnections. Four Sparc64 chips

on a board are interconnected with the A-and C-axes links. Three boards in a Tofuunit are interconnected with the B-axislinks. Tofu units are interconnected withthe X-, Y- and Z-axes links that form a 3Dtorus. The Z-axis links connect 17 Tofuunits in two racks: 16 for computingnodes, and one for I/O nodes. The X-axisand Y-axis are expandable according to thenumber of columns and rows of racks.

Figure 2b shows a topological model ofthe ABC 3D mesh/torus forming a Tofuunit. In the event of a single-board failurethat decreases the B-axis’s length, the 3Dtorus graph’s embeddability is unaffected be-cause the ABC 3D topology remains cubic.

One of the big challenges in building theK computer was system reliability. Forinstance, a mean time between failures(MTBF) of five years per node, assumingcommodity processing nodes, would bringabout two failures every hour in the80,000-node system. We needed aboutone-hundredth of that failure rate. To min-imize the failure rate, we integrated allactive components of the Tofu interconnectinto a single ICC chip, protected major datapaths using error-correction code, and

[3B2-9] mmi2012010021.3d 18/1/012 18:44 Page 22

Host bus interface

Tofu networkinterface (TNI) and

Tofu barrierinterface (TBI)

Tofu network router (TNR)

Figure 1. A micrograph of the ICC chip.

The chip integrates all active components

of the Tofu interconnect: a Tofu network

router (TNR), four Tofu network interfaces

(TNIs), a Tofu barrier interface (TBI), a host

bus interface, and two PCI Express root

complexes.

Tofu unit

Quad SPARC64 board

SPARC64chip

InterConnect Controller (ICC) chip

C axis

B axis

Z axis

Y axis

X axis

A axis

PCIExpress

Hostbus

C axis

B axis

A axis(a) (b)

A axis

Figure 2. The ICC chip structure and interconnections along six axes (a). A topological model of the A-, B-, and C-axes (b).

....................................................................

22 IEEE MICRO

...............................................................................................................................................................................................HOT INTERCONNECTS

A Tofu unit

White paper FUJITSU Supercomputer PRIMEHPC FX100 – Evolution to the Next Generation

https://www.fujitsu.com/global/Images/primehpc-fx100-hard-en.pdf

The K and FX100

�SPARC64TM XIfx�HPC-ACE2�L1キャッシュ、Wayを2倍�スーパースカラーの強化•アウトオブオーダ資源の増加•分岐予測の強化

�256 bit wide SIMD•単精度倍幅モード•8バイト整数命令

�アシスタントコア•IO・OS・通信のデーモンを処理

core core

Assistantcore

core core

Tofu2 interfaceTofu2 controller

HMC interface HMC

L2 cache

PCI interface

MAC MAC MA

PCI controller

�Rack�216ノード / キャビネット�CPU、メモリ、光モジュールを直接水冷(水冷率90%)

�Chassis�19インチランクマウント型シャーシ�12ノード / 2U�本体装置間 Tofu2は光接続

�CPU Memory Board�CPU x 3�3 x 8 Micron's HMCs

3 CPUs (Nodes)

The K Computer

FX1002015

18 chassis (6,912 cores)

24 chassis (768 cores)

4 system boards(384 cores)

1 system board4 CPUs and 4 ICCs

(32 cores)

The Tofu circuit is on the same CPU die, however, the Tofu circuit can keep running while the CPU cores are shutdown and power off.

32+2 Cores + Tofu

http://accc.riken.jp/wp-content/uploads/2015/06/chiba.pdf

Various Units in FX100• Various units in Fujitsu FX100

• Unit of network• Tofu (12 nodes)

• Unit of scheduling• Tofu (not to interfere with other jobs)

• Physical Unit• Chassis - A chassis spans 3 Tofu units

• Replacement of a chassis• Affects 3 Tofu units (36 nodes, 1152 cores)

• Before replacement, the jobs running on the 36 nodes must be aborted

• While in the replacement, the affected 36 nodes can not accept jobs

• Tofu is direct network, replacement can affect entire network because XYZ connections for I/O are lost

• This 36-Nf nodes are called apparent failure (in this talk)

Repair Schedule• Entire system must go on as much as possible• Replacement may cause more apparent failures as packaging

density increases• Replacement cannot take place as soon as failure happens

• Remedy for apparent failures is getting harder• The more frequent system service, the higher running cost• So, repair is scheduled once in a day, 2-3 times in a week,

once in a week, and so on

The K’s case: Every morning, SEs replace the failed nodes1. Shutdown the chassis2. Unplug the chassis3. Replace failed mother board4. Plug the chassis5. Reboot (K’s nodes are disk-less)

Apparent failure

Repair Interval• The longer repair interval, the larger number

of failed parts• K: One node failure in 1~2 days

OperationRepair

ApparentFailure

Operation RepairTime

ApparentFailure

Average

Towards Exa-flops• Higher failure rate

• larger number of components• end of Moore’s law is close

• Longer time between repair• to reduce running cost• denser packaging results in more apparent

failures• larger impact on running jobs

➡ Always having one or more number of failed components

Network Resilience Towards Exa-scale and Beyond

My Personal Opinion

Failure will be Daily Life• Assumption of current HPC

failure happens unexpectedly and unusually• System design is based on particular rules

and algorithms• Random failure breaks those rules and

algorithms• Node MTBF is less than a day already

• If failure is daily happening, why don’t we design HPC systems having failures in mind ?

Failure Conscious Design• Failure• happens randomly• # Combinations are factorial !

• impossible to handle failures case by case• impossible to predict performance

degradation due to the failures

Stencil and Cartesian Topology • The node failure problem in stencil computations

is revisited• Communication pattern of stencil computation

fits with Cartesian topology very very well• When spare node

substitutions take place, then the fitness is gone and performance degrades

0 20 40 60 80 100 120 140 160 180 200

# Node Failures

Best Average Worst

5P-Stencil Communication Performance Degradation over the Number of Failed Nodes [7]

Topology and Protocol• Protocols of collective

operations are optimized according to topology

• If conditions of H/W support are NOT met, then general protocol takes place

• Failure break those conditions

0 50 100 150 200 250 300

# Node Failures

K-Barrier

K-Allreduce

BGQ-Barrier

BGQ-Allreduce

BGQ-Barrier*

BGQ-Allreduce*

Regular topology turns into random topology

as the number of failed links increases

(Full) Dragonfly 22/28 16/28

Regular topology turns into random topology

as the number of failed links increases

(Full) Dragonfly 22/28 16/28

QualitativeChange

QuantitativeChange

Randomness may be an answer

Randomness may be an answer• Can we rely on the rules and algorithms which can be

broken by failures ?• Failures on regularity

Qualitative change: Hard to imagine

Qualitative change: Hard to imagine• What if we give up having such rules ?

• Failures on randomness Quantitative change: Easier to imagine

Qualitative change: Hard to imagine• What if we give up having such rules ?

• Failures on randomness Quantitative change: Easier to imagine

• Let’s start designing random systems from the beginning, forget about failures in regular systems

➡Random Topology➡Random Network

Random Topology (1)

• Goal: to make a low-latency topology for HPC networks – low diameter and low average path hops

• Random topology is best [Koibuchi et al, ISCA2012]

100 times improvement (a) Non-random Shortcuts

(b) Random Shortcuts

1,024-node network

Good Point of Random Topology

3 Switch degree ≈ Number of shortcuts

Michihiro Koibuchi, http://research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf

Random Topology (2)

Two Approaches to Quasi-randomness

• Method A makes a non-random topology random • Method B makes a random topology layout-friendly

Low High (not random) (fully random)

Randomness

Method A Method B start start

Quasi-random topologies

Michihiro Koibuchi, http://research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf

Random Routing in Hypercube

7 Sid C-K Chau

Random Routing in Hypercube • For deterministic bit-fix routing, the worst case requires

at least 2𝑛/2/2 steps (exponential in n) • But for random bit-fix routing, it requires O(n) steps

with high probability (i.e., using more than O(n) steps has a vanishing probability converging to 0, as nl0)

• Random bit-fix routing has two stages: 1. Pick a random node r(i) in the hypercube independently, and

use bit-fixing routing from i to r(i) 2. Use bit-fixing routing from r(i) to d(i)

• Obviously, longer paths are needed for random bit-fix routing. Then why is this better?

• The intuition is that random routing can average out the worst case configuration from deterministic routing

• The probability that a randomly generated configuration is the worst case is very low, and is vanishing for large n

• This intuition is behind many randomized algorithms

Random bit-fixing routing

0000 l 0000 l 0000 0001 l 0001 l 0100 0010 l 1000 l 1000 0011 l 0101 l 1100 0100 l 0001 l 0001 0101 l 1110 l 0101 0111 l 1101 l 1101 1110 l 0000 l 1011 1111 l 1110 l 1111

i l r(i) l d(i)

A two-stage configuration

Sid C-K Chau, https://www.cl.cam.ac.uk/teaching/1011/CompSysMod/RandBits_Lec2V2.pdf

Dynamic routing vs. Random routing

• A switch has several routing candidates for a packet to go through• Static routing• choose fixed one always

• Dynamic routing• choose one according to network status

• Random routing• choose one in a random way• not have to be uniformly random

Randomness in a Network• Combination of regularity and randomness

• Random Topology• Regular part + Random part

• Ex) Ring + Random shortcuts• Random (Oblivious) Routing

(≠ Brownian motion)• Random routing + Regular routing

• Node/switch on the way is randomly chosen

• Failure may happen on the regular part ?• The factorial nature can be relaxed• Ex) Redundant links of the regular part of topology.

Conclusions • Use of random shortcuts at HPC interconnects

• Ring + random shortcuts is best • Advantage of high-radix networks

• Little variability of sampling and performance • Random shortcut topology imposes no constraints

on the number of switches, and links

Random Shortcut Topology (Ring + random shortcuts)

Up to 18% lower latency

Hypercube (Non-random topology)

My Last Word

“An eye for an eye, a tooth for a tooth”

Randomness for randomness

Randomness MAY save the future supercomputers (not yet proven)

Thank you56

Reference1) High-radix router: Microarchitecture of a High-Radix Router, John Kim, William J. Dally, et. al.,

ISCA’05.2) Tofu network: THE TOFU INTERCONNECT, Yuichiro Ajima, et. al., HOT INTERCONNECTS, 2012. 3) Dragonfly network: Technology-Driven, Highly-Scalable Dragonfly Topology, John Kim, William J.

Dally, et. al., ISCA '08.4) Routing algorithms: A Survey and Evaluation of Topology-Agnostic Deterministic Routing

Algorithms, J. Flich et al., in IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 3, pp. 405-425, March 2012.

5) Shortest path finding algorithm (Dijkstra Algorithm): A note on two problems in connexion with graphs, Dijkstra, E.W., In Numerische Mathematik, 1959.

6) Adaptive routing in Infiniband: Fail-in-place Network Design: Interaction Between Topology, Routing Algorithm and Failures, J. Domke, T. Hoefler, and S. Matsuoka, SC ’14, 2014.

7) Spare node substitution: Sliding Substitution of Failed Nodes, Atsushi Hori, et. al., In Proceedings of the 22nd European MPI Users' Group Meeting, ACM, 2015.

8) Random algorithms including the random routing: Randomized Algorithms, Rajeev Motwani and Prabhakar Raghavan, Cambridge University Press, 1995.

9) Random network: A Case for Random Shortcut Topologies for HPC Interconnects, Michihiro Koibuchi, et. al., ISCA’12.

10)Another view on HPC network robustness: Robustness Attributes of Interconnection Networks for Parallel Processing, Behrooz Parhami, Keynote lecture, 1st Int'l Suprcomputing Conf. (ISUM-2010), 2010 March 4. (https://www.ece.ucsb.edu/~parhami/pres_folder/parh10-isum-robustness-int-nets.ppt)

Robustness of Interconnection Networks

Documents

Transcript of Robustness of Interconnection Networks

Lecture 15: Interconnection Networks

Stochasticity and Robustness in Spiking Neural Networks

Principles and Practices of Interconnection Networks

Chapter 2 Parallel Architectures. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors.

Multistage Interconnection Networks: A

Interconnection Networks: Flow Control and Microarchitecture.

Interconnection Networks: Introduction Natalie Enright Jerger.

Chapter 10: Scalable Interconnection Networks

Interconnection Networks: Graph- and Group Theoretic Modelling · interconnection networks, such as complete networks, Rings, tori, hypercubes, cube-connected cycles, n-stars, etc.,

The Representation of Multistage Interconnection Networks ...

Interconnection Networks: Topology

ROBUSTNESS ANALYSIS OF GEODETIC NETWORKS

Robustness-based design of water distribution networks · Robustness-based design of water distribution networks ... multi-objective, network design, optimization, ... distribution

Interconnection Negotiations between Telecommunication Networks … · 2005-01-28 · Interconnection Negotiations between Telecommunication Networks and Universal Service Objectives

Optical Interconnection Networks

Mobility Robustness in 5G Networks

Load Balanced Routing in Interconnection Networks

Multiprocessors Interconnection Networks

Embedded Computer Architecture 5SAI0 Interconnection Networks

Multiprocessor Interconnection Networks Using Partitioned ...people.cs.pitt.edu/~melhem/vita/doc/00336637.pdf · Multiprocessor Interconnection Networks Using Partitioned Optical