Design Exploration of Multi-tier Interconnection Networks ... · chitectures; • Mathematics of...

Design Exploration of Multi-tier Interconnection Networks forExascale Systems

Javier [email protected]

University of ManchesterManchester, UK

Josh [email protected]


Jose A. [email protected]. of the Basque Country

Donostia, Spain

Mikel Lujá[email protected]


John [email protected]


ABSTRACTInterconnection networks are one of the main limiting factors whenit comes to scale out computing systems. In this paper, we explorewhat role the hybridization of topologies has on the design of anstate-of-the-art exascale-capable computing system. More preciselywe compare several hybrid topologies and compare with commonsingle-topology ones when dealing with large-scale application-like traffic. In addition we explore how different aspects of thehybrid topology can affect the overall performance of the system.In particular, we found that hybrid topologies can outperform state-of-the-art torus and fattree networks as long as the density ofconnections is high enough–one connection every two or fournodes seems to be the sweet spot–and the size of the subtori islimited to a few nodes per dimension. Moreover, we explored twodifferent alternatives to use in the upper tiers of the interconnect,a fattree and a generalised hypercube, and found little differencebetween the topologies, mostly depending on the workload to beexecuted.

CCS CONCEPTS• Computer systems organization → Interconnection archi-tectures; Reconfigurable computing; • Networks → Network ar-chitectures; • Mathematics of computing→ Graph theory.

KEYWORDSInterconnection Networks, Multi-tier Topologies, Exascale Systems,Performance Evaluation, SimulationACM Reference Format:Javier Navaridas, Josh Lant, Jose A. Pascual, Mikel Luján, and John Goodacre.2019. Design Exploration of Multi-tier Interconnection Networks for Exas-cale Systems. In 48th International Conference on Parallel Processing (ICPP2019), August 5–8, 2019, Kyoto, Japan. ACM, New York, NY, USA, 10 pages.https://doi.org/10.1145/3337821.3337903

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] 2019, August 5–8, 2019, Kyoto, Japan© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6295-5/19/08. . . $15.00https://doi.org/10.1145/3337821.3337903

1 INTRODUCTIONOur society has come to depend on information and computer sys-tems as the platforms where we carry out most of our activities.This has both driven forward the development of a plethora of ITtechnologies and motivated the construction of increasingly largercomputing facilities. Indeed, we are relentlessly approaching theExaScale Milestone, in which a single computer system will be ableto execute a mind-blowing 1018 instructions per second. This mag-nificent computing power is required both in business and scientificcontexts. For instance, in the context of business-centric computing,companies require increasingly higher computing power to sup-port operations such as mining data from service records, offeringon-line services and supporting increasingly large amounts of data.If we look at the world’s largest companies it is speculated thatthey have hundreds of thousands of servers to sustain their infras-tructure. As examples of well-known companies, Google may havearound one million servers scattered among 13 datacentres world-wide whereas Amazon may have roughly half a million servers in7 datacentres around the world.

We can find an analogous trend within the scientific commu-nity, with systems of similar sizes and an always increasing greedfor more computing power. In this context, applications typicallyused number-crunching approaches – e.g. computer simulation ofvarious natures (molecular dynamics, finite elements or weathermodelling, to cite a few) which are carried out using increasinglyfiner-grain models to improve accuracy which in turn require ofgreater computing power. However, in recent years, the advent ofdata analytics technologies has opened new avenues for scientificresearch. The highest exponent of data analytics in science is theLarge Hadron Collider at CERN, which generates data at a stun-ning rate of 50 Petabytes per year. In order to be able to analyseall the generated data a Grid-like system with over 150 computingcentres all over the world is used (See Worldwide LHC ComputingGrid website1). This data generation rate will be dwarfed by theSquare Kilometer Array project which is expected to generate amind-blowing amount of data exceeding the Exabyte per day once itis built by 2020 (See Square Kilometre Array website2). At any rate,the advent of data analytics within the scientific community hasmotivated the convergence of datacentre and HPC architectures.

1Available at: http://wlcg.web.cern.ch/2Available at https://www.skatelescope.org/

https://doi.org/10.1145/3337821.3337903

https://doi.org/10.1145/3337821.3337903

http://wlcg.web.cern.ch/

https://www.skatelescope.org/

ICPP 2019, August 5–8, 2019, Kyoto, Japan Javier Navaridas, Josh Lant, Jose A. Pascual, Mikel Luján, and John Goodacre

In all these kinds of systems the interconnection network (IN,hereafter) – a specific-purpose network that allows compute nodesto interchange messages with high throughput and low latency– is a key element. This is specially true for large-scale comput-ing platforms because its performance has a definite impact onthe overall execution time of applications, particularly for thosethat are fine-grained and communication intensive. Indeed, INs arewidely acknowledged (e.g., [1, 13, 24, 27]) to be one of the limitingfactors for scaling up computing systems, essentially because thecommunication and synchronisation penalties suffered by appli-cations increase with the size of the system. Current trends showthe number of nodes used in data centre networks or supercomput-ers can be hundreds of thousands [2, 8, 16, 17] and these numbersare expected to increase over the millions in the next decade asanticipated by [24].

In this paper, we argue that with the uncontrolable growth ofnetworks in terms of endpoints, standard topologies such as theones that are in use in most systems today will not be able to copewith such systems either because of a lack of scalability or practical-ity. Therefore, we will need to look for new network arrangementsthat allow to simplify the design and deployment of systems with-out sacrificing their performance. Our objective is to analyse thesuitability of hybrid topologies for large-scale systems as well as tounderstand the trade-offs involved in the design of such networks.We then study possible alternatives for a large-scale system basedaround the ExaNeSt technologies and some topologies of interest. Inparticular, we want to explore different parameters affecting the hy-bridization of topologies so to find the best trade-off when nestingthe rigidity of the Lower levels (backplane-connected torus) withthe flexibility offered by our FPGA-based routers [9] in the higherlevels. This flexibility would allow us to adopt different topologies,but for simplicity we will stick to the well known fattree topology.For completeness and, given that we have several spare uplinks, wewill also explore the suitability for our workloads of interest of thegeneralized hypercube topology.

With these in mind, we investigate which is the sweet spot whenit comes to separating the different levels. In other words, we wantto know to what extent we can expand the Torus topology withoutseverely affecting the performance. In addition, we look into howdense the connectivity to the upper levels of the network needs tobe in order to sustain adequate performance, i.e., which proportionof our Quad-FPGA daughter boards (QFDBs) should have theiruplink active. This is important, because it will have a definiteimpact on the amount of hardware required to implement the maininterconnect and, in turn, on cost and power.

To study these aspects, we start our research with an analysis ofthe topologies where we show the number of network componentsrequired by each configuration and from these, we estimate the costand power overheads imposed by the extra components required toform the higher level of the network. From this analysis we see thatthere is little difference between using a fattree or a generalisedhypercube in the higher levels. We also look at the distribution ofdistances and see that the generalised hypercube provides shorterpaths by a slight margin. From our simulation results we confirmthat the difference between the two arrangements is minimal exceptfor specific workloads. We also found that, for most workloads, ahigher density of connections is generally better, but that slightly

reducing the density to 1 connection every 2 or even 4 QFDBsdoes not affect performance significantly, except for the heaviestloads, so there is room for reducing the cost and power of thesystem by thinning the connections. Note this process is akin tothe oversubscription mechanisms commonly implemented in large-scale computing platforms.

It is also apparent from our results that keeping the sub-torussmall would be necessary, since in general the configurations withthe largest subtorus required higher execution times. While thiscan be explained by a larger number of network hops to reachthe destination, it is somewhat unexpected as a larger subtorusshould also reduce the pressure in the higher level network aslocality is increased. It seems, though, that this effect does nothave a significant effect when compared with the impact of theincrease in path length. Apart from helping us to understand thetrade-offs involved in the selection of the size of the subtorus andthe connection density, our experiments also help us find someinteresting case studies, where some of the workloads are obtainingsome unexpected results.We proceeded to analyse these unexpectedresults thoroughly.

2 RELATEDWORKGiven the large interest that network architectures arouse, there isa large body or research on different approaches to organize theinterconnection network of computing systems, and great emphasisis put on the scalability of the designs.

For example, until very recently many computing systems haveused the well-known Torus topology [2, 7, 8, 15, 20] which hasbeen historically used to interconnect massively parallel processorsand has a massive body of research around it. Nodes in a torusare arranged in a d-dimensional grid with wrap-around links, sosystem deployment is trivial and there is no need for extra cabinetsfor the interconnect. However, the torus topology does not scalevery well as distance related properties increase with the number ofconnected nodes. Note that the underlying topology of our systemis based on small subtorus for exactly this reason.

Another common family of networks revolve around Tree-liketopologies such as the k-ary n-tree topology [31] (fattree), the k:k’-ary n-tree topology [29] (thintree) or the generalized tree topology(gtree). These are multi-stage networks that can provide very highthroughput and even non-blocking behaviour and thus, became thearchitecture of choice as soon as the technology enabled the use oflarge-radix switches.

One of the latest network organizations that is getting a greatinterest from the community is the Dragonfly topology. This isa large-radix, low-diameter, recursive topology that uses a groupof high-radix routers (a network group) as a virtual router to in-crease the effective radix of the network. Large numbers of thesenetwork groups are interconnected in an all-to-all fashion using agiven connection rule [14, 23]. Many of these connection rules doexist and they have a great impact on the performance of the IN.While Dragonflies can sustain very high throughput for averagecases, they are very sensitive to the communication patterns of theapplications and there exist many pathological scenarios that willreduce greatly the performance of the interconnect, primarily withunbalanced loads.

Design Exploration of Multi-tier Interconnection Networks for Exascale Systems ICPP 2019, August 5–8, 2019, Kyoto, Japan

Jellyfish is a random network designed to provide high connec-tivity in datacenters and was demonstrated to be able to outperformtree-like topologies in some scenarios. One of the main characteris-tics of Jellyfish is that it is incrementally expandable – i.e., does notrequire any specific size, as opposed to most common topologieswhich are restricted to relative few configurations [33]. However,one of the main drawbacks of Jellyfish is its lack of structure whichbrings many challenges from a practical perspective in terms ofrouting, physical layout, and wiring.

Another topology that has shown to have great potential forlarge scale systems is the Generalised Hypercube (GHC) topol-ogy [6, 37]. While the study of these topologies has typically beenperformed from a more formal perspective, there have been manyproposals for the deployment of Generalised Hypercube-based sys-tems, specially within the Datacentre community. One such pro-posal is Bcube, a server-centric network architecture designed forshipping-container modular datacenters [18] which serves as inspi-ration for adapting GHC to be employed as the upper tier in oursystem. Interestingly, the Stellar construction which was proposedto generate dual-port server-centric networks was firstly exercisedwith a Generalised Hypercube topology [11].

Still in the realms of server-centric datacentre networks, wecan highlight DPillar, which is somewhat inspired by the classicbutterfly topology [26] but with servers connected between eachstage of the topology and wrapped around. DPillar provides severalnice properties such as scalability, network performance, and costefficiency, which make it suitable for building large scale futuredatacenters. However, the routing function that was proposed orig-inally was shown to be highly inefficient and an optimal shortestpath algorithm was introduced [12].

The reader should note that all of these network arrangementsare based on a single architecture and there is little attention withinthe community to hybrid networks. While, arguably, the Dragon-fly could be considered a hybrid topology in that it is generatedrecursively – and this could also apply to many other recursivetopologies, such as DCell [17], Ficonn [25], or the recursive di-agonal torus [36] – there is little effort on investigating hybridsystems in which different topologies are used at different levels ofthe interconnection network. One such example is HCN/BCN, arecursively-defined family of networks, where the BCN constructis built using (copies of) HCNs by including an additional layerof interconnecting links [19]. However, this is a highly theoreticalapproach and no clear pathway to a practical deployment can beanticipated from the authors discussion.

The research work that we perform here aims to fill this gap bylooking at the hybridization of common topologies in the contextof a real prototype system and driven by the physical constraints ofthe devices and boards developed within ExaNeSt. Our Mezzanineboards enforce the use of a torus topology in the lower level of ourcommunication infrastructure and, in this paper, we will exploresome promising candidates for being implemented in the higherlevels: a fattree and a generalized hypercube – both discussed above.The reason for focusing on these two is that the fattree is the defacto standard whereas generalised hypercube is selected becauseit is very suitable for containerized deployment which we target inthe second testbed of EuroExa.

Figure 1: Overview of ExaNeSt architecture.

3 SYSTEM ARCHITECTUREIn this Section we describe the main characteristics of our novelarchitecture, ExaNeSt—see [5, 22] for a more detailed description.The architecture has recently been showcased by means of a small,2-cabinet prototype held at Foundation for Research and Technol-ogy - Hellas (FORTH) and will be extended within EuroEXA cite-exanesteuroexa. Our system will require millions of low-power-consumption ARM+FPGAMPSoCs to reach Exascale and includes aunified, low-latency interconnect and a fully distributed storage sub-system with data spread across the nodes using local Non-VolatileMemory (NVM) storage [35], using BeeGFS3, an state-of-the-artparallel file-system that has been ported to our architecture. Theultimate objective being to provide near-data processing, whichshould be essential for emerging data-centric applications.

As shown in Fig. 1, our building block is a small form-factorQFDB containing 4× Zynq Ultrascale+ MPSoCs4. Each of theseQFDB is configured to use up to 10× communication ports attachedto 10Gbps transceivers. In addition we employ the FPGA fabric inthe MPSoC to implement a proprietary network protocol developedby our group [3, 4]. The system scales as follows: Up to 16 QFDBs

3Available at: https://www.beegfs.com4See https://www.xilinx.com/support/documentation/white_papers/wp482-zu-pwr-perf.pdf

https://www.beegfs.com

https://www.xilinx.com/support/documentation/white_papers/wp482-zu-pwr-perf.pdf

https://www.xilinx.com/support/documentation/white_papers/wp482-zu-pwr-perf.pdf


can be connected within a Blade which uses a backplane that de-livers high-bandwidth connectivity, whilst reducing the costs andpower consumption of external cables and transceivers. Due topragmatic reasons during the design phase of our architecture, theboards are arranged in the blade as a static 3D mesh-like structure(4×2×2) where each board uses 6 links for internal connection andexposes 4 links to the outside of the blade for uplinking. This meshcan be extended seamlessly to a torus topology using the externalconnectivity of the blades. One of the uplinks is dedicated for 10GEthernet communications with the outside world, so we can use,at most, 3 links per QFDB to connect to the higher levels of ourhybrid network [9]. With these constraints in mind, we need tostick to a subtorus as the low-level topology of our system and willlook at well-known topologies for the upper levels: fattrees andgeneralised hypercubes, see below.

4 EXPERIMENTAL SET-UPIn this section we describe our experimental work. First we presentthe simulation environment and the application models that willbe used to analyze the networks. Then we describe the networkswe are considering in our study, the parameters of interest andthe connection rules we use to explore the effect that the size ofthe subtorus and the density of connections have on the executiontime.

4.1 Simulation Environment and ApplicationModel

The evaluation has been carried out using our in-house developedsimulator INRFlow [30], which models the behaviour of large-scale parallel systems, including the network topology (link ar-rangement), the workload generation and the scheduling policies(selection, allocation and mapping) and measures several static(application-independent) and dynamic (with applications) prop-erties. During execution, the links of the network have capacitiesand each flow is specified with a bandwidth reflecting the datathat must be routed while also respecting the causal relationshipsbetween network flows – i.e, some flows must finish before othersare allowed to be injected. As a result, it provides realistic, flow-level simulation of general real-world workloads, as well as a goodestimation of the completion times of a collection of application-inspired workloads.

Since one of our main objectives is to ensure the developedinterconnect is, indeed, appropriate for large-scale systems as weapproach exascale we will analyze our architecture with systemscomposed by over half a million Zynq FPGAs. Indeed, we need toensure scalability and also to look for the best design decisions forthe higher levels of our network. To obtain more realistic results,our experimental work considers a large collection of workloads,which are modelling real-world applications.

Collective Operations: Collective operations are central to manyparallel applications and, thus, are very useful to provide an estima-tion of the performance of applications. In this study, we considertwo different collective operations. The first one, Reduce, models anon-optimized N-to-1 collective in which all nodes send messagesto the root task. This generates a lot of congestion in the network

around the root task, as all tasks are sending traffic towards thesame point. While an optimized, logarithmic implementation wouldbe preferred in a real system, we use the non optimized one to eval-uate the behaviour of the networks under a pathological scenariowith a hot spot. In addition, we also consider an optimised AllRe-duce collective operation based on a logarithmic implementationwhere only loд2tasks steps are needed [34].

MapReduce: MapReduce [10] is a common application in the Dat-acenter domain commonly used to analyze large amounts of data(big data analytics). In MapReduce, a root task partitions and dis-tributes the original data amongst all servers. Once computingnodes receive data from the root, they perform the mapping of thedata and shuffle it to the other servers in an all-to-all fashion andthen send their results back to the root.

Sweep3D: The Sweep3D is a common function in the HPC domainfor the deterministic particle transport problem [28]. It performsa wavefront communication in which a grid of tasks is traversedin diagonal starting from one of the corners and advancing in all 3dimensions towards the opposite corner of the 3D grid.

Flood: This workload is similar to the sweep workload above, butsending data from a source node in all directions with severalwavefronts being propagated at the same time, which exerts amuch heavier pressure into the network.

Near Neighbors: This workload has a communication patternfound in many HPC codes such as LAMMPS [32] (a consolidatedmolecular dynamics simulator) or RegCM (climatemodelling) whichare some of ExaNeSt’s target applications. The tasks are arrangedin a virtual grid and communicate only with their neighbors asdefined by a stencil pattern.

n-Bodies: n-Bodies is another common application model, broadlyused in different physics domains to model the interaction betweenbodies (particles, planets, etc). In our model, tasks are arranged ina virtual ring in which each task starts a chain of messages thattravel clockwise across half of the ring.

Unstructured Applications: Unstructured workloads often arisewhen considering system management traffic, system runtimesbased on work-stealing, or graph analytics applications (a key ap-plication area within datacenters). We model four such applica-tions. First, UnstructuredApp is based on fixed-length messagesto model an unstructured application in which data has been par-titioned evenly across the tasks. The second one, Unstructured-Mgnt, follows the traffic size distribution produced by the manage-ment software in common large-scale systems as described in [21].The third one, UnstructuredHR, models an application in whicha subset of the tasks is more likely to be targeted as destinations(Hot tasks). Finally, Bisection models an application in which thetasks perform pair-wise communications swapping pairs randomlyevery round.

4.2 TopologiesTo perform the study, we have implemented in our simulator newtopologies that hybridize a torus and nests it into a tree (NestTree)as well as a torus nested into a generalized hypercube (NestGHC).


(a) Torus 3D (4 × 4 × 2). Wrap-around link not shown for the sakeof simplicity

(b) A torus nested in a generalised hypercube, NestGHC, topology(t=2, u=8), 4-ary 2-GHC.

(c) A 4-ary 2-tree(d) A torus nested in a fattree, NestTree, topology (t=2, u=8), 4-ary2-tree

Figure 2: Examples of the topologies under study

These topologies accept a large number of parameters to definethe size of the torus, the number of connections within a torusto the upper network and the parameters of the upper network(levels/dimensions of the network + ports per switch per level). Forsimplicity, we will denote them as NestTree(t ,u) and NestGHC(t ,u),where t is the number of nodes per dimension in the subtorus and 1

uis the density of uplinks per subtorus, or in otherwords, therewill beone uplink everyu QFDBs. Note that it is anticipated that these twoparameters, t and u, have a definite impact on the characteristics ofthe system, both in terms of performance and extra HW necessaryto build the higher level of the interconnect. Understanding thetrade-offs of these two parameters, is therefore, essential to designan effective interconnect.

In both cases, we implemented routing functions in which com-munications within a subtorus always stay within the subtorus,to reduce the pressure on the higher level. Communications be-tween nodes in two different subtori are more complex as theyrequire routing from the source to the closest uplinked node in itssubtorus (could be the source itself) then route minimally withinthe upper network to get to the closest uplinked node to the desti-nation (again, could be the destination itself) and, if needed routeto the destination within its subtorus. Routing within a subtorus isperformed using dimensional order routing. Routing within a treeuses minimal UP*/DOWN* routing. Finally routing in a generalizedhypercube uses e-cube routing which traverses the generalized

hypercube dimensions in order. We assume all links can transmitat 10Gbps similar to the transceivers available in our QFDBs.

In order to assess the suitability of hybridizing topologies we willcompare our proposals, NestTree and NestGHC, with two standardtopologies: (i) a 3D torus as this is the one wewould implement if wedeter ourselves from using the extra uplinks in our QFDBs to builda higher level network and (ii) a fattree topology as a representativeof standard high performance topologies. The fattree is expectedto operate very proficiently with most workloads due to its non-blocking nature. We have depicted some examples of these fournetwork architectures in Figure 2. Note that no over-subscriptionis applied to the fattrees under consideration (both on their own ornested). Also, we restrict our study to fattrees with three stages.

4.3 Connection rulesAn important aspect of our hybrid topologies is the density ofconnections from the QFDBs to the higher levels of the network.In this paper, we will consider 4 different connection densities,depicted in Figure 3.u = 1: The simplest case where all QFDBs have uplinks. There is noneed for intra-torus communication when destinations are outsidethe subtorus.u = 2: Only half of the nodes have a uplink connection. In thiscase, all non-connected nodes will have an uplinked neighbour at a


(t ,u)Average distance Diameter

NestGHC NestTree NestGHC NestTree(2, 8) 8.75 8.88 12 12(2, 4) 7.31 7.44 8 8(2, 2) 6.84 6.97 8 8(2, 1) 5.87 5.98 6 6(4, 8) 8.69 8.87 12 12(4, 4) 7.31 7.44 8 8(4, 2) 6.84 6.97 8 8(4, 1) 5.87 5.98 6 6(8, 8) 8.72 8.87 12 12(8, 4) 7.32 7.44 11 11(8, 2) 6.85 6.97 11 11(8, 1) 5.88 5.99 11 11

Table 1: Average distance (for uniform traffic) and Diameter for the hybrid topologies. For reference, the Fattree in our studyhas a diameter of 6 (3 stages) and an average distance of 5.94. The torus has a diameter of 80 and an average distance of 40.

(t ,u)Switches Cost Increase (est.) Power Increase (est.)

NestGHC NestTree NestGHC NestTree NestGHC NestTree(2, 8) 2048 2048 1.17% 1.17% 0.39% 0.39%(2, 4) 3072 3072 1.76% 1.76% 0.59% 0.59%(2, 2) 5120 5120 2.93% 2.93% 0.98% 0.98%(2, 1) 8192 9216 4.69% 5.27% 1.56% 1.76%(4, 8) 2048 2048 1.17% 1.17% 0.39% 0.39%(4, 4) 3072 3072 1.76% 1.76% 0.59% 0.59%(4, 2) 5120 5120 2.93% 2.93% 0.98% 0.98%(4, 1) 8192 9216 4.69% 5.27% 1.56% 1.76%(8, 8) 2048 2048 1.17% 1.17% 0.39% 0.39%(8, 4) 3072 3072 1.76% 1.76% 0.59% 0.59%(8, 2) 5120 5120 2.93% 2.93% 0.98% 0.98%(8, 1) 8192 9216 4.69% 5.27% 1.56% 1.76%

Table 2: Number of switches and estimation of cost and power overhead for the hybrid topologies. For reference, employ-ing a Fattree with our architecture will require 9216 switches and increases of 5.27% and 1.76% in terms of cost and powerconsumption, respectively.

single hop in the X dimension and there will be no contention forthe use of subtorus resources.u = 4: Only one of every 4 nodes has a connection. In this case, inorder to be able to minimise the number of hops in the sub-torus,we connect the nodes in two opposite vertices of a 2× 2× 2 subgrid,so that all the nodes are still one hop away from a connected node.u = 8: The lowest density we consider with only one out of every8 nodes having a connection and all the nodes will communicatethrough the root of a 2 × 2 × 2 subgrid.

5 DISCUSSION OF RESULTSIn this section, we analyze the performance of large-scale ExaNeStsystems composed of 131,072 QFDBs (or around 50 cabinets). Thenumber of endpoints is limited by the available memory in ourservers.

5.1 Topological characteristicsWe open the section with a simple topological analysis of the topolo-gies under consideration. Table 1 shows the average distance and

diameter of the networks for uniform traffic which serves as apreliminary raw estimate of the achievable performance. Table 2shows the number of extra switches required to conform the uppernetwork topology as well as rough back-of-the-envelope estima-tions of the cost and power overheads that each of the topologieswill entice, compared with a simple network using only the hard-wired torus topology. It is worth noticing that neither the cost orthe power overhead are really significant compared to the overallsystem, but still it will be preferable to reduce cost and power asmuch as reasonably possible.

If we compare the NestTree and the NestGHC, we can see thatthe two families of topologies have very similar characteristics,both in terms of extra hardware required and distance of the com-munications. As discussed above, it is clear that the amount of extrahardware required to build the topology greatly depends on thenumber of uplinks per subtorus, with very significant differences.We can also see that the average distance is greatly affected by thenumber of uplinks. However, notice that increasing the numberof uplinks has a massive effect in terms of the number of extra


(a) Density 1:8 (u = 8). (b) Density 1:4 (u = 4).

(c) Density 1:1 (u = 1). (d) Density 1:2 (u = 2).

Figure 3: The 4 different uplink densities considered in ourstudy. Darker nodes represent uplinked nodes. The arrowsshowwhich paths are used to get to the uplinked nodes fromthe not connected nodes. Arrows of different colors havebeen used for the sake of clarity.

switches required, so we need to find a trade-off between these twoparameters. For this reason, we move now to evaluate how thesecharacteristics affect the overall execution time of the workloadsexplained above.

5.2 Simulation workGiven the large variety of workloads we are employing to evaluatethe systems and that performing a case by case analysis would betoo cumbersome but provide little insight, we will split the discus-sion into 2 main types: (i) workloads generating a heavy utilizationof the network, with long periods of congestion generated by a largeproportion of the endpoints injecting traffic at once, and (ii) work-loads with light network utilization, where inter-message causalitylimits the number of endpoints that inject traffic concurrently.

Let us start with the results of the heavyworkloads as they are themost prominent, see Figure 4. With these workloads we can see thatthe simple torus topology fails to deliver appropriate performance(execution time is up to one order of magnitude slower than thefattree or the fastest of the hybrid topologies). This shows that goingtowards a multi-tier interconnect with a hybrid approach, insteadof simply scaling the system by expanding the lower-tier torus, wasa sensible design decision. We can also observe that, provided thatthe uplink density is high enough, the hybrid approach is capable ofoutperforming the single fattree topology in most cases, albeit witha smaller margin. This is because the availability of the lower tiertopology allows to exploit locality and reduce the pressure into theupper tier. In addition, we can see that reducing density can have asevere effect in the performance, especially if the subtorus is toolarge (with worst-case scenarios where execution time is well over

an order of magnitude slower). Indeed, we can see that increasingthe size of the subtorus, generally increases the overall executiontime.

If we compare the topology of the upper tier in our hybrid ap-proach, we can see there is normally little difference between theperformance of a fattree and the generalized hypercube. There areonly two exceptions: The first one is bisection, where the fattree candeliver the workload much faster than the generalized hypercube.In contrast, UnstructuredHR executes quicker in the generalizedhypercube than in the fattree, but to a lesser extent.

The results with lighter workloads, shown in Figure 5, are alsoof interest. For instance, we can see that in Sweep3D and Flood,some of the trends discussed above are inverted. The best perform-ing topology is the torus because the topology matches better thegrid-like nature of these two workloads and, therefore, it can out-perform the other topologies. This is also apparent with our hybridnetworks where having longer dimensions in the subtorus helpsimproving performance (although not to the same extent as theylose performance in the heavy workloads). This diverges with theresults obtained with the Near Neighbors workload which, evenwhen it has the same spatial pattern as Sweep3D and Flood, thetorus topology still performed worse than the fattree and the besthybrid topologies because of the huge pressure placed into thenetwork with all nodes sending at the same time.

If we move to MapReduce, it is another instance where the torusbeats the other topologies, although by a slim margin. This is some-what unexpected as the traffic pattern is not especially well suitedfor this topology. Even when this is the case for this workload, wecan see that again, increasing the size of the subtorus in our hybridnetworks is still counterproductive.

Finally, the Reduce collective is a special case where there is nonoticeable difference between the different networks under evalua-tion. The reason for this is that the workload imposes congestiononly in the root of the collective, which basically serializes packetdelivery. In other words, the consumption port at the root becomesthe bottleneck, so the performance of the network does not affectthe total execution time.

Overall, we can see that hybrid networks employing differentlayers with different topologies can outperform the other topolo-gies in isolation. A single torus topology is not adequate for mostworkloads due to the large distance packets need to traverse and thehigh congestion that this can generate. The only cases where thetorus outperforms the rest of the topologies is when the workloaddoes not put too much pressure into the network, especially if itmatches the topology, such as in Sweep3D and Flood.

While the fattree usually outperforms the torus, our torus + fat-tree and torus + GHC alternatives tend to perform better if thedensity of uplink connections is high enough. In the cases whereuplink density is reduced, the congestion created around the uplinknode severely affects performance, particularly for workloads im-posing a huge pressure into the network. In our analysis, it seemslike the inflection point is around a density of one uplink connec-tion every 2 or 4 nodes. Going above that cripples the networksignificantly. Considering the cost and power savings that can ariseit seems that a density of 1

4 could be the sweet spot for this topol-ogy, although it might not be enough if very heavy workloads are


UnstructuredApp

0

1

2

3

4

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)

Norm. Execution Time

NestGHC

NestTree

Fattree

Torus3D

UnstructuredHR

0

2

4

6

8

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

Bisection

0

5

10

15

20

25

30

35

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

AllReduce

0

4

8

12

16

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

n-bodies

0

5

10

15

20

25

30

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

Near Neighbors

0

4

8

12

16

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

Figure 4: Normalized simulation time for the heavy workloads. We compare several instances, varying the nodes per subtorus,t , and the uplinks per subtorus, u.

expected to be run in the system. With regards what is the besttopology to use in the higher tiers, no significant difference wasfound between the fattree and the generalized hypercube, exceptfor specific workloads.

6 CONCLUSIONS AND FUTUREWORKIn this paper, we have investigated the hybridization of topologiesstarting from a static hardware-induced torus-like building block.Given that a torus substructure was imposed by pragmatical reasonsrelated to the engineering of the boards, we decided to explorewhat was the best alternative for scaling up our platform ratherthan naively relying on a simple, relatively low-performance torus

topology. To avoid the performance drop that we could expect froma torus topology, we decided to hybridise it by nesting it into asuperior topology such as a fattree or a generalised hypercube.

With this in mind, we moved to understand what are the trade-offs and build upon the best practices to construct such a network.Particularly we wanted to know to what extent the underlyingsubtorus can be expanded without severely affecting the perfor-mance. Not only that, we also need to realize how connectiondensity affects the overall performance and look for trade-offs thatallow the network to sustain adequate performance while, at thesame time, using as few resources as necessary. This is important,because it will have a definite impact on the amount of hardwarerequired to implement the main interconnect and in turn on cost


UnstructuredMgnt

0.8

1.0

1.2

1.4

1.6

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, d)


NestGHC

NestTree

Fattree

Torus3D

MapReduce

0.6

0.8

1.0

1.2

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

Reduce

0.6

0.8

1.0

1.2

1.4

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

Flood

0.0

0.2

0.4

0.6

0.8

1.0

1.2

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

Sweep3D

0.0

0.2

0.4

0.6

0.8

1.0

1.2

(2, 8) (2, 4) (2, 2) (2, 1) (4, 8) (4, 4) (4, 2) (4, 1) (8, 8) (8, 4) (8, 2) (8, 1)

Parameters (t, u)


NestGHC

NestTree

Fattree

Torus3D

Figure 5: Normalized simulation time for the light workloads. We compare several instances, varying the nodes per subtorus,t , and the uplinks per subtorus, u.

and power. Indeed, we looked a the cost and power of differentsystem configurations and found out that the cost of deployingany of the hybrid topology discussed here is relatively small, butsince our budget is limited, we still need to balance the amount ofresources that are put into the construction of the network and intothe procurement of computing devices.

To study these aspects, we start with an analysis of the topologieswhere we show the number of network components required byeach configuration and from them estimate the cost and poweroverheads imposed by the extra components required to form thehigher level of the network. From this analysis we see that there islittle difference between using a fattree or a generalised hypercubein the higher levels. We also look at the distribution of distances

and see that the generalised hypercube provides shorter paths bya slight margin. From our simulation results we confirm that thedifference between the two arrangements is minimal except forspecific workloads.We also found that, for most workloads, a higherdensity of connections is generally better, but that slightly reducingthe density to 1 connection each 2 or even 4 QFDBs does not affectperformance significantly, so there is room for reducing the costand power of the system by spreading the connections among thenetwork.

Our results also suggest that keeping the subtorus small seemslike a good idea, since in general the configurations with the largestsubtorus required a much higher execution time. While this can be


explained by a larger number of network hops to reach the desti-nation, it is somewhat unexpected as a larger subtorus should alsoreduce the pressure in the higher level network as communicationlocality should be increased. It seems, though, that this effect is notas significant as the impact of the increase in path length.

As future work we will continue with the development of thedifferent subsystems of our architecture. In particular, we are de-veloping several low-level aspects of our NIC and routers to incor-porate essential features such as mechanisms for fault tolerance,low-level bandwidth scheduling to give priority to critical flows, ora holistic method for congestion control. Furthermore we plan tolook into other aspects of the scalability of the interconnect andhow these mechanisms affect the scalability of our approach. Withrelation to the nesting of topologies, we would like to extend ouranalyses to incorporate some other aspects of the design such asenergy consumption or fault tolerance. For this we will carry outa revamp of our simulation tools so to be able to perform energyestimation at the scale we are interested in. At a longer-term, onceour prototype has incorporated enough nodes (at least a few hun-dreds of them, currently only a few tens have been deployed), wewould like to perform a similar investigation as this one, but overthe real prototype. While the results would not be as interesting asthe ones presented here, given is small-scale, we expect that someinsights can be discovered from empirical experimentation.

ACKNOWLEDGMENTSThis research work has been supported by the ExaNeSt and Eu-roEXA projects which are funded by the European Comissionthrough the Horizon 2020 EU Research and Innovation programmeunder Grant Agreements number 671553 and 754337.

REFERENCES[1] Dennis Abts and Bob Felderman. 2012. A Guided Tour Through Data-center

Networking. Queue 10, 5, Article 10 (May 2012), 14 pages. https://doi.org/10.1145/2208917.2208919

[2] Y. Ajima et al. 2012. The Tofu Interconnect. Micro, IEEE 32, 1 (2012), 21 –31.[3] Roberto Ammendola et al. 2012. APEnet+: a 3D Torus network optimized for

GPU-based HPC Systems. Journal of Physics: Conference Series 396, 4 (December2012), 042059. https://doi.org/10.1088/1742-6596/396/4/042059

[4] R Ammendola et al. 2017. Low latency network and distributed storage for nextgeneration HPC systems: the ExaNeSt project. Journal of Physics: ConferenceSeries 898 (oct 2017), 082045. https://doi.org/10.1088/1742-6596/898/8/082045

[5] Roberto Ammendola et al. 2017. The Next Generation of Exascale-Class Systems:The ExaNeSt Project. 2017 Euromicro Conference on Digital System Design (DSD)00, 510–515.

[6] L. N. Bhuyan and D. P. Agrawal. 1984. Generalized Hypercube and HyperbusStructures for a Computer Network. IEEE Trans. Comput. 33, 4 (April 1984).https://doi.org/10.1109/TC.1984.1676437

[7] R. Brightwell, K. Pedretti, and K. D. Underwood. 2005. Initial performance evalu-ation of the Cray SeaStar interconnect. In 13th Symposium on High PerformanceInterconnects (HOTI’05). 51–57. https://doi.org/10.1109/CONECT.2005.24

[8] Dong Chen et al. 2011. The IBM Blue Gene/Q interconnection network andmessage unit. In 2011 Intl. Conf. for High Performance Computing, Networking,Storage and Analysis. New York, NY, USA, Article 26, 10 pages.

[9] Caroline Concatto et al. 2018. A CAM-Free Exascalable HPC Router for Low-Energy Communications. In Architecture of Computing Systems – ARCS 2018.Springer International Publishing, Cham, 99–111.

[10] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processingon Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. https://doi.org/10.1145/1327452.1327492

[11] Alejandro Erickson et al. 2017. The stellar transformation: From interconnectionnetworks to datacenter networks. Computer Networks 113 (2017), 29 – 45. https://doi.org/10.1016/j.comnet.2016.12.001

[12] Alejandro Erikson et al. 2017. An optimal single-path routing algorithm in theDatacenter Network DPillar. I E E E Transactions on Parallel and DistributedSystems 28, 3 (1 3 2017), 689 – 703. https://doi.org/10.1109/TPDS.2016.2591011

[13] Jesús Escudero-Sahuquillo and Pedro Javier Garcia. 2016. High-performanceinterconnection networks in the Exascale and Big-Data Era. The Journal ofSupercomputing 72, 12 (01 Dec 2016), 4415–4417. https://doi.org/10.1007/s11227-016-1893-6

[14] Greg Faanes et al. 2012. Cray Cascade: A Scalable HPC System Based on aDragonfly Network. In Procs. of the Intl. Conf. on High Performance Computing,Networking, Storage and Analysis (SC ’12). Los Alamitos, CA, USA, Article 103,9 pages.

[15] A. Gara et al. 2005. Overview of the Blue Gene/L system architecture. IBMJournal of Research and Development 49, 2.3 (march 2005), 195 –212.

[16] Albert Greenberg et al. 2009. VL2: A Scalable and Flexible Data Center Network.SIGCOMM Comput. Commun. Rev. 39, 4 (Aug. 2009), 51–62. https://doi.org/10.1145/1594977.1592576

[17] Chuanxiong Guo et al. 2008. DCell: A Scalable and Fault-tolerant Network Struc-ture for Data Centers. In ACM Conference on Data Communication (SIGCOMM’08). New York, NY, USA, 75–86. https://doi.org/10.1145/1402958.1402968

[18] Chuanxiong Guo et al. 2009. BCube: AHigh Performance, Server-centric NetworkArchitecture for Modular Data Centers. In ACM Conference on Data Communica-tion (SIGCOMM’09). 63–74. https://doi.org/10.1145/1592568.1592577

[19] D. Guo et al. 2013. Expandable and Cost-Effective Network Structures for DataCenters Using Dual-Port Servers. IEEE Trans. Comput. 62, 7 (July 2013), 1303–1317.https://doi.org/10.1109/TC.2012.90

[20] IBM Blue Gene team. 2008. Overview of the IBM Blue Gene/P project. IBMJournal of Research and Development 52, 1/2 (2008), 199–220.

[21] Srikanth Kandula et al. 2009. The Nature of Data Center Traffic: Measurements& Analysis. In Proceedings of the 9th ACM SIGCOMM Conference on InternetMeasurement (IMC ’09). New York, NY, USA, 202–208. https://doi.org/10.1145/1644893.1644918

[22] M. Katevenis et al. 2016. The ExaNeSt Project: Interconnects, Storage, andPackaging for Exascale Systems. 2016 Euromicro Conf. on Digital System Design00 (2016), 60–67.

[23] John Kim et al. 2008. Technology-Driven, Highly-Scalable Dragonfly Topology.In Procs. of the 35th Annual Intl. Symposium on Computer Architecture (ISCA ’08).Washington, DC, USA, 77–88.

[24] P. M. Kogge, P. La Fratta, and M. Vance. 2010. Facing the Exascale Energy Wall.In 2010 International Workshop on Innovative Architecture for Future GenerationHigh Performance. 51–58. https://doi.org/10.1109/IWIA.2010.9

[25] Dan Li et al. 2011. Scalable and Cost-Effective Interconnection of Data-CenterServers Using Dual Server Ports. IEEE/ACM Transactions on Networking 19, 1(February 2011), 102–114.

[26] Yong Liao et al. 2012. DPillar: Dual-port server interconnection network for largescale data centers. Comp. Networks 56, 8 (May 2012), 2132–2147.

[27] O. Lysne et al. 2008. Interconnection Networks: Architectural Challenges forUtility Computing Data Centers. Computer 41, 9 (Sept 2008), 62–69. https://doi.org/10.1109/MC.2008.391

[28] Olaf M. Lubeck et al. 2009. Implementation and Performance Modeling of Deter-ministic Particle Transport (Sweep3D) on the IBM Cell/B.E. Scientific Program-ming 17 (01 2009), 199–208. https://doi.org/10.3233/SPR-2009-0266

[29] Javier Navaridas et al. 2010. Reducing complexity in tree-like computer intercon-nection networks. Parallel Comput. 36 (02 2010), 71–85.

[30] Javier Navaridas et al. 2019. INRFlow: an interconnection networks researchflow-level simulation framework. Journal of parallel and distributed computing.130 (August 2019).

[31] Fabrizio Petrini and Marco Vanneschi. 1997. K -ary N -trees: High PerformanceNetworks for Massively Parallel Architectures. In Proceedings of the 11th In-ternational Symposium on Parallel Processing (IPPS ’97). Washington, DC, USA,87–.

[32] Steve Plimpton. 1995. Fast Parallel Algorithms for Short-RangeMolecular Dynam-ics. J. Comput. Phys. 117, 1 (1995), 1 – 19. https://doi.org/10.1006/jcph.1995.1039

[33] Ankit Singla et al. 2012. Jellyfish: Networking Data Centers Randomly. In Procs. ofthe 9th USENIX Conf. on Networked Systems Design and Implementation (NSDI’12).17–17.

[34] Rajeev Thakur and William D. Gropp. 2003. Improving the Performance ofCollective Operations in MPICH. In Recent Advances in Parallel Virtual Machineand Message Passing Interface. 257–267.

[35] Qiumin Xu et al. 2015. Performance Analysis of NVMe SSDs and Their Implicationon Real World Databases. In Procs. of the 8th ACM Intl. Systems and Storage Conf.New York, NY, USA, Article 6, 11 pages.

[36] Yulu Yang et al. 2001. Recursive Diagonal Torus: An interconnection network formassively parallel computers. Parallel and Distributed Systems, IEEE Transactionson 12 (08 2001), 701 – 715. https://doi.org/10.1109/71.940745

[37] S. D. Young and S. Yalamanchili. 1991. Adaptive routing in generalized hyper-cube architectures. In Proceedings of the Third IEEE Symposium on Parallel andDistributed Processing. 564–571. https://doi.org/10.1109/SPDP.1991.218249

https://doi.org/10.1145/2208917.2208919

https://doi.org/10.1145/2208917.2208919

https://doi.org/10.1088/1742-6596/396/4/042059

https://doi.org/10.1088/1742-6596/898/8/082045

https://doi.org/10.1109/TC.1984.1676437

https://doi.org/10.1109/CONECT.2005.24

https://doi.org/10.1145/1327452.1327492

https://doi.org/10.1145/1327452.1327492

https://doi.org/10.1016/j.comnet.2016.12.001

https://doi.org/10.1016/j.comnet.2016.12.001

https://doi.org/10.1109/TPDS.2016.2591011

https://doi.org/10.1007/s11227-016-1893-6

https://doi.org/10.1007/s11227-016-1893-6

https://doi.org/10.1145/1594977.1592576

https://doi.org/10.1145/1594977.1592576

https://doi.org/10.1145/1402958.1402968

https://doi.org/10.1145/1592568.1592577

https://doi.org/10.1109/TC.2012.90

https://doi.org/10.1145/1644893.1644918

https://doi.org/10.1145/1644893.1644918

https://doi.org/10.1109/IWIA.2010.9

https://doi.org/10.1109/MC.2008.391

https://doi.org/10.1109/MC.2008.391

https://doi.org/10.3233/SPR-2009-0266

https://doi.org/10.1006/jcph.1995.1039

https://doi.org/10.1109/71.940745

https://doi.org/10.1109/SPDP.1991.218249

Design Exploration of Multi-tier Interconnection Networks ... · chitectures; • Mathematics of...

Documents

Transcript of Design Exploration of Multi-tier Interconnection Networks ... · chitectures; • Mathematics of...