Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli...

11
Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier if one interconnection network were “best” for all applications and all operating environments (including hardware, soft- ware, and financial factors). Unfortunately, no such net- work exists. Furthermore, even for a fixed application domain and a fixed operating environment, selecting the best network may be difficult because many cost and performance metrics could be used. Suppose someone asked you to select the best animal. What features would you use to compare, say, an alligator and an armadillo? In some ways, the two are very similar: both have four legs, a rugged exterior, and sharp claws. However, in other ways the two are very different: one prefers a marshy environment, the other dry land; one has a long tail, the other a short tail; one is a reptile, the other a mammal. Which of the two, then, is a better animal and what makes it better? Is it which one could win a fight? Alligators are generally bigger, but how do you incorporate their need for more food because they are bigger? Moreover, what is a fair com- parison of their ability to fight? Must you consider two armadillos against one alligator? If so, should their combined weight equal that of the alli- gator? Or should they be end-to-end the same length? 18 1063-6552/97/$10.00 © 1997 IEEE IEEE Concurrency Network effectiveness differs considerably across applications and operating environments, but even if these variables were fixed, cost and performance metrics must be chosen. Scientifically determining the best network is as difficult as saying with certainty that one animal is better than another. Performance Evaluation Problems with Comparing Interconnection Networks Is an Alli gator Better Than an Armadillo? Kathy J. Liszka The University of Akron John K. Antonio Texas Tech University Howard Jay Siegel Purdue University .

Transcript of Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli...

Page 1: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

Is an Alligator BetterThan an Armadillo?

Designing a parallel machine would be much easier if oneinterconnection network were “best” for all applicationsand all operating environments (including hardware, soft-ware, and financial factors). Unfortunately, no such net-work exists. Furthermore, even for a fixed application

domain and a fixed operating environment, selecting the best network maybe difficult because many cost and performance metrics could be used.

Suppose someone asked you to select the best animal. What featureswould you use to compare, say, an alligator and an armadillo? In someways, the two are very similar: both have four legs, a rugged exterior, andsharp claws. However, in other ways the two are very different: one prefersa marshy environment, the other dry land; one has a long tail, the othera short tail; one is a reptile, the other a mammal. Which of the two, then,is a better animal and what makes it better? Is it which one could win afight? Alligators are generally bigger, but how do you incorporate theirneed for more food because they are bigger? Moreover, what is a fair com-parison of their ability to fight? Must you consider two armadillos againstone alligator? If so, should their combined weight equal that of the alli-gator? Or should they be end-to-end the same length?

18 1063-6552/97/$10.00 © 1997 IEEE IEEE Concurrency

Network effectivenessdiffers considerablyacross applications andoperating environments,but even if thesevariables were fixed,cost and performancemetrics must be chosen.Scientificallydetermining the bestnetwork is as difficult assaying with certaintythat one animal is betterthan another.

Performance Evaluation

Problems withComparingInterconnectionNetworks

Is an Alligator BetterThan an Armadillo?

Kathy J. LiszkaThe University of Akron

John K. AntonioTexas Tech University

Howard Jay SiegelPurdue University

.

Page 2: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

October–December 1997 19

.

As we describe in the main text, thetask of fairly comparing two networksis complicated by the many parame-ters that characterize them. The listbelow is only representative, notexhaustive; the same is true of the sup-porting references. Both are meant togive some idea of the many factors thatcan vary across networks.

NETWORK USE

The way in which a network is used is critical to consider when measur-ing network performance. A networkcould be used such that only a fractionof the network’s peak performance isachieved. Experienced programmersattempt to design parallel code so thattheir application programs use the net-work in ways that are well-suited forthe target system.

Method of task decompositionand mappingBoth the method used to decomposea task and the method used to mapsubtasks onto the processors can affectnetwork loading characteristics.1 Con-sider the problem of how to best mapeight subtasks, say S0, ..., S7, onto eightprocessors, P0, ..., P7. Assume therequired communications occur atapproximately the same time and thatthe communication pattern amongsubtasks is that shown in Figure A1.Also assume the processors are inter-connected as the three-dimensionalhypercube in Figure A2. If subtask Si ismapped to processor Pi, some pathsfor the required communications willuse multiple links of the hypercube, asFigure A3 shows. An alternative map-ping, such as that in Figure A4, whereeach required path contains only onelink, can use the available communi-cation resources more effectively.

Processor and memory speedThe speed of a processor or a memorycan affect the rate at which data can bemoved into the network. If the speedis too low, the network’s peak perfor-mance may not be realized.

Modes of parallelism supportedThe amount of overhead processing

(such as that associated with messageprotocols) for interprocessor com-munications depends on the mode ofparallelism used. In SIMD mode,during a typical transfer each enabledprocessor sends a message to a dis-tinct processor, implicitly synchro-nizing the send and receive com-mands and implicitly identifying themessage. In contrast, MIMD moderequires overhead to identify whichprocessor sent the message, to specifywhat the data item is, and to signalwhen a message has been sent andreceived.

Frequency of communicationsThis parameter quantifies the rate atwhich messages are generated fortransmission through the network.For theoretical analyses and simula-tion studies, the frequency of com-munications is often assumed to be arandom quantity with a known prob-ability distribution. For real applica-tions, the frequency of communica-tions can vary during execution anddepend on the input data in a verycomplicated way, making it difficultto model accurately.

Message-size range anddistributionIn general, network performance de-pends on the distribution and range ofmessage sizes to be transmitted, whichis often difficult to predict for a givenapplication because it may be datadependent. For simplicity, researchersoften assume a uniform distribution ofmessage sizes to conduct simulationstudies. However, this assumption maynot always be accurate.

Message destination assumptionsWill the number of received messagesbe uniformly distributed among thedestination ports? Are the destinationsfor messages from a given source uni-formly distributed? Will the messagedestinations favor a subset of loca-tions, possibly causing a nonuniformuse of network links?2 To determinethe amount of interference in the net-work, designers must consider thesetypes of issues.

Network parameters: characteristics thatdefine and shape interconnection networks

(1)

(2)

(3)

(4)

S1

S0

S4

S7

S6

S5 S3

S2

P7P6

P4 P5

P3P2

P0 P1

S7S6

S4 S5

S3S2

S0 S1

S7S6

S4 S5

S3S2

S0 S1

Figure A. (1) Samplecommunication pattern amongeight subtasks in (2) a three-dimensional hypercube. (3) Theresult of assigning subtask Si toprocessor Pi. (4) The result of amore efficient mapping ofsubtasks to processor Pi.

(Continued on page 20)

Page 3: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

20 IEEE Concurrency

These kinds of comparison questions apply to inter-connection networks. Suppose you are comparing theaverage message delay, for a given set of traffic condi-tions, for a hypercube network and a mesh network.Hypercube networks may involve more links thanmeshes. How do you incorporate that hypercube net-works may require more complex hardware? Should thetotal path width of all the links the networks employ bethe same? Or should the two networks require the samenumber of transistors per switch?

In this article, we explore the problems of determin-ing which metrics or weighted set of metrics designersshould use to compare networks and how they shouldapply these metrics to yield meaningful information.We also look at problems in conducting fair and scien-tific evaluations.

Choosing and applying metrics

In choosing and applying metrics, designers mustanswer three questions:

• What are the relevant metrics for my applicationdomain and operating environment?

• How should I weight each metric?• How can I make two network designs of equal “cost”

so that I can compare them “fairly”?

METRICS AND WEIGHTING ISSUES

There are many ways to characterize and measure theperformance of interconnection networks. We providea representative subset of metrics and briefly list issuesto consider in weighting them.

Latency When data is transferred among the interconnected proces-sors and memories, delays through the interconnectionnetwork can cause processor idle times, which can ulti-mately degrade overall system performance. Multitaskingcan help limit this source of processor idle time, but it maynot improve the total execution time of any single task. Theway you choose to characterize latency—for example, max-imum or average—depends on the application.

.

Hot-spot trafficIn this type of nonuniform traffic, mes-sages from distinct sources are destinedfor the same network output port atapproximately the same time.3 Applica-tions that use barrier-type synchroniza-tions may create hot spot traffic. Perfor-mance degradation caused by hot spottraffic is due primarily to the contentionfor common network resources amongthe messages converging toward thecommon destination port. A secondaryform of performance degradation occursbecause hot-spot traffic can affect back-ground traffic—network traffic not des-tined for the hot destination port.4

ROUTING CONTROL

Routing-control schemes aid in estab-lishing network connections for inter-processor communication. Approachesvary in time and space requirements,capability, and flexibility. In using anyof these schemes, designers must con-sider both deadlock avoidance and thetime required to compute the routingcontrol information.

DistributedThe routing decision is made locally ateach switch on the basis of a routing tagin the header of the message. When amessage arrives at a switch, the switchdetermines if it is the intended destina-

tion or, if not, onto which output linkthe message should be forwarded.

CentralizedA single control mechanism sets the con-nection patterns in all switches simulta-neously. No local routing decisions aremade at any switch. Centralized routingmechanisms are sometimes used forSIMD environments. Centralized con-trol can simplify the hardware design fora SIMD machine, but it is less flexiblethan distributed control. Compute timetends to be higher because all theswitches must be set.

AdaptiveThis scheme is suitable for networksthat have multiple paths betweensource-destination pairs, such as ahypercube or a multipath multistagenetwork.5 When local network conges-tion exists, a message may be dynami-cally rerouted through another, lesscongested, part of the network.6 Theperformance improvement provided byadaptive routing typically comes at thecost of more complex routing protocolsthan nonadaptive routing.

Table-basedOne approach is to store a lookup tableat each switch.7 A common implemen-tation indexes each table by destination

number and returns the output chan-nel (number) required to establish thenext link in the path. A table-basedscheme provides great flexibility, but atthe cost of table storage and lookuptime at each switch.

SWITCHING METHODOLOGY

There are two basic switching mecha-nisms for transmitting messages in anetwork: circuit switching and packetswitching. Combinations and varia-tions of these two approaches have alsobeen proposed and used. In general,approaches differ in how they treat theI/O buffering required at the switches

Circuit switchingIn circuit switching, a dedicated path isestablished from the source to the des-tination and maintained for the entiremessage transmission. Because the pathis dedicated, the transmission encoun-ters no contention once the path isestablished. However, the establish-ment of the path may be delayed ifother paths are already using links com-mon to that path.

Packet switchingIn packet switching, also called messageswitching or store-and-forward switch-ing, messages are split into packets,which are sent through the network.

Page 4: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

October–December 1997 21

Fault toleranceThe network’s environment (an inaccessible satelliteversus a university laboratory, for example) influenceshow important fault tolerance will be. To quantifymetrics such as the number of faults tolerated and anydegradation that results from a fault, you must estab-lish a common fault model and a common fault-toler-ance criterion.1 The fault model characterizes the typesof faults considered (such as permanent versus tran-sient). The fault-tolerance criterion is the condition thatmust be met for the network to tolerate the fault. Forexample, in a multistage network, a fault may be con-sidered tolerated if a blocked message can be sent toan intermediate destination and then to the final des-tination in a second pass.

Bisection widthThis is the minimum number of wires that must be cut toseparate the network into two halves. Larger bisectionwidths are better when data in one half of the system maybe needed by the other half to perform a calculation.

Permuting abilityFor a system of N processors, a permutation is a set ofsource-destination pairs that may be mathematicallyrepresented as a bijection from the set {0, 1, ..., N – 1}onto itself. In this context, a permutation is the trans-ferring of a data item from each processor to a uniqueother processor, with all processors transmitting simul-taneously. Different networks can require differentamounts of time to process various permutations.2 Per-mutations can occur in MIMD (multiple-instructionstream, multiple-data stream) machines when a com-munication is preceded by a barrier synchronization,and in SIMD (single-instruction stream, multiple-datastream) machines.

Control complexityA Benes network can be set to perform any permutationin one pass;3 the multistage cube may require multiplepasses. However, a multistage cube network may be con-trolled in a distributed fashion by using the address ofthe destination processor;2 a Benes network requires a

.

Each packet has a header that containsrouting information. One commonpacket-switching scheme is outputbuffering. In this scheme, when a packetarrives at an intermediate switch, itsheader is stored in an input latch. Theoutput port of the switch is determined,and the packet proceeds to the appro-priate output buffer, which holds theentire packet. Thus, a message does notblock an entire path through the net-work, as in circuit switching.

Virtual cut-throughLike packet switching, virtual cut-through uses an input latch to receiveand store a packet’s header. It thenchecks the channel to the next switchfor availability. If the channel is free, thepacket bypasses, or “cuts through,” theoutput buffer and proceeds to the nextswitch; otherwise, the entire packet isstored in the output buffer, as in nor-mal packet switching.8

Partial cut-throughPartial cut-through is a variation of vir-tual cut-through. It operates on thesame principle of attempting to bypassthe output buffer if the channel is freeand otherwise storing the packet in thebuffer. However, with partial cut-through, if the channel becomes avail-able during buffering, the packet can

start transmitting to the next switch atthat point in time.

WormholeWormhole is another variation of vir-tual cut-through. In one approach towormhole routing, a message is initiallydivided into packets, as is done in packetswitching, and the packets are furtherdivided into flits (flow control digits).9

The first flit (the header, which con-tains the routing information) “cutsthrough” by proceeding directly to thenext switch in the route if the channel isfree. The trailing flits flow through thesame path in the network, following theheader flit in a pipeline fashion. If anoutput channel is not free when theheader arrives at a switch, the header isblocked at the switch, as are the flits fol-lowing the header. Once all channelsare acquired along the route, the pathis dedicated for the entire transmissionof all the flits in the packet.10

Conflict resolutionFor all switching mechanisms, a givenswitch output channel may be requiredby two or more messages at the sametime. This is called a conflict. The conflict-resolution scheme detects and resolves con-flicts by determining which requestsshould be deferred so that the remain-ing messages can proceed.

CombiningThis packet-switching enhancementcan lessen the performance degradationcaused by situations such as hot-spottraffic (defined earlier). For example, thecombining process can merge two readrequests for the same physical memorylocation into one that continues on tothe associated destination port, insteadof sending both the original requests.Combined messages can also be com-bined. This feature increases the com-plexity of the network switch, but maysignificantly affect performance whenshared variables are used for tasks thatrequire large numbers of processors toaccess them at approximately the sametimes—for example, a job queue or asemaphore (synchronization primi-tive).11 Another form of combininginvolves reduction operations.

HARDWARE IMPLEMENTATION

It is generally difficult to predict howvarying a single hardware implementa-tion parameter will affect network per-formance. Also, it is very complicatedand costly to develop accurate simula-tions for (or actually manufacture) twonetworks that differ in only one hard-ware implementation parameter. Thedifficulty of comparing two such net-

(Continued on page 22)

Page 5: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

22 IEEE Concurrency

centralized control scheme (see sidebar, “Network para-meters: characteristics that define and shape intercon-nection networks”) and O(N log N) time to compute thenetwork switch settings (for N processors).

PartitionabilitySome networks may be partitioned into independentsubnetworks, each with the same properties as the orig-inal network.4 Partitionable systems include multiple-SIMD machines (such as the TMC CM-25), MIMDmachines (such as the IBM SP26), and reconfigurablemixed-mode machines (such as PASM, a partitionableSIMD/MIMD machine7). Partitionability allows mul-tiple users, aids fault tolerance, supports subtask paral-lelism, and provides the optimal size subset of processorsfor a given task.8

Graph theoretic definitionsYou can use graph theory parameters, such as degreeand diameter, to quantify network attributes. Thegraph nodes represent multistage network switches or

single-stage (point-to-point) network processors; thearcs represent communication links. Degree is the num-ber of communication links connected to a processoror switch. Diameter is the maximum value, across allnode pairs, for the shortest distance (number of links)between any two nodes. Designers must carefully con-sider the significance and implications of a given per-formance metric such as diameter. The diameter of the twisted cube is approximately half that of thehypercube, yet in some cases the hypercube can havea smaller message delay.9

Unique path vs. multipathSome networks, such as a multistage cube, have onlyone path between a given source and a given destina-tion. Others, such as the hypercube and extra stagecube,1 have multiple paths between source-destina-tion pairs. Multipath networks may be more costlyand difficult to control than unique-path networks,but they offer fault tolerance and enable routingaround busy links. This demonstrates how improving

.

works is compounded by the stronginterdependence among the varioushardware implementation parameters.For example, changing the maximumnumber of pins per chip may also affectthe physical chip size, which may affectthe number of chips that can be placedon a fixed-size printed-circuit board,which affects the number and physicallength of connections among the PCBsand chips.

Implementation technologyThe implementation technology re-lates to the physical materials and pro-duction methodology used to manu-facture the chips and boards. Althoughimplementation using a relativelyexpensive material or technology mayenable certain chips to run faster, de-signers must consider the power re-quirements of the chips and their inter-action with the other components.Several chip-attachment methods arealso possible, which affect packagingcosts, circuit use, circuit delay times,and overall system cost.

Design detailsThe design of a network includes thelogic-level description for the switches,interfaces, and controllers. Factors toconsider in logic circuit design are clockfrequency, communication delay, par-

titioning for multichip configurations,and fault tolerance.

Number and width ofcommunication linksFor a given interconnection scheme(such as a hypercube), increasing thewidth of the channels increases the frac-tion of a message that can be transmit-ted in one network cycle. This increasecomes at the “expense” of increasing thenumber of wires and the overall wiringcomplexity.12 Suppose you have a fixednetwork width, defined here as the prod-uct of the number of links and the widthper link. In some application domains,throughput may be better if you use afour-nearest-neighbor mesh with rela-tively wide communication links insteadof a hypercube with relatively narrowcommunication links. In other applica-tion domains, the reverse may be true.

Packaging constraintsPackaging constraints include thephysical size of the chips, requiredpower consumption, circuit delay, reli-ability, heat dissipation, interconnec-tion density, chip configuration, inter-board distances, and productioncosts.13 For example, if you use the sys-tem in an environment that has littleextra space for cooling fans, requiredpower consumption and heat dissipa-

tion can become critical constraints.

NUMBER OF PINS PER CHIP

Increasing the number of pins on a chipincreases the space needed on the boardto fan out the signals to and from thechip. However, using too few pins on achip may necessitate multiple-chip con-figurations that can, in turn, createmore complex (and less reliable) chip-interconnection schemes. One studycompared a variety of multidimensionalmeshes (including the hypercube)under a particular set of pin-out con-straints.12 This involved examining thebalance between the number of chan-nels associated with a processor and thewidth of each channel.

Number of chips per boardDepending on the design choices, aparticular network implementation mayhave multiple switches on a single chip,have one switch per chip, or requireseveral chips for a single switch. Whenmultiple chips are used, the designermust consider the physical sizes of thechips, the available area of each board,the connection complexity among thechips, mounting configurations for theboards, and so on.

Number of layers per boardTo accommodate the required connec-

Page 6: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

October–December 1997 23

one network characteristic (fault tolerance) candegrade another (cost). If you use a weighted set ofmetrics, different weightings for fault tolerance andcost may result in a different decision about which net-work is best. A general methodology10 has been devel-oped to quantify the evaluation of network perfor-mance by constructing a weighted polynomial.However, fundamental to using this type of approachis deciding how to choose the weights.

Cost-effectivenessThis metric, or cost-performance ratio, is the network’sperformance divided by the cost to implement it. Ifcost-effectiveness is to be used as a scientific measure,it should be defined in terms of hardware required,rather than actual dollar cost, which can be easilychanged by a new marketing policy rather than a tech-nological advance. Although a commercial reality, itseems unscientific for one network design to becomemore cost-effective than another because one salesper-son decided to give a temporary discount on a particu-

lar component to meet an end-of-year sales quota.However, measuring hardware cost without convert-ing to dollars is awkward. What units should be used toadd, say, the cost of optical links to the cost of memoryfor message buffers? A related problem is how exactlyto determine if two network designs are indeed of equalcost. We address this later.

Other metricsThese include throughput (number of messages thatarrive at their destinations during a given time interval),multicasting time (how long one processor takes to sendthe same message to multiple processors), scalability(range of machine sizes for which a network design isappropriate), and what the network will add to the hostmachine’s volume, weight, and power consumption.Some network metrics are difficult to define quantita-tively, but should be considered nonetheless. For exam-ple, ease of use is important from a practical viewpoint.Can the user easily understand and effectively use thenetwork’s important features? Can the user or compiler

.

tions among multiple chips on a PCB,designers can use multiple layers toimplement nonplanar connection pat-terns. To determine the number of lay-ers to use, designers must consider thetrade-offs among design complexity,wire lengths, and required area. Usingmore than two layers generally requiressophisticated wiring schemes or layoutheuristics, which can increase designtime and cost.

ManufacturabilityWhen designing for a commercialproduct, being able to manufacture thenetwork economically and in a reason-able time are important issues. Somepossible considerations include com-ponent availability, use of custom versusoff-the-shelf chips, time to assembleand test parts, wiring rules, and wirelengths.

ScalabilityThis parameter refers to the range ofsizes (as measured, for example, by thenumber of the network’s external I/Oports) that are appropriate for a givenset of implementation parameters. Incommercial markets, in which a widerange of network sizes may be required,implementation choices may improvescalability. In general, however, imple-mentations designed for a wide range

of network sizes may be achieving scal-ability at the expense of degraded per-formance for a given size in the range.

References1. V. Chaudhary and J.K. Aggarwal, “A

Generalized Scheme for Mapping Par-allel Algorithms,” IEEE Trans. Paralleland Distributed Systems, Mar. 1993, pp.328−346.

2. T. Lang and L. Kurisaki, “NonuniformTraffic Spots (NUTS) in MultistageInterconnection Networks,” J. Paralleland Distributed Computing, Sept. 1990,pp. 55−67.

3. M. Kumar and G.F. Pfister, “TheOnset of Hot Spot Contention,” inProc. Int’l Conf. Parallel Processing, IEEEComputer Society Press, Los Alamitos,Calif., 1986, pp. 28−34.

4. M. Wang et al., “Using a Multipath Net-work for Reducing the Effects of HotSpots,” IEEE Trans. Parallel and Distrib-uted Systems, Mar. 1995, pp. 252−268.

5. D. Rau, J.A.B. Fortes, and H.J. Siegel,“Destination Tag Routing TechniquesBased on a State Model for the IADMNetwork,” IEEE Trans. Computers,Mar. 1992, pp. 274−285.

6. P.T. Gaughan and S. Yalamanchili,“Adaptive Routing Protocols forHypercube Interconnection Net-works,” Computer, Vol.26, No. 5, May1993, pp. 12−23.

7. C.B. Stunkel et al., “The SP2 High-Performance Switch,” IBM Systems J.,Vol. 34, No. 2, 1995, pp. 185−204.

8. P. Kermani and L. Kleinrock, “ATradeoff Study of Switching Systemsin Computer Communications Net-works,” IEEE Trans. Computers, Dec.1980, pp. 1052−1060.

9. L.M. Ni and P.K. McKinley, “A Sur-vey of Wormhole Routing Techniquesin Direct Networks,” Computer, Vol.26, No. 2, Feb. 1993, pp. 62−76.

10. W.J. Dally and C.L. Sietz, “Deadlock-Free Message Routing in Multiproces-sor Interconnection Networks,” IEEETrans. Computers, May 1987, pp. 547−553.

11. A. Gottlieb et al., “The NYU Ultra-computer—Designing an MIMDShared-Memory Parallel Computer,”IEEE Trans. Computers, Feb. 1983, pp.175−189.

12. S. Abraham and K. Padmanabhan,“Performance of Multicomputer Net-works under Pin-Out Constraints,” J.Parallel and Distributed Computing, July1991, pp. 237−248.

13. J.R. Nickolls, “Interconnection Archi-tecture and Packaging in Massively Par-allel Computers,” Proc. Packaging, Inter-connects, and Optoelectronics for the Designof Parallel Computers Workshop, 1992,pp. 4−8.

Page 7: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

24 IEEE Concurrency

readily determine ways to use the network to obtain thebest possible performance?

MAKING FAIR COMPARISONS

A difficulty in comparing two network designs is deter-mining if they are of equal cost so that the comparisonis “fair.” For example, for more than 16 processors, thenumber of links and switch complexity for a hypercubeis greater than that for a four-nearest-neighbor mesh.To fairly compare the performance of these two net-works, how should the mesh be augmented to make it ofequal cost?

The problem of making fair comparisons is com-pounded by the numerous design and use parametersthat may be varied. Network designers and imple-menters must understand how these parameters affectnetwork performance. Major categories of parametersinclude network use, routing control,switching methodology, and hard-ware implementation. Designersmust select values for one or moreparameters within each categorybefore they can effectively evaluatemetrics. The sidebar, “Networkparameters: characteristics that de-fine and shape interconnection net-works,” gives specific examples in thefour categories.

Evaluating networks

Having selected from numerous metrics and parame-ters, analysts must now find ways to apply them sys-tematically. This consists of collecting additionalinsights through proven measurement tools and apply-ing evaluation strategies.

MEASUREMENT TOOLS

Measurement tools provide additional informationabout how the network can be expected to perform in itsintended environment. Although these tools are valu-able, analysts must apply them judiciously. The list oftools below is by no means exhaustive.

Mathematical modelsThese models provide important insight into basic net-work features and allow some degree of performanceprediction. However, those modeling and analyzing net-works for parallel processing systems often makeassumptions for the sake of tractability that may not be

totally realistic. When mathematically analyzing packet-switched networks, for example, analysts sometimesassume that the buffers for holding packets at eachswitch are of infinite length. The analytical results mustbe reasonable approximations of behavior when buffersizes are fixed and realistic.

Statistically-based simulationsSuch simulations characterize behavior on the basisof probabilistic models for network traffic. The com-plexity of network simulators varies depending on howclosely they account for the implementation details ofthe networks they are simulating. Some simulatorsmay account for the finite size of buffers for packet -switched networks, and thereby let analysts vary the“buffer size” parameter for different simulationstudies. In contrast, other simulators may assume

infinite buffer capacities, which sim-plifies the complexity of the simu-lator because it is no longer neces-sary to model the exceptionhandling that occurs when dataarrives at a full buffer.

Program tracingThis approach captures the applica-tion program’s traffic pattern, such asthe message source or destination andnumber of processor cycles elapsedbetween sending messages. Analysts

can use trace information to determine (either throughanalysis or simulation) the effectiveness of executing thesame program on a different machine. There is a poten-tial problem, however. Using program tracing whencommunication patterns have been optimized for a net-work on a given machine may not allow a fair compar-ison when used to generate traffic patterns for anothernetwork.11

BenchmarkingThis strategy involves executing a given task on severaldifferent machines and measuring and comparing the per-formance among systems. When using benchmarks tocompare networks, analysts must remember that bench-mark results are affected by many software layers andmany implementation details of the whole parallelmachine.12 First, many algorithmic approaches may beused to implement a benchmark task. The mapping ofthe benchmark algorithm onto the processors of the sys-tem affects the communication patterns, which may affect

.

Varying only onenetwork feature toisolate its impact canresult in an unfaircomparison due tothe interdependencyof the networkfeatures.

Page 8: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

October–December 1997 25

the delay characteristics. The languageor compiler used to express or interpretthe algorithm and interprocessor datatransfers can significantly affect networkperformance. The measured perfor-mance is also a function of the operatingsystem (for example, interprocessorcommunication protocols) that executesthe algorithm.

The machines being benchmarkedmay vary considerably in hardware(such as technology used) and architecture (such as num-ber of processors), which could also affect comparativenetwork performance. Isolating the influence of thechosen network in the system becomes problematic, so interpreting benchmarking results must be donecarefully.

EVALUATION STRATEGIES

Given these measurement tools, how can designers fairlyevaluate network features or application performance?Even after they have decided on which weighted set ofmetrics to use and have selected values for the variousdesign and use parameters, they must still consider if itis possible to isolate the effect of changing a single fea-ture. They must also select application implementationsthat will make the comparison fair.

Classical experimental methodThe classical experimental method assumes that a givensystem is characterized by a set of parameters. To under-stand the effect a given parameter, say x, has on the sys-tem, all parameters except x are fixed and measurementsand comparisons are made on the system. For example,suppose the goal is to determine the effect of networktopology on performance, comparing a multidimen-sional mesh with a hypercube. Assume that the hyper-cube performs fastest using a distributed routing tag asa control scheme, and that the mesh performs fastestusing a routing-table lookup as a control scheme at eachswitch node. Should you compare the networks withboth using routing tags, both using table lookup, or eachusing what is best? If you use the same routing schemefor both, does this result in a fair comparison? What con-clusions can you draw about the topological differenceindependent of the routing scheme? Furthermore, as wehave shown, you must consider many other parametersin addition to the routing scheme if you want to fairlycompare topologies.

If you compare the hypercube with routing tags and

the mesh with table lookup, you invalidate the classicalexperimental method because two parameters are var-ied. Yet, without the classical experimental method, howcan you isolate the effect of one network design orimplementation parameter on performance?

This is an important dilemma that the interconnec-tion network research community must begin toaddress.

Fixed mappingSuppose two SIMD distributed-memory, parallel-processing systems differed only in the topology of theirinterconnection networks: one is a mesh, the other aring. Both systems have N processing elements (proces-sor-memory pairs), or PEs, labeled 0 to N – 1. Assumein the mesh topology that PE i is directly connected toPEs i – 1, i + 1, i + , and i – In the ring topol-ogy, PE i has direct connections only to PEs i – 1 and i+1. To simplify the presentation, we assume the num-ber of data words sent and the number of links traverseddetermine the time for each data transfer between PEs(we could derive similar examples that consider networksetup time, wormhole routing, and so on).

To apply benchmarking or the classical experimentalmethod to compare the effects of the network topolo-gies, suppose you decide to execute the same SIMDimage-smoothing algorithm on both systems. Thesmoothing algorithm involves performing a smoothingoperation for each pixel in an M × M image. Eachsmoothed pixel is the average value of the pixels in a 3× 3 window centered around that pixel. Figure 1a depictshow pixel values from an M × M image are mapped ontoN PEs interconnected as a × mesh, assumingsquare subimages. Each subimage is (M/ ) × (M/ )pixels. As the figure shows, the pixels are mapped ontothe PEs so that adjacent subimages are mapped to adja-cent PEs in the mesh.

During phase 1 of the algorithm, each PE executesthe smoothing operation for those pixels within the inte-

NN

NN

NN

.

PE NPE 0 PE 1 . . .

. . .

PE N –1

PE N –1

...

.... . .

(a) (b)

Mpixels

M pixels

pixels

PE iPE i

PEs

1 pixel

1 pixel

1 pixel

1 pixel

PE N N –1

M/ N

N

PEsN

pixelsM/ N

pixelsM/ N

pixelsM/ N

pixelsM/ N

pixelsM/ N

Figure 1. (a) Mapping an (M / ) × (M / ) subimage onto each PE of amesh and (b) pixels that must be received by PE i from neighboring PEs.

NN

Page 9: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

26 IEEE Concurrency

rior (not along the edge) of its subimage. For this phase,each PE has all the pixel values required for all thesmoothing operations in its local memory; thus no datatransfers between PEs are required.

During phase 2, the M / pixel values along eachside of each square subimage are transferred to neigh-boring PEs, as Figure 1b shows. The four corner pixelsfrom the diagonal neighbors are sent through interme-diate PEs. The two pixel values needed by PE i from PEi − + 1 and PE i + + 1 will already have beentransferred into PE i + 1, so only one transfer each isneeded to move them from PE i + 1 to PE i. The situa-tion for the two pixel values from the other diagonalneighbors is similar. Therefore, four transfers areneeded for the diagonal pixels.

In phase 3 of the algorithm, the smoothing operationis applied to pixels on the boundary of each subimage.

Thus, during phases 1 and 3 combined, each PE per-forms M2/N smoothing operations and all N PEs do thisconcurrently. For phase 2, 4(M / ) + 4 data transfersbetween PEs are required on the N PE mesh, where for each transfer up to all N PEs may send data simul-taneously.8

For 16 PEs, Figure 2 illustrates the required datatransfers to PE 6 for both the mesh and ring topologies.We assume that the image is distributed among the PEsfor the ring network in the same way described earlierfor the mesh (Figure 1).

Consider the number of inter-PE data transfersneeded to smooth an M × M image using N PEs with aring network. The M / pixels along the right verti-cal boundary of each subimage must be transferred fromeach PE i – 1 to PE i. Likewise, with each left verticalboundary from PE i + 1 to PE i. To transfer the M /pixels along the upper horizontal subimage boundaryof each PE i + to PE i requires (M / ) = M datatransfers, because PE i +√N and PE i are separated by

links. Likewise, to transfer the M / pixels alongthe lower horizontal subimage boundary from each PEi – to PE i. As with the mesh network, pixels fromthe diagonal neighbors can be sent through intermedi-ate PEs. The total for the ring is 2M / + 2M + 4 inter-PE data transfers. For 1 < ≤ M, this is greater thanthe required number of inter-PE data transfers for themesh, given by 4(M/ ) + 4. For example, if M = 4,096and N = 256, then 8,708 inter-PE data transfers arerequired by the ring as compared to 1,028 for the meshtopology.

Is it fair to conclude that the ring topology is inferiorto the mesh, even just for this application, as the above

analysis seems to indicate? Consider the following vari-ation in the mapping scheme for the previously describedimage-smoothing algorithm. Instead of dividing the M× M image into N square (M/ ) × (M/ ) subimages,divide it into N rectangular (M/N) × M subimages.

Figure 3 shows a mapping of pixel values from theseN rectangular (M /N) × M subimages onto the ringtopology. Phase 1 still smooths the interior pixels of each subimage. In phase 2, pixel values along the hori-zontal boundaries (top row and bottom row) of the rec-tangular subimages are transferred to neighboring PEs,requiring 2M inter-PE data transfers with the ringtopology. Phase 3 still smoothes the subimage bound-ary pixels.

Consider if the N rectangular (M/N) × M subimagesapproach is used on a system with a mesh network with-out edge “wraparound” connections (for example, nodirect connections from PE 3 to PE 0 in Figure 2a). Phases1 and 3 are unchanged. In phase 2, the M pixels along eachof the horizontal boundaries (top row and bottom row) ofthe rectangular subimages are transferred from PE i + 1 to

NN

N

N

N

N

NN

NNN

N

N

N

NN

N

.

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

0 4

15 11

1 5

14 10

2 6

13 9

3 7

12 8

(a)

(b)

Figure 2. The source PEs for pixels needed by PE 6 for(a) the mesh and (b) the ring topologies for N = 16.

PE 0

M pixelsM pixels

PE 1M /N pixels

M /N pixels

.

.

.

PE N –1M/N pixels

(a) (b)

PE i

M pixels

Figure 3. Mapping (M /N) × M subimages onto the PEsof the ring.

Page 10: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

October–December 1997 27

PE i and from PE i – 1 to PE i. Figure 4 shows the map-ping of the (M /N) × M rectangular subimages onto amesh. If PE i is directly connected to PEs i – 1 and i + 1,then 2M transfers are needed. However, PE 3 and PE 4,for example, need to exchange their boundary pixels, buthave no direct connection. Thus, M pixels are transferredfrom PE i − 1 (for example, 3) south to PE (i +1) − 1 (for example, 7) and then west across the row of PEsto PE i (for example, 4). This requires M transfers.Another M inter-PE data transfers are required tomove the M horizontal boundary pixel values from PE i

to i − 1. Therefore, the total communicationsrequired are 2M + 2M inter-PE data transfers for themesh. This is much greater than the ring for this map-ping. For example, if M = 4,096 and N = 256, the ringrequires 8,192 data transfers, while the mesh with no wrap-around connections requires 139,264 inter-PE data trans-fers. For this mapping strategy, the ring outperforms themesh. Thus, if square subimages are used, the mesh is bet-ter; if rectangular (block of rows) subimages are used, thering is better. The data allocation used determines whichnetwork is better.

The effect of the data allocation can have even moreeffect on system performance than just the number ofdata transfers. Consider image correlation, another win-dow-based task, using a window of size 21 × 21. Withrectangular subimages, PE i will need 10 rows of pixelsfrom each of PEs i + 1 and i – 1. The ring will need 10 •

2M transfers. With square subimages, PE i will need a10-pixel-wide perimeter from each of its eight neigh-boring PEs. The mesh will need 10 (4M / ) +102 • 4transfers. For M = 4,096 and N = 256, the ring willrequire 81,920 transfers and the mesh will require 10,640transfers, a difference of 71,280.

When considering the entire image, using a rectan-gular layout on the ring, there are 10 leftmost and 10rightmost columns of pixels in all PEs that will not beprocessed (because they do not have the 21 × 21 win-dow of pixels around them that is needed for the image-correlation process). Therefore, 10 • 2(M/N), or 320,

pixels will not require processing. (Using square sub-images, there is no such savings, because a subset of thePEs will not contain edge pixels of the entire image.) Ifthe time to process these 320 pixels is greater than thetime to perform 71,280 inter-PE transfers, the ring net-work will result in a smaller total execution time thanthe mesh. For the example image-correlation task witha 21 × 21 window, calculations involving 440 neighborsare needed for each of the 320 pixels, so it is possiblethat the ring network will indeed be a better choice evenwhen each network uses its best data layout.

Therefore, not only does the mapping of a task ontoa parallel machine affect the number of transfer stepsneeded for different networks, but considerations suchas the effect of avoiding processing edge pixels mayaffect overall performance and, hence, network choice.

We have shown that several funda-mental questions remain open inevaluating and comparing inter-connection networks:

• Which weighted collection of metrics should be usedto evaluate network performance?

• How can the relative cost-effectiveness of differentnetworks be determined using units other than dol-lars?

• Can analytically tractable theoretical models demon-strate sufficiently realistic behavior?

• How can benchmarks across different machines bedevised to evaluate a particular network feature?

• With the strong interdependence among application,system, and network parameters, how can the classi-cal experimental method be fairly applied to evalu-ate the effect of changing one parameter while hold-ing all others constant?

• What parameters should be held constant (and withwhat values) when comparing and evaluating net-works?

• In what sense can the implementation of two differ-ent networks be made “equal” for a meaningful com-parison?

N

N

NN

N

NN

NN

.

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

(a) (b)

Figure 4. Mapping (M /N) × M subimages onto the PEsof the mesh with the directions of transfers shown: (a)2M transfers; (b) 2( M) transfers.N

Page 11: Is an Alli gator Better Than an Armadillo? Than anmeiyang/ecg702/proj/isalligator... · Is an Alli gator Better Than an Armadillo? D esigning a parallel machine would be much easier

28 IEEE Concurrency

To establish a weighted set of metrics for a realisticenvironment, designers must formally incorporate qual-ity of service. Trade-offs, such as fault tolerance versusnetwork cost or number of processors supported versusphysical volume, will affect how metrics are weighted.These quality of service factors also apply to the designof other subsystems of parallel machines.

While partial approaches to exploring network per-formance have been proposed,9–11 no complete method-ology exists. Research must be done to answer the abovequestions before we can quantify the relative importanceof various network features. Only then may we be ableto say which network is best for a given situation.

ACKNOWLEDGMENTSThe authors thank Janet M. Siegel and Nancy Talbert for their com-ments, Muthucumaru Maheswaran for his assistance with the illus-trations, and Marcelia Sawyers for her careful typing. A preliminaryversion of portions of this material was presented at The New Fron-tiers: A Workshop on Future Directions of Massively Parallel Pro-cessing. This research was supported in part by Rome Laboratoryunder contracts F30602-92-C-0150 and F30602-92-C-0108.

REFERENCES

1. G.B. Adams III, D.P. Agrawal, and H.J. Siegel, “A Survey andComparison of Fault-Tolerant Multistage Interconnection Net-works,” Computer, June 1987, pp. 14–27.

2. D.H. Lawrie, “Access and Alignment of Data in an Array Proces-sor,” IEEE Trans. Computers, Dec. 1975, pp. 1145–1155.

3. V.E. Benes, Mathematical Theory of Connecting Networks and Tele-phone Traffic, Academic Press, San Diego, 1965.

4. H.J. Siegel, Interconnection Networks for Large-Scale Parallel Pro-cessing: Theory and Case Studies, 2nd ed., McGraw-Hill, New York,1990.

5. L.W. Tucker and G.G. Robertson, “Architectures and Applica-tions of the Connection Machine,” Computer, Aug. 1988, pp.26–38.

6. C.B. Stunkel et al., “The SP2 High-Performance Switch,” IBMSystems J., Vol. 34, No. 2, 1995, pp. 185–204.

7. H.J. Siegel et al., “The Design and Prototyping of the PASMReconfigurable Parallel Processing System,” in Parallel Comput-ing: Paradigms and Applications, A. Y. Zomaya, ed., Int’l ThomsonComputer Press, London, 1996, pp. 78–114.

8. H.J. Siegel, J.B. Armstrong, and D.W. Watson, “Mapping Computer-Vision-Related Tasks onto Reconfigurable Parallel-Processing Systems,” Computer, Feb. 1992, pp. 54–63.

9. S. Abraham and K. Padmanabhan, “The Twisted Cube Topol-ogy for Multiprocessors: A Study in Network Asymmetry,” J.Parallel and Distributed Computing, Sept. 1991, pp. 104–110.

10. M. Malek and W.W. Myre, “Figures of Merit for Interconnec-tion Networks,” Proc. Workshop Interconnection Networks for Par-allel and Distributed Processing, IEEE Press, Piscataway, N.J., 1980,pp. 74–83.

11. A.A. Chien and M. Konstantinidou, “Workloads and Perfor-mance Metrics for Evaluating Parallel Interconnects,” IEEE CSComputer Architecture Technical Committee Newsletter, Summer/Fall 1994, pp. 23–27.

12. S.H. Bokhari, “Multiphase Complete Exchange on Paragon, SP2,and CS-2,” IEEE Parallel & Distributed Technology, Fall 1996, pp.45–59.

Kathy J. Liszka is an assistant professor of mathematical sciences atthe University of Akron. Her research interests are parallel algorithmsand distributed computing. She received a PhD in computer sciencefrom Kent State University and is a member of the IEEE ComputerSociety, ACM, and Pi Mu Epsilon. Her address is the Dept. of Math-ematical Sciences, The University of Akron, Ayer Hall 313, Akron,OH 44325-4002; [email protected].

John K. Antonio is an associate professor of computer science atTexas Tech University. His research interests include high-perfor-mance embedded systems, reconfigurable computing, and heteroge-neous systems. He has coauthored more than 50 publications in theseand related areas. Antonio received a BS, an MS, and a PhD in elec-trical engineering from Texas A&M University. He is a member of theIEEE Computer Society, and of the Tau Beta Pi, Eta Kappa Nu, andPhi Kappa Phi honorary societies. He is the program chair for the1998 Heterogeneous Computing Workshop, and has been on theOrganizing Committee of the International Parallel Processing Sym-posium for the last five years. His address is the CS Dept., Texas TechUniversity, Box 43104, Lubbock, TX 79409-3104; [email protected].

Howard Jay Siegel is a professor of electrical and computer engi-neering and coordinator of the Parallel Processing Laboratory at Pur-due University. His research interests include heterogeneous com-puting, parallel algorithms, interconnection networks, and the PASMreconfigurable parallel computer system. He received a BSEE and aBS in management from the Massachusetts Institute of Technology,and an MA, MSE, and a PhD in computer science from PrincetonUniversity. He is an IEEE Fellow and will be inducted as an ACMFellow in 1998. Siegel has coauthored more than 240 technical papers,was a co-editor-in-chief of the Journal of Parallel and Distributed Com-puting, and served on the Editorial Boards of the IEEE Transactions onParallel and Distributed Systems and the IEEE Transactions on Comput-ers. His address is Parallel Processing Laboratory, ECE School, Pur-due University, West Lafayette, IN 47907-1285; [email protected].

.