FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author:...

26
EE482 FINAL PROJECT REPORT 4-ary 4-cube Implementation of 1024 Node Interconnection Network Jinyung Namkoong Nuwan Jayasena Manman Ren Mohammed Haque May 28 th , 1999

Transcript of FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author:...

Page 1: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

EE482FINAL PROJECT REPORT

4-ary 4-cube Implementation of1024 Node Interconnection Network

Jinyung Namkoong Nuwan Jayasena

Manman Ren Mohammed Haque

May 28th, 1999

Page 2: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Introduction

1024-node × 10Gb/s network with 25% duty cycle has been designed. Out of manypossible solutions, 4-ary 4-cube network with 4-to-1 concentrators has been selected forboth performance and cost and performance reasons. Contrary to popular aversion tohigher-than-3 dimension networks, this topology results in a very efficient packaginggiven the prescribed constraints. This topology also has one of the smallest networkdiameter of its size. Because 4-ary 4-cube network has γavg of 0.5, speedup of 2 can beprovided with 10Gbps channels. (However, it’s been realized during simulation that thischannel bandwidth presents bottleneck at the output of concentrators due to overhead.)

The flow control is a hybrid of SAF and VCT. Though the arbitration for the next hopbegins before the entire packet has been received, and there are virtual channels to allowmore routing options, a packet always stays together. (ie. There’s no flit-levelmultiplexing.) This scheme is expected to be as efficient as flit-level switch allocationfor such small packet size as 16-bytes (resulting in 3 phis in our architecture,) and packetoverhead and token logic complexity is somewhat reduced. Routing is fully adaptive.The expectation is that with this kind of flow control, deadlocks will be very infrequent.This served its purpose at least for the short simulations. However, deadlockconfiguration is possible in this network, and if one is to actually implement this network,there should be some kind of deadlock recovery scheme. Alternatively, routing can beeasily changed to DOR, which should perform as well as fully adaptive routing at leastfor random traffic.

Actual performance measurements are few due to many bugs that have come up duringlast-minute effort to run the simulation. The code is still believed to be significantlybugs-ridden. However, 64-node 4-ary 2-cube version of the network has been tested at25% load and verified to be at least functionally correct. Some channels are idle betweenpackets when they should not, suggesting there are still some bugs leading toinefficiency, but the injected packets do arrive at its destination.

One of the hard constraints for this design was the latency requirement of 125nsec. Theaverage zero-load latency for random traffic is designed to be about half of this budget, toallow for contention delay. This almost certainly means that for the worst casepermutation (with hop count of 8, instead of 4 for random traffic,) the latencyrequirement would never be met. A large portion of this delay is the wire delay, which isabstracted to include the synchronization delay for different clock frequencies of wire androuter.

The rest of this report goes through the design highlights in more detail.

Page 3: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Topology Choice

Topology selection is probably the most important decision in network design. Almostall other design parameters depend heavily on this choice. In this design, the determiningfactors were packaging cost and expected performance given by γ and required speedup.

The two competing topologies were Benes and Torus.

Benes Network

A Benes network can be constructed by putting two butterfly network back-to-back.Since channels are bidirectional, the second half of the Benes network can be folded ontothe first half. The attractiveness of Benes network comes from simple routing and flowcontrol. It also turns out to be cheapest possible solution due to the fact that the latterstages of a butterfly network are nicely localized, allowing many switches to be fit intoone chip. It is well known that γmax for a butterfly network in random traffic is 1.However, in permutation traffic, γmax can be quite high. The elegance of Benes networkis that, with a proper routing scheme, this problem is solved, and all the channels can bemade to have the same γmax of 1. The required flow control is also very simple inconcept. If the routing on the forward path is completely random, any permutationpattern at the input, which causes congestion of certain channels, would be lost by themid-point. That means that the channels on the backward-path have the same load as in arandom traffic pattern. In other words, the channels on the forward path are evenlyloaded due to random routing, and the ones on the reverse path are even loaded due torandom traffic. This truly achieves equal γ for all channels for any permutation, whichwould be hard to do in torus network. (Distributing traffic over all possible paths can bedone, but doesn’t guarantee equal γ. Evenly distributing traffic over all possible paths ismore difficult, but still doesn’t guarantee equal γ.)

However, the problem of this too-good-to-be-true story is that it’s not trivial to do trulyrandom routing on the forward path. What needs to happen is that a packet should haveequal probability of landing on any particular point after forward path. However, bothbecause one doesn’t want to incur contention latency on the forward path, and it’s notgood to waste a free output channel when it’s available, (ie. if certain channel is unusedbecause a packet demands a particular output port, that’s equivalent to increasing γproportional to wait time. This is because this channel can’t be used for future packet,)deflection routing seems to be the natural choice for the forward path. This routing is nottruly random, (as proved by another group’s simulation,) and results in sub-optimalperformance.

One realization of Benes network that seems to give the best tradeoff of packaging costand hop count consists of two 8x8 switch stages and one 4x4 switch stage. With speedupof 1, the total cost comes out to be $56,576, which include the cost of 160 chips, 40boards and 4 backplanes. Of course, because of overhead and routing inefficiency,speedup of 1 is simply not enough. The chief reason for abandoning this topology wasbecause of its limited resource for speedup. The biggest gross speedup possible was only

Page 4: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

around 1.6. One way to provide more room for speedup was to reduce the radix ofswitch so that it results in 4 stages of 4x4 swtiches. However, this makes Benes topologylose both cost and hop-count advantages over equivalent torus solutions. Therefore, thedecision to drop this topology was mainly due to speedup issue, combined with the factthat 4-ary 4-cube results in γ of 0.5, allowing speedup of 2 with the same number of pinsrequired for Benes with speedup of 1.

Torus Network

There can be many choices for k and n, only few of which are pracitcal. Different choiceof k and n differ in cost, speedup we can provide, γmax, and average number of hops. Thelast three terms will affect the network performance.

Initial decision was to do no slicing (ie. at least one node fits in a chip,) and to put 4-to-1concentrator inside each node. That means that each chip should accomodate at least4+2*n channels in one chip. With the pin limitation of 256 signals per chip, the numberof signals per channel (channel width) is limited as shown in the following table.

K 2 4 16 256N 8 4 2 1avg. # of hops 4 4 8 16γmax (random) 0.25 0.5 2 32pin limit ≤ 15 sig/ch. ≤ 28 sig/ch. ≤ 51 sig/ch. ≤ 128 sig/ch.

The first column, which would be 8-ary hypercube, is elminated because of too highdimension and no advantage over 4-ary 4-cube in terms of average number of hops. Thelast column is eliminated because of too large hop-count and large γ. This leaves 4-ary 4-cube and 16-ary 2-cube. Number of hops and γ fields clearly suggest that 4-ary 4-cube isclearly the winner. Irregular torus was not considered since γmax is determined by thelargest k, and lower k in one dimension inevitably means larger k in another dimension.Mixing torus with mesh was not considered for similar reasons. Since an efficientpackaging of 4-ary 4-cube is possible, it seems to be the best choice.

A feature of 4-ary 4-cube is that however one divides the network by half, the bisectionchannel bandwidth is enough to support even the worst permutation traffic where allnodes in one half communicates with all nodes on the other half. This is not true for anyother torus with k>4. This is simply another way to interpret the fact that γmaz(random) =0.5.

Latency Issue

Latency of 125nsec is equivalent to 50 2.5nsec cycles. (Even though the final routeractually runs at 5nsec cycles, it’s easier to normalize latency to the given 400Mhz clock.)In choosing the topology, the hop count is also an important factor, since it determineshow many cycles can be allocated for each hop. For 4-ary 4-cube topology, the average

Page 5: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

hop count for random traffic is 4, and the worst permutation hop count is 8. If thislatency requirement is to be met for the worst permutation traffic, each hop should takefewer than 6 cycles. First, the wire delay is assumed to be 5nsec = 2 cycles. (This is a bitnaïve, since it assumes that all the clocks are synchronous.) If each router delay is 4cycles, only two cycles for entire transit are left for contention latency. And this is evenbefore concentrator/deconcentrator latency is taken into account. Therefore, it seemscertain that this network will not meet the latency requirement for the very worstpermutation. For random traffic, the average number of hops is 4, and zero-load latencyis about 28.

Page 6: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Packaging and Cost

The idea is to start from a low dimension and build up to higher dimension. Basically,each board is a 4-ary 1-cube, and four such adjacent boards form a 4-ary 2-cube. Sinceeach backplane can contain up to 16 boards, a backplane can be a 4-ary 3-cube. Finally,four such backplanes can be combined to form 4-ary 4-cube.

It turns out that two choices are possible for number of wires per channel. Channel widthof 10 signals (10Gbps/channel) allows for two nodes to fit in one chip, reducing cost,while providing speedup of 2 for random traffic. On the other hand, channel width of 20signals (20Gbps/channel) results in only one node per chip, but speedup is 4 for randomtraffic. Initially, the second option was chosen. However, the final choice is the firstoption (10 signals / channel) for large cost saving compared to the other option. The planwas to change back to 20 Gbps/channel option if the simulation results are bad.However, there was not enough time to do enough simulation. Either way, the onlydifference between the two options is the number of chips. Everything from the boardlevel stays the same.

If 2 nodes including the concentrators are to be fit into one chip, the total number ofchannels per chip is 22, out of 25 maximum possible, as shown below in Figure 1. Twosuch chips are used to form a 4-ary 1-cube per board. The channels for dimension 1 areinternal to the board, channels for dimension 2 and 3 are directed to the backplane edgeof the board, channels for dimension 4 are directed outward through cables, and theinjection/ejection channels are through cables along the same edge.

Figure 1: 4-ary 1-cube fits on a board

1node node

chip

node node

chip1

d2 d3

22 22

d2 d3 d2 d3

22

d2 d3

22

d4

24

inj/ej d4

24

inj/ej d4

24

inj/ej d4

24

inj/ej

board

Backplane

Cables

Page 7: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Each edge of a board can support up to 32 channels. It can be seen that only 16 of thoseare used for the backplane edge, and 24 for the cable edge.

Next level of packaging is integrating boards onto the backplane. This level of packagingis shown in figure 2. Wire bisection on backplane is plenty enough to accommodate thedensest portion.

backplane

dimension 2

dimension 3

board (4-ary 1-cube)

4 boards forma 4-ary 2-cube

Figure 2: Connection between boards on one backplane

Board

BackPlane

Figure 3: Inter Backplane Connections

Page 8: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Finally, four such backplanes are connected through cables as shown in Figure 3. Sincethe longest cable length is well below 4m, and all the channels for the first threedimensions are shorter than 1m, every signal can run at maximum rate of 1Gbps, andwire propagation delay is assumed to be less than 1nsec.

The total cost is calculated in the table below. In the following calculation, we assumethat all cables will be 2 feet in length.

Cost Type # of Items Unit Cost Total Cost NotesChips 128 $200.00 $25,600.00Boards 64 $400.00 $25,600.00Backplanes 4 $800.00 $3,200.00Connectors(on boards)

(16+24)*20*64 $0.10 $5,120.00 16 input + 24 network channelconnections per board. 20wires per channel, and 64boards.

Cables ½*24*64*20 $0.30 $1,536.00 24 channels per board oncables, 64 boards, and 10wires per channel.

Connectors(onbackplanes)

(16*16)*20*4 $0.10 $2,048.00 16 connections per board, 16boards per backplane, 10signals per channel, and 4backplanes.

Total cost estimate: $63,104

Page 9: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Routing and Flow Control

The router operates off of a 5 ns clock cycle, which provides 50 bits of data every cycleover each of the 10-bit wide links operating at 1 GB/s. The following discussion is interms of these 50 bit physical digits (phits).

Packet Format

The network must support packets of size 8 bytes to 256 bytes. Therefore, each packetconsists of a header phit and one or more data phits. A 16-byte packet split into 3 phits isshown below.

Figure 4: 16-byte packet mapped to phits

Each phit contains two bits (marked C in the figure above) for conveying credits1 for thereverse channel. This credit is per port, and not per virtual channel. The routerarchitecture provides each input port with two connections to the internal crossbar ofrouters leading to a maximum of 2 credits that may be conveyed per port in one 5 nscycle. The number of C bits set in a phit identifies the number of credits returned. Thevalid bit (V) in each phit identified whether the payload section of the phit contains validdata. Note that this valid bit does not apply to the C bits described above. Credits can besent even when V bit is low. The header bit (H) identifies whether the packet is a headeror a body phit. The presence of an explicit head bit allows packets to be transmitted back-to-back without having to insert an invalid phit between packets. The destination fieldconsists of 10 bits and identifies the destination node in a 1024 node network. The upper8 bits of this destination field (which identifies one of 256 nodes in 4-ary 4-cube) userelative addressing, whereas the LSB 2 bits, which is used by deconcentrator, useabsolute address. The data field contains the payload, and occupies 36 bits in the headerphit and 46 bits in body phits. This provides an optimal utilization for 16-byte packetswith all the available 128 payload bits utilized. For a 16-byte packet, the overhead of C,V, H bits and destination is 14.7%.

Flow Control

Virtual Cut Through (VCT) flow control is used. Wormhole flow control was initiallyconsidered, but given the large amount of memory available to each router, VCT was

1Please see discussion of flow control for a description of the credit scheme.

DataDestinationHVCCDataHVCCDataHVCC

Header phit

Body phits

Page 10: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

found to be feasible, and opted for. Each router has four buffers associated with eachinput channel corresponding to four virtual channels. Each of these buffers is largeenough to accommodate an entire packet, and therefore, switch and physical channelallocation is done on a packet basis (i.e. each packet is a single flit) with all phits of apacket immediately following the header.

Each source (sender) associated with a channel keeps a count of the number of availablebuffers at the receiver’s end through a credit scheme. Note that packets need not go to aspecific virtual channel buffer at the receiver, and may be placed in any available buffer.Therefore, only the number of free buffers needs to be known, and hence, each credit inthis scheme corresponds to a free virtual channel buffer and is associated with a physicalchannel (as opposed to flit credits associated with virtual channels that are used inwormhole flow control).

Routing

A relatively simple routing scheme where all virtual channels route according to the samerouting function is used. This simplifies the implementation and allows the greatestflexibility in allocating virtual channel buffers. To avoid deadlock, the currentimplementation uses minimal dimension order routing. Given that there are four virtualchannel buffers at each destination for each channel, deadlock may be relativelyinfrequent (note that buffers, and not channels, are the resource leading to deadlock inthis case). Therefore, the best resource utilization may be achieved by using a deadlockrecovery scheme as opposed to a static avoidance scheme. However, this option could notbe explored due to time limitations.

Page 11: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Router Microarchitecture

Each router implements a node in a 4-ary 4-cube with an integrated 4:1 concentrator forprocessor connections. A high-level block diagram is shown in Figure 5.

Figure 5: Block diagram of router architecture

Note that the channels to and from the network are shown separately for clarity, but infact operate over the same set of bi-directional physical links. Basically, routing decisionare made by each of the input modules, and the vector of all possible output ports is sentto the switch allocation logic, which arbitrates different requests and allocate switchappropriately. Output module is simply a flip-flop.

Zero-load router latency is two 5nsec cycles. The first half cycle (20 gate delays) is usedfor routing, the second half of this cycle and the first half of second cycle (40 gate delays)are used for switch allocation, and the remaining half cycle (20 gate delays) is used topass through switch.

18x9Crossbar

SwitchAllocation

IM0

IM1

IM2

IM3

IM4

IM5

IM6

IM7

IM8

OM0

OM1

OM2

OM3

OM4

OM5

OM6

OM7

OM8

Concentration/Deconcentration

ToNetwork

FromNetwork

To/from processors

Page 12: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Input Module (IM)

Shown below is the input module with all the connections and bit widths.

Figure 6: Input Module

Input module takes in packets from the channel, directs these packets to the available VCbuffer (credit scheme ensures that any incoming packet has a place to go,) selects two ofthe VC buffers that have valid packets, then do the routing on those selected packets.Proper bypassing is done to allow incoming packet to be passed on to the SAL the nextcycle, if VC buffers are empty. The routing information is passed to SAL in the form ofrequest vectors, and these requests are granted using the grant signals. When a grant isgiven for a packet, credit signal is generated to increment credit counter in the previousnode. Tail signals to the SAL are to signal the last phit; so that SAL can release switchallocation after this phit. Address adjust block is to modify the appropriate field of therelative address. One artifact of the fact that credit is per port rather than per VC is thatthe number of VCs per port can be varied without affecting the design too much. Note that grant signal doesn’t have to tell which output port the packet has been assigned,and simply has to signify it has won the allocation, which will advance the head pointerof the queue so that next phit is at the front. Where it’s going is determined by the switchconfiguration. Since much of the protocol resides in the SAL, the logic for IM is simple,and should fit within 20 gate delays. Each of the four VC buffers is a queue of 45 phits. This number phits is determined forthe largest packet to fit in a VC buffer. The memory word size is 49 bits (Data + Valid +Head + tail.) Therefore, each VC buffer is 2205 bits. Now, since there are 36 suchvirtual channels in a router, the total memory per node is 79380 bits. Now, all the

Valid

Head

Data46

Credit2

48

48

Switchchannel

A

B

reqA

9

reqB

9

tailA tailB grantA

grantB

VCbuf

VCbuf

VCbuf

VCbuf

routing

routing

SAL

addradjust

Page 13: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

memory is two-ported, (one write and one read,) and since we’re allowed up to 500K/22

= 125Kbits, there’s more than enough memory for one node. This is a mistake that wemade early in the design. Initially, we were going to put one node per chip, and there’senough memory for this case. However, for two nodes per chip, the memory is notenough. This means that the number of VCs per port should be reduced to 3. There are two major steps in input module. One is to distribute the incoming packets toproper VC buffers, and the second is to select appropriate VC buffers to be connected tothe switch. The first step is done in a round-robin way. The second step is done in first-come, first-serve way. Currently, if a VC buffer is selected for the switch input, newassignment doesn’t happen until that particular packet has been transmitted. A better waywould be to select other packet if a packet is blocked.

Switch Allocation

Greedy allocation scheme is used. The issue is to do the allocation within 40 gatedelays. Since greedy algorithm inherently is biased, a better scheme would use somekind of round-robin scheme. However, analysis shows that it’s not possible to do round-robin greedy scheme within the time budget.

Switch allocation top level module:

Switch Allocation

Request signalsReq17 to Req0

Each is 9-bit wide Credit linesCredit_incr7 to Credit_incr0

Each is 2-bit wide00: no credit

01 or 10: one more credit11: two more credits

Grant signal(18-bit wide)

Each bit indicate thecorresponding input request

Switch control signalsSw_ctrl8 to Sw_ctrl0Each is 18-bit wide

Tail signal(18-bit wide)

Each bit indicate thecorresponding input request.

Clock

Figure 7: top-level diagram for switch allocation

Page 14: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Inside SAL, there are 4 sub-level blocks: switch allocation algorithm implementation,credit-counter, hold logic, and output port status logic, as shown in figure 2. Each of theblocks will be explained in detail later.

Request signalsReq17 to Req0

Each is 9-bit wide

Switch AllocationAlgorithm

(Combinational Logic)

Cl_grant signal(18-bit wide)

Switch control signalsCl_sw_ctrl8 to Cl_sw_ctrl0

Each is 18-bit wide

Output Switch control signalsSw_ctrl8 to Sw_ctrl0Each is 18-bit wide

Hold Logic

Hold Logic

Output grant signal(18-bit wide)

Each bit indicate thecorresponding input request

CreditCounter

OutputPort

Status

Figure 8

Switch allocation algorithm:

Greedy algorithm is implemented using a matrix of cells forming carry chains. Figure 9below shows this matrix structure. The 18 request vectors from input modulescorrespond to inputs to 18 rows of the matrix. And the 9 switch control vectors will betaken from the outputs of 9 columns of the matrix. The gray boxes indicate that thosekind of request can never occur, since a packet shouldn’t go out through the channel thatit came in a minimal routing. There are both vertical and horizontal carry chains.Vertical chain of each column ensures that at most one input port is assigned to thatparticular output port. On the other hand, horizontal chain makes sure that the input portcorresponding to each row can only be assigned to at most one output port.

Page 15: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

The issue is to minimize the critical path gate delay, and it is achieved by minimizingboth the number of gate delays per stage and the number of stages of longest carry chain.Because implementing round-robin priority encoding would require at least one moremux delay per stage, the start point of carry chain cannot be changed dynamically. Thiscertainly introduces significant bias to the system, but the average latency is hopefullynot too much affected by this bias. Figure 9 and 10 show the resulting carry propagationpattern and the cell schematic. This carry pattern, though biased in a weird fashion,achieves the least number of carry stages by overlapping the horizontal and vertical carry.Alternatively, it can be viewed as distorting the rectangle so that the critical path diagonaldistance is reduced. The analysis of this shows that the longest path takes 34 cycles.

input

output

d1+

d1-

d2+

d2-

d3+

d3-

d4+

d4-

inj

d1+ d1- d2+ d2- d3+ d3- d4+ d4- ej

Figure 9: Switch allocation algorithm matrix

(Refer to Appendix for Logic for the rest of SAL.)

Concentration/Deconcentration

Four processor connections are concentrated onto a single network channel. Theconcentrator implements the processor interface, a limited amount of buffering to handlecontention due to multiple overlapping requests, and the arbitration between multiplerequesters. The deconcentrater is relatively simple and does not have to handle anyarbitration or contention cases. Packets destined for processors connected to the sameconcentrator are routed to the network and received back through the first router’s switch.

Page 16: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

0

reqIn

Sw_ctrl

vCin

hCin

hCout

avail

vCout

Figure 10: Logic for each cell of the Switch alloc. matrix

Figure 11: Block diagram of concentrator/decocentrater

Concentrator IM1

Bufferfullfull

phitsAbs. to rel.

addressconversion

Concentrator IM0empty

grantreq

Concentrator IM2

Concentrator IM3

HeaderDetect

Round-robinArbiter and Hold

Logic,

From/ToProcessor

ToNetwor

ToProcessor From

Networ

Head/taildetect &

hold

credit Concentrator

Deconcentrate

Page 17: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

The details of the processor interface are abstracted in this implementation with a simplefull signal to the messaging interface. This signal is generated when the local buffering inthe concentrator can no longer accommodate more data.

The buffers on the concentrator are phit buffers compared to the packet buffers on therouters. Since the input links to the concentrator are dedicated to one packet source, thephit-by-phit communication does not complicate the implementation or introduce extraoverhead. When a processor sends a header phit to the concentrator, if it’s concentratorinput buffer is empty, a request to the arbiter is generated immediately, and the grant willbe generated within the same cycle if possible. In this case, the data will effectivelybypass the buffer and pass through the concentrator in a single cycle. If there is data inthe input buffer, however, any new packet is buffered and any grant received is applied tothe buffered packet(s) to retain FIFO ordering.

The arbitration within the concentrator is a simple many-to-one problem, and isimplemented as a finite state machine. This FSM implementation is able to produce agrant within 2 to 3 gate delays of receiving a request (if the output channel is currentlyunoccupied). The grant generation occurs on the same cycle as the request is receivedwhenever possible.

The concentrator also converts the absolute address of the destination to a relativeaddress based on the local concentrator ID.

Output Module (OM)

If an output to the network must go off-chip, the output module contains thesynchronization logic to connect to the 1 GB/s links from the 5 ns clock domain. If anoutput connects to another router on the same chip, the output module is not present, andthe data is latched at the input of the next router.

Page 18: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Simulation Result

Simulation results were obtained using a Verilog simulation of a 4-ary 2-cube networkwith 4-1 concentration. We were not able to perform simulations on the full 4-ary 4-cubeimplementation due to the large size of the system. The figures reported here are from1000 cycle simulations (after a 200 cycle warm-up).

Status of the Simulator

We were able to perform simulations and obtain latency figures using the simulator, but itneeds to be improved and optimized for better performance. We noticed severalinefficiencies in the Verilog implementation that were not inherent in the design of ournetwork. One such inefficiency was that the switch is not fully utilized. The originaldesign was capable of transmitting multiple packets back-to-back over the same link, andthe packet format was designed to facilitate this. In the Verilog model, however, one idlecycle is present between packets. Another inefficiency is in the selection of virtualchannels at an input module for requesting the switch. Each input module selects 2 out ofits 4 VCs to request the switch for. Ideally, this selection should occur every cycle exceptwhen packets are being transferred. However, our Verilog model selects two VCs torequest for, and does not re-select until those selected VCs have been granted the switch.This hampers the ability of packets to bypass other packets that may be blocked. Ourswitch allocation mechanism is also not fair which could lead to starvation. Thismanifests in the form of a very large distribution of latencies for packets since somepackets receive the switch soon while others starve for a long time before being allocatedthe switch. While this presents a fairness issue, we believe this does not significantlyaffect the overall average latency. The bottleneck at the concentrator output is alsoanother significant factor in degrading the network performance.

Our original design was intended to be clocked at 200 MHz (5 ns cycle time). Duringsimulation, however, as a result of all of the above inefficiencies, we found that therequired load could not be handled by the network. Since we did not have adequate timeto fully optimize the simulator or enhance the overall design, a 400 MHz clock (2.5 nscycle time) was used for the simulator. Ideally, we would have liked to do anotheriteration of the design-simulate steps using the feedback from the simulation to improveboth the design and the simulator.

Simulation Results

Table 1 lists latency measures obtained for the 4-ary 2-cube with 4-1 concentration (i.e.64 processor network) simulation at 25% traffic from each processor. Please note that oursimulator does not model the wire and synchronization delays, and therefore, thesefigures exclude that delay.

Page 19: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Traffic Pattern Average LatencyRandom 41 cycles (102.5 ns)

Permutation with seed 888 25 cycles (62.5 ns)Permutation with seed 234 27 cycles (67.5 ns)

In our simulations with one routing node, the average delay was found to be 17 cycles.This includes the time spent in the injection queue, overhead and contention at theconcentrator, and delay of the first router. Using this figure and the above latencies for a4-ary 2-cube, we estimated the routing delay per hop in the 4-ary 2-cube, which requires2 extra hops on average for random traffic compared to the 1-node network. Since a 4-ary4-cube requires two extra hops compared to the 4-ary 2-cube on average for randomtraffic, the delay per hop was then used to extrapolate the 4-ary 2-cube figures to estimateaverage latencies in the 4-ary 4-cube. Please note that this reasoning applies to therandom traffic case. The permutation traffic latencies are also extrapolated similarly, butmay not be very accurate. These estimates are listed in Table 2.

Traffic Pattern Average Latency EstimateRandom 65 cycles (162.5 ns)

Permutation with seed 888 33 cycles (82.5 ns)Permutation with seed 234 37 cycles (92.5 ns)

If wire delays are considered, we expect the worst case (for a 4-ary 4-cube) to be 7 wiredelays with each being 2 clock cycles (also allowing for clock recovery andsynchronization). The remaining hop of the worst-case 8 hops is guaranteed to be on anon-chip channel since two routers are located on one chip. The average case wouldtherefore incur three wire and synchronization delays (with 1 hop on chip). Table 3 givesthe average latency estimates with the wire delays included.

Traffic Pattern Average Latency WithAverage # of Hops

Average Latency WithMaximum # of Hops

Random 71 cycles (177.5 ns) 79 cycles (197.5 ns)Permutation with seed 888 39 cycles (97.5 ns) 47 cycles (117.5 ns)Permutation with seed 234 43 cycles (107.5 ns) 51 cycles (127.5 ns)

We were unable to identify the cause of abysmal performance for random traffic. Ourbelief at this point is that the unfair switch allocation is a major factor. While we wereunable to study the issue in detail due to time limitations, it was clear during simulationthat a very long latency was experienced by a very small fraction of the packets leadingto an increased average latency. It is likely that some unlucky packets may need totraverse paths that requires a sequence of least preferred selection choices in the switchallocation, leading to extremely large latencies. Even after the end of the simulationpacket injection time, these unlucky packets may still continue to be delayed by packetsthat no longer affect the latency calculation. Permutation traffic, on the other hand,

Table 1: Latency for 25 % loading on 4-ary 2-cube (excluding wire delays)

Table 2: Latency estimates for 25 % loading on 4-ary 4-cube (excluding wire delays)

Table 3: Latency estimates for 25 % loading on 4-ary 4-cube (including wire delays)

Page 20: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

probably does not require worst case choices at each step. While this does not provide anadvantage in terms of average latency of all packets, it does improve the average latencyof packets injected within a given time window. Therefore, we believe by improving theswitch allocation scheme, we will be able to significantly improve the performance ofrandom traffic.

Page 21: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Conclusion and Discussion

4-ary 4-cube topology seems to be a good solution for the problem at hand, both in termsof cost and hop-count. There exists an efficient way to package such a topology giventhe constraints, and low γ of 0.5 means that 10Gbps channel can achieve speedup ofalmost 2.

There are a couple of significant design errors that were detected too late. One is theconcentrator bottleneck. Even though 10Gbps channel provides γ of 0.5, the concentratoroutput, which is the injection channel, has a theoretical γ of 1. That means that we areproviding only a speedup of 1 for that link. Since a packet takes 3 cycles, and at 25%duty cycle per node, each node produces one packet every ten 5ns cycles, there are fourpackets (12 cycles to transmit) over that link every 10 cycles. This problem requires anincrease in speedup, lower level of concentration, increased concentrator outputbandwidth, or packet format change. The second error is with the memory. Toaccommodate even the largest packet inside each VC buffer, each node requires about79Kbits. Since all memory is two-ported (one read and one write,) the total availablememory per chip is 125Kbits. Therefore, if there’s only one node per chip, it’s more thanenough memory; however, in our current solution, there are two nodes per chip, resultingin memory shortage. Solution to this may involve reducing the number of VCs per portto 3 or just put one node per chip at the expense of cost.

It turns out that the latency requirement of 125ns (25 5nsec cycles) is a very strict one.For zero-load condition, the average hop involves 2 cycle router delay and one cycle wiredelay. Since average hop count for random traffic is 4, this results in 16 cycles. Addingthe concentrator/deconcentrator delay and serialization delay, the zero-load latency turnsout to be around 20 cycles. This leaves very little room for contention or overhead.Because of the synchronization delay, the wire delay is a significant portion of the totallatency. One way to minimize this cost is to fit as many routers inside a chip as possible.However, each bit-slicing increases overhead. It would be interesting to see where thecrossover point occurs.

With sufficient number of VCs and low network diameter, it seems that deadlock wouldbe very unlikely in this network, even for fully adaptive routing. However, pathologicaldeadlock condition can occur, and in actual implementation, needs to be considered. Asimple solution would be to use turn-restricted or DOR, as we did in the simulations.

Page 22: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Appendix : Other Logic Blocks inside SAL

These are some miscellaneous blocks used in SAL.

Credit counter module:

The purpose of this block is:• to maintain a credit status for each output port of the router, after taking into

consideration received credit count from each of the input ports of the nextdownstream router, and

• to provide output port available status to the switch allocation block.

This block receives another signal from the switch allocation block that indicates whetherthe output port has been assigned to any request. If so, then the credit status is decrementby one.

It also receives a credit back signal from the downstream router once an input porttransmits one flit through the switch. The credit that is sent corresponds to the availablebuffer space for the whole input port, rather than for each VC of the input port. Since,each input port can at the maximum transmit from two of its four virtual channels, thecredit back signal is two bit wide, where each bit corresponding to one VC. Therefore, atevery cycle each output port has to update its previous credit count status either by anincrement of 0, or 1 or 2.

The other signal that this block receives is the output-state signal, which indicates if thisoutput is currently assigned to any packet. Therefore, the output can only be assigned toany incoming request, if• there is credit reserved for the output port, or• the output port is currently not assigned to serve any request.

Page 23: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

sub 1 sub 1,then plus 2

sub 1,then plus 1

Flip Flop

clock sub

credit count (3 bits)0xx: no credit

100: credit is 1101: credit is 2110: credit is 3111: credit is 4

The MSB of credit counttells if there is credit at

the downstream rounter.

credit_incr

credit_avail(output signal)

MSB of credit count

Figure 11:

Block inside the above module has 2-gate delay, and their gate level implementation areshowed in figure 12:

Hold logic:

This logic block is needed to ensure that :

• once an output is assigned to an input port, it's switch control signal is held forthe whole duration of the packet (i.e. until the tail flit passes through it);

• hold the entry in the output state table high for the above duration, and• hold the grant signal to the input port high for the duration

Implementation:

This logic takes into consideration the previous states of the grant signal, output-state,switch control signal, grant signal, and tail-flit signal. It holds the state of the above-specified signals asserted as long as the tail filt signal is not asserted. Once the tail flitsignal is asserted (indicating that the next flit is tail flit), it de-asserts the above namedsignals in the following cycle.

Page 24: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

subtract by 1 dependingon input sub signal

subtract by 1 depending oninput sub signal, then plus 1

subtract by 1 depending oninput sub signal, then plus 2

Figure 12

Figure 13: Hold Logic

Hold Logic

tail flit_1

temp signalheld signal

Page 25: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM

Output port status logic:

Hold Logic

Hold Logic

Credit Counter

Credit Counter

sub0

sub1

credit_out_0

credit_out_1

out_avail0

out_avail1

out_avail8

Credit Countercredit_out_7

Hold Logicsub8

Hold Logicsub7 out_avail7

Figure 14

Page 26: FinalReportcva.stanford.edu/classes/ee382c/ee482b_spr99/docs/... · Title: FinalReport Author: lspeh Created Date: 5/29/1999 12:45:40 PM