Winter 2006 ENGR 9861 – High Performance Computer Architecture March 2006 Interconnection...

Winter 2006

ENGR 9861 – High Performance Computer Architecture

March 2006Interconnection Networks

Winter 2006

Introduction When considering interconnection networks

for parallel computation there are many shared concepts with LANs (Local Area Networks) and WANs (Wide Area Networks).

Interconnection Networks for parallel computing is a wide and interesting field. There are many areas of theoretical and practical research with regards to this topic.

“Parallel Computer Architecture”;Ch10, Culler, Singh.

Winter 2006

Introduction There are different ways at looking at this

topic: Firstly, the interconnection structure often

has mathematical properties that often reflect the communication patterns of important algorithms (a regular structure).

Secondly, the design of the physical link between asynchronous elements is a huge area of research.

Thirdly, competition between shared resources within a network is also a large area of research.

Winter 2006

Basic Definitions Ch 10.1

Communication

Assist

Winter 2006

Basic Definitions Some terms:

CA: Communications Assist NI: Network Interface Mem: Memory P: Processor

Communication Requirements for this generic view: The interconnection network will need to provide

network transactions that support the programming model.

Latency should be minimized. Adequate concurrent transactions must be

supported.

Winter 2006

Basic Definitions Physical Protocol: Converts analog

signals into digital ones. Link Protocol: is responsible for

grouping symbols into packets. Node Level Protocol: is responsible

for attaching information so that the target CA can accomplish the transfer.

Winter 2006

Basic Definitions We can look at an IN as a graph

that contains vertices (processing hosts or switch elements) and channels between vertices.

Channels have the following properties: Width w (in bits) Signaling Rate f = 1 / T

Winter 2006

Basic Definitions Channel Bandwidth b = w * f The amount of data transferred

across a link in one cycle is called a physical unit or phit.

Switches connect input channels to output channels. The number of channels connected is called the switch degree.

Winter 2006

Network Components The following components make up a

network: Topology: The structure of the network

graph. 2D grid, 3D cube, irregular etc. A direct network has a host (processing

element) connected to each switch. An indirect network will have hosts

connected to a subset of available switches. The hosts will then be on the edge of the network graph.

Winter 2006

Network Components Routing Algorithm: The path that messages

make through the network is called a route. Procedure describing which route each message takes is called the routing algorithm.

Switching Strategy: How a message travels its route.

Circuit Switching: The same route is used until the entire message is transferred by establishing an end-to-end connection. The route can be reversed as well.

Packet Switching: Message is broken into packets with its own routing information. Each packet can be individually routed. Requires routing tag overhead.

Winter 2006

Network Components Flow Control Mechanism: Controls

when a message or parts of it, move along its route. Flow control becomes a necessity when a network resource has to be utilized by multiple messages at the same time.

Flow Control Options: Stalled in place. Buffered. Re-routed. Discarded.

Winter 2006

Network Components The largest unit of information that can be

accepted or rejected by the nodes in a network is called a flit.

How big can a flit be? ANS: It can be as big as the entire message or

packet. It can be as small as a phit. Some other terms that can be used when

talking of networks for parallel processing are: Diameter: the maximum length of the shortest

path between two nodes through a network. Routing Distance: number of links between

source and destination. Average Distance: average of routing distance.

Winter 2006

Packet Formatting

Winter 2006

Packet Format Header: contains routing and

control info so that the switches can interpret what to do when the packet arrives.

Payload: The information contained within the packet.

Trailer: Usually contains an error checking code.

Winter 2006

Packet Format In parallel processing networks, like LANs,

WANs and the Internet we have issues of Encapsulation and Fragmentation. Encapsulation involves carrying info

from a higher level of abstraction within the current layer.

Fragmentation involves splitting up the higher level information into a sequence of messages.

Winter 2006

Communication Performance There are four components that

affect the time to transfer n bits from source to destination:

TimeS-D(n) = Overhead + Routing Delay + Channel Occupancy + Contention Delay

Winter 2006

Performance Overhead: comes from getting the message in and

out of the network (ie can be caused by the CA) Channel Occupancy: gives us a lower bound on

latency. Channel occupancy could be simply viewed as the time taken for the message to get from source to destination on a direct link.

The CA takes time to process the communication request. Each channel traveled by the packet encounters delay. The destination CA takes some time to process the

packet.

Winter 2006

Performance Overall the occupancy of the channel

can be determined by: (n + ne) / b n = number if bits in payload ne = number of bits in header and trailer b = channel bandwidth

We can also look at the effective bandwidth:

n / ( n + ne)

Winter 2006

Performance Routing Delay: Each channel in the

route incurs a little delay that builds up (we will consider the time taken for node to switch interface to be part of the routing delay). Causes of routing delay:

Routing distance (h): number of channels used in route.

Switching delay (Δ): time taken for a switch to select the proper output port.

h depends on network topology, routing algorithm used and specific nodes involved in the transaction.

Winter 2006

Unloaded Latency (based on switching

strategy)

Store and Forward Routing (packet switched): The entire packet is received by the switch before

forwarded to the next channel. Latency:

Number of bitsin packet, including headerand trailer.

Winter 2006


strategy)

Circuit switched Once the route is setup it is maintained.

We therefore only encounter the switching latency when the route is setup.

Winter 2006


strategy)

In the case of circuit switching we can note the following: As the message size increases, the

amount of latency caused by route setup per hop (h * Δ), and hence the topology becomes insignificant.

How can we reduce the latency in the case of store and forward packet switching?

Winter 2006


strategy)

Solution: Fragment the message packet into smaller

packets. The smaller packets flow through in a pipelined fashion. The unloaded latency becomes:

Size of fragments. Same form as before.

Winter 2006

The previous example is commonly used in the internet and larger networks.

In the case of parallel processing cut-through routing can be used:

Once a few phits are received by the switch the rest of packet is routed to the output.


strategy)

This valuewill be differentfrom the circuit switched case.

Winter 2006

Unloaded Latency (Based on Switching

Strategy)

Winter 2006

Contention As in traditional networking, contention

will occur then two incoming messages need to be routed to the same output at the same time.

In store and forward routing the switch will buffer an entire packet. If there is contention, one packet will get switched to the output and one will get blocked until the next switching cycle.

Winter 2006

Contention In circuit switching, usually some type

of probe is sent from the source node to the destination. If there is contention, the probe will be resent after some time later.

Cut through routing can handle contention in two ways: Virtual cut-through, route one of the packets

into a buffer then route in the next switch cycle. This has the same penalty as store and forward routing under contention.

Winter 2006

Contention Wormhole , only a few flits are

buffered from the header of the packet, then the tail portion is maintained. Similar to holding the circuit open from the sender’s point of view.

Winter 2006

Bandwidth By just looking at the bandwidth from the

viewpoint of a single node, the channel has a channel bandwidth that is higher than the bandwidth that the node can send useful data on.

beff =

Winter 2006

Bandwidth Taking into account routing delay

at the switch (Δ) we have the following expression:

beff =

w is included here in case

the channel width in more

than a bit. Note that n will

be the size in phits.

Winter 2006

Bandwidth These expressions are useful for looking

at the bandwidth available to one node. What if we want a measure of the

overall bandwidth in a network? Most common measure is the bisection

bandwidth: The sum of the bandwidths of the minimum set of

channels that, if removed, partition the network into two equal unconnected sets of nodes.

This has a nice property when considering a uniform communication pattern, what is it?

Winter 2006

Bandwidth ANSWER:

Half of the messages are expected to cross the bisection in each direction.

With this in mind what is wrong with this notion of global or “aggregate” bandwidth available? ANSWER:

If communication is localized, then the bisection bandwidth will give a lower value for communication time.

Winter 2006

Total Bandwidth and Average Link Utilization

Total Bandwidth = C * b (bytes/sec) = C * w (bits / cycle) = C (phits / cycle)

Assuming each of N hosts issue a packet every M cycles with average routing distance h. Then each packet occupies h channels for l = n / w cycles.

The total load on the network is(N * h * l ) / M) phits/cycle

Winter 2006

Total Bandwidth and Average Link Utilization

The average link utilization is (<1):

This is discussed on P. 762 of the text.

Winter 2006

Bandwidth The number of links or channels

per node (C /N) is the total communication bandwidth (phits/cycle/node).

This is consumed in direct proportion to the message size and to the routing distance.

Winter 2006

Factors That Limit ρ Before we look at why the link

utilization is less than one (in some cases much less than one) let us consider the various properties on the network.

The number of links per network node is a property of topology.

Winter 2006

Factors That Limit ρ Average routing distance depends on:

The topology Routing algorithm Program communication pattern Mapping of program onto machine

Often good communication locality will provide a small h, random communication will give the average routing distance and a bad pattern will cross the entire diameter.

Winter 2006

Factors That Limit ρ Factors:

Communication may not be balanced over all links.

Even if it is balanced, the routing algorithm may not support the communication pattern of the program.

Contention for other networking resources may arise.

These factors affect the saturation point of the network.

Winter 2006

What assumption is being

made here?

Winter 2006

Topology of INs

Before we discuss some different types of interconnection network topologies we want to consider the following: The number of host nodes that is

connected to the network will be defined to be N.

Characteristics of each topology will be discussed as a function of N.

Winter 2006

Fully Connected Network This type of network connects all inputs

to all outputs. It can be considered a single big switch.

The diameter of such a network is: 1. The degree in N. Unfortunately, if there is a hardware

failure in such a network the entire network goes down, or at least full connectivity is lost.

Winter 2006

Fully Connected Network A bus is an example of a fully

connected network. Its cost scales with O(N) Bandwidth:

Total Bandwidth = O(1) Bisection Bandwidth = O(1)

Bandwidth scaling is worse than O(1) as clock rate reduces with #ports due to RC

Winter 2006

Fully Connected Networks A crossbar switch is another example. Bandwidth is O(N) Cost is O(N2), why?

As more inputs/outputs are added, the total number of cross points grow by N2.

The scalability of fully connected networks is bad for large host sizes. Usually smaller components of the network (like a basic switching element) may be fully connected.

Winter 2006

Linear Arrays Linear Array

Assume we have N (0 ..N-1) nodes assembled in a linear fashion.

Assume each node is connected with a bi-directional link.

What is the diameter? ANS: N – 1

Average routing distance ~ 2/3 N. The bisection is one link.

Winter 2006

Linear Arrays The route from node A to node B can be

described by the operation B-A. This result is termed as the relative

address. Provides a log N – bit number with

positive numbering being away from node 0.

This arrangement provides no fault tolerance.

Winter 2006

Ring Bi-directional Links Easily constructed by connecting the

ends of a linear array together. Degree: 2 The diameter is N/2 The bisection of the network is 2 The average routing distance is N/3 Note there are two relative addresses

because we can travel in either direction. Also provides better fault tolerance.

Winter 2006

Ring Unidirectional Links

If we have a ring the can only transmit in one direction we have the following properties: Diameter N – 1 Average Distance is N/2 Relative Address ( B – A ) mod N Bisection width: 1

Winter 2006

Winter 2006

Higher Dimension Meshes and Tori A d-dimensional array consists of the

following elements { kd-1 x kd-2 … x k0 }

Where k is a vector of elements. If 0<= ij <= kj-1 for 0<= j <= d-1 we

can use a vector to locate any node in the mesh, i.e., the coordinates of a node are comprised of <id-1, id-2, .. i0>

Winter 2006

Higher Dimension Meshes and Tori

Assuming the length along each dimension is equal. N = kd

The degree of each node varies between 2d and d. Nodes on the inside have the maximum degree and nodes on the corners have the smallest.

For example for d = 3, nodes on the corners have 3 links or channels, and nodes on the inside have 6 channels. What about tori?

Winter 2006

Higher Dimension Meshes and Tori

These arrays are called d-dimensional k-ary arrays.

To extend to a torus, the edges are simply connected to the opposite side.

Usually these types of structures are direct networks, meaning that every node contains a processing element.

The network will scale by increasing k.

Winter 2006

Higher Dimension Arrays and Tori We can form a relative address by

simply performing vector subtraction (unidirectional case): R = (bd-1 - ad-1 , bd-2 - ad-2 , … b0 - a0)

Actual routing can be performed in any order.

The diameter is simply d*(k-1) If k is even, the bisection of a d-

dimensional k-ary structure will be kd-1. If k is odd it maybe a little larger.

Winter 2006

Higher Dimension Arrays and Tori

The average distance is the average distance in each dimension.

Therefore average h = d * 2/3 * k roughly.

Spatially these networks scale in size to whatever dimension we have. Volume in d=3 and planar space in d=2. Assuming shortest possible wiring.

Winter 2006

Higher Dimension Arrays and Tori

Winter 2006

Trees With meshes, the average routing distance

grows with logdN . A binary tree has a degree of 3 (three

connections per node) Usually trees are used in indirect networks.

Indirect Case: Addressing to the leaves can be taken as a log2 N bit

vector. This gives the path from the root to the host. 0 = left, 1 = right.

The diameter is 2 * log2 N Average routing distance is almost as large as the

diameter of the network.

Winter 2006

Trees Relative addressing can be

accomplished by doing the bit-wise XOR operation. For example to get the relative

address from A to B. R = A XOR B. The position of the most significant 1 is how many levels we go up. Then we use the lower bits of B to get to B. We may not have to go all the way to the root!

Winter 2006

Trees

A = 0001 B = 0101

A XOR B = 0100

Winter 2006

Trees We can have trees of higher order

called k-ary trees. We can also have fat trees.

More bandwidth is assigned to more important links as we go towards the root.

A big problem with trees is that the root is composed of one link, therefore the bisection is one link.

Winter 2006

Butterflies The construction of a butterfly is

similar to that of a tree. We have many roots in a butterfly. In addition, many parallel

algorithms communicate in a butterfly structure, ex: Fast Fourier Transform and Batcher odd-even merge sort.

Winter 2006

Butterflies As a building block we start with 2 x 2

switch elements. The basic building block is setup so that

addressing can occur. A bit of 0 causes a straight edge to be followed. While a 1 will cause a crossover to occur.

In the case of a unidirectional indirect butterfly with N hosts, the bisection is N/2 links.

Winter 2006

Butterflies

Winter 2006

Butterflies When considering scalability,

butterflies can be better than meshes and trees in the there are a total of N log2N (in the case of the previous figure) links with packets crossing log2N links on average

Therefore on average there shouldn’t be any collisions.

How many links are in the bisection?

Winter 2006

Butterflies

Winter 2006

Hypercubes If we take the original butterfly and

collapse each straight column into a single log2N switch. This is close to a hypercube arrangement.

We can cross dimensions of the hypercube to get from source to destination.

Each node of a hypercube can embed a lower dimension mesh.

Text P778.

Winter 2006

Hypercubes

From: http://linux.cs.sonoma.edu/~ravi/ces516sp04/Lectures/feb18.ppt

Winter 2006

Some Example Architectures

Winter 2006

Winter 2006

Routing Routing from source to destination is

of primary importance to parallel computing.

We have already seen some examples of how a relative address is formed. In the case of a d=3 cube the relative address will give the shortest path in all three dimensions.

Winter 2006

Routing The routing algorithm decides at

each switch element, which output port to place the packet onto.

3 ways to determine output port based on packet header: Arithmetic Source based port select Table Lookup

Winter 2006

Routing Arithmetic

2D Mesh: Each relative address contains the length

to be traveled in both the x and y directions [Δx, Δy]

At switch i,j we perform the following routing:

Winter 2006

Routing The switch will look at the routing info in

the packet and modify the distance in the appropriate direction. Dimension order routing takes each dimension in turn.

Source based routing can also be used in which the source node assigns switch port numbers to the header. Simple from the switch side. May have variable header size and maybe

large.

Winter 2006

Routing Table Driven Routing:

Similar the the internet and WANs. Switches will have a table of

information use for routing. The header contains an index that is used in the table to select the proper output port.

Tables must be updated. Switch specific messages. The table must be established in the first

place.

Winter 2006

Routing Deterministic Routing

Route is determined solely on the source and destination. The status of the network is not considered.

Dimension ordered routing is such an example.

In the case of a 2D mesh how else can we route a packet?

ANS: If one dimension was blocked, we could switch to the other dimension etc…

Winter 2006

Routing Adaptive:

The route of the packet is determined by source and destination, but may be influenced by network conditions.

In the previous case we could zig-zag across the mesh if both dimensions on the edges were congested.

Winter 2006

Deadlock Deadlock: occurs when a packet

waits for an event that cannot occur. Live lock: Occurs when the routing

of a packet never arrives at its destination.

Indefinite Postponement: Occurs when the packet waits for an event that never happens.

Winter 2006

Deadlock Example

Winter 2006

Virtual Channels One way of avoiding such deadlock is to

implement virtual channels. Virtual channels are used in wormhole

routing and involve each physical channel to have multiple buffers.

Assume we have 2 virtual channels in the prevoius example. Say packets at a node higher than their destination are placed in the high channel and the opposite for lower destinations.

Winter 2006

Virtual Channels

Winter 2006

Adaptive Routing

Winter 2006

Other Topics In Interconnection Networks

Turn-Model Routing Switch Design Channel Buffers Flow Control

There are differences between LANs and interconnection networks for parallel processing.

Global Communications.

Winter 2006

SGI Origin Network Named SPIDER (we’ll see why when we

look at it’s stricture), supports 1.56 GB/s total bandwidth in both directions.

Each switch contains 6 pairs of unidirectional links.

Two nodes are connected to each switch leaving 4 links to connect to other switches.

Routing is table based. So that Message priority is supported.

Winter 2006

SGI Origin Network

Winter 2006 ENGR 9861 – High Performance Computer Architecture March 2006 Interconnection...

Documents

Transcript of Winter 2006 ENGR 9861 – High Performance Computer Architecture March 2006 Interconnection...