CS575 Parallel Processing

21
CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

Transcript of CS575 Parallel Processing

Page 1: CS575 Parallel Processing

CS575 Parallel Processing

Lecture three: Interconnection Networks Wim Bohm, CSU

Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

Page 2: CS575 Parallel Processing

CS575 lecture 3 2

Interconnection networks n  Connect

n  Processors, memories, I/O devices n  Dynamic interconnection networks

n  Connect any to any using switches or busses n  Two types of switches

n  On / off: 1 input, 1 output n  Pass through / cross over: 2 inputs, 2 outputs

n  Static interconnection networks n  Connect point to point using “wires”

Page 3: CS575 Parallel Processing

CS575 lecture 3 3

Dynamic Interconnection Network: Crossbar

n  Connects e.g. p processors to b memories n  p * b matrix

n  p horizontal lines, b vertical lines n  Cross points: on/off switches n  Only one switch on per (row,column) pair n  Non blocking: Pi to Mj does not block Pl to Mk

n  Very costly, does not scale well n  p * b switches, complex timing and checking

Page 4: CS575 Parallel Processing

CS575 lecture 3 4

Dynamic Interconnection Network: Bus n  Connects processors, memories, I/O devices

n  Master: can issue a request to get the bus n  Slave: can respond to a request, one bus is granted n  If there are multiple masters, we need an arbiter

n  Sequential n  Only one communication at the time n  Bottleneck n  But simple and cheap

Page 5: CS575 Parallel Processing

CS575 lecture 3 5

Crossbar vs bus n  Crossbar

n  Scalable in performance n  Not scalable in hardware complexity

n  Bus n  Not scalable in performance n  Scalable in hardware complexity

n  Compromise: multistage network

Page 6: CS575 Parallel Processing

CS575 lecture 3 6

Multi-stage network n  Connects n components to each other n  Usually built from O(n.log(n)) 2x2 switches n  Cheaper than cross bar n  Faster than bus n  Many topologies

n  e.g. Omega (book fig 2.12), Butterfly, ...

Page 7: CS575 Parallel Processing

CS575 lecture 3 7

Static Interconnection Networks

n  Fixed wires (channels) between devices n  Many topologies

n  Completely connected n  (n(n-1))/2 channels n  Static counterpart of crossbar

n  Star n  One central PE for message passing n  Static counterpart of bus

n  Multistage network with PE at each switch

Page 8: CS575 Parallel Processing

CS575 lecture 3 8

More topologies n  Necklace or ring n  Mesh / Torus

n  2D, 3D

n  Trees n  Fat tree

n  Hypercube n  2n nodes in nD hypercube n  n links per node in nD hypercube n  Addressing: 1 bit per dimension

Page 9: CS575 Parallel Processing

CS575 lecture 3 9

Hypercube n  Two connected nodes differ in one bit n  nD hypercube can be divided in

n  2 (n-1) D cubes in n ways n  4 (n-2) D cubes n  8 (n-3) D cubes

n  To get from node s to node t n  Follow the path determined by the differing bits n  E.g. 01100 à 11000:

01100 à 11100 à 11000 n  Question: how many (simple) paths from one node to another?

Page 10: CS575 Parallel Processing

CS575 lecture 3 10

Measures of static networks

n  Diameter n  Maximal shortest path between two nodes

n  Ring: ⎣p/2⎦, hypercube: log(p) 2D wraparound mesh: 2 ⎣sqrt(p)/2⎦

n  Connectivity n  Measure of multiplicity of paths between nodes n  Arc connectivity

n  Minimum #arcs to be removed to create two disconnected networks

n  Ring: 2, hypercube: log(p), mesh: 2, wraparound mesh: 4

Page 11: CS575 Parallel Processing

CS575 lecture 3 11

More measures n  Bisection width

n  Minimal #arcs to be removed to partition the network in two (off by one node) equal halves

n  Ring: 2, Complete binary tree: 1, 2D mesh: sqrt(p) n  Question: bisection width of a hypercube?

n  Channel width n  #bits communicated simultaneously over channel

n  Channel rate / bandwidth n  Peak communication rate (#bits/second)

n  Bisection bandwidth n  Bisection width * channel bandwidth

Page 12: CS575 Parallel Processing

CS575 lecture 3 12

Summary of measures: p nodes Network Diameter Bisection

width Arc

connectivity #links

Completely-Connected

1 p2/4 p-1 p(p-1)/2

Star 2 ⎣p/2⎦ * 1 p-1

Ring ⎣p/2⎦ 2 2 p

Complete binary tree

2log((p+1)/2) 1 1 p-1

Hypercube log(p) p/2 log(p) p.log(p)/2 * The textbook mentions bisection width of a star as 1, but the only way to split a star into (almost) equal halves is by cutting half of its links.

Page 13: CS575 Parallel Processing

CS575 lecture 3 13

Meshes and Hyper cubes

n  Mesh n  Buildable, scalable, cheaper than hyper cubes n  Many (eg grid) applications map naturally n  Cut through works well in meshes n  Commercial systems based on it.

n  Hyper cube n  Recursive structure nice for algorithm design n  Often same O complexity as PRAMs n  Often hypercube algorithm also good for other

topologies, so good starting point

Page 14: CS575 Parallel Processing

CS575 lecture 3 14

Embedding

n  Relationship between two networks n  Studied by mapping one into the other n  Why? n  G(V,E) à G’(V’,E’)

n  graph G, G’, vertices V, V’, edges E, E’ n  Map E àE’, V à V’

n  congestion of k: k (>1) e-s to one e’ n  dilation of k: 1 e to k e’-s n  expansion: |V’| / |V| n  Often we want congestion=dilation=expansion=1

Page 15: CS575 Parallel Processing

CS575 lecture 3 15

Ring into hypercube n  Number the nodes of the ring s.t.

n  Hamming distance between two adjacent nodes = 1 n  Gray code provides such a numbering

n  Can be built recursively: binary reflected Gray code n  2 nodes: 0 1 OK n  2k nodes:

n  take Gray code for 2k-1 nodes n  Concatenate it with reflected Gray code for 2k-1 nodes n  Put 0 in front of first batch, 1 in front of second

n  Mesh can be embedded into a hypercube n  (Toroidal) mesh = rings of rings

Page 16: CS575 Parallel Processing

CS575 lecture 3 16

ring to hypercube cont’

0 00 000 G(0,1) = 0 i →G(i,dim) 1 01 001 G(1,1) = 1 11 011 10 010 G(i,x+1) = 0||G(i,x) i<2x

110 = 1||G(2 x+1-i-1,x) i>=2x

111 (|| is concatenation)

101 100

Page 17: CS575 Parallel Processing

CS575 lecture 3 17

2D Mesh into hypercube

n  Note 2D Mesh n  Rows: rings n  Cols: rings

n  2r * 2s wraparound mesh into 2r+s cube n  Map node(i,j) onto node G(i,r)||G(j,s) n  Row coincides with sub cube n  Column coincides with sub cube n  S.t. if adjacent in mesh then adjacent in cube

Page 18: CS575 Parallel Processing

CS575 lecture 3 18

Complete binary tree into hypercube

n  Map tree root to any cube node n  left child to same node n  right child at level j: invert bit j of parent node 000 000 001 000 010 001 011 000 100 010 110 001 101 011 111

Page 19: CS575 Parallel Processing

CS575 lecture 3 19

Routing Mechanisms

n  Determine all source à destination paths n  Minimal: a shortest path n  Deterministic: one path per (src,dst) pair

n  Mesh: dimension ordered (XY routing) n  Cube: E-routing

n  Send along least significant 1 bit in src XOR dst

n  Adaptive: many paths per (src,dst) pair n  Minimal: only shortest n  Why adaptive? Discuss.

Page 20: CS575 Parallel Processing

CS575 lecture 3 20

Routing (communication) Costs

n  Three factors n  Start up at source (ts)

n  OS, buffers, error correction info, routing algorithm

n  Hop time (th) n  The time it takes to get from one PE to the next n  Also called node latency

n  Word transfer time (tw) n  Inverse of channel bandwidth

Page 21: CS575 Parallel Processing

CS575 lecture 3 21

Two rout(switch)ing techniques

n  Store and Forward O(m.l) n  Strict: whole message travels from PE to PE n  m words, l links tcomm = ts + (m.tw + th).l n  Often, th is much less than m.tw: tcomm= ts + m.l.tw

n  Cut-through O(m+l) n  Non-strict: message broken in flits (packets) n  Flits are pipelined through the network tcomm= ts + l.th + m.tw n  Circular path + finite flit buffer can give rise to deadlock