Synchronization 2: Semaphores (Con’t) Lock Implementation ...
CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t)
description
Transcript of CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t)
CS 258 Parallel Computer Architecture
Lecture 5
Routing (Con’t)
February 11, 2008Prof John D. Kubiatowicz
http://www.cs.berkeley.edu/~kubitron/cs258
Lec 6.22/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Recall: Deadlock free wormhole networks
• Basic dimension order routing techniques don’t work for unidirectional k-ary d-cubes– only for k-ary d-arrays (bi-directional)
• Idea: add channels!– provide multiple “virtual channels” to break the
dependence cycle– good for BW too!
– Do not need to add links, or xbar, only buffer resources
• This adds nodes to the CDG, remove edges?
OutputPorts
Input Ports
Cross-Bar
Lec 6.32/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Recall: Use of virtual channels for adaptation• Want to route around hotspots/faults while avoiding
deadlock• “An adaptive and Fault Tolerant Wormhole Routing
Strategy for k-ary n-cubes,” – Linder and Harden, 1991– General technique for k-ary n-cubes
» Requires: 2n-1 virtual channels/lane!!!
• Alternative: Planar adaptive routing– Chien and Kim, 1995– Divide dimensions into “planes”,
» i.e. in 3-cube, use X-Y and Y-Z– Route planes adaptively in order: first X-Y, then Y-Z
» Never go back to plane once have left it» Can’t leave plane until have routed lowest coordinate
– Use Linder-Harden technique for series of 2-dim planes» Now, need only 3 number of planes virtual channels
• Alternative: two phase routing– Provide set of virtual channels that can be used arbitrarily for
routing– When blocked, use unrelated virtual channels for dimension-order
(deterministic) routing– Never progress from deterministic routing back to adaptive routing
Lec 6.42/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Breaking deadlock with virtual channels
Packet switchesfrom lo to hi channel
Lec 6.52/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Unidirectional k-ary n-cubes
• n+1 virtual channels– (one wrap-around per
channel)– Switch to new “level”
whenever wrap around in any dim
• Any adaptive routing solution is possible as long as:– It doesn’t use more than n
wrap-around channels– If want more adaptivity,
can add more levels (and more virtual channels)
Lec 6.62/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Bidirectional k-ary n-cube
• Need 2n-1 virtual networks– Except for lowest
dimension, only involves single direction
Lec 6.72/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Switch Design
Cross-bar
InputBuffer
Control
OutputPorts
Input Receiver Transmiter
Ports
Routing, Scheduling
OutputBuffer
Lec 6.82/11/08 Kubiatowicz CS258 ©UCB Spring 2008
How do you build a crossbar?
Io
I1
I2
I3
Io I1 I2 I3
O0
Oi
O2
O3
RAMphase
O0
Oi
O2
O3
DoutDin
Io
I1
I2
I3
addr
Lec 6.92/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Input buffered swtich
• Independent routing logic per input– FSM
• Scheduler logic arbitrates each output– priority, FIFO, random
• Head-of-line blocking problem
Cross-bar
OutputPorts
Input Ports
Scheduling
R0
R1
R2
R3
Lec 6.102/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Output Buffered Switch
• How would you build a shared pool?
Control
OutputPorts
Input Ports
OutputPorts
OutputPorts
OutputPorts
R0
R1
R2
R3
Lec 6.112/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Output scheduling
• n independent arbitration problems?– static priority, random, round-robin
• simplifications due to routing algorithm?• general case is max bipartite matching
Cross-bar
OutputPorts
R0
R1
R2
R3
O0
O1
O2
InputBuffers
Lec 6.122/11/08 Kubiatowicz CS258 ©UCB Spring 2008
When are virtual channels allocated?
• Two separate processes:– Virtual channel allocation– Switch/connection allocation
• Virtual Channel Allocation– Choose route and free output virtual channel
• Switch Allocation– For each incoming virtual channel, must negotiate switch
on outgoing pin
• In ideal case (not highly loaded), would like to optimistically allocate a virtual channel
OutputPorts
Input Ports
Cross-Bar
Hardware efficient designFor crossbar
Lec 6.132/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Delay analysis of wormhole router • “A Delay Model and Speculative Architecture
for Pipelined Routers”– Li-Shiuan Peh and William Dally
• Cannonical model for a virtual-channel-router– Separate routing, virtual-channel allocation, and switch
allocation
Lec 6.142/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Virtual Channel Analysis
• Identified Various complex modules within router
• Identified a pipelining model– Speculative Virtual Channel Allocation
• Developed process-independent models – Result permits the evaluation of number of pipelining
stages
• How might we evaluate complexity of logic?– Ideally, have some measure that reflects algorithmic
complexity, not technology-dependent computations
• What is a good normalization?– Single, minimum-sized inverter– Call the delay of this
Lec 6.152/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Process Independent Modeling
• How might we evaluate complexity of logic?– Ideally, have some measure that reflects algorithmic
complexity, not technology-dependent computations
• What is a good normalization?– Single, minimum-sized inverter– Call the delay of this
Lec 6.162/11/08 Kubiatowicz CS258 ©UCB Spring 2008
• Express delays in process-independent unit
• Delay has two components
• Effort delay f = gh (a.k.a. stage effort)– Again has two components
• g: logical effort– Measures relative ability of gate to deliver current– g 1 for inverter
• h: electrical effort = Cout / Cin
– Ratio of output to input capacitance– Sometimes called fanout
• p: Parasitic delay– Represents delay of gate driving no load– Set by internal parasitic capacitance
Logical Effort: Delay in a Logic Gate
absdd
d f p
Lec 6.172/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Delay Plots
d = f + p = gh + p
Electrical Effort:h = C
out / C
in
Nor
mal
ized
Del
ay: d
Inverter2-inputNAND
g = 1p = 1d = h + 1
g = 4/3p = 2d = (4/3)h + 2
Effort Delay: f
Parasitic Delay: p
0 1 2 3 4 5
0
1
2
3
4
5
6
Lec 6.182/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Computing Logical Effort• DEF: Logical effort is the ratio of the input
capacitance of a gate to the input capacitance of an inverter delivering the same output current.
• Measure from delay vs. fanout plots• Or estimate by counting transistor widths
A YA
B
YA
BY
1
2
1 1
2 2
2
2
4
4
Cin = 3g = 3/3
Cin = 4g = 4/3
Cin = 5g = 5/3
Lec 6.192/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Catalog of Gates
Gate type Number of inputs
1 2 3 4 n
Inverter 1
NAND 4/3 5/3 6/3 (n+2)/3
NOR 5/3 7/3 9/3 (2n+1)/3
Tristate / mux
2 2 2 2 2
XOR, XNOR 4, 4 6, 12, 6 8, 16, 16, 8
• Logical effort of common gates
Lec 6.202/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Catalog of Gates
Gate type Number of inputs
1 2 3 4 n
Inverter 1
NAND 2 3 4 n
NOR 2 3 4 n
Tristate / mux
2 4 6 8 2n
XOR, XNOR 4 6 8
• Parasitic delay of common gates– In multiples of pinv (1)
Lec 6.212/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Example: Ring Oscillator
• Estimate the frequency of an N-stage ring oscillator
Logical Effort: g = 1
Electrical Effort: h = 1
Parasitic Delay: p = 1
Stage Delay: d = 2
Frequency: fosc = 1/(2*N*d) = 1/4N
31 stage ring oscillator in 0.6 m process has frequency of ~ 200 MHz
Lec 6.222/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Example: FO4 Inverter
• Estimate the delay of a fanout-of-4 (FO4) inverter
Logical Effort: g = 1
Electrical Effort: h = 4
Parasitic Delay: p = 1
Stage Delay: d = 5
d
The FO4 delay is about
200 ps in 0.6 m process
60 ps in a 180 nm process
f/3 ns in an f m process
Lec 6.232/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Multistage Logic Networks• Logical effort generalizes to multistage
networks• Path Logical Effort
• Path Electrical Effort
• Path Effort
iG gout-path
in-path
CH
C
i i iF f g h 10
x y z20
g1 = 1h
1 = x/10
g2 = 5/3h
2 = y/x
g3 = 4/3h
3 = z/y
g4 = 1h
4 = 20/z
Lec 6.242/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Multistage Logic Networks
• Logical effort generalizes to multistage networks
• Path Logical Effort
• Path Electrical Effort
• Path Effort
• Can we write F = GH?
iG gout path
in path
CH
C
i i iF f g h
Lec 6.252/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Paths that Branch
• No! Consider paths that branch:
G = 1
H = 90 / 5 = 18
GH = 18
h1 = (15 +15) / 5 = 6
h2 = 90 / 15 = 6
F = g1g2h1h2 = 36 = 2GH
5
15
1590
90
Lec 6.262/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Branching Effort
• Introduce branching effort– Accounts for branching between stages in path
• Now we compute the path effort– F = GBH
on path off path
on path
C Cb
C
iB b ih BH
Note:
Lec 6.272/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Multistage Delays
• Path Effort Delay
• Path Parasitic Delay
• Path Delay
F iD fiP pi FD d D P
Lec 6.282/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Designing Fast Circuits
• Delay is smallest when each stage bears same effort
• Thus minimum delay of N stage path is
• This is a key result of logical effort– Find fastest possible delay– Doesn’t require calculating gate sizes
i FD d D P
1ˆ Ni if g h F
1ND NF P
Lec 6.292/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Gate Sizes
• How wide should the gates be for least delay?
• Working backward, apply capacitance transformation to find input capacitance of each gate given load it drives.
• Check work by verifying input cap spec is met.
ˆ
ˆ
out
in
i
i
CC
i outin
f gh g
g CC
f
Lec 6.302/11/08 Kubiatowicz CS258 ©UCB Spring 2008
• Example of results possible:
– Evaluation of latency as function of VC-allocation algorithm complexity
– Develop VC-allocator module as circuit, compute logical effort
How does this relate to Router Model?
Lec 6.312/11/08 Kubiatowicz CS258 ©UCB Spring 2008
Summary
• Deadlock-free if channel dependence graph is acyclic– limit turns to eliminate dependences– add separate channel resources to break dependences– combination of topology, algorithm, and switch design
• Switch design issues– input/output/pooled buffering, routing logic, selection logic
• Logical Effort– Technology-independent delay model: compared with
inverter– d = gh + p– g:logical effort, h:electrical effort, p:parisitic delay
• “A Delay Model and Speculative Architecture for Pipelined Routers”– Speculation on virtual-channel allocation– Improves: low conflict latency and throughput