VLSI System Design Lecture 3 Timing
Transcript of VLSI System Design Lecture 3 Timing
UC Regents Fall 2013/1014 © UCBCS 250 L3: Timing
2014-9-4!Professor Jonathan Bachrach!
slides by John Lazzaro
CS 250 !VLSI System Design
Lecture 3 – Timing
www-inst.eecs.berkeley.edu/~cs250/
TA: Colin Schmidt
UC Regents Fall 2013 © UCBCS 250 L3: Timing
... everything doesn’t happen at once.
Timing, the 10,000 ft view. Locally synchronous, globally asynchronous.
On the same page. Minimal set of timing concepts you need for project.
Break
RTL Examples. Better timing through micro-architecture.
Electrical details. Just so you know ...
View from 10,000 Ft.
Google I/O, 2012
Moore’s Law2.6 Billion
1 Million
2 Thousand
Synchronous logic on a single clock domain is not practical for
a 2.6 billion transistor design
GALS: Globally Asynchronous, Locally Synchronous
both clocks. The basic GALS method focuses on point-
to-point communication between blocks.
FIFO solutionsAnother approach to interfacing locally synchro-
nous blocks is using specially designed asynchronous
FIFO buffers8–10 and hiding the system synchronization
problem within the FIFO buffers. Such a system can
tolerate very large interconnect delays and is also
robust with regard to metastability. Designers can use
this method to interconnect asynchronous and
synchronous systems and also to construct synchro-
nous-synchronous and asynchronous-asynchronous
interfaces. Figure 2 diagrams a typical FIFO interface,
which achieves an acceptable data throughput.8 In
addition to the data cells, the FIFO structure includes
an empty/full detector and a special deadlock de-
tector.
The advantage of FIFO synchronizers is that they
don’t affect the locally synchronous module’s opera-
tion. However, with very wide interconnect data
buses, FIFO structures can be costly in silicon area.
Also, they require specialized complex cells to
generate the empty/full flags used for flow control.
The introduced latency might be significant and
unacceptable for high-speed applications.
As an alternative, Beigne and Vivet designed
a synchronous-asynchronous FIFO based on the
bisynchronous classical FIFO design using gray code,
for the specific case of an asynchronous network-on-
chip (NoC) interface.10 Their aim was to maintain
compatibility with existing design solutions and to use
standard CAD tools. Thus, even with some performance
degradation or suboptimal
architecture, designers can
achieve the main goal of
designing GALS systems in
the standard design envi-
ronment.
Boundary
synchronizationA third solution is to
perform data synchroni-
zation at the borders of
the locally synchronous
island, without affecting
the inner operation of lo-
cally synchronous blocks
and without relying on
FIFO buffers. For this purpose, designers can use
standard two-flop, one-flop, predictive, or adaptive
synchronizers for mesochronous systems, or locally
delayed latching.1,11 This method can achieve very
reliable data transfer between locally synchronous
blocks. On the other hand, such solutions generally
increase latency and reduce data throughput, resulting
in limited applicability for high-speed systems. Table 1
summarizes the properties of GALS systems’ synchro-
nization methods.
Advantages and limitations ofGALS solutions
The scientific community has shown great interest
in GALS solutions and architectures in the past two
decades. However, this interest hasn’t culminated in
many commercial applications, despite all reported
advantages. There are several reasons why standard
design practice has not adopted GALS techniques.
Design and system integration issuesMany proposed solutions require programmable
ring oscillators. This is an inexpensive solution that
allows full control of the local clock. However, it has
significant drawbacks. Ring oscillators are impractical
for industrial use. They need careful calibration
because they are very sensitive to process, voltage,
and temperature variations. Moreover, embedded ring
oscillators consume additional power through contin-
uous switching of the chained inverters.
On the other hand, careful design of the delay line
can reduce its power consumption to a level below
that of a corresponding clock tree. In addition,
432
Figure 2. Typical FIFO-based GALS system.
Globally Asynchronous, Locally Synchronous Design and Test
IEEE Design & Test of Computers
Synchronous modules typically 50K-1M gates, so that the synchronous logic approach works well without requiring heroics. Examples ...
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
IBM Power 5 CPU - Dynamically Scheduled
Stars denote FIFOs that create separate synchronous domains. An example of how architecture and circuits work together.
Rocket uses GALS for accelerator interface
Your project interfaces with the RISC-V
pipeline and the memory system using
FIFOs.
Your timing closure is
independent of the CPU logic domain.
Today: Timing insights for your project
What we’re not doing. If this class was EE 241 and your project was an SRAM:
You could see through down to the layout.Timing? Use SPICE on this hand-drawn schematic.
Technology X: The CS 250 timing challenge.
What we are doing --->
© Synopsys 2012 7
1986: Logic CompilerOptimal Solutions, Inc. (aka Synopsys, Inc.)
Technology X – Provide automation and increase productivity for gate level designers
Logic Synthesis
If your accelerator is too slow ... two options:
Bottom-up: Take control away from logic synthesis. Use HDL as textual schematic. Also, use command-line tool flags.
Top-down: Rework high-level micro-architecture. Let Technology X keep its job.
Sometimes necessary. Colin is the expert, ask in discussion section.
Today.
UC Regents Fall 2013/1014 © UCBCS 250 L3: Timing
A Logic Circuit Primer
“Models should be as simple as possible, but no simpler ...” Albert Einstein.
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inverters: A simple transistor model
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
“1”
“0”
pFET.!A switch.
“On” if gate is
grounded.
nFET.!A switch.
“On” if gate is !at Vdd.
“1”“0”
“1” “0”
Correctly predicts logic output for simple static CMOS circuits.
Extensions to model subtler circuit families, or to predict timing, have not worked well ...
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Transistors as water valves.If electrons are water molecules, !
transistor strengths (W/L) are pipe diameters, !and capacitors are buckets ...
A “on” p-FET fills!up the capacitor
with charge.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
A “on” n-FET !empties the
bucket.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“1”
“0”Time
Water level
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“0”
“1”
TimeWater level
This model is often good enough ...
(Cartoon physics)
UC Regents Fall 2013 © UCBCS 250 L3: Timing
What is the bucket? A gate’s “fan-out”.
Driving other gates slows a gate down.
Spring 2003 EECS150 – Lec10-Timing Page 10
Gate Switching Behavior
• Inverter:
• NAND gate:
Driving wires slows a gate down.
“Fan-out”: The number of gate inputs driven by a gate’s output.
Driving it’s own parasitics slows a gate down.
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Fanout
UC Regents Fall 2013 © UCBCS 250 L3: Timing
A closer look at fan-out ...
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.37
Series Connection
Vdd
Cout
Vout
C1
V1G2
Vdd
Voltage
Vdd
Vin
GND
V1 Vout
Vdd/2
d1 d2
G1
V1Vin Vout
VinG1 G2
Time
° Total Propagation Delay = Sum of individual delays = d1 + d2
° Capacitance C1 has two components:
• Capacitance of the wire connecting the two gates
• Input capacitance of the second inverter
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.38
Calculating Aggregate Delays
Vdd
G2
Vdd
° Sum delays along serial paths
° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)
• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)
° Critical Path = The longest among the N parallel paths
° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3
V2
V1Vin V2
G1V1
C1
Vin
Vdd
G3V3
V3
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.39
Characterize a Gate
° Input capacitance for each input
° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)
- Internal delay (ns)
- Load dependent delay (ns / fF)
° Example: 2-input NAND Gate
OutA
B
Delay A -> Out
Out: Low -> High
0.5ns
Slope =
0.0021ns / fF
For A and B: Input Load (I.L.) = 61 fF
For either A -> Out or B -> Out:
Tlh = 0.5ns Tlhf = 0.0021ns / fF
Thl = 0.1ns Thlf = 0.0020ns / fF
Cout
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.40
A Specific Example: 2 to 1 MUX
Y = (A and !S)
or (B and S)
A
B
S
Gate 3
Gate 2
Gate 1Wire 1
Wire 2
Wire 0
A
B
Y
S
2 x
1M
ux
° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF
• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF
° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF
• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF
• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF
Linear model !works for !
reasonable!fan-out
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
Delay time of an inverter driving 4 inverters.
FO4: Fanout of four delay.
Driving more gates adds delay.
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Propagation delay graphs ...
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
1 ->0 1 ->0
0 ->1 0 ->1
inverter transfer function
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Worst-case delay through combinational logic
Spring 2003 EECS150 – Lec10-Timing Page 13
Gate Delay
• “Fan-in”
• What is the delay in this circuit?
• Critical Path: the path with the maximum delay, from any
input to any output.
– In general, we include register set-up and clk-to-Q times in
critical path calculation.
• Why do we care about the critical path?
x = g(a, b, c, d, e, f)
T2 might be the
worst-case delay path
(critical path)
If d going 0-to-1 switches x 0-to-1, delay is T1.If a going 0-to-1 switches x 0-to-1, delay is T2.
It would be surprising if T1 > T2.
T1
T2
0 ->1
0 ->10 ->1
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Why “might”? Wires have delay too ...
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Looks!benign,!
but ...
UC Regents Fall 2013/1014 © UCBCS 250 L3: Timing
Clocked Logic Circuits
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Spring 2003 EECS150 – Lec10-Timing Page 7
Example
• Parallel to serial converter:
a
b T ! time(clk"Q) + time(mux) + time(setup)
T ! #clk"Q + #mux + #setup
clk
f T1 MHz 1 μs
10 MHz 100 ns100 MHz 10 ns
1 GHz 1 ns
Timing Analysis!!
What is the smallest T that
produces correct operation?
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Timing Analysis and Logic Delay
If our clock period T > worst-case delay through CL, does this ensure correct operation?
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Register:!!
An Array of Flip-Flops
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Flip Flops have internal delays ...
D Q
CLK
Value of D is sampled on positive clock edge.Q outputs sampled value for rest of cycle.
D
Q
t_setup
t_clk-to-Q
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Spring 2003 EECS150 – Lec10-Timing Page 7
Example
• Parallel to serial converter:
a
b T ! time(clk"Q) + time(mux) + time(setup)
T ! #clk"Q + #mux + #setup
clk
ALU “time budget”
Spring 2003 EECS150 – Lec10-Timing Page 8
General Model of Synchronous Circuit
• In general, for correct operation:
for all paths.
• How do we enumerate all paths?
– Any circuit input or register output to any register input or circuit
output.
– “setup time” for circuit outputs depends on what it connects to
– “clk-Q time” for circuit inputs depends on from where it comes.
reg regCL CL
clock input
output
option feedback
input output
T ! time(clk"Q) + time(CL) + time(setup)
T ! #clk"Q + #CL + #setup
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Clock skew also eats into “time budget”
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’
Spring 2003 EECS150 – Lec10-Timing Page 19
Clock Skew (cont.)
• Note reversed buffer.
• In this case, clock skew actually provides extra time (adds
to the effective clock period).
• This effect has been used to help run circuits as higher
clock rates. Risky business!
CL
CLK
CLK’
clock skew, delay in distribution
CLK
CLK’
As T →0, which circuit
fails first?
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’
CLKd CLKd
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’CLKd
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ayGrid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree !Delays, !
IBM “Power” !CPU
Del
ay
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ay
Grid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays, IBM Power
clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.
Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 100top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 20 ps to 40 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 70 ps skew from across-chip channel-lengthvariations [19]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.
Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [20]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.
The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,
including uncertainties associated with the modelingof the floating-body effect [21–23] and its impact onnoise immunity [22, 24 –27] and overall chip decouplingcapacitance requirements [26], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.
Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.
The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description
Figure 9
Global clock waveforms showing 20 ps of measured skew.
1.5
1.0
0.5
0.0
0 500 1000 1500 2000 2500
20 ps skew
Vol
ts (
V)
Time (ps)
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
33
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Some Flip Flops have “hold” time ...
D
t_setup
CLK
t_hold
D must !stay
stable here
D Q
CLK
Does flip-flop hold time affect operation of this
circuit? Under what conditions?
t_inv
What is the intended function of this circuit?
t_clk-to-Q + t_inv > t_holdFor correct operation.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Searching for processor critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Timing Analysis!!
What is the smallest T that
produces correct operation?!
!Must consider!all connected!register pairs.
?
Why might I suspect this one?
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Combinational paths for IBM Power 4 CPU
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
netlist. Of these, 121 713 were top-level chip global nets,and 21 711 were processor-core-level global nets. Againstthis model 3.5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9.8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 12 GB to 14 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6000*Model S80 configured with 64 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as
well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2.5–3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1.9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 24-hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.
The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.
SummaryThe 174-million-transistor !1.3-GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4, emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.
Figure 25
POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.
VIM
Timer files ReportsAsserts
Spice
Spice
GL/1
Reports
< 12 hr
< 12 hr
< 12 hr
< 48 hr
< 24 hr
Non-uplift timing
Noiseimpacton timing
Upliftanalysis
Capacitanceadjust
Chipbench /EinsTimer
Chipbench /EinsTimer
Extraction
Core or chipwiring
Analysis/update(wires, buffers)
Notes:• Executed 2–3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late
Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs
Figure 26
Histogram of the POWER4 processor path delays.
!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280Timing slack (ps)
Lat
e-m
ode
timin
g ch
ecks
(th
ousa
nds)
0
50
100
150
200
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
47
Most wires have hundreds of picoseconds to spare.The critical path
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
How to retime logic
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
IN OUT
1 1
1 1 22
Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.
microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.
The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.
This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.
In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.
2. Conventional Retiming
Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].
In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-
IN OUT
1 1
1 1 22
Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.
straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.
A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.
This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).
An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.
Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.
Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.
Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be
Circles are combinational logic, labelled with delays.
Critical path is 5. We want to improve it without changing circuit semantics.
IN OUT
1 1
1 1 22
Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.
microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.
The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.
This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.
In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.
2. Conventional Retiming
Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].
In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-
IN OUT
1 1
1 1 22
Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.
straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.
A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.
This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).
An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.
Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.
Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.
Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be
Add a register, move one circle. Performance improves by 20%.
“Technology X” can often do this.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Power 4: Timing Estimation, Closure
Timing Estimation!!
Predicting a processor’s clock
rate early in the project
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Power 4: Timing Estimation, Closure
Timing Closure!!
Meeting!(or exceeding!)
the timing estimate
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Floorplaning: essential to meet timing.
(Intel XScale 80200)
UC Regents Fall 2013/1014 © UCBCS 250 L3: Timing
Break
Simple exercises for gaining intuition about timing for your process + EDA tools.
Thanks to Bhupesh Dasila, Open-Silicon Bangalore
Bhupesh Dasila
Synthesize gate chains using hand-specified library cells
Exercises cell library and place and route tools. !Lets you know how many levels of logic you can use in the best case.
Helps you “see through” ... “Technology X”.
Synthesis constrained to 2ns clock.
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Delay of a chain of 3 inverters with strongest strength. “Guaranteed not to exceed” speed.
weak NANDs
Chain lengths ...
40 nm process 29 ps/gate av.
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Force P&L to drive a long wire with a known buffer cell.
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Vary driver strength, wire length, metal layer. !Shows the maximum distance two gates can be placed and still meet your clock period.
Distributed RC is the square of the length is clearly seen!
Bhupesh Dasila
UC Regents Fall 2013/1014 © UCBCS 250 L3: Timing
CS250, UC Berkeley Fall ’12Lecture 04, Timing
Turning Rise/Fall Delay into Gate Delay• Cascaded gates:
“transfer curve” for inverter.
11
1 11 10 0 0 0
CS250, UC Berkeley Fall ’12Lecture 04, Timing
Driving Large Loads‣ Large fanout nets: clocks, resets, memory bit lines, off-chip‣ Relatively small driver results in long rise time (and thus
large gate delay)
‣ Strategy:
‣ Optimal trade-off between delay per stage and total number of stages ⇒ fanout of ∼4-6 per stage
12
Staged Buffers
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Register file: Synthesize, or use SRAM?
R1
R2
...
R31
Q
Q
Q
R0 - The constant 0 Q
clk
.
.
.
32MUX
32
32
sel(rs1)5
.
.
.rd1
32MUX
32
32
sel(rs2)
5...
rd2
“two read ports”
D
D
D
En
En
En
DEMUX
.
.
.
sel(ws)
5
WE
wd32
Speed will depend on how large it lays out ...
Figure 3: Using the raw area data, the physical implementation team can get a more accurate area estimation early in the RTL development stage for floorplanning purposes. This shows an example of this graph for a 1-port, 32-bit-wide SRAM.
Synthesized, custom, and SRAM-based register files, 40nm
For small register files, logic synthesis is competitive. !Not clear if the SRAM data points include area for register control, etc.
Register file compiler
Synthesis
SRAMS
Bhupesh Dasila
UC Regents Fall 2013/1014 © UCBCS 250 L3: Timing
Techniques
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Pipelining
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Starting point: A single-cycle processor
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Dout
Data Memory
WE
Din
Addr
MemToReg
Addr Data
Instr
Mem32A
L
U
32
32
op
Ext
SecondsProgram
InstructionsProgram= Seconds
Cycle InstructionCycles
CPI == 1 This is good.
Slow. This is bad.
Challenge: Speed up clock while keeping CPI == 1
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Reminder: How data flows after posedge
32rd1
RegFile
32rd2
WE
32wd
5rs1
5rs2
5ws
32ALU
32
32
opLogic
Addr Data
Instr!Mem
D
PC
Q+0x4
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Next posedge: Update state and repeat
32rd1
RegFile
32rd2
WE32
wd
5 rs15 rs25 ws
D
PC
Q
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Observation: Logic idle most of cycle
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Dout
Data Memory
WE
Din
Addr
MemToReg
Addr Data
Instr
Mem32A
L
U
32
32
op
Ext
For most of cycle, ALU is either “waiting” for its inputs, or “holding” its output
Ideal: a CPU architecture where each part is always “working”.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Inspiration: Automobile assembly lineAssembly line moves on a steady clock. !
Each station does the same task on each car.Car
body shell
Car chassis
Merge station
Bolting station
The clock
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inspiration: Automobile assembly lineLine speed limited by slowest task.
Most efficient if all tasks take same time to do
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!
These lines go 24 x 7, and rarely shut down.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Lessons from car assembly lines
Faster line movement yields more cars per hour off the line.
Faster line movement requires more stages, each doing simpler tasks.
To maximize efficiency, all stages should take same amount of time!(if not, workers in fast stages are idle)
“Filling”, “flushing”, and “stalling” assembly line are all bad news.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Key Analogy: The instruction is the car
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR IR IR
Instruction Fetch
IR
Pipeline Stage #1 Stage #2
Controls hardware
in stage 2
Stage #3
Controls hardware
in stage 3
Stage #4
Controls hardware
in stage 4
Stage #5
Controls hardware
in stage 5
“Data-stationary control”
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Example: Decode & Register Fetch Stage
D
PC
Q
+
0x4
Addr Data
Instr
Mem
IR
Instr Fetch
Pipeline Stage #1
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR
B
A
M
Stage #2Decode & Reg Fetch
IR
Stage #3
ADD R4,R3,R2OR R7,R6,R5SUB R10,R9,R8
ADD R4,R3,R2OR R7,R6,R5
SUB R10,R9,R8
A sample program
R’s chosen so that instructions are
independent - like cars on the line.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Hazards: An instruction is not a car ...
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr FetchStage #1 Stage #2 Stage #3
Decode & Reg Fetch
ADD R4,R3,R2OR R5,R4,R2
An example of a “hazard” -- we must
(1) detect and (2) resolve all hazards
to make a CPU that matches ISA
R4 not written yet ...... wrong value of R4 fetched from RegFile, contract
with programmer broken! Oops! ADD R4,R3,R2
OR R5,R4,R2
New sample program
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Decode & Reg Fetch
Performance Equation and Hazards
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR IR
B
A
M
Instr Fetch Stage #3
SecondsProgram
InstructionsProgram= Seconds
Cycle InstructionCycles
“Software slows the machine
down”!Seymour Cray
Some ways to cope with hazards !
makes CPI > 1!“stalling pipeline”
Added logic to detect and resolve hazards increases
clock period
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Superpipelining
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Superpipelining: Add more stagesSeconds
Program
Instructions
Program= Seconds
Cycle Instruction
Cycles
Goal: Reduce critical path by!adding more pipeline stages.
Difficulties: Added penalties for load delays and branch misses.
Ultimate Limiter: As logic delay goes to 0, FF clk-to-Q and setup.
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Example: 8-stage ARM XScale:!extra IF, ID, data cache stages.
Also, power!
UC Regents Fall 2013/1014 © UCBCS 250 L3: Timing
CS
152 L10 Pipeline Intro (9)Fall 2004 ©
UC
Regents
Graphically R
epresenting MIP
S Pipeline
Can help w
ith answering questions like:
how m
any cycles does it take to execute this code?w
hat is the ALU
doing during cycle 4?is there a hazard, w
hy does it occur, and how can it be fixed?
ALU
IMR
egD
MR
eg
IR
ID+RF
EX
MEM
WB
IR
IR
IR
IF
5 Stage1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
8 Stage
IF now takes 2 stages (pipelined I-cache)ID and RF each get a stage.ALU split over 3 stagesMEM takes 2 stages (pipelined D-cache)
Note: Some stages now overlap, some instructions
take extra stages.
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Superpipelining techniques ...
Split ALU and decode logic over several pipeline stages.
Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes.
Remove “rarely-used” forwarding networks that are on critical path.
Pipeline the wires of frequently used forwarding networks.
Creates stalls, affects CPI.
Also: Clocking tricks (example: use posedge and negedge registers)
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Hardware limits to superpipelining?
Francois Labonte
[email protected] 4/23/2003 Stanford University
Cycle in FO4
0
10
20
30
40
50
60
70
80
90
100
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05
intel 386
intel 486
intel pentium
intel pentium 2
intel pentium 3
intel pentium 4
intel itanium
Alpha 21064
Alpha 21164
Alpha 21264
Sparc
SuperSparc
Sparc64
Mips
HP PA
Power PC
AMD K6
AMD K7
AMD x86-64
Thanks to Francois Labonte, Stanford
FO4!Delays
Historical limit: about
12 FO4s
CPU Clock Periods!1985-2005
MIPS 2000 5 stages
Pentium 4 20 stages
Pentium Pro 10 stages
*
Power wall: Intel Core Duo has 14 stages
FO4: How many fanout-of-4 inverter delays in the clock period.
PROCESSORS
1
CPU DB: Recording Microprocessor History
With this open database, you can mine microprocessor trends over the past 40 years.
Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University
In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.
Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.
In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.
AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.
Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.
PROCESSORS
1
CPU DB: Recording Microprocessor History
With this open database, you can mine microprocessor trends over the past 40 years.
Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University
In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.
Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.
In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.
AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.
Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.
1985 1990 1995 201020052000 2015
140
120
100
80
60
40
20
0
F04
/ cyc
leF04 Delays Per Cycle for Processor Designs
FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle.
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Multithreading
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Krste
November 10, 2004
6.823, L18--3
Multithreading
How can we guarantee no dependencies between instructions in a pipeline?
-- One way is to interleave execution of instructions from different program threads on same pipeline
F D X M W
t0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)
T2: ADD r7, r1, r4
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
T1: LW r5, 12(r1)
t9
F D X M W
F D X M W
F D X M W
F D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Last instruction
in a thread
always completes
writeback before
next instruction
in same thread
reads regfile
KrsteNovember 10, 2004
6.823, L18--5
Simple Multithreaded Pipeline
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
+1
2 Thread
select
PC1
PC1
PC1
PC1
I$ IRGPR1GPR1GPR1GPR1
X
Y
2
D$
Multithreading of Static Pipelines4
CPUs, each
run at 1/4
clock
Many variants ...
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
At the logic level ...
Post-Placement C-slow Retiming for the Xilinx VirtexFPGA
Nicholas Weaver⇤
UC BerkeleyBerkeley, CA
Yury MarkovskiyUC BerkeleyBerkeley, CA
Yatish PatelUC BerkeleyBerkeley, CA
John WawrzynekUC BerkeleyBerkeley, CA
ABSTRACT
C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids—Automatic syn-thesys
General Terms
Performance
Keywords
FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming
⇤Please address any correspondance [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.
1. Introduction
Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.
Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.
The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.
This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.
We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.
We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1
1This was originally defined to meet systolic slowdown re-quirements.
IN OUT
1 1
1 1 22
Figure 1: A small graph before retiming. Thenodes represent logic delays, with the inputs andoutputs passing through mandatory, fixed regis-ters. The critical path is 5.
microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.
The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.
This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.
In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.
2. Conventional Retiming
Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].
In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-
IN OUT
1 1
1 1 22
Figure 2: The example in Figure 2 after retiming.The critical path is reduced from 5 to 4.
straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.
A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.
This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w0 is calculated, with w0(e) = w(e)� r(u) + r(v).
An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.
Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.
Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.
Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be
Condition Constraint
normal edge from u! v r(u)� r(v) w(e)edge from u! v r(u)� r(v) w(e)� 1must be registerededge from u! v r(u)� r(v) 0 andcan never be registered r(v)� r(u) 0Critical Paths r(u)� r(v) W (u, v)� 1must be registered for all u, v such that D(u, v) > P
Table 1: The constraint system used by the retiming process.
IN OUT
1 1
1 1 22
Figure 3: The example in Figure 2 2-slowed.This design now operates on 2 independent datastreams.
unchanged or switch to operate on the positive edge withconstraints to mandate the placement of registers.2
Initial register conditions can also be calculated if de-sired, but this process is NP hard in the general case.Cong and Wu[3] have an algorithm that computes ini-tial states by restricting the design to forward retimingonly, so it propagates the information and registers for-ward throughout the computation. This is because solvinginitial states for all registers moved forward is straightfor-ward, but backwards movement is NP hard3 as it reducesto satisfiability.
An important question is how to deal with multipleclocks. If the interfaces between the clock domains are reg-istered by clocks from both domains, and with all signalsbeing unidirectional, each clock domain can be treated asan independent block with all signals crossing the domaintreated as I/O. Due to the retiming-imposed constraints onI/O, the logical view of each input will not change. How-ever, constraints may be needed to insure that the physicalregisters remain in position to prevent asynchronous con-ditions from occurring on this interface.
3. C-slow retiming
C-slowing enhances retiming by simply replacing ev-ery register with a sequence of C separate registers beforeretiming occurs. The resulting design operates on C dis-tinct execution tasks. Since all registers are duplicated,the computation proceeds in a round-robin fashion. Theeasiest way to utilize a C slowed block is to simply multi-
2For some cases, this may produce a set of unsolvable con-straints, thus requiring that the memory remain a negativeedge device.3And may not posses a valid solution for nonsensical cases.
IN OUT
1 1
1 1 22
Figure 4: The example in Figure 3 after retim-ing. The combination of C-slowing and retimingreduced the critical path from 5 to 2.
plex and demultiplex C separate data streams, but a moresophisticated interface may be desired depending on theapplication.
One possible interface is to register all inputs and out-puts of a C-slowed block. Because of the additional edgesretiming creates to track I/Os and to insure a consistentinterface, every stream of execution presents all outputs atthe same time, with all inputs being registered on the nextcycle. If part of the design is C-slowed, but all operate onthe same clock, the resulting design can be retimed as acomplete whole while preserving all other semantics. Weuse these observations later when discussing the e↵ects ofC-slowing on a microprocessor core.
However, C-slowing imposes some more significantFPGA design constraints, as summarized in Table 2. Reg-ister clock enables and resets must be expressed as logicfeatures, since each independent thread must see a dif-ferent view of the reset or enable. Thus, they can remainfeatures in the design but can’t be implemented by currentFPGAs using the native enables and resets. Other special-ized features, such as Xilinx SRL16s,4 can’t be utilized ina C-slow design for the same reasons.
One important issue is how to properly C-slow memoryblocks. In most cases, one desires the complete illusionthat each stream of execution is completely independentand unchanged. To create this illusion, the memory mustbe increased by a factor of C, with additional address linesdriven by a counter. This insures that each stream of ex-ecution enjoys a completely separate memory space.
For dual ported memories, this potentially enables agreater freedom in retiming: the two ports can have dif-ferent lags, as long as the di↵erence in lag is C � 1 orless. After retiming, the di↵erence in lag is added to theappropriate port’s thread counter. This insures that each
4A mode where the LUT can act as a 16 bit shift register
Condition Constraint
normal edge from u! v r(u)� r(v) w(e)edge from u! v r(u)� r(v) w(e)� 1must be registerededge from u! v r(u)� r(v) 0 andcan never be registered r(v)� r(u) 0Critical Paths r(u)� r(v) W (u, v)� 1must be registered for all u, v such that D(u, v) > P
Table 1: The constraint system used by the retiming process.
IN OUT
1 1
1 1 22
Figure 3: The example in Figure 2 2-slowed.This design now operates on 2 independent datastreams.
unchanged or switch to operate on the positive edge withconstraints to mandate the placement of registers.2
Initial register conditions can also be calculated if de-sired, but this process is NP hard in the general case.Cong and Wu[3] have an algorithm that computes ini-tial states by restricting the design to forward retimingonly, so it propagates the information and registers for-ward throughout the computation. This is because solvinginitial states for all registers moved forward is straightfor-ward, but backwards movement is NP hard3 as it reducesto satisfiability.
An important question is how to deal with multipleclocks. If the interfaces between the clock domains are reg-istered by clocks from both domains, and with all signalsbeing unidirectional, each clock domain can be treated asan independent block with all signals crossing the domaintreated as I/O. Due to the retiming-imposed constraints onI/O, the logical view of each input will not change. How-ever, constraints may be needed to insure that the physicalregisters remain in position to prevent asynchronous con-ditions from occurring on this interface.
3. C-slow retiming
C-slowing enhances retiming by simply replacing ev-ery register with a sequence of C separate registers beforeretiming occurs. The resulting design operates on C dis-tinct execution tasks. Since all registers are duplicated,the computation proceeds in a round-robin fashion. Theeasiest way to utilize a C slowed block is to simply multi-
2For some cases, this may produce a set of unsolvable con-straints, thus requiring that the memory remain a negativeedge device.3And may not posses a valid solution for nonsensical cases.
IN OUT
1 1
1 1 22
Figure 4: The example in Figure 3 after retim-ing. The combination of C-slowing and retimingreduced the critical path from 5 to 2.
plex and demultiplex C separate data streams, but a moresophisticated interface may be desired depending on theapplication.
One possible interface is to register all inputs and out-puts of a C-slowed block. Because of the additional edgesretiming creates to track I/Os and to insure a consistentinterface, every stream of execution presents all outputs atthe same time, with all inputs being registered on the nextcycle. If part of the design is C-slowed, but all operate onthe same clock, the resulting design can be retimed as acomplete whole while preserving all other semantics. Weuse these observations later when discussing the e↵ects ofC-slowing on a microprocessor core.
However, C-slowing imposes some more significantFPGA design constraints, as summarized in Table 2. Reg-ister clock enables and resets must be expressed as logicfeatures, since each independent thread must see a dif-ferent view of the reset or enable. Thus, they can remainfeatures in the design but can’t be implemented by currentFPGAs using the native enables and resets. Other special-ized features, such as Xilinx SRL16s,4 can’t be utilized ina C-slow design for the same reasons.
One important issue is how to properly C-slow memoryblocks. In most cases, one desires the complete illusionthat each stream of execution is completely independentand unchanged. To create this illusion, the memory mustbe increased by a factor of C, with additional address linesdriven by a counter. This insures that each stream of ex-ecution enjoys a completely separate memory space.
For dual ported memories, this potentially enables agreater freedom in retiming: the two ports can have dif-ferent lags, as long as the di↵erence in lag is C � 1 orless. After retiming, the di↵erence in lag is added to theappropriate port’s thread counter. This insures that each
4A mode where the LUT can act as a 16 bit shift register
Synchronous logic we want to “multithread”. Critical path is 5.
2X multi-threading: double each register.
Modern synthesis will retime this as shown: critical path is now 2.
Good fit for GALS
Two input queues (red and green). The mux control logic implements
turn-taking.
Outputs placed into two output
queues.
UC Regents Fall 2013/1014 © UCBCS 250 L3: Timing
Electrical Details
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Flip Flops Revisited
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Recall: Static RAM cell (6 Transistors)
x! x
Gnd Vdd Vdd Gnd Vth Vth
noise noise
“Cross- coupled
inverters”
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Recall: Positive edge-triggered flip-flop
D Q A flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling !circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds !value
16 Transistors: Makes an SRAM look compact!What do we get for the 10 extra transistors?
Clocked logic semantics.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Sensing: When clock is low
D QA flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling !circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds !value
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk = 0 clk’ = 1
Will capture new value on posedge.
Outputs last value captured.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Capture: When clock goes high
D QA flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling !circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds !value
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk = 1 clk’ = 0
Remembers value just captured.
Outputs value just captured.
UC Regents Fall 2013/2014 © UCBCS 250 L3: Timing
Flip Flop delays:
D Q
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk-to-Q ?
CLK == 0Sense D, but Q!
outputs old value.
CLK 0->1Capture D, pass!
value to Q
CLK
setup ? hold ?
clk-to-Q
setup
hold
On Tuesday ... Power and Energy
Heat Sink
Heat Source