EEs reach logic bottom-up - EECS Instructional Support ...cs152/fa04/lecnotes/lec2-2.6page.pdf ·...
Transcript of EEs reach logic bottom-up - EECS Instructional Support ...cs152/fa04/lecnotes/lec2-2.6page.pdf ·...
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
2004-09-07
Dave Patterson
(www.cs.berkeley.edu/~patterson)
John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
www-inst.eecs.berkeley.edu/~cs152/
CS152 – Computer Architecture andEngineering
Lecture 4 – Timing
1
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Last Time: Test plan for your project
complete processor
testing
Top-downtesting
Bottom-uptesting
unit testing
processortesting
withself-checks
Which testing types are good for each epoch?
processorassemblycomplete
correctlyexecutes
singleinstructions
correctlyexecutes
shortprograms
Time
Epoch 1 Epoch 2 Epoch 3 Epoch 4
unit testing
early
multiunit
testing
latermulti-unit testing
processortesting
withself-checks
multi-unit testing
unit testing
diagnostics
complete processor
testing
verification
processortesting
withself-checks
diagnostics
processortesting
withself-checks
multi-unit testing
unit testing
diagnostics
2
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Outline - Timing
Team networking break!
A clocked logic circuit primer
More clocked logic circuits
3
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Architects draw blocks ...
Circuit designers draw ???
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&#&%'()*+#$,-%-)./)0.1%2)3($#,
4 5678.'(9):8#+-%-&.8);.9($<))
!"#$4 =()8(/(8)&.)&8#+-%-&.8)>-&8(+1&?>)#-)
&?()#6."+&)./)2"88(+&)&?#&)/$.@-)/.8)
#)1%'(+ A9- #+9 A1-B)
4 :?()-&8(+1&?)%-)$%+(#8$,)78.7.8&%.+#$)
&.)&?()8#&%.)./)=C0B)
%"#$
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&#&%'()*+#$,-%-)./)0.1%2)3($#,
4 5678.'(9):8#+-%-&.8);.9($<))
!"#$4 =()8(/(8)&.)&8#+-%-&.8)>-&8(+1&?>)#-)
&?()#6."+&)./)2"88(+&)&?#&)/$.@-)/.8)
#)1%'(+ A9- #+9 A1-B)
4 :?()-&8(+1&?)%-)$%+(#8$,)78.7.8&%.+#$)
&.)&?()8#&%.)./)=C0B)
%"#$
Logic is where they meet.
4
CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB
Architects reach logic top-down ...
Next State Combinational Logic
next_Gnext_R next_YR G Y
ChangeRst
wire next_R, next_Y, next_G;
assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y);assign next_G = rst ? 1’b0 : (change ? R : G);
Is this structural Verilog?
5
CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.33
Basic Components: CMOS Logic Gates
NOR GateNAND Gate
A B Out
0 0 10 1 11 0 11 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
B
A
B
Out
Out = A + BOut = A • B
Vdd
A
B
Out
Vdd
A
B
Out
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.34
Basic Components: CMOS Logic Gates
Out
A
B
C
D
More Inputs More asymmetric Edges Times!
Vdd
Out
B
C
D
A
4-input NAND Gate
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.35
Ideal versus Reality
° When input 0 -> 1, output 1 -> 0 but NOT instantly• Output goes 1 -> 0: output voltage goes from Vdd (5v) to 0v
° When input 1 -> 0, output 0 -> 1 but NOT instantly• Output goes 0 -> 1: output voltage goes from 0v to Vdd (5v)
° Voltage does not like to change instantaneously
Vin
Vout
1 => Vdd
VoltageOutIn
0 => GND
Time
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.36
Fluid Timing Model
Level (V) = Vdd
Vdd
° Water ! Electrical Charge Tank Capacity ! Capacitance (C)
° Water Level ! Voltage Water Flow ! Charge Flowing (Current)
° Size of Pipes ! Strength of Transistors (G)
° Time to fill up the tank proportional to C / G
Reservoir Tank
(Cout)Bottomless Sea
Sea Level
(GND)
SW2SW1SW1
Tank Level (Vout)
Cout
Vout
SW2
EEs reach logic bottom-up ...
Can you build a processorentirely out of NAND gates?
Small number of high-performance
logic circuits.
For some definition of performance.
6
CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB
Logic Synthesis bridges the gap
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y);assign next_G = rst ? 1’b0 : (change ? R : G);
It’s easier to work at one level of
abstraction if you have a basic
understanding of the level below.
7
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Administrivia - Team Networking Break!
First homework due 9/15.
Lab 1 due Monday 9/13.
Mini-Lab 2 this Friday (9/10).Remember to do the pre-lab!
Lab 2 goes out on 9/14. The first team lab (next break)
8
UC Regents Fall 2004 © UCBCS 152 L03 Testing Processors ()
A Logic Circuit Primer
9
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Inverters: A simple transistor model
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
“1”
“0”
pFET.A switch. “On” if gate is grounded.
nFET.A switch. “On” if gate is at Vdd.
“1”“0”
“1” “0”
10
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Transistors as water valves
If electrons are water molecules,and a capacitor a bucket ...
A “on” p-FET fillsup the capacitor
with charge.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
A “on” n-FET empties the
bucket.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“1”
“0”Time
Water level
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“0”
“1”
TimeWater level
11
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
What is the bucket? A gate’s “fan-out”.
Driving other gates slows a gate down.
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
Driving wires slows a gate down.
“Fan-out”: The number of gate inputs driven by a gate’s output.
12
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
A closer look at fan-out ...
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-(
!"#$%&$'"(
) *"+,-.#/
) 01$%2$'"(%-3%"%4"#$%56%78-7-8#5-+"'%#-%5#6%-.#7.#%9"7"95#"+9$:%%;$9".6$<%4"#$6%=>%"+2%?%#.8+%-+@-33%"#%"%'"#$8%#5A$:%%BC#%#"D$6%'-+4$8%3-8%#1$%-.#7.#%-3%4"#$%=E%#-%8$"91%#1$%6F5#915+4%#18$61-'2%-3%4"#$6%=>%"+2%?%"6 F$%"22%A-8$%-.#7.#%9"7"95#"+9$:G
E
?
> Driving more gates adds delay.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.37
Series Connection
Vdd
Cout
Vout
C1
V1G2
Vdd
Voltage
Vdd
Vin
GND
V1 Vout
Vdd/2
d1 d2
G1
V1Vin Vout
VinG1 G2
Time
° Total Propagation Delay = Sum of individual delays = d1 + d2
° Capacitance C1 has two components:
• Capacitance of the wire connecting the two gates
• Input capacitance of the second inverter
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.38
Calculating Aggregate Delays
Vdd
G2
Vdd
° Sum delays along serial paths
° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)
• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)
° Critical Path = The longest among the N parallel paths
° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3
V2
V1Vin V2
G1V1
C1
Vin
Vdd
G3V3
V3
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.39
Characterize a Gate
° Input capacitance for each input
° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)
- Internal delay (ns)
- Load dependent delay (ns / fF)
° Example: 2-input NAND Gate
OutA
B
Delay A -> Out
Out: Low -> High
0.5ns
Slope =
0.0021ns / fF
For A and B: Input Load (I.L.) = 61 fF
For either A -> Out or B -> Out:
Tlh = 0.5ns Tlhf = 0.0021ns / fF
Thl = 0.1ns Thlf = 0.0020ns / fF
Cout
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.40
A Specific Example: 2 to 1 MUX
Y = (A and !S)
or (B and S)
A
B
S
Gate 3
Gate 2
Gate 1Wire 1
Wire 2
Wire 0
A
B
Y
S
2 x
1M
ux
° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF
• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF
° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF
• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF
• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF
Linear model
works for reasonable
fan-out
13
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Propagation delay graphs ...
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--
!"#$%&$'"(
) *"+,"-$-%."#$+/
012#
034
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--
!"#$%&$'"(
) *"+,"-$-%."#$+/
012#
034
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--
!"#$%&$'"(
) *"+,"-$-%."#$+/
012#
034
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--
!"#$%&$'"(
) *"+,"-$-%."#$+/
012#
034
1->0
14
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Intuition: Critical paths ...
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-*
!"#$%&$'"(
) *+",-.,/
) 01"#%.2%#1$%3$'"(%.,%#1.2%4.546.#7
) !"#$#%&'()&$*+(#1$%8"#1%9.#1%#1$%:";.:6:%3$'"(<%=5>:%",(%
.,86#%#>%",(%>6#86#?
@ A,%B$,$5"'<%9$%.,4'63$%5$B.2#$5%2$#-68%",3%4'C-#>-D%#.:$2%.,%
45.#.4"'%8"#1%4"'46'"#.>,?
) 01(%3>%9$%4"5$%"E>6#%#1$ %"#$#%&'(,&$*-
x = g(a, b, c, d, e, f)
If d going 0-to-1 switches x 0-to-1, delay is T1.
If a going 0-to-1 switches x 0-to-1, delay is T2.
Would you be surprised if T1 > T2? Why?
T1
T2
T2 might
be the
critical
(worst-
case
delay)
path.
15
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Why “might”? Wires have delay too ...
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&$'()
* +,$-%"-%./01$%2(1$1%3/$#$%./$%
.#(-14"11"0-%'"-$%$55$2.%"1%
-$6'"6"7'$8
9 !"#$1%:011$1%;"1.#"7<.$;%
#$1"1.(-2$%(-;%2(:(2".(-2$
9 ="4$%20-1.(-.%(1102"(.$;%3"./%
;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%
./$%!"#$%& 05%./$%'$-6./
* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E
9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E
* @0#%*#+,&'($)! 0-%A?18
9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E
9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E
9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8
,I
,J,K
,L
."4$
,I ,L ,K ,J
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&$'()
* +,$-%"-%./01$%2(1$1%3/$#$%./$%
.#(-14"11"0-%'"-$%$55$2.%"1%
-$6'"6"7'$8
9 !"#$1%:011$1%;"1.#"7<.$;%
#$1"1.(-2$%(-;%2(:(2".(-2$
9 ="4$%20-1.(-.%(1102"(.$;%3"./%
;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%
./$%!"#$%& 05%./$%'$-6./
* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E
9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E
* @0#%*#+,&'($)! 0-%A?18
9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E
9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E
9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8
,I
,J,K
,L
."4$
,I ,L ,K ,J
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&$'()
* +,$-%"-%./01$%2(1$1%3/$#$%./$%
.#(-14"11"0-%'"-$%$55$2.%"1%
-$6'"6"7'$8
9 !"#$1%:011$1%;"1.#"7<.$;%
#$1"1.(-2$%(-;%2(:(2".(-2$
9 ="4$%20-1.(-.%(1102"(.$;%3"./%
;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%
./$%!"#$%& 05%./$%'$-6./
* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E
9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E
* @0#%*#+,&'($)! 0-%A?18
9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E
9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E
9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8
,I
,J,K
,L
."4$
,I ,L ,K ,J
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&$'()
* +,$-%"-%./01$%2(1$1%3/$#$%./$%
.#(-14"11"0-%'"-$%$55$2.%"1%
-$6'"6"7'$8
9 !"#$1%:011$1%;"1.#"7<.$;%
#$1"1.(-2$%(-;%2(:(2".(-2$
9 ="4$%20-1.(-.%(1102"(.$;%3"./%
;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%
./$%!"#$%& 05%./$%'$-6./
* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E
9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E
* @0#%*#+,&'($)! 0-%A?18
9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E
9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E
9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8
,I
,J,K
,L
."4$
,I ,L ,K ,J
Looksbenign,but ...
16
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Team Networking Break!
Optimal team size? Min/max?
Disagreements are inevitable.Build a bridge and get over it.
Lab 2 goes out on 9/14. Time to form your team.
Talk out little problems before they get big. Communicate.
17
UC Regents Fall 2004 © UCBCS 152 L03 Testing Processors ()
Clocked Logic Circuits
18
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&'
( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+
#
4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9
5+! #0&7"8 :+#$;" :+#.',;%
0&7
f T
1 MHz 1 !s
10 MHz 100 ns
100 MHz 10 ns
1 GHz 1 ns
Timing Analysis
What is the smallest T that
produces correct operation?
19
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Timing Analysis and Logic Delay
Can T be smaller than worst-casedelay through CL?
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Register:
An Array of Flip-Flops
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
20
CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB
Flip Flops have internal delays ...
D Q
CLK
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
D
Q
t_setup
t_clk-to-Q
21
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB
Conclusion -- Timing
Flip-flops: setup and clk-to-Q
Critical path limits clock speed
Logic delay: fan-out and wires
22