CS 152 Computer Architecture and Engineering Lecture...
Transcript of CS 152 Computer Architecture and Engineering Lecture...
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
2005-10-13John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 13 – Cache I
www-inst.eecs.berkeley.edu/~cs152/
A cosmic ray hits a DRAM cell ...
TAs: David Marquardt and Udam Saini
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
4096 rows
1
of
4096
decoder
2048 columns Each
column 4 bits deep
33,554,432 usable bits(tester found good bits in bigger array)
12-bitrow
address input
8196 bits delivered by sense amps
Select requested bits, send off the chip
Last Time: DRAM design
DRAM has high latency to first bit out. A fact of life.
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Today: Caches and the Memory System
Memory Hierarchy: Technology motivation for caching.
Locality: Why caching works
Cache design: Final project component.
Datapath
Memory
Processor
Input
Output
Control
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
1977: DRAM faster than microprocessors
Apple ][ (1977)
Steve WozniakSteve
Jobs
CPU: 1000 ns DRAM: 400 ns
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Since then: technology scaling ...
Circuit in 250 nm technology (introduced in 2000)
H nanometers long
Same circuit in 180 nm technology (introduced in 2003)
0.7 x H nm
Each dimension
30% smaller. Area is 50%
smaller
Logic circuits use smaller C’s, lower Vdd, and higher kn and kp to speed up clock rates.
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
DRAM scaled for more bits, not more MHz
Assume Ccell = 1 fF
Word line may have 2000 nFet drains,assume word line C of 100 fF, or 100*Ccell.
Ccell holds Q = Ccell*(Vdd-Vth)
When we dump this charge onto the word line, what voltage do we see?
dV = [Ccell*(Vdd-Vth)] / [100*Ccell]
dV = (Vdd-Vth) / 100 ⋲ tens of millivolts! In practice, scale array to get a 60mV signal.
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
1980-2003, CPU speed outpaced DRAM ...
10
DRAM
CPU
Performance(1/latency)
100
1000
1980
2000
1990 Year
Gap grew 50% per year
Q. How do architects address this gap? A. Put smaller, faster “cache” memories
between CPU and DRAM. Create a “memory hierarchy”.
10000The
power wall
2005
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Caches: Variable-latency memory ports Lower Level
MemoryUpper Level
MemoryTo Processor
From Processor
Blk X
Blk Y
Small, fast Large, slow
FromCPU
To CPU
Data in upper memory returned with lower latency.
Data in lower level returned with higher latency.
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Cache replaces data, instruction memory
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
Mux,Logic
IF (Fetch) ID (Decode) EX (ALU) MEM WB
Replace with Instruction Cache and Data Cacheof DRAM
main memory
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Recall: Intel ARM XScale CPU (PocketPC)1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
32 KB Instruction Cache
32 KB Data Cache
180 nm process (introduced 2003)
UC Regents Spring 2005 © UCBCS 152 L14: Cache I
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
2005 Memory Hierarchy: Apple iMac G5
iMac G51.6 GHz$1299.00
Reg L1 Inst
L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 160 1e7
Let programs address a memory space that scales to the disk size, at a speed that is
usually as fast as register access
Managed by compiler
Managed by hardware
Managed by OS,hardware,application
Goal: Illusion of large, fast, cheap memory
UC Regents Spring 2005 © UCBCS 152 L14: Cache I
(1K)
Registers
L1 (64K Instruction)
L1 (32K Data)
512KL2
90 nm, 58 M transistors
PowerPC 970 FX
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Latency: A closer look
Reg L1 Inst
L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 160 1e7Latency
(sec) 0.6n 1.9n 1.9n 6.9n 100n 12.5m
Hz 1.6G 533M 533M 145M 10M 80
Architect’s latency toolkit:
Read latency: Time to return first byte of a random access
(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth.
(2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Programs with locality cache well ...
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Addr
ess
(one
dot
per
acc
ess)
Q. Point out bad locality behavior ...
SpatialLocality
Temporal Locality
Bad
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
The caching algorithm in one slide
Temporal locality: Keep most recently accessed data closer to processor.
Spatial locality: Move contiguous blocks in the address space to upper levels.
Lower LevelMemory
Upper LevelMemory
To Processor
From Processor
Blk X
Blk Y
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Caching terminology
Lower LevelMemory
Upper LevelMemory
To Processor
From Processor
Blk X
Blk Y
Hit: Data appearsin upper
level block(ex: Blk X)
Miss: Data retrieval from lower level
needed(Ex: Blk Y)
Hit Rate: The fraction of memory accesses found in
upper level.
Miss Rate: 1 - Hit Rate
Hit Time: Time to access upper level. Includes hit/miss check.
Miss penalty: Time to replace
block in upper level + deliver to CPU
Hit Time << Miss Penalty
UC Regents Fall 2005 © UCBCS 152 L12: Memory and Interfaces
Admin: Final Xilinx Checkoff Tomorrow
Lab report due Monday, 11:59 PM.
Final project posted soon ...
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Cache Design Example
Recall: Static Memory ...
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Recall: Static Memory Cell Design
Wordline
Bitline
Gnd Vdd Vdd Gnd
Bitline !
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
SRAM array: simpler than DRAM array4/12/04 ©UCB Spring 2004
CS152 / Kubiatowicz Lec19.13
° Why do computer designers need to know about RAM technology?
• Processor performance is usually limited by memory bandwidth
• As IC densities increase, lots of memory will fit on processor chip
- Tailor on-chip memory to specific needs
- Instruction cache
- Data cache
- Write buffer
° What makes RAM different from a bunch of flip-flops?• Density: RAM is much denser
Random Access Memory (RAM) Technology
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.14
Static RAM Cell
6-Transistor SRAM Cell
bit bit
word(row select)
bit bit
word
° Write:1. Drive bit lines (bit=1, bit=0)
2.. Select row
° Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
replaced with pullupto save area
10
0 1
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.15
Typical SRAM Organization: 16-word x 4-bit
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp
: : : :
Word 0
Word 1
Word 15
Dout 0Dout 1Dout 2Dout 3
- +Wr Driver &
Precharger - +Wr Driver &
Precharger - +Wr Driver &
Precharger - +Wr Driver &
Precharger
Ad
dress D
ecod
erWrEn
Precharge
Din 0Din 1Din 2Din 3
A0
A1
A2
A3
Q: Which is longer:
word line or
bit line?
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.16
° Write Enable is usually active low (WE_L)
° Din and Dout are combined to save pins:• A new control signal, output enable (OE_L) is needed
• WE_L is asserted (Low), OE_L is disasserted (High)
- D serves as the data input pin
• WE_L is disasserted (High), OE_L is asserted (Low)
- D is the data output pin
• Both WE_L and OE_L are asserted:
- Result is unknown. Don’t do that!!!
° Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need)
A
DOE_L
2 Nwordsx M bit
SRAM
N
M
WE_L
Logic Diagram of a Typical SRAM
WriteDriver
WriteDriver
WriteDriver
WriteDriver
Word and bit lines slow down as array grows larger! Architects specify number of rows and columns.
ParallelDataI/OLines
Add muxesto selectsubset of bits
How could we pipeline this memory?
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Cache Design Example
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
CPU address space: An array of “blocks” Block #
7
123456
0
227
- 1
.
..
32-byte blocks
27 bits 5 bits
The job of a cache is to hold
a “popular” subset of blocks.
32-bit Memory Address
Which block? Byte #
031
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
One Approach: Fully Associative Cache
Cache Tag (27 bits) Byte Select
531 04
Ex: 0x04
ValidBit
Byte 31 ... Byte
1Byte
0
Byte 31 ... Byte
1Byte
0
Cache DataHolds 4 blocks
=
=
=
=
HitReturn bytes of “hit” cache line
Block # (”Tags”)026
Ideal, but expensive ...
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Building a cache with one comparatorBlock #
7
123456
0
227
- 1
.
..
32-byte blocksBlocks of a certain color may only appear in one
line of the cache.
32-bit Memory Address
Which block? Color Byte #031 4567
25 bits 2 bits 5 bits
Cache index
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Example: A Direct Mapped Cache
Cache Tag (25 bits) Index Byte Select
531 04
=
Hit
Ex: 0x01
Return bytes of “hit” cache line
Ex: 0x00
PowerPC 970: 64K direct-mapped Level-1 I-cache
67
ValidBit
Byte 31 ... Byte
1Byte
0
Byte 31 ... Byte
1Byte
0
Cache Tags 024 Cache Data
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
The limits of direct-mapped caches ...
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Addr
ess
(one
dot
per
acc
ess)
What if both regions have same block color?
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Hybrid Design: Set Associative Cache
Cache Tag (26 bits) Index (2 bits)
Byte Select (4 bits)
Cache block halved to keep # cached bits constant.
Valid
Cache Block
Cache Block
Cache Tags Cache Data
Cache Block
Cache Block
Cache TagsValidCache Data
Ex: 0x01
=
HitRight
=
HitLeft
Return bytes of “hit” set
member
“N-way” set associative -- N is number of blocks for each color
16 bytes16 bytes
PowerPC 970: 32K 2-wayset associative L1 D-cache
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
The benefits of set-associativity ...
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Addr
ess
(one
dot
per
acc
ess)
What if both regions have same block color?
Q. What costs (over direct mapped) for this benefit?
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Conclusions ...
Program locality is why building a memory hierarchy makes sense
Latency toolkit: hierarchy design,bit-wise parallelism, pipelining.
Cache operation: compare tags, detect hits, select bytes.
In practice: how many rows, how many columns, how many arrays.
UC Regents Fall 2005 © UCBCS 152 L13: Cache I
Lectures: Coming up next ...
More on caches +intro to final project
Advanced topics begin ...