Lecture 3: Cache & I/OTDTS10/lectures/16/lec3.pdfLecture 3: Cache & I/O ... segments of program and...
Transcript of Lecture 3: Cache & I/OTDTS10/lectures/16/lec3.pdfLecture 3: Cache & I/O ... segments of program and...
2016-11-07
1
11Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Lecture 3: Cache & I/O
Virtual memory
Cache memory
I/O operations
22Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Mismatch of CPU and MM Speeds
1955 19851965 1970 1975 1980 1990 2000 20051960
Cyc
le T
ime
(nan
ose
cond
)
104
103
102
101
0
Speed Gap(ca. one orderof magnitude, i.e., 10 times)
2010 2015
2016-11-07
2
33Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
instructionsand dataaddresses
addresses
instructionsand data
instructionsand data
addresses
Cache Memory
A cache is a very fast memory which is put between the main memory and the CPU, and used to hold segments of program and data of the main memory.
Main MemoryCPU
Cache
44Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Zebo’s Cache Memory Model
Personal library for a high-speed reader
A computer is a “predictable and iterative reader.” High cache hit ratio, e.g., 96%, is achievable, even with a
relatively small cache (e.g., 0.1% of memory size).
Storage cells Memory controllerCache
2016-11-07
3
55Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Cache Memory Features
It is transparent to the programmers. The CPU still refers to the instructions/data by their addresses in the MM.
Only a very small part of the program/data in the main memory has its copy in the cache (e.g., 4MB cache with 8GB memory).
If the CPU wants to access program/data not in the cache (called a cache miss), the relevant block of the main memory will be copied into the cache.
The intermediate-future memory access will usually refer to the same word or words in the neighborhood, and will not have to involve the main memory.
Locality of reference!
66Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Cache Memory Performance
Average Access Time (AAT)
Phit x Tcache_access +
(1 – Phit) x (Tmm_access + Tcache_access) x Block_size +
Tchecking
Ex. A computer has 8MB MM with 100 ns access time, 8KB cache with 10 ns access time, BS=4, and Tchecking = 2.1 ns, Phit = 0.97, AAT will be 25 ns.
Phit = the probability of cache hit, cache hit ratio;Tcache_access = cache access time;Tmm_access = main memory access time;Block_size = number of words in a cache block; andTchecking = the time needed to check for cache hit or miss.
where
2016-11-07
4
77Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Cache Memory Performance
Ex. A computer has 8MB MM with 100 ns access time, 8KB cache with 10 ns access time, BS=4, and Tchecking = 2.1 ns, Phit = 0.97, AAT will be 25 ns.
Main Memory[8 MB, 100 ns]
Cache [8 KB, 10 ns]
Composite
Memory[8 MB, 25 ns]
88Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Cache Design
The size and nature of the copied block must be care-fully designed, as well as the algorithm to decide which block to be removed from the cache when it is full:
Cache block size (line size).
Total cache size.
Mapping function.
Replacement method.
Write policy.
Numbers of caches:
• Single, two-level, or three-level cache.
• Unified vs. split cache.
2016-11-07
5
99Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Split Data and Instruction Caches?
Split caches (Harvard Architectures):+ CPU accesses I-cache for instruction fetch, and D-cache for
data.
+ Competition for the cache between instruction processing and execution units is eliminated.
+ Instruction fetch can proceed in parallel with memory access from the CPU for operands.
One may be overloaded while the other is under utilized.
Unified caches:+ Better balance the load between instruction and data fetches
depending on the dynamics of the program execution.
+ Design and implementation are cheaper.
Lower performance.
1010Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Direct Mapping Cache
Direct mapping - Each block of the main memory is mapped into a fixed cache slot.
Storage cells Memory controllerCache
1
2
12
1
2
2016-11-07
6
1111Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Direct Mapping Cache ExampleWe have a 10,000-word MM and a 100-word cache. 10 memory cells are grouped into a block.
Memory address = 2 1 1Tag Slot Word
100-Word Cache
Tag Slot No.9876543210
90-9980-8970-7966-6950-5940-4930-3920-2910-1900-09
0020-00290010-00190000-0009
9990-9999
0120-01290110-01190100-0109
10,000-Word Memory
0 1 1 5
01
1212Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Direct Mapping Pros & Cons Simple to implement and therefore inexpensive.
Very fast checking time for cache hit or miss.
Fixed location for blocks. If a program accesses 2 blocks that map to the same
cache slot repeatedly, cache miss rate is very high.
Storage cells Memory controllerCache
1
2
12
1
2
2016-11-07
7
1313Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
100-Word Cache
Tag (3 ps)90-9980-8970-7966-6950-5940-4930-3920-2910-1900-09
Associative Mapping A main memory block can be loaded into any slot of the cache. To determine if a block is in the cache, a mechanism is needed
to simultaneously examine every slot’s tag.
0020-00290010-00190000-0009
9990-9999
0120-01290110-01190100-0109
10,000-Word Memory
Tag
0 1 02 8 70 0 12 9 7
0106
Associative memory example
, 0107
Memory address = Tag Word
1414Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Fully Associative Organization
2016-11-07
8
1515Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Set Associative Organization
The cache is divided into a number of sets (K).
Each set contains a number of slots (W).
A given block maps to any slot in a given set. e.g. block i can be in any slot of set j.
For example, 2 slots per set (W = 2): 2-way associative mapping.
A given block can be in one of 2 slots.
Direct mapping: W = 1 (no alternative).
Fully associative: K = 1 (W = total number of all slots in the cache, all mappings possible).
W is the most important parameter (typically 2-16).
1616Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Replacement Algorithms With direct mapping, it is no need. With associative mapping, a replacement algorithm is
needed in order to determine which block to replace:
First-in-first-out (FIFO).TagUse info
Least-recently used (LRU) -replace the block that has been in the cache longest with not reference to it.
Lest-frequently used (LFU) -replace the block that has experienced the fewest references.
Random.
15:55
2016-11-07
9
1717Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Write Policy The problem: How to keep cache content and main memory content
consistent without losing too much performance?
Write through: All write operations are passed to main memory:
If the addressed location is currently in the cache, the cache copy is also updated so that it is coherent with the main memory.
For write operations, the CPU always slows down to main memory speed.
Since the percentage of writes is small (ca. 15%), it doesn’t lead to too large performance reduction.
1818Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Write Policy (Cont’d) Write through with buffered write:
The same as write-through, but instead of slowing the CPU down by writing directly to main memory, the write address and data are stored in a high-speed write buffer, which transfers data to main memory while the CPU continues its task.
Higher speed, but more complex hardware.
Write back: A write updates only the cache memory which is not kept coherent
with main memory. When the cache block is replaced, its content has to be written back to memory.
Good performance (usually several writes are performed on a cache block before it is replaced), but more complex hardware is needed.
Cache coherence problems are very complex and difficult to solve in multiprocessor systems (to be discussed later)!
2016-11-07
10
1919Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Cache Architecture Examples Intel Pentium (introduced first in1993)
Two on-chip caches, one for data and one for instructions. Each cache: 8 KB. Line size: 32 bytes. 2-way set associative organization.
AMD Opteron 140 (introduced 2003) Two L1 cache: one for instruction and one for data; 64 KB each. 2-way associative organization. L2 cache: 1 MB, 16-way associative organization.
ARM Cortex-A15 (introduced 2012) Each core has separate L1 data and instruction caches. 64 KB (32 KB I-cache, 32 KB D-cache) per core. L2 cache, unified and common for all cores, up to 4 MB.
2020Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
3-Level Cache Example
L1 L2 L3
Contents Split D and I Unified D + I Unified D + I
Size 16 Kbytes each 256 Kbytes 3 Mbytes
Line size 64 bytes 128 bytes 128 bytes
Associativity 4 way 8 way 12 way
Access time 1 cycle 5-7 cycles 14-17 cycles
Store policy Write-through Write-back Write-back
Intel Itanium 2 (introduced 2002):
2016-11-07
11
2121Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Lecture 3: Cache & I/O
Virtual memory
Cache memory
I/O operations
2222Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Motivation for Virtual Memory
The physical main memory (RAM) is relatively limited in capacity.
It may not be big enough to store all the executing programs at the same time.
A program may need memory larger than the main memory size, but the whole program doesn’t need to be kept in the main memory at the same time.
Virtual Memory takes advantage of the fact that at any given instant of time, an executing program needs only a fraction of the memory that the whole program occupies.
The basic idea: Load only pieces of each executing program which are currently needed (on demand).
2016-11-07
12
2323Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Paging of Memory
Divide programs (processes) into equal sized, small blocks, called pages.
Divide the primary memory into equal sized, small blocks called page frames.
Allocate the required number of page frames to a program.
A program does not require continuous page frames!
The operating system (OS) is responsible for:
Maintaining a list of free frames.
Using a page table to keep track of the mapping between pages and page frames.
2424Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Logical vs. Physical Addresses
Implementation of the page-tables:
Main memory — slow since an extra memory access is needed.
Separate registers —fast but expensive.
Cache (usually a dedicated one).
0123
• Logical (virtual) address is the address used by the program.
• Physical address is the address in the actual main memory.
2016-11-07
13
2525Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Motivation for Virtual Memory I
To give the programmer a much bigger memory space than the main memory with the help of the operative system. Virtual memory size is very much bigger than the main memory size.
0000
1000
2000
3000
Program addresses
Secondary memory
0000
1000
2000
3000
4000
5000
MM addresses
2626Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Motivation for Virtual Memory II
To allow multiple programs to share main memory dynamically and efficiently.
You don’t need to negotiate with the others to share the physical memory addresses.
Each program gets a private virtual address space.
To avoid writing into each other’s memory locations.
It is the OS that translates a virtual address into a physical address. No program can read another's data.
Good for security (against malicious attack) and safety (against faults and errant access).
2016-11-07
14
2727Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Page Fault When accessing a VM page which is not in the main memory,
a page fault occurs.
The page must then be loaded from the secondary memory into the main memory by the OS.
Virtual Address Page Number Offset
Page Map
Pages in MM
Page Fault(Interrupt to OS)
2828Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Page Replacement
When a page fault occurs and all page frames are occupied, one of them must be replaced.
If the replaced page has been modified during the time it resides in the main memory, the updated version should be written back to the secondary memory.
Our wish is to replace the page which will not be accessed in the future for the longest amount of time.
Problem — We don’t know exactly what will happen in the future.
Solution — We predict the future by studying the access patterns up till now (“learn from history”).
2016-11-07
15
2929Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Replacement Algorithms
FIFO (First In First Out) — To replace the one in MM the longest of time.
LRU (Least Recently Used) — To replace the one that has not be accessed the longest time.
LFU (Least Frequently Used) — To replace the one that has the smallest number of access during the latest time period.
The replacement by random technique (used for Cache)
is not used for VM!
3030Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Lecture 3: Cache & I/O
Virtual memory
Cache memory
I/O operations
2016-11-07
16
3131Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Input/Output Devices Input/output devices provide a means for us to make use of
a computer system.
Computer
Secondarymemory
Computer System
Input device
Output device
There are two major types of I/O devices: Interactive devices, e.g., a multi-touch screen (iPad). Indirect devices, e.g., a laser printer.
Secondary memories can also be considered as I/O devices. Ex. a magnetic tape.
3232Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Characteristics of I/O I/O devices run at much lower speed than the CPU, due
partially to neglect and partially to physical limitations. Therefore, special technical must be used to control them to avoid
that the CPU has to wait for them all the time.
Many I/O devices function as an interface between a computer system and other physical systems. Such interface usually consists of A/D and D/A converters.
Temperaturesensor
Heaterswitch
Computer
Secondarymemory
Computer System
A/D D/A
2016-11-07
17
3333Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
An I/O Module Responsible for the control of one or several devices,
and the exchange of data between the devices and the main memory and/or CPU registers.
CPU
MM
3434Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Functions of an I/O Module
CPU
MM
Control and timing
CPU communication
Device communication
Data buffering
Error detection and correction
2016-11-07
18
3535Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Control of I/O Devices (1) Programmed I/O
The operations are controlled by I/O instructions, for example, READ and WRITE.
The instructions specify:• the particular I/O operation to perform; and
• the given device by giving its address (its ID number).
The CPU will wait for the I/O operation to be finished before it executes the next instruction.
Since the I/O devices are very slow, the CPU has to wait all the time instead of doing useful work.
It is a very simple but not an efficient method.
• can be used in an embedded system.
3636Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Programmed I/O Example
Select I/O device
Send data to thedevice interface
Check device status
Next instruction
Ready?
yes
no
Execution flow of aWRITE instruction(e.g. WRITE ‘P’, O1):
Start
The CPU does polling, or busy waiting.
The I/O device sets appropriate bits in the I/O status register.
2016-11-07
19
3737Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Control of I/O Devices (2)
Interrupt-driven I/O After the CPU sends an initialization signal to an I/O
device, it continues with the execution of programs.
When the I/O device is ready, or wants to get the attention of the CPU, it sends an interrupt signal to the CPU.
When the CPU receives an interrupt signal, it will first finish the execution of the current instruction and then execute the interrupt service routine (ISR).
This mechanism is used to free the CPU from having to check periodically (polling) the I/O devices to see it they are in need of any attention.
3838Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Interrupt Service Routine (ISR) Save all status information which is needed to resume
execution of the current sequence of instructions. Put the saved PC value in a safe place!
Deal with the interrupt, for example, by reading data from the input device.
Restore the saved status information and then resume execution of the interrupted program.
current sequence
interrupt
interrupt routine
original sequence resumed
2016-11-07
20
3939Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Instruction Cycle w. Interrupts
Fetch nextinstruction
Executeinstruction
Check for interrupt
Start
Stop
Fetch cycle
Execute cycle
Interrupt cycle
Interrupts enabledInterrupts disabled
HW: Save the current PC value &set PC as the 1st address of ISR
4040Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Multiple Interrupts A new interrupt may occur while the current interrupt is
being processed (execution of the ISR instructions). Disabled interrupt approach: CPU ignores new interrupt
signals while processing an interrupt.
Priority-based approach: An interrupt of higher priority will interrupt the processing of a lower-priority interrupt.
current sequence
interrupt routine
original sequence resumedinterrupt
2016-11-07
21
4141Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
current sequence original sequence resumedinterrupt
ISRLower-priority interrupt
Multiple Interrupts A new interrupt may occur while the current interrupt is
being processed (execution of the ISR instructions). Disabled interrupt approach: CPU ignores new interrupt
signals while processing an interrupt.
Priority-based approach: An interrupt of higher priority will interrupt the processing of a lower-priority interrupt.
ISRHigher-priority interrupt
4242Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
Direct Memory Access (DMA): Allow the transfer of a whole block of data from an I/O device
directly to the memory without going through the CPU.
N
CPU
I/O device
A
Main Memory
controlinfo
stop
finish
DMA
Control of I/O Devices (3)
A DMA mechanism has essentially the same function as the data transfer capabilities of the CPU.
2016-11-07
22
4343Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3
I/O Operation Summary
Data exchanges of a computer system with the outside world are provided by the I/O devices.
Secondary memories can also be considered as I/O devices.
Since computers deal with many different types of users and interface also with many different physical systems, there is a large variety of I/O devices.
We need different techniques to control the I/O devices, due to the large differences between I/O operations.
The I/O function is usually the most unreliable part of a computer system, and we need techniques for error detection and recovery.