Lecture 3: Cache & I/OTDTS10/lectures/16/lec3.pdfLecture 3: Cache & I/O ... segments of program and...

2016-11-07

1

11Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS10 – Lecture 3TDTS10 – Lecture 3

Lecture 3: Cache & I/O

Virtual memory

Cache memory

I/O operations


Mismatch of CPU and MM Speeds

1955 19851965 1970 1975 1980 1990 2000 20051960

Cyc

le T

ime

(nan

ose

cond

)

104

103

102

101

0

Speed Gap(ca. one orderof magnitude, i.e., 10 times)

2010 2015

2016-11-07

2


instructionsand dataaddresses

addresses

instructionsand data

instructionsand data

addresses

Cache Memory

A cache is a very fast memory which is put between the main memory and the CPU, and used to hold segments of program and data of the main memory.

Main MemoryCPU

Cache


Zebo’s Cache Memory Model

Personal library for a high-speed reader

A computer is a “predictable and iterative reader.” High cache hit ratio, e.g., 96%, is achievable, even with a

relatively small cache (e.g., 0.1% of memory size).

Storage cells Memory controllerCache

2016-11-07

3


Cache Memory Features

It is transparent to the programmers. The CPU still refers to the instructions/data by their addresses in the MM.

Only a very small part of the program/data in the main memory has its copy in the cache (e.g., 4MB cache with 8GB memory).

If the CPU wants to access program/data not in the cache (called a cache miss), the relevant block of the main memory will be copied into the cache.

The intermediate-future memory access will usually refer to the same word or words in the neighborhood, and will not have to involve the main memory.

Locality of reference!


Cache Memory Performance

Average Access Time (AAT)

Phit x Tcache_access +

(1 – Phit) x (Tmm_access + Tcache_access) x Block_size +

Tchecking

Ex. A computer has 8MB MM with 100 ns access time, 8KB cache with 10 ns access time, BS=4, and Tchecking = 2.1 ns, Phit = 0.97, AAT will be 25 ns.

Phit = the probability of cache hit, cache hit ratio;Tcache_access = cache access time;Tmm_access = main memory access time;Block_size = number of words in a cache block; andTchecking = the time needed to check for cache hit or miss.

where

2016-11-07

4


Cache Memory Performance

Ex. A computer has 8MB MM with 100 ns access time, 8KB cache with 10 ns access time, BS=4, and Tchecking = 2.1 ns, Phit = 0.97, AAT will be 25 ns.

Main Memory[8 MB, 100 ns]

Cache [8 KB, 10 ns]

Composite

Memory[8 MB, 25 ns]


Cache Design

The size and nature of the copied block must be care-fully designed, as well as the algorithm to decide which block to be removed from the cache when it is full:

Cache block size (line size).

Total cache size.

Mapping function.

Replacement method.

Write policy.

Numbers of caches:

• Single, two-level, or three-level cache.

• Unified vs. split cache.

2016-11-07

5


Split Data and Instruction Caches?

Split caches (Harvard Architectures):+ CPU accesses I-cache for instruction fetch, and D-cache for

data.

+ Competition for the cache between instruction processing and execution units is eliminated.

+ Instruction fetch can proceed in parallel with memory access from the CPU for operands.

One may be overloaded while the other is under utilized.

Unified caches:+ Better balance the load between instruction and data fetches

depending on the dynamics of the program execution.

+ Design and implementation are cheaper.

Lower performance.


Direct Mapping Cache

Direct mapping - Each block of the main memory is mapped into a fixed cache slot.


1

2

12

1

2

2016-11-07

6


Direct Mapping Cache ExampleWe have a 10,000-word MM and a 100-word cache. 10 memory cells are grouped into a block.

Memory address = 2 1 1Tag Slot Word

100-Word Cache

Tag Slot No.9876543210

90-9980-8970-7966-6950-5940-4930-3920-2910-1900-09

0020-00290010-00190000-0009

9990-9999

0120-01290110-01190100-0109

10,000-Word Memory

0 1 1 5

01


Direct Mapping Pros & Cons Simple to implement and therefore inexpensive.

Very fast checking time for cache hit or miss.

Fixed location for blocks. If a program accesses 2 blocks that map to the same

cache slot repeatedly, cache miss rate is very high.


1

2

12

1

2

2016-11-07

7


100-Word Cache

Tag (3 ps)90-9980-8970-7966-6950-5940-4930-3920-2910-1900-09

Associative Mapping A main memory block can be loaded into any slot of the cache. To determine if a block is in the cache, a mechanism is needed

to simultaneously examine every slot’s tag.

0020-00290010-00190000-0009

9990-9999

0120-01290110-01190100-0109

10,000-Word Memory

Tag

0 1 02 8 70 0 12 9 7

0106

Associative memory example

, 0107

Memory address = Tag Word


Fully Associative Organization

2016-11-07

8


Set Associative Organization

The cache is divided into a number of sets (K).

Each set contains a number of slots (W).

A given block maps to any slot in a given set. e.g. block i can be in any slot of set j.

For example, 2 slots per set (W = 2): 2-way associative mapping.

A given block can be in one of 2 slots.

Direct mapping: W = 1 (no alternative).

Fully associative: K = 1 (W = total number of all slots in the cache, all mappings possible).

W is the most important parameter (typically 2-16).


Replacement Algorithms With direct mapping, it is no need. With associative mapping, a replacement algorithm is

needed in order to determine which block to replace:

First-in-first-out (FIFO).TagUse info

Least-recently used (LRU) -replace the block that has been in the cache longest with not reference to it.

Lest-frequently used (LFU) -replace the block that has experienced the fewest references.

Random.

15:55

2016-11-07

9


Write Policy The problem: How to keep cache content and main memory content

consistent without losing too much performance?

Write through: All write operations are passed to main memory:

If the addressed location is currently in the cache, the cache copy is also updated so that it is coherent with the main memory.

For write operations, the CPU always slows down to main memory speed.

Since the percentage of writes is small (ca. 15%), it doesn’t lead to too large performance reduction.


Write Policy (Cont’d) Write through with buffered write:

The same as write-through, but instead of slowing the CPU down by writing directly to main memory, the write address and data are stored in a high-speed write buffer, which transfers data to main memory while the CPU continues its task.

Higher speed, but more complex hardware.

Write back: A write updates only the cache memory which is not kept coherent

with main memory. When the cache block is replaced, its content has to be written back to memory.

Good performance (usually several writes are performed on a cache block before it is replaced), but more complex hardware is needed.

Cache coherence problems are very complex and difficult to solve in multiprocessor systems (to be discussed later)!

2016-11-07

10


Cache Architecture Examples Intel Pentium (introduced first in1993)

Two on-chip caches, one for data and one for instructions. Each cache: 8 KB. Line size: 32 bytes. 2-way set associative organization.

AMD Opteron 140 (introduced 2003) Two L1 cache: one for instruction and one for data; 64 KB each. 2-way associative organization. L2 cache: 1 MB, 16-way associative organization.

ARM Cortex-A15 (introduced 2012) Each core has separate L1 data and instruction caches. 64 KB (32 KB I-cache, 32 KB D-cache) per core. L2 cache, unified and common for all cores, up to 4 MB.


3-Level Cache Example

L1 L2 L3

Contents Split D and I Unified D + I Unified D + I

Size 16 Kbytes each 256 Kbytes 3 Mbytes

Line size 64 bytes 128 bytes 128 bytes

Associativity 4 way 8 way 12 way

Access time 1 cycle 5-7 cycles 14-17 cycles

Store policy Write-through Write-back Write-back

Intel Itanium 2 (introduced 2002):

2016-11-07

11



Virtual memory

Cache memory

I/O operations


Motivation for Virtual Memory

The physical main memory (RAM) is relatively limited in capacity.

It may not be big enough to store all the executing programs at the same time.

A program may need memory larger than the main memory size, but the whole program doesn’t need to be kept in the main memory at the same time.

Virtual Memory takes advantage of the fact that at any given instant of time, an executing program needs only a fraction of the memory that the whole program occupies.

The basic idea: Load only pieces of each executing program which are currently needed (on demand).

2016-11-07

12


Paging of Memory

Divide programs (processes) into equal sized, small blocks, called pages.

Divide the primary memory into equal sized, small blocks called page frames.

Allocate the required number of page frames to a program.

A program does not require continuous page frames!

The operating system (OS) is responsible for:

Maintaining a list of free frames.

Using a page table to keep track of the mapping between pages and page frames.


Logical vs. Physical Addresses

Implementation of the page-tables:

Main memory — slow since an extra memory access is needed.

Separate registers —fast but expensive.

Cache (usually a dedicated one).

0123

• Logical (virtual) address is the address used by the program.

• Physical address is the address in the actual main memory.

2016-11-07

13


Motivation for Virtual Memory I

To give the programmer a much bigger memory space than the main memory with the help of the operative system. Virtual memory size is very much bigger than the main memory size.

0000

1000

2000

3000

Program addresses

Secondary memory

0000

1000

2000

3000

4000

5000

MM addresses


Motivation for Virtual Memory II

To allow multiple programs to share main memory dynamically and efficiently.

You don’t need to negotiate with the others to share the physical memory addresses.

Each program gets a private virtual address space.

To avoid writing into each other’s memory locations.

It is the OS that translates a virtual address into a physical address. No program can read another's data.

Good for security (against malicious attack) and safety (against faults and errant access).

2016-11-07

14


Page Fault When accessing a VM page which is not in the main memory,

a page fault occurs.

The page must then be loaded from the secondary memory into the main memory by the OS.

Virtual Address Page Number Offset

Page Map

Pages in MM

Page Fault(Interrupt to OS)


Page Replacement

When a page fault occurs and all page frames are occupied, one of them must be replaced.

If the replaced page has been modified during the time it resides in the main memory, the updated version should be written back to the secondary memory.

Our wish is to replace the page which will not be accessed in the future for the longest amount of time.

Problem — We don’t know exactly what will happen in the future.

Solution — We predict the future by studying the access patterns up till now (“learn from history”).

2016-11-07

15


Replacement Algorithms

FIFO (First In First Out) — To replace the one in MM the longest of time.

LRU (Least Recently Used) — To replace the one that has not be accessed the longest time.

LFU (Least Frequently Used) — To replace the one that has the smallest number of access during the latest time period.

The replacement by random technique (used for Cache)

is not used for VM!



Virtual memory

Cache memory

I/O operations

2016-11-07

16


Input/Output Devices Input/output devices provide a means for us to make use of

a computer system.

Computer

Secondarymemory

Computer System

Input device

Output device

There are two major types of I/O devices: Interactive devices, e.g., a multi-touch screen (iPad). Indirect devices, e.g., a laser printer.

Secondary memories can also be considered as I/O devices. Ex. a magnetic tape.


Characteristics of I/O I/O devices run at much lower speed than the CPU, due

partially to neglect and partially to physical limitations. Therefore, special technical must be used to control them to avoid

that the CPU has to wait for them all the time.

Many I/O devices function as an interface between a computer system and other physical systems. Such interface usually consists of A/D and D/A converters.

Temperaturesensor

Heaterswitch

Computer

Secondarymemory

Computer System

A/D D/A

2016-11-07

17


An I/O Module Responsible for the control of one or several devices,

and the exchange of data between the devices and the main memory and/or CPU registers.

CPU

MM


Functions of an I/O Module

CPU

MM

Control and timing

CPU communication

Device communication

Data buffering

Error detection and correction

2016-11-07

18


Control of I/O Devices (1) Programmed I/O

The operations are controlled by I/O instructions, for example, READ and WRITE.

The instructions specify:• the particular I/O operation to perform; and

• the given device by giving its address (its ID number).

The CPU will wait for the I/O operation to be finished before it executes the next instruction.

Since the I/O devices are very slow, the CPU has to wait all the time instead of doing useful work.

It is a very simple but not an efficient method.

• can be used in an embedded system.


Programmed I/O Example

Select I/O device

Send data to thedevice interface

Check device status

Next instruction

Ready?

yes

no

Execution flow of aWRITE instruction(e.g. WRITE ‘P’, O1):

Start

The CPU does polling, or busy waiting.

The I/O device sets appropriate bits in the I/O status register.

2016-11-07

19


Control of I/O Devices (2)

Interrupt-driven I/O After the CPU sends an initialization signal to an I/O

device, it continues with the execution of programs.

When the I/O device is ready, or wants to get the attention of the CPU, it sends an interrupt signal to the CPU.

When the CPU receives an interrupt signal, it will first finish the execution of the current instruction and then execute the interrupt service routine (ISR).

This mechanism is used to free the CPU from having to check periodically (polling) the I/O devices to see it they are in need of any attention.


Interrupt Service Routine (ISR) Save all status information which is needed to resume

execution of the current sequence of instructions. Put the saved PC value in a safe place!

Deal with the interrupt, for example, by reading data from the input device.

Restore the saved status information and then resume execution of the interrupted program.

current sequence

interrupt

interrupt routine

original sequence resumed

2016-11-07

20


Instruction Cycle w. Interrupts

Fetch nextinstruction

Executeinstruction

Check for interrupt

Start

Stop

Fetch cycle

Execute cycle

Interrupt cycle

Interrupts enabledInterrupts disabled

HW: Save the current PC value &set PC as the 1st address of ISR


Multiple Interrupts A new interrupt may occur while the current interrupt is

being processed (execution of the ISR instructions). Disabled interrupt approach: CPU ignores new interrupt

signals while processing an interrupt.

Priority-based approach: An interrupt of higher priority will interrupt the processing of a lower-priority interrupt.

current sequence

interrupt routine

original sequence resumedinterrupt

2016-11-07

21


current sequence original sequence resumedinterrupt

ISRLower-priority interrupt

Multiple Interrupts A new interrupt may occur while the current interrupt is

being processed (execution of the ISR instructions). Disabled interrupt approach: CPU ignores new interrupt

signals while processing an interrupt.

Priority-based approach: An interrupt of higher priority will interrupt the processing of a lower-priority interrupt.

ISRHigher-priority interrupt


Direct Memory Access (DMA): Allow the transfer of a whole block of data from an I/O device

directly to the memory without going through the CPU.

N

CPU

I/O device

A

Main Memory

controlinfo

stop

finish

DMA

Control of I/O Devices (3)

A DMA mechanism has essentially the same function as the data transfer capabilities of the CPU.

2016-11-07

22


I/O Operation Summary

Data exchanges of a computer system with the outside world are provided by the I/O devices.

Secondary memories can also be considered as I/O devices.

Since computers deal with many different types of users and interface also with many different physical systems, there is a large variety of I/O devices.

We need different techniques to control the I/O devices, due to the large differences between I/O operations.

The I/O function is usually the most unreliable part of a computer system, and we need techniques for error detection and recovery.

Lecture 3: Cache & I/OTDTS10/lectures/16/lec3.pdfLecture 3: Cache & I/O ... segments of program and...

Documents

Transcript of Lecture 3: Cache & I/OTDTS10/lectures/16/lec3.pdfLecture 3: Cache & I/O ... segments of program and...