Introduction to Computer Organization and Architecture Lecture 9 By Juthawut Chantharamalee jutha...

Introduction to Computer Organization and Architecture

Lecture 9By Juthawut

Chantharamaleehttp://dusithost.dusit.ac.th/~juthawut_cha/home.htm

Outline Virtual Memory

Basics Address Translation Cache vs VM Paging Replacement TLBs Segmentation Page Tables

2Introduction to Computer Organization and Architecture

The Full Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns1-0.1 cents/bit

Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

Tapeinfinitesec-min

10 -8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS4K-16K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger


Virtual Memory Some facts of computer life…

Computers run lots of processes simultaneously No full address space of memory for each process Must share smaller amounts of physical memory

among many processes

Virtual memory is the answer! Divides physical memory into blocks, assigns them to

different processes


Virtual Memory Virtual memory (VM) allows main memory

(DRAM) to act like a cache for secondary storage (magnetic disk).

VM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory or on disk.

Compiler assigns data to a “virtual” address.VA translated to a real/physical somewhere in memory…

(allows any program to run anywhere;where is determined by a particular machine, OS)


VM Benefit VM provides the following benefits

Allows multiple programs to share the same physical memory

Allows programmers to write code as though they have a very large amount of main memory

Automatically handles bringing in data from disk


Virtual Memory Basics Programs reference “virtual” addresses in a non-existent

memory These are then translated into real “physical” addresses Virtual address space may be bigger than physical address

space

Divide physical memory into blocks, called pages Anywhere from 512 to 16MB (4k typical)

Virtual-to-physical translation by indexed table lookup Add another cache for recent translations (the TLB)

Invisible to the programmer Looks to your application like you have a lot of memory!


VM: Page Mapping

Process 1’sVirtual AddressSpace

Process 2’sVirtual AddressSpace

Physical Memory

Disk

Page Frames


VM: Address Translation

Virtual page number Page offset

Physical page number Page offset

Page Tablebase

Per-process page table

Valid bitProtection bits

Dirty btReference bit

12 bits20 bits Log2 ofpagesize

To physical memory


Example of virtual memory Relieves problem of

making a program that was too large to fit in physical memory – well….fit!

Allows program to run in any location in physical memory (called relocation) Really useful as you

might want to run same program on lots machines…

048

12

VirtualAddress

ABCD

04K8K

12K

PhysicalAddress

C

A

B

D Disk

16K20K24K28K

Virtual Memory

PhysicalMain Memory

Logical program is in contiguous VA space; here, consists of 4 pages: A, B, C, D;The physical location of the 3 pages – 3 are in main memory and 1 is located on the disk


Cache terms vs. VM termsSo, some definitions/“analogies”

A “page” or “segment” of memory is analogous to a “block” in a cache

A “page fault” or “address fault” is analogous to a cache miss

“real”/physicalmemory

so, if we go to main memory and our dataisn’t there, we need to get it from disk…


More definitions and cache comparisons These are more definitions than analogies…

With VM, CPU produces “virtual addresses” that are translated by a combination of HW/SW to “physical addresses”

The “physical addresses” access main memory The process described above is called “memory mapping” or

“address translation”


Cache VS. VM comparisons (1/2)Parameter First-level cache Virtual memory

Block (page) size

12-128 bytes 4096-65,536 bytes

Hit time 1-2 clock cycles 40-100 clock cycles

Miss penalty

(Access time)

(Transfer time)

8-100 clock cycles

(6-60 clock cycles)

(2-40 clock cycles)

700,000 – 6,000,000 clock cycles

(500,000 – 4,000,000 clock cycles)

(200,000 – 2,000,000 clock cycles)

Miss rate 0.5 – 10% 0.00001 – 0.001%

Data memory size

0.016 – 1 MB 4MB – 4GB


Cache VS. VM comparisons (2/2)

Replacement policy: Replacement on cache misses primarily controlled by

hardware Replacement with VM (i.e. which page do I replace?)

usually controlled by OS Because of bigger miss penalty, want to make the right

choice

Sizes: Size of processor address determines size of VM Cache size independent of processor address size


Virtual MemoryTiming’s tough with virtual memory:

AMAT = Tmem + (1-h) * Tdisk

= 100nS + (1-h) * 25,000,000nS

h (hit rate) had to be incredibly (almost unattainably) close to perfect to work

so: VM is a “cache” but an odd one.


Paging Hardware

CPU page offset

PhysicalMemory

page table

frame

frame offset

page

32 32

How big is a page?How big is the page table?


Address Translation in a Paging System

Program Paging Main Memory

Virtual Address

Register

Page Table

PageFrame

Offset

P#

Frame #

Page Table Ptr

Page # Offset Frame # Offset

+


How big is a page table?Suppose

32 bit architecture Page size 4 kilobytes Therefore:

0000 0000 0000 0000 0000 00000000 0000

0000 0000 00000000 0000 00000000 0000

Offset 212Page Number 220


Test YourselfA processor asks for the contents of virtual memory address 0x10020. The paging scheme in use breaks this into a VPN of 0x10 and an offset of 0x020.

PTR (a CPU register that holds the address of the page table) has a value of 0x100 indicating that this process’s page table starts at location 0x100.

The machine uses word addressing and the page table entries are each one word long.

PTR 0x100

Memory Reference

VPN OFFSET

0x010 0x020


Test YourselfADDR CONTENTS0x00000 0x000000x00100 0x000100x00110 0x000220x00120 0x000450x00130 0x000780x00145 0x000100x10000 0x033330x10020 0x044440x22000 0x011110x22020 0x022220x45000 0x055550x45020 0x06666

• What is the physical address calculated?

1. 10020

2. 22020

3. 45000

4. 45020

5. none of the above

PTR 0x100

Memory Reference

VPN OFFSET

0x010 0x020


Test YourselfADDR CONTENTS0x00000 0x000000x00100 0x000100x00110 0x000220x00120 0x000450x00130 0x000780x00145 0x000100x10000 0x033330x10020 0x044440x22000 0x011110x22020 0x022220x45000 0x055550x45020 0x06666

• What is the physical address calculated?

• What is the contents of this address returned to the processor?

• How many memory accesses in total were required to obtain the contents of the desired address?

PTR 0x100

Memory Reference

VPN OFFSET

0x010 0x020


Another Example

0123

5612

Page Table

01 01 110 01

101110001010

00011011

Logical memory0 a1 b2 c3 d4 e5 f6 g7 h8 i9 j10 k11 l12 m13 n14 o15 p

Physical memory01234 i5 j6 k7 l8 m9 n10 o11 p121314151617181920 a21 b22 c23 d24 e25 f26 g27 h28293031 22Introduction to Computer Organization and Architecture

Replacement policies


Block replacement Which block should be replaced on a virtual

memory miss? Again, we’ll stick with the strategy that it’s a good

thing to eliminate page faults Therefore, we want to replace the LRU block

Many machines use a “use” or “reference” bit Periodically reset Gives the OS an estimation of which pages are referenced


Writing a block What happens on a write?

We don’t even want to think about a write through policy!

Time with accesses, VM, hard disk, etc. is so great that this is not practical

Instead, a write back policy is used with a dirty bit to tell if a block has been written


Mechanism vs. PolicyMechanism:

paging hardware trap on page fault

Policy: fetch policy: when should we bring in the pages of a

process? 1. load all pages at the start of the process 2. load only on demand: “demand paging”

replacement policy: which page should we evict given a shortage of frames?


Replacement PolicyGiven a full physical memory, which page should

we evict??What policy?

Random FIFO: First-in-first-out LRU: Least-Recently-Used MRU: Most-Recently-Used OPT: (will-not-be-used-farthest-in-future)


Replacement Policy Simulationexample sequence of page numbers

0 1 2 3 42 2 37 1 2 3

FIFO?LRU?OPT?How do you keep track of LRU info? (another

data structure question)


Page tables and lookups…1. it’s slow! We’ve turned every access to

memory into two accesses to memory solution: add a specialized “cache” called a “translation

lookaside buffer (TLB)” inside the processor

2. it’s still huge! even worse: we’re ultimately going to have a page

table for every process. Suppose 1024 processes, that’s 4GB of page tables!


Paging/VM (1/3)

CPUCPU 42 356

PhysicalMemory

page table

356

i

OperatingSystem

Disk


Paging/VM (2/3)

CPUCPU 42 356

PhysicalMemory

356

page table

i

OperatingSystem

Disk

Place page table in physical memoryHowever: this doubles the time per memory access!!


Paging/VM (3/3)

CPUCPU 42 356

PhysicalMemory

356

page table

i

OperatingSystem

Disk

Special-purpose cache for translationsHistorically called the TLB: Translation Lookaside Buffer

Cache!


Translation CacheJust like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped

TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations.

Note: 128-256 entries times 4KB-16KB/entry is only 512KB-4MB the L2 cache is often bigger than the “span” of the TLB.


CPUTLB

LookupCache Main

Memory

VA PA miss

hit

data

Trans-lation

hit

missTranslationwith a TLB

Translation CacheA way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB

Virtual Page # Physical Frame # Dirty Ref Valid Access

Really just a cache (a special-purpose cache) on the page table mappings

TLB access time comparable to cache access time (much less than main memory access time)

tag


An example of a TLBPage frame

addressPageOffset

<30> <13>

V<1>

Tag<30>

Phys. Addr.<21>

... …

11 22

32:1 Mux33

44

R<2>

W<2>

…

<21>

<13>

(Low-order 13bits of addr.)

(High-order 21bits of addr.)

34-bit physicaladdress

Read/write policies and permissions…


The “big picture” and TLBs Address translation is usually on the critical path…

…which determines the clock cycle time of the P

Even in the simplest cache, TLB values must be read and compared

TLB is usually smaller and faster than the cache-address-tag memory

This way multiple TLB reads don’t increase the cache hit time

TLB accesses are usually pipelined b/c its so important!


The “big picture” and TLBs

Try to readfrom cache

TLB accessVirtual Address

TLB Hit?

Write?Try to readfrom page

table

Page fault?

Replacepage from

disk

TLB missstall

Set in TLB

Cache hit?

Cache missstall

Deliver datato CPU

Cache/buffermemory write

Yes

Yes

YesYes

No

No No

No


Pages are Cached in a Virtual Memory System

Can Ask the Same Four Questions we did about caches Q1: Block Placement

choice: lower miss rates and complex placement or vice versa miss penalty is huge so choose low miss rate ==> place page anywhere in

physical memory similar to fully associative cache model

Q2: Block Addressing - use additional data structure fixed size pages - use a page table

virtual page number ==> physical page number and concatenate offset

tag bit to indicate presence in main memory


Normal Page Tables Size is number of virtual pages Purpose is to hold the translation of VPN to PPN

Permits ease of page relocation Make sure to keep tags to indicate page is mapped

Potential problem: Consider 32bit virtual address and 4k pages 4GB/4KB = 1MW required just for the page table! Might have to page in the page table…

Consider how the problem gets worse on 64bit machines with even larger virtual address spaces!

Might have multi-level page tables


Inverted Page Tables Similar to a set-associative mechanism Make the page table reflect the # of physical pages (not virtual) Use a hash mechanism

virtual page number ==> HPN index into inverted page table Compare virtual page number with the tag to make sure it is the one you

want if yes

check to see that it is in memory - OK if yes - if not page fault If not - miss

go to full page table on disk to get new entry implies 2 disk accesses in the worst case trades increased worst case penalty for decrease in capacity

induced miss rate since there is now more room for real pages with smaller page table


Inverted Page Table

Page V

Page Offset

Frame Offset

FrameHash

=

OK

•Only store entriesfor pages in physicalmemory


Address Translation Reality The translation process using page tables takes

too long! Use a cache to hold recent translations

Translation Lookaside Buffer Typically 8-1024 entries Block size same as a page table entry (1 or 2 words) Only holds translations for pages in memory 1 cycle hit time Highly or fully associative Miss rate < 1% Miss goes to main memory (where the whole page table

lives) Must be purged on a process switch


Back to the 4 Questions Q3: Block Replacement (pages in physical

memory) LRU is best

So use it to minimize the horrible miss penalty However, real LRU is expensive

Page table contains a use tag On access the use tag is set OS checks them every so often, records what it sees, and

resets them all On a miss, the OS decides who has been used the least

Basic strategy: Miss penalty is so huge, you can spend a few OS cycles to help reduce the miss rate


Last Question Q4: Write Policy

Always write-back Due to the access time of the disk So, you need to keep tags to show when pages are dirty and

need to be written back to disk when they’re swapped out.

Anything else is pretty silly Remember – the disk is SLOW!


Page Sizes An architectural choice Large pages are good:

reduces page table size amortizes the long disk access if spatial locality is good then hit rate will improve

Large pages are bad: more internal fragmentation

if everything is random each structure’s last page is only half full Half of bigger is still bigger if there are 3 structures per process: text, heap, and control stack then 1.5 pages are wasted for each process

process start up time takes longer since at least 1 page of each type is required to prior to start transfer time penalty aspect is higher


More on TLBs The TLB must be on chip

otherwise it is worthless small TLB’s are worthless anyway large TLB’s are expensive

high associativity is likely

==> Price of CPU’s is going up! OK as long as performance goes up faster


Selecting a Page Size Reasons for larger page size

Page table size is inversely proportional to the page size; therefore memory saved

Fast cache hit time easy when cache size < page size (VA caches); bigger page makes this feasible as cache size grows

Transferring larger pages to or from secondary storage, possibly over a network, is more efficient

Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses

Reasons for a smaller page size Want to avoid internal fragmentation: don’t waste storage; data must be conti

guous within page Quicker process start for small processes - don’t need to bring in more memo

ry than needed


Memory Protection With multiprogramming, a computer is shared by several programs o

r processes running concurrently Need to provide protection Need to allow sharing

Mechanisms for providing protection Provide Base and Bound registers: Base ฃAddress ฃBound Provide both user and supervisor (operating system) modes Provide CPU state that the user can read, but cannot write

Branch and bounds registers, user/supervisor bit, exception bits Provide method to go from user to supervisor mode and vice versa

system call : user to supervisor system return : supervisor to user

Provide permissions for each flag or segment in memory


Pitfall: Address space to small One of the biggest mistakes than can be made when designing an

architecture is to devote to few bits to the address address size limits the size of virtual memory difficult to change since many components depend on it (e.g., PC, registers,

effective-address calculations)

As program size increases, larger and larger address sizes are needed

8 bit: Intel 8080 (1975) 16 bit: Intel 8086 (1978) 24 bit: Intel 80286 (1982) 32 bit: Intel 80386 (1985) 64 bit: Intel Merced (1998)


Virtual Memory Summary Virtual memory (VM) allows main memory (DRAM) to act

like a cache for secondary storage (magnetic disk). The large miss penalty of virtual memory leads to

different stategies from cache Fully associative, TB + PT, LRU, Write-back

Designed as paged: fixed size blocks segmented: variable size blocks hybrid: segmented paging or multiple page sizes

Avoid small address size


Summary 2: Typical ChoicesOption TLB L1 Cache L2 Cache VM (page)

Block Size 4-8 bytes (1 PTE) 4-32 bytes 32-256 bytes 4k-16k bytes

Hit Time 1 cycle 1-2 cycles 6-15 cycles 10-100 cycles

Miss Penalty 10-30 cycles 8-66 cycles 30-200 cycles 700k-6M cycles

Local Miss Rate .1 - 2% .5 – 20% 13 - 15% .00001 - 001%

Size 32B – 8KB 1 – 128 KB 256KB - 16MB

Backing Store L1 Cache L2 Cache DRAM Disks

Q1: Block Placement

Fully or set associative

DM DM or SA Fully associative

Q2: Block ID Tag/block Tag/block Tag/block Table

Q3: Block Replacement

Random (not last)

N.A. For DM Random (if SA) LRU/LFU

Q4: Writes Flush on PTE write

Through or back

Write-back Write-back


The End Lecture 9

Introduction to Computer Organization and Architecture Lecture 9 By Juthawut Chantharamalee jutha...

Documents

Transcript of Introduction to Computer Organization and Architecture Lecture 9 By Juthawut Chantharamalee jutha...