Caching in Operating Systems Design & Systems Programming David E. Culler CS162 – Operating...

Caching in Operating Systems Design & Systems Programming

David E. Culler

CS162 – Operating Systems and Systems Programming

Lecture 19

October 13, 2014

Reading: A&D 9.6-7 HW 4 going outProj 2 out today

cs162 fa14 L19 2

Objectives

• Recall and solidify understanding the concept and mechanics of caching.

• Understand how caching and caching effects pervade OS design.

• Put together all the mechanics around TLBs, Paging, and Memory caches

• Solidify understanding of Virtual Memory

10/13/14

cs162 fa14 L19 3

Review: Memory Hierarchy• Take advantage of the principle of locality to:

– Present as much memory as in the cheapest technology– Provide access at speed offered by the fastest technology

L3 Cache

(shared)

Registers

Core

Core

Secondary Storage

(Disk)

Processor

MainMemory(DRAM)

1 10,000,000 (10 ms)

Speed (ns): 10-30 100

100BsSize (bytes): MBs GBs TBs

Registers

L1 Cache

L1 Cache

L2 Cache

L2 Cache

0.3 3

10kBs 100kBs

Secondary Storage

(SSD)

100,000(0.1 ms)

100GBs

10/13/14

cs162 fa14 L19 4

Examples

• vmstat –s• top• mac-os utility/activity

10/13/14

cs162 fa14 L19 5

Where does caching arise in Operating Systems ?

10/13/14

cs162 fa14 L19 6


• Direct use of caching techniques– paged virtual memory (mem as cache for disk)– TLB (cache of PTEs)– file systems (cache disk blocks in memory)– DNS (cache hostname => IP address translations)– Web proxies (cache recently accessed pages)

• Which pages to keep in memory?

10/13/14

cs162 fa14 L19 7


• Indirect - dealing with cache effects• Process scheduling

– which and how many processes are active ?– large memory footprints versus small ones ?– priorities ?

• Impact of thread scheduling on cache performance– rapid interleaving of threads (small quantum) may degrade cache

performance• increase ave MAT !!!

• Designing operating system data structures for cache performance.

• All of these are much more pronounced with multiprocessors / multicores

10/13/14

cs162 fa14 L19 8

MP $

10/13/14

bus

Memory

$

P

$

P

$

P

* * *

cs162 fa14 L19 9

Working Set Model (Denning ~70)

• As a program executes it transitions through a sequence of “working sets” consisting of varying sized subsets of the address space

10/13/14

Time

Addr

ess

cs162 fa14 L19 10

Cache Behavior under WS model

• Amortized by fraction of time the WS is active• Transitions from one WS to the next• Capacity, Conflict, Compulsory misses• Applicable to memory caches and pages. Others ?

10/13/14

Hit

Rate

Cache Size

new working set fits

0

1

cs162 fa14 L19 11

Another model of Locality: Zipf

• Likelihood of accessing item of rank r is α1/ra

• Although rare to access items below the top few, there are so many that it yields a “heavy tailed” distribution.

• Substantial value from even a tiny cache• Substantial misses from even a very large one

10/13/14

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 490%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P access(rank) = 1/rank

pop a=1

Hit Rate(cache)

Rank

Popu

lari

ty (%

acc

esse

s)

Estim

ated

Hit

Rate

cs162 fa14 L19 12


• Maintaining the correctness of various caches

• TLB consistent with PT across context switches ?• Across updates to the PT ?• Shared pages mapped into VAS of multiple

processes ?

10/13/14

cs162 fa14 L19 13

Going into detail on TLB

10/13/14

cs162 fa14 L19 14

What Actually Happens on a TLB Miss?

• Hardware traversed page tables:– On TLB miss, hardware in MMU looks at current page

table to fill TLB (may walk multiple levels)• If PTE valid, hardware fills TLB and processor never knows• If PTE marked as invalid, causes Page Fault, after which kernel

decides what to do afterwards

• Software traversed Page tables (ala MIPS)– On TLB miss, processor receives TLB fault– Kernel traverses page table to find PTE

• If PTE valid, fills TLB and returns from fault• If PTE marked as invalid, internally calls Page Fault handler

• Most chip sets provide hardware traversal– Modern operating systems tend to have more TLB faults

since they use translation for many things

10/13/14

cs162 fa14 L19 15

What happens on a Context Switch?

• Need to do something, since TLBs map virtual addresses to physical addresses– Address Space just changed, so TLB entries no longer valid!

• Options?– Invalidate TLB: simple but might be expensive

• What if switching frequently between processes?

– Include ProcessID in TLB• This is an architectural solution: needs hardware

• What if translation tables change?– For example, to move page from memory to disk or vice versa…– Must invalidate TLB entry!

• Otherwise, might think that page is still in memory!10/13/14

cs162 fa14 L19 16

What TLB organization makes sense?

• Needs to be really fast– Critical path of memory access – Seems to argue for Direct Mapped or Low Associativity

• However, needs to have very few conflicts!– With TLB, the Miss Time extremely high!– This argues that cost of Conflict (Miss Time) is much higher

than slightly increased cost of access (Hit Time)• Thrashing: continuous conflicts between accesses

– What if use low order bits of page as index into TLB?• First page of code, data, stack may map to same entry• Need 3-way associativity at least?

– What if use high order bits as index?• TLB mostly unused for small programs

CPU TLB Cache Memory

10/13/14

cs162 fa14 L19 17

TLB organization: include protection

• How big does TLB actually have to be?–Usually small: 128-512 entries–Not very big, can support higher associativity

• TLB usually organized as fully-associative cache–Lookup is by Virtual Address–Returns Physical Address + other info

• What happens when fully-associative is too slow?–Put a small (4-16 entry) direct-mapped cache in front–Called a “TLB Slice”

• When does TLB lookup occur relative to memory cache access?

–Before memory cache lookup?–In parallel with memory cache lookup?

10/13/14

cs162 fa14 L19 18

• As described, TLB lookup is in serial with cache lookup:

• Machines with TLBs go one step further: they overlap TLB lookup with cache access.– Works because offset available early

Reducing translation time further

Virtual Address

TLB Lookup

V AccessRights PA

V page no. offset10

P page no. offset10

Physical Address

10/13/14

cs162 fa14 L19 19

Overlapping TLB & Cache Access (1/2)

• Main idea: – Offset in virtual address exactly covers the “cache

index” and “byte select”– Thus can select the cached byte(s) in parallel to

perform address translation

OffsetVirtual Page #

indextag / page # byte

virtual address

physical address

10/13/14

cs162 fa14 L19 20

• Here is how this might work with a 4K cache:

• What if cache size is increased to 8KB?– Overlap not complete– Need to do something else. See CS152/252

TLB 4K Cache

10 200

4 bytes

index 1 K

page # disp20

assoclookup

32

Hit/Miss

PA Data Hit/Miss

=PA

Overlapping TLB & Cache Access (1/2)

10/13/14

cs162 fa14 L19 21

Putting Everything Together: Address Translation

Physical Address:

OffsetPhysicalPage #

Virtual Address:

OffsetVirtualP2 index

VirtualP1 index

PageTablePtr

Page Table (1st level)

Page Table (2nd level)

Physical Memory:

10/13/14

cs162 fa14 L19 22


PageTablePtr


Putting Everything Together: TLB

OffsetPhysicalPage #

Virtual Address:


VirtualP1 index

Physical Memory:

Physical Address:

…

TLB:

10/13/14

cs162 fa14 L19 23


PageTablePtr


Virtual Address:


VirtualP1 index

…

TLB:

Putting Everything Together: Cache

Offset

Physical Memory:

Physical Address:PhysicalPage #

…

tag: block:cache:

index bytetag

10/13/14

cs162 fa14 L19 24

Admin: Projects

• Project 1– deep understanding of OS structure, threads, thread

implementation, synchronization, scheduling, and interactions of scheduling and synchronization

– work effectively in a team• effective teams work together with a plan=> schedule three 1-hour joint work times per week

• Project 2– exe load and VAS creation provided for you– syscall processing, FORK+EXEC, file descriptors backing user

file handles, ARGV• registers & stack frames

– two development threads for team• but still need to work together

10/13/14

cs162 fa14 L19 25

Virtual Memory – the disk level

10/13/14

cs162 fa14 L19 26

Reacall: the most basic OS function

10/13/14

cs162 fa14 L19 27

Loading an executable into memory

• .exe– lives on disk in the file system– contains contents of code & data segments, relocation entries and symbols– OS loads it into memory, initializes registers (and initial stack pointer)– program sets up stack and heap upon initialization: CRT0

10/13/14

disk (huge) memory

code

data

info

exe

cs162 fa14 L19 28

Create Virtual Address Space of the Process

• Utilized pages in the VAS are backed by a page block on disk– called the backing store– typically in an optimized block store, but can think of it like a file

10/13/14

disk (huge) memory

code

data

heap

stack

kernel

process VAS

sbrk

kernel code & data

user pageframes

user pagetable

cs162 fa14 L19 29


• User Page table maps entire VAS• All the utilized regions are backed on disk

– swapped into and out of memory as needed• For every process

10/13/14

disk (huge, TB) memory

code

data

heap

stack

kernel

process VAS (GBs)

kernel code & data

user pageframes

user pagetable

code

data

heap

stack

cs162 fa14 L19 30


• User Page table maps entire VAS– resident pages to the frame in memory they occupy– the portion of it that the HW needs to access must be

resident in memory10/13/14


code

data

heap

stack

kernel

VAS – per process

kernel code & data

user pageframes

user pagetable

code

data

heap

stack

PT

cs162 fa14 L19 31

Provide Backing Store for VAS

• User Page table maps entire VAS• Resident pages mapped to memory frames• For all other pages, OS must record where to find them on

disk10/13/14


code

data

heap

stack

kernel

kernel code & data

user pageframes

user pagetable

code

data

heap

stack

VAS – per process

cs162 fa14 L19 32

What data structure is required to map non-resident pages to disk?

• FindBlock(PID, page#) => disk_block

• Like the PT, but purely software• Where to store it?• Usually want backing store for resident pages

too.• Could use hash table (like Inverted PT)

10/13/14

cs162 fa14 L19 33

Provide Backing Store for VAS

10/13/14

disk (huge, TB)memory

kernel code & data

user pageframes

user pagetable

code

data

heap

stack

code

data

heap

stack

kernel

VAS 1 PT 1

code

data

heap

stack

kernel

VAS 2 PT 2heap

stack

data

cs162 fa14 L19 34

On page Fault …

10/13/14


kernel code & data

user pageframes

user pagetable

code

data

heap

stack

code

data

heap

stack

kernel

VAS 1 PT 1

code

data

heap

stack

kernel

VAS 2 PT 2heap

stack

data

active process & PT

cs162 fa14 L19 35

On page Fault … find & start load

10/13/14


kernel code & data

user pageframes

user pagetable

code

data

heap

stack

code

data

heap

stack

kernel

VAS 1 PT 1

code

data

heap

stack

kernel

VAS 2 PT 2heap

stack

data

active process & PT

cs162 fa14 L19 36

On page Fault … schedule other P or T

10/13/14


kernel code & data

user pageframes

user pagetable

code

data

heap

stack

code

data

heap

stack

kernel

VAS 1 PT 1

code

data

heap

stack

kernel

VAS 2 PT 2heap

stack

data

active process & PT

cs162 fa14 L19 37

On page Fault … update PTE

10/13/14


kernel code & data

user pageframes

user pagetable

code

data

heap

stack

code

data

heap

stack

kernel

VAS 1 PT 1

code

data

heap

stack

kernel

VAS 2 PT 2heap

stack

data

active process & PT

cs162 fa14 L19 38

Eventually reschedule faulting thread

10/13/14


kernel code & data

user pageframes

user pagetable

code

data

heap

stack

code

data

heap

stack

kernel

VAS 1 PT 1

code

data

heap

stack

kernel

VAS 2 PT 2heap

stack

data

active process & PT

cs162 fa14 L19 39

Where does the OS get the frame?

• Keeps a free list• Unix runs a “reaper” if memory gets too full• As a last resort, evict a dirty page first

10/13/14

cs162 fa14 L19 40

How many frames per process?

• Like thread scheduling, need to “schedule” memory resources– allocation of frames per process

• utilization? fairness? priority?

– allocation of disk paging bandwith

10/13/14

cs162 fa14 L19 41

Historical Perspective

• Mainframes and minicomputers (servers) were “always paging”– memory was limited– processor rates <> disk xfer rates were much closer

• When overloaded would THRASH– with good OS design still made progress

• Modern systems hardly every page– primarily a safety net + lots of untouched “stuff”– plus all the other advantages of managing a VAS

10/13/14

cs162 fa14 L19 42

Summary• Virtual address space for protection, efficient use of

memory, AND multi-programming.– hardware checks & translates when present– OS handles EVERYTHING ELSE

• Conceptually memory is just a cache for blocks of VAS that live on disk– but can never access the disk directly

• Address translation provides the basis for sharing– shared blocks of disk AND shared pages in memory

• How else can we use this mechanism?– sharing ???– disks transfers on demand ???– accessing objects in blocks using load/store instructions

10/13/14

Caching in Operating Systems Design & Systems Programming David E. Culler CS162 – Operating...

Documents

Transcript of Caching in Operating Systems Design & Systems Programming David E. Culler CS162 – Operating...