Chapter 3 Memory Management Chapter 3 Memory Management —— Page Management Li Wensheng ...
-
Upload
mark-perry -
Category
Documents
-
view
216 -
download
1
Transcript of Chapter 3 Memory Management Chapter 3 Memory Management —— Page Management Li Wensheng ...
2 2
Outline
Data StructureData Structure Page Scanner Operation Page-out Algorithm Hardware Address Translation Layer
3 3
Pages—The Basic Unit of Solaris Memory Physical memory is divided into pages. A page’s identity is its vnode/offset pair. The hardware address translation (HAT)
and address space layers manage the mapping between a physical page and its virtual address space.
4 4
The Page Structure
5 5
The Page Hash List
global hash list -- an array of pointers to linked lists of pages
VM system hashes pages with identity onto a global hash list so that they can be located by vnode/offset.
Three page functions search the global page hash list:
page_find() page_lookup() page_lookup_nowait()
6 6
Locating Pages by Vnode/Offset Identity
7 7
MMU-Specific Page Structures
need to keep machine-specific data about every page, e.g. the HAT information that describes how the page is mapped by the MMU.
struct machpage
The contents of the machine-specific page structure are hidden from the generic kernel.
only the HAT machine-specific layer can see or manipulate its contents
8 8
Machine-Specific Page Structures: sun4u Example
9 9
Physical Page Lists
a segmented global physical page list, consisting of segments of contiguous physical memory.
Contiguous physical memory segments are added during system boot.
Can also added and deleted dynamically when physical memory is added and removed while the system is running.
10 10
arrangement of the physical page lists
11 11
Free List and Cache List
hold pages that are not mapped into any address space and that have been freed by page_free().
free list Does not have a vnode/offset associated Pages are put on the free list at process exits is generally very small
cache list still have a vnode/offset Seg_map free-behind and seg_vn executables
and libraries (for reuse)
12 12
The Page-Level Interfaces Method Description
page_create() Creates pages. Page coloring is based on a hash of the vnode offset. page_create() is provided for backward compatibility only. Don’t use it if you don’t have to. Instead, use the page_create_va() function so that pages are correctly colored.
page_create_va() Creates pages, taking into account the virtual address they will be mapped to. The address is used to calculate page coloring.
page_exists() Tests that a page for vnode/offset exists.
page_find() Searches the hash list for a page with the specified vnode and offset that is known to exist and is already locked
page_first() Finds the first page on the global page hash list
page_free() Frees a page. Pages with vnode/offset go onto the cache list; other pages go onto the free list
page_isfree() Checks whether a page is on the free list
page_ismod() Checks whether a page is modified. This function checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call hat_pagesync() before calling page_ismod().
13 13
The Page-Level Interfaces (Cont.)
Method Description
page_isref() Checks whether a page has been referenced; checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call hat_pagesync() before calling page_isref().
page_isshared() Checks whether a page is shared across more than one address space.
page_lookup() Finds a page representing the specified vnode/offset. If the page is found on a free list, then it will be removed from the free list
page_lookup_nowait() Finds a page representing the specified vnode/offset that is not locked or on the free list
page_needfree() Informs the VM system we need some pages freed up. Calls to page_needfree() must be symmetric, that is they must be followed by another page_needfree() with the same amount of memory multiplied by -1, after the task is complete.
page_next() Finds the next page on the global page hash list.
14 14
The Page Throttle
implemented in the page_create() and page_create_va() functions
causes page creates to block when the PG_WAIT flag is specified, that is, when available is less than the system global, throttlefree.
throttlefree is set to the same value as minfree.
memory allocated through the kernel memory allocator specifies PG_WAIT and is subject to the page-created throttle.
15 15
Page Sizes
System Type System
Type
MMU Page
Size
Capability
Solaris 2.x
Page Size
Early SPARC systems sun4c 4K 4K
microSPARC-I, -II sun4m 4K 4K
SuperSPARC-I, -II sun4m 4K, 4M 4K, 4M
UltraSPARC-I, -II sun4u 4K, 64K, 512K, 4M
8K, 4M
Intel x86 architecture i86pc 4K, 4M 4K, 4M
16 16
Page Coloring
page placement policy affects processor performance The optimal placement of pages often depends on the
memory access patterns of the application. in a random order in some sort of stridden ordered
How page placement can affect performance? The UltraSPARC-I & -II implementations
The L1 cache is 16 Kbytes The L2 (external) cache can vary between 512 Kbytes and
8 Mbytes The L2 cache is arranged in lines of 64 bytes, and
transfers are done to and from physical memory in 64-byte units.
17 17
Assume: we have a 32-Kbyte L2 cache page size of 8 Kbytes four page-sized slots on the L2 cache
The cache does not necessarily read and write 8-Kbyte units from memory; it does that in 64-byte chunks, so 32-Kbyte cache has 1024 addressable slots.
Page Coloring (Cont.)
18 18
Page Coloring (Cont.)
offsets 0 and 32678 map to the same cache line.If we were now to access these two addresses, cache ping-pong effect occurs. we program to virtual memory rather than physical
memory.The OS must provide a sensible mapping between virtual memory and physical memory
19 19
Page Coloring (Cont.) physical pages are assigned to an address
space from the order they appear in the free list.
page coloring algorithm the free list of physical pages is organized into
specifically colored bins, one color bin for each slot in the physical cache.
When a page is put on the free list, the page_free() algorithms assign it to a color bin.
When a page is consumed from the free list (page_create_va() function ), the virtual-to-physical algorithm takes the page from a physical color bin.
20 20
Page Coloring (Cont.)
The kernel supports a default algorithm and two optional algorithms.
The default algorithm was chosen according to the following criteria:
Fairly consistent, repeatable results Good overall performance for the majority of
applications Acceptable performance across a wide range
of applications
21 21
Solaris Page Coloring Algorithmsalgorithm description Solaris Availability
No. Name 2.5.1 2.6 7
0 Hashed VA The physical page color bin is chosen on a hashed algorithm to ensure even distribution of virtual addresses across the cache.
Default Default Default
1 P.Addr=
V.Addr
The physical page color is chosen so that physical addresses map directly to the virtual addresses (as in the example).
Yes Yes Yes
2 Bin
Hopping
Physical pages are allocated with a round-robin method.
Yes Yes Yes
6 Kessler’s
Best Bin
Kessler best bin algorithm. Keep history per process of used colors and chooses least used color; if multiple, use largest bin.
E10000 only
(default)
E10000 only
(default)
Not Available
22 22
Data StructureData Structure Page Scanner Operation Page-out Algorithm Hardware Address Translation Layer
Outline
23 23
Page Scanner
Is the memory management daemon that manages system wide physical memory
When there is a memory shortage, the page scanner runs to steal memory from address spaces, by:
taking pages that haven’t been used recently syncing them up with their backing store freeing them
If paged-out virtual memory is required again, a memory page fault occurs.
24 24
Page Scanner (Cont.) The balancing of page stealing and page faults
determines which parts of virtual memory will be backed and which will be moved out to swap.
global page replacement / local page replacement The subtleties of which pages are stolen govern
the memory allocation policies and can affect different workloads in different ways.
Enhancements to minimize page stealing from extensively shared libraries and executables
Priority paging to prevent application, shared library, and executable paging on systems with ample memory.
25 25
Page Scanner Operation
tracks page usage by reading a per-page hardware bit from the MMU for each page
Two bits for each page: Reference bit & modify bit
awakened when the amount of memory on the free-page list falls below a system threshold
typically 1/64th of total physical memory.
scans through pages in physical page order looking for pages that haven’t been used recently
to page out to the swap device and free
26 26
Two-handed Clock Algorithm
front hand clears the referenced and modified bits for each page back hand inspects the referenced and modified bits some time later Pages haven’t been referenced or modified are swapped out and freed scan rate is controlled by the amount of free memory on the system The gap between the front and back hand is fixed by a boot-time
parameter, handspreadpages.
27 27
Data StructureData Structure Page Scanner Operation Page-out Algorithm Hardware Address Translation Layer
Outline
28 28
Introduction to page-out algorithm
Steals pages when memory is lower than lotsfree
Scanner runs Starts scanning at slowscan (pages/sec) Four times/second when memory is short Awoken by page allocator if very low
Puts memory out to “backing store” Uses a Least Recently Used process
Kernel threads does the scanning
29 29
Page Scanner Parameters Parameter Description Min Default
Lotsfree starts stealing anonymous
memory pages
512K 1/64 th of memory
Desfree scanner is started at 100 times/second Minfree ½ of lotsfee
Minfree start scanning every time a new
page is created
½ of desfree
Throttlefree page_create routine makes the
caller wait until free pages are
Available
Minfree
Fastscan scan rate (pages per second)
when free memory = minfree
slowscan minimum of 64MB/s
or ½ memory size
Slowscan scan rate (pages per second)
when free memory = lotsfree
100
Maxpgio max number of pages per second
that the swap device can handle
~60 60 or 90 pages per spindle
hand-spreadpages number of pages between the
front hand (clearing) and back
hand (checking)
1 Fastscan
min_percent_cpu CPU usage when free memory is
at lotsfree
4% (~1 clock tick) of a single CPU
30 30
Scan Rate Parameters (Assuming No Priority Paging)
Stsrts scanning
at slowscan
Scans faster as the amount of free memory approaches 0
1/64 of memoryDefault 100
Default ½ physical memory
31 31
Scan Rate Parameters calculation
lotsfree is calculated at startup as 1/64th of memory slowscan parameter is 100 by default on Solaris systems fastscan is set to total physicalmemory/2 If total physical memory is 1G, then
Lotsfree=2048 pages/sec fastscan=8192 pages/sec
If free memory falls to 12 Mbytes (1536 pages)
32 32
Not Recently Used Time
The time between the front hand and back hand short time the most active pages remain intact long time only the largely unused pages are stolen
varies from just a few seconds to several hours,according to:
the number of pages between front and back hand the scan rate
Example Scan rate: 2000pages/sec hand spread: 8192 pages/sec Clear/check time: 4 seconds
33 33
Shared Library Optimizations
prevents scanner from stealing pages from extensively shared libraries
looks at the share reference count for each page if the page is shared more than a certain
amount, then it is skipped during the page scan operation.
threshold parameter: po_share 8 ~ 134217728, By default, starts at 8 A page shared by more than po_share processes
will be skipped Each time around, it is decremented ?
34 34
The Priority Paging Algorithm
Purpose: overcome adverse behavior that results from the memory pressure caused by the file system.
puts a higher priority on a process’s pages its heap, stack, shared libraries, and
executables.
permits scanner to pick file system cache pages only when ample
memory is available only steal application pages when there is a
true memory shortage.
35 35
The Priority Paging Algorithm
a new paging parameter, cachefree When the amount of free memory lies between cachefree and
lotsfree, the page scanner steals only file system cache pages scanner wakes up when memory falls below cachefree rather than
below lotsfree
36 36
Scan Rate Interpolation with the Priority Paging Algorithm
pages only the file
system cache
37 37
Page Scanner CPU Utilization Clamp
Purpose: to prevent the page-out daemon from using too much processor time
Two parameters: min_percent_cpu, default 4% of a single CPU max_percent_cpu, default 80% of a single CPU
CPU time can be used: From min_percent_cpu to max_percent_cpu min_percent_cpu when free memory is at
lotsfree (cachefree with priority paging enabled)
max_percent_cpu if free memory were to fall to zero
38 38
Parameters That Limit Pages Paged Out Maxpgio
limits the rate at which I/O is queued to the swap devices
defaults to 40 or 60 I/Os per second Often set to 100 times the number of swap
spindles
Maxpgio can also indirectly affect file system throughput
39 39
Page Scanner Implementation
implemented as two kernel threads Page scanner thread: scans pages Page-out thread: pushes the dirty pages
queued for I/O
40 40
Page Scanner Architecture
41 41
Scanner Schedpaging() waken up
called four times per second by a callout, triggered by the clock() thread if memory falls
below minfree triggered by the page allocator if memory falls
below throttlefree calculates two setup parameters for the
page scanner thread the number of pages to scan the number of CPU ticks that the scanner
thread can consume triggers the scanner through a condition
variable
42 42
Page scanner thread
cycles through the physical page list The front and back hand each have a page
pointer front hand is incremented first to clear the
referenced and modified bits for pointed page
back hand is then incremented to check the status of the pointed page (using check_page() function) If modified, placed in the dirty page queue If not referenced, freed
43 43
Page-out thread
uses a preinitialized list of async buffer headers as the queue for I/O requests
The number of entries is controlled by parameter async_request_size, initialized with 256
Requests to queue more I/Os will be blocked if the entire queue is full if the rate of pages queued has exceeded the maxpgio
removes I/O entries from the queue initiates I/O by calling the vnode putpage()
44 44
The Memory Scheduler
swap out entire processes to conserve memory removing all of a process’s thread structures and
private pages setting flags in the process table to indicate that
this process has been swapped out Not expensive but affects process’s performance
launched at boot time does nothing unless memory is less than desfree
looking for processes that can completely swap out soft-swap out / hard-swap out
45 45
Soft Swapping
takes place when the 30-second average for free memory is below desfree
memory scheduler looks for processes that have been inactive for at least maxslp seconds
If found: swaps out the thread structures for each
thread pages out all of the private pages of memory
for that process
46 46
Hard Swapping
takes place when all of the following are true: At least two processes are on the run queue, waiting
for CPU. The average free memory over 30 seconds is
consistently less than desfree. Excessive paging is going on
determined to be true if page-out + page-in > maxpgio
Use a much more aggressive approach to find memory
First, the kernel is requested to unload all modules and cache memory that are not currently active
Then, processes are sequentially swapped out until the desired amount of free memory is returned
47 47
Memory Scheduler Parameters
Parameter Affect on Memory Scheduler
desfree If the average amount of free memory falls below desfree for 30 seconds, then the memory scheduler is invoked.
maxslp When soft-swapping, the memory scheduler starts swapping processes that have slept for at least maxslp seconds. The default for maxslp is 20 seconds and is tunable
maxpgio When the run queue is greater than 2, free memory is below desfree, and the paging rate is greater than maxpgio, then hard swapping occurs, unloading kernel modules and process memory.
48 48
Data StructureData Structure Page Scanner Operation Page-out Algorithm Hardware Address Translation Layer
Outline
49 49
Introduction to HAT
Hardware Address Translation (HAT) controls the hardware that manages
mapping of virtual to physical memory provides interfaces that implement the
creation and destruction of mappings between virtual and physical memory
provides a set of interfaces to probe and control the MMU
implements all of the low-level trap handlers to manage page faults and memory exceptions
50 50
Solaris Virtual Memory Layers
51 51
Solaris Memory Model HAT layer
52 52
Address Apace
Process Address Space Process Text and Data Stack (anon memory) and Libraries Heap (anon memory)
Kernel Address Space Kernel Text and Data Kernel map Space (data structures, caches) 32-bit kernel map (64-bit kernels only) Trap table Critical virtual memory data structures Mapping File System Cache (segmap)
53 53
The Address Space
54 54
Role of the HAT layer in virtual-to-physical translation
hides the platform-specific implementation used by the segment drivers to implement the
segment driver’s view of virtual-to-physical translation
use hat to hold top-level translation information hat structure is platform specific hat is referenced by the address space structure HAT-specific data structures existing in every page
represent the translation information at a page level
HAT layer is called when the segment drivers want to manipulate the hardware MMU
55 55
Summarizes HAT functions Function Description
hat_chgattr() Changes the protections for the supplied virtual address range.
hat_clrattr() Clears the protections for the supplied virtual address range.
hat_free_end() Informs the HAT layer that a process has exited.
hat_free_start() Informs the HAT layer that a process is exiting.
hat_get_mapped_size() Returns the number of bytes that have valid mappings.
hat_getattr() Gets the protections for the supplied virtual address range.
hat_memload() Creates a mapping for the supplied page at the supplied virtual address. Used to create mappings.
hat_setattr() Sets the protections for the supplied virtual address range.
hat_stats_disable()
Finishes collecting stats on an address space.
hat_stats_enable() Starts collecting page reference and modification stats on an address space.
hat_swapin() Allocates resources for a process that is about to be swapped in.
hat_swapout() Allocates resources for a process that is about to be swapped out.
hat_sync() Synchronizes the struct_page software referenced and modified bits with the hardware MMU.
hat_unload() Unloads a mapping for the given page at the given address.
56 56
Virtual Memory Contexts & Address Spaces A virtual memory context is a set of virtual-to-
physical translations that maps an address space
contexts change when scheduler wants to switch execution from one process to another a trap or interrupt from user mode to kernel occurs
virtual memory context zero refers to kernel context
HAT layer implements functions to create, delete, and switch virtual memory contexts
Different hardware MMUs support different numbers of concurrent virtual memory contexts
57 57
Hardware Translation Acceleration
translation lookaside buffer (TLB) a hardware cache of recent translations The number of entries in the TLB is typically
64 on SPARC systems
TLB fill hardware
such as Intel and older SPARC implementations
software algorithms like the UltraSPARC architecture
58 58
The UltraSPARC-I &-II HAT The UltraSPARC-I &-II MMUs do the following:
Implement mapping between a 44-bit virtual address and a 41-bit physical address
Support page sizes of 8 Kbytes, 64 Kbytes, 512 bytes, and 4 Mbytes
59 59
Virtual-to-Physical Translation
60 60
Translation Table Entry (TTE) TTE is a translation map entry, one for each page TTE contains a virtual address tag and the high bits of the
physical address TTEs must be loaded into the TLB When MMU finds the TTE entry that matches the virtual page
number and current context, it retrieves the physical page information
61 61
Relationship of TLBs, TSBs, and TTEs
Translation Software Buffer
software cache of TTEs
a direct-mapped cache of the TLB
an array of TTEs in regular physical memory
62 62
TSB Size
Memory
Size
Kernel
TSB
Entries
Kernel
TSB
Size
User
TSB
Entries
User
TSB
Size
< 32 Mbytes — — 2048 128 Kbytes
32 Mbytes–
64 Mbytes
4096 256 Kbytes 8192–
16383
512 Kbytes–
1 Mbyte
32 Mbytes–
2 Gbytes
4096–
262,144
512 Kbytes–
16 Mbytes
16384–
524,287
1 Mbyte–
32 Mbytes
2 Gbytes–
8 Gbytes
262,144 16 Mbytes 524,288–
2,097,511
32 Mbytes–
128 Mbytes
8 Gbytes -> 262,144 16 Mbytes 2,097,512 128 Mbytes
63 63
Address Space Identifiers
ASI Description Derived Context
Primary The default address translation; used for regular SPARC Instructions
The address space translation is done through TLB entries that match the context number in the MMU primary context register
Secondary A secondary address space context; used for accessing another address space context without requiring a context switch
The address space translation is done through TLB entries that match the context number in the MMU secondary context register
Nucleus The address translation; used for TLB miss handlers, system calls, and interrupts
The nucleus context is always zero (the kernel’s context).
describe the MMU mode and hardware used to access pages
derived from the instruction being executed and the current trap level
grouped into three different modes of physical memory access
The MMU translation context used to index TLB entries is derived from the ASI
64 64
UltraSPARC-I & II Watchpoint Implementation
watchpoint registers describe the address of watchpoints for the address space
Virtual address / physical address
Watchpoint traps are generated when watchpoints are enabled, and the data MMU detects a load or store to the
virtual or physical address specified by the virtual address data watchpoint register or the physical data watchpoint register
65 65
UltraSPARC-I & -II Protection Modes
Condition
Resultant
Protection ModeTTE in
D-MMU
TTE in
I-MMU
Writable
Attribute Bit
Yes No 0 Read-only
No Yes Don’t Care Execute-only
Yes No 1 Read/Write
Yes Yes 0 Read-only/Execute
Yes Yes 1 Read/Write/Execute
66 66
UltraSPARC-I & -II MMU-Generated Traps
Trap Description
Instruction_access_miss A TTE for the virtual address of an instruction was not found in the instruction TLB
Instruction_access_exception An instruction privilege violation or invalid instruction address occurred
Data_access_MMU_miss A TTE for the virtual address of a load was not found in the data TLB
Data_access_exception A data access privilege violation or invalid data address occurred
Data_access_protection A data write was attempted to a read-only page
Privileged_action An attempt was made to access a privileged address space
Watchpoint Watchpoints were enabled and the CPU attempted to load or store at the address equivalent to that stored in the watchpoint register
Mem_address_not_aligned An attempt was made to load or store from an address that is not correctly word aligned
67 67
TLB Performance and Large Pages
large pages typically 4 Mbytes in size optimize the effectiveness of the hardware
TLB
memory performance is largely influenced by the effectiveness of the TLB
because of the time spent servicing TLB misses
TLBs are limited in size only 64 entries in UltraSPARC-I and -II
68 68
TLB reach
TLB reach -- the amount of memory that TLB can address concurrently
TLB reach = TLB entries * Page size 64*8 Kbytes, or 512 Kbytes
increase TLB reach Increase the number of entries in the TLB Increase the page size that each entry reflects A trade-off method -- use two or more
different page sizes at the same time 8-Kbyte, 64-Kbyte, 512-Kbyte. Or 4-Mbyte pages
69 69
Solaris Support for Large Pages
8 Kbytes a good mix of performance across the range of smaller machines to larger machines hurts large-memory scientific applications and large-memory databases hurts kernel performance
4 Mbytes speeds up the kernel code path frees up valuable TLB slots for hungry applications accelerates graphics performance
Large-Page Database Performance Improvements
Database Performance Improvement
Oracle TPC-C 12%
Informix TPC-C 1%
Informix TPC-D 6%