Chapter 3 Memory Management Chapter 3 Memory Management —— Page Management Li Wensheng ...

Chapter 3 Memory ManagementChapter 3 Memory Management

—— Page Management

Li [email protected]

2 2

Outline

Data StructureData Structure Page Scanner Operation Page-out Algorithm Hardware Address Translation Layer

3 3

Pages—The Basic Unit of Solaris Memory Physical memory is divided into pages. A page’s identity is its vnode/offset pair. The hardware address translation (HAT)

and address space layers manage the mapping between a physical page and its virtual address space.

4 4

The Page Structure

5 5

The Page Hash List

global hash list -- an array of pointers to linked lists of pages

VM system hashes pages with identity onto a global hash list so that they can be located by vnode/offset.

Three page functions search the global page hash list:

page_find() page_lookup() page_lookup_nowait()

6 6

Locating Pages by Vnode/Offset Identity

7 7

MMU-Specific Page Structures

need to keep machine-specific data about every page, e.g. the HAT information that describes how the page is mapped by the MMU.

struct machpage

The contents of the machine-specific page structure are hidden from the generic kernel.

only the HAT machine-specific layer can see or manipulate its contents

8 8

Machine-Specific Page Structures: sun4u Example

9 9

Physical Page Lists

a segmented global physical page list, consisting of segments of contiguous physical memory.

Contiguous physical memory segments are added during system boot.

Can also added and deleted dynamically when physical memory is added and removed while the system is running.

10 10

arrangement of the physical page lists

11 11

Free List and Cache List

hold pages that are not mapped into any address space and that have been freed by page_free().

free list Does not have a vnode/offset associated Pages are put on the free list at process exits is generally very small

cache list still have a vnode/offset Seg_map free-behind and seg_vn executables

and libraries (for reuse)

12 12

The Page-Level Interfaces Method Description

page_create() Creates pages. Page coloring is based on a hash of the vnode offset. page_create() is provided for backward compatibility only. Don’t use it if you don’t have to. Instead, use the page_create_va() function so that pages are correctly colored.

page_create_va() Creates pages, taking into account the virtual address they will be mapped to. The address is used to calculate page coloring.

page_exists() Tests that a page for vnode/offset exists.

page_find() Searches the hash list for a page with the specified vnode and offset that is known to exist and is already locked

page_first() Finds the first page on the global page hash list

page_free() Frees a page. Pages with vnode/offset go onto the cache list; other pages go onto the free list

page_isfree() Checks whether a page is on the free list

page_ismod() Checks whether a page is modified. This function checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call hat_pagesync() before calling page_ismod().

13 13

The Page-Level Interfaces (Cont.)

Method Description

page_isref() Checks whether a page has been referenced; checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call hat_pagesync() before calling page_isref().

page_isshared() Checks whether a page is shared across more than one address space.

page_lookup() Finds a page representing the specified vnode/offset. If the page is found on a free list, then it will be removed from the free list

page_lookup_nowait() Finds a page representing the specified vnode/offset that is not locked or on the free list

page_needfree() Informs the VM system we need some pages freed up. Calls to page_needfree() must be symmetric, that is they must be followed by another page_needfree() with the same amount of memory multiplied by -1, after the task is complete.

page_next() Finds the next page on the global page hash list.

14 14

The Page Throttle

implemented in the page_create() and page_create_va() functions

causes page creates to block when the PG_WAIT flag is specified, that is, when available is less than the system global, throttlefree.

throttlefree is set to the same value as minfree.

memory allocated through the kernel memory allocator specifies PG_WAIT and is subject to the page-created throttle.

15 15

Page Sizes

System Type System

Type

MMU Page

Size

Capability

Solaris 2.x

Page Size

Early SPARC systems sun4c 4K 4K

microSPARC-I, -II sun4m 4K 4K

SuperSPARC-I, -II sun4m 4K, 4M 4K, 4M

UltraSPARC-I, -II sun4u 4K, 64K, 512K, 4M

8K, 4M

Intel x86 architecture i86pc 4K, 4M 4K, 4M

16 16

Page Coloring

page placement policy affects processor performance The optimal placement of pages often depends on the

memory access patterns of the application. in a random order in some sort of stridden ordered

How page placement can affect performance? The UltraSPARC-I & -II implementations

The L1 cache is 16 Kbytes The L2 (external) cache can vary between 512 Kbytes and

8 Mbytes The L2 cache is arranged in lines of 64 bytes, and

transfers are done to and from physical memory in 64-byte units.

17 17

Assume: we have a 32-Kbyte L2 cache page size of 8 Kbytes four page-sized slots on the L2 cache

The cache does not necessarily read and write 8-Kbyte units from memory; it does that in 64-byte chunks, so 32-Kbyte cache has 1024 addressable slots.

Page Coloring (Cont.)

18 18


offsets 0 and 32678 map to the same cache line.If we were now to access these two addresses, cache ping-pong effect occurs. we program to virtual memory rather than physical

memory.The OS must provide a sensible mapping between virtual memory and physical memory

19 19

Page Coloring (Cont.) physical pages are assigned to an address

space from the order they appear in the free list.

page coloring algorithm the free list of physical pages is organized into

specifically colored bins, one color bin for each slot in the physical cache.

When a page is put on the free list, the page_free() algorithms assign it to a color bin.

When a page is consumed from the free list (page_create_va() function ), the virtual-to-physical algorithm takes the page from a physical color bin.

20 20


The kernel supports a default algorithm and two optional algorithms.

The default algorithm was chosen according to the following criteria:

Fairly consistent, repeatable results Good overall performance for the majority of

applications Acceptable performance across a wide range

of applications

21 21

Solaris Page Coloring Algorithmsalgorithm description Solaris Availability

No. Name 2.5.1 2.6 7

0 Hashed VA The physical page color bin is chosen on a hashed algorithm to ensure even distribution of virtual addresses across the cache.

Default Default Default

1 P.Addr=

V.Addr

The physical page color is chosen so that physical addresses map directly to the virtual addresses (as in the example).

Yes Yes Yes

2 Bin

Hopping

Physical pages are allocated with a round-robin method.

Yes Yes Yes

6 Kessler’s

Best Bin

Kessler best bin algorithm. Keep history per process of used colors and chooses least used color; if multiple, use largest bin.

E10000 only

(default)

E10000 only

(default)

Not Available

22 22


Outline

23 23

Page Scanner

Is the memory management daemon that manages system wide physical memory

When there is a memory shortage, the page scanner runs to steal memory from address spaces, by:

taking pages that haven’t been used recently syncing them up with their backing store freeing them

If paged-out virtual memory is required again, a memory page fault occurs.

24 24

Page Scanner (Cont.) The balancing of page stealing and page faults

determines which parts of virtual memory will be backed and which will be moved out to swap.

global page replacement / local page replacement The subtleties of which pages are stolen govern

the memory allocation policies and can affect different workloads in different ways.

Enhancements to minimize page stealing from extensively shared libraries and executables

Priority paging to prevent application, shared library, and executable paging on systems with ample memory.

25 25

Page Scanner Operation

tracks page usage by reading a per-page hardware bit from the MMU for each page

Two bits for each page: Reference bit & modify bit

awakened when the amount of memory on the free-page list falls below a system threshold

typically 1/64th of total physical memory.

scans through pages in physical page order looking for pages that haven’t been used recently

to page out to the swap device and free

26 26

Two-handed Clock Algorithm

front hand clears the referenced and modified bits for each page back hand inspects the referenced and modified bits some time later Pages haven’t been referenced or modified are swapped out and freed scan rate is controlled by the amount of free memory on the system The gap between the front and back hand is fixed by a boot-time

parameter, handspreadpages.

27 27


Outline

28 28

Introduction to page-out algorithm

Steals pages when memory is lower than lotsfree

Scanner runs Starts scanning at slowscan (pages/sec) Four times/second when memory is short Awoken by page allocator if very low

Puts memory out to “backing store” Uses a Least Recently Used process

Kernel threads does the scanning

29 29

Page Scanner Parameters Parameter Description Min Default

Lotsfree starts stealing anonymous

memory pages

512K 1/64 th of memory

Desfree scanner is started at 100 times/second Minfree ½ of lotsfee

Minfree start scanning every time a new

page is created

½ of desfree

Throttlefree page_create routine makes the

caller wait until free pages are

Available

Minfree

Fastscan scan rate (pages per second)

when free memory = minfree

slowscan minimum of 64MB/s

or ½ memory size

Slowscan scan rate (pages per second)

when free memory = lotsfree

100

Maxpgio max number of pages per second

that the swap device can handle

~60 60 or 90 pages per spindle

hand-spreadpages number of pages between the

front hand (clearing) and back

hand (checking)

1 Fastscan

min_percent_cpu CPU usage when free memory is

at lotsfree

4% (~1 clock tick) of a single CPU

30 30

Scan Rate Parameters (Assuming No Priority Paging)

Stsrts scanning

at slowscan

Scans faster as the amount of free memory approaches 0

1/64 of memoryDefault 100

Default ½ physical memory

31 31

Scan Rate Parameters calculation

lotsfree is calculated at startup as 1/64th of memory slowscan parameter is 100 by default on Solaris systems fastscan is set to total physicalmemory/2 If total physical memory is 1G, then

Lotsfree=2048 pages/sec fastscan=8192 pages/sec

If free memory falls to 12 Mbytes (1536 pages)

32 32

Not Recently Used Time

The time between the front hand and back hand short time the most active pages remain intact long time only the largely unused pages are stolen

varies from just a few seconds to several hours,according to:

the number of pages between front and back hand the scan rate

Example Scan rate: 2000pages/sec hand spread: 8192 pages/sec Clear/check time: 4 seconds

33 33

Shared Library Optimizations

prevents scanner from stealing pages from extensively shared libraries

looks at the share reference count for each page if the page is shared more than a certain

amount, then it is skipped during the page scan operation.

threshold parameter: po_share 8 ~ 134217728, By default, starts at 8 A page shared by more than po_share processes

will be skipped Each time around, it is decremented ?

34 34

The Priority Paging Algorithm

Purpose: overcome adverse behavior that results from the memory pressure caused by the file system.

puts a higher priority on a process’s pages its heap, stack, shared libraries, and

executables.

permits scanner to pick file system cache pages only when ample

memory is available only steal application pages when there is a

true memory shortage.

35 35

The Priority Paging Algorithm

a new paging parameter, cachefree When the amount of free memory lies between cachefree and

lotsfree, the page scanner steals only file system cache pages scanner wakes up when memory falls below cachefree rather than

below lotsfree

36 36

Scan Rate Interpolation with the Priority Paging Algorithm

pages only the file

system cache

37 37

Page Scanner CPU Utilization Clamp

Purpose: to prevent the page-out daemon from using too much processor time

Two parameters: min_percent_cpu, default 4% of a single CPU max_percent_cpu, default 80% of a single CPU

CPU time can be used: From min_percent_cpu to max_percent_cpu min_percent_cpu when free memory is at

lotsfree (cachefree with priority paging enabled)

max_percent_cpu if free memory were to fall to zero

38 38

Parameters That Limit Pages Paged Out Maxpgio

limits the rate at which I/O is queued to the swap devices

defaults to 40 or 60 I/Os per second Often set to 100 times the number of swap

spindles

Maxpgio can also indirectly affect file system throughput

39 39

Page Scanner Implementation

implemented as two kernel threads Page scanner thread: scans pages Page-out thread: pushes the dirty pages

queued for I/O

40 40

Page Scanner Architecture

41 41

Scanner Schedpaging() waken up

called four times per second by a callout, triggered by the clock() thread if memory falls

below minfree triggered by the page allocator if memory falls

below throttlefree calculates two setup parameters for the

page scanner thread the number of pages to scan the number of CPU ticks that the scanner

thread can consume triggers the scanner through a condition

variable

42 42

Page scanner thread

cycles through the physical page list The front and back hand each have a page

pointer front hand is incremented first to clear the

referenced and modified bits for pointed page

back hand is then incremented to check the status of the pointed page (using check_page() function) If modified, placed in the dirty page queue If not referenced, freed

43 43

Page-out thread

uses a preinitialized list of async buffer headers as the queue for I/O requests

The number of entries is controlled by parameter async_request_size, initialized with 256

Requests to queue more I/Os will be blocked if the entire queue is full if the rate of pages queued has exceeded the maxpgio

removes I/O entries from the queue initiates I/O by calling the vnode putpage()

44 44

The Memory Scheduler

swap out entire processes to conserve memory removing all of a process’s thread structures and

private pages setting flags in the process table to indicate that

this process has been swapped out Not expensive but affects process’s performance

launched at boot time does nothing unless memory is less than desfree

looking for processes that can completely swap out soft-swap out / hard-swap out

45 45

Soft Swapping

takes place when the 30-second average for free memory is below desfree

memory scheduler looks for processes that have been inactive for at least maxslp seconds

If found: swaps out the thread structures for each

thread pages out all of the private pages of memory

for that process

46 46

Hard Swapping

takes place when all of the following are true: At least two processes are on the run queue, waiting

for CPU. The average free memory over 30 seconds is

consistently less than desfree. Excessive paging is going on

determined to be true if page-out + page-in > maxpgio

Use a much more aggressive approach to find memory

First, the kernel is requested to unload all modules and cache memory that are not currently active

Then, processes are sequentially swapped out until the desired amount of free memory is returned

47 47

Memory Scheduler Parameters

Parameter Affect on Memory Scheduler

desfree If the average amount of free memory falls below desfree for 30 seconds, then the memory scheduler is invoked.

maxslp When soft-swapping, the memory scheduler starts swapping processes that have slept for at least maxslp seconds. The default for maxslp is 20 seconds and is tunable

maxpgio When the run queue is greater than 2, free memory is below desfree, and the paging rate is greater than maxpgio, then hard swapping occurs, unloading kernel modules and process memory.

48 48


Outline

49 49

Introduction to HAT

Hardware Address Translation (HAT) controls the hardware that manages

mapping of virtual to physical memory provides interfaces that implement the

creation and destruction of mappings between virtual and physical memory

provides a set of interfaces to probe and control the MMU

implements all of the low-level trap handlers to manage page faults and memory exceptions

50 50

Solaris Virtual Memory Layers

51 51

Solaris Memory Model HAT layer

52 52

Address Apace

Process Address Space Process Text and Data Stack (anon memory) and Libraries Heap (anon memory)

Kernel Address Space Kernel Text and Data Kernel map Space (data structures, caches) 32-bit kernel map (64-bit kernels only) Trap table Critical virtual memory data structures Mapping File System Cache (segmap)

53 53

The Address Space

54 54

Role of the HAT layer in virtual-to-physical translation

hides the platform-specific implementation used by the segment drivers to implement the

segment driver’s view of virtual-to-physical translation

use hat to hold top-level translation information hat structure is platform specific hat is referenced by the address space structure HAT-specific data structures existing in every page

represent the translation information at a page level

HAT layer is called when the segment drivers want to manipulate the hardware MMU

55 55

Summarizes HAT functions Function Description

hat_chgattr() Changes the protections for the supplied virtual address range.

hat_clrattr() Clears the protections for the supplied virtual address range.

hat_free_end() Informs the HAT layer that a process has exited.

hat_free_start() Informs the HAT layer that a process is exiting.

hat_get_mapped_size() Returns the number of bytes that have valid mappings.

hat_getattr() Gets the protections for the supplied virtual address range.

hat_memload() Creates a mapping for the supplied page at the supplied virtual address. Used to create mappings.

hat_setattr() Sets the protections for the supplied virtual address range.

hat_stats_disable()

Finishes collecting stats on an address space.

hat_stats_enable() Starts collecting page reference and modification stats on an address space.

hat_swapin() Allocates resources for a process that is about to be swapped in.

hat_swapout() Allocates resources for a process that is about to be swapped out.

hat_sync() Synchronizes the struct_page software referenced and modified bits with the hardware MMU.

hat_unload() Unloads a mapping for the given page at the given address.

56 56

Virtual Memory Contexts & Address Spaces A virtual memory context is a set of virtual-to-

physical translations that maps an address space

contexts change when scheduler wants to switch execution from one process to another a trap or interrupt from user mode to kernel occurs

virtual memory context zero refers to kernel context

HAT layer implements functions to create, delete, and switch virtual memory contexts

Different hardware MMUs support different numbers of concurrent virtual memory contexts

57 57

Hardware Translation Acceleration

translation lookaside buffer (TLB) a hardware cache of recent translations The number of entries in the TLB is typically

64 on SPARC systems

TLB fill hardware

such as Intel and older SPARC implementations

software algorithms like the UltraSPARC architecture

58 58

The UltraSPARC-I &-II HAT The UltraSPARC-I &-II MMUs do the following:

Implement mapping between a 44-bit virtual address and a 41-bit physical address

Support page sizes of 8 Kbytes, 64 Kbytes, 512 bytes, and 4 Mbytes

59 59

Virtual-to-Physical Translation

60 60

Translation Table Entry (TTE) TTE is a translation map entry, one for each page TTE contains a virtual address tag and the high bits of the

physical address TTEs must be loaded into the TLB When MMU finds the TTE entry that matches the virtual page

number and current context, it retrieves the physical page information

61 61

Relationship of TLBs, TSBs, and TTEs

Translation Software Buffer

software cache of TTEs

a direct-mapped cache of the TLB

an array of TTEs in regular physical memory

62 62

TSB Size

Memory

Size

Kernel

TSB

Entries

Kernel

TSB

Size

User

TSB

Entries

User

TSB

Size

< 32 Mbytes — — 2048 128 Kbytes

32 Mbytes–

64 Mbytes

4096 256 Kbytes 8192–

16383

512 Kbytes–

1 Mbyte

32 Mbytes–

2 Gbytes

4096–

262,144

512 Kbytes–

16 Mbytes

16384–

524,287

1 Mbyte–

32 Mbytes

2 Gbytes–

8 Gbytes

262,144 16 Mbytes 524,288–

2,097,511

32 Mbytes–

128 Mbytes

8 Gbytes -> 262,144 16 Mbytes 2,097,512 128 Mbytes

63 63

Address Space Identifiers

ASI Description Derived Context

Primary The default address translation; used for regular SPARC Instructions

The address space translation is done through TLB entries that match the context number in the MMU primary context register

Secondary A secondary address space context; used for accessing another address space context without requiring a context switch

The address space translation is done through TLB entries that match the context number in the MMU secondary context register

Nucleus The address translation; used for TLB miss handlers, system calls, and interrupts

The nucleus context is always zero (the kernel’s context).

describe the MMU mode and hardware used to access pages

derived from the instruction being executed and the current trap level

grouped into three different modes of physical memory access

The MMU translation context used to index TLB entries is derived from the ASI

64 64

UltraSPARC-I & II Watchpoint Implementation

watchpoint registers describe the address of watchpoints for the address space

Virtual address / physical address

Watchpoint traps are generated when watchpoints are enabled, and the data MMU detects a load or store to the

virtual or physical address specified by the virtual address data watchpoint register or the physical data watchpoint register

65 65

UltraSPARC-I & -II Protection Modes

Condition

Resultant

Protection ModeTTE in

D-MMU

TTE in

I-MMU

Writable

Attribute Bit

Yes No 0 Read-only

No Yes Don’t Care Execute-only

Yes No 1 Read/Write

Yes Yes 0 Read-only/Execute

Yes Yes 1 Read/Write/Execute

66 66

UltraSPARC-I & -II MMU-Generated Traps

Trap Description

Instruction_access_miss A TTE for the virtual address of an instruction was not found in the instruction TLB

Instruction_access_exception An instruction privilege violation or invalid instruction address occurred

Data_access_MMU_miss A TTE for the virtual address of a load was not found in the data TLB

Data_access_exception A data access privilege violation or invalid data address occurred

Data_access_protection A data write was attempted to a read-only page

Privileged_action An attempt was made to access a privileged address space

Watchpoint Watchpoints were enabled and the CPU attempted to load or store at the address equivalent to that stored in the watchpoint register

Mem_address_not_aligned An attempt was made to load or store from an address that is not correctly word aligned

67 67

TLB Performance and Large Pages

large pages typically 4 Mbytes in size optimize the effectiveness of the hardware

TLB

memory performance is largely influenced by the effectiveness of the TLB

because of the time spent servicing TLB misses

TLBs are limited in size only 64 entries in UltraSPARC-I and -II

68 68

TLB reach

TLB reach -- the amount of memory that TLB can address concurrently

TLB reach = TLB entries * Page size 64*8 Kbytes, or 512 Kbytes

increase TLB reach Increase the number of entries in the TLB Increase the page size that each entry reflects A trade-off method -- use two or more

different page sizes at the same time 8-Kbyte, 64-Kbyte, 512-Kbyte. Or 4-Mbyte pages

69 69

Solaris Support for Large Pages

8 Kbytes a good mix of performance across the range of smaller machines to larger machines hurts large-memory scientific applications and large-memory databases hurts kernel performance

4 Mbytes speeds up the kernel code path frees up valuable TLB slots for hungry applications accelerates graphics performance

Large-Page Database Performance Improvements

Database Performance Improvement

Oracle TPC-C 12%

Informix TPC-C 1%

Informix TPC-D 6%

70 70

End

• [email protected]

Chapter 3 Memory Management Chapter 3 Memory Management —— Page Management Li Wensheng ...

Documents

Transcript of Chapter 3 Memory Management Chapter 3 Memory Management —— Page Management Li Wensheng ...