Introduction to ACPI Based Memory Hot-Plug · ACPI & Memory Hot-Plug •ACPI: Advanced...
Transcript of Introduction to ACPI Based Memory Hot-Plug · ACPI & Memory Hot-Plug •ACPI: Advanced...
Agenda
1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work
2
memory(in use)
memory(in error)
Why need memory Hot-Plug
3
Load Load
memory(in use)
memory(idle)
memory(idle)
2. Reduce power consumption.
1. Balance the load.
3. Handle hardware error.
4
Why need memory Hot-Plug
Hardware (ACPI registers)
Firmware (ACPI BIOS)
Guest OS
VMware
Guest OS
Guest OS
5. ACPI provides sufficient conditions• With the help of ACPI, hardware and
fireware are now able to support memory hotplug physically.
4. Guest OS should support memory hotplug.• VMware supports virtual memory device
hotplug for the virtual machine.• The similar feature is being developped for
KVM.Memory device virtualization
+
Agenda
1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work
5
ACPI & Memory Hot-Plug• ACPI: Advanced Configuration and Power Interface
6
ACPI is an interface specification of Operating System-directed motherboard device configuration and Power Management.
-- ACPI Specification 5.0
Kernel (Software)
ACPI Driver
Hardware
Methods(dynamic)
ACPI BIOS (Firmware)
ACPI Tables(static)
ACPI Registers
Run Time Boot Time
OS layer framework. Event handling API
Static info used only at boot time. DSDT SRAT ……
Event driven model. Event registers Control registers ……SCI
(System Control Interrupt)
Dynamic methods used at run time. _EJ0 _STA ……
Kernel
Memory Hot-Plug Subsystem
ACPI & Memory Hot-Plug
7
ACPI Driver
Hardware
Methods(dynamic)
ACPI BIOS
ACPI Registers
Event info
Call event handlerCall API
ACPI Tables(static)
Generate SCI(System Control Interrupt)
Call ACPI Method
Hardware operation
Read ACPI Tables
Install event handler
• ACPI and Memory Hot-Plug Run time processBoot time process
Call device dependent code
Hot-Plug happens
Memory Device Driver
Kernel
Memory Hot-Plug Subsystem
ACPI & Memory Hot-Plug
8
ACPI Driver
Hardware
Methods(dynamic)
ACPI BIOS
ACPI Registers
ACPI Tables(static)
• Main jobs of Memory Hot-Plug
Memory Device Driver
Main Jobs Node data
Direct mapping
Virtual memory mapping
Page online & offline
Page migration
Event handler
ACPI & Memory Hot-Plug
Physical space
Userspace
……
Kernelspace
hole
virtual memory map (1TB)
direct mapping (64TB)
kernel text mapping
module mapping space
hole
holevmalloc/ioremap space
hole
block X
movable pages (used)
otherpage_structs
page_structs ofblock X
……
……
process (128TB)
9
……
• Things associated with Memory Hot-Plug
2. User processes’ pagetable.
1. Memory block to be hot-plugged.
3. Kernel direct mappingpagetable.
4. Virtual memory mappingpages.
5. Virtual memory mappingpagetable.2
1
5
3
4
Agenda
1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work
10
11
Memory hot-add• Generic memory definition & sparse memory model
endstart
section
pages
Physical memory range
block blockblock …… block……
section
pages
section
pages……
Sparse memory
model
Sparse memory model divides a memory range into several sections, in which the memory is contiguous.(128MB per section, one section per block by default on x64)
Generic memory definition divides a memory range into several blocks.(128MB per block by default on x64)
block X
pages(invalid)
block X
pages(invalid)
Memory hot-add• Add memory (1)
Physical space……
Kernel space
hole
virtual memory map
(1TB)
direct mapping
(64TB)
kernel text mapping
module mapping space
hole
holevmalloc/ioremap space
hole
……
12
unmapped
unmapped
free pages
page_structs
page_structs
block X
pages(invalid)
empty
Blocks in the memory range are hot-added one by one.
Memory hot-add• Add memory (2)
Physical space……
Kernel space
hole
virtual memory map
(1TB)
direct mapping
(64TB)
kernel text mapping
module mapping space
hole
holevmalloc/ioremap space
hole
……
13
NEW
NEW
page_structs
page_structs
block X
pages(invalid)
NEW
empty
1
2
1. Initialize direct mappingpagetable.
2. Allocate virtual memory mapping pages.
3
3. Initialize virtual memory mapping pagetable.
Memory hot-add• Add memory (3)
Physical space……
Kernel space
hole
virtual memory map (1TB)
direct mapping (64TB)
kernel text mapping
module mapping space
hole
holevmalloc/ioremap space
hole
……
14
page_structs
page_structs
block X (offline)
pages
page_structs ofblock X The newly added pages are
offline and not present.
echo online >/sys/devices/system/memory/memoryX/state
Memory hot-add• Online pages
Physical space……
Kernel space
hole
virtual memory map (1TB)
direct mapping (64TB)
kernel text mapping
module mapping space
hole
holevmalloc/ioremap space
hole
……
15
page_structs
page_structs
block X (online)
pages
page_structs ofblock X
User space
process n (128TB)
……
process 1 (128TB)
process 3 (128TB)
process i (128TB)
process 2 (128TB)
Memory hot-add
• Configuration– mm/Kconfig
config MEMORY_HOTPLUGbool "Allow for memory hot-add“depends on SPARSEMEM || X86_64_ACPI_NUMAdepends on HOTPLUG && ARCH_ENABLE_MEMORY_HOTPLUGdepends on (IA64 || X86 || PPC_BOOK3S_64 || SUPERH || S390)
16
Agenda
1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work
17
Memory hot-remove• Remove memory (1)
Physical space
block X (online)
movable pages (used)
……
pages(free)
pages(free)
Kernel space
hole
virtual memory map (1TB)
direct mapping (64TB)
kernel text mapping
module mapping space
hole
holevmalloc/ioremap space
hole
process n (128TB)
…
page_structs
page_structs ofblock X
……
……
……process 1 (128TB)
process 3 (128TB)
process i (128TB)
process 2 (128TB)
User space18
…
Memory hot-remove• Remove memory (2)
Physical space
process n (128TB)
…process 1 (128TB)
process 3 (128TB)
process i (128TB)
process 2 (128TB)
1
1
1. Unmap user pages.• Kernel will generate a
page fault for each process who access these pages, and the process will wait till the migration is over.
2. Allocate new pages.
3. Copy data from old pages to new pages.
pages(used)
pages(used)
block X (online)
……
page_structs
page_structs ofblock X
……
……
……
movable pages (used)
2
2
3
3
19User space
…
pages(used)
pages(used)
Memory hot-remove• Remove memory (3)
Physical space
block X (offline)
……
page_structs
page_structs ofblock X
……
……
……
movable pages (isolated)
process n (128TB)
……
process 1 (128TB)
process 3 (128TB)
updating
updating
process i (128TB)
process 2 (128TB)
4. Update user processes’pagetable.
• Also wake up all the processes waiting for these pages.
5. Isolate old pages.• Pages are not in the
buddy system, and won’t be allocated to anyone.
4
54
20User space
6. Set the block state to offline.
6
pages(used)
pages(used)
removing
Memory hot-remove• Remove memory (4)
Physical space……
Kernel space
hole
virtual memory map (1TB)
direct mapping (64TB)
kernel text mapping
module mapping space
hole
holevmalloc/ioremap space
hole
page_structs
……
……
……
freeing
freeing
freeing
freeing
freeing
6
7
6. Free kernel direct mappingpagetable.
7. Free virtual memory mapping pages.8
8. Free virtual memory mapping pagetable.
21
removed
pages(used)
pages(used)
Memory hot-remove• Remove memory (5)
Physical space……
Kernel space
hole
virtual memory map (1TB)
direct mapping (64TB)
kernel text mapping
module mapping space
hole
holevmalloc/ioremap space
hole
process n (128TB)
……
page_structs
……
……
……process 1 (128TB)
process 3 (128TB)
freed
freed
updated
updated
process i (128TB)
process 2 (128TB)
freed
freed
22
freed
User space
Memory hot-remove• Post work: automatically remove the node
ZONE_MOVABLE
node i
cpu
CPU in use
CPU idle
unmovable memory
movable memory
memory removedCPU removed
removedmemory
node i
cpu cpu cpu cpu
RemoveCPUs
removedmemory
cpu cpu cpu
Removememory
ZONE_MOVABLE
All CPUs, memoryare removed ?
NO
YES
NO
Set node state to offline
Remove /sys files
Free wait_table
Clear pgdat
Node hot-remove
23
Memory hot-remove
• Merged into Linux 3.9• Configuration
– mm/Kconfig
config MEMORY_HOTREMOVEbool "Allow for memory hot remove"select MEMORY_ISOLATIONselect HAVE_BOOTMEM_INFO_NODE if X86_64depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVEdepends on MIGRATION
24
Memory hot-remove• Kernel pages cannot be hot-removed
25
direct mapping(64TB)
Userspace
Kernelspace
user mapping
(128TB)
Physical space
User page
User page
User page
Kernel page
Kernel pageKernel page
Kernel page
variable
1. migrate
not migratableva = pa + offset(1-1 mapped)
User page
Kernel page
not hot-removable
hot-removable
2. hot-remove
Agenda
1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work
26
memory
Movable node
memory
node0
cpu cpu
node1 node2
cpu cpucpu
27
• NUMA: Non-Uniform Memory Access– A node consists of a set of CPUs and memory.– CPUs access memory in the same node faster.– Meaningful in 64bits platform only. 32bits platform has only
one node.
cpu
memory
Fast FastFast
Movable node
ZONE_NORMAL
ZONE_MOVABLE
node
cpu cpu
28
ZONE_DMA
ZONE_DMA32
ZONE_HIGLMEM
ZONE_DMA / ZONE_DMA_32 (64bits only): used for DMA.
ZONE_NORMAL: Memory directly mapped, used by kernel and user space.
ZONE_HIGHMEM: For 32bits kernel to access memory not directly mapped.
ZONE_MOVABLE: Memory can be migrated. (user space only)
• Zone: Different types of memory in a node
…
Kernel allocates ZONE_NORMAL on each node evenly
Movable node• Problem
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
ZONE_MOVABLE
…
node 0
cpu cpu
cpu cpu
CPU in use
CPU idle
unmovable memory
movable memory
ZONE_NORMAL
ZONE_MOVABLE
node 1
cpu cpu
cpu cpu
node i
Each node has ZONE_NORMAL
No node is hot-removable
ZONE_NORMAL
ZONE_MOVABLE
cpu cpu
cpu cpu
Kernel may use
ZONE_NORMAL
29
Configure a node to have only
ZONE_MOVABLE
Movable node• Solution
…
node 0
cpu cpu
cpu cpu
ZONE_NORMAL
ZONE_MOVABLE
node 1
cpu cpu
cpu cpu
ZONE_MOVABLE
node i
cpu cpu
cpu cpu
The node is hot-removable
CPU in use
CPU idle movable memory
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
ZONE_MOVABLE
Movable node has no ZONE_NORMAL
Kernel can not use
ZONE_MOVABLE
unmovable memory
30
Processor Local APIC/SAPIC AffinityProcessor Local
APIC/SAPIC AffinityProcessor Local x2APIC Affinity
Mainly useful information
SRAT
Local x2APIC ID PXM (proximity domain) ……
Static information of NUMA architecture.
KernelACPI Driver
Hardware
Methods
ACPI BIOS
Tables
Registers
Memory AffinityMemory AffinityMemory Affinity
Processor Local APIC/SAPIC AffinityProcessor Local
APIC/SAPIC AffinityProcessor Local APIC/SAPIC Affinity
APIC ID or SAPIC ID/EID PXM (proximity domain) ……
Memory range PXM (proximity domain) Hotpluggable flag ……
Movable node• Static configuration
– SRAT: System Resource Affinity Table
31
Movable node• movablecore = acpi
unmovable memory
movable memory
…
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
node 0
ZONE_MOVABLE
node 1
ZONE_MOVABLE
node i
ZONE_NORMAL
node n
…
Node 0unhotpluggable
Node 1hotpluggable
Node nunhotpluggable
Node ihotpluggable
… …
• Use SRAT to arrange ZONE_MOVABLE.• Only for memory hotplug users.• Still being pushing.(New)
SRAT memory affinities
32
Movable node• The old way (no performance lost)
– kernelcore / movablecore = nn {G|M|K} (Old)
unmovable memory
movable memory
…
• Allocate ZONE_MORMAL in each node evenly.• For regular users.
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL(same size)
ZONE_MOVABLE
node 0
ZONE_NORMAL(same size)
ZONE_MOVABLE
node 1
ZONE_NORMAL(same size)
ZONE_MOVABLE
node i
ZONE_NORMAL(same size)
ZONE_MOVABLE
node n
…
33
ZONE_NORMAL
ZONE_MOVABLE
Movable node• Dynamic configuration
offline memory
node i
cpu cpu
cpu cpu
memory offlineCPU offline
mem_section XXX
mem_section XXX+1
CPU in use
CPU idle movable memory
unmovable memory
node i
cpu cpu
cpu cpu
online memory
1. online_kernel (NEW)Set to ZONE_NORMAL.
2. online_movable (NEW)Set to ZONE_MOVABLE.
3. online (Improved)Keep the previous state.(ZONE_NORMAL for the first time)
echo COMMAND >/sys/devices/system/node/nodei/memoryXXX/state
Rule: ZONE_MOVABLE should always be after ZONE_NORMAL, never overlaps.
34
online_kernel
online_movable
offline online
offline and online again
Performancedown!
Movable node
ZONE_NORMAL
unmovablenode
cpu cpu
cpu cpu
ZONE_MOVABLE
movablenode
ZONE_NORMAL
ZONE_MOVABLE
unmovablenode
cpu cpu
cpu cpu
cpu cpu
cpu cpu
CPU in use
CPU idle movable memory
unmovable memory
CPU in use by kernel
35
• Drawback
No good enough way to solve this problem now.
Movable node
• Merged into Linux 3.8• Configuration
– mm/Kconfig
config MOVABLE_NODEboolean "Enable to assign a node which has only movable memory"depends on HAVE_MEMBLOCKdepends on NO_BOOTMEMdepends on X86_64depends on NUMAdefault n
36
Agenda
1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work
37
node0 node1
Bootmem handling
lowmemory
…
memblock.memory[]: All the present memory in the system.
memblock.reserve[]: Allocated memory.
memory1 memory2 memory4
node2memory5
node3memory3 memory6 memory7
38
• Memblock: A bootmem allocator– Consists of two arraies
…… …
Buddy systemorder
0
1
MAX
……
……
……
……
…
Free unused memory to buddy system at last.
Bootmem handling• Problem at boot time
– Bootmem allocator memblock may allocate hotpluggablememory for kernel at boot time.
Hotpluggablememory
Kernel data
Kernel data
ZONE_MOVABLE
Movable node
Not hot-removable39
2. Parse SRAT (too late).
Allocated by memblock
1. Memblock is ready.Boot time
3. Initialize ZONEs.
Hotpluggable memory ranges are unknown.
Bootmem handling
DEFAULTHOTPLUGGABLE
40
• Solution1. Parse SRAT earlier, before memblock starts to work.2. Introduce flags into memblock, and reserve hotpluggable
memory with a special flag in memblock.
2. Parse SRAT earlier.
Boot time
No memoryallocation.
1. Memblock is ready. 3. Initialize ZONEs.
ZONE_MOVABLE
Movable node
Hot-removable
Hotpluggablememory Kernel data
node0 node1
Boot time
Bootmem handling
lowmemory
…
memblock.memory[]
memblock.reserve[]
unhotpluggable memoryhotpluggable memory
memory1 memory2 memory4
node2memory5
node3memory3 memory6 memory7
Allocated memory ranges will be put into reserve[].
1. Memblock is ready.
empty
41
All the memory ranges in the system are put into memory[].
Boot time
node0 node1
Bootmem handling
lowmemory
…
memblock.memory[]
memblock.reserve[]
memory1 memory2 memory4
node2memory5
node3memory3 memory6 memory7
…
DEFAULT
unhotpluggable memoryhotpluggable memory
2. Before parsing SRAT.
• Reserve kernel _data, _text, setup data, … , with flag DEFAULT.
• Any node the kernel resides in is unhotpluggable.(Not necessary to be node 0)
• No new memory allocation, so no hotpluggable memory could be used by the kernel.
42
Boot time
node0 node1
Bootmem handling
lowmemory
…
memblock.memory[]
memblock.reserve[]
memory1 memory2 memory4
node2memory5
node3memory3 memory6 memory7
… HOTPLUGGABLE HOTPLUGGABLE
unhotpluggable memoryhotpluggable memory
3. Parsing SRAT.
Reserve hotpluggable memory with flag HOTPLUGGABLE.
43
DEFAULT
Boot time
node0 node1
Bootmem handling
lowmemory
…
memblock.memory[]
memblock.reserve[]
memory1 memory2 memory4
node2memory5
node3memory3 memory6 memory7
… HOTPLUGGABLE HOTPLUGGABLE … … …
unhotpluggable memoryhotpluggable memory
4. After parsing SRAT, hotpluggable memory has been reserved.
44
DEFAULT DEFAULT
No hotpluggable memory used by kernel.
Boot time
Bootmem handling
memblock.reserve[]
… HOTPLUGGABLE HOTPLUGGABLE … … …
unhotpluggable memoryhotpluggable memory
5. Memory initialization has been finished.
Free hotpluggable memory to buddy system. (NEW)
Buddy systemorder
0
1
MAX
……
……
……
……
45
DEFAULT DEFAULT
Agenda
1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work
46
Future work
• Node local pagetable and vmemmap.– Improve performance.
• Migrate user pages pinned in memory.– For those who pin pages for a long time.
• User space tools, like libnuma and numactl.– A library of functions.– Commands.
47