Presentation of Chapter 4, LINUX Kernel Internals

Presentation of Chapter Presentation of Chapter 4, LINUX Kernel Internals4, LINUX Kernel Internals

Zhihua (Scott) Jiang

Computer Science Department

University of Maryland, Baltimore County

Baltimore, MD 21250

<[email protected]>

GuidelineGuideline

• The Architecture-independent The Architecture-independent Memory Model in LINUXMemory Model in LINUX

• The Virtual Address Space for a The Virtual Address Space for a ProcessProcess

• Block Device CachingBlock Device Caching

• Paging Under LINUXPaging Under LINUX

The architecture-independent The architecture-independent memory modelmemory model

• Pages of MemoryPages of Memory

• Virtual Address SpaceVirtual Address Space

• Converting the Linear AddressConverting the Linear Address

• The Page Directory The Page Directory

• The Page Middle DirectoryThe Page Middle Directory

• The Page TableThe Page Table

Pages of memoryPages of memory

• Defined by the PAGE_SIZE macro in Defined by the PAGE_SIZE macro in the asm/page.hthe asm/page.h

• For X86, the size is 4k bytesFor X86, the size is 4k bytes

• For Alpha uses 8K bytesFor Alpha uses 8K bytes

Virtual address spaceVirtual address space

• Given by reference to a segment selector and the Given by reference to a segment selector and the offset within the segmentoffset within the segment

• C pointers hold the offsetsC pointers hold the offsets

• Defined in asm/segment.hDefined in asm/segment.h

– KERNERL_DS (segment selector for kernel data)KERNERL_DS (segment selector for kernel data)

– USER_DS (segment selector for user data)USER_DS (segment selector for user data)

• By carrying out a conversion on the segment selector By carrying out a conversion on the segment selector register, a system function can be given pointers to register, a system function can be given pointers to the kernel segment.the kernel segment.

– Used by UMSDOS file system to simulate a Unix file Used by UMSDOS file system to simulate a Unix file systemsystem

ContinuedContinued

• MMU of an x86 processor converts the virtual address MMU of an x86 processor converts the virtual address to a linear addressto a linear address

• 4 Gbytes by width of the linear address4 Gbytes by width of the linear address

– 3 Gbytes for user segment3 Gbytes for user segment

– 1 Gbyte for kernel segment1 Gbyte for kernel segment

• Alpha does not support segmentationAlpha does not support segmentation

– Offset addresses for the user segment not permitted to Offset addresses for the user segment not permitted to overlap with the offset addresses for the kernel segmentoverlap with the offset addresses for the kernel segment

Converting the linear Converting the linear addressaddress

Linear address conversion in the architecture-independent memory model

Linear address

The virtual address space for a The virtual address space for a processprocess

• The User SegmentThe User Segment

• Virtual Memory AreasVirtual Memory Areas

• The System Call The System Call brkbrk

• Mapping FunctionsMapping Functions

• The Kernel SegmentThe Kernel Segment

• Static Memory Allocation in the Kernel Static Memory Allocation in the Kernel SegmentSegment

• Dynamic Memory Allocation in the Kernel Dynamic Memory Allocation in the Kernel SegmentSegment

The user segmentThe user segment

• In user mode, access only in user segmentIn user mode, access only in user segment

• Individual page tables for different Individual page tables for different processesprocesses

• system call system call forkfork– child and parent processes have different page directories child and parent processes have different page directories

and page tablesand page tables

– however, in the kernel segment page tables are shared by however, in the kernel segment page tables are shared by all processesall processes

• system call system call cloneclone– old and new threads share the memory fullyold and new threads share the memory fully

ContinuedContinued

• Some explanation for shared libraries in the Some explanation for shared libraries in the user segmentuser segment– Originally, linked into one binary, lead to efficiencyOriginally, linked into one binary, lead to efficiency

– Drawback is the growth of the length Drawback is the growth of the length

– Stored in separate files and loaded at program startStored in separate files and loaded at program start

– Linked to static addressesLinked to static addresses

– With ELF, allowed shared libraries to be loaded during With ELF, allowed shared libraries to be loaded during program executionprogram execution

– No absolute address references in the compiled codeNo absolute address references in the compiled code

Virtual memory areasVirtual memory areas

• Process not use all functions at any timeProcess not use all functions at any time

• Process can share codes if they are run by Process can share codes if they are run by the same executable filethe same executable file

• Copy-on-write strategy used for memory Copy-on-write strategy used for memory managementmanagement

The system call The system call brkbrk

• The The brkbrk field points to the end of the BSS segment for field points to the end of the BSS segment for non-statically initialized datanon-statically initialized data

• Used for allocating or releasing dynamic memoryUsed for allocating or releasing dynamic memory

• The system call The system call brkbrk can be used to find the current can be used to find the current value of the pointer or to set it to a new one under value of the pointer or to set it to a new one under protection checkprotection check

• Rejected if the mem required exceeds the estimated Rejected if the mem required exceeds the estimated sizesize

• function sys_brk() calls do_map() to map a private and function sys_brk() calls do_map() to map a private and anonymous area between the old & new values of anonymous area between the old & new values of brkbrk

Mapping functionsMapping functions

• C library provides 3 functions in C library provides 3 functions in sys/mman.hsys/mman.h– caddr_t mmap(caddr_t addr, size_t len, int prot, int caddr_t mmap(caddr_t addr, size_t len, int prot, int

flags, int fd, off_t off);flags, int fd, off_t off);

– int munmap(caddr_t addr, size_t len);int munmap(caddr_t addr, size_t len);

– int mprotect(caddr_t addr, size_t len, int prot);int mprotect(caddr_t addr, size_t len, int prot);

– int msync;int msync;

The kernel segmentThe kernel segment

• In x86 architecture, a system call is generally initiated In x86 architecture, a system call is generally initiated by the software interrupt 128 (0x80) being triggered.by the software interrupt 128 (0x80) being triggered.

• Any processes in system mode will encounter the Any processes in system mode will encounter the same kernel segmentsame kernel segment

• Kernel segment in alpha architecture cannot start at Kernel segment in alpha architecture cannot start at addr 0addr 0

• A PAGE_OFFSET is provided between physical & virtual A PAGE_OFFSET is provided between physical & virtual addrsaddrs

Static memory allocation in the kernel Static memory allocation in the kernel segmentsegment

• Initialization routine for character-Initialization routine for character-oriented devices is called as followsoriented devices is called as follows

memory_start = console_init(memory_start, memory_start = console_init(memory_start, memory_end);memory_end);

• Reserves memory by returning a value Reserves memory by returning a value higher than the parameter higher than the parameter memory_startmemory_start

• The memory between the return value The memory between the return value and and memory_start memory_start can be used as desired can be used as desired by the initialized componentby the initialized component

Dynamic memory allocation in the Dynamic memory allocation in the kernel segmentkernel segment

• In LINUX kernel, kmalloc() and kfree() used for dynamic In LINUX kernel, kmalloc() and kfree() used for dynamic memory allocationmemory allocation– void * kmalloc(size_t size, int priority);void * kmalloc(size_t size, int priority);

– void kfree(void *obj);void kfree(void *obj);

• To increase efficiency, the memory reserved is not To increase efficiency, the memory reserved is not initializedinitialized

• In LINUX kernel 1.2, In LINUX kernel 1.2, __get_free_pages()__get_free_pages() only to reserve only to reserve contiguous areas of memory of 4, 8, 16, 32, 64, and contiguous areas of memory of 4, 8, 16, 32, 64, and 128 Kbytes in size128 Kbytes in size

• kmalloc()kmalloc() can reserve far smaller areas of memory can reserve far smaller areas of memory

ContinuedContinued

• Sizes[] contains descriptors for Sizes[] contains descriptors for different for different sizes of memory different for different sizes of memory areaarea– one manages memory suitable for DMAone manages memory suitable for DMA

– the other is responsible for ordinary memorythe other is responsible for ordinary memory

Continued Continued

Structures for kmalloc

ContinuedContinued

• KmallocKmalloc()() and and kfreekfree()() restricted to the size of one restricted to the size of one page of mempage of mem

• vmallocvmalloc()() and and vfreevfree()() improved to multiple of the improved to multiple of the size of one page of memsize of one page of mem

• The max of value of size is limited by the amount of The max of value of size is limited by the amount of physical memory availablephysical memory available

• Memory reserved by Memory reserved by vmallocvmalloc() won’t be copied to () won’t be copied to external storageexternal storage

ContinuedContinued

• Comparison of Comparison of vmallocvmalloc() and () and kmallockmalloc()()– the size of the area of memory requested can be the size of the area of memory requested can be

better adjusted to actual needsbetter adjusted to actual needs

– Limited only by the size of free physical memory Limited only by the size of free physical memory and not by its segmentation (as and not by its segmentation (as kmallockmalloc() is)() is)

– Does not return any physical addressDoes not return any physical address

– reserved memory can be non-consecutive pagesreserved memory can be non-consecutive pages

– not suitable for reserving memory for DMA not suitable for reserving memory for DMA

Block Device CachingBlock Device Caching

• Block BufferingBlock Buffering

• The The updateupdate and and bdflushbdflush Processes Processes

• List Structures for the Buffer CacheList Structures for the Buffer Cache

• Using the Buffer CacheUsing the Buffer Cache

Block BufferingBlock Buffering

• Block size may be 512, 1024, 2048, or 4096 Block size may be 512, 1024, 2048, or 4096 bytesbytes

• Held in memory via a buffering systemHeld in memory via a buffering system

• A special case applies for blocks taken from files A special case applies for blocks taken from files opened with the flag opened with the flag 0_SYNC0_SYNC– Transferred to disk every time their contents are modifiedTransferred to disk every time their contents are modified

• Data is organized as frequently requested data Data is organized as frequently requested data lie every close together & can be kept in the lie every close together & can be kept in the processor cacheprocessor cache

The The updateupdate and and bdflushbdflush ProcessesProcesses

• At periodic intervals, At periodic intervals, updateupdate process calls the system process calls the system call call bdflushbdflush with an parameter with an parameter

• All modified buffer blocks are written back to disk with All modified buffer blocks are written back to disk with all superblock and inode informationall superblock and inode information

• bdflushbdflush, writes back the number of blocks buffers , writes back the number of blocks buffers marked “dirty” given in the marked “dirty” given in the bdflushbdflush parameter parameter

• Always activated when a block is released by means Always activated when a block is released by means of brelse()of brelse()

• Also activated when new block buffers are requested Also activated when new block buffers are requested or the size of the buffer cache needs to be reducedor the size of the buffer cache needs to be reduced

List structure for the buffer List structure for the buffer cachecache

• LINUX manages its block buffers via a number of different LINUX manages its block buffers via a number of different doubly linked listsdoubly linked lists

• Block buffers in use are managed in a set of special LRU listsBlock buffers in use are managed in a set of special LRU lists

LRU list(index) DescriptionBUF_CLEAN Block buffers not managed in other lists - content

matches relevant block on hard diskBUF_UNSHARED Block buffers formerly (but no longer) managed in

BUF_SHAREDBUF_LOCKED Locked block buffers (b_lock != 0 )

BUF_LOCKED1 Locked block buffers for inodes and superblocks

BUF_DIRTY Block buffers with contents not matching the relevantblock on hard disk

BUF_SHARED Block buffers situated in a page of memory mapped tothe user segment of a process

The various LRU lists

Using the buffer cacheUsing the buffer cache

• Function Function breadbread() is called for block read() is called for block read

• Variance of bread(), breada(), reads not the Variance of bread(), breada(), reads not the block requested into the buffer cache but a block requested into the buffer cache but a number of following blocksnumber of following blocks

Paging under LINUXPaging under LINUX

• Page Cache and ManagementPage Cache and Management

• Finding a Free PageFinding a Free Page

• Page Errors and Reloading a PagePage Errors and Reloading a Page

Page Cache and Page Cache and ManagementManagement

• LINUX can save pages to extenral media in 2 LINUX can save pages to extenral media in 2 waysways

– a complete block device as the external medium, a complete block device as the external medium, typically a partition on a hard disktypically a partition on a hard disk

– fixed-length files on a file system for its external fixed-length files on a file system for its external storagestorage

• Data that belong together are stored in a Data that belong together are stored in a cache line (16 bytes)cache line (16 bytes)

Finding a free pageFinding a free page• __get_free_pages() is called after physical pages of __get_free_pages() is called after physical pages of

mem reservedmem reserved

– unsigned long __get_free_pages(int priority, unsigned unsigned long __get_free_pages(int priority, unsigned long order, int dma) ;long order, int dma) ;

Priority Description

GFP_BUFFER Free page to be returned only if free pages are still availablein physical mem

GFP_ATOMIC The function __get_free_page must not interrupt the currentprocess, but a page should be returned if possible

GFP_USER The current process may be interrupted to swap pages

GFP_KERNEL This para is the same as GFP_USER

GFP_NOBUFFER The buffer cache won’t be reduced by an attempt to find afree page in mem

GFP_NFS The difference between this & GFP_USER is that the # ofpages reserved for GFP_ATOMIC is reduced frommin_free_pages to five. Will speed up NFS operations

Priorities for the function __get_free_page()

Page errors and reloading a Page errors and reloading a pagepage

• do_page_fault() is called when there generates do_page_fault() is called when there generates a page fault interrupta page fault interrupt

– void do_page_fault(struct pt_regs *regs, void do_page_fault(struct pt_regs *regs, unsigned long error_code);unsigned long error_code);

• do_no_page() or do_wp_page() is called when do_no_page() or do_wp_page() is called when the address is in a virtual memory area, the the address is in a virtual memory area, the legality of the read or write operation is legality of the read or write operation is checked by reference to the flags for the checked by reference to the flags for the virtual memvirtual mem

Presentation of Chapter 4, LINUX Kernel Internals

Documents

Transcript of Presentation of Chapter 4, LINUX Kernel Internals