Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang.

Post on 19-Jan-2016

219 views 6 download

Transcript of Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang.

Operating System Design - Linux

Instructor: Ching-Chi Hsu

TA:Yung-Yu Chuang

Introduction to Linux (Nov. 1991, Linus Torvalds)

• Multi-tasking

• Demand loading & Copy On Write

• Paging (not swapping)

• Shared Libraries

• POSIX 1003.1

• Protected Mode

• Support different file systems and executable formats

Multitaskingrequire service require service

CPU idle CPU idle

require service require service

time interrupt for time-sharingrequire service

time expire

require service

• Based on i386 and Linux 2.0.33

• Topics– initialization– memory management (free space management, virt

ual memory management)– process management (context switching, schedulin

g)– system call

Resources for Tracing Linux

• http://odie.csie.ntu.edu.tw/~osd

• TLK, KHG, Linux Kernel Internals

• Source code browser

• Intel Programmer’s manual

Source Tree for Linux

/usr/src/linux

modules

fs

netkernel

init include ipclib

driversarch linux

asm-i386

asm-????

char

block

scsineti386

????

kernel boot mm

nfs

ext2

proc

….

..

How to compile Linux Kernel

1. make config (make manuconfig)2. make depend3. make boot (generate a compressed bootable linux kernel arch/i386/boot/zIamge) make zdisk (generate kernel and write to disk dd if=zImage of=/dev/fd0) make zlilo (generate kernel and copy to /vmlinuz)

lilo: Linux Loader

i386

• Segmented Addressing (segment:offset)

• Paging(Virtual Memory)

• Call Gate (Protection)

• TSS (Context Switching)

T I

GDTR LDTR

GDT LDT

INDEX

SELECTOR

desc desc

OFFSET

+

Linear Address

BASE LIMIT

BASE+LIMIT

BASE+8

BASE 15:0 LIMIT 15:0

BASE 31:24 AGD0 V L

LIMIT19:16 BASE 23:16TYPE

DP P S L

031

3263

Desc., Call gate, TSS

yyyyy000zzzzz000

CR3

ddd ttt ooo

4K page

zzzzzooo+

PTEPDE

Page Addr. P

Physical memory

Disk

Linear Address Space

4GBOS

3

210

Call Gate

Call TSS gate cause context switching

TSS Gate TSS desc.

CS,DS, ES…IPSP0, SP1,SP2, SP3CR3…..

in GDT

CPU

• #RESET– real-address mode– self-test– EAX contains error code– EDX contains CPU id– CR0

i386 Initialization

PG

PE

TS

EM

M P

RESERVED

0

EFLAGSEIPCS*DS**SSES**FSGSIDTR(base)IDTR(limit)DR7

0XXXX0002H0000FFF0H0F000H0000H0000H0000H0000H0000H00000000H03FFH0000H

Register State

* invisible part: 0FFFF0000(base) 0FFFF(limit)** invisible part: 0(base) 0FFFF(limit)

FFFF0H : ROM-BIOS address* do some test* initialize interrupt vector at physical address 0* load the first sector of a bootable device to 0x7C00 (boot/bootsect.S)* jump to 0x7C00 and run

Linux Kernel on Disk (vmlinux, 1,133,665 bytes)

bootsect.S Setup.S

1 sector 4 sectors

Self-extracted Kernel Image

Compressed Kernel Image (vmlinux.out, 455,321)

vmlinux (executable)

Decompressionmodule

/usr/src/linux/arch/i386/boot/zImage

boot disk

CPUA20

1M

A0000

I/O & BIOS

7C000

90000

IP

64K

0.5K bytes

7C000

Bootsect.S

BIOS load

IP 7C000

90000IP

bootsect.S

0.5K bytes

0.5K bytes

0.5K bytes7C000

90000IP

2K bytes

90200

Setup.S

0.5K bytes7C000

0.5K bytes90000

IP

2K bytes

90200

Setup.S

10000

508K bytes

0.5K bytes

vmlinux

SETUPSECS = 4 ! nr of setup-sectorsBOOTSEG = 0x07C0 ! original address of boot-sectorINITSEG = DEF_INITSEG ! we move boot here - out of the way 0x9000SETUPSEG = DEF_SETUPSEG ! setup starts here, 0x9020SYSSEG = DEF_SYSSEG ! system loaded at 0x10000 (65536)

< omitted>

mov ax,#BOOTSEG mov ds,ax mov ax,#INITSEG mov es,ax mov cx,#256 sub si,si sub di,di cld rep movsw

jmpi go,INITSEG ! Execute moved bootsectgo:

Copy bootsect.S to 0x90000

<omit>load_setup:

xor dx, dx ! drive 0, head 0 mov cl,#0x02 ! sector 2, track 0 mov bx,#0x0200 ! address = 512, in INITSEG mov ah,#0x02 ! service 2, nr of sectors mov al,setup_sects ! (assume all on head 0, track 0) ! Setup_sects=4 int 0x13 ! read it (BIOS routine) jnc ok_load_setup ! ok - continue

push ax ! dump error code call print_nl mov bp, sp call print_hex pop ax

jmp load_setupok_load_setup:

Try to load setup.S from(drive 0, head 0,sector 2, track 0)to memory 0x90200

<omit>! Print some inane message mov ah,#0x03 ! read cursor pos xor bh,bh int 0x10 mov cx,#9 mov bx,#0x0007 ! page 0, attribute 7 (normal) mov bp,#msg1 ! .byte 13,10 .ascii “Loading” mov ax,#0x1301 ! write string, move cursor int 0x10 ! BIOS routine

! ok, we've written the message, now! we want to load the system (at 0x10000) mov ax,#SYSSEG mov es,ax ! segment of 0x010000 call read_it ! Read 508K to 0x10000 (64K), one . per track call kill_motor ! Stop floopy motor call print_nl<omit> jmpi 0, SETUPSEG ! Jump to 0x90200 (setup.S)

Print “/nLoading”

setup.S

• Check memory size

• set keyboard, video adapter, get HD data

• switch to protected mode– set GDT– set IDT– set PE bit (flush pipe)

start: jmp start_of_setup! ------------------------ start of header --------------------------------!! SETUP-header, must start at CS:2 (old 0x9020:2)! .ascii "HdrS" ! Signature for SETUP-header .word 0x0201 ! Version number of header format ! (must be >= 0x0105 ! else old loadlin-1.5 will fail)

<omit>start_of_setup:

…………… (check signature)

good_sig: mov ax,cs ! aka #SETUPSEG sub ax,#DELTA_INITSEG ! aka #INITSEG mov ds,ax ! DS=9000

loader_ok:! Get memory size (extended mem, kB)

mov ah,#0x88 int 0x15 mov [2],ax ! Store memory size in 0x90002 (bootsect.S)

<omit>(disable interrupts)(move kernel image to 1000)

end_move_self: lidt idt_48 ! load idt with 0,0 lgdt gdt_48 ! load gdt with whatever appropriate

idt_48:.word 0.word 0, 0

gdt_48:.word 0x800.word 512+gdt, 0x9

BASE Limit

0,0 0idt_48

0x9, 512+gdt 0x800 (2048)gdt_48gdt: .word 0,0,0,0 ! dummy

.word 0,0,0,0 ! unused

.word 0xFFFF ! 4Gb - (0x100000*0x1000 = 4Gb) .word 0x0000 ! base address=0 .word 0x9A00 ! code read/exec .word 0x00CF ! granularity=4096, 386 (+5th nibble of limit)

.word 0xFFFF ! 4Gb - (0x100000*0x1000 = 4Gb) .word 0x0000 ! base address=0 .word 0x9200 ! data read/write .word 0x00CF ! granularity=4096, 386 (+5th nibble of limit)

BASE 15:0 LIMIT 15:0

BASE 31:24 AGD0 V L

LIMIT19:16 BASE 23:16TYPE

DP P S L

031

3263

null

Not used

code

data

BASE=0x00000000, LIMIT=FFFFFF G=1 (4G)DPL=0 type=1010 (code, non-conforming, r/x, not accessed)

BASE=0x00000000, LIMIT=FFFFFF G=1 (4G)DPL=0 type=1010 (code, non-conforming, r/x, not accessed)

! that was painless, now we enable A20, no wrapped

call empty_8042 mov al,#0xD1 ! command write out #0x64,al call empty_8042 mov al,#0xDF ! A20 on out #0x60,al call empty_8042

<omit>

mov ax,#1 ! protected mode (PE) bit lmsw ax ! This is it! Load into CR0 jmp flush_instr ! Flush pipeflush_instr: xor bx,bx ! Flag to indicate a boot

! NOTE: For high loaded big kernels we need a! jmpi 0x100000,KERNEL_CS!! but we yet haven't reloaded the CS register, so the default size ! of the target offset still is 16 bit.! However, using an operant prefix (0x66), the CPU will properly! take our 48 bit far pointer. (INTeL 80386 Programmer's Reference! Manual, Mixing 16-bit and 32-bit code, page 16-6) db 0x66,0xea ! prefix + jmpi-opcodecode32: dd 0x1000 ! will be set to 0x100000 for big kernels dw KERNEL_CS ! KERNEL=0x10

0 0 0001 0000

TI

RPL

15 2 1 0

INDEX

0:GDT 1:LDT

Decompress Kernelstartup_32: (gcc entry point) cld

cli movl $(KERNEL_DS),%eax # KERNEL_DS=0x18 mov %ax,%ds mov %ax,%es mov %ax,%fs mov %ax,%gs

<omit>

lss SYMBOL_NAME(stack_start),%esp xorl %eax,%eax1: incl %eax # check that A20 really IS enabled movl %eax,0x000000 # loop forever if it isn't cmpl %eax,0x100000 je 1b

( clear BSS )

/* * Do the decompression, and jump to the new kernel.. */ subl $16,%esp # place for structure on the stack pushl %esp # address of structure as first arg call SYMBOL_NAME(decompress_kernel) # decompress kernel to 100000 orl %eax,%eax # gunzip 1.0.3 jnz 3f xorl %ebx,%ebx ljmp $(KERNEL_CS), $0x100000 # jump to decompressed kernel

100000

101000

102000

103000

104000

105000

106000

swapper_pg_dir

pg0

empty_bad_page

empty_bad_page_table

empty_zero_page

stack

idtgdt

EIP

head.S

(copy parameters from 0x90000)

100000

101000

102000

103000

104000

105000

106000

PG_DIR

PG0

empty_bad_page

empty_bad_page_table

empty_zero_page

stack

idtgdt

CR3

0

768 4M

Physical Memory

Setup Paging Table & Enable Paging

100000

101000

102000

103000

104000

105000

106000

PG_DIR

PG0

empty_bad_page

empty_bad_page_table

empty_zero_page

stack

idtgdtGDTR

NULL0

00

2*NR_TASKS

C0000000 1G DPL=0 codeC0000000 1G DPL=0 data00000000 3G DPL=3 code00000000 3G DPL=3 data

0x100x180x230x2b

Setup GDT

100000

101000

102000

103000

104000

105000

106000

PG_DIR

PG0

empty_bad_page

empty_bad_page_table

empty_zero_page

stack

idtgdt

255

0 GDT

ignore_int

IDTR

Setup IDT

call setup_paging

setup_paging: movl $1024*2,%ecx /* 2 pages - swapper_pg_dir+1 page table */ xorl %eax,%eax movl $ SYMBOL_NAME(swapper_pg_dir),%edi /* swapper_pg_dir is at 0x1000 */ cld;rep;stosl/* Identity-map the kernel in low 4MB memory for ease of transition *//* set present bit/user r/w */ movl $ SYMBOL_NAME(pg0)+7,SYMBOL_NAME(swapper_pg_dir)/* But the real place is at 0xC0000000 *//* set present bit/user r/w */ movl $ SYMBOL_NAME(pg0)+7,SYMBOL_NAME(swapper_pg_dir)+3072 movl $ SYMBOL_NAME(pg0)+4092,%edi movl $0x03ff007,%eax /* 4Mb - 4096 + 7 (r/w user,p) */ std1: stosl /* fill the page backwards - more efficient :-) */ subl $0x1000,%eax jge 1b cld

movl $ SYMBOL_NAME(swapper_pg_dir),%eax movl %eax,%cr3 /* cr3 - page directory start */ movl %cr0,%eax orl $0x80000000,%eax movl %eax,%cr0 /* set paging (PG) bit */ ret /* this also flushes the prefetch-queue */

31 12 6 5 2 1 0

Page Address D AU /S

R /W

P

Format of PDE & PTE

lgdt gdt_descr

gdt_descr: .word (8+2*NR_TASKS)*8-1 .long 0xc0000000+SYMBOL_NAME(gdt)

ENTRY(gdt) .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x0000000000000000 /* not used */ .quad 0xc0c39a000000ffff /* 0x10 kernel 1GB code at 0xC0000000 */ .quad 0xc0c392000000ffff /* 0x18 kernel 1GB data at 0xC0000000 */ .quad 0x00cbfa000000ffff /* 0x23 user 3GB code at 0x00000000 */ .quad 0x00cbf2000000ffff /* 0x2b user 3GB data at 0x00000000 */ .quad 0x0000000000000000 /* not used */ .quad 0x0000000000000000 /* not used */ .fill 2*NR_TASKS,8,0 /* space for LDT's and TSS's etc */

(setup data segments and clear BSS)call setup_idt

setup_idt: lea ignore_int,%edx movl $(KERNEL_CS << 16),%eax movw %dx,%ax /* selector = 0x0010 = cs */ movw $0x8E00,%dx /* interrupt gate - dpl=0, present */

lea SYMBOL_NAME(idt),%edi mov $256,%ecxrp_sidt: movl %eax,(%edi) movl %edx,4(%edi) addl $8,%edi dec %ecx jne rp_sidt ret

SELECTOR OFFSET

OFFSET 8 E 0 0

interrupt gate

ignore_int: just print “Unknown Interrupt”

lidt idt_descr ljmp $(KERNEL_CS),$1f1: movl $(KERNEL_DS),%eax # reload all the segment registers mov %ax,%ds # after changing gdt. mov %ax,%es mov %ax,%fs mov %ax,%gs

call SYMBOL_NAME(start_kernel) # jump to C main routine

start_kernelasmlinkage void start_kernel(void) {

setup_arch(&command_line, &memory_start, &memory_end); memory_start = paging_init(memory_start,memory_end); trap_init(); init_IRQ();

<-------------- omit ---------------->

memory_start = console_init(memory_start,memory_end);

memory_start = kmalloc_init(memory_start,memory_end); sti(); # enable interrupt

memory_start = inode_init(memory_start,memory_end); memory_start = file_table_init(memory_start,memory_end); memory_start = name_cache_init(memory_start,memory_end);

mem_init(memory_start,memory_end);

<---------- omit ------------->

printk(linux_banner);

sysctl_init(); kernel_thread(init, NULL, 0); cpu_idle(NULL);}

setup_arch

1M

kernelmemory_start

memory_start = (unsigned long) &_end;

memory_end

memory_end = (1<<20) + (EXT_MEM_K<<10); memory_end &= PAGE_MASK;

#define PARAM empty_zero_page#define EXT_MEM_K (*(unsigned short *) (PARAM+2))

init_task.mm->start_code = TASK_SIZE; /* 0xC0000000 */ init_task.mm->end_code = TASK_SIZE + (unsigned long) &_etext; init_task.mm->end_data = TASK_SIZE + (unsigned long) &_edata; init_task.mm->brk = TASK_SIZE + (unsigned long) &_end;

/ * "mem=XXX[kKmM]" overrides the BIOS-reported memory size */

if (c == ' ' && *(const unsigned long *)from == *(const unsigned long *)"mem=")

memory_end = simple_strtoul(from+4, &from, 0); if ( *from == 'K' || *from == 'k' ) { memory_end = memory_end << 10; from++; } else if ( *from == 'M' || *from == 'm' ) { memory_end = memory_end << 20; from++; }

paging_init

1M

kernelpg_dir

pg0

memory_startpg1

pg2

pgn01

768769

pg0pg1pg2

pgn

n

4M

4M

start_mem = PAGE_ALIGN(start_mem); address = 0; pg_dir = swapper_pg_dir; while (address < end_mem) {

/* map the memory at virtual addr 0xC0000000 */ pg_table = (pte_t *) (PAGE_MASK & pgd_val(pg_dir[768])); if (!pg_table) { pg_table = (pte_t *) start_mem; start_mem += PAGE_SIZE; }

/* also map it temporarily at 0x0000000 for init */ pgd_val(pg_dir[0]) = _PAGE_TABLE | (unsigned long) pg_table; pgd_val(pg_dir[768]) = _PAGE_TABLE | (unsigned long) pg_table; pg_dir++;

for (tmp = 0 ; tmp < PTRS_PER_PTE ; tmp++,pg_table++) { if (address < end_mem) set_pte(pg_table, mk_pte(address, PAGE_SHARED)); else pte_clear(pg_table); address += PAGE_SIZE; } } local_flush_tlb(); /* move cr3, r?; mov r?, cr3; */ return free_area_init(start_mem, end_mem);

free_area_init

1. Set min_free_pages2. Initialize swap cache3. Mark all pages reserved4. Initialize Buddy system for free memory management

Free Memory Management (Tanenbaum)• Bitmap

• Linked list (first-fit, next-fit, best-fit, quick-fit)

0 2 4 6 8 10 12 14 16

0011000011100100

P 0 2 H 2 2 P 4 4 H 8 3

P 11 2 H13 1 P 14 2

Buddy System

A

B

C

A

A B

B

B D

D

C

C

C

C

Initialization

request A (2)

request B (1)

request C (2)

free A*

request D (1)

free B

free D

free C

0 2 4 6 8 10 12 14 16page

B

0

1

0

0

0

0

0

00

0

1

1

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

0 6

3

C

Request D (1)

0

0

0

0

0

0

0

00

0

1

1

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

0 6

C

BD

Free B

0

1

0

0

0

0

0

00

0

1

1

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

0 6

C

D

2

Free D

0

0

0

0

0

0

0

00

0

1

0

1

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

6

C0

Free C

0

0

0

0

0

0

0

00

0

0

0

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

80

Request 2

0

0

0

0

0

0

0

00

0

0

1

1

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

8

4

2

Kernel

pg1-pgn

swap cache

mem_map

free_area[].bitmap

start_mem

(4 bytes per page)

typedef struct page { /* these must be first (free area handling) */ struct page *next; struct page *prev; struct inode *inode; unsigned long offset; ……….. atomic_t count; unsigned flags; unsigned dirty:16, age:8; ……... unsigned long map_nr; /* page->map_nr == page - mem_map */} mem_map_t;

0

0

0

0

0

0

0

00

0

0

0

0

0

0

1

2

3

free_area

0 1 2 3 4 5 6 7 8 9101112131415

mem_map

unsigned long free_area_init(unsigned long start_mem, unsigned long end_mem){

/* * select nr of pages we try to keep free for important stuff * with a minimum of 48 pages. This is totally arbitrary */ i = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT+7); if (i < 24) i = 24; i += 24; /* The limit for buffer pages in __get_free_pages is * decreased by 12+(i>>3) */ min_free_pages = i;

start_mem = init_swap_cache(start_mem, end_mem); mem_map = (mem_map_t *) start_mem; p = mem_map + MAP_NR(end_mem); start_mem = LONG_ALIGN((unsigned long) p); memset(mem_map, 0, start_mem - (unsigned long) mem_map);

do { --p; p->flags = (1 << PG_DMA) | (1 << PG_reserved); p->map_nr = p - mem_map; } while (p > mem_map); /* 6 */ for (i = 0 ; i < NR_MEM_LISTS ; i++) { unsigned long bitmap_size; init_mem_queue(free_area+i); mask += mask; /* mask *=2 */ end_mem = (end_mem + ~mask) & mask; /* should be i+1 */ bitmap_size = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT + i); bitmap_size = (bitmap_size + 7) >> 3; bitmap_size = LONG_ALIGN(bitmap_size); free_area[i].map = (unsigned int *) start_mem; memset((void *) start_mem, 0, bitmap_size); start_mem += bitmap_size; } return start_mem;}

trap_init

1. Setup interrupt routines2. Int 0x80 for system call3. Setup TSS and LDT in GDT for each task

486 Exceptions

0 Fault Divided by Zero1 Fault Debug…..0B Fault Not Present…..0D Fault General Protection0E Fault Page Fault

…..

20-FF Int/Trap Used for OS

void trap_init(void){ set_call_gate(&default_ldt,lcall7); set_trap_gate(0,&divide_error); set_trap_gate(1,&debug); set_trap_gate(2,&nmi); set_system_gate(3,&int3); /* int3-5 can be called from all */ set_system_gate(4,&overflow); set_system_gate(5,&bounds); set_trap_gate(6,&invalid_op); set_trap_gate(7,&device_not_available); set_trap_gate(8,&double_fault); set_trap_gate(9,&coprocessor_segment_overrun); set_trap_gate(10,&invalid_TSS); set_trap_gate(11,&segment_not_present); set_trap_gate(12,&stack_segment); set_trap_gate(13,&general_protection); set_trap_gate(14,&page_fault); set_trap_gate(15,&spurious_interrupt_bug); set_trap_gate(16,&coprocessor_error); set_trap_gate(17,&alignment_check);

for (i=18;i<48;i++) set_trap_gate(i,&reserved); set_system_gate(0x80,&system_call); /* set up GDT task & ldt entries */ p = gdt+FIRST_TSS_ENTRY; set_tss_desc(p, &init_task.tss); /* init_task: hardwired task #0 */ p++; set_ldt_desc(p, &default_ldt, 1); p++;

for(i=1 ; i<NR_TASKS ; i++) { p->a=p->b=0; p++; p->a=p->b=0; p++; }

set_call_gate(a, addr) set_gate(a, 12, 3, addr)

set_trap_gate(n, addr) set_gate(&idt[n], 15, 0, addr)

set_system_gate(n, addr) set_gate(&idt[n], 15, 3, addr)

set_intr_gate(n, addr) set_gate(&idt[n], 14, 0, addr)

#define _set_gate(gate_addr,type,dpl,addr) \__asm__ __volatile__ ("movw %%dx,%%ax\n\t" \ "movw %2,%%dx\n\t" \ "movl %%eax,%0\n\t" \ "movl %%edx,%1" \ :"=m" (*((long *) (gate_addr))), \ "=m" (*(1+(long *) (gate_addr))) \ :"i" ((short) (0x8000+(dpl<<13)+(type<<8))), \ "d" ((char *) (addr)),"a" (KERNEL_CS << 16) \ :"ax","dx")

SEGMENT SELECTOR OFFSET 15:0

OFFSET 31:24 DP P L

031

3263

TYPE 000 RESERVED

Descriptor in IDT

mem_init

• Reserve kernel and I/O pages

• Return all unused pages to buddy system

pg1-pgn

swap_cache

mem_map

free_area[].map

Console,PCI & FS

end_text

reserved

0x100000

0xA0000

data

code

start_mem

high_mem

start_low_mem4K

void mem_init(unsigned long start_mem, unsigned long end_mem){ end_mem &= PAGE_MASK; high_memory = end_mem;

/* mark usable pages in the mem_map[] */ start_low_mem = PAGE_ALIGN(start_low_mem);

start_mem = PAGE_ALIGN(start_mem);

/* * IBM messed up *AGAIN* in their thinkpad: 0xA0000 -> 0x9F000. * They seem to have done something stupid with the floppy * controller as well.. */ while (start_low_mem < 0x9f000) { clear_bit(PG_reserved, &mem_map[MAP_NR(start_low_mem)].flags); start_low_mem += PAGE_SIZE; }

while (start_mem < high_memory) { clear_bit(PG_reserved, &mem_map[MAP_NR(start_mem)].flags); start_mem += PAGE_SIZE; }

for (tmp = 0 ; tmp < high_memory ; tmp += PAGE_SIZE) { if (tmp >= MAX_DMA_ADDRESS) /* 16M */ clear_bit(PG_DMA, &mem_map[MAP_NR(tmp)].flags); if (PageReserved(mem_map+MAP_NR(tmp))) { if (tmp >= 0xA0000 && tmp < 0x100000) reservedpages++; else if (tmp < (unsigned long) &_etext) codepages++; else datapages++; continue; } mem_map[MAP_NR(tmp)].count = 1;

free_page(tmp); }

tmp = nr_free_pages << PAGE_SHIFT;

printk("Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data)\n", tmp >> 10, high_memory >> 10, codepages << (PAGE_SHIFT-10), reservedpages << (PAGE_SHIFT-10), datapages << (PAGE_SHIFT-10));

return;}

#define free_page(addr) free_pages((addr),0)

void free_pages(unsigned long addr, unsigned long order){ unsigned long map_nr = MAP_NR(addr);

if (map_nr < MAP_NR(high_memory)) { mem_map_t * map = mem_map + map_nr; if (PageReserved(map)) return; if (atomic_dec_and_test(&map->count)) { delete_from_swap_cache(map_nr); free_pages_ok(map_nr, order); return; } }}

static inline void free_pages_ok(unsigned long map_nr, unsigned long order){ struct free_area_struct *area = free_area + order; unsigned long index = map_nr >> (1 + order); unsigned long mask = (~0UL) << order;

cli();

#define list(x) (mem_map+(x)) map_nr &= mask;

nr_free_pages -= mask; /* -mask = 1+~mask */ while (mask + (1 << (NR_MEM_LISTS-1))) { if (!change_bit(index, area->map) ) break; remove_mem_queue(list(map_nr ^ -mask)); /* neighbor */ mask <<= 1; area++; index >>= 1; map_nr &= mask; } add_mem_queue(area, list(map_nr));#undef list}

extern inline unsigned long get_free_page(int priority){ unsigned long page;

page = __get_free_page(priority); if (page) memset((void *) page, 0, PAGE_SIZE); return page;}

#define __get_free_page(priority) __get_free_pages((priority),0,0)

unsigned long __get_free_pages(int priority, unsigned long order, int dma){ unsigned long flags; int reserved_pages;

if (order >= NR_MEM_LISTS) return 0; if (intr_count && priority != GFP_ATOMIC) { static int count = 0; if (++count < 5) { printk("gfp called nonatomically from interrupt %p\n", __builtin_return_address(0)); priority = GFP_ATOMIC; } } reserved_pages = 5; if (priority != GFP_NFS) reserved_pages = min_free_pages; if ((priority == GFP_BUFFER || priority == GFP_IO) && reserved_pages >= 48) reserved_pages -= (12 + (reserved_pages>>3)); save_flags(flags);

repeat: cli(); if ((priority==GFP_ATOMIC) || nr_free_pages > reserved_pages) { RMQUEUE(order, dma); restore_flags(flags); return 0; } restore_flags(flags); if (priority != GFP_BUFFER && try_to_free_page(priority, dma, 1)) goto repeat; return 0;}

/* * Some ugly macros to speed up __get_free_pages().. */#define MARK_USED(index, order, area) \ change_bit((index) >> (1+(order)), (area)->map)#define CAN_DMA(x) (PageDMA(x))#define ADDRESS(x) (PAGE_OFFSET + ((x) << PAGE_SHIFT))

#define RMQUEUE(order, dma) \do { struct free_area_struct * area = free_area+order; \ unsigned long new_order = order; \ do { struct page *prev = memory_head(area), *ret; \ while (memory_head(area) != (ret = prev->next)) { \ if (!dma || CAN_DMA(ret)) { \ unsigned long map_nr = ret->map_nr; \ (prev->next = ret->next)->prev = prev; \ MARK_USED(map_nr, new_order, area); \ nr_free_pages -= 1 << order; \ EXPAND(ret, map_nr, order, new_order, area); \ restore_flags(flags); \ return ADDRESS(map_nr); \ } \ prev = ret; \ } \ new_order++; area++; \ } while (new_order < NR_MEM_LISTS); \} while (0)

#define EXPAND(map,index,low,high,area) \do { unsigned long size = 1 << high; \ while (high > low) { \ area--; high--; size >>= 1; \ add_mem_queue(area, map); \ MARK_USED(index, high, area); \ index += size; \ map += size; \ } \ map->count = 1; \ map->age = PAGE_INITIAL_AGE; \} while (0)

kernel_threadcall sys_clone();

if (StackIsChanged() /* new process */) { call fn(args); sys_exit();} else { /* do nothing */ /* task[0] goes through here*/}

CPU_idle()

sys_idle()

schedule()

static inline pid_t kernel_thread(int (*fn)(void *), void * arg, unsigned long flags){ long retval;

__asm__ __volatile__( "movl %%esp,%%esi\n\t" "int $0x80\n\t" /* Linux/i386 system call */ "cmpl %%esp,%%esi\n\t" /* child or parent? */ "je 1f\n\t" /* parent - jump */ "pushl %3\n\t" /* push argument */ "call *%4\n\t" /* call fn */ "movl %2,%0\n\t" /* exit */ "int $0x80\n" "1:\t" :"=a" (retval) :"0" (__NR_clone), "i" (__NR_exit), "r" (arg), "r" (fn), "b" (flags | CLONE_VM) :"si"); return retval;}

System Calls/* * This file contains the system call numbers. Unistd.h */

#define __NR_setup 0 /* used only by init, to get system going */#define __NR_exit 1#define __NR_fork 2#define __NR_read 3#define __NR_write 4#define __NR_open 5……..#define __NR_clone 120……..#define __NR_sched_rr_get_interval 161#define __NR_nanosleep 162#define __NR_mremap 163

.data /* entry.S */ENTRY(sys_call_table) .long SYMBOL_NAME(sys_setup) /* 0 */ .long SYMBOL_NAME(sys_exit) .long SYMBOL_NAME(sys_fork) .long SYMBOL_NAME(sys_read) .long SYMBOL_NAME(sys_write) .long SYMBOL_NAME(sys_open) /* 5 */…….. .long SYMBOL_NAME(sys_clone) /* 120 */…….. .long SYMBOL_NAME(sys_sched_rr_get_interval) .long SYMBOL_NAME(sys_nanosleep) .long SYMBOL_NAME(sys_mremap) .long 0,0 .long SYMBOL_NAME(sys_vm86) .space (NR_syscalls-166)*4 /* 256 */

Pseudo Code for System Call

if (sys_call_num >= NR_syscalls) return -ENOSYS;else { if (sys_call_table[sys_call_sum]==NULL) return -ENOSYS; if (PF_TRACESYS) { syscall_trace(); call sys_call_table[sys_call_num]; syscall_trace(); } else call sys_call_table[sys_call_num];

ENTRY(system_call) pushl %eax # save orig_eax, for syscall_trace (strace) SAVE_ALL

0(%esp) - %ebx 4(%esp) - %ecx 8(%esp) - %edx C(%esp) - %esi 10(%esp) - %edi 14(%esp) - %ebp # SAVE_ALL 18(%esp) - %eax 1C(%esp) - %ds 20(%esp) - %es 24(%esp) - %fs 28(%esp) - %gs 2C(%esp) - orig_eax # pushl %eax 30(%esp) - %eip 34(%esp) - %cs # push by CPU, int 0x80 38(%esp) - %eflags 3C(%esp) - %oldesp # push by CPU, stack switching 40(%esp) - %oldss

STACK

movl $-ENOSYS,EAX(%esp) cmpl $(NR_syscalls),%eax # EAX=SYS_CALL_NUM jae ret_from_sys_call movl SYMBOL_NAME(sys_call_table)(,%eax,4),%eax testl %eax,%eax je ret_from_sys_call

…….. testb $0x20,flags(%ebx) # PF_TRACESYS jne 1f call *%eax movl %eax,EAX(%esp) # save the return value jmp ret_from_sys_call ALIGN1: call SYMBOL_NAME(syscall_trace) movl ORIG_EAX(%esp),%eax call SYMBOL_NAME(sys_call_table)(,%eax,4) movl %eax,EAX(%esp) # save the return value

call SYMBOL_NAME(syscall_trace)

sys_cloneasmlinkage int sys_clone(struct pt_regs regs){ unsigned long clone_flags; unsigned long newsp;

clone_flags = regs.ebx; newsp = regs.ecx; if (!newsp) newsp = regs.esp; return do_fork(clone_flags, newsp, &regs);}

do_fork

• Copy process structure from parent

int do_fork(unsigned long clone_flags, unsigned long usp, struct pt_regs *regs){ int nr; int error = -ENOMEM; unsigned long new_stack; struct task_struct *p;

p = (struct task_struct *) kmalloc(sizeof(*p), GFP_KERNEL); if (!p) goto bad_fork; new_stack = alloc_kernel_stack(); /* get_free_page(GFP_KERNEL) */ if (!new_stack) goto bad_fork_free_p; error = -EAGAIN; nr = find_empty_process(); if (nr < 0) goto bad_fork_free_stack;

*p = *current;

if (p->exec_domain && p->exec_domain->use_count) (*p->exec_domain->use_count)++; if (p->binfmt && p->binfmt->use_count) (*p->binfmt->use_count)++;

p->did_exec = 0; p->swappable = 0; p->kernel_stack_page = new_stack; *(unsigned long *) p->kernel_stack_page = STACK_MAGIC; p->state = TASK_UNINTERRUPTIBLE; p->flags &= ~(PF_PTRACED|PF_TRACESYS|PF_SUPERPRIV); p->flags |= PF_FORKNOEXEC; p->pid = get_pid(clone_flags); p->next_run = NULL; p->prev_run = NULL; p->p_pptr = p->p_opptr = current; p->p_cptr = NULL; init_waitqueue(&p->wait_chldexit); p->signal = 0;

p->it_real_value = p->it_virt_value = p->it_prof_value = 0; p->it_real_incr = p->it_virt_incr = p->it_prof_incr = 0; init_timer(&p->real_timer); p->real_timer.data = (unsigned long) p; p->leader = 0; /* session leadership doesn't inherit */ p->tty_old_pgrp = 0; p->utime = p->stime = 0; p->cutime = p->cstime = 0;

p->start_time = jiffies; task[nr] = p; SET_LINKS(p); nr_tasks++;

error = -ENOMEM; /* copy all the process information */ if (copy_files(clone_flags, p)) goto bad_fork_cleanup; if (copy_fs(clone_flags, p)) goto bad_fork_cleanup_files;

if (copy_sighand(clone_flags, p)) goto bad_fork_cleanup_fs; if (copy_mm(clone_flags, p)) goto bad_fork_cleanup_sighand; copy_thread(nr, clone_flags, usp, p, regs); p->semundo = NULL;

/* ok, now we should be set up.. */ p->swappable = 1; p->exit_signal = clone_flags & CSIGNAL; p->counter = (current->counter >>= 1); wake_up_process(p); /* state=TASK_RUNNING insert into run_queue */ ++total_forks; return p->pid; /* error handler */}

Process’s Virtual Memory

mm

Process’s Virtual Memory

countpgd

mmapmmap_avlmmap_sem

mm_struct

task_struct

vm_endvm_startvm_flagsvm_inodevm_ops

vm_next

vm_endvm_startvm_flagsvm_inodevm_ops

vm_next

vm_area_struct

code

data

nopagewppageswapout….

struct mm_struct { int count; pgd_t * pgd; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack, start_mmap; unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss, total_vm, locked_vm; unsigned long def_flags; struct vm_area_struct * mmap; struct vm_area_struct * mmap_avl; struct semaphore mmap_sem;};#define INIT_MM { \ 1, \ swapper_pg_dir, \ 0, 0, 0, 0, \ 0, 0, 0, 0, \ 0, 0, 0, 0, \ 0, 0, 0, \ 0, \ &init_mmap, &init_mmap, MUTEX }

struct vm_area_struct { struct mm_struct * vm_mm; /* VM area parameters */ unsigned long vm_start; unsigned long vm_end; pgprot_t vm_page_prot; unsigned short vm_flags;/* AVL tree of VM areas per task, sorted by address */ short vm_avl_height; struct vm_area_struct * vm_avl_left; struct vm_area_struct * vm_avl_right;/* linked list of VM areas per task, sorted by address */ struct vm_area_struct * vm_next;/* more */ struct vm_operations_struct * vm_ops; unsigned long vm_offset; struct inode * vm_inode; unsigned long vm_pte; /* shared mem */};

#define INIT_MMAP { &init_mm, 0, 0x40000000, PAGE_SHARED, VM_READ | VM_WRITE | VM_EXEC }

copy_thread

Copy TSS from parent and set some private fields

void copy_thread(int nr, unsigned long clone_flags, unsigned long esp, struct task_struct * p, struct pt_regs * regs){ int i; struct pt_regs * childregs;

p->tss.es = KERNEL_DS; p->tss.cs = KERNEL_CS; p->tss.ss = KERNEL_DS; p->tss.ds = KERNEL_DS; p->tss.fs = USER_DS; p->tss.gs = KERNEL_DS; p->tss.ss0 = KERNEL_DS; p->tss.esp0 = p->kernel_stack_page + PAGE_SIZE; p->tss.tr = _TSS(nr); childregs = ((struct pt_regs *) (p->kernel_stack_page + PAGE_SIZE)) - 1; p->tss.esp = (unsigned long) childregs; p->tss.eip = (unsigned long) ret_from_sys_call; *childregs = *regs;

childregs->eax = 0; childregs->esp = esp; p->tss.back_link = 0; p->tss.eflags = regs->eflags & 0xffffcfff; /* iopl is always 0 for a new process */ p->tss.ldt = _LDT(nr); set_tss_desc(gdt+(nr<<1)+FIRST_TSS_ENTRY,&(p->tss));

p->tss.bitmap = offsetof(struct thread_struct,io_bitmap); for (i = 0; i < IO_BITMAP_SIZE+1 ; i++) /* IO bitmap is actually SIZE+1 */ p->tss.io_bitmap[i] = ~0;}

ret_from_sys_call

• All slow interrupts and system calls end here

ret_from_sys_call: cmpl $0,SYMBOL_NAME(intr_count) /* handle interrupts */ jne 2f9: movl SYMBOL_NAME(bh_mask),%eax andl SYMBOL_NAME(bh_active),%eax jne handle_bottom_half

1: sti cmpl $0,SYMBOL_NAME(need_resched) /* to see if we need reschedule*/ jne reschedule ………….

2: RESTORE_ALL

#define RESTORE_ALL \ ………….. popl %ebx; \ popl %ecx; \ popl %edx; \ popl %esi; \ popl %edi; \ popl %ebp; \ popl %eax; \ pop %ds; \ pop %es; \ pop %fs; \ pop %gs; \ addl $4,%esp; \ iret

schedule

• Task->count: dynamic priority

• Task->priority: static priority

• time interrupt: (100Hz)

jiffies++

if (current->count <= 0)

need_resched=1;

• run queue: links all RUNNABLE tasks

asmlinkage void schedule(void){ int c; struct task_struct * p; struct task_struct * prev, * next; unsigned long timeout = 0;

/* check alarm, wake up any interruptible tasks that have got a signal */

allow_interrupts();

if (intr_count) goto scheduling_in_interrupt;

if (bh_active & bh_mask) { intr_count = 1; do_bottom_half(); intr_count = 0; }

need_resched = 0; prev = current; cli(); /* move an exhausted RR process to be last.. */ if (!prev->counter && prev->policy == SCHED_RR) { prev->counter = prev->priority; move_last_runqueue(prev); } …………. p = init_task.next_run; sti(); c = -1000; next = idle_task; while (p != &init_task) { int weight = goodness(p, prev, this_cpu); if (weight > c) c = weight, next = p; p = p->next_run; }

/* if all runnable processes have "counter == 0", re-calculate counters */ if (!c) { for_each_task(p) p->counter = (p->counter >> 1) + p->priority; } if (prev != next) { kstat.context_swtch++; ………….. switch_to(prev,next); } return;}

#define switch_to(prev,next) do { \__asm__("movl %2,"SYMBOL_NAME_STR(current_set)"\n\t" \ "ljmp %0\n\t" \ …………….. : /* no outputs */ \ :"m" (*(((char *)&next->tss.tr)-4)), \ "r" (prev), "r" (next)); \} while (0)

process #1

int 80

system_call

ret_from_sys_call

need_reschedschedule

switch_to

return ret_from_sys_call

iret

process #2

Process Switching

Page FaultWhen page fault occurs:

error_codeEIPCSEFLAGSold ESPold SS

U /S

W / R

P

CR2: contains fault address

Jump to interrupt handlingroutine for int 0x0Estack

ENTRY(page_fault) pushl $ SYMBOL_NAME(do_page_fault) jmp error_code

0(%esp) - %ebx 4(%esp) - %ecx 8(%esp) - %edx C(%esp) - %esi 10(%esp) - %edi 14(%esp) - %ebp # pushl ….. 18(%esp) - %eax 1C(%esp) - %ds 20(%esp) - %es 24(%esp) - %fs 28(%esp) - %gs 2C(%esp) - orig_eax # error_code pushed by CPU 30(%esp) - %eip 34(%esp) - %cs # push by CPU, int 0x80 38(%esp) - %eflags 3C(%esp) - %oldesp # push by CPU, stack switching 40(%esp) - %oldss

STACK

# addr. of do_page_fault

error_code: push %fs push %es push %ds pushl %eax xorl %eax,%eax pushl %ebp pushl %edi pushl %esi pushl %edx decl %eax # eax = -1 pushl %ecx pushl %ebx cld xorl %ebx,%ebx # zero ebx xchgl %eax, ORIG_EAX(%esp) # orig_eax (get the error code. ) mov %gs,%bx # get the lower order bits of gs movl %esp,%edx xchgl %ebx, GS(%esp) # get the address and save gs. pushl %eax # push the error code (argument) pushl %edx

movl $(KERNEL_DS),%edx mov %dx,%ds mov %dx,%es movl $(USER_DS),%edx mov %dx,%fs

movl SYMBOL_NAME(current_set),%eax

call *%ebx # call do_page_fault

addl $8,%esp # make a similar stack as system call

jmp ret_from_sys_call

do_page_fault

• This routine handles page faults. It determines the address, and the problem, and then passes it off to one of the appropriate routines.

• error_code:

bit 0 == 0 means no page found,

1 means protection fault

bit 1 == 0 means read, 1 means write

bit 2 == 0 means kernel, 1 means user-mode

asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code){ void (*handler)(struct task_struct *, struct vm_area_struct *, unsigned long, int); struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; struct vm_area_struct * vma; ….

/* get the address */ __asm__("movl %%cr2,%0":"=r" (address)); vma = find_vma(mm, address); if (!vma) goto bad_area; if (vma->vm_start <= address) goto good_area; …...

/* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */bad_area: if (error_code & 4) { /* user mode, kill it */ tsk->tss.cr2 = address; tsk->tss.error_code = error_code; tsk->tss.trap_no = 14; force_sig(SIGSEGV, tsk); return; }

…...}

good_area: handler = do_no_page; switch (error_code & 3) { default: /* 3: write, present */ handler = do_wp_page; /* fall through */ case 2: /* write, not present */ if (!(vma->vm_flags & VM_WRITE)) goto bad_area; break; case 1: /* read, present */ goto bad_area; case 0: /* read, not present */ if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; } handler(tsk, vma, address, write); .….. return;

not present present

write check if you can writedo_no_page do_wp_page

read check if you bad_area can read do_no_page

do_no_page1. Address is present in memory, just return2. Address in swap area, call so_swap_page to swap it in

cr3

tskpage

disk

3. If no nopage routine is defined in the vm_area_struct, get a free page and link. (uninitialized data)

4. If a nopage routine is defined in the vm_area_struct, call it (file_mmap_nopage, tries to share pages with other tasks)

cr3

tskpage

get_free_page

do_wp_page1. Address not present, return2. Page is PAGE_RW, return3. If the page is referenced by only one task (count==1), make it PAGE_RW.4. If the page is referenced by more than one task, copy a new page and make it PAGE_RW.

cr3

tsk1 page

cr3

tsk

New pageset PAGE_RW

copy