monolithic kernel:
entire operating system resides in the kernel, so that all system calls runs in supervisor mode.
downside of monolithic kernel:
Interfaces between each parts of operating system is often complex and therefore it is easy for developers to make mistake.
In monolithic kernel, each mistakes are critical to kernel; one mistake might fail the whole kernel and reboot might be required.
micro kernel:
amount of system code that run in supervisor mode is minimumized so that the risk of the mistake can be reduced.
(OS services running as a process are called "server".)
File system is implemented in user mode.
When an application like shell want to read/write file, the application send message through kernel interface.
The kernel interface consists of few low-level functions for starting application, accessing hardware, and sending a message to other user process.
xv6 is monolithic kernel, like any other Unix based operating systems.
Address space layout:
address space includes user memory starting at virtual memory zero.
Instructions comes first, followed by global variables, then the user stack, and heap memory that can be expanded by malloc().
Maximum address is 2^38-1, MAXVA.
Xv6 uses code in a trampoline page to trasition into kernel and back; trapframe is necessary to save/restore the context of the process.
Each process has two stacks, user stack & kernel stack.
user stack: application is executed in this stack.
kernel stack: system call code, interrupt processing is executed in this stack.
How the first address space is organized:
The first step to provide isolation is setting up the kernel address space for its own use.
BIOS: when PC boot, it executes program called BIOS(Basic Input/Output System).
Its job is to load the kernel boot loader from the boot sector(512 byte from 0x7c00), then transfer control to the code loaded from boot sector.
Boot loader: stored at 0x7c00 through 0x7e00. It load the xv6 kernel into memory from disk, then transfer control to the xv6 kernel. this comprises two source files; bootasm.S & bootmain.c.
Bootloader simulates Intel 8088; use 16-bit register, 20-bit memory address.
->use segment registers CS(Code Segment), DS(Data Segment), SS(Stack Segment), ES(Extra Segment) to provide additional bit.
ex) CS << 4 + offset register = 20-bit memory address
The BIOS does not guarantee anything about the contents of %ds, %es, %ss, so first order of business after disabling interrupts is to set %ax to zero and then copy that zero into %ds, %es, and %ss.
12 start:
13 cli # BIOS enabled interrupts; disable
14
15 # Zero data segment registers DS, ES, and SS.
16 xorw %ax,%ax # Set %ax to zero
17 movw %ax,%ds # -> Data Segment
18 movw %ax,%es # -> Extra Segment
19 movw %ax,%ss # -> Stack Segment
20
to remain compatible with older architecture like 8088 20-bit memory address, A20 gate is disabled by default. A20 gate can be enabled by keyboard controller.
If the second bit of the keyboard controller’s output port is low, the 21st physical address bit is always cleared; if high, the 21st bit acts normally. The boot loader must enable the 21st address bit using I/O to the keyboard controller on ports 0x64 and 0x60.
21 # Physical address line A20 is tied to zero so that the first PCs
22 # with 2 MB would run software that assumed 1 MB. Undo that.
23 seta20.1:
24 inb $0x64,%al # Wait for not busy
25 testb $0x2,%al # check if keyboard controller is not busy
26 jnz seta20.1
27
28 movb $0xd1,%al # 0xd1 -> port 0x64
29 outb %al,$0x64 # set
30
31 seta20.2:
32 inb $0x64,%al # Wait for not busy
33 testb $0x2,%al
34 jnz seta20.2
35
36 movb $0xdf,%al # 0xdf -> port 0x60
37 outb %al,$0x60
BIOS start in Real-mode.
Protected mode allows address to have 32-bit address.
In protected mode, the segment register is index into GDT.
GDT(Global Descriptor Table): in x86,memory management are controlled through tables of descriptors. Each table entry specifies a base physical address, a maximum virtual address called the limit, and permission bits for the segment.
To enable protected mode, setting the 1 bit(CR0_PE) to cr0.
Enabling protected mode does not immediately change how the processor translates logical to physical addresses; it is only when one loads a new value into a segment register that the processor reads the GDT and changes its internal segmentation settings. (ljmp to specify $cs segment selector)
39 # Switch from real to protected mode. Use a bootstrap GDT that makes
40 # virtual addresses map directly to physical addresses so that the
41 # effective memory map doesn't change during the transition.
42 lgdt gdtdesc #load global descriptor table
43 movl %cr0, %eax
44 orl $CR0_PE, %eax #enable protected mode by setting 1 bit in cr0.
45 movl %eax, %cr0
46
47 //PAGEBREAK!
48 # Complete the transition to 32-bit protected mode by using a long jmp
49 # to reload %cs and %eip. The segment descriptors are set up with no
50 # translation, so that the mapping is still the identity mapping.
51 ljmp $(SEG_KCODE<<3), $start32 //set code segment selector as Base:SEG_KCODE<<3,offset:start32
start32(first 32-bit action) initializes data segment register with SEG_KDATA, and then setting up the stack in unused memory for executing bootmain.c C code.
The stack grows down from 0x7c00($start) toward 0x00000.
Finally boot loader call bootmain C code. Its jobs is to load the kernel from disk to memory and transfer control to kernel. It only returns if something's gone wrong in bootmain code. In that case, it returns to 0x8a00 port where the nothing is conntected in real machine; in simulator, it is connected to its simulator. It then loops.
64 # Set up the stack pointer and call into C.
65 movl $start, %esp
66 call bootmain
67
68 # If bootmain returns (it shouldn't), trigger a Bochs
69 # breakpoint if running under Bochs, then loop.
70 movw $0x8a00, %ax # 0x8a00 -> port 0x8a00
71 movw %ax, %dx
72 outw %ax, %dx
73 movw $0x8ae0, %ax # 0x8ae0 -> port 0x8a00
74 outw %ax, %dx
75 spin:
76 jmp spin
77
The bootmain.c expects to find kernel image(ELF format) at second disk sector.
It places first 4096 byte size ELF header in 0x10000, and then check if this is ELF executable file.
Then loading the data from the disk(readseg()) and set remainder of the sector to zero(stosb()). Finally, call entry point from the ELF header.
17 void
18 bootmain(void)
19 {
20 struct elfhdr *elf;
21 struct proghdr *ph, *eph;
22 void (*entry)(void);
23 uchar* pa;
24
25 elf = (struct elfhdr*)0x10000; // scratch space
26
27 // Read 1st page off disk
28 readseg((uchar*)elf, 4096, 0);
29
30 // Is this an ELF executable?
31 if(elf->magic != ELF_MAGIC)
32 return; // let bootasm.S handle error
33
34 // Load each program segment (ignores ph flags).
35 ph = (struct proghdr*)((uchar*)elf + elf->phoff);
36 eph = ph + elf->phnum;
37 for(; ph < eph; ph++){
38 pa = (uchar*)ph->paddr;
39 readseg(pa, ph->filesz, ph->off);
40 if(ph->memsz > ph->filesz)
41 stosb(pa + ph->filesz, 0, ph->memsz - ph->filesz);
42 }
43
44 // Call the entry point from the ELF header.
45 // Does not return!
46 entry = (void(*)(void))(elf->entry);
47 entry();
48 }
The kernel has been compiled and linked so that it can be found at virtual memory starting at 0x80100000. (as kernel.asm describes)
The paging hardware is not yet enabled. Once the paging hardware is enabled, 0x80100000 will point to 0x00100000.
kernel.ld specifies ELF to cause boot loader to load kernel at memory starting at physical memory 0x00100000.
executes kernel starting at entry.
First, set page directory:
loads physical address of the entrypgdir to register %cr3.
It set up page table that maps virtual memory address 0x80000000(KERNBASE) to physical memory address 0x0.
The entry page table is defined in main.c
->Entry 0 maps virtual memory 0:0x400000 to physical memory 0:0x400000. This mapping is required as long as entry is executing at low address. It is removed after entry execution. Entry 512 maps virtual memory KERNBASE:KERNBASE+0x400000 to physical memory 0:0x400000 where the instruction and data needed for kernel is loaded by boot loader. It restricts size of instruction and data to 4MB.
Second, enable paging
by setting CR0_PG at register %cr0, the paging feature is enabled and kernel can start to use high address.
Third, set up the stack pointer.
assembly directive ".comm" allocates specified size at data section.
Fourth, jump to main and switch to executing at high address.
43 # Entering xv6 on boot processor, with paging off.
44 .globl entry
45 entry:
46 # Turn on page size extension for 4Mbyte pages
47 movl %cr4, %eax
48 orl $(CR4_PSE), %eax
49 movl %eax, %cr4
50 # Set page directory
51 movl $(V2P_WO(entrypgdir)), %eax # macro V2P_WO subtracts KERNBASE to find out physical address
52 movl %eax, %cr3
53 # Turn on paging.
54 movl %cr0, %eax
55 orl $(CR0_PG|CR0_WP), %eax
56 movl %eax, %cr0
57
58 # Set up the stack pointer.
59 movl $(stack + KSTACKSIZE), %esp # stack grows down
60
61 # Jump to main(), and switch to executing at
62 # high addresses. The indirect call is needed because
63 # the assembler produces a PC-relative instruction
64 # for a direct jump.
65 mov $main, %eax
66 jmp *%eax
67
68 .comm stack, KSTACKSIZE
main initializes several settings and call userinit() to create first process.
userinit() first call allocproc() that tries to find a UNUSED process in process table and mark it EMBRYO.
allocproc() also sets up the new process with a specially prepared kernel stack and set of kernel registers that cause it to ‘return’ to user space when it first runs. It does that part by causing process to execute forkret and then trapret.
This forkret() function will return to whatever address is at the bottom of the
stack. In the bottom of the stack, trapret() exists. trapret() restores user register from the values stored at the top of the kernel stack.
userinit() writes user register-like values at the top of the kernel stack.
These value is 'struct trapframe' which stores user register.
allocproc():
105 // Set up new context to start executing at forkret,
106 // which returns to trapret.
107 sp -= 4; //decrement; push
108 *(uint*)sp = (uint)trapret;
109
110 sp -= sizeof *p->context;
111 p->context = (struct context*)sp;
112 memset(p->context, 0, sizeof *p->context);
113 p->context->eip = (uint)forkret
First process is going to execute small program(initcode.S).
It needs memory to store the program.
setupkvm() to set up page table for mapping memory that only kernel uses.
Its first memory is filled with compiled initcode.S(the linker embeds it in the kernel) by using defined two symbol '_binary_initcode_start[]', '_binary_initcode_size[]'.
Userinit copies that binary into the new process’s memory by calling inituvm, which allocates one page of physical memory, maps virtual address zero to that memory, and copies the binary to that page.
userinit():
128 initproc = p;
129 if((p->pgdir = setupkvm()) == 0)
130 panic("userinit: out of memory?");
131 inituvm(p->pgdir, _binary_initcode_start, (int)_bi nary_initcode_size);
Then userinit sets up the trap frame with the initial user mode state.
after calling userinit(), main calls mpmain() to start first process.
mpmain() calls scheduler() to find RUNNABLE process and start it.
scheduler():
it sets its per-cpu struct variable proc to selected process(initproc).
swtichuvm() tells hardware to start using page table of the selected process. It is possible to changing page table while running kernel because the page table mapping for the kernel data/code is identical(userinit()).
swtch() perform context switching to target process's thread. The current context is not the process but rather per-cpu scheduler context, so its context is stored in the hardware register(cpu->scheduler).
proc.c scheduler()
322 void
323 scheduler(void)
324 {
325 struct proc *p;
326 struct cpu *c = mycpu();
327 c->proc = 0;
....
339 // Switch to chosen process. It is the process's job
340 // to release ptable.lock and then reacquire it
341 // before jumping back to us.
342 c->proc = p;
343 switchuvm(p);
344 p->state = RUNNING;
345
346 swtch(&(c->scheduler), p->context);
347 switchkvm();
....
356 }
swtch() loads the saved context register of the target process including the stack pointer and target instruction. The final ret instruction pops the target process’s %eip from the stack, finishing the context switch.
swtch.S
10 swtch:
11 movl 4(%esp), %eax
12 movl 8(%esp), %edx
13
14 # Save old callee-saved registers
15 pushl %ebp
16 pushl %ebx
17 pushl %esi
18 pushl %edi
19
20 # Switch stacks
21 movl %esp, (%eax)
22 movl %edx, %esp
23
24 # Load new callee-saved registers
25 popl %edi
26 popl %esi
27 popl %ebx
28 popl %ebp
29 ret
allocproc() already set the context->eip to forkret.
forkret() does some initalization that can be done only in context of regular process, not in the main.
Then the forkret() returns to the top of the current stack where the trapret() resides.
trapret() first set the %esp to process' trap frame.
Trapret also uses pop instructions to restore registers from the trap frame just as swtch did with the kernel context:
trapasm.S
18 # Call trap(tf), where tf=%esp
19 pushl %esp
20 call trap
21 addl $4, %esp #set the %esp to process's trap frame.
22
23 # Return falls through to trapret...
24 .globl trapret
25 trapret:
26 popal #restore general register
27 popl %gs
28 popl %fs
29 popl %es
30 popl %ds #retore %gs, %fs, %es, %ds
31 addl $0x8, %esp # skip two field, trapno and errcode
32 iret #pops %cs, %eip, %flags, %esp, and %ss from the stack.
It now begin to execute at tf->eip, that points to virtual address 0, initcode.S.
At this point, %eip holds zero and %esp holds 4096.
allocuvm() (used by sbrk() later) set up the process’s page table so that virtual address zero refers to the physical memory allocated for this process, and set a flag (PTE_U) that tells the paging hardware to allow user code to access that memory. The fact that userinit() set up the low bits of %cs to run the process’s user code at CPL=3 means that the user code can only use pages with PTE_U set, and cannot modify sensitive hardware registers such as %cr3.
initcode() will invoke exec() system call.
pushing $argv, $init, $0.
then set SYS_EXEC to %eax and execute $T_SYSCALL.
exec() will start to run the program named by $init, which is /init.
Init creates a new console device file if needed and then opens it as file descriptors 0, 1, and 2. Then it loops, starting a console shell, handles orphaned zombies until the shell exits, and repeats.
initcode.S
9 .globl start
10 start:
11 pushl $argv
12 pushl $init
13 pushl $0 // where caller pc would be
14 movl $SYS_exec, %eax
15 int $T_SYSCALL
16
17 # for(;;) exit();
18 exit:
19 movl $SYS_exit, %eax
20 int $T_SYSCALL
21 jmp exit
22
23 # char init[] = "/init\0";
24 init:
25 .string "/init\0"
26
27 # char *argv[] = { init, 0 };
28 .p2align 2
29 argv:
30 .long init
31 .long 0
32
init.c
1 // init: The initial user-level program
2
3 #include "types.h"
4 #include "stat.h"
5 #include "user.h"
6 #include "fcntl.h"
7
8 char *argv[] = { "sh", 0 };
9
10 int
11 main(void)
12 {
13 int pid, wpid;
14
15 if(open("console", O_RDWR) < 0){
16 mknod("console", 1, 1);
17 open("console", O_RDWR);
18 }
19 dup(0); // stdout
20 dup(0); // stderr
21
22 for(;;){
23 printf(1, "init: starting sh\n");
24 pid = fork();
25 if(pid < 0){
26 printf(1, "init: fork failed\n");
27 exit();
28 }
29 if(pid == 0){
30 exec("sh", argv);
31 printf(1, "init: exec sh failed\n");
32 exit();
33 }
34 while((wpid=wait()) >= 0 && wpid != pid)
35 printf(1, "zombie!\n");
36 }
37 }