Linux kernel booting processwe learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like
initrdmounting, lockdep initialization, and many many other things, before we can see how the kernel runs the first init process.
Pagingis a mechanism that translates a linear memory address to a physical address. If you have read the previous parts of this book, you may remember that we saw segmentation in real mode when physical addresses are calculated by shifting a segment register by four and adding an offset. We also saw segmentation in protected mode, where we used the descriptor tables and base addresses from descriptors with offsets to calculate the physical addresses. Now we will see paging in 64-bit mode.
Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a program’s execution environment are mapped into physical memory as needed.
x86_64version of the Linux kernel, but we will not go into too much details (at least in this post).
IA-32e pagingpaging mode we need to do the following things:
4096bytes for the
x86_64Linux kernel. To perform the translation from linear address to physical address, special structures are used. Every structure is
4096bytes and contains
512entries (this only for
IA32_EFER.LMEmodes). Paging structures are hierarchical and the Linux kernel uses 4 level of paging in the
x86_64architecture. The CPU uses a part of linear addresses to identify the entry in another paging structure which is at the lower level, physical memory region (
page frame) or physical address in this region (
page offset). The address of the top level paging structure located in the
cr3register. We have already seen this in arch/x86/boot/compressed/head_64.S:
cr3is used to store the address of the top-level structure, the
Page Global Directoryas it is called in the Linux kernel.
cr3is 64-bit register and has the following structure:
2^48or 256 TBytes of linear-address space may be accessed at any given time.
cr3register stores the address of the 4 top-level paging structure.
47:39bits of the given linear address store an index into the paging structure level-4,
38:30bits store index into the paging structure level-3,
29:21bits store an index into the paging structure level-2,
20:12bits store an index into the paging structure level-1 and
11:0bits provide the offset into the physical page in byte.
CPL(current privilege level). If
CPL < 3it is a supervisor mode access level, otherwise it is a user mode access level. For example, the top level page table entry contains access bits and has the following structure (See arch/x86/include/asm/pgtable_types.h for the bit offset definitions):
x86_64uses 4-level page tables. Their names are:
System.mapfile which stores the virtual addresses of the functions that are used by the kernel. For example:
0xffffffff81efe497here. I doubt you really have that much RAM installed. But anyway,
x86_64_start_kernelwill be executed. The address space in
2^64wide, but it's too large, that's why a smaller address space is used, only 48-bits wide. So we have a situation where the physical address space is limited to 48 bits, but addressing still performs with 64 bit pointers. How is this problem solved? Look at this diagram:
sign extension. Here we can see that the lower 48 bits of a virtual address can be used for addressing. Bits
63:48can be either only zeroes or only ones. Note that the virtual address space is split into 2 parts:
0x00007fffffffffffand kernel space occupies the highest part from
0xffffffffffffffff. Note that bits
63:47is 0 for userspace and 1 for kernel space. All addresses which are in kernel space and in userspace or in other words which higher
63:48bits are zeroes or ones are called
canonicaladdresses. There is a
non-canonicalarea between these memory regions. Together these two memory regions (kernel space and user space) are exactly
2^48bits wide. We can find the virtual memory map with 4 level page tables in the Documentation/x86/x86_64/mm.txt:
0xffff87ffffffffffto prevent access to non-canonical area, but was later extended by 3 bits for the hypervisor.
ffff880000000000. This virtual memory region is for direct mapping of all the physical memory. After the memory space which maps all the physical addresses, the guard hole. It needs to be between the direct mapping of all the physical memory and the vmalloc area. After the virtual memory map for the first terabyte and the unused hole after it, we can see the
kasanshadow memory. It was added by commit and provides the kernel address sanitizer. After the next unused hole we can see the
espfixup stacks (we will talk about it in other parts of this book) and the start of the kernel text mapping from the physical address -
0. We can find the definition of this address in the same file as the
0x1000000. So we have the start point of the kernel
0xffffffff80000000and offset -
0x1000000, the resulted virtual address will be
0xffffffff80000000 + 1000000 = 0xffffffff81000000.
.textregion there is the virtual memory region for kernel module,
vsyscallsand an unused hole of 2 megabytes.
63:48- bits not used;
47:39- bits store an index into the paging structure level-4;
38:30- bits store index into the paging structure level-3;
29:21- bits store an index into the paging structure level-2;
20:12- bits store an index into the paging structure level-1;
11:0- bits provide the offset into the physical page in byte.
pagingand we can go ahead in the kernel source code and see the first initialization steps.