Transition to 64-bit mode
In the previous part, we saw the transition from the real mode into protected mode. At this point, the two crucial things were changed:
The processor now can address up to four gigabytes of memory
The privilege levels were set for the memory access
Despite this, the kernel is still in its early setup mode. There are many different things that the early setup code should prepare before we reach the main kernel's entry point. Right now, the processor operates in protected mode. However, protected mode is not the main mode in which x86_64 processors should operate – it exists only for backward compatibility. The next crucial step is to switch to the native mode for x86_64 - long mode.
The main characteristic of this new mode (as with all the earlier modes) is the way it defines the memory model. In real mode, the memory model was relatively simple, and each memory location was formed based on the base address specified in a segment register, plus some offset. In protected mode, the global and local descriptor tables contain descriptors that describe memory areas. All the memory accesses in long mode are based on the new mechanism called paging. One of the crucial goals of the kernel setup code before it can switch to the long mode is to set up paging.
In this chapter, we will see how the kernel switches to long mode in detail.
[!NOTE] There will be lots of assembly code in this part, so if you are not familiar with that, read another set of my posts about assembly programming.
The 32-bit kernel entry point location
The last point where we stopped was the jump instruction to the kernel's entry point in protected mode. This jump was located in the arch/x86/boot/pmjump.S and looks like this:
jmpl *%eax # Jump to the 32-bit entrypointThe value of the eax register contains the address of the 32-bit entry point. What is this address? To answer on this question, we can read the Linux kernel x86 boot protocol document:
When using bzImage, the protected-mode kernel was relocated to 0x100000
We can make make sure that this 32-bit entry point of the Linux kernel using the GNU GDB debugger and running the Linux kernel in the QEMU virtual machine. To do this, you can run the following command in one terminal:
sudo qemu-system-x86_64 -kernel ./linux/arch/x86/boot/bzImage \
-nographic \
-append "console=ttyS0 nokaslr" -s -S \
-initrd /boot/initramfs-6.17.0-rc3-g1b237f190eb3.img[!NOTE] You need to pass your own kernel image and initrd image to the
-kerneland-initrdcommand line options.
After this, run the GNU GDB debugger in another terminal and pass the following commands:
As soon as the debugger stopped at the breakpoint, we can inspect registers to be sure that the eax register contains the 0x100000 - address of the 32-bit kernel entry point:
From the previous part, you may remember:
First of all, we preserve the address of
boot_paramsstructure in theesiregister.
So the esi register has the pointer to the boot_params. Let's inspect it to make sure that it is really it. For example we can take a look at the command line string that we passed to the virtual machine:
We got it 🎉
Now we know where we are, so let's take a look at the code and proceed with learning of the Linux kernel.
First steps in the protected mode
The 32-bit entry point is defined in arch/x86/boot/compressed/head_64.S assembly source code file:
First of all, it is worth knowing why the directory is named compressed. It's because the kernel is in the bzImage file, which is a compressed package that contains the kernel image and kernel setup code. In all previous chapters, we were researching the kernel setup code. The next two big steps, which the kernel's setup code should do before we see the entry point of the kernel itself, are:
Switch to long mode
Decompress the kernel image and jump to its entry point
In this part, we will focus only on switching to long mode. The kernel image decompression will be covered in the next chapters. Returning to the current kernel code, you can find the following two files in the arch/x86/boot/compressed directory:
We will focus only on the head_64.S file. Yes, the file name contains the 64 suffix, despite the kernel being in the 32-bit protected mode at the moment. The explanation for this situation is simple. Let's look at arch/x86/boot/compressed/Makefile. We can see the following make goal here:
The first line contains the following target - $(obj)/head_$(BITS).o. This means that make will select the file during the kernel build process based on the $(BITS) value. This make variable is defined in the arch/x86/Makefile Makefile and its value depends on the kernel's configuration:
Since we are consider the kernel for x86_64 architecture, we assume that the CONFIG_X86_64 is set to y. As the result, the head_64.S file will be used during the kernel build process. Let's start to investigate this what the kernel does in this file.
Reload the segments if needed
As we already know, our start is in arch/x86/boot/compressed/head_64.S assembly source code file. The entry point is defined by the startup_32 symbol.
At the beginning of the startup_32, we can see the cld instruction, which clears the DF or direction flag bit in the flags register:
When the direction flag is clear, all string or copy-like operations used for copying data, like for example stos or scas, will increment the index registers esi or edi. We need to clear the direction flag because later we will use string operations for tasks such as clearing space for page tables or copying data.
The next instruction is to disable interrupts - cli. We have already seen it in the previous chapter. The interrupts are disabled "twice" because modern bootloaders can load the kernel starting from this point, but not only one that we have seen in the first chapter.
After these two simple instructions, the next step is to calculate the difference between where the kernel is compiled to run, and where it actually was loaded. If we will take a look at the linker script, we will see the following definition:
This means that the code in this section is compiled to run at the address zero. We also can see this in the output of objdump utility:
We can see that both the linker script and the objdump utility indicate that the address of the startup_32 function is 0, but this is not where the kernel was loaded. This is the address that the code was compiled for, also known as the link-time address. Why was it done like that? The answer is – for simplicity. By telling the linker to set the address of the very first symbol to zero, each next symbol becomes a simple offset from 0. As we already know, the kernel was loaded at the 0x100000 address. The difference between the address where the kernel was loaded and the address with which the kernel was compiled is called the relocation delta. Once the delta is known, the code can reach any variable or function by adding this delta to their compile-time addresses.
We know both these addresses based on the experiment above, and as a result, we know the value of the delta. Now let's take a look at how the kernel calculates this difference:
The call instruction is used to get the physical address where the kernel is actually loaded. This trick works because after the call instruction is executed, the stack should have the return address on top. This return address will be exactly the address of the label 1.
In the code above, the kernel sets up a temporary mini stack where the return address will be stored after the call instruction. Right after the call, we pop this address from the stack and save it in the ebp register. Using the last instruction, we subtract the difference between the address of the label 1 and the startup_32 physical address using the rva macro and subl instruction, and store the result in the ebp register.
The rva macro is defined in the same source code file and looks like this:
Schematically, it can be represented like this:
Starting from this moment, the ebp register contains the physical address of the startup_32 symbol. Next, it will be used to calculate the offset to any other symbols or structures in memory.
The very first such structure that we need to access is the Global Descriptor Table. To switch to long mode, we need to update the previously loaded Global Descriptor Table with 64-bit segments:
Knowing now that the ebp register contains the physical address of the beginning of the kernel in protected mode, we calculate the offset to the gdt structure using it at the first line of code shown above. In the last two lines, we write this address to the gdt structure with offset 2, and load the new Global Descriptor Table with the lgdt instruction.
The new Global Descriptor Table looks like this:
The new Global Descriptor table contains five descriptors:
32-bit kernel code segment
64-bit kernel code segment
32-bit kernel data segment
Task state descriptor
Second task state descriptor
We already saw loading the Global Descriptor Table in the previous part, and now we're doing almost the same, but we set descriptors to use CS.L = 1 and CS.D = 0 for execution in 64 bit mode.
After the new Global Descriptor Table is loaded, the next step is to set up the stack:
In the previous step, we loaded a new Global Descriptor Table; however, all the segment registers may still have selectors from the old table. If those selectors point to invalid entries in the new Global Descriptor Table, the next memory access can cause General Protection Fault. Setting them to __BOOT_DS, which is a well-known descriptor, should fix this potential fault and allow us to set the proper stack pointed by boot_stack_end.
The last action after we loaded the new Global Descriptor Table is to reload the cs descriptor:
Since we can not change segment registers using the mov instruction, a trick with the lretl instruction is used to set the cs with the correct value. This instruction fetches two values from the top of the stack, then puts the first value into the eip register and the second value into the cs register. Since this moment, we have a proper kernel code selector and instruction pointer values.
Just a couple of steps separate us from transitioning into the long mode. As mentioned at the beginning of this chapter, one of the most crucial steps is to set up paging. But before that, the kernel needs to do the last preparations, which we will see in the next sections.
Last steps before paging setup
As we mentioned in the previous section, there a couple of additional steps before we can setup paging and switch to long mode. These steps are:
Verification of CPU
Calculation of the relocation address
Enabling
PAEmode
In the next sections we will take a look at these steps.
CPU verification
Before the kernel can switch to long mode, it checks that it runs on a suitable x86_64 processor by running this piece of code:
The verify_cpu function is defined in arch/x86/kernel/verify_cpu.S and executes the CPUID instruction to check the details of the processors on which the kernel is running. In our case, the most crucial check is for long mode and SSE support. This function returns the result in the eax register. Its value is 0 on success and 1 on failure. If long mode is not supported by the current processor, the kernel jumps to the no_longmode label, which stops the CPU with the hlt instruction:
If everything is ok, the kernel proceeds its work.
Calculation of the kernel relocation address
The next step is to calculate the address for the kernel decompression. The kernel image mainly consists of two parts:
Kernel's setup and decompressor code
Chunk of compressed kernel code
We can see it looking at the arch/x86/boot/compressed/vmlinux.lds.S linker script:
There are three sections at the beginning of the linker script above:
.head.text- section where we are now.rodaya..compressed- section with the compressed kernel image.text- section with the decompressor code
The kernel decompression happens in-place, which is the same place where the compressed kernel is. This means that the parts of the decompressed kernel image will overwrite the parts of the compressed image during the decompression process. It may sound dangerous – if the decompressed part overwrites the decompressor code or the part of the compressed kernel image that is not decompressed yet, this will lead to code or image corruption.
One way to avoid this problem is to allocate a buffer for the decompressed kernel image and copy the compressed image outside of it. But this is not the most effective way in terms of memory consumption, and may not work on devices with not enough memory to hold both kernel images.
The second way to avoid this problem is to allocate a buffer for the decompressed kernel image, but copy the compressed image to the end of this buffer and leave some room at the beginning of this buffer for the parts of the decompressed kernel. Of course, the kernel decompressor must choose the right parameters, so the pointer to the end of the decompressed part does not move faster than the pointer to the part that is currently compressed.
Schematically, it can be represented like this:
The buffer for the decompressed kernel starts at the address specified by the LOAD_PHYSICAL_ADDR macro, which by default expands to the 0x1000000 address. Since we loaded this address below (at 0x100000), the kernel setup code should copy itself, the compressed kernel image, and the decompressor code at this address. In addition, to have some room for the safe in-place decompression, it should calculate a special offset from the beginning of this buffer.
We can see this calculation in the following code:
Despite it may look scary, it is not as complex as it may seem. Let's take a closer look at it and try to understand what it does.
The ebp register contains the physical address where the protected kernel mode was loaded. We know that this address is 0x100000. This address is aligned to the two-megabyte boundary, and the result value is compared with the LOAD_PHYSICAL_ADDRESS:
If this value is equal to or greater than
LOAD_PHYSICAL_ADDRESS, we leave it as is.Otherwise, we put the value of the
LOAD_PHYSICAL_ADDRESS(which is0x1000000) into theebxregister.
At this moment, we have the pointer to the beginning of the buffer where the kernel image is relocated and decompressed in the ebx register.
The last two lines are the most interesting. Using them, the kernel calculates the offset where to move the compressed kernel image with the decompressor for safe in-place decompression. At first, we add the BP_init_size to the ebx register. The BP_init_size is the maximum value between the size of the uncompressed kernel image code (from _text to _end) and the size of the kernel setup code + compressed kernel image + decompressor code. At this moment, the ebx register points to the end of the decompression buffer. On the last line of the code, we move this pointer back to the new place of the startup_32 symbol within the decompression buffer.
As a result, we get something like this:
The decompressor code decompresses the compressed kernel image starting from the beginning of the buffer and gradually overwrites the compressed kernel image. As mentioned above, the size of the gap between the beginning of the decompression buffer and startup_32 must be safe enough not to overwrite still-compressed parts of the image with the decompressed ones. The calculation of this gap highly depends on the compression method the kernel uses and is encoded in BP_init_size. Here I will skip all the details about this calculation, but if you are interested, you can find more details in the comment located in the arch/x86/boot/header.S file.
Enabling PAE mode
The next step before the kernel can switch the processor into the long mode is to set up the so-called PAE mode:
Kernel does it by setting the X86_CR4_PAE bit in the cr4 control register. This tells the processor that the page table entries will be enlarged from 32 to 64 bits. We will see this process soon.
Set up paging
At this moment, we almost finished the preparations needed to switch the processor into 64-bit long mode. The next crucial step is to build page tables. But before we take a look at the process of page table setup, let's try to briefly understand what it is.
In protected mode, each memory access is interpreted through a segment descriptor stored in the Global Descriptor Table. The situation changes significantly in long mode.
In 64-bit mode, segmentation is disabled. The base and limit fields of most segment descriptors are ignored, and the processor treats the address space as a flat linear range. Of course, code, data, and stack segments still exist, but only formally. The processor still requires valid segment selectors, but they no longer perform address translation in the traditional sense.
Instead, memory translation in long mode relies almost entirely on the mechanism called paging.
Each program operates now with addresses that are called virtual. When a program references a virtual address, the processor interprets the address as a 64-bit linear address and translates it through the multi-level structure called page tables.
[!NOTE] Modern x86_64 processors support five-level paging, but we will skip it in this post and focus on four-level paging.
Let’s briefly see what happens when the processor needs to translate a virtual address into a physical one.
In four-level paging mode, a virtual address is 64 bits long. However, only the 48 bits are actually used for translation to a physical address. These 48 bits are divided into several parts:
Each group of 9 bits selects an entry in one level of the page-table hierarchy. Since 9 bits can represent 512 values, each page table contains exactly 512 entries. Each entry of a page table occupies 8 bytes, so a single page table fits into one 4-kilobyte page.
When the processor translates a virtual address, it performs the following steps:
It reads the
cr3control register to obtain the physical address of the top-level page table calledPML4.It extracts bits
47–39of the virtual address and uses them as an index of thePML4page table.The selected
PML4entry contains the physical address of the next-level table calledPDPT.Bits
38–30are selected to find an entry in thePDPT.Bits
29–21are selected to find an entry in thePD.Bits
20–12select an entry in thePT.Bits
11–0provide the offset inside the resulting physical page.
In addition to a physical address of the next-level table, each page table entry contains flags in first 12 bits. These flags are:
P
Present
Indicates whether the page or page table entry is valid and exists in memory. If cleared, accessing the corresponding address causes a page fault.
RW
Read/Write
Determines whether write operations are permitted. If cleared, the page is read-only; if set, writes are allowed (subject to privilege rules).
US
User/Supervisor
Controls privilege-level access. If cleared, the page is accessible only in supervisor mode. If set, it may also be accessed from user mode.
PWT
Page-Level Write-Through
Controls the caching policy. If set, write-through caching is used; otherwise, write-back caching is typically applied.
PCD
Page Cache Disable
Disables caching for the referenced page when set. Commonly used for memory-mapped I/O regions.
A
Accessed
Set automatically by the processor when the page-table entry is used during address translation. Useful for page replacement decisions.
D
Dirty
Set automatically by the processor when a write operation occurs to a mapped page. Indicates that the page has been modified.
PS
Page Size
Determines whether the entry maps a large page (e.g., 2 MiB or 1 GiB) instead of pointing to a lower-level page table.
NX
No-Execute
Prevents instruction execution from the referenced page when set. Used to enforce executable/non-executable memory protections.
You might wonder how an 8-byte entry can contain both a 64-bit physical address of the next-level page table and flags at the same time. The reason is that each page table is aligned on a four-kilobyte boundary. As a result, the lower 12 bits of its physical address are always zero. These 12 bits are therefore used to store the flags.
Now that we know how the processor translates a virtual address to a physical address using paging, it is time to take a look at the structure of page tables.
A page table in x86_64 is a four-kilobyte memory area that contains 512 entries. Each entry occupies 8 bytes. In four-level paging mode with four-kilobyte pages, four such tables participate in the translation of a virtual address:
4
PML4
The top-level page table. Each entry points to a Page Directory Pointer Table (PDPT).
3
PDPT
The third-level table. Each entry points to a Page Directory (PD) or, if the PS bit is set, directly maps a 1 GiB page.
2
PD
The second-level table. Each entry points to a Page Table (PT) or, if the PS bit is set, directly maps a 2 MiB page.
1
PT
The first-level table. Each entry points directly to a 4 KiB physical memory page.
Each table has the same internal structure. The only difference between them is how their entries are interpreted. As we already know, an entry in a page table is 64 bits wide. It contains two types of information:
A physical address of either the next-level page table or a physical memory page
A set of control flags that define access permissions and status information
If you are interested in this topic, you can find more information about page tables and page table entries structure in the Intel® 64 and IA-32 Architectures Software Developer Manuals.
Now that we know a little about paging, we can return to the kernel and update our knowledge by looking at the real code. Now we will see how the kernel builds the early page table to switch to long mode. But before we jump directly to the code, we need to remember one important thing. The kernel will be relocated to the address stored in the ebx register, as seen above. So, all structures, including the page tables, should be aligned to this address.
The page table structure for boot is defined in the same source code file and looks like this:
The kernel needs to fill this structure with the proper page table entries for early 64-bit code. First of all, it fills the whole memory area occupied by the page tables with zeros for safety:
At the beginning, we set the address of the top of the page table to the edi register. After this, the kernel fills with zeros the memory area that will be occupied by the page table. The boot page table will have the following structure:
1 level4 table
1 level3 table
4 level2 table that maps everything with 2M pages
After the kernel clears the memory region reserved for the page tables, it starts populating it with entries. At the start, it fills the first and single entry of the top-level page table. The following snippet shows this:
In the code above, the kernel fills the first entry of the top-level page table with the address of the next-level page table, which is located at the pgtable + 0x1000 address and has 0x7 flags. In our case, the flags 0x7 are:
Present
Read/Write
User
In the next step, the kernel builds four Page Directory entries in the Page Directory Pointer table with the same Present+Read/Write/User flags:
In the code above, we can see the filling of the first four entries of the 3rd-level page table. The first entry of the 3rd level page table is located at the offset 0x1000 from the beginning of the top-level page table. The value of the eax register is similar to the 4th-level page table entry, with the difference that now it points to the 2nd-level page table. Next, the kernel fills the four entries of the 3rd-level page table in the "loop" until the value of the ecx register is not zero. As soon as these page table entries are filled, the kernel proceeds to the next-level page table:
Here we already fill four page directory tables with 2048 entries. The first entry is located at the offset 0x2000 from the beginning of the top-level page table. Each entry maps a two-megabyte chunk of memory with the following flags:
Present
Read/Write
User
Page Cache Disable
Large Page
The two additional flags tell the processor to keep TLB entry across reload of the value of the cr3 register and use two-megabyte pages.
There is no need to populate the lowest-level page tables yet. Every entry in the 2nd-level page directory has the Large Page bit set, which means each entry directly maps a two-megabyte region of physical memory. During the address translation, the page-walk procedure stops at the 2nd-level page table, and the lower 21 bits of the virtual address are used as the offset inside that two-megabyte page.
The page tables are now fully prepared. The last remaining step is to actually enable paging. To do this, the processor must know where the top-level page table resides. As we know, this is done by loading the physical address of the top-level page table into the cr3 control register:
From this moment, page tables that cover four gigabytes of memory are ready, and paging is enabled. The kernel is ready for transition into the long mode.
The transition into 64-bit mode
Only the last steps remain before the Linux kernel can switch the processor into the long mode. The first one is setting the EFER.LME flag in the special model-specific register to the predefined value 0xC0000080:
This is the Long Mode Enable bit, and it is mandatory to set this bit to enable long mode.
In the next step, we can see the preparation for the jump on the long mode entrypoint. To do this jump, the kernel stores the base address of the kernel segment code along with the address of the long mode entrypoint on the stack:
Since the stack contains the base of the kernel code segment and the address of the entrypoint, the kernel executes the last instruction in protected mode:
The CPU extracts the address of startup_64, which is the long mode entrypoint from the stack, and jumps there:
The Linux kernel is now in 64-bit mode! 🎉
Conclusion
This is the end of the third part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - 0xAX, drop me an email, or just create an issue.
Links
Here is the list of the links that you may find useful during reading of this chapter:
Last updated