Kernel decompression

In the previous part, we saw the transition from the protected modearrow-up-right into long modearrow-up-right, but what we have in memory is not yet the kernel image ready to run. We are still in the kernel setup code, which should decompress the kernel and pass control to it. The next step before we see the Linux kernel entrypoint is kernel decompression.

First steps in the long mode

The point where we stopped in the previous chapter is the lretarrow-up-right instruction, which performed "jump" to the 64-bit entry point located in the arch/x86/boot/compressed/head_64.Sarrow-up-right:

	.code64
	.org 0x200
SYM_CODE_START(startup_64)

This is the first 64-bit code that we see. Before decompression, the kernel must complete a few final steps. These steps are:

  • Disabling the interrupts

  • Unification of the segment registers

  • Calculation of the kernel relocation address

  • Reload of the Global Descriptor Table

  • Load of the Interrupt Descriptor Table

All of this we will see in the next sections.

Disabling the interrupts

The 64-bit entrypoint starts with the same two instructions that 32-bit:

	cld
	cli

As we already know from the previous part, the first instruction clears the direction flagarrow-up-right bit in the flagsarrow-up-right register, and the second instruction disables interruptsarrow-up-right.

The same as the bootloader can load the Linux kernel at the 32-bit entrypoint instead of 16-bit entry point, in the same way the bootloader can switch the processor into 64-bit long mode by itself and load the kernel starting from the 64-bit entry point.

The kernel executes these two instructions if the bootloader didn't perform them before transfering the control to the kernel. The direction flag ensures that memory copying operations proceed in the correct direction, and disabling interrupts prevents them from disrupting the kernel decompression process.

Unification of the segment registers

After these two instructions are executed, the next step is to unify segment registers:

Segment registers are not used in long mode, so the kernel resets them to zero.

Calculation of the kernel relocation address

The next step is to compute the difference between the location the kernel was compiled to be loaded at and the location where it is actually loaded:

This operation is very similar to what we have seen already in the Calculation of the kernel relocation address section of the previous chapter.

[!TIP] It is highly recommended to read carefully Calculation of the kernel relocation address before trying to understand this code.

This piece of code is almost a 1:1 copy of what we have seen in protected mode. If you understood it back then, you shouldn't have any problems understanding it now. The main purpose of this code is to set up the rbp and ebx registers with the base addresses where the kernel will be decompressed, and the address where the kernel image with decompressor code should be relocated for safe decompression.

The only difference with the code from protected mode is that now, the kernel can use rip based addressing to get the address of the startup_32. So it does not need to do magic tricks with call and popl instructions like in protected mode. All the rest is just the same as what we already have seen in the previous chapter and done only for the same reason - if the bootloader is loaded, the kernel starts from the 64-bit mode, and the protected mode code is skipped.

After these addresses are obtained, the kernel sets up the stack for the decompressor code:

Reload of the Global Descriptor Table

The next step is to set up a new Global Descriptor Table. Yes, one more time 😊 There are at least two reasons to do this:

  1. The bootloader can load the Linux kernel starting from the 64-bit entrypoint, and the kernel needs to set up its own Global Descriptor Table in case the one from the bootloader is not suitable.

  2. The kernel might be configured with support for the 5-levelarrow-up-right paging, and in this case, the kernel needs to jump to 32-bit mode again to set it safely.

The "new" Global Descriptor Table has the same entries but is pointed by the gdt64 symbol:

The single difference is that lgdt in 64-bit mode loads GDTR register with size 10 bytes. In comparison, in 32-bit, the size of GDTR is 6 bytes. To load the new Global Descriptor Table, the kernel writes its address to the GDTR register using the lgdt instruction:

Load of the Interrupt Descriptor Table

After the new Global Descriptor Table is loaded, the next step is to load the new Interrupt Descriptor Table:

The load_stage1_idt function is defined in arch/x86/boot/compressed/idt_64.carrow-up-right and uses the lidt instruction to load the address of the new Interrupt Descriptor Table. For this moment, the Interrupt Descriptor Table has NULL entries to avoid handling the interrupts. As you can remember, the interrupts are disabled at this moment anyway. The valid interrupt handlers will be loaded after kernel relocation.

The next steps after this are highly related to the setup of 5-level paging, if it is configured using the CONFIG_PGTABLE_LEVELS=5 kernel configuration option. This feature extends the virtual address space beyond the traditional 4-level paging scheme, but it is still relatively uncommon in practice and not essential for understanding the mainline boot flow. As mentioned in the previous chapter, for clarity and focus, we’ll set it aside and continue with the standard 4-level paging case.

Kernel relocation

Since the calculation of the base address for the kernel relocation is done, the kernel setup code can copy the compressed kernel image and the decompressor code to the memory area pointed by this address:

The set of assembly instructions above copies the compressed kernel image and decompressor code to the memory area, which starts at the address pointed by the rbx register. The code above copies the memory contents starting from the _bss-8 up to the _startup_32 symbol, which includes:

  • 32-bit kernel setup code

  • compressed kernel image

  • decompressor code

Because of the std instruction, the copying is performed in the backward order, from higher memory addresses to the lower.

After the copying is performed, the kernel needs to reload the previously loaded Global Descriptor Table in case it was overwritten or corrupted during the copy procedure:

And finally jump on the relocated code:

The last actions before the kernel decompression

In the previous section, we saw the kernel relocation. The very first task after this jump is to clear the .bss section. This step is needed because the .bss section holds all uninitialized global and static variables. By definition, they must be initialized with zeros in C code. Cleaning it, the kernel ensures that all the following code, including the decompressor, begins with a proper .bss memory area without any possible garbage in it.

The following code does that:

The assembly code above should be pretty easy to understand if you read the previous parts. It clears the value of the eax register and uses its value to fill the memory region of the .bss section between the _bss and _ebss symbols.

In the next step, the kernel fills the new Interrupt Descriptor Table with the call:

This function defined in the arch/x86/boot/compressed/idt_64.carrow-up-right and looks like this:

We can skip the part of the code wrapped with CONFIG_AMD_MEM_ENCRYPT as it is not of main interest for us right now, but try to understand the rest of the function's body. It is similar to the first stage of the Interrupt Descriptor Table. It loads the entries of this table using the lidt instruction, which we already have seen before. The only single difference is that it sets up two interrupt handlers:

  • PF - Page fault interrupt handler

  • NMI - Non-maskable interrupt handler

The first interrupt handler is set because the initialize_identity_maps function (which we will see very soon) may trigger page fault exception. This exception can be triggered for example, when Address space layout randomizationarrow-up-right is enabled and such random physical and virtual addresses were used for which the page tables do have an entry.

The second interrupt handler is needed to "handle" a triple-fault if such an interrupt appears during kernel decompression. So at least dummy NMI handler is needed.

After the Interrupt Descriptor Table is re-loaded, the initialize_identity_maps function is called:

This function is defined in arch/x86/boot/compressed/ident_map_64.carrow-up-right and clears the memory area for the top-level page table identified by the top_level_pgt pointer to initialize a new page table. Yes, the kernel needs to initialize page tables one more time, despite we have seen the initialization and setup of the early page tables in the previous chapter. The reason for "one more" page table is that if the kernel was loaded at the 64-bit entrypoint, it uses the page table built by the bootloader. Since the kernel was relocated to a new place, the decompressor code can overwrite these page tables during decompression.

The new page table is built in a very similar way to the previous page table. Each virtual addressarrow-up-right directly corresponds to the same physical addressarrow-up-right. That is why it is called the identity mapping.

Now let's take a look at the implementation of this function. It starts by initializing an instance of the x86_mapping_info structure called mapping_info:

This structure provides information about memory mappings and a callback to allocate space for page table entries. The context field is used for tracking the allocated page tables. The page_flag and kernpg_flag fields define various page attributes (such as present, writable, or executable), which are reflected in their names.

In the next step, the kernel reads the address of the top-level page table from the cr3 control registerarrow-up-right and compares it with the _pgtable. If you read the previous chapter, you remember that _pgtable is the page table initialized by the early kernel setup code before switching to long mode. If we came from the startup_32, and it is exactly our case, the cr3 register contains the same address as _pgtable. In this case, the kernel reuses and extends this page table:

Otherwise, the new page table is built:

At this stage, new identity mappings are added to cover the essential regions needed for the kernel to continue the boot process:

  • the kernel image itself (from _head to _end)

  • the boot parameters provided by the bootloader

  • the kernel command line

All of the actual work is performed by the kernel_add_identity_map function defined in the same filearrow-up-right:

The kernel_add_itntity_map function walks the page table hierarchy and ensures that there is existing page table entries which provide 1:1 mapping into the virtual address space. If such entries does not exist, the new entry is allocated with the flags that we have seen during the initialization of the mapping_info.

After all the identity mapping page table entries were initialized, the kernel updates the cr3 control register with the address of the top page table:

At this point, all the preparations needed to decompress the kernel image are done. Now the kernel decompressor code is ready to decompress the kernel:

After the kernel is decompressed. The last instructions of the decompressor code transfers control to the Linux kernel entrypoint jumping on the address of the kernel's entrypoint. The early setup phase is complete, and the Linux kernel starts its job 🎉

In the next section, let's see how the kernel decompression works.

Kernel decompression

Right now, we are finally at the last point before we see the kernel entrypoint. The last remaining step is only to decompress the kernel and switch control to it.

The kernel decompression is performed by the extract_kernel function defined in arch/x86/boot/compressed/misc.carrow-up-right. This function starts with the video mode and console initialization that we already saw in the previous parts. The kernel needs to do this again because it does not know if the kernel was loaded in the real modearrow-up-right or whether the bootloader used the 32-bit or 64-bit boot protocol.

We will skip all these initialization steps as we already saw them in the previous chapters. After the first initialization steps are done, the decompressor code stores the pointers to the start of the free heap memory and to the end of it:

The main reason to set up the heap borders is that the kernel decompressor code uses the heap intensively during decompression.

After the initialization of the heap, the kernel calls the choose_random_location function from arch/x86/boot/compressed/kaslr.carrow-up-right. This function chooses the random location in memory to write the decompressed kernel to. This function performs work only if the address randomization is enabled. At this point, we will skip it and move to the next step, as it is not the most crucial point in the kernel decompression. If you are interested in what this function does, you can find more information in the next chapter.

Now let's get back to the extract_kernel function. Since we assume that the kernel address randomization is disabled, the address where the kernel image will be decompressed is stored in the output parameter without any change. The value from this variable is obtained from the rbp register as calculated in the previous steps.

The next action before the kernel is decompressed is to perform the sanitising checks:

After all these checks, we can see the familiar message on the screen of our computers:

The kernel setup code starts decompression by calling the decompress_kernel function:

This function performs the following actions:

  1. Decompress the kernel

  2. Parse kernel ELF binary

  3. Handle relocations

The kernel decompression performed by the helper function __decompress. The implementation of this function depends on what compression algorithm was used to compress the kernel and located in one of the following files:

I will not describe here each implementation as this information is rather about compression algorithms rather than something specific to the Linux kernel.

After the kernel is decompressed, two more functions are called: parse_elf and handle_relocations. Let's take a short look at them.

The kernel binary, which is called vmlinux is an ELFarrow-up-right executable file. As a result, after decompression we have not just a "piece" of code on which we can jump but an ELF file with headers, program segments, debug symbols and other information. We can easily make sure in it inspecting the vmlinux with readelf utility:

The parse_elf function acts as a minimal ELFarrow-up-right loader. It reads the ELF program headers of the decompressed kernel image and uses them to determine which segments must be loaded and where each segment should be placed in physical memory.

At this point, the parse_elf function has completed loading the decompressed kernel image into memory. Each PT_LOAD segment has been copied from the ELF file into its proper location. The kernel’s code, data, and other segments are now present at the chosen load address. However, it might not be sufficient to make the kernel fully runnable.

The kernel was originally linked assuming a specific base address. If the address space layout randomization is enabled, the kernel can instead be loaded at a different physical and virtual address. As a result, any absolute addresses embedded within the kernel image will still reflect the original link-time address rather than the actual load address. To resolve this, the kernel image includes a relocation table that identifies all locations containing such absolute references.

The handle_relocations function processes this table and adjusts each affected value by applying the relocation delta, which is the difference between the actual load address and the link-time base address.

Once the relocations are applied, the decompressor code jumps to the kernel entrypoint. Its address is stored in the rax register, as we already have seen above.

Now we are in the kernel 🎉🎉🎉

The kernel entrypoint is the startup_64 function from arch/x86/kernel/head_64.Sarrow-up-right. This is our next stop, but it will be in the next set of chapters - Kernel initialization processarrow-up-right.

Conclusion

This is the end of the third part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - 0xAXarrow-up-right, drop me an emailenvelope, or just create an issuearrow-up-right.

Here is the list of the links that you can find useful when reading this chapter:

Last updated