Kernel decompression
In the previous part, we saw the transition from the protected mode into long mode but what we have in memory is not yet the full kernel image ready to run. We are still in the kernel setup code. The kernel itself is already loaded by the bootloader but it is in a compressed form. Before we can reach the real kernel entry point, this compressed blob must be unpacked into memory and prepared for execution. The next step before we will see the Linux kernel entrypoint is kernel decompression.
Preparing to Decompress the Kernel
The point where we stopped in the previous chapter is the jump to the 64-bit
entry point which is located in the arch/x86/boot/compressed/head_64.S:
.code64
.org 0x200
SYM_CODE_START(startup_64)
The 64-bit
entrypoint starts with the same two instructions that 32-bit
:
cld
cli
As you already may know from the previous part, the first instruction clears the direction flag bit in the flags register and the second instruction disables interrupts. The same as the bootloader can load the Linux kernel and jump to the 32-bit
entrypoint immediately after load, in the same way the bootloader can switch the processor into 64-bit
long mode and jump on this entrypoint right after the kernel is loaded. These two instructions have to be executed in a case if the bootloader didn't do it before jumping to the kernel. The direction flag
allows us to use memory copying operations in a correct direction and disabled interrupts will not break or interrupt the kernel decompression process.
After these two instructions are executed for safeness, the next step is to unify data segments:
xorl %eax, %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %ss
movl %eax, %fs
movl %eax, %gs
Since we have loaded a new Global Descriptor Table
before we jumped into long mode all the data segment registers have to be unified and set to zero as Linux kernel uses flat memory model:
The next step is to compute the difference between the location the kernel was compiled to be loaded at and the location where it is actually loaded:
#ifdef CONFIG_RELOCATABLE
leaq startup_32(%rip) /* - $startup_32 */, %rbp
movl BP_kernel_alignment(%rsi), %eax
decl %eax
addq %rax, %rbp
notq %rax
andq %rax, %rbp
cmpq $LOAD_PHYSICAL_ADDR, %rbp
jae 1f
#endif
movq $LOAD_PHYSICAL_ADDR, %rbp
1:
/* Target address to relocate to for decompression */
movl BP_init_size(%rsi), %ebx
subl $ rva(_end), %ebx
addq %rbp, %rbx
This operation is very similar to what we have already seen in the Calculation of the kernel relocation address section of the previous chapter. This piece of code is almost 1:1 copy to what we have seen in the protected mode. The only one difference that now kernel can use rip
based addressing to get the address of the startup_32
in memory. All other is just the same what we already have seen in the previous chapter and done only for the same reason - if bootloader is loaded kernel starting from the 64-bit
mode. As the result the rbx
register will contain the physical address where the decompressor will be relocated.
After this address obtained, we can setup the stack for decompressor code:
leaq rva(boot_stack_end)(%rbx), %rsp
The next step is to setup new Global Descriptor Table
. Yes, one more time 😊 There are at least two reasons to do this. The first reason you already may guess is that bootloader may load the Linux kernel starting from the 64-bit
entrypoint and the kernel needs to setup own Global Descriptor Table in a case one from bootloader is not suitable. The second reason is that the kernel might be configured with support for the 5-level paging and in this case, kernel needs to jump to 32-bit
mode again to setup it safely. So the gdt64
will have 32-bit
code segment:
/* Make sure we have GDT with 32-bit code segment */
leaq gdt64(%rip), %rax
addq %rax, 2(%rax)
lgdt (%rax)
/* Reload CS so IRET returns to a CS actually in the GDT */
pushq $__KERNEL_CS
leaq .Lon_kernel_cs(%rip), %rax
pushq %rax
lretq
.Lon_kernel_cs:
The definition of gdt64
you can find in the same file and basically contains the same fields what the Global Descriptor Table for 32-bit
mode, with the single difference is that lgdt
in 64-bit
mode loads GDTR
with size 10
bytes in comparison to 32-bit
where GDTR
is 6
bytes:
SYM_DATA_START_LOCAL(gdt64)
.word gdt_end - gdt - 1
.quad gdt - gdt64
SYM_DATA_END(gdt64)
.balign 8
SYM_DATA_START_LOCAL(gdt)
.word gdt_end - gdt - 1
.long 0
.word 0
.quad 0x00cf9a000000ffff /* __KERNEL32_CS */
.quad 0x00af9a000000ffff /* __KERNEL_CS */
.quad 0x00cf92000000ffff /* __KERNEL_DS */
.quad 0x0080890000000000 /* TS descriptor */
.quad 0x0000000000000000 /* TS continued */
SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
After the new Global Desciptor Table
was loaded, the next step is to load new Interrupt Descriptor Table
:
/*
* RSI holds a pointer to a boot_params structure provided by the
* loader, and this needs to be preserved across C function calls. So
* move it into a callee saved register.
*/
movq %rsi, %r15
call load_stage1_idt
The load_stage1_idt
function is defined in the arch/x86/boot/compressed/idt_64.c and uses the lidt
instruction to load address of the new Interrupt Desciptor Table
. For this moment, the Interrupt Descriptor Table
has NULL
entries to avoid interrupts. The valid interrupt handlers will be loaded after kernel relocation.
The next steps after this will be highly related to the setup of 5-level
paging if it is configured using the CONFIG_PGTABLE_LEVELS=5
kernel configuration option. This feature extends the virtual address space beyond the traditional 4-level paging scheme, but it is still relatively uncommon in practice and not essential for understanding the mainline boot flow. For clarity and focus, we’ll set it aside and continue with the standard 4-level paging case.
Since the Linux kernel has now valid stack, the kernel setup code can copy the compressed kernel image to the address that we calculated above and stored in the rbx
register.
leaq (_bss-8)(%rip), %rsi
leaq rva(_bss-8)(%rbx), %rdi
movl $(_bss - startup_32), %ecx
shrl $3, %ecx
std
rep movsq
cld
Th set of assembly instructions above copies the compressed kernel over to where it will be decompressed. These instructions copies the decompressor image to the the safe place for the decompression to not overlap important regions of memory during decompression process. After the decompressor code moved, the kernel needs to re-load the previously loaded Global Descriptor Table
in a case it was overwritten or corrupted during the copy procedure:
leaq rva(gdt64)(%rbx), %rax
leaq rva(gdt)(%rbx), %rdx
movq %rdx, 2(%rax)
lgdt (%rax)
And finally jump on the relocated code:
leaq rva(.Lrelocated)(%rbx), %rax
jmp *%rax
The last actions before the kernel decompression
In the previous section we looked at how the kernel’s decompressor code relocates itself and then jumps to the new entry point at its final address. The very first task after this jump is to clear the .bss
section.
This step is needed because the .bss
section holds all uninitialized global and static variables. By definition, they must start out as zero in C
code. Since the relocation code only copied the initialized part of the image, the .bss
may contain random garbage values. Cleaning it, the kernel ensures that the decompressor begins with a predictable state before it proceeds further.
The following code does that:
xorl %eax, %eax
leaq _bss(%rip), %rdi
leaq _ebss(%rip), %rcx
subq %rdi, %rcx
shrq $3, %rcx
rep stosq
The assembly code above should be pretty easy to understand if you read the previous parts. It clears the value of the eax
register and fills memory area with it which belongs to the .bss
section between _bss
and _ebss
labels.
At the next step, kernel fills new Interrupt Descriptor Table
with the call:
call load_stage2_idt
This function defined in the arch/x86/boot/compressed/idt_64.c and looks like this:
void load_stage2_idt(void)
{
boot_idt_desc.address = (unsigned long)boot_idt;
set_idt_entry(X86_TRAP_PF, boot_page_fault);
set_idt_entry(X86_TRAP_NMI, boot_nmi_trap);
#ifdef CONFIG_AMD_MEM_ENCRYPT
/*
* Clear the second stage #VC handler in case guest types
* needing #VC have not been detected.
*/
if (sev_status & BIT(1))
set_idt_entry(X86_TRAP_VC, boot_stage2_vc);
else
set_idt_entry(X86_TRAP_VC, NULL);
#endif
load_boot_idt(&boot_idt_desc);
}
We can skip the part of the code wrapped with the CONFIG_AMD_MEM_ENCRYPT
as it is not the main interest for us, but try to understand the rest of the function's body. It is similar to the first stage Interrupt Descriptor Table
. It loads entries of this table using the lidt
instruction which we have seen during the loading of the first step. The only single difference is that it sets up two interrupt handlers:
PF
- Program fault interruptNMI
- Non-maskable interrupt
The first interrupt handler is set because the initialize_identity_maps
function which we will see very soon many need it. The second interrupt handler needed to prevent triple-fault if such interrupt will appear during kernel decompression. So at least dummy NMI is needed.
After the Interrupt Descriptor Table
is re-loaded, the initialize_identity_maps
function is called:
movq %r15, %rdi
call initialize_identity_maps
This function defined in the arch/x86/boot/compressed/ident_map_64.c and clears the memory area for the top level page table identified by the top_level_pgt
pointer to initialize new page table. Yes. The kernel needs to initialize page tables one more time despite we have seen the initialization and setup of the early page tables in the previous chapter. The reason for "one more" page table is that if kernel was booted using the 64-bit
entrypoint, it uses the page table built by the bootloader. To be sure that the kernel can make any changes in memory mapping when it is necessary, it loads own page table.
The new page table will be built and used in which each virtual address directly corresponds to the same physical address. That is why it called - identity mapping. Now when we know a little bit of theory, let's return back to practice and take a look at the most crucial parts of this function.
This function starts by initializing an instance of the x86_mapping_info
structure called mapping_info
:
mapping_info.alloc_pgt_page = alloc_pgt_page;
mapping_info.context = &pgt_data;
mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
mapping_info.kernpg_flag = _KERNPG_TABLE;
This structure provides information about memory mappings and callback to allocate space for a page table entries. The context
field is used for the tracking of the allocated page tables. The page_flag
and kernpg_flag
fields define various page attributes (such as present, writable, or executable), which is reflected in their names.
At the next step, the kernel reads the address of the top level page table from the cr3
control register and compares it with the _pgtable
. If you read the previous chapter, you may remember that _pgtable
pointer stores the address of the top level page table initialized by the early kernel setup code. If we came from the startup_32
and it is exactly our case, the cr3
register will contain the same address as _pgtable
. In this case, this page table will be "re-used" by the kernel. Otherwise the new page table hierarchy will be built:
top_level_pgt = read_cr3_pa();
if (p4d_offset((pgd_t *)top_level_pgt, 0) == (p4d_t *)_pgtable) {
pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
pgt_data.pgt_buf_size = BOOT_PGT_SIZE - BOOT_INIT_PGT_SIZE;
memset(pgt_data.pgt_buf, 0, pgt_data.pgt_buf_size);
} else {
pgt_data.pgt_buf = _pgtable;
pgt_data.pgt_buf_size = BOOT_PGT_SIZE;
memset(pgt_data.pgt_buf, 0, pgt_data.pgt_buf_size);
top_level_pgt = (unsigned long)alloc_pgt_page(&pgt_data);
}
We saw that the kernel does not overwrite the existing bootstrap page tables. Instead, it reuses them and extends the mapping. At this stage, new identity mappings are added to cover the essential regions needed for the kernel to continue booting:
the kernel image itself (from
_head
to_end
)the boot parameters provided by the bootloader
the kernel command line
All of the actual work is performed by the kernel_add_identity_map
function defined in the same source code file:
kernel_add_identity_map((unsigned long)_head, (unsigned long)_end);
boot_params_ptr = rmode;
kernel_add_identity_map((unsigned long)boot_params_ptr,
(unsigned long)(boot_params_ptr + 1));
cmdline = get_cmd_line_ptr();
kernel_add_identity_map(cmdline, cmdline + COMMAND_LINE_SIZE);
This helper function walks the page table hierarchy and ensures that there is existing page table entries which provide 1:1 mapping into the virtual address space. If such entries not exist, the new entry is allocated with the flags that we have seen during the initialization of the mapping_info
.
After all the identity mapping page table entries were initialized, the kernel updates the cr3
control register with the address of the top page table:
write_cr3(top_level_pgt);
At this point, all the preparations which were needed to decompress the kernel is done. Now kernel decompressor code is ready to decompress the kernel:
/* pass struct boot_params pointer and output target address */
movq %r15, %rdi
movq %rbp, %rsi
call extract_kernel /* returns kernel entry point in %rax */
/*
* Jump to the decompressed kernel.
*/
movq %r15, %rsi
jmp *%rax
After kernel is decompressed, kernel jumps on the Linux kernel entrypoint - the place where finally we already behind the kernel setup code and where the Linux kernel starts its job 🎉
In the next section, let's see how kernel is decompressed.
Kernel decompression
We are the last point before we will see the kernel entrypoint. After kernel will be decompressed we will see the jump on its entrypoint. The kernel decompression performed in the extract_kernel
function which is defined in the arch/x86/boot/compressed/misc.c. This function starts with the video mode and console initialization that we already saw in the previous parts. We need to do this again because we don't know if the kernel was loaded in the real mode or whether the bootloader used the 32-bit
or 64-bit
boot protocol.
We will skip all these initialization steps as we already saw them in the previous chapters. After the first initialization steps, we store pointers to the start of the free heap memory and to the end of it:
free_mem_ptr = heap; /* Heap */
free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
After the initialization of the heap pointers, the next step is the call of the choose_random_location
function from the arch/x86/boot/compressed/kaslr.c. This function chooses the random location in memory to write the decompressed kernel to. This function performs a work only if the addresses randomization was enabled. At this point we will skip it and move to the next step as it is not the most crucial point in the kernel decompression. If you are interested what this function does, you can find more information in the next chapter.
Now let's get back to the extract_kernel
function. Since we assume that the kernel address randomization is disabled, the address where the kernel image will be decompressed is stored in the output
variable without any change. The value from this variable is obtained from the rbp
register since we calculated it in the previous steps. Before the kernel will be decompressed kernel does last sanitizing checks:
if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
error("Destination physical address inappropriately aligned");
if (virt_addr & (MIN_KERNEL_ALIGN - 1))
error("Destination virtual address inappropriately aligned");
#ifdef CONFIG_X86_64
if (heap > 0x3fffffffffffUL)
error("Destination address too large");
if (virt_addr + needed_size > KERNEL_IMAGE_SIZE)
error("Destination virtual address is beyond the kernel mapping area");
#else
if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
error("Destination address too large");
#endif
#ifndef CONFIG_RELOCATABLE
if (virt_addr != LOAD_PHYSICAL_ADDR)
error("Destination virtual address changed when not relocatable");
#endif
After all these checks we will see the familiar message:
Decompressing Linux...
and the kernel setup code starts decompression by the calling the decompress_kernel
function:
entry_offset = decompress_kernel(output, virt_addr, error);
This function performs 3 following actions:
Decompress the kernel
Parse kernel ELF binary
Handle relocations
The kernel decompression performed by the helper function __decompress
. The implementation of this function depends on what compression algorithm was used to compress the kernel and located in one of the following files:
#ifdef CONFIG_KERNEL_GZIP
#include "../../../../lib/decompress_inflate.c"
#endif
#ifdef CONFIG_KERNEL_BZIP2
#include "../../../../lib/decompress_bunzip2.c"
#endif
#ifdef CONFIG_KERNEL_LZMA
#include "../../../../lib/decompress_unlzma.c"
#endif
#ifdef CONFIG_KERNEL_XZ
#include "../../../../lib/decompress_unxz.c"
#endif
#ifdef CONFIG_KERNEL_LZO
#include "../../../../lib/decompress_unlzo.c"
#endif
#ifdef CONFIG_KERNEL_LZ4
#include "../../../../lib/decompress_unlz4.c"
#endif
#ifdef CONFIG_KERNEL_ZSTD
#include "../../../../lib/decompress_unzstd.c"
#endif
I will not describe here each implementation as this information is rather about compression algorithms rather than something specific to the Linux kernel.
After the kernel is decompressed, two more functions are called: parse_elf
and handle_relocations
. Let's take a short look at them.
The kernel binary, which is called vmlinux
is an ELF executable file. As a result, after decompression we have not just a "piece" of code on which we can jump but an ELF file with headers, program segments, debug symbols and other information. We can easily make sure in it inspecting the vmlinux
with readelf
utility:
readelf -l vmlinux
Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000
0x0000000000893000 0x0000000000893000 R E 200000
LOAD 0x0000000000a93000 0xffffffff81893000 0x0000000001893000
0x000000000016d000 0x000000000016d000 RW 200000
LOAD 0x0000000000c00000 0x0000000000000000 0x0000000001a00000
0x00000000000152d8 0x00000000000152d8 RW 200000
LOAD 0x0000000000c16000 0xffffffff81a16000 0x0000000001a16000
0x0000000000138000 0x000000000029b000 RWE 200000
...
...
...
The parse_elf
function is responsible for reading ELF headers and extracting the information about segments and where they should live in memory.
In addition this ELF
file contains relocation entries. Normally, the Linux kernel can be loaded to the fixed virtual address. But as we have seen before, the kernel address randomization can be enabled and in this case the entrypoint of the kernel will be located at random address. Relocation entries provides information which instructions or data references contain addresses that need to be patched with the actual load address. The handle_relocations
function goes through the relocation table and applies the necessary adjustments so that all references point to the correct locations in memory.
After the kernel is relocated, we return from the extract_kernel
function, the code jumps to the kernel entrypoint which address is stored in the rax
register as we already have seen above.
Now we are in the kernel 🎉🎉🎉
Conclusion
This is the end of the third part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - 0xAX, drop me an email, or just create an issue.
Links
Here is the list of the links that you may find useful during reading of this chapter:
Last updated
Was this helpful?