Transition to 64-bit mode
In the previous part, we saw the transition from the real mode into protected mode. At this point, the two crucial things were changed - the processor can address up to 4 gigabytes of memory and the privilege levels were set for the memory access. Despite this, the kernel is still in its early setup mode. There are many different things that has to be prepared and configured before we will reach the main kernel's entry point. Since we are learning the Linux kernel for x86_64 processors, the protected mode is not the main mode where the processor should operate. The next crucial step is to switch to the native mode for x86_64 - long mode.
The main characteristic of this new mode, as with all the earlier modes - the way it defines the memory model. In real mode, the memory model was relatively simple and each memory location was formed based on the base address specified in a segment register and plus some offset. In protected mode, introduced Global and Local descriptor table with descriptors which describe memory areas. All the memory accesses in long mode are based on the new mechanism called paging. One of the crucial goal of the kernel before it can switch to the long mode is to setup paging. This and all other details needed to switch to long mode we will see in this chapter.
[!NOTE] There will be lots of assembly code in this part, so if you are not familiar with that, you might want to consult a book about it or read another set of my posts about assembly programming.
The 32-bit kernel entry point location
The last point where we stopped was the jump to the kernel's entry point in protected mode. This jump is defined in the arch/x86/boot/pmjump.S and looks like this:
jmpl *%eax # Jump to the 32-bit entrypointThe value of the eax register contains the address of the 32-bit entry point. What is this address? To answer on this question, we can read the Linux kernel x86 boot protocol document:
When using bzImage, the protected-mode kernel was relocated to 0x100000
We can make make sure that this 32-bit entry point of the Linux kernel using the GNU GDB debugger and running the Linux kernel in the QEMU virtual machine. To do this, you can run the following command in one terminal:
sudo qemu-system-x86_64 -kernel ./linux/arch/x86/boot/bzImage \
-nographic \
-append "console=ttyS0 nokaslr" -s -S \
-initrd /boot/initramfs-6.17.0-rc3-g1b237f190eb3.img[!NOTE] You need to pass your own kernel image and initrd image to the
-kerneland-initrdcommand line options.
After this, run the GNU GDB debugger in another terminal and pass the following commands:
$ gdb
(gdb) target remote :1234
(gdb) hbreak *0x100000
(gdb) c
Continuing.
Breakpoint 1, 0x0000000000100000 in ?? ()As soon as the debugger stopped at the breakpoint, we can inspect registers to be sure that the eax register contains the 0x100000 - address of the 32-bit kernel entry point:
eax 0x100000 1048576
ecx 0x0 0
edx 0x0 0
ebx 0x0 0
esp 0x1ff5c 0x1ff5c
ebp 0x0 0x0
esi 0x14470 83056
edi 0x0 0
eip 0x100000 0x100000
eflags 0x46 [ PF ZF ]From the previous part, you may remember:
First of all, we preserve the address of
boot_paramsstructure in theesiregister.
So the esi register has the pointer to the boot_params. Let's inspect it to make sure that it is really it. For example we can take a look at the command line string that we passed to the virtual machine:
(gdb) x/s ((struct boot_params *)$rsi)->hdr.cmd_line_ptr
0x20000: "console=ttyS0 nokaslr"
(gdb) ptype struct boot_params
type = struct boot_params {
struct screen_info screen_info;
struct apm_bios_info apm_bios_info;
__u8 _pad2[4];
__u64 tboot_addr;
struct ist_info ist_info;
__u64 acpi_rsdp_addr;
__u8 _pad3[8];
__u8 hd0_info[16];
__u8 hd1_info[16];
struct sys_desc_table sys_desc_table;
struct olpc_ofw_header olpc_ofw_header;
__u32 ext_ramdisk_image;
__u32 ext_ramdisk_size;
__u32 ext_cmd_line_ptr;
__u8 _pad4[112];
__u32 cc_blob_address;
struct edid_info edid_info;
struct efi_info efi_info;
__u32 alt_mem_k;
__u32 scratch;
__u8 e820_entries;
__u8 eddbuf_entries;
__u8 edd_mbr_sig_buf_entries;
__u8 kbd_status;
__u8 secure_boot;
__u8 _pad5[2];
__u8 sentinel;
__u8 _pad6[1];
struct setup_header hdr;
__u8 _pad7[36];
__u32 edd_mbr_sig_buffer[16];
struct boot_e820_entry e820_table[128];
__u8 _pad8[48];
struct edd_info eddbuf[6];
__u8 _pad9[276];
}
(gdb) x/s ((struct boot_params *)$rsi)->hdr.cmd_line_ptr
0x20000: "console=ttyS0 nokaslr"We got it 🎉
Now we know where we are, so let's take a look at the code and proceed with learning of the Linux kernel.
First steps in the protected mode
The 32-bit entry point is defined in the arch/x86/boot/compressed/head_64.S assembly source code file:
.code32
SYM_FUNC_START(startup_32)First of all, it is worth to know is the directory named compressed? The answer to that is that the kernel is in the bzImage file. This file is a compressed package consisting of kernel image and the kernel setup code. In all previous chapters we were researching the kernel setup code. The next two big steps which remaining before we will see the entrypoint of the kernel are:
switch to long mode
decompress the kernel image and jump to its entrypoint
In this part we will focus at the first big step and the steps leading to the kernel decompression and the decompression itself we will see in the next chapters. Returning to the current kernel code, you may find the two following files in the arch/x86/boot/compressed directory:
In our case, we will consider only the head_64.S file. Yes, the file named with the 64 suffix despite the kernel is in the 32-bit protected mode for this moment. The explanation for this situation is simple. Let's look at arch/x86/boot/compressed/Makefile. We may see the following make goal here:
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/kernel_info.o $(obj)/head_$(BITS).o \
$(obj)/misc.o $(obj)/string.o $(obj)/cmdline.o $(obj)/error.o \
$(obj)/piggy.o $(obj)/cpuflags.oThe first line contains this the following target - $(obj)/head_$(BITS).o. This means, that make will select the file during kernel build process based on the value of the $(BITS). This make variable defined in the arch/x86/Makefile make file and its value depends on the kernel configuration:
ifeq ($(CONFIG_X86_32),y)
BITS := 32
...
...
else
BITS := 64
...
...
endifSince we are consider the kernel for x86_64 architecture, we assume that the CONFIG_X86_64 is set to y. As the result, the head_64.S file will be used during the kernel build process. Let's start to investigate this what the kernel does in this file.
Reload the segments if needed
As we already know, our start is in the arch/x86/boot/compressed/head_64.S assembly source code file. The entry point is defined by the startup_32 symbol.
In the beginning of the startup_32, we can see the cld instruction which clears the DF or direction flag bit in the flags register:
.code32
SYM_FUNC_START(startup_32)
/*
* 32bit entry is 0 and it is ABI so immutable!
* If we come here directly from a bootloader,
* kernel(text+data+bss+brk) ramdisk, zero_page, command line
* all need to be under the 4G limit.
*/
cldWhen the direction flag is clear, all string operations which usually used for copying data, like for example stos, scas and others, will increment the index registers esi or edi. We need to clear the direction flag because later we will use strings operations to perform various operations such as clearing space for page tables or copying data.
The next instruction is to disable interrupts - cli. We already have seen it in previous chapter. The interrupts are disabled "twice" because modern bootloaders can load the kernel starting from this point but not only one that we have seen in the first chapter.
After these two simple instructions, the next step is to calculate the difference between where the kernel is compiled to run, and where it actually was loaded. If we will take a look at the linker script, we will see the following definition:
SECTIONS
{
/* Be careful parts of head_64.S assume startup_32 is at
* address 0.
*/
. = 0;This means that the code in this section is compiled to run at the address zero. We also can see this in the output of objdump utility:
$ objdump -D /home/alex/disk/dev/linux/arch/x86/boot/compressed/vmlinux | less
/home/alex/disk/dev/linux/arch/x86/boot/compressed/vmlinux: file format elf64-x86-64
Disassembly of section .head.text:
0000000000000000 <startup_32>:
0: fc cld
1: fa cliWe may see that and the linker script and the objdump utility tells us that the address of the startup_32 function is 0 but it is not where the kernel was loaded. This is only the address that the code was compiled for, also called link-time address. Why it was done like that? The answer is for simplicity. By telling the linker to set the address of the very first symbol to zero, each next symbol becomes a simple offset from 0. As we already know, the kernel was loaded at the 0x100000 address. The difference between this address and zero called relocation delta. Once that delta is known, the code can reach any variable or function by adding this delta to their compile-time addresses.
We know these addresses and as the result the value of delta based on experiment we have seen above. Now let's take a look how the kernel calculates this difference:
leal (BP_scratch+4)(%esi), %esp
call 1f
1: popl %ebp
subl $ rva(1b), %ebpThe call instruction is used to get the real address of the kernel. This trick works because after the call instruction is executed, the stack should have return address on its top. In the code above we setup a temporary mini stack to get the address of the kernel and execute the call to the nearest label 1. Since the top of the stack contains the return address, we put it into the ebp register. Using the last instruction we subtract the difference between the address of the label 1 and strtup_32 address from the return address that we got at the previous step:
Starting from this moment, the ebp register will contain the address of the beginning of the kernel image and using it we can calculate offset to any other symbols or structures in memory. And the first such structure that we will access is the Global Descriptor Table. To switch to long mode, we need to update the previously loaded Global Descriptor Table with 64-bit segments:
leal rva(gdt)(%ebp), %eax
movl %eax, 2(%eax)
lgdt (%eax)Where the new Global Descriptor Table is:
SYM_DATA_START_LOCAL(gdt)
.word gdt_end - gdt - 1
.long 0
.word 0
.quad 0x00cf9a000000ffff /* __KERNEL32_CS */
.quad 0x00af9a000000ffff /* __KERNEL_CS */
.quad 0x00cf92000000ffff /* __KERNEL_DS */
.quad 0x0080890000000000 /* TS descriptor */
.quad 0x0000000000000000 /* TS continued */
SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)The new Global Descriptor table contains five descriptors:
32-bitkernel code segment64-bitkernel code segment32-bitkernel data segmentTask state descriptor
Second task state descriptor
We already saw the loading the Global Descriptor Table in the previous part, and now we're doing almost the same here, but we set descriptors to use CS.L = 1 and CS.D = 0 for execution in 64 bit mode.
After the new Global Descriptor Table is loaded, the kernel can setup the new stack:
movl $__BOOT_DS, %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %fs
movl %eax, %gs
movl %eax, %ss
/* Setup a stack and load CS from current GDT */
leal rva(boot_stack_end)(%ebp), %espAt the previous step we loaded new Global Descriptor Table, but all the segment registers may have selectors from old table. If those selectors point to invalid entries in the new Global Descriptor Table, next memory access can cause General Protection Fault. Setting them to the __BOOT_DS which is a known-good descriptor should fix this potential fault and allow us to set proper stack pointed by the boot_stack_end.
The last action after we loaded the new Global Descriptor Table is to reload cs descriptor:
pushl $__KERNEL32_CS
leal rva(1f)(%ebp), %eax
pushl %eax
lretl
1:Since we can not change segment registers using simple mov instruction, we need to apply a trick with the lretl instruction. This instruction fetches the two values from the top of the stack and put the first value into the eip register and the second value to the cs register. Since this moment we have proper kernel code selector and instruction pointer values.
Just a couple of steps separate us from the transition into the long mode. As it was mentioned in the beginning of this chapter, one of the most crucial is to setup paging. But before this task, the kernel needs to do last preparations which we will see in the next sections.
Last steps before paging setup
As we mentioned in the previous section, there a couple of additional steps before we can setup paging and switch to long mode. These steps are:
Verification of CPU
Calculation of the relocation address
Enabling
PAEmode
In the next sections we will take a look at these steps.
CPU verification
Before we the kernel can switch to long mode, it needs to check that it runs on the suitable x86_64 processor. This is done by the next piece of code:
/* Make sure cpu supports long mode. */
call verify_cpu
testl %eax, %eax
jnz .Lno_longmodeThe verify_cpu function defined in the arch/x86/kernel/verify_cpu.S and executes the cpuid instruction to check the details of the processors on which kernel is running on. In our case, the most crucial check is for long mode and SSE support. and sets the eax register to 0 on success and 1 on failure. If the long mode is not supported by the current processor, the kernel jumps to the no_longmode label which just stops the CPU with the hlt instruction:
.code32
SYM_FUNC_START_LOCAL_NOALIGN(.Lno_longmode)
/* This isn't an x86-64 CPU, so hang intentionally, we cannot continue */
1:
hlt
jmp 1bIf everything is ok, the kernel proceeds its work.
Calculation of the kernel relocation address
The next step is to calculate the address for the kernel decompression. The kernel consists of two parts:
Relatively small decompressor code
Chunk of compressed kernel code
Obviously, the final decompressed kernel code will be bigger than compressed image. The memory area where the decompressed kernel should locate may overlap with the area where the compressed image is located. In this case, the compressed image could be overwritten during decompression process. To avoid this, the the kernel will copy the compressed part for safe decompression. This is done by the following code:
#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
movl BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
notl %eax
andl %eax, %ebx
cmpl $LOAD_PHYSICAL_ADDR, %ebx
jae 1f
#endif
movl $LOAD_PHYSICAL_ADDR, %ebx
1:
/* Target address to relocate to for decompression */
addl BP_init_size(%esi), %ebx
subl $ rva(_end), %ebxThe ebp register contains current address of the beginning of the kernel image. We put this address to the ebx register and aligned it by the 2MB border. If the resulted address equal or bigger than LOAD_PHYSICAL_ADDRESS which is 0x1000000 we use it as is, otherwise we set it to 0x1000000. Since we have the beginning of the address where to move the compressed kernel image, we add to it BP_init_size which is the size of decompressed kernel image. This will allow us to copy compressed kernel image behind the memory area where the kernel will be decompressed. In the end we just subtract the address of the _end from the value in the ebx to get the new base address of the decompressor kernel code.
Enabling PAE mode
The next step is to setup so-called PAE mode:
/* Enable PAE mode */
movl %cr4, %eax
orl $X86_CR4_PAE, %eax
movl %eax, %cr4We doing it by setting the X86_CR4_PAE bit in the cr4 control register. This tells to CPU that the page table entries that we will see soon will be enlarged from 32 to 64 bits.
Setup paging
At this moment we are almost finished with the preparations needed to switch the processor into the 64-bit mode. One of the last step, is to build page tables. But before we will take a look at the process of page tables setup, let's try briefly understand what is it.
As we mentioned in the beginning of this chapter - on x86_64, the processor must have paging enabled to use long mode. Paging lets the processor translate virtual addresses or addresses used by the code, into a physical addresses. The translation of virtual addresses into physical done using the special structure - page tables. All the memory considered as array of sequential blocks called pages. Each page is described by the special descriptor in the page table called PTE or page table entry. The page table entries are stored in the special structure called page tables. The page table is a structure with predefined hierarchy:
PML4- top level table, each entry points toPDPTPDPT- 3rd level table, each entry points toPDPD- 2nd level table, each entry poitns toPTPT- 1st level table, each entry points to a 4 killobyte physical page
The physical address of the top level table must be stored in the cr3 register.
When the processor needs to translate a virtual address into the corresponding physical address, it splits the virtual address to the next parts:
Knowing the index of the corresponding entry in each table, CPU obtains the physical address.
The next goal of the kernel is to build a structure similar to the description above to switch to long mode. Let's take a look how it is implemented in the kernel. First of all we need to fill the current page table structure specified by the pgtable with zeros for safeness:
leal rva(pgtable)(%ebx), %edi
xorl %eax, %eax
movl $(BOOT_INIT_PGT_SIZE/4), %ecx
rep stoslAfter we cleaned the memory area for the page tables, we can start to fill it. First of all, we need to fill the top-level page entry:
leal rva(pgtable + 0)(%ebx), %edi
leal 0x1007 (%edi), %eax
movl %eax, 0(%edi)
addl %edx, 4(%edi)This adds the first entry to the top-level page table. This entry will contain a reference to the first entry of the lower-level table. The offset to it is 0x1000 bytes. The 0x7 are flags of the page table entry:
Present
Read/Write
User
Each page entry is 64-bit structure, no matter if it is a PML4, PDPT, PD or PT entry. The format is almost the same among all the levels. The difference is only in the address field which stores the physical address of the next page table by hierarchy. Besides the address field, a page table entry contains flags like:
P- present bitRW- read/write bitUS- user/supervisor bitPWT- Page-level Write-Through bit controlling caching of the pagePCD- Page Cache Disable bit controlling caching of the pageA- accessed page bitD- dirty page bitPS- page size bitNX- No-Execute bit
More information about the page tables and page table entries structure you can find in the Intel® 64 and IA-32 Architectures Software Developer Manuals.
In the next step we will build four Page Directory entries in the Page Directory Pointer table with the same Present+Read/Write/User flags:
leal rva(pgtable + 0x1000)(%ebx), %edi
leal 0x1007(%edi), %eax
movl $4, %ecx
1: movl %eax, 0x00(%edi)
addl %edx, 0x04(%edi)
addl $0x00001000, %eax
addl $8, %edi
decl %ecx
jnz 1bIn the code above, we may see filling of the first four entries of the 3rd level page table. The first entry is located at the offset 0x1000 from the beginning of the page table. The value of the eax register is similar to the 4th level page table entry. Next we just fill the four entries of this table in the "loop" while value of the ecx will not be zero. As soon as this table entries are filled, the next turn of the next level page table:
leal rva(pgtable + 0x2000)(%ebx), %edi
movl $0x00000183, %eax
movl $2048, %ecx
1: movl %eax, 0(%edi)
addl %edx, 4(%edi)
addl $0x00200000, %eax
addl $8, %edi
decl %ecx
jnz 1bHere we already fill 4 page directories with 2048 entries. The first entry is located at the offset 0x2000 from the beginning of the page table. Each entry maps a 2 megabytes chunk of memory with the same Present/Read/Write/Large Page flags but in addition there is Global flag. This additional flag tells the processor to keep TLB entry across reload of the value of the cr3 register.
This was the last page table entries which kernel fills. There is no need for this moment to fill the 4th level PT tables because every at the 2nd level page table was filled with the Large Page bit, so each such entry directly maps a 2 megabytes region. During the address transition, the page-walk procedure stops at the PD level going through PML4 → PDPT → PD, and the lower 21 bits of the virtual address will be used as the offset inside that 2 megabytes page.
Now we can enable the paging by storing the address of the page table in the cr3 register:
leal rva(pgtable)(%ebx), %eax
movl %eax, %cr3The page tables is ready and paging is enabled starting from this moment. Now the kernel is prepared for transition into the long mode.
The transition into 64-bit mode
Only the last steps are remaining before the Linux kernel can switch CPU into the long mode. The first one is setting the EFER.LME flag in the special model specific register to the predefined value 0xC0000080:
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
wrmsrThis is the Long Mode Enable bit and it is mandatory action to set this bit to enable 64-bit mode.
In the next step, we may see the preparation of the jump on the long mode entrypoint. To do this jump, the kernel stores the base address of the kernel segment code along with the address of the long mode entrypoint on the stack:
leal rva(startup_64)(%ebp), %eax
pushl $__KERNEL_CS
pushl %eaxEverything is ready. Since our stack contains the base of the kernel code segment and the address of the entrypoint, kernel executes the last instruction in protected mode:
lretThe CPU extracts the address of the startup_64 from the stack and jumps there:
.code64
.org 0x200
SYM_CODE_START(startup_64)The Linux kernel now in 64-bit mode 🎉
Conclusion
This is the end of the third part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - 0xAX, drop me an email, or just create an issue.
Links
Here is the list of the links that you may find useful during reading of this chapter:
Last updated
Was this helpful?