From bootloader to kernel
If you’ve read my earlier posts about assembly language for Linux x86_64, you might see that I started to get interested in low-level programming. I’ve written a set of articles on assembly programming for x86_64 Linux and, in parallel, began exploring the Linux kernel source code. I’ve always been fascinated by what happens under the hood — how programs execute on a CPU, how they’re laid out in memory, how the kernel schedules processes and manages resources, how the network stack operates at a low level, and many other details. This series is a way of sharing my journey.
[!NOTE] This is not official Linux kernel documentation, it is a learning project. I’m not a professional Linux kernel developer, and I don’t write kernel code as part of my daily job. Learning how the Linux kernel works is just my hobby. If you find anything unclear, spot an error, or have questions or suggestions, feel free to reach out - you always can ping me on X 0xAX, send me an email or open a new issue. Your feedback is always welcome and appreciated.
The main goal of this series is to provide a guide to the Linux kernel for readers who want to begin learning how it works. We will explore not only what the kernel does, but will try to understand how and why it does it. Despite being considered to be understandable for anyone who is interested in Linux kernel, it is highly recommended to have some prior knowledge before starting to read these notes. If you want to experiment with the kernel code, first of all it is best to have a Linux distribution installed. Besides that, on these pages we will see much of C and assembly code, so the good understanding of these programming languages is highly required.
[!IMPORTANT] I started writing this series when the latest version of the kernel was
3.18
. A lot has changed since then, and I am in the process of updating the content to reflect modern kernels where possible — now focusing on v6.16+. I’ll continue revising the posts as the kernel evolves.
That’s enough introduction — let’s dive into the Linux kernel!
The Magic Power Button - What happens next?
Although this is a series of posts about Linux kernel, we will not jump straight into kernel code. First, let’s step back and look at what happens before the kernel even comes into play. Everything starts from the turning on a computer. And we will start from this point as well.
When you press the "magic" power button on your laptop or desktop computer, the motherboard sends a signal to the power supply. In response, the power supply delivers the proper amount of electricity to other components of the computer. Once the motherboard receives the power good signal, it triggers the CPU to start. The CPU then performs a reset: it clears any leftover data in its registers and loads predefined values into each of them, preparing for the very first instructions of the boot process.
Each x86_64 processor begins execution in a special mode called real mode. This mode exists for historical reasons - to be compatible with the earliest processors. Real mode is supported on all x86-compatible processors — from the original 8086 to today’s modern 64-bit CPUs.
The 8086 was a 16-bit microprocessor. Basically it means that its general-purpose registers and instruction pointer were 16
bits wide. However, the chip was designed with a 20-bit
physical memory address bus — the set of electrical lines used to select memory locations. With 20
address lines, the CPU can form addresses from 0x00000
to 0xFFFFF
, giving access to exactly 1 MB
of physical memory or 2^20
bytes.
Because the registers on 8086 processors were only 16
bits wide, the largest value they could hold was 0xFFFF
which equals 64 KB. This means that, using just a single 16-bit value, the CPU could only directly address 64 KB of memory at a time. This leads us to the question - how can a processor with 16-bit registers access 20-bit addresses? The answer is memory segmentation.
To make use of the entire 1 MB space provided by the 20-bit address bus, the 8086 used a scheme called memory segmentation. All memory is divided into small, fixed-size segments of 65_536
bytes each. Instead of using just one value to identify a memory location, a CPU uses the two:
Segment selector — identifies the starting point (base address) of a 64 KB segment. Represented by the value of the
cs
(code-segment) register.Offset — specifies how far into that segment the target address is. Represented by the value of the
ip
register.
In real mode, the base address for a given segment selector is calculated as:
Base Address = Segment Selector << 4
To compute the final physical memory address, the CPU adds the base address to the offset:
Physical Address = Base Address + Offset
For example, if the value of the cs:ip
is 0x2000:0x0010
, then the corresponding physical address will be:
>>> hex((0x2000 << 4) + 0x0010)
'0x20010'
If we take the largest possible values for the segment selector and the offset - 0xFFFF:0xFFFF
, the resulting address will be:
>>> hex((0xffff << 4) + 0xffff)
'0x10ffef'
This gives us the address 0x10FFEF
, which is 65_520
bytes past the 1 MB boundary. Since, in real mode on the original 8086 CPU, the CPU could only access the first 1 MB of memory, any address above 0xFFFFF
would wrap around back to the beginning of the address space. On modern 386+ CPUs the physical bus is wider even in real mode, but the address computation still based on the segment:offset
.
Now that we understand the basics of real mode and its memory addressing limitations, let’s return to the state after a hardware reset.
First code executed after reset
The system has just been powered on, the reset signal has been released, and the processor is waking up to execute first instructions. The 80386 and later CPUs set the following register values after a hardware reset:
ip
0xFFF0
Instruction pointer; execution starts here within the current code segment
cs
(selector)
0xF000
Visible code segment selector value after reset
cs
(base)
0xFFFF0000
Hidden descriptor base address loaded into cs
during reset
In real mode, the base address is normally formed by shifting the 16-bit segment selector value 4 bits left to produce a 20-bit physical address. However, after the hardware reset the first instruction will be located at the special address. We may see that the segment selector in the cs
register is loaded with 0xF000
but the hidden base address is loaded with 0xFFFF0000
. Instead of using the usual formula to get the address, the processor uses this value as the base address of the first instruction. Having the value of the base address and the offset (from the ip
register), the starting address will be:
>>> hex(0xffff0000 + 0xfff0)
'0xfffffff0'
We got 0xFFFFFFF0
, which is 16 bytes below 4GB. This is the very first address where the CPU starts the execution after reset. This address has special name - reset vector. It is the memory location at which the CPU expects to find the first instruction to execute after reset. Usually it contains a jump (jmp
) instruction which points to the BIOS or UEFI entry point. For example, if we take a look at the source code of the coreboot, we will see it there:
/* This is the first instruction the CPU runs when coming out of reset. */
.section ".reset", "ax", %progbits
.globl _start
_start:
jmp _start16bit
To prove that this code is located at the 0xFFFFFFF0
address, we may take a look at the linker script:
. = 0xfffffff0;
_X86_RESET_VECTOR = .;
.reset . : {
*(.reset);
. = _X86_RESET_VECTOR_FILLING;
BYTE(0);
}
The address 0xFFFFFFF0
is much larger than 0xFFFFF
(1MB). How can the CPU access this address in real mode? The answer is simple. Most likely you have something more modern than 8086 CPU with 20-bit address bus. More modern processors starts in real mode but with 32-bit or 64-bit bus.
When the CPU wakes up, it reads the jump at the 0xFFFFFFF0
address, jump into the firmware, and the long chain of the boot process begins. This is the very first step on the way to boot the Linux kernel.
From Power-On to Bootloader
We stopped at the point when a CPU jumps from the reset vector to the firmware. On a legacy PC, that means the BIOS. On modern computers it is UEFI. In the next chapters we will see the booting processes on a legacy PC using the BIOS, and later UEFI.
The first job of BIOS is to bring the system into a working state. It runs a series of hardware checks and initializations — memory tests, peripheral setup, chipset configuration — all part of the POST routine. Once everything is checked, the next step is to find an operating system to boot. The BIOS doesn’t pick just a random disk. It follows a boot order, a list stored in its configuration.
When the BIOS tries to boot from a hard drive, it looks for a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector is stored in the first 446
bytes of the first sector, where each sector is 512
bytes. The final two bytes of the first sector must be 0x55
and 0xAA
. These two last bytes says to BIOS somewhat like "yes - this device is bootable". Once the BIOS finds the valid boot sector, it copies it into the fixed memory location at 0x7C00
, jumps to there and start executing it.
In general, real mode's memory map is as follows:
0x00000000–0x000003FF
Real Mode Interrupt Vector Table
0x00000400–0x000004FF
BIOS Data Area
0x00000500–0x00007BFF
Unused
0x00007C00–0x00007DFF
Bootloader
0x00007E00–0x0009FFFF
Unused
0x000A0000–0x000BFFFF
Video RAM (VRAM) Memory
0x000B0000–0x000B7777
Monochrome Video Memory
0x000B8000–0x000BFFFF
Color Video Memory
0x000C0000–0x000C7FFF
Video ROM BIOS
0x000C8000–0x000EFFFF
BIOS Shadow Area
0x000F0000–0x000FFFFF
System BIOS
We can do a simple experiment and create a very primitive boot code:
;;
;; Note: this example is written using NASM assembler
;;
[BITS 16]
boot:
;; Symbol to print
mov al, '!'
;; TTY-style text output
mov ah, 0x0e
;; Position where to print the character
mov bh, 0x00
;; Color
mov bl, 0x07
;; Interrupt call
int 0x10
jmp $
times 510-($-$$) db 0
db 0x55
db 0xaa
You can build and run this code using the following commands:
nasm -f bin boot.S && qemu-system-x86_64 boot -nographic
This will instruct QEMU virtual machine to use the boot
binary that we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (we end it with the magic sequence), QEMU will treat the binary as the master boot record (MBR) of a disk image.
If you did everything correctly, you will see something like this after run of the command above:
SeaBIOS (version 1.17.0-5.fc42)
iPXE (https://ipxe.org) 00:03.0 CA00 PCI2.10 PnP PMM+06FCAEC0+06F0AEC0 CA00
Booting from Hard Disk...
!
Of course, a real-world boot sector has "slightly" speaking more code for loading of an operating system instead of printing an exclamation mark, but it may interesting to experiment. In this example, we can see that the code will be executed in 16-bit
real mode which is specified by the [BITS 16]
directive. After starting, it calls the 0x10 interrupt, which just prints the !
symbol. The times
directive will pad that number of bytes up to 510th
byte with zeros. In the end we "hard-code" the last two magic bytes 0xAA
and 0x55
. To exit from the virtual machine, you can press - Ctrl+a x
.
From this point onwards, the BIOS hands control over to the bootloader.
The Bootloader Stage
There are a number of different bootloaders that can boot Linux kernel, such as GRUB 2, syslinux, systemd-boot, and others. The Linux kernel has a Boot protocol which specifies the requirements for a bootloader to implement Linux support. In this chapter, we will take a short look how GRUB 2 does loading.
Continuing from where we left off - the BIOS has now selected a boot device, found its boot sector, loaded it into memory and passed control to the code located there. GRUB 2 bootloader consists of multiple stages. The first stage of the boot code is in the boot.S source code file. Due to limited amount of space for the first boot sector, this code has only single goal - to load core image into memory and jump to it.
The core image starts with diskboot.S, which is usually stored right after the first sector of the disk. The code from the diskboot.S
file loads the rest of the core image into memory. The core image contains the code of the loader itself and drivers for reading different filesystems. After the whole core image is loaded into memory, the execution continues from the grub_main function. This is where GRUB sets up the environment it needs to operate:
Initializes the console so messages and menus can be displayed.
Sets the root device — the disk from which GRUB will read files modules and configuration files.
Loads and parses the GRUB configuration file.
Loads required modules.
Once these tasks are complete, we may see the familiar GRUB menu where we can choose the operating system we want to load. When we select one of the menu entries, GRUB executes the boot command which boots the selected operating system. So how the loader loads the Linux kernel? To answer on this question, we need to get back to the Linux kernel boot protocol.
As we can read in the documentation, the bootloader must load the kernel into memory, fill some fields in the kernel setup header and pass control to the kernel code. The very first part of the kernel code is so-called kernel setup header and setup code. The kernel setup header is a special structure embedded in the early Linux boot code and provides fields that describes how kernel should be loaded and started. The setup header is started at the 0x01F1
offset from the beginning of the kernel image. We may look at the boot linker script to confirm the value of this offset:
. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
The kernel setup header is split on two parts and the first part starts from the following fields:
.globl hdr
hdr:
.byte setup_sects - 1
root_flags: .word ROOT_RDONLY
syssize: .long ZO__edata / 16
ram_size: .word 0 /* Obsolete */
vid_mode: .word SVGA_MODE
root_dev: .word 0 /* Default to major/minor 0/0 */
boot_flag: .word 0xAA55
The bootloader may fill some of these fields in the setup header which marked as being type write
or modify
in the Linux boot protocol. The values set by the bootloader will be taken from its configuration or will be calculated during boot. Of course we will not go over full descriptions and explanations of all the fields of the kernel setup header. Instead, we will take a look closer at this or that field if we will meet it during our research of the kernel code.
According to the Linux kernel boot protocol, memory will be mapped as follows after loading the kernel:
~ ~
| Protected-mode kernel |
100000 +------------------------+
| I/O memory hole |
0A0000 +------------------------+
| Reserved for BIOS | Leave as much as possible unused
~ ~
| Command line | (Can also be below the X+10000 mark)
X+10000 +------------------------+
| Stack/heap | For use by the kernel real-mode code.
X+08000 +------------------------+
| Kernel setup | The kernel real-mode code.
| Kernel boot sector | The kernel legacy boot sector.
X +------------------------+
| Boot loader | <- Boot sector entry point 0000:7C00
001000 +------------------------+
| Reserved for MBR/BIOS |
000800 +------------------------+
| Typically used by MBR |
000600 +------------------------+
| BIOS use only |
000000 +------------------------+
... where the address X is as low as the design of the boot loader permits.
We can see that when the bootloader transfers control to the kernel, execution starts right after the kernel’s boot sector — that is, at the address X
plus the length of the boot sector. The value of this X
depends on how the kernel loaded. For example if I try to load kernel just with qemu, the starting address of the kernel image is at 0x10000
:
hexdump -C /tmp/dump | grep MZ
00010000 4d 5a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |MZ..............|
Linux kernel image starts from 4D 5A
bytes as you may see in the beginning of the kernel setup code:
.code16
.section ".bstext", "ax"
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.word IMAGE_DOS_SIGNATURE
If you want to get a similar memory dump, follow these steps. First of all, you need to build kernel. If you do not know how to do it, you can find detailed instruction here. On the diagram above, we can see that the Protected-mode
kernel starts from 0x100000
. Knowing this address we can start the kernel in the qemu virtual machine with the following command:
sudo qemu-system-x86_64 -kernel ./linux/arch/x86/boot/bzImage \
-nographic \
-append "console=ttyS0 nokaslr" \
-initrd /boot/initramfs-6.17.0-rc1-g8f5ae30d69d7.img -s -S
After the virtual machine is started, we can attach the debugger to it, set up a breakpoint on the entry point and get the dump:
gdb vmlinux
(gdb) target remote :1234
(gdb) hbreak *0x100000
(gdb) c
Continuing.
Breakpoint 1, 0x0000000000100000 in ?? ()
(gdb) dump binary memory /tmp/dump 0x0000 0x20000
After this you should be able to find your dump in the /tmp/dump
.
If we try to load Linux kernel using GRUB 2 bootloader, this X
address will be 0x90000
. Let's take a look how to do it and check. First of all you need to prepare image with kernel and GRUB 2. To do so execute the following commands:
qemu-img create hdd.img 64M
parted hdd.img --script mklabel msdos
parted hdd.img --script mkpart primary ext2 1MiB 100%
parted hdd.img --script set 1 boot on
sudo losetup -fP hdd.img
sudo mkfs.ext2 /dev/loop0p1
sudo mount /dev/loop0p1 /mnt/tmp
sudo mkdir -p /mnt/tmp/boot/grub
sudo grub2-install \
--target=i386-pc \
--boot-directory=/mnt/tmp/boot \
/dev/loop0
sudo cp ./arch/x86/boot/bzImage /mnt/tmp/boot/
sudo tee /mnt/tmp/boot/grub/grub.cfg > /dev/null <<EOF
terminal_input serial
terminal_output serial
set timeout=0
set default=0
set debug=linux
menuentry "Linux" {
linux /boot/bzImage
}
EOF
sudo umount /mnt/tmp
sudo losetup -d /dev/loop0
Now we can run qemu virtual machine with our image:
qemu-system-x86_64 -drive format=raw,file=hdd.img -m 256M -s -S -no-reboot -no-shutdown -vga virtio
Connect with gdb debugger and setup breakpoint:
$ gdb
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
(gdb) break *0x90200
Breakpoint 1 at 0x90200
(gdb) c
Continuing.
If you did everything correctly, you will see the GRUB 2 prompt in the qemu window. Execute the following commands:
set pager=1
set debug=all
linux /boot/bzImage
boot
During the execution of the linux
command, you will see the debug line:
relocator: min_addr = 0x0, max_addr = 0xffffffff, target = 0x90000
That confirms that the kernel image will be loaded at the 0x90000
address. During execution of the boot
command, the breakpoint should be caught. In debugger you can execute i r
command and see that we are at the 0x9020:0x0000
rip 0x0 0x0
cs 0x9020 36896
If you continue to execute s i
commands in the debugger CLI, you will go step by step by the early kernel setup code. If you exit from debugger you will see the continuation of the kernel loading procedure.
The Beginning of the Kernel Setup Stage
The bootloader has now loaded the Linux kernel and the kernel setup code into memory, filled the header fields, and then jumped to the corresponding memory address. Finally, we are in the kernel 🎉
Technically, the kernel itself hasn't run yet but only early kernel setup code. First, the kernel setup part must switch from the real mode to protected mode, and after this switch to the long mode, to configure the kernel decompressor, and finally decompress the kernel and jump to it. Execution of the kernel setup code starts from arch/x86/boot/header.S at the _start
symbol:
_start:
# Explicitly enter this as bytes, or the assembler
# tries to generate a 3-byte jump here, which causes
# everything else to push off to the wrong offset.
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
1:
The very first instruction we encounter here is the jump specified by the 0xEB
opcode. The second byte is the distance where to jump. If you’ve never met the Nf
syntax before, 1f
means the next label 1
that will appear in the code. And immediately after those two bytes is the label 1
which is located right before the beginning of the second part of the kernel setup header. Right after the second part of the setup header, we see the .entrytext
section, which starts at the start_of_setup
label. This is exactly the place where the execution will be continued. But from where we are jumping? After the kernel setup code receives control from the bootloader, the first jmp
instruction is located at the 0x200
bytes offset from the start of the loaded kernel image. This can be seen in both the Linux kernel boot protocol and the GRUB 2 source code:
segment = grub_linux_real_target >> 4;
state.gs = state.fs = state.es = state.ds = state.ss = segment;
state.cs = segment + 0x20;
state.ip = 0;
Here, grub_linux_real_target
is the physical load address of the setup code. As we have seen in the previous section, this address is usually 0x90000
. Shifting it right by four divides it by 16
, converting a physical address into a segment value - that’s how real mode memory segmentation works. Then GRUB adds 0x20
to cs
before starting execution. Why 0x20
? Let's remember that in real mode, physical addresses are computed as:
Physical = (cs << 4) + ip
With ip = 0
and cs
increased by 0x20
, the offset from the start of the loaded image is:
0x20 << 4 = 0x200
This is 512 bytes — exactly the offset where our jump instruction resides in the image.
After the jump to the start_of_setup
label, the kernel setup code enters the very first phase of its real work:
Unifying the segment registers
Establishing a valid stack
Clearing the
.bss
sectionTransitioning into C code
In the next sections, we’ll walk through each of these steps in detail.
Aligning the segment registers
First of all, the kernel setup code ensures that the ds
and es
segment registers point to the same address. Next, it clears the direction flag using the cld
instruction:
.section ".entrytext", "ax"
start_of_setup:
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld
We need to do both of these two things to clear the bss section properly a bit later. From this point we are sure that both ds
and es
segment registers point to the same address - 0x9000
.
Stack Setup
We need to prepare for C language environment. The next step is to setup the stack. Let's take a look at the next lines of the code:
movw %ss, %dx
cmpw %ax, %dx # %ds == %ss?
movw %sp, %dx
je 2f # -> assume %sp is reasonably set
Here we compare the value of the ss
and ds
registers. According to the comment around this code, only old versions of the LILO bootloader may set these registers to different values. So we will skip all the "edge cases" and consider only single case when the value of the ss
register equal to ds
. Since the values of these registers are equal, we jump to the 2
label:
2: # Now %dx should point to the end of our stack space
andw $~3, %dx # dword align (might as well...)
jnz 3f
movw $0xfffc, %dx # Make sure we're not zero
3: movw %ax, %ss
movzwl %dx, %esp # Clear upper half of %esp
sti # Now we should have a working stack
dx
register stores stack pointer value whish should point to the top of the stack. The value of the stack pointer is 0x9000
. GRUB 2 bootloader sets it during loading of the Linux kernel image and the address is defined by the:
#define GRUB_LINUX_SETUP_STACK 0x9000
At the next step we check that the address is aligned by four bytes and if yes jump to the label 3
. If the stack pointer is not aligned, we set it to 0xFFFC
value. The reason for this that we can not have stack pointer equal to zero as it grows down during pushing something on the stack. The 0xFFFC
value is the highest 4‑byte aligned address below 0x10000
. If the value of the stack pointer is aligned, we continue to use the aligned value.
From this point we have a correct stack and starts from 0x9000:0x9000
and grows down:
BSS Setup
Before the kernel can switch to C code, two final tasks must be done:
Verify the “magic” signature.
Clear the
.bss
section.
The first is the signature checking:
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
This simply compares the setup_sig constant value placed by the linker with the magic number 0x5A5AAA55
. If they are not equal, the setup code reports a fatal error and stops execution. The main goal of this check is to ensure we are actually running a valid Linux kernel setup binary, loaded into the proper place by the bootloader.
With the magic number confirmed, and knowing our segment registers and stack are already in the proper state, the only initialization left is to clear the .bss
section. The section of memory is used to store statically allocated, uninitialized data. Let's take a look at the initialization of this memory area:
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep stosl
The main goal of this code is to clear or in other words to fill with zeros the memory area between __bss_start
and _end
. To fill this memory area with zeros, the rep stos
instruction is used. This instruction puts the value of the eax
register to the destination pointed by the es:di
. That is why we unified the values of the ds
and es
registers. The rep
prefix specifies the repetition of the stos
instruction based on the value of the cx
register.
To clear this memory area, at first we set the borders of this area - from the __bss_start to _end + 3
. We add 3
bytes to the _end
address because we are going to write zeros in double words or 4 bytes at a time). Adding three bytes ensures that when we later divide by four, any reminder at the end of the memory area still get covered. After we setup the borders of the memory area and fill the eax
with 0 using the xor
instruction, the rep stosl
does its job.
The effect of this code is that zeros are written through the all memory from __bss_start
to _end
. To know exact addresses of them we can inspect setup.elf
file with readelf utility:
$ readelf -a arch/x86/boot/setup.elf | grep bss
[12] .bss NOBITS 00003f00 004efc 001380 00 WA 0 0 32
00 .bstext .header .entrytext .inittext .initdata .text .text32 .rodata .videocards .data .signature .bss
145: 00005280 0 NOTYPE GLOBAL DEFAULT 12 __bss_end
169: 00003f00 0 NOTYPE GLOBAL DEFAULT 12 __bss_start
These offsets inside the setup segment. Since in our case the kernel image is loaded at physical address 0x90000
, the symbols translate to:
__bss_start = 0x90000 + 0x3f00 = 0x93F00
__bss_end = 0x90000 + 0x5280 = 0x95280
The following diagram illustrates how the setup image, .bss
, and the stack region are laid out in memory:
Jump to C code
At this point we have initialized the stack and .bss sections. The last instruction of the early kernel setup assembly is to jump to C code:
calll main
The main()
function is located in arch/x86/boot/main.c source code file.
What's happening there, we will see in the next chapter.
Conclusion
This is the end of the first part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - 0xAX, drop me an email, or just create an issue. In the next part, we will see the first C code that executes in the Linux kernel setup, the implementation of memory routines such as memset
, memcpy
, earlyprintk
, early console implementation and initialization, and much more.
Links
Here is the list of the links that you may find useful during reading of this chapter:
Last updated
Was this helpful?