From bootloader to kernel
Last updated
Was this helpful?
Last updated
Was this helpful?
If you read my previous , you might have noticed that I have been involved with low-level programming for some time. I wrote some posts about assembly programming for x86_64
Linux and, at the same time, started to dive into the Linux kernel source code.
I have a great interest in understanding how low-level things work, how programs run on my computer, how they are located in memory, how the kernel manages processes and memory, how the network stack works at a low level, and many many other things. So, I decided to write yet another series of posts about the Linux kernel for the x86_64 architecture.
Note that I'm not a professional kernel hacker and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on Twitter , drop me an or just create an . I appreciate it.
All posts will also be accessible at and, if you find something wrong with my English or the post content, feel free to send a pull request.
Note that this isn't official documentation, just learning and sharing knowledge.
Required knowledge
Understanding C code
Understanding assembly code (AT&T syntax)
Anyway, if you're just starting to learn such tools, I will try to explain some parts during this and the following posts. Alright, this is the end of the simple introduction. Let's start to dive into the Linux kernel and low-level stuff!
I started writing these posts at the time of the 3.18
Linux kernel, and many things have changed since that time. If there are changes, I will update the posts accordingly.
Although this is a series of posts about the Linux kernel, we won't start directly from the kernel code. As soon as you press the magical power button on your laptop or desktop computer, it starts working. The motherboard sends a signal to the device. After receiving the signal, the power supply provides the proper amount of electricity to the computer. Once the motherboard receives the , it tries to start the CPU. The CPU resets all leftover data in its registers and sets predefined values for each of them.
The and later CPUs define the following predefined data in CPU registers after the computer resets:
An address consists of two parts: a segment selector, which has a base address; and an offset from this base address. In real mode, the associated base address of a segment selector is Segment Selector * 16
. Thus, to get a physical address in memory, we need to multiply the segment selector part by 16
and add the offset to it:
For example, if CS:IP
is 0x2000:0x0010
, then the corresponding physical address will be:
But, if we take the largest segment selector and offset, 0xffff:0xffff
, then the resulting address will be:
Ok, now we know a little bit about real mode and its memory addressing. Let's get back to discussing register values after reset.
The CS
register consists of two parts: the visible segment selector and the hidden base address. In real-address mode, the base address is normally formed by shifting the 16-bit segment selector value 4 bits to the left to produce a 20-bit base address. However, during a hardware reset the segment selector in the CS register is loaded with 0xf000
and the base address is loaded with 0xffff0000
. The processor uses this special base address until CS
changes.
The starting address is formed by adding the base address to the value in the EIP register:
For example:
Build and run this with:
You will see:
You can see a binary dump of this using the objdump
utility:
A real-world boot sector has code for continuing the boot process and a partition table instead of a bunch of 0's and an exclamation mark. :) From this point onwards, the BIOS hands control over to the bootloader.
NOTE: As explained above, the CPU is in real mode. In real mode, calculating the physical address in memory is done as follows:
just as explained above. We have only 16-bit general purpose registers, which has a maximum value of 0xffff
, so if we take the largest values the result will be:
In general, real mode's memory map is as follows:
At the start of execution, the BIOS is not in RAM, but in ROM.
The grub_main
function initializes the console, gets the base address for modules, sets the root device, loads/parses the grub configuration file, loads modules, etc. At the end of execution, the grub_main
function moves grub to normal mode. The grub_normal_execute
function (from the grub-core/normal/main.c
source code file) completes the final preparations and shows a menu to select an operating system. When we select one of the grub menu entries, the grub_menu_execute_entry
function runs, executing the grub boot
command and booting the selected operating system.
As we can see in the kernel boot protocol, memory will be mapped as follows after loading the kernel:
When the bootloader transfers control to the kernel, it starts at:
where X
is the address of the kernel boot sector being loaded. In my case, X
is 0x10000
, as we can see in a memory dump:
Here we can see the memory address of the entry point, which is 0x0000000001000000
. Let's go ahead.
Booting in QEMU
Attaching GDB to QEMU
The bootloader has now loaded the Linux kernel into memory, filled the header fields, and then jumped to the corresponding memory address. We now move directly to the kernel setup code.
It may look a bit strange at first sight, as there are several instructions before it. A long time ago, the Linux kernel had its own bootloader. Now, however, if you run, for example,
then you will see:
The actual kernel setup entry point is:
The bootloader (GRUB 2 and others) knows about this point (at an offset of 0x200
from MZ
) and jumps directly to it, despite the fact that header.S
starts from the .bstext
section, which prints an error message:
The kernel setup entry point is:
This is the first code that actually runs (aside from the previous jump instructions, of course). After the kernel setup part receives control from the bootloader, the first jmp
instruction is located at the 0x200
offset from the start of the kernel real mode, i.e., after the first 512 bytes. This can be seen in both the Linux kernel boot protocol and the GRUB 2 source code:
In my case, the kernel is loaded at the physical address 0x10000
. This means that segment registers have the following values after kernel setup starts:
After the jump to start_of_setup
, the kernel needs to do the following:
Make sure that all segment register values are equal
Set up a correct stack, if needed
Let's look at the implementation.
First of all, the kernel ensures that the ds
and es
segment registers point to the same address. Next, it clears the direction flag using the cld
instruction:
As I wrote earlier, grub2
loads kernel setup code at address 0x10000
by default and cs
at 0x1020
because execution doesn't start from the start of the file, but from the jump here:
This can lead to 3 different scenarios:
ss
has a valid value 0x1000
(as do all the other segment registers besides cs
)
ss
is invalid and the CAN_USE_HEAP
flag is set (see below)
ss
is invalid and the CAN_USE_HEAP
flag is not set (see below)
Let's look at all three of these scenarios in turn:
Here we set the alignment of dx
(which contains the value of sp
as given by the bootloader) to 4
bytes and check if it is zero. If it is, we set dx
to 0xfffc
(The last 4-byte aligned address in a 64KB segment). If it is not zero, we continue to use the value of sp
given by the bootloader (0xf7f4
in my case). Afterwards, we put the value of ax
(0x1000
) into ss
. We now have a correct stack:
and as we can read in the boot protocol:
If the CAN_USE_HEAP
bit is set, we put heap_end_ptr
into dx
(which points to _end
) and add STACK_SIZE
(the minimum stack size, 1024
bytes) to it. After this, if dx
is not carried (it will not be carried, dx = _end + 1024
), jump to label 2
(as in the previous case) and make a correct stack.
When CAN_USE_HEAP
is not set, we just use a minimal stack from _end
to _end + STACK_SIZE
:
If the magic number matches, knowing we have a set of correct segment registers and a stack, we only need to set up the BSS section before jumping into the C code.
The BSS section is used to store statically allocated, uninitialized data. Linux carefully ensures this area of memory is first zeroed using the following code:
That's all! We have the stack and BSS, so we can jump to the main()
C function:
The processor starts working in . Let's back up a little and try to understand in this mode. Real mode is supported on all x86-compatible processors, from the CPU all the way to the modern Intel 64-bit CPUs. The 8086
processor has a 20-bit address bus, which means that it could work with a 0-0xFFFFF
or 1 megabyte
address space. But it only has 16-bit
registers, which have a maximum address of 2^16 - 1
or 0xffff
(64 kilobytes).
is used to make use of all the address space available. All memory is divided into small, fixed-size segments of 65536
bytes (64 KB). Since we cannot address memory above 64 KB
with 16-bit registers, an alternate method was devised.
which is 65520
bytes past the first megabyte. Since only one megabyte is accessible in real mode, 0x10ffef
becomes 0x00ffef
with the disabled.
We get 0xfffffff0
, which is 16 bytes below 4GB. This point is called the . It's the memory location at which the CPU expects to find the first instruction to execute after reset. It contains a (jmp
) instruction that usually points to the (Basic Input/Output System) entry point. For example, if we look in the source code (), we see:
Here we can see the jmp
instruction , which is 0xe9
, and its destination address at _start16bit - ( . + 2)
.
We also see that the reset
section is 16
bytes and is compiled to start from the address 0xfffffff0
():
Now the BIOS starts. After initializing and checking the hardware, the BIOS needs to find a bootable device. A boot order is stored in the BIOS configuration, controlling which devices the BIOS attempts to boot from. When attempting to boot from a hard drive, the BIOS tries to find a boot sector. On hard drives partitioned with an , the boot sector is stored in the first 446
bytes of the first sector, where each sector is 512
bytes. The final two bytes of the first sector are 0x55
and 0xaa
, which designates to the BIOS that this device is bootable. Once the BIOS finds the boot sector, it copies it into a fixed memory location at 0x7c00, jumps to there and start executing it.
This will instruct to use the boot
binary that we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (we end it with the magic sequence), QEMU will treat the binary as the master boot record (MBR) of a disk image. Note that when providing a boot binary image to QEMU, setting the origin to 0x7c00 (using [ORG 0x7c00]
) is unneeded.
In this example, we can see that the code will be executed in 16-bit
real mode. After starting, it calls the interrupt, which just prints the !
symbol. The times directive will pad that number of bytes up to 510th byte with zeros and finishes with the two magic bytes 0xaa
and 0x55
.
where 0x10ffef
is equal to (1MB + 64KB - 16B) - 1
. An processor (which was the first processor with real mode), in contrast, has a 20-bit address line. Since 2^20 = 1048576
is 1MB and 2^20 - 1
is the maximum address that could be used, this means that the actual available memory is 1MB.
At the beginning of this post, I wrote that the first instruction executed by the CPU is located at address 0xFFFFFFF0
, which is much larger than 0xFFFFF
(1MB). How can the CPU access this address in real mode? The answer is in the documentation:
There are a number of bootloaders that can boot Linux, such as and . The Linux kernel has a which specifies the requirements for a bootloader to implement Linux support. This example will describe GRUB 2.
Continuing from before, now that the BIOS has chosen a boot device and transferred control to the boot sector code, execution starts from . Its code is very simple, due to the limited amount of space available. It contains a pointer which is used to jump to the location of GRUB 2's core image. The core image begins with , which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image, which contains GRUB 2's kernel and drivers for handling filesystems, into memory. After loading the rest of the core image, it executes the function.
As we can read in the kernel boot protocol, the bootloader must read and fill some fields of the kernel setup header, which starts at offset 0x01f1
from the kernel setup code. You may look at the boot to confirm the value of this offset. The kernel header starts from:
The bootloader must fill this and the rest of the headers (which are only marked as being type write
in the Linux boot protocol, such as in ) with values either received from the command line or calculated during booting. (We will not go over full descriptions and explanations for all fields of the kernel setup header for now, but we shall do so when discussing how the kernel uses them. You can find a description of all fields in the .)
Before trying to debug the kernel, please see
Finally, we are in the kernel! Technically, the kernel hasn't run yet. First, the kernel setup part must configure stuff such as the decompressor and some memory management related things, to name a few. After all these things are done, the kernel setup part will decompress the actual kernel and jump to it. Execution of the setup part starts from at the symbol.
Actually, the file header.S
starts with the magic number (see image above), the error message that displays and, following that, the header:
It needs this to load an operating system with support. We won't be looking into its inner workings right now but will cover it in upcoming chapters.
Here we can see a jmp
instruction opcode (0xeb
) that jumps to the start_of_setup-1f
point. In Nf
notation, 2f
, for example, refers to the local label 2:
. In our case, it's label 1:
that is present right after the jump, and contains the rest of the setup . Right after the setup header, we see the .entrytext
section, which starts at the start_of_setup
label.
Set up
Jump to the C code in
which is at a 512
byte offset from . We also need to align cs
from 0x1020
to 0x1000
, as well as all other segment registers. After that, we set up the stack:
which pushes the value of ds
to the stack, followed by the address of the label and executes the lretw
instruction. When the lretw
instruction is called, it loads the address of label 6
into the register and loads cs
with the value of ds
. Afterward, ds
and cs
will have the same values.
Almost all of the setup code is for preparing the C language environment in real mode. The next is checking the ss
register's value and setting up a correct stack if ss
is wrong:
ss
has a correct address (0x1000
). In this case, we go to label :
The second scenario, (ss
!= ds
). First, we put the value of (the address of the end of the setup code) into dx
and check the loadflags
header field using the testb
instruction to see whether we can use the heap. is a bitmask header defined as:
The last two steps that need to happen before we can jump to the main C code are setting up the area and checking the "magic" signature. First, signature checking:
This simply compares the with the magic number 0x5a5aaa55
. If they are not equal, a fatal error is reported.
First, the address is moved into di
. Next, the _end + 3
address (+3 - aligns to 4 bytes) is moved into cx
. The eax
register is cleared (using the xor
instruction), and the bss section size (cx - di
) is calculated and put into cx
. Then, cx
is divided by four (the size of a 'word'), and the stosl
instruction is used repeatedly, storing the value of eax
(zero) into the address pointed to by di
, automatically increasing di
by four, repeating until cx
reaches zero. The net effect of this code is that zeros are written through all words in memory from __bss_start
to _end
:
The main()
function is located in . You can read about what this does in the next part.
This is the end of the first part about Linux kernel insides. If you have questions or suggestions, ping me on Twitter , drop me an , or just create an . In the next part, we will see the first C code that executes in the Linux kernel setup, the implementation of memory routines such as memset
, memcpy
, earlyprintk
, early console implementation and initialization, and much more.
Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to .