First steps in the kernel setup code

We have already started our journey into the Linux kernel in the previous part, where we have walked through the very early stages of the booting process and first assembly instructions of the Linux kernel code. Aside from different mechanisms, this code was responsible to prepare environment for C programming language. At the end of chapter we reached a symbolic milestone - the very first call of a C function. This function has classical name - main and defined in the arch/x86/boot/main.c source code file.

From here on, we will start to see assembler code more and more rare, but it is not the end 🤓 We still will meet some assembly code on our way, but it will be more rare and rare. But now it is time for more "high level" logic!

In this part, we’ll keep digging through the kernel’s setup code and cover:

What protected mode is on x86 processors
Setup of early heap and console
Detection of available memory
Validation of a CPU
Initialization of a keyboard

Time to explore these steps in detail!

Protected mode

The Linux kernel for x86_64 operates in a special mode called - long mode. One of the main goal of all the setup kernel code is to switch to this mode. But before we can move to this mode, the kernel must switch the CPU into protected mode.

What is protected mode? From the previous chapter we already know that currently CPU operates in real mode. For us it is mostly means - memory segmentation. As a short reminder - to access a memory location, the combination of two CPU registers is used:

A segment register - cs, ds, ss and es which defines segment selector.
A general purpose register which specifies offset within the segment.

The main motivation for switching from real mode is its memory limitation. As we saw in the previous part, real mode can address only 2²⁰ bytes. This is just 1 MB of RAM. Obviously modern software including an operating system kernel need more. To break this constraints, the new processor mode was introduced - protected mode.

Protected mode was introduced to the x86 architecture in 1982 and became the primary operating mode of Intel processors, starting with the 80286 until the introduction of x86_64 and long mode. This mode brought many changes and improvements, but one of the most crucial was in memory management. The 20-bit address bus was replaced with a 32-bit address bus. It allowed access to 4 Gigabytes of memory vs the 1 Megabyte in real mode.

Memory management in protected mode is divided into two, mostly independent mechanisms:

Segmentation
Paging

For now, our attention stays on segmentation. We’ll return to paging later, once we enter 64-bit long mode.

Memory segmentation in protected mode

In protected mode, memory segmentation was completely redesigned. Fixed 64 KB real mode segments are gone. Instead, each segment is now defined by a special data structure called a Segment Descriptor which specifies the properties of a memory segment. The segment descriptors are stored in another special structure called Global Descriptor Table or GDT. Whenever a CPU needs to find an actual physical memory address, it consults this table. The GDT itself is just a block of memory which address is stored in the special CPU register called gdtr. This is a 48-bit register and consists of two parts:

The size of the Global Descriptor Table
The address of the Global Descriptor Table

Later, we will see exactly how the Linux kernel builds and loads its GDT. For now, it’s enough to know that the CPU provides a dedicated instruction to load the table’s address into the GDTR register:

lgdt gdt

As mentioned above, the GDT contains segment descriptors which describe memory segments. Now let's see how segment descriptors look like. Each descriptor is 64-bits in size. The general scheme of a descriptor is:

Do not worry! I know it may look a little bit intimidating at the first glance, especially in comparison to the relatively simple addressing in real mode, but we will go through it in details. We will start from the bottom, from right to left.

The first field is LIMIT 15:0. It represents the first 16 bits of the segment limit. The second part is located at the bits 51:48. This field provides information about the size of a segment. Having 20-bit size of the limit field, it may seem that the max size of a memory segment can be 1 MB, but it is not like that. In addition, the max size of a segment depends on the 55th G bit:

If G=0 - the value of the LIMIT field is interpreted in bytes.
if G=1 - the value of the LIMIT field is interpreted in 4 KB units called pages.

Based on this, we can easily calculate that the max size of a segment is 4 GB.

The next field is BASE. We may see that it is split on three parts. The first part occupies bits from 16 to 31, the second part occupies bits from 32 to 39, and the last third part occupies bits from 56 to 63. The main goal of this field is to store the base address of a segment.

The remaining of the fields in a segment descriptor represent flags which control different aspects of a segment, like for example type of a memory. Let's take a look at the description of these flags:

Type - describes the type of a memory segment.
S - distinguishes system segments from code and data segments.
DPL - provides information about the privilege level of a segment. It can be a value from 0 to 3, where 0 is the most privileged level.
P - tells the CPU whether a segment presented in memory.
AVL - available and reserved bits. It is ignored by the Linux kernel.
L - indicates whether a code segment contains 64-bit code.
D / B - provides different meaning depends on the type of a segment.
- For a code segment: Controls the default operand and address size. If the bit is clear, it is a 16-bit code segment. Otherwise it is a 32-bit code segment.
- For a stack segment or in other words a data segment pointed by the ss register: Controls the default stack pointer size. If the bit is clear, it is a 16-bit stack segment and stack operations use sp register. Otherwise it is a 32-bit stack segment and stack operations use esp register.
- For a expand-down data segment: Specifies the upper bound of the segment. If the bit is clear, the upper bound is 0xFFFF or 64 KB. Otherwise, it is 0xFFFFFFFF or 4 GB.

If the S flag of a segment descriptor is set, the descriptor describes either a code or a data segment, otherwise it is a system segment. If the highest order bit of the Type flags is clear - this descriptor describes a data segment, otherwise a code segment. Rest of the three bits of a data segment descriptor interpreted as:

Accessed - indicates whether a segment has been accessed since the last time the kernel cleared this bit.
Write-Enable - determines whether a segment is writable or read-only.
Expansion-Direction - determines whether addresses decreasing from the base address or not.

For a code segment, these three bits interpreted as:

Accessed - indicates whether a segment has been accessed since the last time the kernel cleared this bit.
Read-Enable - determines whether a segment is execute-only or execute-read.
Confirming - determines how privilege level changes are handled when transferring execution to that segment.

In the tables below you can find full information about possible states of the flags for a code and a data segments.

A data segment Type field:

E (Expand-Down)

W (Writable)

A (Accessed)

Description

Read-Only

Read-Only, accessed

Read/Write

Read/Write, accessed

Read-Only, expand-down

Read-Only, expand-down, accessed

Read/Write, expand-down

Read/Write, expand-down, accessed

A code segment Type field:

C (Conforming)

R (Readable)

A (Accessed)

Description

Execute-Only

Execute-Only, accessed

Execute/Read

Execute/Read, accessed

Execute-Only, conforming

Execute/Read, conforming

Execute-Only, conforming, accessed

Execute/Read, conforming, accessed

So far, we’ve looked at how a segment descriptor defines the properties of a memory segment — its base, limit, type, and different flags. But how does the CPU actually refer to one of these descriptors during execution? Just like in real mode - using segment registers. In protected mode they contain segment selectors. However, in protected mode, a segment selector is handled differently. Each segment descriptor has an associated segment selector which is a 16-bit structure:

The meaning of the fields is:

Index - the entry number of the descriptor in the descriptor table.
TI - indicates where to search for the descriptor
- If the value of the bit is 0, a descriptor will be searched in the Global Descriptor Table.
- If the value of this bit is 1, a descriptor will be searched in the Local Descriptor Table.
RPL - the privilege level requested by the selector.

When a program running in protected mode references a memory, the CPU need to calculate a proper physical address. The following steps are needed to get a physical address in protected mode:

A segment selector is loaded into one of the segment registers.
The CPU tries to find a associated segment descriptor in the Global Descriptor Table based on the Index value from the segment selector. If the descriptor was found, it is loaded into a special hidden part of this segment register.
The physical address will be the base address from the segment descriptor plus offset from the instruction pointer or memory location referenced within an executed instruction.

In the next part, we will see the transition into protected mode. But before the kernel can be switched to protected mode, we need to do some more preparations.

Let's continue from the point where we have stopped in the previous chapter.

Back to the Kernel: Entering main.c

As we already have mentioned in the beginning of this chapter, one of the kernel's first main goals is to switch the processor into protected mode. But before this can happen, the kernel need to do some preparations.

If we look at the very beginning of the main function from the arch/x86/boot/main.c, the very first thing we will see is a call of the init_default_io_ops function.

This function defined in the arch/x86/boot/io.h and looks like:

static inline void init_default_io_ops(void)
{
	pio_ops.f_inb  = __inb;
	pio_ops.f_outb = __outb;
	pio_ops.f_outw = __outw;
}

This function initializes function pointers for:

reading a byte from an I/O port
writing a byte to an I/O port
writing a word (16-bit) to an I/O port

These callbacks will be used to write data to the serial console which will be initialized at the one of the next steps. All the operations will be executed with the help of the inb, outb, and outw macros which defined in the same file:

#define inb  pio_ops.f_inb
#define outb pio_ops.f_outb
#define outw pio_ops.f_outw

The __inb, __outb, and __outw themselves are inline functions from the arch/x86/include/asm/shared/io.h:

#define BUILDIO(bwl, bw, type)						\
static __always_inline void __out##bwl(type value, u16 port)		\
{									\
	asm volatile("out" #bwl " %" #bw "0, %w1"			\
		     : : "a"(value), "Nd"(port));			\
}									\
									\
static __always_inline type __in##bwl(u16 port)				\
{									\
	type value;							\
	asm volatile("in" #bwl " %w1, %" #bw "0"			\
		     : "=a"(value) : "Nd"(port));			\
	return value;							\
}

BUILDIO(b, b, u8)
BUILDIO(w, w, u16)
BUILDIO(l,  , u32)

All of these functions use in and out assembly instructions which send the given value to the given port or read the value from the given port. If the syntax is not familiar to you, you can read the chapter about inline assembly.

After initialization of callbacks for writing to a serial port, the next step is copying of the kernel setup header filled by a bootloader into the corresponding field of the C boot_params structure. This will make the fields from the kernel setup header more easily accessible. All the job by copying handled by the copy_boot_params function with the help of memcpy:

	memcpy(&boot_params.hdr, &hdr, sizeof(hdr));

Do not mix this memcpy with the function from the C standard library - memcpy. During the time when the kernel is in the early initialization phase, there is no way to load any library. For this reason, an operating system kernel provides own implementation of such functions. The kernel's memcpy defined in the copy.S. If you already started to miss an assembly code, this is the high time to bring some back:

SYM_FUNC_START_NOALIGN(memcpy)
	pushw	%si
	pushw	%di
	movw	%ax, %di
	movw	%dx, %si
	pushw	%cx
	shrw	$2, %cx
	rep movsl
	popw	%cx
	andw	$3, %cx
	rep movsb
	popw	%di
	popw	%si
	retl
SYM_FUNC_END(memcpy)

First of all, we can see that memcpy and other routines which are defined there, start and end with the two macros - SYM_FUNC_START_NOALIGN and SYM_FUNC_END. The SYM_FUNC_START_NOALIGN just specifies the given symbol name as .globl to make it visible for other functions. The SYM_FUNC_END just expands to an empty string in our case.

Despite the implementation of this function is written in assembly language, the implementation of memcpy is relatively simple. At first, it pushes values from the si and di registers to the stack to preserve their values because they will change during the memcpy execution. At the next step we may see handling of the function's parameters. The parameters of this function are passed through the ax, dx, and cx registers. This is because the kernel setup code is built with -mregparm=3 option. So:

ax will contain the address of boot_params.hdr
dx will contain the address of hdr
cx will contain the size of hdr in bytes

The rep movsl instruction copies bytes from the memory pointed by the si register to the memory location pointed by the di register. At each iteration 4 bytes copied. For this reason we divided the size of the setup header by 4 using shrw instruction. After this step we just copy rest of bytes that is not divided by 4.

From this point, the setup header is copied into a proper place and we can move forward.

Console initialization

As soon as the kernel setup header is copied into the boot_params.hdr, the next step is to initialize the serial console by calling the console_init function. Very soon we will be able to print something from within the kernel code!

The console_init defined in arch/x86/boot/early_serial_console.c. At the very first step it tries to find the earlyprintk option in the kernel's command line. If the search was successful, it parses the port address and baud rate and executes the initialization of the serial port.

[!NOTE] If you want to know what else options you can pass in the kernel command line, you can find more information in the The kernel's command-line parameters document.

Let's take a look at these two steps in details.

The possible values of the earlyprintk command line option are:

serial,0x3f8,115200
serial,ttyS0,115200
ttyS0,115200

The parameters defines the name of a serial port, the port number and the baud rate. The pointer to the kernel command line is stored in the kernel setup header and can be accessed through boot_params.hdr.cmd_line_ptr. The parse_earlyprintk function tries to find the earlyprintk option in the kernel command line, parse it if it was found and initialize the serial console parameters with one of the values above. If the earlyprintk option is given and contains valid values, the initialization of the serial console takes place in the early_serial_init function. There is nothing specific to Linux kernel in the initialization of a serial console, so we will skip this part. If you want to dive deeper by yourself, more information you can find here and learn arch/x86/boot/early_serial_console.c step by step.

After the serial port initialization we can see the first output:

	if (cmdline_find_option_bool("debug"))
		puts("early console in setup code\n");

The puts function uses the inb function that we have seen above during initialization of I/O callbacks.

From this point we can print messages from the kernel setup code 🎉. Time to move to the next step.

Heap initialization

We have seen the initialization of the stack and bss memory areas in the previous chapter. The next step is to initialize the heap memory area. The heap initialization takes place in the init_heap function:

static void init_heap(void)
{
	char *stack_end;

	if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
		stack_end = (char *) (current_stack_pointer - STACK_SIZE);
		heap_end = (char *) ((size_t)boot_params.hdr.heap_end_ptr + 0x200);
		if (heap_end > stack_end)
			heap_end = stack_end;
	} else {
		/* Boot protocol 2.00 only, no heap available */
		puts("WARNING: Ancient bootloader, some functionality may be limited!\n");
	}
}

First of all, init_heap checks the CAN_USE_HEAP flag from the kernel setup header. If it is not set, we'll see the warning message. If heap is enabled, the last address of it is set to the boot_params.hdr.heap_end_ptr filled by bootloader plus 512 bytes or to the end of the stack if the value specified by bootloader is above it. The beginning of the heap is right after the end of the .bss area. The stack size is 1024 bytes. Thereby, the memory map will look like:

Now the heap is initialized, although we will see the usage of it in the next chapters.

CPU validation

The next step is the validation of CPU on which the kernel is running. The kernel has to do it to make sure that the all required functionalities will work correctly on the given CPU.

The validate_cpu function from arch/x86/boot/cpu.c validates the CPU. This function calls the check_cpu which check the CPU model and its flags using the cpuid instruction. The CPU's flags are checked like the presence of long mode, checks the processor's vendor and makes preparations for certain vendors like turning on extensions like SSE+SSE2:

int validate_cpu(void)
{
	u32 *err_flags;
	int cpu_level, req_level;

	check_cpu(&cpu_level, &req_level, &err_flags);

	if (cpu_level < req_level) {
		printf("This kernel requires an %s CPU, ",
		       cpu_name(req_level));
		printf("but only detected an %s CPU.\n",
		       cpu_name(cpu_level));
		return -1;
	}

If the level of CPU is less than the required level specified by the CONFIG_X86_MINIMUM_CPU_FAMILY kernel configuration option, the function returns the error and the kernel setup process is aborted.

Memory detection

After the kernel became sure that the CPU which it is running on is suitable, the next stage is to detect available memory in the system. This task is handled by the detect_memory function, which queries the system firmware to obtain a map of physical memory regions. To do this, the kernel uses the special BIOS service - 0xE820, but kernel can fallback to legacy BIOS services like 0xE801 or 0x88. In this chapter, we will see only the implementation of the 0xE820 interface.

The detect_memory function defined in the arch/x86/boot/memory.c and as just mentioned, tries to get the information about available memory:

void detect_memory(void)
{
	detect_memory_e820();

	detect_memory_e801();

	detect_memory_88();
}

Let's look at the crucial part of the implementation of the detect_memory_e820 function. First of all, the detect_memory_e820 function initializes the biosregs structure with the special values related to the 0xE820 BIOS interface:

	initregs(&ireg);
	ireg.ax  = 0xe820;
	ireg.cx  = sizeof(buf);
	ireg.edx = SMAP;
	ireg.di  = (size_t)&buf;

ax register contains the number of the BIOS service
cx register contains the size of the buffer which will contain the data about available memory
di register contain the address of the buffer which will contain memory data
edx register contains the SMAP magic number

After registers filled with the needed values, the kernel can ask the 0xE820 BIOS interface about available memory. The kernel does it by the invoking 0x15 BIOS interrupt which returns information about one memory region. The kernel repeats this operation in the loop until information about all the memory regions is not collected.

After the information is called, the kernel print message about the available memory regions. You can find it in the dmesg output:

[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffdffff] usable
[    0.000000] BIOS-e820: [mem 0x000000003ffe0000-0x000000003fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved

Keyboard initialization

Once memory detection is complete, the kernel proceeds with initializing the keyboard using the keyboard_init:

static void keyboard_init(void)
{
	struct biosregs ireg, oreg;

	initregs(&ireg);

	ireg.ah = 0x02;		/* Get keyboard status */
	intcall(0x16, &ireg, &oreg);
	boot_params.kbd_status = oreg.al;

	ireg.ax = 0x0305;	/* Set keyboard repeat rate */
	intcall(0x16, &ireg, NULL);
}

This function performs two tasks using BIOS interrupt 0x16:

Gets the state of a keyboard which contains information about state of certain modifier keys, like for example Caps Lock active or not.
Sets the keyboard repeat rate which determines how long a key must hold down before it begins repeating

Gathering system information

After we went though the most essential hardware interfaces like CPU, I/O, memory map, keyboard, the next a couple of steps are to query the BIOS for additional information about the system. The information which kernel is going to gather is not strictly required for entering protected mode, but it provides useful details that later parts of the kernel may rely on.

The following information is going to be collected:

Information about Intel SpeedStep
Information about Advanced Power Management
Information about Enhanced Disk Drive

At this moment we will not dive into details about each of this query, but will get back to them in the next parts when we will use this information. For now, just let's take a short look at these functions:

	/* Query Intel SpeedStep (IST) information */
	query_ist();

	/* Query APM information */
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
	query_apm_bios();
#endif

	/* Query EDD information */
#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
	query_edd();
#endif

The first one is getting information about the Intel SpeedStep. This information is obtained by the calling the 0x15 BIOS interrupt and store the result in the boot_params structure. The returned information describes the support of the Intel SpeedStep and settings around it. If it is supported, this information will be passed later by the kernel to the power management subsystems.

The next one is getting information about the Advanced Power Management. The logic of this function is pretty similar to the one described above. It uses the same 0x15 BIOS interrupt to obtain information and store it in the boot_params structure. The returned information describes the support of the APM which was power management sub-system before ACPI started to be a standard.

The last one function gets information about the Enhanced Disk Drive from the BIOS. The same 0x13 BIOS interrupt is used to obtain this information. The returned information describes the disks and their characteristics like geometry and mapping information.

Conclusion

This is the end of the second part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - 0xAX, drop me an email, or just create an issue. In the next part, we will continue to deal with the preparations before transitioning into protected mode and the transitioning itself.

Links

Here is the list of the links that you may find useful during reading of this chapter:

PreviousFrom bootloader to kernel NextVideo mode initialization and transition to protected mode

Last updated 2 months ago

Was this helpful?