📊
linux-insides
  • README
  • Summary
    • Booting
      • From bootloader to kernel
      • First steps in the kernel setup code
      • Video mode initialization and transition to protected mode
      • Transition to 64-bit mode
      • Kernel decompression
      • Kernel load address randomization
    • Initialization
      • First steps in the kernel
      • Early interrupts handler
      • Last preparations before the kernel entry point
      • Kernel entry point
      • Continue architecture-specific boot-time initializations
      • Architecture-specific initializations, again...
      • End of the architecture-specific initializations, almost...
      • Scheduler initialization
      • RCU initialization
      • End of initialization
    • Interrupts
      • Introduction
      • Start to dive into interrupts
      • Interrupt handlers
      • Initialization of non-early interrupt gates
      • Implementation of some exception handlers
      • Handling Non-Maskable interrupts
      • Dive into external hardware interrupts
      • Initialization of external hardware interrupts structures
      • Softirq, Tasklets and Workqueues
      • Last part
    • System calls
      • Introduction to system calls
      • How the Linux kernel handles a system call
      • vsyscall and vDSO
      • How the Linux kernel runs a program
      • Implementation of the open system call
      • Limits on resources in Linux
    • Timers and time management
      • Introduction
      • Clocksource framework
      • The tick broadcast framework and dyntick
      • Introduction to timers
      • Clockevents framework
      • x86 related clock sources
      • Time related system calls
    • Synchronization primitives
      • Introduction to spinlocks
      • Queued spinlocks
      • Semaphores
      • Mutex
      • Reader/Writer semaphores
      • SeqLock
      • RCU
      • Lockdep
    • Memory management
      • Memblock
      • Fixmaps and ioremap
      • kmemcheck
    • Cgroups
      • Introduction to Control Groups
    • SMP
    • Concepts
      • Per-CPU variables
      • Cpumasks
      • The initcall mechanism
      • Notification Chains
    • Data Structures in the Linux Kernel
      • Doubly linked list
      • Radix tree
      • Bit arrays
    • Theory
      • Paging
      • Elf64
      • Inline assembly
      • CPUID
      • MSR
    • Initial ram disk
    • Misc
      • Linux kernel development
      • How the kernel is compiled
      • Linkers
      • Program startup process in userspace
      • Write and Submit your first Linux kernel Patch
      • Data types in the kernel
    • KernelStructures
      • IDT
    • Useful links
    • Contributors
Powered by GitBook
On this page
  • Introduction
  • Page Table Initialization
  • Avoiding Reserved Memory Ranges
  • Physical address randomization
  • Virtual address randomization
  • Conclusion
  • Links

Was this helpful?

  1. Summary
  2. Booting

Kernel load address randomization

PreviousKernel decompressionNextInitialization

Last updated 2 years ago

Was this helpful?

Introduction

This is the sixth part of the Kernel booting process series. In the we took a look at the final stages of the Linux kernel boot process. But we have skipped some important, more advanced parts.

As you may remember, the entry point of the Linux kernel is the start_kernel function defined in the source code file. This function is executed at the address stored in LOAD_PHYSICAL_ADDR. and depends on the CONFIG_PHYSICAL_START kernel configuration option, which is 0x1000000 by default:

config PHYSICAL_START
	hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
	default "0x1000000"
	---help---
	  This gives the physical address where the kernel is loaded.
      ...
      ...
      ...

This value may be changed during kernel configuration, but the load address can also be configured to be a random value. For this purpose, the CONFIG_RANDOMIZE_BASE kernel configuration option should be enabled during kernel configuration.

Now, the physical address where the Linux kernel image will be decompressed and loaded will be randomized. This part considers the case when the CONFIG_RANDOMIZE_BASE option is enabled and the load address of the kernel image is randomized for .

Page Table Initialization

Before the kernel decompressor can look for a random memory range to decompress and load the kernel to, the identity mapped page tables should be initialized. If the used the , we already have page tables. But, there may be problems if the kernel decompressor selects a memory range which is valid only in a 64-bit context. That's why we need to build new identity mapped page tables.

Indeed, the first step in randomizing the kernel load address is to build new identity mapped page tables. But first, let's reflect on how we got to this point.

In the , we followed the transition to and jumped to the kernel decompressor entry point - the extract_kernel function. The randomization stuff begins with a call to this function:

void choose_random_location(unsigned long input,
                            unsigned long input_size,
                            unsigned long *output,
                            unsigned long output_size,
                            unsigned long *virt_addr)
{}

This function takes five parameters:

  • input;

  • input_size;

  • output;

  • output_size;

  • virt_addr.

asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
				                          unsigned char *input_data,
				                          unsigned long input_len,
				                          unsigned char *output,
				                          unsigned long output_len)
{
  ...
  ...
  ...
  choose_random_location((unsigned long)input_data, input_len,
                         (unsigned long *)&output,
				         max(output_len, kernel_total_size),
				         &virt_addr);
  ...
  ...
  ...
}
leaq	input_data(%rip), %rdx
.section ".rodata..compressed","a",@progbits
.globl z_input_len
z_input_len = 6988196
.globl z_output_len
z_output_len = 29207032
.globl input_data, input_data_end
input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
input_data_end:

As you can see, it contains four global symbols. The first two, z_input_len and z_output_len are the sizes of the compressed and uncompressed vmlinux.bin.gz archive. The third is our input_data parameter which points to the Linux kernel image's raw binary (stripped of all debugging symbols, comments and relocation information). The last parameter, input_data_end, points to the end of the compressed linux image.

So, the first parameter to the choose_random_location function is the pointer to the compressed kernel image that is embedded into the piggy.o object file.

The second parameter of the choose_random_location function is z_input_len.

The last parameter of the choose_random_location function is the virtual address of the kernel load address. As can be seen, by default, it coincides with the default physical load address:

unsigned long virt_addr = LOAD_PHYSICAL_ADDR;

The physical load address is defined by the configuration options:

#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
				+ (CONFIG_PHYSICAL_ALIGN - 1)) \
				& ~(CONFIG_PHYSICAL_ALIGN - 1))

We've covered choose_random_location's parameters, so let's look at its implementation. This function starts by checking the nokaslr option in the kernel command line:

if (cmdline_find_option_bool("nokaslr")) {
	warn("KASLR disabled: 'nokaslr' on cmdline.");
	return;
}
kaslr/nokaslr [X86]

Enable/disable kernel and module base offset ASLR
(Address Space Layout Randomization) if built into
the kernel. When CONFIG_HIBERNATION is selected,
kASLR is disabled by default. When kASLR is enabled,
hibernation will be disabled.

Let's assume that we didn't pass nokaslr to the kernel command line and the CONFIG_RANDOMIZE_BASE kernel configuration option is enabled. In this case we add kASLR flag to kernel load flags:

boot_params->hdr.loadflags |= KASLR_FLAG;

Now, we call another function:

initialize_identity_maps();
mapping_info.alloc_pgt_page = alloc_pgt_page;
mapping_info.context = &pgt_data;
mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;
mapping_info.kernpg_flag = _KERNPG_TABLE;
struct x86_mapping_info {
	void *(*alloc_pgt_page)(void *);
	void *context;
	unsigned long page_flag;
	unsigned long offset;
	bool direct_gbpages;
	unsigned long kernpg_flag;
};

This structure provides information about memory mappings. As you may remember from the previous part, we have already set up page tables to cover the range 0 to 4G. This won't do since we might generate a randomized address outside of the 4 gigabyte range. So, the initialize_identity_maps function initializes the memory for a new page table entry. First, let's take a look at the definition of the x86_mapping_info structure.

alloc_pgt_page is a callback function that is called to allocate space for a page table entry. The context field is an instance of the alloc_pgt_data structure. We use it to track allocated page tables. The page_flag and kernpg_flag fields are page flags. The first represents flags for PMD or PUD entries. The kernpg_flag field represents overridable flags for kernel pages. The direct_gbpages field is used to check if huge pages are supported and the last field, offset, represents the offset between the kernel's virtual addresses and its physical addresses up to the PMD level.

The alloc_pgt_page callback just checks that there is space for a new page, allocates it in the pgt_buf field of the alloc_pgt_data structure and returns the address of the new page:

entry = pages->pgt_buf + pages->pgt_buf_offset;
pages->pgt_buf_offset += PAGE_SIZE;

Here's what the alloc_pgt_data structure looks like:

struct alloc_pgt_data {
	unsigned char *pgt_buf;
	unsigned long pgt_buf_size;
	unsigned long pgt_buf_offset;
};

The last goal of the initialize_identity_maps function is to initialize pgdt_buf_size and pgt_buf_offset. As we are only in the initialization phase, the initialze_identity_maps function sets pgt_buf_offset to zero:

pgt_data.pgt_buf_offset = 0;
pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
pgt_data.pgt_buf = _pgtable

As the buffer for new page tables is initialized, we may return to the choose_random_location function.

Avoiding Reserved Memory Ranges

mem_avoid_init(input, input_size, *output);

All unsafe memory regions will be collected in an array called mem_avoid:

struct mem_vector {
	unsigned long long start;
	unsigned long long size;
};

static struct mem_vector mem_avoid[MEM_AVOID_MAX];
enum mem_avoid_index {
	MEM_AVOID_ZO_RANGE = 0,
	MEM_AVOID_INITRD,
	MEM_AVOID_CMDLINE,
	MEM_AVOID_BOOTPARAMS,
	MEM_AVOID_MEMMAP_BEGIN,
	MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
	MEM_AVOID_MAX,
};

Let's look at the implementation of the mem_avoid_init function. The main goal of this function is to store information about reserved memory regions with descriptions given by the mem_avoid_index enum in the mem_avoid array and to create new pages for such regions in our new identity mapped buffer. The mem_avoid_index function does the same thing for all elements in the mem_avoid_indexenum, so let's look at a typical example of the process:

mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;
add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,
		 mem_avoid[MEM_AVOID_ZO_RANGE].size);
void add_identity_map(unsigned long start, unsigned long size)
{
	unsigned long end = start + size;

	start = round_down(start, PMD_SIZE);
	end = round_up(end, PMD_SIZE);
	if (start >= end)
		return;

	kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
				  start, end);
}

The round_up and round_down functions are used to align the start and end addresses to a 2 megabyte boundary.

The kernel_ident_mapping_init function sets default flags for new pages if they were not already set:

if (!info->kernpg_flag)
	info->kernpg_flag = _KERNPG_TABLE;
for (; addr < end; addr = next) {
	p4d_t *p4d;

	next = (addr & PGDIR_MASK) + PGDIR_SIZE;
	if (next > end)
		next = end;

    p4d = (p4d_t *)info->alloc_pgt_page(info->context);
	result = ident_p4d_init(info, p4d, addr, next);

    return result;
}

The first thing this for loop does is to find the next entry of the Page Global Directory for the given address. If the entry's address is greater than the end of the given memory region, we set its size to end. After this, we allocate a new page with the x86_mapping_info callback that we looked at previously and call the ident_p4d_init function. The ident_p4d_init function will do the same thing, but for the lower level page directories (p4d -> pud -> pmd).

That's all.

Now we may return to the choose_random_location function.

Physical address randomization

After the reserved memory regions have been stored in the mem_avoid array and identity mapped pages are built for them, we select the region with the lowest available address to decompress the kernel to:

min_addr = min(*output, 512UL << 20);

You will notice that the address should be within the first 512 megabytes. A limit of 512 megabytes was selected to avoid unknown things in lower memory.

The next step is to select random physical and virtual addresses to load the kernel to. The first is the physical addresses:

random_addr = find_random_phys_addr(min_addr, output_size);
static unsigned long find_random_phys_addr(unsigned long minimum,
                                           unsigned long image_size)
{
	minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);

	if (process_efi_entries(minimum, image_size))
		return slots_fetch_random();

	process_e820_entries(minimum, image_size);
	return slots_fetch_random();
}
struct slot_area {
	unsigned long addr;
	int num;
};

#define MAX_SLOT_AREA 100

static struct slot_area slot_areas[MAX_SLOT_AREA];

The kernel will select a random index from this array to decompress the kernel to. The selection process is conducted by the slots_fetch_random function. The main goal of the slots_fetch_random function is to select a random memory range from the slot_areas array via the kaslr_get_random_long function:

slot = kaslr_get_random_long("Physical") % slot_max;

We now have a random physical address to decompress the kernel to.

Virtual address randomization

After selecting a random physical address for the decompressed kernel, we generate identity mapped pages for the region:

random_addr = find_random_phys_addr(min_addr, output_size);

if (*output != random_addr) {
		add_identity_map(random_addr, output_size);
		*output = random_addr;
}
if (IS_ENABLED(CONFIG_X86_64))
	random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);

*virt_addr = random_addr;

In architectures other than x86_64, the randomized physical and virtual addresses are the same. The find_random_virt_addr function calculates the number of virtual memory ranges needed to hold the kernel image. It calls the kaslr_get_random_long function, which we have already seen being used to generate a random physical address.

At this point we have randomized both the base physical (*output) and virtual (*virt_addr) addresses for the decompressed kernel.

That's all.

Conclusion

This is the end of the sixth and last part concerning the Linux kernel's booting process. We will not see any more posts about kernel booting (though there may be updates to this and previous posts). We will now turn to other parts of the linux kernel instead.

The next chapter will be about kernel initialization and we will study the first steps take in the Linux kernel initialization code.

Links

Let's try to understand what these parameters are. The first parameter, input is just the input_data parameter of the extract_kernel function from the source code file, cast to unsigned long:

This parameter is passed through assembly from the source code file:

input_data is generated by the little program. If you've tried compiling the Linux kernel yourself, you may find the output generated by this program in the linux/arch/x86/boot/compressed/piggy.S source code file. In my case this file looks like this:

The third and fourth parameters of the choose_random_location function are the address of the decompressed kernel image and its length respectively. The decompressed kernel's address came from the source code file and is the address of the startup_32 function aligned to a 2 megabyte boundary. The size of the decompressed kernel is given by z_output_len which, again, is found in piggy.S.

We exit choose_random_location if the option is specified, leaving the kernel load address unrandomized. Information related to this can be found in the :

The initialize_identity_maps function is defined in the source code file. This function starts by initializing an instance of the x86_mapping_info structure called mapping_info:

The x86_mapping_info structure is defined in the header file and looks like this:

pgt_data.pgt_buf_size will be set to 77824 or 69632 depending on which boot protocol was used by the bootloader (64-bit or 32-bit). The same is done for pgt_data.pgt_buf. If a bootloader loaded the kernel at startup_32, pgdt_data.pgdt_buf will point to the end of the already initialized page table in the source code file:

Here, _pgtable points to the beginning of . On the other hand, if the bootloader used the 64-bit boot protocol and loaded the kernel at startup_64, the early page tables should already be built by the bootloader itself and _pgtable will just point to those instead:

After the stuff related to identity page tables is initialized, we can choose a random memory location to extract the kernel image to. But as you may have guessed, we can't just choose any address. There are certain reserved memory regions which are occupied by important things like the and the kernel command line which must be avoided. The mem_avoid_init function will help us do this:

Here, MEM_AVOID_MAX is from the mem_avoid_index which represents different types of reserved memory regions:

Both are defined in the source code file.

The mem_avoid_init function first tries to avoid memory regions currently used to decompress the kernel. We fill an entry from the mem_avoid array with the start address and the size of the relevant region and call the add_identity_map function, which builds the identity mapped pages for this region. The add_identity_map function is defined in the source code file and looks like this:

In the end this function calls the kernel_ident_mapping_init function from the source code file and passes the previously initialized mapping_info instance, the address of the top level page table and the start and end addresses of the memory region for which a new identity mapping should be built.

It then starts to build new 2-megabyte (because of the PSE bit in mapping_info.page_flag) page entries (PGD -> P4D -> PUD -> PMD if we're using or PGD -> PUD -> PMD if are used) associated with the given addresses.

We now have new page entries related to reserved addresses in our page tables. We haven't reached the end of the mem_avoid_init function, but the rest is similar. It builds pages for the and the kernel command line, among other things.

The find_random_phys_addr function is defined in the source code file as choose_random_location:

The main goal of the process_efi_entries function is to find all suitable memory ranges in fully accessible memory to load kernel. If the kernel is compiled and run on a system without support, we continue to search for such memory regions in the region. All memory regions found will be stored in the slot_areas array:

The kaslr_get_random_long function is defined in the source code file and as its name suggests, returns a random number. Note that the random number can be generated in a number of ways depending on kernel configuration and features present in the system (For example, using the , or or some other method).

From now on, output will store the base address of the memory region where kernel will be decompressed. Currently, we have only randomized the physical address. We can randomize the virtual address as well on the architecture:

If you have any questions or suggestions write me a comment or ping me in .

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to .

previous part
main.c
security reasons
bootloader
16-bit or 32-bit boot protocol
previous part
long mode
arch/x86/boot/compressed/misc.c
arch/x86/boot/compressed/head_64.S
mkpiggy
arch/x86/boot/compressed/head_64.S
kernel's documentation
arch/x86/boot/compressed/kaslr_64.c
arch/x86/include/asm/init.h
arch/x86/boot/compressed/head_64.S
_pgtable
initrd
enum
arch/x86/boot/compressed/kaslr.c
arch/x86/boot/compressed/kaslr_64.c
arch/x86/mm/ident_map.c
five-level page tables
four-level page tables
initrd
same
EFI
e820
arch/x86/lib/kaslr.c
time stamp counter
rdrand
x86_64
twitter
linux-insides
Address space layout randomization
Linux kernel boot protocol
long mode
initrd
Enumerated type
four-level page tables
five-level page tables
EFI
e820
time stamp counter
rdrand
x86_64
Previous part