Hello!
While
Leonid is preparing for his first
open lesson in our course
“Linux Administrator” , we continue to talk about loading the Linux kernel.
Go!
')
Understanding the operation of a system that functions without failures - preparing for the elimination of inevitable breakdowns
The oldest joke in the field of open source software is a statement that “the code documents itself”. Experience shows that reading the source code is like listening to weather forecasts: intelligent people will still go out and look at the sky. The following are tips for checking and investigating the boot of Linux systems using familiar debugging tools. An analysis of the boot process of a system that works well prepares users and developers to eliminate the inevitable failures.
On the one hand, the boot process is surprisingly simple. The kernel of the operating system (kernel) runs single-threaded and synchronous on one core (core), which may seem understandable even to a pitiful human mind. But how does the kernel run itself? What are the functions of initrd (
disk in memory for initial initialization ) and boot loaders? And wait, why is the LED on the Ethernet port always on?

Read on to get answers to these and some other questions; The code for the described demos and exercises is also available on
GitHub .
Start of loading: state OFFWake-on-lanA state of OFF means the system has no power, right? The seeming simplicity is deceptive. For example, the Ethernet LED is on even in this state, because wake-on-LAN is on in your system (WOL, wake-up on [signal from] local network). Make sure to write:
$# sudo ethtool <interface name>
Where instead may be, for example, eth0 (ethtool is in Linux packages with the same name). If the “wake-on” in the output shows g, remote hosts can boot the system by sending
MagicPacket . If you do not want to remotely turn on your system yourself and give this opportunity to others, disable WOL in the system BIOS menu, or using:
$# sudo ethtool -s <interface name> wol d
The processor that responds to MagicPacket can be a
Baseboard Management Controller (BMC) or part of a network interface.
Intel Management Engine, Platform Controller Hub and MinixThe BMC is not the only microcontroller (MCU) that can “listen” to a nominally turned off system. On x86_64 systems, there is the Intel Management Engine (IME) software package for remote system management. A wide range of devices, from servers to laptops, have technology that
has features such as KVM Remote Control or Intel Capability Licensing Service. According to
Inte l's
own tool ,
IME has unpatched vulnerabilities. Bad news - disable IME is difficult. Trammell Hudson created
the me_cleaner project, which erases some of the most egregious IME components, such as the embedded web server, but at the same time there is a chance that using the project will turn the system on which it is running into a brick.
The IME firmware and the System Management Mode (SMM) program, which follows it on boot, are based on
the Minix operating system and run on a separate Platform Controller Hub processor, rather than the main system CPU. Then SMM launches the Universal Extensible Firmware Interface (UEFI) program on the main processor, which
has already been written about more than once . The Coreboot group launched at Google the excitingly ambitious
Non-Extensible Reduced Firmware (NERF) project , which aims to replace not only UEFI, but also the early components of the Linux user space, for example, systemd. And while we are waiting for results, Linux users can purchase laptops from Purism, System76 or Dell, on which
IME is disabled , plus, we can hope for laptops with a
64-bit ARM processor .
Loaders
What besides running booted spyware does boot firmware do? The task of the loader is to provide the just-enabled processor with the necessary resources to run a general-purpose operating system like Linux. During power up, there is not only virtual memory, but DRAM until the time when its controller is raised. The boot loader then turns on the power supplies and scans the buses and interfaces to find the kernel image and root filesystem. Popular boot loaders, for example, U-Boot and GRUB, have support for common interfaces like USB, PCI and NFS, as well as other more specialized embedded devices such as NOR and NAND flash. Loaders also interact with hardware security devices, such as the
Trusted Platform Module (TPM) , to establish a trust chain from the beginning of the download.
Run the U-boot bootloader in the sandbox on the build server.The popular open source
U-Boot downloader is supported on systems from Raspberry Pi to Nintendo devices, motherboards and Chromebooks. There is no system log, and if something goes wrong, there may not even be a console output. To facilitate debugging, the U-Boot team provides a sandbox for testing patches on the build host or even in the Continuous Integration system. On a system where the usual development tools like Git and the GNU Compiler Collection (GCC) are installed, it’s easy to understand the U-Boot sandbox.
$# git clone git://git.denx.de/u-boot; cd u-boot $# make ARCH=sandbox defconfig $# make; ./u-boot => printenv => help
That's all: you launched U-Boot on x86_64 and can test tricky features, for example, repartition of
dummy storage devices , TPM-based key manipulation and hot plug (hotplug) of USB devices. The U-Boot sandbox can be single-stage as part of the GDB debugger. Development using a sandbox is 10 times faster than testing by rewriting the loader onto the board, plus the “brick” sandbox can be restored by pressing Ctrl + C.
Kernel startupSupply booting kernelAfter completing his tasks, the loader switches to the kernel code that it loaded into main memory, and starts its execution, passing all the command line parameters that the user specified. What is the core program? file / boot / vmlinuz shows that this is a bzImage. There is
a extract-vmlinux tool in the Linux source tree that can be used to decompress a file:
$# scripts/extract-vmlinux /boot/vmlinuz-$(uname -r) > vmlinux $# file vmlinux vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
The kernel is an
Executable and Linking Format (ELF) binary file, just like the Linux user-space program. This means that we can use commands from the binutils package, such as readelf, to study it. Compare, for example, the following conclusions:
$# readelf -S /bin/date $# readelf -S vmlinux
The list of sections in binary files is mostly similar.
So, the kernel should run other ELF Linux binaries ... But how are user-space programs running? In the
main()
function, right? Not really.
Before launching the function
main()
programs need the execution context, including heap- (heap) and stack- (stack) memory, plus file descriptors for
stdio
,
stdout
and
stderr
. User-space programs get these resources from the standard library (
glibc
for most Linux systems). Consider the following:
$# file /bin/date /bin/date: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=14e8563676febeb06d701dbee35d225c5a8e565a, stripped
ELF binary files have an interpreter, just like Bash and Python scripts. But it does not need to be clarified through
#!
, as in scripts, because ELF is a proprietary Linux format. The ELF interpreter supplies the binary file with all the necessary resources by calling the
_start()
function, which is available in the
glibc
source package, which can be studied through
GDB . The kernel obviously does not have an interpreter, and it must supply itself, but how?
An investigation of the launch of a kernel with GDB provides an answer to this question. First, install the kernel debug package, which contains an uncut version of
vmlinux
, for example,
apt-get install linux-image-amd64-dbg
. Or compile and install your own kernel from some source, for example, following the instructions from the excellent
Debian Kernel Handbook .
gdb vmlinux
, followed by the
info files
, shows the ELF section
init.text
. Specify the start of program execution in
init.text
with
l *(address)
, where address is the hex start of
init.text
. GDB will indicate that the x86_64 kernel is run in the
arch/x86/kernel/head_64.S
, where we find the build function
start_cpu0()
and the code that explicitly creates the stack and unpacks the zImage before calling the
x86_64 start_kernel()
function. ARM 32-bit kernels have similar
arch/arm/kernel/head.S. start_kernel()
arch/arm/kernel/head.S. start_kernel()
is architecture independent, so the function is in the kernel
init/main.c
We can say that
start_kernel()
is the real
main()
function of Linux.
From start_kernel () to PID 1Kernel hardware manifest: ACPI tables and device treesWhen loading, the kernel needs hardware information in addition to the type of processor for which it was compiled. The instructions in the code are supplemented with configuration data that is stored separately. There are two main methods of data storage: Device Tree and
ACPI tables . The kernel learns from these files what hardware needs to be run on each boot.
For embedded devices, the device tree (DM) is the manifest of the installed hardware. The DU is a file that is compiled at the same time as the kernel source and is usually located in / boot along with
vmlinux
. To see what is in a binary device tree on an ARM device, simply use the
strings
command from the binutils package in the file whose name is
/boot/*.dtb
, because
dtb
means the binary file of the device tree (Device-Tree Binary). The remote control can be changed by editing the JSON-like files of which it is composed and by restarting the special dtc compiler provided with the kernel source. The remote control is a static file whose path is usually passed to the kernel by loaders on the command line, but in recent years a
device tree overlay has been added where the kernel can dynamically load additional fragments in response to hotplug events after loading.
The x86 family and many ARM64 business-level devices use an alternative Advanced Configuration and Power Interface mechanism (
ACPI , advanced configuration and power management interface). Unlike the remote control, the ACPI information is stored in the virtual file system
/sys/firmware/acpi/tables
, which is created by the kernel at launch through a call to the built-in ROM. To read ACPI tables, use the
acpidump
command from the
acpica-tools
package. Here is an example:
ACPI tables on Lenovo laptops are ready for Windows 2001.Yes, your Linux system is ready for Windows 2001 if you want to install it. ACPI has both methods and data, in contrast to the DU, which is more similar to the hardware description language. ACPI methods continue to be active after loading. For example, if you run the acpi_listen command (from the apcid package) and then close and open the lid of the laptop, you will see that the ACPI functionality has continued to work all this time. Temporary and dynamic
rewriting of ACPI tables is possible, but for a permanent change you will need to interact with the BIOS menu on the boot or flashing the ROM. Instead of such difficulties, you may simply need to
install coreboot , a replacement for open source firmware.
From start_kernel () to user space
The code in
init/main.c
is surprisingly easy to read and, oddly enough, still carries the original copyright of Linus Torvalds (Linus Torvalds) from 1991-1992. Strings found in
dmesg | head
dmesg | head
running system mainly originates from this source file. The first CPU is registered by the system, global data structures are initialized, the scheduler, interrupt handlers (IRQs), timers, and the console come up one by one. All timestamps before running
timekeeping_init()
are zero. This part of the kernel initialization is synchronous, that is, execution occurs only in one thread. Functions are not executed until the last one is completed and returned. As a result, the output of
dmesg
will be fully reproducible even between two systems, as long as they have the same remote control or ACPI tables. Linux also behaves like a real-time operating system (RTOS, real-time operating system) running on an MCU, for example, QNX or VxWorks. This situation is stored in the
rest_init()
function, which is called by
start_kernel()
at the time of its completion.
Brief description of the early kernel boot process
A modestly named
rest_init()
creates a new thread that starts
kernel_init()
, which in turn calls
do_initcalls()
. Users can monitor
initcalls
by adding
initcalls_debug
to the kernel command line. As a result, you will get the
dmesg
entity each time the
initcall
function
initcall
.
initcalls
pass through seven consecutive levels: early, core, postcore, arch, subsys, fs, device, and late. The most visible part of
initcalls
for users is the definition and installation of processor peripherals: buses, network, storage, displays, and so on, accompanied by the loading of their core modules.
rest_init()
also creates a second thread in the boot processor, which starts by running
cpu_idle()
while the scheduler distributes its work.
kernel_init()
also sets
symmetric multiprocessing (SMP). In modern kernels, this moment can be found in the dmesg output by the line “Bringing up secondary CPUs ...”. The SMP then makes the “hot plug” of the CPU, which means that it manages its lifecycle using state machines that are conditionally similar to those used in devices like auto-detecting USB memory sticks. The kernel power management system often turns off individual cores (core), and wakes them up as needed, so that the same hotplug CPU code is called on an unoccupied machine time after time. Look at how the power management system calls the hotplug CPU using
a BCC tool called
offcputime.py
.
Notice that the code in
init/main.c
almost finished executing at the moment
smp_init()
launched. The boot processor has completed most of the one-time initialization that other cores do not need to repeat. However, threads must be created for each core (core) in order to manage interrupts (IRQs), workqueue, timers, and power events on each one. For example, look at the threads of the processors that serve softirqs and workqueues with the
ps -o psr.
command
ps -o psr.
$\
where the PSR field means “processor”. Each core must have its own timers and hotplug cpuhp handlers.
And finally, how does user space run? Toward the end,
kernel_init()
looking for an
initrd
that can start the
init
process on its behalf. If not, the kernel itself executes
init
. Why then may need
initrd
?
Early user space: who ordered the initrd?In addition to the device tree, another path to the file, optionally provided to the kernel on boot, belongs to the
initrd
.
initrd
often located in / boot along with the bzImage vmlinuz file in x86, or with a similar uImage and device tree for ARM. The list of
intrd
contents can be viewed using the
lsinitramfs
tool, which is part of the
initramfs-tools-core
package. The initrd distribution image contains minimal directories
/bin
,
/sbin
and
/etc
, as well as kernel modules and files in
/scripts
. Everything should look more or less familiar, since the
initrd
is mostly similar to the simplified Linux root file system. This similarity is a bit deceptive, since almost all executable files in
/bin
and
/sbin
inside ramdisk are symlinks to the
BusyBox binary , which makes the / bin and / sbin directories 10 times smaller than in
glibc
.
Why try to create an
initrd
if the only thing it does is load some modules and run
init
in the usual root filesystem? Consider an encrypted root filesystem. Decryption may depend on loading the kernel module stored in the root file system
/lib/modules
... and, as expected, in the
initrd
. A crypto module can be statically compiled into the kernel, and not loaded from a file, but there are several reasons to refuse it. For example, static compilation of a kernel with modules may make it too large to fit in the available storage, or static compilation may violate the terms of the software license. Not surprisingly, storage, network, and HID (human input devices) drivers can also be presented in an
initrd
— in fact, any code that is not an essential part of the kernel that is required to mount the root file system. Also in the initrd, users can store
their own ACPI table code .
Fun with a rescue shell and custom initrd.initrd
also great for testing file systems and storage devices. Put the testing tools in the
initrd
and run the tests from memory, not from the object under test.
Finally, when
init
running, the system is running! Since the secondary processors are already running, the machine has become an asynchronous, paged, unpredictable, and high-performance creature that we all know and love. Indeed,
ps -o pid,psr,comm -p
shows that the user-space
init
process is no longer running on the boot processor.
TotalThe Linux boot process sounds forbidden, given the amount of affected software, even on the simplest embedded device. On the other hand, the boot process is quite simple, since the excessive complexity caused by preemptive multitasking, RCU and race condition is absent here. Paying attention only to the kernel and PID 1, you can lose sight of the great work done by the loaders and auxiliary processors to prepare the platform for launching the kernel. The kernel is certainly different from other Linux programs, but using tools to work with other ELF binary files will help you better understand its structure. Exploring a workable boot process will prepare for future failures.
THE END
We are waiting for your comments and question, as usual, or here, or in our
open lesson , where Leonid will take the rap.