Memory management in the Linux kernel. Yandex Workshop

Hello! My name is Roman Guschin. In Yandex, I do the Linux kernel. Some time ago I spent for system administrators a seminar on the general description of the memory management subsystem in Linux, as well as some of the problems we encountered, and methods for solving them. Most of the information describes the “vanilla” Linux kernel (3.10), but some are specific to the kernel used in Yandex. It is quite possible that the seminar will be of interest not only to system administrators, but also to anyone who wants to learn how work with memory is arranged in Linux.

The main topics covered at the seminar:

Tasks and components of the memory management subsystem;
Hardware capabilities of the x86_64 platform;
As described in the kernel, physical and virtual memory;
Memory Management Subsystem API;
Release previously occupied memory;
Monitoring tools;
Memory Cgroups;
Compaction - defragmenting physical memory.

Under the cat you will find a more detailed outline of the report with the disclosure of basic concepts and principles.

Tasks of the memory management subsystem and the components of which it consists

The main task of the subsystem is the allocation of physical memory to the core and userspace processes, as well as release and redistribution in cases where the entire memory is occupied.

Main components:

Buddy allocator manages the pool of free memory.
Page replacent (“LRU” reclaim model) decides who to take the memory from when the free one has ended.
PTE management is a translation table management block.
Slub kernel allocator - internal kernel allocator.
and etc.

X86_64 platform hardware capabilities

The NUMA scheme implies that a certain amount of memory is attached to each physical processor to which it can access the fastest. Appeal to the memory of other processors is much slower.

How is physical and virtual memory described in the kernel?

Physical memory in the core is described by three structures: nodes (pg_data_t), zones (struct zone), pages (struct page). Each process has its own virtual memory and is described using the struct mm_struct structure. They, in turn, are divided into regions (struct vm_area_struct).

Memory Management Subsystem API

The kernel interacts with the memory management subsystem using functions such as __get_free_page (), kmalloc (), kfree (), vmalloc (). They are responsible for the allocation of free pages, large and small sections of memory, as well as their release. There is a whole family of such functions, differing in small features, for example, whether the area will be zeroed when released.

User programs interact with the mm-subsystem using the functions mmap (), munmap (), brk (), mlock (), munlock (). There are also posix_fadvice () and madvice () functions that can give the kernel "advice." But it is strictly not obliged to take them into account in their heuristics.

Memory reclaim released

The system always tries to maintain a certain amount of free memory. Thus, the memory is allocated much faster, because it is not necessary to release it at that moment when it is already really needed.

Those pages in memory that are constantly used (system libraries, etc.) are called working set. Replacing them from memory slows down the entire system. The overall rate of memory consumption in the system is called memory pressure. This value can vary greatly depending on how loaded the system is.

The entire memory unused by the kernel in the system can be divided into two parts: anonymous memory and file memory. They differ in that about the first one we know for sure that each piece of it corresponds to a file, and it can be dropped there.

LRU model

LRU stands for least recently used. This is an abstraction, which suggests throwing out pages that we have not addressed the longest. It’s impossible to fully implement it in Linux, because all we know is whether there has ever been a referral to a particular page. In order to somehow track the frequency of page hits, active, inactive and unevictable lists are used. The latter contains user-locked pages that will not be emitted from memory under any circumstances.

There are clear rules for moving between inactive and active lists. Under the influence of memory pressure, pages from the inactive list can either be thrown out of memory, or switch to active. Pages from the active list are moved to inactive if they have not been accessed for a long time.

Monitoring tools

The top utility shows the statistics of memory consumption in the system. Program vmtouch - shows how much of a particular file is in memory. Exhaustive information on the number of file, active and inactive pages can be found in / proc / vmstat. The buddy allocator statistics are in / proc / buddyinfo, and the slub allocator statistics, respectively, in / proc / slabinfo. It is often useful to look at perf top, where all the problems with fragmentation are clearly visible.

Memory cgroups

Segroups originated from the desire to isolate a group of several processes, combine them logically and limit their total memory consumption to certain ones. At the same time, if they reach their limit, the memory should be released precisely from the volume allocated to them. In this case, you need to free the memory belonging to this particular group (this is called target reclaim). If the system just ran out of memory and you need to fill up the free pool - this is called global reclaim. From the point of view of the account, each page belongs to only one segroup: the one that first read it.

Compaction

Compaction is a mechanism for defragmenting physical memory. It is described in some detail in this article . This mechanism was broken for a long time, approximately from version 3.3 to version 3.7. This was manifested in the fact that on some machines with a powerful fragmenting moment after two weeks of work, all the processors were occupied exclusively with compaction and did not perform any useful action.

Source: https://habr.com/ru/post/231957/

All Articles