Tale of sysctl (penguin folk story)

Very often, administrators configure the system simply by setting up basic things - ip, dns, hostname, install software, and everything else is application settings. In most cases this is the case, since in linux there are very reasonable defaults and, in most cases, these defaults are enough and everyone lives happily. Among the very beginners there are legends about some sysctl'yah, and those who are more experienced and even ruled something.

But the moment comes when the admin in his campaigns on the system meets this beast - sysctl. Most likely, he meets someone from the net.ipv4 or vm family, even most likely net.ipv4.ip_forward, if the hike is behind a router or vm.swappinness, if he is concerned about a grown-up swap of his penguin. The first beast allows the penguin to accept packets from one wing and give to others (allows routing), and the second helps to cope with the use of swap in a quiet system and to regulate its use - in a loaded one.

')
A meeting with any of these “animals” opens the gates to a whole world of system settings and kernel components, there are several families, the names of most of which speak for themselves: net — the network; kern is the core; vm - memory and caches. But this is all a “fairy tale”, the reality is much more interesting.

Having got acquainted with this configuration of settings, an inexperienced administrator, like a moth into the world, flies and wants, rather, rather, to set them all up, to make the system better and more optimal. Oh yeah! His temptation is great and the goal is good, but the chosen path sometimes leads to a completely different direction. The name of that way is “google-tuning”.

After all, what a temptation to quickly google “the optimal sysctl settings” and apply some recipe from the beginning of the article without really going into what is written below - ibid TL; DR. In most cases, the system does not get any worse, because the load or configuration is not the way for problems to surface. Later, with experience, digging into the system settings for a particular case, you will understand that this is some kind of nonsense was written.

Here, for example, the parameters net.ipv4.tcp_mem, rmem, wmem - look very similar, three numbers like “4096 87380 6291456”, but that's bad luck, for [rw] mem these are bytes, and for mem these are pages and, if if all three parameters are given the same value, then there will be an “explosive” configuration, since tcp_mem is responsible for managing the memory consumed by tcp, and [rw] mem for socket buffers.

And there are sysctl'i which, until you meet and do not ogrebsh - you do not think about what this sysctl, as, for example, net.ipv4.conf.all.rp_filter. The worst thing that shoots only when you have a need to do asymmetric routes, that is, your router has a 2+ interface and traffic can come from one interface, and return through another, that is, quite rarely. Called Reverse Path Filtering, blocks packets that come from an address that is not routed through the interface from where they come from.
There are some very useful parameters, but they require careful study of documentation and calculations to find out how they can help you in your particular situation. I repeat that the defaults in linux are good enough. This is especially true for tcp settings and the network as a whole.

The parameters that have to be played often enough are:

vm.swappiness - setting the aggressiveness of “dropping out” memory. This is necessary in order to keep as much RAM as possible available for applications and operation. In the default value of 60, the system transfers into swap those pages of memory that have not been used for some long time, in the value 0 or 1, the system tries to use a swap only when it cannot allocate physical memory or when the amount of available memory goes to the specified in vm.min_free_kbytes. This is not to say that the use of a swap is definitely good or bad. It all depends on the situation and the profile of memory usage, and this knob allows us to control the system's attitude to the swap from 0 - I don’t like it at all, up to 100 - oh yeah, sooooop!

vm.min_free_kbytes - (since already mentioned in vm.swappiness) Determines the minimum amount of free memory that must be maintained. If set too small - the system will break; if too large, the OOM (Out Of Memory) killer will often come.

vm.overcommit_memory - allows / prohibits “allocating” more memory than is: 0 - the system checks each time whether there is enough free memory; 1 - the system always assumes that there is memory and allows allocation while the memory is really there; 2 - prohibit asking for more memory than there is - you can allocate no more than RAM + SWAP. This may fire when you have an application, for example redis, which consumes> ½ of memory and decides to write data to disk, for which it forks and copies all the data, but because the available memory may not be enough - either the recording may break on the disk or come oom and kill something you need.

net.ipv4.ip_forward - enable or disable packet routing. With the need to twist this handle, we are faced when setting up the router. Everything is more or less clear: 0 - turn off; 1 - enable.

net.ipv4. {all, default, interface} .rp_filter - controls the Reverse Path Filtering option: 0 - do not check, disable; 1 - “strict” mode, when packets are discarded, the answers that would not leave through the interface from which the packet came; 2 - “relaxed” mode - only those packets are rejected, the route to which is unknown (if there is a default route, in my opinion, it should lead to the same effect as 0).

net.ipv4.ip_local_port_range - defines the minimum and maximum ports that is used to create a local client socket. If your system makes a large number of calls to network resources, then you may have a problem of lack of local ports for setting up connections. This parameter allows you to adjust the range of ports that is used to establish client connections. It can also be useful in order to secure your services that “listen” to high ports.

net.ipv4.ip_default_ttl is the default packet lifetime (TTL). It may be necessary when you need to mislead the CC operator when using your phone as a modem, or when you need to make sure that the packets from this host do not go beyond the perimeter of the network.

net.core.netdev_max_backlog - adjusts the size of the packet queue between the network card and the kernel. If the kernel does not have time to process packets and the queue is full, then new packets are discarded. you may need to tweak in certain situations to cope with peak loads and not experience network problems.

Below are two parameters responsible for the queue of connections, that is, TCP. These two parameters are primarily subjected to tuning on the loaded web-services, in the case when problems arise.

To understand, you need to know how connections work in tcp:

The program opens the listening socket: socket () -> listen (). As a result, it receives, for example *: 80 (port 80 on all interfaces).
The client establishes a connection: a) sends a syn-packet to the server (this is where tcp_max_syn_backlog works); b) receives a syn-ack from the server; c) sends ack to the server (and somaxconn is already working here)
Through the accept call, the connection is processed and passed to the process for working with a specific client.

net.core.somaxconn - the size of the queue of established connections of pending accept (). If we have small peaks of load on the application - perhaps you can help increase this parameter.

net.ipv4.tcp_max_syn_backlog is the size of the queue of not established connections .

For most TCP settings, as well as, actually, all the others, an understanding of the mechanisms that govern these settings is required.
For example, how tcp transmits data. Since this protocol “guarantees delivery”, it requires confirmation of the delivery from the second side of the connection. These acknowledgments are called ACKnoweledgement. Confirms the received range of bytes. We also know that we can transfer data to the network in blocks equal to the MTU size (even, for simplicity, 1500 bytes), and we need to transfer more, say 1500000 bytes, that is, 1000 frames, the data will be fragmented. If we have a server on the same network at a distance of one patch cord from each other, we will not notice any problems, but if we are sharing with a remote system, then waiting for confirmation of each packet is very long, because we have to wait until THERE our package will come and from there the confirmation of receipt will return it will greatly affect the data transfer rate. To resolve this issue, tcp_window was entered, the size of which is allocated 16 bits in the tcp header. Roughly speaking, this is the number of bytes that we can send without confirmation. In 16 bits, we can store a value of a maximum of 2 ^ 16 = 65536 (65Kb), which in our age of multi-gigabit networks is not much at all.

To understand how we can transfer data, say, from Moscow to Novosibirsk (RTT (Round Trip Time) = 50ms) via the 1Gbit channel, let's do a few calculations (below are VERY rough calculations).

Without tcp_window. It turns out that we can only send 1500 bytes to .05s, 1500 / 0.05 = 30,000 bytes / second. Not enough, despite the fact that the channel speed is 1Gbit / s, which is roughly equal to 100Mb / s. ~ 30Kb / s vs ~ 100Mb / s - there is a problem, we do not recycle the available band.
With tcp_window equal to 65536 (the maximum that can be specified in the header). Those. we can immediately send all our 65K. 65536 / 0.05 = 1310720 = 1.25Mb / s. 1.25Mb / s vs ~ 100Mb / s is still not enough, the difference is about 80 times.
So how much do we need in order to dispose of at least 900Mbit? carry out the inverse calculations. (900000000/8) b / s * 0.05s = 5625000b ~ = 5.36 Mb. This is the window size we need to efficiently transfer data. But since we have a very long link, we may have problems there => losses. This will also affect throughput. In order to be able to notify about the size of a window larger than 65K through a 16-bit field, the tcp_window_scaling option was introduced.

net.ipv4.tcp_window_scaling - 0 - disables window scaling; 1 - includes window scaling. This should be supported on both sides and serves to optimize the use of a channel band.

But it is not enough to be able to specify a window larger than 65K, you also need to be able to keep all the necessary data in memory allocated for the socket, in our case in tcp_buffer:

net.ipv4.tcp_wmem, net.ipv4.tcp_rmem — the Read and Write settings of the buffers look the same. These are three numbers in bytes, “min default max” is the minimum guaranteed buffer size, the default size and the maximum size, which the system will not increase to a buffer. It is worthwhile to approach the setting of these parameters with an understanding of how many connections you expect, how much data you intend to transfer and how slow your clients / servers are.

net.ipv4.tcp_mem — tcp stack memory management settings. 3 numbers in PAGES “min pressure max”, which describe: min - the threshold of memory use by buffers below which the system will not care about the displacement of buffers; pressure - threshold at which the system will take care of reducing memory consumption by buffers, as long as it is possible; max - threshold at which memory will not be allocated and buffers will not be able to grow.

Sometimes applications fall, so to speak, “into the crust”. This crust (“memory dump”) can be used to debug a problem with gdb. A thing is definitely useful when set up. By default, the coredump is saved to the faceless core file in the working directory of the application, it may be accompanied by the pid of the application, but you can make everything much more convenient:

kernel.core_pattern - allows you to specify the format of the name (and path) that will be used to save the core dump. For example, “/var/core/%E.%t.%p” will save the “crust” to the / var / core / directory using the full path in the name to the program that fell (replacing / on!) And adding timestamp events and pid applications. You can even redirect the core to an external analysis program. More details can be found in man 5 core.

This is all just the tip of the iceberg, if you describe everything in more detail - you can write a whole book.

Good luck in the console.

Here is such a small note that we have prepared as part of our course " Linux Administrator ". The topic was chosen by our listeners with a vote, so we hope that the note will be interesting and useful for you.

Source: https://habr.com/ru/post/340870/

All Articles

Tale of sysctl (penguin folk story)

More articles: