WAL in PostgreSQL: 1. Buffer cache

The previous cycle was devoted to the isolation and multiversion of PostgreSQL, and today we are starting a new one - about the journaling mechanism (write-ahead logging). Let me remind you that the material is based on training courses in administration, which we do with Pavel pluzanov , but does not repeat them verbatim and is intended for thoughtful reading and independent experimentation.

This cycle will consist of four parts:

Buffer cache (this article);
Prerecord log - how it works and how it is used during restoration;
Checkpoint and background recording - why you need it and how it is configured;
Log settings - levels and tasks to be solved, reliability and performance.

Why do you need journaling?

In the process of work, part of the data with which the DBMS deals, is stored in RAM and is written to disk (or to another non-volatile medium) in a deferred manner. The less often this happens, the less I / O and the faster the system works.
')
But what happens in the event of a failure, for example, when the power is turned off or when an error occurs in the DBMS or operating system code? The entire contents of the RAM will be lost, and only the data recorded on the disk will remain (with some types of failures, the disk may suffer, but in this case only a backup will help). In principle, you can organize I / O in such a way that the data on the disk is always maintained in a consistent state, but it is difficult and not very efficient (as far as I know, only Firebird went this way).

Usually, including in PostgreSQL, the data recorded on the disc is inconsistent and special actions are required for recovery after a crash in order to restore consistency. Journaling is the very mechanism that makes this possible.

Buffer cache

Speaking of journaling, we, oddly enough, start with the buffer cache. The buffer cache is not the only structure that is stored in RAM, but one of the most important and complex. Understanding the principle of its work is important in itself, besides on this example we will get acquainted with how data is exchanged between the RAM and the disk.

Caching is used in modern computing systems everywhere; only one processor can count three or four cache levels. In general, any cache is needed in order to smooth the difference in performance of two types of memory, one of which is relatively fast, but not enough for everyone, and the other is relatively slow, but available in abundance. So the buffer cache smoothes the difference between the access time to the RAM (nanoseconds) and the disk (milliseconds).

Note that the operating system also has a disk cache, which solves the same problem. Therefore, usually DBMSs try to avoid double caching by accessing the disk directly, bypassing the OS cache. But in the case of PostgreSQL, this is not the case: all data is read and written using ordinary file operations.

In addition, the cache is also a controller of disk arrays, and even the disks themselves. This fact is still useful to us when we get to the question of reliability.

But back to the database buffer cache.

It is called so because it is an array of buffers . Each buffer is a place under one data page (block), plus a header. The title, among other things, contains:

location on disk of the page in the buffer (file and block number in it);
a sign that the data on the page has changed and sooner or later should be written to disk (this buffer is called dirty );
the number of calls to the buffer (usage count);
flag of pin counting.

The buffer cache is located in the shared memory of the server and is accessible to all processes. To work with data — read or modify — processes read pages in the cache. While the page is in the cache, we work with it in RAM and save on disk access.

Initially, the cache contains empty buffers, and they are all linked into a list of free buffers. The meaning of the pointer to the "next victim" will become clear a little later. To quickly find the desired page in the cache, a hash table is used.

Search page in the cache

When a process needs to read a page, it first tries to find it in the buffer cache using a hash table. The hash key is the file number and page number within the file. In the corresponding hash table basket, the process finds the buffer number and checks whether it really contains the desired page. As with any hash table, collisions are possible here; in this case, the process will have to check several pages.

Using a hash table has long been a criticism. This structure allows you to quickly find a buffer on the page, but it is completely useless if, for example, you need to find all the buffers occupied by a particular table. But no one has proposed a good replacement.

If the desired page is found in the cache, the process must “pin” the buffer by increasing the pin count counter (several processes can do this at the same time). As long as the buffer is fixed (the counter value is greater than zero), it is considered that the buffer is used and its contents should not "radically" change. For example, a new version of a line may appear in a page - this does not bother anyone thanks to multiversion and visibility rules. But another page cannot be read into the pinned buffer.

crowding out

It may happen that the necessary page will not be found in the cache. In this case, it must be read from disk into any buffer.

If there are still free buffers in the cache, the first free one is selected. But sooner or later they will run out (usually the size of the database is larger than the memory allocated for the cache) and then you have to choose one of the busy buffers, force out the page there and read a new one to the empty seat.

The displacement mechanism is based on the fact that each time a buffer is accessed, the processes increase the count of the number of accesses (usage count) in the buffer header. Thus, those buffers that are used less frequently than others have a smaller counter value and are good candidates for crowding out.

The clock-sweep algorithm iterates through all buffers (using a pointer to the “next victim”), decreasing their hit counts by one. For displacement, the first buffer is selected, which:

has a zero count of appeals (usage count),
and not fixed (zero pin count).

It can be noted that if all buffers have a nonzero call counter, the algorithm will have to do more than one circle, resetting the counters until one of them finally turns to zero. To avoid “winding up the circles”, the maximum value of the hit counter is limited to 5. But all the same, if the buffer cache size is large, this algorithm can cause significant overhead.

After the buffer is found, the following happens to it.

The buffer is fixed to show the rest of the processes that it is being used. In addition to fastening, other means of blocking are used, but we will talk more about this separately.

If the buffer is dirty, that is, it contains modified data, the page cannot be simply discarded - it must first be saved to disk. This is not a good situation, since the process that is going to read the page has to wait for the recording of “foreign” data, but this effect is smoothed out by the checkpoint and background recording processes, which will be discussed later.

Next, a new page is read into the selected buffer from the disk. The count of the number of calls is set to one. In addition, a link to the downloaded page must be registered in the hash table, so that in the future it can be found.

Now the link to the “next victim” points to the next buffer, and the newly loaded one has time to increase the count of hits until the pointer walks around the entire buffer cache and returns again.

With my own eyes

As is customary in PostgreSQL, there is an extension that allows you to look inside the buffer cache.

=> CREATE EXTENSION pg_buffercache;

Create a table and insert one row into it.

 => CREATE TABLE cacheme( id integer ) WITH (autovacuum_enabled = off); => INSERT INTO cacheme VALUES (1);

What will be in the buffer cache? At a minimum, a page should appear on it, to which a single line has been added. Check it with the following query, in which we select only the buffers related to our table (by the file number relfilenode), and decrypt the layer number (relforknumber):

 => SELECT bufferid, CASE relforknumber WHEN 0 THEN 'main' WHEN 1 THEN 'fsm' WHEN 2 THEN 'vm' END relfork, relblocknumber, isdirty, usagecount, pinning_backends FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('cacheme'::regclass);

  bufferid | relfork | relblocknumber | isdirty | usagecount | pinning_backends ----------+---------+----------------+---------+------------+------------------ 15735 | main | 0 | t | 1 | 0 (1 row)

So it is - in the buffer one page. It is dirty (is dirty), the call counter is equal to one (usagecount), and it is not fixed by any process (pinning_backends).

Now add another line and repeat the request. To save letters, we insert a string in another session, and repeat the long query with the \g command.

 | => INSERT INTO cacheme VALUES (2);

 => \g

  bufferid | relfork | relblocknumber | isdirty | usagecount | pinning_backends ----------+---------+----------------+---------+------------+------------------ 15735 | main | 0 | t | 2 | 0 (1 row)

New buffers did not increase - the second line fit on the same page. Please note that the usage count has increased.

 | => SELECT * FROM cacheme;

 | id | ---- | 1 | 2 | (2 rows)

 => \g

  bufferid | relfork | relblocknumber | isdirty | usagecount | pinning_backends ----------+---------+----------------+---------+------------+------------------ 15735 | main | 0 | t | 3 | 0 (1 row)

And after accessing the page for reading, the counter also increases.

And if you do the cleaning?

 | => VACUUM cacheme;

 => \g

  bufferid | relfork | relblocknumber | isdirty | usagecount | pinning_backends ----------+---------+----------------+---------+------------+------------------ 15731 | fsm | 1 | t | 1 | 0 15732 | fsm | 0 | t | 1 | 0 15733 | fsm | 2 | t | 2 | 0 15734 | vm | 0 | t | 2 | 0 15735 | main | 0 | t | 3 | 0 (5 rows)

Cleaning created a visibility map (one page) and a free space map (three pages - the minimum size of this map).

Well, and so on.

Size setting

The size of the cache is set by the shared_buffers parameter. The default is a ridiculous 128 MB. This is one of the parameters that it makes sense to increase immediately after installing PostgreSQL.

 => SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';

  setting | unit ---------+------ 16384 | 8kB (1 row)

Keep in mind that changing a parameter requires restarting the server, since all the necessary cache memory is allocated when the server starts.

From what considerations to choose the appropriate value?

Even the largest database has a limited set of "hot" data, which is actively working with at any time. Ideally, this particular set should be placed in the buffer cache (plus some space for “one-time” data). If the cache size is smaller, then actively used pages will constantly displace each other, creating redundant I / O. But mindlessly increasing the cache is also wrong. With a large amount will increase the overhead of its maintenance, and in addition, RAM is required for other needs.

Thus, the optimal size of the buffer cache will be different in different systems: it depends on the data, on the application, on the load. Unfortunately, there is no such magical value that is equally well suited to all.

The standard recommendation is to take 1/4 of the RAM as the first approximation (for Windows up to the PostgreSQL 10 version, it was recommended to choose a size smaller).

And then we must look at the situation. It is best to conduct an experiment: increase or decrease the cache size and compare the characteristics of the system. Of course, for this you need to have a test bench and be able to reproduce the typical load - in a production environment such experiments look dubious pleasure.

Be sure to check out the report by Nikolay Samokhvalov at PgConf-2019: "An Industrial Approach to Tuning PostgreSQL: Experiments on Databases "

But some information about what is happening can be gleaned directly on the living system using the same pg_buffercache extension - the main thing is to look at the right angle.

For example, you can study the distribution of buffers according to the degree of their use:

 => SELECT usagecount, count(*) FROM pg_buffercache GROUP BY usagecount ORDER BY usagecount;

  usagecount | count ------------+------- 1 | 221 2 | 869 3 | 29 4 | 12 5 | 564 | 14689 (6 rows)

In this case, many empty counter values are free buffers. Not surprising for a system in which nothing happens.

You can see what percentage of which tables in our database are cached and how actively this data is used (in this query, active use refers to buffers with a usage counter greater than 3):

 => SELECT c.relname, count(*) blocks, round( 100.0 * 8192 * count(*) / pg_table_size(c.oid) ) "% of rel", round( 100.0 * 8192 * count(*) FILTER (WHERE b.usagecount > 3) / pg_table_size(c.oid) ) "% hot" FROM pg_buffercache b JOIN pg_class c ON pg_relation_filenode(c.oid) = b.relfilenode WHERE b.reldatabase IN ( 0, (SELECT oid FROM pg_database WHERE datname = current_database()) ) AND b.usagecount is not null GROUP BY c.relname, c.oid ORDER BY 2 DESC LIMIT 10;

  relname | blocks | % of rel | % hot ---------------------------+--------+----------+------- vac | 833 | 100 | 0 pg_proc | 71 | 85 | 37 pg_depend | 57 | 98 | 19 pg_attribute | 55 | 100 | 64 vac_s | 32 | 4 | 0 pg_statistic | 27 | 71 | 63 autovac | 22 | 100 | 95 pg_depend_reference_index | 19 | 48 | 35 pg_rewrite | 17 | 23 | 8 pg_class | 16 | 100 | 100 (10 rows)

Here, for example, it can be seen that the vac table occupies the most place (we used it in one of the past topics), but no one has addressed it for a long time and it has not been supplanted just because the free buffers have not yet ended.

You can come up with other cuts that will give useful information for thought. It is only necessary to take into account that such requests:

must be repeated several times: the numbers will vary within certain limits;
it is not necessary to perform continuously (as part of monitoring) due to the fact that the extension briefly blocks work with the buffer cache.

And one moment. We should not forget that PostgreSQL works with files through the usual calls of the operating system and, thus, double caching occurs: the pages get into both the buffer database cache and the OS cache. Thus, “missing” the buffer cache does not always lead to the need for real I / O. But the OS preemption strategy is different from the DBMS strategy: the operating system knows nothing about the meaning of the data read.

Mass extrusion

For operations that perform mass reading or writing data, there is a danger of quickly crowding out useful pages from the buffer cache with “one-time” data.

To prevent this from happening, so-called buffer rings (buffer rings ) are used for such operations — for each operation, a small part of the buffer cache is allocated. Preemption works only within the ring, so the rest of the buffer cache data does not suffer.

For sequential reading (sequential scan) of large tables (the size of which exceeds a quarter of the buffer cache) 32 pages are allocated. If, in the process of reading a table, another process also needs this data, it does not start reading the table from the beginning, but connects to an already existing buffer ring. After the end of the scan, he reads out the “skipped” beginning of the table.

Let's check. To do this, create a table so that one line takes up a whole page - it is more convenient to read this way. The default buffer cache size is 128 MB = 16384 pages of 8 KB each. It means that more than 4096 pages-lines should be inserted into the table.

 => CREATE TABLE big( id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY, s char(1000) ) WITH (fillfactor=10); => INSERT INTO big(s) SELECT 'FOO' FROM generate_series(1,4096+1);

Let's analyze the table.

 => ANALYZE big; => SELECT relpages FROM pg_class WHERE oid = 'big'::regclass;

  relpages ---------- 4097 (1 row)

Now we have to restart the server in order to clear the cache from the table data that the analysis has read.

 student$ sudo pg_ctlcluster 11 main restart

After the reboot, read the entire table:

 => EXPLAIN (ANALYZE, COSTS OFF) SELECT count(*) FROM big;

  QUERY PLAN --------------------------------------------------------------------- Aggregate (actual time=14.472..14.473 rows=1 loops=1) -> Seq Scan on big (actual time=0.031..13.022 rows=4097 loops=1) Planning Time: 0.528 ms Execution Time: 14.590 ms (4 rows)

And make sure that only 32 buffers are occupied by table pages in the buffer cache:

 => SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('big'::regclass);

  count ------- 32 (1 row)

If you prohibit sequential scanning, the table will be read by index:

 => SET enable_seqscan = off; => EXPLAIN (ANALYZE, COSTS OFF) SELECT count(*) FROM big;

  QUERY PLAN ------------------------------------------------------------------------------------------- Aggregate (actual time=50.300..50.301 rows=1 loops=1) -> Index Only Scan using big_pkey on big (actual time=0.098..48.547 rows=4097 loops=1) Heap Fetches: 4097 Planning Time: 0.067 ms Execution Time: 50.340 ms (5 rows)

In this case, the buffer ring is not used and the entire table will be in the buffer cache (and almost the entire index, too):

 => SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('big'::regclass);

  count ------- 4097 (1 row)

Similarly, buffer rings are used for the cleaning process (also 32 pages) and for mass write operations COPY IN and CREATE TABLE AS SELECT (usually 2048 pages, but not more than 1/8 of the entire buffer cache).

Temporary tables

The exception to the general rule is temporary tables. Since temporal data is visible to only one process, they have nothing to do in the shared buffer cache. Moreover, temporary data exists only within a single session, so it does not need to be protected from failure.

For temporary data, the cache is used in the local memory of the process that owns the table. Since such data is available to only one process, it is not necessary to protect it with locks. The local cache uses the usual displacement algorithm.

Unlike the common buffer cache, memory for the local cache is allocated as needed, because temporary tables are not used in all sessions. The maximum amount of memory for temporary tables in one session is limited by the temp_buffers parameter.

Warming up the cache

After restarting the server, some time should pass in order for the cache to “warm up” - to collect actual data that is actively used. Sometimes it may be useful to immediately read the data of certain tables in the cache, and for this purpose a special extension is intended:

 => CREATE EXTENSION pg_prewarm;

Previously, an extension could only read certain tables in the buffer cache (or only in the OS cache). But in PostgreSQL 11, it was able to save the current state of the cache to disk and restore it after restarting the server. To use this, you need to add the library to the shared_preload_libraries and reload the server.

 => ALTER SYSTEM SET shared_preload_libraries = 'pg_prewarm';

 student$ sudo pg_ctlcluster 11 main restart

The restart field, if the value of the pg_prewarm.autoprewarm parameter did not change, will automatically start the background autoprewarm master process, which once in pg_prewarm.autoprewarm_interval will flush the list of pages in the cache (remember to take into account the new process when setting max_parallel_processes ).

 => SELECT name, setting, unit FROM pg_settings WHERE name LIKE 'pg_prewarm%';

  name | setting | unit ---------------------------------+---------+------ pg_prewarm.autoprewarm | on | pg_prewarm.autoprewarm_interval | 300 | s (2 rows)

 postgres$ ps -o pid,command --ppid `head -n 1 /var/lib/postgresql/11/main/postmaster.pid` | grep prewarm

 10436 postgres: 11/main: autoprewarm master

Now there is no big table in the cache:

 => SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('big'::regclass);

  count ------- 0 (1 row)

If we assume that all its contents are very important, we can read it into the buffer cache by calling the following function:

 => SELECT pg_prewarm('big');

  pg_prewarm ------------ 4097 (1 row)

 => SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('big'::regclass);

  count ------- 4097 (1 row)

The page list is reset to the autoprewarm.blocks file. To see it, you can simply wait until the autoprewarm master process runs for the first time, but we initiate it manually:

 => SELECT autoprewarm_dump_now();

  autoprewarm_dump_now ---------------------- 4340 (1 row)

The number of pages dropped is more than 4097 - this includes the pages of system catalog objects already read by the server. And here is the file:

 postgres$ ls -l /var/lib/postgresql/11/main/autoprewarm.blocks

 -rw------- 1 postgres postgres 102078  29 15:51 /var/lib/postgresql/11/main/autoprewarm.blocks

Now restart the server again.

 student$ sudo pg_ctlcluster 11 main restart

And right after the launch, our table is again in the cache.

 => SELECT count(*) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('big'::regclass);

  count ------- 4097 (1 row)

This is provided by the same autoprewarm master process: it reads a file, splits pages into databases, sorts them (so that the read from the disk is as consistent as possible) and sends the autoprewarm worker to a separate workflow for processing.

To be continued.

Source: https://habr.com/ru/post/458186/

All Articles