Last time we met with the device of one of the important objects of shared memory, the buffer cache. The possibility of losing information from RAM is the main reason for the need for disaster recovery tools. Today we will talk about these funds.
Magazine
Alas, there is no miracle: in order to survive the loss of information in the RAM, everything you need must be recorded on the disk (or other non-volatile device) in a timely manner.
Therefore, this is what is done. Along with the change of data is also a
log of these changes. When we change something on the page in the buffer cache, we create an entry in the log about this change. Record contains the minimum information sufficient in order that if necessary change could be repeated.
')
For this to work, a journal entry must necessarily go to disk
before the modified page gets there. Hence the name:
pre -write log (write-ahead log).
If a failure occurs, the data on the disk is in a disagreeable state: some pages were written earlier, some - later. But the log remains, which can be read and rerun those operations that were already performed before the failure, but the result of which did not reach the disk.
Why not forcefully write the data pages themselves to the disk, why do double work instead? It turns out that it is more effective.
First, the log is a sequential data stream for writing. Even HDDs do quite well with sequential recording. But the recording of the data itself is random, because the pages are scattered across the disk more or less randomly.
Secondly, a journal entry can be much smaller than a page.
Thirdly, during the recording, it is not necessary to take care that at each arbitrary moment of time the data on the disk remains consistent (such a requirement makes life very difficult).
And fourth, as we will see later, the log (since it is available) can be used not only for recovery, but also for backup and replication.
It is necessary to log all operations that may lead to disk inconsistency in the event of a failure. In particular, the following actions are recorded in the log:
- change pages in the buffer cache (as a rule, these are pages of tables and indexes) - since the changed page does not immediately hit the disk;
- commit and cancel transactions - the status change occurs in the XACT buffers and also does not hit the disk immediately;
- file operations (creating and deleting files and directories, for example, creating files when creating a table) - since these operations must occur synchronously with the change of data.
The log does not record:
- operations with non-journaling (unlogged) tables - their name speaks for itself;
- operations with temporary tables make no sense, since the lifetime of such tables does not exceed the lifetime of the session that created them.
Prior to PostgreSQL 10,
hash indexes were not logged (they only served to map hash functions to different data types), but this has now been fixed.
Logical device

Logically, a journal can be thought of as a sequence of records of various lengths. Each entry contains
data about a certain operation, preceded by a standard
header . In the title, among others, are:
- transaction number to which the record relates;
- resource manager - the component responsible for recording;
- checksum (CRC) - allows you to determine data corruption;
- record length and link to the previous record.
The data itself has a different format and meaning. For example, they can be a fragment of a page that needs to be written on top of its content with a certain offset. The specified resource manager “understands” how to interpret the data in its record. There are separate managers for the tables, for each type of index, for the status of transactions, etc. You can optionally get a complete list of them with the command
pg_waldump -r list
Physical device
The log is stored on disk as files in the $ PGDATA / pg_wal directory. Each default file is 16 MB. The size can be increased to avoid a large number of files in one directory. Prior to PostgreSQL 11, this could only be done when compiling the source code, but now the size can be specified during cluster initialization (the
--wal-segsize
).
Log entries fall into the currently used file; when it ends, the next one begins to be used.
Special buffers are allocated in the server shared memory. The size of the log cache is set by the
wal_buffers parameter (the default value implies automatic tuning: 1/32 of the buffer cache is allocated).
The journal cache is arranged like a buffer cache, but it works mainly in the ring buffer mode: records are added to the “head” and written to the disk from the “tail”.
The positions of the entry (“tail”) and insert (“heads”) show the functions pg_current_wal_lsn and pg_current_wal_insert lsn, respectively:
=> SELECT pg_current_wal_lsn(), pg_current_wal_insert_lsn();
pg_current_wal_lsn | pg_current_wal_insert_lsn --------------------+--------------------------- 0/331E4E64 | 0/331E4EA0 (1 row)
In order to refer to a specific record, the data type pg_lsn (LSN = log sequence number) is used - this is a 64-bit number representing the byte offset to the record relative to the beginning of the log. LSN is displayed as two 32-bit numbers in hexadecimal system through slashes.
You can find out in which file we will find the desired position, and with what offset from the beginning of the file:
=> SELECT file_name, upper(to_hex(file_offset)) file_offset FROM pg_walfile_name_offset('0/331E4E64');
file_name | file_offset --------------------------+------------- 000000010000000000000033 | 1E4E64 \ /\ / 0/331E4E64
The file name consists of two parts. The upper 8 hexadecimal digits indicate the time branch number (it is used when restoring from a backup), the remainder corresponds to the LSN upper digits (and the remaining LSN lower digits indicate the offset).
Log files can be viewed in the file system in the $ PGDATA / pg_wal / directory, but since PostgreSQL 10 you can also see them with a special function:
=> SELECT * FROM pg_ls_waldir() WHERE name = '000000010000000000000033';
name | size | modification --------------------------+----------+------------------------ 000000010000000000000033 | 16777216 | 2019-07-08 20:24:13+03 (1 row)
Proactive recording
Let's see how logging happens and how proactive write is secured. Create a table:
=> CREATE TABLE wal(id integer); => INSERT INTO wal VALUES (1);
We will look at the title of the table page. To do this, we need an already familiar extension:
=> CREATE EXTENSION pageinspect;
We start the transaction and remember the insertion position in the log:
=> BEGIN; => SELECT pg_current_wal_insert_lsn();
pg_current_wal_insert_lsn --------------------------- 0/331F377C (1 row)
Now let's perform some operation, for example, update the line:
=> UPDATE wal set id = id + 1;
This change was also recorded in the log, the insertion position has changed:
=> SELECT pg_current_wal_insert_lsn();
pg_current_wal_insert_lsn --------------------------- 0/331F37C4 (1 row)
To ensure that a modified data page is not preempted to disk before a log entry, the LSN of the last log entry relating to this page is stored in the page header:
=> SELECT lsn FROM page_header(get_raw_page('wal',0));
lsn ------------ 0/331F37C4 (1 row)
It is necessary to take into account that the journal is common for the entire cluster, and new records get into it all the time. Therefore, the LSN on the page may be less than the value that pg_current_wal_insert_lsn has just returned. But nothing happens in our system, so the numbers are the same.
Now complete the transaction.
=> COMMIT;
The fixation entry also goes to the log, and the position changes again:
=> SELECT pg_current_wal_insert_lsn();
pg_current_wal_insert_lsn --------------------------- 0/331F37E8 (1 row)
A commit changes the status of a transaction in a structure called XACT (we
already talked about it ). Statuses are stored in files, but their cache is also used for them, which takes 128 pages in shared memory. Therefore, for the XACT pages, the LSN of the last log entry is tracked. But this information is not stored in the page itself, but in RAM.
At some point, the created log entries will be written to disk. In which one - we will talk another time, but in our case it has already happened:
=> SELECT pg_current_wal_lsn(), pg_current_wal_insert_lsn();
pg_current_wal_lsn | pg_current_wal_insert_lsn --------------------+--------------------------- 0/331F37E8 | 0/331F37E8 (1 row)
After this point, data pages and XACT can be pushed out of the cache. But if it were necessary to oust them earlier, it would be detected and the journal entries would be forcibly recorded first.
Knowing the two LSN positions, you can get the size of the log entries between them (in bytes) by simply subtracting one position from another. It is only necessary to bring the position to the type pg_lsn:
=> SELECT '0/331F37E8'::pg_lsn - '0/331F377C'::pg_lsn;
?column? ---------- 108 (1 row)
In this case, updating the string and committing required 108 bytes in the log.
In the same way, it is possible to estimate how much log entries are generated by the server per unit of time under a certain load. This is important information that will be required when setting up (which we will talk about next time).
Now we use the pg_waldump utility to look at the created log entries.
The utility can work with the LSN range (as in this example), and select entries for the specified transaction. It should be run on behalf of the postgres OS user, as it needs access to the log files on the disk.
postgres$ /usr/lib/postgresql/11/bin/pg_waldump -p /var/lib/postgresql/11/main/pg_wal -s 0/331F377C -e 0/331F37E8 000000010000000000000033
rmgr: Heap len (rec/tot): 69/ 69, tx: 101085, lsn: 0/331F377C, prev 0/331F3014, desc: HOT_UPDATE off 1 xmax 101085 ; new off 2 xmax 0, blkref #0: rel 1663/16386/33081 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 101085, lsn: 0/331F37C4, prev 0/331F377C, desc: COMMIT 2019-07-08 20:24:13.945435 MSK
Here we see the headings of the two entries.
The first is the
HOT_UPDATE operation, related to the Heap resource manager. The file name and page number are indicated in the blkref field and match the updated tabular page:
=> SELECT pg_relation_filepath('wal');
pg_relation_filepath ---------------------- base/16386/33081 (1 row)
The second entry is a COMMIT related to the Transaction resource manager.
Not the most readable format, but you can figure it out if necessary.
Recovery
When we start the server, the postmaster process is started first, and it, in turn, starts the startup process, whose task is to ensure recovery if a failure has occurred.
To determine if recovery is required, startup looks in the special control file $ PGDATA / global / pg_control and looks at the status of the cluster. We can check the status ourselves using the pg_controldata utility:
postgres$ /usr/lib/postgresql/11/bin/pg_controldata -D /var/lib/postgresql/11/main | grep state
Database cluster state: in production
A neatly stopped server status will be “shut down”. If the server does not work, and the status remains “in production”, this means that the DBMS has fallen and then the recovery will be performed automatically.
To restore the startup process will consistently read the log and apply entries to the pages, if necessary. You can check the necessity by comparing the LSN pages on the disk with the LSN of the journal entry If the LSN of the page is larger, then the entry is not necessary. And in fact - even impossible, because the records are designed for strictly consistent application.
There are exceptions. Part of the records are formed as a full image of the page (FPI, full page image), and it is clear that such an image can be applied to the page in any state - it will still erase everything that was there. Another change in the status of the transaction can be applied to any version of the XACT page - so there is no need to store LSNs inside such pages.
Changing pages during recovery occurs in the buffer cache, as in normal work - for this postmaster runs the necessary background processes.
Similarly, journal entries apply to files: for example, if an entry indicates that the file must exist, but it does not exist, the file is created.
Well, at the very end of the recovery process, all non-journaling tables are overwritten with "dummies" from their
init layers .
This is a very simplified presentation of the algorithm. In particular, we have not yet said anything about where to start reading the journal entries (this conversation will have to be postponed until the checkpoint is considered).
And the last clarification. “According to the classics,” the recovery process consists of two phases. In the first phase (roll forward) log entries are rolled in, and the server repeats all the work lost during the failure. On the second (roll back) - rollback transactions that were not fixed at the time of failure. But PostgreSQL does not need a second phase. As we
considered earlier , due to the peculiarities of the implementation of a multiversion transaction, it is not necessary to roll back physically; it suffices that the commit bit is not set in XACT.
Continued .