We already got acquainted with the device of the
buffer cache - one of the main objects in shared memory - and realized that in order to recover from a crash, when the contents of the RAM disappear, we need to keep a
prerecord log .
The unresolved problem, on which we stopped last time, is that it is unknown from what point you can start playing log records during recovery. Starting from the beginning, as advised by the King from
Alice , will not work: it is impossible to keep all the journal entries from the start of the server - this is potentially a huge amount, and the same huge recovery time. We need such a gradually advancing point from which we can begin the restoration (and, accordingly, we can safely delete all previous log entries). This is the
reference point , which will be discussed today.
Check Point
What property should a control point have? We must be sure that all journal entries starting at the checkpoint will be applied to pages written to disk. If this were not the case, during the recovery we could read from the disk too old version of the page and apply a journal entry to it, and thereby irrevocably damage the data.
How to get a control point? The easiest option is to periodically suspend the system and dump all dirty buffer pages and other caches to disk. (Note that the pages are only recorded, but not crowded out of the cache.) Such points will satisfy the condition, but, of course, no one wants to work with the system, constantly freezing for an indefinite, but very significant time.
')
Therefore, in practice, everything is somewhat more complicated: the control point from a point turns into a segment. First we
start the control point. After that, without interrupting the work and, if possible, without creating peak loads, we slowly dump dirty buffers onto the disk.

When all buffers that were dirty
at the start of the control point were recorded, the control point is considered
complete . Now (but not before) we can use the
start point as the point from which to start the recovery. And the journal entries up to this point we no longer need.

The checkpoint is performed by a checkpoint special background process.
The duration of writing dirty buffers is determined by the value of the
checkpoint_completion_target parameter. It shows how much time between two adjacent control points will be recorded. The default value is 0.5 (as in the figures above), that is, the record takes half the time between the control points. Usually the value is increased up to 1.0 for greater uniformity.
Let us take a closer look at what happens when a checkpoint is executed.
First, the checkpoint process flushes transactional status (XACT) buffers to disk. Since there are few of them (only 128), they are recorded immediately.
Then the main work begins - writing dirty pages from the buffer cache. As we have said, it is impossible to reset all pages at once, since the size of the buffer cache can be significant. Therefore, first all currently dirty pages are marked in the buffer cache in the headers with a special flag.

And then the checkpoint process gradually goes through all buffers and discards those marked on the disk. Recall that the pages are not pushed out of the cache, but only written to disk, so no attention should be paid to the number of calls to the buffer, or to its fixation.
Marked buffers can also be recorded by server processes - depending on who gets to the buffer first. In any case, the previously set flag is cleared when writing, so (for control point purposes) the buffer will be recorded only once.
Naturally, during the execution of the checkpoint, the pages continue to change in the buffer cache. But new dirty buffers are not flagged and the checkpoint process should not record them.

At the end of its work, the process creates a log entry about the end of the checkpoint. This record contains the LSN of the starting point of the control point. Since a checkpoint does not write anything to the log at the beginning of its work, any log entry can be found on this LSN.
In addition, in the $ PGDATA / global / pg_control file, the indication of the last
passed checkpoint is updated. Before the checkpoint finishes, pg_control points to the previous checkpoint.

To look at the work of the control point, create some kind of table - its pages will fall into the buffer cache and will be dirty:
=> CREATE TABLE chkpt AS SELECT * FROM generate_series(1,10000) AS g(n); => CREATE EXTENSION pg_buffercache; => SELECT count(*) FROM pg_buffercache WHERE isdirty;
count ------- 78 (1 row)
Remember the current position in the log:
=> SELECT pg_current_wal_insert_lsn();
pg_current_wal_insert_lsn --------------------------- 0/3514A048 (1 row)
Now we’ll manually perform the checkpoint and make sure that there are no dirty pages left in the cache (as we said, new dirty pages may appear, but in our case no changes occurred during the checkpoint execution):
=> CHECKPOINT; => SELECT count(*) FROM pg_buffercache WHERE isdirty;
count ------- 0 (1 row)
Let's see how the reference point is reflected in the log:
=> SELECT pg_current_wal_insert_lsn();
pg_current_wal_insert_lsn --------------------------- 0/3514A0E4 (1 row)
postgres$ /usr/lib/postgresql/11/bin/pg_waldump -p /var/lib/postgresql/11/main/pg_wal -s 0/3514A048 -e 0/3514A0E4
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/3514A048, prev 0/35149CEC, desc: RUNNING_XACTS nextXid 101105 latestCompletedXid 101104 oldestRunningXid 101105
rmgr: XLOG len (rec/tot): 102/ 102, tx: 0, lsn: 0/3514A07C, prev 0/3514A048, desc: CHECKPOINT_ONLINE redo 0/3514A048; tli 1; prev tli 1; fpw true; xid 0:101105; oid 74081; multi 1; offset 0; oldest xid 561 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 101105; online
Here we see two entries. The last of these is a checkpoint record (CHECKPOINT_ONLINE). The LSN of the start of the checkpoint is indicated after the word redo, and this position corresponds to the log entry, which was the last one at the start of the checkpoint.
We find the same information in the control file:
postgres$ /usr/lib/postgresql/11/bin/pg_controldata -D /var/lib/postgresql/11/main | egrep 'Latest.*location'
Latest checkpoint location: 0/3514A07C Latest checkpoint's REDO location: 0/3514A048
Recovery
Now we are ready to clarify the recovery algorithm outlined in the last article.
If the server crashes, then the next time the startup process detects it, it will look in the pg_control file and see a status other than “shut down”. In this case, automatic recovery is performed.
First, the recovery process reads from the same pg_control the position of the start of the checkpoint. (For completeness, we note that if the backup_label file is present, then the checkpoint record is read from it — this is necessary for restoring from backups, but this is a topic for a separate cycle.)
Then he will read the magazine, starting from the position he has found, consistently applying journal entries to the pages (if there is a need for this, as we already discussed
last time ).
Finally, all non-journaling tables are overwritten with images in init files.
At this point, the startup process terminates, and the checkpointer process immediately executes a checkpoint in order to fix the restored state on the disk.
You can simulate a crash by forcibly stopping the server in immediate mode.
student$ sudo pg_ctlcluster 11 main stop -m immediate --skip-systemctl-redirect
(The
--skip-systemctl-redirect
key is needed here because PostgreSQL installed in Ubuntu is used from the package. It is controlled by the pg_ctlcluster command, which actually calls the systemctl, and it already calls pg_ctl. With all these wrappers, the mode name is lost along the way. And the
--skip-systemctl-redirect
key allows you to do without systemctl and save important information.)
Check the status of the cluster:
postgres$ /usr/lib/postgresql/11/bin/pg_controldata -D /var/lib/postgresql/11/main | grep state
Database cluster state: in production
When launched, PostgreSQL understands that a failure has occurred and a repair is required.
student$ sudo pg_ctlcluster 11 main start
postgres$ tail -n 7 /var/log/postgresql/postgresql-11-main.log
2019-07-17 15:27:49.441 MSK [8865] LOG: database system was interrupted; last known up at 2019-07-17 15:27:48 MSK 2019-07-17 15:27:49.801 MSK [8865] LOG: database system was not properly shut down; automatic recovery in progress 2019-07-17 15:27:49.804 MSK [8865] LOG: redo starts at 0/3514A048 2019-07-17 15:27:49.804 MSK [8865] LOG: invalid record length at 0/3514A0E4: wanted 24, got 0 2019-07-17 15:27:49.804 MSK [8865] LOG: redo done at 0/3514A07C 2019-07-17 15:27:49.824 MSK [8864] LOG: database system is ready to accept connections 2019-07-17 15:27:50.409 MSK [8872] [unknown]@[unknown] LOG: incomplete startup packet
The need for recovery is noted in the message log:
database system was not properly shut down; automatic recovery in progress . Then, the replay of the journal entries starts from the position noted in “redo starts at” and continues until the following journal entries are obtained. This completes the restoration in the “redo done at” position and the DBMS starts working with clients (
database system is ready to accept connections ).
And what happens when the server stops normally? To dump dirty pages to disk, PostgreSQL disables all clients and then executes the final checkpoint.
Remember the current position in the log:
=> SELECT pg_current_wal_insert_lsn();
pg_current_wal_insert_lsn --------------------------- 0/3514A14C (1 row)
Now gently stop the server:
student$ sudo pg_ctlcluster 11 main stop
Check the status of the cluster:
postgres$ /usr/lib/postgresql/11/bin/pg_controldata -D /var/lib/postgresql/11/main | grep state
Database cluster state: shut down
And in the journal we will find the only entry on the final control point (CHECKPOINT_SHUTDOWN):
postgres$ /usr/lib/postgresql/11/bin/pg_waldump -p /var/lib/postgresql/11/main/pg_wal -s 0/3514A14C
rmgr: XLOG len (rec/tot): 102/ 102, tx: 0, lsn: 0/3514A14C, prev 0/3514A0E4, desc: CHECKPOINT_SHUTDOWN redo 0/3514A14C; tli 1; prev tli 1; fpw true; xid 0:101105; oid 74081; multi 1; offset 0; oldest xid 561 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
pg_waldump: FATAL: error in WAL record at 0/3514A14C: invalid record length at 0/3514A1B4: wanted 24, got 0
(The terrible fatal message pg_waldump just wants to say about what he read to the end of the magazine.)
Run the instance again.
student$ sudo pg_ctlcluster 11 main start
Background record
As we found out, the control point is one of the processes that writes dirty pages from the buffer cache to disk. But not the only one.
If the backend needs to push the page out of the buffer, and the page is dirty, it will have to write it to disk on its own. This is not a good situation, leading to expectations - much better when recording occurs asynchronously, in the background.
Therefore, in addition to the checkpoint process (checkpointer), there is also
a background writing process (background writer, bgwriter, or simply writer). This process uses the same buffer search algorithm as the crowding mechanism. By and large there are two differences.
- Not a pointer to the “next victim” is used, but your own. He may be ahead of the pointer to the "victim", but never lags behind him.
- Bypassing the buffers, the hit counter does not decrease.
Buffers are recorded, which simultaneously:
- contain modified data (dirty)
- not fixed (pin count = 0)
- have zero number of calls (usage count = 0).
Thus, the background writing process runs ahead of the crowding out and finds those buffers that are likely to be crowded out soon. Ideally, due to this, the servicing processes should find that the buffers selected by them can be used without stopping to write.
Customization
The checkpoint process is usually configured for the following reasons.
First, we need to decide how much log files we can afford to save (and how much recovery time suits us). The more, the better, but for obvious reasons this value will be limited.
Further, we can calculate how much time this volume will be generated under normal load. How to do this, we have already considered (you need to remember the position in the journal and subtract one from the other).
This time will be our usual interval between control points. We write it to the
checkpoint_timeout parameter. The default value - 5 minutes - is clearly too short, usually the time is increased, say, up to half an hour. I repeat: the less often you can afford control points, the better - it reduces overhead.
However, it is possible (and even likely) that the load will sometimes be higher than normal, and in the time specified in the parameter too much log entries will be generated. In this case, I would like to perform the checkpoint more often. For this, in the
max_wal_size parameter
, we specify the volume that is valid within one control point. If the actual volume is received more, the server initiates an unscheduled checkpoint.
Thus, most of the control points occur on a schedule: once in
checkpoint_timeout time units. But with increased load, the control point is called more often when the volume
reaches max_wal_size .
It is important to understand that the
max_wal_size parameter does not define at all the maximum amount that the log files on the disk can occupy.
- To recover from a failure, you need to store files from the time of the last passed checkpoint, plus the files that have accumulated during the running of the current checkpoint. Therefore, the total amount can be roughly estimated as
(1 + checkpoint_completion_target ) Ă— max_wal_size . - Up to version 11, PostgreSQL additionally stored files and for the time before last the checkpoint, so that up to version 10 in the above formula, instead of 1, it is necessary to put 2.
- The max_wal_size parameter is just a wish, but not a hard limit. Maybe more.
- The server does not have the right to erase log files that have not yet been transferred through replication slots, and which have not yet been recorded in the archive with continuous archiving. If this functionality is used, constant monitoring is necessary, because you can easily overwhelm the server memory.
For completeness, you can set not only the maximum volume, but also the minimum: the
min_wal_size parameter. The meaning of this setting is that the server does not delete files, as long as they fit into the volume
min_wal_size , but simply renames them and uses them again. This allows you to save a little on the constant creation and deletion of files.
The process of background recording makes sense to set up after setting the control point. Together, these processes must have time to write dirty buffers before they are required by the serving processes.
The background writing process runs in cycles for a maximum of
bgwriter_lru_maxpages pages, falling asleep between cycles on
bgwriter_delay .
The number of pages that will be recorded in one cycle of work is determined by the average number of buffers that have been requested by service processes since the last run (this uses a moving average to smooth out the unevenness between runs, but does not depend on a long history). The calculated number of buffers is multiplied by the coefficient
bgwriter_lru_multiplier (but in any case it will not exceed
bgwriter_lru_maxpages ).
Default values:
bgwriter_delay = 200ms (most likely too much, in 1/5 of a second a lot of water will
leak ),
bgwriter_lru_maxpages = 100,
bgwriter_lru_multiplier = 2.0 (trying to respond to demand ahead).
If the process has not detected any dirty buffers at all (that is, nothing happens in the system), it “hibernates”, from which it is deduced from the server process by the buffer. After that, the process wakes up and again works in the usual way.
Monitoring
The setting of the checkpoint and the background recording can and should be adjusted by receiving feedback from the monitoring.
The
checkpoint_warning parameter displays a warning if checkpoints caused by log size overflow are performed too often. Its default value is 30 seconds, and it must be aligned with the
checkpoint_timeout value.
The
log_checkpoints parameter (disabled by default) allows you to receive information on running checkpoints in the server message log. Turn it on.
=> ALTER SYSTEM SET log_checkpoints = on; => SELECT pg_reload_conf();
Now we change something in the data and execute the control point.
=> UPDATE chkpt SET n = n + 1; => CHECKPOINT;
In the message log we will see something like this:
postgres$ tail -n 2 /var/log/postgresql/postgresql-11-main.log
2019-07-17 15:27:55.248 MSK [8962] LOG: checkpoint starting: immediate force wait 2019-07-17 15:27:55.274 MSK [8962] LOG: checkpoint complete: wrote 79 buffers (0.5%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.013 s, total=0.025 s; sync files=2, longest=0.011 s, average=0.006 s; distance=1645 kB, estimate=1645 kB
Here you can see how many buffers were recorded, how the composition of the log files changed after the checkpoint, how long the checkpoint took and the distance (in bytes) between adjacent checkpoints.
But probably the most useful information is the statistics of the work of the checkpoint process and the background record in the pg_stat_bgwriter view. Presentation of one for two, because once both tasks were performed by one process; then their functions were divided, and the presentation remained.
=> SELECT * FROM pg_stat_bgwriter \gx
-[ RECORD 1 ]---------+------------------------------ checkpoints_timed | 0 checkpoints_req | 1 checkpoint_write_time | 1 checkpoint_sync_time | 13 buffers_checkpoint | 79 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 42 buffers_backend_fsync | 0 buffers_alloc | 363 stats_reset | 2019-07-17 15:27:49.826414+03
Here, among other things, we see the number of control points performed:
- checkpoints_timed - on schedule (on reaching checkpoint_timeout),
- checkpoints_req - on request (including upon reaching max_wal_size).
The large value of checkpoint_req (compared to checkpoints_timed) indicates that checkpoints occur more often than expected.
Important information about the number of pages recorded:
- buffers_checkpoint - checkpoint process,
- buffers_backend - serving processes,
- buffers_clean - background write process.
In a well-tuned system, the buffers_backend value should be substantially less than the sum of buffers_checkpoint and buffers_clean.
Even maxwritten_clean is useful for setting up a background recording - this number shows how many times the background recording process stopped working due to exceeding
bgwriter_lru_maxpages .
You can reset the accumulated statistics using the following call:
=> SELECT pg_stat_reset_shared('bgwriter');
To be continued.