By customizing
postgresql.conf , you may have noticed that there is a
full_page_writes parameter. The commentary next to it says something about the partial recording of pages and people tend to leave it
on - which is not bad, which I will explain later in this article. Nevertheless, it is very useful to understand what
full_page_writes does, since the impact on the system can be significant.
Unlike my previous post about
setting up checkpoints , this is not a guide on how to set up a server. There is not so much that you could customize, in fact, but I will show you how some solutions at the application level (for example, the choice of data types) can interact with the recording of full pages.
Partial Entry / “Torn” Pages
So what is a full page entry? As the commentary from
postgresql.conf says, PostgreSQL uses a 8kB page size (the default) to recover from partial page writing, while other parts of the stack use excellent chunk sizes. The Linux file system typically uses 4kB pages (it’s possible to use smaller pages, but 4kB is the maximum on x86), at the hardware level, old drives use 512B sectors, while new ones write data in larger chunks (usually 4kB, or even 8kB).
That way, when PostgreSQL writes an 8kB page, the rest of the storage layers can break it down into smaller pieces that are processed separately. This is a problem of atomicity of the record. An 8KB PostgreSQL page can be split into two 4kB filesystem pages and then into 512B sectors. Now, what happens if the server goes down (power failure, kernel error, ...)?
')
Even if the server uses a storage system designed to cope with such failures (SSD with capacitors, RAID controllers with batteries, ...), the kernel already divides the data into 4kB pages. There is a possibility that the database recorded an 8kB data page, but only part of it hit the disk before the crash.
From this point of view, you are now probably thinking that this is exactly what we have for the transaction log (WAL) and you are right! So, after starting the server, the database will read WAL (from the last checkpoint executed), and apply the changes again to make sure that the data files are correct. Simply.
But there is a trick - the restoration does not apply the changes blindly, it is often necessary for him to read pages with data, etc., which means that the page is no longer spoiled in some way, for example in connection with partial recording. What seems to be just a little internally contradictory, because to correct data corruption, we mean that the data was not damaged.
A full page entry is something of a solution to this puzzle — when you change a page for the first time after a checkpoint, the whole page will be recorded in WAL. This ensures that during recovery, the first WAL entry associated with this page stores the complete page, freeing us from having to read a potentially damaged page from the data file.
Record increase
Of course, the negative consequence of this is an increase in the size of WAL'a - changing one byte on the 8kB page will lead to its full entry in WAL. A full page record occurs only on the first record after the checkpoint, that is, reducing the checkpoint frequency - this is one way to improve the situation, in fact, there is a small “explosion” of the full page record after the checkpoint, after which relatively few complete records occur before it ends.
UUID against BIGSERIAL keys
There are still some unexpected interactions with design decisions made at the application level. Let's assume that we have a simple table with a primary key, a UUID or BIGSERIAL, and we write data to it. Will there be a difference in the size of the generated WAL (assuming that we write the same number of lines)?
It seems reasonable to expect approximately the same size of WALs in both cases, but the following diagrams clearly demonstrate that there is a huge difference in practice:

Here are the sizes of WALs obtained as a result of an hourly test, accelerated to 5000 inserts per second. With BIGSERIAL as the primary key, this resulted in ~ 2GB of WAL while the UUID issued more than 40GB. The difference is more than palpable, and most of the WAL is related to the index behind the primary key. Let's look at the record types in WAL:

Obviously, the absolute majority of records are full-page images (FPI), i.e. the result of a full page entry. But why is this happening?
Of course, this is due to the inherent UUID'u accident. New BIGSERIALs are consistent, and therefore are written to the same branches of the btree index. Since only the first page change causes a full page record, such a small number of WAL records are FPIs. With a UUID, it's quite another thing, of course, the values ​​are completely inconsistent and each insert is likely to fall into a new branch of the index (assuming the index is rather large).
The database can do nothing special with this - the load is random in nature, which causes a large number of full page entries.
Of course, it’s not so difficult to achieve a similar increase in the record even with BIGSERIAL keys. It just requires a different type of load, for example, updates, random updates of records will change the distribution, the diagram looks like this:

Suddenly, the difference between the data types disappeared - access is made randomly in both cases, resulting in approximately the same size as the WALs produced. Another difference is that most WAL is associated with “heap”, i.e. tables, not indexes. “HOT” cases were reproduced to enable HOT UPDATE optimization (i.e., updates without having to touch the index), which almost completely eliminates all WAL traffic associated with the indices.
But you can protest that most applications do not change the entire data set. Usually, only a small part of the data is “active” - people are interested in messages from the past few days on the forums, unresolved orders in online stores, etc. How does this affect the results?
Fortunately, pgbench supports non-uniform distributions, and, for example, with an exponential distribution relating to a 1% data set ~ 25% of the time, the charts will look like this:

If we make the distribution even more asymmetric, concerning 1% of the data ~ 75% of the time:

This again shows how much difference the choice of data types can cause, and how important the setting of hot updates is.
8kB and 4kB pages
Another interesting question is how much WAL traffic can be saved by using smaller pages in PostgreSQL (which requires compiling a custom package). At best, this can save up to 50% of WAL, thanks to logging only 4kB, instead of 8kB pages. For load with evenly distributed updates, it looks like this:

In general, the savings are not quite 50%, but the reduction from ~ 140GB to ~ 90GB is still quite noticeable.
Do we need a complete record of the pages?
This may seem blatant after explaining all the dangers of partial recording, but perhaps disabling full page writing may be a viable option, at least in some cases.
First, I wonder if Linux file systems are still vulnerable for partial writes? The parameter was introduced in PosqtgreSQL version 8.1, released in 2005, so perhaps many file system improvements have since resolved this problem. This is probably not a universal approach for any workload, but perhaps given some additional conditions (for example, using 4kB pages in PostgreSQL) will it suffice? In addition, PostgreSQL never overwrites only part of the 8kB page, but only the full page.
I have conducted many tests recently, trying to cause a partial recording, but I could not even cause a single case. Of course, this is not proof that there is no problem. But even if it is, checksums can be sufficient protection (this will not fix the problem, but at least will indicate a damaged page).
Secondly, many modern systems rely on replicas that use streaming replication — instead of waiting for the server to reboot after a hardware failure (which can take a long time) and then spend even more time on recovery, the systems will simply switch to hot standby. If the database on the corrupted master was removed (and then cloned from the new master), partial entries are not a problem.
But, I'm afraid if we start recommending this approach, then “I don’t know how the data was damaged, I just did full_page_writes = off on systems!” Will become one of the most common sentences right before the death of the DBA (along with “I saw this snake on reddit, it is not poisonous ").
Conclusion
Not much can be done to set up a full page record directly. For a larger number of loads, most of the complete records occur immediately after the checkpoint, and then disappear to the next checkpoint. So it’s quite important to set up checkpoints so that they don’t follow each other too often.
Some solutions at the application level can increase the randomness of writing to tables and indexes — for example, UUIDs are, by their nature, random, turning even the usual load from inserts into random updates of indexes. The scheme used in the examples was rather trivial - in practice, there would be secondary indexes, foreign keys, etc. Using BIGSERIAL as primary keys (and leaving the UUID as side keys) can at least reduce the increase in the record.
I’m really interested in discussing the need to fully write pages on different kernels / file systems. Unfortunately, I did not find a lot of resources, if you have any relevant information, let me know.