📜 ⬆️ ⬇️

MVCC-3. String Versions

So, we looked at issues related to isolation , and made a digression about organizing data at a low level . And finally got to the most interesting - to the string versions.

Headline


As we have said, each row can be simultaneously in the database in several versions. One version must be somehow distinguished from the other. To this end, each version has two marks defining the “time” of the action of this version (xmin and xmax). In quotes - because it is not a time as such that is used, but a special increasing counter. And this counter is a transaction number.

(As usual, in fact, everything is more complicated: the transaction number cannot increase all the time due to the limited digit capacity of the counter. But we will consider these details in detail when we get to the freeze.)

When a string is created, the xmin value is set to the number of the transaction that executed the INSERT command, and xmax is not populated.
')
When a row is deleted, the xmax value of the current version is marked with the transaction number that performed the DELETE.

When a string is modified by an UPDATE command, two operations are actually performed: DELETE and INSERT. The current version of the string is set to xmax, equal to the number of the transaction that performed the DELETE. Then a new version of the same line is created; its xmin value is the same as the xmax value of the previous version.

The xmin and xmax fields are included in the row version header. In addition to these fields, the header contains others, for example:


As a result, the header is quite large - at least 23 bytes for each version of the string, and usually more because of the NULL bitmap. If the table is “narrow” (that is, it contains few columns), the overhead can take more than useful information.

Insert


Let us take a closer look at how string operations are performed at a low level, and begin with an insert.

For experiments, create a new table with two columns and an index for one of them:

=> CREATE TABLE t( id serial, s text ); => CREATE INDEX ON t(s); 

Insert one line, starting the transaction.

 => BEGIN; => INSERT INTO t(s) VALUES ('FOO'); 

Here is our current transaction number:

 => SELECT txid_current(); 
  txid_current -------------- 3664 (1 row) 

Let's look at the content of the page. The pageinspect extension's heap_page_items function allows you to get information about pointers and string versions:

 => SELECT * FROM heap_page_items(get_raw_page('t',0)) \gx 
 -[ RECORD 1 ]------------------- lp | 1 lp_off | 8160 lp_flags | 1 lp_len | 32 t_xmin | 3664 t_xmax | 0 t_field3 | 0 t_ctid | (0,1) t_infomask2 | 2 t_infomask | 2050 t_hoff | 24 t_bits | t_oid | t_data | \x0100000009464f4f 

Note that heg (heap) in PostgreSQL denotes tables. This is another strange use of the term — a heap is a well - known data structure that has nothing to do with a table. Here this word is used in the sense of “everything is piled up in a heap”, in contrast to the ordered indexes.

The function displays the data "as is" in a format that is difficult to read. To understand, we will leave only a part of the information and decipher it:

 => SELECT '(0,'||lp||')' AS ctid, CASE lp_flags WHEN 0 THEN 'unused' WHEN 1 THEN 'normal' WHEN 2 THEN 'redirect to '||lp_off WHEN 3 THEN 'dead' END AS state, t_xmin as xmin, t_xmax as xmax, (t_infomask & 256) > 0 AS xmin_commited, (t_infomask & 512) > 0 AS xmin_aborted, (t_infomask & 1024) > 0 AS xmax_commited, (t_infomask & 2048) > 0 AS xmax_aborted, t_ctid FROM heap_page_items(get_raw_page('t',0)) \gx 
 -[ RECORD 1 ]-+------- ctid | (0,1) state | normal xmin | 3664 xmax | 0 xmin_commited | f xmin_aborted | f xmax_commited | f xmax_aborted | t t_ctid | (0,1) 

Here is what we did:


What do we see? When you insert a row in the tabular page, a pointer appears with the number 1, which refers to the first and only version of the row.

In the row version, the xmin field is filled with the current transaction number. The transaction is still active, so both the xmin_committed and xmin_aborted bits are not set.

The ctid field of the string version refers to the same string. This means that a newer version does not exist.

The xmax field is filled with the fictitious number 0, since this version of the line has not been deleted and is relevant. Transactions will not pay attention to this number, since the xmax_aborted bit is set.

We will take another step towards improving readability by adding information bits to transaction numbers. And we will create a function, since we will need the query more than once:

 => CREATE FUNCTION heap_page(relname text, pageno integer) RETURNS TABLE(ctid tid, state text, xmin text, xmax text, t_ctid tid) AS $$ SELECT (pageno,lp)::text::tid AS ctid, CASE lp_flags WHEN 0 THEN 'unused' WHEN 1 THEN 'normal' WHEN 2 THEN 'redirect to '||lp_off WHEN 3 THEN 'dead' END AS state, t_xmin || CASE WHEN (t_infomask & 256) > 0 THEN ' (c)' WHEN (t_infomask & 512) > 0 THEN ' (a)' ELSE '' END AS xmin, t_xmax || CASE WHEN (t_infomask & 1024) > 0 THEN ' (c)' WHEN (t_infomask & 2048) > 0 THEN ' (a)' ELSE '' END AS xmax, t_ctid FROM heap_page_items(get_raw_page(relname,pageno)) ORDER BY lp; $$ LANGUAGE SQL; 

In this form, it is much clearer what is happening in the header of the version of the line:

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+------+-------+-------- (0,1) | normal | 3664 | 0 (a) | (0,1) (1 row) 

Similar, but significantly less detailed, information can be obtained from the table itself, using the pseudo columns xmin and xmax:

 => SELECT xmin, xmax, * FROM t; 
  xmin | xmax | id | s ------+------+----+----- 3664 | 0 | 1 | FOO (1 row) 

Fixing


Upon successful completion of the transaction, you need to remember its status - note that it is fixed. To do this, use a structure called XACT (and before version 10 it was called CLOG (commit log) and this name can still be found in different places).

XACT is not a system catalog table; these are files in the PGDATA / pg_xact directory. They have two bits for each transaction: committed and aborted - just like in the header of the version of the string. This information is divided into several files solely for convenience, we will return to this issue when we consider the freezing. And work with these files is carried out page by page, as well as with all others.

So, when committing a transaction in XACT, the committed bit is set for this transaction. And this is all that happens when fixing (although we are not talking about the pre-recording journal yet).

When any other transaction turns to the table page we just looked at, it will have to answer a few questions.

  1. Is the xmin transaction complete? If not, the created version of the string should not be visible.
    This check is performed by viewing another structure that is located in the shared memory of the instance and is called ProcArray. It contains a list of all active processes, and each one contains the number of its current (active) transaction.
  2. If it is completed, then how - by fixing or cancellation? If canceled, the version of the string should not be visible either.
    That's what XACT is for. But, although the last pages of XACT are stored in buffers in RAM, it is expensive to check XACT every time. Therefore, the transaction status found out once is written to the xmin_committed and xmin_aborted bits of the string version. If one of these bits is set, then the status of the xmin transaction is considered known and the next transaction will no longer have to contact XACT.

Why are these bits not set by the insert transaction itself? When an insertion occurs, the transaction does not yet know whether it will complete successfully. And at the time of fixing, it is not clear exactly which lines in which particular pages were changed. There may be many such pages, and it is not profitable to memorize them. In addition, part of the pages can be pushed from the buffer cache to disk; reading them again to change the bits would mean slowing down the fixation significantly.

The downside to saving is that after a change, any transaction (even performing a simple read - SELECT) can begin to change data pages in the buffer cache.

So, fix the change.

 => COMMIT; 

Nothing has changed in the page (but we know that the status of the transaction is already recorded in XACT):

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+------+-------+-------- (0,1) | normal | 3664 | 0 (a) | (0,1) (1 row) 

Now the transaction that first accessed the page will have to determine the status of the xmin transaction and write it into the information bits:

 => SELECT * FROM t; 
  id | s ----+----- 1 | FOO (1 row) 

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3664 (c) | 0 (a) | (0,1) (1 row) 

Deletion


When deleting a line in the xmax field of the current version, the number of the current deleting transaction is written, and the xmax_aborted bit is reset.

Note that the set value xmax corresponding to the active transaction acts as a row lock. If another transaction is about to update or delete this row, it will have to wait for the xmax transaction to complete. We'll talk more about locks later. For now, we only note that the number of row locks is unlimited. They do not occupy space in RAM and system performance does not suffer from their number. True, “long” transactions have other disadvantages, but more on that later.

Delete the line.

 => BEGIN; => DELETE FROM t; => SELECT txid_current(); 
  txid_current -------------- 3665 (1 row) 

We see that the transaction number is recorded in the xmax field, but the information bits are not set:

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+----------+------+-------- (0,1) | normal | 3664 (c) | 3665 | (0,1) (1 row) 

Cancel


Discarding changes works in the same way as committing, only in XACT, the aborted bit is set for the transaction. Cancellation is as fast as commit. Although the command is called ROLLBACK, there is no change rollback: everything that the transaction has managed to change in the data pages remains unchanged.

 => ROLLBACK; => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+----------+------+-------- (0,1) | normal | 3664 (c) | 3665 | (0,1) (1 row) 

When accessing the page, the status will be checked and the prompt bit xmax_aborted will be set in the string version. The xmax number itself remains on the page, but no one will look at it.

 => SELECT * FROM t; 
  id | s ----+----- 1 | FOO (1 row) 

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+----------+----------+-------- (0,1) | normal | 3664 (c) | 3665 (a) | (0,1) (1 row) 

Update


The update works as if the current version of the line was deleted first and then a new one was inserted.

 => BEGIN; => UPDATE t SET s = 'BAR'; => SELECT txid_current(); 
  txid_current -------------- 3666 (1 row) 

The request returns one line (new version):

 => SELECT * FROM t; 
  id | s ----+----- 1 | BAR (1 row) 

But in the page we see both versions:

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3664 (c) | 3666 | (0,2) (0,2) | normal | 3666 | 0 (a) | (0,2) (2 rows) 

The remote version is marked with the current transaction number in the xmax field. Moreover, this value is written over the old one, since the previous transaction was canceled. And the xmax_aborted bit is cleared because the status of the current transaction is not yet known.

The first version of the line now refers to the second (t_ctid field), as a newer one.

The second index appears in the index page and the second line refers to the second version in the tabular page.

As with deletion, the xmax value in the first version of the line indicates that the line is locked.

Well, complete the transaction.

 => COMMIT; 

Indices


So far we have only talked about tabular pages. And what happens inside the indices?

Information in index pages is highly dependent on the specific type of index. And even one type of index has different types of pages. For example, a B-tree has a page with metadata and “regular” pages.

However, usually in the page there is an array of pointers to the lines and the lines themselves (as well as in the table page). In addition, at the end of the page there is space for special data.

Rows in indexes can also have very different structures depending on the type of index. For example, for a B-tree, the rows related to leaf pages contain the value of the index key and a link (ctid) to the corresponding row of the table. In general, the index can be arranged in a completely different way.

The most important point is that there are no row versions in any type of index. Well, or we can assume that each line is represented by exactly one version. In other words, there are no xmin and xmax fields in the header of the index line. We can assume that the links from the index lead to all tabular versions of the rows — so that you can figure out which of the versions the transaction will see only by looking in the table. (As usual, this is not the whole truth. In some cases, the visibility map allows you to optimize the process, but we will take a closer look at this later.)

At the same time, in the index page, we find pointers to both versions, both the current one and the old one:

 => SELECT itemoffset, ctid FROM bt_page_items('t_s_idx',1); 
  itemoffset | ctid ------------+------- 1 | (0,2) 2 | (0,1) (2 rows) 

Virtual transactions


In practice, PostgreSQL uses an optimization that “saves” transaction numbers.

If a transaction only reads data, then it does not affect the visibility of row versions. Therefore, at first the serving process issues a virtual virtual number (virtual xid). The number consists of a process ID and a sequence number.

Issuance of this number does not require synchronization between all processes and therefore is performed very quickly. We will get acquainted with another reason for using virtual numbers when we talk about freezing.

Virtual numbers are not counted in snapshots of data.

At different points in time, there may well be virtual transactions with numbers that have already been used, and this is normal. But such a number cannot be recorded in the data pages, because the next time the page is accessed it may lose all meaning.

 => BEGIN; => SELECT txid_current_if_assigned(); 
  txid_current_if_assigned -------------------------- (1 row) 

If a transaction starts changing data, it is given a real, unique transaction number.

 => UPDATE accounts SET amount = amount - 1.00; => SELECT txid_current_if_assigned(); 
  txid_current_if_assigned -------------------------- 3667 (1 row) 

 => COMMIT; 

Nested transactions


Save points


In SQL, savepoints are defined that allow you to cancel a part of a transaction, without interrupting it completely. But this does not fit the above scheme, because the status of a transaction is one for all its changes, and physically no data is rolled back.

To implement this functionality, a transaction with a save point is divided into several separate nested transactions (subtransaction), the status of which can be managed separately.

Nested transactions have their own number (larger than the main transaction number). The status of nested transactions is recorded in the usual way in XACT, but the final status depends on the status of the main transaction: if it is canceled, all nested transactions are also canceled.

The transaction nesting information is stored in files in the PGDATA / pg_subtrans directory. Files are accessed via buffers in the instance's shared memory, organized in the same way as XACT buffers.

Do not confuse nested transactions and autonomous transactions. Autonomous transactions do not depend on each other, and nested - depend. There are no autonomous transactions in ordinary PostgreSQL, and, perhaps, for the better: they are needed very, very rarely in the case, and their presence in other DBMS provokes abuse, from which then everyone suffers.

Clear the table, start the transaction and insert the line:

 => TRUNCATE TABLE t; => BEGIN; => INSERT INTO t(s) VALUES ('FOO'); => SELECT txid_current(); 
  txid_current -------------- 3669 (1 row) 

 => SELECT xmin, xmax, * FROM t; 
  xmin | xmax | id | s ------+------+----+----- 3669 | 0 | 2 | FOO (1 row) 

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+------+-------+-------- (0,1) | normal | 3669 | 0 (a) | (0,1) (1 row) 

Now put a save point and insert another line.

 => SAVEPOINT sp; => INSERT INTO t(s) VALUES ('XYZ'); => SELECT txid_current(); 
  txid_current -------------- 3669 (1 row) 

Notice that the txid_current () function returns the number of the main, not nested, transaction.

 => SELECT xmin, xmax, * FROM t; 
  xmin | xmax | id | s ------+------+----+----- 3669 | 0 | 2 | FOO 3670 | 0 | 3 | XYZ (2 rows) 

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+------+-------+-------- (0,1) | normal | 3669 | 0 (a) | (0,1) (0,2) | normal | 3670 | 0 (a) | (0,2) (2 rows) 

Roll back to the save point and insert the third line.

 => ROLLBACK TO sp; => INSERT INTO t VALUES ('BAR'); => SELECT xmin, xmax, * FROM t; 
  xmin | xmax | id | s ------+------+----+----- 3669 | 0 | 2 | FOO 3671 | 0 | 4 | BAR (2 rows) 

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3669 | 0 (a) | (0,1) (0,2) | normal | 3670 (a) | 0 (a) | (0,2) (0,3) | normal | 3671 | 0 (a) | (0,3) (3 rows) 

In the page, we continue to see the line added by the canceled nested transaction.

We fix the changes.

 => COMMIT; => SELECT xmin, xmax, * FROM t; 
  xmin | xmax | id | s ------+------+----+----- 3669 | 0 | 2 | FOO 3671 | 0 | 4 | BAR (2 rows) 

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3669 (c) | 0 (a) | (0,1) (0,2) | normal | 3670 (a) | 0 (a) | (0,2) (0,3) | normal | 3671 (c) | 0 (a) | (0,3) (3 rows) 

Now you can clearly see that each nested transaction has its own status.

Note that nested transactions cannot be used in SQL explicitly, that is, you cannot start a new transaction without completing the current one. This mechanism is implicitly used when using savepoints, and also when handling PL / pgSQL exceptions and in a number of other, more exotic, cases.

 => BEGIN; 
 BEGIN 
 => BEGIN; 
 WARNING: there is already a transaction in progress BEGIN 
 => COMMIT; 
 COMMIT 
 => COMMIT; 
 WARNING: there is no transaction in progress COMMIT 

Errors and atomicity of operations


What happens if an error occurs during the operation? For example:

 => BEGIN; => SELECT * FROM t; 
  id | s ----+----- 2 | FOO 4 | BAR (2 rows) 

 => UPDATE t SET s = repeat('X', 1/(id-4)); 
 ERROR: division by zero 

An error has occurred. Now the transaction is considered aborted and no operation in it is allowed:

 => SELECT * FROM t; 
 ERROR: current transaction is aborted, commands ignored until end of transaction block 

And even if you try to commit the changes, PostgreSQL will announce the cancellation:

 => COMMIT; 
 ROLLBACK 

Why can't I continue the transaction after a failure? The fact is that the error could arise in such a way that we would get access to a part of the changes — the atomicity would not be broken even by the transaction, but by the operator. As in our example, where the operator before the error managed to update one line:

 => SELECT * FROM heap_page('t',0); 
  ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3669 (c) | 3672 | (0,4) (0,2) | normal | 3670 (a) | 0 (a) | (0,2) (0,3) | normal | 3671 (c) | 0 (a) | (0,3) (0,4) | normal | 3672 | 0 (a) | (0,4) (4 rows) 

It must be said that in psql there is a mode that still allows the transaction to continue working after a failure, as if the actions of the erroneous operator are being rolled back.

 => \set ON_ERROR_ROLLBACK on => BEGIN; => SELECT * FROM t; 
  id | s ----+----- 2 | FOO 4 | BAR (2 rows) 

 => UPDATE t SET s = repeat('X', 1/(id-4)); 
 ERROR: division by zero 

 => SELECT * FROM t; 
  id | s ----+----- 2 | FOO 4 | BAR (2 rows) 

 => COMMIT; 

It is easy to guess that in this mode, psql actually puts an implicit savepoint before each command, and in case of failure initiates a rollback to it. This mode is not used by default, since the installation of savepoints (even without rollback to them) is associated with significant overheads.

Continued.

Source: https://habr.com/ru/post/445820/


All Articles