Introductory
In PostgreSQL, there is such an interesting technical solution - before actually starting to change something in the files of the database itself, the DBMS writes already translated commands to the internal format in a special log - Write-Ahead Log, and after successful completion of the transaction makes a mark in this log. This was done to recover from failures, but in the end, the inquisitive mind of the developers came to the idea of using this log for backup and replication. In principle, it is logical, all the moves in it are recorded, moreover, you can not just restore the data from the backup, but also restore the state of the database to a certain point in time, interrupting the playback of the WAL-log records at the right moment.
However, let's consider such a scenario - let's say on Monday you made a basic backup and started archiving WAL logs, on Wednesday you fulfilled a deletion request with an erroneous mask, and found it only on Friday, when the manager reported about the disappearance of some necessary record. In this situation, we can only recover from the backup until Wednesday, losing all the work of managers for Thursday and Friday.
A logical question arises, is it possible to make the playback of WAL logs from Monday to Friday, while excluding our “erroneous” request?
')
Normally, I would limit myself to a forum question, but I had 2 distributions of FreeBSD, 10 tarballs with PostgreSQL source codes of different versions, 10GB of space on the screw, gcc, two relatively unloaded weeks, as well as tequila, rum, a box of beer and fragmentary memories of the syntax of the C language. Not that it was the necessary supply for a solution, but since I looked into the source codes, it was difficult to stop ...
So, for the experiments, FreeBSD 10 and PostgreSQL 9.2.8 are taken from its ports. The client of the corresponding version can be set up using pkg, there is no need to change anything in it I apologize in advance for possible captaincy, but the text was written both for beginners and for quickly refreshing everything in my head if necessary, so all the teams are detailed.
Server setup and basic setup
root@leninzhiv> cd /usr/ports/databases/postgresql92-server root@leninzhiv> make fetch root@leninzhiv> make extract
The downloaded source file is deployed to the work folder in the port directory. I honestly did not understand how to rebuild the source code after the changes, there is no sort of make rebuild, make clean in turn simply demolishes this folder with all the changes. So I just copied the work folder to my home directory, made changes there, then copied it to the port folder and ran make install.
So far, we are not changing anything, just set postgres:
root@leninzhiv> make install
Create folders for archives:
root@leninzhiv> mkdir -p /usr/db_archive/wal root@leninzhiv> mkdir -p /usr/db_archive/data root@leninzhiv> chown -R pgsql:wheel /usr/pg_archive
Postgres requires that the data directory has access only for the user to the poet, changing the rights:
root@leninzhiv> chmod 0700 /usr/pg_archive/data
We make a primitive setting. It makes sense to go under pgsql postgresovuyu uchetku so that there is less trouble with file permissions.
root@leninzhiv> su - pgsql pgsql@leninzhiv> initdb -D /usr/local/pgsql/data
Uncomment and edit the archiving parameters of WAL logs in /usr/local/pgsql/data/postgresql.conf:
archive_mode = on
wal_level = archive
archive_command = 'test! -f / usr / db_archive / wal /% f && cp% p / usr / db_archive / wal /% f '
(An example is there in the cells)
max_wal_senders = 1
In /usr/local/pgsql/data/pg_hba.conf uncomment the line
local replication pgsql trust
We start the server
pgsql@leninzhiv> /usr/local/etc/rc.d/postgresql start
Making a base backup
pgsql@leninzhiv> pg_basebackup -D /usr/db_archive/data/
We check that the / usr / db_archive / data / folder should contain a copy of the data directory, and / usr / db_archive / wal / should contain WAL files of the type approximately 000000010000000000000003
Copy to the folder with backup data directory config for recovery
cp /usr/local/share/postgresql/recovery.conf.sample /usr/db_archive/data/recovery.conf
and in it we will uncomment and edit the recovery command (the example is also close in the comments).
restore_command = 'cp / usr / db_archive / data /% f% p'
Make entries:
pgsql@leninzhiv> psql -U pgsql -d postgres
postgres=
Delete the entries:
postgres=
We change records, we introduce new, disco
postgres=
We discover that deleting records by mask was not a very good idea, and together with Karlin we deleted Kaplan, Karas and Karman.
Stop the server
pgsql@leninzhiv> /usr/local/etc/rc.d/postgresql stop pgsql@leninzhiv> exit root@leninzhiv>
and start thinking what to do.
Go to the source
As you remember, after make extract, I copied the work folder from the port directory to my home folder, and made changes to it. Therefore, go there. If someone can tell how to make changes in the source in the port folder itself, everything will be rebuilt normally after the changes made to the code will be extremely grateful.
First, I set myself the goal to find the place where they are read from the WAL-logs recording file.
I found the file with WAL-related code by searching for the string "WAL" in the contents of the files in the work / postgresql-9.2.8 / src directory and common sense, it turned out to be xlog.c
I do not know how to trace programs in C, so just at the beginning of each function I added a record of its name to the file, I assembled it and launched it.
The file has the following result:
bool check_wal_buffers(int *newval, void **extra, GucSource source) void assign_xlog_sync_method(int new_sync_method, void *extra) Size XLOGShmemSize(void) static int XLOGChooseNumBuffers(void) bool check_wal_buffers(int *newval, void **extra, GucSource source) void XLOGShmemInit(void) Size XLOGShmemSize(void) static void ReadControlFile(void) void StartupXLOG(void) static void ReadControlFile(void) static char * str_time(pg_time_t tnow) static void ValidateXLOGDirectoryStructure(void) static void readRecoveryCommandFile(void) static List * readTimeLineHistory(TimeLineID targetTLI) static bool read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired, bool *backupFromStandby) static XLogRecord * ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt) static XLogRecord * ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt) static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, bool randAccess) static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources) ... static XLogRecord * ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt) static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, bool randAccess) static bool RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode) static bool recoveryStopsHere(XLogRecord *record, bool *includeThis) static void CheckRecoveryConsistency(void) static XLogRecord * ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt) static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, bool randAccess) ...
In general, I got the impression that the main action takes place in the ReadRecord -> XLogPageRead -> RecordIsValid -> RecoveryStopsHere -> CheckRecoveryConsistency cycle.
A closer acquaintance with the ReadRecord function showed that it returns a record in two places - as a return record and as a return (XLogRecord *) buffer, with the above simple method we specify that in the recovery process from WAL-logs, the return goes through the return (XLogRecord *) buffer. Perfectly! We write the result to the file.
The structure of the XLogRecord type can be viewed in the xlog.h file and it is rather laconic:
typedef struct XLogRecord { pg_crc32 xl_crc; XLogRecPtr xl_prev; TransactionId xl_xid; uint32 xl_tot_len; uint32 xl_len; uint8 xl_info; RmgrId xl_rmid; } XLogRecord;
Well, if we have a length, then we use it to output the contents of the record to the file, before adding (XLogRecord *) buffer we add the buffer:
FILE *pf2 = fopen("/usr/local/pgsql/data/log3.txt", "a"); char *buf_poi = buffer; for (uint32 i=0; i < record->xl_tot_len; i++) {fputc(*buf_poi, pf2); buf_poi++;} fprintf(pf2, "\n crc32: %u \n xl_xid=%i \n", record->xl_crc, record->xl_xid); fclose(pf2);
We demolish the old Postgres, collect and install a new one:
root@leninzhiv> cd /usr/ports/databases/postgresql92-server root@leninzhiv> make deinstall
I remind that we copied the work directory into the home folder and made all changes to the code there. Now copy it to the place of the work folder in the port directory.
root@leninzhiv> rm -R /usr/ports/databases/postgresql92-server/work root@leninzhiv> cp -R ~/work /usr/ports/databases/postgresql92-server/work root@leninzhiv> make install
Delete the database files and copy the base backup in their place. WAL-files themselves catch up.
root@leninzhiv> su - pgsql pgsql@leninzhiv> /usr/local/etc/rc.d/postgresql stop pgsql@leninzhiv> rm -R /usr/local/pgsql/data pgsql@leninzhiv> cp -R /usr/db_archive/data /usr/local/pgsql/data pgsql@leninzhiv> /usr/local/etc/rc.d/postgresql start pgsql@leninzhiv> psql -U pgsql -d postgres
postgres=
pgsql@leninzhiv> /usr/local/etc/rc.d/postgresql stop
We look at the contents of the log3.txt file, first we go through many large records, apparently creating service tables and data, towards the end we see:
#{####T#####r###R#### ###### ####0###@###### #e######## ###gNikolaev crc32: 3682278083 l_xid=1002 W# #####U#####,### ###`#######›Ќ%ћ###### crc32: 3423214679 xl_xid=1002 r
We see that between the familiar names of Nikolaev and Petrov there are 4 similar entries and one dissimilar, under one transaction number. Apparently, these are delete commands, which means that commands like “erase row 50 in table 64822” are already recorded in the WAL-log. Basically, as expected. We add a check, which, when xl_xid = 1003, returns NULL instead of a record.
Again, delete the old Postgres, build and install a new one, start the recovery ...
Deleted recordings on the spot! True, everything that should have happened after the deletion did not happen :( Well, it didn’t work out of a rush. In general, this is understandable, because before playing the record, integrity checks and all that happens are performed.
So goal number 2 is to find where the record is “playing”. A quick search for the use of readRecord in the same file led me to the void StartupXLOG (void) function ... And here I clearly understood that I had gone the wrong way until now, because almost immediately after the second or third appearance of the readRecord call in this function (they are next) immediately comes firstly a chic diagnostic piece, and secondly, immediately after the comment “Now apply the WAL record itself” - the record play command RmgrTable [record-> xl_rmid] .rm_redo (EndRecPtr, record);
Change this piece of code to
if (record->xl_xid==1003) {} else RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);
Again we reassemble, we start, we check ... Victory! Deleted records in place and changes made after deletion are also in place!
Focusing on the terrain
Well, this is undoubtedly good, but we solved the problem on a very limited set of data, but how to find the necessary entry in the logs of the working database?
Let's return to the mentioned smart diagnostic piece in the StartupXLOG function:
#ifdef WAL_DEBUG if (XLOG_DEBUG || (rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) || (rmid != RM_XACT_ID && trace_recovery_messages <= DEBUG3)) { StringInfoData buf; initStringInfo(&buf); appendStringInfo(&buf, "REDO @ %X/%X; LSN %X/%X: ", ReadRecPtr.xlogid, ReadRecPtr.xrecoff, EndRecPtr.xlogid, EndRecPtr.xrecoff); xlog_outrec(&buf, record); appendStringInfo(&buf, " - "); RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, XLogRecGetData(record)); elog(LOG, "%s", buf.data); pfree(buf.data); } #endif
You can simply include the output in the logs, uncommenting #define WAL_DEBUG in pg_config_manual.h and adding wal_debug = on to the postgresql.conf file, but I, by habit, sent the output to a separate file. This piece, as I understand it, displays the command description using the rm_desc function (in this case, RmgrTable is an array of functions?), It looks like this:
REDO @ 0/3015500; LSN 0/3015578: prev 0/30154D0; xid 1002; len 82: Heap - insert: rel 1663/12318/16386; tid 0/9 REDO @ 0/3015578; LSN 0/30155A8: prev 0/3015500; xid 1002; len 12: Transaction - commit: 2014-06-06 08:38:27.537874+00 REDO @ 0/30155A8; LSN 0/30155E0: prev 0/3015578; xid 1003; len 21: Heap - delete: rel 1663/12318/16386; tid 0/1 REDO @ 0/30155E0; LSN 0/3015618: prev 0/30155A8; xid 1003; len 21: Heap - delete: rel 1663/12318/16386; tid 0/4 REDO @ 0/3015618; LSN 0/3015650: prev 0/30155E0; xid 1003; len 21: Heap - delete: rel 1663/12318/16386; tid 0/5 REDO @ 0/3015650; LSN 0/3015688: prev 0/3015618; xid 1003; len 21: Heap - delete: rel 1663/12318/16386; tid 0/8 REDO @ 0/3015688; LSN 0/30156B8: prev 0/3015650; xid 1003; len 12: Transaction - commit: 2014-06-06 08:38:27.54153+00 REDO @ 0/30156B8; LSN 0/3015738: prev 0/3015688; xid 1004; len 89: Heap - hot_update: rel 1663/12318/16386; tid 0/7; new 0/10
This is an already familiar piece with transaction number 1003, and from it we can see that yes, these are four commands for deletion and one confirmation of the transaction. In the delete commands, we see rel - a table identifier in the format “oid namespace / oid of the database / oid of the table”. Corresponding numbers can be obtained by queries.
SELECT oid, spcname FROM pg_catalog.pg_tablespace;
SELECT oid, datname FROM pg_catalog.pg_database;
and, suddenly,
SELECT oid, relname FROM pg_catalog.pg_class;
The second landmark is the timestamp in the transaction description. Well, there is no need to explain anything, if we know when this crime was commited, we will find the corresponding records.
Well, as an alternative way, you can go back to viewing records in the cracks, and navigate by scraps of text that were passed as parameters to the INSERT and UPDATE commands, if we remember the requests with which parameters were made shortly before or after the desired "erroneous" request. In the case of UPDATE, however, you can find only those strings that were used as a new value, if the string was used to search for records, then it is not found in WAL logs.
And finally, I can note that in the PostgreSQL 9.3 controller, the pg_xlogdump utility has appeared, which seems to be aimed at solving the problem of providing the contents of WAL logs in human-readable form. If you are interested in some features, it makes sense to write to developers.
It is possible that the use of this method on the archives of the working database will have some kind of pitfalls. For example, how will UPDATE work if we “skip” the deletion of a part of records on a database that uses frequent vacuuming? I did not check. But in any case, in the case it is better to have at least some hope of correcting the error than none at all.