Dear readers. This is the second article in the database cycle. I decided to make some table of contents on the planned articles of this cycle:
- How to make a different time zone in different databases on the same server.
- How to keep logs of data changes by users in the database, storing them in another database, so that the main database is not clogged with garbage and does not grow.
- How to create your file system based on blob fields in the database. Why is it convenient? Issues of file storage efficiency: how to get the maximum speed and at the same time the minimum space occupied.
I was surprised by the number of comments on the first article, so I immediately want to note that I do not pretend to the only correct way of implementation. I am sure that creative people will find many more ways to accomplish this task. But implementing it in due time, I did not find a single article describing such a functional, and I had to do this task from scratch, although in my opinion it is relevant. The implementation, which I will describe, is completely working and is used by me in practice.
I also welcome
constructive criticism. Sometimes people write interesting things and you can look at the problem from an angle that you didn’t intend to and somehow improve your mechanisms.
')
So, let's begin.
Firebird database 3.
The wording of the task is as follows:
it is necessary to write detailed logs of data changes by users in the database (insert, update, delete), but at the same time write them in another database on another server. This is necessary so that the size of the main database does not grow by leaps and bounds, it was convenient to backup, restore it so that it works quickly, does not accumulate debris, and does not contain unnecessary and rare information.This mechanism was developed for a cloud-based enterprise automation platform (
https://erp-platforma.com ), but for the first time I encountered this problem while working in the billing department, in a telecom company. There was a billing database (let's call it the Primary Database), with which all the systems worked, and a database for the clients ’personal account (let's call it DB LK), which also had to contain most of the data from the main database. This was done because it was once dumb for everyone that the “outside” LC looks into the company's main database, for there is little ...
In order for the data from the main database to go to the database database (and in some cases, the database replication mechanism was implemented from the database database server to the primary database), the insert, update and delete triggers were made, which wrote all the changes to an “intermediate” table, with a certain periodicity, the external script unloaded new records from this table into the textbooks, these textbooks were thrown onto the server from the DB LK and were executed there in that database. Everything seems to be quite good, BUT, the base has grown by leaps and bounds, reaching in a few months GIANT sizes due to this table, where all the logs merged. The obsolete data remained in it, and not immediately removed from it, because deletion is a long process and contributes to the accumulation of garbage and just the database will work more slowly with time. So decided the developers. As a result, it came to the point that every 3 months, you had to delete this data, and to do a backup-restore of the database, because the place on the server just ended. Backup-restore of such volumes and on such a server is almost a day for all the work. This is every 3 months stop on the day of the entire company. And it had to do. Help desk made notes at this time on a piece of paper ...
Developing my cloud platform, in view of experience, I was already aware of this issue and decided to make a system for recording detailed logs devoid of these flaws. Especially considering the volumes, what can be on one server is not one database, but 1000, and doing this thousand backup restore is unreal.
First I want to describe the general concept.
To structure record keeping, they are conducted in a batch manner.
There are two tables, a table with information on the package and a table with the data of this package. Let's call them LOG_PACKET and LOG
LOG_PACKET
- Unique identificator
- Record date
- Written by (user id)
- Symptom processing (0 or 1)
- Record type (ins, up, del)
- Table number (name of the table from which we write logs)
- Data table number
- Package Record ID
- The number of processing attempts
- Last processing date
LOG
- Unique identificator
- Blob field for OLD data
- Blob field for data NEW
- Package Record ID
- Table field number (table field name)
In a trigger when data is changed, an entry must be filled in the LOG_PACKET with information about what this record and data (before and after the change) in the LOG table
Then, an external script, launched with a certain periodicity, finds unprocessed records in LOG_PACKET, and makes a copy of them to the log database on another server, using their identifiers, finds all their records in the LOG table and also makes a copy.
Next, we need to get rid of garbage in the main database. Delete data from LOG - slow down the database and save trash. There is a faster and better DROP TABLE procedure.
PS: Why DROP is preferable DELETE I will not describe in detail. Everything is very well described in this article .
Therefore, the following solution applies.
LOG tables are 3 pieces. LOG_1 - on the first day, data is written here, LOG_2 - on the next day, data is written here, etc. alternation occurs, one day in LOG_1 second day in LOG_2. When data is written to LOG_2 and all packets referring to LOG_1 are processed, DROP LOG_1 and CREATE TABLE LOG_1 occur. Thus, at the moment all records are deleted and the table becomes clean. The next day, similarly with LOG_2.
But not so simple. Since Links to LOG_1 and LOG_2 are registered in the triggers of all logged tables - they will not be able to be deleted. For this you need to use a trick. Make them VIEW. And register the addition of data in the triggers through this VIEW.
CREATE OR ALTER VIEW V_LOG_1( ID, BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) AS select p.id, p.blob_old, p.blob_new, p.packet, p.num_pole from LOG_1 p;
But even so, deleting a table with a trigger does not work because VIEW in a trigger will refer to it.
To do this, you need to make a table LOG_3, which will be a copy of the LOG_1 and LOG_2 structure. We replace in VIEW LOG_1 with LOG_3 and vaul, you can delete and re-create the table LOG_1. After that we change in VIEW back LOG_3 to LOG_1 - and everything works on.
This operation can be carried out without problems, in the table LOG_1 at this moment is guaranteed there will be no recording, because On this day, recording is made in LOG_2. Such a cleaning operation always happens with a table on which there is no record on this day. The next day, the same operation will occur with the LOG_2 table.
Also, this operation should occur only if all the data in the LOG_PACKET on this table is processed. If they are not processed, then you cannot touch the table, because it will be data loss. You must first understand why the data is not processed, for example, a script may hang which transfers them to the database with logs. If a cleaning operation has been canceled on a given day, then the next attempt will be performed every other day, etc.
The concept we have disassembled, now we will analyze the optimal scheme of work.
First, you need to enter a table with information about which table we are writing today. This should be a table with a single record (in this case, data retrieval is a fast operation), no indexes. It should contain the day number, the date of the update, and whether logging is allowed at all (the administrator should be able to disable logging altogether to save space, if he does not need them at all)
for example
CREATE TABLE LOG_INFO ( ZAPIS SMALLINT, DATA TIMESTAMP, ID_TABLE SMALLINT );
In all the triggers of the logged tables, it is first checked whether the entry is allowed.
select p.zapis from LOG_INFO p into: zapis;
If it is enabled, then it is checked in which table to write data
select first 1 case when p.data=current_date then p.id_table else case when p.id_table=1 then 2 when p.id_table=2 then 1 end end, case when p.data=current_date then 1 else -1 end from LOG_INFO p into: log_info, log_info_check;
PS: for skeptics, first 1 is needed in order to eliminate the possibility of trigger flushing if more than one record appears in the LOG_INFO table suddenly. In this case, there will be an error in writing data to all logged tables (usually an error of multiple rows). And first 1 we are guaranteed to exclude this option.With this request we:
1) We check if the date coincides with the current one, if yes, then write to the current table, if not, then change the table to another one (there was a day transition to the next one);
2) Put a sign that this table should be updated to another one when the day passes.
The next step is if we need an update when we go over the day, update
if (log_info_check=-1) then UPDATE LOG_INFO SET ID_TABLE = :log_info, DATA = current_date;
Those. The first entry on the next day updates the data, what to write in the following table. And then resources are no longer being spent on this.
Next, write the data in LOG_PACKET and get its ID
if (inserting) then TYPE_=1; if (updating) then TYPE_=2; if (deleting) then TYPE_=3; if (TYPE_ in (1,2)) then INSERT INTO LOG_PACKET (TYPE_, TABLE_, NUM_TABLE, AVTOR, ID_ZAPISI) VALUES (:TYPE_, 15, :log_info, new.avtor, new.id) RETURNING ID into: id_packet; else INSERT INTO LOG_PACKET (TYPE_, TABLE_, NUM_TABLE, AVTOR, ID_ZAPISI) VALUES (:TYPE_, 15, :log_info, old.avtor, old.id) RETURNING ID into: id_packet;
Further, depending on the received table number, the data should be written in V_LOG_1 or V_LOG_2.
For example, the entry might look like this:
if (log_info=1) then begin if (TYPE_=1) then begin INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (NULL, new.n_687, :id_packet, 687); INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (NULL, new.n_688, :id_packet, 688); INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (NULL, new.n_689, :id_packet, 689); INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (NULL, new.n_690, :id_packet, 690); INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (NULL, new.n_691, :id_packet, 691); end if (TYPE_=2) then begin if (new.n_687<>old.n_687) then INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_687, new.n_687, :id_packet, 687); if (new.n_688<>old.n_688) then INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_688, new.n_688, :id_packet, 688); if (new.n_689<>old.n_689) then INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_689, new.n_689, :id_packet, 689); if (new.n_690<>old.n_690) then INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_690, new.n_690, :id_packet, 690); if (new.n_691<>old.n_691) then INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_691, new.n_691, :id_packet, 691); end if (TYPE_=3) then begin INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_687, NULL, :id_packet, 687); INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_688, NULL, :id_packet, 688); INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_689, NULL, :id_packet, 689); INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_690, NULL, :id_packet, 690); INSERT INTO V_LOG_1 (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (old.n_691, NULL, :id_packet, 691); end end
PS: I have simplified my life a little by the fact that I have my own built-in programming language in the system and all the tables and fields I have come with numbers, not text. Also thanks to this, such triggers are created absolutely automatically when a new table appears or are re-created when its fields change. Those. it is enough for the user to click the button “Turn on logs” in the interface in the table editor and everything will happen automatically to him. Everything will be more complicated for readers, if this database is simple, then such a trigger will have to be created manually and the data types in NUM_POLE and NUM_TABLE should be textual, i.e. There it is necessary to write down the names of tables and fields. Or enter a kind of table, where the tables and fields will be assigned numbers and take data from it.Next, you need to create a table with the flag, whether the record occurred. So we will save system resources, for making a query on a table with one field is much faster than going through LOG_PACKET in search of whether there are flags of new records.
For example:
CREATE TABLE LOG_PACKET_FLAG (ID SMALLINT);
And put on the LOG_PACET table in the trigger when adding a new package entry
UPDATE LOG_PACKET_FLAG SET ID = 1;
When you run the script for transferring data to the database of logs, you need to check whether there are new entries (if they are not there, do nothing more).
select first 1 p.id from LOG_PACKET_FLAG p
This will work much faster than, for example:
select count(*) from LOG_PACKET p where p.check_=0
At the end of the transfer of records, it is necessary to check whether there are new records while we were transferring those that were and do:
UPDATE LOG_PACKET_FLAG SET ID = 0;
In general, save the resources of your systems.
Next, consider the process of transferring the records.
First, we receive all new LOG_PACKET records from our main database, and in the loop we do INSERT of this data into the logging database.
After creating a packet record in the logging database, we need to transfer the data of this packet from the LOG_1 (2) tables.
Extracting data from blob fields and adding them to another place is a certain crap and dances with a tambourine. After many experiments, I came to the conclusion that the easiest and most efficient way is to simply run a query from a script in which a query will be sent to the database (in general, to drag the blob of a field by a direct query from the logging database to the main database.
The procedure of packet data, which should be pulled by the script, after adding the package.
reate or alter procedure BLOB_INS ( SELECT_ varchar(250), BASE_ varchar(100), USER_ varchar(50), PASS_ varchar(50), PACKET integer) AS declare variable BLOB_OLD BLOB SUB_TYPE 0 SEGMENT SIZE 80; declare variable BLOB_NEW BLOB SUB_TYPE 0 SEGMENT SIZE 80; declare variable PACKET BIGINT; declare variable NUM_POLE INTEGER; begin FOR EXECUTE STATEMENT (:select_) ON EXTERNAL :base_ AS USER :user_ PASSWORD :pass_ INTO :BLOB_OLD,:BLOB_NEW,:NUM_POLE DO INSERT INTO LOG (BLOB_OLD, BLOB_NEW, PACKET, NUM_POLE) VALUES (:BLOB_OLD,:BLOB_NEW, :pacet, :NUM_POLE); End
: select should be supplied approximately of this type from the script:
select p.blob_old, p.blob_new, p.num_pole from log_'.< >.' p where p.packet='.< >
After successful processing of the package, it is necessary to set the flag in the main database that this package is processed.
In case of errors of adding data to the package, I update the data on the number of attempts, after 5 attempts a flag is set that the data was processed with an error and the program stops trying to process it. If you do not make this mechanism, then you can go into an infinite loop in the event of some kind of error and the program will cease to perform its functions. You can deal with erroneous packages later, but the program should continue to work.
Just a small message: make a separate virtual machine for the logging database. You can allocate this machine less cores and memory. You can also allocate a worse disk system to it, there is no need to keep logs on the SSD, for example, as the main bases with which the main work is carried out.
There is no increased demand for speed, because This is a pending operation. It does not matter to us that the logs are recorded in a minute or 1.5 minutes, the main thing is that they will be recorded. Users access this data very rarely, only in case of any problems, and it's okay if the page with logs will load for 200 ms longer.
In general, with this you will simply save resources; it is better to allocate these resources to loaded machines.