In the continuation of the article on the theory and practice of migration of data warehouses to PostgreSQL, we will talk about the problems that you may encounter when moving from the popular MySQL DBMS. In order not to bore everyone with unnecessary rhetoric, today's story will be more abstract and problem-oriented.
All of the following is a list of typical errors in the design and operation of MySQL, which may affect the process of adapting the scheme, processing the code and transferring data. The presence of all these little things in a variety of cunning combinations is one of the reasons why the existing “universal” tools are unlikely to deal specifically with your base.
That is why in the
previous article I recommended not to waste time searching for a silver bullet and write something of your own “on the knee” that really works. This article is designed to facilitate the writing of such a tool, pointing to potential flaws, the presence of which you can quickly verify.
')
Let's get down to business.
Default valuesMySQL is notorious for its desire to simplify and facilitate the work of the developer. As a rule, at the start of operation the strict mode is disabled, which responsibly reports the absence of correct default values ​​when inserting data into tables.
In other words, you may have a column declared in a table without the DEFAULT parameter. When inserting into a table, you will forget to specify this column, but MySQL will not get lost and will insert something “default” for you. Typically, this is 0, an empty string, or a similar “zero” value of the corresponding data type. PostgreSQL always throws constraint violation, so be prepared to fix the corresponding situations in the application code.
Separately, I would like to note the love of MySQL to insert inadequate default values ​​into fields of type DATE / DATETIME / TIMESTAMP. Inaccurate work with such fields may result in the presence of the “0000-00-00” or “00:00:00 0000-00-00” lines in your database. Understandably, PostgreSQL will not miss this. It is necessary either to “fix” the data in the source database, or, when importing, to force them to replace them with NULL and correct the application code and the structure of the corresponding table accordingly.
Strictness constraints, STRICT MODEIn continuation of the topic of default values ​​and restrictions, MySQL is also quite loyal to their violation. For example, you have declared a field of type VARCHAR2 (255). MySQL has an unhealthy tendency to automatically truncate strings if it does not fit into the specified dimension. PostgreSQL will swear.
In the context of this problem, it is also worth noting that incorrect byte sequences in the rows of the specified MySQL encoding will silently cut out when STRICT is off. PostgreSQL always swears on incorrect byte sequences. Be prepared to further sanitize the input data when preparing the application code for PostgreSQL.
Well, just check in advance by making a backup in advance that your entire MySQL database consistently resides in the same encoding, and none of the programmers created a table with the VARCHAR2 field in some swedish locale or something similar through phpMyAdmin.
Transactions and autocommitThe programmer is usually afraid of using transactions, or, because of his incompetence, considers them “heavy” and “slow”. In fact, the execution of requests without a transaction is impossible. Therefore, for such specialists, MySQL helpfully works in AUTOCOMMIT mode, which hides the entire vicious practice of transactional processing “under the hood”. In PostgreSQL, you will most likely have to simulate such an approach to operating a database by including a special flag (SET AUTOCOMMIT).
Of particular interest is the ability of MySQL to use in-transaction table engines that do not support transactional data processing. Here you can only quote the documentation (“If you’ve been deliberately mixing the transactional table, it’s actually your case and don’t applaud.
If possible, try to systematize the engines before migration. If not, move using autocommit or try to ensure at least nominal transaction processing for such situations. Worse will definitely not be.
Your own paragraph deserves a table engine called Memory. The peculiarity of this engine is that it does not persist data to disk, but stores it entirely in memory. The amount of memory is fixed and is set by the setting, which can only be changed by restarting the database server. There are two nuances associated with the operation of this engine:
a table that has gone beyond the limit of permissible size is instantly “swapped” onto a disk, the carriage turns into a pumpkin, sometimes together with a database (if the Memory slider was used to optimize performance);
as a rule, such tables put data that is considered temporary, which is not a pity to lose. After the first loss, it turns out that the data, it turns out, was mission critical, and you cannot lose them. And the database architecture and applications are already firmly attached to these tables, with the result that everyone starts to pray for them, and any full base reboot turns into a specific ritual with converting tables into a persistent format on the disk, converting back to Memory and similar shamanism.
In general, PostgreSQL is simply more adequate in itself in terms of memory management and, provided there is enough of it allocated for your database, it would be great to keep such a table in the “RAM” and return rows by requests from the application. If, however, the workload on the tables is very large, there is not enough memory, or there is simply no time to recycle the application under persistent tables, you can emulate the behavior of Memory tables, creating similar ones in PostgreSQL with the keyword UNLOGGED (CREATE UNLOGGED TABLE ...). Citing documentation:
If specified, the table is created as an unlogged table. Write ahead ahead (see chapter 29), which makes it easier than ordinary tables. However, it’s not an unlogged table. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.
Match savingMany MySQL architects who often forget about such extremely important things as the correct arrangement of primary keys and maintaining the correctness of relational relationships, however, consider it their duty to very accurately indicate the digit capacity and the sign of the numbers added to one or another integer. MySQL offers a very large variety of data types (int, smallint, tinyint, etc.), for each of which you can specify the maximum number of characters and the absence / presence of negative values. For example, TINYINT will hold values ​​from -128 to 127, and UNSIGNED TINYINT will hold from 0 to 255.
What the above-mentioned architect will never think about is the future. Sooner or later, a value that does not fall within the specified frames will arrive in such a field, and MySQL will be reduced to the nearest correct value, with strict mode disabled. I repeatedly observed bases where, because of such nonsense, objectively important and real figures of efficiency indicators were “cut off” to the maximum permissible format of high values, which, in turn, led to a less efficient operation of the system as a whole.
Knowledgeable people have already guessed that a similar situation occurs with decimal / numeric types, in which something important is traditionally stored, for example, money. In such cases, we can already talk about the loss of real profits. PostgreSQL has been spared from such a zoo of opportunities to shoot itself in the foot (or even in the head) and will always swear to go beyond the limits of the given “constraints”.
INSERT IGNORE + ON DUPLICATE KEY ... UPDATESome of the most dearly loved designs by MySQL developers have no direct analogues in PostgreSQL.
Continuous use of INSERT IGNORE strongly hints at the problems of designing your database schema. However, this behavior can be emulated by writing a stored procedure that makes INSERT and, in the case of UNIQUE EXCEPTION, intercepts it and silently “extinguishes”.
ON DUPLICATE KEY ... UPDATE, or so-called UPSERT (UPdate + inSERT), was nevertheless recognized as a necessary feature in the PostgreSQL community. Its release is assigned to version 9.5 of the DBMS. Until then, you can emulate this with the following construct:
Work with time in MySQLThe problem of rounding time values ​​in the data types available in MySQL for these purposes has already been mentioned. Separately, we consider a diverse number of possible options for storing such information. In itself, this is not a categorical problem, all the troubles stem from the developers' ignorance of the difference in types and, as a result, the frivolous use of all at once without some sort of visible logic. The situation when DATETIME, TIMESTAMP, and BIGINT are used to store time (for unix timestamp values) is quite regular and is observed on almost every second base.
DATETIME differs from TIMESTAMP by storing the time zone offset. Based on this, plan the relevant data types in PostgreSQL (timestamp with / without time zone) and be ready to import / subtract the time zone correction on the fly during import (if the database is suddenly tuned to a specific timezone, and this is used indiscriminately).
Equally important is the problem of storing unix timestamp in fields of type bigint. MySQL has a nasty practice of adding the number of seconds corresponding to the time zone offset to this value, which, ideologically, should remain in UTC. This behavior does not conform to the standard, therefore, starting from version 9.2, PostgreSQL has to use an unpleasant workaround to get the epoch value corrected for the time zone, in the form of an EXTRACT SQL construct (EPOCH FROM TZVALUE :: TIMESTAMP WITH TIME ZONE), which extracts the epoch value from the specified timestamp, obviously converting it to a format with time zone storage.
Working with time zones is one of those operations, the presence of errors in which is excusable, since the task is indeed non-trivial. An explanation of the correct approaches to working with time zones in PostgreSQL can be found in the excellent article located at the link
www.depesz.com/2014/04/04/how-to-deal-with-timestamps .
LocksOften, programmers like to manually manage locks on the database. Explicitly set shared and exclusive locks or, worse, change the default isolation level of a transaction. Strictly speaking, to do it categorically is contraindicated. Nevertheless, situations were observed in the practice when architects decided to implement the integrity of operations by “deaf” locking of updated tables. At the same time, few people were worried that the OLTP load from the user interfaces goes to the same tables. Very often, such urges come from the fact that MySQL transactions are not applicable due to the presence of non-transactional table engines or general madness in terms of settings that reign on the database. For example, there is a long query, for which “timeouts” are screwed to the limit and normal transactions are not applied.
The best solution to this problem during migration is to try to use fair transactions in PostgreSQL, which is more likely to cope better with a high concurrency situation. If, after all, blocking is an integral part of logic, study the mechanism of advisory locking. The so-called advisory locks are virtual and do not lead to physical blocking of data. Their adequate use within transactions carries significantly less risk and is more benign for resources and performance of the DBMS. But, of course, no one cancels the need to think with your head and accurately track all the “cases” of lock application in the application code.
ConclusionWe have reviewed only a small part of the amazing things that can occur on the MySQL database server. It can be very long to argue about the intricacies of replication in MySQL and the methods of its reproduction already by means of PG, which, for example,
Oleg Tsarev will do at the upcoming PG Day'15 Russia. At best, the problem is solved using the hot-standby mechanism. At worst, you'll have to learn non-trivial tools like Slony in support and operation!
I hope that these examples clearly illustrate why the task of migration is a complex and practically unsolvable process using existing tools. Only thoughtful, manual approach and the old military wisdom “know your data” are guaranteed to work.