📜 ⬆️ ⬇️

Versioned migration of database structure: main approaches

The problems of database version control and migration between versions have been raised more than once both in Habré ( 1 , 2 , 3 , etc.) and on the Internet (mostly English).

In the first section of this article, I consider the main problems that arise in the teams of programmers when making any changes to the database structure. In the second section, I tried to identify the main general approaches to the form in which changes in the database structure can be stored and maintained during the development process.

Terminology

Database - a set of all database objects (tables, procedures, triggers, etc.), static data (immutable data stored in lookup tables) and user data (which change during the work with the application).

Database structure is a collection of all database objects and static data. User data are not included in the concept of database structure.
')
Database version - a specific state of the database structure. Usually the version has a number associated with the version number of the application.

Migration , in this context, is updating the database structure from one version to another (usually newer).

In this sense, the term migration seems to be used in many sources (this was especially due to the migration from the Active Record gem, which is part of Ruby on Rails). However, when using this term, ambiguity arises: a person who does not know the context would rather think that it is a matter of transferring a database from one DBMS to another (MySQL => Oracle), or even process / data migration between cluster nodes. Therefore, I suggest, in cases where the context is not obvious, to use a more precise term: versional migration of databases.

Why do you need it?

Developers who have already encountered the problem of out-of-sync versions of the database and applications can skip this section. Here I will remind you why it is necessary to respect the parity of the versions of the application and the database and what a common problem arises.

The database version must match the application version.


So, imagine the following situation: a team of several programmers is developing an application that actively uses the database. From time to time, the application is delivered in production - for example, this is a website that is deposited on a web server.
Any programmer in the process of writing application code may need to change the database structure, as well as the data itself, which is stored in it. I will give a simple example: let's say there is a non-nullable string field in one of the tables. This field does not always have data, in which case an empty string is stored there. At some point you decided that storing empty lines is semantically incorrect in some cases (see 1 , 2 ), and it is correct to store NULLs. In order to implement this, you will need the following steps:

1. Change the field type to nullable:
ALTER myTable CHANGE COLUMN myField myField VARCHAR (255) NULL DEFAULT NULL ;

2. Since in this table production database probably already contains empty lines, you make a volitional decision and treat them as a lack of information. Therefore, you will need to replace them with NULL:
UPDATE myTable SET myField = NULL WHERE myField = '' ;

3. Modify the application code so that when it receives data from the database stored in this field, it will adequately respond to NULLs. Writing to this field now also requires NULLs instead of empty strings.

From point 3 it can be seen that the application and the database are inseparable parts of one whole. This means that when delivering a new version of the application to the production, it is necessary to update the database version as well, otherwise the application will simply not be able to work correctly. In this example, if only the application is updated to the new version, then at some point NULL will be inserted into the non-resettable field, and this is an obvious error.

Thus, updating the application version requires correct versioned database migration .

Is it that simple?


Realizing that parity of database versions and applications is necessary, you need to make sure that database migrations to the required version will always be performed correctly. But what is the problem here? After all, at first glance, there is nothing complicated here!

Here again we turn to a living example. Suppose programmers in the development process write their changes to the structure and database data in a separate file in the form of SQL queries (both DDL and DML queries). And every time you update the latest version of the application, you simultaneously update the database to the latest version by executing queries from the SQL file itself ... But wait, from which version do you update the database to the latest version? "From the past"? But do you remember so well what exactly was the previous version (it was released 2 months ago)? If not, how are you going to update it? Indeed, without accurate information on the state of the structure and data, it is impossible to perform a correct migration: if you inadvertently execute queries that have already been executed once, this can lead to data loss or violation of their integrity.
A simple example is replacing passwords with their MD5 sums. If you re-execute such a request, the data can be restored only from the backup. Anyway, any UPDATE 's, DELETE ' s, and even INSERT 's, executed repeatedly, can lead to extremely undesirable consequences. Not to mention the late TRUNCATE 'ah and DROP ' ah (although such cases are much less likely).
By the way, from this point of view, underperformance is no less dangerous for the performance of the application than overfulfilling.

Thus, it can be concluded that in the process of versioned migration all requests should be executed only once and, moreover, in the correct sequence . Consistency is important because some changes may depend on others (as in the example with a nullable field).


General principles of versioned migration

In the previous section, we outlined important criteria that show what is required of the version migration process. It:
Now we will single out more practical criteria in order to understand what to require from the process of creating and storing migrations. In my opinion, for most development teams it will be important:

Base migration


As it turned out, most approaches have a general principle: they need a base (baseline) - a certain reference state of the database, from which one can make a start. This concept is pretty well described in Scott Allen 's Versioning Databases - The Baseline article.

Simply put, the base is a dump of the database structure for the version that is taken as the base version. Having a base on hands, afterwards it will always be possible to create a database from scratch. After applying to this database all migrations created during the development process, we will obtain a database with the latest version structure.

Next, we will consider three approaches to the organization of versioned database migration.


Incremental change method

This method is well described in the article "Versioning Databases - Change Scripts" by Scott Allen. A similar approach is also described in the article “Managing SQL scripts and continuous integration” by Michael Baillon.

File structure


An example of how a folder with migration files might look like in this case:
Database
|- Baseline.sql
|- 0001. 03 .01.sql
|- 0002. 03 .01.sql
|- 0003. 03 .01.sql
|- 0004. 03 .02.sql
|- 0005. 03 .02.sql
|- 0006. 03 .02.sql
'- 0007. 03 .02.sql

In this example, all files created during the development of version 03 are stored in a folder. However, the folder may be common to all versions of the application.

In any case, the very first file to appear in such a folder is the base (Baseline.sql). After that, any change in the database is submitted to the repository as a new migration file with the name of the form [ ].[].[].sql .

In fact, in this example, the file name contains the full version number of the database. That is, after the migration file with the name 0006 .03.02.sql the database will be updated from the state corresponding to version 03.02. 0005 03.02. 0005 , up to version 03.02. 0006 03.02. 0006 .

Version history storage


The next step is to add a special table to the database in which the history of all changes in the database will be stored.
CREATE TABLE MigrationHistory
(
Id INT ,
MajorVersion VARCHAR (2),
MinorVersion VARCHAR (2),
FileNumber VARCHAR (4),
Comment VARCHAR (255),
DateApplied DATETIME ,

PRIMARY KEY (Id)
)


This is just an example of how a table might look. If necessary, it can be both simplified and supplemented.

In the Baseline.sql file, you will need to add the first entry to this table:
INSERT INTO
MigrationHistory ( MajorVersion, MinorVersion, FileNumber, Comment, DateApplied )
VALUES ( '03' , '01' , '0000' , 'Baseline' , NOW() )


After each migration file is executed, an entry with all the migration data will be recorded in this table.
The current version of the database can be obtained from the record with the maximum date.

Automatic migration


The final touch in this approach is a program / script that will update the database from the current version to the latest.

Performing database migration is automatically quite simple, because The last migration number can be obtained from the MigrationHistory table, and then all that is needed is to apply all files with higher numbers. You can sort files by number, so there will be no problems with the order of migration.

Such a script also has the task of adding records of completed migrations to the MigrationHistory table.

As an added convenience, such a script can be able to create the current version of the database from scratch, first rolling the base onto the database, and then performing a standard migration operation to the latest version.

Pros, Cons, Conclusions


Quick and easy migration to the latest version;
Version numbering mechanism. The current version number is stored directly in the database;
For maximum convenience, we need automation tools for migrations;
It is inconvenient to add comments to the database structure. If you add them to Baseline.sql, then in the next version they will be lost, because the base will be generated from scratch again, as a dump of the new version of the structure. In addition, such comments will quickly become obsolete;
There are problems in the process of parallel development in several branches of the repository. Since the numbering of the migration files is sequential, files with different DDL / DML queries may occur under the same numbers in different branches. As a result, when merging branches, you will have to either manually edit the files and their sequence, or start a new, “merged” branch with a new Baseline.sql, which takes into account changes from both branches.

This method in various forms is quite widespread. In addition, it is easy to simplify and modify to the needs of the project.
On the Internet, you can find ready-made variants of scripts for incremental migration and embed them in your project.


Idempotent change method

This method is described in the Phil Heck Bulletproof Sql Change Scripts Using INFORMATION_SCHEMA Views article. A description of the similar approach is also set forth in the answer to this question on StackOverflow.

Under idempotency refers to the property of an object to remain unchanged when you try to change it again.
The theme comes to mind a funny scene from “Friends” :)

The main idea of ​​this approach is to write migration files so that they can be executed on the database more than once. The first time you try to execute any of the SQL commands, the changes will be applied; for all subsequent attempts, nothing will happen.

This idea is easiest to understand by example. Suppose you need to add a new table to the database. If you want, in the event that it already exists, there was no error when executing the query, MySQL has a short syntax for this purpose:
CREATE TABLE IF NOT EXISTS myTable
(
id INT (10) NOT NULL ,
myField VARCHAR (255) NULL ,
PRIMARY KEY (id)
);

Due to the key phrase IF NOT EXISTS , MySQL will only try to create a table if a table with that name does not already exist. However, this syntax is not available in all DBMS; besides, even in MySQL it can be used not for all commands. Therefore, we consider a more universal way:
IF NOT EXISTS
(
SELECT *
FROM information_schema.tables
WHERE table_name = 'myTable'
AND table_schema = 'myDb'
)
THEN
CREATE TABLE myTable
(
id INT (10) NOT NULL ,
myField VARCHAR (255) NULL ,
PRIMARY KEY (id)
);
END IF ;

In the last example, the role of the conditional parameter parameter is a query that checks whether the table myTable in the database myDb . And only if the table is missing, in fact, its creation will occur. Thus, the above query is idempotent.

It is worth noting that in MySQL for some reason it is forbidden to execute DDL queries within conditional expressions. But this prohibition is easy to get around - it is enough to include all such requests in the body of the stored procedure:
DELIMITER $$

CREATE PROCEDURE sp_tmp() BEGIN

IF NOT EXISTS
(
--
-- .
--
)
THEN
--
-- , .
--
END IF ;

END ;
$$

DELIMITER;

CALL sp_tmp();

DROP PROCEDURE sp_tmp;


What kind of bird is information_schema?


Complete information about the database structure can be obtained from special system tables located in the database named information_schema . This database and its tables are part of the SQL-92 standard, so this method can be used on any of the modern DBMSs. The previous example uses the information_schema.tables table, which stores data about all the tables. Similarly, one can check the existence and metadata of table fields, stored procedures, triggers, schemas, and, in fact, any other objects in the database structure.

A complete list of tables with detailed information on their purpose can be found in the text of the standard . A short list can be seen in the Phil Hack article already mentioned above. But the easiest way, of course, is to simply open this database on any working database server and see how it works.

Usage example


So, you know how to create idempotent SQL queries. Now we will consider how this approach can be used in practice.

An example of how a folder with sql files might look like in this case:
Database
|- 3.01
| |- Baseline.sql
| '- Changes.sql
|
'- 3.02
|- Baseline.sql
'- Changes.sql

In this example, a separate folder is created for each minor version of the database. When creating each new folder, a base is generated and recorded in Baseline.sql. Then, in the development process, all necessary changes are written to the Changes.sql file in the form of idempotent queries.

Suppose that during the development process at different times programmers needed the following changes in the database:
a) create a table myTable;
b) add a newfield field to it;
c) add some data to the myTable table.

All three changes are written so as not to be re-executed. As a result, no matter which intermediate state the database is in, executing the Changes.sql file will always migrate to the latest version.

For example, one of the developers created a table myTable on his local copy of the database, recorded the change a) in the file Changes.sql stored in the code common repository, and for a while forgot about it. Now, if it executes this file on its local database, the change a) will be ignored and the changes b) and c) will be applied.

Pros, Cons, Conclusions


Very convenient migration from any intermediate version to the last - you just need to execute one file on the database (Changes.sql);
Potential situations are possible in which data will be lost, this will have to be monitored. An example would be deleting a table, and then creating another table with the same name. If during deletion only the name is checked, then both operations (deletion and creation) will occur every time the script is executed, despite the fact that they were once performed;
In order for the changes to be idempotent, you need to spend more time (and code) to write them.

Due to the fact that updating the database to the latest version is very simple, and it can be done manually, this method shows itself in a favorable light if you have many production servers and need to update them frequently.


Method of simulating database structure to source code

Separate articles on this approach, I unfortunately did not find. I would be grateful for links to existing articles, if any. UPD: In his article, Absent talks about his experience in implementing a similar approach using a handwritten diff-utility.

The main idea of ​​this method is reflected in the header: the database structure is the same source code as PHP, C #, or HTML code. Therefore, instead of storing the migration files in the code repository (with queries that change the database structure), you need to store only the current database structure - in a declarative form.

Implementation example


For simplicity of the example, we will assume that in each revision of the repository there will always be only one SQL file: CreateDatabase.sql. In parentheses, I note that in the analogy with the source code, you can go even further and store the structure of each database object in a separate file. Also, the structure can be stored in the form of XML or other formats that are supported by your DBMS.

The CreateDatabase.sql file will store the CREATE TABLE , CREATE PROCEDURE , etc. commands that create the entire database from scratch. If necessary, changes to the structure of the tables, these changes are made directly to existing DDL-requests to create tables. The same applies to changes in stored procedures, triggers, etc.

For example, in the current version of the repository, there is already a table myTable, and in the CreateDatabase.sql file it looks like this:
CREATE TABLE myTable
(
id INT (10) NOT NULL ,
myField VARCHAR (255) NULL ,
PRIMARY KEY (id)
);

If you need to add a new field to this table, you simply add it to the existing DDL query:
CREATE TABLE myTable
(
id INT (10) NOT NULL ,
myField VARCHAR (255) NULL ,
newfield INT(4) NOT NULL ,
PRIMARY KEY (id)
);

After that, the modified sql file is submitted to the code repository.

Performing migrations between versions


In this method, the procedure for updating the database to a newer version is not as straightforward as in other methods. Since for each version only the declarative description of the structure is stored, for each migration it is necessary to generate a difference in the form of ALTER -, DROP - and CREATE queries. This will help you automatic diff-utilities, such as the Schema Synchronization Tool, part of SQLyog , TOAD , available for many DBMS, Dklab_pgmigrator for PostgreSQL from DmitryKoterov , as well as SQL Comparison SDK from RedGate.

To migrate from one database version to another, you will have to restore the structure of the original and final versions on two temporary databases, and then generate a migration script. However, this procedure can be automated and should not take much time.

How to deal with data changes?


From time to time, when updating the database version on production servers, it is necessary to update not only the database structure, but also the data stored in it. As an example, the transfer of data from a table with the old structure to new tables can be given for the purpose of normalization. Since the data on the production servers already exist and are used, it is not enough just to create new tables and delete old ones, you also need to transfer the existing data.

In previous methods, in the context of storing and executing migrations, the data differed little from the database structure. But in this method, changes in the data stand apart, because it is impossible to store them in the code repository in a declarative form: the data on all servers is different. And it is also impossible to automatically generate such requests for changing data: it requires human intervention.

This problem has several more or less acceptable solutions:

Pros, Cons, Conclusions


It is convenient to observe changes in the structure between versions using the means of the version control system;
Like any source code, the structure of the database is convenient to comment;
In order to create from scratch a clean database of the latest version, you need to execute just one file;
Migration scripts are more reliable than other methods, since they are automatically generated;
Migrating from new versions to old is almost as easy as from old to new (problems can only arise with notorious data changes);
In the case of merging of two repository branches, the merge structure of the database is simpler than when using other approaches;
Data changes will have to be stored separately, and then manually inserted into the generated migration scripts;
Manually performing migrations is very inconvenient; automated tools are needed.

This method has many positive qualities. If you are not afraid of the described problems with data changes, and if production server updates are rare, I recommend using this method.


Ready solutions for database versioning migration

The methods described above can be used without third-party solutions; however, there are ready-to-use products, each with its own ideology and original approach, worthy of a separate article. When choosing a version migration solution, be sure to consider such products.

Some of them are discussed in a recent article “Approaches for versioning databases” by Denis Gladkikh .

Below is a list of only a small part of ready-to-use version migration systems:

Finally

There are many different ways to store and apply changes to databases. Which one of them you choose for your project is not so important, the main thing is to adopt one of the methods and follow it unswervingly. The rest is details. It was this idea that I tried to convey with this article by adding a classification of the simplest methods with my thoughts - in the hope that they will help you choose the most suitable solution for you.

Source: https://habr.com/ru/post/121265/


All Articles