Headache-free data scheme migration: idempotency and convergence for DDL scripts

The language of SQL and relational databases with us for over forty years. During this time, the SQL standard has gone through many revisions, and, apparently, the development process does not stop there. Relational databases have reigned undividedly for decades as data repositories, and they still reign today, and only recently have alternative approaches become a bit crowded.

SQL is almost omnipotent when it comes to data retrieval. (Not everyone knows, but with a single SQL query, you can graphically build a Mandelbrot set ). But one problem continues to be conceptually unresolved in it: the problem of data schema migration.

I think, at different stages of our IT career, we all faced with how difficult it can be: to control the structure of the working database and perform its updates as new software versions are deployed. The bugs returning after they seem to have already been fixed, the errors “field is not in the table”, complaints “I fixed the storage, and you jammed!” - do you know all this?
')
The acuteness of this problem is especially clear in contrast to how fundamentally the task of controlling source code versions is solved: systems like Git or Subversion allow you to control the joint work of many developers, not to lose a single change, “record all the moves” and collect the results of the team’s work together. But this blissful picture ends where the area of applicability of version control systems ends. If we are talking about the structure of the data schema, then the use of version control systems for their development is still very limited.

It is so easy and tempting to make changes to the structure right on the base! If such changes are not documented, then in the future there is a need to develop the system and a catastrophe approaches. But even having realized the need to document every step of changing the database structure, we come to the fact that doing this is very inconvenient, and the DDL language of the data scheme definition does not help us in this way. For example, adding a field to a table requires the ALTER TABLE ADD command. It makes sense to execute this expression exactly once and that very moment when there is no field in the table, and there is already a need for it. Running this script at other times will either lead to an error in the execution of the script itself, or, worse, to an incorrect state of the database structure.

How to aggregate all changes made by developers in a single result and how to control its implementation?

Approach # 1: the accumulation of change scripts

A practical solution to this task manually, without using any additional tools, is to create folders with numbered scripts to modify the database structure, accumulate them in this folder and introduce the strictest discipline among the participants of the development project in the code base of the project. The process of recording changes in the next script file and the process of “rolling” changes on the database should be formalized.

A more advanced and convenient solution is the use of systems such as Liquibase . Liquibase takes control of the number of the last data structure change performed and ensures that the structure modification scripts are run in a linear fashion one and only once. This is already a lot, and the use of Liquibase uniquely eliminates the chaos that prevails in the development and updating of the data scheme.

Nevertheless, the use of Liquibase and other tools based on the accumulation of change-scripts does not make the task of modifying the data schema as easy and technical as it is easy and technical to modify the software source code of your application.

First, change scripts are piling up . A long-running project draws a big “tail” of these scripts, most of which lose their relevance, because they contain changes made in a very long time. Saving the whole “tail” is meaningless, because a change in one place in the log can cancel the result of a change in another place: for example, during the development of the system we tried some option and abandoned it, but two changes remain forever change- log Looking at the expanded log, it becomes impossible to understand our current understanding of the structure of the database. And if we want to reproduce the database structure “from scratch” on, say, a completely new project, we will have to repeat its entire path of evolutionary development during the execution of the script!

Secondly, with this approach, there is a huge difference between the complexity of modifying the database structure through changesets and the ease of modifying the source code of the program.

Suppose you have code that describes a class in your favorite programming language, and in the process of improving the system, you needed to add one method to the class and another to remove it. What are you doing? Take and change the source code of the class, and then commit it to the version control system, is not it? You are not forced to create a change set with commands like this:

alter class drop method foo;
alter class add method bar(…) {
…
}

Why do we have to do this for the data schema structure?

The reason, of course, is that the database also contains data with which you have to do something when the structure changes. We will talk more about this problem, but first let's take a look at another possible approach to solving our problem.

Approach # 2: idempotent DDL script

Problems similar to those that arise when migrating a database schema have long been encountered when configuring servers.

How to automate the installation of the necessary software on the server and ensure the updating of configurations? We can write a shell script that, when executed on an empty virtual machine, installs and configures everything on it. This is an analogue of the script for creating the database structure (CREATE TABLE ... script), but its problem is that it can be executed only once and only on an empty machine. If the machine is already deployed and working, and according to the new specification for the operation of the system, we need, for example, another Java version, how can we install - to add the change script, demolishing the old version and installing the new Java version? Well, if we need to reproduce the configuration on an empty machine - what should we go through all the steps that we have historically gone through?

The main, key question that arises here is: is it possible to edit the infrastructure / data scheme as easily as we edit the source code by directly changing its description and writing to version control?

The answer to these questions for the problem of configuring servers was the emergence of the principle of Infastructure as Code (IaC) and a whole class of systems known as configuration management systems: Ansible, Chef, Puppet, etc. All the systems of this type are based on two main principles: idempotency (idempotence) and convergence (convergence, convergence). Both of these terms are borrowed from mathematics. If we discard the unnecessary formalism, as applied to our problems, these terms denote the following:

The idempotent and convergent script describes not the actions to be performed on the object, but the state to which the object should be brought.
Idempotency means that if such a script is executed successfully and the object is already in the desired state, then rerunning the script will not change anything and will not lead to anything. For example, if we are talking about the configuration management script of the system declaring the installation of the necessary packages, then if these packages are already installed, the script simply does not perform any operations when it is restarted.
Convergence means that if the script was not executed or failed, when it is executed again, the system will tend to the desired state. For example, if the installation of one of the packages ended with an error, because the network was lost at the moment of downloading the package, then restarting the script will result in the missing package (and those that were installed last time will remain in place).

These principles are general and apply not only to software configuration of machines. For example, the Terraform system allows you to specify the necessary set of virtual servers and their hardware configuration in the form of a code and monitor your virtual machine fleet in the cloud.

I think what has been said is enough to understand how these principles can help to control the data schema. Suppose that our database supports the keyword “CONVERGE” and we have this script:

CONVERGE TABLE OrderHeader(
id VARCHAR(30) NOT NULL,
date DATETIME DEFAULT GETDATE(),
customer_id VARCHAR(30),
customer_name VARCHAR(100),
CONSTRAINT Pk_OrderHeader PRIMARY KEY (id)
);

The keyword CONVERGE should be interpreted as “bring the table to the desired structure”. That is: if there is no table, create it, if there is a table, see which fields are in it, with which types, which indexes, which foreign keys, which default values, etc., and whether it is necessary to change something in this table to bring it to the desired form.

If databases could support such a keyword, that is, if it were possible to write idempotent and convergent DDL scripts for databases, there would be no need to write this article. All of us would simply keep the “CONVERGE TABLE” version control scripts describing the currently wanted data scheme, and work with them the same way we work with the source code: we need a new field in the table - we added, we need a different composition fields in the index - edited. (I hear your question: what about the migration of data - but be patient, I’ll get to this soon.)

Unfortunately, in the world of relational DBs, the movement towards the support of real idempotency DDL is not happening yet. Everything that has been done so far in the databases towards the idempotency of the DDL code is support for the CREATE IF NOT EXISTS construct, but this is, frankly, a rather weak attempt. The CREATE TABLE IF NOT EXISTS script, of course, will behave externally as idempotent (will not give an error if the table has already been created), but not as convergent (will not modify the structure of the table already created).

We have to rely on external tools. Idempotent and convergent DDL is available, for example, in the Celesta system. In order for developers and development tools (for example, the visual editor of ER diagrams) to look like a regular DDL script, the keyword CREATE is used in Celesta, although in Celesta it has the meaning of a hypothetical CONVERGE. At each start, Celesta compares the current database structure to which it is attached with the desired structure, described as a CelestaSQL DDL script, and, if necessary, executes the minimum required series of CREATE / ALTER / DROP commands. At present, four types of databases are supported: PostgreSQL, Oracle, MS SQL Server and H2 (the latter, mainly for the needs of organizing unit testing).

The idempotent script defining the database structure cannot simply be taken and executed linearly. As you know, some objects in the database depend on other objects - for example, tables link to each other through foreign keys, views and indexes depend on tables, etc. Therefore, before you can perform a series of creating / rebuilding objects, you must rank them in order of dependence on each other: speaking formally, perform a topological sorting of the dependency graph, and then process the objects, starting with those on which nothing depends. It is often necessary to reset foreign keys in order to restore them after modifying the tables they connect. This problem is solved in Celesta, and this allows you to upgrade without problems for the absolute majority of cases.

Data migration

So how to deal with data transformation, because a simple ALTER is not always enough? What if we, for example, want to add a non-empty table NOT NULL-field and do not provide it with a DEFAULT value? After all, if such a field is not pre-filled with data, the database will not allow the ALTER TABLE ADD script to be executed. And if we want to add a Foreign Key, but not all the data in the table satisfy the constraint? And if, let's say, the application logic has changed and you want to transfer data from one column to another?

All these questions are perfectly correct, but first we note that for most of the changes that you make in the database during the development of your application, no migration is required and a simple ALTER script is sufficient. You do not need to migrate data if you simply add a new table or a new column to the table (NULLABLE or with a DEFAULT value). You do not need to migrate data if you add or rebuild an index. Nothing needs to be migrated if the request for the view changes. The practice of using the Celesta system shows that the vast majority of changes made by developers are of this type.

If data migration is really necessary, then yes: you have to write and execute a one-time migration script. That's just to save this script "for centuries" is not necessary and the story with its debugging and application is much easier than in systems built on the change log.

Consider the case of adding a foreign key to a non-empty table, not all of the records of which satisfy such a key. During the upgrade process, Celesta will try to create such a key using the ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY command. If she succeeds - fine, if not - the system stops and Celesta reports that she didn’t completely complete the update of such an object for such a reason, with such a DB message.

For changelog systems, there is nothing worse than a changeset that was executed halfway and fixed in an intermediate state: in such a situation, the system is “in the middle” between two revisions and the situation can be resolved only manually. The lack of support for rolling back DDL transactions in many databases introduces additional difficulties.

For “convergent” systems, as opposed to “changelog” systems, updates that are not fully completed are not a problem, since the system compares the current actual state of the database with the desired one to generate ALTER commands, and changes that are not made when one attempt, she will try to finish in another.

Faced with a situation where the update cannot be executed automatically (the situation, I repeat, quite rare), you can make a one-time script. Suppose that the reference table is filled with the necessary data and thus creates conditions for automatic execution of the update for Celesta. This script can be debugged on a fresh copy of the production base, then executed on the production base, then a Celesta update can be made.

After all this, your script can simply be thrown out , because you will never need it anymore! After all, your work bases are already in the state you need in structure, and if you plan to make a new base from scratch, then you do not need to force the base to go all the way you have gone, modifying its structure in the development process.

Perhaps this approach will seem to someone less reliable than using changelog-systems, which force you to think about the necessary steps to migrate data every time you change the structure and embed such steps in a changeset. But if you think about it, it becomes clear that the reliability of changelog-systems in this regard is imaginary. As you know, there are no programs without potential errors, this rule also applies to data modification scripts. The fact that the change set script of the existing data set was debugged and showed correct operation does not really guarantee, with 100% certainty, that it will execute without errors on any data. In the case of using idempotent DDL, we at least do not declare a data modification script as not subject to change, protected by a checksum attribute of a revision. In case the error does occur, we can always repeat the update attempts until we reduce the system to the desired structure. Your data will not be lost, since Celesta never automatically drops columns and tables containing data, leaving this operation to be performed manually.

* * *

~~Unfortunately, the scope of CelestaSQL is limited to using it in conjunction with the Celesta system to generate business logic code,~~ (already not - see update!) therefore, it is for your project for today, perhaps, I would recommend Liquibase. However, Celesta is an open project and one of the possible directions of its development is the creation of a tool for migrating a general purpose database structure.

Although, personally, I would prefer that DB developers ever realize real support for idempotency and convergence of DDL.

UPD: Since the beginning of 2018, CelestaSQL can be used to idempotent update an arbitrary database using the 2bass system.

Source: https://habr.com/ru/post/337816/

All Articles

Headache-free data scheme migration: idempotency and convergence for DDL scripts

Approach # 1: the accumulation of change scripts

Approach # 2: idempotent DDL script

Data migration

* * *

More articles: