Approaches to versioning database changes

Much better disciplinary restrictions to remove instrumental expansion
The author of the article

Introduction

When developing an information system, that is, a program aimed at storing, working with data, processing, analyzing and visualizing a database, one of the cornerstones is the task of developing a database. When I first began to ask myself this question, it seemed that whatever I did, it would still be crooked.

During 5 years of development of several corporate IS, I have set and tried to solve questions, how to make this or that aspect of database development convenient. I was looking for tools to help do something with the database, methodology. Surprisingly, there are few developments in this area. And in each approach you can immediately see - this is impossible, it will be inconvenient here, there are too many disciplinary rules (see the epigraph) ... In this article I tried to collect those trips that I consider most effective, and one, in addition to the collected ones, will present the crown of my quest, which I consider the most "armor-piercing."

Evolution of the problem

I will give an example of how a database development task can become a problem and evolve. If the global formulation of the problem with all the features is interesting at once, you can skip this paragraph.

There is one instance of the DB, it is prod, and the users work on it, and the programmers watch and change the data and the circuit. It seems comfortable, everything is very simple, nothing superfluous, Occam's razor and all that.
Hmm, but programmers program right on the prod right away - and if they do something wrong? Will they overwrite the data, or delete something important? Yes, or simply load the server in search of the desired optimization. Any solution to such an already existing situation will be bad for users. No, it would be necessary to separate the database for dev development from prod. The question arises - how to transfer changes from dev to prod?
Hmm, how can we keep track of when and what has changed in the database? We follow the application code, and the code of the stored procedures and everything else in the database? Will we script some scripts with alter 's? And how to create create ? After all, he is repeatedly non-rolling, unlike alter 'a.
Hmm, all the programmers are on the same dev-server, it’s a test for users. That is, testing users may stumble upon what the programmer has not yet completed, which should not be visible to the user. Or just broken. It would be necessary to separate the dev with test.
(in fact, many IP can not pay much attention to this problem)
But the same situation with programmers - if two programmers change the database, then there may be some artifacts. One changes something, and the application to this is not yet ready - and for all the project does not start. It would be necessary for each programmer on the DB. So that each has its own sandbox. And when the database is very large?
Oops, how do you transfer all changes to prod now? If you use the comparison schemes, then the database of one programmer will overwrite the changes of another programmer. And how to transfer changes from one programmer to another instance? Transfer backup from prod to each programmer? And then change it as a roll on this backup? No, nonsense, somehow here some scripts should play a role.
Wait, what about the data that should be the same for all instances (directories and persistent directories)? Will they be transferred with backups? Or still script insert 's or merge ' s and version scripts? And how to run them - always all at once, or is it only those that have changed? And how to regulate the order?
And it seems that it happens that there are such data and procedures that should be different in different instances! Settings, replication, communication with other services. With sharding . How to deal with them? Transfer of backup and subsequent change script, its for each instance? This is possible, but not all differences can be done - for example, memory optimized tables cannot be converted to regular tables, not counting the fact that it can be impossible to raise backup with them when the test server or programmer instance does not have enough memory .
By the way, I ran into this when I needed to make memory optimized tables for 20 GB, and even when the prod had a whole system of synchronization with another system, which should not be on other instances.
Mmm, and how to make it so that you can easily raise a whole new DB instance at once? What if all the settings of both the database and SQL Server are scripted and versioned?
This is convenient when each programmer has his own SQL Server, and not when the server is one with a bunch of databases. Or sharding. Or a new test loop for a new big business feature.

Some of the points of this evolution may seem redundant and generally inventing problems for yourself, but when solved with a convenient tool, database development will become almost as simple and painless as the development of ordinary application code.

Having collected in one heap all possible problems associated with the development of the database, which I saw when developing information systems, I came to a global formulation of the problem statement in my vision.

Full universal formulation of the problem

Tasks and problems that may arise during the development of the database:

Must be versioned:
A. DB schema. That is, you need to be able to save the history of database schema changes in any form.
This is the most basic goal , on which all other possibilities are hung. Versioning the database schema is the cornerstone of database development. There may be two ways. You can save the flow of changes in all database objects with one global stream, and you can somehow version each specific object. I think (see p6) that the second way is in the end much more practical.
B. Some data: dynamic references, persistent references (persistent, static).
The goal is important if the application code depends on this data (then they will be persistent). That is, if a table corresponds to one enum in the code. Or, otherwise, the goal is secondary, but very useful for controlling the addition / change of static and not-so data, as well as for automating the deployment of database changes and release of different contours (P2) - if there are more than one.
C. Data conversion scripts (migration scripts). They can be associated with either a change in the database schema or a business task. For such changes, order is important, and sometimes repeated execution is required.
Not all IP such scripts will exist. But, if they exist, this goal is required .
D. DB Settings.
Examples of important database settings that should be versioned - if the service broker should be enabled (it is required for sql dependency ), change tracking (it can be used for more complex caches), the read_committed_snapshot option should be the required user database for the application, setting up error alerts , setting support for memory optimized tables . The advantage of saving and versioning database settings is that you can easily make a new database that works completely the same as the main one. And that in a convenient form, you can conduct a code review , coordinate changes. And also just to have a story - what, when and by whom it was made.
E. Some other objects of SQL Server are jobes, some server settings, alerts, logins, linked servers (for integration with other systems).
This may be necessary when the test circuits must completely repeat the prod, along with the processing in the jobs. That seems like a completely normal situation.
In addition, if all the settings, Jobs, etc. are scripted, then you can easily pick up a new contour. I launched one script - and the entire new SQL environment is fully working.
Many contours are needed: one prod (or maybe not one! - for example, when sharding, or when the product is boxed and each client has its own database), many test ones are not just two test contours (prod-test, dev-test) , but also along the contour for each programmer, or along the contour for a large feature.
A. The support of many contours means that it must be possible to change one contour into any other contour, without spoiling it and not wiping out the changes that it already may also have.
That is, if each programmer has his own sandbox database, changes to one programmer need to be transferred not only to prod, but to all other instances. Even if there another programmer did something, too.
B. It happens that the circuit may differ slightly in different circuits - for example, additional objects (views, storages) are needed in the test environment, and maybe vice versa, a piece of database, integration with another system, etc. should be excluded from some test circuit. Also in the case when there is memory optimized tables in one instance, and in the other they do not fit in memory and these tables should be normal.
C. In different circuits, the versioned data may be different - for example, application launch settings, access to other systems. Also test data for test contours.
You can draw an analogy with the case when different programmers, or on different publish-host applications must have different app.config files - which is achieved using the Visual Studio app config transformation feature. Similarly, something in the database may differ in different instances.
D. Same for database and server settings. For example, on prod / prod-test, all jobs are needed, and on local databases, programmers do not have everything. Support for memory optimized tables - on prod'e needed, but on other circuits may not be.
E When restoring a backup from a prod to another instance, you need to be able to bring the restored database into compliance with the desired path.
Sometimes this may not be possible (again, if there are memory optimized tables ), but if it is possible, then this is a very handy feature when you need to transfer all the data from one DB Instance to another, which is most easily done by transferring backup.
Tasks related to the database version:
A. When a programmer makes changes to the database, not all of which need to be in the prod, you need to leave unnecessary ones - if the programmer made them for his test purposes - and transfer them to the prod (and other instances) only necessary.
It’s convenient for a programmer to leave everything he wants in his sandbox, although transferring to other instances is exactly what is needed. That is, if he has made for himself some kind of convenient and necessary view only for him, a procedure, or even a table - this should not fall on prod.
This clause is almost equivalent to 2bc, but here it means that there may be database objects that can be not versioned, but simply manually created in your sandbox.
B. Nakat to some specific version of the database (larger than the current on the selected contour).
This is necessary if the development of the database is separate from the development of the application. If the database changes go along with the application change, then there should be no different future versions of the database. That is, there should be an “not ready / spoils / not supported by the application approach — don't commit, at least to the common repository branch”. And if there are changes in the database, they should be backward compatible. For example, deleting / renaming a column should be accompanied by a corresponding change in the application, and adding should not be necessary. In article 3 ( Evolutionary database design ), the author calls such changes destructive - I agree with this term.
C. ~~Rollback to the old version~~ , by analogy with a simple git checkout with the usual programming of the application - is impossible in general. Because, in the presence of data in the tables, it is impossible to convert them to the old (that is, arbitrary) scheme.
Sometimes it can be useful to create a clean database with this schema and given reference books (or without reference books) - for tests, for example.
A. When testing automation ( continuous integration ), you can create a fresh, clean database for running tests, or with filling with random data (I don’t know free tools for this, redgate , devart have paid tools ) or a subset of working data ( Jailer can help with this).
In practice, real benefit from this can be extracted only if this process is simple and quick as a click of a finger - otherwise no one will do it. After all, you can use the usual test database.
If some change has already been done, in the normal mode, provide for it not to be rolled again, or make the changes idempotent.
In practice, the rule “make all changes idempotent” is too disciplinary. Much better disciplinary restrictions to remove instrumental expansion.
For each database object you need to be able to see the history of its changes.
A. For control, you need to be able to see what changes will be rolled up in specific SQL scripts (“cumulative scripts”).
B. Code review is highly desirable. Moreover, an explicit alter statement is preferable to comparing create table statements (on the basis of which diff is made and subsequently rolled), since it is better to control actions from the database, rather than the declaration. And for procedures and similar objects, you must be able to see the diff of the body.
Used tools need licenses. With the shared db approach (which can provoke problems and conflicts of changes - see p2), one is enough (when only one specially selected person rolls back changes), and at the approach, when each programmer has his own database, each programmer is licensed.

In my practice, I was guided by the following points:

1 and 2 - from cover to cover;
3a;
3b did not use it, because always in my projects the databases with the application were a complete one system and were fully developed together;
4 - I tried to use it, but the tests as a whole did not take root very well because they require an additional resource, well, or paradigm rearrangements for TDD;
5 - required;
6 - important, although quite rare;
licenses were not required because the tool for the chosen approach is free.

So I consider all items as mandatory except 3b and 4.

Approaches

I came to the conclusion that it is advisable to highlight the approaches:

Comparison of target database and source database.
Comparing the scripted schema (and data) with the target database.
Based on sequential (incremental) manual SQL scripts.
Based on manual independent SQL scripts whose structure repeats the DB schema.

In this list, approaches are sorted by increase in utility.

1 Comparison of target database and source database

Tool

Redgate SQL Compare . There is also http://compalex.net/ , but it only works with php. There are other tools for comparing database schemas.

Methodology

In addition to the prod database, it is the target database, the dev database is being made, it is the source database.

Changes are made in the source database in some way. Moreover, this source database is not test in the conventional sense, because you cannot do anything with it - it is understood that all changes (at least, changes to the database schema) must be transferred to the target database. Further, these changes can be scripted, but these scripts are not used in any way afterwards, because if you use them and roll them in any way, then the whole point of the approach disappears, comparing the schemes becomes meaningless. These scripts can only play the role of a change history. But which may be different from reality, because you can change something visually in Management Studio (or another GUI for the database) and forget to script it. Or scripted wrong. Then, at the moment of deployment to the target database, a diff script is created (with the help of the tool) that rolls in, bringing the target database to a state of equal circuit with the source.

pros

For those who are afraid to write sql. Changes in the source database can be made by visual means (on the other hand, I do not consider this a good practice).
Scripts need to be written only for data conversion (again, I believe that scripts for the programmer should be the starting point, and you should not leave them).

Minuses

Everything that is connected with several circuits (n2) is impossible.
If I make changes on my selected test sandbox database, I have to transfer these changes to the source database for the prod. If you do this in the same way, then it gives a hole - transferring those random or personal changes that I made to my database (I can do everything in my sandbox, everything, everything, everything, right?). After all, only the necessary changes are scripted, but all are transferred. In addition, I will jam the changes that another programmer suffers.
If there are at least some differences in the second circuit, you need to make another source database for it. That is, if there are any settings or data different from the prod in the test loop, then either a second source database is needed for it, or in some other way to solve this problem.
Lack of control over change transfer scripts - the scripts generated by the tool will be rolled. Sometimes it can be bad, you need to take into account the specifics. For example, some changes require the re-creation of a table. What you need to do or very carefully, or do not do (if the table is often used, very large).
If we take this diff script, look at it and edit it, then again the whole point of comparing the schemes disappears - you can write the scripts right away.
The scripted history of object changes may differ from the real one. Because the tool compares exactly the scheme of the bases, and not the scripts - you can (by chance, of course) change something and not add it to the scripts. Or add to the scripts is not entirely correct.
Anyway, additional efforts and tools are needed to support:
- migration scripts (data conversion),
- versioning of reference books and persistent data.
The tool does not override the database settings and SQL Server settings.
If someone else rolls something on the source database, it will drag too when deployed.
The target database is dependent on another database in the sense that prod must be bundled with its source.
There are no free tools, but redgate SQL Compare is expensive. And you need a license for the tool on all hosts that will be migrated to any target database.

The approach does not solve the problem

1b - it is implied that for this purpose it is necessary to use SQL Data Compare; 1c; 1d; 1e.
For 2 - additional and very significant efforts. It’s easier to drop several contours than to support them.
3a - the tool overrides the schema changes of either all objects or selected ones. That is, when deploying, you need to view a list of all objects and uncheck unnecessary ones. In principle, this solves the problem, but ... Nobody will do that. Again, the rule is “much better disciplinary restrictions to remove instrumental expansion”. Something is missing .gitignore file type.
6 - additional efforts are needed, the history of changes may differ from the real one.

2 Comparison of the scripted schema (and data) with the target database

Tool

I do not know which tool could compare two DB schemas not by the bases themselves, but by the scripts. Therefore, this approach is theoretical. But, if you find or make such a tool, it seems that the approach would be quite good. Especially if this tool could compare data.

Methodology

The role of the source database here would play a directory with scripts that completely create the database — both the schema and the versioned data (directories, persistent directories). That is, the programmer makes changes to these scripts, runs a tool that compares the entire directory with the target database and creates a diff script, which is either saved for code review , or rolled up immediately.
Since there is no tool, one can only fantasize about how this tool could compare data and settings of the database and SQL Server.

pros

The task of the history of changes in each object of the database (p6) is perfectly solved.
The sandbox task (p3a) was solved perfectly, as well as in the remaining approaches based on scripts. No changes needed - do not make scripts.

Minuses

Similar to approach 1, everything that is connected with several contours (n2) is impossible.
Similar to approach 1, there is no control over the rolling scripts.
The same problem with migration scripts.

The approach does not solve the problem

1c.
1d, 1e - it is unknown which tool.
For 2 - additional and very significant efforts. It’s easier to drop several contours than to support them.

3 Based on sequential (incremental) manual SQL scripts

Tool

flyway db . Perhaps there are alternatives ( https://github.com/lecaillon/Evolve - I'm not ready to talk about this tool, but it seems to be doing something similar).

Methodology

The approach methodology is the simplest. Enchantingly simple. As required, sql-scripts of changes are written - arbitrary, both for changing the schema and for changing the data. It does not matter which scripts. Files are numbered, folded into a daddy. At the right time, the tool is launched, which in the order of numbering rolls in new, that is, not yet executed script files on the selected database. Rolled remembers in a special plate, that is, the script will not be re-executed.

This is how Qiwi works. Or I worked when I participated in the development of the payment system. But there without tools, tools are replaced by disciplinary rules. There are several QA-employees who monitor the special git repository and roll in new scripts, first on the test database - see if anything has broken, then, if everything is fine, on prod.

pros

Extremely, extremely simple. You write a script, you put it in a daddy, and that's it. I think thanks to this, this approach is the most common.
Full (absolutely full) control on the rolled changes. No script generation.
Including it is possible to store both data (p1b), migration scripts (p1c), and settings for both the database and SQL Server (p1de) in scripts.
You can roll up to a specific version (p3b).
The problem of many contours has been solved, but without differences in them (n2a).

Minuses

It is impossible to single out a history of changing a single object, alter 's for one object are scattered across many scripts, many files.
With the parallel work of two programmers, they can create scripts with the same numbers.
Since the approach is scripted, there are not enough features:
- Some scripts would like to roll when they change. That is, he added a line to the script of the versioned reference book, and it turned out, and the line appeared in the table. In this form, you can store a history of data changes (see approach 4).
- Some scripts would like to roll in with each deploy'e - this is, for example, some sort of cleaning, entering of persistent directories that must be versioned (therefore, they cannot be entered manually into the database).
The sequence of arbitrary scripts is extremely difficult to understand. Creating tables, their alter 's, adding lines to directories, migrations are scattered almost randomly in one sequence. I would like to have an alphabetical sorting of files, various daddies. In a word, I want to see the database structure in the scripts. You can, of course, come up with something - make a bunch of daddies, make a huge bat, launch the tool on these daddies in the right order ... Yes, this is the beginning of the next approach, the 4th.

The approach does not solve the problem

2bcde,
6b.

4 Based on manual independent SQL scripts whose structure repeats the database schema.

Tool

liquibase . Perhaps there are alternatives ( redgate SQL Source Control , https://www.quora.com/What-are-the-alternatives-to-LiquiBase - but I don’t quite know how they work).

Methodology

Idea

To create and modify the schema for each database object, create a file in which there will be a script responsible for this object - so when versioning files, we get the change history for each object. We put these files in daddies, repeating the structure of the database. Since the sequence of execution of scripts is important, we enter control files containing a sequence of roll up scripts, and the tool makes writing these control files simple enough and decides what change to roll and what doesn’t, if it was already rolled up or filtered. In addition, if you need a difference in something in different DB instances, enter the values of the variables that the tool uses, modifying the scripts for each instance as necessary. In addition, you can enter filters on scripts, and, depending on the context (“only schema change”, “only directories import”, “create / update only such a piece of scheme”) filter scripts.

To change a table, you need to add a script (changeset) with the alter or create index operator or some other file to its create file. Or, you can modify the existing corresponding changeset, if it is possible to make it re-rolling.

To change the procedure / function / trigger / view you need to change the code in the file corresponding to this object. In order for this script to be re-rolling, you need to create this object with an empty body in the first changeset, and the alter operator with the desired body in the second (sorry, SQL Server has no create or alter operator). Then the first changeset will be executed only once, and the second - when changed.

Well, for the deploy directly, we do the bat-file (s) that launch the tool with the necessary context and settings. Thus, the required deploy will be launched by running the appropriate bat'nik.

You can configure the launch logs (which changesets to execute and how long) are saved. Adding them to .gitignore.

Files

Make the following folder structure:

- <DB name>

CLR_Objects
There will be 1 sql-file for each CLR assembly. In this file, first drop function , then drop assembly , then create assembly , then create function . The assemblies themselves will have to be stored in the script as base64 , but you can make a simple tool that automates this conversion:
```
Clipboard.SetText("0x" + string.Join(“”, File.ReadAllBytes(path).Select(b => b.ToString("x2"))) ) 
```
Tables
There will be a table for each table in the directory in which the table and script creation files for the versioned directory are separate.
- <table name1>
- <table name2>
- ...
Types
There will be 1 sql-file for each type.
Functions
There will be 1 sql-file for each function.
Views
There will be 1 sql file for each view.
Stored_procedures
There will be 1 sql-file for each stored procedure.
Triggers
There will be 1 sql-file for each trigger.
Migration_Scripts
There will be a sequence of data change scripts.

- core
There will be the tool itself and its configuration files.

As you can see, the idea of these folders is to repeat the database schema and match each object to its one file.

In the head folder <DB name> there will be files:

Database creation script ( create database , alter database - settings, users, roles)
create_db.sql
File (s) with scripts that change SQL Server settings and the database itself ( sp_configure , sp_add_job and its brethren, sysmail_add_account_sp and its brethren, alter database etc). Although scripts Jobs can be stored for convenience in various files.
config_db.sql
File with a sequence of roll up tables. The includeAll tool command executes scripts in alphabetical order, which incorrectly creates tables due to the presence of links in them, so you need a file that governs the sequence of rolling tables.
tables.xml
File with a sequence of roll up view. Views can depend on each other, so their roll order is important.
views.xml
A file with a sequence of data change scripts. Ordering is also important for data migrations.
migrations.xml
File with commands tool how to create a database schema. It will contain links to files 1 and 2 with the creation and configuration of the database and server, and then to the folders CLR_Objects , Tables (more precisely, the file tables.xml ), Types , Functions , Views (more precisely, the file views.xml ), Stored_Procedures , Triggers - precisely in this sequence. Also in this file, you can set values for variables that scripts will depend on - if there are differences in different instances of the database.
master_Scheme.xml
File with all commands tool. It first contains a link to the previous schema creation file, then additional actions - data conversion / migration / data cleaning (link to the Migration_Scripts folder, or rather, the file migrations.xml )
master.xml

Note that the sequence is not important for creating / modifying stored procedures, triggers, functions, and types. Therefore, it is enough to includeAll command of the tool that rolls them in alphabetical order.

How to use

For each use case, you need to create a bat file that runs the tool with the appropriate context — for example, deploy_local_scheme.bat , deploy_local_full.bat , deploy_prod_scheme.bat , deploy_prod_full.bat , deploy_ <programmer> .bat etc. In one project, I had as many as 18 such files - there was a whole system of data migration, and it was necessary to regulate when to perform which migration.

In addition to the context, the bat file itself must contain the connection string and the name of the tool command.
Possible commands:

update . The main command to execute all the necessary changesets.
status . Shows what changesets will be rolled up.
udpateSQL . Shows a full sql script that collects all the rollout changesets.
changelogSync . Pushes new changesets as executed without rolling them. It should be used when the change is rolled by hand. Should be used as seldom as possible.

For the convenience of viewing the execution logs, I also output them to a text file:
> %outputfilename% 2>&1

Changesets can be tagged with attributes:

runOnChange - if =true instructs the tool to roll a changeset when it changes (if you write =false , then after changing the changeset there will be an error like "a rolled changeset has been changed");
runAlways - indicates to roll a changeset on every deploy.

If you need to change the schema so that some data modification scripts break down, that is, migration scripts (for example, if you need to change the table name, columns, delete something), then you need to write the corresponding alter or sp_rename in the file corresponding to this table, and change the necessary scripts accordingly. Further, for one-time scripts, you need to make them so that the tool does not give an error that the thumbnail changeset has changed. This is achieved in two ways - either the changelogSync command, or manually change the corresponding row in the tool table by updating the md5-sum there — the tool's value will prompt it.

pros

Full (absolutely full) control on the rolled changes. Just as in approach 3.
Similar to approach 2, it is very convenient to carry out code review , look at the change history of all objects.
Differences in contours (n2bcde) are supported by variables. That is, you can make a bat-file on the contour, in which the values of variables will be specified, and this value can be used in the script, or changese can be labeled with labels that can be filtered or not depending on the contour.
Even a clean database can be created (p4).

Minuses

Difficult in the initial setup. But this is only the difficulty of entry. When using such complexity there is no need - just add the changeset script to the desired file, and that's it.
In a long-term development, when one table needs to be changed many times over time, in some cases changesets with the alter table operator will multiply. But in most cases this will not happen - if you always write reusable changesets. This is a minus with respect to approach 2 (in which you can edit the create statements), but for approach 3, this is an improvement — in general, all changes will be added linearly one after another.
The tool is very poorly maintained and developed. It is written by one person as a hobby. But today's functionality is enough. In addition, the tool is simple, you can do it yourself in a month.

The approach does not solve the problem

Only optional in my vision p3b. Victory.

Source: https://habr.com/ru/post/330662/

All Articles

Approaches to versioning database changes

Introduction

Evolution of the problem

Full universal formulation of the problem

Approaches

Similar articles

1 Comparison of target database and source database

Tool

Methodology

pros

Minuses

The approach does not solve the problem

2 Comparison of the scripted schema (and data) with the target database

Tool

Methodology

pros

Minuses

The approach does not solve the problem

3 Based on sequential (incremental) manual SQL scripts

Tool

Methodology

pros

Minuses

The approach does not solve the problem

4 Based on manual independent SQL scripts whose structure repeats the database schema.

Tool

Methodology

Idea

Files

How to use

pros

Minuses

The approach does not solve the problem

More articles: