Database as Code. Digging deeper

In IT projects, code is written by everyone. Engineers manage Kubernetes clusters with a few lines, disperse clouds with Terraform, and roll tons of configurations on Ansible, Chef, and Puppet. QA write understandable business test scenarios on Spock and Cucumber. Analysts are fluent, often better than developers, talking in SQL. Project documentation in Markdown, AsciiDoc or LaTEX formats is "compiled" into the required format on the build server. Well, the developers themselves, these code tamers, immediately speak a lot of languages for every life situation - client, server, script, functional, etc.

The code has long ceased to be mysterious gibberish and is now in one form or another accessible and understandable to many, even the prime ministers . And all this code is involved in the standard life cycle - is under the control of VCS, subjected to code review, automated testing, CI, CD. Common tools and approaches, performance and quality metrics are used. And all together it has a proud name - "Everything as code".

However, the world of database traditionally stands apart from this extravaganza of progress and technology. The process of developing and maintaining a database does not change for years and continues to instill horror and fear in developers, administrators and users around the world. But is it possible to present the database in the form of a regular code? Get closer to the basic development process, use common tools and approaches? About this under the cut.

Typical project

Most of us every day, whether we like it or not, have to deal with a variety of databases. Design, develop, administer, solve problems with performance, competitive access and that's it. At the same time, requirements for collecting, processing and storing data are constantly being tightened - for some, 100% consistency is needed, for others it is not. Some are stored for years, others a few seconds. The loss of one will lead to millions in losses, while the loss of others will not be noticeable to the business.

In this regard, the Polyglot Persistence approach has appeared and is being used successfully for a long time, as a result of which to meet one single DBMS on any project today will be a great success. Instead, for each specific task, the most suitable base is selected, without trying to somehow get out and solve all their problems in the good old relational. And, for example, the microservice architecture allows you to isolate work from a specific DBMS within one service and squeeze the maximum benefit out of it, without harming the cleanliness of the architecture of the entire project. As a result, " Relational " (where to go without it) + " Key-value DB " + " Wide Column DB " + " Document-oriented DB " + " Search engine DB " + " Graph DB " + " Something else DB " = " SUCCESS! ". In addition, the number of instances of these databases — sharding, replication, distributed DBMS — is increasing. Plus, all this must be "stretched" and maintained in different environments - development, test, food, etc.

Not to be unsubstantiated, I’ll give a few excerpts from a fresh poll from the giant company on database management and development solutions Quest Software - DBA Problems: Trends in Database Administration (to download, you have to go through a simple registration, but it's worth it):

Question number 2: “ How many instances of relational databases are running in your company? ” - according to the results, 24% of respondents use from 100 to 500 instances, and 19% use more than 500.
Question 4: “ How many different DBMS (SQL and NoSQL) are used in your company? ” - 56% use from 2 to 3 different DBMS, 40% use more than 5, and only 4% of respondents have the only one DBMS.
Question 18: “ What happens to the number of database instances in your projects? ” - 2/3 of respondents mark their annual increase.
Question # 30 (as a result): " What are the main problems the DBA will face in the next 3 years? "
1. 51% - " Increasing the number of DBMS instances "
2. 40% - "The ability to administer new non-relational databases "
3. 32% - " Reducing IT-budget" (which, as expected, swells due to the first two problems)

Are we ready for this?

It is no secret that in order to work effectively with a particular DBMS, specialized tools are needed, such as IDE and DB managers, monitoring tools, modeling tools, migration schemes, and much more. However, most of these tools and environments were invented and developed back in those glorious times when a single relational DBMS was enough for any project, and such buzz words as " Agile ", " DevOps " and " CI / CD ", not invented. And since then, little has changed, because the area of development and maintenance of the database has always been fairly closed and conservative, and most developers associate with something ancient, complex and incomprehensible. Today, when practically the most disparate DBMS and dozens / hundreds of their instances work in almost any project management, the usual everyday problems of developers and DBA become even more acute.

Traditionally, all kinds of IDEs are the most popular and indispensable tool for working with databases. Usually this is a complex solution that combines a variety of functions for developing and administering into a single working environment under its interface, mainly graphical, making our work with the database more productive and comfortable - all that remains is to click on the desired menu item. And indeed, few of us do without such cool things as EMS SQL Manager , Toad , dbForge, and many others .

One of the main problems of such "boxes" is that under the pressure of new features and new DBMS (which do not cease to appear on the market) they turn out to be quite complex and continue to become more complex each time. And both in use (often people have to learn not the features of a particular DBMS, but the tricks and features of the IDE itself), and in their implementation (therefore, such systems are usually closed and very paid). Even many (including me), the favorite DBeaver , the flagship among the open sourse DB IDE, provides support for NoSQL solutions (and which project is now without them) only in its Enterprise version . In addition to all this, there is a non-illusory risk that the support of any very necessary and interesting functionality of the DBMS can not be expected soon, until the IDE developers consider it necessary (or they may not consider it at all).

In response to the current situation, users themselves often do not wait until the necessary feature or support of the new DBMS appears, but they solve the incoming problems on their own and in the most appropriate way for themselves and their project. GitHub just teems with such solutions of varying degrees of complexity. Basically, these are sql-scripts for all occasions ( dataegret / pg-utils , NikolayS / postgres_dba , gwenshap / Oracle-DBA-Scripts , ktaranov / sqlserver-kit , lestatkim / opensql ). Or console utilities, such as top-like pg_activity and pgcenter . As well as all sorts of web-based tools, ranging from completely independent clients and monitoring tools ( pg_web , pg_hero ) to, for example, just html wrappers over system tables ( pg_web_stats , pg_stats_viewer ). As a result, a whole class of solutions of the form "programmers-for-programmers", "DBA-for-DBA" is formed, with the help of which simple, and not so much, DBMS users share their experiences with each other directly.

Everything as code

A similar picture was once observed long ago in the area of server administration and software configuration management. Then, with the constant emergence of new requirements and technologies, the infrastructure of projects became noticeably more complex, the number of servers and software running on it increased dramatically. And there was also a need for more frequent, and then continuous, delivery of software with new business features to the customer. Existing tools for configuring and managing all these facilities did not cope well, resulting in the “Infrastructure as Code” approach and such tools as Ansible, Chef, Salt, Puppet, and others, where infrastructure configuration is performed on a special DSL. That gives more flexibility and freedom of creativity. And the code on such DSL participates in the standard life cycle along with the main application code (in Java, C #, Ruby, or any other programming language), namely, it is stored in the version control system (with all the attendant problems - branch, fork, code review), going to builds on the build server, running automatic tests, etc.

In the future, the impact of this approach has become noticeable in other areas:

" Test as Code " - Spock , Cucumber , Gatling , Postman
" Pipeline as Code " - Jenkins pipeline
" Documentation as code " - Markdown, AsciiDoc, LaTeX
" Diagram as Code " - PlantUML , Graphviz , Mermaid , flowchart.js
" DNS as code " - DNSControl , OctoDNS
" Presentation as code " - reveal.js

More and more people are no longer afraid to write code, this is becoming the norm. This allows a more flexible configuration, and also allows the use of common tools (version control, metrics, visualization, reports, testing, etc.) in the general flow. In many specialized tools, the graphical interface is replaced or supplemented by a more flexible software interface, various DSL and configuration files.

More information on this topic can be found in the report by Alexander Tarasov ( aatarasoff ) "Everything as a Code" .

Database as code

And what about this in the vast world of databases? Having typed in Google a simple phrase " Database as code ", I did not find anything interesting, except for a single (but what!) Post on DZone - " Database as Code: a Novel Concept ".

We talk a lot about the database-first approach. We talk a lot about the fact that data is the most valuable asset of an enterprise. But what about the concept presented by Dan North on the third slide of his presentation ? What if you treat a database like code?

It sounds tempting, and on slides # 5 and # 6 (unfortunately, I did not find the video) 2 approaches to managing the database schema "through code" are described - through incremental migrations (Liquibase, Flyway) and idempotent DDL scripts ( Redgate ). Thus, the scheme is also code, it is under the control of VCS, builds on a build server, automatic tests are performed and so on.

What do we need to change our way of working with databases? We must stop treating our database as a magical artifact or as a unique scenario and look at it from the same angle as we look at the usual code of our application.

Powerfully said again, I already have tears in my eyes. Unfortunately, both in the report and in the post itself, “as a code”, only changes to the database schema (migration of the database schemas) are considered:

We treat the source code of our application as a treasure. And this is absolutely true, the code is the heart of any application. But shouldn't the database changes also be under control of the version control system, be automated, be ready for release on demand and obey the “DevOps-laws”, like the main code?

But wait a minute ...

DB is not only a schema!

Most modern DBMSs provide their own query language, with which we can not only receive and modify the data stored in it (the so-called DML) and operate on its schema (the so-called DDL), but generally get it (I'm not afraid of this word) Any meta information about the current database and its state (from system tables and views) and perform almost any operations on it (start the database, stop, convert to read only, collect statistics, manage memory, etc.). And such code is also quite suitable for the concept described in the previous paragraph.

And when it comes to the database and query language, the first to come to mind is, of course, the good old SQL. But can we count on it (and at the same time on relational databases, with which the majority is closely associated) now and in the near future, given the huge growth in demand for NoSQL solutions?

"After all, is SQL already everything? ..."

You can rightly ask me what kind of SQL I am talking about here in the era of NoSQL and schemaless databases, when " dusty relationals live their life only in the bloody legacy of Interprise ". Early Habr (as a litmus test of trends in the technology world) was also once filled with a negative attitude towards SQL and relationals and foreshadowed their early death (the number of votes is shown in parentheses, and the number of comments is indicated through a slash):

2009 - “MongoDB or how to stop loving SQL” ( +115 / 99 )
2010 - “NoSQL Collection Point” ( +53 / 45 )
2011 - “Are relational databases doomed?” ( +77 / 131 )
2012 - "NoSQL database: we understand the essence" ( +129 / 75 )

However, from about the 12th year, the mood on Habré changes noticeably:

2012 - “Think 1000 times before using noSQL” ( +109 / 130 )
2013 - “NoSQL and Big Data - deception of workers?” ( +75 / 286 )
2014 - “Why you should never use MongoDB” (+143/245) ( +143 / 245 )
2017 - “Why SQL prevails over NoSQL, and where it will lead in the future” ( +22 / 105 )
2018 - "No, you do not need machine learning. You need SQL" ( +102 / 417 )

At the same time, anti-sql and anti-relational moods continue to take place, but the community’s reaction to them is already completely different:

2014 - “Disadvantages of RDBMS or RDBMS vs NoSQL” ( -34 / 79 )
2015 - “Isn't it time for relational databases to be dumped in history?” ( -7 / 309 )

Confirmation of these attitudes can be found in the recent reports by Konstantin Osipov ( kostja ) "NewSQL: SQL does not go anywhere" and Andrei Nikolayenko "Non-relational SQL" . I take the responsibility to bring a brief summary of both presentations:

" SQL in NoSQL " - SQL (as it is not strange) feels pretty comfortable in NoSQL as well. Practically in all NoSQL databases, it is possible to use any sql-like query language, such as CQL in Cassandra and ScyllaDB, AQL in Aerospike and N1QL in Couchbase, and in some cases a full (or as close as possible to it) ANSI SQL - as in Tarantool , ClickHouse and CrateDB .
" SQL in the clouds " - Cloud giants provide support for a variety of SQL storage: Amazon RDS , Amazon Aurora , Oracle Cloud , Google Cloud SQL , Microsoft SQL Azure , Alibaba Cloud ApsaraDB , Yandex Managed Databases .
" SQL in BigData " - SQL has long been tightly integrated into the Hadoop infrastructure. Back in 2009, Hive appeared with its HiveQL, and then, year after year, solutions began to appear that directly support SQL - Impala (2011), Spark (2013), Kudu (2014), Presto (2015), Phoenix (2017) .
" SQL in Potochnikov " - SQL has become almost the standard for stream data handlers: Flink , Samza , Storm , Apex , and more recently, Kafka .
" SQL in NewSQL " - Many modern DBMSs are actively developing towards NewSQL , where they combine the advantages of both NoSQL and classic relational solutions, including extensive SQL support ( CockroachDB , FoundationDB , MemSQL ).

Need more arguments with pictures!

I will give some more arguments further, but if everything is clear, then we can safely proceed to the next paragraph.

Database history in "No notations":

Also recently, there are many new and trendy DBMSs that are “add-ons” over well-known and proven relational DBMSs. For example, Postgres-based ( Timescale and ToroDB ), MySQL-based ( RadonDB ) and even based on sqllite (which has always been positioned as a simple and reliable embedded database) appeared rqlite - “light distributed relational storage”. Or implement the interface of a popular database (for example, MySQL for InfiniDB and TiDB ). This approach is good because with the new data model we remain on the same familiar platform that we know how to configure and administer.
')
Well, for the very “uncirculated” guys, there are also attempts to create a common universal language “a la SQL”, such as Eclipse JNoSQL , or rather its JNoSQL Aphrodite subproject.

Google, which is one of the pioneers of the NoSQL movement, is also making more and more emphasis on SQL. At first it was Spanner , and now BigQuery is actively developing:

SQL interfaces acquire not only data warehouses and data handlers. For example, using osquery (from Facebook itself) and fsql in SQL, you can get a lot of useful information from any OS or perform various operations with file systems.

In general, SQL feels quite comfortable in modern conditions, including far beyond the boundaries of relationals. With virtually any data source, you can "talk" in the SQL or SQL-like language, get the necessary information about the data or metadata of the repository and perform some operations (for example, create something, delete, start, stop, etc.)

Hereinafter (for convenience and with your permission), by “SQL” I will mean not only the “same” standard relational SQL, but also all its subspecies (including far from relational ones), and indeed any embedded in DB QL, with which you can do something useful.

Why all this?

And the fact that in the distant 92nd year, a wonderful ANSI standard appeared , according to which any relational DBMS must be able to describe its internal structure in a special scheme - the Information schema . So using the standard query language for service tables / views, you can get the metadata of any database - as it contains charts, tables, indexes, columns, etc. But in fact it is not limited to the same data scheme in the same way (even if outside the standard) you can get information about processes, sessions, query execution plans, disk subsystem, memory utilization and much more.

Despite the fact that this standard appeared almost a quarter of a century ago and only for relational databases, almost all modern NoSQL and NewSQL databases also implement something similar. For example, in Cassandra there are several system schemes (or rather keyspaces - system, system_auth, system_schema, system_traces), which can be accessed using the already mentioned CQL. ClickHouse has a special “system” scheme . The developers at CockroachDB generally swung to the implementation of standard information_schema . And even document Mongo pleases us with system collections .

At the same time, there is a constant increase in the number and expansion of such system representations. For example, in Postgresql (as a database with one of the most active communities), in addition to implementing the information_schema itself, the pg_catalog native scheme and pg_stat representations have several official extensions in the form of the pg_stat_statements and pg_buffercache presentations . As well as additional third-party views, such as pg_active_session_history , pg_store_plans and pg_stat_kcache .

Almost any node corresponds to any system table that displays its structure and / or state. That is, it can tell about its internal structure and the state of the database itself through its query language. In addition, many DBMS provides the ability to not only get useful metadata using their query language, but also to perform maintenance operations — stop, start, clear, collect statistics, etc. Some things are possible. In some DBMS, the "ALTER" command can be applied not only to tables or columns, but also to other objects, for example, to a session (" alter session set sql_trace = true "), data files (" alter tablespace add datafile ") or the entire system (" alter system kill session ").

Knowledge of these tables and queries will help both in the development and administration, and in mastering the next new DBMS. It is clear that in some bases such capabilities are highly developed, and the “code” can do almost everything, while in others it is much more modest, but there is definitely a tendency that more and more DBMS developers will develop such capabilities in their products, and we can to operate them effectively.

Instead of conclusions and summaries

Unfortunately, SQL has long been firmly perceived by many as a "low-level" language, a kind of "bytecode" for the database. Which is not accepted, and in some societies it is even deeply indecent to "write with your hands" and in general to contact him in any way. We have long been accustomed to, that numerous DB-tools and frameworks themselves generate and execute for us " tricky and complex " SQL queries to the system, and this process can only be observed. But let's get out of the comfort zone of warm tube-based graphical interfaces and code generators and look at our DBMS from the point of view of regular code . Moreover, we have everything necessary for this.

Now in the case. Users of any DBMS have the ability to describe almost all aspects of working from their database in the form of a code, in a simple, understandable and familiar to many language. This may not be ANSI SQL, but any SQL-like dialect, or even its own built-in QL or API, depending on the database used. How is this useful?

Without being tied to any tool, you can get any information about the database in the required form, or perform any operation on the database;
Since this is still ordinary code , then it should be treated as any other "real" code. Namely, it should be clear, structured and formatted according to an accepted standard, as well as participate in all stages of the standard code life cycle - be under the control of VCS, undergo Code review, participate in CI / CD pipeline, etc. (and not " lying "somewhere in a folder called" SQL-tricks "). That will increase its quality and value in the team.

I am interested in your opinion on this subject, write in the comments - we will discuss.

In the next post, I plan to illustrate the ideas described above with specific examples, and also tell you how we at CROC will try to apply them in an experimental open source graphical DB-manager that is being developed as part of a research project.

Source: https://habr.com/ru/post/333376/

All Articles