The world of magic PostgreSQL: an interview with Nikolai Samokhvalov

Today we will talk with Nikolai, the “fighter” for the promotion of new technologies in the database world, a member of our program committee and an active participant in various conferences. The main topics are self-managed DBMS, DBA AI, clouds, NoSQL, built-in database control mechanisms, reports on RHS ++ and HighLoad ++ Siberia, as well as lots of practical tips and examples that can be useful in real work for both the developer and DBA.

- Tell me something about yourself. What did you finish, where did you start and what are you doing now?

Fiztekh, FUPM graduation 2004, Department of ISP RAS. Back in 2005, with a small development team from the Moscow Institute of Physics and Technology and Moscow State University, historically created the first social network, MyKrug, and then several other social networking projects.
')
In 2006-2007, I participated in the creation of an XML data type for Postgres, at the same time I began to conduct postgrese mitaps (there wasn’t even such a word) in Moscow. Now it has become a Russian-speaking group #RuPostgres , presented at the Meetup.com and YouTube sites. At Meetup.com, by the way, we are the second largest among all postgresovy communities in the world, more than 1,600 participants - we are ahead of the SF Bay Area group, but we are inferior to New York.

Now I am based in the Valley, I am regularly in Moscow. Helping companies get the most out of Postgres, and increasingly - in the clouds of Amazon and Google.

- What is your current focus in the context of working with data?

One of the hottest topics in the world of databases today is self - managed DBMS . Oracle last year actively “drowns” for the fact that their DBMS in the clouds is autonomous and does not need a “driver” (DBA), and meetings with the inventor of self-driving databases and the creator of Peloton DBMS Professor Carnegie Mellon University Andy Pavlo collect 200 + person. By the way, he recently accepted our invitation and will arrive at Highload ++ 2018.

Meanwhile, Postgres, being, of course, the best open source database and possessing tremendously growing popularity , now requires a huge amount of knowledge and manual actions during operation.

I am sure that in the coming years the situation will change for the better. There will be more understandable and automated tools that help both DBA and developers. Will develop a means of tuning database tuning and query optimization.

- That is, the DBA should now have extensive knowledge in various areas? The real work goes beyond the knowledge of the database?

DBA activity is far from trivial. Inside and around the database there is always a huge amount of service data: internal statistics, logs, monitoring indicators. To find the bottlenecks of performance, you must be able to quickly understand the pile of numbers, and this requires a large amount of knowledge and experience. When evaluating the performance of a competent DBA, I often hear the phrases “black magic”, “excellent intuition”. Although in reality such a DBA is armed primarily with a large store of knowledge and many years of experience.

And at the same time, DBA’s work now is purely manual action! For example: found the most "brake" request in pg_stat_statements . But in order to optimize it, you need a specific request with specific parameters - after all, the execution plan, the request speed, and EXPLAIN will not work out without them (the situation will improve a little in Postgres 12 if the patch from PostgresPro is adopted ). And what is DBA doing now? One of two things: either it climbs into the logs to dig out specific examples of requests, or “sucks out of a finger” some values, optimizing as a result a spherical horse in a vacuum. The last option will work well only if you have years of experience. If there is an "intuition".

And then, when a specific solution is selected that solves the problem, say, we need to add some kind of index, an even more funny thing happens. If the DBA is experienced and with access to the “products”, then it may be directly “on sale” and will be debugged, and even the index will be created manually. Passing git, yes. If “well, it’s really necessary” for the project to conduct DDL via git, then the index will be named like i_tmp_killme, and developers will be asked to add a migration system and otrelizit index with an already more meaningful name.

If a DBA has an authority that goes beyond the limit, submissive developers will not be asked questions. But in companies with a good culture, git flow, code review, devops and curious developers need to explain in advance of any real actions with “combat” databases why this particular index is chosen, how exactly it speeds up the query. In the Valley, by the way, the developers quite often come across meticulous ones, they all have to chew, substantiate. And here clouds help out a lot. It is very convenient - in a couple of clicks to create a replica of the combat database in AWS RDS, run `explain (analyze, buffers) on it in different ways, find the best solution and present it to the developers with specific estimates of performance improvement and detailed diagnostics. RDS has perfectly automated the database creation processes, but RDS does not relieve the masses of manual actions to optimize queries and tuning the database in any way - no one will (for now!) Run out the explanations for you and will not provide an explanation for the developers.

As a result, the work of Postgres DBA now looks like the management of a beautiful high-speed car with a manual gearbox. Do you like to ride "on the handle"? I - yes, but not every day.

- In fact, you briefly described the beginning of the action algorithm for finding problem queries. This knowledge can serve as the basis for creating new useful "add-ons"?

Right. That is why my area of interest now is DBA AI . Artificial intelligence, which helps Postgres to “cook” quickly and efficiently, without purely manual actions, on a large and growing scale (usually in the clouds). There is no such thing in nature yet, but work in this direction is already underway (one of the interesting, although still purely research developments is the already mentioned Peloton ).

- At RIT ++ 2017, speaking with a Database First! About common mistakes of using RDBMS , you said that the DBMS is the heart of the IT system, not using the capabilities of the database is nonsense. What examples can you give in support of your words? First of all, the facts are interesting, when exactly the use of standard DB control mechanisms helped to avoid errors or vice versa, when not using standard features led to disastrous results? For example, the absence of FK and, possibly, other, at first glance, ordinary mechanisms.

In the same report, I argued that “disastrous results” are observed in most projects, in which the support for integrity and “purity” of data is removed from the database and implemented on the application side - in PHP, Ruby, Python, or something else. As such a project accumulates data, “dirty data” is collected - such as “orphan lines” (they did not use foreign keys, deleted users, and forgot to delete the data assigned to them), duplicates (not implemented or incorrectly implemented a check for uniqueness on the side of the application), invalid or incorrect values. Another question - how do you feel about such problems. If this is a small blog, then maybe there is no big trouble. But if this is a billing system ... As soon as you "take away" the data check beyond the limits of the DBMS, you admit the possibility that someone (person or program) will appear who will pass this check. Not to mention that your implementation of checks may be far from ideal.

So it is useful to know and apply the capabilities available in the database to support data integrity constraints. There are only a few of them: support for unique values (implemented in practice using a unique index), foreign keys, user constraints (what is achieved using the CHECK keyword), prohibiting NULL values, and Postgres still has special exclusion constraints , with which it is convenient to provide , for example, non-intersecting intervals.

The use of suitable data types is also an important tool for ensuring data purity. A simple example. An obvious and very common mistake: using the text (varchar) data type to store email addresses in a column and hanging the usual unique index on a column. For emails, we, of course, need case-insensitive checks, so the citext data type is better (it is not “in the box”, but there is a citext extension available in most Postgres installations) or hanging a functional index like `create index ... using btree (lower (email)) `. By the way, in the latter case, do not forget to rebuild the statistics (`vacuum analyze ...`) so that postgres realizes what distribution in the table the values of the expression `lower (email)` have.

In addition to the competent use of data types and the support of various types of integrity constraints, the DBMS allows you to implement complex data validation logic — for example, when you need to modify certain data in one table, you can perform several checks using several tables at once. This is a task for triggers. From the perspective of my experience, which includes very different projects and different people, I undertake to assert: dislike for triggers is directly proportional to the developer's ignorance-DB. Such is the simple formula.

No one in their right mind would say that PL / pgSQL or, say, PL / Python is superfast. On the usual arithmetic calculations PL / pgSQL (as, by the way, simple SQL) are noticeably inferior to C and even PHP. So slow that only a madman will use them for such purposes on a significant scale (or whoever adopts , say, the MADlib library, which, by the way, I respect very much and sometimes enjoy using). But to work with a very large amount of data, when you need to find the right needles in haystacks, there is nothing better than the “next to the data” position, when all indices available in the database and the lack of network complexity play on your side, and using one of the two most most popular programming languages in the world - SQL. Do not use these opportunities when it is definitely profitable, and there is nonsense! A competently written trigger and quickly executes, and is quite easy to debug (for profiling and debugging, the pg_stat_statements and auto_explain extensions help with the options `pg_stat_statements.track = all` and` auto_explain.log_triggers = on 'included) and, most importantly, it is a boundary that is not overcome dirty data.

- In continuation of the previous question, tell me why the built-in capabilities of PostgreSQL for controlling and manipulating data are more optimal and advantageous than self-written constructions?

One obvious reason: built-in capabilities have been developed and have been developed for many years by intelligent people, Postgres creators like Tom Lane .

We have already discussed another - architectural - reason, it remains only to illustrate. How many entrances to your house? One? Two? When you leave, how many doors do you close / control? Not exactly ten? In a modern project, there can be a web site, a back-office, and various APIs for mobile applications, and also external users. If you implement integrity support by means of an application, then in your house there will be many doors through which visitors enter and exit (in the case of a database, data). And if you are even more “lucky” and the project is being developed not by two or three programmers, but by a large team or even by a group of teams, then, hello, they have arrived. Your doors are made and supported by different people / teams, who often also speak different languages ... It is clear what this is about, right? Or you will have to limit and restrain the development of the project (“you can’t connect this already ready GUI to work with data - then the manager will be able to write anything in the database, it’s better we will create the admin panel! ..”) or somehow synchronize the development of some and the same data control mechanisms in different subsystems, often in different languages.

In fact, I’m telling trivial things here, but my personal observations show how many stereotypes such as unambiguous "stored - evil" or "FK - big brakes" are still alive and lead to mass cycling, and how expensive it is for them to pay for projects.

- A lot of discussions and questions go around PostgreSQL releases. Perhaps you are often asked about the correct options for the transition from versions 9.3 - 9.6 to 10.
Is the use of the pg_upgrade tool always justified? Have you encountered in practice situations where you need to fear this tool?

There was a very painful bug in version 9.3; many (including me) had “sleepless” nights thanks to this bug. Fortunately, this is in the past. Of course, there is no 100% insurance against bugs, but today pg_upgrade is practically an industrial standard, its use is reasonable in most situations when several minutes of downtime are permissible. There are lucky those who are already in the clouds and with a managed database such as AWS RDS - they don’t think about it at all, they simply plan to maintain the window and upgrade. If you have to think about it, you should definitely experiment as much as possible, having spent at least a few test runs on a cloned database (in full volume and in identical infrastructure - of course, if data volumes and resources allow). Here, by the way, again it is tempting to use clouds, but at a lower level - just EC2-machines in Amazon, for example. This is very cheap if you do not forget about such an opportunity as spot instances .

A fresh and detailed case, like 1500 DBs with a total volume of 55 TB in 6 DCs were upgraded in just 15 minutes: https://bricklen.imtqy.com/2018-03-27-Postgres-10-upgrade/ . Notice how many tests they did before performing operations "in battle". The main formula of this article is universal: “Test, find problems, fix; lather, rinse, repeat. " Here I really want to start talking about automating experiments again, but I’m probably tired already.

If such a short idle time is unacceptable, then it is necessary to consider more labor-intensive solutions in the implementation of the solution - first of all, on the basis of pglogical (a fresh post on this topic ).

- In May RIT ++ you are announced with the report " Continuous Database Administration ". In the description of the performance, the first part is devoted to the postgres_dba tool. Tell me please about him.

postgres_dba is such a “semi-automatic” “Swiss knife” for working with Postgres. Any experienced DBA has a collection of useful scripts accumulated over the years, answering different questions - from “which tables in this database occupy the most space” to evaluating a bloat, finding unused or redundant indexes, working with pg_stat_statements. I shoveled more than one collection of such scripts and made an interactive menu for them - and now the native postgresovoy console, psql, you can "communicate" with Postgres, getting answers to your questions very quickly. It is very easy to add to this set and any of your reports or interactive scripts (for example, you can add something to add / remove database users with generating passwords so that it is as safe as possible).

This tool, of course, does not make DBA's work fully automated. But already visibly accelerates. For myself, I noted that the resulting tool brings together the DBA and databases with which she / he works: reports are now accessible at the distance of clicks of the entire pair of keys, time is greatly saved, full-featured DB “communication” happens very quickly, and therefore more often. The project is open, I invite you to try and participate in the development: https://github.com/NikolayS/postgres_dba .

- The second part of the report consists of several blocks devoted to automation, cloud solutions and issues related to AI. What do you think, which of these areas will be actively used in the near future?

There are several directions at once. First, Amazon, Google, and Microsoft already provide so-called managed databases — this is when you solve database deployment issues, replication settings, switch-over / fail-over, automatic backups, and basic monitoring. But such decisions, though based on Open Source products, are now not made in the spirit of FLOSS . AWS RDS does not even allow you to download backups, not to mention the possibility of replication somewhere else, except to another RDS instance. Google Cloud SQL for Postgres, although it announced GA in April, is still extremely poor in terms of Postgres configuration. It does not even give logging requests, only errors.

But in general, these are all successful stories of building proprietary solutions based on open source, especially when it comes to RDS and RDS Aurora. At the same time, I believe that in the near future there will be open counterparts, ready to work anywhere (in the Amazon or Google cloud, in a private cloud or on-premise), with no less advanced, with different options for building HA solutions, back-up policies, with full access to the OS, FS with superuser-access. According to my observations, this niche is definitely there and it is not yet occupied. By the way, the building blocks for this have already been created and run in numerous companies: here is Patroni with Stolon to raise the master and replicas with auto-file support, and not so long ago k8s-operators from Zalando and Crunchy started to develop, and solutions for WAL-E backups , WAL-G , pgBackRest .

By the way, to engineers from Russia - in general to everyone, not only DBA - I strongly recommend to move your consciousness towards the clouds as soon as possible, if they have not already done so. Then came a strange story. It is well known that in most cases the delay in the adaptation of new technologies compared with the Valley in Russia is two to three years. But in the case of clouds, this lag is clearly more. In the US, clouds today are practically already a commodity (sorry, I don’t know the Russian equivalent word). But in our past years, the “cloud” presentations at RIT / Highload ++ conferences and others have been counted. , , . - . AWS-, RTT , , ~50 . Mail.ru — , , . .

— , , , . 35-40 , , , ( Git, — CSV, SVN ). ftp, HTML . , , . , , GitHub, BitTorrent, GitLab. — - , CI. : devops-, , , .

, DBA , , ( ), . DBA - . . , . PostgreSQL.support Database Replay, . , . , — , `shared_buffers`, `seq_page_cost` `default_statistics_target` — -DBA Nancy, ( « », — , ), , , , , , , . , — .

: , , , . : SELECT- (`create index … where …`), , UPDATE-. , HOT Updates ( ). , trade-offs. — - , .

, — DBA AI. , ML, Postgres MySQL, OtterTune . Andy Pavlo. — AutoAdmin Microsoft , SQL Server. () Oracle, Peloton, , – .

. . , . . — , -, , . — , . -. -2013 , AWS, . 2015 Amazon , EC2- , Fortune-500, , Disney . , , -DBA Nancy .

, ( ) , , , , enterprise- ( Oracle, Microsoft, IBM), — , Open Source, — . , ++, , .

— , . , NoSQL ? MySQL PostgreSQL? ?

. NewSQL , NoSQL. . . — NoSQL-, SQL. SQL — -. , , ( , , SQL/JSON, ), -.

— , , JSON, — , , JSON- . , hstore, XML, JSON, . — NewSQL-.

, «NoSQL»: , SQL. Google Spanner, CocroachDB ClickHouse. , . — , XML NoSQL — , , , .

NewSQL . , : , Citus- CloudFlare ( , Postgres) ClickHouse . — , Citus. pluggable engines, « » — , column store, row store (c «undo», « », VACUUM – . EnterpriseDB zheap ) — , ( «» , , Citus , ), . , - .

MySQL , - … . , , - . MySQL Percona , (. Percona Highload++ 2016) . - — , MySQL-, sysbench , «» MySQL , - - . , , , , .

— . - #RuPostgres ?

— SF Bay Area, . The goal is achieved. - , , . .

, , , . , . -, . -, - — , YouTube , , , - . . YouTube , .

— HighLoad++ Siberia, , ? , ++?

. Highload++ Siberia — , «» Highload, Highload++ Junior. , . , , . , .

. ClickHouse VK, , Mail.ru, , Badoo — , , .

, enterprise-. Oracle, . , — ( Oracle ) , .

, Postgres. Andreas Scherbaum (Greenplum) Alvaro Hernandez (OnGres), .

, !

Source: https://habr.com/ru/post/359306/

All Articles

The world of magic PostgreSQL: an interview with Nikolai Samokhvalov

More articles: