DZ Online Tech: Postgres Professional

Hey.

Last year, I began filming a series of programs / interviews on the subject of digital business transformation ( they are here, who are interested - sign up ). These programs were at the junction of IT and business, but, nevertheless, more about business.

In the process, it became clear that there are quite a few topics that have significant depth from a programming point of view. And this year we started shooting a series of interviews under the general label “DZ Online Tech” - now with an emphasis on what is under the hood. Well, since everyone is too lazy to watch the video, of course, these interviews are decrypted, and today I publish the first one - with Ivan Panchenko from Postgres Professional.
')
Who is interested in the original - here it is:

(Well, by the way, I can’t swear that all issues will be decrypted, so if you like it - subscribe to YouTube, everything comes back earlier and guaranteed.)

For those who like to read - decoding:

Good afternoon, Ivan. Postgres is massively perceived as an import substitution tool. And the most famous driver of its use is the idea of “let's replace some Western DBMS with Postgres”. It is known that he has a lot of his own values. I would like to take a quick look at them. If we choose, without looking “import - not import”, but from a clean slate - why is that so?

Good day. First, most Postgres chooses not because of import substitution. It is clear that there are indeed government measures that restrict the import of foreign software. Postgres Pro is in the registry, so it is suitable for import substitution. In fact, this is a global trend to increasingly choose Postgres. We see only its reflection. Perhaps, what we have expressed in import substitution is, in fact, a consequence of some deeper, fundamental processes.

Postgres ripened. This is a good alternative to older commercial databases. All over the world, they now realize that Postgres is a product of the same class as whales: Oracle, Microsoft SQL Server and DB2, the last of which is almost forgotten, but nevertheless, it is a good product.

But, anyway, DB2 is gradually floating somewhere out of the market, and Oracle, Microsoft SQL is on it. And Postgres is the third thing that now globally flows into the market. I must say that it has been floating for a long time - 10 years ago, when good support for Windows appeared in Postgres. When replication appeared in it, they began to perceive it; they started talking about him as a database for business, industry and something serious.

Now you still say in the picture "there were two good ones, there is one more good one". But for sure there are things that put him ahead and make him prefer to rest.

Yes, there are several such things. First, Postgres is an open source product with a specific license. At the same time, he has commercial clones. One of them is ours. But the fact that it comes from an open source product is in itself a big advantage.

The second topic is that Postgres is a highly extensible database. This was also the driver of its development.

It must be said that the degree of extensibility of Postgres is the place where our Russian contribution is most noticeable. The last 15-17 years, our team is mainly engaged in the mechanisms of extensibility.

For example, in Rambler in 2000, where I first worked together with my current colleagues in the company, we used the extensibility of Postgres to increase the performance of the Rambler news service 30 times. How? We are programmers who are fluent in C and can read documentation. These two things allowed us to create a new type of index to quickly search through arrays, which goes beyond the classical relational model, but is useful.

In particular, in Rambler, replacing an extra join (join) with an array in which you can search by a tricky index increased performance 30 times. It is necessary to understand that then all the Rambler content projects were spinning on typewriters, whose power did not exceed the capacity of a modern smartphone: Pentium 2, 400 MHz, 500 MB of memory. If you're lucky, the Pentium 3 (it appeared then) up to 800 MHz. Such Pentium 3 could be 2 pieces in the server. And all this served millions of requests.

It is clear that then it was necessary to very seriously optimize the code, otherwise it would all not work. Postgres extensibility helped. We sat, thought, and within a reasonable time, we can say for a month, did what then formed the basis of everything that now exists in Postgres related to JSON with full-text search.

Because of this, Postgres quickly added support for semi-structured data. What it is? These are key values, roughly speaking. In 2004, we made an extension of the Hstore in Postgres to store key-value type information inside a single database field. Then it was not JSON, it went into fashion later. And then there was Hstore. Why hstore? We looked at Perl, which hash (hash), and here we have a hash with almost the same syntax made in Postgres.

Initially, this thing existed as a separate extension (extension), then it was committed to the main set of what is included in Postgres. And she was just as popular. That is, JSON has skirted it in popularity just recently.

It turns out that you in a certain sense put an end to this stupid discussion of SQL and noSQL, creating a DBMS that possesses the properties of both of them at the same time.

In a sense, yes. From noSQL, we took the good, namely, the relative flexibility, without loss of transactionality, without loss of data integrity and everything that usually happens in normal DBMS.

So, Hstore. But Hstore was single level. That is a one-level hash. This is not JSON, just a hash. And plus the arrays in Postgres came before us. We made support for the indices, that is, we could quickly search by what is in the arrays, and by what lies inside Hstore, both by keys and by values. This is a useful thing. People began to use it around the world. Then came JSON. He appeared recently, the year in 2011-2012 in Postgres. There were three different attempts to implement it. Then all these attempts were thrown out, and made in another way. At the beginning, JSON was simply a text that lies in a special field.

Did your team do json?

No, our team did an important thing for JSON. When JSON appeared in Postgres, it was just a text field, and in order to extract something from there, some search value had to parse JSON every time. Of course, it worked slowly, but allowed to store multi-level data.

But in this text box there is little sense. I, for example, in my projects and so kept weakly structured data in text fields, simply without naming it a field of a special type. But the question is that with this JSON it was also necessary to work effectively. And what we have already done with our team in 2014 is JSON-B.

It is faster due to the fact that it has already been decomposed and parsed. Parsed so that you can quickly get value by key. When we did JSON-B, we screwed the ability to search by indexes to it, then it became really popular.

Okay. Returning to specific properties: what else is there that distinguishes Postgres from other DBMSs and is the reason for using it?

All that we talked about JSON for so long is all a manifestation of Postgres flexibility. There are many places where it can be expanded. The other side of flexibility is geodata support. Exactly the same way as special indices were made, for example, for searching by JSON, by arrays and by texts, approximately the same place was used for indices for quick search by geo-information, by spatial data. There we also put our rather large Russian hand to this business.

As a result, PostGIS, this is a very common way of working with geographic information in the world.

Do users develop extensions for Postgres for their projects? Is this a significant amount of its application?

No, of course, this is some kind of "cream" on top. There are people who need it, because their task grows to it. First of all, it is a universal database that supports all standards very well and which is surely suitable for almost any task.

If you start a startup, for example - take Postgres, you can’t go wrong. Because, first of all, it is functionally expandable. And secondly, if you use its open source version, then do not become a hostage of the license you need to acquire.

Do you, as an employee of a commercial company, say this, perhaps, not very happy?

These are laws of nature, against which you can not argue. There is open source, and there is a business on open source. And this is quite an unusual thing, different from the usual business. But in the future, the role of open source, apparently, will grow. At least, this can be seen from the success of Postgres and from Gartner’s annoying graphics, that the open source database is being used more and more often, and there’s no way to go.

You as a company, why are you needed? Here it is open source: went, downloaded and everything is all right?

Actually, went, downloaded, but buried in the problem. Sooner or later, people stick to the problem. It is clear that most users will use the open source version. They will have no problems, they have small bases, they will not grow. But in some cases need the help of a professional.

Previously, the model was simple. In this case, I'm talking about Postgres, because different products are at different stages of this evolutionary path. Postgres was originally just a university project. People did not think about money; they were interested in solving the problem. Then he was a volunteer project: people made for themselves. At the same time they worked in some companies like ours.

Well, how did you make the decision in Rambler?

Rambler is a good example. We did it because we needed it. We were not at all worried about business users. Does it need someone else or not.

Postgres-companies initially began to emerge not in Russia, but in other countries. For the first time such a company originated in Japan, but with Fujitsu money. There has been invested quite a lot. And she began developing in Postgres and near-Postgres things. Then, in a fairly short time after that, the second quadrant (2nd Quadrant) in England and in America Enterprise DB appeared. These are all companies that make money from Postgres. All these companies start with the fact that they execute orders. They are engaged ...

Custom development.

Custom development at the DBMS level. We do that too, and we did things like that when we weren't yet a company, because Postgres is expanding. “We need to search for full-text search not only this way, but also correctly ignore our French accents, or our German umlauts. Finish it - we'll pay you. " Or: “JSON is a very good thing. Here to attach to it an index. And then our free ad portal will fly a little faster than the rest. ”

These guys really financed the development by paying individual developers or, in some cases, ordering to young, newly appeared companies some features that they wanted to get. And then gradually the need for commercial versions of Postgres began to arise.

Why, and in general, where did these commercial forks come from? First of all, I must say that Postgres has such a tricky license, which legally allows you to make a commercial product out of it. BSD-like license. This is not a GPL. You can take at least the same code, even without making any changes to it, leave there two lines from the license agreement, add something else to it, and write that this is my “Vasya Pupkin DB” and I sell it . To go to the market and sell even for billions - please, it does not matter. Will anyone buy Vasya Pupkin “Vasya Pupkin DB” or not is a separate question. Most likely no. And if he brings something of his own there? And they started doing it.

We have such a picture, where green and red flags indicate commercial and non-commercial forks Postgres. There are a lot of them. We counted 40-50 pieces. Most of them, of course, are not successful: they are bent, they are forgotten. Some large organizations, such as Amazon or Salesforce, may fork over themselves, for example. The most famous commercial forks are, of course, Greenplum, EnterpriseDB. Including our Postgres Pro, Japanese Fujitsu Enterprise Postgres.

The British, too, have recently done the Postgres enterprise version. For what? There are specialized forks, for example, Greenplum is a database for massive analytical calculations with large parallelism. They took Postgres about a decade ago, very much changed the scheduler, added work with distributed queries there, however, having lost transactional integrity. But for analytical databases it does not matter: everything is already there, and it is necessary to collect and properly distribute the request among the servers. They made Greenplum.

This thing is used, among other things, in Russia. In particular, Tinkoff Bank boasted that they have it. Good thing, which had a certain commercial success, but lagged behind open source, because it had a large level of incompatibility. They changed the scheduler very much, did a lot of good things, of course, they can be praised for it, but the development of open source went the other way. And here the main thing is not to go far. The farther you go, the harder it is then to merge.

The first example is when Postgres versions have been made that have some specific functionality. This is one example of why commercial companies arise. And what else is there?

Greenplum did a really big functional thing that pushes Postgres heavily sideways. At the same time, it is no longer a universal product, but it is well-resolving an important task in demand.

Another group of forks is our Postgres Pro and Enterprise DB. The latter, oddly enough, despite all their American nature, are engaged in migration from Oracle.

In America, too, moving from Oracle, because it is expensive to pay for every sneeze. Not everyone wants it, and not everyone has what to pay. Therefore, Enterprise DB offers worldwide migration services from Oracle. And in order to make it easier to crawl, they chose the implementation path in Enterprise DB of some features that are in Oracle.

And what is our Postgres Pro? We decided not to go this route, although we also did some features from Oracle, for example, asynchronous autonomous transactions. But in the long run, we do not see the path of following Oracle. Because “follow” means that you will always lag behind, and still you will never be 100% compatible, and you still have to constantly prove your at least partial compatibility. This is a difficult problem.

Although you can write the Oracle parser of the PSQL language; Thank God, it is documented syntactically, but you will not find very strict documentation on its semantics. Moreover, for full compatibility you will have to reproduce the "bug in the bug." Here, try to do it, and then prove that this is true. We realized that in fact it does not completely solve the problem of migration. We decided that it was better to move forward. Better to just make the best product.

On the other hand, is it necessary to make compatibility at the level of full coverage in such a picture, I'm not talking about a “bug in a bug”? The challenge is to close the conditional 80% of the cost of migration.

It really seems that way, but in practice it turns out that when you migrate a system, it looks like this: you are given a complete system. This is a black box wrapped in black paper tied with a black rope. You do not know what is inside. Is there 80% there or does it get out a bit? It often happens that it was developed by the “hooded” method on 3 floors, and the builders of the first floor have long since disappeared.

And therefore, in order to migrate to this thing, you have to do reverse engineering - not even to rewrite, but to understand whether it works or not. That is, if you have the perfect test coverage - then yes, you can talk ...

What never happens, of course ..

... You can say: "We replace the engine, run the tests. 100% passed, everything is verified. Go". In reality, this does not happen. This is the perfect case. And therefore it turns out that you realize 80%, and the rest you will still suffer, and you do not know in advance what. Migration has other, much more significant problems.

Every product, every DBMS has advantages. In Postgres, for example, this is working with JSON and with arrays, which we talked about. Many things that are optimally written to Oracle like this are optimally written differently to Postgres. Therefore, if you migrate “to stupid”, then you will get a minimally working subset that is compatible with both. You will not take advantage of either the old database or the new one.

Give me two words about the future. Where does all this go? Will you kill all the competitors?

I do not think that we will kill all competitors. In principle, we have such good, “sweet” goals. We came not to kill, but to build. Our task is to create a very good database. Desirable best.

And what is it? What is the criterion for this goodness? I doubt that Microsoft’s task is to make a very bad database.

Yes, of course, all go to this. Therefore, in the database world there are now some trends. And since data volumes are growing faster than anything else, respectively, the main trend of databases now is distribution. Those who will better solve the problem of a distributed database - they will be great.

With us was Ivan Panchenko. Thank you Ivan.

Thank.

Source: https://habr.com/ru/post/352650/

All Articles

DZ Online Tech: Postgres Professional

More articles: