Live Rup #RuPostgres: questions and answers with Avito experts. Live decoding

About a month ago, my colleagues from the DBA-team took part in a live Mitap on the #RuPostgres Live youtube channel, where they answered questions from Nikolai Samokhvalov and viewers who sent them into a form and connected to the broadcast. It turned out an interesting and informative conversation about PostgreSQL, experience with different versions and tasks. Therefore, we decided to make a text transcript of this meeting, enriching it with useful links. In the comments ask questions, if they arise - we will try to answer them!

First I will introduce the participants of the conversation:

Nikolay Samokhvalov - PostgreSQL evangelist, co-founder of the Russian PostgreSQL community (2008) and a number of successful startups using PostgreSQL

Konstantin Evteev - leader of the Avito OLTP-unit

Sergey Burladyan - Database Architect

Dmitry Vagin - DBA Team Lead

And now - to decipher!

Nikolai Samokhvalov: Hello, dear Youtube-viewers. My name is Nikolai Samokhvalov. This is #RuPostgres Live # 2: questions and answers with Avito experts, and Konstantin Evteev, Dmitry Vagin and Sergey Burladyan are with me today. This is not all Postgres experts at Avito, as I understand it. How many people do you have in a team that knows Postgres?
Konstantin Evteev: Postgres-team - this is, in fact, an OLTP-unit - consists of 13 people who, in turn, are divided: part is engaged in the platform, part - in product development.
Nikolay Samokhvalov: Are there women?
Konstantin Evteev: Yes.
Nikolai Samokhvalov: Right now you can write questions in Youtube chat. We also have questions prepared in advance, including from me personally. Let's talk about how Postgres “cooks” in Avito. And the first question: is it true that still 9.2?
Konstantin Evteev: No 9.2 anymore. We use all versions of Postgres that are officially supported by the community.
Nikolay Samokhvalov: The end-of-life was just recently ...
Konstantin Evteev: In September. We approached this, and from 9.2 we upgraded to different versions, including 9.4, 9.5, 9.6.
Nikolai Samokhvalov: 10 ...?
Konstantin Evteev: Not yet. 10 we have in the test.
Nikolai Samokhvalov: What is the reason for such a motley picture?
Konstantin Evteev: The mixed picture is primarily related to the fact that before an upgrade we first test the version, and then we upgrade. We look at how it showed itself, what kind of performance it shows, and after that (for the next instance) we can test another version, upgrade to 9.4-9.5. Because any upgrade is associated with downtime, and there is no upgrade for the sake of upgrade. We do it for the sake of obtaining the necessary functionality and performance.
Nikolai Samokhvalov: So, you have four different versions?
Konstantin Evteev: 9.3 we do not have. There are 9.4, 9.5, 9.6. 10 in the test.
Nikolai Samokhvalov: If any of you have 9.3, immediately upgrade. There was one of the questions in Youtube: “Is it worth it to upgrade from 9.3 to 9.6 right away, or do I need an intermediate version?”. I would say that you need to upgrade immediately to the top ten.
Sergey Burladyan: You can skip the intermediate version.
Nikolay Samokhvalov: I do not understand the logic of this question a bit. People probably think that these versions are more raw. But 9.6 has been around for a year.
Sergey Burladyan: Someone may be concerned that, say, the format has changed, to recreate the indices. But pg_upgrade supports skipping version, you can be calm.
Nikolai Samokhvalov: At 10, would you start dragging your project?
Sergey Burladyan: Not yet. We have such an idea that we upgrade somewhere on the fourth minor release. At 10.4, we could probably go.
Nikolai Samokhvalov: Is this an official policy?
Sergey Burladyan: Semi-official.
Konstantin Evteev: But at the same time, we have dev and test environments, where we test the newest versions. Including 10 versions we tested starting with alpha and wrote several bug reports: 1 , 2 .
Nikolay Samokhvalov: How do you do it?
Konstantin Evteev: We have sampling, on the one hand, and on the other hand, fixtures in order to test our application. After sampling the data, we prepare an image that we deploy to our cloud. In addition to the base being raised, the dev and test environments are also a large amount of infrastructure. There we raise external indexes, various caches, queues and other infrastructure elements. After deploying, you need to initialize the external indexes. We have very actively used logical replication. On the sample data you need to raise the entire associated infrastructure. Thus, the roll-up of sample data leads to the initialization of about a dozen databases, through the PgQ queue mechanism, our copies of the files are filled, data is delivered to external indices, etc.
Nikolay Samokhvalov: We are approaching the question: how much of your data is stored in Postgres?
Konstantin Evteev: The total amount of our databases is more than 15 TB. They work on dozens of servers. Typical workload for them - about 10 thousand transactions per second.
Nikolai Samokhvalov: A very interesting question recently slipped on Facebook. There is production data that could be tested in a test environment, and maybe even in dev. This is an ideal case: you can see where it is slowing down. It is clear that, firstly, not all developers need to show all the data. Secondly, we will not easily take a huge amount of data of 15 TB in size on the dev-environment. How do you prepare these environments, how do you collect data?
Konstantin Evteev: How to cut a sample? There are data related, there are unrelated. We have two main entities on the project - the user and the announcement. We take some samples of users, collect all their ads, then there is a selection of the remaining tables. Most likely, you need a limited number of sets - we set the limits in our script.
Nikolay Samokhvalov: How many percent?
Konstantin Evteev: It depends on the parameters and the problem that we solve in the test. In most cases, large amounts are not needed.
Dmitry Vagin: I will tell you a little about the whole mechanism. Every morning, we run a script that takes away certain users and their ads: random dialing plus pre-wired test users. Plus a lot of dependencies with some limitations. The entire assembly from production databases takes about half an hour. Then docker images are built on this and pushed into the registry.
Nikolay Samokhvalov: How is randomly chosen? Table simple or something different?
Dmitry Vagin: No. We have counters, how many users have active ads, randomly takes a pack of users. The criterion is: that the issue in this test image on the site was 2 pages in Moscow, so that you can test more or less clearly. Simple conditions.
Nikolai Samokhvalov: This is not much, it turns out, data.
Dmitry Vagin: There are a lot of reference books, a lot of dependent data.
Nikolai Samokhvalov: But there are not many ads themselves. If you are doing something new, for example, have you forgotten the index, will you understand this only in production? Or is there something in the test environment?
Dmitry Vagin: We will see this in testing.
Konstantin Evteev: “Not much” and “forgotten index” - it all depends on the amount of data. If there are several gigabytes, the index will work, and it will be noticeable. And we say that there are several gigabytes. And if developers need load testing, then you can raise the necessary database from the backup to the dev-cluster. We can sometimes do it. The main thing that I wanted to mention is that when assembling these media, obfuscation takes place, all data is anonymized.
Nikolai Samokhvalov: Did programmers write specifically?
Konstantin Evteev: In order not to see real user data.
Nikolai Samokhvalov: That's cool. But it is understood that if the scheme changes, we must rewrite. Requires extra time.
Konstantin Evteev: Including. In case you work with a data sample, it is inevitable, raising dev-environments. Before you develop in the prod, you need to do a double job: first, do the development in dev and test environments. Therefore, when new projects appear, we are campaigning with the whole team not to use data sampling, but to write fixtures so as not to have a single point of failure everywhere.
Nikolai Samokhvalov: Basically what languages are used for? Ruby or PHP?
Konstantin Evteev: Fixtures can be made in different languages.
Nikolai Samokhvalov: What are you?
Konstantin Evteev: PHP.
Nikolay Samokhvalov: Do you seek to move away from sampling in the direction of synthetic data in order to use it for testing?
Konstantin Evteev: Yes, firstly, tests will be written first, on the other hand, there will be no need to separately support data sampling. And to solve including issues of sampler crashes. He can break if someone has corrected the structure, but has not written the migration, has not added to the sample ...
Nikolay Samokhvalov: As one famous post-regressoid said, life is richer. Live data sharing is different. Right? Maybe production-live data also makes sense to take? I just heard the idea that it's cool - fixtures, but it seems to me that I need to look at the living distribution.
Konstantin Evteev: In this case, just styling. The need for load testing, and there is a set of servers where you can deploy a full backup of the prod base.
Nikolay Samokhvalov: It is interesting to ask about monitoring. How do you understand that there are not enough indexes? What are you using? I know you have a monitoring report, Dmitry did. Can a couple of words: what tools?
Dmitry Vagin: The idea is simple. Take all that is, and throw in Graphite or somewhere else. And then build dashboards. When somewhere on a very loaded table there is not enough index, it is usually immediately apparent that nothing works on the site, or it can be seen in the top queries somewhere in pg_stat_activity and so on. More complex cases have to disassemble the graphs in the dashboards. Suppose there are too many disk reads on this label. You look where some queries go there, and somewhere, and perhaps you need to do an index slightly different.
Nikolai Samokhvalov: Why not take something that exists?
Dmitry Vagin: It is possible, but when you take something existing, you have to figure out how this existing one takes these metrics and what you see as a result. You have to figure out what it takes, how it processes, where it folds and what you see in the end. And so I can not so easily take and trust some tsiferke, which says some utility, for example munin. But you can crawl into the munin plugin and see what's there ...
Nikolai Samokhvalov: In Zabbix, too, because you can see how everything works there.
Dmitry Vagin: Zabbix simply will not export on our volumes.
Nikolay Samokhvalov: Interesting.
Dmitry Vagin: Our Graphite doesn't always take everything out.
Nikolay Samokhvalov: A question from the chat: how do you vacuum a full table of a huge table over 100 GB?
Dmitry Vagin: There are two options. The first is not to do.
Nikolai Samokhvalov: Bad option.
Dmitry Vagin: Well ... Why vacuum full?
Nikolai Samokhvalov: By the way, right. I did not immediately understand this question. Strike out "full". Nobody makes vacuum full just like that, no need to do it.
Dmitry Vagin: Vacuum is working.
Nikolai Samokhvalov: Ok. We discussed before broadcasting that the size of 15 TB is even decreasing. Due to what?
Konstantin Evteev: We called it the "tail." There are data that are active, hot ads. Then there are, say, deleted, blocked, and there are deleted forever, that is, when you delete them and no longer see them in the interface. But we must store all this information. But this can be done elsewhere - also in Postgres.
Nikolai Samokhvalov: In 9.2?
Konstantin Evteev: (Laughs). Well, for example, yes. We have written a script that collects all the old data and transfers it to another database, accessible only for back office. We had a high growth rate of the main database. After launching the script, transferring data to the tail and subsequent re-index, the volume decreased and the growth rate slowed down, almost stopped. Periodically, we re-index.
Nikolay Samokhvalov: Do you use any pg_repack, or analogs?
Konstantin Evteev: Including use. Now Dima will tell the case.
Dmitry Vagin: I once turned off the auto-vacuum specially on one plate. And forgot to turn it on. Two weeks later, I saw that the tablet instead of 20 GB was 300 GB. I think: "Okay, you need to fix it." The autovacuum worked honestly, it is half empty, and you have to somehow shrink it. What option? Make downtime, turn off the production load, vacuum full - not really. The second option is to use the PG Compact script written by Sergey. This is an old well-known idea - to update from the end of the page table ...
Nikolai Samokhvalov: I'll enter here. This is an example of an epic fail open source, when Hubert depesz Lubaczewski raises some topic , then the guys legit, then Sergey Konoplev turns on , then Maxim Boguk turns on , and everyone has their own version. They are three such forks, and now I hear about the fourth fork! I do not know, maybe in the NDA, something that everyone writes his own cannot agree. I tried to use all this, but for myself personally I chose pg_repack . Just at that moment, when I thought about it, Amazon added official support for pg_repack. This is practically the industrial standard becomes. And you say that you use updates. For those who do not know, pg_repack makes a separate table, triggers to monitor changes. Why not pg_repack? The second question: why is the 4th version all the same?
Sergey Burladyan: pg_repack edits the Postgres service directory. It seems scary to us.
Nikolai Samokhvalov: Directly?
Sergey Burladyan: Yes.
Nikolai Samokhvalov: Amazon does not seem.
Sergey Burladyan: Yes. And then, he uses triggers, this is an extra load during production. It takes a place, almost double the volume. Therefore, we did not want to use pg_repack. And the approach with the constant update of these lines, so that they move to the free space, seems to us more suitable.
Nikolai Samokhvalov: Looks like Windows defragmentation.
Sergey Burladyan: Yes.
Nikolai Samokhvalov: Why a version?
Sergey Burladyan: We just wanted to make a simple version to make it clear how it works. There are no chips that are in the versions of which you spoke. We have there, roughly speaking, the challenge of updates and the challenge of the vacuum. Also written in Perl. Now I will probably be in Python. This thing is rarely used, but sometimes used.
Nikolai Samokhvalov: Let's go back a little. How to understand that the table is swollen? There is one company, they do not show the blotting schedule, explaining that in order to show normally, you need to use extensions that create a large load. That is, in order to understand blotting normally, you need to get into the guts, shake them, and this is very noticeable. How do you do it?
Dmitry Vagin: We do not look at the bloat. The vacuum works fine. Enough not to turn off the auto-vacuum, and everything will be fine. It is slightly more aggressive than the default one.
Nikolai Samokhvalov: 1%, 2%? There is a threshold that says when to analyze to do, and when to make the vacuum itself, depending on how much dirty the table is.
Sergey Burladyan: autovacuum_vacuum_scale_factor = 0.0014. By default, vacuum_cost_limit, in my opinion, is 200, we have 700. This, too, has been unscrewed so that it will work more often.
Nikolay Samokhvalov: The most interesting is how many percent.
Sergey Burladyan: As long as there is space on the discs, then you can not think about it. You can think if it is clear that requests are inhibited, let's say.
Nikolay Samokhvalov: Monitoring?
Sergey Burladyan: Yes.
Nikolai Samokhvalov: There was a question in the other direction: about security. Do you store ACLs in Postgres?
Sergey Burladyan: ACL is not used. Row level security is not used.
Nikolay Samokhvalov: What tools do you use for big data file backup replication?
Sergey Burladyan: We don’t use our own tools. There is a set of scripts for the filer. We still use Londiste, and our patched one, which allows us to automatically recover after the filer.
Nikolai Samokhvalov: Londiste is your main tool now?
Sergey Burladyan: Yes, the main one for logical replication.
Nikolay Samokhvalov: Do you have the PL / pgSQL code?
Konstantin Evteev: Yes, a lot.
Nikolai Samokhvalov: Like?
Konstantin Evteev: Depends on the case. But generally like it.
Nikolai Samokhvalov: Tell us how you prepare, test, debug.
Konstantin Evteev: I’ll tell you more about the versioning of PL / pgSQL code . There is a question: how to version the code and schema of the database along with the application code? After all, when you have a lot of application servers, and they roll out gradually, you need to maintain the old and new versions of PL / pgSQL code.
If we are talking about the database schema, then we do it through migrations. All of them must be backward compatible, because both new and new ones can be in combat at the same time. What to do with the code? We came to two approaches, and now in the process of transition from one to another.
The first approach is when we put all our stored procedures on PL / pgSQL along with the code of the aplikation. Then they wrote a binding that reads the catalog, each procedure is stored in a separate file, reads a check-sum from it, refers to the database, a checksum is stored directly in the database, that is, it compares whether there is such a version of the procedure or not. It creates a new stored procedure with the postscript of a specific version, registers in the table. For example, create a user, version 1. The table simply stores 1, and then from this label builds a dictionary that is stored with the application, and the application, if it sees the “create user” binding, it has a dictionary that says that this suffix must be 1, 2 or 3.
There are drawbacks.

When there are many projects, all this needs to be introduced and supported in each project.
The second drawback is that in this case, those procedures that call other procedures cannot be versioned this way. Accordingly, they came to the second approach.
The second approach: when we just hold a certain pool of users, for example. Before application deployment, the user is rotated in a circle, one of the users completely recreates the scheme with stored procedures. A new application deployment occurs with the new user and the schema.

Nikolai Samokhvalov: Once during the creation. Cool. I saw in the presentation that you are quite a lot using session variables in Postgres - GUC Variable (Grand Unified Configuration).
Konstantin Evteev: Yes. I'll tell you. They do not need to be afraid. This case we have studied. In which case did we have it? We delivered to our analytical subsystem. We do it with the help of the PgQ and deferred queues. We hang on the table deferred-trigger for the first response. If the record is changed several times, we need to catch exactly the first change. At the time of the first change, we see all the records from the old data, and we can select the new data directly from the database.
Then we had the next task. What if we have a lot of related tables, and we just want to, along with the change of this table, throw a signal that the object in the next table has changed? In this case, when we change the data in the next table, we take and set session variables. For example, set the key: user 5 did some action. Only in the current session. And in this variable we make key value pairs: what if there are several such users. And then directly in the deferred-trigger we check these keys, session variables, and if anything, we mix in some signals. Thus, we get a derivative of the data: in this form, they are not at the source, and a completely different set of data will arrive at the receiver. We all tested it under load, we realized that there was no overhead when setting and using session variables. (In any case, we did not see it).
Nikolay Samokhvalov: Do you use JSON-B or JSON?
Konstantin Evteev: In most cases, no. In some places we are starting to use, but we do not use functions for working with JSON-B. Why? As it turned out, we are CPU bound. The JSON-B format itself is convenient for storage and for work, but when we are CPU bound, all this work is done at the application level. But, by the way, while we are actively using Hstore. It historically appeared earlier, but there all the same problems with the CPU.
Nikolai Samokhvalov: I saw in your slides about it. When you talk about key value, is it about hstore?
Konstantin Evteev: Yes.
Nikolai Samokhvalov: And the indices? Are they stored, hstore? GIN? Generally GIN, GIST used somehow?
Konstantin Evteev: In one case used. There is a story with him. Need to carefully read the documentation. In particular, we had a case when it was necessary to set a fast update off. Because it worked very well up to a certain size, the record was quite active in this table. And then began to freeze when updates.
Sergey Burladyan: This is written in the Postgres documentation.
Dmitry Vagin: When there are changes in the column on which there is a GIN index, in order to make changes quickly, he builds a small tree in a small field, and then when reading, he turns to the main tree and changes in the memory. At some point, the place ends, he needs to smudge it all up - transfer changes to a large index. At that moment, we all got up for two minutes, and everything rested on it.
Nikolai Samokhvalov: Does this mean that there were too many updates per unit of time?
Dmitry Vagin: There are always a lot of updates, because we are producing hoo.
Nikolai Samokhvalov: 10,000 transactions, or how much ...
Dmitry Vagin: Yes, yes, yes. And there is simply no moment when he can quietly take and throw off. As a result, he tells us: “I have rested against the limit, I need to reset everything. Wait". And we are waiting. As a result, we turned off the fast update, and everything became good.
Nikolai Samokhvalov: In general, every update has become slower?
Dmitry Vagin: Not noticed.
Nikolai Samokhvalov: Of these 10,000 transactions, how many modifying?
Konstantin Evteev: 1500. 90 000 per minute. At different bases, by the way, in different ways. This example is in the context of one of them. About 10 thousand transactions per second are everywhere on other databases. Somewhere writing, roughly speaking, 500 transactions per second, and everything else - reading. And there are bases where, with a total load of 8 thousand transactions per second, there are 4 thousand records.
Nikolay Samokhvalov: Different versions, bases are different services, they have different teams?
Konstantin Evteev: There are different services, where there is a microservice architecture, one service is one base. There is also a legacy - this is a monolith, which has several bases, some of which are connected, including Londiste and event streaming, we are driving events through PgQ. Some of them are connected at the application level.
Nikolay Samokhvalov: Do you use PostgreSQL as an analytical database?
Konstantin Evteev: Used before, now not. We have Vertica, and we are very pleased.
Nikolai Samokhvalov: How often do you get data from Postgres?
Konstantin Evteev: In real time with minimal delay. Persecuted through PgQ, and Vertica takes away. We have one dedicated server for this. Then the Vertica team takes them from there in a certain order. Due to the fact that we have PgQ, on the sequence all data is ordered.
Nikolay Samokhvalov: If something is deleted, updated, is also updated in Vertica, or do you store the new version, do you also keep the old version? The line, for example, refreshed.
Konstantin Evteev: Vertica stores all versions. There is a very good report by Nicholas Head and an article on Habré , as storage is directly implemented in Vertica.
Nikolay Samokhvalov: What with other databases? What is SQLite?
Konstantin Evteev: In general, we have such an approach in the company - when a team starts to make a new project, it is free to choose from a technical radar a set of technologies that is more convenient for it and which one is more optimal for solving its tasks.
Nikolai Samokhvalov: ClickHouse not in the radar?
Konstantin Evteev: Yes. ClickHouse is used to store metrics .
Nikolai Samokhvalov: CockroachDB?
Konstantin Evteev: No For one solution, SQLite was a good fit. He is already in production. This included Postgres ... well, but a certain amount of TPS issued SQLite, we chose it.
Nikolai Samokhvalov: Do you help all the Postgres teams?
Konstantin Evteev: Yes. We do DBaaS, we provide a database under certain requirements. If necessary, we act as a certain SWAT, which helps to implement the business logic of certain teams. We can teach or implement a specific business feature.
Nikolai Samokhvalov: There is no such thing to use not only Postgres? Do you want - MongoDB, want - Tarantool, for example?
Konstantin Evteev: For certain tasks, there are both Tarantool and MongoDB.
Nikolai Samokhvalov: Do you use MySQL, MariaDB?
Konstantin Evteev: MySQL is not in production. Maybe there is internal projects.
Nikolai Samokhvalov: More questions in the chat. What can you advise for online migration from 9.3 to 9.5 or 9.6? Londiste, Bucardo or something else? Sergey, maybe he knows better?
Sergey Burladyan: We simply did not consider such migration processes, because, most likely, at first glance, they simply cannot cope with our traffic. Neither Londiste nor Bucardo can logically replicate the entire base to us.
Nikolai Samokhvalov: Let's imagine the situation (I know that this is not so), that you have everything on 9.3. Well, let's say no luck, and you need something more fresh. Your actions?
Sergey Burladyan: We will do downtime and use pg_upgrade.
Nikolay Samokhvalov: How much will the Avito website fall on?
Sergey Burladyan: We use pg_upgrade, a regular Postgres utility, and she has a great way to work with hardlinks. Thanks to this mode, downtime will take about 3-5 minutes. 10 is the maximum.
Nikolay Samokhvalov: That is, we take the documentation, we study, all by regular means, with this approach with hardlinks pg_upgrade. Here is the answer.
Sergey Burladyan: Yes.
Nikolai Samokhvalov: How to enter a cue?
Sergey Burladyan: pg_upgrade can upgrade only master, because he needs to write to the server. And standby can not upgrade. We have standby in battle. The master is not enough for us, part of the load goes to the reading standby. We can no longer run just one master without standby. (We used to do this before: upgrade master, then quietly recreated standby, and that’s all). Fortunately, the Postgres authors wrote a special algorithm in the documentation, which, using the rsync utility, allows us to upgrade to standby after pg_upgrade. His trick is to simply use the rsync feature to copy hardlinks. Roughly speaking, he recreates on standby the same hardlink structure as on the master. On standby there is the same all the data as on the master. If you create the same hardlinks there during downtime, it turns out ... When you start rsync, the wizard must be turned off.
Nikolai Samokhvalov: If there are several replicas, the downtime will increase, you need to follow this.
Sergey Burladyan: Yes.
Nikolai Samokhvalov: Do you manually or automate?
Sergey Burladyan: We prepare for the task, prepare a set of commands that need to be executed, and execute them. Manually.
Nikolay Samokhvalov: Tell me about some of the pitfalls that you encountered when working with Postgres, things that I would like to improve, problems, bugs, recently found.
Sergey Burladyan: We find bugs periodically in Postgres, because we have a serious load. A couple of days ago, we found out that, it turns out, the plans in the trigger are not invalidated when the table is changed. There is a trigger, a session, it works, then we do the alter table, change the column, and the trigger does not see the changes, begins to fall with errors. It is necessary to alter the column after the alter column in the same transaction so that it drops its cache. We also found bugs in a vacuum. Pro cache is not reportable.
Nikolai Samokhvalov: Maybe not as a bug, but as ...
Sergey Burladyan: You can write about this in the newsletter.
Nikolay Samokhvalov: It is necessary.
Sergey Burladyan: I agree .
Nikolay Samokhvalov: Do you have deadlocks, how do you fight?
Sergey Burladyan: There are. We look at the places in which they appear, we try to solve them.
Nikolay Samokhvalov: Monitoring about this report?
Sergey Burladyan: Yes, in the logs it is clear that the deadlock occurred.
Nikolai Samokhvalov: But the logs will not be read every day in such quantity. Say, well, for example, in the last minute 10 deadlocks, will you receive an SMS?
Konstantin Evteev: No, but they will come to light immediately. How to deal with them? The deadlocks detected by Postgres will kill one of the requests. And what - is unknown. To deal with them is to take locks in the same order. Sort when taking locks need to take. Or go from a more private object to a general sorted.
But more terrible deadlocks are when Postgres cannot detect them. If you have, say, a microservice architecture, you open a transaction, take lock on a resource, and then the application makes a request to another service, and it implicitly turns to the same database and tries to take lock on the same resource. In this case, it will be an eternal deadlock, which is not detected in any way. Further only on a timeout, probably, will fall off.
Dmitry Vagin: More on SMS with ten deadlocks. If Deadlock arises, one transaction falls off, if it somehow affects the prod, an exception will be thrown there, an error will occur, and an error will be created in the sentry, it will come to the post office later. In principle, everything is seen.
Nikolay Samokhvalov: The consequences will come. There is a question that I wanted to highlight in more detail. How do you make backups? What tools do you use, how do you check, how much is stored, how much can you get back up and recover?
Sergey Burladyan: We make them with the help of pg_basebackup, around it is our harness in the form of a script. We use Point-in-Time Recovery (PITR), we constantly archive WAL files, and we can recover to any point from the past. Additionally, WAL playback delay on one of the standby for 12 hours is used.
Nikolai Samokhvalov: There was a report on this topic last year.
Sergey Burladyan: Yes.
Nikolai Samokhvalov: Those interested can find a video, PDF, article. Will we be able to restore a backup a month ago?
Konstantin Evteev: Yes.
Sergey Burladyan: Kostya means that our backup databases, which we do, still back up the DevOps department, are being stored in our storage. They backed up with the help of Bareos, used to be Bacula. They take our backups and stack them separately.
Nikolay Samokhvalov: How many places do they take?
Sergey Burladyan: Our, in my opinion, about 60 TB. How many then these backups at DevOps, I do not know.
Nikolay Samokhvalov: Have wal-e utilities been considered?
Sergey Burladyan: No, we have our own script. We know what he does, he is simple, just archives the shafts to us in the archive.
Nikolay Samokhvalov: I’m asked to post your version of the script, this compactor. Can you do this?
Sergey Burladyan: I think I can. We are planning to upload our patched version of Londiste to GitHub before HighLoad, and maybe this script too.
Nikolay Samokhvalov: Send me me, I tweet too.
Now the topic that cannot be ignored is a search. How is everything arranged with you, including fulltext, what do you use? Why does the word "zoo" come out here again?
Konstantin Evteev: We do not currently use full text search in Postgres. Until recently, we used it, but due to the fact that sufficiently large amounts of data and a large number of requests, respectively, we run into the CPU, we are CPU bound. But we have a great team "Sphinx".
Nikolai Samokhvalov: If the CPU is bound, why not scatter on several machines, zadhardit?
Konstantin Evteev: We used it for a very narrowly specific task, when there was a very small amount of data. We realized that the volume will grow, after that this task was transferred to the side of Sphinx, which we had a long time ago.
Nikolai Samokhvalov: How shard?
Konstantin Evteev: If I'm not mistaken ... I would rather not say a number.
Nikolai Samokhvalov: Approximately.
Konstantin Evteev: There are articles, and transcripts, including on Habré, and videos from HighLoad, it’s better to look there. I know how we deliver data to these search engines.
The data that needs to be delivered to Sphinx, if we are talking about issuing active ads on the site, we materialize and use Londiste to send it to a separate machine, where, in turn, we take this data and decompose it into many pieces, that is, roughly speaking, partitioning is on. Further in many flows we send this data to Sphinx. Thus, in half an hour we can completely re-create the index of the active site ads several times. At the same time, we also implemented real-time delivery to sphinx-indexes via PgQ, when this data comes to the server for indexing, then we transfer it to another queue, and it already directly works through the consumer with Sphinx.
Nikolai Samokhvalov: Another such question. I understand correctly that now the delay is quite large, it turns out the search engine? That is, if I place an ad ...
Konstantin Evteev: These are product requirements. We have them now half an hour.
Nikolay Samokhvalov: That is, if I added an ad myself, I will not find it by the full text?
Konstantin Evteev: In extradition, it will appear in half an hour. Including this restriction on the part of the product. After all, it is not known which ad you will add. Good or bad.
Nikolai Samokhvalov: Will it appear on the list right away? Or how? When the slave gets.
Konstantin Evteev: From a technical point of view, we can deliver it to the index very quickly. But before you really show it, you need to check it out.
Nikolai Samokhvalov: Oh, that means moderation.
Konstantin Evteev: Yes, moderation. You never know what is written in this ad, what kind of goods can be sold. If we talk about our queues, then technically the speed of delivery to the external index is a few seconds.
Nikolai Samokhvalov: I was distracted by questions, correctly understood that the queue in Sphinx via PgQ too?
Konstantin Evteev: Yes.
Nikolai Samokhvalov: So you like PgQ so much?
Konstantin Evteev: Yes, we love very much. And it is very fast. In general, this is a guaranteed SLA for a few minutes. But in general it is a few seconds.
Nikolai Samokhvalov: Do you have Kafka at all?
Konstantin Evteev: None. We have nsq yet.
Nikolai Samokhvalov: Why is he?
Konstantin Evteev: He proved himself very well, a good turn. And, accordingly, it is also used here. That is, on the one hand, we have a queue that goes from the base through PgQ, and through NCQ, including clickstream events, events from applications ...
Nikolai Samokhvalov: "Do you use PgBouncer?". I actually will not ask this question either. Just your colleague Viktor Yagofarov reported to San Francisco. I'd rather put a link on Twitter. And there you can watch pdf, maybe there is even a video.
Konstantin Evteev: And in St. Petersburg, he also told the same report .
Nikolay Samokhvalov: Do you use Materialized views?
Konstantin Evteev: In terms of Postgres, we do not use Materialized views due to the fact that they need to be updated manually. We use Materialized views in our interpretation. There are great materials. Probably, I will give you a link, you can including it will shut it up. Because a very cool approach, that is, realtime materialized view, when it is implemented on triggers. [References: 1 , 2 ].
Nikolay Samokhvalov: Most likely, I have already posted.
Konstantin Evteev: Probably, yes. That is, when we put some triggers that are in the now () record update time. And then in the deferred trigger we can see: if this time is not equal to now (), then we simply take, delete the record, and re-select it from view and then we insert, respectively, into the table, which is just our mat. view.
Nikolai Samokhvalov: So this is actually a table?
Konstantin Evteev: Yes, this is just a table. We support it on triggers.
Nikolai Samokhvalov: A lot of these? What is the amount of data?
Konstantin Evteev: There are tens of millions of entries, there are hundreds of millions of entries.
Nikolay Samokhvalov: And what are the tasks?
: .
: , .
: . , . , , . , , join' . , , . , Sphinx, , . shared- .
: , deffered- .
: , Londiste .
: , Londiste ?
: .
, , , , . : , - , , .
: Londiste , , ?
: .
: , ?
: . , , Londiste …
: , -, ?
: . , .
: . ltree?
: -, . . . .
: Linux Postgres ? ?
: Postgres Debian GNU Linux. , . .
: . OOM-killer ( ) ?
: , .
: --.
: . read-ahead, -, . ? , . NUMA, .
: huge pages?
: Huge pages .
: . . – , . , hstore? ? , , - Range.
: Range, -, 9.4.
: , .
: 9.3, , ?
: , .
: 9.2 ?
: .
: , , -, .
: . . btree_gist. , CONSTRAINT vs .
: ?
: .
: constraint exclusion.
: .
: , ? , .
: , .
: . I understood. - ? - … Tsvector ? It's a pity. JSON-B . Clear.
, , Vertica . -, , , Vertica, BI , .
: , . . , .
: . , Postgres. , Greenplum . . .
. SSD Postgres? ?
: SSD .
: SSD, ?
: , . SSD, . SSD- .
: ? Cool.
: - . SSD. SSD , . , -, - - , , 90 - . SSD, . — .
: , ? Pg_pathman?
: . . , , .
: PL/Proxy. , 16 , , , . .
: ?
: , . .
: , . - ? . - ? c Postgres? …
: , Master Standby, , Master.
: ?
: , Master. Standby . , , .
: ? ?
: . . . , stand-by .
: , . , , ?
: PgBouncer, , - . 20 .
: ?
: . .
: . -.
: , . . standby hardlink rsync.
: ? ?
: 9.2 9.4, . master vacuum, master, master . — . rsync. , - master , , standby. … , .
: - ?
: , . .
: , . .
: : - ?
: release notes . , .
: release notes . . .
: .
: .
: .
: . . , , . . , .
: - , … ?
: .
: ? OK.
: , . , , . : , , , commit lock, , . , , , fsync, .
: ? ?
: Christophe Pettus.
: CEO PostgreSQL Experts -. , , . It's true. . , - ? . - .
: . , , , - . , . . , . , – notify . . , , — PG Metricus .
: pg_stat_statements, , . ?
: . .
: stack trace.
: , , , . , , , - .
: , , listen/notify notify .
: , notify . , , brubeck. And that's all. grafana dashboard.
: , , , ?
: , , - – .
: , . , , .
: , , . .
: . . - .
: .
Nikolay Samokhvalov: See you on the next airs.

Source: https://habr.com/ru/post/344780/

All Articles

Live Rup #RuPostgres: questions and answers with Avito experts. Live decoding

More articles: