📜 ⬆️ ⬇️

10 ways to achieve HighLoad and BigData on level ground



Ilya Kosmodemyansky ( hydrobiont )


There are typical errors in working with the repository, and these errors, not that I specifically invent them, but since we work a lot with remote database support, we simply collect them. Often the same from customers. And we make a kind of rating of what we have collected. I will talk about these things today.

Ilya Kosmodemyansky: Some things will be absolutely common for all repositories, but I will naturally mention a couple of specific things for PostgreSQL, since we work mainly with PostgreSQL. And PostgreSQL is used a lot in the web traditionally.

As an epigraph, you can cite the following thing:
')


Like in electronics, there are two types of problems - either there is an extra contact somewhere, or there is no contact where necessary. The database is the same story. What only people do not store in the database: from pictures to absolutely amazing things, which, perhaps, there would be stored and not worth it. Or, on the contrary, they do not store in the database those things that would be worth keeping. Because for some things one cache is used, for other things another cache is used, for the third one is some kind of NoSQL storage and so on. And maybe it is better to be somewhere in the database. And more valuable.

Let us, in fact, briefly take a look at these things. They are not that somehow ordered in the order of rigidity of these problems, but now we will understand.


As a disclaimer: there will be some trolling. Maybe I won't hold out somewhere. When you tell boringly that: “Guys, don't do it like this!”, Then, naturally, everyone says: “Yeah,” and they do it. When any emotions appear, the person looks: “Damn! What fools write like that? Opens git, and there - ta-dam! Who wrote this yesterday? Then it is remembered better.

See, first, one of the first problems.



And it is the first, including because I do not observe it only on very rare occasions. Probably, with Oracle and DB2, this problem rarely happens, because there every scaled node costs hellish money. This still limits the flight of fantasy, but also does not apply to the web. And with Open Source, with free stuff - just go! And, moreover, it is absolutely not necessary with PostgreSQL, with MySQL it is even more characteristic, and especially with NoSQL.

People love to scale, because scaling is good, they wrote about it in books. Just because "good." Why is good - few people can explain, and here there are quite a lot of different problems.

It is especially characteristic that for the past 10 years, I don’t know, I have constantly heard that everything should be scale-out and in no case should there be a scale-up. Because scale-up is about - very bad!

Since that time, since it was invented, computers have become somewhat more powerful. Tasks also grew, but tasks did not grow to the same extent as powerful computers.

There are people who, for objective reasons, deal with Big Data: these are physicists who get something from the hadron collider; these are the physicists who study the stars; these are biologists who are researching genes; these are people who write computer games and poorly develop game scenarios (they often have more elements there than atoms in the universe), and they believe that this should work.

Objective tasks, where you really need a lot of data and in one machine they never fit, they are quite small. Maybe some kind of stock exchange and stuff like that.

Therefore, it is possible that, in principle, to scale, especially from the very beginning, is not so good. As a matter of fact, it is a typical case for getting closer and clearer about how it happens.



We decided that a lot of people would use our website. Usually, when marketing with one of the co-founders makes plans for how much load we will have, a normal programmer can divide it into a hundred at once, because it is unlikely to be so. But if it is, then you should be ready to do something about it. Naturally, people think immediately about scaling and, I would even say, about the so-called premature scaling.

Typical story. Let's, since we will have a million active users on the site, we will share them by the date of creation of this user. And why by the date of creation of the user? Because it seemed so necessary to us. In fact, such cases may be many different, but this is one of them.

As a result, after a while we get a bunch of problems. Because our users are not completely independent and, for example, to assemble some such banal task for a relational base, if it is on the same machine as a friendlet, we need some special services and so on. And in fact, for any kind of inter-communicator interaction, be it a chat or something else, it turns out that users inevitably live on different machines, and we “arrived.” Immediately, the whole logic is complicated, immediately the support becomes more complicated, and so on.

Back in quite a long time ago, I had to assume in one project that if we put ten machines with PostgreSQL (not very fast then for a number of things), then these machines must be of such capacity to withstand the planned load, such a cost. Compared, looked. To buy an Oracle enterprise license and put it on a disk array with a large cabinet just turned out to be stupidly cheaper in terms of money. Such a case may well be. And now it's even easier on this topic, because databases have started to work better. All - and Oracle, and MySQL, and PostgreSQL over the past 10 years have made a lot of progress.

But then there is an even bigger problem. Suddenly it turns out that we have only a few of these one hundred machines into which we sawed the whole thing, are active. Because freshly registered users, due to the pattern of how they live on our site, they are unevenly active. That is, those users who have registered for a long time and settled on some machines, they, respectively, do nothing and the car is worth it.

There comes another problem - if the machine is standing, it does not warm up the cache. The database is faster when it sends data from the cache, rather than from disk. Accordingly, if the request to “get data about some user” comes there once an hour, then at some not-so-wonderful moment, it turns out that in order to do this, the request is several times slower because it is not warmed up. cache and everything works badly.

There is a problem that we have installed a hundred machines, made the appropriate changes in the infrastructure and got an unworkable system, which we need to immediately redo.

And, in general, this case is much wider. First, we cut one key into shards, then we realized that in fact we had to do it in a completely different way. They began, respectively, to alter. We changed it, found out that we need another time, because the web is a moving target and, strictly speaking, conditions change faster than we planned.

Therefore, my typical advice is to first grow to the resources of one machine on a normal relational base, to see where we are getting at such a pace. Then calculate how much it costs to upgrade this car, and how much more we can live, respectively, on a more fashionable hardware, a better, faster one. And after we calculate it, only after that make a decision on scaling.

And this approach actually saved more than one project, because premature scaling takes a lot of the team’s resources, a lot of money, and ultimately it turns out only more expensive and worse.



The second point, which is also high in my rating. These are “Big Data from scratch” cases, I would say. A business wants to have "all-time data." A typical story - we consider some statistics; we collect, I don’t know, user activity, time inventory, in general. In fact, we only need aggregates from it, because the latest data is there - well, the last day, the last week, the last hour. Everything else is pre-calculated aggregates. And it is necessary to recount these units, maybe once a year. In other cases, I saw that they were never counted at all, but they kept several terabytes so that they were just lying there. This is a typical example of bad bag data.

Explaining this task to the business is the task of the programmer, because the fear of the business of losing some data and then not counting the analytics is completely obvious. Data is money, it's simple. But instead of keeping four terabytes, you can do many different interesting things. You can archive the data correctly, even the raw data is stored somewhere on the archive machine, so that, if necessary, you can count the aggregates that you use for some kind of statistics or something else if you need a new aggregate. But as a result, the hot data that just arrived to you, instead of four terabytes, will take up one hundred gigabytes, and it will be much easier to work with it.

Moreover, in many databases, for example, in the same PostgreSQL, there are many automatic tools that allow you to do this. For example, PL / Proxy allows you to carry away archived data to a remote machine and, if necessary, pull up data from there and from there in order to return a sample.

There are a lot of methods to do this, and this or that partitioning to the archive and hot data helps a lot in this situation. Because to exploit a database that is a thousand times larger in size, when the real part of this data is real, you don’t need 90% - it’s always a pain, and you don’t have to do it.



They love to use universal approaches. Entity-Attribute-Value is, in general, a scourge for DBA. Because the programmer is so interested in designing. Because we have everything universal, we do not need to change the structure of the tables, when we need to add something new, and we can simply add a new attribute type, register a new attribute and put a new value there.

Well, comfortable, fine, but as a result, it all grows to three or four (if you have a type) tables, which are huge in size. And all your data lives in them, and you always join them. At the same time, somehow to optimize these JOINs is actually very difficult, because your data will most likely lie in some type of “text” in such a situation. Because it will be difficult for you to put them in a different type and, for example, the efficiency of indexing such data will be orders of magnitude less than if you keep them in separate relational tables.

And plus, again, imagine how much documentation you need to keep in order to figure out what attribute what it means and so on. Still, the relational scheme to some extent documents itself, especially if there is some rational description to it. In EAV, you will not understand just like that, without good documentation, generally under any circumstances.

As a result, what constitutes EAV is usually called the “core”. When it reaches a state that becomes completely inoperable, it is renamed the proud name "core". And all this is beginning to be hung with different ideas, where, in fact, the denormalized data is stored in order to have at least some quick access to them.

As a result, there is nothing left of the initially simplified design headline. It works slowly, poorly, and any change in this scheme has to be projected into a pile of disparate, completely independent of ourselves, often written by different people and even departments of things, and as a result we do not effectively achieve the goal that we had there stated.



The same with ORM. I do not want to go into a long holivar, it is good or bad to have an ORM, but as a DBA, I absolutely do not like ORM. In principle, I understand the desire to quickly prototype something on the ORM, but in fact, then a strange phenomenon begins. We understand that ORM significantly slows down the database, we start some queries to convert to plain SQL. Even better, try to train ORM to write good queries. This is, in general, a terrible thing. I once observed a person who courageously struggled with a request generated by ORM, knowing how to write it correctly. He already had a request that he wrote correctly. And he tried to train the ORM to generate a query of the same kind so that it would work normally. The meaninglessness of this becomes quite obvious.

Moreover, in principle, you can read, for example, a separate report on the topic, “but let us guess from the SQL query logs from which ORM they were sent to.” This (on the slide above) from which one? In general, many ORM sin with this. But if I add that this IN can be of enormous size, I think that, most likely, Django will still be that. In Ruby on Rails, they will be shorter.

The fact is that in many ways ORM can be guessed, but this thing is generally very bad. Why? Because this list can be anything. The optimizer doesn't know what to do with it at all. If this is replaced with some kind of JOIN, it will still work all right, most likely. This can somehow be optimized. With this IN you will not do anything without very dirty hacks. And in ORM, that's all! All requests are to some extent made exactly such.

That is, yes, please prototype on your favorite ORM, but you have to understand that at some point it will sooner or later become, all the same, inconvenient and bad. And in terms of performance, the main thing is bad.



This is the pure postgrese thing about Big Data. PostgreSQL has this autovacuum thing. PostgreSQL, when it inserts, it inserts, and when it updates, it inserts + delete. In this case, delete is also not delete, but simply removing the tuple from the scope.

And we have a very fragmented table. We may have 100 thousand actual values ​​on the plate, and it may weigh some hellish gigabytes, simply because there are still a few millions of irrelevant tuples that we don’t see. They just go ballast and turn our database into a typical Big Data example.

Naturally, if nothing is set up there, if everything is working badly, the very first reaction of people who exploit it: “Let's turn off the autovacuum! Because it is the longest running process, it interferes with everything, respectively, we can not add an index, perform any DDL if the autovacuum on the table goes. Let's turn it off! ”

It ends very badly and, accordingly, you can tell a lot about it.

Here is a reference with recommendations on how to deal with the autovacuum and what to do. The main thing is that in no way should you turn it off, because the results are absolutely monstrous. You get Big Data even more “out of the blue” than with time series data that you really don't need and just take up space in the database.



“JOIN is evil!” Is this thing. I have some observations about where people get this in their heads. For example, people who have long been working with MySQL with some other version 3, so they often have the idea that JOIN is evil.

What can I say here? JOIN is good. Why is the relational model so successful? Who among you has ever heard that the relational model is such a backward thing, which, how unknown is that? I think everyone has heard such an opinion that the schema-less NoSQL is coming now and everyone will win? Not happening.

Why? Because our data is conveniently stored in one way - as blocks on a disk, and it is convenient to get them in some other way, preferably at a higher level. And, accordingly, if we get data in a high-level way from different tables, and so on, the relational model is very convenient to do the optimization. And if we use the relational model, then we need to use it to its full potential.

What is the alternative to what we do JOIN? Very simple. We pull into our favorite programming language, to our application server, back-end - anywhere - data from two or three tablets. In fact, they already live in our app, occupy space, and so on and so forth, and these tablets can be large easily. That is, we need to actually get ten lines in total, and we are pulling out huge sheets.

And further we are engaged in JOIN'om manually, such a "sunset manually." We first begin to walk in a cycle, then we understand that this is slow, we begin some kind of hashing algorithm to invent it.

In fact, we are starting to invent a database, such as the “full-size, locomotive model in force”. And anyway, it will be somewhat bad, because the optimizer collects information about a heap of all parameters in order to select the JOIN algorithm — nested loop, hash, merge, or some other where they are. In our programming language, we will write it all ourselves again. And why do this?

However, who has never seen anyone do something like this? Right so never? All the rest at least once is met somewhere. This suggests that in fact among the people this idea is popular for some reason.

Use the power of relational algebra that is given to you. SQL is a high level good language. If you need to get ten lines, having previously passed through very large tables, JOIN will seriously help you, in contrast to doing it manually and not using funds for that.



Another completely painful subject is the invention of numerous replication options. Why slony? Because in PostgreSQL there is such a not so beloved by many, I would say, deservedly not so beloved by many, Slony replication system, which is very old. Even before the built-in replication. This is such a trigger based replication, very troublesome to maintain.

Nevertheless, despite the fact that it has been around for many years and it doesn’t work very well if you train it (just because it’s a very old product, it’s got a lot of bugs), people regularly try to invent something of their own.

There are at least three widely used replication methods in PostgreSQL: Shipping Log's Hot Standby, Slony, Londiste PgQ, which have their advantages, their disadvantages. I, frankly, in ninety-nine percent of cases would not advise using anything except the built-in Shipping'a Log'ov. But, nevertheless, the people are trying to invent something of their own, with some goals of their own.

Always, when I saw such invented things, they always worked somehow wrong. Because you do not take into account all the problems that you have. Replication is the processing of distributed transactions. Processing distributed transactions is always hard, there is no magic there. And, accordingly, if for some reason the built-in replication does not provide you with such functionality, for example, a multimaster, this means something. For some reason this does not work.And if you try to invent it yourself, you are more likely to walk the rake, and many people do this in general.

Therefore, first of all, I would advise, if there is any technology in the database, then think about why it is used so much and use it.

With regard to replication, so here is another important point is that the built-in replication does not work at a high level, at the level of SQL. When you write some kind of replication, you probably still communicate with the level of tables, triggers, stored procedures and SQL. This is usually slow and, as a rule, fraught with conflicts.

Getting somewhere deeper is much more difficult. To rewrite the database so that it has another replication - this task is already very serious, most likely you should understand what you are doing if you can get so deep. And built-in replication is replication by the transaction log. Information about changes is written, not in the form of a SQL statement, but in the form of what information we need to return the 8-kilobyte page to the previous state or, conversely, redo it to the new state. And this log goes, respectively, on the slave and is applied there. And it will be much more efficient and orders of magnitude more reliable than any other replication method. And it is already done. It is built, you can take and use.



The typical answer of the programmer to the administrator, who says that the interaction processes of operation and development in the office are fundamentally broken.

Everyone knows that you need to use EXPLAIN. Not everyone uses, but everyone knows what to use. We need to see whether the query works optimally, whether we have forgotten indexes there and a lot of similar things. But it often happens that a person looked at it in his development environment, and it turned out the simplest option is that production will simply have 100,000 times more data. And another plan will be chosen, and the work will be done accordingly, and it will be, accordingly, slowly, an index will be needed, another index will be needed, something else will be needed, and so on. There may be a lot of problems.

By the way, we posted open access videos from the Highload ++ Junior 2016 conference . Here is the corresponding sheet - 2016 in our YouTube account .

Moreover, the database is a very complex system, there workload depends on a heap of various strange parameters. For example, your project fills the base with another project, they work there together, and you only have your project on the development environment. Some kind of load is generated from that project, because, I don’t know, all employees of all offices in Russia in the morning go to your site for a cup of coffee, for example, with which the project works. And you have a project that is completely unrelated to this, but you are experiencing some amazing brakes, because the base is dramatically slower in these morning hours.

Here it is very important, in fact, for developers to have some kind of access to production or to properly debugged procedures for working with DBA. One of the worst moments is when admins just try not to let stupid programmers go anywhere. They closed, as you know, the Soviet cleaner, who complains that “they walk, trample everyone,” they would close the school, only so that they would not go back and forth. It is not right.

, - , , , - . , PostgreSQL pg_stat_statements, , – : , ; .

If the admin doesn’t let the developers in any way there, sooner or later it will end with such things that “everything in my test works”. Because the developer is not responsible for his code, for his project in such a situation. And he should be able to look at production, how it all happens, because the difference can be very, very significant there.



In no case do not want to offend Java-programmers. A variety of programmers pretending to be smart. This is such a role, because programmers are inquisitive people and it is interesting for them to try something new, something interesting and take advantage of the approaches adopted in their language.

Here we have Java. In Java, there are threads and, it is known that these Java threads, you can make them work so that everything runs faster than without them. But we have further database. We want to load a lot of data into the database, put it there in several, respectively, streams. It seems to us that slowly - we are parallelizing more. And at the same time, the developer in this situation does not care what actually happens. And the following happens: in the database we have 10 workers. Because we have such a number of cores on a machine there that, taking into account official workers, we only have 10 of them left. And in these 10 workers all these streams come.

What happens next? Naturally, these streams are beginning to stupidly fight each other, and in this situation it is not a good idea. Therefore, the programmer needs to imagine the entire stack in greater and better detail. Because if you do such things, it turns out very strange.

The second point is a case of quite a life itself. We usually do the following. We have monitoring of slow queries: top slow queries, why they are slow. We look in the report - a request in the top. What? Why?

Usually how does a DBA attendant check that there is such a problem? He takes this request and in the transaction (because the request can change some data, PostgreSQL has all transactional, respectively, you can say BEGIN, EXPLAIN ANALYZE and ROLLBACK, so as not to spoil anything). Everything works, moreover, it works in milliseconds. Looks at the report - regularly stably for a very long time.

It turned out that there the guys decided to try the korutiny and, accordingly, this case is launched from the script through the korutiny. They fight there among themselves. As a result, the query is slow. Moreover, not because the request is bad, but because he is fighting on the side of Python and there the data from him is slowly sent and the data comes to him slowly, everything is waiting there.

, , , , , , , , / .



, , . , .

We have a query, it returns some rows, many rows, a million. And we are a web site, for example. That is, we work with the web-muzzle. What do you think, do you often have to print a million lines on a web snatch in some form? Actually deduce and often. To some search list, to some such thing, etc. How many people do you think this million lines have read? No one. Obviously.

If you parse your slow queries and see via EXPLAIN or something else, that the query returns such a number of rows, and this is not ETL, not uploading something somewhere to analytics at night, and so on, think, in general, but who can read?

This is a sure sign that you have a clear error. This is a request that you need to look at and understand: either there is more data there, and there was once a little, or someone just wrote it with an error; that is, it is a big problem right away. And, most importantly, for the web muzzle such a request is completely useless. This is a mistake, you just need to remove it.

The same story with count (*) counters, which I tell at each conference, where I say something about how to work correctly from the web with databases. And in so many cases, I see this case in reports of slow queries.

Very fond of users showing counters. At the same time, we have a highly loaded site, and on the face of this site these counters are ticking quickly in a competitive environment. If they are not ticking quickly, then they can be shown very simply. PostgreSQL has data from the scheduler statistics analyzer, and you can write a procedure that will return this data, which will be updated once in a while, and they will be sufficient, because they are approximate, and good.

But people do the following - they derive the real counter. You will have information that 261,526 users have registered today, and when you refresh the page in a second, you will find that this figure has changed by 15,000. What is the use of having such an exact figure?

count(*) – . count(*) – Seq Scan ( PostgreSQL), . count(*) . 20 ? , .

? ?

NoRemove such counters. If you have a page on a high-load site that only spends a few seconds on SQL queries, or even half a second, this is not the web! This web user is not needed. The user is used to the fact that the computer slows down, and even then it more or less breaks away in modern times. And the web for the user should work instantly. Because if he loads a page for a long time, the user went to another site and, I don’t know, placed an order on it, or what else he did. Therefore, these things should always be mowed.

And the traditional main tip in this database report is: “Know your data!” You need to know how your data works.

Somehow Tom Kite (this is Oracle's vice president of consulting, who is a very smart guy and knows a lot about how Orace works, about SQL queries, etc .; he can conduct a panel discussion with query optimization, which he is being offered to optimize right now - this is the highest class!) asked: "What tools do you use to optimize SQL queries so well?" He said: "Yes, everything is very simple, I close my eyes and try to imagine how it works."

Roughly speaking, it is generally useful to know what data you have, what your requests are and think a little when, respectively, you work with them. This is the main cool secret of successful work with databases in general, with PostgreSQL in particular, and with any technology with which you deal.

I have it all. We still have quite a reasonable amount of time for questions. Thank.I will be happy to answer them.

Question: It was said that the autovacuum should not be turned off. And if the append-only base, it makes sense in this case to disable and not use?

Answer: It does not. We must still hold.

There it is written on the link, but there are a few moments.

First, in addition to the auto vacuum as such, this demon deals with analytics. He updates the statistics. Accordingly, if your append occurs, you change the number of records in the tablets and the optimizer needs to know this in order to choose the best plans.

Secondly, you have pg_catalog, which is updated. Internal all sorts of signs. If you disable autovacuum on it, the result can be very surprising. You have some kind of pg_class label that will become several gigabytes, and it will all work very sadly. This is simply not necessary to do, even if only for this reason.

It is better to set up intelligently and work. In some cases, if there are any labels that are not updated very often, it is possible to individually indicate on them that the autovacuum does not work so intensively. But in practice, I would not advise doing this, because if the tablet grows to large values, and it has an autovacuum for 90% of updates, this autovacuum will work for a long time and will seriously interfere with the performance of the system, despite the fact that will rarely work. That is, this is not a good practice.

Question: Regarding large samples. If there is such a request, can I put a limit in the same place?

Answer: The limit, yes, can be set, but, as a rule, if a million is returned, then the limit is not set. With a limit, you need to understand the following: to get the data you want, you need to understand how you have it sorted and what part of it you get. Here we have to deal with the specific implementation.

Question:You said at the beginning that the business wants to store any data at all times and for the whole period of time. We also have about such situations. How, then, is it better to approach the problem of data archiving? Say, some big table, huge logs that you can not put somewhere. We may need them sometime, but in fact we need the last part, and the big “tail” that remains must be put somewhere, archived in some way. But at the same time save, so that you can work with him.

You said a little about PL / Proxy, it would be interesting what kind of tool it is, is it suitable for this? Or are there any other tools?

Answer:There are many methods. First of all, a simple way is to keep some kind of car, maybe simpler, just with a lot of volume, which doesn’t need special performance, there’s just a lot of space. And you can carry this “cold tail” there.

Next is the question of what methods to make. The simplest way is to simply dump this table and take this dump to that host and, accordingly, deploy it there.

On such an archival host, concurrency will not be special, so — please — let it be there. You can get it from there, at any time, go see, work.

If, for example, you need faster access, you can write a stored procedure in PL / Proxy. This is one of the methods that may not be the best, not suitable for everything, but working nonetheless. What is this technology? You have two stored procedures, one on your host on which you live live data, and this is a procedure in the PL / Proxy language, which only deals with the stored procedure in the PL / pgSQL language on the remote archive machine, which possesses the same exact signature and returns to this result via a remote call. Pretty good technology works pretty fast. Skype developed in due time, even when they were not under Microsoft. It really works well.

And in such a situation, you can, for example, hang VIEW with UNION on a request from this label that you have with hot data on the active server, and with the result issued from a stored procedure, which is PL / Proxy. And further, respectively, if you access this view, and you need only hot data under the WHERE condition, you only access it, you only raise this data. And the archived data - PL / Proxy understands where to go, looks that there is zero result and zero result comes to UNION. This is one of the techniques how this can be done.

Question: You said that you need not rush to scale, and if you still need to somehow provide for scaling and think about it in advance? What approaches should be considered? What can you advise and suggest?

Answer:The first piece of advice, surprisingly, is not to scale in advance, but simply trite, in preparing for the start, to understand how much data there will be.

It is clear that no load testing does not guarantee the reproduction of the situation, as it goes in combat. In any case, the first thing you need to have is to have good monitoring. You should ideally have a graph of your data growth and database response to it. You have charts and watch the trend. If you see that your marketing has begun to actively sell some services there and you have gone up, you will notice this on the chart and then you will make a decision that is quite reasonable. What is easier and cheaper for you: buy some new SSDs fast and deliver RAM, or it is too expensive, and you need to put a couple of cheap cars to unload on some specific tasks.

I would say that the “zero” task is to understand whether you need to scale or not, and then, based on this, think.

For example, PostgreSQL on a reasonably worthy database — this server — which, relatively speaking, costs 30 thousand dollars, for a given server — this is a reasonable price for today, can withstand a very large load without noticing. And far from all projects such a load arises. Here it is.

I would rather say that it makes sense to scale, maybe some things on the workload. For example, if you have a web, there are many UTP requests on it, here they go to the master. If you need to get rid of some analytics, then most likely it will not be very good, because long and short queries do not live very well on databases, especially versioned ones, due to the mass of reasons for transactions.

In this situation, it is better for you to scale, just stupidly putting a replica and sending read requests there by sending. Scale not by spreading data in different places, but on the basis of the task. If your requests are longer than a few milliseconds, they go to the replica, if the requests are short, suitable for UTP, they, respectively, live on the master. Rather, in this direction, I would advise to think.

Question:About the previous question, the removal of logs. Besides PL / Proxy, it’s still possible to use foreign data wrapper? You did not use it?

Answer: You can use a foreign data wrapper, but there are more things at your own peril and risk. PL / Proxy is not fully capable of transactions, in the sense that it can only autocommit, transactions cannot be managed there, but this is a transactional RPC call, moreover, it can be called very far. And foreign data wrapper is a thing, firstly, it is completely unpredictable for the optimizer, and secondly, it does not always work well with transactions, because God knows who wrote it there.

Foreign data wrapper to PostgreSQL is generally a rather strange thing. There it turns out that on one side the request is executed with one plan, on the other - with another plan, and this may be extremely non-optimal.

In principle, you can use a foreign data wrapper for this purpose, but I would recommend strongly simpler solutions and I would use foreign data wrapper only if I really want to.

Question: And with PL / Proxy, it does not work out that there will be another plan?

Answer: Naturally, it will turn out, but with him, from experience, there are fewer problems on this topic.

Question: You have met solutions of the type: put the archive label on a separate disk, put this disk in some shared storage with some OCFS2, and with this OCFS2 a separate server on read-only works.

Answer: This is not the best idea. Because if someone suddenly jerks this table space for some reason, it dramatically increases the entire latency, because all this stuff starts to climb into the spherical buffers.

Question: I mean from another server?

Answer: I mean, is the table space on this server pulling from another server? This should not be done in any case, even on read-only. Because there all the meta-information in the table space, in the page, it all refers to this server, and it will not work.

Q: But does Oracle somehow do the same?

Answer: No. Oracle has a RAC, and that makes a big difference.

Question:What to do with ORM? Do you refuse ORM altogether as soon as you start writing a project? Or do you rewrite everything before releasing into production? What is your policy about him?

Answer: Let's just say, I do not write anything, I only eliminate the consequences. Among our customers, many who do differently. There are people who use ORM and live with it, and they have business processes that are responsible for optimizing ORM, and they think that this is good. This is completely normal.

From my personal point of view, ORM in general is not very convenient to use, because SQL is a convenient thing, and in Java, the design of queries looks, in my opinion, a bit wild. But "the taste and color of all the markers are different." There are a lot of people who, on the contrary, love to do it.

Experience suggests the following: that there are a lot of projects that start entirely on ORM; projects that have been successfully operated for many years and at the same time entirely on the ORM, I have not met one at all. Typically, people start with the fact that some critical requests begin to pull out of the ORM and write them with their hands. Probably, 90% of projects that started with ORM, live exactly according to this scheme.

At the same time, I also saw projects that I just saw and just rewrote them with ORM, I also saw much less, because it usually costs a lot, and in all cases the performance situation was changed by just orders of magnitude. This seriously brings a lot of advantages. The only thing is that you need to hire programmers who know SQL and know how to write it, and, accordingly, spend time on some slower things in the development at first.

ORM largely originated as an idea from vendor development. When people develop some kind of system, put it to the customer, and they really need to reduce the time to sign the act very much. They put it, and then that it will be difficult to maintain it - this is just a bonus, money is paid for this support.

When we live on the web is a completely different story. We need to do it differently, because we do not have an act signing point, as a rule. We ourselves will then spend time on all these things. Something like that.

Contacts


" Hydrobiont
" ik@postgresql-consulting.com

— HighLoad++ Junior .

, , HighLoad++ Junior HighLoad++ .

? — 5 6 ? .

( :) — , , highload-. — .

, Project1917 (http://project1917.ru/), Nginx+MySQL+Laravel+AngularJS . , highload — , ( ) .

: HTTP- , , HTTP- , , MySQL .

, , ! !

Source: https://habr.com/ru/post/329222/


All Articles