How to grow 10 times under the number of database queries without moving to a more efficient server and keep the system working? I’ll tell you how we struggled with the performance degradation of our database, how we optimized SQL queries to serve as many users as possible and not increase the cost of computing resources.
I make a service for managing business processes in construction companies. About 3 thousand companies work with us. More than 10 thousand people work with our system for 4-10 hours every day. It solves various tasks of planning, alerts, warnings, validations ... We use PostgreSQL 9.6. We have about 300 tables in the database and every day up to 200 million requests (10 thousand different) are sent to it. On average, we have 3-4 thousand requests per second, in the most active moments more than 10 thousand requests per second. Most requests are OLAP. There are much fewer additions, modifications and deletions, that is, the OLTP load is relatively small. I gave all these figures so that you can evaluate the scope of our project and understand how our experience can be useful to you.
The first picture. Lyrical
When we started development, we didn’t really think about what kind of load will be on the database and what we will do if the server stops pulling. When designing the database, we followed the general recommendations and tried not to shoot ourselves in the foot, but beyond general tips like “do not use the
Entity Attribute Values pattern, we did not go. Designed based on the principles of normalization avoiding data redundancy and did not care about speeding up certain queries. As soon as the first users arrived, we ran into a performance problem. As usual, we were completely unprepared for this. The first problems were simple. As a rule, everything was decided by adding a new index. But there came a time when simple patches stopped working. Having realized that there is not enough experience and it is becoming increasingly difficult to understand what is the cause of the problems, we hired specialists who helped us set up the server correctly, connect monitoring, showed where to look in order to get
statistics .
The second picture. Statistical
So we have about 10 thousand different queries that are executed on our database per day. Of these 10 thousand, there are monsters that run 2-3 million times with an average run time of 0.1-0.3 ms and there are queries with an average run time of 30 seconds that are called 100 times a day.
')
It was not possible to optimize all 10 thousand queries, so we decided to figure out where to direct efforts in order to improve database performance correctly. After several iterations, we began to divide requests into types.
TOP queries
These are the most difficult queries that take the most time (total time). These are queries that are either called very often or queries that take a very long time (long and frequent queries were optimized even at the first iterations of the struggle for speed). As a result, the server spends the most time on their execution in total. Moreover, it is important to separate the top requests by the total execution time and separately by IO time. The ways to optimize such queries are slightly different.
The usual practice of all companies is to work with TOP requests. There are few of them, optimization of even one request can free up 5-10% of resources. However, as the project grows older, optimizing TOP queries becomes an increasingly non-trivial task. All simple methods have already been worked out, and the most “difficult” request takes away “only” 3-5% of resources. If TOP queries in total take less than 30-40% of the time, then most likely you have already made efforts so that they work quickly and it is time to move on to optimizing queries from the next group.
It remains to answer the question how many top queries to include in this group. I usually take no less than 10, but no more than 20. I try to ensure that the time of the first and last in the TOP group differs no more than 10 times. That is, if the query execution time drops sharply from 1 place to 10, then I take TOP-10, if the drop is smoother, then I increase the group size to 15 or 20.

Middle peasants (medium)
These are all requests that go immediately after TOP, with the exception of the last 5-10%. Usually, in optimizing these particular requests lies the ability to greatly increase server performance. These queries can “weigh” up to 80%. But even if their share has exceeded 50%, then it is time to look at them more closely.
Tail
As it was said, these requests go at the end and they take 5-10% of the time. You can forget about them only if you do not use automatic query analysis tools, then their optimization can also be cheap.
How to evaluate each group?
I use an SQL query that helps to make such an assessment for PostgreSQL (I’m sure that for many other DBMSs you can write a similar query)
SQL query to evaluate the size of TOP-MEDIUM-TAIL groupsSELECT sum(time_top) AS sum_top, sum(time_medium) AS sum_medium, sum(time_tail) AS sum_tail FROM ( SELECT CASE WHEN rn <= 20 THEN tt_percent ELSE 0 END AS time_top, CASE WHEN rn > 20 AND rn <= 800 THEN tt_percent ELSE 0 END AS time_medium, CASE WHEN rn > 800 THEN tt_percent ELSE 0 END AS time_tail FROM ( SELECT total_time / (SELECT sum(total_time) FROM pg_stat_statements) * 100 AS tt_percent, query, ROW_NUMBER () OVER (ORDER BY total_time DESC) AS rn FROM pg_stat_statements ORDER BY total_time DESC ) AS t ) AS ts
The query result is three columns, each of which contains a percentage of the time that is spent processing requests from this group. Inside the query, there are two numbers (in my case, 20 and 800) that separate requests from one group from another.
This is how the share of requests at the time of the start of optimization work now roughly correlates.

The diagram shows that the share of TOP requests has sharply decreased, but the “middle peasants” have grown.
Initially, TOP blunders hit TOP queries. Over time, childhood illnesses disappeared, the share of TOP requests was reduced, and more efforts had to be made to accelerate difficult requests.
To get the text of the requests we use such a request SELECT * FROM ( SELECT ROW_NUMBER () OVER (ORDER BY total_time DESC) AS rn, total_time / (SELECT sum(total_time) FROM pg_stat_statements) * 100 AS tt_percent, query FROM pg_stat_statements ORDER BY total_time DESC ) AS T WHERE rn <= 20
Here is a list of the most commonly used tricks that helped us speed up TOP queries:
- Redesign systems, for example, processing notification logic on message broker instead of periodic database queries
- Adding or Modifying Indexes
- Rewrite ORM queries in pure SQL
- Rewrite lazy data loading logic
- Caching through data denormalization. For example, we have a link between the tables Delivery -> Invoice -> Request -> Request. That is, each delivery is associated with the application through other tables. In order not to link all tables in each request, we duplicated the link to the application in the Delivery table.
- Caching static tables with directories and rarely changing tables in program memory.
Sometimes the changes dragged on an impressive redesign, but they gave 5-10% of the unloading of the system and were justified. Over time, the exhaust became less and less, and the redesign needed more and more serious.
Then we paid attention to the second group of requests — the group of middle peasants. It contains a lot more requests and it seemed that it would take a lot of time to analyze the entire group. However, most of the queries turned out to be very simple for optimization, and many problems were repeated dozens of times in different variations. Here are examples of some typical optimizations that we applied to dozens of similar queries and each group of optimized queries unloaded the database by 3-5%.
- Instead of checking for records with COUNT and a full table scan, EXISTS
- We got rid of DISTINCT (there is no general recipe, but sometimes you can easily get rid of it by speeding up the request 10-100 times).
For example, instead of querying to select all drivers on a large delivery table (DELIVERY)
SELECT DISTINCT P.ID, P.FIRST_NAME, P.LAST_NAME FROM DELIVERY D JOIN PERSON P ON D.DRIVER_ID = P.ID
made a request for a relatively small PERSON table
SELECT P.ID, P.FIRST_NAME, P.LAST_NAME FROM PERSON WHERE EXISTS(SELECT D.ID FROM DELIVERY WHERE D.DRIVER_ID = P.ID)
It would seem that we used a correlating subquery, but it gives an acceleration of more than 10 times.
- In many cases, COUNT and
replaced by the calculation of the approximate value
- instead
UPPER(s) LIKE JOHN%'
use
s ILIKE “John%”
Each specific request was sometimes accelerated by 3-1000 times. Despite the impressive performance, at first it seemed to us that there was no point in optimizing the query, which was executed for 10 ms, included in the third hundred of the heaviest queries, and in the total database load time it took hundredths of a percent. But applying the same recipe to a group of similar requests, we won back several percent. In order not to waste time manually viewing all hundreds of queries, we wrote several simple scripts that, using regular expressions, found similar queries. As a result, the automatic search for query groups allowed us to further improve our performance by spending modest efforts.
As a result, we have been working on the same hardware for three years now. The average daily load is about 30%, at peaks it reaches 70%. The number of requests as well as the number of users has grown by about 10 times. And all this thanks to the constant monitoring of these very groups of TOP-MEDIUM queries. As soon as a new request appears in the TOP group, we immediately analyze it and try to speed it up. We review the MEDIUM group once a week using query analysis scripts. If you come across new requests that we already know how to optimize, then we quickly change them. Sometimes we find new optimization methods that can be applied to several queries at once.
According to our forecasts, the current server will withstand an increase in the number of users by another 3-5 times. True, we have one more trump card in the sleeve; we still have not translated SELECT queries to the mirror, as recommended. But we do not do this consciously, since we first want to fully exhaust the possibilities of “smart” optimization before turning on “heavy artillery”.
A critical look at the work done may suggest using vertical scaling. Buy a more powerful server, instead of wasting the time of specialists. The server may not cost so much, especially since the limits of vertical scaling have not yet been exhausted. However, only the number of requests increased 10 times. For several years, the functionality of the system has increased and now there are more varieties of queries. The functionality that was, due to caching, is performed by fewer requests, moreover, more efficient requests. So you can safely multiply by another 5 to get the real acceleration coefficient. So, according to the most conservative estimates, we can say that the acceleration was 50 or more times. Vertically shaking the server 50 times would cost more. Especially considering that once the optimization is carried out all the time, and the bill for a rented server comes every month.