📜 ⬆️ ⬇️

Monitoring Postgresql: Requests

In 2008, the pgsql-hackers mailing list began discussing an extension to collect statistics on requests. Starting from version 8.4, the pg_stat_statements extension is included in the postgres and allows you to receive various statistics about the requests that the server processes.


Typically, this extension is used by database administrators as a data source for reports (this data is actually the sum of indicators since the moment the counters were reset). But on the basis of this statistics, you can make monitoring requests - look at the statistics in time. This is extremely useful for finding the causes of various problems and, in general, for understanding what is happening on the database server.


I will tell you which metrics on requests are collected by our agent, how we group them, visualize them, and also tell you about some of the rakes we went through.


pg_stat_statements


So, what we have in view pg_stat_statements (my example is from 9.4):


postgres=# \d pg_stat_statements; View "public.pg_stat_statements" Column | Type | Modifiers ---------------------+------------------+----------- userid | oid | dbid | oid | queryid | bigint | query | text | calls | bigint | total_time | double precision | rows | bigint | shared_blks_hit | bigint | shared_blks_read | bigint | shared_blks_dirtied | bigint | shared_blks_written | bigint | local_blks_hit | bigint | local_blks_read | bigint | local_blks_dirtied | bigint | local_blks_written | bigint | temp_blks_read | bigint | temp_blks_written | bigint | blk_read_time | double precision | blk_write_time | double precision | 

Here, all requests are grouped, that is, we receive statistics not for each request, but for a group of requests that are identical from the point of view of pg (I’ll tell you more about it). All counters grow from the moment of start or from the moment of reset of counters (pg_stat_statements_reset).



blk_read_time and blk_write_time are collected only when the track_io_timing setting is enabled .


Separately, it is worth noting that only completed requests fall into pg_stat_statements. That is, if your request has already done something heavy for an hour and has not yet ended, it will be visible only in pg_stat_activity .


How requests are grouped


I have long believed that requests are grouped according to a real implementation plan. It was embarrassing only that requests with different number of arguments in IN are displayed separately, then they should have the same plan.


In the code, you can see that the hash is actually taken from the "significant" parts of the request after parsing. Starting from 9.4, it is displayed in the queryid column.


In practice, we have to additionally normalize and group requests already in the agent. For example, a different number of arguments in IN we collapse into one placeholder "(?)". Or the arguments that arrived in pg inlined in the query we ourselves replace with placeholders. The task is further complicated by the fact that the request text may not be complete.


Before 9.4, the query text is truncated to track_activity_query_size , c 9.4 the query text is stored out of shared memory and the restriction is removed, but we cut the query to 8Kb anyway, because if the query contains very heavy strings, the requests from the agent significantly post-load.


Because of this, we cannot parse the query with the SQL parser for additional normalization; instead, we had to write a set of regular expressions for additional stripping of queries. This is not a very good solution, as it is constantly necessary to add new scenarios, but nothing better has yet been thought up.


Another problem is that in the query field, the postgres records the first incoming query in the group before normalization while preserving the original formatting, and if the counters are reset, the request may be different for the same group. Very often, developers use comments in queries, for example, to specify the name of the function that pulls this query or the ID of the user query, they are also saved in the query .


In order not to produce new metrics for the same requests every time, we cut comments and extra whitespace characters.


Formulation of the problem


We did postgres monitoring with the help of our friends from postgresql consulting . They suggested that the most useful when searching for problems with the database, which metrics are not particularly useful, and advised on postgresql internals.


As a result, we want to receive from the monitoring answers to the following questions:



We collect metrics


In fact, it is quite expensive to pour counters for all monitoring requests. We decided that we are only interested in TOP-50 requests, but we can’t just take top by total_time, since if we have a new request, its total_time will catch up with the old requests for a long time.


We decided to take the top of the derivative (rate) total_time per minute. To do this, once a minute, the agent reads pg_stat_statements entirely and stores the previous values ​​of the counters. For each counter of each request, the rate is calculated, then we try to additionally combine the same requests, which pg considered different, their statistics are summarized. Then we take top, we make separate metrics for them, we summarize the rest of the queries into some query = "~ other".


As a result, we get 11 metrics for each query from the top. Each metric has a set of refinement parameters (tags):


 {"database": "<db>", "user": "<user>", "query": "<query>"} 


Interpret metrics


Very often, users have questions about the physical sense of the metrics "postgresql.query.time. *". Indeed, it is not very clear what the sum of response times shows, but such metrics usually show well the relationship of some processes with each other.


But if we agree not to take locks into account, we can very roughly assume that during the entire processing time of the request, the post-rescue utilizes a resource (processor or disk). Such metrics have a dimension: the number of resource seconds spent per second. You can also bring this to the percentage of utilization by request of the processor core, if multiplied by 100%.


We look what happened


Before relying on metrics, you need to check whether they show the truth. For example, let's try to figure out what causes an increase in the number of write operations on the database server:




We look, whether wrote postgres on a disk at this time:


')


Further we find out, what requests "soiled" pages:





We see that there is an approximate coincidence, but the schedule for queries does not exactly repeat the buffer reset schedule. This is because the process of writing blocks occurs in the background and at the same time the profile of the load on the disk varies slightly.


Now let's see what kind of picture we have with reading:







Here we also see that there is a correlation, but there is no exact match. This can be explained by the fact that postgres reads blocks from a disk not directly, but through the cache of the file system. Thus, we do not see part of the load on the disk, since part of the blocks is read from the cache.


CPU utilization can also be explained by specific requests, but absolute accuracy cannot be expected, since there are expectations of various locks, etc.:






Total



On our demo booth there is an example of metrics on demand, but the load there is not very interesting and it’s better to look at your metrics (we have a free, non-binding, two-week trial)

Source: https://habr.com/ru/post/311028/


All Articles