📜 ⬆️ ⬇️

MySQL Performance real life Tips and Tricks

I promised yesterday to write an article about real cases of MySQL database optimization.
Today I had to get up early in the morning to realize the promise.
It is still difficult to maintain the centralized management of thoughts, so do not judge strictly for incidents and lapses in my article.

Recently, one has often enough to engage in optimizing the performance of sites. And as a rule, the "bottleneck" in the performance of these sites is precisely the database, errors in both the architecture and in the execution of requests. Starting from the incorrect arrangement of indices, or their complete absence, the wrong (uneconomic) selection of data types for a specific field, ending with an absolutely illogical database architecture and the same illogical queries.

In this article I will describe several techniques that were used for an application with 4 million + users and that having about 100 million + hits per day, and at the end I will describe a problem that was solved recently and maybe a highly respected community will offer me solutions to this problem more efficiently than to which I came.
')

For large tables in MySQL that contain hundreds of thousands, millions of records are often critical queries for GROUP BY. Because in most cases, if we look at the explain of this request, we will see in the Extra field - Using temporary; Using filesort
For example:

explain
select
*
from
`tags`
group by
tag_text;

* This source code was highlighted with Source Code Highlighter .


Extra - Using index; Using temporary; Using filesort
Those. For grouping, a temporary table is used, which is then sorted, and sorting takes place without using any indexes.

In general, if GROUP BY is present in the query, MySQL will always sort the results. If the order of the output results is not important to us, then it is better to get rid of this operation (sorting). This can be done by adding “order by null” to the request. Total we get

explain
select
*
from
`tags`
group by
tag_text
order by null ;

* This source code was highlighted with Source Code Highlighter .


Extra - Using index; Using temporary;

As a rule, grouping often passes by string data types, which is rather slow; you can achieve a significant performance gain if you can sacrifice the “accuracy” of the grouping, or rather the grouped values. This can be useful in evaluating the logs or some other reports where the estimated numbers will come down.

For example, such a query will be executed much faster than the previous
select
*
from
`tags`
group by
crc32 (tag_text)
order by null ;
* This source code was highlighted with Source Code Highlighter .


Sometimes a big enough problem is LIMIT in queries, I will not say here that some are pulling out 100 records, and sometimes even 1000 if they really use 10; I will say the following - there is a benefit from the limit only when the query uses an index by the field that we sort, because otherwise Using temporary; Using filesort eliminates all the benefits of the limit. Also it is necessary to avoid the following limits LIMIT 1,000,000, 25, since 1000025 records will still be selected, and only then 1000000 will be dropped. This is often used for pagination, and many programmers are often justified by the fact that users still mostly go to new pages (the latter in chronological order), i.e. requests with such limits are rarely performed ... Yes, users go to pages of a year or two years ago not often, but if a search bot comes to the site, it goes to all pages, and this bot, which indexes the site content, puts the database server to us.
The solution to this problem, as a rule, is the calculation of the results on the previous page, for example, the last id of the record on the previous page, and the limit in the query will look something like this
WHERE bla-bla AND NODE_ID> id_from_previous_page ORDER BY NODE_ID LIMIT 25
well, or something like that
WHERE Position BETWEEN 1000 and 1025
these designs are much faster than LIMIT 1000, 25

The solution to this problem proposed by% rumkin% regarding this problem.
The issue with paginated navigation, taking into account the current non-love of extra GET requests, can be solved as follows:

SELECT
*
FROM
` table`
WHERE
id> X * Y-1
LIMIT
X;
* This source code was highlighted with Source Code Highlighter .


Where X is the number of results per page, Y is the number of the current page. So we will do without pre-queries, GET requests and other complications.

Some suggest horizontal solutions to solve these problems.
sharding (partition) But to solve the problem with pagination this is IMHO perversion that one more ... Although about sharding later ...

Also, some people think that the SQL_NO_CACHE SQL_CALC_FOUND_ROWS construction is faster than 2 queries first with LIMIT and the second select count (*)
Here you will read the revelation of this legend.

http://www.mysqlperformanceblog.com/2007/08/28/to-sql_calc_found_rows-or-not-o-sql_calc_found_rows/

I will not write about the arrangement of indices. Much has been said about this, I will only write that so far you do not need to rely on the index merge algorithm and, if possible, replace it with composite (composite indices for several fields), you can read more about it here.

http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/

http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html

Or sometimes it is appropriate to replace with a query with UNION (or rather, with UNION ALL, since UNION is an abbreviation of UNION DISTINCT and therefore when we combine the results of two queries with UNION, this union will run slower, this happens not because UNION ALL unlike UNION, it does not use a temporary table, in fact it uses it, only it does not recognize it (in the sense it doesn’t show it in explain), this can only be seen with show status. The fact is that UNION ALL creates a temporary table without UNIQUE KEY, and UNION DISTINCT with - hence the difference) Therefore, sometimes a query of this type can with amb optimal variant

select
*
from
` table`
where
first = 'A'

UNION ALL

select
*
from
` table`
where
second = 'B' AND first ! = 'A'

UNION ALL

select
*
from
` table`
where
third = 'C' and second ! = 'B' AND first ! = 'A'
* This source code was highlighted with Source Code Highlighter .


than

select
*
from
` table`
where
third = 'C' OR second = 'B' OR first = 'A'
* This source code was highlighted with Source Code Highlighter .


Yes, what else I wanted to write about was covering indexes, more precisely, about the queries that use them.
In short, what is the essence - we work in the request, i.e. We use as a condition (WHERE) and return (SELECT) only fields that are included in one composite index. Total - all that the muscle needs to run through the tree of indexes and return the result. The indices are in memory, we don’t climb into the data on the disk, everything is cool and fast.
A classic example of this is a query of the form.

SELECT user_password FROM `users` WHERE user_name = 'maghamed';
having an index on the fields (user_name, user_password)

I will give a more non-standard example on this topic for securing the material :-)

SELECT
`log`.visitor_id,` url`.url
FROM (
SELECT
id
FROM
log
WHERE
ip = ”127.0.0.1”
ORDER BY
ts desc
LIMIT 50.10
) l
JOIN log
ON (l.id = log.id)
JOIN url
ON (url.id = log.url_id)
ORDER BY
TS DESC ;
* This source code was highlighted with Source Code Highlighter .


Although this query looks terrifying, but if we have a covering index for the fields (IP, TS, ID), the derived query will use it and execute very quickly, including the limit in it, the passage for the limit will also be performed using only the index . After that we do self-join to connect the remaining fields of the table (visitor_id)



And finally, I will describe one case that not so long ago presented itself to be solved.
In general, there is a site-blogger, where users write articles, news on various topics, in general, something like a habr, only more civilian orientation :-)
It was necessary to create a tracking system for authors, something like google analytics. So that the authors of the articles could see who, where and when goes to their articles. Those. statistics should be the total number of views for a certain time, the statistics of referrers from different sites, and the statistics of the most popular queries for which users came from search engines to this post.

All this statistics is collected in the log and every hour the crown is transferred from the log to the database, after which a new log file is created and the statistics for the next hour are written there.

Total that is:
articles (let's call them entries, have their ID)
referrals are collected and processed for each article + phrases for which people come from search engines
what reports need to be generated:
views in the last hour, two, 6, 12, day, week
search phrases for each article at the same intervals
referrals for each article for the same intervals
the most "popular" search phrases for the same intervals
the most "popular" referrers for the same intervals

There was a solution that worked fine up to a certain point, but after the number of site visits increased dramatically, it began to work slowly.

Here is the structure of the table for the report on general statistics, approximately the same tables exist for the statistics of referrers and search phrases

CREATE TABLE `mt_daily_entry_stats` (
`daily_entry_stats_entry_id` INTEGER (11) UNSIGNED NOT NULL ,
`daily_entry_stats_views` INTEGER (11) UNSIGNED NOT NULL ,
`daily_entry_stats_date` DATETIME NOT NULL ,
PRIMARY KEY (`daily_entry_stats_entry_id`,` daily_entry_stats_date`),
KEY `daily_entry_stats_date` (` daily_entry_stats_date`)
) ENGINE = InnoDB * This code was highlighted with Source Code Highlighter .


Well, respectively, to this table, depending on the selected statistics, the following queries were executed:

SELECT
`stats`.`daily_entry_stats_entry_id`,
SUM (`stats`.`daily_entry_stats_views`) as` entry_stats_views`
FROM
`mt_daily_entry_stats` as` stats`
WHERE
`stats`.`daily_entry_stats_date`> NOW () - INTERVAL 24 HOUR
GROUP BY
`stats`.`daily_entry_stats_entry_id`
HAVING
`entry_stats_views`> 1000 * This was highlighted with the Source Code Highlighter .


This decision began to work slowly and it was necessary to understand why.

here is the explain of this request

select_type: SIMPLE
table: stats
type: range
possible_keys: daily_entry_stats_date
key: daily_entry_stats_date
key_len: 8
ref: NULL
rows: 97644
Extra: Using where; Using temporary; Using filesort

1. Engine = InnoDB is used this way, it means that surrogate keys are used, which are stored directly in the same data file (unlike MyISAM, where indexes are stored in a separate file), moreover, the data is sorted by this surrogate key and it is included in the rest of the keys, therefore, it is very important that the PRIMARY KEY is as small as possible and therefore requests to tables on InnoDB using PRIMARY KEY are executed very quickly.

What we get: there is a composite key PRIMARY KEY (`daily_entry_stats_entry_id`,` daily_entry_stats_date`)
which takes 4 bytes (int) + 8 bytes (dattime) = 12 bytes

2. Since the DATETIME data type takes up quite a lot of space (8 bytes), it is probably more appropriate that dates that fall between 1970 and 2038 should be better represented in TIMESTAMP. But based on the line in the manual
The TIMESTAMP values ​​are converted to the current time zone. Those. 2 additional operations are performed when saving and retrieving the date. That is best in this case, the date stored in INT

During the solution, many solutions to the problem were proposed. Various caching, the creation of a temporary inmemori table containing the results, so that exactly it is.

Even experimented with sharding. But sharding in this case was also ineffective, since report times overlap; for example, a 12-hour report includes a 6-hour report data. Therefore, reports for large periods also require data from several shards, and the possibility of partishig (degenerate sharding, corrected at the request of% andry%) is supported in version 5.1, which is not yet a release, so everything remains only attempts.

After trying various options, I came up with a variant with denormalization. Those. it was decided to create several tables, for each of the statistics intervals - stats_hour, stats_2hour, stats_6hour, stats_12hour, stats_day
then you get rid of GROUP BY ... HAVING = WHERE.

And accordingly, when it is necessary to get statistics on some time interval, we specify the table we need.
The truth has increased the number of tables for statistics. And data is stored redundantly. But along with caching (memcached) is quite workable.

Moreover, it is quite simple to add / change existing intervals; all that is needed is to make changes to the cron script that fills these tables.
My solution does not claim the best, it would be interesting how you would solve such a problem.
I hope to read it in the comments :-)

Source: https://habr.com/ru/post/38907/


All Articles