📜 ⬆️ ⬇️

MySQL and JOINs

The reason for writing this article was some debates in one of the groups of linkedin related to MySQL, as well as communication with colleagues and habroles :-)

In this article I wanted to write what JOINs are in general in MySQL and how you can optimize queries with them.


')

What are JOINs in MySQL?



In MySQL, the term JOIN is used much more broadly than might be supposed. Here JOIN can be called not only a query that combines the results from several tables, but also a query to one table, for example, SELECT on one table - this is also a join.

That's because the algorithm for executing joins in MySQL is implemented using nested loops. Those. each subsequent JOIN is an additional nested loop. To execute the query and return all the records that satisfy the condition MySQL performs a cycle and runs through the records of the first table in parallel, checking the conditions described in the request body when there are records that meet the conditions — in the nested loop, the second table looks for records that match the first and satisfy the verification conditions and m .d

Standard query with INNER JOIN

SELECT
*
FROM
Table1
INNER JOIN
Table2 ON P1(Table1,Table2)
INNER JOIN
Table3 ON P2(Table2,Table3)
WHERE
P(Table1,Table2,Table3).

* This source code was highlighted with Source Code Highlighter .


where P is the table splicing conditions and filters in the WHERE condition.

You can imagine such a pseudo code for such a query.

FOR each row t1 in Table1 {
IF(P(t1)) {
FOR each row t2 in Table2 {
IF(P(t2)) {
FOR each row t3 in Table3 {
IF P(t3) {
t:=t1||t2||t3; OUTPUT t;
}
}
}
}
}
}

* This source code was highlighted with Source Code Highlighter .


where the construction t1 || t2 || t3 means the concatenation of columns from different tables.

If the query contains OUTER JOINs, for example, LEFT OUTER JOIN

SELECT
*
FROM
Table1
LEFT JOIN
(
Table2 LEFT JOIN Table3 ON P2(Table2,Table3)
)
ON P1(Table1,Table2)
WHERE
P(Table1,Table2,Tabke3)

* This source code was highlighted with Source Code Highlighter .


then the algorithm for executing this query MySQL will look something like this

FOR each row t1 in T1 {
BOOL f1:=FALSE;
FOR each row t2 in T2 such that P1(t1,t2) {
BOOL f2:=FALSE;
FOR each row t3 in T3 such that P2(t2,t3) {
IF P(t1,t2,t3) {
t:=t1||t2||t3; OUTPUT t;
}
f2=TRUE;
f1=TRUE;
}
IF (!f2) {
IF P(t1,t2,NULL) {
t:=t1||t2||NULL; OUTPUT t;
}
f1=TRUE;
}
}
IF (!f1) {
IF P(t1,NULL,NULL) {
t:=t1||NULL||NULL; OUTPUT t;
}
}
}

* This source code was highlighted with Source Code Highlighter .


You can read more about it here - dev.mysql.com/doc/refman/5.1/en/nested-joins.html

So, as we can see, JOINs are just a group of nested loops. So why in MySQL and UNION and SELECT and queries with SUBQUERY are also joins?

MySQL optimizer tries to lead queries to the form to which it is more convenient for it to process and execute queries according to the standard scheme.

With SELECT, everything is clear - just a loop without nested loops. All UNIONs are executed as separate queries and the results are added to a temporary table, and then MySQL is already working with this table, i.e. looping through the entries in it. Subquery is the same story.

Leading everything to one pattern, for example, MySQL rewrites all RIGHT JOIN requests for LEFT JOIN equivalents.

But the strategy of executing queries through nested loops imposes some limitations, for example, in connection with such a scheme, MySQL does not support the execution of FULL OUTER JOIN queries.

But the result of such a query can be obtained using the UNION of two queries on LEFT JOIN and on RIGHT JOIN.
An example of the request itself can be viewed on the wiki link.

JOIN query execution plan



Unlike other RDBMSs, MySQL does not generate bytecodes to execute the query, instead MySQL generates a list of instructions in a tree form that the query execution engine follows when executing the query.
This tree has the following form and has the name "left-deep tree"
image

Unlike balanced trees (Bushy plan), which are used in other DBMS (for example Oracle)

image

JOIN optimization



We now turn to the most interesting - to optimize joins.
MySQL optimizer, namely the part that is responsible for optimizing JOINs, selects the order in which the existing tables will be glued together, because You can get the same result (dataset) with a different order of tables in the gluing. MySQL optimizer estimates the cost of various plans and selects with the lowest cost. The unit of evaluation is a single-read operation of a 4-kilobyte data page from arbitrary disk space.

For the selected plan, you can find out the cost by executing the command

SHOW SESSION STATUS LIKE 'Last_query_cost';

after completing the query that interests us. The variable Last_query_cost is a session variable. The description of the Last_query_cost variable in the MySQL documentation can be found here - dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#option_mysqld_Last_query_cost

Estimation is based on statistics: the number of memory pages occupied by the table and / or indexes for this table, cardinality (number of unique values) of indexes, the length of records and indexes, their distribution, etc. During its evaluation, the optimizer does not expect any parts to fall into the cache, the optimizer assumes that each read operation is a disk access.

Sometimes the analyzer-optimizer cannot analyze all possible execution plans and chooses the wrong one. For example, if we have an INNER JOIN in 3 tables, then the analyzer has 3 possible options! = 6, and if we have gluing together on 10 tables, then there are already 10 possible options! = 3628800 ... MySQL cannot analyze so many variants, therefore in this case it uses the " greedy " search algorithm.

And just to solve this problem, we can use the STRAIGHT_JOIN construction. In fact, I am opposed to similar hacks like FORCE INDEX and STRAIGH_JOIN, more precisely, against their mindless use wherever possible and impossible. In this case, you can :-) Having clarified (either experimentally making queries with STRAIGH_JOIN and evaluating Last_query_cost, or empirically) the necessary order of joins, you can rewrite the query with the tables in the appropriate order and add STRAIGH_JOIN to this query, so we immediately kill two hares - we define the correct execution plan for the request (this is the main hare) and save time at the “Statistic” stage (All stages of the request can be viewed by setting the profiling of requests with the SET PROFILING = 1 command, I described this in my previous article on profiling of queries in MySQL)

But you should not apply this hack to all queries, expecting to optimize on matches and save time on drawing up a plan for query execution by the optimizer and add STRAIGH_JOIN to all queries with joins , since the data is changing and the gluing, which is optimal now, may cease to be optimal with time, and then the requests to start very much lag.

Also, as mentioned above, the joins are placed in temporary tables, so it is often appropriate to use the “derived table” in which we impose all the necessary conditions on the sample, and also indicate the LIMIT and the sort order. In this case, we will get rid of data redundancy in the temporary table, as well as perform the sorting at an early stage (according to the result of one sample, and not the final splicing, which will reduce the size of the records that will be sorted).

The standard example of the approach described above. A simple sample for a lot of things to many: news and tags to them.

SELECT
t.tid, t.description, n.nid, n.title, n. extract , n.modtime
FROM
(
SELECT
n.nid
FROM
news n
WHERE
n.type = 1321
AND n.published = 1
AND status = 1
ORDER BY
n.modtime DESC
LIMIT
200
) as news
INNER JOIN
news n ON n.nid = news.nid
INNER JOIN
news_tag nt ON n.nid = nt.nid
INNER JOIN
tags t ON nt.tid = t.tid

* This source code was highlighted with Source Code Highlighter .


And lastly there is a small problem, which I sometimes ask during interviews :-)

There is a news blogger site. There are such entities as news and comments to them.

The task - you need to write a request that displays a list of 10 news of a certain type (set by the user) sorted by publication time in chronological order, as well as for each of these news to show no more than 10 recent comments, i.e. if there are more comments, we show only the last 10.

Everything needs to be done in one request. Yes, this may not be the best way, and you are free to propose another solution :-)

Source: https://habr.com/ru/post/44807/


All Articles