📜 ⬆️ ⬇️

Unexpected results from a simple MySQL query


Recently, I spent a lot of time trying to understand why a single query produces such a strange result. On MariaDB, it produced an obviously incorrect result, but on the good old MySQL it was not executed at all. Those. The request was launched, but it was not possible to wait for its completion. In order to figure out what was wrong, I had to do a little research. But let's get everything in order.


It was necessary to count the number of users who wrote to the forum at least once during the day in the last month.
Suppose forum messages are in the following table:
CREATE TABLE `forum_posts` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `created` datetime NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

which contains the following data
 INSERT INTO `forum_posts` (`user_id`, `created`) VALUES (1, '2013-01-01'), (1, '2013-01-01'), (2, '2013-01-01'), (2, '2013-01-01'), (2, '2013-01-01'), (3, '2013-01-01'), (3, '2013-01-01'), (4, '2013-01-01'), (5, '2013-01-01'), (5, '2013-01-01'), (1, '2013-01-02'), (1, '2013-01-02'), (2, '2013-01-02'), (2, '2013-01-02'), (3, '2013-01-02'), (3, '2013-01-02'), (4, '2013-01-02'), (4, '2013-01-02'), (1, '2013-02-02'), (1, '2013-02-02'), (2, '2013-02-02'), (2, '2013-02-02'), (3, '2013-02-02'), (3, '2013-02-02'), (4, '2013-02-02'), (4, '2013-02-02'), (5, '2013-02-02'), (5, '2013-02-02'); 


Request which gives the necessary data ::
 SELECT DATE(fp.created) dt, count(*) qnt FROM forum_posts fp WHERE fp.id IN ( SELECT fp2.id FROM forum_posts fp2 WHERE fp2.created >= '2013-01-01' AND fp2.created < '2013-02-01' GROUP BY DATE(fp2.created), fp2.user_id ) GROUP BY dt; 

The result of this query was unexpected, he returned:
 2013-01-01 10 2013-01-02 8 

Not believing my eyes, I rushed to perform the subquery separately - it returned the correct result (9 lines). That is, the query with the WHERE id IN (...) condition WHERE id IN (...) returned more records than the transferred identifiers in IN (...) . It became clear that the evening was long.

What we have?


We have a simple query with a subquery. The subquery cuts values ​​by WHERE , groups by date and user, and then passes the resulting identifiers into an external query, which groups only by date. But, as the test shows, the database ignores the results of the subquery.

Since I do not believe in mysticism and do not really trust myself in the evenings, I was attracted to solving the problem of Comrade xzander . He ran a query on the dump of my database (in fact, there were about 4 million lines). The result was unexpected - the request failed in 10 minutes. Comparing the environment, we found that I run a query on MariaDB (5.5.30), and it is on MySQL (5.5.28)
')
Having dug deeper, we realized that MySQL and MariaDB have a different approach to query execution. The “smart” MySQL optimizer decides that it is necessary to first execute an external query, and each line with a subquery, having executed it as a result N times . Thus, on a large database, such a query will be executed for a very long time.

MariaDB comes smarter - it executes the query quickly, but ignores GROUP BY in the subquery and, as it turned out, this is not a bug at all, but a feature . So much for the full compatibility and transparency of the transition.

It would seem, how did I find myself in such a situation? Why subquery, if you can do a join? It's simple: in this project Doctrine was used, and in it, as in any decent ORM, in order to make a join, you must first describe the connection. Describing the connection of the table to itself only for the sake of one query is somehow too much.

What to do?


It turned out that it was not difficult to deceive the optimizer: it is enough to convince him that the subquery is more complicated than it seems to him by adding HAVING 1 to the subquery

So here’s the query that works correctly on MariaDB:
 SELECT DATE(fp.created) dt, count(*) qnt FROM forum_posts fp WHERE fp.id IN ( SELECT fp2.id FROM forum_posts fp2 WHERE fp2.created >= '2013-01-01' AND fp2.created < '2013-02-01' GROUP BY DATE(fp2.created), fp2.user_id HAVING 1 ) GROUP BY dt; 

On MySQL, you probably still have to rewrite the query, getting rid of the subquery.

What conclusions can we draw?


  1. It is necessary to check the results of even simple queries. Not to notice that such a request returns incorrect data, it was easy.
  2. The transition from MySQL to MariaDB is not as transparent as they say.
  3. You should not be tied to the ORM when you need to perform complex statistical queries.


UPD.
From the alno submission , it turned out that instead of a crutch with HAVING 1 , it suffices to replace fp2.id in the subquery with MIN(fp2.id) .
And petropavel decided that it was still a bug, not a feature, and created a ticket in the MariaDB bug tracker. Let's see what the developers think about it.

Source: https://habr.com/ru/post/181195/


All Articles