Conditional grouping

Periodically, a problem arises that requires grouping a data set with the condition that for attributes that are not participating in a grouping, you need to take a tuple with the maximum value in one of the fields.

Let's look at a simple example.
There is a table:

CREATE TABLE IF NOT EXISTS shop ( id INT NOT NULL AUTO_INCREMENT, article INT(4) ZEROFILL NOT NULL, dealer VARCHAR(45) NOT NULL, price DECIMAL(8,2) NOT NULL, PRIMARY KEY (id)) ENGINE = InnoDB;

It is necessary for all the article to find a dealer with the maximum price.

For this problem there are several obvious and simple solutions, but I know one of them, which far surpasses all others.
Faced with this task? Want to see a new way to solve it? I ask under the cat.

Even the official documentation of mysql.com has not bypassed this task, and 3 solutions are proposed:
Before each request I will indicate the index and time of its execution. The table is filled with 100,000 entries.

 DELIMITER $$ CREATE PROCEDURE InsertRand() BEGIN DECLARE i INT; SET i = 1; START TRANSACTION; WHILE i <= 100000 DO INSERT INTO shop (article, dealer, price) VALUES (CEIL(RAND() * 9999), CEIL(RAND() * 999), RAND() * 9999); SET i = i + 1; END WHILE; COMMIT; END$$ DELIMITER ;

The first idx (article) 2,169 c:

 SELECT article, dealer, price FROM shop s1 WHERE price=(SELECT MAX(s2.price) FROM shop s2 WHERE s1.article = s2.article);

Second idx (article, price) 0,203 c

 SELECT s1.article, dealer, s1.price FROM shop s1 JOIN ( SELECT article, MAX(price) AS price FROM shop GROUP BY article ) AS s2 ON s1.article = s2.article AND s1.price = s2.price;

Third idx (article, price) 0.593 c

 SELECT s1.article, s1.dealer, s1.price FROM shop s1 LEFT JOIN shop s2 ON s1.article = s2.article AND s1.price < s2.price WHERE s2.article IS NULL;

Well, now my decision:

Attention! Use this method at your own risk! In future versions of MySQL, grouping behavior may change.

This decision is based on the fact that group by for attributes to which group operations are not indicated and which do not participate in the grouping, takes the first value encountered. Thus, if the data set is pre-sorted, we get the tuples with the desired maximum value.

4. idx (price) 0,328 c

 SELECT article, dealer, price FROM ( SELECT article, dealer, price FROM shop ORDER BY price desc) as t GROUP BY article ORDER BY NULL;

Since If previous examples were without any sorting, and group by automatically adds it, then you need to specify ORDER BY NULL so that the data are not further sorted, otherwise the results will be incompatible.
But why do we need to create an intermediate table, because we can get the sorted data using the index:

5. idx (article, price) 0,110 c

 SELECT article, dealer, price FROM shop use index (idx) GROUP BY article DESC ORDER BY NULL;

Bonus solution:

The solution was found on the Mitch Dickinson blog. It does not claim to be the fastest, but it is very original.

6. idx (article) 0,202 c

 SELECT article, SUBSTRING_INDEX(GROUP_CONCAT(dealer ORDER BY price DESC),',',1) AS dealer, MAX(price) AS price FROM shop GROUP BY article;

In the comments dm9 gave another 1 solution, which was described in the documentation for earlier versions:

 SELECT article, SUBSTRING( MAX( CONCAT(LPAD(price,6,'0'),dealer) ), 7) AS dealer, 0.00+LEFT(MAX( CONCAT(LPAD(price,6,'0'),dealer) ), 6) AS price FROM shop GROUP BY article;

In general, the article is devoted not so much to solving a specific task, as to examining an example of how you can go beyond the generally accepted solutions and try to find alternative ways. For me, databases have always been interesting precisely because it is possible to find interesting ways to optimize, which can significantly save time and effort.

PS: A careful reader probably noticed that methods 4-6 for each article give only 1 supplier with a maximum price, unlike the first methods, in which all suppliers return. But in solving this problem, I was interested in any of the suppliers, so this problem was immaterial.

PPS: The alternative method proposed in this article shows itself well with average table sizes. With the number of records over a million, the best method would be 2. Moreover, if the number of records is already so large, I highly recommend this information precalculate in separate tables.

Source: https://habr.com/ru/post/159767/

All Articles