📜 ⬆️ ⬇️

Grouping serial posts close in time

Good afternoon, Habr!

The project encountered the following task: there is a news feed of photos, which users can post to only one photo at a time, and they need to be displayed together as a gallery. In other words, all sample lines need to be logically combined into several “time windows” for each author and use this when displaying.

It begs to group the following posts one by one, but this does not work: if two users simultaneously upload hundreds of photos in parallel and slowly, they are added to the tape one by one, and when viewing posts, it will be unpleasant to alternate.
')
Behind the MySQL solution

Formulation of the problem



Immediately, I’ll make a reservation that grouping posts during selection is wrong: it’s very desirable that groups of images remain static and nothing falls off from anywhere. Thus, each post must clearly belong to some kind of “group”, visually represented by the gallery.

The solution is not a panacea, but there is a range of tasks where exactly this approach can be useful.

First, create a table for the experiments:

CREATE TABLE `feed`( `id` INT UNSIGNED NOT NULL AUTO_INCREMENT, `tm` INT UNSIGNED NOT NULL COMMENT 'timestamp', `user_id` INT UNSIGNED NOT NULL COMMENT 'author id', `image` VARCHAR(255) NOT NULL COMMENT 'posted image filename', `group` INT UNSIGNED NULL DEFAULT NULL COMMENT 'post group', PRIMARY KEY(`id`), INDEX(`user_id`), INDEX(`tm`,`group`) ); 


The `feed` is a list of posts. Each post has time to add tm , a link to the user_id , the actual picture, and also we add a special column group which allows you to group images into the gallery. When adding a new entry group=NULL .

The option is wrong


At first it seems: we select the most recent post, then we select the posts of the same user within a radius of one hour, and assign them all to group= id-- . Only in this case each post will turn out to belong only to its own group. No, it does not fit :)

Grouping



First you need to determine the criterion of the time proximity of posts:

 SET @granularity:=60*60; 


So, all posts within one hour are grouped into one gallery.

Then we make the following logical move: we give each post to become the “basis” for the group:

 SELECT `g`.`id` AS `group` FROM `feed` `g`; 


And such a group will contain lines in the hourly radius from the “warp” (time difference - within one hour):

 SELECT `g`.`id` AS `group`, `f`.* FROM `feed` `g` CROSS JOIN `feed` `f` ON (`f`.`user_id` = `g`.`user_id` AND `f`.`tm` BETWEEN `g`.`tm`-@granularity AND `g`.`tm` ) 


So, now each line has a number of candidates for the basics. Selection criteria: choose the post containing the greatest number of posts in its hourly radius as the “basis”.
In order not to strain MySQL with extra calculations - instead of a radius, we use the criterion `id` :

 SELECT MAX(`g`.`id`) AS `group`, `f`.* FROM `feed` `g` CROSS JOIN `feed` `f` ON (`f`.`user_id` = `g`.`user_id` AND `f`.`tm` BETWEEN `g`.`tm`-@granularity AND `g`.`tm` ) GROUP BY `f`.`id` 


So, we selected all lines and for each of them set group.

I note that there is a nuance. When displaying such a tape, we will now sort by `group` DESC . Then, if the MAX() function is used in the above code, then the most recent “group” (which received the last update) will jump to the very top in tape sorting.
This behavior can be easily changed: then we get permanent groups, such that elements cannot move from one to another: it’s enough to use the MIN() function: the oldest post will always become the basis, and the group can only be supplemented by new incoming photos:

 SELECT MIN(`g`.`id`) AS `group`, `f`.* FROM `feed` `g` CROSS JOIN `feed` `f` ON (`f`.`user_id` = `g`.`user_id` AND `f`.`tm` BETWEEN `g`.`tm` AND `g`.`tm`+@granularity ) GROUP BY `f`.`id` 


Now we need to update the table based on the results of this query: set the value of the `group` column. MySQL does not allow you to update the table from which you are reading in a single UPDATE request, so you must first transfer our sample to a temporary table:

 CREATE TEMPORARY TABLE `_feedg` SELECT MAX(`g`.`id`) AS `group`, `f`.`id` FROM `feed` `g` CROSS JOIN `feed` `f` ON (`f`.`user_id` = `g`.`user_id` AND `f`.`tm` BETWEEN `g`.`tm`-@granularity AND `g`.`tm` ) WHERE `f`.`group` IS NULL OR `f`.`tm` >= (UNIX_TIMESTAMP()-2*@granularity) GROUP BY `f`.`id`; 


Pay attention to the WHERE condition that appears: it is used for optimization so that the rearrangement is carried out only at the very top of the table, among the latest records.

Now, using a temporary table, you can update the original:

 UPDATE `feed` `f` CROSS JOIN `_feedg` `g` USING(`id`) SET `f`.`group` = `g`.`group`; 


It is desirable to perform the query after each insertion, but for optimization it is possible to do this by cron, leaving part of the work on the output algorithm.

UPD: as Melkij suggested - indeed, you can update if the read request is made a subquery instead of a JOIN. Then we completely remove the CREATE TEMPORARY TABLE , and the UPDATE query will look like this:

 UPDATE `feed` `f` CROSS JOIN ( SELECT MAX(`g`.`id`) AS `group`, `f`.`id` FROM `feed` `g` CROSS JOIN `feed` `f` ON (`f`.`user_id` = `g`.`user_id` AND `f`.`tm` BETWEEN `g`.`tm`-@granularity AND `g`.`tm`) WHERE `f`.`group` IS NULL OR `f`.`tm` >= (UNIX_TIMESTAMP()-2*@granularity) GROUP BY `f`.`id` ) `g` USING(`id`) SET `f`.`group` = `g`.`group`; 


Selections



Now, how to correctly select from such a table?

If `group` for all lines, then

 SELECT * FROM `feed` ORDER BY `group` DESC, `tm` DESC; 


However, if the above query runs on the crown and, therefore, we have part of the lines for which group=NULL , part of the output logic should be assigned to the script renderer, and the sample should be done like this:

 SELECT * FROM `feed` ORDER BY `group` IS NULL, `group` DESC, `tm` DESC; 


Links



My question on stackoverflow is: Stackoverflow: Grouping serial posts in a user feed . Here you can admire how it is done in Oracle using "temporary windows".

SQLfiddle, play around : SQLfiddle

I hope I was helpful.

Source: https://habr.com/ru/post/150063/


All Articles