Not so long ago, a contest of recommendation systems from the Okko online cinema - Rekko Challenge 2019 was held on the Boosters platform. For me it was the first experience of participating in a competition with a leaderboard (previously I tried strength only in a hackathon). The task is interesting and familiar to me from practice, there is a prize pool, which means it made sense to participate. As a result, I took 14th place, for which the organizers issued a commemorative T-shirt. Nicely. Thank.
In this article, I will briefly immerse you in the task, talk about the hypotheses put forward by me, as well as how to drag the competition in recommendation systems and get into the top 15 without stacking experience, which will be especially useful for those who are just going to participate in contests.
Recommender Systems
The main goal of recommender systems is to give the user what he wants to buy (unfortunately, such a hypertrophied view is imposed on us by commercial application). There are different statements of tasks (ranking, search for similar ones, prediction of a specific element), and, accordingly, ways to solve them. Well, we all love the variability in choice, which is provided by a set of several potential solutions to each problem. Various approaches are well described in the article Anatomy of Recommender Systems . Of course, no one canceled the NFL theorem , which means that in the competitive problem we can try different algorithms. ')
Formulation of the problem
Read more about the task and the data in the article by the organizers. TL; DR here I will describe the necessary minimum for understanding the context.
The dataset contains just over ten thousand films with anonymized attributes. The following options are available as user-item interaction matrices:
transactions - contains facts of users buying content / renting / viewing by subscription;
ratings - ratings of films by users;
bookmarks - the event of adding a movie to bookmarks.
All information is taken over a certain period of time, which is presented in arbitrary units that are linked to real.
Content had the following set of attributes:
You can read about them in detail in the article by the organizers, but I want to immediately pay attention to what caught my eye: the “attributes” parameter. It contained a bag of categorical attributes with a cardinality of ~ 36 thousand. There were an average of 15 values per film. At first glance, these values are encrypted just the most basic attributes that describe the content: actors, directors, country, subscriptions or collections to which the film belongs.
It is necessary to predict 20 films that test users will watch in the next two months. Test users are 50 thousand of all 500 thousand users. On the leaderboard they are divided in half: 25 thousand each in public / private.
Metrics
The organizers chose Mean Normalize Average Precision on 20 elements (MNAP @ 20) as a metric. The main difference from the usual MAP is that for users who have not watched 20 films in the test period, rationing does not take place on k, but on the actual value of the films watched.
Getting to the solution of the problem. First of all, it was necessary to decide what was validated. Since we need to predict movies in the future, we can’t do a simple breakdown by users. Due to the fact that time was anonymized, I had to at least decipher it at first. To do this, I took several users, made a schedule for transactions and revealed a certain seasonality. It was assumed that it is daily, and knowing the time difference between the days, we can calculate for what period the data were uploaded. It turned out that these were transactions for 6 months. This was later confirmed in the telegram channel where the contest was discussed.
The chart above shows the hourly frequency of transactions using one-month-long data as an example. Three prominent peaks each week are similar to Friday, Saturday, and Sunday evenings.
Consequently, we have six months of viewing, and films must be predicted for the next two. We will use the last third of the training sample time as a validation dataset.
The next few submissions showed that the split was selected successfully and the local validation speed correlated excellently with the leaderboard.
Attempt to deanonymize data
To begin with, I decided to try to deanonymize all the films so that:
generate a bunch of signs from the meta-information of the content. At least, the following people come to mind: genres, cast, entry into subscriptions, text description, etc .;
throw in the furnace of interactions from the side to reduce the sparseness of the matrix. Yes, the competition rules did not prohibit the use of external data. Of course, there was no hope of a match with open datasets, but nobody canceled the parsing of Russian portals.
It seems to be a logical motivation, which, according to my expectations, was to become a killer-featured solution.
First of all, I decided to parse the Okko website and pull out all the movies along with their properties (rating, duration, age restrictions and others). Well, how to parse - everything turned out to be quite simple, in this case, you could use the API:
By going to the catalog and choosing a specific genre or subscription, you just had to go into any of the elements. In response to the request above, the entire array of films in the genre / subscription with all the attributes falls out. Very comfortably :)
So the attributes of one element in the structure looked
It remains to go through all genres, parse JSON, and then allow duplicates, since one movie can belong to several genres / subscriptions.
Yes, here I was lucky and I saved a lot of time. If this is not your case, and you need to parse html content, then there are articles on the hub that can help with this, for example, here .
“The thing is the hat,” I thought, “we can only hold it back.” “The point is the hat,” I realized the next day: the data were not absolutely matched. About it below.
Firstly, the size of the catalog was significantly different: in the dataset - 10,200, collected from the site - 8870. This follows from the historicity of the dataset: it was downloaded only what is on the site now, and the competition data for 2018. Some of the films have become unavailable. Oops
Secondly, of the potential attributes for the match, there was persistent intuition only about the following:
feature5 - age limit. It was easy enough to understand. The cardinality of the attribute is 5 unique float values and “-1”. Among the collected data, the “ageAccessType” attribute was found with just cardinality 5. The mapping looked as follows:
feature2 - converted movie rating from movie search. Initially, at the EDA stage, the idea that we are dealing with a rating was submitted by the correlation of the parameter with the total number of views. Subsequently, the belief that this rating was from a movie search confirmed the presence of the “kinopoiskRating” parameter in the site data.
One step closer to the match! Now it remains to find a way to reverse the conversion for the feature2 parameter presented in an anonymous form.
Here's what the distribution of values looks like in feature2 :
And so the distribution of kinopoiskRating parameter values:
When I showed these images to my colleague Sasha, he immediately saw that this was a degree of three. Three mathematicians are not held in high esteem, but the number Pi is very even. As a result, it turned out like this:
It seems to be all, but not quite. We see identical distributions, but the nominal values and quantity still do not converge. If we knew a certain number of comparison examples, we would only have to approximate the linear function to find the factor. But we did not have them.
To approximate, by the way, is not the most suitable word. We need a solution with an error almost equal to zero. Accuracy in the collected data is 2 characters after the separator. If you consider that there are a lot of films with a rating of 6.xx and there are films with the same rating, then you should fight for precision here.
What else can you try? You can rely on the minimum and maximum values and use MinMaxScaler, but the unreliability of this method immediately raises doubts. Let me remind you that the number of films did not initially coincide, and our dataset is historical, and the current state on the site. Those. there are no guarantees that the films with the minimum and maximum ratings in both groups are identical (it turned out that they had a different age limit, and the duration did not converge from the word “completely”), as well as there is no understanding of how often OKKO updates in API daily changing movie search rating.
So it became clear that I needed more attribute candidates for matching.
What else is interesting?
feature1 - some kind of date. In theory, for dates, the organizers promised the preservation of separation of states, which implied a linear function. In general, the transformation should have been identical to the ts attribute for the interaction matrices. If you look at the distribution of films by feature_1 ...
... then the film’s release date hypothesis is immediately swept away. Intuition suggests that the number of films produced by the industry should increase over time. This is confirmed below.
There are 14 attributes in the data that we received from the site. In fact, unfortunately, the values were contained only for the following:
None of the examples above are similar to feature_1 . Well, there were no more ideas for comparison, and it seems that all the fuss with this task was in vain. Of course, I heard about contests, where the guys marked up the data manually, but not when it comes to hundreds and thousands of copies.
Decision
1. Simple models
Realizing that not everything is so simple, I began to humbly work with what is. For starters, I wanted to start with a simple one. I tried several solutions often used in the industry, namely: collaborative filtering (in our case, item-based) and factorization of matrices .
The community makes extensive use of a couple of python libraries suitable for these tasks: implicit and LightFM . The first one is capable of factoring based on ALS, as well as Nearest Neighbor Collaborative Filtering with several options for preprocessing the item-item matrix. The second has two distinctive factors:
Factorization is based on SGD, which makes it possible to use sampling-based loss functions, including WARP .
It uses a hybrid approach, combining information about user attributes and items in the model in such a way that the latent vector of the user is the sum of the latent vectors of its attributes. And similarly for items. This approach becomes extremely convenient when there is a cold start problem for the user / item.
We didn’t have any user attributes (apart from the ability to display them based on interaction with movies), so I used only the attributes of items.
In total, 6 settings went to enumerate the parameters. A combination of three matrices was used as the interaction matrix, where the ratings were converted to binary. Comparative results with the best hyperparameters for each setting in the table below.
Model
Test MNAP @ 20
Implicit ALS
0.02646
Implicit Cosine kNN CF
0.03170
Implicit TFIDF kNN CF
0.03113
LightFM (without item features), BPR loss
0.02567
LightFM (without item features), WARP loss
0.02632
LightFM with item features, WARP loss
0.02635
As you can see, classical collaborative filtering has proven to be much better than other models. Not perfect, but much is not required of the base line. Submit with this configuration gave 0.03048 on the public leaderboard. I don’t remember the position at that time, but at the time of the closing of the competition, this submission would definitely have hit the top 80 and provided a bronze medal.
2. Hello Boosting
What could be better than one model? Correct: several models.
Therefore, the next option was ensemble or, in the context of recommendations, a ranking model of the second level. As an approach, I took this article from the guys from Avito. It seems to be cooking strictly according to the recipe, periodically stirring and seasoning with the attributes of films. The only deviation was the number of candidates: I took the top 200 from LightFM, because with 500,000 users, more simply did not fit into memory.
As a result, the speed obtained by me on validation was worse than on one model.
After several days of experimentation, the realization came that nothing was working and nothing would work on its own. Or I don’t know how to cook it (spoiler: the correct second answer). What am I doing wrong? Two reasons came to mind:
On the one hand, taking the top 200 from the first-level model is sensible from the point of view of generating “hard negative” samples, i.e. those films that are also relevant to the user, but not watched by him. On the other hand, some of these films can be watched in the test period, and we present these examples as negative. Next, I decided to reduce the risks of this fact, rechecking the hypothesis with the following experiment: I took all the positive examples + random for the training sample. The speed on the test sample did not improve. Here it is necessary to clarify that in the sampling on the test there were also top predictions from the first level model, because on the leaderboard no one will tell me all the positive examples.
Of the 10,200 films available in the catalog, only 8,296 films made any interactions. Nearly 2,000 films were deprived of user attention, partly because they were not available for purchase / rental / as part of a subscription. The guys in the chat asked if inaccessible films could become available in the test period. The answer was yes. It is definitely impossible to throw them out. Thus, I assumed that almost 2,000 more films will be available in the next 2 months. Otherwise, why throw them into the dataset?
3. Neurons
From the previous paragraph the question was asked: how can we work with films for which there are no interactions at all? Yes, we recall item features in LightFM, but as we remember, they did not enter. What else? Neurons!
The open source arsenal has a couple of quite popular high-level libraries for working with recommender systems: Spotlight by Maciej Kula (author of LightFM) and TensorRec . The first under the hood PyTorch, the second - Tensorflow.
Spotlight can perform factorization on implicit / explicit datasets with neurons and model sequences. At the same time, in the factorization “out of the box” there is no way to add user / item features, so drop it.
TensorRec, by contrast, only knows how to factorize and is an interesting framework:
representation graph - a way of converting (it can be set different for user / item) input data into embedding, based on which calculations in the prediction graph will take place. The selection consists of layers with different activation options. It is also possible to use an abstract class and stick a custom transformation, consisting of a sequence of keras layers.
prediction graph allows you to select the operation at the end: your favorite dot product, euclidean and cosine distance.
loss - there is also plenty to choose from. I was pleased with the implementation of WMRB (essentially the same WARP, only knows how to learn batch and distributed)
Most importantly, TensorRec is able to work with contextual features, and indeed the author admits that he was originally inspired by the idea of LightFM. Well, let's see. We take interactions (only transactions) and item features.
We send to search of various configurations and wait. Compared to LightFM, training and validation takes a long time.
In addition, there were a couple of inconveniences that had to be encountered:
From changing the verbose flag, the fit method has not changed anything and no callbacks have been provided for you. I had to write a function that internally did the training of one era using the fit_partial method, and then ran the validation for train and test (in both cases, samples were used to speed up the process).
In general, the author of the framework is very well done and uses tf.SparseTensor everywhere. However, it is worthwhile to understand that as prediction, including for validation, you get the result in dense vector form with the length n_items for each user. Two tips follow from this: make a cycle for generating predictions with batches (the library method does not have such a parameter) with top-k filtering and prepare the strips with RAM.
Ultimately, on the best configuration option, I managed to squeeze 0.02869 on my test sample. There was something similar to LB.
Well, what did I hope for? What adding nonlinearity to item features will give a twofold increase in the metric? It is naive.
4. Beck that task
So wait a moment. It seems that I again ran into juggling neurons. What hypothesis did I want to test when I got down to this? The hypothesis was as follows: “In the next 2 months of the delayed selection on the leaderboard there will be almost 2,000 new films. Some of them will take on a heavy share of views. ”
So you can check it in 2 steps:
It would be nice to see how many movies we added in the honestly split by us test period regarding train. If we take only the views, then the “new” films are only 240 (!). The hypothesis immediately shook. It seems that the purchase of new content cannot differ by such a number from period to period.
We finish. We have the opportunity to train the model to use only the presentation based on item features (in LightFM, for example, this is done by default, if we did not pre-populate the attribute matrix with the identity matrix). Further, for infreness, we can submit to this model only (!) Our inaccessible and never seen films. From these results, we submit and get 0.0000136.
Bingo! This means that you can stop squeezing semantics out of movie attributes. By the way, later, at DataFest, guys from OKKO said that most of the inaccessible content was just some old movies.
You need to try something new, and new - well-forgotten old. In our case, not completely forgotten, but what happened a couple of days ago. Collaborative filtering?
5. Tune the base line
How can I help the CF baseline?
Idea number 1
On the Internet, I found a presentation about using the likelihood ratio test to filter out minor films.
Below I will leave my python code to calculate the LLR score, which I had to write on my knee to test this idea.
LLR calculation
import numpy as np from scipy.sparse import csr_matrix from tqdm import tqdm classLLR:def__init__(self, interaction_matrix, interaction_matrix_2=None): interactions, lack_of_interactions = self.make_two_tables(interaction_matrix) if interaction_matrix_2 isnotNone: interactions_2, lack_of_interactions_2 = self.make_two_tables(interaction_matrix_2) else: interactions_2, lack_of_interactions_2 = interactions, lack_of_interactions self.num_items = interaction_matrix.shape[1] self.llr_matrix = np.zeros((self.num_items, self.num_items)) # k11 - item-item co-occurrence self.k_11 = np.dot(interactions, interactions_2.T) # k12 - how many times row elements was bought without column elements self.k_12 = np.dot(interactions, lack_of_interactions_2.T) # k21 - how many times column elements was bought without row elements self.k_21 = np.dot(lack_of_interactions, interactions_2.T) # k22 - how many times elements was not bought together self.k_22 = np.dot(lack_of_interactions, lack_of_interactions_2.T) def make_two_tables(self, interaction_matrix): interactions = interaction_matrix if type(interactions) == csr_matrix: interactions = interactions.todense() interactions = np.array(interactions.astype(bool).T) lack_of_interactions = ~interactions interactions = np.array(interactions, dtype=np.float32) lack_of_interactions = np.array(lack_of_interactions, dtype=np.float32) return interactions, lack_of_interactions def entropy(self, k): N = np.sum(k) return np.nansum((k / N + (k == 0)) * np.log(k / N)) def get_LLR(self, item_1, item_2): k = np.array([[self.k_11[item_1, item_2], self.k_12[item_1, item_2]], [self.k_21[item_1, item_2], self.k_22[item_1, item_2]]]) LLR = 2 * np.sum(k) * (self.entropy(k) - self.entropy(np.sum(k, axis=0)) - self.entropy(np.sum(k, axis=1))) return LLR def compute_llr_matrix(self): for item_1 in range(self.num_items): for item_2 in range(item_1, self.num_items): self.llr_matrix[item_1, item_2] = self.get_LLR(item_1, item_2) if item_1 != item_2: self.llr_matrix[item_2, item_1] = self.llr_matrix[item_1, item_2] def get_top_n(self, n=100, mask=False): filtered_matrix = self.llr_matrix.copy() for i in tqdm(range(filtered_matrix.shape[0])): ind = np.argpartition(filtered_matrix[i], -n)[-n:] filtered_matrix[i][[x for x in range(filtered_matrix.shape[0]) if x not in ind]] = 0 if mask: return filtered_matrix != 0 else: return filtered_matrix
As a result, the resulting matrix can be used as a mask to leave only the most significant interactions, and here there are two options: use threshold or leave top-k elements with the highest value. In the same way, the combination of several influences on a purchase in one speed is used to rank items, in other words, the test shows how important it is, for example, adding to favorites regarding possible conversion to a purchase. It looks promising, but the use showed that filtering using the LLR score gives a very miniature increase, and combining several speeds only worsens the result. Apparently, the method is not for this data. Of the pluses, I can only note that, while figuring out how to implement this idea, I had to delve into the implicit under the hood.
An example of applying this custom logic to implicit will be left under the cat.
Modification of the matrix in implicit
# LLR scores. n_items * n_items. llr = LLR(train_csr, train_csr) llr.compute_llr_matrix() # id, , llr_based_mask = llr.get_top_n(n=500, mask=True) # , - CosineRecommender. model = CosineRecommender(K=10200) model.fit(train_csr.T) # model.similarity - co-occurrence ( Cosine - ). . masked_matrix = np.array(model.similarity.todense()) * llr_based_mask # fit() scorer model.similarity. # . model.scorer = NearestNeighboursScorer(csr_matrix(masked_matrix)) # : recommend . test_predict = {} for id_ in tqdm(np.unique(test_csr.nonzero()[0])): test_predict[id_] = model.recommend(id_, train_csr, filter_already_liked_items=True, N=20) # tuples (item_id, score), id . test_predict_ids = {k: [x[0] for x in v] for k, v in test_predict.items()} # / , , Cython.
Idea number 2
Another idea came up that could improve predictions on simple collaborative filtering. The industry often uses some kind of damping function for old estimates, but we have a not so simple case. Probably, you need to somehow take into account 2 possible cases:
Users who watch the service are mostly new
“Examiners.” That is, those who came relatively recently and can watch previously popular films.
Thus, it is possible to divide the collaborative component into two different groups automatically. To do this, I invented a function that instead of implicit values set to “1” or “0” at the intersection of the user and the film, a value showing how important this film is in the user's viewing history.
where StartTime is the release date of the movie, and ΔWatchTime is the difference between the release date and the date the user watched, and α and β are hyperparameters.
Here, the first term is responsible for increasing the speed of films that have recently been released, and the second is for taking into account users' addiction to old films. If a movie was released long ago, and the user immediately bought it, then today, this fact should not be taken into account much. If this is some kind of novelty, then we should recommend it to a larger number of users who also watch new items. If the movie is quite ancient, and the user watched it only now, then this can be also important for those like him.
There is a trifle left - to sort out the coefficients α and β . Leave for the night, and the job is done. There were no assumptions about the range at all, so it was initially large, and below are the results of a local optimum search.
Using this idea, a simple model on a single matrix gave a speed of 0.03627 on local validation and 0.03685 on public LB, which immediately looks like a good boost compared to previous results. At that time, it brought me approximately to the top 20.
Idea number 3
The final idea was that old films that the user had watched for a long time, can be ignored at all. This CF screening method is often called pruning. Let’s go through several values and test the hypothesis:
It turns out that for users with a long history it makes sense to leave only the last 25 interactions.
In total, by working with the data, we achieved a speed of 0.040312 on the local test sample, and the submission with these parameters gave the result in 0.03870 and 0.03990 on the public / private parts, respectively, and provided me with 14th place and a T-shirt.
Acknowledged segments
Running projects in jupyter notebook is a thankless job. You are constantly lost in your code, which is scattered across several notebooks. And tracking results only in output cells is completely dangerous in terms of reproducibility. Therefore, datasaentists cope as much as they can. My colleagues and I made our cookiecutter-data-science framework - Ocean. This is a tool for creating structure and project management. You can read about its advantages in our article . Largely due to the good approach to separation of experiments, I did not lose my mind and did not get confused when testing hypotheses.
The secret to success (not mine)
As it became known immediately after the private part of the leaderboard was opened, almost all the guys with top solutions used simple models on the first level and boosting with additional features on the second. As I understood later, my first puncture in this experiment was primarily the method of sampling when training the second-level model. I tried this:
Realizing that this was nonsense, I started looking for quite sophisticated methods, like this one:
Well, I found the right way on the slide from the presentation of Evgeny Smirnov, who took 2nd place. It was a split by users in the test part of the first level model. It seems trivial, but this idea did not come to me.
Conclusion
Contests are contests. Heavy models really give more accuracy, especially if you can cook them. Will this experience come in handy? Instead of an answer, I’ll say that after the end of the competition, the organizers threw off a spoiler - a feature importance graph from the product model, according to which it is obvious that they use the same “successful” approach in production. It turns out that they are also flowing in the prod.
It turns out that for me the experience of participation was really useful professionally. Thanks for reading the article; Questions, suggestions, comments are welcome.