The problem of a cold start to personalize the news feed

Today we would like to talk about our research in the field of personalizing news feeds as part of the favoraim project. The idea to show the user only those news (further entries) that will be of interest to him is not new and quite natural. To solve this problem, there are well-established and well-proven models.

The principle of operation of these algorithms is similar: we analyze the user reaction (feedback) to previous entries and try to predict its reaction to current events. If the reaction is "positive", the event falls into the tape, if "negative" - does not fall. About what “positive” and “negative” reaction is later, but for now let's consider what kind of reaction we can get from the user:

Score records, for example, on a five-point scale.
Like / dislike, which is a special case of the first option (-1, +1)
Implicit feedback, this may be, for example, the fact that the user has opened the page of this entry; the time that the user viewed the recording information, etc.

For the first and second cases, we can predict the rating for an event that the user has not yet had time to evaluate, based on the ratings of the same event by other users. To do this, we can use some kind of correlation model (for example, k-nearest neighbor (kNN), where to predict the score of the record, we use the estimates of “similar” users of the same record), or a latent model based on matrix decomposition (for example, SVD or SVD ++), which is more preferable. In the latter case, when we only have data about the viewed records, we assume that the user will view interesting records much more often than uninteresting ones and view the review as a positive rating. Here we have only positive feedback (positive-only feedback), and it would be better to use another model, for example, the Bayesian personalized ranking (BPR). There is no point in talking about these models, there are a lot of materials on this topic on the net.
')
Predicted estimates can be used both for ranking records and as a filter (the event “interesting” - we show, “not interesting” - we do not show). To filter, we need to enter some threshold value, above which we consider the estimate to be positive. Here you can come up with as many heuristics as you like, for example, in the first case from our list of reactions, the average user rating suggests, in the second - 0. If we want to get a reasonable threshold value, we can use ROC analysis.

And now the problem is: in order to predict the user's rating for a record, we need someone to rate this record, and it’s better that there are many such ratings. In practice, we will constantly face a situation where the record will not have ratings. An obvious example is a record just added. This is the so-called cold start problem. We will talk about this problem in more detail.

Obviously, to solve this problem, we need to stop viewing the event as a kind of black box, as was done in the previous approach. Need to get some characteristics of the recording itself. As such characteristics, we will begin to use the “themes” of the recording. We will proceed from the assumption that the user rates the record based on the fact that he liked or did not like one or several topics of this record. How to highlight the topics of the record we told in the last article ( Automatic placement of search tags ). Such a model is a rather rough approximation, and you can come up with any number of examples of records with the same subjects, to which the user will have a different reaction, but the resultant error is relatively small.

About the user's attitude to a particular topic, we will already have quite a lot of information if the user has shown any activity in the system (views and ratings of records with this topic, search queries with this topic, etc.). Thus, we have some information about the attitude of a particular user to the topics of a particular entry and we need to predict whether this entry will be “interesting” to the user or not.

Let's digress a bit and consider an example from biology. We will be interested in a simplified model of the work of the nerve cell (Fig. 1). The principle of its operation is as follows: electric impulses come through the dendrites into the nucleus of the nerve cell from the external environment. The impulse can be both positive and negative. Also, different dendrites can have different throughput, i.e. the pulse passing through different dendrites is also different in absolute value. The nucleus of the neuron is able to accumulate charge. When the charge accumulates large enough, the cell generates a pulse, which is transmitted along the axon. This behavior is described by the McCulk-Pitts model (Fig. 2).

Fig. one

Fig. 2

Now apply this model to our problem. At the inputs of the neuron we will submit data about the user's attitude to the event's themes (various evaluations of other events on these topics, the number of search queries with these topics, the number of events viewed with these topics, etc.). Then we find the value of S, and, if S> 0, we will consider the event “interesting”. And otherwise - "uninteresting." This model is also called a linear classifier.

All we have left is to determine the values.

. This process is called classifier training. The principle of learning is the following: we take the records that users have already rated (training sample), we find the values for each of them

and using our classifier, we try to predict the already known estimate for this record. The task is to find the values.

giving the smallest number of errors in our training sample. This problem is solved by any known optimization method. Most sources recommend using the stochastic gradient descent (SG) method for this task, but using it in the forehead will not work for this task. There is a set of heuristics that still allow us to effectively apply this method to solve this problem, but in our case we used genetic algorithms that seemed more preferable to solve this problem on relatively small training samples.
Genetic algorithms model the process of natural selection to solve an optimization problem. The principle of its operation is the following: a population of “individuals” is created, the genotype of which encodes the values of variables (in our case

). Individual populations interbreed with each other and undergo mutations. Greater chance to "survive" - to remain in the population - are those individuals who better solve the task. This method has the following advantages:

There is no need to choose an initial approximation.
Due to the mutation, the algorithm has a chance not to focus around local minima.

The disadvantages include speed. For very large amounts of data, stochastic gradient descent will work incomparably faster. But it all depends on the task.

We now turn to the mathematical part. We need to build on a training set.

classification algorithm

with two classes

where

- discriminant function, w - parameter vector.

- separating surface. Function

has the appearance

- margin (margin) of the object

algorithm

mistaken on

.

At this point, I want to drop everything and start training the classifier, minimizing the number of errors on the training set

however, such a decision would not be entirely correct, since As a result, some objects may be located quite close to the separating plane. This will not affect the fitness function, since as a result will give the correct result on the training set. However, with minor changes in x, the object can change the class, as a result of which, on the control sample, we most likely will not get the best results. In addition to this, we will have to minimize the discrete function, and this is a very dubious pleasure. To solve this problem, we introduce a non-increasing non-negative loss function.

and we will minimize the function

, thus encouraging large margins, the magnitude of which can be considered the severity of the class. Most often, the logarithmic function is used as a loss function.

however, her one choice is not limited.

Along with the replacement of the loss function, it is recommended to add to the Q functional a penalty term prohibiting too large values of the norm of the weight vector

. Adding a punitive or regularization reduces the risk of retraining and increases the stability of the weight vector with respect to small changes in the training set. Regularization parameter

it is selected either from a priori considerations, or by sliding control (cross-validation).

The optimization of the functional by the genetic algorithm is rather trivial. For the first experiments with data, the evoj library is quite suitable. In the future, it is better to use a co-evolutionary algorithm in order to more effectively get out of local minima.

As a conclusion, I will say that this method can be used in conjunction with any latent model, passing the predicted record estimate as one of the parameters of the linear classifier.

Source: https://habr.com/ru/post/227701/

All Articles

The problem of a cold start to personalize the news feed

More articles: