My name is Vasily, for more than three months, as I work as a mathematician in the company Surfingbird.
The first serious task that I faced while working for a company is to solve the problem of a cold start. In this article I will describe the essence of the problem and the main directions of its solution.
The formulation of the task of the recommender system is already described by Sergei Nikolenko in the article
Recommender Systems: Problem Statement .
Most recommender systems are based on the so-called
collaborative filtering methods. Our recommendation system is no exception. All algorithms for collaborative filtering rely only on information about ratings provided by users, and do not analyze the content of resources (in our case, web pages). Therefore, these algorithms work with a sufficiently large number of ratings, as a rule it is 10-20 ratings. The task of issuing relevant recommendations for new users and for new sites is called a cold start problem.
So, the cold start problem is divided into a cold start for users (what should new users show?) And a cold start for sites (who should recommend newly added sites to?). Let's start in order.
')
Cold start for users
A cold start for users is possible based on demographic data that users themselves indicate during registration. In this task of recommending web pages, the user knows the gender, date of birth and location. We take these characteristics as basic ones. However, demographic data can be obtained and more. With the help of the social networking API, we can find out the level of education, social status and other characteristics.
There are two main approaches to applying demographic information about a user in recommendations:
- Expert stereotypes are drawn up for various demographic categories. That is, the expert himself determines that on a cold start to show each of the categories. The obvious disadvantage of this approach is the need for an expert to work, with the user being recommended only popular sites subjectively selected by the expert. The volume of expert work increases significantly with an increase in the number of categories.
- Demographic categories are automatically determined by identifying clusters of users with similar interests. Recommendations are based on what ratings users from the same category, that is, the same age, gender, location, etc., put down.
The second approach does not require the involvement of experts and provides the ability to create an unlimited number of clusters, so I’ll focus on it in more detail.
To build demographic categories it is natural to use clustering methods. Clustering objects
x are users. The attributes (or characteristics) of objects are the normalized demographic data of the user: gender, age, location, education, and others.
For clustering by demographic data, it is natural to use the
k- average method, since in this case each cluster is determined by the point of its center and, as a consequence, is well interpreted. The distance from the object (user) to the center of the cluster can be determined, generally speaking, in an infinite number of ways. It is accepted to use the Euclidean metric in the space of attributes.
However, in our task it is necessary to take into account data on ratings, and not just demographic data.
The situation is saved by the presence of the
SVD topics calculated for all sites, which can also be added as features of the objects during clustering. In this case, the distance between the sites will be calculated based on both the similarity of demographic data and user ratings.
After the clusters of users are obtained, and for each new user who has indicated his demographic data, we know the cluster to which he belongs, we can improve recommendations on a cold start using group recommendations or filter bots. About this in more detail.
The most natural is the method of
group recommendations (group recommendations to an individual user) , the name of which speaks for itself: we select a new user with such recommendations that are liked by most users in his demographic category.
There are a number of different strategies for how to aggregate the ratings of different users into a group recommendation. For example, a group rating can be calculated as follows:
GR = P r_i ^ w_i ,
where
r_i is the
i- th user rating,
w_i is the
i- th user weight. The product is taken for all users or for a selected group by some criterion (for example, age-gender).
We give
w_i weights for users with the same age, gender and location to a higher value, to others less (manually selected).
An alternative approach is
filter bots , which generate default ratings for the new user. That is, when registering, filter bots will automatically generate several ratings for the user based on his demographic data, which will be used by the cold start collaborative filtering algorithms. The advantage of this approach is the ease of implementation and the lack of need to change existing algorithms.
In addition, filter bots and group recommendations can be used together: then group ratings are taken for the default ratings of filter bots.
Cold start for web pages
To solve the problem of a cold start for new web pages, various methods of analyzing text and other page content (pictures, video, flash, links, etc.) are used.
The main methods of semantic text analysis that I would like to
highlight are
LDA and
relevance feedback .
The general scheme of recommendations based on the text content of the page is approximately as follows. First, useful content is collected on all sites (ads, menus, etc., are discarded). Words in the text are pre-processed, that is, stop-words are discarded and lemmatization is performed. Next, a single dictionary of words and a table of the occurrence of words in the texts of web pages (content profiles of pages) are compiled. Words are weighed by TF-IDF and for too long texts only the top N of the most powerful words is left.
The
relevance feedback algorithm compiles a tag profile (i.e., keywords) for each user based on the content profiles of the pages that this user likes.
New sites are recommended for those users, whose tags are the most correlated with the content profile of the newly added page.
The
LDA algorithm works differently. Words from the vocabulary of words as a result of learning a probabilistic model are grouped by a fixed number of topics (for example, 100). For each web page, a probability distribution is built over the topics (that is, a vector of 100 features, each of which characterizes the extent to which this page corresponds to the topic). To predict the likelihood of likes for each user, a logistic regression is learned on the LDA features of the pages that the user watched, with the result that each user also receives a weight vector across all LDA topics. When adding a new site, LDA features are first calculated for it, and then it is recommended for users with the maximum predicted likelihood, which is easily calculated from the known features of the user and the web page.
Each of the algorithms mentioned in this chain deserves a separate article with practical examples, but this is in the future ...