Hello! My name is Sergey, I am a mathematician, and I define the development of the Surfingbird recommender system. With this article we open the cycle devoted to machine learning and recommender systems in particular - I don’t know yet how many installations will be in the cycle, but I will try to write them regularly. Today I will tell you what recommender systems are, in general, and set the task a little more formally, and in the next series we will start talking about how to solve it and how our advisory system
Tachikoma is learning.

Recommender systems are models that know better what you want. Recently I heard an indicative anecdote: as you know, supermarket chains usually try to predict what you want to advertise for you (and this is an example of a recommender system); in particular, the supermarket may try to recognize by the changed preferences that the woman became pregnant, and start using it. So, they
say that once an infuriated father broke into the office of a supermarket, whose schoolgirl daughters began to receive mails for diapers and kidswear by mail; the manager had to apologize for a long time and tell him that all recommender models are probabilistic, and mistakes are quite possible. A couple of months later my father came again and apologized himself - it turned out that he knew far from everything about his own daughter ...
')
Collaborative filtering systems are models that try to predict how much you will enjoy a particular product, receiving input about how you and other users have rated this and other products in the past. Collaborative filtering is the most popular type of recommendation systems today. In Surfingbird, for example, your ratings are like and dislike buttons (as well as the fact that you looked at the page and did not set any ratings, but more on that later). The more data we have about your preferences, the more interesting pages we can recommend to you!
Here are some other examples of well-known recommender systems.
- Amazon is one of the leaders in the region; Amazon recommends books and other products to you, based on what you bought, what you viewed, what ratings you put, what feedback you left ... Yes, as is usually the case, Big Brother collects everything, even if you still don’t know how to use it.
- Netflix - they know little of this company in Russia, and it doesn’t work for Russia, but it was Netflix that made itself the loudest in the scientific community when it announced the famous Netflix Prize , promising $ 1M to improve the quality of their prediction algorithm by 10% (We'll talk about the Netflix Prize and the lessons we’ve learned from it in one of the following episodes). The core business of Netflix is ​​film rentals; Now the company has switched to streaming video, but the first ten years of their life, they sent physical DVDs by mail, which then had to be sent back to get the next one (they took the money for a subscription). Of course, it’s difficult for a Russian person to understand how to pay money to download a movie, or not even download it, but to watch it online - but the model turned out to be very successful, and the data set published for Netflix Prize for several years became the main test case for collaborative filtering systems (now Netflix has removed it from open access due to possible de-anonymization, and Yahoo! KDD Cup Dataset has replaced it).
- Last.fm and Pandora recommend music. They adhere to different recommendation strategies: Last.fm uses, besides the ratings of other users, exclusively “external” data about the music - the author, style, date, tags, etc., and Pandora is based on the “content” of the musical composition, using very much An interesting idea is the Music Genome Project , in which professional musicians analyze the composition according to several hundred attributes (unfortunately, Pandora is currently unavailable in Russia). It’s true that no one is good at analyzing songs automatically, and this is another interesting application of machine learning ...
- Google , Yahoo! , Yandex - can we say that they also recommend sites to users? Formally, yes, but in reality these are other systems: search engines try to predict how relevant this document is to this request, and recommenders try to predict what rating this user will put on this product. Of course, the success of search engines is a great merit of models based on data from users (click logs), and, of course, search results are often personalized, but the task is slightly different. Somewhat closer to our task is the problem of what kind of advertising to show to the user ( AdSense , Yandex.Direct , etc.) - here users really “vote with their feet” for advertisements, and you need to “recommend” those that are likely to cause positive the reaction. But the matter is complicated by the economic side of the issue (advertisers pay money, between them you need to arrange an auction for the right to place an advertisement), so we will not consider these tasks now either. However, leading search engines have a lot of side projects based on recommender systems - for example, we have already mentioned Yahoo! Music .
So back to our sheep. Imagine that we have a lot of users and a lot of products (for
Surfingbird, these are web pages, for
Netflix - movies, for
Last.fm - compositions), and some users somehow rated some products. Formally speaking, data consists of triples of the form.

where
i denotes a user,
a is a product, and

- rating that user
i assigned to product
a .
You can imagine this data as a matrix, each row of which corresponds to the user, and the column - to the product. Our task is to predict the unknown elements of the matrix; To be precise, our task is to predict which of the unknown elements will be maximal in their line, that is, which products will be most liked by this or that user.
Collaborative filtering systems have several common problems that any model has to solve in one way or another.
- The matrix of ratings is usually very sparse (sparse) - usually there are a lot of users and products, and in fact there are much less ratings than their work, because the average user assesses very few products; the remaining elements of the matrix are unknown to us, and it is precisely them that must be predicted.
- The cold start problem. For users, when a new user arrives who has no ratings yet, what to do with him? Well, when not at all, this is nothing - you can simply recommend the most popular products; and what if the user has already appreciated something, but so far very little? For products - how many ratings do you need for a new product before you can confidently recommend it? And where do these ratings come from, if you don’t recommend it to anyone?
In the next series we will talk about what to do with these and other problems, as well as how to predict unknown ratings in general - follow the developments!