Prediction of the outflow of players from World of Tanks from Yandex Data Factory. Lecture for the Small ShAD
The most important expertise of Yandex is machine learning. It has grown out of search needs, for ranking in which we have developed the now well-known technology Matrixnet . In 2014, Yandex began to use its knowledge in the field of ML outside its own services - Yandex Data Factory appeared . This is an international direction that solves complex math problems for other companies.
One of his projects is the outflow of World of Tanks players. Ilya Trofimov told the listeners of the Small ShAD not only about the project with Wargaming, but also about the fact that such machine learning is in general and in what tasks it can help business. The students are high school students interested in mathematics and computer science.
')
Ilya himself graduated from the Faculty of Physics of Moscow State University in 2007 with a degree in theoretical physics. In 2011 - School of data analysis in the specialty "data analysis". In Yandex, he was engaged in the use of machine learning to optimize ad impressions, now he is solving the problems of analyzing large amounts of data in Yandex Data Factory. He lectures at the School of Business Administration on the subject “Machine learning on big data”.
Who of you plays World of Tanks? One two three four. Apparently shy. In fact, everyone is playing. Fine. And who even plays online games? About seventy percent, I would say that. What is World of Tanks? This is such an online toy, designed by wonderful Belarusians, by Wargaming. They still have World of Warplanes and World of Warships. The game itself is free, multiplayer, who plays, then everyone knows it. It has some paid items: you can buy the so-called "gold" for real money.
According to statistics, the game has more than a million accounts, and this is a lot. And here, for example, there is such an interesting figure: on January 19, 2014 more than a million players were online, playing World of Tanks. It is interesting that at the moment these companies that are involved in games are very powerful, i.e. in the same wargaming 4,000 employees. In Yandex, in my opinion, 6000, i.e. It seems toys, but in reality it is such a very powerful business, and in their work they strive to use, for example, mathematics — this is what today's lecture will be about.
I said the word - outflow. What is outflow? He can be absolutely in any business. These are banks, cellular operators, Internet, cable TV providers, insurance companies, the same online games, etc. If you are a customer of a bank or cellular operator, and you decide to leave it, this is called outflow. I have bold mobile operators, especially this task is relevant for them. What do you think, why do we need to deal with the outflow? Why do you need to predict it?
To predict.
To make statistics.
To delay.
We need to look at this outflow: which groups are leaving, which ones are not leaving, but the most important thing is to try to keep these people in check. Specifically mobile operators. There are two main tasks — predicting churn, predicting which customer will leave, prevention and retention. For example, I found a picture on the Internet “The share of mobile operators in the USA over the past few years”. They have the main operators Verizon, AT & T and Sprint Mobile and, as you can see, some more others. Look at how little the proportion changes - very little, i.e. This is a very competitive market. And if you, for example, prevent the outflow of 1% of customers, then this will be pretty cool. And if every year you prevent a 1% outflow, this will accumulate and will be very noticeable in the business. One can draw such an analogy as flowing from a barrel. There is a barrel, from it gradually water flows a little. Maybe a little, but a lot is accumulated over the year, and if you plug this hole, the barrel remains full. Efforts to predict churn and prevent it in terms of their impact on business are quite substantial. You can honestly come up with different tariff plans, some chips to keep customers, or you can just prevent the outflow. Another observation is why this trend is actively developing specifically for mobile operators. You have an MTS, and you decide to change it to Megaphone. What are you doing: you take out one SIM card, insert another one, and that’s all; MTS can no longer contact you just like that. It can only by e-mail, provided that you are registered in your account. But it is clear that this is not the case. Those. cellular operators have a problem like this: you need to predict which client wants to leave, and, while he has not left yet, while he can still call, send SMS, make some offer that he cannot refuse, and somehow try hold him down In general, all the largest Russian operators are engaged in this. Need to do some introduction.
Who knows what machine learning is? Well, half already heard, half - not heard. It is necessary to clarify this first. See, we want to predict something. To put it pathetically, we predict the future. For example, we want to predict that a person will buy some product from an online store and what kind of product it will be. You can, for example, try to predict that a tweet on Twitter will get a lot of retweets. And people are engaged in such predictions, including in Yandex. It can be predicted that the treatment will be effective or ineffective, and choose the most effective treatment. You can, for example, predict that a boy and a girl will make a good pair. Why not. You can predict the stock price next week. And many people who trade on the stock exchange do it professionally. The topic of this lecture is to try to predict that the player will stop playing the online game. And there are a few more tasks, they are not about prediction, but very similar to what I will say. For example, recognize a person in a photo, or, for example, recognize a handwritten text and enter it into a computer. What do you think, which of these problems are solved with the help of machine learning, and which ones are not? Facebook offers to sign photos, which of your friends are on them. The handwritten text is a rather old problem, it is also being solved.
That the person will buy some goods in the online store. You are right, all online stores have recommendatory blocks: “I advise you to buy”, “together with this product they buy”, etc., etc. It's all built on predicting what you will buy. That tweet gets a lot of retweets. I have already said that they do it and can do it. What treatment will be effective? No you can not. If you collect a lot of information, you can practically predict everything. The point is that the boy and the girl will make a good pair. I do not know how scientific this is, for example, what animal do you look like? I do not think it is based on any technology. But nevertheless there was such a general remark that everything depends on the total amount of information. And we have the last point left, that a boy and a girl will make a good pair. If you collect a lot of information, you can predict better than randomly guessing. It's right. There are foreign dating services that use machine learning to match couples. You can predict anything, there was such a remark that everything depends on the amount of information, and you need to use the right methods. What is machine learning about?
Machine learning is about the fact that we are trying to train a machine (this is a computer) or a cluster of computers to solve problems that people can easily do. For example, to recognize photos or handwriting, or what people do not know how to do, for example, that a tweet will receive many retweets or predict a stock price next week, and a computer can do this and that. Recognition of people in the photo in the past few years has reached a good level and even slightly exceeded the level of a person. Now the computer recognizes the people in the photo better than ordinary people - such an interesting observation. There is a lecture about machine learning in the Small MASTER, you can see it if someone has not seen it, i.e. we learn computer. How is computer training built? The computer is trained on examples. Need such a thing as a learning sample. In the most general form, this task looks like this: there is input data, there is an answer, and there are a lot of such examples that need to be shown to the computer in order for it to learn from the input data to predict the answer. From a formal point of view, the input data is usually denoted by the vector X - this is the vector of some numbers. The answer is usually denoted by Y, and here there are two main options: if Y is a vector of a finite set, then we call the classification problem, if Y is R - a set of real numbers, then we call the task a regression problem and, from a mathematical point of view, you need construct a function mapping X and Y so that the training sample contains as few errors as possible. Let's think with you that we have X in these tasks, and what we have in Y. For example, how to predict that a person will buy goods in the online store? What we have X, and what we have Y? A product is Y, X is information about a person: who he is, how many purchases he has, what interests he has, how he behaves on the site. Y is the final product. We need to choose one product from Y.
That tweet gets a lot of retweets. Here you need to predict the number of retweets. This is considered a regression task; we want to predict a certain number, not a class. What can be used as input? The average number of retweets for previous tweets, you can use the content, some topics are more popular, some less popular, you can use hashtags, the number of subscribers, information about them, you can still use the time when the tweet is written. There are advertising agencies that are engaged in PR on Twitter, there are special services that suggest what time to write a tweet, so that he collected more retweets to promote any star or product.
The stock price is next week. What is X and what is Y? Y is the final price of the stock, and X is information about the status of the exchange. Is it a classification or a regression? The stock price is a real number, you need to predict it. The prediction of the exchange rate is a regression: X is information about the exchange, information about the sale of shares, the rate at the moment, yesterday's rate, a week ago, etc.
Recognizing a person in a photo is a classification, Y is a lot of people from which you need to choose one of them, X is a photo itself, if it is a photo from a social network, then there will be people from your circle of friends on it, this simplifies the task, mainly use the photo itself, i.e. A photo consists of pixels, each pixel has its own color, in general, it is a separate topic in the Small SAD, neural networks are used there, its own separate science. How to predict that the player will stop playing the online game? Then this is the task of classifying whether or not it leaves. After how much he leaves, you can do that too, we didn’t do that in this project, but then it will be a regression task, and there it will be all different, other methods. X - is information about the player.
The machine is trained on examples. That's the whole beauty of machine learning, which, as you have seen, can work absolutely in any field. You can understand nothing in medicine and predict the effectiveness of treatment, you can understand nothing in online games and predict the outflow of players, etc. As Archimedes said: “Give me a foothold and I will turn the world around.” Here, everything sounds a little differently: give me such a series - input data, answer, input data, answer, and I will learn you to predict the answer in general, almost without understanding the subject area. That's such an interesting thing. Let's go back to the outflow of players again. We will jump from theory to practice with you. What can be done here? Take the axis of time. We have the current moment, there is the past, from the past we can extract some factors, some parameters. For example, in this task I used the maintenance for the last six months. Then we fix some control interval in the future (for example, one month) and say that if a player has not played a single game for this month, he hasn’t spent a single fight, then he has left. If at least one fight was, then he did not leave. Such is the task of classification.
Input data, what I used in this project: the number of tanks destroyed, the number of battles, the number of battles in the clan, the share of victories, the free experience of gold, the premium account or not, in general, I tried to tighten everything in the tanks . There is an interesting point here that you need to thrust anything at all in practical tasks, because a computer can bring out such patterns that are not obvious to a person. You need to rely on mathematics, on a computer, and not on your intuition, which can often fail.
What else can you do interesting? You can look at all these parameters with different depths. This is how it was in the example of the stock price. We could look at the stock price a day ago, a week ago, a month ago. I used all the parameters with a depth of up to six months, i.e. 24 weeks. And then, for each player, 501 parameters are obtained. It is clear that you cannot work with hands and eyes with 501 parameters, in Excel you cannot do visualization, i.e. people perceive two-dimensional space, three-dimensional already on a computer is quite difficult to draw, and 501-dimensional in no way. Some mathematics is needed here. The answer in this problem is a binary variable (the player is gone or not gone). For some reason, it is more convenient to designate +1 and -1.
Look, it turns out such a giant table, where each line is a player, and from it we know X and Y. X is gone, not gone, and Y is such a vector from 501 signs, a parameter. I could not fit all the parameters on the slide, but such a sheet with which you cannot work with your eyes in Excel, that's why it is sometimes called big data. What would you do if you were given such a sheet? A bunch of players, every player knows about gone, and 501 numbers are not gone, what to do with this? Analyze information. The first idea is this: we look at the number of game hours by months, draw a graph, if this graph falls, then we predict that the person will leave. For example, we can extrapolate it, draw it straight, and if it goes to 0, then the person leaves. It is quite a normal approach, only it has a disadvantage in that it uses only one factor - the number of hours in the game per month. There was an idea about premium accounts, but I did not understand the specific algorithm how to use. Still it is possible to use non-growing indicators, experience does not grow, the share of murders. You can use priority if we see that the number of battles and the number of hours in the game is reduced, let's say it’s 10 virtual points, if I haven’t credited money for a long time - these are +20 virtual points, and accumulate points. The more points, the greater the likelihood that a person will leave.
We have a certain set of hypotheses, for example, that a person has a premium account, he will be less likely to leave, if his share of victories does not grow, his share of murders, experience doesn’t grow, he is dissatisfied with the game, he leaves. This is all a normal human approach, but what do I want to do? I want to take all the parameters, 501 pieces, and so that the computer itself could find a bunch of such patterns, even those that cannot be guessed by a person.
What else can you do? For example, we have a premium account on the X axis. We have two options: there is a premium account here, here it is not. We calculate the probability of leaving next month. With the premium, I’ll now say the numbers from the bald, just to illustrate the idea that a comrade suggested. 1000 players with premium, without premium - 10 000. We look what happened next month. How many left? Suppose a premium went 100, and not a premium gone 2000. What is the probability of leaving? Here is 10%, and without premium - 20%. This is a rather made-up example, I don’t remember exactly what the real numbers are. Such things can be done, for each parameter you can see how much it affects. I do not want to show you more fictional examples. You have done a lot of ideas, but you have the task of generating a computer for such hypotheses. It can generate more and more efficiently. About points, this is also a good idea, because it allows us to simultaneously take into account many factors. There is such a method, it is called logistic regression, and it does this practically.
I still have math slides. We need to understand what a classification error is. We take some sample, some set and we consider forecasts. Our forecast is the result of the f function.
If our prediction coincides with reality, then our error is equal to 0, if it does not coincide - the error is equal to 1. And we simply add all the errors in all examples. We call this a classification error, i.e. The number of cases that we have not guessed. In machine learning there is such an observation that there is no problem in constructing a function f, in which the error on a fixed sample is zero. You may even guess how to build such a function. We have a training set - this is an input-response pair. We denote them X, Y. We want to build a function that maps X and Y so that it makes fewer errors.
How to do this on this sample? A reasonable question arises, what we do, why machine learning, if you can always build a function that will not make mistakes. Of course, that for the well-known X we predict the value of Y. If some new X comes, which was not there before, then we can not do anything. It is possible for each X to find the one closest to him from the training set and give the corresponding Y. This will work a little better, but still not very good.
How to deal with it? How to work in machine learning. These XY pairs have come to you, and you need to divide them from the very beginning into two parts: one we call “Training”, the other one we call “Test”. The function f (x) is based only on the training set, and its quality, i.e. classification error, measured on a test that we did not use when building a function. It is clear that in the fictional example, when the function f strictly returns Y, known to it, when new data arrives, there will be a very large classification error - this function is bad.
The main task of machine learning is how to build the functions f on the training set, so that there was a small error building on this new test set. Here is an interesting point that we work only with the training set, and then from somewhere in the sky new data comes to us, and they also need to be well classified. This is what machine learning is about.
In this particular task there was an outflow of players - 1000 players. These are comrades who have played at least 500 battles in the last 3 months, and plus they have had at least one battle in the last week. There should be 500 battles, it should be quite an active player. I divided it this way: 33,000 for training, 17,000 for a test.
Now I will talk about the decision tree. This is one of the oldest machine learning methods. It builds hierarchical conditions. If the number of battles in the last week is less than 10, and, in addition, gold is less than 100, it means that he will leave. And if the number of fights is less than 10, and gold is more than 100, then it will not go away. The question is how to build it? She can combine many factors. This is what we need because we want to coordinate 501 factors.
The first sentence was this. How to build a decision tree. If we know which factors strongly influence care, then we will build a tree without problems. If the strength, the priority of factors is known, then the tree is built easily. True, it is not clear how.
Any more ideas? For each factor or parameter, we consider how much it itself affects care, this is the first time. For example, we can build a graph. On the X axis - the value of this particular factor, on Y - the probability of leaving. If there is a straight line here, then nothing depends on this factor; if it is a curve, then the probability to leave depends on the value of the factor. Next, we rank all the factors according to their strength and build a tree somehow. Still need a coefficient, magic, which will solve all the problems.
What are common factors and what are strong factors affecting care? You gave me ideas, which I will now talk about. , , . , - , , .