📜 ⬆️ ⬇️

In-depth training with the support of a virtual manager in the game against inefficiency



About the success of Google Deepmind now know and talk. Algorithms DQN (Deep Q-Network) defeat Man with a good margin of more and more games. The achievements of recent years are impressive: in literally tens of minutes of learning, the algorithms learn and win a person in Pong and other Atari games. Recently we’ve come to the third dimension - we win a person in DOOM in real time, as well as learn to drive cars and helicopters.


DQN was used to teach AlphaGo by playing thousands of games by itself. When it was not yet fashionable, in 2015, anticipating the development of this trend, the management of Phobos in the person of Alexey Spassky, ordered the Research & Development department to conduct a study. It was necessary to consider the existing machine learning technologies with a view to the possibility of using them to automate the victory in management games. Thus, in this article we will discuss the design of a self-learning algorithm in the game of a virtual manager against a living team for increasing productivity.


The applied task of analyzing machine learning data classically has the following solution steps:



This article will discuss the key decisions in the design of an intelligent agent.
More detailed descriptions of the stages from setting the task to presenting the results will be described in the following articles if the reader is interested. Thus, it is likely that we will be able to solve the problem of a story about a multidimensional and ambiguous research result without losing clarity.


Algorithm selection


So, to fulfill the task of finding the maximum efficiency of team management, it was decided to use deep learning with reinforcement, namely Q-learning. The intellectual agent forms the utility function Q of each action available to him on the basis of remuneration or punishment from a transition to a new state of the environment, which gives him the opportunity not to accidentally choose a strategy of behavior, but to take into account the experience of the previous interaction with the gaming environment.


The main reason for choosing DQN is that the agent doesn’t need a model to train with this method, nor to select an action. This is a critical requirement for the learning method for the simple reason that a formalized model of a collective of people with practical predictive power does not yet exist. However, an analysis of the success of artificial intelligence in logic games shows that the advantages of an approach based on expert knowledge become more pronounced as the environment becomes more complex. This is found in checkers and chess, where the evaluation of actions based on the model was more successful than Q-learning.



One of the reasons that training with reinforcements still leaves office clerks without work is that the method does not scale well. The Q-learning agent conducting the research is an active student who must repeatedly apply every action in every situation in order to create his own Q-function for evaluating the benefits of all possible actions in all possible situations.



If, as in the old vintage games, the number of actions is calculated by the number of buttons on the joystick, and the states by the ball position, then the agent takes dozens of minutes and hours of training to defeat the person, then in chess and GTA5 the combinatorial explosion already makes the number of combinations of game states and possible actions of space for the passage of the student.


Hypothesis and model


To effectively use Q-learning to manage a team, we need to minimize the dimension of environmental states and actions.
Our solutions:



An example of an online learning simple game:
https://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html






The diagram shows the states of the three game environments for the three agents who manage the work on the task.


States:



The list of actions for each of the three agents is different. Project Manager - Agent assigns the performer and tester, the time and priority of the task. The agents working with Dev and QA are personal for each artist and tester. If there is a transfer of a task further, agents receive rewards, if the task comes back - punishment.


The greatest reward all agents receive at the closing task. Also for Q-learning, DF and LF (the factor of discounting and training, respectively) were selected in such a way that the agents were focused on closing the task. In the general case, the calculation of reinforcement takes place according to the optimal control formula, which takes into account, among other things, the difference in the estimate of time and real costs, the number of task returns, and so on. The advantage of this solution is its ability to scale to a larger team.


Conclusion


The iron on which the calculations were performed - GeForce GTX 1080.


For the above mini-games with the formulation and management of the task in Youtrack, the control functions converged to higher than average values ​​(employee productivity increased relative to working with a human manager) for 3 people out of 5. The overall productivity (in hours) almost doubled. There were no satisfied experimental staff from the test group; dissatisfied 4; one abstained from ratings.


Nevertheless, we made conclusions for ourselves that in order to use the “in combat” method, it is necessary to bring in the model expert knowledge in psychology. The total duration of development and testing is more than a year.


')

Source: https://habr.com/ru/post/319768/


All Articles