NeurIPS: how to conquer the best ML conference

NeurIPS is a conference that is currently considered the most top event in the world of machine learning. Today I will tell you about my experience of participating in NeurIPS contests: how to compete with the best academics in the world, take a prize place and publish an article.

What is the essence of the conference

NeurIPS supports the introduction of machine learning methods in various scientific disciplines. About 10 tracks are launched annually to solve actual problems of the academic world. According to the results of the competition, the winners perform at the conference with presentations, new developments and algorithms. Most of all I am passionate about learning with reinforcements (Reinforcement Learning or RL), so for the second year I have been participating in RL contests dedicated to NeurIPS.

Why NeurIPS

NeurIPS is primarily focused on science, not money. By participating in contests, you do something really important, deal with actual problems.

Secondly, this conference is a global event, scientists from different countries gather in one place, with each of which you can communicate.

In addition, the entire conference is filled with the latest scientific achievements and state-of-the-art results, to know and follow which people from the field of data science is extremely important.

How to start?

Starting to participate in such contests is quite simple. If you understand so much in DL that you can train ResNet - that's enough: register and go ahead. There is always a public leaderboard on which you can soberly assess your level compared to other participants. And if something is not clear –– there are always channels in slack / discord / gitter / etc to discuss all the issues that arise. If the topic is really “yours”, then nothing will stop you from getting the coveted result –– in all the contests in which I participated, all approaches and solutions were studied and implemented right in the course of the competition.

NeurIPS on the example of a specific case: Learning to Run

Problematics

The gait of a person is the result of the interaction of muscles, bones, organs of sight and inner ear. When the central nervous system is impaired, certain movement disorders can occur, including gait disturbance –– abasia.
Researchers from the Stanford laboratory of neuromuscular biomechanics decided to connect machine learning to the subject of treatment in order to be able to experiment and test their theories on a virtual skeleton model, and not on living people.

Formulation of the problem

The participants were given a virtual human skeleton (in the OpenSim simulator), which had a prosthesis in place of one leg. The task was to teach the skeleton to move in a certain direction with a given speed. During the simulation, both the direction and speed could change.

')
To obtain a virtual skeleton control model, it was proposed to use reinforcement learning. The simulator gave us some state of the skeleton S (a vector of ~ 400 numbers). It was necessary to predict what action A should be performed (the forces of activation of the muscles of the legs –– a vector of 19 numbers). In the course of the simulation, the skeleton was given a reward R - as a kind of constant minus the penalty for deviation from a given speed and direction.

Pro training with reinforcements

Reinforcement Learning (RL) is an area that deals with decision theory and the search for optimal behavior policies.

Recall how to teach ~~cat~~ doggy new tricks. Repeat some action, for doing the trick you give a snack, for failing - you don't. A dog should understand and find the strategy of behavior (“policy” or “policy” in terms of RL), which maximizes the number of received tastes.

Formally, we have an agent (dog) who is trained in the history of interactions with the environment (man). In this environment, assessing the actions of the agent, gives him a reward (snack) - the better the behavior of the agent, the greater the reward. Accordingly, the task of the agent is to find a policy that maximizes well the reward for the whole time of interaction with the environment.

Developing this topic further, rule-based solutions - software 1.0, when all the rules were set by the developer, supervised learning is the same software 2.0, when the system learns itself by the existing examples and finds data dependencies, reinforcement learning is a step further when the system itself learns to explore, experiment, and find the required dependencies in their solutions. The further we go, the better we try to repeat the way a person learns.

Task features

The task looks like a typical reinforcement training representative for tasks with a continuous space of action (RL for continuous action space). It differs from the usual RL in that instead of choosing a specific action (pressing the joystick button), this action is required to be accurately predicted (there are infinitely many possibilities here).

The basic approach to solving ( Deep Deterministic Policy Gradient ) was invented in 2015, which for a long time by the standards of DL, the area continues to actively develop in application to robotics and real-world RL applications. There is something to improve: robustness of approaches (not to break a real robot), sample efficiency (not to collect data from real robots for months) and other problems of RL (exploration vs exploitation trade-off, etc). In this competition, a real robot was not given to us - just a simulation, but the simulator itself was 2000 times slower than Open Source analogues (on which everyone checks their RL algorithms), and therefore brought the problem of sample efficiency to a new level.

Stages of the competition

The competition itself took place in three stages, during which the task and conditions were somewhat modified.

Stage 1: the skeleton learned to walk straight at a speed of 3 meters per second. The task was considered completed if the agent will pass 300 steps.
Stage 2: changing the speed and direction with a regular frequency. The length of the distance increased to 1000 steps.
Stage 3: the final decision had to be packaged in a docker image and sent for review. In total it was possible to make 10 parcels.

The main quality metric was considered the total reward for the simulation, which showed how well the skeleton adhered to a given direction and speed throughout the distance.

During the 1st and 2nd stages, the progress of each participant was displayed on the leaderboard. The final decision was required to send in the form of a docker-image. There were restrictions on work time and resources.

Coolstory: public leaderboard and RL

Because of the availability of the leaderboard, no one shows their best model in order to give out “a little more than usual” in the final round and surprise their rivals.

Why docker images are so important

Last year, there was a small incident in evaluating decisions in the very first round. At that time, the test went through http-interaction with the platform, and the face of testing conditions was found. It was possible to find out in which situations the agent was evaluated and to retrain him only under these conditions. Which, of course, did not solve the real problem. That is why it was decided to transfer the system submit to docker-images and launch on remote servers of the organizers. Dbrain uses the same system for calculating the result of competitions exactly from these considerations.

Key points

Team

The first thing that matters to the success of an entire enterprise is the team. No matter how good you are (and how powerful your hands are) - participation in a team greatly increases the chances of success. The reason is simple - a variety of opinions and approaches, double-checking hypotheses, the ability to parallelize the work and conduct more experiments. All this is extremely important when solving new problems that you have to face.

Ideally, your knowledge and skills should be at the same level and complement each other. For example, this year I got our team on PyTorch, and I got some initial ideas for implementing a distributed agent training system.

How to find a team? First, you can join the ranks of ods and look for like-minded people there. Secondly, for RL-fellows there is a separate chat in a telegram - the RL club . Thirdly, you can take a wonderful course from the ShAD - Practical RL , after which you will definitely get a couple of acquaintances.

However, it is worth remembering the policy of “submitting - or not”. If you want to unite - first get your decision, zabmmit, appear on the leaderboard and show your level. As practice shows, such teams are much more balanced.

Motivation

As I already wrote, if the theme is “yours”, then nothing will stop you. This means that the region does not just like you, but inspires you - you burn with it, you want to become the best in it.
I met RL 4 years ago - during the passage of the Berkeley 188x - Intro to AI - and still do not cease to be surprised at the progress in this area.

Systematic

Third, but just as important - you need to be able to do what you promised, to invest in the competition every day and just ... solve it. Everyday. No innate talent can compare with the ability to do something, even a little bit, but every day. This is what motivation is needed for. To succeed in this, I advise you to read DeepWork and AMA ternaus .

Time management

Another extremely important skill is the ability to distribute your strength and use your free time properly. Combining fulltime work and participation in competitions is not a trivial task. The most important thing in these conditions - do not burn out and withstand the entire load. To do this, you need to properly manage your time, soberly assess your strength and do not forget to rest in time.

Overwork

At the final stage of the competition, there is usually a situation where in just a week you need to do more than just a lot, but A LOT. For the sake of a better result, you need to be able to force yourself to sit down and make the last dash to the coveted prize.

Coolstory: deadline after deadline

Because of what you may need to rework for the benefit of the competition? The answer is quite simple - the transfer of deadlines. At such competitions, organizers often cannot predict everything, which is why the easiest way out is to give participants more time. This year the competition was extended 3 times: first for a month, then for a week and at the very last moment (24 hours before the deadline) - for another 2 days. And if during the first two transfers it was necessary to simply organize the extra time, then on the last two days you just had to plow.

Theory

Among other things, do not forget about the theory - to be aware of what is happening in the region and be able to note the relevant. For example, to solve last year, our team pushed away from the following articles:

Continuous control with deep reinforcement learning - a basic article on deep reinforcement learning for problems with continuous action space.
Parameter Space Noise for Exploration - a study on adding noise to the weight of an agent to better study the environment. By experience - one of the best techniques for exploration in RL.

This year they added another “couple”:

A Distributional Perspective on Reinforcement Learning - a new look at predicting possible rewards. Instead of simply predicting the average, the distribution of the future reward is calculated.
Distributional Reinforcement Learning with Quantile Regression is a continuation of the previous work, but already with “quantizing” the distribution.
Distributed Prioritized Experience Replay - work from the direction of deep reinforcement learning at scale. How to properly organize the architecture of the experiment to maximize the use of available resources and increase the speed of training agents.
Distributed Distributional Deterministic Policy Gradients - combining the three previous articles for tasks with a continuous space of action.
The Addressing Function Approximation Error in Actor-Critic Methods is an excellent job of increasing the robustness of RL agents. I recommend reading.
Data-Efficient Hierarchical Reinforcement Learning - a development of a previous article on hierarchical reinforcement learning (HRL).

Additional reading

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor - the authors proposed a method of training stochastic policies for off-policy reinforced learning. Thanks to this article, it became possible to train non-deterministic politicians even in tasks with a continuous action space.
Latent Space Policies for Hierarchical Reinforcement Learning - a continuation of the previous HRL article with multi-level stochastic policies.
Diversity is All You Need: Learning Skills without a Reward Function - The article contains an approach to learning a variety of random, low-level stochastic policies without any reward from the environment. Subsequently, when we have set the reward function, the most correlating with the award can be used to teach high-level policies on top.
Probabilistic Inference: Tutorial and Review - a review of all kinds of maximum entropy reinforced learning learning methods from Sergey Levine .

I also advise OpenAI to compile articles on reinforcement learning and its version for mendeley . And if you are interested in the topic of training with reinforcements - join the RL club and RL papers .

Practice

Knowledge of theory alone is not enough - it is important to be able to implement all these approaches in practice and establish the correct validation system for evaluating decisions. For example, this year we learned that our agent does not cope well with some marginal cases only 2 days before the end of the competition. Because of this, we did not have time to completely correct our model and literally got a few points to the coveted second place. If we found it at least a week later - the result could be better.

Coolstory: episode III

The averaged reward for 10 test episodes served as the final evaluation of the decision.

The graph shows the results of testing our agent: 9 out of 10 episodes, our skeleton passed just fine (average - 9955.66), but one episode .... episode 3 was not given to him (award 9870). It was this mistake that led to a fall in the total score to 9947 (-8 points).

Luck

And finally - do not forget about banal luck. Do not think that this is a controversial point. On the contrary, a little luck strongly contributes to continuous work on yourself: even if the probability of luck is only 10%, a person who tried to participate in the competition 100 times succeeds much more than someone who tried only 1 time and abandoned the idea.

There and back: the decision of last year - the third place

Last year, our team - Mikhail Pavlov and I - participated in NeurIPS competitions for the first time and the main motivation was simply to participate in the first NeurIPS competition in reinforcement learning. Then I just completed the Practical RL course at the SAD and wanted to test my skills. As a result, we took the honorable third place, behind only nnaisene (Schmidhuber) and the university team from China. At that time, our decision was “pretty simple” and was based on Distributed DDPG with parameter noise ( publication and performance on ml . Trainings ).

The decision of this year - the third place

This year there have been a couple of changes. First, simply there was no desire to participate in this competition, I wanted to win it. Secondly, the team has also changed: Alexey Grinchuk, Anton Pechenko, and me. To take and win did not work, but we again took 3rd place.
Our solution will be officially presented at NeurIPS, and now we will limit ourselves to a small number of details. Taking the decision of last year and the success of off-policy reinforcement learning of this year (articles above), we added to this a number of our own developments, which we will tell on NeurIPS, and got Distributed Quantile Ensemble Critic, with which we took the third place.

All of our achievements –– distributed learning system, algorithms, etc. will be published and available in Catalyst.RL after NeurIPS.

Coolstory: big boys - big guns

Our team confidently went to the 1st place throughout the entire competition. However, the big guys had other plans - 2 weeks before the end of the competition, 2 big players came to the competition at once: FireWork (Baidu) and nnaisense (Schmidhuber). And if nothing could be done with Chinese Google, we managed to fairly fight for second place with the Schmidhuber team for a long time, losing only with a minimum margin. It seems to me pretty good for lovers.

Why is this all?

Connections Top researchers come to the conference with whom you can communicate live, which will not give any email correspondence.
Publication. If the decision takes a prize, then the team is invited to the conference (and maybe more than one) to present its decision and publish the article.
Job offer and PhD. Publication and prizes in such a conference significantly increase your chances of getting a position in such leading companies as OpenAI, DeepMind, Google, Facebook, Microsoft.
Real world value. NeurIPS is conducted to solve actual problems of academic and real world. You can be sure that the results will not go to the table, but will really be in demand and will help to improve the world.
Drive. Solving such contests ... just interesting. In the conditions of competition, you can come up with many new ideas, test different approaches - just to be the best. And let's be honest, when else can you drive skeletons, play games and all this with a serious look and for the sake of science?

Coolstory: visa and RL

I strongly recommend not trying to explain to the American who checks you that you are going to the conference because you are training virtual skeletons to run in simulations. Just go to the conference with a report.

Results

Participation in NeurIPS is an experience that is difficult to overestimate. Do not be afraid of loud headlines - you just need to pull yourself together and start to decide.

And go to Catalyst.RL , what really.

Source: https://habr.com/ru/post/430712/

All Articles