The first machine learning competition MERC-2017 from the
Neurodata Lab , held on its own site
Datacombats (we will soon present an updated, full-fledged version of the platform), is announced to be closed and closed. Time to sum up some and comment on the results. With the visualization of statistics you can find in the
previous post of our blog.
How did the idea of ​​the competitionOur laboratory, due to the nature and specifics of the emotional data we deal with, is engaged in machine learning, or more precisely, in the technology of recognition of emotions in audio and video streams. This is an area that is not too well studied - on average only two to five specialized articles are published per year, and the accuracy of state-of-the-art approaches is far from acceptable. And if in the images (photos) of people basic emotions are recognized as a whole not bad, then everything is more complicated with speech. In addition, there were almost no adequate attempts to use body language and eye movements, although these are very informative sources. The problem of combining several channels is also not encouraging - nothing is better than naively putting together all the available features (features) or predictions of individual models.
Another problem that is urgent for us is that in real life not all data is available to us. For example, we cannot see a person’s face behind the scenes, or there are technical failures in the movements of a person’s close-up hands: we suppose that the microphone is spoiling, creating serious interference and noise.
Both problems involve a huge number of hypotheses to help solve them. To test each physically difficult. That is why we decided to organize a competition and appeal to the collective mind.
Task setting and proposed data')
We have our own laboratory-type, labeled emotional dataset -
RAMAS . It consists of about 7 hours of video, in which various pairs of actors play scenes from real life, adhering to the contour of a scenario that allows for a high degree of improvisation and variation in the presentation of the material. Each study is marked by external annotators (5-6 people min for one episode) on the subject of manifestation of any emotions in it. Actually, they had to be predicted by the features that we considered for the video.
We did not give the participants the video files themselves, in order to avoid a possible situation with the newly done manual markup on their part, therefore we limited ourselves to signs. Here we faced the first pitfall - it was necessary to explain to people why we considered these features and how. If in the case of sound, body and eyes, reference can be made to reference articles, which was done, then a different approach was required with the person. We had two working options. The first is to highlight 68 key points on the face. In this case, it is clear what is responsible for what, you can even successfully visualize, and even more so it is easy to explain why this is so, but our internal experiments have shown that a significant amount of information is lost with this approach. We abandoned it in favor of the following scheme: select faces on all frames in the video, run them through the pre-trained neural network, take the output from the penultimate fully connected layer, and then reduce the dimension using
the principal component method . We covered this and other technical aspects in more detail in the contest
description .
Of course, we wanted to reduce the probability of cheating (brute force, griddle, fitting) to a minimum (but the probability of trees falling to a minimum was not reduced, but more on that below). To do this, to put it simply, we renamed the feature plates so that the participants did not retrain for specific actors, entered the final decision check on the sample that was not available during the competition and demanded that the model be the same at two stages.
SolutionsBy the end of the first stage, we got about twenty solutions. This is quite a bit, and among them about half overcame 52.5% accuracy on a public test sample (52.5% is the accuracy of our solution, where we all naively put it into an lstm grid). At the second stage, we laid out the private part of the test sample.
We asked participants to send us predictions + the model itself + a small report with leading questions about what the model consists of, what was done with features, etc. The leader in the public part of the sample quickly laid out his decision and maintained his position at the end of the competition. We opened the received document with the hope of finding out what we did not know ourselves.
It turned out that the channels should be combined in the forehead, the gaps in the data should be replaced with the median value, and as a model, use xgboost with a magic number of trees. Immediately checked on both test samples, everything came together. In our own experiments, for example, we taught the generative networks to predict gaps in the other channels (it turned out relatively poorly), and the casket just opened. All the better. Anyway, work with features and postprocessing was done a lot, but I wanted more in terms of working with gaps and channel linking.
At the second stage of the presented solutions, two were obviously wrong, because the metrics on the public and private tests differed by 0.3. The rest made us a company for the next week, we dismantled them completely. It was necessary to analyze the reports, run the scripts and compare the results of the launch on our side with the sent predictions.
Decision checkingThe analysis of the methods was not difficult: the solutions differed from each other by the preprocessing of features, model architectures and some trifles. Almost all solutions run out of the box exactly as needed. Curiously, the third and fourth positions are reversed. In this case, work with signs turned out to be more efficient than optimization of the neural network architecture. The fourth place winner, apparently, retrained on a public test sample, which resulted in -6% on a private one.
But the juice was to check the second place. The team’s decision consisted of an ensemble of 22 heavy LSTM nets. From the first time, the results could not be repeated, but from about the fifth, the puzzle was formed. Next time, of course, we will add a limit on the speed of the algorithms.
The final rating and prizes were distributed as follows:

It's funny that two completely different approaches for the first and second places differ only in the fourth decimal place. This, of course, is an accident, not a sign of high competition, but nonetheless.
On the other hand, our expectations about the fight against passes and the association of channels did not materialize. Participants almost did not pay attention to this problem, optimizing the processing of features and architecture. And yet this is an invaluable experience that will be taken into account in all the nuances when preparing our next contests.
Summary of winning solutions:I place: tEarth
Model: lightXGB
Channel aggregation: feature-level
Processing gaps: replaced by the average.
Features: for each second, medians of signs are calculated, the model is trained on such data + post-processing of predictions + training on very consistent data (aas> = 0.8).
II place: 10011000
Model: ensemble lstm models, trained with different parameters, a total of 22 pieces
Channel aggregation: decision-level
Processing passes: filled 0
Features: the calculation of predictions takes more than a day for each of the tests (~ 30 times slower than real time).
III place: lechatnoir
Model: ordinary lstm
Channel consolidation: eyes and face are combined, some features of the kinekt are rejected, further decision-level
Processing passes: filled 0
Jury members commentsAlexey Potapov , Doctor of Technical Sciences, Professor at the Department of Computer Photonics and Video Informatics at ITMO:
“First, I note that the simplicity of the methods used by the participants (XGBoost, LSTM ensemble) is in fact relative - some time ago they might have seemed quite complicated. So I would say that these are just modern, well-run methods, the use of which does not require excessive efforts on R & D, for which the participants did not have time in these conditions. Even at longer competitions this result is quite typical. In this sense, it is quite natural, although a “miracle” could have happened in the form of an original model. It should be emphasized that the result obtained does not mean that the given methods in such a configuration of the problem are really close to optimal (this is also not excluded, although it is doubtful).
As for the approximately equal accuracy of the first two places, this is curious, although the difference in the fourth sign seems to be an accident. It may be due to the property of the data itself - in principle, no noticeable improvement can be achieved due to the inseparability of classes on the provided features. But I a priori do not believe in it. For example, on ImageNet, the best solutions until the successful application of convolutional networks in 2012 were very close, but then the deep networks beat all these solutions by a large margin. It seems to me that in this problem a similar situation is also possible. Perhaps, with the help of the same generative models, it will be possible to achieve this (although it is somewhat optimistic to expect such a breakthrough in the month-long competition). But why such different models still give similar results in quality, I still find it difficult to answer.
What other methods could be used? It is obvious that other blocks of recurrent networks (GRU, pseudo-recurrent networks, etc.); convolutional networks again, although, of course, it would be more promising to develop original models. ”
Pavel Prikhodko , Ph.D., Skolkovo Institute of Science and Technology, IITP RAS:
“The fact that the leading solutions do not show much retraining is a good sign. As far as they are generalized - the question is open, you need to try (at least on somehow plagued or modified source data). Most likely, a gain in accuracy could be obtained if the end-to-end problem is solved with calculating features of a convolutional network from video and using them in LSTM (especially since features were already generated by convolutional grids). In addition, the marking of emotions, strictly speaking, depends on the expert, each of which may have its own threshold for determining emotions. There are areas of feature space where different people can hear different emotions. It seems to me that this is an essential feature of the problem, and it is worth considering a formulation with multiple labels. First of all, I would consider generative models or other approaches (for example, varying the speaker’s parameters) as a way to greatly increase the size of the training set and increase the model’s resilience to data disturbances. 7 hours is still not a big start-up dataset in terms of volume, which means an increase in its size by an order of magnitude can give a noticeable increase in quality. ”
Over the text worked:
Gregory Sterling, a mathematician, an expert in machine learning and data analysis of the
Neurodata Lab