
How to talk about the conference, where the key word was "data"? We decided that in the text about the
SmartData held in St. Petersburg
it is interesting to make specific numbers with headings. This data was very heterogeneous, the neural network is unlikely to get something useful, but you can.
0
So much is the number of previous conferences SmartData: this was the first. And from the point of view of preparing SmartData, this meant that it was impossible to be guided by reviews of the previous one. When organizing conferences, we carefully compare the marks given to the reports and read the reviews to go in the right direction - we can say that we have reinforcement learning. But the first time is always a lack of data, and you have to experiment. Did the experiment succeed? You can decide for yourself, and we will give examples of what could be heard at the conference.
20700
So many views of the August
habrapos Vitaly Khudobakhshov , which formed the basis of the opening keynote SmartData. The observation that people with different names are lonely with significantly different frequencies impresses with its counterintuitivism: we immediately want to object to this. But the report was different from the post just because it took into account the objections that arose after the post: it turned out that the first options that came to mind like “this pattern was distorted by the bots” are not confirmed.
')

As a result, the first presentation of the conference turned out to be both amusing, and at the same time quite serious about the issue, there was something to laugh at and something to think about.
2001
This is the number of stars on GitHub at the
CatBoost library, voiced in the report of
Anna Veronika Dorogush . And now this number has already increased to 2041 stars (perhaps this was just what the viewers of the report contributed to). Also, as noted in the report, the Infoworld website recently included CatBoost in the
list of the best tools for machine learning.

In general, as long as there is a lot of hyip around society in neural networks, professionals are also actively interested in gradient boosting - knowing that it can be much better suited for heterogeneous data. And even better together: among other things, the report mentioned that in Yandex, neural networks and gradient boosting are used in conjunction to achieve the best results.
2
So many libraries of gradient boosting were compared by
Alexey Natekin : in the report “Maps, Boosting, Two Chairs” he examined XGBoost and LightGBM, having come to the conclusion “using the GPU for gradient boosting” is impractical. And, perhaps, that limited to two, played a cruel joke with him. Because the same Anna Veronika Dorogush, going to his report, during the questions and answers began to argue convincingly, based on the experience of the third: “Let's start with the implementation on several maps. About LightGBM, I agree that it is very difficult to get it, but on the fly: CatBoost on several maps ... ".

Starting immediately after the report, the discussion then moved to the discussion area. Well, it's always interesting when you get a lively discussion, and not just everyone nods and disagrees!
TWO MILLION EIGHT EIGHT FIFTY THOUSAND, KARL!
This we literally quote
Ivan Drokin's slide. He described how many rubles would have to be spent in a particular project, if for determining the position of parts on the working surface, instead of computer vision, the labor of living markers would be used.

Then he proceeded to the main content of the report, showing that if there is no suitable dataset with real photos at hand to “train” on them, then it is possible to use artificially generated ones. But the remark about money is, in its own way, revealing. The work with “big and smart data” is partly connected with the academic world, and
Alexey Potapov from ITMO, for example, spoke at SmartData, but the conference did not turn into a scientific symposium divorced from mundane earthly things like money. Here, much was devoted not to abstract data in a vacuum, but to real industrial problems, where both the size of the dataset and the size of the budget matter.
14,000,000,000
records are sent to the tapes of Odnoklassniki users daily It is not surprising that the company pays great attention to the topic “how to make a news feed in the most correct way”. At the conference,
Dmitry Bugaychenko talked about the technical implementation: machine learning is used there, but the report was not specifically about him, but about everything related, and words like Hadoop were heard. Perhaps they sound less impressive and high-level than the “neural network” - but, again, for practitioners it means nothing less, and according to audience assessments, the report took second place at the conference.

We immediately had a question for Dmitry not about the main content of the report, but about what. Ok, for an ordinary social network user, such a news feed can be a great advantage compared to a simple chronological approach. But advanced users often say “I’ve already included exactly what I want in the tape, do not interfere with their cunning algorithms” and constantly switch the Facebook feed from “top stories” to “most recent”. Do Odnoklassniki want to enter a button that allows you to abandon all this machine learning?
Dmitri's answer turned out to be this: to introduce "buttons for advanced users" is fraught with the fact that less advanced push them too, and then suffer. Therefore, a more correct approach would be “the system itself understands which user needs the most straight-line tape”. To isolate such is not a trivial task, but I want to move in this direction.
2
Will there be two distances from Earth to Mars, if you record on DVD the data generated by mankind in one day and put these discs in a pile? In between the reports, it was possible to take a break from the flow of complex information with the help of more entertaining ones. The conference was sponsored by Sberbank-Technologies, EPAM and First Line Software, and the third of the companies at its booth collected similar facts about big data, suggesting to guess "is it true or not." You can also try - we hide the answer under the spoiler.
Hidden textNo: First Line we were told that the distance from the Earth to the Moon, and not Mars, would actually be obtained. One can guess that the condition is not true if we recall that the distance between the Earth and Mars (unlike the Moon) varies greatly.
3.5
speakers from Yandex was at the conference. Firstly, the already mentioned Anna Veronika Dorogush.
Secondly,
Artyom Grigoriev , who talked about crowdsourcing on the experience of Yandex.Toloki. The word "toloka" (a form of village mutual aid, that is, just a kind of crowdsourcing) on ​​gramota.ru is given with an emphasis on the second syllable, and here a representative of the service of the same name emphasized the last one. Now we are trying to understand which option is correct and whether it is possible to find out with the help of crowdsourcing.
Thirdly,
Vladimir Krasilshchik , who spoke about the "right structure" of the banking system, in which he also helped the "pre-yandex" experience. The ideas he expressed that it is necessary to keep all the events in four times (“the time when the event happened, when we learned about it, and two marks for the interval of its action”) were objected to by some viewers - so after the report with him actively discussed.

And what about the "half" speaker? In the case of Yandex,
Ivan Yamshchikov acts as an “external consultant” - so it’s not quite clear how to count. But according to the audience reviews, this is clear: his closing keynote about the “creative and artistic intelligence” liked the audience most of all. Here, as with the opening keyout, it turned out to be “both fun and serious.” When the words about the “exploration of space” are illustrated with photographs of children squeezing dogs, it is clear to everyone. When it is said about such a photo that the feedback there will obviously soon change from positive to negative, it causes laughter:

And when Ivan started the music of his own project Neurona, where Kurt Cobain’s neural network composed texts in his spirit, it sounds very accessible. But the seriousness of the work behind this and the fact that humanity has received new opportunities (even if we do not consider them “real work”) does not negate it.
If the leadership of Ivan’s performances is obvious, then the rest of the feedback from the audience (which is still continuing) is still to be properly processed and drawn. And when we do this, the next SmartData will be even smarter than the first!
