
Conferences dedicated to the same topic may look completely different. And when a completely new event is planned, it is not quite clear in advance what to expect. If the conference is devoted to “big and smart data”, then is it possible that it is designed for giant companies and employees of small ones nothing to do there? And will there be such a bias in data science that people without a degree are better off not entering?
Waiting for the
SmartData conference, which will be held for the first time in St. Petersburg on October 21, we decided to clarify and questioned two members of its program committee:
Vitaly Khudobakhshov (Odnoklassniki) and
Roman p0b0rchy Poorchy . They dispelled many concerns, and the conversation turned out not only about the conference, but also about the state of the industry: what is happening around machine learning now, why small companies go into data mining, why managers buy tickets for a technical conference about big data.
JUG.ru: In the list of topics on the site there is machine learning, and this direction now looks booming. Could it be that the reports of the conference will become obsolete during the preparation?')
Vitaly: In fact, technologically everything changes not so fast. More importantly, now most companies are such "catching up".
There is a “cutting edge”, like DeepMind, which doesn’t tell anyone anything, but does something. But even they often do not so difficult things, just because of the large budget, they are not very steaming and can afford to get profit, hitting their heads against the same wall for a long time.
And, of course, not everyone can afford to invest so much. But at the same time, firstly, now there is a lot of open source, well-developed code that can already be used, and secondly, there is a lot of information available. Therefore, most people are now just starting to use it. If you look, for example, at the price of NVIDIA’s shares over the past three years, it will be clear that real deep learning is just beginning now. Just according to the demand for video cards: it is clear that cryptocurrency has affected it, but now sales of video cards for deep learning have already surpassed sales of video cards in order to play. And this is a good marker, showing that deep learning, despite its “basework”, is really a working thing.
We, at Odnoklassniki, tried to use deep learning for the first time a year and a half ago, and now, when students come to us, they say: “Oh, and you have a little necklace, so we can make a net” - and we pull the Tesla P40 out of our pocket. And if several years ago an article that taught the net to play the classic Atari 2600 games turned out to be published in a very serious journal Nature, now a student of some MIPT is able to write a model that will play the same games better. There is nothing so complicated, in fact, in this, one did something - now anyone can repeat. And even many can do better.
That is, the objective situation is such that enormous technological changes do not occur, the main thing is already known and accessible, and now the question is how much the audience is able to process and adopt all this. And just the conference - that helps in this.
Roman: I would like to say not specifically about the neural network, but about “big and smart data” as a whole: there is a great deal of separation according to how much this or that technology has already been adopted. Some things in some places since 2008 are used, for example, but somewhere they are only now recognized, oddly enough. Despite the fact that everything seems to be following the articles, I see that the real separation in the industry is very large.
JUG.ru: Well, surely not everyone follows the articles. There are many small companies that do not claim to world domination and revolutionary innovations, and are engaged in fairly standard things and do not really follow the "front line". Are those on SmartData find benefit for themselves or not?Vitali: They will find it, in fact.
Strictly speaking, the conference shows that you can solve such a range of tasks in this way. We are probably still not engaged in string theory. And, maybe, even not very advanced companies can adopt something from this. And due to this some advantage in the market.
Because now data mining is nothing more than a market advantage. It allows you to be better. And for a small company, to be better off cheaply is very valuable. Let's see it this way: you can hire some intelligent person who will make decisions about what to sell to whom. And you can download and train a model on a random forest, which will do better.
There is a wonderful story, very old: how this whole idea with recommendations appeared in Amazon. Once upon a time, Amazon had a staff of experts who made recommendations manually. But then a student just came in, wrote a collaborative filtering algorithm, and everyone got fired because he was just better.
And I have a whole series of reports, where for students and beginners in data mining (and not only beginners) I showed how to make item-to-item collaborative filtering on MapReduce in one slide. I show that this can be done by anyone. And this is what we want to convey: it is not necessary to be Yandex, Mail.Ru Group or Google. This is of course very cool when you google. But it’s very cool that we can take the open source of some big companies and take advantage of it. Use this algorithm in everyday life and show that you can gain an advantage in the market, even if your company has five people. This is quite our audience.
JUG.ru: Since data science is stated in topics, I would like to clarify: how much will there be “academic” at the conference, and how much will “industrial”?Vitali: There are two different plots here. One is big data and smart data in production. This is what people who go to the Joker conference, for example, are used to. And from data science, when you work with production every day, not too many things apply: linear regression, logistic regression, deep learning.
And the other plot is what is being done in theory, as well as in point problems: to make an analyst, to get a result, to assign this analyst to the boss, he will look and make a decision.
Meanwhile, what needs to be done every day automatically, and what can be done once, there is a giant gap. It is necessary to understand that these are two completely different situations, two different publics, and what is now constantly being done in production is something that was drawn 30, 40 or 50 years ago, and maybe even 100-200 years ago. But at the same time, what will be in production tomorrow is what is being painted now.
We, the program committee, proceed from the practical benefits for the conference participants. And we ask a question to the speakers: “But what you tell and want to show us is something that is already in production, or is it just you have drawn somewhere?”
But at the same time, I personally believe that both of these stories are important. And you shouldn’t drive nails into some scientific hardcore simply because the people who are now writing in Java have never heard of it. Real value tomorrow is what hardcore is now.
There are different opinions on this issue within the program committee. Here is the question of representativeness: how possible is it to make hardcore accessible to the public, so that not only a narrow group of people who come specifically for this, but also people who, by convention, simply want to write their MapReduce on Hadoop, can understand it. Even if the formulas will not be as much as it could, but it will be clear the possible value, and to whom it is interesting in more detail - he opens the article and reads.
JUG.ru: And how do you see the conference audience?Vitali: We see it as real professionals who come from practice, from programming, they want to absorb the culture of data science, data mining, and maybe implement it in their company.
In addition, many large companies have an R & D department - those people who have much more knowledge than simple developers and do something that does not get into production or does not go right away. For example, I am in Odnoklassniki, in fact, in the R & D department. People usually come to me with some question to which no one knows the answer. And these people are, of course, also our customers, even if they are in the minority. This may be about production and is not very interesting, but I want to listen to the
report by Alexey Potapov or other famous people in the field of science who look forward in data analysis, in artificial intelligence.
And besides this, our audience is also managers who want to penetrate and learn how to set tasks. Because, after all, tasks usually come from above. And in order to do some kind of data mining, the manager must understand which tasks are generally solved by the method of data analysis, and which are not solved. And with this already come to the engineers, to the date of the miner, and talk with him about it. For example, banking is an ultraconservative business, where data mining may be useful, but they have a management that knows little about data mining. They have algorithmic trading, this is a slightly different story, but in general they are very conservative.
Some managers have already bought tickets, they told me about it themselves. Maybe not every manager will be interested in it, because many people simply do not understand that this is important. But many people understand.
JUG.ru: For many, the words about managers can be unexpected, because usually conferences from the JUG.ru Group are not associated with them. The reports themselves are primarily designed for techies? Do not have to modify them to make managers more understandable?Vitaly: First of all on techies, of course. But you need to understand that this is still not a Joker. There is no Shipilev with his "right now, we put on gloves, crawl into the guts of the JVM and see what is there." We are talking about real cases using real data. Such tasks, conditionally speaking, have an engineering component and a subject component, and we just do everything so that the reports are more substantive.
Roman: I would like to add this: a layer of people who somehow work with machine learning or face the problem of really large amounts of data is now much thinner than the layer of Java programmers. Therefore, in the case of Java, even if you select a subset from a layer of Java programmers and make a conference for this narrow segment, you can still gather a large audience. And in our case, it still seems that it is more logical to include more different things for different people. In addition, we are still exploring the audience, we still have the first time. When we do, we will see how it is, using the methods of working with data that we know.
JUG.ru: Roman, many people know you as a speaker trainer, you have already analyzed reports in Habré, explaining things like “laser pointers are evil”. Since you are participating in the SmartData PC and are viewing the reports, then, in addition to the content, do you watch that there are no laser pointers everywhere?Roman: Yes, I eradicate them wherever I meet, well, really, without them it would be better? But okay, laser pointers, there are other things. So I see that wonderful
codeware slide deck has
spread about how to code the slides. Of course, we will try to ensure that the code in all presentations is designed in accordance with this. Well, we will try so that people do not forget to tell what problem they solve, what recommendations they want to give to the audience at the end of their story, so that everything they put on the slides is more or less relevant and can be seen. These things - yes, of course, we will try to do.
JUG.ru: What will be at the conference has become clearer. Finally, this question: what on SmartData will not and can not be?Roman: We are trying very hard to sift bulshitting. If you wish, you can say a lot of spectacular words about big data without saying anything in essence. We are for specifics.
Vitali: There are two types of inappropriate reports. One is bullshit, and the second is when, in the struggle for the fifth decimal, they forget about the practical benefits of stacking and blending for the sake of stacking and blending themselves, and not for achieving specific goals.
Both are bad because they are divorced from reality. And we want to make the conference so that it is connected with reality.
SmartData will take place on October 21, conference tickets are already on sale
on the conference website (and they are becoming more expensive over time). Her main topics are:
- Data and their processing (Spark, Kafka, Storm, Flink)
- Storages (Databases, NoSQL, IMDG, Hadoop, cloud storage)
- Data Science (Machine learning, neural networks, data analysis)