Creator of Open Data Science about Slack, xgboost and GPU

The Open Data Science (ODS) community is already known in Habré for its open machine learning course ( OpenML ). Today we will talk with its creator about the history of ODS, people and the most popular methods of machine learning (according to Kaggle and industry projects). For interesting facts and technical expertise - please under the cat.

Alexey Natekin natekin . Founder of several machine learning and data analysis projects. Dictator and coordinator of Open Data Science - the largest online community in Eastern Europe, Data Scientists.
')
- How did the idea of creating a community? How did it all start? Why did you choose Slack and who was at the beginning?

Alexey Natekin: It was in 2014, when the third recruitment to our free educational program at DM Labs ( DM: Data Mining ) came to an end. We have already passed the material and wanted to go to work together on projects. It was the first iteration of fun tasks - analysis of pornotteg, definition of depression in social networks, game data from DotA. At once there was a concept that projects should be made open and attract not only course participants, but also more fussing children to the movement.

Having experimented with collective farm chatikov on VKontakte and self-made WordPress, we found out that all this is not suitable for normal work. Fortunately, at the beginning of 2015, Slack began to slowly become known, and we used it as a platform for projects and communication. It turned out to be very convenient, and we almost immediately stopped at it.

At that moment, we didn’t think much about the fact that all this had to be wrapped up in some beautiful ideology, and we simply called this business Open Data Science . This phrase was found only in one conference (so-called Open Data Science Conference), and a bunch of open science and collaborative education in DS ( Data Science ) - and others to teach, and to learn something yourself - was a good base case, which we wanted to. Still, Open Data and Open Science were already there, it only remained to come up with and implement the missing link.

We started as a project chat, in which many expert and technological channels quickly emerged, also related to DS discussions. We can say that the focus has shifted more to the local Stack Overflow in Russian. Projects lived their lives, and the activity turned into general discussions.

Fortunately, we quickly gathered a large mass of expertise in key DS-related areas and technologies, and the first core of people was formed at ODS. At any time, you could ask for something in the direction you are interested in and get good advice from a person who understands this.

At that time, professional communities on DS in the form in which they are now, if they were, were either an audience of some regular meetings in one direction, or turned out to be closed and strongly tied to a specific place (for example, university students) .

We were immediately in favor of uniting and uniting, so we began to integrate with mitos and different groups: Moscow Independent DS Meetup, ML workouts (started before ODS after the SNA hackathon in St. Petersburg), mitts on R, Big Data and then Deephack, Deep Learning Meetup, DS Meetup from Mail.ru and many others.

Interesting fact: at one of the winter get-togethers we found out that the fee ended on this mitap, and so that the next Big Data spammers didn’t take their hands on us, we were drastically undermined to pay for the lost Mail.ru MDSM account - the payment is still dripping from my card :)

Historically, the most active guys with whom we built ODS together were also organizers. Therefore, we not only helped each other with the speakers, PR and organizational issues, but also quickly began to invent and conduct new events and formats. Among them are DataFest, as the largest DS conference in Eastern Europe and the CIS, and DS Case Club, as an invaluable series of events about the real benefits of DS in business, and data breakfasts with their unusual, but so popular, format.

And, of course, we cooperated with companies: for example, with Yandex, we did a series of Data & Science events, and with Sberbank, Sberbank Data Science Journey. At last count, we have accumulated more than 20 regular events throughout the CIS.

The expansion did not take long to wait - we shared our experiences and started our events and DS development in other cities and countries: first Moscow and St. Petersburg, then Yekaterinburg, Novosibirsk, Nizhny Novgorod and Kaliningrad, then Ukraine with Kiev, Lvov and Odessa, Belarus and Minsk and more and more new cities in the CIS.

Now in the admin-organizing team 35 people from 4 countries. There are active participants gathering for meetings in the USA, Germany, Norway and Israel, but we are still working on globalization. So we have 7.5 thousand people in Slack from 20 time zones, more than 3 thousand of whom visit at least once a week. So there is global potential.

Do we have analogues and competitors? It would be very cool if in the form in which we grew up, there was at least one analogue in the world - we would cooperate with it. Unfortunately, in the USA, which is considered the leader in DS / ML, there is nothing analogous to us in spirit and is unlikely to appear.

Mitapas are cluttered with paid marketing debris, local communities are very tightly tied to universities and companies (a separate get-together on Google, a separate one - in Amazon, a separate one - on Facebook, and so on). And on the Machine Learning Meetup, it is difficult to find those who are engaged in machine learning itself - per 100 people, as a rule, and 10 will not be found: completely interested, onlookers, evangelists and PR people. But at profile conferences there is indeed a world level, and there are almost no random people.

AI Researchers slack, created right after last year’s NIPS conference, which gathered 1,000 people over 9 months and where Ian Goodfellow was seen in the first week was essentially dead: there were 24,000 messages in it. We write 30,000 posts per week, and in total there are almost 1.5 million. There are KDnuggets, like some DS blogging platform. We can say that the largest DS community communicates in Kaggle (I would not be surprised if we have more messages than they do). But we have not yet seen an analogue that combines the site, events, and other initiatives like learning.

- The phrase “To make xgboost” has turned into a meme, so what is xgboost and why should it be made?

Alexey Natekin: Xgboost is a concrete implementation of the gradient boosting algorithm. The algorithm itself is used everywhere as an extremely powerful general-purpose model — and in production, including the Yandex, Mail.ru, Microsoft and Yahoo search engines, and at competitive sites — like Kaggle. Xgboost is, firstly, very efficiently written code that allows you to learn models faster and more productively. And secondly, additional buns and regularization are added to xgboost so that the trees themselves are more stable.

The xgboost itself is good enough to use it without any fraud. At Kaggle, and beyond, it has become a no-brainer solution that comes off the shelf and knows that it usually gives a very good result. Here and so: free lunch, of course, does not exist ( note , this is a reference to No Free Lunch Theorem ) , but xgboost is worth trying in the next task.

However, at Kaggle the main task is to get the best result in the rating table. Repeatedly, this best result was reduced to dozens of people fighting for the third, fourth, and fifth decimal places, trying to get a little extra accuracy. It has long been known to accurately scrape such crumbs with the help of stacking and multi-level ensemble algorithms.

If it is very neat and competent to train a couple of dozens of models instead of one, and a couple of dozens more of their predictions, you can still scrape off some more accuracy. The price of this is much higher computational costs that no one in their right mind would repeat in practice, but at Kaggle no one promised a sound mind and realistic tasks.

Thus, often the decision of the winner - the person who took the first place, was a few layers of xgboost stashed. Xgboost - because it is powerful and fast. Zastakannyh - because we must take first place.

The phrase “to make xgboost” is, in fact, a mockery of the meaningless and merciless essence of Kaggle contests, many of which can be solved by brute force and terrible, from the point of view of practical benefits, solution. But, nevertheless, winning at Kaggle.

- They say that xgboost shows excellent results in practical applications. Can you give an example where xgboost is really an order of magnitude stronger than its competitors? And is there any rationalization, why does this occur on such data?

Alexey Natekin: It depends on what is considered a competitor. In general, the gradient boosting lies under the hood of such a wide range of applications that it is difficult to assemble them all: this is antifraud / antispam, various forecasts in financial companies, and the search engines of the largest companies, as I wrote above. Well, Kaggle tasks, in spite of all their toy productions, in most contests, where the data is not very sparse and are not pictures, are also usually won by gradient boosting + by some ensemble on top, especially when you need to squeeze extra accuracy.

To say that boosting is an order of magnitude stronger than its competitors cannot be, since none of the business applications will tell anyone what results certain models had on real data. And the result of the boosting itself is often not so head and shoulders above its closest competitors: random forest, neuron or SVM.

Simply, boosting is best developed from the point of view of implementations and is a very flexible family of methods, which makes it easy to adjust the boosting to a specific task. And about the rationalization - why exactly the boosting works and what the chip is - I can recommend a couple of my ( 1 , 2 ) tutorials and one - Alexander Dyakonov.

- In the announcement of your report on Smartdataconf , it is noted that people expect from xgboost the same acceleration on the GPU as the neural networks. Can you intuitively tell how neural networks get such a performance boost?

Alexey Natekin: How does a Satanist Stakhanovite of intellectual labor think an ordinary date? There are neural networks, they made cool hardware and top-end video cards with thousands of specially optimized cores. And Nvidia itself says: our iron will advance and accelerate AI, whatever that means. Hence the unrealistic expectations that the GPU can be used in a wider range of tasks.

In neural networks, the main part of operations performed both during training and prediction is matrix multiplication. Now it's more honest to say that this is work with tensors, but the essence is the same: you need a lot of routine matrix operations.

The GPUs are ideal for this, as there are more cores at a lower cost and power consumption. And the lack of an extended set of features can safely not pay attention.

But because of this, it becomes difficult for the GPU to sort arrays and do recursive calculations, but calculating the convolution and multiplying the matrix is very cheap. This raises an urgent question: what will happen if we try to start learning the decision trees commonly used in practice on the GPU? And how meaningful is this idea by design, what is called?

- And why does this not happen in the case of xgboost and other machine learning algorithms and what needs to be done to use the GPU for training?

Alexey Natekin: GPU is great for those algorithms that are adapted to the GPU. You will not, as I already wrote, build indexes on the GPU? For a wider use of GPUs in other machine learning tasks, other than neural networks, it is necessary to invent effective implementations of algorithms for GPUs. Or, more likely, inspired by existing versions of CPU-algorithms, to create on their basis new algorithms that are suitable for efficient computing on the GPU. Or, wait for a wider release of Intel phi, about which various legends go. But this is no longer a GPU and a completely different story.

- And finally, a question about hardware: what parameters should I pay attention to when buying GPU cards for machine learning? What are people from top Kaggle using now?

Alexey Natekin: In fact, they buy mostly 1080Ti, since they have the most optimal ratio of price, speed and 11 gigabytes of memory. From the Titan X and 1080Ti choose the latter. Datasets still do not fit in memory for a long time, so the more memory, the more you can stuff into processing. But in general, everyone is interested in what will happen in the next generation of cards: as soon as they appear, they will have to be bought very quickly.

If you are as ill with machine learning and data analysis as we are, you may find interesting here these reports of our upcoming 2017 SmartData conference, which will be held in St. Petersburg on October 21:

Maps, boosting, 2 chairs (Alexey Natekin, Open Data Science)
CatBoost - the next generation of gradient boosting (Anna Veronika Dorogush, Yandex)
Recommender systems: from matrix expansions to in-depth depth learning (Mikhail Kamalov, Epam Systems)
Deep Learning: Recognizing scenes and sights on images (Andrey Boyarov, Mail.Ru Group)
Neurona: why did we teach the neural network to write poems in the style of Kurt Cobain? (Ivan Yamshchikov, Max Planck Institute, Leipzig, Germany / Creaited Labs)

Source: https://habr.com/ru/post/339386/

All Articles

Creator of Open Data Science about Slack, xgboost and GPU

More articles: