How we participated in the hackathon from OpenData

Hello everyone, in this article I want to tell you about Why So Serious Hack . About the fact that we were led there in general, how the hackathons in the classical sense differ from the hackathons with a contest and what helped us win.

Hackathon as a contest

Let's start with what this event was different from the standard hackathon. On the usual hackathon, teams are invited to implement a prototype of a certain idea, on the basis of which you can then start a startup.

But lately, along with HYIP around artificial intelligence and machine learning, hackathons-contests began to appear, where instead of selling the product, teams are fighting for better prediction accuracy on the dataset provided by the hackathon organizers.
')
How does this differ from kaggle competition in this case? That it is important not only to achieve (or not achieve) the best accuracy, but also to tell about the work done: how the data were investigated, what interesting patterns were found, why it was decided to use certain algorithms.

Sometimes it is still interesting to monetize the proposed algorithm: that is, why in general in the real world it might be necessary to do what was done. Therefore, as the organizers of this hackathon advised in the invitation letter: “Do not take the hackathon as a competition in sports programming. In spite of the fact that creative tasks can be formulated very concretely, often the teams that have invested their soul in the project and rethought the text of the task often win. ” I think we managed to do it, and I'll write about that too.

Hackathon organization

In order not to immediately drive everyone into anguish with stories about the correlation of features and the metric that was used to evaluate predictions, first a few words about the organization of the hackathon.

We learned about the hackathon from the VKontakte tape, but despite loud statements that the hackathon was being organized by some kind of Open Government, ITMO University and the Institute for Law Enforcement at the European University, the site was rather small, there were also few teams (5-6, as far as I remember).

However, it was also stated that the stack of hackathon technologies is Linked Data, AI & ML. Everything is clear about the so-called AI & ML: the trees came, but we did not know anything about Linked Data and it was interesting to try what it is. So we gathered a team of five people from the Academic University and decided to take part.

Hackathon took place on Vasilyevsky Island, on the basis of ITMO University. There, they fed well (in my opinion and in comparison with other hackathons): we ate in the ITMO canteen, where the good hostesses told us about what cakes and salatis to take. Before the canteen, it was necessary to descend six floors along the steep steps of the old St. Petersburg houses, which allowed us to warm up after hours spent at a laptop. But on the site itself there was coffee-point with fruits and cookies, for which, too, thanks.

Case and solution

And let's finally to the task that was necessary to complete. In short, it sounded like this: to predict the risk category of an enterprise for such features as company category, TIN, region, and so on. By the way, one more plus in the karma of the hackathon organizers: the verification of the solution was organized through a telegram bot (@WSSHackEvaluatorBot), who spoke only about the evaluation of the decision and the place in the overall standings. It was not clear where the rest of the team and with what estimates. Such is the intrigue that does not allow to relax.

This work (assignment of a risk category to an enterprise) is usually performed by special people, but everyone can already guess, minus here is to pay the salary to leather bags while the machine can do the work, excluding the human factor.

So, when you see tabular data, you immediately think that there are standard ways to process them and figure them out. Of course, it does not take off. Data needs to be cleaned and corrected. It was necessary to clean, for example, cases where the number instead of the number 0 was the letter O, and to correct cases when a long incomprehensible line of the form “PART 9 OF ARTICLE 9 OF THE FEDERAL LAW of December 26, 2008 N 294-, 2018-06-15 <. ..> ”actually means fire safety.

At the same time, we cleaned the irrelevant (in our opinion) columns: for example, we removed the inspection body, because we thought that the risk category could not depend on the inspection organization. It turned out that we are wrong, because different people work in different bodies and assign categories that they want, this is a turn!

And now to the most interesting, to the Linked Data model, which we met on the hackathon. At the very beginning, the mentors told us that here you are, of course, but most likely you will have to use open data from other sources.

For example, this is how we obtained the index by the code of the tax authority, and by the index the city. Then you can take a couple more datasets and add around the city population and the ratio of urban residents to rural. It turned out that the new features correlated well with the target variable. Of course, there were also problems with this model of related data: many datasets are paid or their API does not allow producing many queries.

Perhaps a question will arise here, but how did it even occur to us to add these features? It's simple. First, there were mentors on the hackathon. And the following thoughts apply to all hakatons in general.

Mentors are always possible and necessary to ask, because for this they came there. By default, they know more about the task than you, so to neglect this opportunity - to give a strong head start to rivals. In addition, in the process of communicating with a mentor, you can learn not only about the task, but also about what the jury wants to hear during the presentations. So in our team there is always a special person who communicates with the mentors, writes them, calls them, is remembered by them.

Secondly (about getting features), since people specifically make markup for this task, there are some rules for them, so that the estimates are somehow consistent. We googled these rules and suddenly we found them quite easily and were able to get many of the necessary features from the features we had. In addition, the importance of such features was indeed high, which we demonstrated at the final presentation as a confirmation of the argument that a person can really be replaced by an algorithm.

results

However, according to the accuracy of the predictions, we were bypassed by the cool guys from mat-fur. But the presentation (again, in my opinion) was better with us. But you can yourself come to this or the opposite opinion by looking at the reports here . Dudes with mat fur (red pandas) are the first, we (AU-Rocks) are the second.

By the way, about the names: acting as a team of the Academic University, we always choose the name of the team with the prefix AU, red pandas make it look like. I think it's very cool when you come to an event, and there you can recognize people you know by the name of the team, but they can recognize you.

So, back to the presentations. We performed a little better, the guys have more accuracy. So there was such an intrigue, who will win. And since we took the prize, I think it is important to repeat the thought that is so often expressed in IT: even the coolest project or the result may not be understood without a good presentation.

Of course, the quality of the presentation is defined as the structure of the report (here, by the way, our slides , I hope, they can be useful to someone as an example of what can be told) and a speaker. And despite the fact that many people advise taking a special person to the hackathon team who knows how to present the results well, I believe that you need to become that person yourself. That is, to go to hackathons also in order to pump this skill.

And since I started talking about why going to hackathons, I would also like to mention that besides networking, interesting tasks, prizes, new ideas and projects for CV, this is also a great opportunity to distract from the daily routine for the weekend.

I’ll finish, perhaps, with the announcement of several more posts like this (I think at least three) on other hackathons, where the AU team managed to win prizes. Although most likely we will also tell about the cases of failure, because it is easier and faster to learn from the mistakes of others, so sharing experience is no less important.

Post written in collaboration with avgaydashenko .

Source: https://habr.com/ru/post/354150/

All Articles

How we participated in the hackathon from OpenData

Hackathon as a contest

Hackathon organization

Case and solution

results

More articles: