How do students from Perm get to the final of the international data analysis championship Data Mining Cup 2019

Hello. In this article I will talk about our experience of participating in the Data Mining Cup 2019 data analysis competition (DMC) and how we managed to enter the TOP-10 teams and take part in the in-person championship final in Berlin.

I will narrate on behalf of our team, which I enter (Alexander Perevalov), as well as my colleague - Sergey Bobkov. We are undergraduates of the Perm Polytechnic , in our free time from work and study we are engaged in solving Data Science contests.

What is DMC and how did we hear about it

The Data Mining Cup is a worldwide student data analysis championship that is held once a year. Its story began 20 years ago, long before Kaggle , it can be said that the DMC held a data analysis competition before it became mainstream .
')
DMC is organized by PrudSys , a German company engaged in Retail Intelligence. Previously, participation in the championship was allowed only “alone”, then participants were allowed to team up from the university, by the way, the maximum number of teams from the university is only 2. Membership of the university is also tightly controlled, to participate you must have mail with your training domain institutions, as well as send a copy of the student card.

At the moment, if you compare the level of participants in the DMC and Kaggle, of course, the level of Kaggle is much higher. This is due to the restrictions on students at the DMC and the popularity of Kaggle. A distinctive feature of the DMC is the lack of a leaderboard , which allows you to get rid of problems with fitting it.

I found out about the Data Mining Cup at the time when we went with our university group for an internship in Germany. Upon arrival at home, my friend and teammate suggested that I take part, it was in mid-April. To be honest, I was skeptical about this idea, however, having learned that this year the data and the task are rather simple - we all started to solve it.

How we solved the task

In 2019, the task lay in the field of self-checkout fraud detection. Surely you have already come across self-service cash desks in supermarkets. These devices work both under the supervision of a store employee, and fully automatically. Self-service cash desks allow you to optimize personnel costs and minimize queues at supermarkets. However, there is one problem, human nature is such that one way or another, there is a desire to “not pierce” the product that we want to see in our refrigerator. To avoid this, control is necessary, but such that it does not embarrass and bother buyers.

Thus, on the basis of the marked data on self-checkout transactions, it is necessary to develop a mathematical model that automatically assigns a particular transaction to fraudulent or non-fraudulent. So, we solve the problem of binary classification.

The data was as follows:

The size of the training sample was only ~ 1,800 examples, while the test sample - 499,000 examples. Also, the training sample was not balanced : only 4% of transactions were fraudulent, it is obvious that accuracy (the proportion of correct answers) is useless to apply here. Surprisingly, there were no missing values in the data, and some of the signs were evenly distributed. Based on this, we can conclude that the data is generated artificially.

Also, the organizers offered their metrics in the form of a Confusion Matrix, which is measured in monetary units:

	Actual values
	Fraud	Not fraud
Fraud	5 Euro (TP)	-25 Euro (FP)
Not fraud	-5 Euro (FN)	0 Euro (TN)

After analyzing it, it became clear to us that Precision is more important in this case, since we incur the maximum loss, if by mistake we call an honest buyer - a fraud.

The course of our decision consisted of classical stages:

Basic data analysis
Analysis of traits, their descriptive statistics and distributions
Removal of outliers
Feature generation
Building a model and setting parameters
Validation and final forecast

Slides with the content of our solution can be found at: www.docdroid.net/2XEDfYg/dmc-2019-1.pdf
The repository on GitHub is here: github.com/Perevalov/dmc2019 (everything is scattered across different branches until there was no time to put everything in order)

Organizational moments of preparation for the final

After we sent the final decision in early May, we began to expect results. The conditions of the organizers are such that the Top 10 teams are invited to the on-site finals in Berlin , which will be held as part of the Retail intelligence summit 2019 conference: Smart Decisions for Smart Retail.

For reference, in 2019, 149 teams from 114 universities located in 28 countries participated in the DMC.

To be honest, we didn’t even hope to get to the final , but now, at the end of May, that coveted invitation letter arrives. Moreover, all the finalists were asked to pay expenses up to 500 Euros, and also offered accommodation at the hotel for one night, where the event was held.

Without hesitation, we bought tickets to Berlin and went to get visas. Being poor students, the amount of expenses for a 2-day trip turned out to be rather big for us. The cost of the tickets Perm-Berlin-Perm and the issuance of a visa rose to about 40,000 rubles. per person, this is a little more than 500 euros.

Since we represent our university at the event, we decided to receive material support from it. Moreover, the Perm Polytech implements a program for the development of Russian-German relations and strongly supports initiative students (it seemed to us). Enlisting the approval and signature of the head of the department, where we study, we went to the department of science and innovation. There began a bureaucratic epic of a length of a month, which ended approximately as follows: "There is no money, but you hold on . " Of course, we were a little upset, but did not lose heart. Now it’s ridiculous to read various statements by the top management of our university about the "need to support young scientists" and other nonsense. Well it is, lyrical digression.

We received visas in just 2 weeks. During the same time, we prepared a report for a speech and went to the airport on July 2nd evening.

Performance at the final of the Data Mining Cup and rewarding

Arriving in Berlin on July 3rd in the morning, we went to the nHow Hotel, where the conference was held. The level of organization, of course, is high. No wonder, because the cost of participation in it was 1000 euros per person (free for us). And here is the same hotel:

Our talk was scheduled for 4:30 p.m. It took place in the main conference hall, of course in English. By the way, the performance itself was not taken into account in the final rating, it was calculated only on the basis of the final score, which only the organizers had data about.

Among the first 10 teams were such universities as: George Washington University (USA), University of Geneva (Switzerland), Chemnitz University of Technology (Germany), University of Iowa (USA), etc. And of course our Perm National Research Polytechnic University.

The conference hall looked like this:

A small embarrassment was the fact that I had to speak not with the slides, but with one poster displayed on the screen. Therefore, the performances of the participants turned out to be insufficiently informative. However, there was an opportunity to approach and view the paper poster of each of the participants in the conference hall. Basically, most people used stacking, blending and ensembling (we are among them), also, some participants used an increased threshold for the classification models, a couple of teams managed not to do feature generation at all and built a model on the original ones.

By the way, we were the smallest team - only 2 people.

After the performances began a festive dinner and rewarding. We hoped for prizes, but understood that this was unlikely, so our mundane desire was “though not to be 10th.” It turned out exactly as we wanted - we took the honorable 9th place. Naturally, it was a bit annoying, but the fact that we were in the final among such serious universities already says a lot. The winners were participants from the University of Iowa (USA), although one cannot say that they came from the states (see photo):

The prizes for the 1st, 2nd and 3rd places were 2000, 1000 and 500 euros, respectively. The final rating is lined up as follows:

findings

We didn’t regret about participation in this competition. At least, this is a +1 in the portfolio, as much as possible useful contacts with people and the opportunity to represent our city and country at an international event.

I advise all date Scientists to take part in such events, it's cool!

Source: https://habr.com/ru/post/458930/

All Articles