How to get the top five using data analysis?

Hi, Habr! I am sure that there are many students among us and, probably, all of them recognize that on their student path they met such sciences, about which it is possible to break their teeth with granite. That's why I want to tell you about how a hobby - the science of data, helped me pass one of the most difficult subjects in the semester to the top five. If it is interesting to you - I ask under kat.

Prehistory

I study at the ITMO Computing Department. In the middle of the last semester, while preparing for the 100,500 laboratory, I got the idea to apply my hobby - the science of data, in order to somehow ease my task. Within two minutes, I downloaded the ipython notebook and plunged into the process ...

As a result, I made certain conclusions about the distribution of laboratory difficulties and about the correlations of their topics. These conclusions seemed to me very interesting and plausible, and I tried to apply the information obtained in practice. At the end of the semester, I received the top five and safely left the study in the depths of the githab. But just a couple of days ago I was able to share the idea and conclusions of this mini-study with people who are not directly related to the department and the subject of which conclusions are drawn. And I heard a lot of positive feedback. Therefore, I decided to tell about this not very large, but very applied, research on Habré.

Where to get the data?

I began, as expected, with a search for data that can give us the opportunity to make valuable conclusions. At the university, the electronic assessment system is closed, and you can’t see the grades of anyone other than yours. I did not want to arrange polls, because it would take a lot of time and effort, in general, it is not advisable. Fortunately, many teachers keep an open journal in Google Docks. They will help us. I found a magazine on a subject of interest to us from one of the past courses, parsed it and got a small dataset, in which there were about 100 ratings. At that time I was preparing for the delivery of the 4th laboratory, which was preceded by 3 first labs and one DZ. Total for each object there were 6 ratings for each object:
')

1-4 laboratory (5-point scale)
1 DZ (5-point scale)
Final semester grade (100-point scale)

Visualization and data analysis

After receiving the necessary data, I immediately began to visualize them. First, let's look at the dependencies of all ratings from each other.

I couldn’t get out of the extradiagonal elements of something particularly useful / interesting. At the same time, on the diagonal, where the distributions of the corresponding estimates are displayed, one can see how people often take this job. Immediately, for example, we see that the first, for some reason, most people pass by 4. It is possible that the teachers have not yet had time to understand the level of the majority of students, therefore they are showing “cautious” tactics. In subsequent laboratory this is not manifested.

In addition, here we can judge the complexity of the work. For example, I was immediately struck by the fact that the 2nd lab is the only one of all, whose evaluation mode is equal to 5. From this, it can be judged that this laboratory should be the easiest.

I also noted that the average final grade (bottom line) is clearly much to the left of the mark of 74 points (in our university, <= 74 is a triple). But after all, according to laboratory estimates, most people went by 4, which means that they should be afraid of the exam.

Thus, according to this plot, people can solve two interesting problems:

Rank laboratory for difficulty
Understand what grade should focus

Go ahead. The next chart, it seemed to me, gave even more interesting information.

This is a mapping of the matrix of correlations between each pair of estimates, and hence between the topics of the respective works.

Of greatest interest to us is the last row / column. Take, for example, a string. It shows how strongly the score for the corresponding laboratory correlates with the final semester grade. And here you can see that the assessment for the second laboratory has almost no effect on the final one. Does this mean that on the exam / tests this topic comes across extremely rarely? YES!

At the same time, 1 lab and dz play a little more important role, which means it would be nice to deal with this topic. But the squares responsible for the 3rd and 4th laboratories say that it is worth getting into this topic as best as possible in order to get a decent grade.

That is, with the help of the usual heatmap, we almost could open the exam tickets, long before the exam!

Thus, with the help of this schedule, we can solve one of the most difficult tasks for a student: “which ticket should be re-read 10 times, and which 100”.

Moreover, it is not limited to an exam. According to this schedule, it is clear that when preparing for the homework assignment, it is more important to deal with the materials of the 1st and 2nd laboratory, well, the 3rd one can be given a little less attention.

And the most interesting thing is that if you start to ponder over all laboratory topics, then all conclusions become very explainable and only partly unexpected.

Building a predictive model

Of course, I can't ignore machine learning. I tried to build a lot of models and in the end the smallest absolute error was given by the sklearn's random forest + -0.2
But I wanted to share this model among the interested guys, so I taught the usual linear regression, got weight out of it and published them. Thus, so that any person, by simple arithmetic operations, can predict his grade for the 4th laboratory (the MAE algorithm was 0.3).

But in fact, the main value of this mini-study carries the previous section and conclusions made in it.

findings

That's all research. As you can see, by collecting even a very large sample, you can make really useful conclusions that can help you. And, the main idea that I wanted to show by this is that data analysis is a very useful thing, even at the level of “household” questions.

I hope the article was interesting, thanks!

Source: https://habr.com/ru/post/321534/

All Articles