Kaggle - our excursion to the kingdom of overfit

Kaggle is a platform for machine learning contests. On Habré they often write about her: 1 , 2 , 3 , 4 , etc. Kaggle contests are interesting and practical. The first places are usually accompanied by good prizes (top contests - more than 100k dollars). Recently, on Kaggle offered to recognize:

And many many others.

I have long wanted to try, but something always interfered. I developed many systems related to image processing: the subject matter is close. Skills lie more in the practical part and classical Computer Vision (CV) algorithms than in modern Machine Learning techniques, so it was interesting to evaluate my knowledge at the world level, and to improve understanding of convolutional networks.
')
And suddenly it all came together. Dropped a couple of weeks is not very busy schedule. At kaggle held an interesting competition on a close topic. I updated myself comp. And most importantly, he knocked out vasyutka and Nikkolo to form a company.

I must say that we have not achieved the enchanting results. But I consider 18th place out of 1.5 thousand participants to be quite good. And considering that this is our first experience of participating in kaggle, that from 3 months of the competition we participated only 2.5 weeks, that all the results were obtained on one single video card - it seems to me that we performed well.

What will this article be about? First, about the problem itself and our method for solving it. Secondly, about the process of solving CV tasks. I wrote a lot of articles on Habré about machine vision ( 1 , 2 , 3 ), but the scribbling and theory is always better supported by example. And to write articles for some commercial problem for obvious reasons is impossible. Now finally tell about the process. Moreover, here he is the most common, well illustrating how problems are solved. Thirdly, the article about what goes after the solution of the idealized problem in a vacuum: what will happen when the problem collides with reality.

Task analysis

The task we started doing was as follows: “Identify the driver in the photo to one of ten groups: careful driving, telephone in his right hand, telephone in his right ear, telephone in his left hand, telephone in his left ear, tuning music, drinking liquid , stretching back, hair-dressing (painting lips, scratching the back of his head), talking with a neighbor ”. But, as it seems to me, it is better to look at the examples once:

0

one

four

five

eight

English

c0: safe driving
c1: texting - right
c2: talking on the phone - right
c3: texting - left
c4: talking on telephone - left
c5: operating the radio
c6: drinking
c7: reaching behind
c8: hair and makeup
c9: talking to passenger

********************************************
Everything seems clear and obvious. But it is not so. What class do these two examples belong to?

The first example is grade 9, conversation. The second example is the zero class, safe driving.
According to our assessment, the accuracy of a person when recognizing a class on the basis is about 94%. At the same time, the first and tenth grades make the most confusion. At the beginning of our participation, the first places had an accuracy of approximately 97% of correct recognitions. Yes Yes! Robots are already better than people!

Some details:

The size of the base for learning 22 thousand images.
Approximately 2 thousand per class.
There are 27 drivers in the training base.
Test database - 79 thousand images.

Today, the main means for solving problems of this kind are convolutional neural networks. They perform image analysis on many levels, independently highlighting key features and their relationships. You can read about convolutional networks here , here and here . Convolutional networks have a number of disadvantages:

It takes a lot of different data to learn. Two thousand per class is less enough, although not enough. But the fact that there are only 27 drivers is very limiting.
Convolution networks are prone to "overfit". When training, they may catch on with some insignificant sign, peculiar to several images, but which does not carry significant weight. We have such a sign was the opening of the sun visor. All the photos with him went to the 8th grade with a gun. Anyway: when overfitting the grid, it simply can remember all 22 thousand input images. Here is a good overfit example .
Convolution networks are ambiguous due to the fact that they are complex. Approximately the same solution on different frameworks can yield fundamentally different results. A few slightly modified network parameters fundamentally change the result. Many people call the configuration of networks something akin to art.

An alternative to convolutional networks may be the manual management of low-level features. Highlight hands. Face position Facial expression. Open / closed visor at the car.
Convolution networks are many different. The classic approach is to use the most common networks from the Zoo ( caffe theano keras ). This is primarily VGG16, VGG19, GoogleNet, ResNet. For these networks, there are many variations, plus you can use techniques that accelerate learning. Of course, this approach is used by all participants. But the basic good result can be obtained only on it.

Our setup

All calculations in our work were carried out on a single GTX1080. This is the latest game card from NVIDIA. Not the best option from what is on the market, but quite good.
We wanted to use a cluster with three Tesla in one of the works, but due to a number of technical difficulties this did not work out. We also thought about using some old video card from a 4Gb laptop, but as a result we decided not to go this way, there was much less speed there.
The framework used is Caffe . Keras could also be used with Theano, which would certainly enhance our result due to a slightly different implementation of the training. But we did not have time for this, so Caffe was used to the maximum.

RAM - 16Gb, the maximum when training was used 10Gb. The processor is the last i5.

If anyone is interested, but nothing special

A few words about the rules

I think that most of the readers never participated in kaggle, so I’ll take a quick look at what the rules of the competition are:

Participants are presented with two sets of data. Training - on which target results are signed. Test - by which you need to make recognition.
The participant must recognize all the images from the test set and send them to the site in csv-format (text file of the form “image number” - probability of the first class, probability of the second class, ....).
In total, you can make 5 attempts to send.
After each sending, the user is told the current percentage of the test base. But, you need to take into account one interesting point. The user is told the result is not for the entire database, but only for a small part of it. In our problem, it was 39%.
The final result of the competition is considered for the remaining piece of the base (61%) after the close of the competition. This can lead to serious permutations of the participants.
The first three places - prize. If a participant enters them, he must publish his decision and the organizer will check it.
The solution should not:
- Contain proprietary data
- Use user-marked test selection during training. In this case, it is permitted to mark out a training set.

A few words about the metric

Suppose we invented some recognition mechanism and recognized all the images. How will they be further checked? In this problem, we used the approach of calculating multiclass logarithmic losses . In short, it can be written as:

y is the decision matrix. A unit if the object belongs to a class.
p - matrix of answers that the user has sent. It is best to record the probability of belonging to a class.
M number of classes
N is the number of objects
In the neighborhood of zero, the value is replaced by a constant.
We estimated that the probability “95%” corresponds to a logloss value of approximately “0.2”, the value “0.1” corresponds to a probability of “97.5%”. But this is a rough estimate.
We will return to this function, but a little lower.

The first steps

The theory is good. But where to start? Let's start with the simplest: take the CaffeNet grid, which is attached to Caffe and for which there is an example.
After I did the same thing, I immediately got the result “0.7786”, which was somewhere in the 500th place. It's funny, but for many people the result was much worse. At the same time, it is worth noting that 0.77 approximately corresponds to 80-85% of correct recognitions.
We will not dwell on this grid, which is already quite outdated. Take something standard modern. To the standard can be considered:

VGG family :
- VGG-16
- VGG-19
ResNet family :
- ResNet-50
- ResNet-101
- ResNet-152
Inception family

About non-standard methods can be read below in the section "Ideas that have not rolled."
Since we started somewhere in two and a half months after the start of the competition, it made sense to explore the forum. The forum advised VGG-16. The author of the post assured that he received a solution with a loss of "0.23" on the basis of this network.

Consider the decision of the author:

He used the pre-trained VGG network. This greatly increases the speed of learning.
He taught not one, but 8 grids. When learning each grid, only 1/8 of the input database (22/8 thousand images) was fed to the input.
The resulting solution gave a level of 0.3-0.27
He received the final result by adding the result of these 8 grids.

The decision to repeat with us failed. However, many who could not do it. Although the author laid out the training script on Keras'e. Apparently on Keras it could be reached, but not on caffe. The third place winner also counted VGG on Keras, and all the other grids on Caffe and Theano.

As for us, pure VGG gave 0.4, which, of course, improved our result at that time, but only up to 300th place.

As a result, we decided to give up VGG and tried out the training of the pre-trained ResNet-50 ( here you can read interestingly what it is). Which immediately gave us 0.3-0.29.

A small remark: we never used the technique of “breaking the base into 8 parts”. Although, most likely, it would bring us a little extra accuracy. But such training would take several days, which was unacceptable to us.

Why do I need to split the base into 8 parts and train independent grids? Suppose the first of the grids is always mistaken in favor of situation A, when choosing from A and B. The second grid, on the contrary, decides B. Moreover, both grids are often mistaken about the real decision. But the sum of the grids will more correctly estimate the risk of A / B. In fact, in most disputable situations, it will deliver 50% - A, 50% - B. This approach minimizes our losses. We achieved it differently.

To improve accuracy with 0.3, we did the following:

We trained several ResNet-50 grids with various hyperparameters. All of them gave somewhere 0.3-0.28, but their sum was more.
We trained the ResNet-101 net, which in itself gave somewhere around 0.25
We trained the ResNet-50 grid with image changes, which gave somewhere 0.26

The image changes are the following: when training, instead of the original training picture, a rotated picture is given, the picture is cropped, the picture is noisy with noise. This increases stability and accuracy.
The final result of the addition of all the grids was at the level of "0.23-0.22."

To new heights

Result 0.22 was somewhere around 100 places. This is a good result. In fact, the maximum, which gives the correctly configured network. But to go further you need to stop, think and reflect on what has been done.
The easiest way to do this is to see Confusion Matrix. In essence, this concept hides a budget of errors. How and when are we mistaken. Here is the matrix that we got:

In this matrix, the x-axis are the objects of the original class. On the y-axis - where they were assigned. For example, out of 100% of the objects of the zero class, 72% were successfully assigned to him, 0.8% to the first class, 16.8% to the ninth.
The following conclusions can be drawn from the matrix:

The greatest number of errors - zero and ninth grade. This is logical. I myself often cannot understand whether a person is talking or not. Above in the text was an example.
The second in the number of errors - the eighth grade. Any class where the hand is next to the head can be referred to the eighth class, if the information is not enough. This is talking on the phone. These are conversations with a neighbor, with active gestures. In some situations, when a person drinks a liquid - it can also be similar.

Therefore, it is necessary to develop an algorithm that can more correctly distinguish these three classes.
In order to do this, we used the following ideas:

In the opinion of a person, 90% of the information allowing to distinguish between grade 0 and grade 9 is contained on the face At the same time, in most cases, when classifying in grade 8 was wrong, the final decision is also made on the area around the face.
All the grids above were sharpened to a resolution of 224 * 224. At the same time, the quality of the face sags heavily.

So we need to keep the resolution around the face and use it to refine the 0-8-9 grades. In total, we had three ideas on how to do this. About two of them will be written below in the section "Ideas that are not rolled." I rolled the following idea:
We train a simple Haar classifier for face selection. In principle, the face can even be well distinguished by color. Given that we know well where the person should be.
The competition rules did not prohibit manual marking of the training base. Therefore, we noted somewhere on 400 images of the face. And we got a very good selection of frames recognized automatically (faces were found correctly for 98-99% of frames):

Having trained ResNet-100 in images, we got an accuracy somewhere around 80%. But the addition of learning results to the amount of networks used gave an additional 0.02 on the test sample, moving us to the thirties area.

Ideas that are not rolled

Break the slender outline of the narrative and take a small step to the side. Without this step, everything is fine, but with this step it becomes clear what was going on in my head at that moment.
There are much more ideas that do not produce results in any research task than ideas that produce results. And sometimes it happens that ideas for one reason or another can not be used. There is a small list of ideas that we have already tried to test at the time of entering thirtieth places.

The first idea was as simple as a log. We already wrote on Habré ( 1 , 2 ) about networks that can color images according to the class affiliation of objects. It seemed to us that this is a very good approach. You can teach the network to detect exactly what you need: telephones, hands, an open visor near the car. We even spent two days on the layout, configuration and training of SegNet. And suddenly we realized that SegNet has a closed non-OpenSource license. And therefore we can not honestly use it. I had to refuse. And the results of automatic markup were promising (several approaches are shown here at once).

The first:

Second:

And here is the markup process:

The second idea was that the resolution of 224 * 224 is not enough for us to make a decision: class 0 or 9. The greatest problems are caused by the loss of permission on faces. But we knew that the face is almost always located in the upper left part of the image. Therefore, we dragged the pictures, having received such cute tadpoles with the maximum resolution in places of interest to us:

Not a ride. The result was something like the usual training + strongly correlated with what we had.

The next idea was quite large and comprehensive. Posts on the contest forum prompted us to think: what does the network see in reality? What interests her?
There is a whole selection of articles on this topic: 1 , 2 , 3

The forum Kaggle cited such cool pictures:

Naturally, we decided to invent our own bicycle for this task. Moreover, it is written very simply.

We take the original image and begin to move the black square on it. At the same time, we look at how the response of the system changes.

We will draw the result as a heat map.

The image here shows an example of incorrect classification of a class 9 image (conversation with a neighbor). The image is classified as "drawn to the control of music." And indeed, it seems. After all, the network itself does not see 3d. She sees a hand that lies in the direction of the dashboard switch. So what that hand is on the foot.

Having looked at another few dozens of errors, we realized that again everything rested there: the grid does not look at what is happening on the face.

Therefore, we have come up with a different way of learning. A set was fed to the network's input, where half of the pictures came directly from the training base, and half was overwhelmed with everything except the face:

And about a miracle, the grid began to work much better by orders of magnitude! For example, by a mistakenly classified man:

Or, differently (as it was, as it became):

At the same time, the network worked well for the other classes (it noted the correct zones of ineres).
We already wanted to celebrate the victory. Loaded network for verification - worse. Combined with our best - the final result has not improved. I don't even remember if we began to add it to our best answer. We thought for a long time that we had some kind of mistake, but did not find anything.

It seems like on the one hand - the network has become more correct to look and a lot of what errors to correct. On the other hand, somewhere began to do new. But at the same time there was no significant statistical difference.

Ideas that are not rolled was much more. There was a Dropout, which gave us almost nothing. There were additional different noises: also did not help. But about this is nothing beautiful and you will not write.

Let's return to our sheep

We stopped somewhere around the 30th places. Left a little. A lot of ideas have already failed, 25 test projects have accumulated on the computer. There was no improvement. Our knowledge of neural networks gradually began to be exhausted. So go google the current contest forum and the old kaggle forums. And the solution was found. It was called "pseudo labeling" and "Semi-supervised learning" . And it leads to the dark side. Rather gray. But it was announced by the admins of the contest as legal.
In short: we use a test sample for learning by marking it with an algorithm trained in a training sample. Sounds weird and messy. But if you think about it, it is fun. By pushing into the network objects that are marked for it, we do not improve anything in a local perspective. But. First, we learn to highlight those signs that give the same result, but easier. In essence, we teach convolutional levels in such a way that they learn to distinguish new features. Maybe in some next image they stand out and help better. Secondly, we protect the network from retraining and overfit by introducing pseudo-arbitrary data that will not worsen the network.

Why does this path lead to the gray side? Because formally using a test sample for training is prohibited. But after all we here do not use it for training. Only for stabilization. Especially admins allowed.

Result: + 10 positions. We fall into the twenty.

The final graph of the walk results looked something like this (in the beginning we did not spend all the attempts per day):

And again about LogLoss

Somewhere in the beginning of the article I mentioned the fact that I will come back to LogLoss. It's not that easy with him. You can see that log (0) is minus infinity => if you suddenly put 0 in the class where the answer is one, then we get minus infinity.

Unpleasant But the organizers have protected us from this. They say that they replace the value under log with max (a, 10 ^ (- 15)). So we get an additive -15 / N to the result. That equals -0.000625 to the result for each wrong image for a public check and -0.0003125 for a closed one. Ten images affect the error in the third digit. And this is the position.
But the error can be reduced. Suppose instead of 10 ^ (- 15) we put 10 ^ (- 4). Then if we make a mistake we get -4 / N. But if we guess correctly, then we will also have losses. Instead of log (1) = 0, we take log (0.9999), which is -4 * 10 ^ (- 5). If we make a mistake every 10 attempts, then it is certainly more profitable for us than a loss of 10 ^ (- 15).

And then the magic begins. We need to combine 6-7 results to optimize the LogLoss metric.

In total, we made 80 shipments of the result. About 20-30 of them were dedicated to optimization of losses.

I think that 5-6 places have been played for this account. Although, as it seems to me, everyone is doing this.
All this magic did Vasyutka .I don't even know what the last option looked like. Only his description, which we have kept, reached two paragraphs.

What we didn't

At the end of the competition, we still have a small stack of ideas. Probably, if we had time, it would have cost us five more positions. But we understood that this was clearly not a way to the top 3, so we didn’t throw all the remaining forces into the fight.

We have not investigated the partitioning of the base into small pieces and their independent learning. In theory, this adds stability. We used a similar approach in semi-supervised learning, but also did not fully explore the whole field of possibilities.
, , . . : , . - 2-3 . : , , , .
( ). , . , , ... . . .
. , . , , . , . /. 2 — .

After the end of the competition, many participants publish their decisions. Here there is carried out the statistics on the top 20 of them. Now, about half of the top 20 have published solutions.
Let's start with the best published. Third place .

I will say right away. I do not like the decision and it seems to be a violation of the rules of the competition.
The authors of the decision noted that all the examples were shot sequentially. To do this, they analyzed the test sample in automatic mode and found adjacent frames. Neighboring frames are images with minimal changes => they belong to the same class => for all close decisions you can make a single answer.
And yes, it helps terribly. If you are talking on the phone holding it in your left hand - there are frames where the phone is not visible and it is not clear whether you are talking or scratching your head. But looking at the next frame, you can clarify everything.
I would not mind if statistics thus accumulated and background machine was subtracted. But to reverse engineer a video is for me beyond good and evil. Why-explain below. But, of course, the decision is up to the organizers.

I really like the fifth solution. It is so cool and so trivial. The thought occurred to me about ten times during the competition. But every time I swept it away. "What for?!". "Why will it even work at all ?!" "Too lazy to waste time on this hopelessness."
I did not even discuss it with my comrades. As it turned out - in vain. The idea here is:

Take two pictures of the same class. Split in half and glue. Submit to training. Everything.
I don’t really understand why it works (except that it stabilizes the sample: 5 million samples are cool, it’s not 22 thousand). And the guys had 10 TitanX cards. Maybe it played an important role.

The sixth solution is poorly described, I did not understand it. Ninth is very similar to ours. And the accuracy is not much different. Apparently, the guys were able to train the network better and better. They did not describe in detail due to which a small increase.

The tenth decision implemented a part of our ideas but a little bit differently:

Cut the area with a person to increase the resolution -> apply for training. The same problems that we solved by cutting the face are solved. But, apparently, better.

15 solution - everything is like ours. Even individuals were also cut out (plus the helm area from which we refused).
But ... They trained 150 models and folded them. 150 !!!

19, 20 solution - everything is like with us, but without faces. And 25 trained models.

From toys to business

Suppose you are an insurance company that wants to implement a system for determining what the driver is doing. You have assembled a base, several teams have proposed algorithms. What we have:

Algorithm that uses data about neighboring frames. This is strange. If you wanted your algorithm to work this way, you would suggest analyzing not photos, but videos. Moreover.This algorithm is almost impossible to tie with the video as such. If you have algorithms that recognize a single frame - it can be very easily added to the algorithm that works with video. The algorithm that uses the “5 nearest frames” is very difficult.
Algorithms that use 150 neural networks. This is a huge computing power. Such a product does not turn out to be massive. 2-3 networks is for a good maximum. Ok, let 10 be the limit. It's one thing if your job was to detect cancer. There such costs are permissible. There you need to fight for every percentage. But your goal is a mass product.

But, still got some good and interesting models.
Let's go ahead. And understand how these models work. Unfortunately, I don’t have them, so I’ll be testing ours, which gave the 18th result, which is not so bad in principle.
From all articles which I wrote on Habr my favorite about how it is necessary to collect base. From this side we will come to the analysis. What do we know about the collected database?

When recruiting the base drivers did not drive the car. The car was traveling by truck, the drivers were given the task of what to do.
The entire base was recruited during the daytime hours. Apparently the workers. No low sun. No evening shots.
All situations are perfectionary. People do not hold the phone with their left hand at the right ear. All phones have the same plus or minus.
The base was recruited in America. You ask how I understood? There is not a single car with a manual box ...

For the first time, four situations will suffice. But they are much more. All situations can never be foreseen. That is why the base should be typed real and not simulated.
Go.I made the frames below myself.
As influenced by the fact that when recruiting the base drivers did not drive. Driving is a rather complicated process. It is necessary to turn the head by 120 degrees, to look at traffic lights, to turn sharply at intersections. All this is not in the database. Consequently, such situations are defined as “conversation with a neighbor.” Here is an example:

Day hours. This is a very big problem. Of course, you can make a system that will look at the driver at night including the IR illumination. But in the IR illumination a person looks completely different. It will be necessary to completely redo the algorithm. Most likely to train one per day and one at night. But there is not only the problem of the night. Evening - when it is still light but already dark for the camera and there are noises. The grid starts to get confused. Weights on the neurons are walking. The first of the pictures is recognized as a conversation with a neighbor. The second picture jumps between “pulling back”, “meykap” (which is logical within the network, because I am reaching for the visor), “talking with a neighbor”, “safe driving”. The sun on the face is a very unpleasant factor, you know ...

About perfectionism. Here is a more or less realistic situation:

And the weights on the neurons: 0.04497785 0.00250986 0.23483475 0.05593431 0.40234038 0.01281587 0.00142132 0.00118973 0.19188504 0.0520909

Maximum on that phone at the left ear. But the network has gone a little bit.
About Russia generally keep quiet. Almost all the frames with a hand on the gear knob are recognized as “pulling back”:

It’s obvious that there are a lot of problems. I didn’t add pictures where I tried to trick the network (by turning on the flashlight on my phone, weird caps, etc.). This is real and it is deceiving the network.

Panic.Why is everything so bad? You promised 97% !!

And the problem is that NO computer vision system is done from the first iteration. There should always be a startup process. Starting with a simple test sample. With a set of statistics. With the correction of emerging problems. And they will always be. At the same time, 90% need to be corrected, not programmatically, but administratively. I got caught trying to trick the system - get a pie. Someone does not fix the camera? Be kind, support humanly. And not the fool himself.
Making the development you need to be ready, that before you get the perfect result you will need to redo everything 2-3 times.

And it seems to me that for this task everything is not bad, but, on the contrary, good. The 95-97% test performance shown is good and means that the system is promising. This means that the final system can be brought to plus or minus the same accuracy. You just need to invest in the development of another 2-3 times more power.

By the way. Pro camera mount. apparently, how the camera was fastened by picking up statistics prevents the passenger. The way I fastened the camera - gives a slightly different picture for which the statistics are falling. But does not interfere with the passenger. I think that a solution with a camera that interferes with the passenger will be unsuitable, most likely they will be recompiling the base. Or strongly dosobirat.

It is also not clear where the images are planned to be processed. In the module that will shoot? Then you need to have a decent computational power there. Send to server? Then it is obviously single frames. Save on the map and once a week to transfer from your home computer? Inconvenient form factor.

How much time is spent on the task of such a plan from scratch

Dohrena. I have never seen a task brought to release in less than half a year. Probably, applications like Prism, of course, can be deployed in a month, or even faster, if the infrastructure is there. But any task where the result is important and which is not an art project is a long time.
Specifically for this task. At least 70 people were involved in the recruiting of the base. 5 of which were at least 5 — the attendants who drove the truck sat with a notebook. And people 65-100 are people who entered the database (I do not know how many of them there are). It’s hardly less than 1-2 months to organize such a move, to collect everything, check it out with the organizers. The contest itself went 3 months, but you can make a good decision on the recruited database in 2-3 weeks (which we did). But to bring this decision to the working sample can be released as early as 1-3 months. It is necessary to optimize the solution for iron, to build transmission algorithms, to refine the solution so that it is idle in the absence of a driver, nothing is chased. And so on and so forth. Such things always depend on the problem statement and on where and how the solution should be used.
And then the second stage begins: trial operation. You need to put the system to people 40-50 in the car for a couple of days each and see how it works. Draw conclusions when it does not work, summarize and rework the system. And here begins the swamp, where it is almost impossible to estimate the implementation dates a priori before the start of work. It will be necessary to make the right decisions: in what situations the system is being finalized, in which stubbing, and in which we will fight administratively. This is the most difficult thing that few people can do.

findings

Conclusions will be on kaggle. I like it. I was glad to our level. From time to time it seemed to me that we are behind the times, we choose not optimal methods in our work. A hike all the rules. From the side of assessing oneself in the surrounding world, kaggle is very good.

From the problem solving side, kaggle is an interesting tool. Probably this is one of the correct and comprehensive methods that allows you to conduct a good R & D. You just need to understand that the finished product does not smell here. But to understand and appreciate the complexity and ways to solve the problem is normal.

We will not participate yet. Forces to fight for the first places you need to spend a lot. But, if there is an interesting problem - why not.

- PS They asked me to repeat this whole story at a Yandex training session. And there it turns out they write. If anyone needs:

Source: https://habr.com/ru/post/307078/

All Articles