Planet understanding the amazon from space recently took place at kaggle.com
Before this, I didn’t work on image recognition, so I thought it was a great chance to learn how to work with pictures. Moreover, according to the assurances of people in the chatika, the threshold of entry was very low, someone even called it “MNIST on steroids”.
Task
Actually, how does every competition begin - with the formulation of the problem and the quality metrics. The task was as follows: pictures were given from satellites of the Amazon, and for each picture it was necessary to affix labels: road, river, field, forest, clouds, clear sky, and so on (17 pieces in total). And on the same picture could simultaneously be several classes. As you can see, in addition to the type of terrain, there were also classes related to weather conditions, and, which seems logical, the weather in the picture can be only one. Well, it can not be both clear and cloudy. When deciding the competition, I did not look at the data, hoping that the car would figure out who is whose brother, so I had to dig in to give examples of the images:

')
Baseline
How to solve this problem? Take some convolutional neural network, pre-trained in a large dataset, and retrain the weights on your set of pictures. In theory, I, of course, heard it, and everything seems to be understandable here, but to take and implement it until the hands reached. Well, the first thing you need to choose a framework for work. I went the easy way and used
keras , as he has very good documentation and clear human code.
Now, a little more detail about the procedure of additional training of weights (in the industry this is called
Fine-tune ). A convolutional neural network, for example, VGG16, trained in ImageNet dataset, consisting of several million pictures, is taken to predict one of 1000 classes. Well, well, she predicts a cat, a dog, a car, so what? In our case, the classes are completely different ... And the point is that the lower layers of the neural network already know how to recognize such basic components of the picture as lines and gradients. We can only remove the top layer of 1000 neurons and put in place of it of 17 neurons (just as many classes can be on satellite images).
Thus, the neural network will predict the probability of each of the 17 classes. How to say if there is a specific class in the picture? The simplest idea is to cut off by a single threshold: if the probability is greater than, for example, 0.2, then we include the class, if it is less, then we do not include it. You can choose this threshold for each class separately, and I didn’t think anything smarter than independently (which is of course not true, improving one threshold can influence the selection of the other) go through
thresholds , as they were called in chat.
No sooner said than done. The result is the
top 90% leaderboard. Yes, it’s bad, but you have to start somewhere. And here I want to say about the huge advantage of the competition - a forum in which a bunch of people, professionals and not so much, work on one problem, and even ready-made baselines are published. From the discussions, I immediately realized that I was wrongly found. The fact is that we need to train weights in two stages:
- “Freeze” the weights of all layers except the last (of the 17 neurons) and train only its
- After the loss drops to a certain value and will continue to oscillate around it, “unfreeze” all the weights and train the net as a whole.
After these simple manipulations, I got at least some acceptable, in my opinion, quality. What to do next?
Augmentations
I heard that augmentation can be done to enrich the dataset - some transformations of the image fed to the input of the neural network. Well, really, if we turn the picture, then the river or the road from it will not disappear anywhere, but now for learning we will have more than one picture. I decided not to think too much and turned the pictures only at angles multiple of 90 degrees, and also mirrored them. Thus, the size of the training dataset increased 8 times, which immediately for the better reflected in quality.

Then I thought, why not do the same thing, but in the prediction phase: take the network outputs for the converted images and average? It seems that so predictions will be more stable. I tried to realize, I already learned how to turn the pictures. Unexpectedly, but it really worked: with the same network architecture, I climbed 100 lines on the leaderboard. Only later, from the chat, I learned that everything was already invented before me and such a technique is called
Test Time Augmentation .
Ensemble 1
Recently, there is an opinion that to win a competition you just need to “pack xgboost'y”. Yes, without building ensembles, it is unlikely that you will be able to reach the top, but without good engineering, without proper validation, it simply will not give any result. Therefore, before embarking on an ensemble, a considerable amount of work needs to be done.
Which models can be combined in an image recognition task? An obvious idea is various architectures of convolutional neural networks. About their topology can be read in this
article .
I used the following:
- VGG16
- VGG19
- Resnet101
- Resnet152
- Inception_v3
How to ensemble? For a start, you can take and average the predictions of all networks. If they are not strongly correlated and give approximately the same quality, then, in my experience, averaging works almost always. It worked now.
Team
At the very least, I reached the border of the bronze medal. In the meantime, a week remained until the end of the competition. It is at this time that the so-called merge deadline comes - the moment after which teaming is prohibited. And now, 20 minutes before the deadline, I think, why should I not actually unite with someone in the team? I look at the leaderboard, hoping to find someone from the chat next to me. But no one is online. Only
asanakoy , which at that time was as much as 40 lines above me. And what, and suddenly? Well, I wrote. It remains 2 minutes before the deadline - I get the answer that asanakoy is not opposed to unite, if I am ready to continue to do something.
Artyom Sanakoev turned out to be a PhD student in Computer Vision with already existing victories in competitions, which of course could not but make me happy. Teaming is a
huge plus for participating in kaggle competitions, because it allows newcomers like me to learn something new from their more experienced colleagues directly while solving a problem together. Try to participate, and if you have at least a little success and a great desire to do something, then you will definitely be taken to a team where they will teach you everything and show everything. Well, the first thing you want after the merger is to get an instant boost by combining your solutions. What we did, rising at the same time by 30 lines, which was quite good, considering that a minimum of effort was put in for this. Thanks to Artyom, I realized that I had greatly under-trained my networks, and one of his models in the trash did all my combined in quality. But there was still time to fix everything with the help of the advice of your teammate.
Ensemble 2
And there were a few days left until the end of the competition, Artyom went to the
CVPR conference, and I stayed with a pack of predictions of various grids. Averaging is nice, but there are more advanced techniques, such as stacking. The idea is to break the training set, for example, into 5 parts, train the model into 4 of them, predict into the 5th,
and do this for all 5 folds. Explaining picture:

More information about stacking, again, on the example of competition, you can read
here .
How to apply it to our task: we have predictions of the probabilities of each network for each of the 17 classes, a total of 6 networks * 17 classes = 102 recognition. This will be our new training sample. On the resulting table, I taught a binary classifier for each of the classes (a total of 17 classifiers). As a result, the output had label probabilities for each image. Then you can apply to them the previously used greedy algorithm. But, to my surprise, stacking gave worse results than simple averaging. For myself, I explained this by the fact that we had different splits of the training set into folds - Artyom used 5 folds, and I only 4 (the more folds, the better, but you need to spend more time training models). Then it was decided to do stacking only on the predictions of their neural networks, and then take a weighted sum of the result with Artem's predictions. Lightgbm, xgboost, and a three-layer perceptron were used as second-level models, after which their outputs were averaged.

And here stacking really earned, on the leaderboard we went up to confident silver medals. There was almost no time left for ideas and I decided to add another neural network to the stacking with the last layer of four neurons and softmax activation, which predicted only weather conditions. If a narrower problem is known, then why not use it? This gave an improvement, but did not say that it was very strong.
Results
However, as a result, we were
in 17th place out of almost a thousand, which seems very good for the first deep learning competition. Gold medals began with 11th place, and I realized that we were really close, and the solution differs from the top one, perhaps only in the implementation details.
What else could be done
- Many people wrote at the forum that the Densenet architecture shows very good results, but due to
my lack of experience, I couldn’t connect it. - Make single folds, but bigger ones (they wrote in the chat what they do in 10 folds)
- To predict the weather, it was possible to use not one model, but several
In conclusion, I would like to thank the responsive chat community
ods.ai , where you can always ask for advice, and almost always help. In addition, two teams from the chat managed to take
3 and 7 places respectively.