How to farm Kaggle

* farm - (from the English. Farming) - long and boring repetition of certain game actions with a specific purpose (gaining experience, resource extraction, etc.).

Introduction

Recently (October 1) a new session of an excellent course on DS / ML started (I highly recommend as an initial course to anyone who wants, as it is now called, “enter” DS). And, as usual, after the end of any course, graduates have a question - and where now to get practical experience in order to consolidate the still raw theoretical knowledge. If you ask this question on any profile forum, the answer is likely to be the same - go decide Kaggle. Kaggle is yes, but where to start and how to most effectively use this platform for leveling up practical skills? In this article, the author will try to give his own answers to these questions, as well as describe the location of the main rake on the field of competitive DS, in order to speed up the process of pumping and receive from this fan.

A few words about the course from its creators:

The mlcourse.ai course is one of the largest activities of the OpenDataScience community. @yorko and company (~ 60 people) demonstrate that you can get cool skills outside the university and even absolutely free. The main idea of the course is the optimal combination of theory and practice. On the one hand, the presentation of the basic concepts is not without mathematics, on the other hand - a lot of homework, Kaggle Inclass competitions and projects will give, with a certain investment of strength on your part, excellent machine learning skills. It is necessary to note the competitive nature of the course - there is a general rating of students, which strongly motivates. The course is also different in that it takes place in a truly lively community.

The course includes two Kaggle Inclass competitions. Both are very interesting, the construction of features works well in them. The first is user identification by the sequence of visited sites . The second is the prediction of the popularity of the article on Medium . The main benefit is from two homework assignments, where you have to be smart and beat the baselines in these competitions.

Having paid tribute to the course and its creators, we continue our history ...

I remember myself a year and a half ago, a course (still the first version) from Andrew Ng was completed, specialization from MIPT was completed , a pile of books was read - a little bit of theoretical knowledge is full, but when you try to solve any basic combat task, a stupor arises. No, how to solve the problem - it is clear which algorithms to use - is also understandable, but the code is written very hard, with a minute call on the sklearn / pandas help, etc. Why is it so - there are no accrued pipelines and the feeling of the code "at your fingertips".

So it will not work, the author thought, and went to Kaggle. It was scary to start from the combat competition right away, and the first sign was Getting started the " House Prices: Advanced Regression Techniques " competition, in which that approach to efficient pumping described in this article took shape.

In what will be described further, there is no know-how, all the techniques, methods and techniques are obvious and predictable, but this does not detract from their effectiveness. At least, following them, the author managed to take the Kaggle Competition Master die for six months and three competitions in solo mode and, at the time of writing this article, enter the top-200 Kaggle world ranking . By the way, this answers the question of why the author at all allowed himself the courage to write an article of this kind.

In a nutshell, what is Kaggle?

Kaggle is one of the most well-known platforms for holding Data Science competitions. In each competition, the organizers post a description of the problem, the data for solving this task, a metric by which the decision will be assessed - and set time limits and prizes. Participants are given from 3 to 5 attempts (by the will of the organizers) per day for “submit” (sending their own solution).

The data is divided into a training sample (train) and test (test). For the training part, the value of the target variable (target) is known, for the test one - no. The task of the participants is to create a model that, being trained in the training part of the data, will give the maximum result on the test.

Each participant makes predictions for the test sample - and sends the result to Kaggle, then the robot (who knows the target variable for the test) evaluates the result sent, which is displayed on the leaderboard.

But not everything is so simple - the test data, in turn, is divided in a certain proportion into the public (public) and private (private) part. During the competition, the submitted decision is evaluated, according to the metrics set by the organizers, on the public part of the data and is laid out on the leaderboard (the so-called public leaderboard) - according to which participants can evaluate the quality of their models. The final decision (usually two - at the participant's choice) is evaluated on the private part of the test data - and the result falls on the private leaderboard, which is available only after the end of the competition and according to which, in fact, the final results are evaluated, prizes, buns, and medals are distributed.

Thus, during the competition only information is available to participants as their model behaved (what result - or soon it showed) in the public part of the test data. If, in the case of a spherical horse in a vacuum, the private part of the data is the same in distribution and the statisticians are in the public - everything is fine, but if not, then a model that showed itself well in public may not work on the private part, that is, trust (retrain). And it is here that what is called a “flight” in the jargon, when people from 10 places in public fly down to 1000-2000 places on the private part due to the fact that their chosen model has retrained and failed to produce the necessary accuracy for new data.

How to avoid it? To do this, first of all, you need to build the correct validation scheme, what is taught in the first lessons in almost all DS courses. Since if your model cannot give the correct prediction for data that it has never seen - then you wouldn’t use whatever sophisticated technique, no matter how complex neural networks would be built - you cannot produce such a model, because its results are worth nothing.

For each competition on Kaggle, a separate page is created on which there is a section with data, with a description of the metric - and the most interesting for us is the forum and the kernels.

Forum he and Kaggle forum, people write, discuss and share ideas. But kernels are more interesting. In essence, this is the ability to run your code, which has direct access to competition data in the Kaggle cloud (similar to Amazon AWS, Google GCE, etc.) Limited resources are allocated for each kernel, so if there is not very much data, then work with You can use them directly from the browser on the Kaggle website - write the code, launch it, submit the result. Two years ago, Kaggle was acquired by Google, so it’s no wonder that this functionality is used under the hood by the Google Cloud Engine.

Moreover, there were several competitions (recent - Mercari ), where data could be worked on only through Kernels. A very interesting format, leveling the difference in the gland of the participants and forcing the brain to turn on for optimizing the code and approaches, since, naturally, the kernels had a hard resource limit, at that time - 4 cores / 16 GB RAM / 60 minutes run-time / 1 GB scratch and output disk space. While working on this competition, the author learned more about the optimization of neural networks than from any theoretical course. It was not enough for gold, finished solo 23rd, but got experience and pleasure fairly ...

Taking this opportunity, I want to say once again Thanks to colleagues from ods.ai - Arthur Stepanenko (arthur) , Konstantin Lopukhin (kostia) , Sergey Fironov (sergeif) for advice and support in this competition. In general, there were many interesting moments, Konstantin Lopukhin (kostia) , who took the first place together with Paweł Jankiewicz , then laid out what they called “ reference humiliation in 75 lines ” in chatika - a kernel of 75 lines of code, which gives the result to the gold leaderboard zone. This is, of course, a must see :)

Well, they were distracted, so - the people write the code and lay out the kernels with solutions, interesting ideas and other things. Usually, in each competition in a couple of weeks, one or two excellent EDA (exploratory data analysis) of the kernel appears, with a detailed description of the dataset, statistics, characteristics, etc. And a couple of baselines (basic solutions), which, of course, do not show the best result on the leaderboard, but they can be used as a starting point for creating your own solution.

Why Kaggle?

In fact, no matter what platform you play, just Kaggle is one of the first and most promoted, with an excellent community and a fairly comfortable environment (I hope they will finalize the kernels for stability and performance, but many people remember the hell that was going on Mercari ) But, in general, the platform is very convenient and self-sufficient, and its dies are still appreciated.

A small digression in general on the topic of competitive DS. Very often, in articles, conversations and other conversations, the thought sounds that this is all bullshit, the experience in competitions has no relation to real problems, and the people there are engaged in tyunit 5th decimal point, which is insanity and divorced from reality. Let's look at this issue in more detail:

As practicing DS-specialists, unlike academy and science, we, in our work, should and will solve business problems. That is (here is a reference to CRISP-DM ) to solve the problem you need:

understand the business problem
evaluate the data on the subject of whether they can hide the answer to this business problem
collect additional data if there is not enough existing to receive a response
select the metric that most accurately approximates the business goal
and only after that choose the model, convert the data under the selected model and "drain the boots". (WITH)

The first four items from this list are not taught anywhere (correct me if such courses have appeared - I will sign up without hesitation), then just learn from the experience of colleagues working in this industry. But the last point - starting with the choice of the model and further, it is possible and necessary to pump in competitions.

In any competition, most of the work for us was done by the organizers. We have a described business goal, an approximating metric is selected, data is collected - and our task is to build a working pipeline from all of this Lego. And it is here that the skills are being pumped - how to work with passes, how to prepare data for neural networks and trees (and why neural networks require a special approach), how to correctly build validation, how not to retrain, how to choose hyper parameters, how ... ... moreover a dozen and two “hows”, the competent execution of which distinguishes a good specialist from people in our profession.

What can farm on Kaggle

Basically, and this is reasonable, all newcomers come to Kaggle to gain and pump practical experience, but one should not forget that besides this there are at least two goals:

Farm medals and dies
Farm reputation in the Kaggle community

The main thing to remember is that these three goals are completely different, they require different approaches to achieve them, and you should not mix them up especially at the initial stage!

Not in vain it is emphasized “at the initial stage” , when you pump through - these three goals will merge into one and will be solved in parallel, but while you are just starting out - do not mix them ! This way you will avoid pain, disappointment and resentment of this unjust world.

Let's go briefly on the objectives from the bottom up:

Reputation - is pumped by writing good posts (and comments) on the forum and creating useful kernels. For example, EDA kernels (see above), posts describing non-standard techniques, etc.
Medals are a very controversial and haute topic, but oh well. Bleeding public Kernels (*) is pumped through, participation in a team with a bias in the experience, and the creation of its own top-line.
Experience - pumped through the analysis of decisions and work on the bugs.

(*) blending of public kernels is a technique of farm medals, at which the laid out kernels are selected with the maximum soon on the public leaderboard, their predictions are averaged (blends), the result is submitted. As a rule, such a method leads to a hard overfit (retraining on the train) and flying to a private flight, but sometimes it allows you to get a submission almost into silver. The author, at the initial stage, does not recommend such an approach (read below about the belt and pants).

I recommend the first goal to choose "experience" and stick to it until the moment when you feel ready to work on two / three goals at the same time.

There are two more points worth mentioning (Vladimir Iglovikov (ternaus) - thanks for the reminder).

The first is the conversion of efforts invested in Kaggle into a new, more interesting and / or highly paid job. No matter how now the Kaggle dies are leveled, but for understanding people the line in the summary of the Kaggle Competition Master, and other achievements, are worth something.

As an illustration of this point, two interviews ( one , two ) with our colleagues Sergey Mushinsky (cepera_ang) and Alexander Buslaev (albu)

And also the opinion of Valery Babushkina ( venheads) :

Valery Babushkin - Head of Data Science at X5 Retail Group (the current number of units is 30 people + 20 vacancies from 2019)

Team Leader, Analytics, Yandex Advisor

Kaggle Competition Master is an excellent proxy metric for evaluating a future team member. Of course, in connection with the latest events in the form of teams of 30 people each and undisguised locomotives, a little more thorough study of the profile is required than before, but this is still a matter of a few minutes. A person who has achieved the title of master, is very likely to be able to write at least medium quality code, reasonably understands machine learning, knows how to clean the data and build stable solutions. If the master can still not boast with a master, then the fact of participation is also a plus, at least the candidate knows about the existence of Kagla and was not lazy and spent time learning it. And if something other than the public kernel was launched and the resulting solution surpassed its results (which is pretty easy to check), then this is a reason for a detailed conversation about the technical details, which is much better and more interesting than the classical questions on the theory, the answers to which are given less understanding of how a person will cope in the future with work. The only thing to be afraid of and what I have come across is that some people think that the work of DS is about as Cagle, which is fundamentally wrong. Many more think that DS = ML, which is also a mistake

The second point is that many problems can be solved in the form of pre-prints or articles, which on the one hand allows the knowledge that the collective mind gave birth to during the competition not to die in the wilds of the forum, and on the other adds one more line to the authors portfolio. and +1 visibility, which in any case has a positive effect on career and on the citation index.

For example, a list of works of our colleagues following several competitions

Authors (in alphabetical order):

Andrei O., Ilya, albu, aleksart, alex.radionov, almln, alxndrkalinin, cepera_ang, dautovri, davydov, fartuk, golovanov, ikibardin, kes, mpavlov, mvakhrushev, n01z3, rakhlin, rauf, ututut, seselovator, mvakhrushev, n01z3, rakhlin, rautlin, ututv, itlov m m, mvakhrushev, n01z3, rakhlin, rautlin, ututrov, itkvard, mvakhrushev snikolenko, ternaus, twoleggedeye, versus, vicident, zfturbo

Competition Name	Article title
Dstl Satellite Imagery Feature Detection	Satellite imagery feature detection using deep convolutional neural network: A Kaggle competition
Carvana Image Masking Challenge	TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation
MICCAI2017: Gastrointestinal Image ANAlysis (GIANA)	Angioysplasia Detection and Localization Using Deep Convolutional Neural Networks
MICCAI2017: Robotic Instrument Segmentation	Automatic Instrument Segmentation in Robot-Assisted Surgery Using Deep Learning
DEEPGLOBE - CVPR18: Road Extraction	Fully convolutional network for automatic terrain extraction from satellite imagery
DEEPGLOBE - CVPR18: Building Detection	Ternausnetv2: Fully convolutional network for instance segmentation
DEEPGLOBE - CVPR18: Land Cover Classification	Feature pyramid network for multi-class land segmentation
Pediatric Bone Age Challenge	Pediatric Bone Age Assessment Using Deep Convolution Neural Networks
IEEE's Signal Processing Society - Camera Model Identification	Camera Model Identification Using Solution Neural Networks
TensorFlow Speech Recognition Challenge	Deep Learning Approaches for Understanding Simple Speech Commands
ICIAR2018-Challenge	Deep Convolutional Neural Networks for Breast Cancer Histology Image Analysis
Diabetic Retinopathy Detection	Deep Learning classification framework
DEEPGLOBE - CVPR18: Land Cover Classification	Land Cover Classification from Satellite Imagery With U-Net and Lovasz-Softmax Loss
DEEPGLOBE - CVPR18: Land Cover Classification	Land Cover Classification With Superpixels and Jaccard Index Post-Optimization
DEEPGLOBE - CVPR18: Building Detection	Building Detection from Satellite Composite Loss Function
The Marinexplore and Cornell University Whale Detection Challenge	North Atlantic Right Whale Call Detection with Convolutional Neural Networks
NIPS 2017: Learning to Run	Run, skeleton, run: physics-based simulation
NIPS 2017: Learning to Run	Learning to Run Challenges: Adapting reinforcement learning methods for neuromusculoskeletal environments
ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013)	Comparison of Regularization Methods for ImageNet Classification with Deep Convolutional Neural Networks
MS-Celeb-1M (2017)	Doppelganger Mining for Face Representation Learning
Disguised Faces in the Wild (DFW) 2018	Hard Example Mining with Auxiliary Embeddings
Favorita grocery sales forecasting	Sales forecasting using WaveNet within the framework of the Kaggle competition

How to avoid pain from losing medals

Score!

I will explain. In almost every competition, closer to its end, a kernel is laid out on the public with a solution that shifts the entire leaderboard up, well, and you, with your decision, are accordingly down. And every time on the forum begins the pain! As so, here I had a decision on silver, and now I don’t even draw on bronze. What are you doing, get everything back.

Remember - Kaggle is a competitive DS. Where you are on the leaderboard is up to you. Not from the guy who laid out the kernel, not from the stars came together or not, but from how much effort you put into the decision and whether you used all possible ways to improve it.

If Public Kernel knocks you from your place on the leaderboard - this is not your place.

Instead of pouring out the pain of the injustice of the world - thank this guy. Seriously, Public Kernel with a better solution than yours means that you have missed something in your pipelines. Find what exactly, improve your pipeline - and go around the whole crowd of hamsters with the same soon. Remember, to return to your place you just need to be a little bit better than this public.

How I was upset by this moment in the first competition, my hands were already falling, here you are in silver - and here you are in the bottom of the leaderboard. Nothing, you just have to get together, to understand where and what you missed - to alter your decision - and to return to the place.

Moreover, this moment will be present only at an early stage of your competitive process. The more experienced you become, the less the lined kernels and stars will influence you. In one of the last competitions ( Talking Data , in which our team took 8th place ) they also laid out such a kernel, but he was honored with just one line in our team chat from Pavel Pleskov (ppleskov) : " Guys, I messed it up with our decision it just got worse - we throw it away . " That is, all the useful signal that this kernel was pulling from the data was already pulled out by our models.

And about the medals - remember:

"a belt without equipment is needed only to maintain pants" (C)

Where, on what and how to write code.

Here is my recommendation - python 3.6 on jupyter notebook under ubuntu . Python has long become the de facto standard in DS, given the huge number of libraries and the community, jupyter , especially with the presence of jupyter_contrib_nbextensions very convenient for rapid prototyping, analysis and data processing, ubuntu is convenient in itself, plus some of the data processing is sometimes easier to do in bash :)

After installing jupyter_contrib_nbextensions, I immediately recommend that you include:

Collapsible headings (helps a lot in organizing blocks of code)
Code folding (same)
Split cells (rarely, but useful if you need to debug something in parallel)

And your life will be much easier and more pleasant.

As soon as your pipelines become more or less stable, I recommend to immediately take the code into separate modules. Believe me - you will rewrite it more than once and not two or even five. But it normal.

There is just the opposite approach when participants try to use jupyter notebook as rarely as possible and only when necessary, preferring to write the pipelines immediately with scripts. (An adherent of such an option is, for example, (Vladimir Iglovikov (ternaus) )

And there are those who are trying to combine jupyter with an IDE, for example pycharm .

Each approach has the right to life, each has its pros and cons, as they say "the taste and color of all markers are different." Choose what you want to work.

But in any case, make it a rule.

save the code for each submission / OOF made (see below) .

(*) OOF - out of folds , a technique for obtaining model predictions for the training part of datasets using cross-validation. Indispensable for the further assembly of several solutions in the ensemble. It is taught again on courses or easily googled.

How? Well, there are at least three options:

For each competition, a separate repository is created on the githaba or bitback, and the code for each submission is committed to the repository with a commentary containing the resulting speed, model parameters, etc.
The code of each submission is collected in a separate archive, with the name of the file in which all the meta information of the submission is specified (the same speed, parameters, etc.)
It uses a version control system sharpened specifically for DS / ML. For example https://dvc.org .

In general, in the community there is a tendency of a gradual transition to the third option, since both the first and second have their flaws, but they are simple, reliable and, frankly, for Kaggle they are quite enough.

Yes, more about python for those who are not programmers - do not be afraid of it. Your task is to understand the basic structure of the code and the basic essence of the language, to understand other people's kernels and write your own libraries. There are many good beginner courses on the web, perhaps in the comments they will tell you exactly where. Unfortunately (or fortunately) I cannot assess the quality of such courses, so I don’t provide references in the article.

So, go to the framework

Note

All further description will be based on working with tabular and textual data. , Kaggle — . , , - ResNet/VGG , — , .

, . Camera Identification , , , [ ods.ai ] , Kaggle , , — . , 46- , , , — , 300 , .

— .

( jupyter notebooks + ) :

EDA (exploratory data analysis) . — Kaggle :), EDA . - , , - , .. . , .
Data Cleaning — , . , , ..
Data preparation — , . :
- Overall
- /
- ( , , FM/FFM )
- ( Vectorizers, TF-IDF , Embeddings )
Models
- Linear models
- Tree models
- Neural Networks
- Exotic (FM/FFM)
Feature selection
Hyperparameters search
Ensemble

, , ( ). .

— , - .

CSV, feather/pickle/hdf — .

, TalkingData, , memmap , lgb.

— hdf/feather, - ( ) — CSV . — , , .

Getting started ( , House Prices: Advanced Regression Techniques ), . , , , , .. etc. , — , .

— — .

, 100% :

EDA . ( , , , ...)
Data Cleaning. ( fillna, , )
Data preparation
- ( — label/ohe/frequency, , , )
- ( )
Models
- Linear models ( — ridge/logistic)
- Tree models (lgb)
Feature selection
- grid/random search
Ensemble
- Regression / lgb

… :)

, . — , Mercedes, Santander . Mercedes , ( , — ):

(danila_savenkov) :

How to Win a Data Science Competition: Learn from Top Kagglers"

— !!!

, , … …
/
. 1

— — ! , . — , , , — . , .

, - , — ?

. , Kaggle . .
. — , , .

4, (EDA/ Preparation/ Model/ Ensemble/ Feature selection/ Hyperparameters search/ ...)
, , , .

() , .
- , .

, . , . .

5 , , ? ( ) , , — - , , )

? , — , . . , — , mean target encoding , , . ! scipy.optimize , .

- ...

In this mode, we solve several competitions. Each time we notice that there are less and less records on the leaflets, and more and more code in the modules. Gradually, the task of analysis is reduced to the fact that you just read the description of the solution, say aha, oh, oh, that's how it is! And you add one or two new spells or approaches in your piggy bank.

After this mode is changed to the mode of work on the errors. You have the base ready, now you just need to apply it correctly. After each competition, reading the description of the decisions, look - what you did not do, what could be done better, what you missed, well, or where you specifically lazhanuli, as I happened in Toxic . He walked quite well, into the underbelly of gold, and flew down to private positions by 1500 positions. It's a shame to tears ... but he calmed down, found a mistake, wrote a post in a slaka - and learned a lesson.

The fact that one of the descriptions of the top solution will be written from your nickname can serve as a sign of the final exit to the working mode.

What should be approximately in pipelines by the end of this stage:

All sorts of options for pre-processing and creating numeric features - projections, relationships,
Different methods of working with categories - Mean target encoding in the correct version, frequencies, label / ohe,
Various schemes of embeddings over text (Glove, Word2Vec, Fasttext)
Various text vectoring schemes (Count, TF-IDF, Hash)
Several validation schemes (N * M for standard cross-validation, time-based, by group)
Bayesian optimization / hyperopt / something else for selecting hyper parameters
Shuffle / Target permutation / Boruta / RFE - for feature selection
Linear models - in the same style over a single data set.
LGB / XGB / Catboost - in the same style over a single data set

The author made metaclasses separately for linear and tree-based models, with a single external interface, in order to level the differences in API between different models. But now it is possible to run in a single key with one line, for example, LGB or XGB over one processed data set.

Several neural networks for all occasions (do not take pictures yet) - embeddings / CNN / RNN for text, RNN for sequences, Feed-Forward for everything else. It is good to understand and be able to auto-encoders .
Lgb / regression / scipy ensemble for regression and classification tasks
It’s good to already be able to Genetic Algorithms , sometimes they come in well

Summarize

Any sport, and competitive DS is also a sport, it is a lot of sweat and a lot of work. This is neither good nor bad, it is a fact. Participation in competitions (if you approach the process correctly) pumps technical skills very well, plus more or less shakes the sporting spirit, when you really don't want to do something, breaks everything directly - but you get up to the laptop, rework the model, run it on still gnaw this unfortunate 5th decimal place.

So decide Kaggle - farm experience, medals and fan!

A couple of words about the author's pipelines

In this section I will try to describe the main idea of the pipelines and modules assembled in a year and a half. Again - this approach does not pretend to be universal or unique, but suddenly someone will help.

All code for feature-engineering, except for mean target encoding, is placed in a separate module as functions. Tried to collect through objects, it turned out cumbersome, and in this case it is not necessary.
All functions of feature-engineering are executed in the same style and have a single signature of call and return:

def do_cat_dummy(data, attrs, prefix_sep='_ohe_', params=None): # do something return _data, new_attrs

At the input we pass dataset, attributes for work, prefix for new attributes and additional parameters. At the output we get a new dataset with new attributes and a list of these attributes. Next, this new dataset is stored in a separate pickle / feather.

What it gives is that we get the opportunity to quickly assemble datasets for learning from pre-generated cubes. For example, for categories we do three processing at once - Label Encoding / OHE / Frequency, save to three separate feathers, and then at the modeling stage we just play with these blocks, creating elegant datasets for training with one elegant movement.

  pickle_list = [ 'attrs_base', 'cat67_ohe', # 'cat67_freq', ] short_prefix = 'base_ohe' _attrs, use_columns, data = load_attrs_from_pickle(pickle_list) cat_columns = []

If you need to build another dataset, change the pickle_list , reboot, and work with the new dataset.

The main set of functions over tabular data (real and categorical) includes various categories coding, projection of numeric attributes onto categorical, as well as various transformations.

  def do_cat_le(data, attrs, params=None, prefix='le_'): def do_cat_dummy(data, attrs, prefix_sep='_ohe_', params=None): def do_cat_cnt(data, attrs, params=None, prefix='cnt_'): def do_cat_fact(data, attrs, params=None, prefix='bin_'): def do_cat_comb(data, attrs_op, params=None, prefix='cat_'): def do_proj_num_2cat(data, attrs_op, params=None, prefix='prj_'):

A universal Swiss knife for combining attributes, to which we transfer the list of initial attributes and the list of conversion functions, as a result we get, as usual, a list of new attributes.

  def do_iter_num(data, attrs_op, params=None, prefix='comb_'):

Plus various additional specific converters.

For processing text data, a separate module is used, which includes various methods of preprocessing, tokenization, lemmatization / stemming, translation into a frequency table, and so on. etc. Everything is standard using sklearn , nltk and keras .

Time series are also processed by a separate module, with the transformation functions of the original dataset for both normal tasks (regression / classification) and sequence-to-sequence. Thank you to François Chollet for finishing keras so that building seq-2-seq models doesn’t look like a voodoo ritual of summoning demons.

In the same module, by the way, are the functions of the usual statistical analysis of the series - a test for stationarity, STL decomposition, etc ... It helps a lot at the initial stage of the analysis in order to “touch” the series and see what it is.

Functions that cannot be applied immediately to the whole dataset, but need to be used inside the folds during cross-validation are moved to a separate module:
- Mean target encoding
- Upsampling / Downsampling
They are passed into the class of the model (read about the model below) at the training stage.

  _fpreproc = fpr_target_enc _fpreproc_params = fpr_target_enc_params _fpreproc_params.update(**{ 'use_columns' : cat_columns, })

A metaclass was created for modeling, which generalizes the concept of a model, with abstract methods: fit / predict / set_params /, etc. For each specific library (LGB, XGB, Catboost, SKLearn, RGF, ...), an implementation of this metaclass was created.

That is, to work with LGB, we create a model

  model_to_use = 'lgb' model = KudsonLGB(task='classification')

For XGB:

  model_to_use = 'xgb' metric_name= 'auc' task='classification' model = KudsonXGB(task=task, metric_name=metric_name)

And all functions further operate with model .

For validation, several functions were created, which immediately considered both prediction and OOF for several seeders during cross-validation, as well as a separate function for ordinary validation via train_test_split. All validation functions operate on meta-model methods, which provides model-independent code and makes it easy for any other library to connect to the pipeline.
```
 res = cv_make_oof( model, model_params, fit_params, dataset_params, XX_train[use_columns], yy_train, XX_Kaggle[use_columns], folds, scorer=scorer, metric_name=metric_name, fpreproc=_fpreproc, fpreproc_params=_fpreproc_params, model_seed=model_seed, silence=True ) score = res['score'] 
```
For feature selection, nothing interesting, standard RFE, and my favorite shuffle permutation in all possible options.
Bayesian optimization is mainly used to search for hyperparameters, again in a unified form so that you can run a search for any model (via the cross-validation module). This unit lives in the same laptop as the simulation.
For the ensembles, several functions have been done, unified for regression and classification problems based on the Ridge / Logreg, LGB, Neural network and my favorite scipy.optimize.

A small explanation is that each model from the pipeline gives two files as a result: sub_xxx and oof_xxx , which are the predictions for the test and OOF prediction for the train. Next, in the ensemble module from the specified directory, we load pairs of predictions from all models into two data frames - df_sub / df_oof . Well, after that we look at the correlations, select the best ones, then we build models of the 2nd level over df_oof and apply them to df_sub .

Sometimes, searching for the best subset of models, the search for genetic algorithms goes well (the author uses this library ), sometimes the method from Caruana . In the simplest cases, standard regressions and scipy.optimize work fine.
Neural networks live in a separate module, the author uses keras in a functional style , yes, not as flexible as pytorch , but still enough. Again, written universal functions for training, invariant to the type of network.

This pipeline was once again tested in a recent competition from Home Credit , attentive and careful use of all blocks and modules brought 94th place and silver.

The author is generally ready to express a seditious idea that for tabular data and a normally made pipeline, the final submit at any competition should fly into the top 100 leaderboard. Naturally there are exceptions, but in general, this statement seems to be true.

About teamwork

It’s not all that simple, deciding Kaggle in a team or solo depends a lot on the person (and the team), but my advice for those who are just starting out is to try a solo. Why? I will try to explain my point of view:

First, you will understand your strengths, see weaknesses and, in general, be able to assess your potential as a DS practice.
Secondly, even working in a team (unless this is not an established team with division of roles), you will still have to wait for a complete finished solution - that is, you should already have working pipelines. (" Submission or not ") (C)
And thirdly, optimally, when the level of players in a team is about the same (and quite high), then you can learn something really high-level) In weak teams (there is nothing derogatory, I’m talking about the level of training and experience at Kaggle) imho it is very difficult to learn something, it is better to nibble the forum and the kernels. Yes, you can farm medals, but see Above about goals and a belt for maintaining pants)

Useful tips from the captain evidence and promised card rake :)

These tips reflect the experience of the author, are not dogma, and can (and should) be verified by their own experiments.

Always start with the construction of valid validation - it will not be her, all other efforts will fly into the furnace. Take another look at the Mercedes leaderboard .

The author is really pleased with the fact that in this competition he built a stable cross-validation scheme (3x10 folds), which retained speed and brought the legal 42nd place)
If a competent validation is built - always trust the results of your validation . If your models' fastness improves on validation, but deteriorates in public, it is wiser to trust validation. When analyzing, just count that piece of data on which the public leaderboard is considered to be another fold. You do not want to make your model a single fold?
If the model and the scheme allows - always do OOF predictions and save near the model. At the stage of the ensemble you never know what will fire.
Always keep near the result / OOF code to get them . No matter on githab, locally, anywhere. Twice I managed that in the ensemble the best model is the one that was made two weeks ago "out of the box" and for which no code was preserved. Pain.
Hammer on the selection of the "right" sid for cross-validation , he himself sinned with it first. Better choose any three and do 3xN cross-validation. The result will be more stable and easier.
Do not chase after the number of models in the ensemble - better is less, but more diverse - more varied by models, by preprocessing, by datasets. In the worst case, in terms of parameters, for example, one deep tree with rigid regularization, one shallow tree.
To select a feature, use shuffle / boruta / RFE , remember that the feature importance in various tree-based models is the metrics in parrots on the sawdust bag.
Personal opinion of the author (may not coincide with the opinion of the reader) Bayesian optimization for the selection of hyper parameters . (">" == better)
It is better to handle a leading kernel kernel posted on public as follows:
- There is a time - we look at what's new there and build in ourselves
- Less time - we remake it on our validation, we do OOF - and we fasten it to the ensemble
- There is no time at all - stupidly blend with our best solution and we look soon.
How to choose two final submissions - by intuition, of course. But seriously, everyone usually practices the following approaches:
- Conservative submit (on sustainable models) / risky submit.
- With the best soon on OOF / public leaderboard
Remember - everything is a figure and the possibilities of its processing depend only on your imagination. Use classification instead of regression, treat sequences as a picture, etc.

And finally:

Join ods.ai :) chat and get fan of DS and life! )

useful links

Are common

http://ods.ai/ - for those who want to join the best DS community :)
https://mlcourse.ai/ - course site ods.ai
https://www.Kaggle.com/general/68205 - post about a course on Kaggle

In general, I highly recommend in the same mode as described in the article, view the video cycle mltrainings - many interesting approaches and techniques.

Video

Courses

In more detail the methods and approaches to solving problems on Kaggle can be found in the second course of specialization , " How to Win a Data Science Competition: Learn from Top Kagglers"

Extracurricular reading:

Conclusion

The subject of Data Science in general and competitive Data Science in particular is as inexhaustible as atom (C). In this article, the author only slightly opened the topic of pumping practical skills using competitive platforms. If it became interesting - connect, look around, save up experience - and write your articles. The more good content, the better for all of us!

Anticipating questions — no, the pipelines and author’s libraries have not yet been freely available.

Many thanks to the colleagues from ods.ai: Vladimir Iglovikov (ternaus) , Yuri Kashnitsky (yorko) , Valery Babushkin ( venheads) , Alexey Pronkin (pronkin_alexey) , Dmitry Petrov (dmitry_petrov) , Artur Kuzin (i01z3) . article before publication, for edits and reviews.

Special thanks to Nikita Zavgorodnoye (njz) - for the final proofreading.

Thank you for your attention, I hope this article will be useful to someone.

My nickname in Kaggle / ods.ai : kruegger

Source: https://habr.com/ru/post/426227/

All Articles