Hi, Habr!

In the previous two posts (
one ,
two ), we reviewed the basic algorithms and techniques used by the participants of the
Kaggle competition. Today I would like to go further and talk about the difficulties that researchers encounter in developing algorithms in the case when there is a lot of data and learning comes from samples that do not fit in memory. Immediately it should be noted that this happens quite often,
even at Kaggle itself (in this task, the training sample has a volume of several gigabytes and the beginner may simply not understand what to do with this). Below we will look at machine learning algorithms and tools that can handle this problem.
Many who are familiar with machine learning know that quite often good quality can be obtained thanks to simple linear models, provided that the signs have been well selected and generated (
which we discussed earlier ). These models are also pleased with their simplicity and often even clarity (for example, SVM, which maximizes the width of the dividing strip). However, there is another very important advantage of linear methods - during training, it is possible to ensure that the algorithm parameters are set (ie, the weights update stage) is performed each time a new object is added. These methods of machine learning in literature are often also called
Online Machine Learning .
')
If you don’t go into details, in a nutshell it looks like this: in order to select the parameters of a particular linear method (for example, weights in the logistic regression), some initial value of these parameters is initialized, and then getting the next object from the training sample as input , weights are updated. So linear methods allow just this kind of training. At the same time, it is clear that it is simply not necessary to store all objects simultaneously in memory.
To date, one of the most famous implementations of such methods is the
Vowpal Wabbit package, which can be briefly described in several points:
- It can train only linear models . At the same time, as we already know, it is possible to increase the quality of the methods themselves by adding new features, by fitting the loss function, and also by using low-rank expansions (we will talk more about this in the next articles)
- The training set is processed using a stochotic optimizer, which allows learning on samples that do not fit into memory.
- You can handle a large number of features by hashing them (the so-called hashing trick), which can train models even in cases where the full set of scales simply does not fit in memory
- Active learning mode is supported , in which training sample objects can be submitted even from several machines over the network.
- Training can be parallelized across multiple machines.
So, let us dwell on how to work in practice with this tool and what results can be obtained with it. As an example, consider the well-known task of
Titanic: Machine Learning from Disaster . Probably, this is not the best example due to the fact that there is little data in this problem. However, since The article is intended primarily for newcomers to machine learning - this post will be an excellent continuation of the
official Tutorial . In addition, it will be quite easy to later rewrite the code used in this post for the real (and actual at the time of the writing of the post)
Click-Through Rate Prediction task - the training sample in it has a size larger than 5 GB.
Before starting the description of specific steps, we note that the code described below was run for a long time (even before Vawpal Wabbit became popular), and the project itself has been actively updated lately, therefore all the sources are correct with some accuracy - the author leaves it to check to the reader.
Recall that in the task at hand it is proposed to build a classifier that would predict for a specific person (a passenger of the Titanic) whether he will drown or not. We will not describe in detail the condition of the problem and the characteristics given to us. Those interested can familiarize themselves with this information on the competition page.
So let's start with the fact that Vowpal Wabbit accepts input in a certain format:
label | A feature1: value1 | B feature2: value2Which as a whole is no different from the familiar object-feature matrix, except that the signs can be categorized so that, in the course of training, some of them can be turned off. Thus, after we have downloaded the training and test samples, the first step is to convert the data into a format that would read Vowpal Wabbit
Preparation of training and test samples
To do this, you can take a simple script (or you can simply use the excellent
phraug2 library), which reads the
train.csv file line by line and convert each object of the training set to the desired format. It should be noted that in the case of a two-class classification, the
label takes the values ​​+1 or -1
import csv import re i = 0 def clean(s): return " ".join(re.findall(r'\w+', s,flags = re.UNICODE | re.LOCALE)).lower() with open("train_titanic.csv", "r") as infile, open("train_titanic.vw", "wb") as outfile: reader = csv.reader(infile) for line in reader: i += 1 if i > 1: vw_line = "" if str(line[1]) == "1": vw_line += "1 '" else: vw_line += "-1 '" vw_line += str(line[0]) + " |f " vw_line += "passenger_class_"+str(line[2])+" " vw_line += "last_name_" + clean(line[3].split(",")[0]).replace(" ", "_") + " " vw_line += "title_" + clean(line[3].split(",")[1]).split()[0] + " " vw_line += "sex_" + clean(line[4]) + " " if len(str(line[5])) > 0: vw_line += "age:" + str(line[5]) + " " vw_line += "siblings_onboard:" + str(line[6]) + " " vw_line += "family_members_onboard:" + str(line[7]) + " " vw_line += "embarked_" + str(line[11]) + " " outfile.write(vw_line[:-1] + "\n")
Similarly, we proceed with the test sample:
i = 0 with open("test_titanic.csv", "r") as infile, open("test_titanic.vw", "wb") as outfile: reader = csv.reader(infile) for line in reader: i += 1 if i > 1: vw_line = "" vw_line += "1 '" vw_line += str(line[0]) + " |f " vw_line += "passenger_class_"+str(line[1])+" " vw_line += "last_name_" + clean(line[2].split(",")[0]).replace(" ", "_") + " " vw_line += "title_" + clean(line[2].split(",")[1]).split()[0] + " " vw_line += "sex_" + clean(line[3]) + " " if len(str(line[4])) > 0: vw_line += "age:" + str(line[4]) + " " vw_line += "siblings_onboard:" + str(line[5]) + " " vw_line += "family_members_onboard:" + str(line[6]) + " " vw_line += "embarked_" + str(line[10]) + " " outfile.write(vw_line[:-1] + "\n")
At the output we get 2 files
train_titanic.vw and
test_titanic.vw, respectively. It is worth noting that this is often the most difficult and long stage - sample preparation. In fact, then we will only run several times the machine learning methods on this sample and immediately get the result
Training linear models in Vowpal Wabbit
Work comes from the command line by running the
vw utility with the parameters passed to it. To start it. We will not focus on the detailed description of all parameters, but run only one of the examples:
vw train_titanic.vw -f model.vw --binary --passes 20 -c -q ff --adaptive --normalized --l1 0.00000001 --l2 0.0000001 -b 24Here we report that we want to solve the binary classification problem (
--binary ), we want to make 20 passes in the training set (
--passes 20 ) we want to make L1 and L2 regularization (
--l1 0.00000001 --l2 0.0000001 ), normalization, and the model itself is saved in
model.vw . The
-b 24 parameter is used to specify the hash function (as was stated at the beginning, all attributes are hashed, and the hashes themselves take values ​​from 0 to 2 ^ b-1). Also, it is important to note the
-q ff parameter, which indicates that we also want to add paired signs to the model (this is a very useful feature of VW, sometimes allowing to significantly increase the quality of algorithms)
After some time we will get a trained model. It remains only to run the algorithm on the test sample.
vw -d test_titanic.vw -t -i model.vw -p preds_titanic.txtAnd convert the result to be sent to the
kaggle.com system:
import csv with open("preds_titanic.txt", "r") as infile, open("kaggle_preds.csv", "wb") as outfile: outfile.write("PassengerId,Survived\n") for line in infile.readlines(): kaggle_line = str(line.split(" ")[1]).replace("\n","") if str(int(float(line.split(" ")[0]))) == "1": kaggle_line += ",1\n" else: kaggle_line += ",0\n" outfile.write(kaggle_line)
This simple solution shows a fairly good quality - more than
0.79 AUC . We applied a rather trivial model. By optimizing the parameters and “playing” with the signs, the result can be slightly improved (the reader is invited to do this as an exercise). I hope this introduction will help beginners to cope with the volume of data in machine learning competitions!