How to quickly write and roll out in the production of machine learning algorithm

Nowadays, data analysis is increasingly used in various, often remote from IT fields and tasks that a specialist faces in the early stages of a project are radically different from those faced by large companies with advanced analytics departments. In this article I will discuss how to quickly make a useful prototype and prepare a simple API for its use by an application programmer.

For example, consider the problem of predicting the price of pipes placed on the Kaggle competition platform. Description and data can be found here . In fact, in practice, very often there are tasks in which you need to quickly make a prototype with a very small amount of data, or even without any real data until the first implementation. In these cases, you have to approach the task creatively, start with simple heuristics and appreciate each query or tagged object. But in our model situation, fortunately, there are no such problems and therefore we can immediately begin with a review of the data, the definition of the problem and the attempts to apply the algorithms.

So, unpacking the archive with the data, you can find that it contains approximately a couple dozen csv-files. According to the description on the competition page, the main ones are train_set.csv and test_set.csv. They contain basic price information. The remaining files contain auxiliary, but slightly less important data. Let's look at them in more detail.

Placing the archive in the data subdirectory of the project root directory and unarchiving it with commands
')

$ cd data/ $ unzip data.zip $ cd ..

We can see what is in the files of interest to us:

 $ head data/competition_data/train_set.csv tube_assembly_id,supplier,quote_date,annual_usage,min_order_quantity,bracket_pricing,quantity,cost TA-00002,S-0066,2013-07-07,0,0,Yes,1,21.9059330191461 TA-00002,S-0066,2013-07-07,0,0,Yes,2,12.3412139792904 TA-00002,S-0066,2013-07-07,0,0,Yes,5,6.60182614356538 TA-00002,S-0066,2013-07-07,0,0,Yes,10,4.6877695119712 TA-00002,S-0066,2013-07-07,0,0,Yes,25,3.54156118026073 TA-00002,S-0066,2013-07-07,0,0,Yes,50,3.22440644770007 TA-00002,S-0066,2013-07-07,0,0,Yes,100,3.08252143576504 TA-00002,S-0066,2013-07-07,0,0,Yes,250,2.99905966403855 TA-00004,S-0066,2013-07-07,0,0,Yes,1,21.9727024365273

We see columns with data corresponding to the object (instance), the price of which we will predict. Namely, the assembly identifier (it’s completely incomprehensible at the moment, but the magic of machine learning, among other things, is that in order to make effective use of data, it’s sometimes not necessary to understand what they mean), the vendor number, the date, and so on. Separately, pay attention to the penultimate column with the number of units of delivery. Finally, the last column is the label (label), its price. The more units of goods delivered, the lower its price, which is consistent with our understanding of what is happening in the real world.

Let's look now at the file with test data.

 $ head data/competition_data/test_set.csv id,tube_assembly_id,supplier,quote_date,annual_usage,min_order_quantity,bracket_pricing,quantity 1,TA-00001,S-0066,2013-06-23,0,0,Yes,1 2,TA-00001,S-0066,2013-06-23,0,0,Yes,2 3,TA-00001,S-0066,2013-06-23,0,0,Yes,5 4,TA-00001,S-0066,2013-06-23,0,0,Yes,10 5,TA-00001,S-0066,2013-06-23,0,0,Yes,25 6,TA-00001,S-0066,2013-06-23,0,0,Yes,50 7,TA-00001,S-0066,2013-06-23,0,0,Yes,100 8,TA-00001,S-0066,2013-06-23,0,0,Yes,250 9,TA-00003,S-0066,2013-07-07,0,0,Yes,1

The same columns except for the last - namely, the predicted price. What is logical for the competition - it is on these data that the answer must be formed in order to be sent to the competition website and receive an intermediate result.

As we noted above, in addition to the two main data files at our disposal there are also quite a few additional ones. It makes sense to look at at least a couple of them in order to catch the basic idea of the data they contain.

 $ head data/competition_data/tube.csv tube_assembly_id,material_id,diameter,wall,length,num_bends,bend_radius,end_a_1x,end_a_2x,end_x_1x,end_x_2x,end_a,end_x,num_boss,num_bracket,other TA-00001,SP-0035,12.7,1.65,164,5,38.1,N,N,N,N,EF-003,EF-003,0,0,0 TA-00002,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0 TA-00003,SP-0019,6.35,0.71,127,7,19.05,N,N,N,N,EF-008,EF-008,0,0,0 TA-00004,SP-0019,6.35,0.71,137,9,19.05,N,N,N,N,EF-008,EF-008,0,0,0 TA-00005,SP-0029,19.05,1.24,109,4,50.8,N,N,N,N,EF-003,EF-003,0,0,0 TA-00006,SP-0029,19.05,1.24,79,4,50.8,N,N,N,N,EF-003,EF-003,0,0,0 TA-00007,SP-0035,12.7,1.65,202,5,38.1,N,N,N,N,EF-003,EF-003,0,0,0 TA-00008,SP-0039,6.35,0.71,174,6,19.05,N,N,N,N,EF-008,EF-008,0,0,0 TA-00009,SP-0029,25.4,1.65,135,4,63.5,N,N,N,N,EF-003,EF-003,0,0,0

Yeah, it's just a correspondence between some kind of “assembly identifier” mentioned above and the material, diameter, and so on. The information is probably quite useful for learning the algorithm.

Finally, let's see what is contained in the file devoted to the materials of which the pipes are made.

 $ head data/competition_data/bill_of_materials.csv tube_assembly_id,component_id_1,quantity_1,component_id_2,quantity_2,component_id_3,quantity_3,component_id_4,quantity_4,component_id_5,quantity_5,component_id_6,quantity_6,component_id_7,quantity_7,component_id_8,quantity_8 TA-00001,C-1622,2,C-1629,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00002,C-1312,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00003,C-1312,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00004,C-1312,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00005,C-1624,1,C-1631,1,C-1641,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00006,C-1624,1,C-1631,1,C-1641,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00007,C-1622,2,C-1629,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00008,C-1312,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00009,C-1625,2,C-1632,2,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

Again we just see a list of correspondences between the “assembly identifier” and the components. Pay attention to the abundance of fields with the value "NA". As a rule, they denote gaps in the data, but in this example they correspond to cases where the component is simply not enough to fill all 8 positions in the header.

It would seem that the data have been studied in general terms, it is time to proceed directly to setting up the algorithm. And here it is not - before you start setting up an algorithm, you should understand what we are going to achieve from it, how we will compare the two algorithms, which one is worse, and which one is better. First, we will have to leave part of the tagged data to check the quality of the trained algorithm; in machine learning, this part of the data is called validation sampling . Secondly, we will have to exclude this part of the data from those on which we will conduct training, otherwise a model that predicts exactly that label (in our case, the price), which was in the training object and random - in any other case, will give perfect predictions on a validation sample, but absolutely terrible - with real use. Baseline data after excluding a validation sample from them is called a training sample . The process of training a model in a training set and testing the quality of its work in a validation process is called a validation procedure . But this is not all - in addition to the choice of a validation procedure (which, by the way, may change during experiments), it is also necessary to choose a method for assessing the quality of prediction with the available experimental results. Namely, a function that will take the corresponding two arrays as input and will give an assessment of how well the predictions correspond to the experiment. Such a function is called a quality metric and in its actual situations it often depends much more on choosing than on what algorithm we choose and how we will tune it. Without going into the details of this fine process, we will use the Root Mean Squared Logarithic Error (RMSLE) proposed by the organizers.

So, the fundamental issues are resolved and we can start writing the code that loads the data and trains our algorithm. In order to test the base pipeline, we divide the sample into training and validation parts in the 70/30 ratio, we use two simple features and one of the simplest algorithms - linear regression.

We write the basic function to load the data and check its performance:

 def load_data(): list_of_instances = [] list_of_labels = [] with open('./data/competition_data/train_set.csv') as input_stream: header_line = input_stream.readline() columns = header_line.strip().split(',') for line in input_stream: new_instance = dict(zip(columns[:-1], line.split(',')[:-1])) new_label = float(line.split(',')[-1]) list_of_instances.append(new_instance) list_of_labels.append(new_label) return list_of_instances, list_of_labels >>> list_of_instances, list_of_labels = load_data() >>> print(len(list_of_instances), len(list_of_labels)) 30213 30213 >>> print(list_of_instances[:3]) [{'annual_usage': '0', 'quote_date': '2013-07-07', 'tube_assembly_id': 'TA-00002', 'min_order_quantity': '0', 'bracket_pricing': 'Yes', 'quantity': '1', 'supplier': 'S-0066'}, {'annual_usage': '0', 'quote_date': '2013-07-07', 'tube_assembly_id': 'TA-00002', 'min_order_quantity': '0', 'bracket_pricing': 'Yes', 'quantity': '2', 'supplier': 'S-0066'}, {'annual_usage': '0', 'quote_date': '2013-07-07', 'tube_assembly_id': 'TA-00002', 'min_order_quantity': '0', 'bracket_pricing': 'Yes', 'quantity': '5', 'supplier': 'S-0066'}] >>> print(list_of_labels[:3]) [21.9059330191461, 12.3412139792904, 6.60182614356538]

The results are in line with expectations. Now we will write the preparation of a function that translates objects (instances) into feature models (samples).

 def is_bracket_pricing(instance): if instance['bracket_pricing'] == 'Yes': return [1] elif instance['bracket_pricing'] == 'No': return [0] else: raise ValueError def get_quantity(instance): return [int(instance['quantity'])] def to_sample(instance): return is_bracket_pricing(instance) + get_quantity(instance) >>> print(list(map(to_sample, list_of_instances[:3]))) [[1, 1], [1, 2], [1, 5]]

Later, when there are a lot of different features, they will all move to the features.py file specially reserved for this, where they will be overgrown with variations, auxiliary functions, and in some particularly neglected cases - also with unit tests.

Now, in principle, we already have everything necessary to train the first, simplest model of machine learning. A small (but important) point - we agreed that we will be optimized for the metric proposed in the contest condition

R M S L E = s q r t f r a c 1 n s u m_{i = 1}^{n} (l o g (p_{i} + 1) - l o g (a_{i} + 1))^{2},

$RMSLE = \ sqrt {\ frac1n \ sum_ {i = 1} ^ n (\ log (p_i + 1) - \ log (a_i + 1)) ^ 2},$

Where

p_{i}

$p_i$ - the predicted value of the model, and

a_{i}

$a_i$ - actual. In order for our efforts to select features and model settings to more closely match this goal, apply the function to labels.

f (x) = l o g (x + 1)

$f (x) = \ log (x + 1)$ . The fact is that most of the regression machine learning methods are designed to minimize the square error (MSE) - and it is to the optimization of such a quality metric that we reduce our task by applying the above function to the labels.

 import math def to_interim_label(label): return math.log(label + 1) def to_final_label(interim_label): return math.exp(interim_label) - 1 >>> print(to_final_label(to_interim_label(42))) 42.0

It seems that we were not mistaken with the transition functions from the original labels to more convenient ones for optimization and vice versa, and these functions are indeed reciprocal. Now we initialize the model and train it on the received feature-guides and intermediate labels.

 >>> model = LinearRegression() >>> list_of_samples = list(map(to_sample, list_of_instances)) >>> TRAIN_SAMPLES_NUM = 20000 >>> train_samples = list_of_samples[:TRAIN_SAMPLES_NUM] >>> train_labels = list_of_labels[:TRAIN_SAMPLES_NUM] >>> model.fit(train_samples, train_labels)

Now the model is trained and we can check how well it works by comparing the result of its prediction and the actual values on the validation sample.

 >>> validation_samples = list_of_samples[TRAIN_SAMPLES_NUM:] >>> validation_labels = list(map(to_interim_label, list_of_labels[TRAIN_SAMPLES_NUM:])) >>> squared_errors = [] >>> for sample, label in zip(validation_samples, validation_labels): >>> prediction = model.predict(numpy.array(sample).reshape(1, -1))[0] >>> squared_errors.append((prediction - label) ** 2) >>> mean_squared_error = sum(squared_errors) / len(squared_errors) >>> print('Mean Squared Error: {0}'.format(mean_squared_error)) Mean Squared Error: 0.8251727558694294

Is the resulting error small or small? A priori, it is impossible to say, but given the fact that we use only two features and one of the simplest models, most likely our prediction is far from optimal. Let's try to experiment, for example by adding new features. For example, in the base data file, in addition to the two fields already used (bracket_pricing and quantity), there is a field min_order_quantity (even if the values are not very often different from 0). Let's try to use its value as a new feature for learning the algorithm.

 def get_min_order_quantity(instance): return [int(instance['min_order_quantity'])] def to_sample(instance): return (is_bracket_pricing(instance) + get_quantity(instance) + get_min_order_quantity(instance)) Mean Squared Error: 0.8234554779286141

As you can see, the error has decreased slightly, which means our algorithm has improved. We will not stop there and add one more feature after another.

 def get_annual_usage(instance): return [int(instance['annual_usage'])] def to_sample(instance): return (is_bracket_pricing(instance) + get_quantity(instance) + get_min_order_quantity(instance) + get_annual_usage(instance)) Mean Squared Error: 0.8227852260998361

The next unused field, quote_date, does not have a simple interpretation as a number or a set of numbers of fixed length. So you have to think a little about how to use its value as a numeric input for the algorithm. Of course, both year and month and day can help as a new feature, but the most logical first approximation is the number of days starting from a certain date — for example, from January 1, the zero year, as the day that arrived before the earliest date in the file. In the first approximation, it is still possible to calculate that there are always 365 days in a year, and 30 months in each of the 12 months. And let the seeming mathematical incorrectness of this assumption do not confuse us - we can always clarify the formulas later and see if the corresponding feature will improve the prediction quality by validation sample.

 def get_absolute_date(instance): return [365 * int(instance['quote_date'].split('-')[0]) + 12 * int(instance['quote_date'].split('-')[1]) + int(instance['quote_date'].split('-')[2])] def to_sample(instance): return (is_bracket_pricing(instance) + get_quantity(instance) + get_min_order_quantity(instance) + get_annual_usage(instance) + get_absolute_date(instance)) Mean Squared Error: 0.8216646342919645

As you can see, even a not quite mathematically and astronomically correct feature nevertheless helped to improve the quality of the prediction of our model. We now turn to the new type of features contained in the tube_assembly_id and supplier fields. Each of these fields contains the values of the identifiers of the manufacturer and vendor. They have not binary and not quantitative nature, but describe the type of object from a fixed list. In machine learning, such properties of objects and their corresponding features are categorical . However, the category of the manufacturer itself is unlikely to help us, since they are not repeated in the test_set.csv file, and we absolutely correctly divided the marked sample so that there is (practically) no corresponding intersection between the training and validation parts. Nevertheless, let us try to extract something useful from the value of the supplier field. Let's first see what the corresponding codes are found in the file with marked data.

 >>> with open('./data/competition_data/train_set.csv') as input_stream: ... header_line = input_stream.readline() ... suppliers = set() ... for line in input_stream: ... new_supplier = line.split(',')[1] ... suppliers.add(new_supplier) ... >>> print(len(suppliers)) 57 >>> print(suppliers) {'S-0058', 'S-0013', 'S-0050', 'S-0011', 'S-0070', 'S-0104', 'S-0012', 'S-0068', 'S-0041', 'S-0023', 'S-0092', 'S-0095', 'S-0029', 'S-0051', 'S-0111', 'S-0064', 'S-0005', 'S-0096', 'S-0062', 'S-0004', 'S-0059', 'S-0031', 'S-0078', 'S-0106', 'S-0060', 'S-0090', 'S-0072', 'S-0105', 'S-0087', 'S-0080', 'S-0061', 'S-0108', 'S-0042', 'S-0027', 'S-0074', 'S-0081', 'S-0025', 'S-0024', 'S-0030', 'S-0022', 'S-0014', 'S-0054', 'S-0015', 'S-0008', 'S-0007', 'S-0009', 'S-0056', 'S-0026', 'S-0107', 'S-0066', 'S-0018', 'S-0109', 'S-0043', 'S-0046', 'S-0003', 'S-0006', 'S-0097'}

As we see, there are not so many of them. For a start, you can try to do the simplest standard for such cases, namely, to compare each value of the field array with one unit, standing in place of the corresponding identifier and all other values equal to 0.

 SUPPLIERS_LIST = ['S-0058', 'S-0013', 'S-0050', 'S-0011', 'S-0070', 'S-0104', 'S-0012', 'S-0068', 'S-0041', 'S-0023', 'S-0092', 'S-0095', 'S-0029', 'S-0051', 'S-0111', 'S-0064', 'S-0005', 'S-0096', 'S-0062', 'S-0004', 'S-0059', 'S-0031', 'S-0078', 'S-0106', 'S-0060', 'S-0090', 'S-0072', 'S-0105', 'S-0087', 'S-0080', 'S-0061', 'S-0108', 'S-0042', 'S-0027', 'S-0074', 'S-0081', 'S-0025', 'S-0024', 'S-0030', 'S-0022', 'S-0014', 'S-0054', 'S-0015', 'S-0008', 'S-0007', 'S-0009', 'S-0056', 'S-0026', 'S-0107', 'S-0066', 'S-0018', 'S-0109', 'S-0043', 'S-0046', 'S-0003', 'S-0006', 'S-0097'] def get_supplier(instance): if instance['supplier'] in SUPPLIERS_LIST: supplier_index = SUPPLIERS_LIST.index(instance['supplier']) result = [0] * supplier_index + [1] + [0] * (len(SUPPLIERS_LIST) - supplier_index - 1) else: result = [0] * len(SUPPLIERS_LIST) return result def to_sample(instance): return (is_bracket_pricing(instance) + get_quantity(instance) + get_min_order_quantity(instance) + get_annual_usage(instance) + get_absolute_date(instance) + get_supplier(instance)) Mean Squared Error: 0.7992338454746866

As we see, there was a significant reduction in the average error. Most likely this means that the value of the field that we added contains valuable information for the algorithm and it is worth trying to use it later several more times, but in some less banal way. We have already discussed that the tube_assembly field is unlikely to help us directly, but you should still try to check it experimentally.

 def get_assembly(instance): assembly_id = int(instance['tube_assembly_id'].split('-')[1]) result = [0] * assembly_id + [1] + [0] * (25000 - assembly_id - 1) return result def to_sample(instance): return (is_bracket_pricing(instance) + get_quantity(instance) + get_min_order_quantity(instance) + get_annual_usage(instance) + get_absolute_date(instance) + get_supplier(instance) + get_assembly(instance))

Of course, a specific way of converting a manufacturer's identifier into a feature lector looks somewhat awkward, especially considering the number of possible values, however, if the reader has a more constructive use of this field directly and without additional data, you can discuss it in the comments and even try to apply it within code currently available. And we recall that in addition to the main file with the training sample, we have several other files with auxiliary data and try to experiment with their involvement. For example, so that we are not very offended by the relative (albeit predictable) failure with the direct use of the tube_assembly_id field value, we can try to take revenge using the data contained in the specs.csv file and describing the corresponding specifiers in the anonymized form.

 head data/competition_data/specs.csv -n 20 tube_assembly_id,spec1,spec2,spec3,spec4,spec5,spec6,spec7,spec8,spec9,spec10 TA-00001,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00002,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00003,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00004,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00005,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00006,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00007,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00008,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00009,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00010,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00011,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00012,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00013,SP-0004,SP-0069,SP-0080,NA,NA,NA,NA,NA,NA,NA TA-00014,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00015,SP-0063,SP-0069,SP-0080,NA,NA,NA,NA,NA,NA,NA TA-00016,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00017,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA TA-00018,SP-0007,SP-0058,SP-0070,SP-0080,NA,NA,NA,NA,NA,NA TA-00019,SP-0080,NA,NA,NA,NA,NA,NA,NA,NA,NA

It seems that there is a simple enumeration of qualifiers corresponding to each value. Accordingly, it seems reasonable to choose an array of ones and zeros as a fichevector, in which the units are in positions that correspond to the existing specifiers, and zeros to the rest. In order to use the data from the additional file, we will have to make a small change in the code structure.

 def get_assembly_specs(instance, assembly_to_specs): result = [0] * 100 for spec in assembly_to_specs[instance['tube_assembly_id']]: result[int(spec.split('-')[1])] = 1 return result def to_sample(instance, additional_data): return (is_bracket_pricing(instance) + get_quantity(instance) + get_min_order_quantity(instance) + get_annual_usage(instance) + get_absolute_date(instance) + get_supplier(instance) + get_assembly_specs(instance, additional_data['assembly_to_specs'])) def load_additional_data(): result = dict() assembly_to_specs = dict() with open('data/competition_data/specs.csv') as input_stream: header_line = input_stream.readline() for line in input_stream: tube_assembly_id = line.split(',')[0] specs = [] for spec in line.strip().split(',')[1:]: if spec != 'NA': specs.append(spec) assembly_to_specs[tube_assembly_id] = specs result['assembly_to_specs'] = assembly_to_specs return result additional_data = load_additional_data() list_of_samples = list(map(lambda x:to_sample(x, additional_data), list_of_instances)) Mean Squared Error: 0.7754770419953809

Our works were not in vain and the target metric improved by 0.024, which is not bad at the current stage. On this, perhaps, for the time being it is possible to stop with the optimization of the algorithm and discuss the question of simply providing a convenient API to the trained algorithms for an application programmer.

First, save the trained model to disk.

 with open('./data/model.mdl', 'wb') as output_stream: output_stream.write(pickle.dumps(model))

Now we will create a generate_response.py script, in which we will use the results obtained earlier.

 import pickle import numpy import research class FinalModel(object): def __init__(self, model, to_sample, additional_data): self._model = model self._to_sample = to_sample self._additional_data = additional_data def process(self, instance): return self._model.predict(numpy.array(self._to_sample( instance, self._additional_data)).reshape(1, -1))[0] if __name__ == '__main__': with open('./data/model.mdl', 'rb') as input_stream: model = pickle.loads(input_stream.read()) additional_data = research.load_additional_data() final_model = FinalModel(model, research.to_sample, additional_data) print(final_model.process({'tube_assembly_id':'TA-00001', 'supplier':'S-0066', 'quote_date':'2013-06-23', 'annual_usage':'0', 'min_order_quantity':'0', 'bracket_pricing':'Yes', 'quantity':'1'})) 2.357692493326624

Now, in the same way, an application programmer can load the model and accompanying variables in any script and use it to predict values on new incoming data.

In the next series - further optimization of the model, more features, more complex algorithms and setting up their hyperparameters, if necessary - advanced validation procedures.

Source: https://habr.com/ru/post/351074/

All Articles

How to quickly write and roll out in the production of machine learning algorithm

More articles: