How to debug machine learning models

I was thinking, mainly from the point of view of a teacher, about how to teach how to debug machine learning models. Personally, it seems to me useful to consider the model from the point of view of errors of various kinds: Bayesian error (how much the best possible classifiers are wrong), approximation error (which we lose due to the restriction of the hypothesis class), estimation error (associated with a limited sample length), optimization (what happens if you do not find the global optimum for the optimization problem). I realized that it is useful to try to attribute the error to a specific area, and then eliminate the shortcomings in this particular area.

For example, my usual error correction strategy consists of the following steps:
')

First, make sure that the problem is not in the optimizer. This can be verified by adding features that perfectly match the class labels. Make sure that in this case the classification of the training data is working correctly. If not, then the problem is most likely in the optimizer or in too small a sample.
Remove all attributes except those added in step 1. Check that the classification works in this case. If so, gradually return the signs to the site in increasing portions (usually exponentially). If at some point the model stops working, it means that you have too many signs or too little data.
Remove the added features and significantly increase your hypothesis class, for example, by adding many quadratic features. Check that the classification works. If not, perhaps we need a better class of hypotheses.
Reduce the training data by half. Usually, with an increase in the amount of training data, the accuracy of the test tends to asymptote, which means that if reducing their volume by two times has a significant effect, then you are still far from asymptote and you should take more data.

The problem is that the usual analysis in terms of errors came from theory, and theories tend to lose sight of some things because of a certain level of abstraction. Usually abstraction has to deal with the fact that the final goal has already been brought to task iid (simple sampling hypothesis) / PAC learning (theory of almost correct learning), so we cannot see all types of errors - abstraction hides them.

To better understand what was going on, I built a flowchart that would cover all the types of errors that I know of that could sneak into machine learning.
The block diagram is shown below.

I tried to give the steps acceptable names (left side of the rectangles), and then give an example from real life - from the field of advertising. Let's go through all the steps and consider what mistakes can occur on each of them.

In the first step, we set the goal of increasing the profits of our company, and in order to achieve it, we decide to optimize the display of advertising banners. Already at this step, we limit the maximum increase in profits, which we can hope for, since it may be worthwhile to focus not on advertising, but, for example, on creating a better product. This is a kind of business decision, but maybe one of the main questions: do we deal with those things?
Now that we have a real mechanism (optimization of advertising), it is necessary to turn it into a learning task (or not). In this case, we decided to do this by predicting where users will click, and then using the obtained predictions for the best placement of advertising banners. Can I use click information to predict sales growth? This question in itself is an area of active research. But as soon as you decide to predict clicks, you already experience some losses due to the inconsistency between the forecasting task and the goal of optimizing the placement of banners.
Now you need to collect some data. You can do this by registering interactions with a running system. And here we will get the entire zoo of errors, since the data is not collected from the system that you plan to deploy (it is in the process of construction), which leads to problems associated with the distribution shift.
You probably do not have the ability to log everything that happens on the current system, which means you can only collect a certain subset of information. Suppose you have collected information about queries, banners, and clicks. In this case, we lose all data that has not been registered, for example, the time of day, day of the week, user information, which may also be important. All this also limits your maximum profit.
Then, a presentation of the data is usually chosen, for example, a quadratic relationship between a set of query keywords and a set of banner keywords with a + or - sign, depending on whether the user clicked on the banner. This is the moment when we can use theoretical calculations, but they are mainly limited to the concept of Bayesian error. The more information received and the better the presentation of the data, the less this error will be.
Next you need to choose a class of hypotheses. Personally, I would choose decision trees. From here and my approximation errors.
We need to collect training data. In the real world, there are no simple samples (iid), so what data we do not take, they will always contain an error. The distribution may differ from the distribution of test data (for example, because the behavior may differ in different months). The sample may not be independent (because the behavior does not diverge much in the next seconds). It will all cost us accuracy.
Now we are training our model, possibly setting up hyper parameters as well. At this step, estimation errors appear.
Then we select test data to measure the performance of the model. Of course, this data will tell us only how well the model will work in the future, if this data is generally talking about something. In practice, the sample is unlikely to be representative, if only because the data will change over time.
After the prediction is performed on the test sample, it is necessary to select the criteria for evaluating the success. Accuracy, F-measure, area under the ROC curve, etc. can be used. The degree to which these metrics relate to what is really important to us (increase in profits) will determine the success of achieving the main goal. If the metric does not match this, we may be at a loss instead of profit.

(Small note: although I arranged the steps in a certain order, it is not necessary to follow this order, many stages can be swapped. In addition, in this process of working on improving the system there can be many iterations and dependencies).

Some of the aspects mentioned are still under investigation. Problems such as errors in the choice of the sample, the adaptation of the subject area, the covariance shift may be caused by a mismatch between the training and test data. For example, if the classification of a training set works correctly, and the generalization is terrible, I often try to randomly mix test and training samples and check whether the generalization will work better. If not, most likely we are dealing with an adaptation error.

When developing new metrics for evaluating models (such as Bleu for machine translation), try to take into account their compliance with the final goal (as described in paragraph 10).

Related literature:

Oh, and come to work with us? :)
wunderfund.io is a young foundation that deals with high-frequency algorithmic trading . High-frequency trading is a continuous competition of the best programmers and mathematicians of the whole world. By joining us, you will become part of this fascinating fight.

We offer interesting and challenging data analysis and low latency tasks for enthusiastic researchers and programmers. Flexible schedule and no bureaucracy, decisions are quickly made and implemented.

Join our team: wunderfund.io

Source: https://habr.com/ru/post/320482/

All Articles

How to debug machine learning models

More articles: