translator's comment Before the change, this chapter was called "Systematic and Random: Two main sources of error," that is, I used the terms "random error" and "systematic error" to translate bias and variance. However, the forum user robot @ Phaker rightly remarked in the comments that in the field of machine learning in Russian terminology the concepts of "displacement" and "scatter" are fixed for these terms. I looked at the work of K.V. Vorontsova, who is deservedly one of the authorities in the field of machine learning in Russia and the resources of the professional community, and agreed with the comment by robot @ Phaker . Despite the fact that, from my point of view, there is a profound analogy between the "bias" and "variance" in teaching algorithms and the "systematic error" and "random error" of a physical experiment. , it is nevertheless correct to use the terms established in this field. Therefore, I revised the translation of this and subsequent chapters, replacing "Systematic and Random errors" with "Offset and Scatter" and will stick to this approach in the future.
Suppose your training, validation and test samples have the same distribution. Then you need to take more data for training, it will only improve the quality of the algorithm, is this true?
Despite the fact that getting more data can not damage the work, unfortunately, new data does not always help as much as you can expect. In some cases, working on additional data may be a waste of effort. How to make a decision - when to add data, and when you should not worry about it.
In machine learning there are two main sources of errors: displacement and dispersion (variance). Understanding what they are will help you decide whether to add more data, will also help you choose tactics to improve the quality of the classifier.
Suppose you hope to build a cat recognizer with 5% errors. Currently, your classifier has a 15% error on the training sample, and a 16% validation sample. In this case, adding training data is unlikely to help significantly increase the quality. You should concentrate on other system changes. In fact, adding more examples to your training sample will only make it harder for your algorithm to get a good result on this sample (why this will be explained in the following chapters).
If the percentage of your mistakes in the training sample is 15% (which corresponds to an accuracy of 85%), but your goal is the error rate of 5% (95% accuracy), then first of all you need to improve the quality of your algorithm in the training sample. The quality of the algorithm on a validation / test sample is usually worse than the quality of its work on the sample for training (on a training sample). You need to understand that those approaches that led you to an accuracy not exceeding 85% with examples with which your algorithm is familiar will not allow you to get 95% accuracy with examples that this algorithm has not even seen.
Suppose, as noted above, the error rate of your algorithm is 16% (accuracy is 84%) in the validation sample. We have to break the 16% error into two components:
author's comment In the statistics there is a more accurate definition for the bias and the spread (systematic and random errors), but we should not be disturbed. Roughly speaking, we will assume that the offset is a mistake of your algorithm on your training sample, when you have a very large training sample. The scatter is how much worse the algorithm works on the test sample compared to the training one with the same parameter settings. If we use the root-mean-square error, then we can write the formulas defining these two quantities and prove that the total error is equal to the sum of the displacement and the spread (the sum of random and systematic errors). But for our purposes of improving the algorithms in machine learning problems, an informal definition of displacement and scatter is sufficient.
Some changes in the learning algorithm affect the first component of the error - bias ( bias ) and improve the execution of the algorithm on the training set. Some changes affect the second component - variance and help to better generalize the work of the algorithm to the validation and test samples. To select the most effective changes that need to be made to the system, it is extremely useful to understand how each of these two components of the error affects the overall system error.
author's note: There are also some approaches that simultaneously reduce displacement and scatter, making significant changes to the system architecture. But they are usually more difficult to find and implement.
To select the most effective changes that need to be made to the system, it is extremely useful to understand how each of these two components of the error affects the overall system error.
Developing an intuition in understanding how the Offset contributes to the error, and which Spread, helps you to effectively choose ways to improve your algorithm.
Consider our task on the classification of cats. The ideal classifier (for example, a person) can achieve the excellent quality of this task.
Suppose that the quality of our algorithm is as follows:
What is the problem with this classifier? Applying the definitions from the previous chapter, we estimate the displacement at 1% and the variation at 10% (= 11% - 1%). Thus, our algorithm has a large scatter . The qualifier has a very low error on the training sample, but it cannot generalize the results of training for the validation sample. In other words, we are dealing with overfitting .
Now consider the following situation:
Then we estimate the offset at 15% and the spread at 1%. This classifier was poorly trained in the training sample, while its error in the validation sample was slightly more than in the training sample. Thus, this classifier has a large offset, but a small variation. It can be concluded that this algorithm is underfitting .
Consider the following error distribution:
In this case, the offset is 15% and the spread is also 15%. This classifier has high offset and scatter: it works poorly in the training sample, having a high offset, and its quality in the validation sample is much worse than in the training sample, i.e. the spread is also great. This case is difficult to describe in terms of over-training / under-training, this classifier is both re-trained and under-trained.
Finally, consider the following situation:
This is a perfectly working classifier, it has a low offset and scatter. Congratulations to engineers with a great result!
In our example on recognition of cats, the ideal fraction of errors is the level available to the “optimal” classifier and this level is close to 0%. The person viewing the picture is almost always able to recognize whether the cat is in the picture or not, and we can hope that sooner or later the car will do it just as well.
But there are more complex tasks. For example, imagine that you are developing a speech recognition system and found that 14% of audio recordings have so much background noise or so unintelligible speech that even a person cannot make out what was said there. In this case, even the most "optimal" speech recognition system may have an error in the region of 14%.
Suppose in the given problem of speech recognition our algorithm has achieved the following results:
The quality of work of the classifier on the training sample is already close to the optimum, having a 14% error rate. Thus, in this case we have not so many opportunities to reduce the bias (improving the performance of the algorithm on the training sample). However, it is not possible to generalize the work of this algorithm on the validation sample, therefore there is a large field for scatter reduction activities.
This case is similar to the third example from the previous chapter, in which the error on the training sample is also equal to 15% and the error on the validation sample is 30%. If the optimal error rate is about 0%, then the error on the training sample of 15% gives a lot of space to work on improving the algorithm. With this assumption, efforts to reduce the bias in the work of the algorithm can be very fruitful. But if the optimal fraction of classification errors cannot be lower than 14%, then a similar proportion of algorithm errors in the training sample (ie, in the region of 14-15%) suggests that the possibilities for reducing the bias are almost exhausted.
For tasks in which the optimal fraction of classification errors is significantly different from zero, it is possible to propose a more detailed structuring of errors. Continuing with the speech recognition example given above, a total error of 30% on a validation sample can be decomposed into the following components (errors can be analyzed in a test sample in the same way):
author's note: If this value is negative, thus your algorithm on the training sample shows a smaller error than the “optimal” one. This means that you have retrained in the training sample, your algorithm has memorized examples (and their classes) of the training sample. In this case, you should focus on the methods of reducing the scatter, and not on further reducing the bias.
Matching this with our previous definitions, the bias and the disposable bias are related as follows:
Offset (bias) = Optimal Offset ( "unavoidable bias" ) + Disposable Offset ( "avoidable bias" )
author's note : These definitions are chosen to better explain how the quality of the learning algorithm can be improved. These definitions differ from the formal definitions of displacement and scatter adopted in statistics. Technically, what I define as “Offset” should be called “an error that lies in the data structure (it cannot be identified and eliminated)” and “Eliminated offset” should be defined as “Offset learning algorithm that exceeds the optimal offset” .
The avoidable bias shows how much worse the quality of your algorithm in the training sample is than the quality of the “optimal classifier”.
The basic idea of variance remains the same. In theory, we can always reduce the spread to almost zero, training on a fairly large training sample. Thus, any variation is “avoidable” if there is a sufficiently large sample, so there can be no such thing as an “unavoidable variance”.
Consider another example in which the optimal error is 14% and we have:
In the previous chapter, we estimated the classifier with such indicators as a classifier with a high offset; in the current conditions, we will say that the “avoidable bias” is 1% and the spread is about 1%. Thus, the algorithm is already working quite well and there is almost no room for improving the quality of its work. The quality of operation of this algorithm is only 2% lower than optimal.
From these examples it is clear that the knowledge of the magnitude of a fatal error is useful for making decisions about further actions. In statistics, the optimal error rate is also called the Bayes error rate .
How to find out the size of the optimal error rate? For tasks that a person does well, such as image recognition or decoding audio clips, you can ask assessors to mark up the data, and then measure the accuracy of the human markup on the training sample. This will give an estimate of the optimal proportion of errors. If you are working on a problem that is difficult for a person to cope with (for example, to predict which film to recommend or which advertisement to show to the user), in this case it is rather difficult to estimate the optimal proportion of errors.
In the Comparison with Human-Level Performance section, Chapters 33 through 35, I will discuss in more detail the process of comparing the quality of the learning algorithm with the level of quality that a person can achieve.
In the last chapters, you learned how to estimate removable / unrecoverable displacement and spread by analyzing the fraction of classifier errors in training and validation samples. The next chapter will look at how you can use the findings from such an analysis to decide whether to focus on methods that reduce bias or on methods that reduce scatter. Approaches to dealing with bias are very different from approaches to reducing scatter, so the techniques that you should use in your project to improve quality depend strongly on what is currently the problem — a large bias or a large scatter.
Read on!
We give a simple formula for eliminating bias and scatter:
If you have the opportunity to increase the size of the neural network and add unlimited data to the training sample, this will help to achieve a good result for a large number of machine learning tasks.
In practice, increasing the size of a model will ultimately cause computational complexity, since learning of very large models is slow. You can also exhaust the limit of training data. (Even in the whole Internet, the number of images with cats of course!)
Different architectures of models of algorithms, for example, different architectures of neural networks, will give different values for displacement and spread, with reference to your task. A shaft of recent research in the field of depth learning has allowed for the creation of a large number of innovative architectures of neural network models. Thus, if you use neural networks, scientific literature can be an excellent source for inspiration. There is also a large number of excellent implementations of algorithms in open sources, for example on GitHub. However, the results of attempts to use new architectures are much less predictable than the above simple formula - increase the size of the model and add data.
Increasing the size of the model usually reduces the offset, but it can also cause an increase in the spread, and the risk of over-training also increases. However, the problem of retraining arises only when you are not using regularization. If you include a well-designed regularization method in the model, it is usually possible to safely increase the size of the model without allowing retraining.
Suppose you apply deep learning using L2 regularization or dropout ( Translator's Note : You can read about Dropout , for example, here: https://habr.com/company/wunderfund/blog/330814/ ) using regularization parameters that work flawlessly on validation sample. If you increase the size of the model, usually the quality of your algorithm remains the same or grows; its significant decrease is unlikely. The only reason for which it is necessary to refuse to increase the size of the model - large computational costs.
You may have heard of the “trade-off between displacement and spread.” Among the many changes that can be made to the learning algorithms, there are those that reduce the offset and increase the spread or vice versa. In this case, they speak of a “compromise” between displacement and spread.
For example, increasing the size of a model — adding neurons and (or) layers of the neural network, or adding input features usually reduce the offset, but can increase the spread. On the contrary, the addition of regularization often increases the offset, but reduces the spread.
Today, we usually have access to a large amount of data and enough computing power to train large neural networks (for deep learning). Thus, the problem of compromise is not so acute, and we have many tools at our disposal to reduce the displacement without harming the scatter value strongly and vice versa.
For example, you can usually increase the size of the neural network and adjust the regularization so as to reduce the offset without noticeably increasing the spread. Adding data to the training sample, as a rule, reduces the spread, without affecting the offset.
If you successfully select the model architecture that is well suited to the task, you can simultaneously reduce both the offset and the spread. But choosing such an architecture can be challenging.
In the next few chapters, we will discuss other specific techniques aimed at combating displacement and scatter.
If your learning algorithm suffers from a large removable offset, you can try the following approaches:
One not very useful method:
Only after a good quality of the algorithm in the training sample, can we expect acceptable results from it on a validation / test sample.
In addition to the methods previously described, applied to a large offset, I sometimes also pass on error analysis to the training sample data, following the same approach that was used in analyzing the validation sample of the eyeball. This can help if your algorithm has a high offset, i.e., if the algorithm was not able to study well in a training set.
For example, suppose you are developing a speech recognition system for an application and have collected a training sample of audio clips from volunteers. If your system does not work well on a training sample, you can consider listening to a set of 100 examples in which the algorithm worked poorly in order to understand the main categories of errors in the training sample. Similar to analyzing errors in a validation sample, you can calculate errors by category:
Audio clip | Loud background noise | User spoke too fast | Too far from microphone | Comments |
---|---|---|---|---|
one | X | Car noise | ||
2 | X | X | Restaurant noise | |
3 | X | X | User shouts across the room | |
four | X | Noise cafe | ||
% of total qty | 75% | 25% | 50% |
In this example, you might understand that your algorithm is experiencing particular difficulties with training examples that have a lot of background noise. In this way, you can focus on methods that will allow him to work better on training examples with background noise.
You can also re-check how much a person can parse such audio clips by letting him listen to the same recordings as the learning algorithm. If there is so much background noise in them that it is simply impossible for anyone to understand what they are saying, then it may be meaningless to expect that any algorithm correctly recognizes such pronunciation. In further chapters we will discuss the benefits of comparing the quality of our algorithm with the level of quality available to humans.
If your algorithm suffers from a large scatter, you can try the following approaches:
Here I present two additional tactical techniques, repeating what was said in previous chapters, in relation to reducing bias:
Source: https://habr.com/ru/post/420591/
All Articles