The ideal student, or what is silent in machine learning

Earlier, habrahabr.ru/post/145309 we reviewed the approach to universal artificial intelligence (AI). But what is a universal AI? What exactly is lacking in modern practical AI systems to be called universal? For more specificity of the discussion of this issue, let's consider it on the example of machine learning, which is a necessary component of AI.

Machine learning has long overgrown the boundaries of the field of academic artificial intelligence and has come to "ordinary programmers." Many applied tasks use practical methods of data mining, pattern recognition, statistical inference, etc. Large firms arrange various courses and schools for automatic data analysis, create basic departments, participate in the organization of conferences and carry the knowledge gained in machine learning to the world by other means. Scientists, too, are not lagging behind: they are inventing new methods and are working on new subregions, such as the current transfer training.

However, such a development of this area in breadth does not particularly contribute to its development. On the contrary, simplified interpretations of the fundamentals of machine learning (despite the fact that these interpretations are often supported by a rather complicated theory) are spreading, allowing one to obtain particular practical solutions. The latter, of course, is not bad, but the problem of universal methods of machine learning remains unsolved (and even ignored). At the same time, on the one hand, myths about universal methods of machine learning (such as neural networks), which in reality are not, were popularized in the “applied” environment. On the other hand, the “theorists” lost faith in the possibility of creating universal methods of machine learning, which means that any attempt to propose a new universal method is a priori labeled “Yet Another Classifier” or “No Free Lunch” theorem.
')
The most striking thing about this is that the theory of truly universal machine learning has existed for already half a century, and many of the arguments of modern researchers in the context of this theory look very naive. What is this theory? How does she solve machine learning problems? Why she has not received much distribution so far? And is it still possible to make a universal machine learning system? We will try to get closer to the answers to these questions, but first we will show the need for universal training for strong AI.

The main problems of machine learning can be demonstrated by the example of such a “simple” task as functional approximation and extrapolation. Ultimately, an intelligent agent needs to establish a connection between the sensory input and the motor output, or learn to extrapolate the magnitudes of the reinforcements from the environment depending on the actions performed, so the connection of this task with the AI is undeniable.

Let's look at the sequence of points shown in Fig. 1. How to continue it? At first, it seems absolutely natural that to solve this problem, you need to select a function that runs the most closely to the existing points, and this function will be the best solution.

Fig. 1. How to continue the points?

In mathematics, there are various theorems that some arbitrary function, the form of which is not imposed significant restrictions (except, perhaps, continuity and differentiability), can be approximated by functions of a certain type with an arbitrary accuracy. In particular, it is precisely because of such theorems that artificial neural networks (INS) have a reputation of universal approximators. It's funny that such a reputation has developed precisely with the INS, and not, say, with polynomials, for which similar theorems were proved much earlier.

But what do these theorems mean in terms of machine learning? The fact that approximation of some functions by other functions is possible in principle does not mean at all that it can be implemented in practice. After all, in practice there is a training sample of always finite size, and besides, it is often noisy. Let's see how approximation will work in such conditions. For this we use polynomials, since the result for them will have a more transparent interpretation.

We will choose the parameters of polynomials of different degrees so that these polynomials pass most closely to the points of the training sample, that is, we use the least squares method, which in this case has a very simple implementation. In fig. 2 shows solutions for a linear function, a parabola and a cubic function. The last (red) point was not included in the training sample and corresponds to the true dependence.

Fig. 2. The results of the construction of polynomials of the 1st, 2nd and 3rd degrees.

As if all is well: by increasing the degree of the polynomial, it is possible both to reduce the deviations from the points of the training sample and to improve the prediction accuracy. But try to continue to increase the degree further. In fig. 3 shows solutions for polynomials of the 4th, 7th and 10th degrees.

Fig. 3. The results of the construction of polynomials of the 4th, 7th and 10th degrees.

As you can see, the accuracy of the training sample increases, and the 10th degree polynomial passes ideally through all points, but the prediction accuracy drops dramatically! This is a well-known effect of overfitting. To some, this effect may seem completely mystical: “We are improving the approximation on the available sample, except for which there is no data! Why is it really getting worse? ” And it will seem trivial to someone: “What did you expect if you try to adjust your model to the noise in the data?”. However, regardless of the attitude to this problem, it does not disappear anywhere. It manifests itself, of course, in the case of the INS. This problem in the field of machine learning is omnipresent: it is typical not only for approximation and extrapolation, but also for pattern recognition, segmentation, clustering, etc. How can you deal with it?

Here different approaches are possible. In the case of some classes of neural networks, training is specifically interrupted, until the INS has not managed to adapt too much to the data (even special heuristic criteria have been developed that determine the most suitable moment for stopping the training). In this regard, this phenomenon has a second name - overlearning. In simple terms, this is cramming, in which the system concentrates on concrete examples. This disadvantage is also peculiar to the training of students, if their training and assessment of its quality is carried out according to the same tests.

Another simple way of dealing with this effect is cross-checking, or cross-validation, the idea of which is to select a test from the training set. In this case, training is performed on one part of the sample, and the quality of training is evaluated on the other part. That is, since we want our solution to work well not on the data on which we trained it, but on new data, let's check it out. This approach has its drawbacks. First, the size of the training sample decreases, and, therefore, the quality of the model under construction also deteriorates. Secondly, such an approach is still prone to retraining, albeit to a weaker one: overly close fitting indirectly occurs in relation to the control sample (it suffices to recall how schoolchildren are dragged into the USE without special benefit for learning). To weaken this effect, the training sample is divided into many parts (say, 10) and, alternatively, the quality of the solution is tested on different of them, which, however, increases the computation time and exacerbates the first drawback. And, thirdly, this approach is not too easy to apply when teaching without a teacher.

In any case, neither early interruption of training, nor cross-training do not answer the question about the nature of retraining and look strange: if you study poorly for a long time or use all the available information, then something must be wrong in the learning process itself. Let's try to figure it out. Re-training occurs when the space of models is too large, for example, arbitrary degree polynomials or neural networks with any number of neurons are used (because of this, it is sometimes thought that in practice such spaces are meaningless to consider - there is too much “creative freedom” in them this is not fashionable nowadays when teaching people). Hence, the problem is to choose the best from a large number of models. To solve it, you need at least an adequate criterion for the quality of the models.

A hint at the nature of retraining is provided by probability theory. By Bayes' rule, the best M data model D will be the model with the maximum posterior probability P (M | D) , which is proportional to the product likelihood of the data for this model P (D | M) multiplied by the a priori probability P model (M) :

P (M | D) ~ P (D | M) P (M).

Here, the accuracy with which the model describes the data corresponds to the likelihood, which, under the assumption of a normal distribution of errors (deviations of the model from the data), is explicitly expressed in terms of the standard deviation. If only maximizing it (that is, using the maximum likelihood method), then just the effect of retraining occurs. This means that, in accordance with the Bayes rule, it is possible to choose a less accurate, but not prone to retraining, the model can only be provided by a priori probabilities P (M).

The problem of prior probabilities in statistical inference is fundamental. After all, these probabilities are a priori, that is, they must be miraculously determined before receiving the data. If we have a lot of data sets, then the prior probabilities of the models are the probabilities that the particular model will be the best for some arbitrary set. But this only complicates the task: after all, when determining a priori probabilities for many data sets, we again need some a priori probabilities so that no retraining occurs. And so on. Although we found out that the problem of retraining is connected with the problem of setting a priori probabilities, but this is not so much easier.

An interesting observation, however, can be made if the Bayes rule is prologized and, instead of maximizing the a posteriori probability itself, the minus logarithm of it is minimized:

–Log ₂ P (M | D) ~ –log ₂ P (D | M) - log ₂ P (M).

The minus logarithm of probability is just the amount of information, and –log ₂ P (M) is the amount of information in the model (the length of its description in bits), and –log ₂ P (D | M) is the additional amount of information in the data, with existing model (in fact, the length of the description of data deviations from the model, which characterizes its accuracy). If we ignore the method of calculating the amount of information through probabilities, we can see the following. The amount of information in the model is its complexity. Say, the more members of our polynomial or neurons and the connections between them in the ANN, the more bits will be spent on coding such models.

The need to find a compromise between the complexity of the model and its accuracy is expressed in the principle of the minimum description length, which states that the best model is the model for which the minimum length of the description of data deviations from the model (or data encoded with the model) and the complexity of the model itself are achieved. This principle is sometimes called the formalization of Occam's razor rule, which has long indicated that you should not produce entities beyond necessity.

It is important here that the complexity of the model can be considered directly, and not through its prior probability. In fact, there is a correct opportunity to introduce the concept of the amount of information to the notion of probability (and the theory of probability to derive from information theory, and not vice versa), and thereby solve a number of fundamental problems inherent in both the theory of probability and the classical (Shannonovskaya) based on it information theory. But this is a topic for another discussion.

So, the model should have the less a priori probability, the higher its complexity. Such a penalty of models for complexity in practice rather well replaces other methods of eliminating the effect of retraining. Overtraining occurs when choosing a too complex model. The simpler the model, the more general it is. Obviously, retraining is associated with insufficient generalization. It is the lack of generalization that is the cause of the poor quality of training in cramming or fitting training to tests. There is, of course, another extreme, when using one simple model they try to describe too much, regardless of errors or the fact that the model does not simplify (“does not compress”) the data. Information criterion allows you to find the right compromise.

However, the general problem of machine learning is still not solved. The impossibility of correct generalization may be due to the fact that the required model is simply not present in the used model space.
Let's go back to the polynomial approximation example. If the true model for us is a polynomial, then everything will be fine - the description length criterion will allow you to select a polynomial with the optimal number of parameters that provides the best (with the available data) prediction. But what happens if the data is generated by some other function? Will it save us that this function can in principle be approximated by some polynomial?

Take, for example, the exhibitor. It is described by a number of terms of the form xn / n !. But in this series there are an infinite number of members. And the sample of points we will have is finite, therefore the reconstructed polynomial will also have a finite degree. So, extrapolating the error will still accumulate very quickly, and no criterion will help us. Equally, if the data contains a harmonic relationship, then it will not be well extrapolated by either polynomials or exponents whose finite series cannot be periodic. Of course, some types of dependencies can be better or worse extrapolated by other types, but it can be said that if we do not have exactly the type of dependencies in the model space contained in the data, then we will not get the best prediction.

The same can be said not only about prediction, but also, say, about recognition. In the case of recognition, this problem is most clearly manifested in the problem of invariants. If some images from a certain class satisfy some invariant, then the work of the recognition system will not be good enough if it cannot learn this particular invariant. In particular, this is why neural networks for image recognition either require that they immediately be given signs that are invariant to geometric transformations of images, or require samples of huge sizes. So, if a perceptron is shown to present recognizable objects in one part of the image during training, it will not be able to recognize them, if they are then presented in another part of the image.

It is this kind of perceptron criticism that was contained in the famous book by Minsk and Papert: they are not able to build non-local invariants (to carry out an appropriate generalization) and require a training sample of exponentially large size in order to learn to recognize an image in all its forms. Even if perceptrons are not retrained in this case, they do not achieve the main goal of effective training - to identify the underlying laws, which allow to perform prediction or recognize patterns in a form that has not been encountered before.

The inability to express patterns of arbitrary types is characteristic, naturally, not only for specific ANN architectures, but also for other machine learning methods. All of them use some limited way to represent patterns, for example, in the form of a specific family of basic functions, which are decomposed. The same can be said about symbolic training when using, for example, decision trees or sets of rules. In complex systems based on knowledge, this problem may be veiled, but not solved at all.

Naturally, there is a desire to try to combine different methods that take into account various possible patterns in the data. And this approach is really quite popular now. However, a simple combination of non-universal methods only slightly increases their capabilities, since developers can manually lay only a relatively small number of different families of patterns.
But is it possible to put in the machine learning system the ability to identify any possible regularities? Is there a space in which arbitrary patterns that are not specifically provided by the developer in advance are constructively described? This space is not only there; it is also generally known. This is a space of algorithms.

Need to reproduce exponential dependency? Such an algorithm is easy to build. Need polynomials, harmonic functions, Fibonacci sequence, factorial? There are also such algorithms. Need to recognize classes of even and odd numbers? Using the appropriate algorithm is easy! And now think how easy it would be to solve all these tasks simultaneously using the INS.Naturally, in order for the comparison to be correct, a method is needed that itself would build all these algorithms upon presentation of the corresponding training set. But here the main thing for us is that in the space of algorithms, at least, there are suitable solutions for all such problems.

So, in order for the machine learning system to be universal, it is necessary that it can deduce any pattern. Modern machine learning methods do not satisfy this requirement; therefore, it is doubtful that a universal AI can be built on their basis. For such an AI, it is natural to use a set of algorithms as the “space of all possible laws”. Of course, one can argue whether a set of algorithms really contains all the possible laws (we, however, have already written that the algorithmic approach is enough to implement AI habrahabr.ru/post/145929). In any case, modern methods of machine learning work in highly restricted subspaces of algorithms. And if it is possible to “at least” use algorithms of an arbitrary type as models, then the possibilities of machine learning will unimaginably expand.

Since the overwhelming majority of teaching methods are implemented in the form of algorithms, the idea of going through algorithms as potential solutions to learning problems may even seem trivial. And so much the better. However, there are many algorithms capable of reproducing the available data (including the algorithm that simply “remembers” the training sample), and the fact that the length description criterion allows you to choose the best solution to the training problem is not so obvious.

A universal method that works in the space of algorithms and uses the criterion of algorithmic complexity for the general solution of the problem of induction (machine learning) was proposed by R. Solomon in 1964. True, this method requires almost infinite computational resources, which is quite understandable given the complexity of brute-force algorithms. Not surprisingly, pragmatically minded researchers, he was completely ignored. However, this does not mean that you need to look where it is light, and not where you lost it. To create a strong AI, you need to get as close as possible to the effective implementation of such a universal learning method, otherwise (that is, if you use only modern practical methods of machine learning), the AI cannot really learn anything. From understanding, in particular,This fact results in such a field of research as universal AI. How to get closer to solving a problem is a topic for another difficult conversation.

Source: https://habr.com/ru/post/148002/

All Articles

The ideal student, or what is silent in machine learning

More articles: