LSearch and ξ tl
In the previous part (
http://habrahabr.ru/post/150056/ ) we looked at the basic models of "unbiased" universal algorithmic intelligence, which we called the ideal minimal intelligence (IMI), since these models, without being oriented to any class media are as compact as possible. However, it is clear that they are far from sufficient to create a real AI.
The obvious reason for the impossibility of applying universal prediction with the help of algorithmic probability, as well as IMI models based on it, is its noncomputability. Incompetence due to two circumstances:
1) averaging is carried out over an infinite set of models;
2) among these models there are non-stop algorithms (and, due to the insolubility of the stop problem, in all cases it is impossible to determine whether an algorithm loops on the available data or not).
R. Solomonov pointed out [Solomonoff, 1986] that a big step forward in the use of algorithmic probability in solving practical problems is the search procedure proposed by LA Levin (a student of A.N. Kolmogorov) [Levin, 1973] and now called LSearch, which allows you to get a solution to any P or NP problems in a time proportional to the optimal time multiplied by a constant factor.
The general idea of this procedure is to allocate for testing each algorithmic model q time proportional to its a priori probability ξ (q) = 2
–l (q) . With fixed resources, it is possible to calculate in advance how many processor ticks of processor time to allocate for the execution of a particular model (algorithm). If a solution is sought until some satisfactory quality is achieved, then models can run in parallel with priorities ξ (q) (or in the case of sequential execution, at each clock step with probability ξ (q), model q is chosen to continue the calculations).
')
Obviously, LSearch is time-consuming, exponential in complexity, and can only be used for tasks of very low dimensionality (Solomonov repeatedly noted this, indicating that LSearch is a blind search, better than which, however, for exact problem solving without additional a priori information can not be offered anything). But, at least, on the basis of LSearch, a computable variant of algorithmic probability (or algorithmic complexity), which in the indicated sense is optimal, can be constructed. In particular, it can be noted that the presence of non-stop algorithms is not a problem for LSearch, since they will not be distinguishable from just long-running algorithms, the execution of which after the available resources also does not have time to complete.
The subtle question, however, is whether to take into account in the prediction the results generated by algorithms that did not have time to stop. This is a special prediction case: we can well expect non-stop from the predicted natural processes, which, however, is not connected with the constant modification of the “history”, but with the formation of the “future”. In this case, say, an algorithm that outputs an infinite sequence 010101 ... it will be simpler than an algorithm that outputs the same sequence, but of some fixed size. In this sense, the requirement of q output history x
<k (as in the previous part, x
<k means the sequence of sensory signals and reinforcement signals, that is, all input actions received by the agent before the tick with number k – 1) and stop (which is done in AIξ) it may not be completely natural. This subtlety at the current level of detail of models of universal intelligence does not play a fundamental role.
Since, on the basis of LSearch, a computable modification of algorithmic probability can be constructed, for which the optimality of computation time with accuracy up to a multiplicative constant is proved, it is quite natural that the next in detail expansion of IMI models may include LSearch.
Obviously, the LSearch procedure with limited time will have time to execute only shorter and faster algorithms. Somewhat generalizing, you can enter the following distribution of ξ
tl on a set of programs that are limited in length and runtime:

As in the previous part, we introduce the algorithmic probabilities themselves as

It should be borne in mind that the distribution recorded in this form is not normalized. Normalization is easy to do, but you can also do without it. Also note that the conditional probabilities of interest to us are easy to introduce by analogy with P
ALP (x '| x), which will give some analogue of formula (42) in [Hutter, 2007].
Obviously

that is, when using this distribution, the intelligent agent will tend to the (unlimited) universal when resources tend to infinity.
This distribution can be directly used to modify AIξ-type models when searching for environment models. A joint increase in the values of t and l can be carried out in the same way as in LSearch, so that at time T each program q executes in 2
–l (q) T steps; at time T, those programs for which 2
–l (q) T> t (q) will have time to be executed, that is, the programs will be sorted by the value 2
l (q) t (q). The first model q, which reproduces the history of the interaction of the agent with the environment and performs the prediction, will have the smallest value 2
l (q) t (q), the logarithm from which can be called Levin complexity (this definition of complexity has a deep meaning that requires a separate discussion).
After finding a model with minimal complexity, the search may not stop until the computational resources are exhausted, since universal prediction in IMI models implies the use of many models compatible with history. However, using the only best model, the computation time may be O (2
L T), where L and T are the corresponding characteristics of the minimal model. As can be seen, such a calculation scheme (similar to LSearch) has a slowing factor of 2
L , due to the fact that, prior to the discovery of this model, it is necessary to execute about 2
L programs and spend up to T time ticks for each of them. In addition, for a universal agent, it is necessary to perform a prediction for different chains of their actions (here it is tempting to separate the tasks of prediction and choice of actions, but their naive separation violates the universality of the agent).
In any case, the multiplicative decelerating constant 2
L is gigantic. And, moreover, in the case of the real world, the value of L (the length of the “true” model of the medium) can be unbounded from above. As a result, the complexity of finding the minimum deterministic model of the environment will increase exponentially with increasing history length. And in the case of limited resources, this will make the convergence of the agent's behavior based on the ξ
tl distribution to the behavior of the agent using some “correct” a priori distribution μ (since convergence in the case of ξ = 2
–l (q) just occurred when increasing the volume observational data when this volume begins to exceed the complexity of the data source).
M. Hatter introduces the model AIξ
tl [Hutter, 2005]. It is interesting, however, that it does not just use ξ
tl instead of ξ, but also suggests a search method that is optimal in execution time with an accuracy of an additive, rather than a multiplicative constant. The main idea of this method is to iterate over not all, but only those algorithms for which it is proved that the values of the objective function predicted by them are not overestimated. In this case, it is not the models of the environment that are sorted out, but the programs of the agent himself, which, however, do not only deduce actions, but also reinforcement values (present and expected in history). Evidence is generated in a formal system by a separate algorithm, working in parallel with the procedures for the execution of models.
Some doubt is caused by the possibility of rigorous constructive proof of the required property for programs of rational behavior without going through other programs. In this sense, an attractive-looking replacement of the multiplicative retarding constant by an additive one in practice may be ineffective. In any case, this additional time required to search for evidence is too long for the real AI to act in accordance with this approach. Since in the future we will not use the model AIξ
tl itself , detailed explanations on its account are not given here, for which it is better to turn to the original source.
Effective and pragmatic universal intelligence.
In [Wang, 2007], the concept of effective intelligence is given, as intelligence fulfilling its purpose in conditions of limited computing resources and information. AIξ
tl works in conditions of limited resources and information (the environment is assumed to be a priori uncertain), and it is optimal in time with an accuracy of some moderating factor. However, this optimality is too weak. But can it be improved?
The obvious limitation of AIξ
tl is that the choice of each subsequent action does not simplify with increasing history, but becomes more complicated as the length of the history increases, and with it the complexity of the shortest algorithms that can reproduce this history (of course, this remark does not apply to AIξ, which all algorithms get over, which, however, is absolutely unrealistic). The selection of each new action is considered independently of the previous one by the same method. As a result, the constant decelerating factor does not decrease from tact to tact, that is, in essence, it is multiplied by how many times the agent selects an action. It is quite natural to consider not just the optimality of each choice of action separately, but also of the whole process of the agent's functioning. The effectiveness of an intelligent agent can, in principle, increase from tact to tact. The specified additive slowing factor will not disappear, but will simply cease to multiply by the number of cycles. And this result cannot be improved in principle. It can be characterized in the sense of Pareto optimality, as well as the distribution of ξ: there is no intelligence algorithm that would be no worse than that in all environments and better than at least one.
You can, of course, weaken the requirement for the convergence of the agent to the optimal behavior. After all, usually not optimal behavior is required, but simply good enough. However, in the limit (with infinite resources) there is no reason to refuse optimal behavior. The problem is rather in the use of Pareto optimality.
Actual effectiveness for all environments cannot be immediately achieved. However, we are not interested in all possible environments at all. Rather, we would like to maintain Pareto optimality and universality of the IMI models, but add to this a higher efficiency for our particular world. In this connection, the concept of pragmatic (effective) AI, introduced in [Goertzel, 2010], is useful. In the more familiar terminology, such AI should be called “biased”. We emphasize that “bias” does not necessarily mean the loss of universality. A Pareto-set includes many variants of optimal AI, some of which turn out to be much more “pragmatic” (that is, effective in our particular world) than just some arbitrarily chosen variant.
In particular, we noted that the distributions of ξ (q) can be different depending on the chosen universal machine. The choice of this machine does not matter in terms of Pareto optimality, but is of great importance for increasing the pragmatism of AI.
Let us return to our methodological principles introduced in the article (
http://habrahabr.ru/post/145309/ ) and that AI should be universal, and this versatility should be maintained when entering resource constraints, which largely determine the architecture of real efficient AI, which should also be based on a priori information about the world to speed up learning. Now we can clarify what they mean to create a universal, effective, pragmatic AI. Regarding ξ (q), the preservation of universality means ξ (q) 0 for any algorithm q; introducing a priori information to increase learning speed (increasing pragmatism) means choosing a universal machine that determines the most adequate (to our world) a priori distribution ξ (q); consideration of resource constraints (increase in efficiency) means using the distribution ξ
tl (q) and its refinement as we receive new information.
The choice of the support machine.
The possibility of choosing a support machine for introducing a priori information has already been noted by various authors [Solomonoff, 2010], [Pankov, 2008], and even indicated that its choice affects the “relative intelligence of agents” [Hibbard, 2009]. But what does this mean in practice?
There is a danger to understand the choice of the reference machine too literally, that is, the choice between, say, Turing machines, lambda calculi, formal grammars, Markov algorithms, etc. All these formalisms, indeed, define different ξ (q), but one can hardly expect that one of them is substantially “more pragmatic” than the others. A similar understanding of the choice of the reference machine, but in a slightly more practical way, leads to an attempt to select (or create) the most appropriate programming language. Indeed, any language also defines some ξ (q). And the choice of a suitable language can facilitate for specific cases the task of building algorithmic models. There are various papers based on similar considerations, for example [Hall, 2008], [Looks, 2009].
All this, of course, has to do with increasing the pragmatism of the universal AI. Even the interpretation of some data lines as numbers with the operations entered for them already sets a considerable inductive bias, facilitating the construction of models of our world. And even just the use of the above-mentioned formalisms already provides some inductive bias towards the direction of our world, since they have a relatively simple physical implementation (that is, the world itself is such a universal “machine” that “emulates” other universal machines having different design complexity). However, even the most cursory analysis of the problem, say, the interpretation of images will show that such an inductive bias is far from enough. Indeed, a minimal program that generates a certain image and is written in some more or less ordinary language will have an enormous size, and it will be impossible to build it by direct brute force.
In the brain there is a huge number of specialized procedures for processing sensory information, without an a priori task of which the intellect cannot be pragmatic. A similar conclusion can be made about others, both low-level and higher cognitive functions. It may even seem that only these specialized (arisen for our world) functions should be dealt with, and the models of universal AI appear to be irrelevant. But once again we will repeat that in order to create a strong AI, one must by all means try to preserve the property of universality and resist the temptation to take the simpler path of effective pragmatic, but not universal intelligence. Indeed, for the human intellect (despite all its “pragmatism”), universality is characteristic even in the smallest. Without such universality, a person would never be able to see, say, a broken pixel on a monitor or a cross formed by stars in the sky. No computer vision system is able to form such arbitrary visual concepts, unless they are incorporated a priori in its description space. A completely transparent reason for this is non-universality.
Thus, natural intelligence has a huge “bias” (in particular, in the form of inductive bias), ensuring its “pragmatism” and determining its structure. This shift can be described in terms of specifying the reference machine in determining ξ (q), however, this interpretation should not be understood in a simplified way. To emphasize this, we will call this “bias” cognitive bias, since this bias in humans is realized in cognitive functions (such as memory, attention, planning, etc.) that are absent in IMI models.
Note that the concept of cognitive bias (cognitive distortion) is already “engaged” in psychology. There it is more negative in nature, pointing to systematic “mistakes” made by man in the process of thinking. Especially noteworthy are errors in reasoning about probabilities, which seems to indicate that human thinking does not use the rules of statistical inference. Cognitive distortions deserve separate consideration. Here we note that the concept of cognitive bias introduced by us, although more general than the concept of cognitive distortions in psychology, but there is no contradiction between them. On the contrary, one can hope that they allow clarifying the content of each other.
Adaptability and self-optimization.
It was noted above that the considered models of IMI are non-incremental in their structure: each of them is a challenge in an infinite loop of the same algorithm; from stroke to clock, only the input line for this algorithm (representing the history of the interaction of the agent with the environment) changes. It is intuitively clear that IMI in this form will repeatedly perform work on the prediction of input x and the choice of output (action) y. If a priori information can be laid into the intellect by choosing the distribution ξ (q) in order to increase efficiency, then it is quite natural to modify this distribution as it accumulates information.
Such an incremental learning (although in a narrower context of the problem of induction) was also considered by Solomon [Solomonoff, 2003]. His approach contains important ideas, but on the whole he is heuristic (his ideas flow largely from heuristic programming) and is insufficient.
Similar ideas were expressed by Schmidhuber, proposing an adaptive LSearch, the idea of which is that as soon as a program q is found when finding a solution to the current problem, the probability ξ (q) increases significantly; however, the modification method ξ (q) itself does not guarantee optimality [Schmidhuber, 2007a].
Schmidhuber [Schmidhuber, 2007b] proposed a different approach (called “Gödel machine”), which implies strictly justified self-optimization (including not only the distribution of ξ (q), but also other components of the agent’s intelligence, and even the objective function). However, although this approach looks more theoretically sound, in it the act of self-optimization is performed only when it can be strictly proved that it leads to an increase in future reinforcement. Although this ensures that such self-optimization, at least, does not worsen the agent’s behavior, it is doubtful that it will be quite effective, since it is logical to prove that certain self-modification steps will lead the agent to a guaranteed increase in future reinforcement, it is unlikely to be possible in many cases. due to the inductive nature of the problem itself (we have expressed a similar methodological doubt regarding AIξ
tl ). Although evidence can be used here that this or that self-optimization will not necessarily increase future reinforcement itself, but rather the likelihood of its increase, the answer to the question of choosing proper axiomatics for Gödel’s machine is not given, which makes this machine a “framework model” that requires significant development. At the same time, an indication of the need for self-optimization, which includes not only the refinement of ξ (q), but also the search programs for both models and optimal chains of actions, seems perfectly fair (it is worth noting that, in the context of infinite resources, even a modification of ξ (q ) taking into account the existing history is not needed, since the prediction is performed on the same full story). Unfortunately, there is no suitable theory of self-optimization in the literature and requires development.
Redundancy THEM.
We have already held a preliminary discussion on whether models are IMI sufficiently universal. However, given the difficulty of achieving such universality, it is also appropriate to ask whether it is not redundant? Perhaps the pursuit of such universality is meaningless? Although algorithmic incompleteness entails the weakness of methods, such as machine learning, and it should be agreed that the achievement of algorithmic completeness is somehow necessary, you can make a number of considerations, according to which there should be more general principles preceding algorithmic universality. First of all, we are talking about evolutionary considerations.
Indeed, at first glance it seems that, at least, animals do not possess universal intelligence. It turns out that universal intelligence arises as a late evolutionary superstructure. It does not seem to be consistent not only with our desire to develop a strong AI by increasing the effectiveness of universal intelligence (rather than gradually increasing the universality of effective pragmatic intelligence), but also implies the primacy of the principles of evolutionary development.
Moreover, it can be said that the intelligence of each person individually is not so universal. Of course, people can, in principle, learn anything. In the history of science, people invented a variety of models (including the universal machine). Children can learn any language or master the ways of thinking (say, in the form of mathematical methods), which people were unknown a century ago. Only the ability of a person to programming almost unconditionally proves the algorithmic universality of his thinking. But at the same time, this universality is very relative: people more often demonstrate the limitations of their thinking, being unable to master something new or go beyond the usual stereotypes. Even in science, a change of generations of scientists is often required in order for a certain new paradigm to become fully accepted.
Of course, this can be attributed to the fact that even from universal intelligence in conditions of limited resources and a relatively short lifetime, one should not expect too much. But at the same time, the banal fact that human thinking is in many ways a social phenomenon draws attention to itself. In the previous part, we have already noted the need to take into account interactions between intellectual agents, but in a somewhat different aspect (rather, “competitive”). Here we are talking about the fact that universal intelligence may have to be attributed not to a separate individual, but to society. However, society “strives” for survival (or “optimizes” some kind of implicit “objective function”), and its multi-agent is only one of the “architectural features” (although perhaps very important), therefore, the above considerations are not part of a contradiction with the consideration of general models of universal algorithmic intelligence.
Returning to evolutionary considerations, it can also be noted that the evolutionary “search” goes in “algorithmically complete space” (we use quotation marks to emphasize the model of such a representation). Indeed, arbitrary “algorithms” (in particular, programs of animal and human behavior) can be written in the genetic code, that is, the reading of the genetic code is carried out by the “universal machine”. In this sense, evolution is also "algorithmically universal." Moreover, it turns out to be a “self-optimizing search”, because such widely used in AI search metaheuristics as crossing, were not present a priori in evolution, but were “invented”. And this kind of "inventions" are many. At the same time, the “invention” of the first metaheuristics took billions of years, whereas later the rates of evolution increased significantly (especially if we consider scientific and technical progress as a continuation of a single global evolutionary process). In this sense, global evolution (the beginning of which, perhaps, should be attributed to the time of the emergence of replicators, which are also non-trivial "machines") was originally an ineffective, unbiased, but self-optimizing universal "intellect" that gradually created an effective pragmatic intelligence - human intelligence. Perhaps the analogy between evolution and intelligence is not too complete, but the properties of “algorithmic universality” and “self-optimizing search” in both cases can be traced very clearly. From this, in particular, we can conclude that an unbiased intellect will have to at least repeat the entire evolution, which makes this intelligence unpragmatic and practically unrealizable.
Consideration of society or evolution may clarify some points regarding the objective function of individual agents, but it will raise additional questions regarding the criteria of social and evolutionary development. As for the rest, such consideration will not relieve us of the need to solve the problem of self-optimizing search in algorithmically complete spaces. We can study this problem using the example of evolution modeling, but ultimately we will need to create a pragmatic AI. Thus, consideration of the evolutionary and social aspects of intelligence does not provide any more general principles for its research and development, but, on the contrary, further emphasizes the importance of the subject matter.
Findings.
IMI models are fairly easy to make computable and to a certain extent take into account resource constraints. , 2
L , L – «» . , L , , . , , , , , .
, , , . - , , , , , , . .
( , ), «» – (, , , ). , «» , , , .
Literature.
(Goertzel, 2010) Goertzel B.
Toward a Formal Characterization of Real-World General Intelligence // In: E.Baum, M.Hutter, E.Kitzelmann (Eds), Advances in Intelligent Systems Research. 2010. Vol. 10 (Proc. 3rd Conf. on Artificial General Intelligence, AGI 2010, Lugano, Switzerland, March 5-8, 2010.). P. 19–24.
(Hall, 2008) Hall S. VARIAC:
an Autogenous Cognitive Architecture // In: Frontiers in Artificial Intelligence and Applications (Proc. 1st AGI Conference). 2008. V. 171. P. 176–187.
(Hibbard, 2009) Hibbard B.
Bias and No Free Lunch in Formal Measures of Intelligence // Journal of Artificial General Intelligence. 2009. V. 1. P. 54–61.
(Hutter, 2005) Hutter M.
Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic Probability / Springer. 2005. 278 p.
(Hutter, 2007) Hutter M.
Universal Algorithmic Intelligence: A Mathematical Top→Down Approach // in Artificial General Intelligence. Cognitive Technologies, B. Goertzel and C. Pennachin (Eds.). Springer. 2007. p. 227–290.
(Looks, 2009) Looks M., Goertzel B.
Program Representation for General Intelligence // B. Goertzel, P. Hitzler, M. Hutter (Eds), Advances in Intelligent Systems Research. V. 8 (Proc. 2nd Conf. on Artificial General Intelligence, AGI 2009, Arlington, Virginia, USA, March 6-9, 2009). P. 114–119.
(Pankov, 2008) Pankov S.
A Computational Approximation to the AIXI Model // Frontiers in Artificial Intelligence and Applications (Proc. 1st AGI Conference). 2008. V. 171. P. 256–267.
(Schmidhuber, 2007a) Schmidhuber J.
The New AI: General & Sound & Relevant for Physics // in Artificial General Intelligence. Cognitive Technologies, B. Goertzel and C. Pennachin (Eds.). Springer. 2007. P. 175–198.
(Schmidhuber, 2007b) Schmidhuber J.
Gödel Machines: Fully Self-Referential Optimal Universal Self-improvers // In: Artificial General Intelligence. Cognitive Technologies, B. Goertzel and C. Pennachin (Eds.). Springer. 2007. P. 199–226.
(Solomonoff, 1986) Solomonoff R.
The Application of Algorithmic Probability to Problems in Artificial Intelligence // In: LN Kanal and JF Lemmer (Eds.). Uncertainty in Artificial Intelligence: Elsevier Science Publishers. 1986. P. 473–491.
(Solomonoff, 2003) Solomonoff R.
Progress in Incremental Machine Learning // Technical Report IDSIA-16-03, IDSIA. 2003.
(Solomonoff, 2010) Solomonoff R.
Algorithmic Probability, Heuristic Programming and AGI // In: E.Baum, M.Hutter, E.Kitzelmann (Eds), Advances in Intelligent Systems Research. 2010. V. 10 (Proc. 3rd Conf. on Artificial General Intelligence, Lugano, Switzerland, March 5-8, 2010). P. 151–157.
(Wang, 2007) Wang P.
The Logic of Intelligence. In: Artificial General Intelligence // In: Cognitive Technologies, B. Goertzel and C. Pennachin (Eds.). Springer, 2007. pp. 31–62.(Levin, 1973) Levin, LA Universal brute force problems // Problems of information transfer. 1973. Vol. 9. No. 3. P. 115–116.