
Chemists like to say that chemistry is engaged in the study of dirty substances by pure methods, physics - of pure substances by dirty methods, and
physical chemistry , they say, examines dirty substances using dirty methods. In areas traditionally related to or related to artificial intelligence (pattern recognition, solving NP-hard problems, word processing, etc.), most tasks are dirty. Those. poorly amenable to formal description and not having clear criteria for the correctness of the decision. I don’t know how chemists get out, and programmers rarely manage to solve such tasks without getting dirty. Programming dirty tasks is also dirty, and here dirty does not mean bad. This article is not about how to keep clean and sterile. This article is about how, armed with a
crowbar courage and patience, to dive into the deep lithospheric layers and survive.
So, suppose that you need to develop a system that demonstrates complex behavior (for example, transferring grandmothers through roads, or, in the order of the exotic, recognizing text in an image). If it seems to you that the task is not dirty enough, try to write a working system, improve the quality of its work as far as possible, and then improve it even more. It is desirable, if at the same time performance does not deteriorate, ideally - if it improves.
Science can do a lot of geek
There are two stable myths about how to solve complex problems of a poorly formalizable form.
')
Myth 1. The great power of mathematics
It is a misconception that in the world of complex problems, everything is taken into account by a powerful hurricane of mathematics. In the view of many programmers involved in solving AI-like problems are egg-headed scientists in white coats who spend almost all the time at the slate and solve furious curvilinear integrals. I hasten to disappoint that mathematical methods, although extremely useful in interesting cases, have a depressingly limited scope. Rather, the right approach would be to try to isolate the foremost science subtasks and use their solutions as separate components of the system.

For example, for simple cases, the clustering problem of objects is well studied; dividing them into groups, and developing methods based on the foundations of the theory of probability / mathematical statistics for solving it, where
formulas look charmingly appealing on a slate. On the same theoretical foundation, the promising and recently intensively studied methodology for the construction of complex systems,
Bayesian networks, is based . But in general, the math gears are too beautiful and fit tightly to each other so that they will not get stuck from the large clods of dirt that have fallen into them. Sooner or later you have to take off your bathrobe and get out of the quiet of the office.
Myth 2. The Big Neural Network will solve everything.
You just need to choose the right number of layers and its architecture. In the future, it will only be necessary to train the network, and the wise synapses themselves will indicate the True Path. Sometimes genetic algorithms are offered as an alternative: it is believed that they also have a
thinker inside. The situation here is very similar to what was said about mathematics. Yes, in limited cases, these methods help, but in general they are not a panacea. Setting up a neural network is still anguish, or, if you like, art. Another problem is that in the end it turns out something very similar to a black box, to predict or change the behavior of which becomes a serious problem. And change the behavior, of course, need.
The fundamental limitation of “beautiful” approaches is that they all depend too much on any nice assumptions. For example, for the
classification problem
, the assumption may be that the set of features of the objects is subject to a mixed Gaussian distribution, or that these features are independent of each other. In the case of neural networks, it would be great if the signs were at least approximately linearly separable. At the worst, let the signs be linearly separable in
another, even if implicitly defined space .
In real life, for some reason objects really do not want to obey us and always go beyond the limits set by one or another method. And then it turns out that in order to adapt the theoretical beauty to the harsh reality, you have to strain your brains very hard, painfully setting up the system or modifying the chosen solution. And when the initial premises begin to crack at the seams, you have to abandon the system, which has been loved and loved by the years of suffering, and turn to competitive approaches.
Heuristic programming
It turns out that it is necessary to somehow achieve that in clarifying the reality it would be possible to adapt to new conditions with a little blood. You need an expressive language of describing reality and displaying this reality in the processor instructions. Unfortunately, nothing more flexible and efficient for this purpose than a general-purpose programming language does not yet exist. So you have to write code, and a lot.
So, at a certain stage of development, a program that solves a really dirty task actually turns into an expert system written in a general-purpose programming language (if you managed to get by with a specialized language, then we are talking about an expert system in the classical understanding of this term, but the essence of the reasoning this does not change). An integral feature of such a program - an expert system is the abundance of heuristics, clearly expressed in code. Heuristics are questionable from the point of view of scientific validity of the action, the rules and criteria, something like if (word.LettersCount ()> = 4) {...}. A piece of a program composed of heuristics is itself the heuristic of the next level, and so on to the level of the entire system. This code is easy to recognize by the names of variables, methods and classes. Any suspiciousImage, looksLikeTable (), GoodTextExtractor. Also give comments:
« : », « , »
.
Writing heuristics is a fascinating occupation and interesting art, although it may seem unusual. It is great to discover the fact that the letters are approximately square, and use this to estimate the width of the gap in the text based on the height of the dark components in the image (the width is less reliable to take, since the letters are more often glued inside the lines than between them). Dirty, isn't it?
Later in this article, the characteristics "heuristic" and "dirty" in relation to programming are interchangeable.
It lives

At a certain stage of complexity, the system itself is the best way to describe the system. In nature, examples of extremely complex systems are enough: economics, genome, biocenosis. In programming, too, but for heuristic programming, an effect that gives rise to a feeling of loss of control, especially bright.
Suppose we write a system that analyzes the image and identifies images, text, tables, and similar objects on it. It seems reasonable to use heuristics that take into account the measure of confidence that a given object is a piece of text. So we need a
classifier . It's great that the task of creating a classifier is well studied. The question is what to take as a criterion for the quality of his work. Common sense says that it is necessary to create a base of actually encountered images and require a large proportion of properly classified objects and, accordingly, a small fraction of false positives. In the case when the classifier is used in a living large system, not everything is so simple. It may well be that improving the quality of the classifier affects the quality of the system as a whole.
How can this happen? It is possible that there has been a serious improvement in the classification of a certain type of fonts, which is often found in images, due to a slight deterioration in the classification of rarely encountered fonts - well, it would seem excellent. The problem is that these rare fonts are more often used in headings. And the text in the headers is much smaller than the text on the page as a whole, so the system’s single error of the classifier will be more difficult to identify and correct. In addition, headings can have abnormally large sizes and, therefore, have an increased chance of being mistaken for a picture. As a result, the weak negative effects of the classifier change outweigh the strong positive ones. It turns out that it is necessary to complicate the quality criterion of the classifier? In the general case, it turns out that the simplest criterion for the quality of a component of a system is the quality of the system as a whole.
The situation is very similar to that born of the evolution of the genome. It makes no sense to find out whether this gene is harmful based on abstract reasoning. It is important whether a gene increases the likelihood of an individual to survive. In the program, as in the genome, complex and non-trivial heuristic / gene interactions are sure to occur.
It is known that the presence of the first blood group increases the likelihood of cholera, but reduces the sensitivity to malaria. Thus, the usefulness of a set of blood group genes depends on the place of residence of the person. In a heuristic software system, not less bizarre and unexpected dependences are ubiquitous.
So the system becomes frighteningly complex. It is tempting to accept it as it is and develop it exclusively in accordance with the principle of general utility, as evolution with the genome does. If you have a couple of billions of years left and resources the size of a planet, then this approach is quite suitable, and you can not read the second part of the article. Ordinary mortals can become interested in whether there is a way to somehow overcome the increase in complexity. The short answer is: there is no universal recipe, but we must try. More detailed reasoning - in the second part.
Dmitry Lyubarsky (now on Habré: MityaLsky )
Technology Development DepartmentUPD from 0:42 05/31/2012: The
second part appeared .