Limitations of image recognition algorithms

No, it will not be about image recognition algorithms - it will be about the limitations of their use, in particular when creating AI.

In my opinion, the recognition of visual images by humans and computer systems is very different - so much that it has little to do with each other. When a person says “I see,” in fact, he thinks more than he sees, which cannot be said about a computer system equipped with image recognition equipment.
')
I know that the thought is not new, but I propose once again to make sure that it is fair by the example of a robot claiming to possess intellect. The test question is: how should a robot see the world around it in order to become completely human-like?

Of course, the robot must recognize objects. Oh yes, this algorithms cope - through training at the initial samples, as I understand. But this is too little!

I.
Firstly, each object of the surrounding world consists of a set of any objects and in turn is a subset of other objects. I call this property nesting. And what if some object simply does not have a name, respectively, it is not in the base of initial samples, by which the algorithm is trained - what should the robot recognize in this case?

The cloud, which I am currently observing in the window, has no named parts, although it obviously consists of edges and a middle. However, there are no special terms for the edges and the middle of the cloud, they have not been invented. To refer to an unnamed object I use the verbal formulation (the "cloud" - the type of object, "edge of the cloud" - the verbal formulation) that does not include the possibility of image recognition algorithm.

It turns out that the algorithm without a logical block is not enough for what is suitable. If the algorithm detects a part of the whole object, it will not always be able to figure out - accordingly, the robot will not be able to tell what it is.

Ii.
Secondly, the list of objects that make up the world around us is not closed: it is constantly updated.

A person has the ability to construct objects of reality, assigning the names of new objects discovered, for example, species of fauna. Horse with a human head and torso, he would call a centaur, but it pre-figured out that this human being's head and torso, and everything else - a horse, thereby recognizing the object seen for a new one. So does the human brain. And the algorithm, in the absence of input data, will determine such a creature either as a person or as a horse: without operating with the characteristics of types, it will not be able to establish their combination.

In order for a robot to become like a human being, he must be able to identify new types of objects for him and assign names to these types. In descriptions of a new type, characteristics of known types should appear. If the robot is not able, on the hell do we need him, so handsome?

Suppose we send a spy robot to Mars. The robot sees something unusual, but is able to identify the object exclusively in terms known to the earth. What will it give to people who listen to verbal messages from the robot? Sometimes - it will give something, of course (if earth objects are found on Mars), and in other cases - nothing (if Martian objects turn out to be not similar to Earth ones).

The image is another matter: a person will be able to see everything, correctly evaluate and name it. Only by means of not a previously trained image recognition algorithm, but of one’s more cunningly arranged human brain.

Iii.
Thirdly, there is some problem with the individualization of objects.

The world around consists of specific items. Actually, you can only see specific items. But in some cases, they need to personalize verbally, which are used either personal names ( "George Petrov"), or a simple reference to a particular object, spoken or implied ( "the Table"). What I call the types of objects (“people”, “tables”) are just collective names of objects with certain common characteristics.

Image recognition algorithms, if trained on the original samples, will be able to recognize both individualized and non-individualized objects - this is good. Face recognition in crowded places and all that. The bad thing is that such algorithms will not understand which items should be recognized as possessing individuality, and which are absolutely not worth it.

The robot, as the owner of the AI, should occasionally burst into messages like:
- Oh, and I already saw this old woman a week ago!

But about blades of grass to abuse such replicas are not worth it, especially since there are reasonable concerns about the adequacy of computing power to perform a similar task.

I do not understand where the fine line between an old woman and individualized countless blades of grass of the field, which in themselves are individualized at least the old woman, but for a man of no interest from the perspective of individualization are not. What is the recognized image in this sense? Almost nothing - the beginning of the difficult to painful perception of the surrounding reality.

Iv.
Fourth, the dynamics of objects, determined by their mutual spatial arrangement. This, I tell you, is something!

I sit in front of the fireplace in a deep armchair and now I am trying to get up.
- What do you see, robot?

From our everyday point of view, the robot sees me rise from my chair. What should he answer? Perhaps the relevant answer would be:
“I see you rise from your chair.”

To do this, the robot should know who I am, what a chair and what it means to climb ...

The image recognition algorithm, after appropriate adjustment, will be able to recognize me and the chair, then, by comparing the frames, we will be able to determine the fact that I was removed from the chair, but what does it mean to “rise”? How does “lifting” generally occur in physical reality?

If I have already risen and walked away, everything is quite simple. After I moved away from the chair, all subjects in the study did not change the spatial position relative to each other, except for me, who originally was in the chair, and after a while turned away from the chair. It is permissible to conclude that I left the chair.

If I'm still in the process of getting up from my chair, everything is somewhat more complicated. I am still close to the chair, but the mutual spatial position of parts of my body has changed:

originally the shin and torso were in an upright position, and the thigh in a horizontal position (I was sitting),
the next moment all the parts of the body were upright (I got up).

Watch my behavior man, he will instantly conclude that I rise from the chair. For a man, this will be not so much a logical conclusion as a visual perception: he will literally see me rise from my chair, although in reality I will see a change in the relative position of parts of my body. However, in reality it will be a logical conclusion that someone should explain to the robot, or the robot should work out this logical conclusion independently.

Both are equally difficult:

to enter information into the initial knowledge base that getting up is a consistent change in the mutual spatial position of certain parts of the body is somehow not inspiring;
it is no less foolish to hope that the robot, as an artificial thinking creature, will very quickly guess that the aforementioned change in the mutual spatial position of certain parts of the body is called rising. In humans, this process takes years, how much will it take from a robot?

And what does the image recognition algorithms? They will never be able to determine that I am rising from the chair.

V.
“Rising” is an abstract concept, determined by a change in the characteristics of material objects, in this case by a change in their mutual spatial position. In general, this is true for any abstract concepts, because abstract concepts by themselves do not exist in the material world, but are completely dependent on material objects. Although we often perceive them as observable in person.

Shift the jaw to the right or left without opening the mouth - what is this action called? No way. Undoubtedly, for the reason that such a movement for a person is in general uncharacteristic. See a robot using the algorithms discussed will see, and what's the point? In the base of initial samples the necessary name will be absent, and the robot will find it difficult to name the fixed action. And to give detailed verbal formulations to unnamed actions, as well as to other abstract concepts, image recognition algorithms are not trained.

In fact, we have a duplicate of the first paragraph, only in respect of items not to, but abstract concepts. However, the remaining points, previous and subsequent, can also be linked to abstract concepts - just pay attention to the increased level of complexity when working with abstractions.

Vi.
Sixth, cause and effect relationships.

Imagine that you are seeing a pick-up fly out of the way and tear down the fence. The reason that the fence is demolished is the movement of the pickup, and in turn the movement of the pickup has the effect of demolishing the fence.

- I saw it with my own eyes!
This is the answer to the question, did you see what happened or thought of it. And what did you actually see?

Several items in this dynamic:

pickup drove off the road
pickup came close to the fence,
The fence has changed shape and location.

Based on visual perception, the robot must realize that in the usual case the fences do not change shape and location: here it happened as a result of contact with the pickup truck. The subject-cause and the subject-effect must be in contact with each other, otherwise there is no causality in their relationship.

Although here we fall into a logical trap, because other objects can also be in contact with the subject-effect, not only the subject-cause.

For example, at the time of the pickup hit the fence of the village jackdaw. The pickup and the jackdaw were in contact with the fence at the same time: how to determine which contact caused the fence to be demolished?

Probably using repeatability:

if in each case, when the jackdaw sits on the fence, the fence is demolished, the jackdaw is to blame;
if in each case, when a pick-up crashes into the fence, the pick-up is to blame.

Thus, the conclusion that the fence was demolished by a pickup truck is not quite an observation, but the result of an analysis based on the observation of objects in contact with each other.

On the other hand, the impact can be carried out at a distance, for example the effect of a magnet on an iron object. How does a robot guess that approaching a magnet to a nail causes the nail to rush to the magnet? The visual picture is not like this:

the magnet is approaching, but not in contact with the nail,
at the same instant, the nail on its own initiative rushes to the magnet and comes into contact with it.

As you can see, keeping track of the causal relationships is very difficult, even in cases where the witness states with iron conviction that he saw it with his own eyes. Image recognition algorithms are powerless here.

VII.
Seventh and last, it is the choice of the goals of visual perception.

The surrounding visual picture may consist of hundreds and thousands of objects nested into each other, many of which constantly change their spatial position and other characteristics. Obviously, the robot does not need to perceive every blade of grass in the field, however, like every person on a city street: you need to perceive only the important, depending on the tasks performed.

It is obvious that it is impossible to adjust the image recognition algorithm to the perception of some objects and ignoring others, since it may not be known in advance what to pay attention to and what to ignore, especially as current goals may change along the way. A situation may arise when you first need to perceive many thousands of objects nested in each other — literally each of them — to analyze and only then give a verdict which objects are essential for solving the current problem and which do not represent interest. This is how a person perceives the world around him: he sees only the important, not paying attention to uninteresting background events. How does he succeed, is a secret.

And the robot, even equipped with the most modern and sophisticated image recognition algorithms? .. If during the attack of the Martian aliens he starts a report from the weather report and continues the description of the new landscape spreading before him, he may not have time to report the attack itself.

findings

Simple recognition of visual images will not replace human eyes.
Image recognition algorithms - an auxiliary tool with a very narrow scope.
In order for the robot to begin not only to think, but at least to see humanly, algorithms are required not only for pattern recognition, but also for all the same high-grade and so far unattainable human thinking.

Source: https://habr.com/ru/post/450422/

All Articles

Limitations of image recognition algorithms

More articles: