Machine Vision vs Human Intuition: Algorithms for Disruption of Object Recognition

The logic of machines is flawless, they do not make mistakes if their algorithm is working properly and the specified parameters meet the required standards. Ask the car to choose a route from point A to point B, and it will build the most optimal one, taking into account the distance, fuel consumption, availability of gas stations, etc. This is a pure calculation. The car will not say: "We will go on this road, I feel this route better." Maybe cars are better than us in speed of calculation, but intuition still remains one of our trump cards. Mankind has spent dozens of years in order to create a machine similar to the human brain. But is it much in common between them? Today we will look at a study in which scientists, having doubted the unsurpassed machine “view” on the basis of convolutional neural networks, conducted an experiment on duping the object recognition system with an algorithm whose task was to create “fake” images. How successful was the sabotage activity of the algorithm, did people cope with the recognition of better machines and what would this research bring to the future of this technology? We will find the answers in the scientists' report. Go.

The basis of the study

Object recognition technologies using convolutional neural networks (SNS) allow the machine, roughly speaking, to distinguish a swan from a number 9 or a cat from a bicycle. This technology is developing quite rapidly and is currently being used in various fields, the most obvious of which is the production of unmanned vehicles. Many express the opinion that the SNA of the object recognition system can be considered as a model of human vision. However, this statement is too loud, due to the human factor. The whole point is that it was easier to fool a car than a man (at least in matters of object recognition). SNA systems are very vulnerable to the effects of malicious algorithms (hostile, if you will) that will in every way prevent them from properly performing their task, creating images that will be incorrectly classified by the SNA system.
')
Researchers divide these images into two categories: “duping” (completely changing the target object) and “embarrassing” (partially changing the target object). The first are meaningless images that are recognized by the system as something familiar. For example, a set of lines can be classified as a “baseball ball”, and a multi-colored digital noise as a “battleship”. The second category of images (“embarrassing”) are images that would normally be classified correctly, but the malicious algorithm slightly distorts them, speaking exaggeratedly, in the eyes of the SNA system. For example, handwritten number 6 will be classified as number 5 due to a small addition of several pixels.

Just imagine what harm such algorithms can do. It is necessary to swap the classification of road signs for autonomous transport and accidents will be inevitable.

Below are the "fake" images that fool the SNA system, trained to recognize objects, and how they were classified by a similar system.

Image number 1

Explanation of the series:

a - indirectly coded "fraudulent" images;
b - directly coded fraudulent images;
- “embarrassing” images, forcing the system to classify one digit as another;
d - LaVAN attack (localized and visible competitive / harmful noise) can lead to misclassification, even when “noise” is located only at one point (in the lower right corner).
e - three-dimensional objects that are misclassified from different angles.

But the most curious thing about this is that a person may not succumb to fooling a malicious algorithm and classify images correctly, based on intuition. Previously, as scientists say, no one made a practical comparison of the capabilities of a machine and a person in an experiment to counter harmful algorithms of fake images. That is what the researchers decided to do.

To this end, several images were prepared, made by malicious algorithms. The subjects were told that the machine classified these images (dummy) as familiar objects, i.e. the machine recognized them incorrectly. The task of the subjects was to determine exactly how the machine classified these images, i.e. what they think the car saw on the images, whether such a classification is correct, etc.

A total of 8 experiments were conducted in which 5 types of malicious images were used that were created without taking into account human vision. In other words, they are made by machine for cars. The results of these experiments were very entertaining, but we will not spoil and consider everything in order.

Experimental results

Experiment number 1: fooling images with incorrect tags

In the first experiment, 48 duplicate images were used, created by an algorithm to counteract the recognition system based on the SNA called AlexNet. This system has classified these images as “gear wheel” and “donut” ( 2a ).

Image number 2

During each attempt, the test person, of whom there were 200, saw one image fool and two tags, i.e. classification labels: system SNS label and random from other 47 images. The subjects had to choose the label that was created by the machine.

As a result, most of the subjects chose to choose the label created by the machine, rather than the label of the malicious algorithm. Classification accuracy, i.e. the degree of agreement of the subject with the machine was 74%. Statistically, 98% of the subjects chose machine labels at a level higher than statistical randomness ( 2d , "% of subjects who agreed with the machine"). 94% of the images showed a very high human-machine matching, that is, out of 48, only 3 images were classified by people differently from the machine.

Thus, the subjects showed that a person is able to separate the real image and the one who is fooling, that is, to act according to a program based on the SNA.

Experiment number 2: the first choice against the second

The researchers wondered if the subjects were able to recognize the images so well and separate them from the erroneous marks and the dupes? Probably, the subjects marked the orange-yellow ring as a “bagel”, because in reality a bagel of precisely this shape and approximately the same color. Associations and intuitive choice based on experience and knowledge could help a person to recognize.

To check this, the random tag was replaced with the one chosen by the machine as the second possible classification option. For example, AlexNet classified the image of the orange-yellow ring as a “bagel”, and the second version of this program was a “pretzel”.

The subjects had a task to choose the first tag of the car or the one that occupied the second place, for all 48 images ( 2c ).

The graph in the center of image 2d shows the results of this test: 91% of the subjects chose the first variant of the label, and the man-machine agreement level was 71%.

Experiment number 3: multithreaded classification

The above experiments are quite simple due to the fact that the subjects have a choice between two answer choices (a car tag and a random tag). In fact, the machine in the process of image recognition goes through hundreds and even thousands of options for tags, before choosing the most suitable one.

In this test, all the labels for 48 images were immediately in front of the subjects. They had to choose from this set the most suitable for each image.

As a result, 88% of the subjects chose exactly those tags as the machine, and the degree of agreement was 79%. Curious is the fact that even when choosing the wrong label that the car chose, subjects in 63% of such cases chose one of the 5 most top marks. That is, the car all the tags are arranged in a list from the most suitable to the most inappropriate (exaggerated example: "donut", "pretzel", "rubber ring", "tire", etc. up to "hawk in the night sky" ).

Experiment number 3b: "what is it?"

In this test, scientists have slightly changed the rules. Instead of asking to "guess" which machine the machine would choose for a particular image, the subjects simply asked what they saw in front of them.

Recognition systems based on convolutional neural networks select the appropriate label for a particular image. This is a fairly clear and logical process. In this test, the subjects exhibit intuitive thinking.

As a result, 90% of the subjects chose a label that the machine also chose. The level of human-machine matching among the images was 81%.

Experiment No. 4: TV static noise

Scientists note that in previous experiments, the images, although unusual, have distinct features that can push test subjects to the correct (or wrong) choice of label. For example, the image “baseball ball” is not a ball, but there are lines and colors on it that are present on a real baseball ball. This is a bright distinctive feature. But if the image does not have such features, but is essentially a static noise, can a person recognize at least something on it? That was what was decided to check.

Image number 3a

In this test, before the subjects there were 8 images with statics, which are recognized by the SNA system as a specific object (for example, a bird charge). Also in front of the subjects was a label and normal images related to it (8 images of static, 1 charge label and 5 photos of this bird). The subject had to choose 1 out of 8 static images that best fit one mark or another.

You can check yourself. Above, you see an example of such a test. Which of the three images best fits the charge label and why?

81% of subjects chose the label that the machine chose. In this case, 75% of the images were labeled by the subjects with the most appropriate label in the opinion of the machine (from a number of variants, which we have already talked about earlier).

For this particular test, you may have questions, like me. The fact is that in the suggested images of statics (above) I, personally, see three distinct features that distinguish them from each other. And only in one image this feature strongly resembles that very charge (I think you understand exactly which image of the three). Therefore, my personal and very subjective opinion is that such a test is not particularly revealing. Although perhaps among the other options, static images were really indistinguishable and unrecognizable.

Experiment number 5: "doubtful" numbers

The above tests were based on images that cannot be immediately and fully classified without a drop of doubt as a particular object. There is always a bit of doubt. Stupid images are quite straightforward in their business - to spoil the image beyond recognition. But there is a second type of malicious algorithms that add (or remove) only a small detail in the image, which can completely disrupt the recognition procedure by the SNS system. Add a few pixels, and the number 6 magically turns into the number 5 ( 1c ).

Scientists consider such algorithms among the most dangerous. You can slightly change the image-tag, and the unmanned vehicle incorrectly considers the sign of the speed limit (for example, 75 instead of 45), which can lead to sad consequences.

Image number 3b

In this test, scientists suggested that subjects choose the wrong answer, but rather the wrong one. In the test, 100 digitally modified images were used by the malicious algorithm (the SAN system LeNet changed their classification, that is, the malicious algorithm worked successfully). The subjects had to say what number they thought the car saw. As expected, 89% of the subjects successfully passed this test.

Experiment number 6: photos and localized "distortion"

Scientists note that not only object recognition systems are developed, but also malicious algorithms that prevent them from doing so. Previously, in order for the image to be classified incorrectly, it was necessary to distort (change, delete, damage, etc.) 14% of all pixels in the target image. Now this figure has become much smaller. It is enough to add a small image inside the target and the classification will be broken.

Image number 4

In this test, a rather new malicious algorithm LaVAN was used, which places a small image localized at one point on the target photo. As a result, the object recognition system can recognize a subway train as a milk can ( 4a ). The most significant features of this algorithm are just a small fraction of the damaged pixels (only 2%) of the target image and no need to distort it in whole or the main (most significant) part of it.

In the test, 22 images damaged by LaVAN were used (the Inception V3 recognition system SNS was successfully hacked by this algorithm). The subjects had to classify the malicious insert in the photo. 87% of the subjects were able to successfully do this.

Experiment number 7: three-dimensional objects

The images we saw before are two-dimensional, like any photo, picture or newspaper clipping. Most malicious algorithms successfully manipulate such images. However, these pests can work only under certain conditions, that is, they have a number of restrictions:

complexity: only 2D images;
practical application: malicious changes are possible only on systems that read the resulting digital images, and not images from sensors and sensors;
resilience: the malicious attack loses its power if the two-dimensional image is rotated (resize, crop, change sharpness, etc.);
person: we see the world and objects around us in 3D from different angles, lighting, and not in the form of two-dimensional digital images taken from a single angle.

But, as we know, progress has not bypassed malicious algorithms. Among them appeared the one who is capable of not only distorting two-dimensional images, but also three-dimensional ones too, which leads to an incorrect classification by the object recognition system. When using software for three-dimensional graphics, such an algorithm is misleading classifiers based on the SNA (in this case, the program Inception V3) from different distances and viewing angles. The most amazing thing is that such fooling 3D images can be printed on the corresponding printer, i.e. to create a real physical object, and the object recognition system will still mistakenly classify it (for example, an orange as an electric drill). And all due to minor changes in the texture of the target image ( 4b ).

For the object recognition system, such a malicious algorithm is a serious adversary. But man is not a car, he sees and thinks differently. In this test, before the subjects there were images of three-dimensional objects, in which there were the above-described changes in texture, from three angles. Also, the subjects were given the label correct and erroneous. They had to determine which labels are correct, which are not, and why, i.e. Do subjects see changes in textures on images.

As a result, 83% of subjects successfully coped with the task.

For more detailed acquaintance with the nuances of the study I strongly recommend to look into the report of scientists .

And by this link you will find the image, data and code files that were used in the study.

Epilogue

The work carried out gave scientists the opportunity to make a simple and fairly obvious conclusion - human intuition can be a source of very important data and a tool in making the right decision and / or perception of information. A person is able to intuitively understand how the object recognition system will behave, which labels it will choose and why.

The reasons why it is easier for a person to see a real image and recognize it correctly is somewhat. The most obvious is the method of obtaining information: the machine receives the image in digital form, and the person sees it with his own eyes. For a machine, a picture is a data set, by making changes to which one can distort its classification. For us, the image of a subway train will always be a subway train, and not a can of milk, because we see it.

Scientists also emphasize that such tests are difficult to evaluate, because a person is not a machine, and a machine is not a person. For example, the researchers talk about the dough with the "donut" and "wheel". These images are similar to "donut" and "wheel", because the recognition system so classifies them. A person sees that they are similar to the "donut" and "wheel", but they are not. This is the fundamental difference in the perception of visual information between a person and a program.

Thank you for your attention, remain curious and have a good working week, guys.

Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending to friends, 30% discount for Habr's users on a unique analogue of the entry-level servers that we invented for you: The whole truth about VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps from $ 20 or how to share the server? (Options are available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps before summer for free if you pay for a period of six months, you can order here .

Dell R730xd 2 times cheaper? Only we have 2 x Intel Dodeca-Core Xeon E5-2650v4 128GB DDR4 6x480GB SSD 1Gbps 100 TV from $ 249 in the Netherlands and the USA! Read about How to build an infrastructure building. class c using servers Dell R730xd E5-2650 v4 worth 9000 euros for a penny?

Source: https://habr.com/ru/post/445372/

All Articles