📜 ⬆️ ⬇️

The program taught to select realistic sounds for photos



Looking at a photo, a person can easily guess which sound should correspond to this frame.

Knowledge of sounds comes from life experience. We observe various events in life and listen to sounds. With experience in the brain accumulates a large collection. A person conducts a quick associative search in memory, selects the most appropriate sound - and plays it while watching a photo.
')
Approximately on the same principle works and the new program developed by experts of Disney Research and the Swiss Higher Technical School of Zurich for the selection of sounds for photos. In principle, the authors of the program specifically tried to copy the human process of establishing the relationship between sound and picture.

Information about the sounds can be obtained not only from reality. In kindergarten, all children are necessarily taught that the cow says “mu”.

To a very large extent, the brain collection of sounds is replenished with movies and computer games. After all, there often show events about which people do not have life experience. Therefore, almost everyone knows how a pistol shot sounds, although few people have heard it in reality. It can be assumed that sounds from movies / games account for more than half of all sounds accumulated in the memory of a person’s life.

The Disney Research program was also taught to compile a collection of sounds from a video sequence. This is not such an easy task, because the system must filter out a large number of extraneous sounds and determine exactly which object corresponds to which sound.

Interpreting visual content is a key task of machine vision. In recent years, many impressive results have been obtained in this area in the classification and recognition of objects, segmentation, tracking, and 3D reconstruction. But neural network education, the relationship between visual content and audio data, is still a rather unexplored area.

In this regard, it should be noted that the human brain is capable of amazing things. For example, he can pick up a "suitable" sound, which in principle can not exist. For example, the sound of a growing flower, although the flowers in principle do not make any sounds. The authors of the new program did not set out to copy the functionality of the human brain in the field of such fantasies. Although this is possible, I guess.

How to generate sound


One of the options for selecting a sound for an object is the synthesis of sound according to the physical characteristics of the object in the video. But in this way you can voice a very limited number of objects.

In contrast, the Disney Research system and the Swiss High School of Zurich collected samples of ready-made sounds from real videos. The video shows examples of such videos that were used for training.


Then the system was taught to separate the desired sound from outsiders. The main principle in this procedure is to find a similar sound in all the videos of a single object. This sound will be the sound of the object, and everything else - the background noise.

After the system has learned to allocate the corresponding sound for a particular object, the trivial task remains, since the recognition of objects in the video of the computer vision system is performed quite well now.

The researchers conducted experiments on 9 types of objects with 10–20 video samples with a duration of 15–90 s for each of them. To select the desired sounds used classifier kNN .



A survey of people showed that they recognize the sounds filtered by the program, much better than the unfiltered ones.



What is it for


In addition to the most logical task of self-learning of robots and other artificial intelligence systems that replicate the functionality of the human brain, comparing sound to graphic objects is useful in many useful applications of machine vision and multimedia. For example, to automate the work of noise maker - a specialist in recording noise effects in movies and computer games.

It is known that when shooting movies sounds are not very expressive. To improve the expressiveness of the movie, sound effects are then separately imposed on the video sequence. So it turns out a much more spectacular and spectacular movie. In addition, the noise picker helps to eliminate defects when the real sound does not fit the video sequence. For example, when in a movie, a hero hits an opponent heavily - but in reality, the actors only simulate strikes. In this case, the skimmer corrects the defect, that is, it imposes realistic sounds of bone crunch, champing flesh, flowing brain and other attractive effects.

Another possible application of the program is the sounding of the surrounding world for people with hearing impairments. Now they can not just hear the surrounding sounds, but hear them in the best quality, rich, without unnecessary noise - like in the movies. Ordinary people without hearing impairment will even envy people with disabilities, as now athletes with one leg envy completely legless , who have a competitive advantage - more advanced bionic prostheses, so that they run much faster and easily beat one-legged (and even two-legged) athletes.

Such technologies of augmented reality will most likely be in demand in the entertainment industry, in which a person perceives the surrounding reality through a computer interface. Finally, we will be able to block unwanted people from the outside world (as in the TV series “Black Mirror”). The system will simply filter the sound of their voices. Replace it with another, allowed sound. The image of a blocked person will be replaced with another object with the generation of corresponding sounds. Alternatively, you can simply change the voices of colleagues in the office and relatives for more pleasant voices. For example, the voice of a friend can be changed to sexual pronons during evening caresses, add missing sounds, etc.

Source: https://habr.com/ru/post/399317/


All Articles