Machine hearing. SoundNet neural network trained to recognize objects by sound

Left: an attempt to recognize the scene and objects only by sound. Right: real sound source

Recently, neural networks have made considerable progress in the recognition of objects and scenes in the video. Such achievements were made possible by training on massive datasets with labeled objects (for example, see the work "Learning deep areas for scenarios using places database" . NIPS, 2014). Looking at photographs or video clips, a computer can almost accurately determine the scene of action by selecting one suitable description from 401 scenes , for example, “cluttered kitchen”, “stylish kitchen”, “teenager’s bedroom”, etc. But in the field of understanding the sound neural networks have not yet demonstrated such progress. Specialists from the Laboratory of Informatics and Artificial Intelligence (CSAIL) of the Massachusetts Institute of Technology have corrected this deficiency by developing the machine learning system SoundNet .

In fact, the ability to determine the scene from the sounds is just as important as determining the location from the video. In the end, the picture from the camera can often be blurred or not provide enough information. But if the microphone works - the robot will already be able to navigate where it is.

From the point of view of science, learning the SoundNet neural network is quite a commonplace task. Employees of CSAIL used the method of natural synchronization between machine vision and computer hearing, teaching the neural network to automatically extract the sound representation of the object from unallocated video material. About 2 million Flickr videos (26 TB of data) were used for training, as well as an annotated sound database — 50 categories and approximately 2000 samples.
')

SoundNet Neural Network Architecture

Although the neural network was trained under visual observation, the system gives an excellent result offline by classifying at least three standard acoustic scenes for which the developers checked it. Moreover, testing the neural network showed that it independently learned to recognize sounds characteristic for some scenes, and the developers did not provide its samples for recognizing these objects specifically. According to the base of unmarked video materials, the neural network itself learned what scene corresponds to the sound of a jubilant crowd (this is a stadium) and bird chirping (this is a lawn or a park). Simultaneously with the scene, the neural network recognizes the specific object that is the source of the sound.

The video shows some examples of object recognition by sound. Initially, the sound sounds and the recognition result is shown, and the picture itself is blurred - so you can try to check yourself. Can you understand the location and the presence of certain objects only by sound as accurately as a neural network does. For example, what does the song "Happy Birthday To You!" Most likely mean that a few people sing in chorus? The correct answer is: the object is burning candles , the scene is a restaurant, cafe, bar .

"Machine vision has become so good that we can transfer this technology to other areas," said Carl Vondrick, a student at the Massachusetts Institute of Technology in electrical engineering and computer science, one of the authors of the scientific work. - We used the natural relationship between computer vision and sound. It was possible to achieve a large scale due to a multitude of unpartitioned video materials so that the neural network learned to understand sound. ”

SoundNet was tested on two standard sound recording bases, and it showed 13–15% higher accuracy of object recognition than the best of such programs. On a data set with 10 different sound categories, SoundNet classifies sounds with 92% accuracy, and on a data set with 50 categories, it shows accuracy of 74%. For comparison, on the same data sets, people show recognition accuracy, on average, 96% and 81%.

Even people sometimes can not exactly determine what exactly they hear. Try this experiment on your own. Let a colleague launch an arbitrary video from YouTube - and you try to say without looking at the monitor what is happening, where the sounds come from and what is shown on the screen. Not always you can guess. So the challenge for artificial intelligence is really not easy, but SoundNet managed quite well to cope with it.

In the future, such computer programs may find practical practical value. For example, your mobile phone will automatically recognize that you have come to a public place - a cinema or theater, and automatically muffle the volume of a call. If the movie began and the audience subsided, the phone will automatically turn off the sound and turn on the vibrating alert.

Orientation by terrain by sound will help in control programs for autonomous robots and other machines.

In security systems and smart homes, the system can respond to specific sounds in a specific way. For example, the sound of a broken window. In the smart cities of the future, recognizing noise on the streets will help you understand its causes and deal with sound pollution.

The scientific article was published on October 27, 2016 in open access on the site arXiv.org (arXiv: 1610.09001, pdf ).

Source: https://habr.com/ru/post/399659/

All Articles

Machine hearing. SoundNet neural network trained to recognize objects by sound

More articles: