📜 ⬆️ ⬇️

Adaptive Noise Reduction

In the process of working on the dialogue system ( http://habrahabr.ru/post/235763/ ) we faced an insurmountable, at first glance, problem — in real, combat working conditions, the performance of the ASR system turned out to be significantly lower than expected. One of the components affecting performance, invariably turned out to be background noise, which takes a variety of forms. Especially unpleasant for ASR in our experiments were difficult to neutralize the noise of a city street and the noise of crowds of people.

It became clear that the problem would have to be solved, or there would simply be no real value from the voice system.

The original plan was simple - to find an analytical solution for these two specific types of noise and neutralize them. But in the process of experimenting with several algorithms, it turned out that, firstly, the voice was distorted quite strongly, and secondly, in our training sample, the types of noise turned out to be much larger. Plus, we didn’t want the (relatively) clear speech to be modified in any way, as it had a negative effect on the word error rate in recognition.

Needed an adaptive scheme.
')
Deep neural networks are an excellent solution in cases where an analytical solution for a function is difficult or impossible to select. And we wanted just such a function - to transform a noisy speech signal into a no noise one. And that's what happened with us:
  1. The model works up to the filter bank. The alternative was noise reduction after the filter bank (for example, in MEL-space), which would also work faster. But we wanted to build a model that will work for the material that a person or external ASR systems must then hear. Thus, we can clear any voice streams from noise.
  2. The model is adaptive. It removes any types of noise in the background, including music, screams, a truck passing by, or the noise of a circular saw, which compares favorably with other commercially available systems. But if the signal is not noisy, it is almost unchanged. Those. This speech enhancement model - recovers a clear voice from a bad signal. One of the interesting further experiments will be the ability of this model to restore the studio-quality voice from 8KHz to 44KHz. We do not expect much, but perhaps it will surprise us.
  3. This particular model only works with 8Khz. This is due to the fact that at the time of the start of training we had a lot of 8KHz material. Now we already have enough material at 44KHz and we can assemble any model, be it necessary. In the meantime, our area of ​​interest lies in the telecommunications field.
  4. The training time (after all our optimizations) is about two weeks on 40 cores. The first version trained for almost 2 months on the same equipment. Moreover, the model continues to improve the level of error on the test data, so we will still train it.
  5. Processing speed on a relatively modern processor is now about 20x from real time on a single core. A 20-core hyper-threading server can handle approximately 700 voice streams simultaneously, almost 90 Mbps in raw data.

We went a little further and implemented the network on complex numbers. The idea boils down to restoring not only the amplitude, but also the phase that otherwise has to be taken from the noisy signal. And the phase of the noisy signal negatively affects the quality of the restored sound, if you want to hear it later. Therefore, soon we will have the option "high quality", due to approximately 2-fold slowing down the processing speed. For use in ASR, there is no sense in it, of course.

Well, after a short time, we will switch to our new recurrent model, which can boast a radical improvement in the quality of the restored voice due to a significant expansion of the field of visibility. The relatively small size of the window in the current model leads to small artifacts at the transitions between different types of noise, or when the noise signal profile partially coincides with the speech profile (the child cries in the background).

I would like to show some interesting pictures.
entrance image Output image

entrance Output

entrance Output

entrance Output

For those who want to indulge with the model - welcome to our website ( http://sapiensapi.com/#speechapidemo ). There you can upload your file, take it ready, make noise, or remove the noise from the file. The interface is quite simple.

For API lovers, we offer a free test API through Mashape ( http://www.mashape.com/tridemax/speechapi ).

If you have any questions - write me at tridemax@sapiensapi.com , well, or welcome to the comments.

Source: https://habr.com/ru/post/256857/


All Articles