Classifying Sounds with TensorFlow

Igor Panteleev, Software Developer, DataArt

For recognition of human speech, many services have been invented - it suffices to recall the Pocketsphinx or Google Speech API. They are able to quite qualitatively convert phrases recorded as an audio file into printed text. But none of these applications can sort the different sounds captured by the microphone. What exactly was recorded: human speech, animal screams or music? We are faced with the need to answer this question. And they decided to create test projects for classifying sounds using machine learning algorithms. The article describes what tools we have chosen, what problems we encountered, how we trained the model for TensorFlow, and how to launch our open source solution. We can also upload recognition results to the DeviceHive IoT platform in order to use them in cloud services for third-party applications.
')

Selection of tools and models for classification

First we had to choose software for working with neural networks. The first solution that seemed appropriate to us was the Python Audio Analysis library.

The main problem of machine learning is a good data set. There are a lot of such sets for speech recognition and music classification. With the classification of random sounds, things are not so good, but we, though not immediately, found a data set with "city" sounds .

During testing, we encountered the following problems:

pyAudioAnalysis is not flexible enough. It works with a small range of parameters, and some of them are calculated on the fly. For example, the number of training cycles is based on the number of samples, and this cannot be changed.
The selected data set contains only 10 classes, and all of them belong to the city sounds group.

The next solution was the Google AudioSet dataset , which is based on YouTube’s tagged video clips and is available for download in two formats:

CSV files that contain the following information about each fragment: the ID of the video posted on YouTube, the beginning and end time of the fragment, one or more tags assigned to the passage.
Extracted audiofichs that are saved as TensorFlow files.
These audiofichs are compatible with YouTube-8M models. This solution also suggests using the TensorFlow VGGish model to extract features from the audio stream. This solution met most of our requirements, and we decided to choose it.

Learning model

The next task was to find out how the YouTube-8M interface works. It is designed to work with video, but, fortunately, can work with audio. This library is quite flexible, but has a fixed number of classes. Therefore, we made some changes so that the number of classes can be passed as a parameter. YouTube-8M can work with two types of data: aggregated features and features for each fragment. Google AudioSet provides data in the form of features for each fragment. Next we had to choose a model for training.

Resources, Time, and Accuracy

Graphic processors (GPUs) are better suited for machine learning than central processing units (CPUs). You can find more information here , so we will not dwell on this in detail and go straight to our configuration. For the experiments, we used a PC with a single NVIDIA GTX 970 4GB graphics card.

In our case, learning time did not matter much. Note that one or two hours of training was enough to make an initial decision about the chosen model and its accuracy.

Of course, we want to get the highest possible accuracy. But learning a more complex model (which should provide greater accuracy) will require more RAM (video card memory in the case of using a graphics processor).

Model selection

A full list of YouTube-8M models with descriptions is available here . Since our training data is presented in the form of fragmented features, it is necessary to use the appropriate model. Google AudioSet contains a data set that is divided into three parts: a balanced train, an unbalanced train, and a score. Read more about this here .

For training and evaluation used a modified version of YouTube-8M. You can find it here .

Balanced learning

In this case, the command looks like this:

python train. 527 --train_dir = / path_to_logs --model = ModelName

For LstmModel, we changed the base learning rate to 0.001 in accordance with the documentation. We also changed the value of lstm_cells to 256, because we did not have enough RAM.

Let's look at the learning outcomes.

Model name	Studying time	Evaluation in the last step	average rating
Logistic	14m 3s	0.5859	0.5560
Dbof	31m 46s	1,000	0.5220
Lstm	1h 45m 53s	0.9883	0.4581

We managed to get good results at the training stage, but this does not mean that we will achieve similar indicators with a full assessment.

Unbalanced learning

In an unbalanced data set, there are much more samples, so we set the number of training cycles to 10 (we had to set five, because it took a lot of time to train).

Model name	Studying time	Evaluation in the last step	average rating
Logistic	2h 4m 14s	0.8750	0.5125
Dbof	4h 39m 29s	0.8848	0.5605
Lstm	9h 42m 52s	0.8691	0.5396

Learning journal

If you want to examine our log files, you can download and extract them by following this link. After downloading, run tensorboard --logdir / path_to_train_logs / and follow the link .

More about learning

YouTube-8M takes many options, and many of them affect the learning process.

For example, you can adjust the learning rate and the number of epochs, which will greatly change the learning process. There are also three functions for calculating losses and other useful variables that can be customized and modified to improve results.

Using a trained model with audio capture devices

When we have trained models, it's time to add code to interact with them.

Microphone Audio Capture

We need to somehow get the audio data from the microphone. We will use the PyAudio library, which has a simple interface and can work on most platforms.

Sound preparation

As mentioned earlier, we use the TensorFlow VGGish model as a tool for extracting features. Here is a brief explanation of the transformation process:

For visualization we used the sample Dog bark (“Dog barking”) from the UrbanSound data set.

Convert audio to 16 kHz mono.

We calculate the spectrogram using the values of STFT (Fourier transform on a small time interval) with a window size of 25 ms, a step of 10 ms, and a Hann periodic window .

We calculate the chalk spectrogram, leading the current spectrogram to a 64-bit chalk range.

We calculate the stabilized logarithmic spectrogram using log (mel-spectrum + 0.01), where the offset is used to avoid the logarithm of zero.

These features are then converted to non-intersecting fragments in 0.96 seconds, where each of them has a dimension of 64 chalk-ranges per 96 frames of 10 ms each.

The resulting data is then fed into the VGGish model to bring the data into a vector view.
Classification
Finally, we need an interface to transfer data to the neural network and get results.

Let's take the YouTube-8M interface as a basis, but change it to remove the serialization / deserialization step.

Here you can see the results of our work. Let's look at this moment in more detail.

Installation

PyAudio uses libportaudio2 and portaudio19-dev, so you need to install these packages to work.

In addition, you will need some Python libraries. You can install them with pip: pip install -r requirements.txt

You also need to download and extract the archive with the saved models to the project root. You can find it here .

Launch

Our project offers the possibility of using one of the three interfaces.

Pre-recorded audio file

Just run python parse_file.py path_to_your_file.wav , and you’ll see Speech: 0.75, Music: 0.12, Inside, large room or hall: 0.03

The result depends on the source data. These values are derived from the neural network prediction. A higher value means a higher probability that the input data belongs to this class.

Capturing and processing data from a microphone

python capture.py runs a process that will constantly capture data from your microphone. It will transmit data for classification every 5–7 seconds (by default). You will see the results in the same way as in the previous example. You can run it with the --save_path = / path_to_samples_dir / parameter, in this case all the captured data will be saved in the specified folder in the .WAV format. This feature is useful if you want to try different models with the same patterns. Use the --help parameter to get additional information.

Web interface

The python daemon.py command implements a simple web interface that is available by default at http://127.0.0.1:8000 . We use the same code as in the previous example. You can see the last ten predictions on the event page .

Integration with IoT

The last very important point - integration with IoT-infrastructure. If you launch the web interface we mentioned in the previous section, you can find the DeviceHive client connection status and settings on the main page. While the client is connected, forecasts will be sent to the specified device in the form of notifications.

Conclusion

TensorFlow is a very flexible tool that can be useful in many machine-learning applications for image and sound recognition. Using such a tool in tandem with an IoT platform allows you to create an intelligent solution with great potential. In the "smart cities" it can be used to ensure security - it, for example, is able to recognize the sound of breaking glass or a shot. Even in tropical forests, such a solution could be used to track the routes of wild animals or birds, analyzing their voices. The IoT platform can be configured to send notifications of sounds in range of the microphone. Such a solution can be installed on local devices (at the same time, it can be deployed as a cloud system) to minimize the cost of traffic and cloud computing, customize it to send only notifications, without attachments with raw audio. Do not forget that this is an open source project, so you can use it to create your own services.

Source: https://habr.com/ru/post/343464/

All Articles