Development of acoustic dataset for training neural network

Once in an interview, one well-known Russian musician said: “We are working to lie and spit at the ceiling.” I can not agree with this statement, because the fact that it is laziness that is the driving force in the development of technology, there can be no dispute. Indeed, over the past century alone, we have moved from steam engines to digital industrialization, and now the artificial intelligence that the science fiction writers and futurologists of the past century wrote about is becoming an increasing reality of our world every day. Computer games, mobile devices, smart watches and more basically use algorithms associated with machine learning mechanisms.

Nowadays, due to the growth of computing abilities of graphic processors and the large amount of data that has appeared, neural networks have become popular, using which they can solve classification and regression problems, teaching them using prepared data. There are already a lot of articles already written about how to train neural networks and what frameworks to use for this. But there is an earlier task, which also needs to be solved, and this is the task of forming an array of data - dataset, for further training of the neural network. This will be discussed in this article.
')

Not so long ago there was a need to build an acoustic classifier of automobile noise capable of distinguishing from a general audio data stream: broken glass, opening doors and operating a car engine in various modes. Development of the classifier was not difficult, but where to get dataset to meet all the requirements?

Google came to the rescue (no offense to Yandex - I’ll talk about its advantages a little later), with the help of which it turned out to select several main clusters containing the necessary data. I want to note in advance that the sources indicated in this article include a large amount of acoustic information, with various classes, allowing to form data for different tasks. We now turn to the review of these sources.

Freesound.org

Most likely, Freesound.org provides the largest amount of acoustic data, being a joint repository of licensed musical samples, which currently has more than 230,000 copies of sound effects. Each sound sample can be distributed under a different license, so it is better to get acquainted with the license agreement in advance. For example, the license zero (cc0) has the status “Without copyright”, and allows you to copy, modify and distribute, including commercial use, and allows you to use the data absolutely legally.

For the convenience of finding elements of acoustic information in a variety of freesound.org, the developers have provided an API designed to analyze, search and download data from repositories. To work with it you need to get access, for this you need to go to the form and fill in all the necessary fields, after which the generation of the individual key will take place.

The developers of freesound.org provide API for different programming languages, thus allowing to solve the same task with different tools. The list of supported languages and access links to them on GitHub are listed below.

Python was used to achieve this goal, since this beautiful dynamic typing programming language gained its popularity due to its ease of use, completely crossing the myth about the complexity of software development. A module for working with freesound.org for python can be cloned from the github.com repository.

Below is the program code, which consists of two parts and demonstrates all the ease of use of this API. The first part of the program code performs the task of data analysis, the result of which is the density of data distribution for each requested class, and the second part downloads data from the repositories of freesound.org for selected classes. The distribution density when searching for acoustic information with the keywords glass, engine, door is shown below in a pie chart as an example.

Sample code for analyzing freesound.org data

import plotly import plotly.graph_objs as go import freesound import os import termcolor #      def histogram(data, filename = "tmp_histogram.html"): data = [ go.Histogram( histfunc="count", x=data, name="count",textfont=dict(size=15) ), ] plotly.offline.plot({ "data": data, "layout": go.Layout(title="Histogram") }, auto_open=True, filename=filename) pass #      freesound.org def freesound_analysis(search_tokens, output, lim_page_count = 1, key = None): lim_page_count = int(lim_page_count) try: client = freesound.FreesoundClient() client.set_token(key,"token") print(termcolor.colored("Authorisation successful ", "green")) except: print(termcolor.colored("Authorisation failed ", "red")) classes = list() for token in search_tokens: try: results = client.text_search(query=token,fields="id,name,previews") output_catalog = os.path.normpath(output) if not os.path.exists(output_catalog): os.makedirs(output_catalog) page_count = int(0) while True: for sound in results: try: classes.append(token) info = "Data has been getter: " + str(sound.name) print(termcolor.colored(info, "green")) except: info = "Data has not been getter: " + str(sound.name) print(termcolor.colored(info, "red")) page_count += 1 if (not results.next) or (lim_page_count == page_count): page_count = 0 break results = results.next_page() except: print(termcolor.colored(" Search is failed ", "red")) histogram(classes) pass

Sample code to download freesound.org data

 #   def freesound_download(search_tokens, output, lim_page_count = 1, key = None): lim_page_count = int(lim_page_count) #  .     try: client = freesound.FreesoundClient() client.set_token(key,"token") print(termcolor.colored("Authorisation successful ", "green")) except: print(termcolor.colored("Authorisation failed ", "red")) for token in search_tokens: try: results = client.text_search(query=token,fields="id,name,previews") output_catalog = os.path.normpath(output + "\\" + str(token)) if not os.path.exists(output_catalog): os.makedirs(output_catalog) page_count = int(0) while True: for sound in results: try: sound.retrieve_preview(output_catalog) info = "Saved file: " + str(output_catalog) + str(sound.name) print(termcolor.colored(info, "green")) except: info = str("Sound can`t be saved to " + str(output_catalog) + str(sound.name) ) print(termcolor.colored(info, "red")) page_count += 1 if not results.next or lim_page_count == page_count: page_count = 0 break results = results.next_page() except: print(termcolor.colored(" Search is failed ", "red"))

A feature of freesound is that audio data analysis can be performed without downloading an audio file, allowing you to get MFCC, spectral energy, spectral centroid and other factors. Read more about lowlevel information in the freesound.ord documentation .

Using the freesound.org API, the time spent on sampling and unloading data is minimized, saving working hours on exploring other sources of information, since high accuracy of the acoustic classifier requires volume data with high variability, representing data with different harmonics on one same class of events.

YouTube-8M and AudioSet

I think that youtube is not particularly required in the submission, but still Wikipedia tells us that youtube is a video hosting site that provides video display services to users, forgetting to say that youtube is a huge database, and this source just needs to be used in machine learning and Google Inc provides us with a project called YouTube-8M Dataset .

YouTube-8M Dataset is a data set that includes more than a million video files from YouTube in high quality. If you give more accurate information, then by May 2018 there were 6.1M videos with 3862 classes. This dataset is distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license . This license allows you to copy and distribute material in any medium and format.

You are probably wondering: where is the video data, when acoustic information is needed for the task, and you will be very right. The fact is that Google provides not only video content, but also separately allocated a subproject with audio data called AudioSet .

AudioSet - provides a set of data obtained from YouTube clips, where a set of data is presented in a class hierarchy, using an ontology file , its graphical representation is located below.

This file allows you to get an idea of the nesting of classes, as well as access to youtube videos. To upload data from the Internet space, you can use the python module - youtube-dl, which allows you to download audio or video content, depending on the required task.

AudioSet is a cluster divided into three sets: test, training (balanced) and training (unbalanced) datasets .

Let's consider this cluster and analyze each of the sets separately in order to have an idea of the classes contained.

Training (balanced)

According to the documentation, this dataset consists of 22,176 segments selected from various videos selected by keywords, providing each class with at least 59 copies. If you look at the density distribution of root classes in the hierarchy of the set, then we see that the largest group of audio files is represented by the Music class.

Organized classes are decomposed into subsets of classes, making it possible to obtain more detailed information when used. This balanced training set has a distribution density which shows that there is a balance, but also certain classes stand out from the general form.

The distribution of classes, the number of elements which exceeds the average value

The average duration of each of the audio files is 10 seconds, a more detailed information is presented on the disk diagram, which shows that the length of the part of the files differs from the main set. This diagram is also presented.

Diagram of the duration of one and a half percent, other than the average, from a balanced set of audiosets

Training (unbalanced)

The advantage of this dataset is its size. Just imagine that, according to the documentation, this set includes 2,042,985 segments and in comparison with a balanced dataset it represents a lot of variation, but the entropy of this set is much higher.

In this set, the average duration of each of the audio files is also equal to 10 seconds, the disk diagram for this dataset is presented below.

Duration chart, other than the average, from an unbalanced audioset set

Test set

This set is very similar to a balanced set with the advantage that the elements of these sets do not intersect. Their distribution is presented below.

The distribution of classes, the number of elements which exceeds the average value

The average duration of one segment from this dataset is also equal to 10 seconds.

and the remaining part has the duration represented on the disk diagram

An example of program code for analyzing and downloading acoustic data in accordance with the selected data set:

 import plotly import plotly.graph_objs as go from collections import Counter import numpy as np import os import termcolor import csv import json import youtube_dl import subprocess #      def histogram(data,hist_mean= True, filename = "tmp_histogram.html"): if hist_mean == True: cdata = Counter(data) mean_number_classes = np.asarray([cdata[x] for x in cdata]).mean() ldata = list() for name in cdata: if cdata[name] > mean_number_classes: ldata += list(Counter({name:cdata[name]}).elements()) trace_mean_data = go.Histogram(histfunc="count", x=ldata, name="count" ) trace_data = go.Histogram(histfunc="count", x=data, name="count", text="" ) trace = [ trace_data, trace_mean_data] plotly.offline.plot({ "data": trace, "layout": go.Layout(title="stack") }, auto_open=True, filename=filename) pass #       def pie_chart(labels, values = None, filename = "tmp_pie_chart.html", textinfo = 'label+value'): if labels == None: raise Exception("Can not create pie chart, because labels is None") if values == None: data = Counter(labels) labels = list() values = list() for name in data: labels.append(name) values.append(data[name]) trace = go.Pie(labels=labels, values=values,textfont=dict(size=20),hoverinfo='label+percent', textinfo=textinfo, marker=dict(line=dict(color='#000000', width=2)) ) plotly.offline.plot([trace], filename='basic_pie_chart') pass #          def audioset_analysis(audioset_file, inputOntology): if not os.path.exists(inputOntology) or not os.path.exists(audioset_file): raise Exception("Can not found file") with open(audioset_file, 'r') as fe: csv_data = csv.reader(fe) sx = list() with open(inputOntology) as f: data = json.load(f) duration_hist = list() for row in csv_data: if row[0][0] == '#': continue classes = row[3:] try: color = "green" tmp_duration = str(float(row[2]) - float(row[1])) info = str("id: ") + str(row[0]) + str(" duration: ") + tmp_duration duration_hist.append(tmp_duration) for cl in classes: for dt in data: cl = str(cl).strip().replace('"',"") if cl == dt['id'] and len(dt['child_ids']) == 0: sx.append(dt['name']) info += str(" ")+str(dt['name']) + str(",") except: color = "red" info = "File has been pass: " + str(row[0]) continue print(termcolor.colored(info, color)) histogram(sx, filename="audioset_class") pie_chart(duration_hist, textinfo="percent + label", filename="audioset_duration")

 #   youtube def youtube_download(filepath, ytid): ydl_opts = { 'format': 'bestaudio/best', 'outtmpl': os.path.normpath(filepath), 'postprocessors': [{ 'key': 'FFmpegExtractAudio', 'preferredcodec': 'wav', 'preferredquality': '192', }], } with youtube_dl.YoutubeDL(ydl_opts) as ydl: ydl.download(['https://www.youtube.com/watch?v={}'.format(ytid)]) pass #    ffmpeg def cutOfPartFile(filename,outputFile, start, end, frequency = 44100): duration = float(end) - float(start) command = 'ffmpeg -i ' command += str(filename)+" " command += " -ar " + str(frequency) command += " -ss " + str(start) command += " -t " + str(duration) + " " command += str(outputFile) subprocess.call(command,shell=True) pass #    yotube        def audioset_converter(incatalog,outcatalog, token = "*.wav", frequency = 44100): find_template = os.path.join(incatalog,token) files = glob(find_template); for file in files: _,name = os.path.split(file) name = os.path.splitext(name)[0] duration = str(name).split("_")[1:3] filename = name.split("_")[0] +"."+ token.split(".")[1]; outfile = os.path.join(outcatalog,filename) cutOfPartFile(file,outfile,start=duration[0],end=duration[1]) #    audioset def audioset_download(audioset_file, outputDataset, frequency = 44100): t,h = os.path.split(audioset_file) h = h.split(".") outputDataset_full = os.path.join(outputDataset,str(h[0])+"_full") outputDataset = os.path.join(outputDataset,str(h[0])) if not os.path.exists(outputDataset): os.makedirs(outputDataset) if not os.path.exists(outputDataset_full): os.makedirs(outputDataset_full) with open(audioset_file, 'r') as fe: csv_data = csv.reader(fe) duration_hist = list() for row in csv_data: if row[0][0] == '#': continue try: color = "green" tmp_duration = str(float(row[2]) - float(row[1])) info = str("id: ") + str(row[0]) + str(" duration: ") + tmp_duration duration_hist.append(tmp_duration) save_full_file = str(outputDataset_full) + str("//")+ str(row[0]).lstrip()+str("_") +str(row[1]).lstrip() + str("_").lstrip() + str(row[2]).lstrip() + str('.%(ext)s') youtube_download(save_full_file,row[0]) except: color = "red" info = "File has been pass: " + str(row[0]) continue print(termcolor.colored(info, color)) audioset_converter(outputDataset_full,outputDataset, frequency = frequency)

For more detailed information on analyzing the audioset data, or downloading this data from the yotube space in accordance with the ontology file and the selected audioset set , the program code is freely available in the GitHub repository .

urbansound

Urbansound is one of the largest datasets with markup sound events whose classes belong to the urban environment. This set was called taxonomic (categorical), i.e. each class is divided into subclasses belonging to it. Such a set can be represented as a tree.

To upload the urbansound data for later use, simply go to the page and click download .

Since there is no need to use all subclasses in the task, but only a single class associated with the car is needed, it is first necessary to filter the necessary classes using the meta file located in the root of the directory obtained when unzipping the downloaded file.

After unloading all the necessary data from these sources, it turned out to form a dataset containing more than 15,000 files. This amount of data allows you to go to the task of learning the acoustic classifier, but there remains an unresolved question regarding the “purity” of the data, i.e. training set includes data not related to the necessary classes of the problem to be solved. For example, when listening to files from the “broken glass” class, you can find people talking about “how not to beat the glass well”. Therefore, we face the task of filtering data and as a tool for solving this kind of task, a tool is great, the core of which was developed by Belarusian guys and received the strange name “Yandex.Toloka”.

Yandex.Toloka

Yandex.Toloka is a crowdfunding project created in 2014 to mark up or collect a large amount of data for further use in machine learning. In fact, this tool allows you to collect, mark up and filter data using a human resource. Yes, this project not only allows you to solve problems, but also allows other people to earn. The financial burden in this case falls on your shoulders, but due to the fact that the performers are more than 10,000 interpreters, the results of the work will be obtained in the near future. A good description of the operation of this tool can be found in the Yandex company blog .

In general, the use of money is not particularly difficult, since for publication of the task you just need to register on the site , the minimum amount is 10 US dollars, and a properly executed task. How to correctly form the task you can see the documentation of Yandex.Tolok or there is not a bad article on Habré . From myself, I want to add to this article that even if the pattern that is suitable for the requirement of your task is missing, then its development will take no more than a few hours of work, with a break for coffee and a cigarette, and the performers can be obtained by the end of the working day.

Conclusion

In machine learning in solving the problem of classification or regression, one of the primary tasks is the development of a reliable data set - dataset. This article examined the sources of information with a large amount of acoustic data allowing to form and balance the required data set for a specific task. The presented program code allows you to simplify the operation of uploading data to a minimum, allowing you to reduce the time to receive data, spend the rest on the development of the classifier.

As for my task, after collecting the data, from all the sources presented in this article and the subsequent filtering of the data, it was possible to form the necessary datasets for learning the acoustic classifier based on the neural network. I hope that this article will allow you and your team to save time and spend it on the development of new technologies.

PS A software module developed in python, for analyzing and uploading acoustic data for each of the submitted sources, you can find it in the github repository

Source: https://habr.com/ru/post/427397/

All Articles

Development of acoustic dataset for training neural network

More articles: