Random forest vs neural network: who will better cope with the task of recognizing gender in speech (part 1)

Historically, deep learning has been most successful in image processing, recognition, segmentation, and image processing. However, not the convolutional networks are united, as they say, the science of data lives.

We tried to make a guide for solving problems related to speech processing. The most popular and sought-after of them is probably the recognition of what they say, the analysis at the semantic level, but we turn to a simpler task - determining the gender of the speaker. However, the toolkit in both cases is almost the same.

/ Photo justin lincoln / CC-BY
')

What our algorithm will “hear”

Characteristics of the voice that we will use

First of all, you need to understand the physics of processes - to understand how a male voice differs from a female one. About the device of the vocal tract in humans can be read in reviews or special literature, but the basic explanation “on the fingers” is quite transparent: the vocal cords, oscillations of which produce sound waves prior to modulation by other organs of speech, in men and women have different thickness and tension, which leads to different frequency of the main tone (it is pitch, pitch). For men, it is usually in the range of 65-260 Hz, and for women - 100-525 Hz. In other words, the male voice most often sounds below the female.

It would be naive to assume that sound alone can be enough. It can be seen that for both sexes, these intervals rather overlap. Moreover, in the process of speech, the frequency of the fundamental tone — a variable parameter that varies, for example, during the transmission of intonation — cannot be determined for many consonant sounds, and the algorithms for calculating it are not perfect.

In the perception of a person, the individuality of a voice is contained not only in frequency, but also in timbre - the totality of all frequencies in a voice. In a sense, it can be described using the spectrum, and then mathematics comes to the rescue.

Sound is a changeable signal, which means that its spectrum from the point of view of average time is unlikely to give us something meaningful, therefore it is reasonable to consider the spectrogram — the spectrum at each moment in time, as well as its statistics. The audio signal is divided into overlapping 25-50 millisecond segments - frames, for each of which, using the fast Fourier transform, the spectrum is calculated, and for it, after that, moments are sought. The centroid, entropy, dispersion, asymmetry and kurtosis coefficients are used most often - a lot of things are used when calculating random variables and time series.

Also use chalk-frequency cepstral coefficients (MFCC). It is worth reading about them, for example, here . The problems they are trying to solve are two. First, the human perception of sound is not linear in frequency and signal amplitude, therefore, some scaling (logarithmic) is required. Secondly, the spectrum of the speech signal itself varies in frequency quite smoothly, and its description can be reduced to several numbers without any special loss in accuracy. As a rule, 12 chalk-cepstral coefficients are used, each of which represents the logarithm of the spectral density within certain frequency bands (its width is higher, the higher the frequency).

It is this set of features (pitch, spectrogram statistics, MFCC) that we will use for classification.

/ Photo by Daniel Oines / CC-BY

We solve the classification problem

Machine learning begins with data. Unfortunately, there are no open and popular databases for identifying gender, such as ImageNet for classifying images or IMDB for tonality of texts. You can take the known base for speech recognition TIMIT, but it is paid (which imposes some restrictions on its public use), so we will use the VCTK - 7 GB base in free access. It is designed for the problem of speech synthesis, but it suits us in all respects: there is sound and data on 109 speakers. For each of them, we take 4 random statements of 1-5 seconds in length and try to determine the gender of their author.

In a computer, sound is output as a sequence of numbers - deviations of the microphone membrane from the equilibrium position. The sampling frequency is most often chosen from the range from 8 to 96 kHz, and for single-channel sound one of its seconds will be represented by at least 8 thousand numbers: any of them encodes the membrane deviation from the equilibrium position at each of eight thousand times per second. For those who have heard of Wavenet — the neural network architecture for sound signal synthesis — this does not seem to be a problem, but in our case this approach is redundant. Logical action at the stage of data pre-processing is the calculation of features that can significantly reduce the number of parameters describing the sound. Here we turned to openSMILE , a convenient package that can calculate almost everything related to sound.

The code is written in Python, and the implementation, Random Forest, which best coped with the classification, is taken from the sklearn library. It is also curious to see how neural networks cope with this task, but we will make a separate post about them.

Solving the classification problem means building a function on the training data, which returns a class label using the same parameters, and does it quite accurately. In our case, it is necessary that for a set of features for an arbitrary audio file, our classifier responds, whose speech in it is recorded, men or women.

An audio file consists of multiple frames, usually their number is much larger than the number of training examples. If we are trained on a set of frames, we hardly get anything worthwhile - it is reasonable to reduce the number of parameters. In principle, it is possible to classify each frame separately, but due to outliers, the end result will also not be very encouraging. The golden mean is to calculate the statistics of signs for all frames in an audio file.

In addition, we need a validator algorithm for the classifier - you should make sure that it does everything correctly. In speech processing tasks, it is considered that the generalizing ability of a model is low if it works well not for all speakers, but only for those for whom it was trained. Otherwise, it is said that the model is so-called. speaker free, and in itself is not bad. To verify this fact, it is enough to divide the speakers into groups: to learn on some, and on the rest to check the accuracy.

So we will do.

The table with the data is stored in the data.csv file, the columns are signed in the first row, if desired, it can be displayed on the screen or viewed manually.

We connect the necessary libraries, we read data:

import csv, os import numpy as np from sklearn.ensemble import RandomForestClassifier as RFC from sklearn.model_selection import GroupKFold # read data with open('data.csv', 'r')as c: r = csv.reader(c, delimiter=',') header = r.next() data = [] for row in r: data.append(row) data = np.array(data) # preprocess genders = data[:, 0].astype(int) speakers = data[:, 1].astype(int) filenames = data[:, 2] times = data[:, 3].astype(float) pitch = data[:, 4:5].astype(float) features = data[:, 4:].astype(float)

Now we need to organize a procedure for cross-validation of speakers. The GroupKFold iterator built into sklearn works as follows: each point in the sample belongs to a group, in our case to one of the speakers. The set of all speakers is divided into equal parts and each of them is successively eliminated, the classifier is trained on the remaining ones and the accuracy is remembered on the discarded ones. The accuracy in all parts is taken as the accuracy of the classifier.

 def subject_cross_validation(clf, x, y, subj, folds): gkf = GroupKFold(n_splits=folds) scores = [] for train, test in gkf.split(x, y, groups=subj): clf.fit(x[train], y[train]) scores.append(clf.score(x[test], y[test])) return np.mean(scores)

When everything is ready, you can set up experiments. First, let's try to classify frames. The attribute vector is input to the classifier, and the output label matches the label of the file from which the current frame is taken. Let's compare the classification separately by frequency and by all signs (frequency + spectral + mfcc):

 # classify frames separately score_frames_pitch = subject_cross_validation(RFC(n_estimators=100), pitch, genders, speakers, 5) print 'Frames classification on pitch, accuracy:', score_frames_pitch score_frames_features = subject_cross_validation(RFC(n_estimators=100), features, genders, speakers, 5) print 'Frames classification on all features, accuracy:', score_frames_features

Expectedly, we received low accuracy - 66 and 73% of correctly classified frames. Not much, not much better than a random classifier, which would give about 50%. First of all, such low accuracy is related to the presence of debris in the sample - for 64% of frames it was not possible to calculate the frequency of the main tone. There may be two reasons: the frames either did not contain speech at all (silence, sighs), or were components of consonant sounds. And if the first can be rejected with impunity, then the second - with reservations: we believe that we can succeed in correctly separating male and female speech by sound frames.

In fact, I want to classify not frames, but audio files as a whole. You can calculate a variety of statistics from the temporal sequences of signs, and then classify them already:

 def make_sample(x, y, subj, names, statistics=[np.mean, np.std, np.median, np.min, np.max]): avx = [] avy = [] avs = [] keys = np.unique(names) for k in keys: idx = names == k v = [] for stat in statistics: v += stat(x[idx], axis=0).tolist() avx.append(v) avy.append(y[idx][0]) avs.append(subj[idx][0]) return np.array(avx), np.array(avy).astype(int), np.array(avs).astype(int) # average features for each frame average_features, average_genders, average_speakers = make_sample(features, genders, speakers, filenames) average_pitch, average_genders, average_speakers = make_sample(pitch, genders, speakers, filenames)

Now each audio file is represented by a vector. We consider the mean, variance, median and extreme values of the signs and classify them:

 # train models on pitch and on all features score_pitch = subject_cross_validation(RFC(n_estimators=100), average_pitch, average_genders, average_speakers, 5) print 'Utterance classification on pitch, accuracy:', score_pitch score_features = subject_cross_validation(RFC(n_estimators=100), average_features, average_genders, average_speakers, 5) print 'Utterance classification on features, accuracy:', score_features

97.2% is a completely different thing, everything seems to be great. It remains to discard junk frames, recalculate statistics and enjoy the result:

 # skip all frames without pitch filter_idx = pitch[:, 0] > 1 filtered_average_features, filtered_average_genders, filtered_average_speakers = make_sample(features[filter_idx], genders[filter_idx], speakers[filter_idx], filenames[filter_idx]) score_filtered = subject_cross_validation(RFC(n_estimators=100), filtered_average_features, filtered_average_genders, filtered_average_speakers, 5) print 'Utterance classification an averaged features over filtered frames, accuracy:', score_filtered

Hooray, the bar in 98.4% achieved. By selecting the model parameters (and choosing the model itself), you can probably increase this number, but we will not get qualitatively new knowledge.

Conclusion

Machine learning in speech processing is objectively difficult. The “head-on” solution in most cases turns out to be far from the desired result, and often it is necessary to additionally “scrape” 1-2% accuracy, changing something that would seem to be of little importance, but justified by physics or mathematics. Strictly speaking, this process can be continued indefinitely, but ...

In the next and last part of our introductory guide, we will consider in detail whether neural networks will cope with this task better, we will study different experimental settings, network architectures and related questions.

Worked on the material:

Gregory Sterling, mathematician, leading expert of the Neurodata Lab on machine learning and data analysis
Eva Kazimirova, biologist, physiologist, expert of the Neurodata Lab in the field of acoustics, voice and speech analysis

Stay with us.

Source: https://habr.com/ru/post/334136/

All Articles

Random forest vs neural network: who will better cope with the task of recognizing gender in speech (part 1)

What our algorithm will “hear”

We solve the classification problem

Conclusion

More articles: