Recognizing emotions has always been an exciting challenge for scientists. Recently, I have been working on an experimental SER project (Speech Emotion Recognition) to understand the potential of this technology - for this I selected the most popular repositories on
Github and made them the basis of my project.
Before we begin to understand the project, it will be nice to remember what kind of bottlenecks SER has.
Main obstacles
- emotions are subjective, even people interpret them differently. It is difficult to define the very concept of “emotion”;
- commenting on audio is hard. Should we somehow mark every single word, sentence, or the entire communication as a whole? A set of what kind of emotions to use in recognition?
- collecting data is also not easy. A lot of audio data can be collected from movies and news. However, both sources are “biased” because the news must be neutral, and the emotions of the actors are played. It is difficult to find an “objective” source of audio data.
- markup data requires large human and time resources. Unlike drawing frames on images, it requires specially trained personnel to listen to entire audio recordings, analyze them and provide comments. And then these comments must be appreciated by many other people, because the ratings are subjective.
Project Description
Using a convolutional neural network to recognize emotions in audio recordings. And yes, the owner of the repository did not refer to any sources.
Data Description
There are two datasets that were used in the RAVDESS and SAVEE repositories, I just adapted RAVDESS in my model. There are two types of data in the RAVDESS context: speech and song.
')
Dataset
RAVDESS (The Ryerson Audio-Visual Database of Emotional Speech and Song) :
- 12 actors and 12 actresses recorded their speech and songs in their performance;
- actor # 18 has no recorded songs;
- emotions Disgust (aversion), Neutral (neutral) and Surprises (surprise) are absent in the "song" data.
Emotion Breakdown:
Emotion Distribution Chart:
Feature extraction
When we work with speech recognition tasks,
the cepstral coefficient (MFCCs) is an advanced technology, despite the fact that it appeared in the 80s.
Quote from the
MFCC Tutorial :
This shape determines what the output sound is. If we can pinpoint the form, it will give us an accurate representation of the phoneme sounded. The shape of the vocal tract manifests itself in an envelope of a short spectrum, and the MFCC works to accurately display this envelope.
WaveformSpectrogramWe use MFCC as an input feature. If you are interested in learning more about what MFCC is, then this
tutorial is for you. Downloading data and converting it to the MFCC format can be easily done using the librosa Python package.
Default Model Architecture
The author developed a CNN model using the Keras package, creating 7 layers - six Con1D layers and one density layer (Dense).
model = Sequential() model.add(Conv1D(256, 5,padding='same', input_shape=(216,1)))
The author commented on layers 4 and 5 in the latest release (September 18, 2018) and the final file size of this model does not fit the network provided, so I can not achieve the same result in accuracy - 72%.
The model is simply trained with the parameters
batch_size=16
and
epochs=700
, without any training schedule, etc.
Here
categorical_crossentropy
is a function of losses, and the measure of evaluation is accuracy.
My experiment
In the RAVDESS dataset, each actor shows 8 emotions, pronouncing and singing 2 sentences, 2 times each. As a result, 4 examples of each emotion are obtained from each actor, with the exception of the above-mentioned neutral emotions, disgust, and surprise. Each audio lasts approximately 4 seconds, in the first and last seconds most often silence.
Typical offers :
Observation
After I selected a dataset from 1 actor and 1 actress, and then listened to all their records, I realized that men and women express their emotions in different ways. For example:
- male anger (Angry) is just louder;
- men's joy (Happy) and frustration (Sad) - a feature in laughing and crying tones during the "silence";
- female joy (Happy), anger (Angry) and frustration (Sad) are louder;
- female disgust (Disgust) contains the sound of vomiting.
Experiment repetition
The author removed the neutral, disgust, and surprised classes to make the 10-class recognition of the RAVDESS dataset. Trying to repeat the author’s experience, I got the following result:
However, I found out that there is a data leak when the dataset for validation is identical to the test dataset. Therefore, I repeated the separation of the data, isolating the datasets of two actors and two actresses so that they were not visible during the test:
- actors 1 to 20 are used for Train / Valid sets in an 8: 2 ratio;
- actors 21 to 24 are isolated from the tests;
- Train Set parameters: (1248, 216, 1);
- Valid Set parameters: (312, 216, 1);
- Test Set parameters: (320, 216, 1) - (isolated).
I re-trained the model and here is the result:
Performance test
From the Train Valid Gross graph it is clear that there is no convergence for the selected 10 classes. Therefore, I decided to reduce the complexity of the model and leave only male emotions. I isolated two actors in the test set, and put the rest in the train / valid set, 8: 2 ratio. This ensures that there is no imbalance in the dataset. Then I coached the male and female data separately to conduct the test.
Male dataset- Train Set - 640 samples from actors 1-10;
- Valid Set - 160 samples from actors 1-10;
- Test Set - 160 samples from actors 11-12.
Reference line: menFemale dataset- Train Set - 608 samples from actresses 1-10;
- Valid Set - 152 samples from actresses 1-10;
- Test Set - 160 samples from actresses 11-12.
Reference line: womenAs you can see, error matrices are different.
Men: Angry and Happy are the main predicted classes in the model, but they are not alike.
Women: disorder (Sad) and joy (Happy) - basically predicted classes in the model; anger and joy are easily confused.
Recalling the observations from the
Intelligence Data Analysis , I suspect that female Angry and Happy are similar to the point of confusion because their way of expression is simply to increase their voices.
On top of that, I'm curious that if I simplify the model even further, leave only the Positive, Neutral, and Negative classes. Or only Positive and Negative. In short, I grouped emotions into 2 and 3 classes, respectively.
2 classes:- Positive: joy (Happy), calm (Calm);
- Negative: angry, fearful, frustration (sad).
3 classes:- Positive: joy (Happy);
- Neutral: calm (Calm), neutral (Neutral);
- Negative: angry, fearful, frustration (sad).
Before starting the experiment, I set up the model architecture using male data, making 5-class recognition.
I added 2 layers of Conv1D, one layer of MaxPooling1D and 2 layers of BarchNormalization; I also changed the dropout value to 0.25. Finally, I changed the optimizer to SGD with a learning speed of 0.0001.
lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.9, patience=20, min_lr=0.000001) mcp_save = ModelCheckpoint('model/baseline_2class_np.h5', save_best_only=True, monitor='val_loss', mode='min') cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=700, validation_data=(x_testcnn, y_test), callbacks=[mcp_save, lr_reduce])
To train the model, I applied a reduction in the “training plateau” and saved only the best model with a minimum value of
val_loss
. And here are the results for the different target classes.
New model performance
Men, 5 classesWomen, Grade 5Men, Grade 2Men, Grade 3Increase (augmentation)
When I strengthened the architecture of the model, the optimizer and the speed of training, it turned out that the model still does not converge in training mode. I suggested that this is a data quantity problem, since we only have 800 samples. This led me to methods of increasing audio, in the end I doubled the datasets. Let's take a look at these methods.
Men, Grade 5
Dynamic Increment def dyn_change(data): """ """ dyn_change = np.random.uniform(low=1.5,high=3) return (data * dyn_change)
Pitch Adjustment def pitch(data, sample_rate): """ """ bins_per_octave = 12 pitch_pm = 2 pitch_change = pitch_pm * 2*(np.random.uniform()) data = librosa.effects.pitch_shift(data.astype('float64'), sample_rate, n_steps=pitch_change, bins_per_octave=bins_per_octave)
Bias def shift(data): """ """ s_range = int(np.random.uniform(low=-5, high = 5)*500) return np.roll(data, s_range)
Adding White Noise def noise(data): """ """
It is noticeable that augmentation greatly increases accuracy, up to 70 +% in the general case. Especially in the case of the addition of white, which increases the accuracy to 87.19% - however, the test accuracy and the
F1 measure fall by more than 5%. And then I got the idea to combine several augmentation methods for a better result.
Combining several methods
White noise + biasTesting augmentation on men
Men, Grade 2
White noise + biasFor all samples
White noise + biasOnly for positive samples, since the 2-class set is unbalanced (towards negative samples).
Pitch + White NoiseFor all samples
Pitch + White NoiseFor positive samples only
Conclusion
In the end, I was able to experiment only with a male dataset. I re-divided the data so as to avoid imbalance and, as a consequence, data leakage. I set up the model to experiment with male voices, as I wanted to simplify the model as much as possible to get started. I also conducted tests using different augmentation methods; the addition of white noise and bias have worked well on unbalanced data.
findings
- emotions are subjective and difficult to fix;
- it is necessary to determine in advance which emotions are suitable for the project;
- Do not always trust content with Github, even if it has many stars;
- data sharing - keep it in mind;
- exploratory data analysis always gives a good idea, but you need to be patient when it comes to working with audio data;
- Determine what you will give to the input of your model: a sentence, an entire record or an exclamation?
- lack of data is an important success factor in SER, however, creating a good dataset with emotions is a complex and expensive task;
- simplify your model in case of lack of data.
Further improvement
- I used only the first 3 seconds as input to reduce the overall size of the data - the original project used 2.5 seconds. I would like to experiment with full-sized recordings;
- you can pre-process the data: trim the silence, normalize the length by padding with zeros, etc .;
- try recurrent neural networks for this task.