📜 ⬆️ ⬇️

Emotion recognition using a convolutional neural network


Recognizing emotions has always been an exciting challenge for scientists. Recently, I have been working on an experimental SER project (Speech Emotion Recognition) to understand the potential of this technology - for this I selected the most popular repositories on Github and made them the basis of my project.

Before we begin to understand the project, it will be nice to remember what kind of bottlenecks SER has.

Main obstacles




Project Description


Using a convolutional neural network to recognize emotions in audio recordings. And yes, the owner of the repository did not refer to any sources.

Data Description


There are two datasets that were used in the RAVDESS and SAVEE repositories, I just adapted RAVDESS in my model. There are two types of data in the RAVDESS context: speech and song.
')
Dataset RAVDESS (The Ryerson Audio-Visual Database of Emotional Speech and Song) :


Emotion Breakdown:


Emotion Distribution Chart:


Feature extraction


When we work with speech recognition tasks, the cepstral coefficient (MFCCs) is an advanced technology, despite the fact that it appeared in the 80s.

Quote from the MFCC Tutorial :
This shape determines what the output sound is. If we can pinpoint the form, it will give us an accurate representation of the phoneme sounded. The shape of the vocal tract manifests itself in an envelope of a short spectrum, and the MFCC works to accurately display this envelope.


Waveform


Spectrogram

We use MFCC as an input feature. If you are interested in learning more about what MFCC is, then this tutorial is for you. Downloading data and converting it to the MFCC format can be easily done using the librosa Python package.

Default Model Architecture


The author developed a CNN model using the Keras package, creating 7 layers - six Con1D layers and one density layer (Dense).

model = Sequential() model.add(Conv1D(256, 5,padding='same', input_shape=(216,1))) #1 model.add(Activation('relu')) model.add(Conv1D(128, 5,padding='same')) #2 model.add(Activation('relu')) model.add(Dropout(0.1)) model.add(MaxPooling1D(pool_size=(8))) model.add(Conv1D(128, 5,padding='same')) #3 model.add(Activation('relu')) #model.add(Conv1D(128, 5,padding='same')) #4 #model.add(Activation('relu')) #model.add(Conv1D(128, 5,padding='same')) #5 #model.add(Activation('relu')) #model.add(Dropout(0.2)) model.add(Conv1D(128, 5,padding='same')) #6 model.add(Activation('relu')) model.add(Flatten()) model.add(Dense(10)) #7 model.add(Activation('softmax')) opt = keras.optimizers.rmsprop(lr=0.00001, decay=1e-6) 

The author commented on layers 4 and 5 in the latest release (September 18, 2018) and the final file size of this model does not fit the network provided, so I can not achieve the same result in accuracy - 72%.

The model is simply trained with the parameters batch_size=16 and epochs=700 , without any training schedule, etc.

 # Compile Model model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy']) # Fit Model cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=700, validation_data=(x_testcnn, y_test)) 

Here categorical_crossentropy is a function of losses, and the measure of evaluation is accuracy.

My experiment


Exploratory data analysis


In the RAVDESS dataset, each actor shows 8 emotions, pronouncing and singing 2 sentences, 2 times each. As a result, 4 examples of each emotion are obtained from each actor, with the exception of the above-mentioned neutral emotions, disgust, and surprise. Each audio lasts approximately 4 seconds, in the first and last seconds most often silence.

Typical offers :

Observation


After I selected a dataset from 1 actor and 1 actress, and then listened to all their records, I realized that men and women express their emotions in different ways. For example:


Experiment repetition


The author removed the neutral, disgust, and surprised classes to make the 10-class recognition of the RAVDESS dataset. Trying to repeat the author’s experience, I got the following result:



However, I found out that there is a data leak when the dataset for validation is identical to the test dataset. Therefore, I repeated the separation of the data, isolating the datasets of two actors and two actresses so that they were not visible during the test:


I re-trained the model and here is the result:


Performance test


From the Train Valid Gross graph it is clear that there is no convergence for the selected 10 classes. Therefore, I decided to reduce the complexity of the model and leave only male emotions. I isolated two actors in the test set, and put the rest in the train / valid set, 8: 2 ratio. This ensures that there is no imbalance in the dataset. Then I coached the male and female data separately to conduct the test.

Male dataset


Reference line: men


Female dataset


Reference line: women


As you can see, error matrices are different.

Men: Angry and Happy are the main predicted classes in the model, but they are not alike.

Women: disorder (Sad) and joy (Happy) - basically predicted classes in the model; anger and joy are easily confused.

Recalling the observations from the Intelligence Data Analysis , I suspect that female Angry and Happy are similar to the point of confusion because their way of expression is simply to increase their voices.

On top of that, I'm curious that if I simplify the model even further, leave only the Positive, Neutral, and Negative classes. Or only Positive and Negative. In short, I grouped emotions into 2 and 3 classes, respectively.

2 classes:


3 classes:


Before starting the experiment, I set up the model architecture using male data, making 5-class recognition.

 #   -  target_class = 5 #  model = Sequential() model.add(Conv1D(256, 8, padding='same',input_shape=(X_train.shape[1],1))) #1 model.add(Activation('relu')) model.add(Conv1D(256, 8, padding='same')) #2 model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Dropout(0.25)) model.add(MaxPooling1D(pool_size=(8))) model.add(Conv1D(128, 8, padding='same')) #3 model.add(Activation('relu')) model.add(Conv1D(128, 8, padding='same')) #4 model.add(Activation('relu')) model.add(Conv1D(128, 8, padding='same')) #5 model.add(Activation('relu')) model.add(Conv1D(128, 8, padding='same')) #6 model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Dropout(0.25)) model.add(MaxPooling1D(pool_size=(8))) model.add(Conv1D(64, 8, padding='same')) #7 model.add(Activation('relu')) model.add(Conv1D(64, 8, padding='same')) #8 model.add(Activation('relu')) model.add(Flatten()) model.add(Dense(target_class)) #9 model.add(Activation('softmax')) opt = keras.optimizers.SGD(lr=0.0001, momentum=0.0, decay=0.0, nesterov=False) 

I added 2 layers of Conv1D, one layer of MaxPooling1D and 2 layers of BarchNormalization; I also changed the dropout value to 0.25. Finally, I changed the optimizer to SGD with a learning speed of 0.0001.

 lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.9, patience=20, min_lr=0.000001) mcp_save = ModelCheckpoint('model/baseline_2class_np.h5', save_best_only=True, monitor='val_loss', mode='min') cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=700, validation_data=(x_testcnn, y_test), callbacks=[mcp_save, lr_reduce]) 

To train the model, I applied a reduction in the “training plateau” and saved only the best model with a minimum value of val_loss . And here are the results for the different target classes.

New model performance


Men, 5 classes



Women, Grade 5

Men, Grade 2


Men, Grade 3


Increase (augmentation)


When I strengthened the architecture of the model, the optimizer and the speed of training, it turned out that the model still does not converge in training mode. I suggested that this is a data quantity problem, since we only have 800 samples. This led me to methods of increasing audio, in the end I doubled the datasets. Let's take a look at these methods.

Men, Grade 5


Dynamic Increment

 def dyn_change(data): """    """ dyn_change = np.random.uniform(low=1.5,high=3) return (data * dyn_change) 



Pitch Adjustment

 def pitch(data, sample_rate): """    """ bins_per_octave = 12 pitch_pm = 2 pitch_change = pitch_pm * 2*(np.random.uniform()) data = librosa.effects.pitch_shift(data.astype('float64'), sample_rate, n_steps=pitch_change, bins_per_octave=bins_per_octave) 


Bias

 def shift(data): """   """ s_range = int(np.random.uniform(low=-5, high = 5)*500) return np.roll(data, s_range) 


Adding White Noise

 def noise(data): """    """ #     : https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html noise_amp = 0.005*np.random.uniform()*np.amax(data) data = data.astype('float64') + noise_amp * np.random.normal(size=data.shape[0]) return data 


It is noticeable that augmentation greatly increases accuracy, up to 70 +% in the general case. Especially in the case of the addition of white, which increases the accuracy to 87.19% - however, the test accuracy and the F1 measure fall by more than 5%. And then I got the idea to combine several augmentation methods for a better result.

Combining several methods


White noise + bias


Testing augmentation on men


Men, Grade 2


White noise + bias

For all samples


White noise + bias

Only for positive samples, since the 2-class set is unbalanced (towards negative samples).


Pitch + White Noise
For all samples


Pitch + White Noise

For positive samples only


Conclusion


In the end, I was able to experiment only with a male dataset. I re-divided the data so as to avoid imbalance and, as a consequence, data leakage. I set up the model to experiment with male voices, as I wanted to simplify the model as much as possible to get started. I also conducted tests using different augmentation methods; the addition of white noise and bias have worked well on unbalanced data.

findings



Further improvement


Source: https://habr.com/ru/post/461435/


All Articles