Voice, sound, sound wave analysis: acoustics is one of the most interesting and complex data collection channels in the multimodal logic of detecting and recognizing human emotions. Among other things, referring to this source of information sets for researchers tasks of a different order, the solution of which opens up new scientific and technological perspectives. At the
Neurodata Lab , dealing with the topic of emotions, we managed to deal with a fundamental problem: a single-channel division of votes, reaching an accuracy greater than 91-93%, for English, Russian and some other key languages ​​(the experiments continue, the priority is given to the first two).

Of course, at the moment we are at the stage of preparing a full-fledged article, as well as assembling and packing a future commercial product, so here we only briefly outline our activities in this area with an invitation to discuss the results after they are published and presented at conferences in the first half of 2018.
So, what we have as of today. A working prototype of a system designed to solve the following tasks in the following conditions:
- At the entrance there is a single-channel recording of the conversation of two (potentially more) people in the WAV format;
- All fragments are deleted from the recording, where two (or more) voices simultaneously sound; removal is associated with the need to further process the speech of a specific person, for example, to determine the characteristics of the voice and the emotional state of the speaker;
- The remaining record fragments are divided into two groups so that each of them contains the speech of only one specific person;
- At the exit - two audio channels: in the first one the speech of one is heard, in the second - of another person; Timing is saved.
The technological basis of the solution are three subsystems:
')
- Highlighter (speech) phrases;
- Simultaneous speech detector;
- Voice ID.
Phrase selector
Under the phrase in this context refers to a continuous area of ​​speech between the two micropause. The concept of inaccurate, conditional, the result of the use of the phrase highlighter strongly depends on the peculiarities of pronunciation (abrupt or “smooth”, continuous speech), on the parameters of “micropause”, etc. Typical settings of the phrase selector lead to the fact that the phrase, as a rule, is a sequence of phonemes, syllables, sometimes words ranging from 0.2 to several seconds. The exact settings of the determinant of phrases will be given in its technical description.
The meaning of the use of the phrase selector is as follows. If we cast out moments of simultaneous sounding of two voices from a speech, the remaining recording will be alternating (without overlapping) sections of a single-voiced speech, while in most cases the speaker’s change will fall within the boundaries of the phrases.
This assumption is not quite true; in practice, a non-trivial transition of speech from one speaker to another happens. However, these cases are indeed rare, and in the proposed prototype the main negative impact of such unconstrained transitions is reduced to the incorrect formation of supporting basic fragments of the voices of two people and partially arrested by the principle of the formation of such supporting fragments.
Thus, modulo the absence of phrases with the transition of votes, further work (after highlighting phrases and discarding moments of simultaneous speech) is reduced to the task of voice identification of phrases.
Simultaneous speech detector
In addition to the original function (we need only fragments of monophonic speech), the detector allows you to leave only those phrases (or their parts), where one voice sounds (modulo phrases with voice transition, which was higher), thereby reducing the task to the problem of voice identification .
The simultaneous speech detector is based on visual observation: the log spectrogram or its time derivative in the simultaneous speech regions contains characteristic irregularities that are absent in mono voices and are easily distinguishable by the eye. Examples will be given in the detector description.
In connection with such an observation, the solution is based on 2D convolutional networks, which are designed to distinguish graphic features. However, the current prototype contains additional, 1D convolutional neural network solutions to improve the quality of detection.
The idea behind the detector turned out to be quite successful in the sense that not only moments of simultaneous speech are determined, but, as a rule, other harmful sound events: applause, laughter (especially laughter in the hall), etc.
The result of the detector is a number from 0 to 1. For classification, it is assumed that if this number is less than 0.5, then in the considered fragment of the recording there are no two voices simultaneously, otherwise there is an “overlap” of voices.
The main limiter in the application of the detector now are recordings with noticeable reverberation (“echoing” rooms, noticeable echoes, etc.), in which, in some sense, the effect of simultaneous speaking is reproduced.
Voice ID
This is one of the main subsystems of the prototype, which solves the following problem. Two monophonic fragments of speech of arbitrary length are given, it is required to determine whether they belong to the voice of one person, or whether they are voices of different people.
It is based on a neural network solution trained on the basis of 100 male and 100 female voices (the samples are continuously expanding and diversifying). The result is a number from 0 to 1. If it is less than 0.5, then the fragments are considered to belong to the voice of one person, otherwise - to different people.
The quality of the decision is directly dependent on the length of the speech fragments: the shorter they are, the lower the quality. In practice, an error on fragments with a duration of less than 0.3-0.4 seconds becomes significant. More about this, we again tell in the technical description of the identifier and in the article.
At present, polishing of the solution for the shortest possible speech fragments continues, and the results are certainly encouraging.
Graphically, the scheme is shown in the figure:

Project curator: Mikhail Grinenko, Ph.D., scientific consultant of the
Neurodata Lab for in-depth training and data analysis.