Hello colleagues! In this article I will briefly talk about the features of building a solution for the classification of customer calls to the contact center that we encountered during development.
Definition of topics of appeals is used to track trends and listen to records of interest. Traditionally, this task is solved by affixing the appropriate tag by the operator, but with this approach, the “human” factor plays a big role, and many man-hours of work of operators are spent.

To solve this problem, our team - Data4 has developed a system for determining topics based on the classification of texts.
')
The input used 2 channel WAV file, with a frequency of 8 kHz. The file was transcribed using a speech recognition system. Experience has shown that the quality of recognition of Russian spontaneous speech on our data was 60-70% for the
WER metric. This quality makes it difficult to use the methods of decomposition of sentences in a graph, etc., and for visual analysis, but is sufficient for statistical analysis.
The hypothesis was tested that, in addition to the text, the speech quality, such as pauses, interruptions and the ratio of the number of operator’s speech to the number of subscriber’s speech, can affect the quality of the forecast. To identify these signs, we used a speech presence detector, which works as follows:
The audit showed that the characteristics obtained from signal processing did not make a positive contribution to our model. The training was conducted on a small sample (1 thousand records for each class), perhaps, with a larger training sample, a different result is possible.
To construct a text-based classifier, it was required to translate texts into feature vectors. For this we used the TF - IDF method. TF - IDF is a statistical measure to assess the importance of a word in the context of a document, which is part of a collection of documents, in which the weight of a word is proportional to the number of uses of the word in a document, and inversely proportional to the frequency of use of a word in other documents of the collection. To reduce the dimension, the procedure for the lemmatization of forms of words was used.

In order not to take into account rarely used words and frequently used words, we use a list of stop words for the Russian language and experimentally limited the length of the feature vector to 3000, a minimum frequency of token 2. To add to the list of stop words, words from obscene vocabulary, interjection, unions, particles, as in the overwhelming majority of cases they were the result of the erroneous operation of the speech recognition system or did not carry essential information. The remaining words carry enough information to use their vector representation to train the classifier of topics.

The quality metric was chosen f measure. F measure takes into account the values ​​of accuracy (precision) and completeness (recall) and is calculated by the formula: F = 2 P * R / (P + R), where P is accuracy, R is completeness.
To minimize the overtraining effect, L2 regularization and cross-validation with 10 blocks was used.
We used a binary classifier based on the assumption that a topic can be distinguished by contrasting the rest of the topics, and the topics within those should be presented as a tree.

Testing algorithms showed that logistic regression and a random forest of solutions give the best results for the problem of classifying texts of requests. At the same time, logistic regression showed stable results on several data sets, while the random forest showed the maximum quality, but the need for additional manual adjustment when changing the data set.
According to the quality metric F1 measure for weighted classes containing at least 1 thousand examples, a quality equal to 0.98 was achieved. It should be noted that such quality was achieved only for a number of test data. For some classes containing 250–300 examples, the maximum value was 0.7. This is due to the formalization of the separation of themes and the frequency distribution of the theme in the set of texts for training the model. Thus, the quality of the classification of target and non-target calls will be higher than the quality of the classification of customer requests for specific services and for those types that are more common.
Summary:
To classify topics for contact center calls, it is rational to use an algorithm based on logistic regression to achieve sustainable quality, or an algorithm based on a random forest of solutions that needs to be pre-configured. The input of the algorithm is a vector of signs, obtained from the text. To achieve high quality in the F1 metric measure, you should use a training set that contains at least 1 thousand examples of each class.
Useful links for working with texts:
Big-ARTM - State-of-the-art Topic Modeling
Gensim - Topic Modeling for Human
Review of approaches to the classification of textsNeural Network ClassificationSVM classificationPS I thank Anna Larionov for her contribution to the preparation of the article and the development of the solution.