In October, the Okdesk cloud service team took part in the Penza hackathon, in the framework of which we developed a "boxed" Telegram bot for Okdesk. The bot will allow the clients of the service companies to send requests for the service, to correspond with the requests and to give marks to the fulfillment of the requests from the favorite messenger.
We planned to write about this article on Habra, but stopped on time. Truly, who is interested in reading today that the next Telegram-bot was developed at the next hackathon? Therefore, we wrote a continuation of the article on machine learning for the classification of applications in those. support In this article, we talk about how, after learning the algorithm, to make a working service, at the input of which the text of the client application is transmitted, and at the output - the category to which the application belongs.
We strongly recommend that you read the first part before reading this text (or at least add the first part of the article to the bookmarks). The following is a brief summary.
So, service companies provide services to their customers. Clients send requests to customer support: for example, "the Internet does not work" or "does not pass the wiring in 1C." In a service company, different people deal with different directions: problems with the Internet are the responsibility of the group of system administrators, and problems with 1C "fall" on the 1C support group. Distribution of requests by groups can be assigned to the dispatcher, but these are additional expenses (salary) and loss of decision time (by the time the requests are resolved, the dispatcher’s response to the distribution of requests is added). It is logical to shift the task of distributing applications to the "smart algorithm", which can determine by the application text which direction it belongs to.
To solve this problem, a training sample was taken from 1200 application texts with stamped categories (14 categories). A dictionary was compiled from the training sample (a set of words relevant to the classification), all applications were "projected" onto the dictionary (ie, each application text was assigned a vector in the dictionary space), after which the best projection was found on the vectors this training sample classification algorithm. To classify applications by the algorithm, a vector in the space of the dictionary was input, and the output predicted the category of the application.
The best algorithm showed on the training sample 74.5% classification accuracy (which is quite good for 14 categories), but grateful readers wrote to the PM that the applied algorithm showed 92% accuracy on their data (which is already a completely “production” option).
All this work was carried out on a laptop, for practical use it is necessary to somehow turn the result obtained on the laptop into a web service.
Recall that the classification of new applications is carried out in 2 stages:
Thus, to transfer the classification mechanism from a laptop to a web service, we need to “unload” the resulting dictionary and the trained algorithm.
Further in the text we will use variables from the first part of the article.
Unloading a dictionary is simple. In the first part, we wrote down all the vocabulary words (order is important!) In the list variable words . Now you need to write the words from the words variable to a text file. Each word will be written with a new line.
# codecs, utf8 import codecs # words.txt, with codecs.open('words.txt', 'w', encoding = 'utf8') as wordfile: wordfile.writelines(i + '\n' for i in words)
Algorithm training is an operation that requires a lot of machine time. It is not advisable to carry it out every time the algorithm is started. Therefore, if there is an opportunity to save an algorithm that is somehow trained to run on other machines, this opportunity should be used.
Python offers 2 options for saving the algorithm.
Built-in pickle module:
import pickle saved = pickle.dumps(classifier) classifier2 = pickle.loads(saved)
It allows you to save the dump to a variable, and the variable can be saved to a file.
Joblib library:
from sklearn.externals import joblib joblib.dump(classifier, 'filename.pkl') classifier2 = joblib.load('classifier.pkl')
It does not allow to write a dump of a model into a variable, but immediately writes a dump into a .pkl file.
For our task, it is necessary to save the dump of the algorithm to a file:
from sklearn.externals import joblib joblib.dump(optimazer_tree.best_estimator_, 'model_tree.pkl')
Now we have the second file: a dump of the trained algorithm in the model_tree.pkl file
The script that will classify new applications should be able to do the following:
Let's get started To get started, import the necessary libraries. We got acquainted with most of the libraries required for the script work either in the first part of the article, or (with codecs ) in this article just above. Additionally, the sys library appears - we need it to work with the command line parameters (from which we will transfer text to the script)
import numpy as np import re import sklearn import codecs Import sys
Now we will load the dictionary and the trained algorithm:
# words.txt with codecs.open('words.txt','r', encoding = 'utf8') as wordsfile: wds = wordsfile.readlines() # words = [] for i in wds: words.append(i[:-1]) # model_tree.pkl estimator = sklearn.externals.joblib.load('model_kNN.pkl')
Before you project a new text onto the space of a dictionary, it is necessary to split the text into words (first reducing the text to lower case). We declare the corresponding functions:
# def lower(str): return str.lower() # , def splitstring(str): words = [] # [], for i in re.split('[;,.,\n,\s,:,-,+,-,(,),=,-,/,«,»,-,@,-,-,\d,!,?,"]',str): words.append(i) return words
And at the end we will declare a function that accepts a new text as input, and at the output it gives a prediction of its category:
# , , def class_func(new_issue): # , new_issue_words = splitstring(lower(new_issue)) # len(words) new_issue_vec = np.zeros((1,len(words))) # [j]- , j- new_issue for j in new_issue_words: if j in words: new_issue_vec[0][words.index(j)]+=1 # return estimator.predict(new_issue_vec)
Recall that we plan to transmit the text of the new application from the command line. In order to get the command line arguments in the script, we will use the sys library. Sys.argv returns a list of command line arguments, while the zero element is the name of the script (but this is not important for us, since the name of the script is not in the dictionary; it will disappear when projected). Thus, to transfer the text of the new application to the script, we need to "paste" the transferred command line parameters in the script (since each word from the text of the new application will be passed as one argument):
new_issue = u'' for i in sys.argv: # new_issue += ' ' + i.decode('cp1251') class_func(new_issue)
Important! Depending on the encoding used in the console, the new_issue + = '' + i.decode ('cp1251') parameter in the decode may be different.
So, to the file with the dictionary and the file with the dump of the trained algorithm, we added the third file: a script that does all the work of predicting the category of the new application.
Next, all three files must be folded on the server into one folder and set up the necessary environment (python with libraries). The script can be run from the command line in any way you like (for example, write a simple service in your favorite language that will receive the text of the new application on the network, send it to the script, receive an answer and send it back).
The end! Good luck in solving your ML-problems. Well, we will continue to develop Okdesk - the most convenient (according to our company :)) helpdesk system for servicing clients in service companies, while at the same time exploring the possibilities of using “smart algorithms” for solving problems related to servicing.
Source: https://habr.com/ru/post/342796/
All Articles