I am a CTO of the
Preply project and I want to tell a little about what each programmer dreams of, namely about complex and interesting tasks in simple projects.
To be more precise, it’s about how you can add a bit of science to a business and get some benefit as a result. In this article I will try to describe one of the contexts of using Machine Learning in a real project.
Problem
We are the Preply tutoring platform and everyone wants to deceive us.
The user on our site leaves requests for tutors and, after they agree on the conditions, pays for the lessons through the site. If classes are on Skype, we accept all payments through the site. If they meet live, our commission is the cost of the first lesson.
')
For
some reason, tutors and students are trying to circumvent the payment of lessons through the site. To do this, they use the internal message system, which is designed to clarify the details of upcoming classes and is available after sending the application to the tutor. Here are some examples of contacts exchange:
My Skype vasiliy.p, tel +789123456. So at 19:00 on April 1!
Good evening! You could write your number or call my + 78-975-12-34
I do not want to pay before the lesson, my name is Vasily Pupkin - find me on VKontakte
An experienced programmer will immediately say:
“What is the problem to write regular expressions for possible message exchange options?”. There is no problem, but this solution has several disadvantages:
- It is difficult to foresee all variants of incorrect (that is, those that contain contacts) of messages. For example, in the first version of the product there was a set of regular expressions for a phone number, but it worked and blocked messages like:
Friday - from 13 00-15 00-15 30 ... how much will a group lesson cost?
In a more complicated case, the regular expression for email was used. Mail, which was intended to block messages like:
vasya (dog) pupkin (dot) ru
but at the same time, it blocked a completely harmless text:
I know English like a dog: I understand everything, but I cannot say.
The word “Skype” is even more difficult: it is very difficult to distinguish messages containing attempts to exchange Skype:
please add me in Skype - vasya82pupkin
from clarifying messages:
do you want to have skype or local lessons?
- There is no control over the threshold of trust. That is, the message is either blocked or not. In order to change the logic, you need to climb into the code. In real life, it is much easier to make errors of the second type (message skipping) than errors of the first type (false alarm), since in case of a false alarm the user writes support, the support manager will take the time to apologize for incorrect blocking and unblock messages without to mention the spoiled experience of using the service. On the other hand, users who exchange contacts rarely become our customers, so it’s easier to make mistakes of the second type, since we don’t make money on them (yes, this is business).
At some point in time, I decided to spend the weekend in order to make the blocking process more scientific. The following describes what came of it. At once I will make a reservation that my goal was not to do everything correctly, precisely and scientifically, but rather to do so that it worked without mistakes and had a positive effect on income.
Decision
I decided to try 3 methods of machine learning to correctly classify the correct / incorrect messages that I remember from the
Coursera Machine Learning by Andrew Ng course.
The first problem is to prepare the base for training. We had over 50,000 posts previously classified by the old system. I took only 5,000 of them and spent about 2–3 hours trying to correct the wrong classification in those messages where the previous system made mistakes. In theory, the larger the base, the better, but in the real world it is quite difficult to manually prepare a large sample (in other words, laziness).
One of the nuances in the time-consuming sample preparation process is the ethics of the process. I confess, it would not be very convenient for me to read other people's messages, so before that I jumbled the words so that during a cursory review the suspicious messages were visible, but without an understanding of the point. For example:
It was:
I would like to start classes in February, is it possible? I can also tell you the exact time in January, but it definitely won't be until 18:00
It became:
time not classes from February is possible? Exactly in January, too, before 18:00 I’m sure but I can say I would like to start a month, it’s
It was:
I would be glad to be useful, my phone. (012) 345-678 Call, we will agree, thank you
It became:
tel. be glad I would be helpful, my Call, we will agree, thanks (012) 345-678
The result is a csv file with ~ 5000 lines, where the incorrect messages are marked with zero, and the correct ones. After that, on the basis of working with data, we identified a set of message characteristics that "by eye" have an impact on the classification.
- suspicion on the phone;
- suspicion on el. mail;
- suspicion of Skype contacts;
- suspicion of url;
- suspicion of soc. networks;
- correct words that come with numbers (time, currency);
- message length;
- Suspicious words: find, add, mine;
- … and so on.
After defining the characteristics for each of them, I wrote several regular expressions, for example:
import re SEPARATOR = "|" reg_arr = [ re.compile(u'|facebook|linkedin|vkontakt|',re.IGNORECASE | re.UNICODE), re.compile(u'.{1,10}', re.UNICODE)] re.compile(u'[^]|skype', re.IGNORECASE | re.UNICODE), re.compile(u'', re.IGNORECASE | re.UNICODE), re.compile(u'[і].*\s[a-zZ-Z]', re.IGNORECASE | re.UNICODE), re.compile('\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}'), ... re.compile('http|www|\.com', re.IGNORECASE), re.compile(u'|my', re.IGNORECASE | re.UNICODE), re.compile(u'|find', re.IGNORECASE | re.UNICODE), re.compile(u'|add', re.IGNORECASE | re.UNICODE), ... re.compile('\w+@\w+', re.IGNORECASE), re.compile('.{0,50}', re.IGNORECASE), re.compile('.{50,200}', re.IGNORECASE), .... ] def feature_vector(text): return map(lambda x: 1 if x.search(text) else 0, reg_arr) fi=open('db_machine.csv', 'r') fo=open('db_machine_result.csv', 'w') for line in fi: [text, result] = line.split(SEPARATOR) output = feature_vector(text).append(result) fo.write(",".join(map(lambda x: str(x), output )) + "\n") fo.close() fi.close()
Accordingly, after processing all messages for all characteristics (we now have about a hundred), we write the vector of characteristics and the result of the classification into a file.
After preparing the data, it is necessary to split the sample into three parts: for training (train set), for selecting parameters (cross-validation set) and checking (test set). Following the advice from the course, the sizes of the training samples, the selection of parameters, the test are correlated in the proportion of 60/20/20:
import random with open('db_machine_result.csv','r') as source: data = [ (random.random(), line) for line in source ] data.sort() n = len(data) with open('db_machine_result_train.csv','w') as target: for _, line in data[:int(n*0.60)]: target.write( line ) with open('db_machine_result_cross.csv','w') as target: for _, line in data[int(n*0.60):int(n*0.80)]: target.write( line ) with open('db_machine_result_test.csv','w') as target: for _, line in data[int(n*0.80):]: target.write( line )
Then, guided by the principle of not reinventing the wheel and getting results as quickly as possible, we used scripts from the Machine Learning Coursera course and simply drove our samples along logistic regression algorithms, SVM and neural networks. Scripts are simply taken from the course, for example, SVM looks like this:
clear ; close all; clc data_train = load('db_machine_result_train.csv'); X = data_train(:, 1:end-1); y = data_train(:,end); data_val = load('db_machine_result_cross.csv'); Xval = data_val(:, 1:end-1); yval = data_val(:,end); data_test = load('db_machine_result_test.csv'); Xtest = data_test(:, 1:end-1); ytest = data_test(:,end); [C, sigma] = dataset3Params(X, y, Xval, yval);
Look at how the svmTrain / svmPredict functions are implemented on the
course website or, for example,
here .
All the algorithms on the cross-validation sample went through the internal parameters (
λ for regularization,
σ ,
C for the Gaussian function, size for the size of the hidden layer of the neural network). Imagine the final results of accuracy for some of them below:
Neural networks | Logistic regression | SVM |
---|
size = 30, λ = 1 | size = 30, λ = 0.01 | size = 30, λ = 0.001 | λ = 0 | λ = 0.01 | λ = 1 | Linear (λ 0.001, σ = 0.001) | Gaussian (λ 0.1, σ = 0.1) | Gaussian (λ 0.001, σ = 0.001) |
96.41% | 97.88% | 98.16% | 97.51% | 97.88% | 98.16% | 96.48% | 97.14% | 98.89% |
Here we need to clarify that in the process of preparing the system, the result was much worse (96.6% for SVM, for example), and debugging gave very tangible improvements. We launched logistic regression, as the simplest and fastest on the real data of the entire sample, and revised the classification result. I was surprised to find that the system was smarter than me, because in 30% of cases there was an error in the classification of messages by a person (as I wrote, I looked through ~ 5000 messages and, as it turned out, I made about 30-40 classification errors), and the system classified everything correctly. During the debugging process, we corrected errors in the database and, accordingly, the accuracy of the method increased. Moreover, we expanded the vector of characteristics if we saw that some interesting pattern is not processed by the system.
We chose to use the SVM method, the characteristics on the total sample were as follows:
Message
| Fact
|
Correct
| Incorrect
|
Forecast
| Correct
| 4998 | 36 |
Incorrect
| eleven | 390 |
Since the system has the property that the classes are skewed classes, I will also give the parameters for comparing the algorithm:
Precision | Recall | Accuracy |
---|
99.28% | 99.78% | 99.13% |
As a result, we decided to use SVM with the core of the Gaussian function to filter messages on the site. It is more complicated than logistic regression, but it gives significantly better results, although it works more slowly.
The complete message processing path is as follows:
- The user sends a message to the site, Backbone JS creates a model on the client’s machine and sends a server API to the POST request;
- The server API, written in Django TastyPie, uses the Django model validation form;
- the first validator pulls up a user profile from the base and checks whether the user is marked as an intruder (no need to check further, 403 response) or he has already made payments through the site (no need to check further, just 201 answers);
- The svmPredict validator returns the result of the message text verification. If the user has violated the rules, the corresponding flag is put in his profile, otherwise everything is fine and the user receives 201 responses from the API and the message is written to the database;
- if the message contained contacts or the user was an intruder, a 403 response is returned to the client, upon receipt of which Backbone renders a message to the user that he is breaking the rules. The user is marked in the database as an offender;
So far it works well and we are happy about it.
findings
Understanding why Machine Leaning works better than the old system is very simple - it reveals the connections between the characteristics that were hidden in expert observation. For example, we had a regular expression and several if-conditions for an event: if there is Cyrillic and Latin in the text, a few numbers and a message is short, then this is most likely an exchange of contacts. Now we simply consider individual events, and the system itself understands the connection between them and makes the rules for us.
Now we really use SVM in production to classify messages due to good accuracy rates. We use it in a very simple way - we took the set of weights of the optimal model and use the svmPredict ported Python function mentioned above for classification. In an ideal world, it would be necessary to make a teacher feedback system so that the administrator points out classification errors, and the system adjusts weights and improves. But our project lives in the real world, where time = money and we still enjoy the fact that the number of appeals in support of incorrect blocking has fallen by 2 times. It is also an interesting idea to balance the threshold of trust and, accordingly, errors of the first and second types, but so far everything suits us. Measuring the number of “message skip” errors is quite difficult. I will only clarify that the conversion of applications into payments after the introduction of the system did not fall. In other words, even if there are more passes, this does not affect the business. But the eye passes also became less. So this is a very good result for one weekend.
If the topic is interesting to you, then I am ready to write about the collaborative filtering approach for recommendations of the tutor that we do. If you need a code, also contact, - there is nothing secret there, and the article wanted to describe the pipeline more.
PS: We are growing and in the future we are looking for 2 intelligent and responsible programmers in our Kiev office: an intern and a more experienced one for closing tasks that my two hands lack. Our stack is Python / Django and JS / Backbone. Many interesting tasks and best practices.
Email dmytro@preply.com