The analysis of restaurant user reviews was part of the
SentiRuEval-2015 testing task , which was held at the Dialog-2015 conference. In this article we will talk about what such analyzers actually do, why it is needed in practice, and how to create such a tool with your own hands using the Meanotek NeuText API
Analysis of reviews on aspects is often divided into several stages. Consider for example the suggestion “Japanese dishes were tasty, but the waiter was slow.” At the first stage, we select important words or phrases from it. In this case, it is “Japanese dishes”, “tasty”, “waiter”, “slow”. This makes it possible to understand what the sentence is about. Further, we may want to group the terms - for example, referring “dishes” and “tasty” to food, and “waiter” to serving. Such a grouping will allow to produce aggregated statistics. Finally, we may want to evaluate the tonality of the terms, saying something positive or negative about them.
Why: Answer this question now is not so simple. Under the assumption of the test organizers, the objective is to ultimately assess the tone of the recall as a whole, that is, to say that the author of this review considers the service good and the interior bad. But this task is being solved today by other methods - on most otzovikov sites it is provided that when the user leaves a review, he also manually fills in the assessment by aspect. The availability of such information dramatically reduces the value of such synthetic analysis. Although you can, of course, highlight additional aspects, analyze posts from forums, but all this has a secondary meaning for the user.
')
But let's say we need to solve the inverse problem. For example, you are a restaurant owner, and you are wondering why a bad rating is indicated in the “interior” section. And the link with the source text on the sites with manual evaluation is mostly lost. You will have to look through all the reviews with a negative assessment in order to find the necessary information. And these reviews are, quite frankly, quite long and containing a lot of "water", like "my friend had a birthday yesterday. We got together and thought for a long time where to go. Usually we ... "and so on. Highlighting important aspect terms, you can highlight them in the text, or show only the sentences containing them, or even count them and display the following summary:
Loud music - 34
Nakuren - 8
Air conditioning - 4
Dirty in the toilet - 2
By reducing the labor costs for analysis, we will improve business efficiency, plus the restaurant will be able to respond to various problems and requests from visitors. Although, of course, not all restaurant owners substantively do such careful monitoring of reviews, but that is another matter.
Implementation : To implement using the Meanotek NeuText API, you will need a free API key; if you do not already have it, you can get it
here . Like last time, we need a training sample. In the training data created by the developers of SentiRuEval-2015, explicit terms (dishes, food, waiter, restaurant, table, etc.) and implicit terms (tasty, loud, salted) are distinguished. You can use ready-made markup, or come up with your own notation.
SentiRuEval -2015 source samples are publicly available as XML files. There is no preprocessing in this data (splitting into sentences, words, etc.). Therefore, we have prepared the data in the right format for use with our API (
download ). Our sample contains only explicit aspect terms (explicit) and it is also divided into two files: rest_expl_train.txt - data for training the model and rest_expl_test.txt - data for checking the results (both files are made from the original training sample SentiRuEval-2015).
the japanese | explicit |
dishes | explicit |
were | |
delicious | implicit |
, | |
waiter | explicit |
... | |
There is one subtle truth here - if you meet a phrase like “brought the waiter fish dishes”, then the “waiter fish dishes” will be highlighted as one term, although there are actually two of them here - “waiter” and “fish dishes”. Therefore, often for the first and last words of the term they use separate designations so that they can be separated later. But this is not always justified, because in the absence of a sufficient number of examples of such merging terms, the model may still not learn how to put a beginning and an end, and an increase in the number of classes will require an increase in the size of the training sample in order to obtain an adequate quality of the model. Therefore, it makes sense to try both options and compare the quality of the results obtained, but if there is no time, the first option is also suitable.
You can create a model like the
previous example, with the product names retrieved. To simplify working with the API, you can use the
library for the .NET Framework.Model MyModel = new Model(" api ","RestExplModel"); MyModel.CreateModel(); Console.WriteLine(" "); MyModel.UploadTrainData("rest_expl_train.txt"); Console.WriteLine(" "); MyModel.UploadTestData ("rest_expl_dev.txt"); Console.WriteLine(" "); MyModel.TrainModel();
Also included is an executable file of the example, with which you can download arbitrary files and check the results without writing code.
After training the model, we request statistics on a test sample, as well as analysis for a new example:
Console.WriteLine(MyModel.GetValidationResults()); string p = model.GetPredictionsJson(" , , "); Console.WriteLine(p);
Here's what happened:

Especially for this article, I also made an
online demo on php , in the form in which you can enter data and get the selected terms.
More information about the API of extracting information from the text can be found in the
previous entry , and technical details of the work are in
our article published in the collection “Computational linguistics and intellectual technologies” (English text).