InterSystems iKnow. Part one. iKnow and beach holidays

I have long wanted to write my article on iKnow technology. Three years have passed since its appearance, but so far there have been no publications about the applications of this technology in Russian-speaking solutions. The explanation for this is quite simple - there was no full support for the Russian language. But with each new release, starting with Cache 2013.1, the situation changed for the better. And finally, we decided to implement the first project on iKnow. How it was, what happened, and what did not, read later in my article.

So, as I said, until now there were no applications and real Russian-language solutions created using iKnow, although support for the semantic model for the Russian language first appeared in version 2013.1. At some point, it became clear that what works for Latin languages is not suitable for Russian (as for all Slavic languages). And the matter is that the variety of forms that one word can take. When iKnow analyzes the text, the concepts are counted (here we can assume that the concept is a noun), and the concepts of “apple”, “apples” and “apples” are completely different terms and they are counted separately. According to this, for example, the recipe of cooking charlottes, iKnow will be able to understand as an article on the maintenance of electric ovens, because the term "oven" (turn on "oven", open "oven", turn off "oven", etc.) will appear more often in the text than the form of the word "apple" separately. That was the difficulty. The use of lemmatization , a tool for bringing the word to normal form, helped to cope with it.
And so, in Cache 2015.1 FieldTest, there was support for lemmatization, implemented using the Hunspell library. And this means that it became possible to create full-fledged applications for analyzing data from texts in Russian and Ukrainian. And I immediately wanted to do something that would be an excellent practical example of the use of iKnow and not a useless analogue of the “Hello world”. And such a task was found!
We were given a base of 27,000 reviews on the 100 most popular hotels in Turkey and Egypt. The circle of priorities that were to be solved was immediately determined. However, I will tell about everything in order.
What is a tourist review. First of all, this is unstructured data or text (the concept of “unstructured text”, which many people like to use, seems meaningless to me). People returning from a holiday (we considered a beach holiday), go to the portal, give ratings to the hotel in which they lived, or the categories of this hotel, such as service, food, hospitality, etc. Then describe the rest in your own words, they note what was good and what was bad. Numeric recreation scores (for example, on a five-point scale) are metadata that the portal administration can easily use to calculate hotel ratings. But, often, people just write the text, and forget to rate. There are quite a few such cases - in our project, more than half of the reviews contain only text, without numerical estimates. It turns out that for rating such a review is useless. This is the first task that had to be solved - to teach iKnow to calculate the hotel's rating solely on the text of a review.
The remaining tasks were also formulated fairly quickly:

calculate the rating of individual categories of the hotel (comfort, service, food, hospitality, territory, location);
to assess how much this calculation corresponds to the estimates that the authors of the reviews set themselves
to synthesize the final phrase about the rest at the hotel (for example, “Of 653 guests at the hotel, 278 people (43%) noted the courtesy and friendliness of the staff, 220 holidaymakers (34%) liked the food in restaurants, and 76 guests (12%) would like to again rest here ”);
learn to identify the most useful reviews in order to first of all offer them to portal visitors;
find suspicious and customized reviews that are written for gingerbread for promotional purposes, and often have little to do with the sad reality;

There were other tasks, but I’ll stop on the description of how I solved the above.
Now I need to explain what iKnow is and what can be achieved from it without significant efforts. iKnow is a technology that allows you to analyze texts. iKnow API - a set of functions for working with unstructured data. There is also a GUI that allows you to visualize the results of indexing texts and extract useful information from the data. When we load something into iKnow, we have the same text as the output, but divided into concepts and connections between them. Concepts in the proposals, as a rule, are the subject and additions. Connections between concepts in most cases are verbs, verb forms or prepositions. In addition, iKnow can carefully calculate how many times the term “hotel”, “sea”, “beach” or “food poisoning” is mentioned in a tourist review.

Example of splitting a sentence into concepts and relationships. Yellow highlighted concepts, underlined - relationships, gray marked insignificant words.

What else we can get from the text depends mainly on our imagination, and a little more on the diversity of the iKnow API.
How can I rate the hotel on the text of the review? Below is one of the approaches for calculating the numerical characteristics of the text.
The first thing to do is to break the whole text into parts. iKnow API allows splitting into offers or paths . With sentences everything is simple, it is a part of the text, bounded by dots, question or exclamation marks, as well as a semicolon. The path is a part of a sentence in which interrelated concepts are presented. In practice, in most cases, the paths and sentences are the same. And only complex or complex sentences consist of several paths.
')
The room found the ants promised in the reviews , called the reception desk and informed them about our problem.

All together this proposal, and in parts - the way.
We divided the text into sentences.
The second task is to understand what is being said in the proposal. In other words, we need to determine which category of hotel is referred to in this proposal. For this you will need dictionaries of so-called functional markers. For example, if the sentence has the term “restaurant”, “juice”, “tea”, then we are talking about the category “food”, and the terms themselves are included in the dictionary of markers for this category.

The drinks in the restaurant and bars include powdered juices, carbonated drinks, tea, instant coffee, wine and beer, and strong alcohol in bars.

Our functional marker dictionary contained about 300 terms specific to 6 rated hotel categories. And what is important is that if there were no lemmatization, then for the system to work properly, it would be necessary to register all forms of these 300 words in the dictionary. Yes, at first it is not an impossible task. But if the dictionary will grow?
At the first stage, the marker dictionaries were formed manually. They included terms that were met when proofreading the first 200 reviews. At the second stage, the dictionary expanded automatically with the help of a dictionary learning algorithm built using iKnow features. As a result, the volume of the dictionary increased on average by an order of magnitude.
Well, now we know what the sentence says. It remains to be seen whether the author liked this or not. In other words, it is required to determine the emotional color of the sentence. For this, a dictionary of emotional markers was formed. As a rule, in Russian emotional coloring is given by adjectives (delicious coffee, convenient entry into the sea, etc.). We can also take into account the nouns that have a clear color (dirt, mass alcohol poisoning, joy).
And now that “magic” begins, when numerical assessments, graphs and tables are formed from the text. We can count the number of positive and negative terms in relation to hotel categories for each individual review, and then determine the proportion of positive. For the calculation, I used the following formula:

Score = positive / (positive N + negative)

You get a number from 0 to 1. Moreover, the better the hotel, the closer this value is to 1. If you multiply this number by a certain factor, for example, 5, you can get a hotel rating on a five-point scale, which was done.
The next task is to make sure that all this makes sense. This was perhaps the key moment in the work. To understand how much the estimates calculated with the help of iKnow correlate with the author's estimates, we built a graph, which includes all the hotels we rate.

Figure 1. Correlation of copyright and calculated ratings.

The blue values here indicate the average values of the author's hotel ratings, and the green calculated iKnow values. As you can see, the correlation is clearly present. And although it is early to draw final conclusions, it is already clear that this approach works and can be developed further. Of course, a similar algorithm for quantifying hotels and their parameters works with a statistically large number of reviews: in our case, the estimates were formed with at least 20 reviews for each hotel. By the way, I would like to note that to build analytics, I used another Intersystems technology - DeepSee .
The following tasks that had to be solved - search for useful and customized reviews. It's all pretty simple, you just need to formulate the appropriate criteria. Here, for example, the criteria for the usefulness of the review:

The review describes the maximum number of hotel categories. That is, you can read about the level of service, room comfort, quality of food, etc.
feedback must be balanced emotionally. There should not be only naked criticism or just a delighted description. As a rule, people like some things, while others are unhappy with them. That is, there must be both positive and negative emotional markers in the recall.
Hotel rating, calculated on such a review, should be close to the average rating.

Similar criteria were formulated to assess the suspicion of reviews. It should be noted that it is unequivocal to assert that a customized response would be too presumptuous for iKnow, but it is quite possible to call the review suspicious and draw the attention of the administrator of the tourist portal to it.
Separately, I want to say about DeepSee, which supports the ability to use measurements and metrics on data from iKnow. There are a lot of interesting, and most importantly, useful information for tourists, which can be obtained from reviews in addition to static ratings. With the help of the analyzer, it was found that the estimates vary depending on the month, from year to year. And all this is clearly seen on the DeepSee charts.

Figure 2. Monthly hotel rating change

I will try to summarize the work done. We managed to implement the first meaningful project on iKnow in Russian. With enough imagination and wealth of the iKnow API, you can build quite complex solutions. In the future, we plan to develop a review analyzer for hotels, into a universal tool, because there are a lot of reviews on the Internet: about movies, cars, phones, etc.
But it would be wrong to talk only about the merits, without mentioning the problems that had to be faced. There are two such problems:

processing negative sentences;
imperfection of lemmatization.

Negative bid handling first appeared in iKnow in version 2014.1. But using it in a review analyzer is quite problematic. How, for example, the system to evaluate this phrase:

I can not say that we did not like the beach.

or

The result, our personal rest, was WELL, but I would never go to this hotel again!

iKnow is able to find the presence of negation in a sentence, however, it is difficult to determine with certainty what exactly is negated. This problem is especially characteristic of the Russian language, where there is no rigidly established word order in the sentence. How to deal with such offers, you decide the options are: turn the assessment of emotional markers (multiply by -1), ignore such offers at all, or leave everything as it is. In any case, this will lead either to a loss of information or to a loss of accuracy in the analysis of individual sentences. And it is not at all clear how to analyze sarcasm.
There is also a problem with lemmatization. It is and works, but sometimes gives absolutely wonderful things. For example, the prepositional case from a hotel noun “about a hotel” is referred to the livestock term “calving”, and “green wild plum” is transformed into sanitary “green wild plum”.
However, I think we should be optimistic and believe that over time these problems will be solved. Work on improving iKnow continues.
This article has no technical details. I will be happy to tell you about them and how to create your own iKnow-applications in the sequel.

Source: https://habr.com/ru/post/243217/

All Articles

InterSystems iKnow. Part one. iKnow and beach holidays

More articles: