📜 ⬆️ ⬇️

How we came up with a text analysis system

Good day to all. This is our first post on the startup blog “Meanotek”, and it will probably be more introductory. In order not to be completely bored to read, we will try to tell a story about how a practical task led us to create a full-fledged system of “understanding” text by a computer, and what came of it.

The idea to teach a computer to communicate in human language appeared to me when I was at school when I had one of the first Soviet analogs of the IBM PC in my home, with the GW BASIC programming language. It is clear that this idea did not go far at that time, then other more important matters blocked it, but quite unexpectedly it surfaced again many years later, due to a specific need.

Actually, the idea came to mind while working on another project - a search site for reviews reviewdot.ru. The idea of ​​reviewdot.ru was as follows - the user enters a request, for example, a “reflex camera for beginners” - and receives a list of links to reviews on the Internet that relate specifically to this issue. Or, for example, so that on request “what breaks in the Indesit washing machine?” Links to the feedback from Indesit users, who have something broken, appear. The question of the value of this resource for people is left behind for now, and let's talk a little about the technical side of implementation.

It is clear that in order to search for reviews on the Internet, you must first select these reviews from different sites. Reviews are found on very different sites (otzoviki sites, online stores, individual blogs, etc.). The classic approach to this problem is to load pages, select the necessary blocks from the markup using the HTML parser library and put it into the database. However, it turned out that our limited resources didn’t do so much (we worked together during free time from work and rest). Because, first of all, adding each new site is connected with the need to analyze its structure manually and prescribe rules for data extraction, which is not always trivial, even with the help of an HTML parser and Xpath queries. Secondly, sites tend to change their structure. When work goes with 5-6 sites, this is still nothing, since a major redesign happens less frequently. But if sources count by the thousands ... well, the amount of effort required to support a solution in working condition increases in proportion to the number of sources. And there are also categories of goods that are called differently on each site, and we need to compare them, and maintain some kind of a single category tree ... In short, the amount of dull monotonous work quickly and monotonously increases, which makes the task too heavy for us.
')
Therefore, having suffered for some time, we spat and scored on the project , decided to go the other way and teach the computer to independently find reviews on any web page. There were no standard tools for this, and after some effort we made a simple algorithm that distinguishes text blocks from web pages and classifies them into two types - recall / not recall. Thus, the program acquired a rudimentary intelligence, and we got rid of the dependence of the cost of maintaining the system on the number of data sources. How many source sites do we want, add as much, and the labor costs remain constant. True, there was a certain loss in quality, since about 15% of the texts that got into the database were not reviews, and some percentage of the responses were not at all, but these shortcomings could not be compared with the savings achieved.

From which we drew the following conclusions:


The second thought is rather trivial, but the first observation seemed interesting to us. We added an automatic “retriever” of product names and a category “appropriator” to the program, so now the system itself could “open” new products and product categories. Over time, several hundreds of categories “opened up” in this way, including those that we did not even think about at the beginning. For example, the program found that the “hotel” is also in some sense a commodity, and began to add different hotels and reviews about them to the database, which was not originally intended at all. As a result, knowledge is not only extracted, but also, to a certain extent, acquires a certain structure.

This again suggested that the practical value of computer text analyzers was underestimated. Well, of course, there are all kinds of text analysis APIs now, but the number of tasks that they can solve is strictly limited, usually this is extracting some predefined entities, such as company names, products, people names, and tonality assessment (positive / negative / neutral). This has obvious practical applications, but potentially a list of what can be done much more.

For example:
  1. Voice interface in natural language in each mobile application (and not just for selected)
  2. New types of applications that answer questions (the same selection of a suitable product for the requirements in natural language)
  3. Analysis of e-mail messages to identify frequent problems of applicants, with the possibility of an automatic response to the solution of typical problems


Ready-made text analysis APIs, as a rule, do not know how to “discover” new concepts in a given subject area and use them to improve the quality of analysis. In general, prior to the experience with reviewdot, we thought that such self-improving systems did not work well enough to be practically useful, but here we were convinced of quite the opposite.

And these are potentially hundreds of specialized tasks that are theoretically possible to do, but in practice are not yet covered. Of course, we cannot solve each applied task on our own, and after some reflection, the idea of ​​a business scheme was born:

The client (often a software developer) has a task or an idea of ​​a new application → he contacts us, we free of charge make support of the function he needs and provide it through a web-API or offline → if the application is successful, the developer receives income and pays access to it to web API.

It is clear that there are also many difficulties. For example, in order for the system to scale to new tasks, you need to make sure that you do not need to write a separate solution for each user. It is much better to have a kernel that can potentially do everything - one intellectual architecture for all tasks, which is simply trained on a certain number of test cases. We already had some prototype as a result of the work on reviewdot - the fact is that we were too lazy to use standard language processing methods - all sorts of determinants of speech parts, morphological analyzers, parsers building trees, and similar tools with (semi-manual) selection of signs for the classification required a lot of monotonous work, and this turned out to be our weak point (see above). Therefore, after studying the literature, we found text analysis algorithms using neural networks that worked a little worse, but, having written one core, we could use it almost everywhere, and when it didn’t fit somewhere, we looked for a way to expand / change it so that -So use. Such an approach sometimes gave rise to terrible decisions, but nonetheless worked.

In the fall of 2014, we discovered on the network open testing of the analysis of tonality analysis systems in Russian at the Dialog conference and decided to take part in it to test how our “handicraft” approach to language analysis is compared to professional solutions. This led to several cycles of improvements, and sleepless nights (which is the topic of a separate article), but in the end we not only managed to confirm universality, but also take several first and second places along different paths, which made our system one of the best and most accurate. results.

Thus, the idea has finally matured, and we started working while in test mode, looking for people with interesting tasks or application projects, and inviting them to make an analyzer. So far, there are only two such projects in development, but there is hope that others will appear.

Source: https://habr.com/ru/post/256303/


All Articles