📜 ⬆️ ⬇️

Automatic generation of meaningful unique texts

Every web optimizer knows that in order for search engines to like the site, it must contain unique texts. And not just any sets of words, but meaningful sentences, preferably on the topic of the site. This is especially a problem for aggregators, who take information from other sites, and online stores, where the parameters and data on products in general are the same. Therefore, the standard practice in this situation is to order unique texts for copywriters. The cost of such pleasure from 50 to 300 rubles. for 1000 characters. If your site has 10,000 pages, then unique texts quickly become a significant expense item.

In this article we will talk about algorithmic text generation methods and tell you about our experience with them.

Immediately, we will clarify that the discussion will deal with the generation of meaningful and useful texts, and not text-like garbage, which can be easily created in large quantities. It is not rare that an opinion is expressed that this task cannot be solved automatically, but in practice this belief is already outdated.

As a task, consider the issue of automatic generation of product descriptions based on reviews. Those. Having several user reviews of goods received from different sites, automatically create a small unique text that summarizes information from reviews. This task is more complicated than, say, generating text based on the characteristics of the product, since we must first extract some information from the reviews, and then create a new text based on it.
')
Suppose we are working with phone reviews. What information can we extract? At the surface level, we can determine if the feedback is positive or negative using a text classifier , and then extract a list of the mentioned aspects of the phone. For example, the easiest way is to analyze a dictionary of occurrences of words such as "convenience", "screen", "battery", "loudness", etc. A more accurate way of identifying aspects and evaluating them can be based on the learning system for extracting information from text .

Thus, we can obtain data of the form {convenience: +, volume -, screen + ...}. Not a lot of information, but for the beginning will come down. Now you need to create a text. Let's see how this can be done.

Templates The first thing that comes to mind is the use of templates. Those. prepare in advance sentences like "This phone is very convenient", "Volume is good", etc. Then go through the list of signs and insert the appropriate sentences. For our example, something like this will turn out.

This phone is very convenient. The volume leaves much to be desired. The screen is quite good.

The text is relatively meaningful, and more or less readable, but it will quickly cease to be unique, since the variety of options is small. This is bad for search engines, and the reader will be annoyed over time.

Formal grammar. Imagine such a set of rules:

$ convenience ← $ phone $ conv
$ phone ← $ this $ phone-ex
$ conv ← $ mod $ conv-ex
$ mod ← very
$ mod ← is enough
$ mod ←
$ phone-ex ← phone
$ phone-ex ← device
$ this ← this
$ this ←
$ conv-ex ← convenient $ use
$ conv-ex ← is convenient
$ use ← to use
$ use ←

let's start with the topmost rule and substitute the character values ​​on the right: $ usability => $ phone $ conv => $ this $ phone-ex $ mod $ conv-ex => this device is quite convenient

If you choose a rule for the next substitution randomly, you get different sentences. For example, the same set of rules can generate: the phone is very convenient and this device is very convenient to use

This set of rules describes many different variants of sentences and provides much greater variability. With a known diligence, you can write rules that allow you to generate a variety of fairly readable texts.

For example, here is a description of the phone generated in this way from reviewdot.ru

We studied 295 reviews. There is reason to believe that such an amount is sufficient to obtain an analysis. The main majority of people are pretty with this phone, but there are some not very good opinions.

Advantages: users who have left reviews, as a rule, distinguish among the advantages of design and sufficient convenience in use. In addition to this, users who are generally able to detect reviews are generally satisfied with the quality of the battery, volume, sound, camera, keyboard, case, plastic, strength, screen.
Disadvantages: reliability is commonly mentioned as a disadvantage.


The disadvantages of this method are limited vocabulary, rather large labor intensity (creating rules takes time and effort).

For the English language, there are many ready-made language generation packages, which also include the rule-based proposal planning system and its own generation system. For example, SimpleNLG , well, a lot of others, from simple to very advanced. With the Russian language, the situation is somewhat worse, but as we have seen, writing a simple language generator in formal grammar is relatively easy, and he can do quite a lot.

Neural networks . Our latest development is a text-generating neural network. An article about her was recently published in the materials of the Dialog-2015 conference ( an article in English is available here ). This system learns to generate new texts with examples.

The principle of its work is similar to the one we have already described in the article on the chatbot . The difference lies in the fact that there is an additional layer of neurons, which simultaneously receives information about the current word of the sentence and the set of aspects that this sentence contains. Thus, the list of aspects is encoded by a vector, where each dimension corresponds to one aspect, and the value of this dimension (1 or 0) encodes the presence or absence of this aspect in this sentence. The task of the neural network is to predict the next word, knowing the current word and the vector of aspects. Below is a diagram of our article, with signatures translated into Russian:



A trained neural network, upon receipt of a list of aspects, is able to generate new sentences. Here is an example of the texts that result from:

Comfortable player, battery. Convenient sound, metal body. Small price, and easy to use. Screen, 2 SIM cards, 2 battery.

Big weight, big screen, attractive, good camera. The design + good super camera sits down, almost all games (some ask for RAM) are large.

Battery, speed is beautiful ,. Design, sound, functionality, the mass of different days is enough. Beautiful, a few back, processor, responsive sensor. Beautiful screen, color reproduction. Design, battery, do not brake, practical.

Build quality, user-friendly interface. Great amold display, buttons, camera, and all games. design, fast internet, good at catching the net. Bright size, pleasantly heavy, fits well in the hand. Beautiful screen, speed, Internet, java. Rich battery, fast functionality, reliability. strong, expensive percent, great sound, fast percent

It has a flash drive, Java applications, a card reader, which is not a brick, it is quite miniature, the speaker is not buggy. The case quality especially large buttons, good equipment.

The main disadvantage is some clumsiness of texts, grammatical and semantic errors. Plus - a variety, more natural feeling, there is no need to manually develop the rules. As an application option - you can generate a lot of texts, and then manually correct the curves of the place - still much faster than writing manually from scratch, especially if you are planning to write texts based on the analysis of actual feedback.

And of course, the model is not limited only to the subject area of ​​reviews - it can be taught, in principle, on any text.

In conclusion, I would like to quote a small fragment of Pierre Boule's fantastic story “The Perfect Robot”, 1953:

“If a noun“ ram ”is chosen, the robot will be able to combine this word grammatically with a suitable adjective, in other words, to choose the right one from such phrases as“ liquid ram ”,“ misty ram ”or“ white ram ”, excluding those that violate the rules of conformity grammatical gender and numbers, such as, for example, "radiant ram" or "white ram".
“Liquid ram is a meaningless phrase,” the Spirit of Contradiction interrupted.
- Let me finish! All in due time ... We do not foresee any particular complications in the next stage: in the formation of a complete phrase according to the rules of syntax. These rules are precisely defined, so that the machine will be able to accept them just like the human brain, and perhaps even better. So we will achieve the formation of a certain amount of grammatically correct phrases, like “a liquid sheep flies in a pointed sky” or “a white sheep eats grass” ...
“That's where I caught you!” - rejoiced Spirit of controversy. - Most of your phrases, as you say, grammatically correct, will be meaningless!

They will be perfect in terms of form ... "

Phrases like “a card reader that is not a brick, quite tiny” invariably remind me of “a liquid sheep flies in a pointed sky”, but in general, one can say that half a century later the task of automatic text writing has moved from the realm of fantasy to practical applications.

Source: https://habr.com/ru/post/259355/


All Articles