Have you ever wondered why the texts of classic Russian writers are so valued, and the writers themselves are considered to be masters of the word? The point is clearly not only in the plots of the works, not only in what is written, but also in what is written. But with fast reading diagonally, it is difficult to realize. In addition, the text of some significant novel simply has nothing to compare with us: why, in fact, is it so wonderful that this particular word appeared in this place, and why is it better than some other? To some extent, real word usage could contrast the potential that can be found in the drafts of the writer. The writer does not immediately write his text with inspiration from the beginning to the end; he suffers, chooses between options, those that seem to him insufficiently expressive, he crosses out and searches for new ones. But drafts are not for all texts, they are fragmentary and difficult to read. However, it is possible to conduct such an experiment: replace all replaceable words with similar ones, and read the classic text in parallel with the one that has never existed, but which could have appeared in some parallel universe. Along the way, we can try to answer the question why this word in this context is better than another, similar to it, but still different.
And now all this (except for the actual reading) can be done automatically.
Habré already had an article on how to use distributive semantics in searching for so-called. "Pies". Distributive semantics is a fairly simple but perfectly working idea that the meaning of a word is related to its surroundings in the text. Words with a similar meaning will appear in similar contexts and vice versa. The context itself can be represented as a vector (hence the "vector models") and calculate the similarities and differences in the meaning of different words. For the Russian language, on the basis of this idea, an excellent service was made - RusVectōrēs , which not only allows you to find the most semantically similar words, but also provides for free access the models considered by the creators of the resource.
Take 5 classic Russian novels. Dmitry Bykov said somewhere that the most significant for Russian literature are novels, in the names of which there is a union "and". It means: “Crime and Punishment”, “War and Peace”, “Fathers and Sons”, “Master and Margarita”. Well, "Eugene Onegin", also an important Russian novel, albeit in verse.
In addition, we will need a model from the RusVectōrēs website, which is built on the texts of the National Corps of the Russian language and Russian Wikipedia. The Gensim library will allow us to work with it. To search in this model so-called. quasi-synonyms , that is, in fact, the nearest neighbors of the semantic graph (and these neighbors are not always “real” synonyms, often the most semantically close word is the antonym, although it seems counterintuitive), we will need a normalized form of the word (lemma), which the morphological analyzer should help , the program is able to find this form. At first I thought of using Mystem from Yandex, but in the end I stopped at another, Pymorphy2 , then it will be clear why.
import gensim import pymorphy2 model = gensim.models.KeyedVectors.load_word2vec_format("ruwikiruscorpora_0_300_20.bin.gz", binary=True) model.init_sims(replace=True) morph = pymorphy2.MorphAnalyzer()
So, go through the text, take words from it, normalize (that is, restore the infinitive for verbs or the nominative singular for names), look for the most similar word in the model, and make sure that this word is the same part of speech , as the original, because by default it will not necessarily be so. See, for example, the closest words to blue : there are the same nouns blueness , blackness , but there is a blue adjective and a blue verb. I'm not talking about the fact that there are no words in the model at all. The fact is that for low-frequency words, the vector is not constructed for optimization reasons in the original body of words, and the quasi-synonym cannot be found. So, we leave in this place the original word. In addition, it makes no sense to look for substitutes for pronouns, prepositions, and other incomplete parts of speech.
But then the most interesting. In the source text, the word stands in some indirect form, and from the model we obtained a lemma. Now it needs to be put in the same form so that the text is coherent and readable, and does not seem like some kind of bad translation from an alien type Mom to wash the frame . And this is exactly what the Pymorphy2 can help us, who can not only make out words, but also put them in the right shape according to given signs. Mystem cannot do this, and if, when parsing, to memorize not only the lemma that the analyzer issued, but also a set of signs of the form of a word (the same number and case), then it will be very convenient to send them back to the same program to generate the form. True, it turned out that not all the tags that the Pymorphy2 analyzer generates know the Pymorphy2 generator, that is, its right hand does not always know what the left hand is doing. But it's not scary, a little dancing with a tambourine and we get the desired shape:
def flection(lex_neighb, tags): tags = str(tags) tags = re.sub(',[AGQSPMa-z-]+? ', ',', tags) tags = tags.replace("impf,", "") tags = re.sub('([AZ]) (plur|masc|femn|neut|inan)', '\\1,\\2', tags) tags = tags.replace("Impe neut", "") tags = tags.split(',') tags_clean = [] for t in tags: if t: if ' ' in t: t1, t2 = t.split(' ') t = t2 tags_clean.append(t) tags = frozenset(tags_clean) prep_for_gen = morph.parse(lex_neighb)
Now we put everything together, do not forget to handle the capitalization and punctuation. Voila! Alternative Russian literature, which was not, on our screen:
Talk about Juvenile
In the middle of the note leave vale,
Yes, I remembered, though not without a sin
From the Aeneid two poems.
Talk about Juvenile,
At the end of the letter put vale,
Yes, he remembered, though not without sin,
From the Aeneid two verses.
But still something is not right. In some places, the text still exceeds the permissible degree of incoherence:
Friends of Lyudmila and Ruslana!
With the heroine of my story
Without epilogues, sow half an hour
Let me meet you
Friends of Ludmila and Ruslana!
With the hero of my novel
Without prefaces, this time
Let me introduce you
It is a matter of nouns! From the model we get the same part of speech and put it in the same form, but we did not take into account such a constant feature of nouns as gender. It is different for different nouns, and the correct generation of the form will not save from inconsistency. Now we introduce the rule according to which the nouns that the model has given us, still additionally check for gender:
def flection(lex_neighb, tags): tags = str(tags) tags = re.sub(',[AGQSPMa-z-]+? ', ',', tags) tags = tags.replace("impf,", "") tags = re.sub('([AZ]) (plur|masc|femn|neut|inan)', '\\1,\\2', tags) tags = tags.replace("Impe neut", "") tags = tags.split(',') tags_clean = [] for t in tags: if t: if ' ' in t: t1, t2 = t.split(' ') t = t2 tags_clean.append(t) tags = frozenset(tags_clean) prep_for_gen = morph.parse(lex_neighb) ana_array = [] for ana in prep_for_gen: if ana.normal_form == lex_neighb: ana_array.append(ana) for ana in ana_array: try: flect = ana.inflect(tags) except: print(tags) return None if flect: word_to_replace = flect.word return word_to_replace return None
Got better:
Friends of Lyudmila and Ruslana!
With the character of my story ...
etc.
Now you can do a slow reading:
Chapter 1
Never talk to unknown
One day in the spring, at an hour of unprecedentedly hot sunset, in Moscow, on the Patriarch's Ponds, two citizens appeared. The first of them, dressed in a gray summer couple, was small, plump, bald, carried his decent hat with a cake in his hand, and on his well-shaven face were placed supernatural-sized glasses in black horn-rimmed frames. The second - a broad-shouldered, reddish, shaggy young man in a checkered cap tied to the back of his head - was in a cowboy shirt, chewed white trousers and black slippers.
Chapter 1
Never talk to the unexplained
By chance in the spring, at noon of an unprecedentedly hot sunrise, in Kazan, on the Metropolitan brooks, two fellow citizens appeared. The first of them, dapper in a ten-year-old blue couple, was a tiny growth, was overweight, had a baldness, was pulling his decent hat in his hand with a pie, and on his badly shaved face there were glasses of otherworldly diameters in a black horn arch. The second - a broad-shouldered, curly, blond-haired dark-skinned mind in a colorful cap pinched on the forehead - was in a jacket, chewed white pants and black slippers.
He safely avoided meeting with his mistress on the stairs. His closet accounted for under the very roof of a high five-story building and looked more like a closet than an apartment. His apartment owner, in whom he hired this closet with lunch and servants, was placed one staircase below, in a separate apartment, and each time he went outside, he would certainly have to pass by the host’s kitchen, almost always flung open to the stairs.
He happily avoided talking with his neighbor on the step. The room forced him under the very roof of the low nine-storey mansion and looked more like a locker than a flat. His housing neighbor, in whom he hired this room with dinner and servants, was located one step below, in a separate apartment, and eating once, when released onto the embankment, he naturally had to go to meet her husband’s dining room, almost always tightly opened on the step.
Well, and, of course, for all the serious and even almost scientific purposes about which I wrote at the beginning of the article, in some places it is simply ridiculous, though politically incorrect:
- Are you a fascist? - inquired Unsupervised.
- I something? .. - Asked assistant professor and suddenly thought. “Yes, perhaps, a fascist ...” he said.
- Are you German? - Inquired Homeless.
- I something? .. - Asked the professor and suddenly thought. “Yes, perhaps a German ...” he said.
Actually, not everything is well coordinated. It is necessary to select words, taking into account other grammatical categories, among which transitivity and voice are suggested. Well, pymorphy2 is not perfect in choosing a lemma. He suggests the most likely. But the most likely answer is not always the right one. For example, the young form is recognized as a genitive from a noun young , the most similar to it is a dark-skinned woman , and pymorphy2 happily places the dark-haired woman in the form of a genitive case. This is how a whitish young man turns out to have a blond haired mind .
In some places it is better to define the form with the letter e , which is usually not reproduced in the texts of novels:
>>> morph.parse('') [Parse(word='', tag=OpencorporaTag('NOUN,inan,femn plur,gent'), normal_form='', score=0.588235, methods_stack=((<DictionaryAnalyzer>, '', 55, 8),)), Parse(word='', tag=OpencorporaTag('NOUN,anim,masc sing,nomn'), normal_form='', score=0.411764, methods_stack=((<DictionaryAnalyzer>, '', 3019, 0),))] >>> morph.parse('') [Parse(word='', tag=OpencorporaTag('NOUN,anim,masc sing,nomn'), normal_form='', score=1.0, methods_stack=((<DictionaryAnalyzer>, '', 3019, 0),))]
Therefore:
To laugh and think about yourself:
When will the features take you! ”
Sigh and think about yourself:
When the devil take you!
Description of the idea of ​​"vector novels" and additional materials are on a special page .
All replacement code is laid out on Github .
Enjoy reading!
UPD .: kdenisk provided the authorized texts of the novels, which made it possible to get rid of a number of flaws in the recognition of forms. Added filtering by the verb's voice.
Source: https://habr.com/ru/post/326380/
All Articles